[PATCH review 0/9] Call for testing and review of mount detach fixes

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH review 0/9] Call for testing and review of mount detach fixes
@ 2015-01-02 21:42 ` Eric W. Biederman
  0 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-01-02 21:42 UTC (permalink / raw)
  To: Linux Containers, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA
  Cc: Richard Weinberger, Al Viro, Andrey Vagin, Andy Lutomirski

Way back in October Andrey Vagin reported that umount(MNT_DETACH) could
be used to defeat MNT_LOCKED.

That MNT_DETACH is allowed in user namespace comes from my early
misunderstanding what MNT_DETACH does.  My mistake.

To avoid breaking existing userspace the conflict between MNT_DETACH
and MNT_LOCKED is fixed by leaving locked umounts attached in the mount
hash table until the last reference goes away.

While investigating this issue I also found an issue with
__detach_mounts.  The code was unnecessarily and incorrectly triggering
mount propagation.  Resulting in too many mounts going away when a
directory is deleted, and too many cpu cycles burned while doing that.

For those who like to see everything in a single tree the code is at:

   git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-testing

Eric W. Biederman (9):
      mnt: Improve the umount_tree flags
      mnt: Don't propagate umounts in __detach_mounts
      mnt: In umount_tree reuse mnt_list instead of mnt_hash
      mnt: Add MNT_UMOUNT flag
      mnt: Delay removal from the mount hash.
      mnt: Factor out __detach_mnt from detach_mnt
      mnt: Simplify umount_tree
      mnt: Remove redundant NULL tests in namespace_unlock
      mnt: Honor MNT_LOCKED when detaching mounts

 fs/namespace.c        | 150 +++++++++++++++++++++++++++++++-------------------
 fs/pnode.c            |   8 +--
 fs/pnode.h            |   3 +-
 include/linux/mount.h |   1 +
 4 files changed, 99 insertions(+), 63 deletions(-)

^ permalink raw reply	[flat|nested] 240+ messages in thread

* [PATCH review 0/9] Call for testing and review of mount detach fixes
@ 2015-01-02 21:42 ` Eric W. Biederman
  0 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-01-02 21:42 UTC (permalink / raw)
  To: Linux Containers, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA
  Cc: Richard Weinberger, Al Viro, Andrey Vagin, Andy Lutomirski

Way back in October Andrey Vagin reported that umount(MNT_DETACH) could
be used to defeat MNT_LOCKED.

That MNT_DETACH is allowed in user namespace comes from my early
misunderstanding what MNT_DETACH does.  My mistake.

To avoid breaking existing userspace the conflict between MNT_DETACH
and MNT_LOCKED is fixed by leaving locked umounts attached in the mount
hash table until the last reference goes away.

While investigating this issue I also found an issue with
__detach_mounts.  The code was unnecessarily and incorrectly triggering
mount propagation.  Resulting in too many mounts going away when a
directory is deleted, and too many cpu cycles burned while doing that.

For those who like to see everything in a single tree the code is at:

   git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-testing

Eric W. Biederman (9):
      mnt: Improve the umount_tree flags
      mnt: Don't propagate umounts in __detach_mounts
      mnt: In umount_tree reuse mnt_list instead of mnt_hash
      mnt: Add MNT_UMOUNT flag
      mnt: Delay removal from the mount hash.
      mnt: Factor out __detach_mnt from detach_mnt
      mnt: Simplify umount_tree
      mnt: Remove redundant NULL tests in namespace_unlock
      mnt: Honor MNT_LOCKED when detaching mounts

 fs/namespace.c        | 150 +++++++++++++++++++++++++++++++-------------------
 fs/pnode.c            |   8 +--
 fs/pnode.h            |   3 +-
 include/linux/mount.h |   1 +
 4 files changed, 99 insertions(+), 63 deletions(-)

^ permalink raw reply	[flat|nested] 240+ messages in thread

* [PATCH review 1/9] mnt: Improve the umount_tree flags
       [not found] ` <871tncuaf6.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-01-02 21:52   ` Eric W. Biederman
  2015-01-02 21:52   ` [PATCH review 2/9] mnt: Don't propagate umounts in __detach_mounts Eric W. Biederman
                     ` (8 subsequent siblings)
  9 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-01-02 21:52 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Richard Weinberger, Andy Lutomirski, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

- Remove the unneeded declaration from pnode.h
- Mark umount_tree static as it has no callers outside of namespace.c
- Define an enumeration of umount_tree's flags.
- Pass umount_tree's flags in by name

This removes the magic numbers 0, 1 and 2 making the code a little
clearer and makes it possible for there to be lazy unmounts that don't
propagate.  Which is what __detach_mounts actually wants for example.

Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/namespace.c | 31 ++++++++++++++++---------------
 fs/pnode.h     |  1 -
 2 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index cd1e9681a0cf..5bb96c440b31 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1323,14 +1323,15 @@ static inline void namespace_lock(void)
 	down_write(&namespace_sem);
 }
 
+enum umount_tree_flags {
+	UMOUNT_SYNC = 1,
+	UMOUNT_PROPAGATE = 2,
+};
 /*
  * mount_lock must be held
  * namespace_sem must be held for write
- * how = 0 => just this tree, don't propagate
- * how = 1 => propagate; we know that nobody else has reference to any victims
- * how = 2 => lazy umount
  */
-void umount_tree(struct mount *mnt, int how)
+static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 {
 	HLIST_HEAD(tmp_list);
 	struct mount *p;
@@ -1344,7 +1345,7 @@ void umount_tree(struct mount *mnt, int how)
 	hlist_for_each_entry(p, &tmp_list, mnt_hash)
 		list_del_init(&p->mnt_child);
 
-	if (how)
+	if (how & UMOUNT_PROPAGATE)
 		propagate_umount(&tmp_list);
 
 	hlist_for_each_entry(p, &tmp_list, mnt_hash) {
@@ -1352,7 +1353,7 @@ void umount_tree(struct mount *mnt, int how)
 		list_del_init(&p->mnt_list);
 		__touch_mnt_namespace(p->mnt_ns);
 		p->mnt_ns = NULL;
-		if (how < 2)
+		if (how & UMOUNT_SYNC)
 			p->mnt.mnt_flags |= MNT_SYNC_UMOUNT;
 		if (mnt_has_parent(p)) {
 			hlist_del_init(&p->mnt_mp_list);
@@ -1457,14 +1458,14 @@ static int do_umount(struct mount *mnt, int flags)
 
 	if (flags & MNT_DETACH) {
 		if (!list_empty(&mnt->mnt_list))
-			umount_tree(mnt, 2);
+			umount_tree(mnt, UMOUNT_PROPAGATE);
 		retval = 0;
 	} else {
 		shrink_submounts(mnt);
 		retval = -EBUSY;
 		if (!propagate_mount_busy(mnt, 2)) {
 			if (!list_empty(&mnt->mnt_list))
-				umount_tree(mnt, 1);
+				umount_tree(mnt, UMOUNT_PROPAGATE|UMOUNT_SYNC);
 			retval = 0;
 		}
 	}
@@ -1496,7 +1497,7 @@ void __detach_mounts(struct dentry *dentry)
 	lock_mount_hash();
 	while (!hlist_empty(&mp->m_list)) {
 		mnt = hlist_entry(mp->m_list.first, struct mount, mnt_mp_list);
-		umount_tree(mnt, 2);
+		umount_tree(mnt, UMOUNT_PROPAGATE);
 	}
 	unlock_mount_hash();
 	put_mountpoint(mp);
@@ -1658,7 +1659,7 @@ struct mount *copy_tree(struct mount *mnt, struct dentry *dentry,
 out:
 	if (res) {
 		lock_mount_hash();
-		umount_tree(res, 0);
+		umount_tree(res, UMOUNT_SYNC);
 		unlock_mount_hash();
 	}
 	return q;
@@ -1682,7 +1683,7 @@ void drop_collected_mounts(struct vfsmount *mnt)
 {
 	namespace_lock();
 	lock_mount_hash();
-	umount_tree(real_mount(mnt), 0);
+	umount_tree(real_mount(mnt), UMOUNT_SYNC);
 	unlock_mount_hash();
 	namespace_unlock();
 }
@@ -1865,7 +1866,7 @@ static int attach_recursive_mnt(struct mount *source_mnt,
  out_cleanup_ids:
 	while (!hlist_empty(&tree_list)) {
 		child = hlist_entry(tree_list.first, struct mount, mnt_hash);
-		umount_tree(child, 0);
+		umount_tree(child, UMOUNT_SYNC);
 	}
 	unlock_mount_hash();
 	cleanup_group_ids(source_mnt, NULL);
@@ -2045,7 +2046,7 @@ static int do_loopback(struct path *path, const char *old_name,
 	err = graft_tree(mnt, parent, mp);
 	if (err) {
 		lock_mount_hash();
-		umount_tree(mnt, 0);
+		umount_tree(mnt, UMOUNT_SYNC);
 		unlock_mount_hash();
 	}
 out2:
@@ -2416,7 +2417,7 @@ void mark_mounts_for_expiry(struct list_head *mounts)
 	while (!list_empty(&graveyard)) {
 		mnt = list_first_entry(&graveyard, struct mount, mnt_expire);
 		touch_mnt_namespace(mnt->mnt_ns);
-		umount_tree(mnt, 1);
+		umount_tree(mnt, UMOUNT_PROPAGATE|UMOUNT_SYNC);
 	}
 	unlock_mount_hash();
 	namespace_unlock();
@@ -2487,7 +2488,7 @@ static void shrink_submounts(struct mount *mnt)
 			m = list_first_entry(&graveyard, struct mount,
 						mnt_expire);
 			touch_mnt_namespace(m->mnt_ns);
-			umount_tree(m, 1);
+			umount_tree(m, UMOUNT_PROPAGATE|UMOUNT_SYNC);
 		}
 	}
 }
diff --git a/fs/pnode.h b/fs/pnode.h
index 4a246358b031..16afc3d6d2f2 100644
--- a/fs/pnode.h
+++ b/fs/pnode.h
@@ -47,7 +47,6 @@ int get_dominating_id(struct mount *mnt, const struct path *root);
 unsigned int mnt_get_count(struct mount *mnt);
 void mnt_set_mountpoint(struct mount *, struct mountpoint *,
 			struct mount *);
-void umount_tree(struct mount *, int);
 struct mount *copy_tree(struct mount *, struct dentry *, int);
 bool is_path_reachable(struct mount *, struct dentry *,
 			 const struct path *root);
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 2/9] mnt: Don't propagate umounts in __detach_mounts
       [not found] ` <871tncuaf6.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  2015-01-02 21:52   ` [PATCH review 1/9] mnt: Improve the umount_tree flags Eric W. Biederman
@ 2015-01-02 21:52   ` Eric W. Biederman
  2015-01-02 21:52   ` [PATCH review 3/9] mnt: In umount_tree reuse mnt_list instead of mnt_hash Eric W. Biederman
                     ` (7 subsequent siblings)
  9 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-01-02 21:52 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Richard Weinberger, Andy Lutomirski, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

Invoking mount propagation from __detach_mounts is inefficient and
wrong.

It is inefficient because __detach_mounts already walks the list of
mounts that where something needs to be done, and mount propagation
walks some subset of those mounts again.

It is actively wrong because if the dentry that is passed to
__detach_mounts is not part of the path to a mount that mount should
not be affected.

change_mnt_propagation(p,MS_PRIVATE) modifies the mount propagation
tree of a master mount so it's slaves are connected to another master
if possible.  Which means even removing a mount from the middle of a
mount tree with __detach_mounts will not deprive any mount propagated
mount events.

Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/namespace.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 5bb96c440b31..07d0562290a5 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1497,7 +1497,7 @@ void __detach_mounts(struct dentry *dentry)
 	lock_mount_hash();
 	while (!hlist_empty(&mp->m_list)) {
 		mnt = hlist_entry(mp->m_list.first, struct mount, mnt_mp_list);
-		umount_tree(mnt, UMOUNT_PROPAGATE);
+		umount_tree(mnt, 0);
 	}
 	unlock_mount_hash();
 	put_mountpoint(mp);
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 2/9] mnt: Don't propagate umounts in __detach_mounts
  2015-01-02 21:42 ` Eric W. Biederman
  (?)
@ 2015-01-02 21:52 ` Eric W. Biederman
  -1 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-01-02 21:52 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-fsdevel, Serge E. Hallyn, Andy Lutomirski, Chen Hanxiao,
	Richard Weinberger, Andrey Vagin, Al Viro

Invoking mount propagation from __detach_mounts is inefficient and
wrong.

It is inefficient because __detach_mounts already walks the list of
mounts that where something needs to be done, and mount propagation
walks some subset of those mounts again.

It is actively wrong because if the dentry that is passed to
__detach_mounts is not part of the path to a mount that mount should
not be affected.

change_mnt_propagation(p,MS_PRIVATE) modifies the mount propagation
tree of a master mount so it's slaves are connected to another master
if possible.  Which means even removing a mount from the middle of a
mount tree with __detach_mounts will not deprive any mount propagated
mount events.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/namespace.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 5bb96c440b31..07d0562290a5 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1497,7 +1497,7 @@ void __detach_mounts(struct dentry *dentry)
 	lock_mount_hash();
 	while (!hlist_empty(&mp->m_list)) {
 		mnt = hlist_entry(mp->m_list.first, struct mount, mnt_mp_list);
-		umount_tree(mnt, UMOUNT_PROPAGATE);
+		umount_tree(mnt, 0);
 	}
 	unlock_mount_hash();
 	put_mountpoint(mp);
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 3/9] mnt: In umount_tree reuse mnt_list instead of mnt_hash
       [not found] ` <871tncuaf6.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  2015-01-02 21:52   ` [PATCH review 1/9] mnt: Improve the umount_tree flags Eric W. Biederman
  2015-01-02 21:52   ` [PATCH review 2/9] mnt: Don't propagate umounts in __detach_mounts Eric W. Biederman
@ 2015-01-02 21:52   ` Eric W. Biederman
  2015-01-02 21:52   ` [PATCH review 4/9] mnt: Add MNT_UMOUNT flag Eric W. Biederman
                     ` (6 subsequent siblings)
  9 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-01-02 21:52 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Richard Weinberger, Andy Lutomirski, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

umount_tree builds a list of mounts that need to be unmounted.
Utilize mnt_list for this purpose instead of mnt_hash as mnt_list is
an ordianry list_head, allowing the use of list_splice and list_move
instead of rolling our own.

This also begins to allow keeping a mount on the mnt_hash after it is
unmounted.  Which is necessary for a properly functioning MNT_LOCKED
implementation.

Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/namespace.c | 45 +++++++++++++++++++--------------------------
 fs/pnode.c     |  6 +++---
 fs/pnode.h     |  2 +-
 3 files changed, 23 insertions(+), 30 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 07d0562290a5..44478b6e3719 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1285,23 +1285,22 @@ int may_umount(struct vfsmount *mnt)
 
 EXPORT_SYMBOL(may_umount);
 
-static HLIST_HEAD(unmounted);	/* protected by namespace_sem */
+static LIST_HEAD(unmounted);	/* protected by namespace_sem */
 
 static void namespace_unlock(void)
 {
 	struct mount *mnt;
-	struct hlist_head head = unmounted;
+	LIST_HEAD(head);
 
-	if (likely(hlist_empty(&head))) {
+	if (likely(list_empty(&unmounted))) {
 		up_write(&namespace_sem);
 		return;
 	}
 
-	head.first->pprev = &head.first;
-	INIT_HLIST_HEAD(&unmounted);
+	list_splice_init(&unmounted, &head);
 
 	/* undo decrements we'd done in umount_tree() */
-	hlist_for_each_entry(mnt, &head, mnt_hash)
+	list_for_each_entry(mnt, &head, mnt_list)
 		if (mnt->mnt_ex_mountpoint.mnt)
 			mntget(mnt->mnt_ex_mountpoint.mnt);
 
@@ -1309,9 +1308,9 @@ static void namespace_unlock(void)
 
 	synchronize_rcu();
 
-	while (!hlist_empty(&head)) {
-		mnt = hlist_entry(head.first, struct mount, mnt_hash);
-		hlist_del_init(&mnt->mnt_hash);
+	while (!list_empty(&head)) {
+		mnt = list_first_entry(&head, struct mount, mnt_list);
+		list_del_init(&mnt->mnt_list);
 		if (mnt->mnt_ex_mountpoint.mnt)
 			path_put(&mnt->mnt_ex_mountpoint);
 		mntput(&mnt->mnt);
@@ -1333,24 +1332,25 @@ enum umount_tree_flags {
  */
 static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 {
-	HLIST_HEAD(tmp_list);
+	LIST_HEAD(tmp_list);
 	struct mount *p;
-	struct mount *last = NULL;
 
-	for (p = mnt; p; p = next_mnt(p, mnt)) {
-		hlist_del_init_rcu(&p->mnt_hash);
-		hlist_add_head(&p->mnt_hash, &tmp_list);
-	}
+	/* Gather the mounts to umount */
+	for (p = mnt; p; p = next_mnt(p, mnt))
+		list_move(&p->mnt_list, &tmp_list);
 
-	hlist_for_each_entry(p, &tmp_list, mnt_hash)
+	/* Hide the mounts from lookup_mnt and mnt_mounts */
+	list_for_each_entry(p, &tmp_list, mnt_list) {
+		hlist_del_init_rcu(&p->mnt_hash);
 		list_del_init(&p->mnt_child);
+	}
 
+	/* Add propogated mounts to the tmp_list */
 	if (how & UMOUNT_PROPAGATE)
 		propagate_umount(&tmp_list);
 
-	hlist_for_each_entry(p, &tmp_list, mnt_hash) {
+	list_for_each_entry(p, &tmp_list, mnt_list) {
 		list_del_init(&p->mnt_expire);
-		list_del_init(&p->mnt_list);
 		__touch_mnt_namespace(p->mnt_ns);
 		p->mnt_ns = NULL;
 		if (how & UMOUNT_SYNC)
@@ -1367,15 +1367,8 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 			p->mnt_mp = NULL;
 		}
 		change_mnt_propagation(p, MS_PRIVATE);
-		last = p;
-	}
-	if (last) {
-		last->mnt_hash.next = unmounted.first;
-		if (unmounted.first)
-			unmounted.first->pprev = &last->mnt_hash.next;
-		unmounted.first = tmp_list.first;
-		unmounted.first->pprev = &unmounted.first;
 	}
+	list_splice(&tmp_list, &unmounted);
 }
 
 static void shrink_submounts(struct mount *mnt);
diff --git a/fs/pnode.c b/fs/pnode.c
index 260ac8f898a4..bf012af709dd 100644
--- a/fs/pnode.c
+++ b/fs/pnode.c
@@ -384,7 +384,7 @@ static void __propagate_umount(struct mount *mnt)
 		if (child && list_empty(&child->mnt_mounts)) {
 			list_del_init(&child->mnt_child);
 			hlist_del_init_rcu(&child->mnt_hash);
-			hlist_add_before_rcu(&child->mnt_hash, &mnt->mnt_hash);
+			list_move_tail(&child->mnt_list, &mnt->mnt_list);
 		}
 	}
 }
@@ -396,11 +396,11 @@ static void __propagate_umount(struct mount *mnt)
  *
  * vfsmount lock must be held for write
  */
-int propagate_umount(struct hlist_head *list)
+int propagate_umount(struct list_head *list)
 {
 	struct mount *mnt;
 
-	hlist_for_each_entry(mnt, list, mnt_hash)
+	list_for_each_entry(mnt, list, mnt_list)
 		__propagate_umount(mnt);
 	return 0;
 }
diff --git a/fs/pnode.h b/fs/pnode.h
index 16afc3d6d2f2..aa6d65df7204 100644
--- a/fs/pnode.h
+++ b/fs/pnode.h
@@ -40,7 +40,7 @@ static inline void set_mnt_shared(struct mount *mnt)
 void change_mnt_propagation(struct mount *, int);
 int propagate_mnt(struct mount *, struct mountpoint *, struct mount *,
 		struct hlist_head *);
-int propagate_umount(struct hlist_head *);
+int propagate_umount(struct list_head *);
 int propagate_mount_busy(struct mount *, int);
 void mnt_release_group_id(struct mount *);
 int get_dominating_id(struct mount *mnt, const struct path *root);
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 3/9] mnt: In umount_tree reuse mnt_list instead of mnt_hash
  2015-01-02 21:42 ` Eric W. Biederman
  (?)
  (?)
@ 2015-01-02 21:52 ` Eric W. Biederman
  -1 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-01-02 21:52 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-fsdevel, Serge E. Hallyn, Andy Lutomirski, Chen Hanxiao,
	Richard Weinberger, Andrey Vagin, Al Viro

umount_tree builds a list of mounts that need to be unmounted.
Utilize mnt_list for this purpose instead of mnt_hash as mnt_list is
an ordianry list_head, allowing the use of list_splice and list_move
instead of rolling our own.

This also begins to allow keeping a mount on the mnt_hash after it is
unmounted.  Which is necessary for a properly functioning MNT_LOCKED
implementation.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/namespace.c | 45 +++++++++++++++++++--------------------------
 fs/pnode.c     |  6 +++---
 fs/pnode.h     |  2 +-
 3 files changed, 23 insertions(+), 30 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 07d0562290a5..44478b6e3719 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1285,23 +1285,22 @@ int may_umount(struct vfsmount *mnt)
 
 EXPORT_SYMBOL(may_umount);
 
-static HLIST_HEAD(unmounted);	/* protected by namespace_sem */
+static LIST_HEAD(unmounted);	/* protected by namespace_sem */
 
 static void namespace_unlock(void)
 {
 	struct mount *mnt;
-	struct hlist_head head = unmounted;
+	LIST_HEAD(head);
 
-	if (likely(hlist_empty(&head))) {
+	if (likely(list_empty(&unmounted))) {
 		up_write(&namespace_sem);
 		return;
 	}
 
-	head.first->pprev = &head.first;
-	INIT_HLIST_HEAD(&unmounted);
+	list_splice_init(&unmounted, &head);
 
 	/* undo decrements we'd done in umount_tree() */
-	hlist_for_each_entry(mnt, &head, mnt_hash)
+	list_for_each_entry(mnt, &head, mnt_list)
 		if (mnt->mnt_ex_mountpoint.mnt)
 			mntget(mnt->mnt_ex_mountpoint.mnt);
 
@@ -1309,9 +1308,9 @@ static void namespace_unlock(void)
 
 	synchronize_rcu();
 
-	while (!hlist_empty(&head)) {
-		mnt = hlist_entry(head.first, struct mount, mnt_hash);
-		hlist_del_init(&mnt->mnt_hash);
+	while (!list_empty(&head)) {
+		mnt = list_first_entry(&head, struct mount, mnt_list);
+		list_del_init(&mnt->mnt_list);
 		if (mnt->mnt_ex_mountpoint.mnt)
 			path_put(&mnt->mnt_ex_mountpoint);
 		mntput(&mnt->mnt);
@@ -1333,24 +1332,25 @@ enum umount_tree_flags {
  */
 static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 {
-	HLIST_HEAD(tmp_list);
+	LIST_HEAD(tmp_list);
 	struct mount *p;
-	struct mount *last = NULL;
 
-	for (p = mnt; p; p = next_mnt(p, mnt)) {
-		hlist_del_init_rcu(&p->mnt_hash);
-		hlist_add_head(&p->mnt_hash, &tmp_list);
-	}
+	/* Gather the mounts to umount */
+	for (p = mnt; p; p = next_mnt(p, mnt))
+		list_move(&p->mnt_list, &tmp_list);
 
-	hlist_for_each_entry(p, &tmp_list, mnt_hash)
+	/* Hide the mounts from lookup_mnt and mnt_mounts */
+	list_for_each_entry(p, &tmp_list, mnt_list) {
+		hlist_del_init_rcu(&p->mnt_hash);
 		list_del_init(&p->mnt_child);
+	}
 
+	/* Add propogated mounts to the tmp_list */
 	if (how & UMOUNT_PROPAGATE)
 		propagate_umount(&tmp_list);
 
-	hlist_for_each_entry(p, &tmp_list, mnt_hash) {
+	list_for_each_entry(p, &tmp_list, mnt_list) {
 		list_del_init(&p->mnt_expire);
-		list_del_init(&p->mnt_list);
 		__touch_mnt_namespace(p->mnt_ns);
 		p->mnt_ns = NULL;
 		if (how & UMOUNT_SYNC)
@@ -1367,15 +1367,8 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 			p->mnt_mp = NULL;
 		}
 		change_mnt_propagation(p, MS_PRIVATE);
-		last = p;
-	}
-	if (last) {
-		last->mnt_hash.next = unmounted.first;
-		if (unmounted.first)
-			unmounted.first->pprev = &last->mnt_hash.next;
-		unmounted.first = tmp_list.first;
-		unmounted.first->pprev = &unmounted.first;
 	}
+	list_splice(&tmp_list, &unmounted);
 }
 
 static void shrink_submounts(struct mount *mnt);
diff --git a/fs/pnode.c b/fs/pnode.c
index 260ac8f898a4..bf012af709dd 100644
--- a/fs/pnode.c
+++ b/fs/pnode.c
@@ -384,7 +384,7 @@ static void __propagate_umount(struct mount *mnt)
 		if (child && list_empty(&child->mnt_mounts)) {
 			list_del_init(&child->mnt_child);
 			hlist_del_init_rcu(&child->mnt_hash);
-			hlist_add_before_rcu(&child->mnt_hash, &mnt->mnt_hash);
+			list_move_tail(&child->mnt_list, &mnt->mnt_list);
 		}
 	}
 }
@@ -396,11 +396,11 @@ static void __propagate_umount(struct mount *mnt)
  *
  * vfsmount lock must be held for write
  */
-int propagate_umount(struct hlist_head *list)
+int propagate_umount(struct list_head *list)
 {
 	struct mount *mnt;
 
-	hlist_for_each_entry(mnt, list, mnt_hash)
+	list_for_each_entry(mnt, list, mnt_list)
 		__propagate_umount(mnt);
 	return 0;
 }
diff --git a/fs/pnode.h b/fs/pnode.h
index 16afc3d6d2f2..aa6d65df7204 100644
--- a/fs/pnode.h
+++ b/fs/pnode.h
@@ -40,7 +40,7 @@ static inline void set_mnt_shared(struct mount *mnt)
 void change_mnt_propagation(struct mount *, int);
 int propagate_mnt(struct mount *, struct mountpoint *, struct mount *,
 		struct hlist_head *);
-int propagate_umount(struct hlist_head *);
+int propagate_umount(struct list_head *);
 int propagate_mount_busy(struct mount *, int);
 void mnt_release_group_id(struct mount *);
 int get_dominating_id(struct mount *mnt, const struct path *root);
-- 
2.2.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 4/9] mnt: Add MNT_UMOUNT flag
       [not found] ` <871tncuaf6.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                     ` (2 preceding siblings ...)
  2015-01-02 21:52   ` [PATCH review 3/9] mnt: In umount_tree reuse mnt_list instead of mnt_hash Eric W. Biederman
@ 2015-01-02 21:52   ` Eric W. Biederman
  2015-01-02 21:52   ` [PATCH review 5/9] mnt: Delay removal from the mount hash Eric W. Biederman
                     ` (5 subsequent siblings)
  9 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-01-02 21:52 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Richard Weinberger, Andy Lutomirski, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

In some instances it is necessary to know if the the unmounting
process has begun on a mount.  Add MNT_UMOUNT to make that reliably
testable.

This fix gets used in fixing locked mounts in MNT_DETACH

Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/namespace.c        | 4 +++-
 fs/pnode.c            | 1 +
 include/linux/mount.h | 1 +
 3 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 44478b6e3719..60d4160cd2f4 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1336,8 +1336,10 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 	struct mount *p;
 
 	/* Gather the mounts to umount */
-	for (p = mnt; p; p = next_mnt(p, mnt))
+	for (p = mnt; p; p = next_mnt(p, mnt)) {
+		p->mnt.mnt_flags |= MNT_UMOUNT;
 		list_move(&p->mnt_list, &tmp_list);
+	}
 
 	/* Hide the mounts from lookup_mnt and mnt_mounts */
 	list_for_each_entry(p, &tmp_list, mnt_list) {
diff --git a/fs/pnode.c b/fs/pnode.c
index bf012af709dd..ac3aa0d43b90 100644
--- a/fs/pnode.c
+++ b/fs/pnode.c
@@ -384,6 +384,7 @@ static void __propagate_umount(struct mount *mnt)
 		if (child && list_empty(&child->mnt_mounts)) {
 			list_del_init(&child->mnt_child);
 			hlist_del_init_rcu(&child->mnt_hash);
+			child->mnt.mnt_flags |= MNT_UMOUNT;
 			list_move_tail(&child->mnt_list, &mnt->mnt_list);
 		}
 	}
diff --git a/include/linux/mount.h b/include/linux/mount.h
index c2c561dc0114..564beeec5d83 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -61,6 +61,7 @@ struct mnt_namespace;
 #define MNT_DOOMED		0x1000000
 #define MNT_SYNC_UMOUNT		0x2000000
 #define MNT_MARKED		0x4000000
+#define MNT_UMOUNT		0x8000000
 
 struct vfsmount {
 	struct dentry *mnt_root;	/* root of the mounted tree */
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 4/9] mnt: Add MNT_UMOUNT flag
  2015-01-02 21:42 ` Eric W. Biederman
                   ` (3 preceding siblings ...)
  (?)
@ 2015-01-02 21:52 ` Eric W. Biederman
  -1 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-01-02 21:52 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-fsdevel, Serge E. Hallyn, Andy Lutomirski, Chen Hanxiao,
	Richard Weinberger, Andrey Vagin, Al Viro

In some instances it is necessary to know if the the unmounting
process has begun on a mount.  Add MNT_UMOUNT to make that reliably
testable.

This fix gets used in fixing locked mounts in MNT_DETACH

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/namespace.c        | 4 +++-
 fs/pnode.c            | 1 +
 include/linux/mount.h | 1 +
 3 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 44478b6e3719..60d4160cd2f4 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1336,8 +1336,10 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 	struct mount *p;
 
 	/* Gather the mounts to umount */
-	for (p = mnt; p; p = next_mnt(p, mnt))
+	for (p = mnt; p; p = next_mnt(p, mnt)) {
+		p->mnt.mnt_flags |= MNT_UMOUNT;
 		list_move(&p->mnt_list, &tmp_list);
+	}
 
 	/* Hide the mounts from lookup_mnt and mnt_mounts */
 	list_for_each_entry(p, &tmp_list, mnt_list) {
diff --git a/fs/pnode.c b/fs/pnode.c
index bf012af709dd..ac3aa0d43b90 100644
--- a/fs/pnode.c
+++ b/fs/pnode.c
@@ -384,6 +384,7 @@ static void __propagate_umount(struct mount *mnt)
 		if (child && list_empty(&child->mnt_mounts)) {
 			list_del_init(&child->mnt_child);
 			hlist_del_init_rcu(&child->mnt_hash);
+			child->mnt.mnt_flags |= MNT_UMOUNT;
 			list_move_tail(&child->mnt_list, &mnt->mnt_list);
 		}
 	}
diff --git a/include/linux/mount.h b/include/linux/mount.h
index c2c561dc0114..564beeec5d83 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -61,6 +61,7 @@ struct mnt_namespace;
 #define MNT_DOOMED		0x1000000
 #define MNT_SYNC_UMOUNT		0x2000000
 #define MNT_MARKED		0x4000000
+#define MNT_UMOUNT		0x8000000
 
 struct vfsmount {
 	struct dentry *mnt_root;	/* root of the mounted tree */
-- 
2.2.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 5/9] mnt: Delay removal from the mount hash.
       [not found] ` <871tncuaf6.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                     ` (3 preceding siblings ...)
  2015-01-02 21:52   ` [PATCH review 4/9] mnt: Add MNT_UMOUNT flag Eric W. Biederman
@ 2015-01-02 21:52   ` Eric W. Biederman
  2015-01-02 21:52   ` [PATCH review 6/9] mnt: Factor out __detach_mnt from detach_mnt Eric W. Biederman
                     ` (4 subsequent siblings)
  9 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-01-02 21:52 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Richard Weinberger, Andy Lutomirski, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

- Modify __lookup_mnt_hash_last to ignore mounts that have MNT_UMOUNTED set.
- Don't remove mounts from the mount hash table in propogate_umount
- Don't remove mounts from the mount hash table in umount_tree before
  the entire list of mounts to be umounted is selected.
- Remove mounts from the mount hash table as the last thing that
  happens in the case where a mount has a parent in umount_tree.
  Mounts without parents are not hashed (by definition).

This paves the way for delaying removal from the mount hash table even
farther and fixing the MNT_LOCKED vs MNT_DETACH issue.

Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/namespace.c | 13 ++++++++-----
 fs/pnode.c     |  1 -
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 60d4160cd2f4..a8afec9c81b6 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -623,14 +623,17 @@ struct mount *__lookup_mnt(struct vfsmount *mnt, struct dentry *dentry)
  */
 struct mount *__lookup_mnt_last(struct vfsmount *mnt, struct dentry *dentry)
 {
-	struct mount *p, *res;
-	res = p = __lookup_mnt(mnt, dentry);
+	struct mount *p, *res = NULL;
+	p = __lookup_mnt(mnt, dentry);
 	if (!p)
 		goto out;
+	if (!(p->mnt.mnt_flags & MNT_UMOUNT))
+		res = p;
 	hlist_for_each_entry_continue(p, mnt_hash) {
 		if (&p->mnt_parent->mnt != mnt || p->mnt_mountpoint != dentry)
 			break;
-		res = p;
+		if (!(p->mnt.mnt_flags & MNT_UMOUNT))
+			res = p;
 	}
 out:
 	return res;
@@ -1341,9 +1344,8 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 		list_move(&p->mnt_list, &tmp_list);
 	}
 
-	/* Hide the mounts from lookup_mnt and mnt_mounts */
+	/* Hide the mounts from mnt_mounts */
 	list_for_each_entry(p, &tmp_list, mnt_list) {
-		hlist_del_init_rcu(&p->mnt_hash);
 		list_del_init(&p->mnt_child);
 	}
 
@@ -1367,6 +1369,7 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 			p->mnt_mountpoint = p->mnt.mnt_root;
 			p->mnt_parent = p;
 			p->mnt_mp = NULL;
+			hlist_del_init_rcu(&p->mnt_hash);
 		}
 		change_mnt_propagation(p, MS_PRIVATE);
 	}
diff --git a/fs/pnode.c b/fs/pnode.c
index ac3aa0d43b90..c27ae38ee250 100644
--- a/fs/pnode.c
+++ b/fs/pnode.c
@@ -383,7 +383,6 @@ static void __propagate_umount(struct mount *mnt)
 		 */
 		if (child && list_empty(&child->mnt_mounts)) {
 			list_del_init(&child->mnt_child);
-			hlist_del_init_rcu(&child->mnt_hash);
 			child->mnt.mnt_flags |= MNT_UMOUNT;
 			list_move_tail(&child->mnt_list, &mnt->mnt_list);
 		}
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 5/9] mnt: Delay removal from the mount hash.
  2015-01-02 21:42 ` Eric W. Biederman
                   ` (4 preceding siblings ...)
  (?)
@ 2015-01-02 21:52 ` Eric W. Biederman
  -1 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-01-02 21:52 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-fsdevel, Serge E. Hallyn, Andy Lutomirski, Chen Hanxiao,
	Richard Weinberger, Andrey Vagin, Al Viro

- Modify __lookup_mnt_hash_last to ignore mounts that have MNT_UMOUNTED set.
- Don't remove mounts from the mount hash table in propogate_umount
- Don't remove mounts from the mount hash table in umount_tree before
  the entire list of mounts to be umounted is selected.
- Remove mounts from the mount hash table as the last thing that
  happens in the case where a mount has a parent in umount_tree.
  Mounts without parents are not hashed (by definition).

This paves the way for delaying removal from the mount hash table even
farther and fixing the MNT_LOCKED vs MNT_DETACH issue.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/namespace.c | 13 ++++++++-----
 fs/pnode.c     |  1 -
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 60d4160cd2f4..a8afec9c81b6 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -623,14 +623,17 @@ struct mount *__lookup_mnt(struct vfsmount *mnt, struct dentry *dentry)
  */
 struct mount *__lookup_mnt_last(struct vfsmount *mnt, struct dentry *dentry)
 {
-	struct mount *p, *res;
-	res = p = __lookup_mnt(mnt, dentry);
+	struct mount *p, *res = NULL;
+	p = __lookup_mnt(mnt, dentry);
 	if (!p)
 		goto out;
+	if (!(p->mnt.mnt_flags & MNT_UMOUNT))
+		res = p;
 	hlist_for_each_entry_continue(p, mnt_hash) {
 		if (&p->mnt_parent->mnt != mnt || p->mnt_mountpoint != dentry)
 			break;
-		res = p;
+		if (!(p->mnt.mnt_flags & MNT_UMOUNT))
+			res = p;
 	}
 out:
 	return res;
@@ -1341,9 +1344,8 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 		list_move(&p->mnt_list, &tmp_list);
 	}
 
-	/* Hide the mounts from lookup_mnt and mnt_mounts */
+	/* Hide the mounts from mnt_mounts */
 	list_for_each_entry(p, &tmp_list, mnt_list) {
-		hlist_del_init_rcu(&p->mnt_hash);
 		list_del_init(&p->mnt_child);
 	}
 
@@ -1367,6 +1369,7 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 			p->mnt_mountpoint = p->mnt.mnt_root;
 			p->mnt_parent = p;
 			p->mnt_mp = NULL;
+			hlist_del_init_rcu(&p->mnt_hash);
 		}
 		change_mnt_propagation(p, MS_PRIVATE);
 	}
diff --git a/fs/pnode.c b/fs/pnode.c
index ac3aa0d43b90..c27ae38ee250 100644
--- a/fs/pnode.c
+++ b/fs/pnode.c
@@ -383,7 +383,6 @@ static void __propagate_umount(struct mount *mnt)
 		 */
 		if (child && list_empty(&child->mnt_mounts)) {
 			list_del_init(&child->mnt_child);
-			hlist_del_init_rcu(&child->mnt_hash);
 			child->mnt.mnt_flags |= MNT_UMOUNT;
 			list_move_tail(&child->mnt_list, &mnt->mnt_list);
 		}
-- 
2.2.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 6/9] mnt: Factor out __detach_mnt from detach_mnt
       [not found] ` <871tncuaf6.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                     ` (4 preceding siblings ...)
  2015-01-02 21:52   ` [PATCH review 5/9] mnt: Delay removal from the mount hash Eric W. Biederman
@ 2015-01-02 21:52   ` Eric W. Biederman
  2015-01-02 21:52   ` [PATCH review 7/9] mnt: Simplify umount_tree Eric W. Biederman
                     ` (3 subsequent siblings)
  9 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-01-02 21:52 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Richard Weinberger, Andy Lutomirski, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

An upcoming change is going to need a version of detach_mnt
that leaves the mount on the parents mnt_mounts list.  Create
that version of detach_mnt now and call it __detach_mnt.

Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/namespace.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index a8afec9c81b6..c3f526ce0522 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -789,13 +789,12 @@ static void __touch_mnt_namespace(struct mnt_namespace *ns)
 /*
  * vfsmount lock must be held for write
  */
-static void detach_mnt(struct mount *mnt, struct path *old_path)
+static void __detach_mnt(struct mount *mnt, struct path *old_path)
 {
 	old_path->dentry = mnt->mnt_mountpoint;
 	old_path->mnt = &mnt->mnt_parent->mnt;
 	mnt->mnt_parent = mnt;
 	mnt->mnt_mountpoint = mnt->mnt.mnt_root;
-	list_del_init(&mnt->mnt_child);
 	hlist_del_init_rcu(&mnt->mnt_hash);
 	hlist_del_init(&mnt->mnt_mp_list);
 	put_mountpoint(mnt->mnt_mp);
@@ -805,6 +804,15 @@ static void detach_mnt(struct mount *mnt, struct path *old_path)
 /*
  * vfsmount lock must be held for write
  */
+static void detach_mnt(struct mount *mnt, struct path *old_path)
+{
+	__detach_mnt(mnt, old_path);
+	list_del_init(&mnt->mnt_child);
+}
+
+/*
+ * vfsmount lock must be held for write
+ */
 void mnt_set_mountpoint(struct mount *mnt,
 			struct mountpoint *mp,
 			struct mount *child_mnt)
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 6/9] mnt: Factor out __detach_mnt from detach_mnt
  2015-01-02 21:42 ` Eric W. Biederman
                   ` (5 preceding siblings ...)
  (?)
@ 2015-01-02 21:52 ` Eric W. Biederman
  -1 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-01-02 21:52 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-fsdevel, Serge E. Hallyn, Andy Lutomirski, Chen Hanxiao,
	Richard Weinberger, Andrey Vagin, Al Viro

An upcoming change is going to need a version of detach_mnt
that leaves the mount on the parents mnt_mounts list.  Create
that version of detach_mnt now and call it __detach_mnt.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/namespace.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index a8afec9c81b6..c3f526ce0522 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -789,13 +789,12 @@ static void __touch_mnt_namespace(struct mnt_namespace *ns)
 /*
  * vfsmount lock must be held for write
  */
-static void detach_mnt(struct mount *mnt, struct path *old_path)
+static void __detach_mnt(struct mount *mnt, struct path *old_path)
 {
 	old_path->dentry = mnt->mnt_mountpoint;
 	old_path->mnt = &mnt->mnt_parent->mnt;
 	mnt->mnt_parent = mnt;
 	mnt->mnt_mountpoint = mnt->mnt.mnt_root;
-	list_del_init(&mnt->mnt_child);
 	hlist_del_init_rcu(&mnt->mnt_hash);
 	hlist_del_init(&mnt->mnt_mp_list);
 	put_mountpoint(mnt->mnt_mp);
@@ -805,6 +804,15 @@ static void detach_mnt(struct mount *mnt, struct path *old_path)
 /*
  * vfsmount lock must be held for write
  */
+static void detach_mnt(struct mount *mnt, struct path *old_path)
+{
+	__detach_mnt(mnt, old_path);
+	list_del_init(&mnt->mnt_child);
+}
+
+/*
+ * vfsmount lock must be held for write
+ */
 void mnt_set_mountpoint(struct mount *mnt,
 			struct mountpoint *mp,
 			struct mount *child_mnt)
-- 
2.2.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 7/9] mnt: Simplify umount_tree
       [not found] ` <871tncuaf6.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                     ` (5 preceding siblings ...)
  2015-01-02 21:52   ` [PATCH review 6/9] mnt: Factor out __detach_mnt from detach_mnt Eric W. Biederman
@ 2015-01-02 21:52   ` Eric W. Biederman
  2015-01-02 21:52   ` [PATCH review 8/9] mnt: Remove redundant NULL tests in namespace_unlock Eric W. Biederman
                     ` (2 subsequent siblings)
  9 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-01-02 21:52 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Richard Weinberger, Andy Lutomirski, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

Replace the open coded __detach_mnt with __detach_mnt.

Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/namespace.c | 10 +---------
 1 file changed, 1 insertion(+), 9 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index c3f526ce0522..9fae55f2242e 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1368,16 +1368,8 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 		if (how & UMOUNT_SYNC)
 			p->mnt.mnt_flags |= MNT_SYNC_UMOUNT;
 		if (mnt_has_parent(p)) {
-			hlist_del_init(&p->mnt_mp_list);
-			put_mountpoint(p->mnt_mp);
 			mnt_add_count(p->mnt_parent, -1);
-			/* move the reference to mountpoint into ->mnt_ex_mountpoint */
-			p->mnt_ex_mountpoint.dentry = p->mnt_mountpoint;
-			p->mnt_ex_mountpoint.mnt = &p->mnt_parent->mnt;
-			p->mnt_mountpoint = p->mnt.mnt_root;
-			p->mnt_parent = p;
-			p->mnt_mp = NULL;
-			hlist_del_init_rcu(&p->mnt_hash);
+			__detach_mnt(p, &p->mnt_ex_mountpoint);
 		}
 		change_mnt_propagation(p, MS_PRIVATE);
 	}
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 7/9] mnt: Simplify umount_tree
  2015-01-02 21:42 ` Eric W. Biederman
                   ` (6 preceding siblings ...)
  (?)
@ 2015-01-02 21:52 ` Eric W. Biederman
  -1 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-01-02 21:52 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-fsdevel, Serge E. Hallyn, Andy Lutomirski, Chen Hanxiao,
	Richard Weinberger, Andrey Vagin, Al Viro

Replace the open coded __detach_mnt with __detach_mnt.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/namespace.c | 10 +---------
 1 file changed, 1 insertion(+), 9 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index c3f526ce0522..9fae55f2242e 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1368,16 +1368,8 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 		if (how & UMOUNT_SYNC)
 			p->mnt.mnt_flags |= MNT_SYNC_UMOUNT;
 		if (mnt_has_parent(p)) {
-			hlist_del_init(&p->mnt_mp_list);
-			put_mountpoint(p->mnt_mp);
 			mnt_add_count(p->mnt_parent, -1);
-			/* move the reference to mountpoint into ->mnt_ex_mountpoint */
-			p->mnt_ex_mountpoint.dentry = p->mnt_mountpoint;
-			p->mnt_ex_mountpoint.mnt = &p->mnt_parent->mnt;
-			p->mnt_mountpoint = p->mnt.mnt_root;
-			p->mnt_parent = p;
-			p->mnt_mp = NULL;
-			hlist_del_init_rcu(&p->mnt_hash);
+			__detach_mnt(p, &p->mnt_ex_mountpoint);
 		}
 		change_mnt_propagation(p, MS_PRIVATE);
 	}
-- 
2.2.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 8/9] mnt: Remove redundant NULL tests in namespace_unlock
       [not found] ` <871tncuaf6.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                     ` (6 preceding siblings ...)
  2015-01-02 21:52   ` [PATCH review 7/9] mnt: Simplify umount_tree Eric W. Biederman
@ 2015-01-02 21:52   ` Eric W. Biederman
  2015-01-02 21:52   ` [PATCH review 9/9] mnt: Honor MNT_LOCKED when detaching mounts Eric W. Biederman
  2015-01-05 20:45   ` [PATCH review 0/11 Call for testing and review of mount detach fixes (take 2) Eric W. Biederman
  9 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-01-02 21:52 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Richard Weinberger, Andy Lutomirski, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

mntget, mntput, dput and pathput all test their arguments to see if
they are NULL before taking any action, so testing for NULL in
namespace_unlock is redundant.

Remove the redundant checks making namespace_unlock a little
shorter and easier to read.

This also makes it possible for mnt_ex_mountpoint.mnt to be NULL
allowing putting a dentry without a mount.  This is will be needed
in __detach_mounts when detaching already unmounted children,
as part of the fix for MNT_DETACH on MNT_LOCKED mounts.

Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/namespace.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 9fae55f2242e..3769dbd040c1 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1312,8 +1312,7 @@ static void namespace_unlock(void)
 
 	/* undo decrements we'd done in umount_tree() */
 	list_for_each_entry(mnt, &head, mnt_list)
-		if (mnt->mnt_ex_mountpoint.mnt)
-			mntget(mnt->mnt_ex_mountpoint.mnt);
+		mntget(mnt->mnt_ex_mountpoint.mnt);
 
 	up_write(&namespace_sem);
 
@@ -1322,8 +1321,7 @@ static void namespace_unlock(void)
 	while (!list_empty(&head)) {
 		mnt = list_first_entry(&head, struct mount, mnt_list);
 		list_del_init(&mnt->mnt_list);
-		if (mnt->mnt_ex_mountpoint.mnt)
-			path_put(&mnt->mnt_ex_mountpoint);
+		path_put(&mnt->mnt_ex_mountpoint);
 		mntput(&mnt->mnt);
 	}
 }
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 8/9] mnt: Remove redundant NULL tests in namespace_unlock
  2015-01-02 21:42 ` Eric W. Biederman
                   ` (7 preceding siblings ...)
  (?)
@ 2015-01-02 21:52 ` Eric W. Biederman
  -1 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-01-02 21:52 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-fsdevel, Serge E. Hallyn, Andy Lutomirski, Chen Hanxiao,
	Richard Weinberger, Andrey Vagin, Al Viro

mntget, mntput, dput and pathput all test their arguments to see if
they are NULL before taking any action, so testing for NULL in
namespace_unlock is redundant.

Remove the redundant checks making namespace_unlock a little
shorter and easier to read.

This also makes it possible for mnt_ex_mountpoint.mnt to be NULL
allowing putting a dentry without a mount.  This is will be needed
in __detach_mounts when detaching already unmounted children,
as part of the fix for MNT_DETACH on MNT_LOCKED mounts.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/namespace.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 9fae55f2242e..3769dbd040c1 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1312,8 +1312,7 @@ static void namespace_unlock(void)
 
 	/* undo decrements we'd done in umount_tree() */
 	list_for_each_entry(mnt, &head, mnt_list)
-		if (mnt->mnt_ex_mountpoint.mnt)
-			mntget(mnt->mnt_ex_mountpoint.mnt);
+		mntget(mnt->mnt_ex_mountpoint.mnt);
 
 	up_write(&namespace_sem);
 
@@ -1322,8 +1321,7 @@ static void namespace_unlock(void)
 	while (!list_empty(&head)) {
 		mnt = list_first_entry(&head, struct mount, mnt_list);
 		list_del_init(&mnt->mnt_list);
-		if (mnt->mnt_ex_mountpoint.mnt)
-			path_put(&mnt->mnt_ex_mountpoint);
+		path_put(&mnt->mnt_ex_mountpoint);
 		mntput(&mnt->mnt);
 	}
 }
-- 
2.2.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 9/9] mnt: Honor MNT_LOCKED when detaching mounts
       [not found] ` <871tncuaf6.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                     ` (7 preceding siblings ...)
  2015-01-02 21:52   ` [PATCH review 8/9] mnt: Remove redundant NULL tests in namespace_unlock Eric W. Biederman
@ 2015-01-02 21:52   ` Eric W. Biederman
  2015-01-05 20:45   ` [PATCH review 0/11 Call for testing and review of mount detach fixes (take 2) Eric W. Biederman
  9 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-01-02 21:52 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Richard Weinberger, Andy Lutomirski, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

Modify umount(MNT_DETACH) to keep mounts in the hash table that are
locked to their parent mounts, when the parent is lazily unmounted.
In doing this invert the reference count so that the parent holds a
reference to the children instead of the children holding a reference
to the parent.

Then in mntput_no_expire detach the children and in cleanup_mnt mntput
the children and dput the dentry they were mounted on.

In __detach_mounts if there are any mounts that have been unmounted
but still are on the list of mounts of a mountpoint, detach those
mounts and schedule them to be mntput and their reference to the dentry
to be put when it becomes safe to sleep.

Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/namespace.c | 47 +++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 43 insertions(+), 4 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 3769dbd040c1..5373343da715 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1017,6 +1017,17 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root,
 	return ERR_PTR(err);
 }
 
+static void mntput_children(struct mount *mnt)
+{
+	struct mount *p, *tmp;
+
+	list_for_each_entry_safe(p, tmp, &mnt->mnt_mounts, mnt_child) {
+		list_del_init(&p->mnt_child);
+		path_put(&p->mnt_ex_mountpoint);
+		mntput(&p->mnt);
+	}
+}
+
 static void cleanup_mnt(struct mount *mnt)
 {
 	/*
@@ -1030,6 +1041,8 @@ static void cleanup_mnt(struct mount *mnt)
 	 * so mnt_get_writers() below is safe.
 	 */
 	WARN_ON(mnt_get_writers(mnt));
+	if (unlikely(!list_empty(&mnt->mnt_mounts)))
+		mntput_children(mnt);
 	if (unlikely(mnt->mnt_pins.first))
 		mnt_pin_kill(mnt);
 	fsnotify_vfsmount_delete(&mnt->mnt);
@@ -1080,6 +1093,15 @@ static void mntput_no_expire(struct mount *mnt)
 	rcu_read_unlock();
 
 	list_del(&mnt->mnt_instance);
+
+	if (unlikely(!list_empty(&mnt->mnt_mounts))) {
+		struct mount *p;
+		list_for_each_entry(p, &mnt->mnt_mounts,  mnt_child) {
+			__detach_mnt(p, &p->mnt_ex_mountpoint);
+			/* No need to mntput mnt */
+			p->mnt_ex_mountpoint.mnt = NULL;
+		}
+	}
 	unlock_mount_hash();
 
 	if (likely(!(mnt->mnt.mnt_flags & MNT_INTERNAL))) {
@@ -1342,7 +1364,7 @@ enum umount_tree_flags {
 static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 {
 	LIST_HEAD(tmp_list);
-	struct mount *p;
+	struct mount *tmp, *p;
 
 	/* Gather the mounts to umount */
 	for (p = mnt; p; p = next_mnt(p, mnt)) {
@@ -1359,7 +1381,7 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 	if (how & UMOUNT_PROPAGATE)
 		propagate_umount(&tmp_list);
 
-	list_for_each_entry(p, &tmp_list, mnt_list) {
+	list_for_each_entry_safe(p, tmp, &tmp_list, mnt_list) {
 		list_del_init(&p->mnt_expire);
 		__touch_mnt_namespace(p->mnt_ns);
 		p->mnt_ns = NULL;
@@ -1367,7 +1389,15 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 			p->mnt.mnt_flags |= MNT_SYNC_UMOUNT;
 		if (mnt_has_parent(p)) {
 			mnt_add_count(p->mnt_parent, -1);
-			__detach_mnt(p, &p->mnt_ex_mountpoint);
+			if ((p->mnt_parent->mnt.mnt_flags & MNT_UMOUNT) &&
+			    ((p->mnt.mnt_flags & (MNT_LOCKED|MNT_SYNC_UMOUNT)) == MNT_LOCKED)) {
+				/* Don't mntput p in namespace_unlock */
+				list_del_init(&p->mnt_list);
+				/* Don't forget about p */
+				list_add_tail(&p->mnt_child, &p->mnt_parent->mnt_mounts);
+			} else {
+				__detach_mnt(p, &p->mnt_ex_mountpoint);
+			}
 		}
 		change_mnt_propagation(p, MS_PRIVATE);
 	}
@@ -1493,7 +1523,16 @@ void __detach_mounts(struct dentry *dentry)
 	lock_mount_hash();
 	while (!hlist_empty(&mp->m_list)) {
 		mnt = hlist_entry(mp->m_list.first, struct mount, mnt_mp_list);
-		umount_tree(mnt, 0);
+		if (mnt->mnt.mnt_flags & MNT_UMOUNT) {
+			struct mount *p, *tmp;
+			list_for_each_entry_safe(p, tmp, &mnt->mnt_mounts,  mnt_child) {
+				detach_mnt(p, &p->mnt_ex_mountpoint);
+				/* p->mnt_parent has already been mntput */
+				p->mnt_ex_mountpoint.mnt = NULL;
+				list_add_tail(&p->mnt_list, &unmounted);
+			}
+		}
+		else umount_tree(mnt, 0);
 	}
 	unlock_mount_hash();
 	put_mountpoint(mp);
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 9/9] mnt: Honor MNT_LOCKED when detaching mounts
  2015-01-02 21:42 ` Eric W. Biederman
                   ` (8 preceding siblings ...)
  (?)
@ 2015-01-02 21:52 ` Eric W. Biederman
       [not found]   ` <1420235574-15177-9-git-send-email-ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  -1 siblings, 1 reply; 240+ messages in thread
From: Eric W. Biederman @ 2015-01-02 21:52 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-fsdevel, Serge E. Hallyn, Andy Lutomirski, Chen Hanxiao,
	Richard Weinberger, Andrey Vagin, Al Viro

Modify umount(MNT_DETACH) to keep mounts in the hash table that are
locked to their parent mounts, when the parent is lazily unmounted.
In doing this invert the reference count so that the parent holds a
reference to the children instead of the children holding a reference
to the parent.

Then in mntput_no_expire detach the children and in cleanup_mnt mntput
the children and dput the dentry they were mounted on.

In __detach_mounts if there are any mounts that have been unmounted
but still are on the list of mounts of a mountpoint, detach those
mounts and schedule them to be mntput and their reference to the dentry
to be put when it becomes safe to sleep.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/namespace.c | 47 +++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 43 insertions(+), 4 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 3769dbd040c1..5373343da715 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1017,6 +1017,17 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root,
 	return ERR_PTR(err);
 }
 
+static void mntput_children(struct mount *mnt)
+{
+	struct mount *p, *tmp;
+
+	list_for_each_entry_safe(p, tmp, &mnt->mnt_mounts, mnt_child) {
+		list_del_init(&p->mnt_child);
+		path_put(&p->mnt_ex_mountpoint);
+		mntput(&p->mnt);
+	}
+}
+
 static void cleanup_mnt(struct mount *mnt)
 {
 	/*
@@ -1030,6 +1041,8 @@ static void cleanup_mnt(struct mount *mnt)
 	 * so mnt_get_writers() below is safe.
 	 */
 	WARN_ON(mnt_get_writers(mnt));
+	if (unlikely(!list_empty(&mnt->mnt_mounts)))
+		mntput_children(mnt);
 	if (unlikely(mnt->mnt_pins.first))
 		mnt_pin_kill(mnt);
 	fsnotify_vfsmount_delete(&mnt->mnt);
@@ -1080,6 +1093,15 @@ static void mntput_no_expire(struct mount *mnt)
 	rcu_read_unlock();
 
 	list_del(&mnt->mnt_instance);
+
+	if (unlikely(!list_empty(&mnt->mnt_mounts))) {
+		struct mount *p;
+		list_for_each_entry(p, &mnt->mnt_mounts,  mnt_child) {
+			__detach_mnt(p, &p->mnt_ex_mountpoint);
+			/* No need to mntput mnt */
+			p->mnt_ex_mountpoint.mnt = NULL;
+		}
+	}
 	unlock_mount_hash();
 
 	if (likely(!(mnt->mnt.mnt_flags & MNT_INTERNAL))) {
@@ -1342,7 +1364,7 @@ enum umount_tree_flags {
 static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 {
 	LIST_HEAD(tmp_list);
-	struct mount *p;
+	struct mount *tmp, *p;
 
 	/* Gather the mounts to umount */
 	for (p = mnt; p; p = next_mnt(p, mnt)) {
@@ -1359,7 +1381,7 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 	if (how & UMOUNT_PROPAGATE)
 		propagate_umount(&tmp_list);
 
-	list_for_each_entry(p, &tmp_list, mnt_list) {
+	list_for_each_entry_safe(p, tmp, &tmp_list, mnt_list) {
 		list_del_init(&p->mnt_expire);
 		__touch_mnt_namespace(p->mnt_ns);
 		p->mnt_ns = NULL;
@@ -1367,7 +1389,15 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 			p->mnt.mnt_flags |= MNT_SYNC_UMOUNT;
 		if (mnt_has_parent(p)) {
 			mnt_add_count(p->mnt_parent, -1);
-			__detach_mnt(p, &p->mnt_ex_mountpoint);
+			if ((p->mnt_parent->mnt.mnt_flags & MNT_UMOUNT) &&
+			    ((p->mnt.mnt_flags & (MNT_LOCKED|MNT_SYNC_UMOUNT)) == MNT_LOCKED)) {
+				/* Don't mntput p in namespace_unlock */
+				list_del_init(&p->mnt_list);
+				/* Don't forget about p */
+				list_add_tail(&p->mnt_child, &p->mnt_parent->mnt_mounts);
+			} else {
+				__detach_mnt(p, &p->mnt_ex_mountpoint);
+			}
 		}
 		change_mnt_propagation(p, MS_PRIVATE);
 	}
@@ -1493,7 +1523,16 @@ void __detach_mounts(struct dentry *dentry)
 	lock_mount_hash();
 	while (!hlist_empty(&mp->m_list)) {
 		mnt = hlist_entry(mp->m_list.first, struct mount, mnt_mp_list);
-		umount_tree(mnt, 0);
+		if (mnt->mnt.mnt_flags & MNT_UMOUNT) {
+			struct mount *p, *tmp;
+			list_for_each_entry_safe(p, tmp, &mnt->mnt_mounts,  mnt_child) {
+				detach_mnt(p, &p->mnt_ex_mountpoint);
+				/* p->mnt_parent has already been mntput */
+				p->mnt_ex_mountpoint.mnt = NULL;
+				list_add_tail(&p->mnt_list, &unmounted);
+			}
+		}
+		else umount_tree(mnt, 0);
 	}
 	unlock_mount_hash();
 	put_mountpoint(mp);
-- 
2.2.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* Re: [PATCH review 9/9] mnt: Honor MNT_LOCKED when detaching mounts
       [not found]   ` <1420235574-15177-9-git-send-email-ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
@ 2015-01-03  2:27     ` Eric W. Biederman
  0 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-01-03  2:27 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Richard Weinberger,
	Andrey Vagin, Al Viro, Andy Lutomirski

"Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> writes:

> Modify umount(MNT_DETACH) to keep mounts in the hash table that are
> locked to their parent mounts, when the parent is lazily unmounted.
> In doing this invert the reference count so that the parent holds a
> reference to the children instead of the children holding a reference
> to the parent.
>
> Then in mntput_no_expire detach the children and in cleanup_mnt mntput
> the children and dput the dentry they were mounted on.
>
> In __detach_mounts if there are any mounts that have been unmounted
> but still are on the list of mounts of a mountpoint, detach those
> mounts and schedule them to be mntput and their reference to the dentry
> to be put when it becomes safe to sleep.

Arr.  This isn't quite enough.  It does not properly handle mount
propagation MNT_DETACH and MNT_LOCKED.

I am not even certain MNT_DETACH needs to be involved.  Back to banging
my head against this one.  Sigh.

Eric

^ permalink raw reply	[flat|nested] 240+ messages in thread

* [PATCH review 0/11 Call for testing and review of mount detach fixes (take 2)
       [not found] ` <871tncuaf6.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                     ` (8 preceding siblings ...)
  2015-01-02 21:52   ` [PATCH review 9/9] mnt: Honor MNT_LOCKED when detaching mounts Eric W. Biederman
@ 2015-01-05 20:45   ` Eric W. Biederman
       [not found]     ` <87mw5xq7lt.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                       ` (10 more replies)
  9 siblings, 11 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-01-05 20:45 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Richard Weinberger, Andy Lutomirski, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

Way back in October Andrey Vagin reported that umount(MNT_DETACH) could
be used to defeat MNT_LOCKED.  As I worked to fix this I discovered
that combined with mount propagation and an appropriate selection of
shared subtrees a reference to a directory on an unmounted filesystem is
not necessary.

That MNT_DETACH is allowed in user namespace in a form that can break
MNT_LOCKED comes from my early misunderstanding what MNT_DETACH does.

To avoid breaking existing userspace the conflict between MNT_DETACH and
MNT_LOCKED is fixed by leaving mounts that are locked to their parents
in the mount hash table until the last reference goes away.

While investigating this issue I also found an issue with
__detach_mounts.  The code was unnecessarily and incorrectly triggering
mount propagation.  Resulting in too many mounts going away when a
directory is deleted, and too many cpu cycles burned while doing that.

For those who like to see everything in a single tree the code is at:

   git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-testing

Eric W. Biederman (11):
      mnt: Improve the umount_tree flags
      mnt: Don't propagate umounts in __detach_mounts
      mnt: In umount_tree reuse mnt_list instead of mnt_hash
      mnt: Add MNT_UMOUNT flag
      mnt: Delay removal from the mount hash.
      mnt: Factor out __detach_mnt from detach_mnt
      mnt: Simplify umount_tree
      mnt: Remove redundant NULL tests in namespace_unlock
      mnt: On an unmount propagate clearing of MNT_LOCKED
      mnt: Don't propagate unmounts to locked mounts
      mnt: Honor MNT_LOCKED when detaching mounts

 fs/namespace.c        | 152 +++++++++++++++++++++++++++++++-------------------
 fs/pnode.c            |  60 +++++++++++++++++---
 fs/pnode.h            |   7 ++-
 include/linux/mount.h |   1 +
 4 files changed, 154 insertions(+), 66 deletions(-)

^ permalink raw reply	[flat|nested] 240+ messages in thread

* [PATCH review 01/11] mnt: Improve the umount_tree flags
       [not found]     ` <87mw5xq7lt.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-01-05 20:46       ` Eric W. Biederman
  2015-01-05 20:46       ` [PATCH review 02/11] mnt: Don't propagate umounts in __detach_mounts Eric W. Biederman
                         ` (10 subsequent siblings)
  11 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-01-05 20:46 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Richard Weinberger, Andy Lutomirski, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

- Remove the unneeded declaration from pnode.h
- Mark umount_tree static as it has no callers outside of namespace.c
- Define an enumeration of umount_tree's flags.
- Pass umount_tree's flags in by name

This removes the magic numbers 0, 1 and 2 making the code a little
clearer and makes it possible for there to be lazy unmounts that don't
propagate.  Which is what __detach_mounts actually wants for example.

Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/namespace.c | 31 ++++++++++++++++---------------
 fs/pnode.h     |  1 -
 2 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index cd1e9681a0cf..5bb96c440b31 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1323,14 +1323,15 @@ static inline void namespace_lock(void)
 	down_write(&namespace_sem);
 }
 
+enum umount_tree_flags {
+	UMOUNT_SYNC = 1,
+	UMOUNT_PROPAGATE = 2,
+};
 /*
  * mount_lock must be held
  * namespace_sem must be held for write
- * how = 0 => just this tree, don't propagate
- * how = 1 => propagate; we know that nobody else has reference to any victims
- * how = 2 => lazy umount
  */
-void umount_tree(struct mount *mnt, int how)
+static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 {
 	HLIST_HEAD(tmp_list);
 	struct mount *p;
@@ -1344,7 +1345,7 @@ void umount_tree(struct mount *mnt, int how)
 	hlist_for_each_entry(p, &tmp_list, mnt_hash)
 		list_del_init(&p->mnt_child);
 
-	if (how)
+	if (how & UMOUNT_PROPAGATE)
 		propagate_umount(&tmp_list);
 
 	hlist_for_each_entry(p, &tmp_list, mnt_hash) {
@@ -1352,7 +1353,7 @@ void umount_tree(struct mount *mnt, int how)
 		list_del_init(&p->mnt_list);
 		__touch_mnt_namespace(p->mnt_ns);
 		p->mnt_ns = NULL;
-		if (how < 2)
+		if (how & UMOUNT_SYNC)
 			p->mnt.mnt_flags |= MNT_SYNC_UMOUNT;
 		if (mnt_has_parent(p)) {
 			hlist_del_init(&p->mnt_mp_list);
@@ -1457,14 +1458,14 @@ static int do_umount(struct mount *mnt, int flags)
 
 	if (flags & MNT_DETACH) {
 		if (!list_empty(&mnt->mnt_list))
-			umount_tree(mnt, 2);
+			umount_tree(mnt, UMOUNT_PROPAGATE);
 		retval = 0;
 	} else {
 		shrink_submounts(mnt);
 		retval = -EBUSY;
 		if (!propagate_mount_busy(mnt, 2)) {
 			if (!list_empty(&mnt->mnt_list))
-				umount_tree(mnt, 1);
+				umount_tree(mnt, UMOUNT_PROPAGATE|UMOUNT_SYNC);
 			retval = 0;
 		}
 	}
@@ -1496,7 +1497,7 @@ void __detach_mounts(struct dentry *dentry)
 	lock_mount_hash();
 	while (!hlist_empty(&mp->m_list)) {
 		mnt = hlist_entry(mp->m_list.first, struct mount, mnt_mp_list);
-		umount_tree(mnt, 2);
+		umount_tree(mnt, UMOUNT_PROPAGATE);
 	}
 	unlock_mount_hash();
 	put_mountpoint(mp);
@@ -1658,7 +1659,7 @@ struct mount *copy_tree(struct mount *mnt, struct dentry *dentry,
 out:
 	if (res) {
 		lock_mount_hash();
-		umount_tree(res, 0);
+		umount_tree(res, UMOUNT_SYNC);
 		unlock_mount_hash();
 	}
 	return q;
@@ -1682,7 +1683,7 @@ void drop_collected_mounts(struct vfsmount *mnt)
 {
 	namespace_lock();
 	lock_mount_hash();
-	umount_tree(real_mount(mnt), 0);
+	umount_tree(real_mount(mnt), UMOUNT_SYNC);
 	unlock_mount_hash();
 	namespace_unlock();
 }
@@ -1865,7 +1866,7 @@ static int attach_recursive_mnt(struct mount *source_mnt,
  out_cleanup_ids:
 	while (!hlist_empty(&tree_list)) {
 		child = hlist_entry(tree_list.first, struct mount, mnt_hash);
-		umount_tree(child, 0);
+		umount_tree(child, UMOUNT_SYNC);
 	}
 	unlock_mount_hash();
 	cleanup_group_ids(source_mnt, NULL);
@@ -2045,7 +2046,7 @@ static int do_loopback(struct path *path, const char *old_name,
 	err = graft_tree(mnt, parent, mp);
 	if (err) {
 		lock_mount_hash();
-		umount_tree(mnt, 0);
+		umount_tree(mnt, UMOUNT_SYNC);
 		unlock_mount_hash();
 	}
 out2:
@@ -2416,7 +2417,7 @@ void mark_mounts_for_expiry(struct list_head *mounts)
 	while (!list_empty(&graveyard)) {
 		mnt = list_first_entry(&graveyard, struct mount, mnt_expire);
 		touch_mnt_namespace(mnt->mnt_ns);
-		umount_tree(mnt, 1);
+		umount_tree(mnt, UMOUNT_PROPAGATE|UMOUNT_SYNC);
 	}
 	unlock_mount_hash();
 	namespace_unlock();
@@ -2487,7 +2488,7 @@ static void shrink_submounts(struct mount *mnt)
 			m = list_first_entry(&graveyard, struct mount,
 						mnt_expire);
 			touch_mnt_namespace(m->mnt_ns);
-			umount_tree(m, 1);
+			umount_tree(m, UMOUNT_PROPAGATE|UMOUNT_SYNC);
 		}
 	}
 }
diff --git a/fs/pnode.h b/fs/pnode.h
index 4a246358b031..16afc3d6d2f2 100644
--- a/fs/pnode.h
+++ b/fs/pnode.h
@@ -47,7 +47,6 @@ int get_dominating_id(struct mount *mnt, const struct path *root);
 unsigned int mnt_get_count(struct mount *mnt);
 void mnt_set_mountpoint(struct mount *, struct mountpoint *,
 			struct mount *);
-void umount_tree(struct mount *, int);
 struct mount *copy_tree(struct mount *, struct dentry *, int);
 bool is_path_reachable(struct mount *, struct dentry *,
 			 const struct path *root);
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 02/11] mnt: Don't propagate umounts in __detach_mounts
       [not found]     ` <87mw5xq7lt.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  2015-01-05 20:46       ` [PATCH review 01/11] mnt: Improve the umount_tree flags Eric W. Biederman
@ 2015-01-05 20:46       ` Eric W. Biederman
  2015-01-05 20:46       ` [PATCH review 03/11] mnt: In umount_tree reuse mnt_list instead of mnt_hash Eric W. Biederman
                         ` (9 subsequent siblings)
  11 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-01-05 20:46 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Richard Weinberger, Andy Lutomirski, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

Invoking mount propagation from __detach_mounts is inefficient and
wrong.

It is inefficient because __detach_mounts already walks the list of
mounts that where something needs to be done, and mount propagation
walks some subset of those mounts again.

It is actively wrong because if the dentry that is passed to
__detach_mounts is not part of the path to a mount that mount should
not be affected.

change_mnt_propagation(p,MS_PRIVATE) modifies the mount propagation
tree of a master mount so it's slaves are connected to another master
if possible.  Which means even removing a mount from the middle of a
mount tree with __detach_mounts will not deprive any mount propagated
mount events.

Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/namespace.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 5bb96c440b31..07d0562290a5 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1497,7 +1497,7 @@ void __detach_mounts(struct dentry *dentry)
 	lock_mount_hash();
 	while (!hlist_empty(&mp->m_list)) {
 		mnt = hlist_entry(mp->m_list.first, struct mount, mnt_mp_list);
-		umount_tree(mnt, UMOUNT_PROPAGATE);
+		umount_tree(mnt, 0);
 	}
 	unlock_mount_hash();
 	put_mountpoint(mp);
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 02/11] mnt: Don't propagate umounts in __detach_mounts
  2015-01-05 20:45   ` [PATCH review 0/11 Call for testing and review of mount detach fixes (take 2) Eric W. Biederman
       [not found]     ` <87mw5xq7lt.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-01-05 20:46     ` Eric W. Biederman
  2015-01-05 20:46     ` [PATCH review 03/11] mnt: In umount_tree reuse mnt_list instead of mnt_hash Eric W. Biederman
                       ` (8 subsequent siblings)
  10 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-01-05 20:46 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-fsdevel, Serge E. Hallyn, Andy Lutomirski, Chen Hanxiao,
	Richard Weinberger, Andrey Vagin, Al Viro

Invoking mount propagation from __detach_mounts is inefficient and
wrong.

It is inefficient because __detach_mounts already walks the list of
mounts that where something needs to be done, and mount propagation
walks some subset of those mounts again.

It is actively wrong because if the dentry that is passed to
__detach_mounts is not part of the path to a mount that mount should
not be affected.

change_mnt_propagation(p,MS_PRIVATE) modifies the mount propagation
tree of a master mount so it's slaves are connected to another master
if possible.  Which means even removing a mount from the middle of a
mount tree with __detach_mounts will not deprive any mount propagated
mount events.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/namespace.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 5bb96c440b31..07d0562290a5 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1497,7 +1497,7 @@ void __detach_mounts(struct dentry *dentry)
 	lock_mount_hash();
 	while (!hlist_empty(&mp->m_list)) {
 		mnt = hlist_entry(mp->m_list.first, struct mount, mnt_mp_list);
-		umount_tree(mnt, UMOUNT_PROPAGATE);
+		umount_tree(mnt, 0);
 	}
 	unlock_mount_hash();
 	put_mountpoint(mp);
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 03/11] mnt: In umount_tree reuse mnt_list instead of mnt_hash
       [not found]     ` <87mw5xq7lt.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  2015-01-05 20:46       ` [PATCH review 01/11] mnt: Improve the umount_tree flags Eric W. Biederman
  2015-01-05 20:46       ` [PATCH review 02/11] mnt: Don't propagate umounts in __detach_mounts Eric W. Biederman
@ 2015-01-05 20:46       ` Eric W. Biederman
  2015-01-05 20:46       ` [PATCH review 04/11] mnt: Add MNT_UMOUNT flag Eric W. Biederman
                         ` (8 subsequent siblings)
  11 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-01-05 20:46 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Richard Weinberger, Andy Lutomirski, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

umount_tree builds a list of mounts that need to be unmounted.
Utilize mnt_list for this purpose instead of mnt_hash as mnt_list is
an ordianry list_head, allowing the use of list_splice and list_move
instead of rolling our own.

This also begins to allow keeping a mount on the mnt_hash after it is
unmounted.  Which is necessary for a properly functioning MNT_LOCKED
implementation.

Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/namespace.c | 45 +++++++++++++++++++--------------------------
 fs/pnode.c     |  6 +++---
 fs/pnode.h     |  2 +-
 3 files changed, 23 insertions(+), 30 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 07d0562290a5..44478b6e3719 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1285,23 +1285,22 @@ int may_umount(struct vfsmount *mnt)
 
 EXPORT_SYMBOL(may_umount);
 
-static HLIST_HEAD(unmounted);	/* protected by namespace_sem */
+static LIST_HEAD(unmounted);	/* protected by namespace_sem */
 
 static void namespace_unlock(void)
 {
 	struct mount *mnt;
-	struct hlist_head head = unmounted;
+	LIST_HEAD(head);
 
-	if (likely(hlist_empty(&head))) {
+	if (likely(list_empty(&unmounted))) {
 		up_write(&namespace_sem);
 		return;
 	}
 
-	head.first->pprev = &head.first;
-	INIT_HLIST_HEAD(&unmounted);
+	list_splice_init(&unmounted, &head);
 
 	/* undo decrements we'd done in umount_tree() */
-	hlist_for_each_entry(mnt, &head, mnt_hash)
+	list_for_each_entry(mnt, &head, mnt_list)
 		if (mnt->mnt_ex_mountpoint.mnt)
 			mntget(mnt->mnt_ex_mountpoint.mnt);
 
@@ -1309,9 +1308,9 @@ static void namespace_unlock(void)
 
 	synchronize_rcu();
 
-	while (!hlist_empty(&head)) {
-		mnt = hlist_entry(head.first, struct mount, mnt_hash);
-		hlist_del_init(&mnt->mnt_hash);
+	while (!list_empty(&head)) {
+		mnt = list_first_entry(&head, struct mount, mnt_list);
+		list_del_init(&mnt->mnt_list);
 		if (mnt->mnt_ex_mountpoint.mnt)
 			path_put(&mnt->mnt_ex_mountpoint);
 		mntput(&mnt->mnt);
@@ -1333,24 +1332,25 @@ enum umount_tree_flags {
  */
 static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 {
-	HLIST_HEAD(tmp_list);
+	LIST_HEAD(tmp_list);
 	struct mount *p;
-	struct mount *last = NULL;
 
-	for (p = mnt; p; p = next_mnt(p, mnt)) {
-		hlist_del_init_rcu(&p->mnt_hash);
-		hlist_add_head(&p->mnt_hash, &tmp_list);
-	}
+	/* Gather the mounts to umount */
+	for (p = mnt; p; p = next_mnt(p, mnt))
+		list_move(&p->mnt_list, &tmp_list);
 
-	hlist_for_each_entry(p, &tmp_list, mnt_hash)
+	/* Hide the mounts from lookup_mnt and mnt_mounts */
+	list_for_each_entry(p, &tmp_list, mnt_list) {
+		hlist_del_init_rcu(&p->mnt_hash);
 		list_del_init(&p->mnt_child);
+	}
 
+	/* Add propogated mounts to the tmp_list */
 	if (how & UMOUNT_PROPAGATE)
 		propagate_umount(&tmp_list);
 
-	hlist_for_each_entry(p, &tmp_list, mnt_hash) {
+	list_for_each_entry(p, &tmp_list, mnt_list) {
 		list_del_init(&p->mnt_expire);
-		list_del_init(&p->mnt_list);
 		__touch_mnt_namespace(p->mnt_ns);
 		p->mnt_ns = NULL;
 		if (how & UMOUNT_SYNC)
@@ -1367,15 +1367,8 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 			p->mnt_mp = NULL;
 		}
 		change_mnt_propagation(p, MS_PRIVATE);
-		last = p;
-	}
-	if (last) {
-		last->mnt_hash.next = unmounted.first;
-		if (unmounted.first)
-			unmounted.first->pprev = &last->mnt_hash.next;
-		unmounted.first = tmp_list.first;
-		unmounted.first->pprev = &unmounted.first;
 	}
+	list_splice(&tmp_list, &unmounted);
 }
 
 static void shrink_submounts(struct mount *mnt);
diff --git a/fs/pnode.c b/fs/pnode.c
index 260ac8f898a4..bf012af709dd 100644
--- a/fs/pnode.c
+++ b/fs/pnode.c
@@ -384,7 +384,7 @@ static void __propagate_umount(struct mount *mnt)
 		if (child && list_empty(&child->mnt_mounts)) {
 			list_del_init(&child->mnt_child);
 			hlist_del_init_rcu(&child->mnt_hash);
-			hlist_add_before_rcu(&child->mnt_hash, &mnt->mnt_hash);
+			list_move_tail(&child->mnt_list, &mnt->mnt_list);
 		}
 	}
 }
@@ -396,11 +396,11 @@ static void __propagate_umount(struct mount *mnt)
  *
  * vfsmount lock must be held for write
  */
-int propagate_umount(struct hlist_head *list)
+int propagate_umount(struct list_head *list)
 {
 	struct mount *mnt;
 
-	hlist_for_each_entry(mnt, list, mnt_hash)
+	list_for_each_entry(mnt, list, mnt_list)
 		__propagate_umount(mnt);
 	return 0;
 }
diff --git a/fs/pnode.h b/fs/pnode.h
index 16afc3d6d2f2..aa6d65df7204 100644
--- a/fs/pnode.h
+++ b/fs/pnode.h
@@ -40,7 +40,7 @@ static inline void set_mnt_shared(struct mount *mnt)
 void change_mnt_propagation(struct mount *, int);
 int propagate_mnt(struct mount *, struct mountpoint *, struct mount *,
 		struct hlist_head *);
-int propagate_umount(struct hlist_head *);
+int propagate_umount(struct list_head *);
 int propagate_mount_busy(struct mount *, int);
 void mnt_release_group_id(struct mount *);
 int get_dominating_id(struct mount *mnt, const struct path *root);
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 03/11] mnt: In umount_tree reuse mnt_list instead of mnt_hash
  2015-01-05 20:45   ` [PATCH review 0/11 Call for testing and review of mount detach fixes (take 2) Eric W. Biederman
       [not found]     ` <87mw5xq7lt.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  2015-01-05 20:46     ` [PATCH review 02/11] mnt: Don't propagate umounts in __detach_mounts Eric W. Biederman
@ 2015-01-05 20:46     ` Eric W. Biederman
  2015-01-05 20:46     ` [PATCH review 04/11] mnt: Add MNT_UMOUNT flag Eric W. Biederman
                       ` (7 subsequent siblings)
  10 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-01-05 20:46 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-fsdevel, Serge E. Hallyn, Andy Lutomirski, Chen Hanxiao,
	Richard Weinberger, Andrey Vagin, Al Viro

umount_tree builds a list of mounts that need to be unmounted.
Utilize mnt_list for this purpose instead of mnt_hash as mnt_list is
an ordianry list_head, allowing the use of list_splice and list_move
instead of rolling our own.

This also begins to allow keeping a mount on the mnt_hash after it is
unmounted.  Which is necessary for a properly functioning MNT_LOCKED
implementation.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/namespace.c | 45 +++++++++++++++++++--------------------------
 fs/pnode.c     |  6 +++---
 fs/pnode.h     |  2 +-
 3 files changed, 23 insertions(+), 30 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 07d0562290a5..44478b6e3719 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1285,23 +1285,22 @@ int may_umount(struct vfsmount *mnt)
 
 EXPORT_SYMBOL(may_umount);
 
-static HLIST_HEAD(unmounted);	/* protected by namespace_sem */
+static LIST_HEAD(unmounted);	/* protected by namespace_sem */
 
 static void namespace_unlock(void)
 {
 	struct mount *mnt;
-	struct hlist_head head = unmounted;
+	LIST_HEAD(head);
 
-	if (likely(hlist_empty(&head))) {
+	if (likely(list_empty(&unmounted))) {
 		up_write(&namespace_sem);
 		return;
 	}
 
-	head.first->pprev = &head.first;
-	INIT_HLIST_HEAD(&unmounted);
+	list_splice_init(&unmounted, &head);
 
 	/* undo decrements we'd done in umount_tree() */
-	hlist_for_each_entry(mnt, &head, mnt_hash)
+	list_for_each_entry(mnt, &head, mnt_list)
 		if (mnt->mnt_ex_mountpoint.mnt)
 			mntget(mnt->mnt_ex_mountpoint.mnt);
 
@@ -1309,9 +1308,9 @@ static void namespace_unlock(void)
 
 	synchronize_rcu();
 
-	while (!hlist_empty(&head)) {
-		mnt = hlist_entry(head.first, struct mount, mnt_hash);
-		hlist_del_init(&mnt->mnt_hash);
+	while (!list_empty(&head)) {
+		mnt = list_first_entry(&head, struct mount, mnt_list);
+		list_del_init(&mnt->mnt_list);
 		if (mnt->mnt_ex_mountpoint.mnt)
 			path_put(&mnt->mnt_ex_mountpoint);
 		mntput(&mnt->mnt);
@@ -1333,24 +1332,25 @@ enum umount_tree_flags {
  */
 static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 {
-	HLIST_HEAD(tmp_list);
+	LIST_HEAD(tmp_list);
 	struct mount *p;
-	struct mount *last = NULL;
 
-	for (p = mnt; p; p = next_mnt(p, mnt)) {
-		hlist_del_init_rcu(&p->mnt_hash);
-		hlist_add_head(&p->mnt_hash, &tmp_list);
-	}
+	/* Gather the mounts to umount */
+	for (p = mnt; p; p = next_mnt(p, mnt))
+		list_move(&p->mnt_list, &tmp_list);
 
-	hlist_for_each_entry(p, &tmp_list, mnt_hash)
+	/* Hide the mounts from lookup_mnt and mnt_mounts */
+	list_for_each_entry(p, &tmp_list, mnt_list) {
+		hlist_del_init_rcu(&p->mnt_hash);
 		list_del_init(&p->mnt_child);
+	}
 
+	/* Add propogated mounts to the tmp_list */
 	if (how & UMOUNT_PROPAGATE)
 		propagate_umount(&tmp_list);
 
-	hlist_for_each_entry(p, &tmp_list, mnt_hash) {
+	list_for_each_entry(p, &tmp_list, mnt_list) {
 		list_del_init(&p->mnt_expire);
-		list_del_init(&p->mnt_list);
 		__touch_mnt_namespace(p->mnt_ns);
 		p->mnt_ns = NULL;
 		if (how & UMOUNT_SYNC)
@@ -1367,15 +1367,8 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 			p->mnt_mp = NULL;
 		}
 		change_mnt_propagation(p, MS_PRIVATE);
-		last = p;
-	}
-	if (last) {
-		last->mnt_hash.next = unmounted.first;
-		if (unmounted.first)
-			unmounted.first->pprev = &last->mnt_hash.next;
-		unmounted.first = tmp_list.first;
-		unmounted.first->pprev = &unmounted.first;
 	}
+	list_splice(&tmp_list, &unmounted);
 }
 
 static void shrink_submounts(struct mount *mnt);
diff --git a/fs/pnode.c b/fs/pnode.c
index 260ac8f898a4..bf012af709dd 100644
--- a/fs/pnode.c
+++ b/fs/pnode.c
@@ -384,7 +384,7 @@ static void __propagate_umount(struct mount *mnt)
 		if (child && list_empty(&child->mnt_mounts)) {
 			list_del_init(&child->mnt_child);
 			hlist_del_init_rcu(&child->mnt_hash);
-			hlist_add_before_rcu(&child->mnt_hash, &mnt->mnt_hash);
+			list_move_tail(&child->mnt_list, &mnt->mnt_list);
 		}
 	}
 }
@@ -396,11 +396,11 @@ static void __propagate_umount(struct mount *mnt)
  *
  * vfsmount lock must be held for write
  */
-int propagate_umount(struct hlist_head *list)
+int propagate_umount(struct list_head *list)
 {
 	struct mount *mnt;
 
-	hlist_for_each_entry(mnt, list, mnt_hash)
+	list_for_each_entry(mnt, list, mnt_list)
 		__propagate_umount(mnt);
 	return 0;
 }
diff --git a/fs/pnode.h b/fs/pnode.h
index 16afc3d6d2f2..aa6d65df7204 100644
--- a/fs/pnode.h
+++ b/fs/pnode.h
@@ -40,7 +40,7 @@ static inline void set_mnt_shared(struct mount *mnt)
 void change_mnt_propagation(struct mount *, int);
 int propagate_mnt(struct mount *, struct mountpoint *, struct mount *,
 		struct hlist_head *);
-int propagate_umount(struct hlist_head *);
+int propagate_umount(struct list_head *);
 int propagate_mount_busy(struct mount *, int);
 void mnt_release_group_id(struct mount *);
 int get_dominating_id(struct mount *mnt, const struct path *root);
-- 
2.2.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 04/11] mnt: Add MNT_UMOUNT flag
       [not found]     ` <87mw5xq7lt.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                         ` (2 preceding siblings ...)
  2015-01-05 20:46       ` [PATCH review 03/11] mnt: In umount_tree reuse mnt_list instead of mnt_hash Eric W. Biederman
@ 2015-01-05 20:46       ` Eric W. Biederman
  2015-01-05 20:46       ` [PATCH review 05/11] mnt: Delay removal from the mount hash Eric W. Biederman
                         ` (7 subsequent siblings)
  11 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-01-05 20:46 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Richard Weinberger, Andy Lutomirski, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

In some instances it is necessary to know if the the unmounting
process has begun on a mount.  Add MNT_UMOUNT to make that reliably
testable.

This fix gets used in fixing locked mounts in MNT_DETACH

Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/namespace.c        | 4 +++-
 fs/pnode.c            | 1 +
 include/linux/mount.h | 1 +
 3 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 44478b6e3719..60d4160cd2f4 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1336,8 +1336,10 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 	struct mount *p;
 
 	/* Gather the mounts to umount */
-	for (p = mnt; p; p = next_mnt(p, mnt))
+	for (p = mnt; p; p = next_mnt(p, mnt)) {
+		p->mnt.mnt_flags |= MNT_UMOUNT;
 		list_move(&p->mnt_list, &tmp_list);
+	}
 
 	/* Hide the mounts from lookup_mnt and mnt_mounts */
 	list_for_each_entry(p, &tmp_list, mnt_list) {
diff --git a/fs/pnode.c b/fs/pnode.c
index bf012af709dd..ac3aa0d43b90 100644
--- a/fs/pnode.c
+++ b/fs/pnode.c
@@ -384,6 +384,7 @@ static void __propagate_umount(struct mount *mnt)
 		if (child && list_empty(&child->mnt_mounts)) {
 			list_del_init(&child->mnt_child);
 			hlist_del_init_rcu(&child->mnt_hash);
+			child->mnt.mnt_flags |= MNT_UMOUNT;
 			list_move_tail(&child->mnt_list, &mnt->mnt_list);
 		}
 	}
diff --git a/include/linux/mount.h b/include/linux/mount.h
index c2c561dc0114..564beeec5d83 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -61,6 +61,7 @@ struct mnt_namespace;
 #define MNT_DOOMED		0x1000000
 #define MNT_SYNC_UMOUNT		0x2000000
 #define MNT_MARKED		0x4000000
+#define MNT_UMOUNT		0x8000000
 
 struct vfsmount {
 	struct dentry *mnt_root;	/* root of the mounted tree */
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 04/11] mnt: Add MNT_UMOUNT flag
  2015-01-05 20:45   ` [PATCH review 0/11 Call for testing and review of mount detach fixes (take 2) Eric W. Biederman
                       ` (2 preceding siblings ...)
  2015-01-05 20:46     ` [PATCH review 03/11] mnt: In umount_tree reuse mnt_list instead of mnt_hash Eric W. Biederman
@ 2015-01-05 20:46     ` Eric W. Biederman
  2015-01-05 20:46     ` [PATCH review 06/11] mnt: Factor out __detach_mnt from detach_mnt Eric W. Biederman
                       ` (6 subsequent siblings)
  10 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-01-05 20:46 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-fsdevel, Serge E. Hallyn, Andy Lutomirski, Chen Hanxiao,
	Richard Weinberger, Andrey Vagin, Al Viro

In some instances it is necessary to know if the the unmounting
process has begun on a mount.  Add MNT_UMOUNT to make that reliably
testable.

This fix gets used in fixing locked mounts in MNT_DETACH

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/namespace.c        | 4 +++-
 fs/pnode.c            | 1 +
 include/linux/mount.h | 1 +
 3 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 44478b6e3719..60d4160cd2f4 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1336,8 +1336,10 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 	struct mount *p;
 
 	/* Gather the mounts to umount */
-	for (p = mnt; p; p = next_mnt(p, mnt))
+	for (p = mnt; p; p = next_mnt(p, mnt)) {
+		p->mnt.mnt_flags |= MNT_UMOUNT;
 		list_move(&p->mnt_list, &tmp_list);
+	}
 
 	/* Hide the mounts from lookup_mnt and mnt_mounts */
 	list_for_each_entry(p, &tmp_list, mnt_list) {
diff --git a/fs/pnode.c b/fs/pnode.c
index bf012af709dd..ac3aa0d43b90 100644
--- a/fs/pnode.c
+++ b/fs/pnode.c
@@ -384,6 +384,7 @@ static void __propagate_umount(struct mount *mnt)
 		if (child && list_empty(&child->mnt_mounts)) {
 			list_del_init(&child->mnt_child);
 			hlist_del_init_rcu(&child->mnt_hash);
+			child->mnt.mnt_flags |= MNT_UMOUNT;
 			list_move_tail(&child->mnt_list, &mnt->mnt_list);
 		}
 	}
diff --git a/include/linux/mount.h b/include/linux/mount.h
index c2c561dc0114..564beeec5d83 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -61,6 +61,7 @@ struct mnt_namespace;
 #define MNT_DOOMED		0x1000000
 #define MNT_SYNC_UMOUNT		0x2000000
 #define MNT_MARKED		0x4000000
+#define MNT_UMOUNT		0x8000000
 
 struct vfsmount {
 	struct dentry *mnt_root;	/* root of the mounted tree */
-- 
2.2.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 05/11] mnt: Delay removal from the mount hash.
       [not found]     ` <87mw5xq7lt.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                         ` (3 preceding siblings ...)
  2015-01-05 20:46       ` [PATCH review 04/11] mnt: Add MNT_UMOUNT flag Eric W. Biederman
@ 2015-01-05 20:46       ` Eric W. Biederman
  2015-01-05 20:46       ` [PATCH review 06/11] mnt: Factor out __detach_mnt from detach_mnt Eric W. Biederman
                         ` (6 subsequent siblings)
  11 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-01-05 20:46 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Richard Weinberger, Andy Lutomirski, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

- Modify __lookup_mnt_hash_last to ignore mounts that have MNT_UMOUNTED set.
- Don't remove mounts from the mount hash table in propogate_umount
- Don't remove mounts from the mount hash table in umount_tree before
  the entire list of mounts to be umounted is selected.
- Remove mounts from the mount hash table as the last thing that
  happens in the case where a mount has a parent in umount_tree.
  Mounts without parents are not hashed (by definition).

This paves the way for delaying removal from the mount hash table even
farther and fixing the MNT_LOCKED vs MNT_DETACH issue.

Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/namespace.c | 13 ++++++++-----
 fs/pnode.c     |  1 -
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 60d4160cd2f4..a8afec9c81b6 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -623,14 +623,17 @@ struct mount *__lookup_mnt(struct vfsmount *mnt, struct dentry *dentry)
  */
 struct mount *__lookup_mnt_last(struct vfsmount *mnt, struct dentry *dentry)
 {
-	struct mount *p, *res;
-	res = p = __lookup_mnt(mnt, dentry);
+	struct mount *p, *res = NULL;
+	p = __lookup_mnt(mnt, dentry);
 	if (!p)
 		goto out;
+	if (!(p->mnt.mnt_flags & MNT_UMOUNT))
+		res = p;
 	hlist_for_each_entry_continue(p, mnt_hash) {
 		if (&p->mnt_parent->mnt != mnt || p->mnt_mountpoint != dentry)
 			break;
-		res = p;
+		if (!(p->mnt.mnt_flags & MNT_UMOUNT))
+			res = p;
 	}
 out:
 	return res;
@@ -1341,9 +1344,8 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 		list_move(&p->mnt_list, &tmp_list);
 	}
 
-	/* Hide the mounts from lookup_mnt and mnt_mounts */
+	/* Hide the mounts from mnt_mounts */
 	list_for_each_entry(p, &tmp_list, mnt_list) {
-		hlist_del_init_rcu(&p->mnt_hash);
 		list_del_init(&p->mnt_child);
 	}
 
@@ -1367,6 +1369,7 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 			p->mnt_mountpoint = p->mnt.mnt_root;
 			p->mnt_parent = p;
 			p->mnt_mp = NULL;
+			hlist_del_init_rcu(&p->mnt_hash);
 		}
 		change_mnt_propagation(p, MS_PRIVATE);
 	}
diff --git a/fs/pnode.c b/fs/pnode.c
index ac3aa0d43b90..c27ae38ee250 100644
--- a/fs/pnode.c
+++ b/fs/pnode.c
@@ -383,7 +383,6 @@ static void __propagate_umount(struct mount *mnt)
 		 */
 		if (child && list_empty(&child->mnt_mounts)) {
 			list_del_init(&child->mnt_child);
-			hlist_del_init_rcu(&child->mnt_hash);
 			child->mnt.mnt_flags |= MNT_UMOUNT;
 			list_move_tail(&child->mnt_list, &mnt->mnt_list);
 		}
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 06/11] mnt: Factor out __detach_mnt from detach_mnt
       [not found]     ` <87mw5xq7lt.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                         ` (4 preceding siblings ...)
  2015-01-05 20:46       ` [PATCH review 05/11] mnt: Delay removal from the mount hash Eric W. Biederman
@ 2015-01-05 20:46       ` Eric W. Biederman
  2015-01-05 20:46       ` [PATCH review 07/11] mnt: Simplify umount_tree Eric W. Biederman
                         ` (5 subsequent siblings)
  11 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-01-05 20:46 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Richard Weinberger, Andy Lutomirski, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

An upcoming change is going to need a version of detach_mnt
that leaves the mount on the parents mnt_mounts list.  Create
that version of detach_mnt now and call it __detach_mnt.

Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/namespace.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index a8afec9c81b6..c3f526ce0522 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -789,13 +789,12 @@ static void __touch_mnt_namespace(struct mnt_namespace *ns)
 /*
  * vfsmount lock must be held for write
  */
-static void detach_mnt(struct mount *mnt, struct path *old_path)
+static void __detach_mnt(struct mount *mnt, struct path *old_path)
 {
 	old_path->dentry = mnt->mnt_mountpoint;
 	old_path->mnt = &mnt->mnt_parent->mnt;
 	mnt->mnt_parent = mnt;
 	mnt->mnt_mountpoint = mnt->mnt.mnt_root;
-	list_del_init(&mnt->mnt_child);
 	hlist_del_init_rcu(&mnt->mnt_hash);
 	hlist_del_init(&mnt->mnt_mp_list);
 	put_mountpoint(mnt->mnt_mp);
@@ -805,6 +804,15 @@ static void detach_mnt(struct mount *mnt, struct path *old_path)
 /*
  * vfsmount lock must be held for write
  */
+static void detach_mnt(struct mount *mnt, struct path *old_path)
+{
+	__detach_mnt(mnt, old_path);
+	list_del_init(&mnt->mnt_child);
+}
+
+/*
+ * vfsmount lock must be held for write
+ */
 void mnt_set_mountpoint(struct mount *mnt,
 			struct mountpoint *mp,
 			struct mount *child_mnt)
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 06/11] mnt: Factor out __detach_mnt from detach_mnt
  2015-01-05 20:45   ` [PATCH review 0/11 Call for testing and review of mount detach fixes (take 2) Eric W. Biederman
                       ` (3 preceding siblings ...)
  2015-01-05 20:46     ` [PATCH review 04/11] mnt: Add MNT_UMOUNT flag Eric W. Biederman
@ 2015-01-05 20:46     ` Eric W. Biederman
  2015-01-05 20:46     ` [PATCH review 07/11] mnt: Simplify umount_tree Eric W. Biederman
                       ` (5 subsequent siblings)
  10 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-01-05 20:46 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-fsdevel, Serge E. Hallyn, Andy Lutomirski, Chen Hanxiao,
	Richard Weinberger, Andrey Vagin, Al Viro

An upcoming change is going to need a version of detach_mnt
that leaves the mount on the parents mnt_mounts list.  Create
that version of detach_mnt now and call it __detach_mnt.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/namespace.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index a8afec9c81b6..c3f526ce0522 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -789,13 +789,12 @@ static void __touch_mnt_namespace(struct mnt_namespace *ns)
 /*
  * vfsmount lock must be held for write
  */
-static void detach_mnt(struct mount *mnt, struct path *old_path)
+static void __detach_mnt(struct mount *mnt, struct path *old_path)
 {
 	old_path->dentry = mnt->mnt_mountpoint;
 	old_path->mnt = &mnt->mnt_parent->mnt;
 	mnt->mnt_parent = mnt;
 	mnt->mnt_mountpoint = mnt->mnt.mnt_root;
-	list_del_init(&mnt->mnt_child);
 	hlist_del_init_rcu(&mnt->mnt_hash);
 	hlist_del_init(&mnt->mnt_mp_list);
 	put_mountpoint(mnt->mnt_mp);
@@ -805,6 +804,15 @@ static void detach_mnt(struct mount *mnt, struct path *old_path)
 /*
  * vfsmount lock must be held for write
  */
+static void detach_mnt(struct mount *mnt, struct path *old_path)
+{
+	__detach_mnt(mnt, old_path);
+	list_del_init(&mnt->mnt_child);
+}
+
+/*
+ * vfsmount lock must be held for write
+ */
 void mnt_set_mountpoint(struct mount *mnt,
 			struct mountpoint *mp,
 			struct mount *child_mnt)
-- 
2.2.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 07/11] mnt: Simplify umount_tree
       [not found]     ` <87mw5xq7lt.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                         ` (5 preceding siblings ...)
  2015-01-05 20:46       ` [PATCH review 06/11] mnt: Factor out __detach_mnt from detach_mnt Eric W. Biederman
@ 2015-01-05 20:46       ` Eric W. Biederman
  2015-01-05 20:46       ` [PATCH review 08/11] mnt: Remove redundant NULL tests in namespace_unlock Eric W. Biederman
                         ` (4 subsequent siblings)
  11 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-01-05 20:46 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Richard Weinberger, Andy Lutomirski, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

Replace the open coded __detach_mnt with __detach_mnt.

Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/namespace.c | 10 +---------
 1 file changed, 1 insertion(+), 9 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index c3f526ce0522..9fae55f2242e 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1368,16 +1368,8 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 		if (how & UMOUNT_SYNC)
 			p->mnt.mnt_flags |= MNT_SYNC_UMOUNT;
 		if (mnt_has_parent(p)) {
-			hlist_del_init(&p->mnt_mp_list);
-			put_mountpoint(p->mnt_mp);
 			mnt_add_count(p->mnt_parent, -1);
-			/* move the reference to mountpoint into ->mnt_ex_mountpoint */
-			p->mnt_ex_mountpoint.dentry = p->mnt_mountpoint;
-			p->mnt_ex_mountpoint.mnt = &p->mnt_parent->mnt;
-			p->mnt_mountpoint = p->mnt.mnt_root;
-			p->mnt_parent = p;
-			p->mnt_mp = NULL;
-			hlist_del_init_rcu(&p->mnt_hash);
+			__detach_mnt(p, &p->mnt_ex_mountpoint);
 		}
 		change_mnt_propagation(p, MS_PRIVATE);
 	}
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 07/11] mnt: Simplify umount_tree
  2015-01-05 20:45   ` [PATCH review 0/11 Call for testing and review of mount detach fixes (take 2) Eric W. Biederman
                       ` (4 preceding siblings ...)
  2015-01-05 20:46     ` [PATCH review 06/11] mnt: Factor out __detach_mnt from detach_mnt Eric W. Biederman
@ 2015-01-05 20:46     ` Eric W. Biederman
  2015-01-05 20:46     ` [PATCH review 08/11] mnt: Remove redundant NULL tests in namespace_unlock Eric W. Biederman
                       ` (4 subsequent siblings)
  10 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-01-05 20:46 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-fsdevel, Serge E. Hallyn, Andy Lutomirski, Chen Hanxiao,
	Richard Weinberger, Andrey Vagin, Al Viro

Replace the open coded __detach_mnt with __detach_mnt.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/namespace.c | 10 +---------
 1 file changed, 1 insertion(+), 9 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index c3f526ce0522..9fae55f2242e 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1368,16 +1368,8 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 		if (how & UMOUNT_SYNC)
 			p->mnt.mnt_flags |= MNT_SYNC_UMOUNT;
 		if (mnt_has_parent(p)) {
-			hlist_del_init(&p->mnt_mp_list);
-			put_mountpoint(p->mnt_mp);
 			mnt_add_count(p->mnt_parent, -1);
-			/* move the reference to mountpoint into ->mnt_ex_mountpoint */
-			p->mnt_ex_mountpoint.dentry = p->mnt_mountpoint;
-			p->mnt_ex_mountpoint.mnt = &p->mnt_parent->mnt;
-			p->mnt_mountpoint = p->mnt.mnt_root;
-			p->mnt_parent = p;
-			p->mnt_mp = NULL;
-			hlist_del_init_rcu(&p->mnt_hash);
+			__detach_mnt(p, &p->mnt_ex_mountpoint);
 		}
 		change_mnt_propagation(p, MS_PRIVATE);
 	}
-- 
2.2.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 08/11] mnt: Remove redundant NULL tests in namespace_unlock
       [not found]     ` <87mw5xq7lt.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                         ` (6 preceding siblings ...)
  2015-01-05 20:46       ` [PATCH review 07/11] mnt: Simplify umount_tree Eric W. Biederman
@ 2015-01-05 20:46       ` Eric W. Biederman
  2015-01-05 20:46       ` [PATCH review 09/11] mnt: On an unmount propagate clearing of MNT_LOCKED Eric W. Biederman
                         ` (3 subsequent siblings)
  11 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-01-05 20:46 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Richard Weinberger, Andy Lutomirski, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

mntget, mntput, dput and pathput all test their arguments to see if
they are NULL before taking any action, so testing for NULL in
namespace_unlock is redundant.

Remove the redundant checks making namespace_unlock a little
shorter and easier to read.

This also makes it possible for mnt_ex_mountpoint.mnt to be NULL
allowing putting a dentry without a mount.  This is will be needed
in __detach_mounts when detaching already unmounted children,
as part of the fix for MNT_DETACH on MNT_LOCKED mounts.

Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/namespace.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 9fae55f2242e..3769dbd040c1 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1312,8 +1312,7 @@ static void namespace_unlock(void)
 
 	/* undo decrements we'd done in umount_tree() */
 	list_for_each_entry(mnt, &head, mnt_list)
-		if (mnt->mnt_ex_mountpoint.mnt)
-			mntget(mnt->mnt_ex_mountpoint.mnt);
+		mntget(mnt->mnt_ex_mountpoint.mnt);
 
 	up_write(&namespace_sem);
 
@@ -1322,8 +1321,7 @@ static void namespace_unlock(void)
 	while (!list_empty(&head)) {
 		mnt = list_first_entry(&head, struct mount, mnt_list);
 		list_del_init(&mnt->mnt_list);
-		if (mnt->mnt_ex_mountpoint.mnt)
-			path_put(&mnt->mnt_ex_mountpoint);
+		path_put(&mnt->mnt_ex_mountpoint);
 		mntput(&mnt->mnt);
 	}
 }
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 08/11] mnt: Remove redundant NULL tests in namespace_unlock
  2015-01-05 20:45   ` [PATCH review 0/11 Call for testing and review of mount detach fixes (take 2) Eric W. Biederman
                       ` (5 preceding siblings ...)
  2015-01-05 20:46     ` [PATCH review 07/11] mnt: Simplify umount_tree Eric W. Biederman
@ 2015-01-05 20:46     ` Eric W. Biederman
  2015-01-05 20:46     ` [PATCH review 09/11] mnt: On an unmount propagate clearing of MNT_LOCKED Eric W. Biederman
                       ` (3 subsequent siblings)
  10 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-01-05 20:46 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-fsdevel, Serge E. Hallyn, Andy Lutomirski, Chen Hanxiao,
	Richard Weinberger, Andrey Vagin, Al Viro

mntget, mntput, dput and pathput all test their arguments to see if
they are NULL before taking any action, so testing for NULL in
namespace_unlock is redundant.

Remove the redundant checks making namespace_unlock a little
shorter and easier to read.

This also makes it possible for mnt_ex_mountpoint.mnt to be NULL
allowing putting a dentry without a mount.  This is will be needed
in __detach_mounts when detaching already unmounted children,
as part of the fix for MNT_DETACH on MNT_LOCKED mounts.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/namespace.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 9fae55f2242e..3769dbd040c1 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1312,8 +1312,7 @@ static void namespace_unlock(void)
 
 	/* undo decrements we'd done in umount_tree() */
 	list_for_each_entry(mnt, &head, mnt_list)
-		if (mnt->mnt_ex_mountpoint.mnt)
-			mntget(mnt->mnt_ex_mountpoint.mnt);
+		mntget(mnt->mnt_ex_mountpoint.mnt);
 
 	up_write(&namespace_sem);
 
@@ -1322,8 +1321,7 @@ static void namespace_unlock(void)
 	while (!list_empty(&head)) {
 		mnt = list_first_entry(&head, struct mount, mnt_list);
 		list_del_init(&mnt->mnt_list);
-		if (mnt->mnt_ex_mountpoint.mnt)
-			path_put(&mnt->mnt_ex_mountpoint);
+		path_put(&mnt->mnt_ex_mountpoint);
 		mntput(&mnt->mnt);
 	}
 }
-- 
2.2.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 09/11] mnt: On an unmount propagate clearing of MNT_LOCKED
       [not found]     ` <87mw5xq7lt.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                         ` (7 preceding siblings ...)
  2015-01-05 20:46       ` [PATCH review 08/11] mnt: Remove redundant NULL tests in namespace_unlock Eric W. Biederman
@ 2015-01-05 20:46       ` Eric W. Biederman
  2015-01-05 20:46       ` [PATCH review 10/11] mnt: Don't propagate unmounts to locked mounts Eric W. Biederman
                         ` (2 subsequent siblings)
  11 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-01-05 20:46 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Richard Weinberger, Andy Lutomirski, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

A prerequisite of calling umount_tree is that the point where the tree
is mounted at is valid to unmount.

If we are propagating the effect of the unmount clear MNT_LOCKED in
every instance where the same filesystem is mounted on the same
mountpoint in the mount tree, as we know (by virtue of the fact
that umount_tree was called) that it is safe to reveal what
is at that mountpoint.

Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/namespace.c |  3 +++
 fs/pnode.c     | 20 ++++++++++++++++++++
 fs/pnode.h     |  1 +
 3 files changed, 24 insertions(+)

diff --git a/fs/namespace.c b/fs/namespace.c
index 3769dbd040c1..ca801b41c643 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1344,6 +1344,9 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 	LIST_HEAD(tmp_list);
 	struct mount *p;
 
+	if (how & UMOUNT_PROPAGATE)
+		propagate_mount_unlock(mnt);
+
 	/* Gather the mounts to umount */
 	for (p = mnt; p; p = next_mnt(p, mnt)) {
 		p->mnt.mnt_flags |= MNT_UMOUNT;
diff --git a/fs/pnode.c b/fs/pnode.c
index c27ae38ee250..89890293dd0a 100644
--- a/fs/pnode.c
+++ b/fs/pnode.c
@@ -362,6 +362,26 @@ int propagate_mount_busy(struct mount *mnt, int refcnt)
 }
 
 /*
+ * Clear MNT_LOCKED when it can be shown to be safe.
+ *
+ * mount_lock lock must be held for write
+ */
+void propagate_mount_unlock(struct mount *mnt)
+{
+	struct mount *parent = mnt->mnt_parent;
+	struct mount *m, *child;
+
+	BUG_ON(parent == mnt);
+
+	for (m = propagation_next(parent, parent); m;
+			m = propagation_next(m, parent)) {
+		child = __lookup_mnt_last(&m->mnt, mnt->mnt_mountpoint);
+		if (child)
+			child->mnt.mnt_flags &= ~MNT_LOCKED;
+	}
+}
+
+/*
  * NOTE: unmounting 'mnt' naturally propagates to all other mounts its
  * parent propagates to.
  */
diff --git a/fs/pnode.h b/fs/pnode.h
index aa6d65df7204..af47d4bd7b31 100644
--- a/fs/pnode.h
+++ b/fs/pnode.h
@@ -42,6 +42,7 @@ int propagate_mnt(struct mount *, struct mountpoint *, struct mount *,
 		struct hlist_head *);
 int propagate_umount(struct list_head *);
 int propagate_mount_busy(struct mount *, int);
+void propagate_mount_unlock(struct mount *);
 void mnt_release_group_id(struct mount *);
 int get_dominating_id(struct mount *mnt, const struct path *root);
 unsigned int mnt_get_count(struct mount *mnt);
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 09/11] mnt: On an unmount propagate clearing of MNT_LOCKED
  2015-01-05 20:45   ` [PATCH review 0/11 Call for testing and review of mount detach fixes (take 2) Eric W. Biederman
                       ` (6 preceding siblings ...)
  2015-01-05 20:46     ` [PATCH review 08/11] mnt: Remove redundant NULL tests in namespace_unlock Eric W. Biederman
@ 2015-01-05 20:46     ` Eric W. Biederman
  2015-01-05 20:46     ` [PATCH review 10/11] mnt: Don't propagate unmounts to locked mounts Eric W. Biederman
                       ` (2 subsequent siblings)
  10 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-01-05 20:46 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-fsdevel, Serge E. Hallyn, Andy Lutomirski, Chen Hanxiao,
	Richard Weinberger, Andrey Vagin, Al Viro

A prerequisite of calling umount_tree is that the point where the tree
is mounted at is valid to unmount.

If we are propagating the effect of the unmount clear MNT_LOCKED in
every instance where the same filesystem is mounted on the same
mountpoint in the mount tree, as we know (by virtue of the fact
that umount_tree was called) that it is safe to reveal what
is at that mountpoint.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/namespace.c |  3 +++
 fs/pnode.c     | 20 ++++++++++++++++++++
 fs/pnode.h     |  1 +
 3 files changed, 24 insertions(+)

diff --git a/fs/namespace.c b/fs/namespace.c
index 3769dbd040c1..ca801b41c643 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1344,6 +1344,9 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 	LIST_HEAD(tmp_list);
 	struct mount *p;
 
+	if (how & UMOUNT_PROPAGATE)
+		propagate_mount_unlock(mnt);
+
 	/* Gather the mounts to umount */
 	for (p = mnt; p; p = next_mnt(p, mnt)) {
 		p->mnt.mnt_flags |= MNT_UMOUNT;
diff --git a/fs/pnode.c b/fs/pnode.c
index c27ae38ee250..89890293dd0a 100644
--- a/fs/pnode.c
+++ b/fs/pnode.c
@@ -362,6 +362,26 @@ int propagate_mount_busy(struct mount *mnt, int refcnt)
 }
 
 /*
+ * Clear MNT_LOCKED when it can be shown to be safe.
+ *
+ * mount_lock lock must be held for write
+ */
+void propagate_mount_unlock(struct mount *mnt)
+{
+	struct mount *parent = mnt->mnt_parent;
+	struct mount *m, *child;
+
+	BUG_ON(parent == mnt);
+
+	for (m = propagation_next(parent, parent); m;
+			m = propagation_next(m, parent)) {
+		child = __lookup_mnt_last(&m->mnt, mnt->mnt_mountpoint);
+		if (child)
+			child->mnt.mnt_flags &= ~MNT_LOCKED;
+	}
+}
+
+/*
  * NOTE: unmounting 'mnt' naturally propagates to all other mounts its
  * parent propagates to.
  */
diff --git a/fs/pnode.h b/fs/pnode.h
index aa6d65df7204..af47d4bd7b31 100644
--- a/fs/pnode.h
+++ b/fs/pnode.h
@@ -42,6 +42,7 @@ int propagate_mnt(struct mount *, struct mountpoint *, struct mount *,
 		struct hlist_head *);
 int propagate_umount(struct list_head *);
 int propagate_mount_busy(struct mount *, int);
+void propagate_mount_unlock(struct mount *);
 void mnt_release_group_id(struct mount *);
 int get_dominating_id(struct mount *mnt, const struct path *root);
 unsigned int mnt_get_count(struct mount *mnt);
-- 
2.2.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 10/11] mnt: Don't propagate unmounts to locked mounts
       [not found]     ` <87mw5xq7lt.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                         ` (8 preceding siblings ...)
  2015-01-05 20:46       ` [PATCH review 09/11] mnt: On an unmount propagate clearing of MNT_LOCKED Eric W. Biederman
@ 2015-01-05 20:46       ` Eric W. Biederman
  2015-01-05 20:46       ` [PATCH review 11/11] mnt: Honor MNT_LOCKED when detaching mounts Eric W. Biederman
  2015-04-03  1:53       ` [PATCH review 0/19] Locked mount and loopback mount fixes Eric W. Biederman
  11 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-01-05 20:46 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Richard Weinberger, Andy Lutomirski, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

If the first mount in shared subtree is locked don't unmount the
shared subtree.

This is ensured by walking through the mounts parents before children
and marking a mount as unmountable if it is not locked or it is locked
but it's parent is marked.

This allows recursive mount detach to propagate through a set of
mounts when unmounting them would not reveal what is under any locked
mount.

Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/pnode.c | 32 +++++++++++++++++++++++++++++---
 fs/pnode.h |  1 +
 2 files changed, 30 insertions(+), 3 deletions(-)

diff --git a/fs/pnode.c b/fs/pnode.c
index 89890293dd0a..6367e1e435c6 100644
--- a/fs/pnode.c
+++ b/fs/pnode.c
@@ -382,6 +382,26 @@ void propagate_mount_unlock(struct mount *mnt)
 }
 
 /*
+ * Mark all mounts that the MNT_LOCKED logic will allow to be unmounted.
+ */
+static void mark_umount_candidates(struct mount *mnt)
+{
+	struct mount *parent = mnt->mnt_parent;
+	struct mount *m;
+
+	BUG_ON(parent == mnt);
+
+	for (m = propagation_next(parent, parent); m;
+			m = propagation_next(m, parent)) {
+		struct mount *child = __lookup_mnt_last(&m->mnt,
+						mnt->mnt_mountpoint);
+		if (child && (!IS_MNT_LOCKED(child) || IS_MNT_MARKED(m))) {
+			SET_MNT_MARK(child);
+		}
+	}
+}
+
+/*
  * NOTE: unmounting 'mnt' naturally propagates to all other mounts its
  * parent propagates to.
  */
@@ -398,10 +418,13 @@ static void __propagate_umount(struct mount *mnt)
 		struct mount *child = __lookup_mnt_last(&m->mnt,
 						mnt->mnt_mountpoint);
 		/*
-		 * umount the child only if the child has no
-		 * other children
+		 * umount the child only if the child has no children
+		 * and the child is marked safe to unmount.
 		 */
-		if (child && list_empty(&child->mnt_mounts)) {
+		if (!child || !IS_MNT_MARKED(child))
+			continue;
+		CLEAR_MNT_MARK(child);
+		if (list_empty(&child->mnt_mounts)) {
 			list_del_init(&child->mnt_child);
 			child->mnt.mnt_flags |= MNT_UMOUNT;
 			list_move_tail(&child->mnt_list, &mnt->mnt_list);
@@ -420,6 +443,9 @@ int propagate_umount(struct list_head *list)
 {
 	struct mount *mnt;
 
+	list_for_each_entry_reverse(mnt, list, mnt_list)
+		mark_umount_candidates(mnt);
+
 	list_for_each_entry(mnt, list, mnt_list)
 		__propagate_umount(mnt);
 	return 0;
diff --git a/fs/pnode.h b/fs/pnode.h
index af47d4bd7b31..0fcdbe7ca648 100644
--- a/fs/pnode.h
+++ b/fs/pnode.h
@@ -19,6 +19,7 @@
 #define IS_MNT_MARKED(m) ((m)->mnt.mnt_flags & MNT_MARKED)
 #define SET_MNT_MARK(m) ((m)->mnt.mnt_flags |= MNT_MARKED)
 #define CLEAR_MNT_MARK(m) ((m)->mnt.mnt_flags &= ~MNT_MARKED)
+#define IS_MNT_LOCKED(m) ((m)->mnt.mnt_flags & MNT_LOCKED)
 
 #define CL_EXPIRE    		0x01
 #define CL_SLAVE     		0x02
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 10/11] mnt: Don't propagate unmounts to locked mounts
  2015-01-05 20:45   ` [PATCH review 0/11 Call for testing and review of mount detach fixes (take 2) Eric W. Biederman
                       ` (7 preceding siblings ...)
  2015-01-05 20:46     ` [PATCH review 09/11] mnt: On an unmount propagate clearing of MNT_LOCKED Eric W. Biederman
@ 2015-01-05 20:46     ` Eric W. Biederman
  2015-01-05 20:46     ` [PATCH review 11/11] mnt: Honor MNT_LOCKED when detaching mounts Eric W. Biederman
  2015-04-03  1:53     ` [PATCH review 0/19] Locked mount and loopback mount fixes Eric W. Biederman
  10 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-01-05 20:46 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-fsdevel, Serge E. Hallyn, Andy Lutomirski, Chen Hanxiao,
	Richard Weinberger, Andrey Vagin, Al Viro

If the first mount in shared subtree is locked don't unmount the
shared subtree.

This is ensured by walking through the mounts parents before children
and marking a mount as unmountable if it is not locked or it is locked
but it's parent is marked.

This allows recursive mount detach to propagate through a set of
mounts when unmounting them would not reveal what is under any locked
mount.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/pnode.c | 32 +++++++++++++++++++++++++++++---
 fs/pnode.h |  1 +
 2 files changed, 30 insertions(+), 3 deletions(-)

diff --git a/fs/pnode.c b/fs/pnode.c
index 89890293dd0a..6367e1e435c6 100644
--- a/fs/pnode.c
+++ b/fs/pnode.c
@@ -382,6 +382,26 @@ void propagate_mount_unlock(struct mount *mnt)
 }
 
 /*
+ * Mark all mounts that the MNT_LOCKED logic will allow to be unmounted.
+ */
+static void mark_umount_candidates(struct mount *mnt)
+{
+	struct mount *parent = mnt->mnt_parent;
+	struct mount *m;
+
+	BUG_ON(parent == mnt);
+
+	for (m = propagation_next(parent, parent); m;
+			m = propagation_next(m, parent)) {
+		struct mount *child = __lookup_mnt_last(&m->mnt,
+						mnt->mnt_mountpoint);
+		if (child && (!IS_MNT_LOCKED(child) || IS_MNT_MARKED(m))) {
+			SET_MNT_MARK(child);
+		}
+	}
+}
+
+/*
  * NOTE: unmounting 'mnt' naturally propagates to all other mounts its
  * parent propagates to.
  */
@@ -398,10 +418,13 @@ static void __propagate_umount(struct mount *mnt)
 		struct mount *child = __lookup_mnt_last(&m->mnt,
 						mnt->mnt_mountpoint);
 		/*
-		 * umount the child only if the child has no
-		 * other children
+		 * umount the child only if the child has no children
+		 * and the child is marked safe to unmount.
 		 */
-		if (child && list_empty(&child->mnt_mounts)) {
+		if (!child || !IS_MNT_MARKED(child))
+			continue;
+		CLEAR_MNT_MARK(child);
+		if (list_empty(&child->mnt_mounts)) {
 			list_del_init(&child->mnt_child);
 			child->mnt.mnt_flags |= MNT_UMOUNT;
 			list_move_tail(&child->mnt_list, &mnt->mnt_list);
@@ -420,6 +443,9 @@ int propagate_umount(struct list_head *list)
 {
 	struct mount *mnt;
 
+	list_for_each_entry_reverse(mnt, list, mnt_list)
+		mark_umount_candidates(mnt);
+
 	list_for_each_entry(mnt, list, mnt_list)
 		__propagate_umount(mnt);
 	return 0;
diff --git a/fs/pnode.h b/fs/pnode.h
index af47d4bd7b31..0fcdbe7ca648 100644
--- a/fs/pnode.h
+++ b/fs/pnode.h
@@ -19,6 +19,7 @@
 #define IS_MNT_MARKED(m) ((m)->mnt.mnt_flags & MNT_MARKED)
 #define SET_MNT_MARK(m) ((m)->mnt.mnt_flags |= MNT_MARKED)
 #define CLEAR_MNT_MARK(m) ((m)->mnt.mnt_flags &= ~MNT_MARKED)
+#define IS_MNT_LOCKED(m) ((m)->mnt.mnt_flags & MNT_LOCKED)
 
 #define CL_EXPIRE    		0x01
 #define CL_SLAVE     		0x02
-- 
2.2.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 11/11] mnt: Honor MNT_LOCKED when detaching mounts
       [not found]     ` <87mw5xq7lt.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                         ` (9 preceding siblings ...)
  2015-01-05 20:46       ` [PATCH review 10/11] mnt: Don't propagate unmounts to locked mounts Eric W. Biederman
@ 2015-01-05 20:46       ` Eric W. Biederman
  2015-04-03  1:53       ` [PATCH review 0/19] Locked mount and loopback mount fixes Eric W. Biederman
  11 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-01-05 20:46 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Richard Weinberger, Andy Lutomirski, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

Modify umount(MNT_DETACH) to keep mounts in the hash table that are
locked to their parent mounts, when the parent is lazily unmounted.
In doing this invert the reference count so that the parent holds a
reference to the children instead of the children holding a reference
to the parent.

Then in mntput_no_expire detach the children and in cleanup_mnt mntput
the children and dput the dentry they were mounted on.

In __detach_mounts if there are any mounts that have been unmounted
but still are on the list of mounts of a mountpoint, detach those
mounts and schedule them to be mntput and their reference to the dentry
to be put when it becomes safe to sleep.

Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/namespace.c | 46 ++++++++++++++++++++++++++++++++++++++++++----
 fs/pnode.h     |  2 ++
 2 files changed, 44 insertions(+), 4 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index ca801b41c643..689a27d6e950 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1017,6 +1017,17 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root,
 	return ERR_PTR(err);
 }
 
+static void mntput_children(struct mount *mnt)
+{
+	struct mount *p, *tmp;
+
+	list_for_each_entry_safe(p, tmp, &mnt->mnt_mounts, mnt_child) {
+		list_del_init(&p->mnt_child);
+		path_put(&p->mnt_ex_mountpoint);
+		mntput(&p->mnt);
+	}
+}
+
 static void cleanup_mnt(struct mount *mnt)
 {
 	/*
@@ -1030,6 +1041,8 @@ static void cleanup_mnt(struct mount *mnt)
 	 * so mnt_get_writers() below is safe.
 	 */
 	WARN_ON(mnt_get_writers(mnt));
+	if (unlikely(!list_empty(&mnt->mnt_mounts)))
+		mntput_children(mnt);
 	if (unlikely(mnt->mnt_pins.first))
 		mnt_pin_kill(mnt);
 	fsnotify_vfsmount_delete(&mnt->mnt);
@@ -1080,6 +1093,15 @@ static void mntput_no_expire(struct mount *mnt)
 	rcu_read_unlock();
 
 	list_del(&mnt->mnt_instance);
+
+	if (unlikely(!list_empty(&mnt->mnt_mounts))) {
+		struct mount *p;
+		list_for_each_entry(p, &mnt->mnt_mounts,  mnt_child) {
+			__detach_mnt(p, &p->mnt_ex_mountpoint);
+			/* No need to mntput mnt */
+			p->mnt_ex_mountpoint.mnt = NULL;
+		}
+	}
 	unlock_mount_hash();
 
 	if (likely(!(mnt->mnt.mnt_flags & MNT_INTERNAL))) {
@@ -1342,7 +1364,7 @@ enum umount_tree_flags {
 static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 {
 	LIST_HEAD(tmp_list);
-	struct mount *p;
+	struct mount *tmp, *p;
 
 	if (how & UMOUNT_PROPAGATE)
 		propagate_mount_unlock(mnt);
@@ -1362,7 +1384,7 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 	if (how & UMOUNT_PROPAGATE)
 		propagate_umount(&tmp_list);
 
-	list_for_each_entry(p, &tmp_list, mnt_list) {
+	list_for_each_entry_safe(p, tmp, &tmp_list, mnt_list) {
 		list_del_init(&p->mnt_expire);
 		__touch_mnt_namespace(p->mnt_ns);
 		p->mnt_ns = NULL;
@@ -1370,7 +1392,14 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 			p->mnt.mnt_flags |= MNT_SYNC_UMOUNT;
 		if (mnt_has_parent(p)) {
 			mnt_add_count(p->mnt_parent, -1);
-			__detach_mnt(p, &p->mnt_ex_mountpoint);
+			if (IS_MNT_LOCKED_AND_LAZY(p)) {
+				/* Don't mntput p in namespace_unlock */
+				list_del_init(&p->mnt_list);
+				/* Don't forget about p */
+				list_add_tail(&p->mnt_child, &p->mnt_parent->mnt_mounts);
+			} else {
+				__detach_mnt(p, &p->mnt_ex_mountpoint);
+			}
 		}
 		change_mnt_propagation(p, MS_PRIVATE);
 	}
@@ -1496,7 +1525,16 @@ void __detach_mounts(struct dentry *dentry)
 	lock_mount_hash();
 	while (!hlist_empty(&mp->m_list)) {
 		mnt = hlist_entry(mp->m_list.first, struct mount, mnt_mp_list);
-		umount_tree(mnt, 0);
+		if (mnt->mnt.mnt_flags & MNT_UMOUNT) {
+			struct mount *p, *tmp;
+			list_for_each_entry_safe(p, tmp, &mnt->mnt_mounts,  mnt_child) {
+				detach_mnt(p, &p->mnt_ex_mountpoint);
+				/* p->mnt_parent has already been mntput */
+				p->mnt_ex_mountpoint.mnt = NULL;
+				list_add_tail(&p->mnt_list, &unmounted);
+			}
+		}
+		else umount_tree(mnt, 0);
 	}
 	unlock_mount_hash();
 	put_mountpoint(mp);
diff --git a/fs/pnode.h b/fs/pnode.h
index 0fcdbe7ca648..7114ce6e6b9e 100644
--- a/fs/pnode.h
+++ b/fs/pnode.h
@@ -20,6 +20,8 @@
 #define SET_MNT_MARK(m) ((m)->mnt.mnt_flags |= MNT_MARKED)
 #define CLEAR_MNT_MARK(m) ((m)->mnt.mnt_flags &= ~MNT_MARKED)
 #define IS_MNT_LOCKED(m) ((m)->mnt.mnt_flags & MNT_LOCKED)
+#define IS_MNT_LOCKED_AND_LAZY(m) \
+	(((m)->mnt.mnt_flags & (MNT_LOCKED|MNT_SYNC_UMOUNT)) == MNT_LOCKED)
 
 #define CL_EXPIRE    		0x01
 #define CL_SLAVE     		0x02
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 11/11] mnt: Honor MNT_LOCKED when detaching mounts
  2015-01-05 20:45   ` [PATCH review 0/11 Call for testing and review of mount detach fixes (take 2) Eric W. Biederman
                       ` (8 preceding siblings ...)
  2015-01-05 20:46     ` [PATCH review 10/11] mnt: Don't propagate unmounts to locked mounts Eric W. Biederman
@ 2015-01-05 20:46     ` Eric W. Biederman
       [not found]       ` <1420490787-14387-11-git-send-email-ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  2015-01-07 18:43       ` Al Viro
  2015-04-03  1:53     ` [PATCH review 0/19] Locked mount and loopback mount fixes Eric W. Biederman
  10 siblings, 2 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-01-05 20:46 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-fsdevel, Serge E. Hallyn, Andy Lutomirski, Chen Hanxiao,
	Richard Weinberger, Andrey Vagin, Al Viro

Modify umount(MNT_DETACH) to keep mounts in the hash table that are
locked to their parent mounts, when the parent is lazily unmounted.
In doing this invert the reference count so that the parent holds a
reference to the children instead of the children holding a reference
to the parent.

Then in mntput_no_expire detach the children and in cleanup_mnt mntput
the children and dput the dentry they were mounted on.

In __detach_mounts if there are any mounts that have been unmounted
but still are on the list of mounts of a mountpoint, detach those
mounts and schedule them to be mntput and their reference to the dentry
to be put when it becomes safe to sleep.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/namespace.c | 46 ++++++++++++++++++++++++++++++++++++++++++----
 fs/pnode.h     |  2 ++
 2 files changed, 44 insertions(+), 4 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index ca801b41c643..689a27d6e950 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1017,6 +1017,17 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root,
 	return ERR_PTR(err);
 }
 
+static void mntput_children(struct mount *mnt)
+{
+	struct mount *p, *tmp;
+
+	list_for_each_entry_safe(p, tmp, &mnt->mnt_mounts, mnt_child) {
+		list_del_init(&p->mnt_child);
+		path_put(&p->mnt_ex_mountpoint);
+		mntput(&p->mnt);
+	}
+}
+
 static void cleanup_mnt(struct mount *mnt)
 {
 	/*
@@ -1030,6 +1041,8 @@ static void cleanup_mnt(struct mount *mnt)
 	 * so mnt_get_writers() below is safe.
 	 */
 	WARN_ON(mnt_get_writers(mnt));
+	if (unlikely(!list_empty(&mnt->mnt_mounts)))
+		mntput_children(mnt);
 	if (unlikely(mnt->mnt_pins.first))
 		mnt_pin_kill(mnt);
 	fsnotify_vfsmount_delete(&mnt->mnt);
@@ -1080,6 +1093,15 @@ static void mntput_no_expire(struct mount *mnt)
 	rcu_read_unlock();
 
 	list_del(&mnt->mnt_instance);
+
+	if (unlikely(!list_empty(&mnt->mnt_mounts))) {
+		struct mount *p;
+		list_for_each_entry(p, &mnt->mnt_mounts,  mnt_child) {
+			__detach_mnt(p, &p->mnt_ex_mountpoint);
+			/* No need to mntput mnt */
+			p->mnt_ex_mountpoint.mnt = NULL;
+		}
+	}
 	unlock_mount_hash();
 
 	if (likely(!(mnt->mnt.mnt_flags & MNT_INTERNAL))) {
@@ -1342,7 +1364,7 @@ enum umount_tree_flags {
 static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 {
 	LIST_HEAD(tmp_list);
-	struct mount *p;
+	struct mount *tmp, *p;
 
 	if (how & UMOUNT_PROPAGATE)
 		propagate_mount_unlock(mnt);
@@ -1362,7 +1384,7 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 	if (how & UMOUNT_PROPAGATE)
 		propagate_umount(&tmp_list);
 
-	list_for_each_entry(p, &tmp_list, mnt_list) {
+	list_for_each_entry_safe(p, tmp, &tmp_list, mnt_list) {
 		list_del_init(&p->mnt_expire);
 		__touch_mnt_namespace(p->mnt_ns);
 		p->mnt_ns = NULL;
@@ -1370,7 +1392,14 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 			p->mnt.mnt_flags |= MNT_SYNC_UMOUNT;
 		if (mnt_has_parent(p)) {
 			mnt_add_count(p->mnt_parent, -1);
-			__detach_mnt(p, &p->mnt_ex_mountpoint);
+			if (IS_MNT_LOCKED_AND_LAZY(p)) {
+				/* Don't mntput p in namespace_unlock */
+				list_del_init(&p->mnt_list);
+				/* Don't forget about p */
+				list_add_tail(&p->mnt_child, &p->mnt_parent->mnt_mounts);
+			} else {
+				__detach_mnt(p, &p->mnt_ex_mountpoint);
+			}
 		}
 		change_mnt_propagation(p, MS_PRIVATE);
 	}
@@ -1496,7 +1525,16 @@ void __detach_mounts(struct dentry *dentry)
 	lock_mount_hash();
 	while (!hlist_empty(&mp->m_list)) {
 		mnt = hlist_entry(mp->m_list.first, struct mount, mnt_mp_list);
-		umount_tree(mnt, 0);
+		if (mnt->mnt.mnt_flags & MNT_UMOUNT) {
+			struct mount *p, *tmp;
+			list_for_each_entry_safe(p, tmp, &mnt->mnt_mounts,  mnt_child) {
+				detach_mnt(p, &p->mnt_ex_mountpoint);
+				/* p->mnt_parent has already been mntput */
+				p->mnt_ex_mountpoint.mnt = NULL;
+				list_add_tail(&p->mnt_list, &unmounted);
+			}
+		}
+		else umount_tree(mnt, 0);
 	}
 	unlock_mount_hash();
 	put_mountpoint(mp);
diff --git a/fs/pnode.h b/fs/pnode.h
index 0fcdbe7ca648..7114ce6e6b9e 100644
--- a/fs/pnode.h
+++ b/fs/pnode.h
@@ -20,6 +20,8 @@
 #define SET_MNT_MARK(m) ((m)->mnt.mnt_flags |= MNT_MARKED)
 #define CLEAR_MNT_MARK(m) ((m)->mnt.mnt_flags &= ~MNT_MARKED)
 #define IS_MNT_LOCKED(m) ((m)->mnt.mnt_flags & MNT_LOCKED)
+#define IS_MNT_LOCKED_AND_LAZY(m) \
+	(((m)->mnt.mnt_flags & (MNT_LOCKED|MNT_SYNC_UMOUNT)) == MNT_LOCKED)
 
 #define CL_EXPIRE    		0x01
 #define CL_SLAVE     		0x02
-- 
2.2.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* Re: [PATCH review 11/11] mnt: Honor MNT_LOCKED when detaching mounts
       [not found]       ` <1420490787-14387-11-git-send-email-ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
@ 2015-01-07 18:43         ` Al Viro
  0 siblings, 0 replies; 240+ messages in thread
From: Al Viro @ 2015-01-07 18:43 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrey Vagin, Richard Weinberger, Linux Containers,
	Andy Lutomirski, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

On Mon, Jan 05, 2015 at 02:46:27PM -0600, Eric W. Biederman wrote:
> Modify umount(MNT_DETACH) to keep mounts in the hash table that are
> locked to their parent mounts, when the parent is lazily unmounted.
> In doing this invert the reference count so that the parent holds a
> reference to the children instead of the children holding a reference
> to the parent.
> 
> Then in mntput_no_expire detach the children and in cleanup_mnt mntput
> the children and dput the dentry they were mounted on.
> 
> In __detach_mounts if there are any mounts that have been unmounted
> but still are on the list of mounts of a mountpoint, detach those
> mounts and schedule them to be mntput and their reference to the dentry
> to be put when it becomes safe to sleep.

Explicit description of your new refcounting rules, please.  What's more,
how do those non-pinning children interact with e.g. copy_tree()?

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 11/11] mnt: Honor MNT_LOCKED when detaching mounts
  2015-01-05 20:46     ` [PATCH review 11/11] mnt: Honor MNT_LOCKED when detaching mounts Eric W. Biederman
       [not found]       ` <1420490787-14387-11-git-send-email-ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
@ 2015-01-07 18:43       ` Al Viro
       [not found]         ` <20150107184334.GZ22149-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
  2015-01-07 19:30         ` Eric W. Biederman
  1 sibling, 2 replies; 240+ messages in thread
From: Al Viro @ 2015-01-07 18:43 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Containers, linux-fsdevel, Serge E. Hallyn,
	Andy Lutomirski, Chen Hanxiao, Richard Weinberger, Andrey Vagin

On Mon, Jan 05, 2015 at 02:46:27PM -0600, Eric W. Biederman wrote:
> Modify umount(MNT_DETACH) to keep mounts in the hash table that are
> locked to their parent mounts, when the parent is lazily unmounted.
> In doing this invert the reference count so that the parent holds a
> reference to the children instead of the children holding a reference
> to the parent.
> 
> Then in mntput_no_expire detach the children and in cleanup_mnt mntput
> the children and dput the dentry they were mounted on.
> 
> In __detach_mounts if there are any mounts that have been unmounted
> but still are on the list of mounts of a mountpoint, detach those
> mounts and schedule them to be mntput and their reference to the dentry
> to be put when it becomes safe to sleep.

Explicit description of your new refcounting rules, please.  What's more,
how do those non-pinning children interact with e.g. copy_tree()?

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 11/11] mnt: Honor MNT_LOCKED when detaching mounts
       [not found]         ` <20150107184334.GZ22149-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
@ 2015-01-07 19:28           ` Al Viro
       [not found]             ` <20150107192807.GA22149-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
  2015-01-07 19:30           ` Eric W. Biederman
  1 sibling, 1 reply; 240+ messages in thread
From: Al Viro @ 2015-01-07 19:28 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrey Vagin, Richard Weinberger, Linux Containers,
	Andy Lutomirski, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

On Wed, Jan 07, 2015 at 06:43:34PM +0000, Al Viro wrote:
> On Mon, Jan 05, 2015 at 02:46:27PM -0600, Eric W. Biederman wrote:
> > Modify umount(MNT_DETACH) to keep mounts in the hash table that are
> > locked to their parent mounts, when the parent is lazily unmounted.
> > In doing this invert the reference count so that the parent holds a
> > reference to the children instead of the children holding a reference
> > to the parent.
> > 
> > Then in mntput_no_expire detach the children and in cleanup_mnt mntput
> > the children and dput the dentry they were mounted on.
> > 
> > In __detach_mounts if there are any mounts that have been unmounted
> > but still are on the list of mounts of a mountpoint, detach those
> > mounts and schedule them to be mntput and their reference to the dentry
> > to be put when it becomes safe to sleep.
> 
> Explicit description of your new refcounting rules, please.

I really hate those path_put(&p->mnt_ex_mountpoint) in the resulting code...
Can you ever get to your mntput_children() with non-NULL ->mnt_ex_mountpoint.mnt
in one of the survivors?  Looks like it shouldn't be possible at all...

What's to prevent such a survivor (not contributing to the refcount of
parent, to be dropped when parent gets killed) in the child list of a parent
that is still normally mounted?  And what happens if it's hit again, in
addition to "what does /proc/mounts of the namespace the parent's in look
like"?

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 11/11] mnt: Honor MNT_LOCKED when detaching mounts
       [not found]         ` <20150107184334.GZ22149-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
  2015-01-07 19:28           ` Al Viro
@ 2015-01-07 19:30           ` Eric W. Biederman
  1 sibling, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-01-07 19:30 UTC (permalink / raw)
  To: Al Viro
  Cc: Andrey Vagin, Richard Weinberger, Linux Containers,
	Andy Lutomirski, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

Al Viro <viro-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org> writes:

> On Mon, Jan 05, 2015 at 02:46:27PM -0600, Eric W. Biederman wrote:
>> Modify umount(MNT_DETACH) to keep mounts in the hash table that are
>> locked to their parent mounts, when the parent is lazily unmounted.
>> In doing this invert the reference count so that the parent holds a
>> reference to the children instead of the children holding a reference
>> to the parent.
>> 
>> Then in mntput_no_expire detach the children and in cleanup_mnt mntput
>> the children and dput the dentry they were mounted on.
>> 
>> In __detach_mounts if there are any mounts that have been unmounted
>> but still are on the list of mounts of a mountpoint, detach those
>> mounts and schedule them to be mntput and their reference to the dentry
>> to be put when it becomes safe to sleep.
>
> Explicit description of your new refcounting rules, please.  What's more,
> how do those non-pinning children interact with e.g. copy_tree()?

The parents hold a reference on the children, and the parent keeps track
of it's children through the mnt_mounts list.  The parents reference to
a child is held until the final mntput of the parent.

As for how those mounts interact with copy_tree, they aren't designed
to.  I had overlooked that collect_mounts is weird and if the proper
race exists can be called on an unmounted tree.  So I expect the
interaction with copy_tree is buggy.  I will look at that and see what I
can do to fix that (it shouldn't be hard).  I expect I can just return
an error if the mount has been unmounted.

Eric

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 11/11] mnt: Honor MNT_LOCKED when detaching mounts
  2015-01-07 18:43       ` Al Viro
       [not found]         ` <20150107184334.GZ22149-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
@ 2015-01-07 19:30         ` Eric W. Biederman
       [not found]           ` <87h9w2gzht.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  1 sibling, 1 reply; 240+ messages in thread
From: Eric W. Biederman @ 2015-01-07 19:30 UTC (permalink / raw)
  To: Al Viro
  Cc: Linux Containers, linux-fsdevel, Serge E. Hallyn,
	Andy Lutomirski, Chen Hanxiao, Richard Weinberger, Andrey Vagin

Al Viro <viro@ZenIV.linux.org.uk> writes:

> On Mon, Jan 05, 2015 at 02:46:27PM -0600, Eric W. Biederman wrote:
>> Modify umount(MNT_DETACH) to keep mounts in the hash table that are
>> locked to their parent mounts, when the parent is lazily unmounted.
>> In doing this invert the reference count so that the parent holds a
>> reference to the children instead of the children holding a reference
>> to the parent.
>> 
>> Then in mntput_no_expire detach the children and in cleanup_mnt mntput
>> the children and dput the dentry they were mounted on.
>> 
>> In __detach_mounts if there are any mounts that have been unmounted
>> but still are on the list of mounts of a mountpoint, detach those
>> mounts and schedule them to be mntput and their reference to the dentry
>> to be put when it becomes safe to sleep.
>
> Explicit description of your new refcounting rules, please.  What's more,
> how do those non-pinning children interact with e.g. copy_tree()?

The parents hold a reference on the children, and the parent keeps track
of it's children through the mnt_mounts list.  The parents reference to
a child is held until the final mntput of the parent.

As for how those mounts interact with copy_tree, they aren't designed
to.  I had overlooked that collect_mounts is weird and if the proper
race exists can be called on an unmounted tree.  So I expect the
interaction with copy_tree is buggy.  I will look at that and see what I
can do to fix that (it shouldn't be hard).  I expect I can just return
an error if the mount has been unmounted.

Eric




^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 11/11] mnt: Honor MNT_LOCKED when detaching mounts
       [not found]             ` <20150107192807.GA22149-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
@ 2015-01-07 19:53               ` Eric W. Biederman
  0 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-01-07 19:53 UTC (permalink / raw)
  To: Al Viro
  Cc: Andrey Vagin, Richard Weinberger, Linux Containers,
	Andy Lutomirski, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

Al Viro <viro-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org> writes:

> On Wed, Jan 07, 2015 at 06:43:34PM +0000, Al Viro wrote:
>> On Mon, Jan 05, 2015 at 02:46:27PM -0600, Eric W. Biederman wrote:
>> > Modify umount(MNT_DETACH) to keep mounts in the hash table that are
>> > locked to their parent mounts, when the parent is lazily unmounted.
>> > In doing this invert the reference count so that the parent holds a
>> > reference to the children instead of the children holding a reference
>> > to the parent.
>> > 
>> > Then in mntput_no_expire detach the children and in cleanup_mnt mntput
>> > the children and dput the dentry they were mounted on.
>> > 
>> > In __detach_mounts if there are any mounts that have been unmounted
>> > but still are on the list of mounts of a mountpoint, detach those
>> > mounts and schedule them to be mntput and their reference to the dentry
>> > to be put when it becomes safe to sleep.
>> 
>> Explicit description of your new refcounting rules, please.
>
> I really hate those path_put(&p->mnt_ex_mountpoint) in the resulting code...
> Can you ever get to your mntput_children() with non-NULL ->mnt_ex_mountpoint.mnt
> in one of the survivors?  Looks like it shouldn't be possible at all...

> What's to prevent such a survivor (not contributing to the refcount of
> parent, to be dropped when parent gets killed) in the child list of a parent
> that is still normally mounted?

umount_tree.

In the base case we know the parent is unmounted because we were called
on the parent.

In the propagation case the code goes through one pass to identify trees
where the parent -> child (MNT_LOCKED) state allows the subtree to be
unmounted and we perform the normal are all of the children gone test.

That is what the entire MNT_MARK business is about.

> And what happens if it's hit again,

If I understand the question such a mount won't be hit again because the
case you describe can't happen.

> in addition to "what does /proc/mounts of the namespace the parent's in look
> like"?

These mounts are unmounted so don't show up in /proc/mounts. 

If the umount propagates somewhere the parents can't be detached from
the propagation is discarded.  Just like we discard propagation when the
are additional children.

Eric

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 11/11] mnt: Honor MNT_LOCKED when detaching mounts
       [not found]           ` <87h9w2gzht.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-01-07 20:52             ` Al Viro
       [not found]               ` <20150107205239.GB22149-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
  0 siblings, 1 reply; 240+ messages in thread
From: Al Viro @ 2015-01-07 20:52 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrey Vagin, Richard Weinberger, Linux Containers,
	Andy Lutomirski, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	Linus Torvalds

On Wed, Jan 07, 2015 at 01:30:22PM -0600, Eric W. Biederman wrote:

> > Explicit description of your new refcounting rules, please.  What's more,
> > how do those non-pinning children interact with e.g. copy_tree()?
> 
> The parents hold a reference on the children, and the parent keeps track
> of it's children through the mnt_mounts list.  The parents reference to
> a child is held until the final mntput of the parent.

This is obvious crap.  It doesn't match the resulting code, not to mention
making no sense whatsoever.

If you (with your resulting tree) mount ten filesystems on /tmp/[0-9],
you will have refcount of vfsmount on /tmp incremented by 10.  Moreover,
you really don't want /tmp *not* to be busy after having done that.

The current rules are
	* take the number of external references
	* add 1 if it's reachable from hash table
with mnt->mnt_parent being a counting reference unless it points to mnt
itself.
One extra wart is namespace_unlock() treatment of ex-mountpoints - we
have decremented their refcounts early (during umount_tree()) to avoid
false positives on "is it busy" checks, but once we are done with calculating
the set of victims we want proper ordering on the fs shutdown vs dput() of
mountpoints, so namespace_unlock() starts with unrolling these decrements
and then replays them in proper order.

You have modified that, but your description above doesn't match what you
are doing.  Moreover, in absense of those locked-and-lazy you *still* have
the same rules as before.  Full description of the new rules, please.

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 11/11] mnt: Honor MNT_LOCKED when detaching mounts
       [not found]               ` <20150107205239.GB22149-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
@ 2015-01-07 21:51                 ` Eric W. Biederman
  2015-01-08  0:22                   ` Al Viro
       [not found]                   ` <87iogi8dka.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  0 siblings, 2 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-01-07 21:51 UTC (permalink / raw)
  To: Al Viro
  Cc: Andrey Vagin, Richard Weinberger, Linux Containers,
	Andy Lutomirski, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	Linus Torvalds

Al Viro <viro-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org> writes:

> On Wed, Jan 07, 2015 at 01:30:22PM -0600, Eric W. Biederman wrote:
>
>> > Explicit description of your new refcounting rules, please.  What's more,
>> > how do those non-pinning children interact with e.g. copy_tree()?
>> 
>> The parents hold a reference on the children, and the parent keeps track
>> of it's children through the mnt_mounts list.  The parents reference to
>> a child is held until the final mntput of the parent.
>
> This is obvious crap.  It doesn't match the resulting code, not to mention
> making no sense whatsoever.
>
> If you (with your resulting tree) mount ten filesystems on /tmp/[0-9],
> you will have refcount of vfsmount on /tmp incremented by 10.  Moreover,
> you really don't want /tmp *not* to be busy after having done that.

You are complaining that I did not describe the part of the rules that
are not new?

> The current rules are
> 	* take the number of external references
> 	* add 1 if it's reachable from hash table
> with mnt->mnt_parent being a counting reference unless it points to mnt
> itself.
> One extra wart is namespace_unlock() treatment of ex-mountpoints - we
> have decremented their refcounts early (during umount_tree()) to avoid
> false positives on "is it busy" checks, but once we are done with calculating
> the set of victims we want proper ordering on the fs shutdown vs dput() of
> mountpoints, so namespace_unlock() starts with unrolling these decrements
> and then replays them in proper order.
>
> You have modified that, but your description above doesn't match what you
> are doing.  Moreover, in absense of those locked-and-lazy you *still* have
> the same rules as before.  Full description of the new rules, please.

Al I think you are big pig headed to be pig headed, you don't see to be
trying to understand so I don't know if this conversation will be
productive.  I did describe how the rules change in the specific case
that they change which is what I though you were asking for.

I disagree with how you have described the current rules.

Currently the reference count on a struct mount is:

- incremented by 1 for every child listed in mnt_mounts.
- incremented by 1 for every oridinary user that does not know
  the intimate details of how mounts are implemented.
- incremented by 1 if the mount is on the mnt_list and is mounted.
- Mounts that have been unmounted do not mounted do not have children.

What I have introduced is state where a mount that is unmounted has
children, and the connection between the mount and those children
remain in the mount hash table.

In this new case it becomes possible for a mount with children to reach
the final mntput.

In this new case an unmounted mount with children that reaches mntput
will hold a reference to each of it's children until it reaches mntput,
and there the reference will be dropped.

To get into that state in umount_tree the code drops the reference to the
parent from children as normal.  The code repairs the child list which
was torn down early.  The code takes mnt_list and sets mnt_ns = NULL so
the mount is clearly unmounted.  The reference count from being mounted
is given to the mount parent.

At the parents final mntput the children are walked through and removed
from the mount hash table, and in cleanup_mnt where we don't have an
ultra deep stack and are not holding locks, the dentry is dput and
and the child mount is mntput.

For fs/namespace.c not a particularly complicated scenario, and not
particulary weird.  And a scenario that matches the description
a parent holds a reference to the child instead of the other way around.

Now that I have fleshed out my description if you continue to think it
is obvious crap and doesn't match the resulting code, your analysis
is bollocks.

Eric

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 11/11] mnt: Honor MNT_LOCKED when detaching mounts
       [not found]                   ` <87iogi8dka.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-01-08  0:22                     ` Al Viro
  0 siblings, 0 replies; 240+ messages in thread
From: Al Viro @ 2015-01-08  0:22 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrey Vagin, Richard Weinberger, Linux Containers,
	Andy Lutomirski, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	Linus Torvalds

On Wed, Jan 07, 2015 at 03:51:17PM -0600, Eric W. Biederman wrote:

> I disagree with how you have described the current rules.

In which situation that description does not match what we do?

> Currently the reference count on a struct mount is:
> 
> - incremented by 1 for every child listed in mnt_mounts.

Yes - their ->mnt_parent does it.

> - incremented by 1 for every oridinary user that does not know
>   the intimate details of how mounts are implemented.

That'd be references other than ->mnt_parent.

> - incremented by 1 if the mount is on the mnt_list and is mounted.

== reachable from hashtable, except that you include the reference from
namespace to its root mount into that; in my description that goes as
external reference.

> - Mounts that have been unmounted do not mounted do not have children.

Meaning?

> What I have introduced is state where a mount that is unmounted has
> children, and the connection between the mount and those children
> remain in the mount hash table.
> 
> In this new case it becomes possible for a mount with children to reach
> the final mntput.

> In this new case an unmounted mount with children that reaches mntput
> will hold a reference to each of it's children until it reaches mntput,
> and there the reference will be dropped.
> 
> To get into that state in umount_tree the code drops the reference to the
> parent from children as normal.  The code repairs the child list which
> was torn down early.  The code takes mnt_list and sets mnt_ns = NULL so
> the mount is clearly unmounted.  The reference count from being mounted
> is given to the mount parent.
> 
> 
> At the parents final mntput the children are walked through and removed
> from the mount hash table, and in cleanup_mnt where we don't have an
> ultra deep stack and are not holding locks, the dentry is dput and
> and the child mount is mntput.

You are describing what your code does.  Thank you, but I *can* read C just
as easy as your text (easier, TBH).  What I'm asking for is data structures
invariants.  As in, "outside of such and such locks, the following predicates
are true"...

If I understand what you are saying, you have
	* subset of mounts (ones that had MNT_LOCKED and went through
MNT_DETACH umount).
	* mounts in that subset are
		* hashed
		* present in their parents' lists of children
		* do _NOT_ contribute to parents' refcounts
		* do _NOT_ participate in event propagation
		* removed from the lists of children and dropped upon
parent's final mntput (dropping being done on shallow stack, just before the
fs shutdown of parent).

I can buy that, *IF* one could add
		* are guaranteed to be unreachable from any namespace root.
to the above.  And yes, your logics around marking them is probably enough
for that.  However, if we go that way, I don't understand why do we need
to involve MNT_LOCKED.

Look: your new rules for that subset are actually not that different from
the old ones - they *are* hashed, so by the old rules we'd added 1 to
refcount anyway.  The only period when that is not true is from __detach_mnt()
in mntput_no_expire() to cleanup_mnt().  The real difference, AFAICS, is
that their ->mnt_parent does not contribute to refcount of its targets.

If the root of such a detached tree gets the last reference dropped,
it gets killed as usual *and* its children become roots of detached trees.
That is protected by mount hash lock (which prevents somebody inside a subtree
from traversing .. and grabbing a reference to root after we'd started to
detach stuff from it).

Fine, that makes sense, but IMO it would make a lot of sense to have a variant
of MNT_DETACH that would do that for *all* mounts, locked or not.  I'd be
tempted to make MNT_DETACH itself do that, except that the current behaviour
(dropping eagerly) is useful in situations like "we have a stuck filesystem
and a bunch of stuff under it; let's do umount -l on that shit, hopefully
the non-busy filesystems mounted down there will get a clean shutdown out
of that".

I seriously dislike your use of path_put() on mnt_ex_mountpoint - it's
actively misleading.  We bloody better _not_ have non-NULL
->mnt_ex_mountpoint.mnt in your cleanup_mnt(), for obvious reasons.  So
it's just dput() + do something random if non-NULL mnt found.

Not sure if I like the use of mnt_child between mntput_no_expire() and
cleanup_mnt() - it's probably safe, but the fewer lists we modify outside of
mount hash lock, the better; hell knows, I'll need to stare at that code
a bit more.  FWIW, AFAICS the refcount rules with your variant are
	* external references countribute
	* mnt_parent contributes unless it points to ourselves *or* mnt_ns is
NULL
	* reachable from mount hash => add 1
	* in addition to the wart around namespace_unlock(), we have a similar
wart between mntput_no_expire() and cleanup_mnt(), only there we thread the
suckers on mnt_child instead of mnt_list and use slightly different logics to
prevent shutdown of parent fs before the dput(mountpoint).

I really hate synchronize_rcu() in namespace_unlock() ;-/  Not your fault,
but it's a PITA every time one does analysis of all that code...  It's
about legitimize_mnt() vs. umount() - we want to make sure that any
transient bumps of refcount from attempts to legitimize a lazy reference
to vfsmount getting killed by umount() will be undone *before* we get to
mntput_no_expire() in namespace_unlock().  It's only an issue for sync
umounts, so we are fine here, but it definitely needs commenting upon -
some vfsmounts *can* escape it with that approach, but they are not going
to be marked sync-umount, so we don't need to worry about them...

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 11/11] mnt: Honor MNT_LOCKED when detaching mounts
  2015-01-07 21:51                 ` Eric W. Biederman
@ 2015-01-08  0:22                   ` Al Viro
  2015-01-08  3:02                     ` Al Viro
       [not found]                     ` <20150108002227.GC22149-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
       [not found]                   ` <87iogi8dka.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  1 sibling, 2 replies; 240+ messages in thread
From: Al Viro @ 2015-01-08  0:22 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Containers, linux-fsdevel, Serge E. Hallyn,
	Andy Lutomirski, Chen Hanxiao, Richard Weinberger, Andrey Vagin,
	Linus Torvalds

On Wed, Jan 07, 2015 at 03:51:17PM -0600, Eric W. Biederman wrote:

> I disagree with how you have described the current rules.

In which situation that description does not match what we do?

> Currently the reference count on a struct mount is:
> 
> - incremented by 1 for every child listed in mnt_mounts.

Yes - their ->mnt_parent does it.

> - incremented by 1 for every oridinary user that does not know
>   the intimate details of how mounts are implemented.

That'd be references other than ->mnt_parent.

> - incremented by 1 if the mount is on the mnt_list and is mounted.

== reachable from hashtable, except that you include the reference from
namespace to its root mount into that; in my description that goes as
external reference.

> - Mounts that have been unmounted do not mounted do not have children.

Meaning?

> What I have introduced is state where a mount that is unmounted has
> children, and the connection between the mount and those children
> remain in the mount hash table.
> 
> In this new case it becomes possible for a mount with children to reach
> the final mntput.

> In this new case an unmounted mount with children that reaches mntput
> will hold a reference to each of it's children until it reaches mntput,
> and there the reference will be dropped.
> 
> To get into that state in umount_tree the code drops the reference to the
> parent from children as normal.  The code repairs the child list which
> was torn down early.  The code takes mnt_list and sets mnt_ns = NULL so
> the mount is clearly unmounted.  The reference count from being mounted
> is given to the mount parent.
> 
> 
> At the parents final mntput the children are walked through and removed
> from the mount hash table, and in cleanup_mnt where we don't have an
> ultra deep stack and are not holding locks, the dentry is dput and
> and the child mount is mntput.

You are describing what your code does.  Thank you, but I *can* read C just
as easy as your text (easier, TBH).  What I'm asking for is data structures
invariants.  As in, "outside of such and such locks, the following predicates
are true"...

If I understand what you are saying, you have
	* subset of mounts (ones that had MNT_LOCKED and went through
MNT_DETACH umount).
	* mounts in that subset are
		* hashed
		* present in their parents' lists of children
		* do _NOT_ contribute to parents' refcounts
		* do _NOT_ participate in event propagation
		* removed from the lists of children and dropped upon
parent's final mntput (dropping being done on shallow stack, just before the
fs shutdown of parent).

I can buy that, *IF* one could add
		* are guaranteed to be unreachable from any namespace root.
to the above.  And yes, your logics around marking them is probably enough
for that.  However, if we go that way, I don't understand why do we need
to involve MNT_LOCKED.

Look: your new rules for that subset are actually not that different from
the old ones - they *are* hashed, so by the old rules we'd added 1 to
refcount anyway.  The only period when that is not true is from __detach_mnt()
in mntput_no_expire() to cleanup_mnt().  The real difference, AFAICS, is
that their ->mnt_parent does not contribute to refcount of its targets.

If the root of such a detached tree gets the last reference dropped,
it gets killed as usual *and* its children become roots of detached trees.
That is protected by mount hash lock (which prevents somebody inside a subtree
from traversing .. and grabbing a reference to root after we'd started to
detach stuff from it).

Fine, that makes sense, but IMO it would make a lot of sense to have a variant
of MNT_DETACH that would do that for *all* mounts, locked or not.  I'd be
tempted to make MNT_DETACH itself do that, except that the current behaviour
(dropping eagerly) is useful in situations like "we have a stuck filesystem
and a bunch of stuff under it; let's do umount -l on that shit, hopefully
the non-busy filesystems mounted down there will get a clean shutdown out
of that".

I seriously dislike your use of path_put() on mnt_ex_mountpoint - it's
actively misleading.  We bloody better _not_ have non-NULL
->mnt_ex_mountpoint.mnt in your cleanup_mnt(), for obvious reasons.  So
it's just dput() + do something random if non-NULL mnt found.

Not sure if I like the use of mnt_child between mntput_no_expire() and
cleanup_mnt() - it's probably safe, but the fewer lists we modify outside of
mount hash lock, the better; hell knows, I'll need to stare at that code
a bit more.  FWIW, AFAICS the refcount rules with your variant are
	* external references countribute
	* mnt_parent contributes unless it points to ourselves *or* mnt_ns is
NULL
	* reachable from mount hash => add 1
	* in addition to the wart around namespace_unlock(), we have a similar
wart between mntput_no_expire() and cleanup_mnt(), only there we thread the
suckers on mnt_child instead of mnt_list and use slightly different logics to
prevent shutdown of parent fs before the dput(mountpoint).

I really hate synchronize_rcu() in namespace_unlock() ;-/  Not your fault,
but it's a PITA every time one does analysis of all that code...  It's
about legitimize_mnt() vs. umount() - we want to make sure that any
transient bumps of refcount from attempts to legitimize a lazy reference
to vfsmount getting killed by umount() will be undone *before* we get to
mntput_no_expire() in namespace_unlock().  It's only an issue for sync
umounts, so we are fine here, but it definitely needs commenting upon -
some vfsmounts *can* escape it with that approach, but they are not going
to be marked sync-umount, so we don't need to worry about them...

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 11/11] mnt: Honor MNT_LOCKED when detaching mounts
       [not found]                     ` <20150108002227.GC22149-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
@ 2015-01-08  3:02                       ` Al Viro
  2015-01-08 22:32                       ` Al Viro
  1 sibling, 0 replies; 240+ messages in thread
From: Al Viro @ 2015-01-08  3:02 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrey Vagin, Richard Weinberger, Linux Containers,
	Andy Lutomirski, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	Linus Torvalds

On Thu, Jan 08, 2015 at 12:22:27AM +0000, Al Viro wrote:
> Not sure if I like the use of mnt_child between mntput_no_expire() and
> cleanup_mnt() - it's probably safe, but the fewer lists we modify outside of
> mount hash lock, the better; hell knows, I'll need to stare at that code
> a bit more.  FWIW, AFAICS the refcount rules with your variant are
> 	* external references countribute
> 	* mnt_parent contributes unless it points to ourselves *or* mnt_ns is
> NULL
> 	* reachable from mount hash => add 1
> 	* in addition to the wart around namespace_unlock(), we have a similar
> wart between mntput_no_expire() and cleanup_mnt(), only there we thread the
> suckers on mnt_child instead of mnt_list and use slightly different logics to
> prevent shutdown of parent fs before the dput(mountpoint).

BTW, why do you use detach_mnt() in __detach_mounts()?  Unless I'm missing
something really subtle, __detach_mnt() is the right thing there...

And while we are at it, all other callers of detach_mnt() are followed by
attach_mnt() within the same namespace *and* all attach_mnt() follow
detach_mnt().  So why bother with mnt_list in either?  Or, put it
another way, why have __detach_mnt() separate from detach_mnt()?

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 11/11] mnt: Honor MNT_LOCKED when detaching mounts
  2015-01-08  0:22                   ` Al Viro
@ 2015-01-08  3:02                     ` Al Viro
       [not found]                       ` <20150108030229.GD22149-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
       [not found]                     ` <20150108002227.GC22149-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
  1 sibling, 1 reply; 240+ messages in thread
From: Al Viro @ 2015-01-08  3:02 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Containers, linux-fsdevel, Serge E. Hallyn,
	Andy Lutomirski, Chen Hanxiao, Richard Weinberger, Andrey Vagin,
	Linus Torvalds

On Thu, Jan 08, 2015 at 12:22:27AM +0000, Al Viro wrote:
> Not sure if I like the use of mnt_child between mntput_no_expire() and
> cleanup_mnt() - it's probably safe, but the fewer lists we modify outside of
> mount hash lock, the better; hell knows, I'll need to stare at that code
> a bit more.  FWIW, AFAICS the refcount rules with your variant are
> 	* external references countribute
> 	* mnt_parent contributes unless it points to ourselves *or* mnt_ns is
> NULL
> 	* reachable from mount hash => add 1
> 	* in addition to the wart around namespace_unlock(), we have a similar
> wart between mntput_no_expire() and cleanup_mnt(), only there we thread the
> suckers on mnt_child instead of mnt_list and use slightly different logics to
> prevent shutdown of parent fs before the dput(mountpoint).

BTW, why do you use detach_mnt() in __detach_mounts()?  Unless I'm missing
something really subtle, __detach_mnt() is the right thing there...

And while we are at it, all other callers of detach_mnt() are followed by
attach_mnt() within the same namespace *and* all attach_mnt() follow
detach_mnt().  So why bother with mnt_list in either?  Or, put it
another way, why have __detach_mnt() separate from detach_mnt()?

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 11/11] mnt: Honor MNT_LOCKED when detaching mounts
       [not found]                       ` <20150108030229.GD22149-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
@ 2015-01-08  3:11                         ` Al Viro
  0 siblings, 0 replies; 240+ messages in thread
From: Al Viro @ 2015-01-08  3:11 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrey Vagin, Richard Weinberger, Linux Containers,
	Andy Lutomirski, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	Linus Torvalds

On Thu, Jan 08, 2015 at 03:02:29AM +0000, Al Viro wrote:
> BTW, why do you use detach_mnt() in __detach_mounts()?  Unless I'm missing
> something really subtle, __detach_mnt() is the right thing there...
> 
> And while we are at it, all other callers of detach_mnt() are followed by
> attach_mnt() within the same namespace *and* all attach_mnt() follow
> detach_mnt().  So why bother with mnt_list in either?  Or, put it
> another way, why have __detach_mnt() separate from detach_mnt()?

I really don't like the look of __detach_mounts() after those changes.
Suppose we have a single "locked-and-lazy" vfsmount mounted on that
dentry.  With nothing mounted under it.  Your loop will do absolutely
nothing to it - it'll just keep spinning.

Why are you doing anything to the stuff mounted under than one, anyway?
It's this sucker you want to detach and kill...  Normal case (umount_tree())
will detach the vfsmount we are giving it; this one should have the same
effect...

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 11/11] mnt: Honor MNT_LOCKED when detaching mounts
       [not found]                     ` <20150108002227.GC22149-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
  2015-01-08  3:02                       ` Al Viro
@ 2015-01-08 22:32                       ` Al Viro
       [not found]                         ` <20150108223212.GF22149-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
  2015-01-09 20:31                         ` Al Viro
  1 sibling, 2 replies; 240+ messages in thread
From: Al Viro @ 2015-01-08 22:32 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrey Vagin, Richard Weinberger, Linux Containers,
	Andy Lutomirski, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	Linus Torvalds

On Thu, Jan 08, 2015 at 12:22:27AM +0000, Al Viro wrote:

> 	* external references countribute
> 	* mnt_parent contributes unless it points to ourselves *or* mnt_ns is
> NULL
> 	* reachable from mount hash => add 1
> 	* in addition to the wart around namespace_unlock(), we have a similar
> wart between mntput_no_expire() and cleanup_mnt(), only there we thread the
> suckers on mnt_child instead of mnt_list and use slightly different logics to
> prevent shutdown of parent fs before the dput(mountpoint).

BTW, there's something I wanted to do for a while - these games with
delaying dput(mountpoint) look a whole lot like a job for mnt_pin_kill().

Look: even now we have (at the exit from umount_tree()) a bunch of remnant
mountpoints associated with ex-parents.  And we want to have them taken
out and shot before said ex-parents shut down.  The tricky part here is
ordering - at the beginning of namespace_unlock() we have a bunch of
mounts about to be dropped; quite a few might be keeping references to
the ex-mountpoints.  We want mntput() on all of those *and* we want
dput() on said ex-mountpoints.  We'd already dropped the references to
ex-parents (in umount_tree()).  The trouble being, some of those ex-mountpoints
might be on other victim mounts and we really do NOT want to drop such a victim
until all dput() within it had been done.

That's why we play those games with bumping ex-parent mount counts first,
so that dput() is immediately followed with matching mntput().

But we don't have to do it that way.  Alternative, and IMO a cleaner one,
would be to embed struct fs_pin into struct mount and, in umount_tree()
simply add that fs_pin->m_list into the ->mnt_pins of parent and ->s_list
into a global list (replacement of 'unmounted').  With ->kill() callback
dropping references to dentry of mountpoint *and* containing mount.  Then
we don't need to care about the ordering at all - we just go through the
global list of fs_pin and do ->kill() on all of them, in whatever order they
happen to be.  If mount gets dropped while there still are ex-mountpoints on
it waiting to be dropped - fine, we'll just drop them there and then (from
mnt_kill_pins()).  No games with refcounts that way, no transiently busy
mounts, which we do have right now.

And that scheme actually covers your "delayed detaching" just fine - if we
want detaching to be delayed until the parent really buys it, all we need to
do is to skip dropping the child from mnt_mounts of parent and putting fs_pin
of child on the global list.  And detach_mount() on all remaining children
at mntput_no_expire() time, as you do, only with removal from child list.

No changes needed in cleanup_mnt() that way - remaining fs_pin of children
will play out, yielding the effect you want.  

The problem with that approach (and the reason why it hadn't been done yet)
is that we'll need some careful tweaking of locking in fs/fs_pin.c.  Back
in 3.17 I really wanted to get the kernel/acct.c crap out of the way, so that
umount-on-rmdir stuff could be done safely.  So the minimal solution went in,
with plans for using it to avoid the headache with ordering left for later.
I had completely missed the possibility to do less eager dissolving of
detached trees that way (and it was a long-standing wishlist thing) back
then, or I probably would've bitten the bullet and gone for it.

What you are proposing is
	a) easily expressed with that scheme (behaviour is almost identical
to what your series does, except that we'd empty mnt_child right in
mntput_no_expire(), rather than keeping it until cleanup_mnt()) and
	b) easily generalized to MNT_DETACH with lazy dissolving; in fact,
the only difference would be to treat *all* submounts as "don't put on
global list, don't remove from child list", not just the MNT_LOCKED ones.

I'll play around with that today and tomorrow; hopefully, I'll have a postable
variant by the weekend...

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 11/11] mnt: Honor MNT_LOCKED when detaching mounts
       [not found]                         ` <20150108223212.GF22149-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
@ 2015-01-09 20:31                           ` Al Viro
  0 siblings, 0 replies; 240+ messages in thread
From: Al Viro @ 2015-01-09 20:31 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrey Vagin, Richard Weinberger, Linux Containers,
	Andy Lutomirski, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	Linus Torvalds

On Thu, Jan 08, 2015 at 10:32:13PM +0000, Al Viro wrote:
> 	a) easily expressed with that scheme (behaviour is almost identical
> to what your series does, except that we'd empty mnt_child right in
> mntput_no_expire(), rather than keeping it until cleanup_mnt()) and
> 	b) easily generalized to MNT_DETACH with lazy dissolving; in fact,
> the only difference would be to treat *all* submounts as "don't put on
> global list, don't remove from child list", not just the MNT_LOCKED ones.
> 
> I'll play around with that today and tomorrow; hopefully, I'll have a postable
> variant by the weekend...

Hmm...  Linus, what do you think of the following?

struct foo {
	int done;
	wait_queue_head_t wait;
	...
};

void kill_foo(struct foo *p)
{
        wait_queue_t wait;
	wait.flags = WQ_FLAG_EXCLUSIVE;
	wait.private = current;
	wait.func = autoremove_wake_function;
	spin_lock_irq(&p->wait.lock);
	if (likely(!p->done)) {
		p->done = -1;
		spin_unlock_irq(&p->wait.lock);
		rcu_read_unlock();
		/* do cleanup */
		...
		/* remove references to *p */
		...
		spin_lock_irq(&p->wait.lock);
		p->done = 1;	
		wake_up_locked(&p->wait);
		spin_unlock_irq(&p->wait.lock);
		/* RCU-schedule freeing of p */
		...
		return;
	}
	if (p->done > 0) {
		spin_unlock_irq(&p->wait.lock);
		rcu_read_unlock();
		return;
	}
	__add_wait_queue_tail(&p->wait, &wait);
	while (1) {
		set_current_state(TASK_UNINITERRUPTIBLE);
		spin_unlock_irq(&p->wait.lock);
		rcu_read_unlock();
		schedule();
		rcu_read_lock();
		if (likely(list_empty(&wait.task_list)))
			break;
		/* OK, we know p couldn't have been freed yet */
		spin_lock_irq(&p->wait.lock);
		if (p->done > 0) {
			spin_unlock_irq(&p->wait.lock);
			break;
		}
	}
	rcu_read_unlock();
}

AFAICS, that ought to be safe - the first caller makes sure that everybody
else will wait for it to finish, everybody else (ones who come via references
before the first one has removed them) will be waiting for the first one
to do wakeup *and* will take care not to touch the victim unless they knows
it's still there.

Do you see any problems with the above?  I wonder if I ended up open-coding
something already existing in there...

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 11/11] mnt: Honor MNT_LOCKED when detaching mounts
  2015-01-08 22:32                       ` Al Viro
       [not found]                         ` <20150108223212.GF22149-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
@ 2015-01-09 20:31                         ` Al Viro
  2015-01-09 21:30                           ` Eric W. Biederman
       [not found]                           ` <20150109203126.GI22149-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
  1 sibling, 2 replies; 240+ messages in thread
From: Al Viro @ 2015-01-09 20:31 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Containers, linux-fsdevel, Serge E. Hallyn,
	Andy Lutomirski, Chen Hanxiao, Richard Weinberger, Andrey Vagin,
	Linus Torvalds

On Thu, Jan 08, 2015 at 10:32:13PM +0000, Al Viro wrote:
> 	a) easily expressed with that scheme (behaviour is almost identical
> to what your series does, except that we'd empty mnt_child right in
> mntput_no_expire(), rather than keeping it until cleanup_mnt()) and
> 	b) easily generalized to MNT_DETACH with lazy dissolving; in fact,
> the only difference would be to treat *all* submounts as "don't put on
> global list, don't remove from child list", not just the MNT_LOCKED ones.
> 
> I'll play around with that today and tomorrow; hopefully, I'll have a postable
> variant by the weekend...

Hmm...  Linus, what do you think of the following?

struct foo {
	int done;
	wait_queue_head_t wait;
	...
};

void kill_foo(struct foo *p)
{
        wait_queue_t wait;
	wait.flags = WQ_FLAG_EXCLUSIVE;
	wait.private = current;
	wait.func = autoremove_wake_function;
	spin_lock_irq(&p->wait.lock);
	if (likely(!p->done)) {
		p->done = -1;
		spin_unlock_irq(&p->wait.lock);
		rcu_read_unlock();
		/* do cleanup */
		...
		/* remove references to *p */
		...
		spin_lock_irq(&p->wait.lock);
		p->done = 1;	
		wake_up_locked(&p->wait);
		spin_unlock_irq(&p->wait.lock);
		/* RCU-schedule freeing of p */
		...
		return;
	}
	if (p->done > 0) {
		spin_unlock_irq(&p->wait.lock);
		rcu_read_unlock();
		return;
	}
	__add_wait_queue_tail(&p->wait, &wait);
	while (1) {
		set_current_state(TASK_UNINITERRUPTIBLE);
		spin_unlock_irq(&p->wait.lock);
		rcu_read_unlock();
		schedule();
		rcu_read_lock();
		if (likely(list_empty(&wait.task_list)))
			break;
		/* OK, we know p couldn't have been freed yet */
		spin_lock_irq(&p->wait.lock);
		if (p->done > 0) {
			spin_unlock_irq(&p->wait.lock);
			break;
		}
	}
	rcu_read_unlock();
}

AFAICS, that ought to be safe - the first caller makes sure that everybody
else will wait for it to finish, everybody else (ones who come via references
before the first one has removed them) will be waiting for the first one
to do wakeup *and* will take care not to touch the victim unless they knows
it's still there.

Do you see any problems with the above?  I wonder if I ended up open-coding
something already existing in there...

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 11/11] mnt: Honor MNT_LOCKED when detaching mounts
       [not found]                           ` <20150109203126.GI22149-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
@ 2015-01-09 21:30                             ` Eric W. Biederman
  2015-01-10  5:32                             ` Eric W. Biederman
  1 sibling, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-01-09 21:30 UTC (permalink / raw)
  To: Al Viro
  Cc: Andrey Vagin, Richard Weinberger, Linux Containers,
	Andy Lutomirski, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	Linus Torvalds

Al Viro <viro-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org> writes:

> On Thu, Jan 08, 2015 at 10:32:13PM +0000, Al Viro wrote:
>> 	a) easily expressed with that scheme (behaviour is almost identical
>> to what your series does, except that we'd empty mnt_child right in
>> mntput_no_expire(), rather than keeping it until cleanup_mnt()) and
>> 	b) easily generalized to MNT_DETACH with lazy dissolving; in fact,
>> the only difference would be to treat *all* submounts as "don't put on
>> global list, don't remove from child list", not just the MNT_LOCKED ones.

I have no problem with treating all mounts whose parents are also being
unmounted as "don't put on global list, don't remove from child list"

Replacing the mnt_ex_mountpoint struct path with a fs_pin seems
reasonable.

We need to be careful with mounts that have parents that are not being
unmounted, with respect to the mount hash table but that just looks like
a detail in the overall scheme of things.

And if delaying the disolution of mounts in the mount detach case is a
long held wishlist item I am all for going there.  I have a patch I was
playing with that did that, I just didn't include it because that is a
bug fix kind of thing.

>> I'll play around with that today and tomorrow; hopefully, I'll have a postable
>> variant by the weekend...
>
> Hmm...  Linus, what do you think of the following?
>
> struct foo {
> 	int done;
> 	wait_queue_head_t wait;
> 	...
> };
>

Where does the rcu_read_lock() happen?
I assume this is kill from fs_pin.kill?

> void kill_foo(struct foo *p)
> {
>         wait_queue_t wait;
> 	wait.flags = WQ_FLAG_EXCLUSIVE;
> 	wait.private = current;
> 	wait.func = autoremove_wake_function;
> 	spin_lock_irq(&p->wait.lock);
> 	if (likely(!p->done)) {
> 		p->done = -1;
> 		spin_unlock_irq(&p->wait.lock);
> 		rcu_read_unlock();
> 		/* do cleanup */
> 		...
> 		/* remove references to *p */
> 		...
> 		spin_lock_irq(&p->wait.lock);
> 		p->done = 1;	
> 		wake_up_locked(&p->wait);
> 		spin_unlock_irq(&p->wait.lock);
> 		/* RCU-schedule freeing of p */
> 		...
> 		return;
> 	}
> 	if (p->done > 0) {
> 		spin_unlock_irq(&p->wait.lock);
> 		rcu_read_unlock();
> 		return;
> 	}
> 	__add_wait_queue_tail(&p->wait, &wait);
> 	while (1) {
> 		set_current_state(TASK_UNINITERRUPTIBLE);
> 		spin_unlock_irq(&p->wait.lock);
> 		rcu_read_unlock();
> 		schedule();
> 		rcu_read_lock();
> 		if (likely(list_empty(&wait.task_list)))
> 			break;
> 		/* OK, we know p couldn't have been freed yet */
> 		spin_lock_irq(&p->wait.lock);
> 		if (p->done > 0) {
> 			spin_unlock_irq(&p->wait.lock);
> 			break;
> 		}
> 	}
> 	rcu_read_unlock();
> }
>
> AFAICS, that ought to be safe - the first caller makes sure that everybody
> else will wait for it to finish, everybody else (ones who come via references
> before the first one has removed them) will be waiting for the first one
> to do wakeup *and* will take care not to touch the victim unless they knows
> it's still there.
>
> Do you see any problems with the above?  I wonder if I ended up open-coding
> something already existing in there...

Your struct foo looks an awful lot like struct completion at first
glance.

Eric

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 11/11] mnt: Honor MNT_LOCKED when detaching mounts
  2015-01-09 20:31                         ` Al Viro
@ 2015-01-09 21:30                           ` Eric W. Biederman
       [not found]                             ` <87k30vwskd.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
       [not found]                           ` <20150109203126.GI22149-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
  1 sibling, 1 reply; 240+ messages in thread
From: Eric W. Biederman @ 2015-01-09 21:30 UTC (permalink / raw)
  To: Al Viro
  Cc: Linux Containers, linux-fsdevel, Serge E. Hallyn,
	Andy Lutomirski, Chen Hanxiao, Richard Weinberger, Andrey Vagin,
	Linus Torvalds

Al Viro <viro@ZenIV.linux.org.uk> writes:

> On Thu, Jan 08, 2015 at 10:32:13PM +0000, Al Viro wrote:
>> 	a) easily expressed with that scheme (behaviour is almost identical
>> to what your series does, except that we'd empty mnt_child right in
>> mntput_no_expire(), rather than keeping it until cleanup_mnt()) and
>> 	b) easily generalized to MNT_DETACH with lazy dissolving; in fact,
>> the only difference would be to treat *all* submounts as "don't put on
>> global list, don't remove from child list", not just the MNT_LOCKED ones.

I have no problem with treating all mounts whose parents are also being
unmounted as "don't put on global list, don't remove from child list"

Replacing the mnt_ex_mountpoint struct path with a fs_pin seems
reasonable.

We need to be careful with mounts that have parents that are not being
unmounted, with respect to the mount hash table but that just looks like
a detail in the overall scheme of things.

And if delaying the disolution of mounts in the mount detach case is a
long held wishlist item I am all for going there.  I have a patch I was
playing with that did that, I just didn't include it because that is a
bug fix kind of thing.

>> I'll play around with that today and tomorrow; hopefully, I'll have a postable
>> variant by the weekend...
>
> Hmm...  Linus, what do you think of the following?
>
> struct foo {
> 	int done;
> 	wait_queue_head_t wait;
> 	...
> };
>

Where does the rcu_read_lock() happen?
I assume this is kill from fs_pin.kill?

> void kill_foo(struct foo *p)
> {
>         wait_queue_t wait;
> 	wait.flags = WQ_FLAG_EXCLUSIVE;
> 	wait.private = current;
> 	wait.func = autoremove_wake_function;
> 	spin_lock_irq(&p->wait.lock);
> 	if (likely(!p->done)) {
> 		p->done = -1;
> 		spin_unlock_irq(&p->wait.lock);
> 		rcu_read_unlock();
> 		/* do cleanup */
> 		...
> 		/* remove references to *p */
> 		...
> 		spin_lock_irq(&p->wait.lock);
> 		p->done = 1;	
> 		wake_up_locked(&p->wait);
> 		spin_unlock_irq(&p->wait.lock);
> 		/* RCU-schedule freeing of p */
> 		...
> 		return;
> 	}
> 	if (p->done > 0) {
> 		spin_unlock_irq(&p->wait.lock);
> 		rcu_read_unlock();
> 		return;
> 	}
> 	__add_wait_queue_tail(&p->wait, &wait);
> 	while (1) {
> 		set_current_state(TASK_UNINITERRUPTIBLE);
> 		spin_unlock_irq(&p->wait.lock);
> 		rcu_read_unlock();
> 		schedule();
> 		rcu_read_lock();
> 		if (likely(list_empty(&wait.task_list)))
> 			break;
> 		/* OK, we know p couldn't have been freed yet */
> 		spin_lock_irq(&p->wait.lock);
> 		if (p->done > 0) {
> 			spin_unlock_irq(&p->wait.lock);
> 			break;
> 		}
> 	}
> 	rcu_read_unlock();
> }
>
> AFAICS, that ought to be safe - the first caller makes sure that everybody
> else will wait for it to finish, everybody else (ones who come via references
> before the first one has removed them) will be waiting for the first one
> to do wakeup *and* will take care not to touch the victim unless they knows
> it's still there.
>
> Do you see any problems with the above?  I wonder if I ended up open-coding
> something already existing in there...

Your struct foo looks an awful lot like struct completion at first
glance.

Eric


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 11/11] mnt: Honor MNT_LOCKED when detaching mounts
       [not found]                             ` <87k30vwskd.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-01-09 22:17                               ` Al Viro
       [not found]                                 ` <20150109221715.GN22149-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
  0 siblings, 1 reply; 240+ messages in thread
From: Al Viro @ 2015-01-09 22:17 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrey Vagin, Richard Weinberger, Linux Containers,
	Andy Lutomirski, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	Linus Torvalds

On Fri, Jan 09, 2015 at 03:30:10PM -0600, Eric W. Biederman wrote:

> Where does the rcu_read_lock() happen?
> I assume this is kill from fs_pin.kill?

This is what I'd rather have *calling* fs_pin.kill (the ... part in there
being the callback, with wakeup done as part of pin_remove(), which would
be called by ->kill()).  The thing is, I don't want mnt_pin_kill() et.al.
to grab refcount on fs_pin (or for the refcount to be necessary there).

IOW, any refcounting belong on the same level as ->kill() implementation
itself; for ex-mountpoint-related ones we'd need none whatsoever (->kill()
would do
	dput(ex-mountpoint dentry);
	pin_remove(pin);
	mntput_no_expire(containing struct mount);
and that would be it), for kernel/acct.c ones we do need some refcounting,
but only for the local reasons - note that it's playing directly with
refcount anyway, which is a pretty clear indication that we'd be better off
with fs/fs_pin.c _not_ messing with that refcount in the first place.
Getting rid of pin_put() also wouldn't hurt.

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 11/11] mnt: Honor MNT_LOCKED when detaching mounts
       [not found]                                 ` <20150109221715.GN22149-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
@ 2015-01-09 22:25                                   ` Eric W. Biederman
  0 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-01-09 22:25 UTC (permalink / raw)
  To: Al Viro
  Cc: Andrey Vagin, Richard Weinberger, Linux Containers,
	Andy Lutomirski, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	Linus Torvalds

Al Viro <viro-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org> writes:

> On Fri, Jan 09, 2015 at 03:30:10PM -0600, Eric W. Biederman wrote:
>
>> Where does the rcu_read_lock() happen?
>> I assume this is kill from fs_pin.kill?
>
> This is what I'd rather have *calling* fs_pin.kill (the ... part in there
> being the callback, with wakeup done as part of pin_remove(), which would
> be called by ->kill()).  The thing is, I don't want mnt_pin_kill() et.al.
> to grab refcount on fs_pin (or for the refcount to be necessary there).
>
> IOW, any refcounting belong on the same level as ->kill() implementation
> itself; for ex-mountpoint-related ones we'd need none whatsoever (->kill()
> would do
> 	dput(ex-mountpoint dentry);
> 	pin_remove(pin);
> 	mntput_no_expire(containing struct mount);
> and that would be it), for kernel/acct.c ones we do need some refcounting,
> but only for the local reasons - note that it's playing directly with
> refcount anyway, which is a pretty clear indication that we'd be better off
> with fs/fs_pin.c _not_ messing with that refcount in the first place.
> Getting rid of pin_put() also wouldn't hurt.

Got it.  I agree the infrastructure related to fs_pin is pretty awakward
right now.

I am digging in and seeing if I can figure out what the awkardness you
are seeing that needs the suggested changes, and then I will give you my feedback.

Eric

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 11/11] mnt: Honor MNT_LOCKED when detaching mounts
       [not found]                           ` <20150109203126.GI22149-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
  2015-01-09 21:30                             ` Eric W. Biederman
@ 2015-01-10  5:32                             ` Eric W. Biederman
       [not found]                               ` <87h9vzryio.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  1 sibling, 1 reply; 240+ messages in thread
From: Eric W. Biederman @ 2015-01-10  5:32 UTC (permalink / raw)
  To: Al Viro
  Cc: Andrey Vagin, Richard Weinberger, Linux Containers,
	Andy Lutomirski, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	Linus Torvalds

Al Viro <viro-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org> writes:

> On Thu, Jan 08, 2015 at 10:32:13PM +0000, Al Viro wrote:
>> 	a) easily expressed with that scheme (behaviour is almost identical
>> to what your series does, except that we'd empty mnt_child right in
>> mntput_no_expire(), rather than keeping it until cleanup_mnt()) and
>> 	b) easily generalized to MNT_DETACH with lazy dissolving; in fact,
>> the only difference would be to treat *all* submounts as "don't put on
>> global list, don't remove from child list", not just the MNT_LOCKED ones.
>> 
>> I'll play around with that today and tomorrow; hopefully, I'll have a postable
>> variant by the weekend...
>
> Hmm...  Linus, what do you think of the following?

You are just about defining a mutex here.  The one difference is the
single use property that makes this a good cleanup primitive.

WQ_FLAG_EXCLUSIVE appears wrong.  We want to remove all of the waiters
from the list and wake the all up.  Otherwise if you have 3 different
kernel threads all trying to clean up the same structure at the
same time (say remount, mntput and acct_off) one of them will be left
waiting forever.

I don't believe rcu anything in this function itself buys you anything,
but structuring this primitive so that it can be called from an rcu list
traversal seems interesting.  

It might be simpler to have a single fs_pin_mutex and that guards all of
the paths to a fs_pin so we don't have to be this clever.  But I expect
there are weird lock ordering issues that being this smart allows us to
avoid.  Certainly a primitive that allows us to clean things up without
any locks held is easier to maintain.

> struct foo {
> 	int done;
> 	wait_queue_head_t wait;
> 	...
> };
>
> void kill_foo(struct foo *p)
> {
>         wait_queue_t wait;
> 	wait.flags = WQ_FLAG_EXCLUSIVE;
> 	wait.private = current;
> 	wait.func = autoremove_wake_function;
> 	spin_lock_irq(&p->wait.lock);
> 	if (likely(!p->done)) {
> 		p->done = -1;
> 		spin_unlock_irq(&p->wait.lock);
> 		rcu_read_unlock();
> 		/* do cleanup */
> 		...
> 		/* remove references to *p */
> 		...
> 		spin_lock_irq(&p->wait.lock);
> 		p->done = 1;	
> 		wake_up_locked(&p->wait);
> 		spin_unlock_irq(&p->wait.lock);
> 		/* RCU-schedule freeing of p */
> 		...
> 		return;
> 	}
> 	if (p->done > 0) {
> 		spin_unlock_irq(&p->wait.lock);
> 		rcu_read_unlock();
> 		return;
> 	}
> 	__add_wait_queue_tail(&p->wait, &wait);
> 	while (1) {
> 		set_current_state(TASK_UNINITERRUPTIBLE);
> 		spin_unlock_irq(&p->wait.lock);
> 		rcu_read_unlock();
> 		schedule();
> 		rcu_read_lock();
> 		if (likely(list_empty(&wait.task_list)))
> 			break;
> 		/* OK, we know p couldn't have been freed yet */
> 		spin_lock_irq(&p->wait.lock);
> 		if (p->done > 0) {
> 			spin_unlock_irq(&p->wait.lock);
> 			break;
> 		}
> 	}
> 	rcu_read_unlock();
> }
>
> AFAICS, that ought to be safe - the first caller makes sure that everybody
> else will wait for it to finish, everybody else (ones who come via references
> before the first one has removed them) will be waiting for the first one
> to do wakeup *and* will take care not to touch the victim unless they knows
> it's still there.
>
> Do you see any problems with the above?  I wonder if I ended up open-coding
> something already existing in there...

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 11/11] mnt: Honor MNT_LOCKED when detaching mounts
       [not found]                               ` <87h9vzryio.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-01-10  5:51                                 ` Al Viro
       [not found]                                   ` <20150110055148.GY22149-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
  0 siblings, 1 reply; 240+ messages in thread
From: Al Viro @ 2015-01-10  5:51 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrey Vagin, Richard Weinberger, Linux Containers,
	Andy Lutomirski, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	Linus Torvalds

On Fri, Jan 09, 2015 at 11:32:47PM -0600, Eric W. Biederman wrote:

> I don't believe rcu anything in this function itself buys you anything,
> but structuring this primitive so that it can be called from an rcu list
> traversal seems interesting.  

???

Without RCU, what would prevent it being freed right under us?

The whole point is to avoid pinning it down - as it is, we can have
several processes call ->kill() on the same object.  The first one
would end up doing cleanup, the rest would wait *without* *affecting*
*fs_pin* *lifetime*.

Note that I'm using autoremove there for wait.func(), then in the wait
loop I check (without locks) wait.task_list being empty.  It is racy;
deliberately so.  All I really care about in there is checking that
wait.func has not been called until after rcu_read_lock().  If that is
true, we know that p->wait hadn't been woken until that point, i.e.
p hadn't reached rcu delay on the way to being freed until after our
rcu_read_lock().  Ergo, it can't get freed until we do rcu_read_unlock()
and we can safely take p->wait.lock.

RCU is very much relevant there.

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 11/11] mnt: Honor MNT_LOCKED when detaching mounts
       [not found]                                   ` <20150110055148.GY22149-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
@ 2015-01-11  2:00                                     ` Al Viro
  2015-01-16 18:29                                       ` Eric W. Biederman
       [not found]                                       ` <20150111020030.GF22149-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
  0 siblings, 2 replies; 240+ messages in thread
From: Al Viro @ 2015-01-11  2:00 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrey Vagin, Richard Weinberger, Linux Containers,
	Andy Lutomirski, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	Linus Torvalds

On Sat, Jan 10, 2015 at 05:51:48AM +0000, Al Viro wrote:
> On Fri, Jan 09, 2015 at 11:32:47PM -0600, Eric W. Biederman wrote:
> 
> > I don't believe rcu anything in this function itself buys you anything,
> > but structuring this primitive so that it can be called from an rcu list
> > traversal seems interesting.  
> 
> ???
> 
> Without RCU, what would prevent it being freed right under us?
> 
> The whole point is to avoid pinning it down - as it is, we can have
> several processes call ->kill() on the same object.  The first one
> would end up doing cleanup, the rest would wait *without* *affecting*
> *fs_pin* *lifetime*.
> 
> Note that I'm using autoremove there for wait.func(), then in the wait
> loop I check (without locks) wait.task_list being empty.  It is racy;
> deliberately so.  All I really care about in there is checking that
> wait.func has not been called until after rcu_read_lock().  If that is
> true, we know that p->wait hadn't been woken until that point, i.e.
> p hadn't reached rcu delay on the way to being freed until after our
> rcu_read_lock().  Ergo, it can't get freed until we do rcu_read_unlock()
> and we can safely take p->wait.lock.
> 
> RCU is very much relevant there.

FWIW, I've just pushed a completely untested tree in #experimental-fs_pin;
it definitely will be reordered, etc., probably with quite a few of the
patches from the beginning of your series mixed in, but the current tree
in there should show at least what I'm aiming at.

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 11/11] mnt: Honor MNT_LOCKED when detaching mounts
       [not found]                                       ` <20150111020030.GF22149-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
@ 2015-01-11  2:50                                         ` Al Viro
  2015-01-16 18:29                                         ` Eric W. Biederman
  1 sibling, 0 replies; 240+ messages in thread
From: Al Viro @ 2015-01-11  2:50 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrey Vagin, Richard Weinberger, Linux Containers,
	Andy Lutomirski, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	Linus Torvalds

On Sun, Jan 11, 2015 at 02:00:30AM +0000, Al Viro wrote:

> FWIW, I've just pushed a completely untested tree in #experimental-fs_pin;
> it definitely will be reordered, etc., probably with quite a few of the
> patches from the beginning of your series mixed in, but the current tree
> in there should show at least what I'm aiming at.

... and shockingly enough, it even seems to work, once I added the
missing init_waitqueue_head() in pin_insert_group()...

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 11/11] mnt: Honor MNT_LOCKED when detaching mounts
       [not found]                                       ` <20150111020030.GF22149-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
  2015-01-11  2:50                                         ` Al Viro
@ 2015-01-16 18:29                                         ` Eric W. Biederman
  1 sibling, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-01-16 18:29 UTC (permalink / raw)
  To: Al Viro
  Cc: Andrey Vagin, Richard Weinberger, Linux Containers,
	Andy Lutomirski, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	Linus Torvalds

Al Viro <viro-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org> writes:

> On Sat, Jan 10, 2015 at 05:51:48AM +0000, Al Viro wrote:
>> On Fri, Jan 09, 2015 at 11:32:47PM -0600, Eric W. Biederman wrote:
>> 
>> > I don't believe rcu anything in this function itself buys you anything,
>> > but structuring this primitive so that it can be called from an rcu list
>> > traversal seems interesting.  
>> 
>> ???
>> 
>> Without RCU, what would prevent it being freed right under us?
>> 
>> The whole point is to avoid pinning it down - as it is, we can have
>> several processes call ->kill() on the same object.  The first one
>> would end up doing cleanup, the rest would wait *without* *affecting*
>> *fs_pin* *lifetime*.
>> 
>> Note that I'm using autoremove there for wait.func(), then in the wait
>> loop I check (without locks) wait.task_list being empty.  It is racy;
>> deliberately so.  All I really care about in there is checking that
>> wait.func has not been called until after rcu_read_lock().  If that is
>> true, we know that p->wait hadn't been woken until that point, i.e.
>> p hadn't reached rcu delay on the way to being freed until after our
>> rcu_read_lock().  Ergo, it can't get freed until we do rcu_read_unlock()
>> and we can safely take p->wait.lock.
>> 
>> RCU is very much relevant there.
>
> FWIW, I've just pushed a completely untested tree in #experimental-fs_pin;
> it definitely will be reordered, etc., probably with quite a few of the
> patches from the beginning of your series mixed in, but the current tree
> in there should show at least what I'm aiming at.

I have merged the work you have been doing and what I have been doing
and posted it to a branch #for-testing of my user-namespace.git tree.

And yes I managed to make the core of the pin primitive not care about
rcu, and I think I will need that property to clean up some of the
weirdness that I still see with using fs_pin.

pin_insert does not wind up being a clean primitive, adding to both
lists at the same time does not end up with particularly clean or
obvious locking rules or a clean locking impelementation.  

Still the code works and is a good starting point for further discussion
and thinking.  I am posting the code while I go off to see if I can spot
better ways to clean some of these things up.

Eric

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 11/11] mnt: Honor MNT_LOCKED when detaching mounts
  2015-01-11  2:00                                     ` Al Viro
@ 2015-01-16 18:29                                       ` Eric W. Biederman
       [not found]                                       ` <20150111020030.GF22149-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
  1 sibling, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-01-16 18:29 UTC (permalink / raw)
  To: Al Viro
  Cc: Linux Containers, linux-fsdevel, Serge E. Hallyn,
	Andy Lutomirski, Chen Hanxiao, Richard Weinberger, Andrey Vagin,
	Linus Torvalds

Al Viro <viro@ZenIV.linux.org.uk> writes:

> On Sat, Jan 10, 2015 at 05:51:48AM +0000, Al Viro wrote:
>> On Fri, Jan 09, 2015 at 11:32:47PM -0600, Eric W. Biederman wrote:
>> 
>> > I don't believe rcu anything in this function itself buys you anything,
>> > but structuring this primitive so that it can be called from an rcu list
>> > traversal seems interesting.  
>> 
>> ???
>> 
>> Without RCU, what would prevent it being freed right under us?
>> 
>> The whole point is to avoid pinning it down - as it is, we can have
>> several processes call ->kill() on the same object.  The first one
>> would end up doing cleanup, the rest would wait *without* *affecting*
>> *fs_pin* *lifetime*.
>> 
>> Note that I'm using autoremove there for wait.func(), then in the wait
>> loop I check (without locks) wait.task_list being empty.  It is racy;
>> deliberately so.  All I really care about in there is checking that
>> wait.func has not been called until after rcu_read_lock().  If that is
>> true, we know that p->wait hadn't been woken until that point, i.e.
>> p hadn't reached rcu delay on the way to being freed until after our
>> rcu_read_lock().  Ergo, it can't get freed until we do rcu_read_unlock()
>> and we can safely take p->wait.lock.
>> 
>> RCU is very much relevant there.
>
> FWIW, I've just pushed a completely untested tree in #experimental-fs_pin;
> it definitely will be reordered, etc., probably with quite a few of the
> patches from the beginning of your series mixed in, but the current tree
> in there should show at least what I'm aiming at.

I have merged the work you have been doing and what I have been doing
and posted it to a branch #for-testing of my user-namespace.git tree.

And yes I managed to make the core of the pin primitive not care about
rcu, and I think I will need that property to clean up some of the
weirdness that I still see with using fs_pin.

pin_insert does not wind up being a clean primitive, adding to both
lists at the same time does not end up with particularly clean or
obvious locking rules or a clean locking impelementation.  

Still the code works and is a good starting point for further discussion
and thinking.  I am posting the code while I go off to see if I can spot
better ways to clean some of these things up.

Eric


^ permalink raw reply	[flat|nested] 240+ messages in thread

* [PATCH review 0/19] Locked mount and loopback mount fixes
       [not found]     ` <87mw5xq7lt.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                         ` (10 preceding siblings ...)
  2015-01-05 20:46       ` [PATCH review 11/11] mnt: Honor MNT_LOCKED when detaching mounts Eric W. Biederman
@ 2015-04-03  1:53       ` Eric W. Biederman
  11 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-04-03  1:53 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Richard Weinberger, Andy Lutomirski, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Omar Sandoval,
	Willy Tarreau

Way back in October Andrey Vagin reported that umount(MNT_DETACH) could
be used to defeat MNT_LOCKED.  As I worked to fix this I discovered
that combined with mount propagation and an appropriate selection of
shared subtrees a reference to a directory on an unmounted filesystem is
not necessary.

That MNT_DETACH is allowed in user namespace in a form that can break
MNT_LOCKED comes from my early misunderstanding what MNT_DETACH does.

To avoid breaking existing userspace the conflict between MNT_DETACH and
MNT_LOCKED is fixed by leaving mounts that are locked to their parents
in the mount hash table until the last reference goes away.

While investigating this issue I also found an issue with
__detach_mounts.  The code was unnecessarily and incorrectly triggering
mount propagation.  Resulting in too many mounts going away when a
directory is deleted, and too many cpu cycles are burned while doing
that.

Looking some more I realized that __detach_mounts by only keeping mounts
connected that were MNT_LOCKED it had the potential to still leak
information so I tweaked the code to keep everything locked together
that possibly could be.

In the middle of all of this bug hunting and fixing it was reported that
with a strategically placed rename ".." on bind mounts could go up
past their root of the bind mount.  Which turned out to be very easy to
understand and test for but tricky to actually fix in a way that would
not slow down path name lookups in the common case.

These fixes are against on v4.0-rc6 which has all of Al's new fs_pin
code.

I have tested the code and I don't see any issues but as I am human I
may have missed a corner case or two.  So any feedback is appreciated.

For those who like to see everything in a single tree the code is at:

   git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-testing

Eric W. Biederman (19):
      mnt: Use hlist_move_list in namespace_unlock
      mnt: Improve the umount_tree flags
      mnt: Don't propagate umounts in __detach_mounts
      mnt: In umount_tree reuse mnt_list instead of mnt_hash
      mnt: Add MNT_UMOUNT flag
      mnt: Delay removal from the mount hash.
      mnt: On an unmount propagate clearing of MNT_LOCKED
      mnt: Don't propagate unmounts to locked mounts
      mnt: Fail collect_mounts when applied to unmounted mounts
      mnt: Factor out unhash_mnt from detach_mnt and umount_tree
      mnt: Factor umount_mnt from umount_tree
      fs_pin: Allow for the possibility that m_list or s_list go unused.
      mnt: Honor MNT_LOCKED when detaching mounts
      mnt: Fix the error check in __detach_mounts
      mnt: Update detach_mounts to leave mounts connected
      mnt: Track which mounts use a dentry as root.
      vfs: Test for and handle paths that are unreachable from their mnt_root
      vfs: Handle mounts whose parents are unreachable from their mountpoint
      vfs: Do not allow escaping from bind mounts.

 fs/dcache.c            |  35 +++++-
 fs/fs_pin.c            |   4 +-
 fs/internal.h          |   2 +
 fs/mount.h             |   8 ++
 fs/namei.c             |  34 +++++-
 fs/namespace.c         | 325 +++++++++++++++++++++++++++++++++++++++++--------
 fs/pnode.c             |  60 +++++++--
 fs/pnode.h             |   7 +-
 include/linux/dcache.h |   7 ++
 include/linux/fs_pin.h |   2 +
 include/linux/mount.h  |   3 +
 include/linux/namei.h  |   2 +
 12 files changed, 424 insertions(+), 65 deletions(-)

^ permalink raw reply	[flat|nested] 240+ messages in thread

* [PATCH review 0/19] Locked mount and loopback mount fixes
  2015-01-05 20:45   ` [PATCH review 0/11 Call for testing and review of mount detach fixes (take 2) Eric W. Biederman
                       ` (9 preceding siblings ...)
  2015-01-05 20:46     ` [PATCH review 11/11] mnt: Honor MNT_LOCKED when detaching mounts Eric W. Biederman
@ 2015-04-03  1:53     ` Eric W. Biederman
  2015-04-03  1:56       ` [PATCH review 01/19] mnt: Use hlist_move_list in namespace_unlock Eric W. Biederman
                         ` (21 more replies)
  10 siblings, 22 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-04-03  1:53 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-fsdevel, Al Viro, Andy Lutomirski, Serge E. Hallyn,
	Richard Weinberger, Andrey Vagin, Jann Horn, Willy Tarreau,
	Omar Sandoval

Way back in October Andrey Vagin reported that umount(MNT_DETACH) could
be used to defeat MNT_LOCKED.  As I worked to fix this I discovered
that combined with mount propagation and an appropriate selection of
shared subtrees a reference to a directory on an unmounted filesystem is
not necessary.

That MNT_DETACH is allowed in user namespace in a form that can break
MNT_LOCKED comes from my early misunderstanding what MNT_DETACH does.

To avoid breaking existing userspace the conflict between MNT_DETACH and
MNT_LOCKED is fixed by leaving mounts that are locked to their parents
in the mount hash table until the last reference goes away.

While investigating this issue I also found an issue with
__detach_mounts.  The code was unnecessarily and incorrectly triggering
mount propagation.  Resulting in too many mounts going away when a
directory is deleted, and too many cpu cycles are burned while doing
that.

Looking some more I realized that __detach_mounts by only keeping mounts
connected that were MNT_LOCKED it had the potential to still leak
information so I tweaked the code to keep everything locked together
that possibly could be.

In the middle of all of this bug hunting and fixing it was reported that
with a strategically placed rename ".." on bind mounts could go up
past their root of the bind mount.  Which turned out to be very easy to
understand and test for but tricky to actually fix in a way that would
not slow down path name lookups in the common case.

These fixes are against on v4.0-rc6 which has all of Al's new fs_pin
code.

I have tested the code and I don't see any issues but as I am human I
may have missed a corner case or two.  So any feedback is appreciated.

For those who like to see everything in a single tree the code is at:

   git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-testing

Eric W. Biederman (19):
      mnt: Use hlist_move_list in namespace_unlock
      mnt: Improve the umount_tree flags
      mnt: Don't propagate umounts in __detach_mounts
      mnt: In umount_tree reuse mnt_list instead of mnt_hash
      mnt: Add MNT_UMOUNT flag
      mnt: Delay removal from the mount hash.
      mnt: On an unmount propagate clearing of MNT_LOCKED
      mnt: Don't propagate unmounts to locked mounts
      mnt: Fail collect_mounts when applied to unmounted mounts
      mnt: Factor out unhash_mnt from detach_mnt and umount_tree
      mnt: Factor umount_mnt from umount_tree
      fs_pin: Allow for the possibility that m_list or s_list go unused.
      mnt: Honor MNT_LOCKED when detaching mounts
      mnt: Fix the error check in __detach_mounts
      mnt: Update detach_mounts to leave mounts connected
      mnt: Track which mounts use a dentry as root.
      vfs: Test for and handle paths that are unreachable from their mnt_root
      vfs: Handle mounts whose parents are unreachable from their mountpoint
      vfs: Do not allow escaping from bind mounts.

 fs/dcache.c            |  35 +++++-
 fs/fs_pin.c            |   4 +-
 fs/internal.h          |   2 +
 fs/mount.h             |   8 ++
 fs/namei.c             |  34 +++++-
 fs/namespace.c         | 325 +++++++++++++++++++++++++++++++++++++++++--------
 fs/pnode.c             |  60 +++++++--
 fs/pnode.h             |   7 +-
 include/linux/dcache.h |   7 ++
 include/linux/fs_pin.h |   2 +
 include/linux/mount.h  |   3 +
 include/linux/namei.h  |   2 +
 12 files changed, 424 insertions(+), 65 deletions(-)

^ permalink raw reply	[flat|nested] 240+ messages in thread

* [PATCH review  01/19] mnt: Use hlist_move_list in namespace_unlock
       [not found]       ` <87a8yqou41.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-04-03  1:56         ` Eric W. Biederman
  2015-04-03  1:56         ` [PATCH review 02/19] mnt: Improve the umount_tree flags Eric W. Biederman
                           ` (19 subsequent siblings)
  20 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-04-03  1:56 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Richard Weinberger, Andy Lutomirski, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Omar Sandoval,
	Willy Tarreau

Small cleanup to make the code more readable and maintainable.

Signed-off-by: Eric Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/namespace.c | 12 +++++-------
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 82ef1405260e..e1ee57206eef 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1298,17 +1298,15 @@ static HLIST_HEAD(unmounted);	/* protected by namespace_sem */
 
 static void namespace_unlock(void)
 {
-	struct hlist_head head = unmounted;
+	struct hlist_head head;
 
-	if (likely(hlist_empty(&head))) {
-		up_write(&namespace_sem);
-		return;
-	}
+	hlist_move_list(&unmounted, &head);
 
-	head.first->pprev = &head.first;
-	INIT_HLIST_HEAD(&unmounted);
 	up_write(&namespace_sem);
 
+	if (likely(hlist_empty(&head)))
+		return;
+
 	synchronize_rcu();
 
 	group_pin_kill(&head);
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review  01/19] mnt: Use hlist_move_list in namespace_unlock
  2015-04-03  1:53     ` [PATCH review 0/19] Locked mount and loopback mount fixes Eric W. Biederman
@ 2015-04-03  1:56       ` Eric W. Biederman
  2015-04-03  1:56       ` [PATCH review 02/19] mnt: Improve the umount_tree flags Eric W. Biederman
                         ` (20 subsequent siblings)
  21 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-04-03  1:56 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-fsdevel, Serge E. Hallyn, Andy Lutomirski,
	Richard Weinberger, Andrey Vagin, Al Viro, Jann Horn,
	Willy Tarreau, Omar Sandoval

Small cleanup to make the code more readable and maintainable.

Signed-off-by: Eric Biederman <ebiederm@xmission.com>
---
 fs/namespace.c | 12 +++++-------
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 82ef1405260e..e1ee57206eef 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1298,17 +1298,15 @@ static HLIST_HEAD(unmounted);	/* protected by namespace_sem */
 
 static void namespace_unlock(void)
 {
-	struct hlist_head head = unmounted;
+	struct hlist_head head;
 
-	if (likely(hlist_empty(&head))) {
-		up_write(&namespace_sem);
-		return;
-	}
+	hlist_move_list(&unmounted, &head);
 
-	head.first->pprev = &head.first;
-	INIT_HLIST_HEAD(&unmounted);
 	up_write(&namespace_sem);
 
+	if (likely(hlist_empty(&head)))
+		return;
+
 	synchronize_rcu();
 
 	group_pin_kill(&head);
-- 
2.2.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review  02/19] mnt: Improve the umount_tree flags
       [not found]       ` <87a8yqou41.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  2015-04-03  1:56         ` [PATCH review 01/19] mnt: Use hlist_move_list in namespace_unlock Eric W. Biederman
@ 2015-04-03  1:56         ` Eric W. Biederman
  2015-04-03  1:56         ` [PATCH review 03/19] mnt: Don't propagate umounts in __detach_mounts Eric W. Biederman
                           ` (18 subsequent siblings)
  20 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-04-03  1:56 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Richard Weinberger, Andy Lutomirski, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Omar Sandoval,
	Willy Tarreau

- Remove the unneeded declaration from pnode.h
- Mark umount_tree static as it has no callers outside of namespace.c
- Define an enumeration of umount_tree's flags.
- Pass umount_tree's flags in by name

This removes the magic numbers 0, 1 and 2 making the code a little
clearer and makes it possible for there to be lazy unmounts that don't
propagate.  Which is what __detach_mounts actually wants for example.

Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/namespace.c | 31 ++++++++++++++++---------------
 fs/pnode.h     |  1 -
 2 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index e1ee57206eef..e06e36777b90 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1317,14 +1317,15 @@ static inline void namespace_lock(void)
 	down_write(&namespace_sem);
 }
 
+enum umount_tree_flags {
+	UMOUNT_SYNC = 1,
+	UMOUNT_PROPAGATE = 2,
+};
 /*
  * mount_lock must be held
  * namespace_sem must be held for write
- * how = 0 => just this tree, don't propagate
- * how = 1 => propagate; we know that nobody else has reference to any victims
- * how = 2 => lazy umount
  */
-void umount_tree(struct mount *mnt, int how)
+static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 {
 	HLIST_HEAD(tmp_list);
 	struct mount *p;
@@ -1337,7 +1338,7 @@ void umount_tree(struct mount *mnt, int how)
 	hlist_for_each_entry(p, &tmp_list, mnt_hash)
 		list_del_init(&p->mnt_child);
 
-	if (how)
+	if (how & UMOUNT_PROPAGATE)
 		propagate_umount(&tmp_list);
 
 	while (!hlist_empty(&tmp_list)) {
@@ -1347,7 +1348,7 @@ void umount_tree(struct mount *mnt, int how)
 		list_del_init(&p->mnt_list);
 		__touch_mnt_namespace(p->mnt_ns);
 		p->mnt_ns = NULL;
-		if (how < 2)
+		if (how & UMOUNT_SYNC)
 			p->mnt.mnt_flags |= MNT_SYNC_UMOUNT;
 
 		pin_insert_group(&p->mnt_umount, &p->mnt_parent->mnt, &unmounted);
@@ -1445,14 +1446,14 @@ static int do_umount(struct mount *mnt, int flags)
 
 	if (flags & MNT_DETACH) {
 		if (!list_empty(&mnt->mnt_list))
-			umount_tree(mnt, 2);
+			umount_tree(mnt, UMOUNT_PROPAGATE);
 		retval = 0;
 	} else {
 		shrink_submounts(mnt);
 		retval = -EBUSY;
 		if (!propagate_mount_busy(mnt, 2)) {
 			if (!list_empty(&mnt->mnt_list))
-				umount_tree(mnt, 1);
+				umount_tree(mnt, UMOUNT_PROPAGATE|UMOUNT_SYNC);
 			retval = 0;
 		}
 	}
@@ -1484,7 +1485,7 @@ void __detach_mounts(struct dentry *dentry)
 	lock_mount_hash();
 	while (!hlist_empty(&mp->m_list)) {
 		mnt = hlist_entry(mp->m_list.first, struct mount, mnt_mp_list);
-		umount_tree(mnt, 2);
+		umount_tree(mnt, UMOUNT_PROPAGATE);
 	}
 	unlock_mount_hash();
 	put_mountpoint(mp);
@@ -1646,7 +1647,7 @@ struct mount *copy_tree(struct mount *mnt, struct dentry *dentry,
 out:
 	if (res) {
 		lock_mount_hash();
-		umount_tree(res, 0);
+		umount_tree(res, UMOUNT_SYNC);
 		unlock_mount_hash();
 	}
 	return q;
@@ -1670,7 +1671,7 @@ void drop_collected_mounts(struct vfsmount *mnt)
 {
 	namespace_lock();
 	lock_mount_hash();
-	umount_tree(real_mount(mnt), 0);
+	umount_tree(real_mount(mnt), UMOUNT_SYNC);
 	unlock_mount_hash();
 	namespace_unlock();
 }
@@ -1853,7 +1854,7 @@ static int attach_recursive_mnt(struct mount *source_mnt,
  out_cleanup_ids:
 	while (!hlist_empty(&tree_list)) {
 		child = hlist_entry(tree_list.first, struct mount, mnt_hash);
-		umount_tree(child, 0);
+		umount_tree(child, UMOUNT_SYNC);
 	}
 	unlock_mount_hash();
 	cleanup_group_ids(source_mnt, NULL);
@@ -2033,7 +2034,7 @@ static int do_loopback(struct path *path, const char *old_name,
 	err = graft_tree(mnt, parent, mp);
 	if (err) {
 		lock_mount_hash();
-		umount_tree(mnt, 0);
+		umount_tree(mnt, UMOUNT_SYNC);
 		unlock_mount_hash();
 	}
 out2:
@@ -2404,7 +2405,7 @@ void mark_mounts_for_expiry(struct list_head *mounts)
 	while (!list_empty(&graveyard)) {
 		mnt = list_first_entry(&graveyard, struct mount, mnt_expire);
 		touch_mnt_namespace(mnt->mnt_ns);
-		umount_tree(mnt, 1);
+		umount_tree(mnt, UMOUNT_PROPAGATE|UMOUNT_SYNC);
 	}
 	unlock_mount_hash();
 	namespace_unlock();
@@ -2475,7 +2476,7 @@ static void shrink_submounts(struct mount *mnt)
 			m = list_first_entry(&graveyard, struct mount,
 						mnt_expire);
 			touch_mnt_namespace(m->mnt_ns);
-			umount_tree(m, 1);
+			umount_tree(m, UMOUNT_PROPAGATE|UMOUNT_SYNC);
 		}
 	}
 }
diff --git a/fs/pnode.h b/fs/pnode.h
index 4a246358b031..16afc3d6d2f2 100644
--- a/fs/pnode.h
+++ b/fs/pnode.h
@@ -47,7 +47,6 @@ int get_dominating_id(struct mount *mnt, const struct path *root);
 unsigned int mnt_get_count(struct mount *mnt);
 void mnt_set_mountpoint(struct mount *, struct mountpoint *,
 			struct mount *);
-void umount_tree(struct mount *, int);
 struct mount *copy_tree(struct mount *, struct dentry *, int);
 bool is_path_reachable(struct mount *, struct dentry *,
 			 const struct path *root);
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review  02/19] mnt: Improve the umount_tree flags
  2015-04-03  1:53     ` [PATCH review 0/19] Locked mount and loopback mount fixes Eric W. Biederman
  2015-04-03  1:56       ` [PATCH review 01/19] mnt: Use hlist_move_list in namespace_unlock Eric W. Biederman
@ 2015-04-03  1:56       ` Eric W. Biederman
  2015-04-03  1:56       ` [PATCH review 03/19] mnt: Don't propagate umounts in __detach_mounts Eric W. Biederman
                         ` (19 subsequent siblings)
  21 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-04-03  1:56 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-fsdevel, Serge E. Hallyn, Andy Lutomirski,
	Richard Weinberger, Andrey Vagin, Al Viro, Jann Horn,
	Willy Tarreau, Omar Sandoval

- Remove the unneeded declaration from pnode.h
- Mark umount_tree static as it has no callers outside of namespace.c
- Define an enumeration of umount_tree's flags.
- Pass umount_tree's flags in by name

This removes the magic numbers 0, 1 and 2 making the code a little
clearer and makes it possible for there to be lazy unmounts that don't
propagate.  Which is what __detach_mounts actually wants for example.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/namespace.c | 31 ++++++++++++++++---------------
 fs/pnode.h     |  1 -
 2 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index e1ee57206eef..e06e36777b90 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1317,14 +1317,15 @@ static inline void namespace_lock(void)
 	down_write(&namespace_sem);
 }
 
+enum umount_tree_flags {
+	UMOUNT_SYNC = 1,
+	UMOUNT_PROPAGATE = 2,
+};
 /*
  * mount_lock must be held
  * namespace_sem must be held for write
- * how = 0 => just this tree, don't propagate
- * how = 1 => propagate; we know that nobody else has reference to any victims
- * how = 2 => lazy umount
  */
-void umount_tree(struct mount *mnt, int how)
+static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 {
 	HLIST_HEAD(tmp_list);
 	struct mount *p;
@@ -1337,7 +1338,7 @@ void umount_tree(struct mount *mnt, int how)
 	hlist_for_each_entry(p, &tmp_list, mnt_hash)
 		list_del_init(&p->mnt_child);
 
-	if (how)
+	if (how & UMOUNT_PROPAGATE)
 		propagate_umount(&tmp_list);
 
 	while (!hlist_empty(&tmp_list)) {
@@ -1347,7 +1348,7 @@ void umount_tree(struct mount *mnt, int how)
 		list_del_init(&p->mnt_list);
 		__touch_mnt_namespace(p->mnt_ns);
 		p->mnt_ns = NULL;
-		if (how < 2)
+		if (how & UMOUNT_SYNC)
 			p->mnt.mnt_flags |= MNT_SYNC_UMOUNT;
 
 		pin_insert_group(&p->mnt_umount, &p->mnt_parent->mnt, &unmounted);
@@ -1445,14 +1446,14 @@ static int do_umount(struct mount *mnt, int flags)
 
 	if (flags & MNT_DETACH) {
 		if (!list_empty(&mnt->mnt_list))
-			umount_tree(mnt, 2);
+			umount_tree(mnt, UMOUNT_PROPAGATE);
 		retval = 0;
 	} else {
 		shrink_submounts(mnt);
 		retval = -EBUSY;
 		if (!propagate_mount_busy(mnt, 2)) {
 			if (!list_empty(&mnt->mnt_list))
-				umount_tree(mnt, 1);
+				umount_tree(mnt, UMOUNT_PROPAGATE|UMOUNT_SYNC);
 			retval = 0;
 		}
 	}
@@ -1484,7 +1485,7 @@ void __detach_mounts(struct dentry *dentry)
 	lock_mount_hash();
 	while (!hlist_empty(&mp->m_list)) {
 		mnt = hlist_entry(mp->m_list.first, struct mount, mnt_mp_list);
-		umount_tree(mnt, 2);
+		umount_tree(mnt, UMOUNT_PROPAGATE);
 	}
 	unlock_mount_hash();
 	put_mountpoint(mp);
@@ -1646,7 +1647,7 @@ struct mount *copy_tree(struct mount *mnt, struct dentry *dentry,
 out:
 	if (res) {
 		lock_mount_hash();
-		umount_tree(res, 0);
+		umount_tree(res, UMOUNT_SYNC);
 		unlock_mount_hash();
 	}
 	return q;
@@ -1670,7 +1671,7 @@ void drop_collected_mounts(struct vfsmount *mnt)
 {
 	namespace_lock();
 	lock_mount_hash();
-	umount_tree(real_mount(mnt), 0);
+	umount_tree(real_mount(mnt), UMOUNT_SYNC);
 	unlock_mount_hash();
 	namespace_unlock();
 }
@@ -1853,7 +1854,7 @@ static int attach_recursive_mnt(struct mount *source_mnt,
  out_cleanup_ids:
 	while (!hlist_empty(&tree_list)) {
 		child = hlist_entry(tree_list.first, struct mount, mnt_hash);
-		umount_tree(child, 0);
+		umount_tree(child, UMOUNT_SYNC);
 	}
 	unlock_mount_hash();
 	cleanup_group_ids(source_mnt, NULL);
@@ -2033,7 +2034,7 @@ static int do_loopback(struct path *path, const char *old_name,
 	err = graft_tree(mnt, parent, mp);
 	if (err) {
 		lock_mount_hash();
-		umount_tree(mnt, 0);
+		umount_tree(mnt, UMOUNT_SYNC);
 		unlock_mount_hash();
 	}
 out2:
@@ -2404,7 +2405,7 @@ void mark_mounts_for_expiry(struct list_head *mounts)
 	while (!list_empty(&graveyard)) {
 		mnt = list_first_entry(&graveyard, struct mount, mnt_expire);
 		touch_mnt_namespace(mnt->mnt_ns);
-		umount_tree(mnt, 1);
+		umount_tree(mnt, UMOUNT_PROPAGATE|UMOUNT_SYNC);
 	}
 	unlock_mount_hash();
 	namespace_unlock();
@@ -2475,7 +2476,7 @@ static void shrink_submounts(struct mount *mnt)
 			m = list_first_entry(&graveyard, struct mount,
 						mnt_expire);
 			touch_mnt_namespace(m->mnt_ns);
-			umount_tree(m, 1);
+			umount_tree(m, UMOUNT_PROPAGATE|UMOUNT_SYNC);
 		}
 	}
 }
diff --git a/fs/pnode.h b/fs/pnode.h
index 4a246358b031..16afc3d6d2f2 100644
--- a/fs/pnode.h
+++ b/fs/pnode.h
@@ -47,7 +47,6 @@ int get_dominating_id(struct mount *mnt, const struct path *root);
 unsigned int mnt_get_count(struct mount *mnt);
 void mnt_set_mountpoint(struct mount *, struct mountpoint *,
 			struct mount *);
-void umount_tree(struct mount *, int);
 struct mount *copy_tree(struct mount *, struct dentry *, int);
 bool is_path_reachable(struct mount *, struct dentry *,
 			 const struct path *root);
-- 
2.2.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 03/19] mnt: Don't propagate umounts in __detach_mounts
       [not found]       ` <87a8yqou41.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  2015-04-03  1:56         ` [PATCH review 01/19] mnt: Use hlist_move_list in namespace_unlock Eric W. Biederman
  2015-04-03  1:56         ` [PATCH review 02/19] mnt: Improve the umount_tree flags Eric W. Biederman
@ 2015-04-03  1:56         ` Eric W. Biederman
  2015-04-03  1:56         ` [PATCH review 04/19] mnt: In umount_tree reuse mnt_list instead of mnt_hash Eric W. Biederman
                           ` (17 subsequent siblings)
  20 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-04-03  1:56 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Richard Weinberger, Andy Lutomirski, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Omar Sandoval,
	Willy Tarreau

Invoking mount propagation from __detach_mounts is inefficient and
wrong.

It is inefficient because __detach_mounts already walks the list of
mounts that where something needs to be done, and mount propagation
walks some subset of those mounts again.

It is actively wrong because if the dentry that is passed to
__detach_mounts is not part of the path to a mount that mount should
not be affected.

change_mnt_propagation(p,MS_PRIVATE) modifies the mount propagation
tree of a master mount so it's slaves are connected to another master
if possible.  Which means even removing a mount from the middle of a
mount tree with __detach_mounts will not deprive any mount propagated
mount events.

Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/namespace.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index e06e36777b90..c68d9fc912e7 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1485,7 +1485,7 @@ void __detach_mounts(struct dentry *dentry)
 	lock_mount_hash();
 	while (!hlist_empty(&mp->m_list)) {
 		mnt = hlist_entry(mp->m_list.first, struct mount, mnt_mp_list);
-		umount_tree(mnt, UMOUNT_PROPAGATE);
+		umount_tree(mnt, 0);
 	}
 	unlock_mount_hash();
 	put_mountpoint(mp);
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review  03/19] mnt: Don't propagate umounts in __detach_mounts
  2015-04-03  1:53     ` [PATCH review 0/19] Locked mount and loopback mount fixes Eric W. Biederman
  2015-04-03  1:56       ` [PATCH review 01/19] mnt: Use hlist_move_list in namespace_unlock Eric W. Biederman
  2015-04-03  1:56       ` [PATCH review 02/19] mnt: Improve the umount_tree flags Eric W. Biederman
@ 2015-04-03  1:56       ` Eric W. Biederman
  2015-04-03  1:56       ` [PATCH review 04/19] mnt: In umount_tree reuse mnt_list instead of mnt_hash Eric W. Biederman
                         ` (18 subsequent siblings)
  21 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-04-03  1:56 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-fsdevel, Serge E. Hallyn, Andy Lutomirski,
	Richard Weinberger, Andrey Vagin, Al Viro, Jann Horn,
	Willy Tarreau, Omar Sandoval

Invoking mount propagation from __detach_mounts is inefficient and
wrong.

It is inefficient because __detach_mounts already walks the list of
mounts that where something needs to be done, and mount propagation
walks some subset of those mounts again.

It is actively wrong because if the dentry that is passed to
__detach_mounts is not part of the path to a mount that mount should
not be affected.

change_mnt_propagation(p,MS_PRIVATE) modifies the mount propagation
tree of a master mount so it's slaves are connected to another master
if possible.  Which means even removing a mount from the middle of a
mount tree with __detach_mounts will not deprive any mount propagated
mount events.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/namespace.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index e06e36777b90..c68d9fc912e7 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1485,7 +1485,7 @@ void __detach_mounts(struct dentry *dentry)
 	lock_mount_hash();
 	while (!hlist_empty(&mp->m_list)) {
 		mnt = hlist_entry(mp->m_list.first, struct mount, mnt_mp_list);
-		umount_tree(mnt, UMOUNT_PROPAGATE);
+		umount_tree(mnt, 0);
 	}
 	unlock_mount_hash();
 	put_mountpoint(mp);
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 04/19] mnt: In umount_tree reuse mnt_list instead of mnt_hash
       [not found]       ` <87a8yqou41.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                           ` (2 preceding siblings ...)
  2015-04-03  1:56         ` [PATCH review 03/19] mnt: Don't propagate umounts in __detach_mounts Eric W. Biederman
@ 2015-04-03  1:56         ` Eric W. Biederman
  2015-04-03  1:56         ` [PATCH review 05/19] mnt: Add MNT_UMOUNT flag Eric W. Biederman
                           ` (16 subsequent siblings)
  20 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-04-03  1:56 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Richard Weinberger, Andy Lutomirski, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Omar Sandoval,
	Willy Tarreau

umount_tree builds a list of mounts that need to be unmounted.
Utilize mnt_list for this purpose instead of mnt_hash.  This begins to
allow keeping a mount on the mnt_hash after it is unmounted, which is
necessary for a properly functioning MNT_LOCKED implementation.

The fact that mnt_list is an ordinary list makding available list_move
is nice bonus.

Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/namespace.c | 20 +++++++++++---------
 fs/pnode.c     |  6 +++---
 fs/pnode.h     |  2 +-
 3 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index c68d9fc912e7..54cbef129f4a 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1327,23 +1327,25 @@ enum umount_tree_flags {
  */
 static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 {
-	HLIST_HEAD(tmp_list);
+	LIST_HEAD(tmp_list);
 	struct mount *p;
 
-	for (p = mnt; p; p = next_mnt(p, mnt)) {
-		hlist_del_init_rcu(&p->mnt_hash);
-		hlist_add_head(&p->mnt_hash, &tmp_list);
-	}
+	/* Gather the mounts to umount */
+	for (p = mnt; p; p = next_mnt(p, mnt))
+		list_move(&p->mnt_list, &tmp_list);
 
-	hlist_for_each_entry(p, &tmp_list, mnt_hash)
+	/* Hide the mounts from lookup_mnt and mnt_mounts */
+	list_for_each_entry(p, &tmp_list, mnt_list) {
+		hlist_del_init_rcu(&p->mnt_hash);
 		list_del_init(&p->mnt_child);
+	}
 
+	/* Add propogated mounts to the tmp_list */
 	if (how & UMOUNT_PROPAGATE)
 		propagate_umount(&tmp_list);
 
-	while (!hlist_empty(&tmp_list)) {
-		p = hlist_entry(tmp_list.first, struct mount, mnt_hash);
-		hlist_del_init_rcu(&p->mnt_hash);
+	while (!list_empty(&tmp_list)) {
+		p = list_first_entry(&tmp_list, struct mount, mnt_list);
 		list_del_init(&p->mnt_expire);
 		list_del_init(&p->mnt_list);
 		__touch_mnt_namespace(p->mnt_ns);
diff --git a/fs/pnode.c b/fs/pnode.c
index 260ac8f898a4..bf012af709dd 100644
--- a/fs/pnode.c
+++ b/fs/pnode.c
@@ -384,7 +384,7 @@ static void __propagate_umount(struct mount *mnt)
 		if (child && list_empty(&child->mnt_mounts)) {
 			list_del_init(&child->mnt_child);
 			hlist_del_init_rcu(&child->mnt_hash);
-			hlist_add_before_rcu(&child->mnt_hash, &mnt->mnt_hash);
+			list_move_tail(&child->mnt_list, &mnt->mnt_list);
 		}
 	}
 }
@@ -396,11 +396,11 @@ static void __propagate_umount(struct mount *mnt)
  *
  * vfsmount lock must be held for write
  */
-int propagate_umount(struct hlist_head *list)
+int propagate_umount(struct list_head *list)
 {
 	struct mount *mnt;
 
-	hlist_for_each_entry(mnt, list, mnt_hash)
+	list_for_each_entry(mnt, list, mnt_list)
 		__propagate_umount(mnt);
 	return 0;
 }
diff --git a/fs/pnode.h b/fs/pnode.h
index 16afc3d6d2f2..aa6d65df7204 100644
--- a/fs/pnode.h
+++ b/fs/pnode.h
@@ -40,7 +40,7 @@ static inline void set_mnt_shared(struct mount *mnt)
 void change_mnt_propagation(struct mount *, int);
 int propagate_mnt(struct mount *, struct mountpoint *, struct mount *,
 		struct hlist_head *);
-int propagate_umount(struct hlist_head *);
+int propagate_umount(struct list_head *);
 int propagate_mount_busy(struct mount *, int);
 void mnt_release_group_id(struct mount *);
 int get_dominating_id(struct mount *mnt, const struct path *root);
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review  04/19] mnt: In umount_tree reuse mnt_list instead of mnt_hash
  2015-04-03  1:53     ` [PATCH review 0/19] Locked mount and loopback mount fixes Eric W. Biederman
                         ` (2 preceding siblings ...)
  2015-04-03  1:56       ` [PATCH review 03/19] mnt: Don't propagate umounts in __detach_mounts Eric W. Biederman
@ 2015-04-03  1:56       ` Eric W. Biederman
  2015-04-03  1:56       ` [PATCH review 05/19] mnt: Add MNT_UMOUNT flag Eric W. Biederman
                         ` (17 subsequent siblings)
  21 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-04-03  1:56 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-fsdevel, Serge E. Hallyn, Andy Lutomirski,
	Richard Weinberger, Andrey Vagin, Al Viro, Jann Horn,
	Willy Tarreau, Omar Sandoval

umount_tree builds a list of mounts that need to be unmounted.
Utilize mnt_list for this purpose instead of mnt_hash.  This begins to
allow keeping a mount on the mnt_hash after it is unmounted, which is
necessary for a properly functioning MNT_LOCKED implementation.

The fact that mnt_list is an ordinary list makding available list_move
is nice bonus.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/namespace.c | 20 +++++++++++---------
 fs/pnode.c     |  6 +++---
 fs/pnode.h     |  2 +-
 3 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index c68d9fc912e7..54cbef129f4a 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1327,23 +1327,25 @@ enum umount_tree_flags {
  */
 static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 {
-	HLIST_HEAD(tmp_list);
+	LIST_HEAD(tmp_list);
 	struct mount *p;
 
-	for (p = mnt; p; p = next_mnt(p, mnt)) {
-		hlist_del_init_rcu(&p->mnt_hash);
-		hlist_add_head(&p->mnt_hash, &tmp_list);
-	}
+	/* Gather the mounts to umount */
+	for (p = mnt; p; p = next_mnt(p, mnt))
+		list_move(&p->mnt_list, &tmp_list);
 
-	hlist_for_each_entry(p, &tmp_list, mnt_hash)
+	/* Hide the mounts from lookup_mnt and mnt_mounts */
+	list_for_each_entry(p, &tmp_list, mnt_list) {
+		hlist_del_init_rcu(&p->mnt_hash);
 		list_del_init(&p->mnt_child);
+	}
 
+	/* Add propogated mounts to the tmp_list */
 	if (how & UMOUNT_PROPAGATE)
 		propagate_umount(&tmp_list);
 
-	while (!hlist_empty(&tmp_list)) {
-		p = hlist_entry(tmp_list.first, struct mount, mnt_hash);
-		hlist_del_init_rcu(&p->mnt_hash);
+	while (!list_empty(&tmp_list)) {
+		p = list_first_entry(&tmp_list, struct mount, mnt_list);
 		list_del_init(&p->mnt_expire);
 		list_del_init(&p->mnt_list);
 		__touch_mnt_namespace(p->mnt_ns);
diff --git a/fs/pnode.c b/fs/pnode.c
index 260ac8f898a4..bf012af709dd 100644
--- a/fs/pnode.c
+++ b/fs/pnode.c
@@ -384,7 +384,7 @@ static void __propagate_umount(struct mount *mnt)
 		if (child && list_empty(&child->mnt_mounts)) {
 			list_del_init(&child->mnt_child);
 			hlist_del_init_rcu(&child->mnt_hash);
-			hlist_add_before_rcu(&child->mnt_hash, &mnt->mnt_hash);
+			list_move_tail(&child->mnt_list, &mnt->mnt_list);
 		}
 	}
 }
@@ -396,11 +396,11 @@ static void __propagate_umount(struct mount *mnt)
  *
  * vfsmount lock must be held for write
  */
-int propagate_umount(struct hlist_head *list)
+int propagate_umount(struct list_head *list)
 {
 	struct mount *mnt;
 
-	hlist_for_each_entry(mnt, list, mnt_hash)
+	list_for_each_entry(mnt, list, mnt_list)
 		__propagate_umount(mnt);
 	return 0;
 }
diff --git a/fs/pnode.h b/fs/pnode.h
index 16afc3d6d2f2..aa6d65df7204 100644
--- a/fs/pnode.h
+++ b/fs/pnode.h
@@ -40,7 +40,7 @@ static inline void set_mnt_shared(struct mount *mnt)
 void change_mnt_propagation(struct mount *, int);
 int propagate_mnt(struct mount *, struct mountpoint *, struct mount *,
 		struct hlist_head *);
-int propagate_umount(struct hlist_head *);
+int propagate_umount(struct list_head *);
 int propagate_mount_busy(struct mount *, int);
 void mnt_release_group_id(struct mount *);
 int get_dominating_id(struct mount *mnt, const struct path *root);
-- 
2.2.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review  05/19] mnt: Add MNT_UMOUNT flag
       [not found]       ` <87a8yqou41.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                           ` (3 preceding siblings ...)
  2015-04-03  1:56         ` [PATCH review 04/19] mnt: In umount_tree reuse mnt_list instead of mnt_hash Eric W. Biederman
@ 2015-04-03  1:56         ` Eric W. Biederman
  2015-04-03  1:56         ` [PATCH review 06/19] mnt: Delay removal from the mount hash Eric W. Biederman
                           ` (15 subsequent siblings)
  20 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-04-03  1:56 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Richard Weinberger, Andy Lutomirski, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Omar Sandoval,
	Willy Tarreau

In some instances it is necessary to know if the the unmounting
process has begun on a mount.  Add MNT_UMOUNT to make that reliably
testable.

This fix gets used in fixing locked mounts in MNT_DETACH

Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/namespace.c        | 4 +++-
 fs/pnode.c            | 1 +
 include/linux/mount.h | 1 +
 3 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 54cbef129f4a..d1708147eb45 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1331,8 +1331,10 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 	struct mount *p;
 
 	/* Gather the mounts to umount */
-	for (p = mnt; p; p = next_mnt(p, mnt))
+	for (p = mnt; p; p = next_mnt(p, mnt)) {
+		p->mnt.mnt_flags |= MNT_UMOUNT;
 		list_move(&p->mnt_list, &tmp_list);
+	}
 
 	/* Hide the mounts from lookup_mnt and mnt_mounts */
 	list_for_each_entry(p, &tmp_list, mnt_list) {
diff --git a/fs/pnode.c b/fs/pnode.c
index bf012af709dd..ac3aa0d43b90 100644
--- a/fs/pnode.c
+++ b/fs/pnode.c
@@ -384,6 +384,7 @@ static void __propagate_umount(struct mount *mnt)
 		if (child && list_empty(&child->mnt_mounts)) {
 			list_del_init(&child->mnt_child);
 			hlist_del_init_rcu(&child->mnt_hash);
+			child->mnt.mnt_flags |= MNT_UMOUNT;
 			list_move_tail(&child->mnt_list, &mnt->mnt_list);
 		}
 	}
diff --git a/include/linux/mount.h b/include/linux/mount.h
index c2c561dc0114..564beeec5d83 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -61,6 +61,7 @@ struct mnt_namespace;
 #define MNT_DOOMED		0x1000000
 #define MNT_SYNC_UMOUNT		0x2000000
 #define MNT_MARKED		0x4000000
+#define MNT_UMOUNT		0x8000000
 
 struct vfsmount {
 	struct dentry *mnt_root;	/* root of the mounted tree */
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review  05/19] mnt: Add MNT_UMOUNT flag
  2015-04-03  1:53     ` [PATCH review 0/19] Locked mount and loopback mount fixes Eric W. Biederman
                         ` (3 preceding siblings ...)
  2015-04-03  1:56       ` [PATCH review 04/19] mnt: In umount_tree reuse mnt_list instead of mnt_hash Eric W. Biederman
@ 2015-04-03  1:56       ` Eric W. Biederman
  2015-04-03  1:56       ` [PATCH review 06/19] mnt: Delay removal from the mount hash Eric W. Biederman
                         ` (16 subsequent siblings)
  21 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-04-03  1:56 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-fsdevel, Serge E. Hallyn, Andy Lutomirski,
	Richard Weinberger, Andrey Vagin, Al Viro, Jann Horn,
	Willy Tarreau, Omar Sandoval

In some instances it is necessary to know if the the unmounting
process has begun on a mount.  Add MNT_UMOUNT to make that reliably
testable.

This fix gets used in fixing locked mounts in MNT_DETACH

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/namespace.c        | 4 +++-
 fs/pnode.c            | 1 +
 include/linux/mount.h | 1 +
 3 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 54cbef129f4a..d1708147eb45 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1331,8 +1331,10 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 	struct mount *p;
 
 	/* Gather the mounts to umount */
-	for (p = mnt; p; p = next_mnt(p, mnt))
+	for (p = mnt; p; p = next_mnt(p, mnt)) {
+		p->mnt.mnt_flags |= MNT_UMOUNT;
 		list_move(&p->mnt_list, &tmp_list);
+	}
 
 	/* Hide the mounts from lookup_mnt and mnt_mounts */
 	list_for_each_entry(p, &tmp_list, mnt_list) {
diff --git a/fs/pnode.c b/fs/pnode.c
index bf012af709dd..ac3aa0d43b90 100644
--- a/fs/pnode.c
+++ b/fs/pnode.c
@@ -384,6 +384,7 @@ static void __propagate_umount(struct mount *mnt)
 		if (child && list_empty(&child->mnt_mounts)) {
 			list_del_init(&child->mnt_child);
 			hlist_del_init_rcu(&child->mnt_hash);
+			child->mnt.mnt_flags |= MNT_UMOUNT;
 			list_move_tail(&child->mnt_list, &mnt->mnt_list);
 		}
 	}
diff --git a/include/linux/mount.h b/include/linux/mount.h
index c2c561dc0114..564beeec5d83 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -61,6 +61,7 @@ struct mnt_namespace;
 #define MNT_DOOMED		0x1000000
 #define MNT_SYNC_UMOUNT		0x2000000
 #define MNT_MARKED		0x4000000
+#define MNT_UMOUNT		0x8000000
 
 struct vfsmount {
 	struct dentry *mnt_root;	/* root of the mounted tree */
-- 
2.2.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review  06/19] mnt: Delay removal from the mount hash.
       [not found]       ` <87a8yqou41.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                           ` (4 preceding siblings ...)
  2015-04-03  1:56         ` [PATCH review 05/19] mnt: Add MNT_UMOUNT flag Eric W. Biederman
@ 2015-04-03  1:56         ` Eric W. Biederman
  2015-04-03  1:56         ` [PATCH review 07/19] mnt: On an unmount propagate clearing of MNT_LOCKED Eric W. Biederman
                           ` (14 subsequent siblings)
  20 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-04-03  1:56 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Richard Weinberger, Andy Lutomirski, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Omar Sandoval,
	Willy Tarreau

- Modify __lookup_mnt_hash_last to ignore mounts that have MNT_UMOUNTED set.
- Don't remove mounts from the mount hash table in propogate_umount
- Don't remove mounts from the mount hash table in umount_tree before
  the entire list of mounts to be umounted is selected.
- Remove mounts from the mount hash table as the last thing that
  happens in the case where a mount has a parent in umount_tree.
  Mounts without parents are not hashed (by definition).

This paves the way for delaying removal from the mount hash table even
farther and fixing the MNT_LOCKED vs MNT_DETACH issue.

Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/namespace.c | 13 ++++++++-----
 fs/pnode.c     |  1 -
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index d1708147eb45..083e3401a808 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -632,14 +632,17 @@ struct mount *__lookup_mnt(struct vfsmount *mnt, struct dentry *dentry)
  */
 struct mount *__lookup_mnt_last(struct vfsmount *mnt, struct dentry *dentry)
 {
-	struct mount *p, *res;
-	res = p = __lookup_mnt(mnt, dentry);
+	struct mount *p, *res = NULL;
+	p = __lookup_mnt(mnt, dentry);
 	if (!p)
 		goto out;
+	if (!(p->mnt.mnt_flags & MNT_UMOUNT))
+		res = p;
 	hlist_for_each_entry_continue(p, mnt_hash) {
 		if (&p->mnt_parent->mnt != mnt || p->mnt_mountpoint != dentry)
 			break;
-		res = p;
+		if (!(p->mnt.mnt_flags & MNT_UMOUNT))
+			res = p;
 	}
 out:
 	return res;
@@ -1336,9 +1339,8 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 		list_move(&p->mnt_list, &tmp_list);
 	}
 
-	/* Hide the mounts from lookup_mnt and mnt_mounts */
+	/* Hide the mounts from mnt_mounts */
 	list_for_each_entry(p, &tmp_list, mnt_list) {
-		hlist_del_init_rcu(&p->mnt_hash);
 		list_del_init(&p->mnt_child);
 	}
 
@@ -1365,6 +1367,7 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 			p->mnt_mountpoint = p->mnt.mnt_root;
 			p->mnt_parent = p;
 			p->mnt_mp = NULL;
+			hlist_del_init_rcu(&p->mnt_hash);
 		}
 		change_mnt_propagation(p, MS_PRIVATE);
 	}
diff --git a/fs/pnode.c b/fs/pnode.c
index ac3aa0d43b90..c27ae38ee250 100644
--- a/fs/pnode.c
+++ b/fs/pnode.c
@@ -383,7 +383,6 @@ static void __propagate_umount(struct mount *mnt)
 		 */
 		if (child && list_empty(&child->mnt_mounts)) {
 			list_del_init(&child->mnt_child);
-			hlist_del_init_rcu(&child->mnt_hash);
 			child->mnt.mnt_flags |= MNT_UMOUNT;
 			list_move_tail(&child->mnt_list, &mnt->mnt_list);
 		}
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review  06/19] mnt: Delay removal from the mount hash.
  2015-04-03  1:53     ` [PATCH review 0/19] Locked mount and loopback mount fixes Eric W. Biederman
                         ` (4 preceding siblings ...)
  2015-04-03  1:56       ` [PATCH review 05/19] mnt: Add MNT_UMOUNT flag Eric W. Biederman
@ 2015-04-03  1:56       ` Eric W. Biederman
  2015-04-03  1:56       ` [PATCH review 07/19] mnt: On an unmount propagate clearing of MNT_LOCKED Eric W. Biederman
                         ` (15 subsequent siblings)
  21 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-04-03  1:56 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-fsdevel, Serge E. Hallyn, Andy Lutomirski,
	Richard Weinberger, Andrey Vagin, Al Viro, Jann Horn,
	Willy Tarreau, Omar Sandoval

- Modify __lookup_mnt_hash_last to ignore mounts that have MNT_UMOUNTED set.
- Don't remove mounts from the mount hash table in propogate_umount
- Don't remove mounts from the mount hash table in umount_tree before
  the entire list of mounts to be umounted is selected.
- Remove mounts from the mount hash table as the last thing that
  happens in the case where a mount has a parent in umount_tree.
  Mounts without parents are not hashed (by definition).

This paves the way for delaying removal from the mount hash table even
farther and fixing the MNT_LOCKED vs MNT_DETACH issue.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/namespace.c | 13 ++++++++-----
 fs/pnode.c     |  1 -
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index d1708147eb45..083e3401a808 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -632,14 +632,17 @@ struct mount *__lookup_mnt(struct vfsmount *mnt, struct dentry *dentry)
  */
 struct mount *__lookup_mnt_last(struct vfsmount *mnt, struct dentry *dentry)
 {
-	struct mount *p, *res;
-	res = p = __lookup_mnt(mnt, dentry);
+	struct mount *p, *res = NULL;
+	p = __lookup_mnt(mnt, dentry);
 	if (!p)
 		goto out;
+	if (!(p->mnt.mnt_flags & MNT_UMOUNT))
+		res = p;
 	hlist_for_each_entry_continue(p, mnt_hash) {
 		if (&p->mnt_parent->mnt != mnt || p->mnt_mountpoint != dentry)
 			break;
-		res = p;
+		if (!(p->mnt.mnt_flags & MNT_UMOUNT))
+			res = p;
 	}
 out:
 	return res;
@@ -1336,9 +1339,8 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 		list_move(&p->mnt_list, &tmp_list);
 	}
 
-	/* Hide the mounts from lookup_mnt and mnt_mounts */
+	/* Hide the mounts from mnt_mounts */
 	list_for_each_entry(p, &tmp_list, mnt_list) {
-		hlist_del_init_rcu(&p->mnt_hash);
 		list_del_init(&p->mnt_child);
 	}
 
@@ -1365,6 +1367,7 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 			p->mnt_mountpoint = p->mnt.mnt_root;
 			p->mnt_parent = p;
 			p->mnt_mp = NULL;
+			hlist_del_init_rcu(&p->mnt_hash);
 		}
 		change_mnt_propagation(p, MS_PRIVATE);
 	}
diff --git a/fs/pnode.c b/fs/pnode.c
index ac3aa0d43b90..c27ae38ee250 100644
--- a/fs/pnode.c
+++ b/fs/pnode.c
@@ -383,7 +383,6 @@ static void __propagate_umount(struct mount *mnt)
 		 */
 		if (child && list_empty(&child->mnt_mounts)) {
 			list_del_init(&child->mnt_child);
-			hlist_del_init_rcu(&child->mnt_hash);
 			child->mnt.mnt_flags |= MNT_UMOUNT;
 			list_move_tail(&child->mnt_list, &mnt->mnt_list);
 		}
-- 
2.2.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 07/19] mnt: On an unmount propagate clearing of MNT_LOCKED
       [not found]       ` <87a8yqou41.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                           ` (5 preceding siblings ...)
  2015-04-03  1:56         ` [PATCH review 06/19] mnt: Delay removal from the mount hash Eric W. Biederman
@ 2015-04-03  1:56         ` Eric W. Biederman
  2015-04-03  1:56         ` [PATCH review 08/19] mnt: Don't propagate unmounts to locked mounts Eric W. Biederman
                           ` (13 subsequent siblings)
  20 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-04-03  1:56 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Richard Weinberger, Andy Lutomirski, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Omar Sandoval,
	Willy Tarreau

A prerequisite of calling umount_tree is that the point where the tree
is mounted at is valid to unmount.

If we are propagating the effect of the unmount clear MNT_LOCKED in
every instance where the same filesystem is mounted on the same
mountpoint in the mount tree, as we know (by virtue of the fact
that umount_tree was called) that it is safe to reveal what
is at that mountpoint.

Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/namespace.c |  3 +++
 fs/pnode.c     | 20 ++++++++++++++++++++
 fs/pnode.h     |  1 +
 3 files changed, 24 insertions(+)

diff --git a/fs/namespace.c b/fs/namespace.c
index 083e3401a808..2b12b7a9455d 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1333,6 +1333,9 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 	LIST_HEAD(tmp_list);
 	struct mount *p;
 
+	if (how & UMOUNT_PROPAGATE)
+		propagate_mount_unlock(mnt);
+
 	/* Gather the mounts to umount */
 	for (p = mnt; p; p = next_mnt(p, mnt)) {
 		p->mnt.mnt_flags |= MNT_UMOUNT;
diff --git a/fs/pnode.c b/fs/pnode.c
index c27ae38ee250..89890293dd0a 100644
--- a/fs/pnode.c
+++ b/fs/pnode.c
@@ -362,6 +362,26 @@ int propagate_mount_busy(struct mount *mnt, int refcnt)
 }
 
 /*
+ * Clear MNT_LOCKED when it can be shown to be safe.
+ *
+ * mount_lock lock must be held for write
+ */
+void propagate_mount_unlock(struct mount *mnt)
+{
+	struct mount *parent = mnt->mnt_parent;
+	struct mount *m, *child;
+
+	BUG_ON(parent == mnt);
+
+	for (m = propagation_next(parent, parent); m;
+			m = propagation_next(m, parent)) {
+		child = __lookup_mnt_last(&m->mnt, mnt->mnt_mountpoint);
+		if (child)
+			child->mnt.mnt_flags &= ~MNT_LOCKED;
+	}
+}
+
+/*
  * NOTE: unmounting 'mnt' naturally propagates to all other mounts its
  * parent propagates to.
  */
diff --git a/fs/pnode.h b/fs/pnode.h
index aa6d65df7204..af47d4bd7b31 100644
--- a/fs/pnode.h
+++ b/fs/pnode.h
@@ -42,6 +42,7 @@ int propagate_mnt(struct mount *, struct mountpoint *, struct mount *,
 		struct hlist_head *);
 int propagate_umount(struct list_head *);
 int propagate_mount_busy(struct mount *, int);
+void propagate_mount_unlock(struct mount *);
 void mnt_release_group_id(struct mount *);
 int get_dominating_id(struct mount *mnt, const struct path *root);
 unsigned int mnt_get_count(struct mount *mnt);
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review  07/19] mnt: On an unmount propagate clearing of MNT_LOCKED
  2015-04-03  1:53     ` [PATCH review 0/19] Locked mount and loopback mount fixes Eric W. Biederman
                         ` (5 preceding siblings ...)
  2015-04-03  1:56       ` [PATCH review 06/19] mnt: Delay removal from the mount hash Eric W. Biederman
@ 2015-04-03  1:56       ` Eric W. Biederman
  2015-04-03  1:56       ` [PATCH review 08/19] mnt: Don't propagate unmounts to locked mounts Eric W. Biederman
                         ` (14 subsequent siblings)
  21 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-04-03  1:56 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-fsdevel, Serge E. Hallyn, Andy Lutomirski,
	Richard Weinberger, Andrey Vagin, Al Viro, Jann Horn,
	Willy Tarreau, Omar Sandoval

A prerequisite of calling umount_tree is that the point where the tree
is mounted at is valid to unmount.

If we are propagating the effect of the unmount clear MNT_LOCKED in
every instance where the same filesystem is mounted on the same
mountpoint in the mount tree, as we know (by virtue of the fact
that umount_tree was called) that it is safe to reveal what
is at that mountpoint.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/namespace.c |  3 +++
 fs/pnode.c     | 20 ++++++++++++++++++++
 fs/pnode.h     |  1 +
 3 files changed, 24 insertions(+)

diff --git a/fs/namespace.c b/fs/namespace.c
index 083e3401a808..2b12b7a9455d 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1333,6 +1333,9 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 	LIST_HEAD(tmp_list);
 	struct mount *p;
 
+	if (how & UMOUNT_PROPAGATE)
+		propagate_mount_unlock(mnt);
+
 	/* Gather the mounts to umount */
 	for (p = mnt; p; p = next_mnt(p, mnt)) {
 		p->mnt.mnt_flags |= MNT_UMOUNT;
diff --git a/fs/pnode.c b/fs/pnode.c
index c27ae38ee250..89890293dd0a 100644
--- a/fs/pnode.c
+++ b/fs/pnode.c
@@ -362,6 +362,26 @@ int propagate_mount_busy(struct mount *mnt, int refcnt)
 }
 
 /*
+ * Clear MNT_LOCKED when it can be shown to be safe.
+ *
+ * mount_lock lock must be held for write
+ */
+void propagate_mount_unlock(struct mount *mnt)
+{
+	struct mount *parent = mnt->mnt_parent;
+	struct mount *m, *child;
+
+	BUG_ON(parent == mnt);
+
+	for (m = propagation_next(parent, parent); m;
+			m = propagation_next(m, parent)) {
+		child = __lookup_mnt_last(&m->mnt, mnt->mnt_mountpoint);
+		if (child)
+			child->mnt.mnt_flags &= ~MNT_LOCKED;
+	}
+}
+
+/*
  * NOTE: unmounting 'mnt' naturally propagates to all other mounts its
  * parent propagates to.
  */
diff --git a/fs/pnode.h b/fs/pnode.h
index aa6d65df7204..af47d4bd7b31 100644
--- a/fs/pnode.h
+++ b/fs/pnode.h
@@ -42,6 +42,7 @@ int propagate_mnt(struct mount *, struct mountpoint *, struct mount *,
 		struct hlist_head *);
 int propagate_umount(struct list_head *);
 int propagate_mount_busy(struct mount *, int);
+void propagate_mount_unlock(struct mount *);
 void mnt_release_group_id(struct mount *);
 int get_dominating_id(struct mount *mnt, const struct path *root);
 unsigned int mnt_get_count(struct mount *mnt);
-- 
2.2.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review  08/19] mnt: Don't propagate unmounts to locked mounts
       [not found]       ` <87a8yqou41.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                           ` (6 preceding siblings ...)
  2015-04-03  1:56         ` [PATCH review 07/19] mnt: On an unmount propagate clearing of MNT_LOCKED Eric W. Biederman
@ 2015-04-03  1:56         ` Eric W. Biederman
  2015-04-03  1:56         ` [PATCH review 09/19] mnt: Fail collect_mounts when applied to unmounted mounts Eric W. Biederman
                           ` (12 subsequent siblings)
  20 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-04-03  1:56 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Richard Weinberger, Andy Lutomirski, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Omar Sandoval,
	Willy Tarreau

If the first mount in shared subtree is locked don't unmount the
shared subtree.

This is ensured by walking through the mounts parents before children
and marking a mount as unmountable if it is not locked or it is locked
but it's parent is marked.

This allows recursive mount detach to propagate through a set of
mounts when unmounting them would not reveal what is under any locked
mount.

Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/pnode.c | 32 +++++++++++++++++++++++++++++---
 fs/pnode.h |  1 +
 2 files changed, 30 insertions(+), 3 deletions(-)

diff --git a/fs/pnode.c b/fs/pnode.c
index 89890293dd0a..6367e1e435c6 100644
--- a/fs/pnode.c
+++ b/fs/pnode.c
@@ -382,6 +382,26 @@ void propagate_mount_unlock(struct mount *mnt)
 }
 
 /*
+ * Mark all mounts that the MNT_LOCKED logic will allow to be unmounted.
+ */
+static void mark_umount_candidates(struct mount *mnt)
+{
+	struct mount *parent = mnt->mnt_parent;
+	struct mount *m;
+
+	BUG_ON(parent == mnt);
+
+	for (m = propagation_next(parent, parent); m;
+			m = propagation_next(m, parent)) {
+		struct mount *child = __lookup_mnt_last(&m->mnt,
+						mnt->mnt_mountpoint);
+		if (child && (!IS_MNT_LOCKED(child) || IS_MNT_MARKED(m))) {
+			SET_MNT_MARK(child);
+		}
+	}
+}
+
+/*
  * NOTE: unmounting 'mnt' naturally propagates to all other mounts its
  * parent propagates to.
  */
@@ -398,10 +418,13 @@ static void __propagate_umount(struct mount *mnt)
 		struct mount *child = __lookup_mnt_last(&m->mnt,
 						mnt->mnt_mountpoint);
 		/*
-		 * umount the child only if the child has no
-		 * other children
+		 * umount the child only if the child has no children
+		 * and the child is marked safe to unmount.
 		 */
-		if (child && list_empty(&child->mnt_mounts)) {
+		if (!child || !IS_MNT_MARKED(child))
+			continue;
+		CLEAR_MNT_MARK(child);
+		if (list_empty(&child->mnt_mounts)) {
 			list_del_init(&child->mnt_child);
 			child->mnt.mnt_flags |= MNT_UMOUNT;
 			list_move_tail(&child->mnt_list, &mnt->mnt_list);
@@ -420,6 +443,9 @@ int propagate_umount(struct list_head *list)
 {
 	struct mount *mnt;
 
+	list_for_each_entry_reverse(mnt, list, mnt_list)
+		mark_umount_candidates(mnt);
+
 	list_for_each_entry(mnt, list, mnt_list)
 		__propagate_umount(mnt);
 	return 0;
diff --git a/fs/pnode.h b/fs/pnode.h
index af47d4bd7b31..0fcdbe7ca648 100644
--- a/fs/pnode.h
+++ b/fs/pnode.h
@@ -19,6 +19,7 @@
 #define IS_MNT_MARKED(m) ((m)->mnt.mnt_flags & MNT_MARKED)
 #define SET_MNT_MARK(m) ((m)->mnt.mnt_flags |= MNT_MARKED)
 #define CLEAR_MNT_MARK(m) ((m)->mnt.mnt_flags &= ~MNT_MARKED)
+#define IS_MNT_LOCKED(m) ((m)->mnt.mnt_flags & MNT_LOCKED)
 
 #define CL_EXPIRE    		0x01
 #define CL_SLAVE     		0x02
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review  08/19] mnt: Don't propagate unmounts to locked mounts
  2015-04-03  1:53     ` [PATCH review 0/19] Locked mount and loopback mount fixes Eric W. Biederman
                         ` (6 preceding siblings ...)
  2015-04-03  1:56       ` [PATCH review 07/19] mnt: On an unmount propagate clearing of MNT_LOCKED Eric W. Biederman
@ 2015-04-03  1:56       ` Eric W. Biederman
  2015-04-03  1:56       ` [PATCH review 09/19] mnt: Fail collect_mounts when applied to unmounted mounts Eric W. Biederman
                         ` (13 subsequent siblings)
  21 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-04-03  1:56 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-fsdevel, Serge E. Hallyn, Andy Lutomirski,
	Richard Weinberger, Andrey Vagin, Al Viro, Jann Horn,
	Willy Tarreau, Omar Sandoval

If the first mount in shared subtree is locked don't unmount the
shared subtree.

This is ensured by walking through the mounts parents before children
and marking a mount as unmountable if it is not locked or it is locked
but it's parent is marked.

This allows recursive mount detach to propagate through a set of
mounts when unmounting them would not reveal what is under any locked
mount.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/pnode.c | 32 +++++++++++++++++++++++++++++---
 fs/pnode.h |  1 +
 2 files changed, 30 insertions(+), 3 deletions(-)

diff --git a/fs/pnode.c b/fs/pnode.c
index 89890293dd0a..6367e1e435c6 100644
--- a/fs/pnode.c
+++ b/fs/pnode.c
@@ -382,6 +382,26 @@ void propagate_mount_unlock(struct mount *mnt)
 }
 
 /*
+ * Mark all mounts that the MNT_LOCKED logic will allow to be unmounted.
+ */
+static void mark_umount_candidates(struct mount *mnt)
+{
+	struct mount *parent = mnt->mnt_parent;
+	struct mount *m;
+
+	BUG_ON(parent == mnt);
+
+	for (m = propagation_next(parent, parent); m;
+			m = propagation_next(m, parent)) {
+		struct mount *child = __lookup_mnt_last(&m->mnt,
+						mnt->mnt_mountpoint);
+		if (child && (!IS_MNT_LOCKED(child) || IS_MNT_MARKED(m))) {
+			SET_MNT_MARK(child);
+		}
+	}
+}
+
+/*
  * NOTE: unmounting 'mnt' naturally propagates to all other mounts its
  * parent propagates to.
  */
@@ -398,10 +418,13 @@ static void __propagate_umount(struct mount *mnt)
 		struct mount *child = __lookup_mnt_last(&m->mnt,
 						mnt->mnt_mountpoint);
 		/*
-		 * umount the child only if the child has no
-		 * other children
+		 * umount the child only if the child has no children
+		 * and the child is marked safe to unmount.
 		 */
-		if (child && list_empty(&child->mnt_mounts)) {
+		if (!child || !IS_MNT_MARKED(child))
+			continue;
+		CLEAR_MNT_MARK(child);
+		if (list_empty(&child->mnt_mounts)) {
 			list_del_init(&child->mnt_child);
 			child->mnt.mnt_flags |= MNT_UMOUNT;
 			list_move_tail(&child->mnt_list, &mnt->mnt_list);
@@ -420,6 +443,9 @@ int propagate_umount(struct list_head *list)
 {
 	struct mount *mnt;
 
+	list_for_each_entry_reverse(mnt, list, mnt_list)
+		mark_umount_candidates(mnt);
+
 	list_for_each_entry(mnt, list, mnt_list)
 		__propagate_umount(mnt);
 	return 0;
diff --git a/fs/pnode.h b/fs/pnode.h
index af47d4bd7b31..0fcdbe7ca648 100644
--- a/fs/pnode.h
+++ b/fs/pnode.h
@@ -19,6 +19,7 @@
 #define IS_MNT_MARKED(m) ((m)->mnt.mnt_flags & MNT_MARKED)
 #define SET_MNT_MARK(m) ((m)->mnt.mnt_flags |= MNT_MARKED)
 #define CLEAR_MNT_MARK(m) ((m)->mnt.mnt_flags &= ~MNT_MARKED)
+#define IS_MNT_LOCKED(m) ((m)->mnt.mnt_flags & MNT_LOCKED)
 
 #define CL_EXPIRE    		0x01
 #define CL_SLAVE     		0x02
-- 
2.2.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 09/19] mnt: Fail collect_mounts when applied to unmounted mounts
       [not found]       ` <87a8yqou41.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                           ` (7 preceding siblings ...)
  2015-04-03  1:56         ` [PATCH review 08/19] mnt: Don't propagate unmounts to locked mounts Eric W. Biederman
@ 2015-04-03  1:56         ` Eric W. Biederman
  2015-04-03  1:56         ` [PATCH review 10/19] mnt: Factor out unhash_mnt from detach_mnt and umount_tree Eric W. Biederman
                           ` (11 subsequent siblings)
  20 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-04-03  1:56 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Richard Weinberger, Andy Lutomirski, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Omar Sandoval,
	Willy Tarreau

The only users of collect_mounts are in audit_tree.c

In audit_tree_trees and audit_add_tree rule the path passed into
collect_mounts is generated from kern_path passed an audit_tree
pathname which is guaranteed to be an absolute path.   In those cases
collect_mounts is obviously intended to work on mounted paths and
if a race results in paths that are unmounted when collect_mounts
it is reasonable to fail early.

The paths passed into audit_tag_tree don't have the absolute path
check.  But are used to play with fsnotify and otherwise interact with
the audit_trees, so again operating only on mounted paths appears
reasonable.

Avoid having to worry about what happens when we try and audit
unmounted filesystems by restricting collect_mounts to mounts
that appear in the mount tree.

Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/namespace.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 2b12b7a9455d..acc5583764dc 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1669,8 +1669,11 @@ struct vfsmount *collect_mounts(struct path *path)
 {
 	struct mount *tree;
 	namespace_lock();
-	tree = copy_tree(real_mount(path->mnt), path->dentry,
-			 CL_COPY_ALL | CL_PRIVATE);
+	if (!check_mnt(real_mount(path->mnt)))
+		tree = ERR_PTR(-EINVAL);
+	else
+		tree = copy_tree(real_mount(path->mnt), path->dentry,
+				 CL_COPY_ALL | CL_PRIVATE);
 	namespace_unlock();
 	if (IS_ERR(tree))
 		return ERR_CAST(tree);
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review  09/19] mnt: Fail collect_mounts when applied to unmounted mounts
  2015-04-03  1:53     ` [PATCH review 0/19] Locked mount and loopback mount fixes Eric W. Biederman
                         ` (7 preceding siblings ...)
  2015-04-03  1:56       ` [PATCH review 08/19] mnt: Don't propagate unmounts to locked mounts Eric W. Biederman
@ 2015-04-03  1:56       ` Eric W. Biederman
  2015-04-03  8:55         ` Lukasz Pawelczyk
       [not found]         ` <1428026183-14879-9-git-send-email-ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
       [not found]       ` <87a8yqou41.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                         ` (12 subsequent siblings)
  21 siblings, 2 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-04-03  1:56 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-fsdevel, Serge E. Hallyn, Andy Lutomirski,
	Richard Weinberger, Andrey Vagin, Al Viro, Jann Horn,
	Willy Tarreau, Omar Sandoval

The only users of collect_mounts are in audit_tree.c

In audit_tree_trees and audit_add_tree rule the path passed into
collect_mounts is generated from kern_path passed an audit_tree
pathname which is guaranteed to be an absolute path.   In those cases
collect_mounts is obviously intended to work on mounted paths and
if a race results in paths that are unmounted when collect_mounts
it is reasonable to fail early.

The paths passed into audit_tag_tree don't have the absolute path
check.  But are used to play with fsnotify and otherwise interact with
the audit_trees, so again operating only on mounted paths appears
reasonable.

Avoid having to worry about what happens when we try and audit
unmounted filesystems by restricting collect_mounts to mounts
that appear in the mount tree.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/namespace.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 2b12b7a9455d..acc5583764dc 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1669,8 +1669,11 @@ struct vfsmount *collect_mounts(struct path *path)
 {
 	struct mount *tree;
 	namespace_lock();
-	tree = copy_tree(real_mount(path->mnt), path->dentry,
-			 CL_COPY_ALL | CL_PRIVATE);
+	if (!check_mnt(real_mount(path->mnt)))
+		tree = ERR_PTR(-EINVAL);
+	else
+		tree = copy_tree(real_mount(path->mnt), path->dentry,
+				 CL_COPY_ALL | CL_PRIVATE);
 	namespace_unlock();
 	if (IS_ERR(tree))
 		return ERR_CAST(tree);
-- 
2.2.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 10/19] mnt: Factor out unhash_mnt from detach_mnt and umount_tree
       [not found]       ` <87a8yqou41.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                           ` (8 preceding siblings ...)
  2015-04-03  1:56         ` [PATCH review 09/19] mnt: Fail collect_mounts when applied to unmounted mounts Eric W. Biederman
@ 2015-04-03  1:56         ` Eric W. Biederman
  2015-04-03  1:56         ` [PATCH review 11/19] mnt: Factor umount_mnt from umount_tree Eric W. Biederman
                           ` (10 subsequent siblings)
  20 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-04-03  1:56 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Richard Weinberger, Andy Lutomirski, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Omar Sandoval,
	Willy Tarreau

Create a function unhash_mnt that contains the common code between
detach_mnt and umount_tree, and use unhash_mnt in place of the common
code.  This add a unncessary list_del_init(mnt->mnt_child) into
umount_tree but given that mnt_child is already empty this extra
line is a noop.

Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/namespace.c | 21 ++++++++++++---------
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index acc5583764dc..e669a3bf86e7 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -798,10 +798,8 @@ static void __touch_mnt_namespace(struct mnt_namespace *ns)
 /*
  * vfsmount lock must be held for write
  */
-static void detach_mnt(struct mount *mnt, struct path *old_path)
+static void unhash_mnt(struct mount *mnt)
 {
-	old_path->dentry = mnt->mnt_mountpoint;
-	old_path->mnt = &mnt->mnt_parent->mnt;
 	mnt->mnt_parent = mnt;
 	mnt->mnt_mountpoint = mnt->mnt.mnt_root;
 	list_del_init(&mnt->mnt_child);
@@ -814,6 +812,16 @@ static void detach_mnt(struct mount *mnt, struct path *old_path)
 /*
  * vfsmount lock must be held for write
  */
+static void detach_mnt(struct mount *mnt, struct path *old_path)
+{
+	old_path->dentry = mnt->mnt_mountpoint;
+	old_path->mnt = &mnt->mnt_parent->mnt;
+	unhash_mnt(mnt);
+}
+
+/*
+ * vfsmount lock must be held for write
+ */
 void mnt_set_mountpoint(struct mount *mnt,
 			struct mountpoint *mp,
 			struct mount *child_mnt)
@@ -1362,15 +1370,10 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 
 		pin_insert_group(&p->mnt_umount, &p->mnt_parent->mnt, &unmounted);
 		if (mnt_has_parent(p)) {
-			hlist_del_init(&p->mnt_mp_list);
-			put_mountpoint(p->mnt_mp);
 			mnt_add_count(p->mnt_parent, -1);
 			/* old mountpoint will be dropped when we can do that */
 			p->mnt_ex_mountpoint = p->mnt_mountpoint;
-			p->mnt_mountpoint = p->mnt.mnt_root;
-			p->mnt_parent = p;
-			p->mnt_mp = NULL;
-			hlist_del_init_rcu(&p->mnt_hash);
+			unhash_mnt(p);
 		}
 		change_mnt_propagation(p, MS_PRIVATE);
 	}
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review  10/19] mnt: Factor out unhash_mnt from detach_mnt and umount_tree
  2015-04-03  1:53     ` [PATCH review 0/19] Locked mount and loopback mount fixes Eric W. Biederman
                         ` (9 preceding siblings ...)
       [not found]       ` <87a8yqou41.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-04-03  1:56       ` Eric W. Biederman
  2015-04-03  1:56       ` [PATCH review 11/19] mnt: Factor umount_mnt from umount_tree Eric W. Biederman
                         ` (10 subsequent siblings)
  21 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-04-03  1:56 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-fsdevel, Serge E. Hallyn, Andy Lutomirski,
	Richard Weinberger, Andrey Vagin, Al Viro, Jann Horn,
	Willy Tarreau, Omar Sandoval

Create a function unhash_mnt that contains the common code between
detach_mnt and umount_tree, and use unhash_mnt in place of the common
code.  This add a unncessary list_del_init(mnt->mnt_child) into
umount_tree but given that mnt_child is already empty this extra
line is a noop.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/namespace.c | 21 ++++++++++++---------
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index acc5583764dc..e669a3bf86e7 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -798,10 +798,8 @@ static void __touch_mnt_namespace(struct mnt_namespace *ns)
 /*
  * vfsmount lock must be held for write
  */
-static void detach_mnt(struct mount *mnt, struct path *old_path)
+static void unhash_mnt(struct mount *mnt)
 {
-	old_path->dentry = mnt->mnt_mountpoint;
-	old_path->mnt = &mnt->mnt_parent->mnt;
 	mnt->mnt_parent = mnt;
 	mnt->mnt_mountpoint = mnt->mnt.mnt_root;
 	list_del_init(&mnt->mnt_child);
@@ -814,6 +812,16 @@ static void detach_mnt(struct mount *mnt, struct path *old_path)
 /*
  * vfsmount lock must be held for write
  */
+static void detach_mnt(struct mount *mnt, struct path *old_path)
+{
+	old_path->dentry = mnt->mnt_mountpoint;
+	old_path->mnt = &mnt->mnt_parent->mnt;
+	unhash_mnt(mnt);
+}
+
+/*
+ * vfsmount lock must be held for write
+ */
 void mnt_set_mountpoint(struct mount *mnt,
 			struct mountpoint *mp,
 			struct mount *child_mnt)
@@ -1362,15 +1370,10 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 
 		pin_insert_group(&p->mnt_umount, &p->mnt_parent->mnt, &unmounted);
 		if (mnt_has_parent(p)) {
-			hlist_del_init(&p->mnt_mp_list);
-			put_mountpoint(p->mnt_mp);
 			mnt_add_count(p->mnt_parent, -1);
 			/* old mountpoint will be dropped when we can do that */
 			p->mnt_ex_mountpoint = p->mnt_mountpoint;
-			p->mnt_mountpoint = p->mnt.mnt_root;
-			p->mnt_parent = p;
-			p->mnt_mp = NULL;
-			hlist_del_init_rcu(&p->mnt_hash);
+			unhash_mnt(p);
 		}
 		change_mnt_propagation(p, MS_PRIVATE);
 	}
-- 
2.2.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review  11/19] mnt: Factor umount_mnt from umount_tree
       [not found]       ` <87a8yqou41.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                           ` (9 preceding siblings ...)
  2015-04-03  1:56         ` [PATCH review 10/19] mnt: Factor out unhash_mnt from detach_mnt and umount_tree Eric W. Biederman
@ 2015-04-03  1:56         ` Eric W. Biederman
  2015-04-03  1:56         ` [PATCH review 12/19] fs_pin: Allow for the possibility that m_list or s_list go unused Eric W. Biederman
                           ` (9 subsequent siblings)
  20 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-04-03  1:56 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Richard Weinberger, Andy Lutomirski, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Omar Sandoval,
	Willy Tarreau

For future use factor out a function umount_mnt from umount_tree.
This function unhashes a mount and remembers where the mount
was mounted so that eventually when the code makes it to a
sleeping context the mountpoint can be dput.

Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/namespace.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index e669a3bf86e7..010d5bebcb7e 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -822,6 +822,16 @@ static void detach_mnt(struct mount *mnt, struct path *old_path)
 /*
  * vfsmount lock must be held for write
  */
+static void umount_mnt(struct mount *mnt)
+{
+	/* old mountpoint will be dropped when we can do that */
+	mnt->mnt_ex_mountpoint = mnt->mnt_mountpoint;
+	unhash_mnt(mnt);
+}
+
+/*
+ * vfsmount lock must be held for write
+ */
 void mnt_set_mountpoint(struct mount *mnt,
 			struct mountpoint *mp,
 			struct mount *child_mnt)
@@ -1371,9 +1381,7 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 		pin_insert_group(&p->mnt_umount, &p->mnt_parent->mnt, &unmounted);
 		if (mnt_has_parent(p)) {
 			mnt_add_count(p->mnt_parent, -1);
-			/* old mountpoint will be dropped when we can do that */
-			p->mnt_ex_mountpoint = p->mnt_mountpoint;
-			unhash_mnt(p);
+			umount_mnt(p);
 		}
 		change_mnt_propagation(p, MS_PRIVATE);
 	}
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review  11/19] mnt: Factor umount_mnt from umount_tree
  2015-04-03  1:53     ` [PATCH review 0/19] Locked mount and loopback mount fixes Eric W. Biederman
                         ` (10 preceding siblings ...)
  2015-04-03  1:56       ` [PATCH review 10/19] mnt: Factor out unhash_mnt from detach_mnt and umount_tree Eric W. Biederman
@ 2015-04-03  1:56       ` Eric W. Biederman
  2015-04-03  1:56       ` [PATCH review 12/19] fs_pin: Allow for the possibility that m_list or s_list go unused Eric W. Biederman
                         ` (9 subsequent siblings)
  21 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-04-03  1:56 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-fsdevel, Serge E. Hallyn, Andy Lutomirski,
	Richard Weinberger, Andrey Vagin, Al Viro, Jann Horn,
	Willy Tarreau, Omar Sandoval

For future use factor out a function umount_mnt from umount_tree.
This function unhashes a mount and remembers where the mount
was mounted so that eventually when the code makes it to a
sleeping context the mountpoint can be dput.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/namespace.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index e669a3bf86e7..010d5bebcb7e 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -822,6 +822,16 @@ static void detach_mnt(struct mount *mnt, struct path *old_path)
 /*
  * vfsmount lock must be held for write
  */
+static void umount_mnt(struct mount *mnt)
+{
+	/* old mountpoint will be dropped when we can do that */
+	mnt->mnt_ex_mountpoint = mnt->mnt_mountpoint;
+	unhash_mnt(mnt);
+}
+
+/*
+ * vfsmount lock must be held for write
+ */
 void mnt_set_mountpoint(struct mount *mnt,
 			struct mountpoint *mp,
 			struct mount *child_mnt)
@@ -1371,9 +1381,7 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 		pin_insert_group(&p->mnt_umount, &p->mnt_parent->mnt, &unmounted);
 		if (mnt_has_parent(p)) {
 			mnt_add_count(p->mnt_parent, -1);
-			/* old mountpoint will be dropped when we can do that */
-			p->mnt_ex_mountpoint = p->mnt_mountpoint;
-			unhash_mnt(p);
+			umount_mnt(p);
 		}
 		change_mnt_propagation(p, MS_PRIVATE);
 	}
-- 
2.2.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 12/19] fs_pin: Allow for the possibility that m_list or s_list go unused.
       [not found]       ` <87a8yqou41.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                           ` (10 preceding siblings ...)
  2015-04-03  1:56         ` [PATCH review 11/19] mnt: Factor umount_mnt from umount_tree Eric W. Biederman
@ 2015-04-03  1:56         ` Eric W. Biederman
  2015-04-03  1:56         ` [PATCH review 13/19] mnt: Honor MNT_LOCKED when detaching mounts Eric W. Biederman
                           ` (8 subsequent siblings)
  20 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-04-03  1:56 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Richard Weinberger, Andy Lutomirski, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Omar Sandoval,
	Willy Tarreau

This is needed to support lazily umounting locked mounts.  Because the
entire unmounted subtree needs to stay together until there are no
users with references to any part of the subtree.

To support this guarantee that the fs_pin m_list and s_list nodes
are initialized by initializing them in init_fs_pin allowing
for the possibility that pin_insert_group does not touch them.

Further use hlist_del_init in pin_remove so that there is
a hlist_unhashed test before the list we attempt to update
the previous list item.

Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/fs_pin.c            | 4 ++--
 include/linux/fs_pin.h | 2 ++
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/fs_pin.c b/fs/fs_pin.c
index b06c98796afb..611b5408f6ec 100644
--- a/fs/fs_pin.c
+++ b/fs/fs_pin.c
@@ -9,8 +9,8 @@ static DEFINE_SPINLOCK(pin_lock);
 void pin_remove(struct fs_pin *pin)
 {
 	spin_lock(&pin_lock);
-	hlist_del(&pin->m_list);
-	hlist_del(&pin->s_list);
+	hlist_del_init(&pin->m_list);
+	hlist_del_init(&pin->s_list);
 	spin_unlock(&pin_lock);
 	spin_lock_irq(&pin->wait.lock);
 	pin->done = 1;
diff --git a/include/linux/fs_pin.h b/include/linux/fs_pin.h
index 9dc4e0384bfb..3886b3bffd7f 100644
--- a/include/linux/fs_pin.h
+++ b/include/linux/fs_pin.h
@@ -13,6 +13,8 @@ struct vfsmount;
 static inline void init_fs_pin(struct fs_pin *p, void (*kill)(struct fs_pin *))
 {
 	init_waitqueue_head(&p->wait);
+	INIT_HLIST_NODE(&p->s_list);
+	INIT_HLIST_NODE(&p->m_list);
 	p->kill = kill;
 }
 
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review  12/19] fs_pin: Allow for the possibility that m_list or s_list go unused.
  2015-04-03  1:53     ` [PATCH review 0/19] Locked mount and loopback mount fixes Eric W. Biederman
                         ` (11 preceding siblings ...)
  2015-04-03  1:56       ` [PATCH review 11/19] mnt: Factor umount_mnt from umount_tree Eric W. Biederman
@ 2015-04-03  1:56       ` Eric W. Biederman
       [not found]         ` <1428026183-14879-12-git-send-email-ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  2015-04-03  1:56       ` [PATCH review 13/19] mnt: Honor MNT_LOCKED when detaching mounts Eric W. Biederman
                         ` (8 subsequent siblings)
  21 siblings, 1 reply; 240+ messages in thread
From: Eric W. Biederman @ 2015-04-03  1:56 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-fsdevel, Serge E. Hallyn, Andy Lutomirski,
	Richard Weinberger, Andrey Vagin, Al Viro, Jann Horn,
	Willy Tarreau, Omar Sandoval

This is needed to support lazily umounting locked mounts.  Because the
entire unmounted subtree needs to stay together until there are no
users with references to any part of the subtree.

To support this guarantee that the fs_pin m_list and s_list nodes
are initialized by initializing them in init_fs_pin allowing
for the possibility that pin_insert_group does not touch them.

Further use hlist_del_init in pin_remove so that there is
a hlist_unhashed test before the list we attempt to update
the previous list item.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/fs_pin.c            | 4 ++--
 include/linux/fs_pin.h | 2 ++
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/fs_pin.c b/fs/fs_pin.c
index b06c98796afb..611b5408f6ec 100644
--- a/fs/fs_pin.c
+++ b/fs/fs_pin.c
@@ -9,8 +9,8 @@ static DEFINE_SPINLOCK(pin_lock);
 void pin_remove(struct fs_pin *pin)
 {
 	spin_lock(&pin_lock);
-	hlist_del(&pin->m_list);
-	hlist_del(&pin->s_list);
+	hlist_del_init(&pin->m_list);
+	hlist_del_init(&pin->s_list);
 	spin_unlock(&pin_lock);
 	spin_lock_irq(&pin->wait.lock);
 	pin->done = 1;
diff --git a/include/linux/fs_pin.h b/include/linux/fs_pin.h
index 9dc4e0384bfb..3886b3bffd7f 100644
--- a/include/linux/fs_pin.h
+++ b/include/linux/fs_pin.h
@@ -13,6 +13,8 @@ struct vfsmount;
 static inline void init_fs_pin(struct fs_pin *p, void (*kill)(struct fs_pin *))
 {
 	init_waitqueue_head(&p->wait);
+	INIT_HLIST_NODE(&p->s_list);
+	INIT_HLIST_NODE(&p->m_list);
 	p->kill = kill;
 }
 
-- 
2.2.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review  13/19] mnt: Honor MNT_LOCKED when detaching mounts
       [not found]       ` <87a8yqou41.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                           ` (11 preceding siblings ...)
  2015-04-03  1:56         ` [PATCH review 12/19] fs_pin: Allow for the possibility that m_list or s_list go unused Eric W. Biederman
@ 2015-04-03  1:56         ` Eric W. Biederman
  2015-04-03  1:56         ` [PATCH review 14/19] mnt: Fix the error check in __detach_mounts Eric W. Biederman
                           ` (7 subsequent siblings)
  20 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-04-03  1:56 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Richard Weinberger, Andy Lutomirski, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Omar Sandoval,
	Willy Tarreau

Modify umount(MNT_DETACH) to keep mounts in the hash table that are
locked to their parent mounts, when the parent is lazily unmounted.

In mntput_no_expire detach the children from the hash table, depending
on mnt_pin_kill in cleanup_mnt to decrement the mnt_count of the children.

In __detach_mounts if there are any mounts that have been unmounted
but still are on the list of mounts of a mountpoint, remove their
children from the mount hash table and those children to the unmounted
list so they won't linger potentially indefinitely waiting for their
final mntput, now that the mounts serve no purpose.

Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/namespace.c | 29 ++++++++++++++++++++++++++---
 fs/pnode.h     |  2 ++
 2 files changed, 28 insertions(+), 3 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 010d5bebcb7e..1894d1878dbc 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1099,6 +1099,13 @@ static void mntput_no_expire(struct mount *mnt)
 	rcu_read_unlock();
 
 	list_del(&mnt->mnt_instance);
+
+	if (unlikely(!list_empty(&mnt->mnt_mounts))) {
+		struct mount *p, *tmp;
+		list_for_each_entry_safe(p, tmp, &mnt->mnt_mounts,  mnt_child) {
+			umount_mnt(p);
+		}
+	}
 	unlock_mount_hash();
 
 	if (likely(!(mnt->mnt.mnt_flags & MNT_INTERNAL))) {
@@ -1370,6 +1377,7 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 		propagate_umount(&tmp_list);
 
 	while (!list_empty(&tmp_list)) {
+		bool disconnect;
 		p = list_first_entry(&tmp_list, struct mount, mnt_list);
 		list_del_init(&p->mnt_expire);
 		list_del_init(&p->mnt_list);
@@ -1378,10 +1386,18 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 		if (how & UMOUNT_SYNC)
 			p->mnt.mnt_flags |= MNT_SYNC_UMOUNT;
 
-		pin_insert_group(&p->mnt_umount, &p->mnt_parent->mnt, &unmounted);
+		disconnect = !IS_MNT_LOCKED_AND_LAZY(p);
+
+		pin_insert_group(&p->mnt_umount, &p->mnt_parent->mnt,
+				 disconnect ? &unmounted : NULL);
 		if (mnt_has_parent(p)) {
 			mnt_add_count(p->mnt_parent, -1);
-			umount_mnt(p);
+			if (!disconnect) {
+				/* Don't forget about p */
+				list_add_tail(&p->mnt_child, &p->mnt_parent->mnt_mounts);
+			} else {
+				umount_mnt(p);
+			}
 		}
 		change_mnt_propagation(p, MS_PRIVATE);
 	}
@@ -1506,7 +1522,14 @@ void __detach_mounts(struct dentry *dentry)
 	lock_mount_hash();
 	while (!hlist_empty(&mp->m_list)) {
 		mnt = hlist_entry(mp->m_list.first, struct mount, mnt_mp_list);
-		umount_tree(mnt, 0);
+		if (mnt->mnt.mnt_flags & MNT_UMOUNT) {
+			struct mount *p, *tmp;
+			list_for_each_entry_safe(p, tmp, &mnt->mnt_mounts,  mnt_child) {
+				hlist_add_head(&p->mnt_umount.s_list, &unmounted);
+				umount_mnt(p);
+			}
+		}
+		else umount_tree(mnt, 0);
 	}
 	unlock_mount_hash();
 	put_mountpoint(mp);
diff --git a/fs/pnode.h b/fs/pnode.h
index 0fcdbe7ca648..7114ce6e6b9e 100644
--- a/fs/pnode.h
+++ b/fs/pnode.h
@@ -20,6 +20,8 @@
 #define SET_MNT_MARK(m) ((m)->mnt.mnt_flags |= MNT_MARKED)
 #define CLEAR_MNT_MARK(m) ((m)->mnt.mnt_flags &= ~MNT_MARKED)
 #define IS_MNT_LOCKED(m) ((m)->mnt.mnt_flags & MNT_LOCKED)
+#define IS_MNT_LOCKED_AND_LAZY(m) \
+	(((m)->mnt.mnt_flags & (MNT_LOCKED|MNT_SYNC_UMOUNT)) == MNT_LOCKED)
 
 #define CL_EXPIRE    		0x01
 #define CL_SLAVE     		0x02
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review  13/19] mnt: Honor MNT_LOCKED when detaching mounts
  2015-04-03  1:53     ` [PATCH review 0/19] Locked mount and loopback mount fixes Eric W. Biederman
                         ` (12 preceding siblings ...)
  2015-04-03  1:56       ` [PATCH review 12/19] fs_pin: Allow for the possibility that m_list or s_list go unused Eric W. Biederman
@ 2015-04-03  1:56       ` Eric W. Biederman
  2015-04-03  1:56       ` [PATCH review 14/19] mnt: Fix the error check in __detach_mounts Eric W. Biederman
                         ` (7 subsequent siblings)
  21 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-04-03  1:56 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-fsdevel, Serge E. Hallyn, Andy Lutomirski,
	Richard Weinberger, Andrey Vagin, Al Viro, Jann Horn,
	Willy Tarreau, Omar Sandoval

Modify umount(MNT_DETACH) to keep mounts in the hash table that are
locked to their parent mounts, when the parent is lazily unmounted.

In mntput_no_expire detach the children from the hash table, depending
on mnt_pin_kill in cleanup_mnt to decrement the mnt_count of the children.

In __detach_mounts if there are any mounts that have been unmounted
but still are on the list of mounts of a mountpoint, remove their
children from the mount hash table and those children to the unmounted
list so they won't linger potentially indefinitely waiting for their
final mntput, now that the mounts serve no purpose.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/namespace.c | 29 ++++++++++++++++++++++++++---
 fs/pnode.h     |  2 ++
 2 files changed, 28 insertions(+), 3 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 010d5bebcb7e..1894d1878dbc 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1099,6 +1099,13 @@ static void mntput_no_expire(struct mount *mnt)
 	rcu_read_unlock();
 
 	list_del(&mnt->mnt_instance);
+
+	if (unlikely(!list_empty(&mnt->mnt_mounts))) {
+		struct mount *p, *tmp;
+		list_for_each_entry_safe(p, tmp, &mnt->mnt_mounts,  mnt_child) {
+			umount_mnt(p);
+		}
+	}
 	unlock_mount_hash();
 
 	if (likely(!(mnt->mnt.mnt_flags & MNT_INTERNAL))) {
@@ -1370,6 +1377,7 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 		propagate_umount(&tmp_list);
 
 	while (!list_empty(&tmp_list)) {
+		bool disconnect;
 		p = list_first_entry(&tmp_list, struct mount, mnt_list);
 		list_del_init(&p->mnt_expire);
 		list_del_init(&p->mnt_list);
@@ -1378,10 +1386,18 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 		if (how & UMOUNT_SYNC)
 			p->mnt.mnt_flags |= MNT_SYNC_UMOUNT;
 
-		pin_insert_group(&p->mnt_umount, &p->mnt_parent->mnt, &unmounted);
+		disconnect = !IS_MNT_LOCKED_AND_LAZY(p);
+
+		pin_insert_group(&p->mnt_umount, &p->mnt_parent->mnt,
+				 disconnect ? &unmounted : NULL);
 		if (mnt_has_parent(p)) {
 			mnt_add_count(p->mnt_parent, -1);
-			umount_mnt(p);
+			if (!disconnect) {
+				/* Don't forget about p */
+				list_add_tail(&p->mnt_child, &p->mnt_parent->mnt_mounts);
+			} else {
+				umount_mnt(p);
+			}
 		}
 		change_mnt_propagation(p, MS_PRIVATE);
 	}
@@ -1506,7 +1522,14 @@ void __detach_mounts(struct dentry *dentry)
 	lock_mount_hash();
 	while (!hlist_empty(&mp->m_list)) {
 		mnt = hlist_entry(mp->m_list.first, struct mount, mnt_mp_list);
-		umount_tree(mnt, 0);
+		if (mnt->mnt.mnt_flags & MNT_UMOUNT) {
+			struct mount *p, *tmp;
+			list_for_each_entry_safe(p, tmp, &mnt->mnt_mounts,  mnt_child) {
+				hlist_add_head(&p->mnt_umount.s_list, &unmounted);
+				umount_mnt(p);
+			}
+		}
+		else umount_tree(mnt, 0);
 	}
 	unlock_mount_hash();
 	put_mountpoint(mp);
diff --git a/fs/pnode.h b/fs/pnode.h
index 0fcdbe7ca648..7114ce6e6b9e 100644
--- a/fs/pnode.h
+++ b/fs/pnode.h
@@ -20,6 +20,8 @@
 #define SET_MNT_MARK(m) ((m)->mnt.mnt_flags |= MNT_MARKED)
 #define CLEAR_MNT_MARK(m) ((m)->mnt.mnt_flags &= ~MNT_MARKED)
 #define IS_MNT_LOCKED(m) ((m)->mnt.mnt_flags & MNT_LOCKED)
+#define IS_MNT_LOCKED_AND_LAZY(m) \
+	(((m)->mnt.mnt_flags & (MNT_LOCKED|MNT_SYNC_UMOUNT)) == MNT_LOCKED)
 
 #define CL_EXPIRE    		0x01
 #define CL_SLAVE     		0x02
-- 
2.2.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review  14/19] mnt: Fix the error check in __detach_mounts
       [not found]       ` <87a8yqou41.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                           ` (12 preceding siblings ...)
  2015-04-03  1:56         ` [PATCH review 13/19] mnt: Honor MNT_LOCKED when detaching mounts Eric W. Biederman
@ 2015-04-03  1:56         ` Eric W. Biederman
  2015-04-03  1:56         ` [PATCH review 15/19] mnt: Update detach_mounts to leave mounts connected Eric W. Biederman
                           ` (6 subsequent siblings)
  20 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-04-03  1:56 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Richard Weinberger, Andy Lutomirski, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Omar Sandoval,
	Willy Tarreau

lookup_mountpoint can return either NULL or an error value.
Update the test in __detach_mounts to test for an error value
to avoid pathological cases causing a NULL pointer dereferences.

The callers of __detach_mounts should prevent it from ever being
called on an unlinked dentry but don't take any chances.

Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/namespace.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 1894d1878dbc..e8f7f8c58c3c 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1516,7 +1516,7 @@ void __detach_mounts(struct dentry *dentry)
 
 	namespace_lock();
 	mp = lookup_mountpoint(dentry);
-	if (!mp)
+	if (IS_ERR_OR_NULL(mp))
 		goto out_unlock;
 
 	lock_mount_hash();
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review  14/19] mnt: Fix the error check in __detach_mounts
  2015-04-03  1:53     ` [PATCH review 0/19] Locked mount and loopback mount fixes Eric W. Biederman
                         ` (13 preceding siblings ...)
  2015-04-03  1:56       ` [PATCH review 13/19] mnt: Honor MNT_LOCKED when detaching mounts Eric W. Biederman
@ 2015-04-03  1:56       ` Eric W. Biederman
  2015-04-03  1:56       ` [PATCH review 15/19] mnt: Update detach_mounts to leave mounts connected Eric W. Biederman
                         ` (6 subsequent siblings)
  21 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-04-03  1:56 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-fsdevel, Serge E. Hallyn, Andy Lutomirski,
	Richard Weinberger, Andrey Vagin, Al Viro, Jann Horn,
	Willy Tarreau, Omar Sandoval

lookup_mountpoint can return either NULL or an error value.
Update the test in __detach_mounts to test for an error value
to avoid pathological cases causing a NULL pointer dereferences.

The callers of __detach_mounts should prevent it from ever being
called on an unlinked dentry but don't take any chances.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/namespace.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 1894d1878dbc..e8f7f8c58c3c 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1516,7 +1516,7 @@ void __detach_mounts(struct dentry *dentry)
 
 	namespace_lock();
 	mp = lookup_mountpoint(dentry);
-	if (!mp)
+	if (IS_ERR_OR_NULL(mp))
 		goto out_unlock;
 
 	lock_mount_hash();
-- 
2.2.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 15/19] mnt: Update detach_mounts to leave mounts connected
       [not found]       ` <87a8yqou41.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                           ` (13 preceding siblings ...)
  2015-04-03  1:56         ` [PATCH review 14/19] mnt: Fix the error check in __detach_mounts Eric W. Biederman
@ 2015-04-03  1:56         ` Eric W. Biederman
  2015-04-03  1:56         ` [PATCH review 16/19] mnt: Track which mounts use a dentry as root Eric W. Biederman
                           ` (5 subsequent siblings)
  20 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-04-03  1:56 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Richard Weinberger, Andy Lutomirski, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Omar Sandoval,
	Willy Tarreau

Now that it is possible to lazily unmount an entire mount tree and
leave the individual mounts connected to each other add a new flag
UMOUNT_CONNECTED to umount_tree to force this behavior and use
this flag in detach_mounts.

This closes a bug where the deletion of a file or directory could
trigger an unmount and reveal data under a mount point.

Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/namespace.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index e8f7f8c58c3c..1f4f9dac6e5a 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1348,6 +1348,7 @@ static inline void namespace_lock(void)
 enum umount_tree_flags {
 	UMOUNT_SYNC = 1,
 	UMOUNT_PROPAGATE = 2,
+	UMOUNT_CONNECTED = 4,
 };
 /*
  * mount_lock must be held
@@ -1386,7 +1387,10 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 		if (how & UMOUNT_SYNC)
 			p->mnt.mnt_flags |= MNT_SYNC_UMOUNT;
 
-		disconnect = !IS_MNT_LOCKED_AND_LAZY(p);
+		disconnect = !(((how & UMOUNT_CONNECTED) &&
+				mnt_has_parent(p) &&
+				(p->mnt_parent->mnt.mnt_flags & MNT_UMOUNT)) ||
+			       IS_MNT_LOCKED_AND_LAZY(p));
 
 		pin_insert_group(&p->mnt_umount, &p->mnt_parent->mnt,
 				 disconnect ? &unmounted : NULL);
@@ -1529,7 +1533,7 @@ void __detach_mounts(struct dentry *dentry)
 				umount_mnt(p);
 			}
 		}
-		else umount_tree(mnt, 0);
+		else umount_tree(mnt, UMOUNT_CONNECTED);
 	}
 	unlock_mount_hash();
 	put_mountpoint(mp);
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review  15/19] mnt: Update detach_mounts to leave mounts connected
  2015-04-03  1:53     ` [PATCH review 0/19] Locked mount and loopback mount fixes Eric W. Biederman
                         ` (14 preceding siblings ...)
  2015-04-03  1:56       ` [PATCH review 14/19] mnt: Fix the error check in __detach_mounts Eric W. Biederman
@ 2015-04-03  1:56       ` Eric W. Biederman
  2015-04-03  1:56       ` [PATCH review 16/19] mnt: Track which mounts use a dentry as root Eric W. Biederman
                         ` (5 subsequent siblings)
  21 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-04-03  1:56 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-fsdevel, Serge E. Hallyn, Andy Lutomirski,
	Richard Weinberger, Andrey Vagin, Al Viro, Jann Horn,
	Willy Tarreau, Omar Sandoval

Now that it is possible to lazily unmount an entire mount tree and
leave the individual mounts connected to each other add a new flag
UMOUNT_CONNECTED to umount_tree to force this behavior and use
this flag in detach_mounts.

This closes a bug where the deletion of a file or directory could
trigger an unmount and reveal data under a mount point.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/namespace.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index e8f7f8c58c3c..1f4f9dac6e5a 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1348,6 +1348,7 @@ static inline void namespace_lock(void)
 enum umount_tree_flags {
 	UMOUNT_SYNC = 1,
 	UMOUNT_PROPAGATE = 2,
+	UMOUNT_CONNECTED = 4,
 };
 /*
  * mount_lock must be held
@@ -1386,7 +1387,10 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 		if (how & UMOUNT_SYNC)
 			p->mnt.mnt_flags |= MNT_SYNC_UMOUNT;
 
-		disconnect = !IS_MNT_LOCKED_AND_LAZY(p);
+		disconnect = !(((how & UMOUNT_CONNECTED) &&
+				mnt_has_parent(p) &&
+				(p->mnt_parent->mnt.mnt_flags & MNT_UMOUNT)) ||
+			       IS_MNT_LOCKED_AND_LAZY(p));
 
 		pin_insert_group(&p->mnt_umount, &p->mnt_parent->mnt,
 				 disconnect ? &unmounted : NULL);
@@ -1529,7 +1533,7 @@ void __detach_mounts(struct dentry *dentry)
 				umount_mnt(p);
 			}
 		}
-		else umount_tree(mnt, 0);
+		else umount_tree(mnt, UMOUNT_CONNECTED);
 	}
 	unlock_mount_hash();
 	put_mountpoint(mp);
-- 
2.2.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review  16/19] mnt: Track which mounts use a dentry as root.
       [not found]       ` <87a8yqou41.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                           ` (14 preceding siblings ...)
  2015-04-03  1:56         ` [PATCH review 15/19] mnt: Update detach_mounts to leave mounts connected Eric W. Biederman
@ 2015-04-03  1:56         ` Eric W. Biederman
  2015-04-03  1:56         ` [PATCH review 17/19] vfs: Test for and handle paths that are unreachable from their mnt_root Eric W. Biederman
                           ` (4 subsequent siblings)
  20 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-04-03  1:56 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Richard Weinberger, Andy Lutomirski, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Omar Sandoval,
	Willy Tarreau

Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/mount.h             |   7 +++
 fs/namespace.c         | 122 +++++++++++++++++++++++++++++++++++++++++++++++--
 include/linux/dcache.h |   7 +++
 3 files changed, 131 insertions(+), 5 deletions(-)

diff --git a/fs/mount.h b/fs/mount.h
index 6a61c2b3e385..a8be3033e022 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -27,6 +27,12 @@ struct mountpoint {
 	int m_count;
 };
 
+struct mountroot {
+	struct hlist_node r_hash;
+	struct dentry *r_dentry;
+	struct hlist_head r_list;
+};
+
 struct mount {
 	struct hlist_node mnt_hash;
 	struct mount *mnt_parent;
@@ -55,6 +61,7 @@ struct mount {
 	struct mnt_namespace *mnt_ns;	/* containing namespace */
 	struct mountpoint *mnt_mp;	/* where is it mounted */
 	struct hlist_node mnt_mp_list;	/* list mounts with the same mountpoint */
+	struct hlist_node mnt_mr_list;	/* list mounts with the same mountroot */
 #ifdef CONFIG_FSNOTIFY
 	struct hlist_head mnt_fsnotify_marks;
 	__u32 mnt_fsnotify_mask;
diff --git a/fs/namespace.c b/fs/namespace.c
index 1f4f9dac6e5a..5b1b666439ac 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -31,6 +31,8 @@ static unsigned int m_hash_mask __read_mostly;
 static unsigned int m_hash_shift __read_mostly;
 static unsigned int mp_hash_mask __read_mostly;
 static unsigned int mp_hash_shift __read_mostly;
+static unsigned int mr_hash_mask __read_mostly;
+static unsigned int mr_hash_shift __read_mostly;
 
 static __initdata unsigned long mhash_entries;
 static int __init set_mhash_entries(char *str)
@@ -52,6 +54,16 @@ static int __init set_mphash_entries(char *str)
 }
 __setup("mphash_entries=", set_mphash_entries);
 
+static __initdata unsigned long mrhash_entries;
+static int __init set_mrhash_entries(char *str)
+{
+	if (!str)
+		return 0;
+	mrhash_entries = simple_strtoul(str, &str, 0);
+	return 1;
+}
+__setup("mrhash_entries=", set_mrhash_entries);
+
 static u64 event;
 static DEFINE_IDA(mnt_id_ida);
 static DEFINE_IDA(mnt_group_ida);
@@ -61,6 +73,7 @@ static int mnt_group_start = 1;
 
 static struct hlist_head *mount_hashtable __read_mostly;
 static struct hlist_head *mountpoint_hashtable __read_mostly;
+static struct hlist_head *mountroot_hashtable __read_mostly;
 static struct kmem_cache *mnt_cache __read_mostly;
 static DECLARE_RWSEM(namespace_sem);
 
@@ -93,6 +106,13 @@ static inline struct hlist_head *mp_hash(struct dentry *dentry)
 	return &mountpoint_hashtable[tmp & mp_hash_mask];
 }
 
+static inline struct hlist_head *mr_hash(struct dentry *dentry)
+{
+	unsigned long tmp = ((unsigned long)dentry / L1_CACHE_BYTES);
+	tmp = tmp + (tmp >> mr_hash_shift);
+	return &mountroot_hashtable[tmp & mr_hash_mask];
+}
+
 /*
  * allocation is serialized by namespace_sem, but we need the spinlock to
  * serialize with freeing.
@@ -234,6 +254,7 @@ static struct mount *alloc_vfsmnt(const char *name)
 		INIT_LIST_HEAD(&mnt->mnt_slave_list);
 		INIT_LIST_HEAD(&mnt->mnt_slave);
 		INIT_HLIST_NODE(&mnt->mnt_mp_list);
+		INIT_HLIST_NODE(&mnt->mnt_mr_list);
 #ifdef CONFIG_FSNOTIFY
 		INIT_HLIST_HEAD(&mnt->mnt_fsnotify_marks);
 #endif
@@ -768,6 +789,77 @@ static void put_mountpoint(struct mountpoint *mp)
 	}
 }
 
+static struct mountroot *lookup_mountroot(struct dentry *dentry)
+{
+	struct hlist_head *chain = mr_hash(dentry);
+	struct mountroot *mr;
+
+	hlist_for_each_entry(mr, chain, r_hash) {
+		if (mr->r_dentry == dentry)
+			return mr;
+	}
+	return NULL;
+}
+
+static int mnt_set_root(struct mount *mnt, struct dentry *root)
+{
+	struct mountroot *mr = NULL;
+
+	lock_mount_hash();
+	if (unlikely(d_mountroot(root)))
+		mr = lookup_mountroot(root);
+	if (!mr) {
+		struct mountroot *new;
+		unlock_mount_hash();
+
+		new = kmalloc(sizeof(struct mountroot), GFP_KERNEL);
+		if (!new)
+			return -ENOMEM;
+
+		lock_mount_hash();
+		mr = lookup_mountroot(root);
+		if (mr) {
+			kfree(new);
+		} else {
+			struct hlist_head *chain = mr_hash(root);
+
+			mr = new;
+			mr->r_dentry = root;
+			INIT_HLIST_HEAD(&mr->r_list);
+			hlist_add_head(&mr->r_hash, chain);
+
+			spin_lock(&root->d_lock);
+			root->d_flags |= DCACHE_MOUNTROOT;
+			spin_unlock(&root->d_lock);
+		}
+	}
+	mnt->mnt.mnt_root = root;
+	hlist_add_head(&mnt->mnt_mr_list, &mr->r_list);
+	unlock_mount_hash();
+
+	return 0;
+}
+
+static void mnt_put_root(struct mount *mnt)
+{
+	struct dentry *root = mnt->mnt.mnt_root;
+	struct mountroot *mr;
+
+	lock_mount_hash();
+	mr = lookup_mountroot(root);
+	BUG_ON(!mr);
+	hlist_del(&mnt->mnt_mr_list);
+	if (hlist_empty(&mr->r_list)) {
+		hlist_del(&mr->r_hash);
+		spin_lock(&root->d_lock);
+		root->d_flags &= ~DCACHE_MOUNTROOT;
+		spin_unlock(&root->d_lock);
+		kfree(mr);
+	}
+	unlock_mount_hash();
+	dput(root);
+}
+
 static inline int check_mnt(struct mount *mnt)
 {
 	return mnt->mnt_ns == current->nsproxy->mnt_ns;
@@ -923,6 +1015,7 @@ vfs_kern_mount(struct file_system_type *type, int flags, const char *name, void
 {
 	struct mount *mnt;
 	struct dentry *root;
+	int err;
 
 	if (!type)
 		return ERR_PTR(-ENODEV);
@@ -941,7 +1034,15 @@ vfs_kern_mount(struct file_system_type *type, int flags, const char *name, void
 		return ERR_CAST(root);
 	}
 
-	mnt->mnt.mnt_root = root;
+	err = mnt_set_root(mnt, root);
+	if (err) {
+		dput(mnt->mnt.mnt_root);
+		deactivate_super(root->d_sb);
+		mnt_free_id(mnt);
+		free_vfsmnt(mnt);
+		return ERR_PTR(err);
+	}
+
 	mnt->mnt.mnt_sb = root->d_sb;
 	mnt->mnt_mountpoint = mnt->mnt.mnt_root;
 	mnt->mnt_parent = mnt;
@@ -974,6 +1075,10 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root,
 			goto out_free;
 	}
 
+	err = mnt_set_root(mnt, root);
+	if (err)
+		goto out_free;
+
 	mnt->mnt.mnt_flags = old->mnt.mnt_flags & ~(MNT_WRITE_HOLD|MNT_MARKED);
 	/* Don't allow unprivileged users to change mount flags */
 	if (flag & CL_UNPRIVILEGED) {
@@ -997,9 +1102,9 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root,
 	    (!(flag & CL_EXPIRE) || list_empty(&old->mnt_expire)))
 		mnt->mnt.mnt_flags |= MNT_LOCKED;
 
-	atomic_inc(&sb->s_active);
 	mnt->mnt.mnt_sb = sb;
-	mnt->mnt.mnt_root = dget(root);
+	atomic_inc(&sb->s_active);
+	dget(root);
 	mnt->mnt_mountpoint = mnt->mnt.mnt_root;
 	mnt->mnt_parent = mnt;
 	lock_mount_hash();
@@ -1052,7 +1157,7 @@ static void cleanup_mnt(struct mount *mnt)
 	if (unlikely(mnt->mnt_pins.first))
 		mnt_pin_kill(mnt);
 	fsnotify_vfsmount_delete(&mnt->mnt);
-	dput(mnt->mnt.mnt_root);
+	mnt_put_root(mnt);
 	deactivate_super(mnt->mnt.mnt_sb);
 	mnt_free_id(mnt);
 	call_rcu(&mnt->mnt_rcu, delayed_free_vfsmnt);
@@ -3079,14 +3184,21 @@ void __init mnt_init(void)
 				mphash_entries, 19,
 				0,
 				&mp_hash_shift, &mp_hash_mask, 0, 0);
+	mountroot_hashtable = alloc_large_system_hash("Mountroot-cache",
+				sizeof(struct hlist_head),
+				mrhash_entries, 19,
+				0,
+				&mr_hash_shift, &mr_hash_mask, 0, 0);
 
-	if (!mount_hashtable || !mountpoint_hashtable)
+	if (!mount_hashtable || !mountpoint_hashtable || !mountroot_hashtable)
 		panic("Failed to allocate mount hash table\n");
 
 	for (u = 0; u <= m_hash_mask; u++)
 		INIT_HLIST_HEAD(&mount_hashtable[u]);
 	for (u = 0; u <= mp_hash_mask; u++)
 		INIT_HLIST_HEAD(&mountpoint_hashtable[u]);
+	for (u = 0; u <= mr_hash_mask; u++)
+		INIT_HLIST_HEAD(&mountroot_hashtable[u]);
 
 	kernfs_init();
 
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index d8358799c594..dd987fb9e1f7 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -226,6 +226,8 @@ struct dentry_operations {
 #define DCACHE_MAY_FREE			0x00800000
 #define DCACHE_FALLTHRU			0x01000000 /* Fall through to lower layer */
 
+#define DCACHE_MOUNTROOT		0x01000000 /* is root of a vfsmount */
+
 extern seqlock_t rename_lock;
 
 /*
@@ -401,6 +403,11 @@ static inline bool d_mountpoint(const struct dentry *dentry)
 	return dentry->d_flags & DCACHE_MOUNTED;
 }
 
+static inline bool d_mountroot(const struct dentry *dentry)
+{
+	return dentry->d_flags & DCACHE_MOUNTROOT;
+}
+
 /*
  * Directory cache entry type accessor functions.
  */
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review  16/19] mnt: Track which mounts use a dentry as root.
  2015-04-03  1:53     ` [PATCH review 0/19] Locked mount and loopback mount fixes Eric W. Biederman
                         ` (15 preceding siblings ...)
  2015-04-03  1:56       ` [PATCH review 15/19] mnt: Update detach_mounts to leave mounts connected Eric W. Biederman
@ 2015-04-03  1:56       ` Eric W. Biederman
       [not found]         ` <1428026183-14879-16-git-send-email-ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  2015-04-03  1:56       ` [PATCH review 17/19] vfs: Test for and handle paths that are unreachable from their mnt_root Eric W. Biederman
                         ` (4 subsequent siblings)
  21 siblings, 1 reply; 240+ messages in thread
From: Eric W. Biederman @ 2015-04-03  1:56 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-fsdevel, Serge E. Hallyn, Andy Lutomirski,
	Richard Weinberger, Andrey Vagin, Al Viro, Jann Horn,
	Willy Tarreau, Omar Sandoval

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/mount.h             |   7 +++
 fs/namespace.c         | 122 +++++++++++++++++++++++++++++++++++++++++++++++--
 include/linux/dcache.h |   7 +++
 3 files changed, 131 insertions(+), 5 deletions(-)

diff --git a/fs/mount.h b/fs/mount.h
index 6a61c2b3e385..a8be3033e022 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -27,6 +27,12 @@ struct mountpoint {
 	int m_count;
 };
 
+struct mountroot {
+	struct hlist_node r_hash;
+	struct dentry *r_dentry;
+	struct hlist_head r_list;
+};
+
 struct mount {
 	struct hlist_node mnt_hash;
 	struct mount *mnt_parent;
@@ -55,6 +61,7 @@ struct mount {
 	struct mnt_namespace *mnt_ns;	/* containing namespace */
 	struct mountpoint *mnt_mp;	/* where is it mounted */
 	struct hlist_node mnt_mp_list;	/* list mounts with the same mountpoint */
+	struct hlist_node mnt_mr_list;	/* list mounts with the same mountroot */
 #ifdef CONFIG_FSNOTIFY
 	struct hlist_head mnt_fsnotify_marks;
 	__u32 mnt_fsnotify_mask;
diff --git a/fs/namespace.c b/fs/namespace.c
index 1f4f9dac6e5a..5b1b666439ac 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -31,6 +31,8 @@ static unsigned int m_hash_mask __read_mostly;
 static unsigned int m_hash_shift __read_mostly;
 static unsigned int mp_hash_mask __read_mostly;
 static unsigned int mp_hash_shift __read_mostly;
+static unsigned int mr_hash_mask __read_mostly;
+static unsigned int mr_hash_shift __read_mostly;
 
 static __initdata unsigned long mhash_entries;
 static int __init set_mhash_entries(char *str)
@@ -52,6 +54,16 @@ static int __init set_mphash_entries(char *str)
 }
 __setup("mphash_entries=", set_mphash_entries);
 
+static __initdata unsigned long mrhash_entries;
+static int __init set_mrhash_entries(char *str)
+{
+	if (!str)
+		return 0;
+	mrhash_entries = simple_strtoul(str, &str, 0);
+	return 1;
+}
+__setup("mrhash_entries=", set_mrhash_entries);
+
 static u64 event;
 static DEFINE_IDA(mnt_id_ida);
 static DEFINE_IDA(mnt_group_ida);
@@ -61,6 +73,7 @@ static int mnt_group_start = 1;
 
 static struct hlist_head *mount_hashtable __read_mostly;
 static struct hlist_head *mountpoint_hashtable __read_mostly;
+static struct hlist_head *mountroot_hashtable __read_mostly;
 static struct kmem_cache *mnt_cache __read_mostly;
 static DECLARE_RWSEM(namespace_sem);
 
@@ -93,6 +106,13 @@ static inline struct hlist_head *mp_hash(struct dentry *dentry)
 	return &mountpoint_hashtable[tmp & mp_hash_mask];
 }
 
+static inline struct hlist_head *mr_hash(struct dentry *dentry)
+{
+	unsigned long tmp = ((unsigned long)dentry / L1_CACHE_BYTES);
+	tmp = tmp + (tmp >> mr_hash_shift);
+	return &mountroot_hashtable[tmp & mr_hash_mask];
+}
+
 /*
  * allocation is serialized by namespace_sem, but we need the spinlock to
  * serialize with freeing.
@@ -234,6 +254,7 @@ static struct mount *alloc_vfsmnt(const char *name)
 		INIT_LIST_HEAD(&mnt->mnt_slave_list);
 		INIT_LIST_HEAD(&mnt->mnt_slave);
 		INIT_HLIST_NODE(&mnt->mnt_mp_list);
+		INIT_HLIST_NODE(&mnt->mnt_mr_list);
 #ifdef CONFIG_FSNOTIFY
 		INIT_HLIST_HEAD(&mnt->mnt_fsnotify_marks);
 #endif
@@ -768,6 +789,77 @@ static void put_mountpoint(struct mountpoint *mp)
 	}
 }
 
+static struct mountroot *lookup_mountroot(struct dentry *dentry)
+{
+	struct hlist_head *chain = mr_hash(dentry);
+	struct mountroot *mr;
+
+	hlist_for_each_entry(mr, chain, r_hash) {
+		if (mr->r_dentry == dentry)
+			return mr;
+	}
+	return NULL;
+}
+
+static int mnt_set_root(struct mount *mnt, struct dentry *root)
+{
+	struct mountroot *mr = NULL;
+
+	lock_mount_hash();
+	if (unlikely(d_mountroot(root)))
+		mr = lookup_mountroot(root);
+	if (!mr) {
+		struct mountroot *new;
+		unlock_mount_hash();
+
+		new = kmalloc(sizeof(struct mountroot), GFP_KERNEL);
+		if (!new)
+			return -ENOMEM;
+
+		lock_mount_hash();
+		mr = lookup_mountroot(root);
+		if (mr) {
+			kfree(new);
+		} else {
+			struct hlist_head *chain = mr_hash(root);
+
+			mr = new;
+			mr->r_dentry = root;
+			INIT_HLIST_HEAD(&mr->r_list);
+			hlist_add_head(&mr->r_hash, chain);
+
+			spin_lock(&root->d_lock);
+			root->d_flags |= DCACHE_MOUNTROOT;
+			spin_unlock(&root->d_lock);
+		}
+	}
+	mnt->mnt.mnt_root = root;
+	hlist_add_head(&mnt->mnt_mr_list, &mr->r_list);
+	unlock_mount_hash();
+
+	return 0;
+}
+
+static void mnt_put_root(struct mount *mnt)
+{
+	struct dentry *root = mnt->mnt.mnt_root;
+	struct mountroot *mr;
+
+	lock_mount_hash();
+	mr = lookup_mountroot(root);
+	BUG_ON(!mr);
+	hlist_del(&mnt->mnt_mr_list);
+	if (hlist_empty(&mr->r_list)) {
+		hlist_del(&mr->r_hash);
+		spin_lock(&root->d_lock);
+		root->d_flags &= ~DCACHE_MOUNTROOT;
+		spin_unlock(&root->d_lock);
+		kfree(mr);
+	}
+	unlock_mount_hash();
+	dput(root);
+}
+
 static inline int check_mnt(struct mount *mnt)
 {
 	return mnt->mnt_ns == current->nsproxy->mnt_ns;
@@ -923,6 +1015,7 @@ vfs_kern_mount(struct file_system_type *type, int flags, const char *name, void
 {
 	struct mount *mnt;
 	struct dentry *root;
+	int err;
 
 	if (!type)
 		return ERR_PTR(-ENODEV);
@@ -941,7 +1034,15 @@ vfs_kern_mount(struct file_system_type *type, int flags, const char *name, void
 		return ERR_CAST(root);
 	}
 
-	mnt->mnt.mnt_root = root;
+	err = mnt_set_root(mnt, root);
+	if (err) {
+		dput(mnt->mnt.mnt_root);
+		deactivate_super(root->d_sb);
+		mnt_free_id(mnt);
+		free_vfsmnt(mnt);
+		return ERR_PTR(err);
+	}
+
 	mnt->mnt.mnt_sb = root->d_sb;
 	mnt->mnt_mountpoint = mnt->mnt.mnt_root;
 	mnt->mnt_parent = mnt;
@@ -974,6 +1075,10 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root,
 			goto out_free;
 	}
 
+	err = mnt_set_root(mnt, root);
+	if (err)
+		goto out_free;
+
 	mnt->mnt.mnt_flags = old->mnt.mnt_flags & ~(MNT_WRITE_HOLD|MNT_MARKED);
 	/* Don't allow unprivileged users to change mount flags */
 	if (flag & CL_UNPRIVILEGED) {
@@ -997,9 +1102,9 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root,
 	    (!(flag & CL_EXPIRE) || list_empty(&old->mnt_expire)))
 		mnt->mnt.mnt_flags |= MNT_LOCKED;
 
-	atomic_inc(&sb->s_active);
 	mnt->mnt.mnt_sb = sb;
-	mnt->mnt.mnt_root = dget(root);
+	atomic_inc(&sb->s_active);
+	dget(root);
 	mnt->mnt_mountpoint = mnt->mnt.mnt_root;
 	mnt->mnt_parent = mnt;
 	lock_mount_hash();
@@ -1052,7 +1157,7 @@ static void cleanup_mnt(struct mount *mnt)
 	if (unlikely(mnt->mnt_pins.first))
 		mnt_pin_kill(mnt);
 	fsnotify_vfsmount_delete(&mnt->mnt);
-	dput(mnt->mnt.mnt_root);
+	mnt_put_root(mnt);
 	deactivate_super(mnt->mnt.mnt_sb);
 	mnt_free_id(mnt);
 	call_rcu(&mnt->mnt_rcu, delayed_free_vfsmnt);
@@ -3079,14 +3184,21 @@ void __init mnt_init(void)
 				mphash_entries, 19,
 				0,
 				&mp_hash_shift, &mp_hash_mask, 0, 0);
+	mountroot_hashtable = alloc_large_system_hash("Mountroot-cache",
+				sizeof(struct hlist_head),
+				mrhash_entries, 19,
+				0,
+				&mr_hash_shift, &mr_hash_mask, 0, 0);
 
-	if (!mount_hashtable || !mountpoint_hashtable)
+	if (!mount_hashtable || !mountpoint_hashtable || !mountroot_hashtable)
 		panic("Failed to allocate mount hash table\n");
 
 	for (u = 0; u <= m_hash_mask; u++)
 		INIT_HLIST_HEAD(&mount_hashtable[u]);
 	for (u = 0; u <= mp_hash_mask; u++)
 		INIT_HLIST_HEAD(&mountpoint_hashtable[u]);
+	for (u = 0; u <= mr_hash_mask; u++)
+		INIT_HLIST_HEAD(&mountroot_hashtable[u]);
 
 	kernfs_init();
 
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index d8358799c594..dd987fb9e1f7 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -226,6 +226,8 @@ struct dentry_operations {
 #define DCACHE_MAY_FREE			0x00800000
 #define DCACHE_FALLTHRU			0x01000000 /* Fall through to lower layer */
 
+#define DCACHE_MOUNTROOT		0x01000000 /* is root of a vfsmount */
+
 extern seqlock_t rename_lock;
 
 /*
@@ -401,6 +403,11 @@ static inline bool d_mountpoint(const struct dentry *dentry)
 	return dentry->d_flags & DCACHE_MOUNTED;
 }
 
+static inline bool d_mountroot(const struct dentry *dentry)
+{
+	return dentry->d_flags & DCACHE_MOUNTROOT;
+}
+
 /*
  * Directory cache entry type accessor functions.
  */
-- 
2.2.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 17/19] vfs: Test for and handle paths that are unreachable from their mnt_root
       [not found]       ` <87a8yqou41.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                           ` (15 preceding siblings ...)
  2015-04-03  1:56         ` [PATCH review 16/19] mnt: Track which mounts use a dentry as root Eric W. Biederman
@ 2015-04-03  1:56         ` Eric W. Biederman
  2015-04-03  1:56         ` [PATCH review 18/19] vfs: Handle mounts whose parents are unreachable from their mountpoint Eric W. Biederman
                           ` (3 subsequent siblings)
  20 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-04-03  1:56 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Richard Weinberger, Andy Lutomirski, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Omar Sandoval,
	Willy Tarreau

- Add a mount flag MNT_VIOLATED to mark loopback mounts that have had
  a dentry moved into a directory that does not descend from the mount
  root dentry.

- Add a function path_connected to verify a path.dentry is reachable from
  path.mnt.mnt_root.  AKA rename did not do something nasty to the bind mount.

- Disable ".." when a path is not connected during lookup.
  (Maybe we want to stop ".." at this path instead?)

  Following .. is not disabled after a transition to /
  and is never disabled when / is the directory we start
  with.   Because we already limit .. no higher than /

- In prepend_path and it's callers in the d_path family
  for a path that is not connected don't attempt to find
  parent directories.

Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/dcache.c           |  3 +++
 fs/internal.h         |  1 +
 fs/namei.c            | 30 ++++++++++++++++++++++++++++++
 include/linux/mount.h |  1 +
 include/linux/namei.h |  2 ++
 5 files changed, 37 insertions(+)

diff --git a/fs/dcache.c b/fs/dcache.c
index c71e3732e53b..e07eb03f6de6 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2871,6 +2871,9 @@ static int prepend_path(const struct path *path,
 	char *bptr;
 	int blen;
 
+	if (!path_connected(path))
+		root = path;
+
 	rcu_read_lock();
 restart_mnt:
 	read_seqbegin_or_lock(&mount_lock, &m_seq);
diff --git a/fs/internal.h b/fs/internal.h
index 01dce1d1476b..046767f0042e 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -51,6 +51,7 @@ extern void __init chrdev_init(void);
 extern int user_path_mountpoint_at(int, const char __user *, unsigned int, struct path *);
 extern int vfs_path_lookup(struct dentry *, struct vfsmount *,
 			   const char *, unsigned int, struct path *);
+extern bool path_connected(const struct path *);
 
 /*
  * namespace.c
diff --git a/fs/namei.c b/fs/namei.c
index c83145af4bfc..ec09c90089a9 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -493,6 +493,22 @@ void path_put(const struct path *path)
 }
 EXPORT_SYMBOL(path_put);
 
+/**
+ * path_connected - Verify that a path->dentry is below path->mnt->mnt.mnt_root
+ * @path: path to verify
+ *
+ * Rename can sometimes move a file or directory outside of bind mount
+ * don't honor paths where this has happened.
+ */
+bool path_connected(const struct path *path)
+{
+	struct vfsmount *mnt = path->mnt;
+	if (!(mnt->mnt_flags & MNT_VIOLATED))
+		return true;
+
+	return is_subdir(path->dentry, mnt->mnt_root);
+}
+
 struct nameidata {
 	struct path	path;
 	struct qstr	last;
@@ -712,6 +728,7 @@ void nd_jump_link(struct nameidata *nd, struct path *path)
 	nd->path = *path;
 	nd->inode = nd->path.dentry->d_inode;
 	nd->flags |= LOOKUP_JUMPED;
+	nd->flags &= ~LOOKUP_NODOTDOT;
 }
 
 void nd_set_link(struct nameidata *nd, char *path)
@@ -897,6 +914,7 @@ follow_link(struct path *link, struct nameidata *nd, void **p)
 			nd->path = nd->root;
 			path_get(&nd->root);
 			nd->flags |= LOOKUP_JUMPED;
+			nd->flags &= ~LOOKUP_NODOTDOT;
 		}
 		nd->inode = nd->path.dentry->d_inode;
 		error = link_path_walk(s, nd);
@@ -1161,6 +1179,7 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path,
 		path->mnt = &mounted->mnt;
 		path->dentry = mounted->mnt.mnt_root;
 		nd->flags |= LOOKUP_JUMPED;
+		nd->flags &= ~LOOKUP_NODOTDOT;
 		nd->seq = read_seqcount_begin(&path->dentry->d_seq);
 		/*
 		 * Update the inode too. We don't need to re-check the
@@ -1176,6 +1195,10 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path,
 static int follow_dotdot_rcu(struct nameidata *nd)
 {
 	struct inode *inode = nd->inode;
+
+	if (nd->flags & LOOKUP_NODOTDOT)
+		return 0;
+
 	if (!nd->root.mnt)
 		set_root_rcu(nd);
 
@@ -1293,6 +1316,9 @@ static void follow_mount(struct path *path)
 
 static void follow_dotdot(struct nameidata *nd)
 {
+	if (nd->flags & LOOKUP_NODOTDOT)
+		return;
+
 	if (!nd->root.mnt)
 		set_root(nd);
 
@@ -1909,6 +1935,8 @@ static int path_init(int dfd, const char *name, unsigned int flags,
 		} else {
 			get_fs_pwd(current->fs, &nd->path);
 		}
+		if (unlikely(!path_connected(&nd->path)))
+			nd->flags |= LOOKUP_NODOTDOT;
 	} else {
 		/* Caller must check execute permissions on the starting path component */
 		struct fd f = fdget_raw(dfd);
@@ -1936,6 +1964,8 @@ static int path_init(int dfd, const char *name, unsigned int flags,
 			path_get(&nd->path);
 			fdput(f);
 		}
+		if (unlikely(!path_connected(&nd->path)))
+			nd->flags |= LOOKUP_NODOTDOT;
 	}
 
 	nd->inode = nd->path.dentry->d_inode;
diff --git a/include/linux/mount.h b/include/linux/mount.h
index 564beeec5d83..778d7cf65b9a 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -62,6 +62,7 @@ struct mnt_namespace;
 #define MNT_SYNC_UMOUNT		0x2000000
 #define MNT_MARKED		0x4000000
 #define MNT_UMOUNT		0x8000000
+#define MNT_VIOLATED		0x10000000
 
 struct vfsmount {
 	struct dentry *mnt_root;	/* root of the mounted tree */
diff --git a/include/linux/namei.h b/include/linux/namei.h
index c8990779f0c3..55c8aaec7b03 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -45,6 +45,8 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
 #define LOOKUP_ROOT		0x2000
 #define LOOKUP_EMPTY		0x4000
 
+#define LOOKUP_NODOTDOT		0x10000
+
 extern int user_path_at(int, const char __user *, unsigned, struct path *);
 extern int user_path_at_empty(int, const char __user *, unsigned, struct path *, int *empty);
 
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review  17/19] vfs: Test for and handle paths that are unreachable from their mnt_root
  2015-04-03  1:53     ` [PATCH review 0/19] Locked mount and loopback mount fixes Eric W. Biederman
                         ` (16 preceding siblings ...)
  2015-04-03  1:56       ` [PATCH review 16/19] mnt: Track which mounts use a dentry as root Eric W. Biederman
@ 2015-04-03  1:56       ` Eric W. Biederman
  2015-04-03  1:56       ` [PATCH review 18/19] vfs: Handle mounts whose parents are unreachable from their mountpoint Eric W. Biederman
                         ` (3 subsequent siblings)
  21 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-04-03  1:56 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-fsdevel, Serge E. Hallyn, Andy Lutomirski,
	Richard Weinberger, Andrey Vagin, Al Viro, Jann Horn,
	Willy Tarreau, Omar Sandoval

- Add a mount flag MNT_VIOLATED to mark loopback mounts that have had
  a dentry moved into a directory that does not descend from the mount
  root dentry.

- Add a function path_connected to verify a path.dentry is reachable from
  path.mnt.mnt_root.  AKA rename did not do something nasty to the bind mount.

- Disable ".." when a path is not connected during lookup.
  (Maybe we want to stop ".." at this path instead?)

  Following .. is not disabled after a transition to /
  and is never disabled when / is the directory we start
  with.   Because we already limit .. no higher than /

- In prepend_path and it's callers in the d_path family
  for a path that is not connected don't attempt to find
  parent directories.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/dcache.c           |  3 +++
 fs/internal.h         |  1 +
 fs/namei.c            | 30 ++++++++++++++++++++++++++++++
 include/linux/mount.h |  1 +
 include/linux/namei.h |  2 ++
 5 files changed, 37 insertions(+)

diff --git a/fs/dcache.c b/fs/dcache.c
index c71e3732e53b..e07eb03f6de6 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2871,6 +2871,9 @@ static int prepend_path(const struct path *path,
 	char *bptr;
 	int blen;
 
+	if (!path_connected(path))
+		root = path;
+
 	rcu_read_lock();
 restart_mnt:
 	read_seqbegin_or_lock(&mount_lock, &m_seq);
diff --git a/fs/internal.h b/fs/internal.h
index 01dce1d1476b..046767f0042e 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -51,6 +51,7 @@ extern void __init chrdev_init(void);
 extern int user_path_mountpoint_at(int, const char __user *, unsigned int, struct path *);
 extern int vfs_path_lookup(struct dentry *, struct vfsmount *,
 			   const char *, unsigned int, struct path *);
+extern bool path_connected(const struct path *);
 
 /*
  * namespace.c
diff --git a/fs/namei.c b/fs/namei.c
index c83145af4bfc..ec09c90089a9 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -493,6 +493,22 @@ void path_put(const struct path *path)
 }
 EXPORT_SYMBOL(path_put);
 
+/**
+ * path_connected - Verify that a path->dentry is below path->mnt->mnt.mnt_root
+ * @path: path to verify
+ *
+ * Rename can sometimes move a file or directory outside of bind mount
+ * don't honor paths where this has happened.
+ */
+bool path_connected(const struct path *path)
+{
+	struct vfsmount *mnt = path->mnt;
+	if (!(mnt->mnt_flags & MNT_VIOLATED))
+		return true;
+
+	return is_subdir(path->dentry, mnt->mnt_root);
+}
+
 struct nameidata {
 	struct path	path;
 	struct qstr	last;
@@ -712,6 +728,7 @@ void nd_jump_link(struct nameidata *nd, struct path *path)
 	nd->path = *path;
 	nd->inode = nd->path.dentry->d_inode;
 	nd->flags |= LOOKUP_JUMPED;
+	nd->flags &= ~LOOKUP_NODOTDOT;
 }
 
 void nd_set_link(struct nameidata *nd, char *path)
@@ -897,6 +914,7 @@ follow_link(struct path *link, struct nameidata *nd, void **p)
 			nd->path = nd->root;
 			path_get(&nd->root);
 			nd->flags |= LOOKUP_JUMPED;
+			nd->flags &= ~LOOKUP_NODOTDOT;
 		}
 		nd->inode = nd->path.dentry->d_inode;
 		error = link_path_walk(s, nd);
@@ -1161,6 +1179,7 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path,
 		path->mnt = &mounted->mnt;
 		path->dentry = mounted->mnt.mnt_root;
 		nd->flags |= LOOKUP_JUMPED;
+		nd->flags &= ~LOOKUP_NODOTDOT;
 		nd->seq = read_seqcount_begin(&path->dentry->d_seq);
 		/*
 		 * Update the inode too. We don't need to re-check the
@@ -1176,6 +1195,10 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path,
 static int follow_dotdot_rcu(struct nameidata *nd)
 {
 	struct inode *inode = nd->inode;
+
+	if (nd->flags & LOOKUP_NODOTDOT)
+		return 0;
+
 	if (!nd->root.mnt)
 		set_root_rcu(nd);
 
@@ -1293,6 +1316,9 @@ static void follow_mount(struct path *path)
 
 static void follow_dotdot(struct nameidata *nd)
 {
+	if (nd->flags & LOOKUP_NODOTDOT)
+		return;
+
 	if (!nd->root.mnt)
 		set_root(nd);
 
@@ -1909,6 +1935,8 @@ static int path_init(int dfd, const char *name, unsigned int flags,
 		} else {
 			get_fs_pwd(current->fs, &nd->path);
 		}
+		if (unlikely(!path_connected(&nd->path)))
+			nd->flags |= LOOKUP_NODOTDOT;
 	} else {
 		/* Caller must check execute permissions on the starting path component */
 		struct fd f = fdget_raw(dfd);
@@ -1936,6 +1964,8 @@ static int path_init(int dfd, const char *name, unsigned int flags,
 			path_get(&nd->path);
 			fdput(f);
 		}
+		if (unlikely(!path_connected(&nd->path)))
+			nd->flags |= LOOKUP_NODOTDOT;
 	}
 
 	nd->inode = nd->path.dentry->d_inode;
diff --git a/include/linux/mount.h b/include/linux/mount.h
index 564beeec5d83..778d7cf65b9a 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -62,6 +62,7 @@ struct mnt_namespace;
 #define MNT_SYNC_UMOUNT		0x2000000
 #define MNT_MARKED		0x4000000
 #define MNT_UMOUNT		0x8000000
+#define MNT_VIOLATED		0x10000000
 
 struct vfsmount {
 	struct dentry *mnt_root;	/* root of the mounted tree */
diff --git a/include/linux/namei.h b/include/linux/namei.h
index c8990779f0c3..55c8aaec7b03 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -45,6 +45,8 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
 #define LOOKUP_ROOT		0x2000
 #define LOOKUP_EMPTY		0x4000
 
+#define LOOKUP_NODOTDOT		0x10000
+
 extern int user_path_at(int, const char __user *, unsigned, struct path *);
 extern int user_path_at_empty(int, const char __user *, unsigned, struct path *, int *empty);
 
-- 
2.2.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 18/19] vfs: Handle mounts whose parents are unreachable from their mountpoint
       [not found]       ` <87a8yqou41.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                           ` (16 preceding siblings ...)
  2015-04-03  1:56         ` [PATCH review 17/19] vfs: Test for and handle paths that are unreachable from their mnt_root Eric W. Biederman
@ 2015-04-03  1:56         ` Eric W. Biederman
  2015-04-03  1:56         ` [PATCH review 19/19] vfs: Do not allow escaping from bind mounts Eric W. Biederman
                           ` (2 subsequent siblings)
  20 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-04-03  1:56 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Richard Weinberger, Andy Lutomirski, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Omar Sandoval,
	Willy Tarreau

- Add a mount flag MNT_UNREACHABLE_PARENT to mark mounts that can not
  reach their parent mount's root from their mountpoint.

- In follup_up and follow_up_rcu don't follow up if the current
  mount's mountpoint can not reach the parent mount's root.

- In prepend_path and it's callers in the d_path family don't follow
  to the parent mount if the current mount's mountpoint can not reach
  the parent mount's root.

Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/dcache.c           | 3 ++-
 fs/namei.c            | 4 ++--
 include/linux/mount.h | 1 +
 3 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index e07eb03f6de6..cae4a42c1846 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2893,7 +2893,8 @@ restart:
 		if (dentry == vfsmnt->mnt_root || IS_ROOT(dentry)) {
 			struct mount *parent = ACCESS_ONCE(mnt->mnt_parent);
 			/* Global root? */
-			if (mnt != parent) {
+			if ((mnt != parent) &&
+			    !(mnt->mnt.mnt_flags & MNT_UNREACHABLE_PARENT)) {
 				dentry = ACCESS_ONCE(mnt->mnt_mountpoint);
 				mnt = parent;
 				vfsmnt = &mnt->mnt;
diff --git a/fs/namei.c b/fs/namei.c
index ec09c90089a9..31a9c9a5787d 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -938,7 +938,7 @@ static int follow_up_rcu(struct path *path)
 	struct dentry *mountpoint;
 
 	parent = mnt->mnt_parent;
-	if (&parent->mnt == path->mnt)
+	if ((mnt->mnt.mnt_flags & MNT_UNREACHABLE_PARENT) || (parent == mnt))
 		return 0;
 	mountpoint = mnt->mnt_mountpoint;
 	path->dentry = mountpoint;
@@ -964,7 +964,7 @@ int follow_up(struct path *path)
 
 	read_seqlock_excl(&mount_lock);
 	parent = mnt->mnt_parent;
-	if (parent == mnt) {
+	if ((mnt->mnt.mnt_flags & MNT_UNREACHABLE_PARENT) || (parent == mnt)) {
 		read_sequnlock_excl(&mount_lock);
 		return 0;
 	}
diff --git a/include/linux/mount.h b/include/linux/mount.h
index 778d7cf65b9a..3249da1af130 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -63,6 +63,7 @@ struct mnt_namespace;
 #define MNT_MARKED		0x4000000
 #define MNT_UMOUNT		0x8000000
 #define MNT_VIOLATED		0x10000000
+#define MNT_UNREACHABLE_PARENT	0x20000000
 
 struct vfsmount {
 	struct dentry *mnt_root;	/* root of the mounted tree */
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review  18/19] vfs: Handle mounts whose parents are unreachable from their mountpoint
  2015-04-03  1:53     ` [PATCH review 0/19] Locked mount and loopback mount fixes Eric W. Biederman
                         ` (17 preceding siblings ...)
  2015-04-03  1:56       ` [PATCH review 17/19] vfs: Test for and handle paths that are unreachable from their mnt_root Eric W. Biederman
@ 2015-04-03  1:56       ` Eric W. Biederman
  2015-04-03  1:56       ` [PATCH review 19/19] vfs: Do not allow escaping from bind mounts Eric W. Biederman
                         ` (2 subsequent siblings)
  21 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-04-03  1:56 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-fsdevel, Serge E. Hallyn, Andy Lutomirski,
	Richard Weinberger, Andrey Vagin, Al Viro, Jann Horn,
	Willy Tarreau, Omar Sandoval

- Add a mount flag MNT_UNREACHABLE_PARENT to mark mounts that can not
  reach their parent mount's root from their mountpoint.

- In follup_up and follow_up_rcu don't follow up if the current
  mount's mountpoint can not reach the parent mount's root.

- In prepend_path and it's callers in the d_path family don't follow
  to the parent mount if the current mount's mountpoint can not reach
  the parent mount's root.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/dcache.c           | 3 ++-
 fs/namei.c            | 4 ++--
 include/linux/mount.h | 1 +
 3 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index e07eb03f6de6..cae4a42c1846 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2893,7 +2893,8 @@ restart:
 		if (dentry == vfsmnt->mnt_root || IS_ROOT(dentry)) {
 			struct mount *parent = ACCESS_ONCE(mnt->mnt_parent);
 			/* Global root? */
-			if (mnt != parent) {
+			if ((mnt != parent) &&
+			    !(mnt->mnt.mnt_flags & MNT_UNREACHABLE_PARENT)) {
 				dentry = ACCESS_ONCE(mnt->mnt_mountpoint);
 				mnt = parent;
 				vfsmnt = &mnt->mnt;
diff --git a/fs/namei.c b/fs/namei.c
index ec09c90089a9..31a9c9a5787d 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -938,7 +938,7 @@ static int follow_up_rcu(struct path *path)
 	struct dentry *mountpoint;
 
 	parent = mnt->mnt_parent;
-	if (&parent->mnt == path->mnt)
+	if ((mnt->mnt.mnt_flags & MNT_UNREACHABLE_PARENT) || (parent == mnt))
 		return 0;
 	mountpoint = mnt->mnt_mountpoint;
 	path->dentry = mountpoint;
@@ -964,7 +964,7 @@ int follow_up(struct path *path)
 
 	read_seqlock_excl(&mount_lock);
 	parent = mnt->mnt_parent;
-	if (parent == mnt) {
+	if ((mnt->mnt.mnt_flags & MNT_UNREACHABLE_PARENT) || (parent == mnt)) {
 		read_sequnlock_excl(&mount_lock);
 		return 0;
 	}
diff --git a/include/linux/mount.h b/include/linux/mount.h
index 778d7cf65b9a..3249da1af130 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -63,6 +63,7 @@ struct mnt_namespace;
 #define MNT_MARKED		0x4000000
 #define MNT_UMOUNT		0x8000000
 #define MNT_VIOLATED		0x10000000
+#define MNT_UNREACHABLE_PARENT	0x20000000
 
 struct vfsmount {
 	struct dentry *mnt_root;	/* root of the mounted tree */
-- 
2.2.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review  19/19] vfs: Do not allow escaping from bind mounts.
       [not found]       ` <87a8yqou41.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                           ` (17 preceding siblings ...)
  2015-04-03  1:56         ` [PATCH review 18/19] vfs: Handle mounts whose parents are unreachable from their mountpoint Eric W. Biederman
@ 2015-04-03  1:56         ` Eric W. Biederman
  2015-04-08 23:31         ` [PATCH review 0/4] Loopback mount escape fixes Eric W. Biederman
  2015-04-16 23:40         ` [GIT PULL] Usernamespace related locked mount fixes Eric W. Biederman
  20 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-04-03  1:56 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Richard Weinberger, Andy Lutomirski, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Omar Sandoval,
	Willy Tarreau

Rename can move a file or directory outside of a bind mount.  This has
allowed programs with paths below the renamed directory to traverse up
their directory tree to the real root of the filesystem instead of
just the root of their bind mount.

In the presence of such renames limit applications to what the bind
mount intended to reveal by marking mounts that have had dentries
renamed out of them with MNT_VIOLATED, marking mounts that can no
longer walk up to their parent mounts with MNT_UMOUNT_PENDING and then
lazily unmounting such mounts.

All moves go through __d_move so __d_move has been modified to mark
all mounts whose dentries have been moved outside of them.

Once the root dentry of a violated mount has been found a new function
mnt_set_violated is called to:

- mark all mounts that have that dentry as their root as violated

- to mark all children of violated mounts that can no longer reach
  their parents.

- to schedule for unmounting all children of violated mounts that can
  no longer reach their parents.

The children that can't reach their parents are only scheduled for
unmounting because the sleeping namespace_sem can not be taken inside
of __d_move which can not sleep.

This change adds a field to struct mount mnt_pending_umount that is
used to thread the list of pending unmounts through struct mount.  As
there are small but unavioable races between scheduling an unmount and
the possibility of userspace calling umount_tree, umount_tree has been
modified to remove all mounts that are being unmounted from the
pending_umount list.

This closes a hole where it was possible in some circumstances
to follow .. past the root of a bind mount.

Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/dcache.c    | 29 ++++++++++++++++++++++++++++
 fs/internal.h  |  1 +
 fs/mount.h     |  1 +
 fs/namespace.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 92 insertions(+)

diff --git a/fs/dcache.c b/fs/dcache.c
index cae4a42c1846..e04e2a23ad00 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2535,6 +2535,26 @@ static void dentry_unlock_for_move(struct dentry *dentry, struct dentry *target)
 	spin_unlock(&dentry->d_lock);
 }
 
+static void mark_violated_mounts(struct dentry *dentry, struct dentry *target)
+{
+	/* Mark all mountroots that are ancestors of dentry
+	 * that do not share a common ancestor with target
+	 *
+	 * This function assumes both dentries are part of a DAG.
+	 */
+	struct dentry *p;
+
+	for (p = dentry->d_parent; !IS_ROOT(p); p = p->d_parent) {
+		if (!d_mountroot(p))
+			continue;
+
+		if (d_ancestor(p, target))
+			break;
+
+		mnt_set_violated(p, dentry);
+	}
+}
+
 /*
  * When switching names, the actual string doesn't strictly have to
  * be preserved in the target - because we're dropping the target
@@ -2569,6 +2589,15 @@ static void __d_move(struct dentry *dentry, struct dentry *target,
 	BUG_ON(d_ancestor(dentry, target));
 	BUG_ON(d_ancestor(target, dentry));
 
+	/* If we are not splicing a dentry, mark mounts which may have
+	 * paths that are no longer able to follow d_parent up to
+	 * mnt_root after this move.
+	 */
+	if (!IS_ROOT(dentry) && !IS_ROOT(target)) {
+		mark_violated_mounts(dentry, target);
+		mark_violated_mounts(target, dentry);
+	}
+
 	dentry_lock_for_move(dentry, target);
 
 	write_seqcount_begin(&dentry->d_seq);
diff --git a/fs/internal.h b/fs/internal.h
index 046767f0042e..2f04050ab32f 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -71,6 +71,7 @@ extern int __mnt_want_write_file(struct file *);
 extern void __mnt_drop_write(struct vfsmount *);
 extern void __mnt_drop_write_file(struct file *);
 
+extern void mnt_set_violated(struct dentry *root, struct dentry *moving);
 /*
  * fs_struct.c
  */
diff --git a/fs/mount.h b/fs/mount.h
index a8be3033e022..0697b23fe417 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -41,6 +41,7 @@ struct mount {
 	union {
 		struct rcu_head mnt_rcu;
 		struct llist_node mnt_llist;
+		struct hlist_node mnt_pending_umount;
 	};
 #ifdef CONFIG_SMP
 	struct mnt_pcp __percpu *mnt_pcp;
diff --git a/fs/namespace.c b/fs/namespace.c
index 5b1b666439ac..c38d299ff26f 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -245,6 +245,7 @@ static struct mount *alloc_vfsmnt(const char *name)
 		mnt->mnt_writers = 0;
 #endif
 
+		INIT_HLIST_NODE(&mnt->mnt_pending_umount);
 		INIT_HLIST_NODE(&mnt->mnt_hash);
 		INIT_LIST_HEAD(&mnt->mnt_child);
 		INIT_LIST_HEAD(&mnt->mnt_mounts);
@@ -1485,6 +1486,7 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 	while (!list_empty(&tmp_list)) {
 		bool disconnect;
 		p = list_first_entry(&tmp_list, struct mount, mnt_list);
+		hlist_del_init(&mnt->mnt_pending_umount);
 		list_del_init(&p->mnt_expire);
 		list_del_init(&p->mnt_list);
 		__touch_mnt_namespace(p->mnt_ns);
@@ -1646,6 +1648,65 @@ out_unlock:
 	namespace_unlock();
 }
 
+static HLIST_HEAD(pending_umount);
+static void umount_pending_umounts(struct work_struct *unused)
+{
+	HLIST_HEAD(head);
+
+	namespace_lock();
+	lock_mount_hash();
+
+	hlist_move_list(&pending_umount, &head);
+
+	while (!hlist_empty(&head)) {
+		struct mount *mnt =
+			hlist_entry(head.first, struct mount, mnt_pending_umount);
+		umount_tree(mnt, UMOUNT_CONNECTED);
+	}
+
+	unlock_mount_hash();
+	namespace_unlock();
+}
+
+static DECLARE_WORK(pending_umount_work, umount_pending_umounts);
+
+void mnt_set_violated(struct dentry *root, struct dentry *moving)
+{
+	struct mountroot *mr;
+	struct mount *mnt;
+
+	lock_mount_hash();
+	mr = lookup_mountroot(root);
+	if (!mr)
+		goto out;
+
+	hlist_for_each_entry(mnt, &mr->r_list, mnt_mr_list) {
+		struct mount *child;
+		/* Be wary of this mount */
+		mnt->mnt.mnt_flags |= MNT_VIOLATED;
+
+		list_for_each_entry(child, &mnt->mnt_mounts, mnt_child) {
+			/* Ignore children that will continue to be connected */
+			if ((child->mnt_mountpoint != moving) &&
+			    !d_ancestor(moving, child->mnt_mountpoint))
+				continue;
+
+			/* Deal with mounts loosing the connection to
+			 * their parents
+			 */
+			if (!(child->mnt.mnt_flags & MNT_UMOUNT)) {
+				child->mnt.mnt_flags |= MNT_UNREACHABLE_PARENT;
+				hlist_add_head(&child->mnt_pending_umount, &pending_umount);
+				schedule_work(&pending_umount_work);
+			} else {
+				umount_mnt(child);
+			}
+		}
+	}
+out:
+	unlock_mount_hash();
+}
+
 /* 
  * Is the caller allowed to modify his namespace?
  */
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review  19/19] vfs: Do not allow escaping from bind mounts.
  2015-04-03  1:53     ` [PATCH review 0/19] Locked mount and loopback mount fixes Eric W. Biederman
                         ` (18 preceding siblings ...)
  2015-04-03  1:56       ` [PATCH review 18/19] vfs: Handle mounts whose parents are unreachable from their mountpoint Eric W. Biederman
@ 2015-04-03  1:56       ` Eric W. Biederman
  2015-04-03  6:20         ` Al Viro
       [not found]         ` <1428026183-14879-19-git-send-email-ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  2015-04-08 23:31       ` [PATCH review 0/4] Loopback mount escape fixes Eric W. Biederman
  2015-04-16 23:40       ` [GIT PULL] Usernamespace related locked mount fixes Eric W. Biederman
  21 siblings, 2 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-04-03  1:56 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-fsdevel, Serge E. Hallyn, Andy Lutomirski,
	Richard Weinberger, Andrey Vagin, Al Viro, Jann Horn,
	Willy Tarreau, Omar Sandoval

Rename can move a file or directory outside of a bind mount.  This has
allowed programs with paths below the renamed directory to traverse up
their directory tree to the real root of the filesystem instead of
just the root of their bind mount.

In the presence of such renames limit applications to what the bind
mount intended to reveal by marking mounts that have had dentries
renamed out of them with MNT_VIOLATED, marking mounts that can no
longer walk up to their parent mounts with MNT_UMOUNT_PENDING and then
lazily unmounting such mounts.

All moves go through __d_move so __d_move has been modified to mark
all mounts whose dentries have been moved outside of them.

Once the root dentry of a violated mount has been found a new function
mnt_set_violated is called to:

- mark all mounts that have that dentry as their root as violated

- to mark all children of violated mounts that can no longer reach
  their parents.

- to schedule for unmounting all children of violated mounts that can
  no longer reach their parents.

The children that can't reach their parents are only scheduled for
unmounting because the sleeping namespace_sem can not be taken inside
of __d_move which can not sleep.

This change adds a field to struct mount mnt_pending_umount that is
used to thread the list of pending unmounts through struct mount.  As
there are small but unavioable races between scheduling an unmount and
the possibility of userspace calling umount_tree, umount_tree has been
modified to remove all mounts that are being unmounted from the
pending_umount list.

This closes a hole where it was possible in some circumstances
to follow .. past the root of a bind mount.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/dcache.c    | 29 ++++++++++++++++++++++++++++
 fs/internal.h  |  1 +
 fs/mount.h     |  1 +
 fs/namespace.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 92 insertions(+)

diff --git a/fs/dcache.c b/fs/dcache.c
index cae4a42c1846..e04e2a23ad00 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2535,6 +2535,26 @@ static void dentry_unlock_for_move(struct dentry *dentry, struct dentry *target)
 	spin_unlock(&dentry->d_lock);
 }
 
+static void mark_violated_mounts(struct dentry *dentry, struct dentry *target)
+{
+	/* Mark all mountroots that are ancestors of dentry
+	 * that do not share a common ancestor with target
+	 *
+	 * This function assumes both dentries are part of a DAG.
+	 */
+	struct dentry *p;
+
+	for (p = dentry->d_parent; !IS_ROOT(p); p = p->d_parent) {
+		if (!d_mountroot(p))
+			continue;
+
+		if (d_ancestor(p, target))
+			break;
+
+		mnt_set_violated(p, dentry);
+	}
+}
+
 /*
  * When switching names, the actual string doesn't strictly have to
  * be preserved in the target - because we're dropping the target
@@ -2569,6 +2589,15 @@ static void __d_move(struct dentry *dentry, struct dentry *target,
 	BUG_ON(d_ancestor(dentry, target));
 	BUG_ON(d_ancestor(target, dentry));
 
+	/* If we are not splicing a dentry, mark mounts which may have
+	 * paths that are no longer able to follow d_parent up to
+	 * mnt_root after this move.
+	 */
+	if (!IS_ROOT(dentry) && !IS_ROOT(target)) {
+		mark_violated_mounts(dentry, target);
+		mark_violated_mounts(target, dentry);
+	}
+
 	dentry_lock_for_move(dentry, target);
 
 	write_seqcount_begin(&dentry->d_seq);
diff --git a/fs/internal.h b/fs/internal.h
index 046767f0042e..2f04050ab32f 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -71,6 +71,7 @@ extern int __mnt_want_write_file(struct file *);
 extern void __mnt_drop_write(struct vfsmount *);
 extern void __mnt_drop_write_file(struct file *);
 
+extern void mnt_set_violated(struct dentry *root, struct dentry *moving);
 /*
  * fs_struct.c
  */
diff --git a/fs/mount.h b/fs/mount.h
index a8be3033e022..0697b23fe417 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -41,6 +41,7 @@ struct mount {
 	union {
 		struct rcu_head mnt_rcu;
 		struct llist_node mnt_llist;
+		struct hlist_node mnt_pending_umount;
 	};
 #ifdef CONFIG_SMP
 	struct mnt_pcp __percpu *mnt_pcp;
diff --git a/fs/namespace.c b/fs/namespace.c
index 5b1b666439ac..c38d299ff26f 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -245,6 +245,7 @@ static struct mount *alloc_vfsmnt(const char *name)
 		mnt->mnt_writers = 0;
 #endif
 
+		INIT_HLIST_NODE(&mnt->mnt_pending_umount);
 		INIT_HLIST_NODE(&mnt->mnt_hash);
 		INIT_LIST_HEAD(&mnt->mnt_child);
 		INIT_LIST_HEAD(&mnt->mnt_mounts);
@@ -1485,6 +1486,7 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 	while (!list_empty(&tmp_list)) {
 		bool disconnect;
 		p = list_first_entry(&tmp_list, struct mount, mnt_list);
+		hlist_del_init(&mnt->mnt_pending_umount);
 		list_del_init(&p->mnt_expire);
 		list_del_init(&p->mnt_list);
 		__touch_mnt_namespace(p->mnt_ns);
@@ -1646,6 +1648,65 @@ out_unlock:
 	namespace_unlock();
 }
 
+static HLIST_HEAD(pending_umount);
+static void umount_pending_umounts(struct work_struct *unused)
+{
+	HLIST_HEAD(head);
+
+	namespace_lock();
+	lock_mount_hash();
+
+	hlist_move_list(&pending_umount, &head);
+
+	while (!hlist_empty(&head)) {
+		struct mount *mnt =
+			hlist_entry(head.first, struct mount, mnt_pending_umount);
+		umount_tree(mnt, UMOUNT_CONNECTED);
+	}
+
+	unlock_mount_hash();
+	namespace_unlock();
+}
+
+static DECLARE_WORK(pending_umount_work, umount_pending_umounts);
+
+void mnt_set_violated(struct dentry *root, struct dentry *moving)
+{
+	struct mountroot *mr;
+	struct mount *mnt;
+
+	lock_mount_hash();
+	mr = lookup_mountroot(root);
+	if (!mr)
+		goto out;
+
+	hlist_for_each_entry(mnt, &mr->r_list, mnt_mr_list) {
+		struct mount *child;
+		/* Be wary of this mount */
+		mnt->mnt.mnt_flags |= MNT_VIOLATED;
+
+		list_for_each_entry(child, &mnt->mnt_mounts, mnt_child) {
+			/* Ignore children that will continue to be connected */
+			if ((child->mnt_mountpoint != moving) &&
+			    !d_ancestor(moving, child->mnt_mountpoint))
+				continue;
+
+			/* Deal with mounts loosing the connection to
+			 * their parents
+			 */
+			if (!(child->mnt.mnt_flags & MNT_UMOUNT)) {
+				child->mnt.mnt_flags |= MNT_UNREACHABLE_PARENT;
+				hlist_add_head(&child->mnt_pending_umount, &pending_umount);
+				schedule_work(&pending_umount_work);
+			} else {
+				umount_mnt(child);
+			}
+		}
+	}
+out:
+	unlock_mount_hash();
+}
+
 /* 
  * Is the caller allowed to modify his namespace?
  */
-- 
2.2.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* Re: [PATCH review  16/19] mnt: Track which mounts use a dentry as root.
       [not found]         ` <1428026183-14879-16-git-send-email-ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
@ 2015-04-03  5:54           ` Al Viro
       [not found]             ` <20150403055449.GE889-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
  2015-04-07 20:22             ` Eric W. Biederman
  0 siblings, 2 replies; 240+ messages in thread
From: Al Viro @ 2015-04-03  5:54 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrey Vagin, Richard Weinberger, Linux Containers,
	Andy Lutomirski, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn,
	Willy Tarreau

On Thu, Apr 02, 2015 at 08:56:20PM -0500, Eric W. Biederman wrote:

One general note - I'd probably put a pointer to that sucker into struct
mount.  For one thing, root-preserving clone_mnt() is a fairly common
case.  For another, searching for that thing in mnt_put_root() looks
wrong.  Matter of taste, but...

Another thing is that IMO it's better to preallocate that thing in
vfs_kern_mount() and free if it turns out to be unused.  Simpler cleanup
path that way...


> -	mnt->mnt.mnt_root = root;
> +	err = mnt_set_root(mnt, root);
> +	if (err) {
> +		dput(mnt->mnt.mnt_root);

	Unless I'm misreading your code, mnt_set_root() does *not* set it
on failure, so what's going on here?

>  #define DCACHE_FALLTHRU			0x01000000 /* Fall through to lower layer */
						^^^^^^^^^^
> +#define DCACHE_MOUNTROOT		0x01000000 /* is root of a vfsmount */
					^^^^^^^^^^

	Er...

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review  19/19] vfs: Do not allow escaping from bind mounts.
       [not found]         ` <1428026183-14879-19-git-send-email-ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
@ 2015-04-03  6:20           ` Al Viro
  0 siblings, 0 replies; 240+ messages in thread
From: Al Viro @ 2015-04-03  6:20 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrey Vagin, Richard Weinberger, Linux Containers,
	Andy Lutomirski, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn,
	Willy Tarreau

On Thu, Apr 02, 2015 at 08:56:23PM -0500, Eric W. Biederman wrote:
> +static void mark_violated_mounts(struct dentry *dentry, struct dentry *target)
> +{
> +	/* Mark all mountroots that are ancestors of dentry
> +	 * that do not share a common ancestor with target
> +	 *
> +	 * This function assumes both dentries are part of a DAG.
> +	 */
> +	struct dentry *p;
> +
> +	for (p = dentry->d_parent; !IS_ROOT(p); p = p->d_parent) {
> +		if (!d_mountroot(p))
> +			continue;
> +
> +		if (d_ancestor(p, target))
> +			break;

Egads...  You do realize that you'll keep walking the path from target to
root again and again?  It's a tree and we have already walked up to the root.
So we know the depths and tree topology can't change due to the rename_lock
being held by caller.

> +	if (!IS_ROOT(dentry) && !IS_ROOT(target)) {
> +		mark_violated_mounts(dentry, target);
> +		mark_violated_mounts(target, dentry);
> +	}
> +
>  	dentry_lock_for_move(dentry, target);

> +void mnt_set_violated(struct dentry *root, struct dentry *moving)
> +{
> +	struct mountroot *mr;
> +	struct mount *mnt;
> +
> +	lock_mount_hash();
> +	mr = lookup_mountroot(root);
> +	if (!mr)
> +		goto out;
> +
> +	hlist_for_each_entry(mnt, &mr->r_list, mnt_mr_list) {
> +		struct mount *child;
> +		/* Be wary of this mount */
> +		mnt->mnt.mnt_flags |= MNT_VIOLATED;
> +
> +		list_for_each_entry(child, &mnt->mnt_mounts, mnt_child) {
> +			/* Ignore children that will continue to be connected */
> +			if ((child->mnt_mountpoint != moving) &&
> +			    !d_ancestor(moving, child->mnt_mountpoint))
> +				continue;
> +
> +			/* Deal with mounts loosing the connection to
> +			 * their parents
> +			 */
> +			if (!(child->mnt.mnt_flags & MNT_UMOUNT)) {
> +				child->mnt.mnt_flags |= MNT_UNREACHABLE_PARENT;
> +				hlist_add_head(&child->mnt_pending_umount, &pending_umount);
> +				schedule_work(&pending_umount_work);
> +			} else {
> +				umount_mnt(child);
> +			}
> +		}
> +	}
> +out:
> +	unlock_mount_hash();

And that can have non-trivial security implications - ability to expose
something that has been overmounted is potentially very nasty.  I might
be missing something in the rest of the series (I'm half-asleep right now,
so that's certainly possible), but that doesn't look obviously safe.
Note that it's very different from the situation with umount-on-invalidation -
there the thing we are uncovering is dead.  In this one it might be very
much alive and deliberately covered...

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review  19/19] vfs: Do not allow escaping from bind mounts.
  2015-04-03  1:56       ` [PATCH review 19/19] vfs: Do not allow escaping from bind mounts Eric W. Biederman
@ 2015-04-03  6:20         ` Al Viro
       [not found]           ` <20150403062035.GF889-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
       [not found]         ` <1428026183-14879-19-git-send-email-ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  1 sibling, 1 reply; 240+ messages in thread
From: Al Viro @ 2015-04-03  6:20 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Containers, linux-fsdevel, Serge E. Hallyn,
	Andy Lutomirski, Richard Weinberger, Andrey Vagin, Jann Horn,
	Willy Tarreau, Omar Sandoval

On Thu, Apr 02, 2015 at 08:56:23PM -0500, Eric W. Biederman wrote:
> +static void mark_violated_mounts(struct dentry *dentry, struct dentry *target)
> +{
> +	/* Mark all mountroots that are ancestors of dentry
> +	 * that do not share a common ancestor with target
> +	 *
> +	 * This function assumes both dentries are part of a DAG.
> +	 */
> +	struct dentry *p;
> +
> +	for (p = dentry->d_parent; !IS_ROOT(p); p = p->d_parent) {
> +		if (!d_mountroot(p))
> +			continue;
> +
> +		if (d_ancestor(p, target))
> +			break;

Egads...  You do realize that you'll keep walking the path from target to
root again and again?  It's a tree and we have already walked up to the root.
So we know the depths and tree topology can't change due to the rename_lock
being held by caller.

> +	if (!IS_ROOT(dentry) && !IS_ROOT(target)) {
> +		mark_violated_mounts(dentry, target);
> +		mark_violated_mounts(target, dentry);
> +	}
> +
>  	dentry_lock_for_move(dentry, target);

> +void mnt_set_violated(struct dentry *root, struct dentry *moving)
> +{
> +	struct mountroot *mr;
> +	struct mount *mnt;
> +
> +	lock_mount_hash();
> +	mr = lookup_mountroot(root);
> +	if (!mr)
> +		goto out;
> +
> +	hlist_for_each_entry(mnt, &mr->r_list, mnt_mr_list) {
> +		struct mount *child;
> +		/* Be wary of this mount */
> +		mnt->mnt.mnt_flags |= MNT_VIOLATED;
> +
> +		list_for_each_entry(child, &mnt->mnt_mounts, mnt_child) {
> +			/* Ignore children that will continue to be connected */
> +			if ((child->mnt_mountpoint != moving) &&
> +			    !d_ancestor(moving, child->mnt_mountpoint))
> +				continue;
> +
> +			/* Deal with mounts loosing the connection to
> +			 * their parents
> +			 */
> +			if (!(child->mnt.mnt_flags & MNT_UMOUNT)) {
> +				child->mnt.mnt_flags |= MNT_UNREACHABLE_PARENT;
> +				hlist_add_head(&child->mnt_pending_umount, &pending_umount);
> +				schedule_work(&pending_umount_work);
> +			} else {
> +				umount_mnt(child);
> +			}
> +		}
> +	}
> +out:
> +	unlock_mount_hash();

And that can have non-trivial security implications - ability to expose
something that has been overmounted is potentially very nasty.  I might
be missing something in the rest of the series (I'm half-asleep right now,
so that's certainly possible), but that doesn't look obviously safe.
Note that it's very different from the situation with umount-on-invalidation -
there the thing we are uncovering is dead.  In this one it might be very
much alive and deliberately covered...

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 09/19] mnt: Fail collect_mounts when applied to unmounted mounts
       [not found]         ` <1428026183-14879-9-git-send-email-ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
@ 2015-04-03  8:55           ` Lukasz Pawelczyk
  0 siblings, 0 replies; 240+ messages in thread
From: Lukasz Pawelczyk @ 2015-04-03  8:55 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrey Vagin, Richard Weinberger, Linux Containers,
	Andy Lutomirski, Al Viro, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	Jann Horn, Willy Tarreau

On czw, 2015-04-02 at 20:56 -0500, Eric W. Biederman wrote:
> The only users of collect_mounts are in audit_tree.c
> 
> In audit_tree_trees and audit_add_tree rule the path passed into

I think you meant audit_trim_trees.

Also you missed a _ in audit_add_tree_rule.


> collect_mounts is generated from kern_path passed an audit_tree
> pathname which is guaranteed to be an absolute path.   In those cases
> collect_mounts is obviously intended to work on mounted paths and
> if a race results in paths that are unmounted when collect_mounts
> it is reasonable to fail early.
> 
> The paths passed into audit_tag_tree don't have the absolute path
> check.  But are used to play with fsnotify and otherwise interact with
> the audit_trees, so again operating only on mounted paths appears
> reasonable.
> 
> Avoid having to worry about what happens when we try and audit
> unmounted filesystems by restricting collect_mounts to mounts
> that appear in the mount tree.
> 
> Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
> ---
>  fs/namespace.c | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 2b12b7a9455d..acc5583764dc 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -1669,8 +1669,11 @@ struct vfsmount *collect_mounts(struct path *path)
>  {
>  	struct mount *tree;
>  	namespace_lock();
> -	tree = copy_tree(real_mount(path->mnt), path->dentry,
> -			 CL_COPY_ALL | CL_PRIVATE);
> +	if (!check_mnt(real_mount(path->mnt)))
> +		tree = ERR_PTR(-EINVAL);
> +	else
> +		tree = copy_tree(real_mount(path->mnt), path->dentry,
> +				 CL_COPY_ALL | CL_PRIVATE);
>  	namespace_unlock();
>  	if (IS_ERR(tree))
>  		return ERR_CAST(tree);

-- 
Lukasz Pawelczyk
Samsung R&D Institute Poland
Samsung Electronics

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 09/19] mnt: Fail collect_mounts when applied to unmounted mounts
  2015-04-03  1:56       ` [PATCH review 09/19] mnt: Fail collect_mounts when applied to unmounted mounts Eric W. Biederman
@ 2015-04-03  8:55         ` Lukasz Pawelczyk
       [not found]           ` <1428051353.1924.2.camel-Sze3O3UU22JBDgjK7y7TUQ@public.gmane.org>
       [not found]         ` <1428026183-14879-9-git-send-email-ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  1 sibling, 1 reply; 240+ messages in thread
From: Lukasz Pawelczyk @ 2015-04-03  8:55 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Containers, Andrey Vagin, Richard Weinberger,
	Andy Lutomirski, Al Viro, linux-fsdevel, Jann Horn,
	Omar Sandoval, Willy Tarreau

On czw, 2015-04-02 at 20:56 -0500, Eric W. Biederman wrote:
> The only users of collect_mounts are in audit_tree.c
> 
> In audit_tree_trees and audit_add_tree rule the path passed into

I think you meant audit_trim_trees.

Also you missed a _ in audit_add_tree_rule.


> collect_mounts is generated from kern_path passed an audit_tree
> pathname which is guaranteed to be an absolute path.   In those cases
> collect_mounts is obviously intended to work on mounted paths and
> if a race results in paths that are unmounted when collect_mounts
> it is reasonable to fail early.
> 
> The paths passed into audit_tag_tree don't have the absolute path
> check.  But are used to play with fsnotify and otherwise interact with
> the audit_trees, so again operating only on mounted paths appears
> reasonable.
> 
> Avoid having to worry about what happens when we try and audit
> unmounted filesystems by restricting collect_mounts to mounts
> that appear in the mount tree.
> 
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  fs/namespace.c | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 2b12b7a9455d..acc5583764dc 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -1669,8 +1669,11 @@ struct vfsmount *collect_mounts(struct path *path)
>  {
>  	struct mount *tree;
>  	namespace_lock();
> -	tree = copy_tree(real_mount(path->mnt), path->dentry,
> -			 CL_COPY_ALL | CL_PRIVATE);
> +	if (!check_mnt(real_mount(path->mnt)))
> +		tree = ERR_PTR(-EINVAL);
> +	else
> +		tree = copy_tree(real_mount(path->mnt), path->dentry,
> +				 CL_COPY_ALL | CL_PRIVATE);
>  	namespace_unlock();
>  	if (IS_ERR(tree))
>  		return ERR_CAST(tree);

-- 
Lukasz Pawelczyk
Samsung R&D Institute Poland
Samsung Electronics




^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 19/19] vfs: Do not allow escaping from bind mounts.
       [not found]           ` <20150403062035.GF889-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
@ 2015-04-03 10:22             ` Eric W. Biederman
  0 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-04-03 10:22 UTC (permalink / raw)
  To: Al Viro
  Cc: Andrey Vagin, Richard Weinberger, Linux Containers,
	Andy Lutomirski, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn,
	Willy Tarreau



On April 3, 2015 1:20:35 AM CDT, Al Viro <viro-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org> wrote:
>On Thu, Apr 02, 2015 at 08:56:23PM -0500, Eric W. Biederman wrote:
>> +static void mark_violated_mounts(struct dentry *dentry, struct
>dentry *target)
>> +{
>> +	/* Mark all mountroots that are ancestors of dentry
>> +	 * that do not share a common ancestor with target
>> +	 *
>> +	 * This function assumes both dentries are part of a DAG.
>> +	 */
>> +	struct dentry *p;
>> +
>> +	for (p = dentry->d_parent; !IS_ROOT(p); p = p->d_parent) {
>> +		if (!d_mountroot(p))
>> +			continue;
>> +
>> +		if (d_ancestor(p, target))
>> +			break;
>
>Egads...  You do realize that you'll keep walking the path from target
>to
>root again and again?  It's a tree and we have already walked up to the
>root.
>So we know the depths and tree topology can't change due to the
>rename_lock
>being held by caller.

In the common case when a mount is not violated we will either not encounter a mount root.  Or the mount root will be a common ancestor and so we will break out of the loop.

If we can make the pathological cases perform better I am all for it.   But I did not see anything immediately obvious.

>> +	if (!IS_ROOT(dentry) && !IS_ROOT(target)) {
>> +		mark_violated_mounts(dentry, target);
>> +		mark_violated_mounts(target, dentry);
>> +	}
>> +
>>  	dentry_lock_for_move(dentry, target);
>
>> +void mnt_set_violated(struct dentry *root, struct dentry *moving)
>> +{
>> +	struct mountroot *mr;
>> +	struct mount *mnt;
>> +
>> +	lock_mount_hash();
>> +	mr = lookup_mountroot(root);
>> +	if (!mr)
>> +		goto out;
>> +
>> +	hlist_for_each_entry(mnt, &mr->r_list, mnt_mr_list) {
>> +		struct mount *child;
>> +		/* Be wary of this mount */
>> +		mnt->mnt.mnt_flags |= MNT_VIOLATED;
>> +
>> +		list_for_each_entry(child, &mnt->mnt_mounts, mnt_child) {
>> +			/* Ignore children that will continue to be connected */
>> +			if ((child->mnt_mountpoint != moving) &&
>> +			    !d_ancestor(moving, child->mnt_mountpoint))
>> +				continue;
>> +
>> +			/* Deal with mounts loosing the connection to
>> +			 * their parents
>> +			 */
>> +			if (!(child->mnt.mnt_flags & MNT_UMOUNT)) {
>> +				child->mnt.mnt_flags |= MNT_UNREACHABLE_PARENT;
>> +				hlist_add_head(&child->mnt_pending_umount, &pending_umount);
>> +				schedule_work(&pending_umount_work);
>> +			} else {
>> +				umount_mnt(child);
>> +			}
>> +		}
>> +	}
>> +out:
>> +	unlock_mount_hash();
>
>And that can have non-trivial security implications - ability to expose
>something that has been overmounted is potentially very nasty.  I might
>be missing something in the rest of the series (I'm half-asleep right
>now,
>so that's certainly possible), but that doesn't look obviously safe.
>Note that it's very different from the situation with
>umount-on-invalidation -
>there the thing we are uncovering is dead.  In this one it might be
>very
>much alive and deliberately covered...

You are correct.    I overlooked that corner case.  It is hard to reach but not impossible.

It is not fundamental to the patchset that the unmount happen.  So just taking out the unmount out should work.

I am wondering if we could performance some kind of weird half unmount like I do for submounts of what I am unmounting, but I do not think so.

I will have to take a look when I get back from the long weekend.   At this point I expect the patches are close enough it should not be hard to iterate to a workable fix.

Eric

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 16/19] mnt: Track which mounts use a dentry as root.
       [not found]             ` <20150403055449.GE889-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
@ 2015-04-03 10:31               ` Eric W. Biederman
  2015-04-07 20:22               ` Eric W. Biederman
  1 sibling, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-04-03 10:31 UTC (permalink / raw)
  To: Al Viro
  Cc: Andrey Vagin, Richard Weinberger, Linux Containers,
	Andy Lutomirski, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn,
	Willy Tarreau



On April 3, 2015 12:54:50 AM CDT, Al Viro <viro-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org> wrote:
>On Thu, Apr 02, 2015 at 08:56:20PM -0500, Eric W. Biederman wrote:
>
>One general note - I'd probably put a pointer to that sucker into
>struct
>mount.  For one thing, root-preserving clone_mnt() is a fairly common
>case.  For another, searching for that thing in mnt_put_root() looks
>wrong.  Matter of taste, but...
>
>Another thing is that IMO it's better to preallocate that thing in
>vfs_kern_mount() and free if it turns out to be unused.  Simpler
>cleanup
>path that way...

Those do sound like reasonable simplifications.

>> -	mnt->mnt.mnt_root = root;
>> +	err = mnt_set_root(mnt, root);
>> +	if (err) {
>> +		dput(mnt->mnt.mnt_root);
>
>	Unless I'm misreading your code, mnt_set_root() does *not* set it
>on failure, so what's going on here?

I will have to look when I get the code in front of me again.

>>  #define DCACHE_FALLTHRU			0x01000000 /* Fall through to lower layer
>*/
>						^^^^^^^^^^
>> +#define DCACHE_MOUNTROOT		0x01000000 /* is root of a vfsmount */
>					^^^^^^^^^^
>
>	Er...

Good point.  I don't think DCACHE_FALLTHRU existed when I wrote the patch and I missed this detail during the rebase.  Sigh.

I will fix it for the next round.   Hopefully DCACHE_FALLTHRU does not have implications for the rest of my changes.

Eric

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 16/19] mnt: Track which mounts use a dentry as root.
       [not found]             ` <20150403055449.GE889-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
  2015-04-03 10:31               ` Eric W. Biederman
@ 2015-04-07 20:22               ` Eric W. Biederman
  1 sibling, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-04-07 20:22 UTC (permalink / raw)
  To: Al Viro
  Cc: Andrey Vagin, Richard Weinberger, Linux Containers,
	Andy Lutomirski, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn,
	Willy Tarreau

Al Viro <viro-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org> writes:

> On Thu, Apr 02, 2015 at 08:56:20PM -0500, Eric W. Biederman wrote:
>
> One general note - I'd probably put a pointer to that sucker into struct
> mount.  For one thing, root-preserving clone_mnt() is a fairly common
> case.  For another, searching for that thing in mnt_put_root() looks
> wrong.  Matter of taste, but...

So I just played with the possibilities and adding a field in struct
mount makes the code more complicated not less.  So I have developed a
distaste for the idea of having a struct mountroot pointer in struct mount.

It especially complicates clone_mnt where I always have to have the code
look at the dentry to find the associated a dentry.  Resuing a current
struct mountroot and/or preallocating one is just a complicated mess.
The current implementation has a much more localized (and thus
understandable and maintainable) implementation.

> Another thing is that IMO it's better to preallocate that thing in
> vfs_kern_mount() and free if it turns out to be unused.  Simpler cleanup
> path that way...

It is a touch cleaner in vfs_kern_mount (not as many things need to be
freed) and much uglier and in clone_mnt.

>> -	mnt->mnt.mnt_root = root;
>> +	err = mnt_set_root(mnt, root);
>> +	if (err) {
>> +		dput(mnt->mnt.mnt_root);
>
> 	Unless I'm misreading your code, mnt_set_root() does *not* set it
> on failure, so what's going on here?

Typo.  That should simply have been dput(root);

Eric

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review  16/19] mnt: Track which mounts use a dentry as root.
  2015-04-03  5:54           ` Al Viro
       [not found]             ` <20150403055449.GE889-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
@ 2015-04-07 20:22             ` Eric W. Biederman
  1 sibling, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-04-07 20:22 UTC (permalink / raw)
  To: Al Viro
  Cc: Linux Containers, linux-fsdevel, Serge E. Hallyn,
	Andy Lutomirski, Richard Weinberger, Andrey Vagin, Jann Horn,
	Willy Tarreau, Omar Sandoval

Al Viro <viro@ZenIV.linux.org.uk> writes:

> On Thu, Apr 02, 2015 at 08:56:20PM -0500, Eric W. Biederman wrote:
>
> One general note - I'd probably put a pointer to that sucker into struct
> mount.  For one thing, root-preserving clone_mnt() is a fairly common
> case.  For another, searching for that thing in mnt_put_root() looks
> wrong.  Matter of taste, but...

So I just played with the possibilities and adding a field in struct
mount makes the code more complicated not less.  So I have developed a
distaste for the idea of having a struct mountroot pointer in struct mount.

It especially complicates clone_mnt where I always have to have the code
look at the dentry to find the associated a dentry.  Resuing a current
struct mountroot and/or preallocating one is just a complicated mess.
The current implementation has a much more localized (and thus
understandable and maintainable) implementation.

> Another thing is that IMO it's better to preallocate that thing in
> vfs_kern_mount() and free if it turns out to be unused.  Simpler cleanup
> path that way...

It is a touch cleaner in vfs_kern_mount (not as many things need to be
freed) and much uglier and in clone_mnt.

>> -	mnt->mnt.mnt_root = root;
>> +	err = mnt_set_root(mnt, root);
>> +	if (err) {
>> +		dput(mnt->mnt.mnt_root);
>
> 	Unless I'm misreading your code, mnt_set_root() does *not* set it
> on failure, so what's going on here?

Typo.  That should simply have been dput(root);

Eric

^ permalink raw reply	[flat|nested] 240+ messages in thread

* [PATCH review 0/4] Loopback mount escape fixes
       [not found]       ` <87a8yqou41.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                           ` (18 preceding siblings ...)
  2015-04-03  1:56         ` [PATCH review 19/19] vfs: Do not allow escaping from bind mounts Eric W. Biederman
@ 2015-04-08 23:31         ` Eric W. Biederman
  2015-04-16 23:40         ` [GIT PULL] Usernamespace related locked mount fixes Eric W. Biederman
  20 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-04-08 23:31 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Richard Weinberger, Andy Lutomirski, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Willy Tarreau


After the last round of feedback I sat down and played with my fix
for the fact that a strategically placed rename, ".." on bind mounts
go up past the root of the bind mount.

The code better handles the escaped directory returning into it's bind
mount, and is now roughly a constant factor cost in all cases from what
the code costs without the fix.

So I think I have found a better tradeoff between fixing this bug and
not slowing down path name lookups in the common case.

These fixes are against on v4.0-rc6.

For those who like to see everything in a single tree the code is at:

    git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-testing

Eric W. Biederman (4):
      mnt: Track which mounts use a dentry as root.
      vfs: Test for and handle paths that are unreachable from their mnt_root
      vfs: Handle mounts whose parents are unreachable from their mountpoint
      vfs: Do not allow escaping from bind mounts.

 fs/dcache.c            |  82 +++++++++++++++++++++++++++---
 fs/internal.h          |   2 +
 fs/mount.h             |   6 +++
 fs/namei.c             |  57 +++++++++++++++++----
 fs/namespace.c         | 135 +++++++++++++++++++++++++++++++++++++++++++++++--
 include/linux/dcache.h |  13 +++++
 include/linux/namei.h  |   2 +
 7 files changed, 277 insertions(+), 20 deletions(-)

^ permalink raw reply	[flat|nested] 240+ messages in thread

* [PATCH review 0/4] Loopback mount escape fixes
  2015-04-03  1:53     ` [PATCH review 0/19] Locked mount and loopback mount fixes Eric W. Biederman
                         ` (19 preceding siblings ...)
  2015-04-03  1:56       ` [PATCH review 19/19] vfs: Do not allow escaping from bind mounts Eric W. Biederman
@ 2015-04-08 23:31       ` Eric W. Biederman
  2015-04-08 23:33         ` [PATCH review 3/4] vfs: Handle mounts whose parents are unreachable from their mountpoint Eric W. Biederman
                           ` (3 more replies)
  2015-04-16 23:40       ` [GIT PULL] Usernamespace related locked mount fixes Eric W. Biederman
  21 siblings, 4 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-04-08 23:31 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-fsdevel, Al Viro, Andy Lutomirski, Serge E. Hallyn,
	Richard Weinberger, Andrey Vagin, Jann Horn, Willy Tarreau,
	Omar Sandoval


After the last round of feedback I sat down and played with my fix
for the fact that a strategically placed rename, ".." on bind mounts
go up past the root of the bind mount.

The code better handles the escaped directory returning into it's bind
mount, and is now roughly a constant factor cost in all cases from what
the code costs without the fix.

So I think I have found a better tradeoff between fixing this bug and
not slowing down path name lookups in the common case.

These fixes are against on v4.0-rc6.

For those who like to see everything in a single tree the code is at:

    git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-testing

Eric W. Biederman (4):
      mnt: Track which mounts use a dentry as root.
      vfs: Test for and handle paths that are unreachable from their mnt_root
      vfs: Handle mounts whose parents are unreachable from their mountpoint
      vfs: Do not allow escaping from bind mounts.

 fs/dcache.c            |  82 +++++++++++++++++++++++++++---
 fs/internal.h          |   2 +
 fs/mount.h             |   6 +++
 fs/namei.c             |  57 +++++++++++++++++----
 fs/namespace.c         | 135 +++++++++++++++++++++++++++++++++++++++++++++++--
 include/linux/dcache.h |  13 +++++
 include/linux/namei.h  |   2 +
 7 files changed, 277 insertions(+), 20 deletions(-)

^ permalink raw reply	[flat|nested] 240+ messages in thread

* [PATCH review  1/4] mnt: Track which mounts use a dentry as root.
       [not found]         ` <874moq9oyb.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-04-08 23:32           ` Eric W. Biederman
  2015-04-08 23:32           ` [PATCH review 2/4] vfs: Test for and handle paths that are unreachable from their mnt_root Eric W. Biederman
                             ` (5 subsequent siblings)
  6 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-04-08 23:32 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Richard Weinberger, Andy Lutomirski, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Willy Tarreau


Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/mount.h             |   6 +++
 fs/namespace.c         | 118 +++++++++++++++++++++++++++++++++++++++++++++++--
 include/linux/dcache.h |   7 +++
 3 files changed, 127 insertions(+), 4 deletions(-)

diff --git a/fs/mount.h b/fs/mount.h
index 6a61c2b3e385..0dbad16ab7b2 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -27,6 +27,12 @@ struct mountpoint {
 	int m_count;
 };
 
+struct mountroot {
+	struct hlist_node r_hash;
+	struct dentry *r_dentry;
+	long r_count;
+};
+
 struct mount {
 	struct hlist_node mnt_hash;
 	struct mount *mnt_parent;
diff --git a/fs/namespace.c b/fs/namespace.c
index 1f4f9dac6e5a..0b517f1e898a 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -31,6 +31,8 @@ static unsigned int m_hash_mask __read_mostly;
 static unsigned int m_hash_shift __read_mostly;
 static unsigned int mp_hash_mask __read_mostly;
 static unsigned int mp_hash_shift __read_mostly;
+static unsigned int mr_hash_mask __read_mostly;
+static unsigned int mr_hash_shift __read_mostly;
 
 static __initdata unsigned long mhash_entries;
 static int __init set_mhash_entries(char *str)
@@ -52,6 +54,16 @@ static int __init set_mphash_entries(char *str)
 }
 __setup("mphash_entries=", set_mphash_entries);
 
+static __initdata unsigned long mrhash_entries;
+static int __init set_mrhash_entries(char *str)
+{
+	if (!str)
+		return 0;
+	mrhash_entries = simple_strtoul(str, &str, 0);
+	return 1;
+}
+__setup("mrhash_entries=", set_mrhash_entries);
+
 static u64 event;
 static DEFINE_IDA(mnt_id_ida);
 static DEFINE_IDA(mnt_group_ida);
@@ -61,6 +73,7 @@ static int mnt_group_start = 1;
 
 static struct hlist_head *mount_hashtable __read_mostly;
 static struct hlist_head *mountpoint_hashtable __read_mostly;
+static struct hlist_head *mountroot_hashtable __read_mostly;
 static struct kmem_cache *mnt_cache __read_mostly;
 static DECLARE_RWSEM(namespace_sem);
 
@@ -93,6 +106,13 @@ static inline struct hlist_head *mp_hash(struct dentry *dentry)
 	return &mountpoint_hashtable[tmp & mp_hash_mask];
 }
 
+static inline struct hlist_head *mr_hash(struct dentry *dentry)
+{
+	unsigned long tmp = ((unsigned long)dentry / L1_CACHE_BYTES);
+	tmp = tmp + (tmp >> mr_hash_shift);
+	return &mountroot_hashtable[tmp & mr_hash_mask];
+}
+
 /*
  * allocation is serialized by namespace_sem, but we need the spinlock to
  * serialize with freeing.
@@ -768,6 +788,76 @@ static void put_mountpoint(struct mountpoint *mp)
 	}
 }
 
+static struct mountroot *lookup_mountroot(struct dentry *dentry)
+{
+	struct hlist_head *chain = mr_hash(dentry);
+	struct mountroot *mr;
+
+	hlist_for_each_entry(mr, chain, r_hash) {
+		if (mr->r_dentry == dentry)
+			return mr;
+	}
+	return NULL;
+}
+
+static int mnt_set_root(struct mount *mnt, struct dentry *root)
+{
+	struct mountroot *mr = NULL;
+
+	lock_mount_hash();
+	if (d_mountroot(root))
+		mr = lookup_mountroot(root);
+	if (!mr) {
+		struct mountroot *new;
+		unlock_mount_hash();
+
+		new = kmalloc(sizeof(struct mountroot), GFP_KERNEL);
+		if (!new)
+			return -ENOMEM;
+
+		lock_mount_hash();
+		mr = lookup_mountroot(root);
+		if (mr) {
+			kfree(new);
+		} else {
+			struct hlist_head *chain = mr_hash(root);
+
+			mr = new;
+			mr->r_dentry = root;
+			mr->r_count = 0;
+			hlist_add_head(&mr->r_hash, chain);
+
+			spin_lock(&root->d_lock);
+			root->d_flags |= DCACHE_MOUNTROOT;
+			spin_unlock(&root->d_lock);
+		}
+	}
+	mnt->mnt.mnt_root = root;
+	mr->r_count++;
+	unlock_mount_hash();
+
+	return 0;
+}
+
+static void mnt_put_root(struct mount *mnt)
+{
+	struct dentry *root = mnt->mnt.mnt_root;
+	struct mountroot *mr;
+
+	lock_mount_hash();
+	mr = lookup_mountroot(root);
+	BUG_ON(!mr);
+	if (!--mr->r_count) {
+		hlist_del(&mr->r_hash);
+		spin_lock(&root->d_lock);
+		root->d_flags &= ~DCACHE_MOUNTROOT;
+		spin_unlock(&root->d_lock);
+		kfree(mr);
+	}
+	unlock_mount_hash();
+	dput(root);
+}
+
 static inline int check_mnt(struct mount *mnt)
 {
 	return mnt->mnt_ns == current->nsproxy->mnt_ns;
@@ -923,6 +1013,7 @@ vfs_kern_mount(struct file_system_type *type, int flags, const char *name, void
 {
 	struct mount *mnt;
 	struct dentry *root;
+	int err;
 
 	if (!type)
 		return ERR_PTR(-ENODEV);
@@ -941,8 +1032,16 @@ vfs_kern_mount(struct file_system_type *type, int flags, const char *name, void
 		return ERR_CAST(root);
 	}
 
-	mnt->mnt.mnt_root = root;
 	mnt->mnt.mnt_sb = root->d_sb;
+	err = mnt_set_root(mnt, root);
+	if (err) {
+		dput(root);
+		deactivate_super(mnt->mnt.mnt_sb);
+		mnt_free_id(mnt);
+		free_vfsmnt(mnt);
+		return ERR_PTR(err);
+	}
+
 	mnt->mnt_mountpoint = mnt->mnt.mnt_root;
 	mnt->mnt_parent = mnt;
 	lock_mount_hash();
@@ -974,6 +1073,10 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root,
 			goto out_free;
 	}
 
+	err = mnt_set_root(mnt, root);
+	if (err)
+		goto out_free;
+
 	mnt->mnt.mnt_flags = old->mnt.mnt_flags & ~(MNT_WRITE_HOLD|MNT_MARKED);
 	/* Don't allow unprivileged users to change mount flags */
 	if (flag & CL_UNPRIVILEGED) {
@@ -999,7 +1102,7 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root,
 
 	atomic_inc(&sb->s_active);
 	mnt->mnt.mnt_sb = sb;
-	mnt->mnt.mnt_root = dget(root);
+	dget(root);
 	mnt->mnt_mountpoint = mnt->mnt.mnt_root;
 	mnt->mnt_parent = mnt;
 	lock_mount_hash();
@@ -1052,7 +1155,7 @@ static void cleanup_mnt(struct mount *mnt)
 	if (unlikely(mnt->mnt_pins.first))
 		mnt_pin_kill(mnt);
 	fsnotify_vfsmount_delete(&mnt->mnt);
-	dput(mnt->mnt.mnt_root);
+	mnt_put_root(mnt);
 	deactivate_super(mnt->mnt.mnt_sb);
 	mnt_free_id(mnt);
 	call_rcu(&mnt->mnt_rcu, delayed_free_vfsmnt);
@@ -3079,14 +3182,21 @@ void __init mnt_init(void)
 				mphash_entries, 19,
 				0,
 				&mp_hash_shift, &mp_hash_mask, 0, 0);
+	mountroot_hashtable = alloc_large_system_hash("Mountroot-cache",
+				sizeof(struct hlist_head),
+				mrhash_entries, 19,
+				0,
+				&mr_hash_shift, &mr_hash_mask, 0, 0);
 
-	if (!mount_hashtable || !mountpoint_hashtable)
+	if (!mount_hashtable || !mountpoint_hashtable || !mountroot_hashtable)
 		panic("Failed to allocate mount hash table\n");
 
 	for (u = 0; u <= m_hash_mask; u++)
 		INIT_HLIST_HEAD(&mount_hashtable[u]);
 	for (u = 0; u <= mp_hash_mask; u++)
 		INIT_HLIST_HEAD(&mountpoint_hashtable[u]);
+	for (u = 0; u <= mr_hash_mask; u++)
+		INIT_HLIST_HEAD(&mountroot_hashtable[u]);
 
 	kernfs_init();
 
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index d8358799c594..01cd930bf9d8 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -226,6 +226,8 @@ struct dentry_operations {
 #define DCACHE_MAY_FREE			0x00800000
 #define DCACHE_FALLTHRU			0x01000000 /* Fall through to lower layer */
 
+#define DCACHE_MOUNTROOT		0x02000000 /* Root of a vfsmount */
+
 extern seqlock_t rename_lock;
 
 /*
@@ -401,6 +403,11 @@ static inline bool d_mountpoint(const struct dentry *dentry)
 	return dentry->d_flags & DCACHE_MOUNTED;
 }
 
+static inline bool d_mountroot(const struct dentry *dentry)
+{
+	return dentry->d_flags & DCACHE_MOUNTROOT;
+}
+
 /*
  * Directory cache entry type accessor functions.
  */
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 2/4] vfs: Test for and handle paths that are unreachable from their mnt_root
       [not found]         ` <874moq9oyb.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  2015-04-08 23:32           ` [PATCH review 1/4] mnt: Track which mounts use a dentry as root Eric W. Biederman
@ 2015-04-08 23:32           ` Eric W. Biederman
       [not found]             ` <87sica8ac5.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  2015-04-08 23:33           ` [PATCH review 3/4] vfs: Handle mounts whose parents are unreachable from their mountpoint Eric W. Biederman
                             ` (4 subsequent siblings)
  6 siblings, 1 reply; 240+ messages in thread
From: Eric W. Biederman @ 2015-04-08 23:32 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Richard Weinberger, Andy Lutomirski, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Willy Tarreau


- Add a dentry flag DCACHE_MOUNT_VIOLATED to mark loopback mounts that
  have had a dentry moved into a directory that does not descend from
  the mount root dentry.

- In mnt_put_root clear DCACHE_MOUNT_VIOLATED.

- Add a function path_connected to verify a path.dentry is reachable from
  path.mnt.mnt_root.  AKA rename did not do something nasty to the bind mount.

- Disable ".." when a path is not connected during lookup.
  (Maybe we want to stop ".." at this path instead?)

  Following .. is not disabled after a transition to /
  and is never disabled when / is the directory we start
  with.   Because we already limit .. no higher than /

- In prepend_path and it's callers in the d_path family
  for a path that is not connected don't attempt to find
  parent directories.

Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/dcache.c            |  3 +++
 fs/internal.h          |  1 +
 fs/namei.c             | 30 ++++++++++++++++++++++++++++++
 fs/namespace.c         |  2 +-
 include/linux/dcache.h |  6 ++++++
 include/linux/namei.h  |  2 ++
 6 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index c71e3732e53b..e07eb03f6de6 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2871,6 +2871,9 @@ static int prepend_path(const struct path *path,
 	char *bptr;
 	int blen;
 
+	if (!path_connected(path))
+		root = path;
+
 	rcu_read_lock();
 restart_mnt:
 	read_seqbegin_or_lock(&mount_lock, &m_seq);
diff --git a/fs/internal.h b/fs/internal.h
index 01dce1d1476b..046767f0042e 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -51,6 +51,7 @@ extern void __init chrdev_init(void);
 extern int user_path_mountpoint_at(int, const char __user *, unsigned int, struct path *);
 extern int vfs_path_lookup(struct dentry *, struct vfsmount *,
 			   const char *, unsigned int, struct path *);
+extern bool path_connected(const struct path *);
 
 /*
  * namespace.c
diff --git a/fs/namei.c b/fs/namei.c
index c83145af4bfc..83cdcdf36eed 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -493,6 +493,22 @@ void path_put(const struct path *path)
 }
 EXPORT_SYMBOL(path_put);
 
+/**
+ * path_connected - Verify that a path->dentry is below path->mnt->mnt.mnt_root
+ * @path: path to verify
+ *
+ * Rename can sometimes move a file or directory outside of bind mount
+ * don't honor paths where this has happened.
+ */
+bool path_connected(const struct path *path)
+{
+	struct dentry *root = path->mnt->mnt_root;
+	if (!d_mount_violated(root))
+		return true;
+
+	return is_subdir(path->dentry, root);
+}
+
 struct nameidata {
 	struct path	path;
 	struct qstr	last;
@@ -712,6 +728,7 @@ void nd_jump_link(struct nameidata *nd, struct path *path)
 	nd->path = *path;
 	nd->inode = nd->path.dentry->d_inode;
 	nd->flags |= LOOKUP_JUMPED;
+	nd->flags &= ~LOOKUP_NODOTDOT;
 }
 
 void nd_set_link(struct nameidata *nd, char *path)
@@ -897,6 +914,7 @@ follow_link(struct path *link, struct nameidata *nd, void **p)
 			nd->path = nd->root;
 			path_get(&nd->root);
 			nd->flags |= LOOKUP_JUMPED;
+			nd->flags &= ~LOOKUP_NODOTDOT;
 		}
 		nd->inode = nd->path.dentry->d_inode;
 		error = link_path_walk(s, nd);
@@ -1161,6 +1179,7 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path,
 		path->mnt = &mounted->mnt;
 		path->dentry = mounted->mnt.mnt_root;
 		nd->flags |= LOOKUP_JUMPED;
+		nd->flags &= ~LOOKUP_NODOTDOT;
 		nd->seq = read_seqcount_begin(&path->dentry->d_seq);
 		/*
 		 * Update the inode too. We don't need to re-check the
@@ -1176,6 +1195,10 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path,
 static int follow_dotdot_rcu(struct nameidata *nd)
 {
 	struct inode *inode = nd->inode;
+
+	if (nd->flags & LOOKUP_NODOTDOT)
+		return 0;
+
 	if (!nd->root.mnt)
 		set_root_rcu(nd);
 
@@ -1293,6 +1316,9 @@ static void follow_mount(struct path *path)
 
 static void follow_dotdot(struct nameidata *nd)
 {
+	if (nd->flags & LOOKUP_NODOTDOT)
+		return;
+
 	if (!nd->root.mnt)
 		set_root(nd);
 
@@ -1909,6 +1935,8 @@ static int path_init(int dfd, const char *name, unsigned int flags,
 		} else {
 			get_fs_pwd(current->fs, &nd->path);
 		}
+		if (unlikely(!path_connected(&nd->path)))
+			nd->flags |= LOOKUP_NODOTDOT;
 	} else {
 		/* Caller must check execute permissions on the starting path component */
 		struct fd f = fdget_raw(dfd);
@@ -1936,6 +1964,8 @@ static int path_init(int dfd, const char *name, unsigned int flags,
 			path_get(&nd->path);
 			fdput(f);
 		}
+		if (unlikely(!path_connected(&nd->path)))
+			nd->flags |= LOOKUP_NODOTDOT;
 	}
 
 	nd->inode = nd->path.dentry->d_inode;
diff --git a/fs/namespace.c b/fs/namespace.c
index 0b517f1e898a..75abc9fcaafa 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -850,7 +850,7 @@ static void mnt_put_root(struct mount *mnt)
 	if (!--mr->r_count) {
 		hlist_del(&mr->r_hash);
 		spin_lock(&root->d_lock);
-		root->d_flags &= ~DCACHE_MOUNTROOT;
+		root->d_flags &= ~(DCACHE_MOUNTROOT | DCACHE_MOUNT_VIOLATED);
 		spin_unlock(&root->d_lock);
 		kfree(mr);
 	}
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 01cd930bf9d8..18e36d974168 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -227,6 +227,7 @@ struct dentry_operations {
 #define DCACHE_FALLTHRU			0x01000000 /* Fall through to lower layer */
 
 #define DCACHE_MOUNTROOT		0x02000000 /* Root of a vfsmount */
+#define DCACHE_MOUNT_VIOLATED		0x04000000 /* Vfsmount with dentries moved out */
 
 extern seqlock_t rename_lock;
 
@@ -408,6 +409,11 @@ static inline bool d_mountroot(const struct dentry *dentry)
 	return dentry->d_flags & DCACHE_MOUNTROOT;
 }
 
+static inline bool d_mount_violated(const struct dentry *dentry)
+{
+	return dentry->d_flags & DCACHE_MOUNT_VIOLATED;
+}
+
 /*
  * Directory cache entry type accessor functions.
  */
diff --git a/include/linux/namei.h b/include/linux/namei.h
index c8990779f0c3..55c8aaec7b03 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -45,6 +45,8 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
 #define LOOKUP_ROOT		0x2000
 #define LOOKUP_EMPTY		0x4000
 
+#define LOOKUP_NODOTDOT		0x10000
+
 extern int user_path_at(int, const char __user *, unsigned, struct path *);
 extern int user_path_at_empty(int, const char __user *, unsigned, struct path *, int *empty);
 
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 3/4] vfs: Handle mounts whose parents are unreachable from their mountpoint
       [not found]         ` <874moq9oyb.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  2015-04-08 23:32           ` [PATCH review 1/4] mnt: Track which mounts use a dentry as root Eric W. Biederman
  2015-04-08 23:32           ` [PATCH review 2/4] vfs: Test for and handle paths that are unreachable from their mnt_root Eric W. Biederman
@ 2015-04-08 23:33           ` Eric W. Biederman
  2015-04-08 23:34           ` [PATCH review 4/4] vfs: Do not allow escaping from bind mounts Eric W. Biederman
                             ` (3 subsequent siblings)
  6 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-04-08 23:33 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Richard Weinberger, Andy Lutomirski, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Willy Tarreau


- In follup_up and follow_up_rcu don't follow up if the current
  mount's mountpoint can not reach the parent mount's root.

- In prepend_path and it's callers in the d_path family don't follow
  to the parent mount if the current mount's mountpoint can not reach
  the parent mount's root.

Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/dcache.c | 14 ++++++++++----
 fs/namei.c  | 27 +++++++++++++++++----------
 2 files changed, 27 insertions(+), 14 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index e07eb03f6de6..6e68312494ed 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2894,10 +2894,16 @@ restart:
 			struct mount *parent = ACCESS_ONCE(mnt->mnt_parent);
 			/* Global root? */
 			if (mnt != parent) {
-				dentry = ACCESS_ONCE(mnt->mnt_mountpoint);
-				mnt = parent;
-				vfsmnt = &mnt->mnt;
-				continue;
+				struct path new = {
+					.dentry = ACCESS_ONCE(mnt->mnt_mountpoint),
+					.mnt = &parent->mnt,
+				};
+				if (path_connected(&new)) {
+					mnt = parent;
+					dentry = new.dentry;
+					vfsmnt = new.mnt;
+					continue;
+				}
 			}
 			/*
 			 * Filesystems needing to implement special "root names"
diff --git a/fs/namei.c b/fs/namei.c
index 83cdcdf36eed..40e56d76df34 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -935,14 +935,16 @@ static int follow_up_rcu(struct path *path)
 {
 	struct mount *mnt = real_mount(path->mnt);
 	struct mount *parent;
-	struct dentry *mountpoint;
+	struct path new;
 
 	parent = mnt->mnt_parent;
-	if (&parent->mnt == path->mnt)
+	if (parent == mnt)
 		return 0;
-	mountpoint = mnt->mnt_mountpoint;
-	path->dentry = mountpoint;
-	path->mnt = &parent->mnt;
+	new.dentry = mnt->mnt_mountpoint;
+	new.mnt = &parent->mnt;
+	if (!path_connected(&new))
+		return 0;
+	*path = new;
 	return 1;
 }
 
@@ -960,7 +962,7 @@ int follow_up(struct path *path)
 {
 	struct mount *mnt = real_mount(path->mnt);
 	struct mount *parent;
-	struct dentry *mountpoint;
+	struct path new;
 
 	read_seqlock_excl(&mount_lock);
 	parent = mnt->mnt_parent;
@@ -968,13 +970,18 @@ int follow_up(struct path *path)
 		read_sequnlock_excl(&mount_lock);
 		return 0;
 	}
-	mntget(&parent->mnt);
-	mountpoint = dget(mnt->mnt_mountpoint);
+	new.dentry = mnt->mnt_mountpoint;
+	new.mnt = &parent->mnt;
+	if (!path_connected(&new)) {
+		read_sequnlock_excl(&mount_lock);
+		return 0;
+	}
+	mntget(new.mnt);
+	dget(new.dentry);
 	read_sequnlock_excl(&mount_lock);
 	dput(path->dentry);
-	path->dentry = mountpoint;
 	mntput(path->mnt);
-	path->mnt = &parent->mnt;
+	*path = new;
 	return 1;
 }
 EXPORT_SYMBOL(follow_up);
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review  3/4] vfs: Handle mounts whose parents are unreachable from their mountpoint
  2015-04-08 23:31       ` [PATCH review 0/4] Loopback mount escape fixes Eric W. Biederman
@ 2015-04-08 23:33         ` Eric W. Biederman
  2015-04-08 23:34         ` [PATCH review 4/4] vfs: Do not allow escaping from bind mounts Eric W. Biederman
                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-04-08 23:33 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-fsdevel, Al Viro, Andy Lutomirski, Serge E. Hallyn,
	Richard Weinberger, Andrey Vagin, Jann Horn, Willy Tarreau,
	Omar Sandoval


- In follup_up and follow_up_rcu don't follow up if the current
  mount's mountpoint can not reach the parent mount's root.

- In prepend_path and it's callers in the d_path family don't follow
  to the parent mount if the current mount's mountpoint can not reach
  the parent mount's root.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/dcache.c | 14 ++++++++++----
 fs/namei.c  | 27 +++++++++++++++++----------
 2 files changed, 27 insertions(+), 14 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index e07eb03f6de6..6e68312494ed 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2894,10 +2894,16 @@ restart:
 			struct mount *parent = ACCESS_ONCE(mnt->mnt_parent);
 			/* Global root? */
 			if (mnt != parent) {
-				dentry = ACCESS_ONCE(mnt->mnt_mountpoint);
-				mnt = parent;
-				vfsmnt = &mnt->mnt;
-				continue;
+				struct path new = {
+					.dentry = ACCESS_ONCE(mnt->mnt_mountpoint),
+					.mnt = &parent->mnt,
+				};
+				if (path_connected(&new)) {
+					mnt = parent;
+					dentry = new.dentry;
+					vfsmnt = new.mnt;
+					continue;
+				}
 			}
 			/*
 			 * Filesystems needing to implement special "root names"
diff --git a/fs/namei.c b/fs/namei.c
index 83cdcdf36eed..40e56d76df34 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -935,14 +935,16 @@ static int follow_up_rcu(struct path *path)
 {
 	struct mount *mnt = real_mount(path->mnt);
 	struct mount *parent;
-	struct dentry *mountpoint;
+	struct path new;
 
 	parent = mnt->mnt_parent;
-	if (&parent->mnt == path->mnt)
+	if (parent == mnt)
 		return 0;
-	mountpoint = mnt->mnt_mountpoint;
-	path->dentry = mountpoint;
-	path->mnt = &parent->mnt;
+	new.dentry = mnt->mnt_mountpoint;
+	new.mnt = &parent->mnt;
+	if (!path_connected(&new))
+		return 0;
+	*path = new;
 	return 1;
 }
 
@@ -960,7 +962,7 @@ int follow_up(struct path *path)
 {
 	struct mount *mnt = real_mount(path->mnt);
 	struct mount *parent;
-	struct dentry *mountpoint;
+	struct path new;
 
 	read_seqlock_excl(&mount_lock);
 	parent = mnt->mnt_parent;
@@ -968,13 +970,18 @@ int follow_up(struct path *path)
 		read_sequnlock_excl(&mount_lock);
 		return 0;
 	}
-	mntget(&parent->mnt);
-	mountpoint = dget(mnt->mnt_mountpoint);
+	new.dentry = mnt->mnt_mountpoint;
+	new.mnt = &parent->mnt;
+	if (!path_connected(&new)) {
+		read_sequnlock_excl(&mount_lock);
+		return 0;
+	}
+	mntget(new.mnt);
+	dget(new.dentry);
 	read_sequnlock_excl(&mount_lock);
 	dput(path->dentry);
-	path->dentry = mountpoint;
 	mntput(path->mnt);
-	path->mnt = &parent->mnt;
+	*path = new;
 	return 1;
 }
 EXPORT_SYMBOL(follow_up);
-- 
2.2.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review  4/4] vfs: Do not allow escaping from bind mounts.
       [not found]         ` <874moq9oyb.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                             ` (2 preceding siblings ...)
  2015-04-08 23:33           ` [PATCH review 3/4] vfs: Handle mounts whose parents are unreachable from their mountpoint Eric W. Biederman
@ 2015-04-08 23:34           ` Eric W. Biederman
  2015-04-09 19:01           ` [PATCH review 0/4] Loopback mount escape fixes Eric W. Biederman
                             ` (2 subsequent siblings)
  6 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-04-08 23:34 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Richard Weinberger, Andy Lutomirski, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Willy Tarreau

Rename can move a file or directory outside of a bind mount.  This has
allowed programs with paths below the renamed directory to traverse up
their directory tree to the real root of the filesystem instead of
just the root of their bind mount.

In the presence of such renames limit applications to what the bind
mount intended to reveal by marking mounts that have had dentries
renamed out of them with MNT_VIOLATED, marking mounts that can no
longer walk up to their parent mounts with MNT_UMOUNT_PENDING and then
lazily unmounting such mounts.

All moves go through __d_move so __d_move has been modified to mark
all mounts whose dentries have been moved outside of them.

Once the root dentry of a violated mount has been found a new function
mnt_set_violated is called to mark all mounts that have that dentry as
their root as violated.

The worst case performance of the changes to __d_move is two extra
trip from dentry and target to the root of their respective dentry
trees.

This closes a hole where it was possible in some circumstances to
follow .. past the root of a bind mount.

Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/dcache.c    | 65 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 fs/internal.h  |  1 +
 fs/namespace.c | 17 +++++++++++++++
 3 files changed, 81 insertions(+), 2 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 6e68312494ed..7baecba354dd 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2535,6 +2535,56 @@ static void dentry_unlock_for_move(struct dentry *dentry, struct dentry *target)
 	spin_unlock(&dentry->d_lock);
 }
 
+static unsigned d_depth(const struct dentry *dentry)
+{
+	unsigned depth = 0;
+
+	while (!IS_ROOT(dentry)) {
+		dentry = dentry->d_parent;
+		depth++;
+	}
+	return depth;
+}
+
+static const struct dentry *d_common_ancestor(const struct dentry *left,
+					      const struct dentry *right)
+{
+	unsigned ldepth = d_depth(left);
+	unsigned rdepth = d_depth(right);
+
+	if (ldepth > rdepth) {
+		swap(left, right);
+		swap(ldepth, rdepth);
+	}
+
+	while (rdepth > ldepth) {
+		right = right->d_parent;
+		rdepth--;
+	}
+
+	while (right != left) {
+		if (IS_ROOT(right))
+			return NULL;
+		right = right->d_parent;
+		left = left->d_parent;
+	}
+
+	return right;
+}
+
+static void mark_violated_mounts(struct dentry *dentry,
+				 const struct dentry *ancestor)
+{
+	/* Mark all mountroots that are children of the common
+	 * ancestor and ancestors of dentry.
+	 */
+	struct dentry *p;
+	for (p = dentry->d_parent; p != ancestor; p = p->d_parent) {
+		if (d_mountroot(p))
+			mnt_set_violated(p);
+	}
+}
+
 /*
  * When switching names, the actual string doesn't strictly have to
  * be preserved in the target - because we're dropping the target
@@ -2563,11 +2613,22 @@ static void dentry_unlock_for_move(struct dentry *dentry, struct dentry *target)
 static void __d_move(struct dentry *dentry, struct dentry *target,
 		     bool exchange)
 {
+	const struct dentry *ancestor = d_common_ancestor(dentry, target);
+
 	if (!dentry->d_inode)
 		printk(KERN_WARNING "VFS: moving negative dcache entry\n");
 
-	BUG_ON(d_ancestor(dentry, target));
-	BUG_ON(d_ancestor(target, dentry));
+	BUG_ON(dentry == ancestor);
+	BUG_ON(target == ancestor);
+
+	/* If there is a common ancestor, mark mounts which may have
+	 * paths that are no longer able to follow d_parent up to
+	 * mnt_root after this move.
+	 */
+	if (ancestor) {
+		mark_violated_mounts(dentry, ancestor);
+		mark_violated_mounts(target, ancestor);
+	}
 
 	dentry_lock_for_move(dentry, target);
 
diff --git a/fs/internal.h b/fs/internal.h
index 046767f0042e..d6a6cbd1e7a1 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -71,6 +71,7 @@ extern int __mnt_want_write_file(struct file *);
 extern void __mnt_drop_write(struct vfsmount *);
 extern void __mnt_drop_write_file(struct file *);
 
+extern void mnt_set_violated(struct dentry *root);
 /*
  * fs_struct.c
  */
diff --git a/fs/namespace.c b/fs/namespace.c
index 75abc9fcaafa..083d96bdbd60 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1644,6 +1644,23 @@ out_unlock:
 	namespace_unlock();
 }
 
+void mnt_set_violated(struct dentry *root)
+{
+	struct mountroot *mr;
+
+	/* Locking the mount guarantees that root is a mountpoint and
+	 * pushes rcu path walkers onto the slow path.
+	 */
+	lock_mount_hash();
+	mr = lookup_mountroot(root);
+	if (mr) {
+		spin_lock(&root->d_lock);
+		root->d_flags |= DCACHE_MOUNT_VIOLATED;
+		spin_unlock(&root->d_lock);
+	}
+	unlock_mount_hash();
+}
+
 /* 
  * Is the caller allowed to modify his namespace?
  */
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review  4/4] vfs: Do not allow escaping from bind mounts.
  2015-04-08 23:31       ` [PATCH review 0/4] Loopback mount escape fixes Eric W. Biederman
  2015-04-08 23:33         ` [PATCH review 3/4] vfs: Handle mounts whose parents are unreachable from their mountpoint Eric W. Biederman
@ 2015-04-08 23:34         ` Eric W. Biederman
       [not found]           ` <87iod68aa3.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                             ` (2 more replies)
  2015-04-13 12:18         ` [PATCH review 0/4] Loopback mount escape fixes Miklos Szeredi
       [not found]         ` <874moq9oyb.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  3 siblings, 3 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-04-08 23:34 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-fsdevel, Al Viro, Andy Lutomirski, Serge E. Hallyn,
	Richard Weinberger, Andrey Vagin, Jann Horn, Willy Tarreau,
	Omar Sandoval

Rename can move a file or directory outside of a bind mount.  This has
allowed programs with paths below the renamed directory to traverse up
their directory tree to the real root of the filesystem instead of
just the root of their bind mount.

In the presence of such renames limit applications to what the bind
mount intended to reveal by marking mounts that have had dentries
renamed out of them with MNT_VIOLATED, marking mounts that can no
longer walk up to their parent mounts with MNT_UMOUNT_PENDING and then
lazily unmounting such mounts.

All moves go through __d_move so __d_move has been modified to mark
all mounts whose dentries have been moved outside of them.

Once the root dentry of a violated mount has been found a new function
mnt_set_violated is called to mark all mounts that have that dentry as
their root as violated.

The worst case performance of the changes to __d_move is two extra
trip from dentry and target to the root of their respective dentry
trees.

This closes a hole where it was possible in some circumstances to
follow .. past the root of a bind mount.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/dcache.c    | 65 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 fs/internal.h  |  1 +
 fs/namespace.c | 17 +++++++++++++++
 3 files changed, 81 insertions(+), 2 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 6e68312494ed..7baecba354dd 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2535,6 +2535,56 @@ static void dentry_unlock_for_move(struct dentry *dentry, struct dentry *target)
 	spin_unlock(&dentry->d_lock);
 }
 
+static unsigned d_depth(const struct dentry *dentry)
+{
+	unsigned depth = 0;
+
+	while (!IS_ROOT(dentry)) {
+		dentry = dentry->d_parent;
+		depth++;
+	}
+	return depth;
+}
+
+static const struct dentry *d_common_ancestor(const struct dentry *left,
+					      const struct dentry *right)
+{
+	unsigned ldepth = d_depth(left);
+	unsigned rdepth = d_depth(right);
+
+	if (ldepth > rdepth) {
+		swap(left, right);
+		swap(ldepth, rdepth);
+	}
+
+	while (rdepth > ldepth) {
+		right = right->d_parent;
+		rdepth--;
+	}
+
+	while (right != left) {
+		if (IS_ROOT(right))
+			return NULL;
+		right = right->d_parent;
+		left = left->d_parent;
+	}
+
+	return right;
+}
+
+static void mark_violated_mounts(struct dentry *dentry,
+				 const struct dentry *ancestor)
+{
+	/* Mark all mountroots that are children of the common
+	 * ancestor and ancestors of dentry.
+	 */
+	struct dentry *p;
+	for (p = dentry->d_parent; p != ancestor; p = p->d_parent) {
+		if (d_mountroot(p))
+			mnt_set_violated(p);
+	}
+}
+
 /*
  * When switching names, the actual string doesn't strictly have to
  * be preserved in the target - because we're dropping the target
@@ -2563,11 +2613,22 @@ static void dentry_unlock_for_move(struct dentry *dentry, struct dentry *target)
 static void __d_move(struct dentry *dentry, struct dentry *target,
 		     bool exchange)
 {
+	const struct dentry *ancestor = d_common_ancestor(dentry, target);
+
 	if (!dentry->d_inode)
 		printk(KERN_WARNING "VFS: moving negative dcache entry\n");
 
-	BUG_ON(d_ancestor(dentry, target));
-	BUG_ON(d_ancestor(target, dentry));
+	BUG_ON(dentry == ancestor);
+	BUG_ON(target == ancestor);
+
+	/* If there is a common ancestor, mark mounts which may have
+	 * paths that are no longer able to follow d_parent up to
+	 * mnt_root after this move.
+	 */
+	if (ancestor) {
+		mark_violated_mounts(dentry, ancestor);
+		mark_violated_mounts(target, ancestor);
+	}
 
 	dentry_lock_for_move(dentry, target);
 
diff --git a/fs/internal.h b/fs/internal.h
index 046767f0042e..d6a6cbd1e7a1 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -71,6 +71,7 @@ extern int __mnt_want_write_file(struct file *);
 extern void __mnt_drop_write(struct vfsmount *);
 extern void __mnt_drop_write_file(struct file *);
 
+extern void mnt_set_violated(struct dentry *root);
 /*
  * fs_struct.c
  */
diff --git a/fs/namespace.c b/fs/namespace.c
index 75abc9fcaafa..083d96bdbd60 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1644,6 +1644,23 @@ out_unlock:
 	namespace_unlock();
 }
 
+void mnt_set_violated(struct dentry *root)
+{
+	struct mountroot *mr;
+
+	/* Locking the mount guarantees that root is a mountpoint and
+	 * pushes rcu path walkers onto the slow path.
+	 */
+	lock_mount_hash();
+	mr = lookup_mountroot(root);
+	if (mr) {
+		spin_lock(&root->d_lock);
+		root->d_flags |= DCACHE_MOUNT_VIOLATED;
+		spin_unlock(&root->d_lock);
+	}
+	unlock_mount_hash();
+}
+
 /* 
  * Is the caller allowed to modify his namespace?
  */
-- 
2.2.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* Re: [PATCH review  4/4] vfs: Do not allow escaping from bind mounts.
       [not found]           ` <87iod68aa3.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-04-09 13:06             ` Jann Horn
  2015-04-09 23:22             ` Al Viro
  1 sibling, 0 replies; 240+ messages in thread
From: Jann Horn @ 2015-04-09 13:06 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrey Vagin, Richard Weinberger, Linux Containers,
	Andy Lutomirski, Al Viro, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	Willy Tarreau


[-- Attachment #1.1: Type: text/plain, Size: 1416 bytes --]

On Wed, Apr 08, 2015 at 06:34:12PM -0500, Eric W. Biederman wrote:
> +static unsigned d_depth(const struct dentry *dentry)
> +{
> +	unsigned depth = 0;
> +
> +	while (!IS_ROOT(dentry)) {
> +		dentry = dentry->d_parent;
> +		depth++;
> +	}
> +	return depth;
> +}

This relies on a depth of 2^32 being impossible, right? Which is guaranteed
somewhat because you would need something like a terabyte of RAM to have
that many dentries in RAM? I can't find any explicit check. Maybe it would
make sense to let the depth be 64 bits or add some kind of overflow check?
Or did I just miss some kind of check on allocation?

<https://access.redhat.com/articles/rhel-limits> claims that redhat has
tested RHEL on a machine with 6TB of physical RAM. I think that 2^32
dentries would fit in there.


> +static const struct dentry *d_common_ancestor(const struct dentry *left,
> +					      const struct dentry *right)
> +{
> +	unsigned ldepth = d_depth(left);
> +	unsigned rdepth = d_depth(right);
> +
> +	if (ldepth > rdepth) {
> +		swap(left, right);
> +		swap(ldepth, rdepth);
> +	}
> +
> +	while (rdepth > ldepth) {
> +		right = right->d_parent;
> +		rdepth--;
> +	}

At this point, the actual depths could differ by 2^32,
right?


> +	while (right != left) {
> +		if (IS_ROOT(right))
> +			return NULL;
> +		right = right->d_parent;
> +		left = left->d_parent;

And then one of these could crash with a NULL pointer deref?

[-- Attachment #1.2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 205 bytes --]

_______________________________________________
Containers mailing list
Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review  4/4] vfs: Do not allow escaping from bind mounts.
  2015-04-08 23:34         ` [PATCH review 4/4] vfs: Do not allow escaping from bind mounts Eric W. Biederman
       [not found]           ` <87iod68aa3.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-04-09 13:06           ` Jann Horn
       [not found]             ` <20150409130601.GA22250-J1fxOzX/cBvk1uMJSBkQmQ@public.gmane.org>
  2015-04-09 23:22           ` Al Viro
  2 siblings, 1 reply; 240+ messages in thread
From: Jann Horn @ 2015-04-09 13:06 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Containers, linux-fsdevel, Al Viro, Andy Lutomirski,
	Serge E. Hallyn, Richard Weinberger, Andrey Vagin, Willy Tarreau,
	Omar Sandoval

[-- Attachment #1: Type: text/plain, Size: 1416 bytes --]

On Wed, Apr 08, 2015 at 06:34:12PM -0500, Eric W. Biederman wrote:
> +static unsigned d_depth(const struct dentry *dentry)
> +{
> +	unsigned depth = 0;
> +
> +	while (!IS_ROOT(dentry)) {
> +		dentry = dentry->d_parent;
> +		depth++;
> +	}
> +	return depth;
> +}

This relies on a depth of 2^32 being impossible, right? Which is guaranteed
somewhat because you would need something like a terabyte of RAM to have
that many dentries in RAM? I can't find any explicit check. Maybe it would
make sense to let the depth be 64 bits or add some kind of overflow check?
Or did I just miss some kind of check on allocation?

<https://access.redhat.com/articles/rhel-limits> claims that redhat has
tested RHEL on a machine with 6TB of physical RAM. I think that 2^32
dentries would fit in there.


> +static const struct dentry *d_common_ancestor(const struct dentry *left,
> +					      const struct dentry *right)
> +{
> +	unsigned ldepth = d_depth(left);
> +	unsigned rdepth = d_depth(right);
> +
> +	if (ldepth > rdepth) {
> +		swap(left, right);
> +		swap(ldepth, rdepth);
> +	}
> +
> +	while (rdepth > ldepth) {
> +		right = right->d_parent;
> +		rdepth--;
> +	}

At this point, the actual depths could differ by 2^32,
right?


> +	while (right != left) {
> +		if (IS_ROOT(right))
> +			return NULL;
> +		right = right->d_parent;
> +		left = left->d_parent;

And then one of these could crash with a NULL pointer deref?

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 09/19] mnt: Fail collect_mounts when applied to unmounted mounts
       [not found]           ` <1428051353.1924.2.camel-Sze3O3UU22JBDgjK7y7TUQ@public.gmane.org>
@ 2015-04-09 16:39             ` Eric W. Biederman
  0 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-04-09 16:39 UTC (permalink / raw)
  To: Lukasz Pawelczyk
  Cc: Andrey Vagin, Richard Weinberger, Linux Containers,
	Andy Lutomirski, Al Viro, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	Jann Horn, Willy Tarreau

Lukasz Pawelczyk <l.pawelczyk-Sze3O3UU22JBDgjK7y7TUQ@public.gmane.org> writes:

> On czw, 2015-04-02 at 20:56 -0500, Eric W. Biederman wrote:
>> The only users of collect_mounts are in audit_tree.c
>> 
>> In audit_tree_trees and audit_add_tree rule the path passed into
>
> I think you meant audit_trim_trees.
>
> Also you missed a _ in audit_add_tree_rule.

I did thank you.

Eric

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review  4/4] vfs: Do not allow escaping from bind mounts.
       [not found]             ` <20150409130601.GA22250-J1fxOzX/cBvk1uMJSBkQmQ@public.gmane.org>
@ 2015-04-09 16:52               ` Eric W. Biederman
  0 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-04-09 16:52 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andrey Vagin, Richard Weinberger, Linux Containers,
	Andy Lutomirski, Al Viro, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	Willy Tarreau

Jann Horn <jann-XZ1E9jl8jIdeoWH0uzbU5w@public.gmane.org> writes:

> On Wed, Apr 08, 2015 at 06:34:12PM -0500, Eric W. Biederman wrote:
>> +static unsigned d_depth(const struct dentry *dentry)
>> +{
>> +	unsigned depth = 0;
>> +
>> +	while (!IS_ROOT(dentry)) {
>> +		dentry = dentry->d_parent;
>> +		depth++;
>> +	}
>> +	return depth;
>> +}
>
> This relies on a depth of 2^32 being impossible, right? Which is guaranteed
> somewhat because you would need something like a terabyte of RAM to have
> that many dentries in RAM? I can't find any explicit check. Maybe it would
> make sense to let the depth be 64 bits or add some kind of overflow check?
> Or did I just miss some kind of check on allocation?

Well there is the 4K PATH_MAX.

If nothing else your performance will grind to a halt if you attempt to
use a path that deeply nested.

That said it doesn't cost us anything to make the variables
unsigned long and it avoids having to worry about this piece of code.

I will respin this patch.

Eric

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 0/4] Loopback mount escape fixes
       [not found]         ` <874moq9oyb.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                             ` (3 preceding siblings ...)
  2015-04-08 23:34           ` [PATCH review 4/4] vfs: Do not allow escaping from bind mounts Eric W. Biederman
@ 2015-04-09 19:01           ` Eric W. Biederman
  2015-04-09 19:12             ` Al Viro
       [not found]             ` <87egnt5dok.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  2015-04-13 12:18           ` Miklos Szeredi
  2015-08-03 21:25           ` [PATCH review 0/6] Bind " Eric W. Biederman
  6 siblings, 2 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-04-09 19:01 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Richard Weinberger, Andy Lutomirski, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Willy Tarreau


Al.  Do you want involvement in any of these patches?

If not I will move them in the direction of linux-next and Linus.  I
expect they are just interesting enough that I don't want to send them
as bug fixes during rc-late.

The feedback from the review I have recevied has been incorporated into:
     git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-testing

While I have energy I would like to push these things and get these issues fixed.

Eric

ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org (Eric W. Biederman) writes:

> After the last round of feedback I sat down and played with my fix
> for the fact that a strategically placed rename, ".." on bind mounts
> go up past the root of the bind mount.
>
> The code better handles the escaped directory returning into it's bind
> mount, and is now roughly a constant factor cost in all cases from what
> the code costs without the fix.
>
> So I think I have found a better tradeoff between fixing this bug and
> not slowing down path name lookups in the common case.
>
> These fixes are against on v4.0-rc6.
>
> For those who like to see everything in a single tree the code is at:
>
>     git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-testing
>
> Eric W. Biederman (4):
>       mnt: Track which mounts use a dentry as root.
>       vfs: Test for and handle paths that are unreachable from their mnt_root
>       vfs: Handle mounts whose parents are unreachable from their mountpoint
>       vfs: Do not allow escaping from bind mounts.
>
>  fs/dcache.c            |  82 +++++++++++++++++++++++++++---
>  fs/internal.h          |   2 +
>  fs/mount.h             |   6 +++
>  fs/namei.c             |  57 +++++++++++++++++----
>  fs/namespace.c         | 135 +++++++++++++++++++++++++++++++++++++++++++++++--
>  include/linux/dcache.h |  13 +++++
>  include/linux/namei.h  |   2 +
>  7 files changed, 277 insertions(+), 20 deletions(-)

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 0/4] Loopback mount escape fixes
       [not found]             ` <87egnt5dok.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-04-09 19:12               ` Al Viro
  0 siblings, 0 replies; 240+ messages in thread
From: Al Viro @ 2015-04-09 19:12 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrey Vagin, Richard Weinberger, Linux Containers,
	Andy Lutomirski, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn,
	Willy Tarreau

On Thu, Apr 09, 2015 at 02:01:15PM -0500, Eric W. Biederman wrote:
> 
> Al.  Do you want involvement in any of these patches?
> 
> If not I will move them in the direction of linux-next and Linus.  I
> expect they are just interesting enough that I don't want to send them
> as bug fixes during rc-late.

I'll post review in a few hours (in the middle of nasty bisect right now)

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 0/4] Loopback mount escape fixes
  2015-04-09 19:01           ` [PATCH review 0/4] Loopback mount escape fixes Eric W. Biederman
@ 2015-04-09 19:12             ` Al Viro
  2015-04-09 19:14               ` Eric W. Biederman
       [not found]               ` <20150409191232.GV889-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
       [not found]             ` <87egnt5dok.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  1 sibling, 2 replies; 240+ messages in thread
From: Al Viro @ 2015-04-09 19:12 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Containers, linux-fsdevel, Andy Lutomirski,
	Serge E. Hallyn, Richard Weinberger, Andrey Vagin, Jann Horn,
	Willy Tarreau, Omar Sandoval

On Thu, Apr 09, 2015 at 02:01:15PM -0500, Eric W. Biederman wrote:
> 
> Al.  Do you want involvement in any of these patches?
> 
> If not I will move them in the direction of linux-next and Linus.  I
> expect they are just interesting enough that I don't want to send them
> as bug fixes during rc-late.

I'll post review in a few hours (in the middle of nasty bisect right now)

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 0/4] Loopback mount escape fixes
       [not found]               ` <20150409191232.GV889-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
@ 2015-04-09 19:14                 ` Eric W. Biederman
  0 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-04-09 19:14 UTC (permalink / raw)
  To: Al Viro
  Cc: Andrey Vagin, Richard Weinberger, Linux Containers,
	Andy Lutomirski, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn,
	Willy Tarreau

Al Viro <viro-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org> writes:

> On Thu, Apr 09, 2015 at 02:01:15PM -0500, Eric W. Biederman wrote:
>> 
>> Al.  Do you want involvement in any of these patches?
>> 
>> If not I will move them in the direction of linux-next and Linus.  I
>> expect they are just interesting enough that I don't want to send them
>> as bug fixes during rc-late.
>
> I'll post review in a few hours (in the middle of nasty bisect right now)

Sounds good. 

Thank you very much.

Eric

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 0/4] Loopback mount escape fixes
  2015-04-09 19:12             ` Al Viro
@ 2015-04-09 19:14               ` Eric W. Biederman
       [not found]               ` <20150409191232.GV889-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
  1 sibling, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-04-09 19:14 UTC (permalink / raw)
  To: Al Viro
  Cc: Linux Containers, linux-fsdevel, Andy Lutomirski,
	Serge E. Hallyn, Richard Weinberger, Andrey Vagin, Jann Horn,
	Willy Tarreau, Omar Sandoval

Al Viro <viro@ZenIV.linux.org.uk> writes:

> On Thu, Apr 09, 2015 at 02:01:15PM -0500, Eric W. Biederman wrote:
>> 
>> Al.  Do you want involvement in any of these patches?
>> 
>> If not I will move them in the direction of linux-next and Linus.  I
>> expect they are just interesting enough that I don't want to send them
>> as bug fixes during rc-late.
>
> I'll post review in a few hours (in the middle of nasty bisect right now)

Sounds good. 

Thank you very much.

Eric

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review  2/4] vfs: Test for and handle paths that are unreachable from their mnt_root
       [not found]             ` <87sica8ac5.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-04-09 23:16               ` Al Viro
       [not found]                 ` <20150409231636.GW889-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
  0 siblings, 1 reply; 240+ messages in thread
From: Al Viro @ 2015-04-09 23:16 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrey Vagin, Richard Weinberger, Linux Containers,
	Andy Lutomirski, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn,
	Willy Tarreau

On Wed, Apr 08, 2015 at 06:32:58PM -0500, Eric W. Biederman wrote:
> 
> - Add a dentry flag DCACHE_MOUNT_VIOLATED to mark loopback mounts that
>   have had a dentry moved into a directory that does not descend from
>   the mount root dentry.
> 
> - In mnt_put_root clear DCACHE_MOUNT_VIOLATED.
> 
> - Add a function path_connected to verify a path.dentry is reachable from
>   path.mnt.mnt_root.  AKA rename did not do something nasty to the bind mount.
> 
> - Disable ".." when a path is not connected during lookup.
>   (Maybe we want to stop ".." at this path instead?)
> 
>   Following .. is not disabled after a transition to /
>   and is never disabled when / is the directory we start
>   with.   Because we already limit .. no higher than /

IDGI.  Am I missing something, or you really only set that flag in the
beginning of the pathwalk?  At the bare minimum, you want to treat
nd_jump_link() the same way, or your protection is trivially defeated by
using /proc/self/cwd/$PATHNAME instead of $PATHNAME...

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review  4/4] vfs: Do not allow escaping from bind mounts.
       [not found]           ` <87iod68aa3.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  2015-04-09 13:06             ` Jann Horn
@ 2015-04-09 23:22             ` Al Viro
  1 sibling, 0 replies; 240+ messages in thread
From: Al Viro @ 2015-04-09 23:22 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrey Vagin, Richard Weinberger, Linux Containers,
	Andy Lutomirski, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn,
	Willy Tarreau

On Wed, Apr 08, 2015 at 06:34:12PM -0500, Eric W. Biederman wrote:
> +	if (ancestor) {
> +		mark_violated_mounts(dentry, ancestor);
> +		mark_violated_mounts(target, ancestor);
> +	}

Umm...  Both sides the same way, regardless of whether it's exchange or
move?  Looks wrong...

Look:

mkdir /tmp/a
mkdir /tmp/b
mkdir /tmp/c
mkdir /tmp/b/c
touch /tmp/a/x
mount --bind /tmp/b /tmp/c
mv /tmp/a/x /tmp/b/c/x

should that make the vfsmount on /tmp/c violated?  And if so, why?

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review  4/4] vfs: Do not allow escaping from bind mounts.
  2015-04-08 23:34         ` [PATCH review 4/4] vfs: Do not allow escaping from bind mounts Eric W. Biederman
       [not found]           ` <87iod68aa3.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  2015-04-09 13:06           ` Jann Horn
@ 2015-04-09 23:22           ` Al Viro
  2015-04-10  2:51             ` Eric W. Biederman
       [not found]             ` <20150409232212.GX889-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
  2 siblings, 2 replies; 240+ messages in thread
From: Al Viro @ 2015-04-09 23:22 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Containers, linux-fsdevel, Andy Lutomirski,
	Serge E. Hallyn, Richard Weinberger, Andrey Vagin, Jann Horn,
	Willy Tarreau, Omar Sandoval

On Wed, Apr 08, 2015 at 06:34:12PM -0500, Eric W. Biederman wrote:
> +	if (ancestor) {
> +		mark_violated_mounts(dentry, ancestor);
> +		mark_violated_mounts(target, ancestor);
> +	}

Umm...  Both sides the same way, regardless of whether it's exchange or
move?  Looks wrong...

Look:

mkdir /tmp/a
mkdir /tmp/b
mkdir /tmp/c
mkdir /tmp/b/c
touch /tmp/a/x
mount --bind /tmp/b /tmp/c
mv /tmp/a/x /tmp/b/c/x

should that make the vfsmount on /tmp/c violated?  And if so, why?

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 2/4] vfs: Test for and handle paths that are unreachable from their mnt_root
       [not found]                 ` <20150409231636.GW889-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
@ 2015-04-10  2:24                   ` Eric W. Biederman
  0 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-04-10  2:24 UTC (permalink / raw)
  To: Al Viro
  Cc: Andrey Vagin, Richard Weinberger, Linux Containers,
	Andy Lutomirski, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn,
	Willy Tarreau

Al Viro <viro-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org> writes:

> On Wed, Apr 08, 2015 at 06:32:58PM -0500, Eric W. Biederman wrote:
>> 
>> - Add a dentry flag DCACHE_MOUNT_VIOLATED to mark loopback mounts that
>>   have had a dentry moved into a directory that does not descend from
>>   the mount root dentry.
>> 
>> - In mnt_put_root clear DCACHE_MOUNT_VIOLATED.
>> 
>> - Add a function path_connected to verify a path.dentry is reachable from
>>   path.mnt.mnt_root.  AKA rename did not do something nasty to the bind mount.
>> 
>> - Disable ".." when a path is not connected during lookup.
>>   (Maybe we want to stop ".." at this path instead?)
>> 
>>   Following .. is not disabled after a transition to /
>>   and is never disabled when / is the directory we start
>>   with.   Because we already limit .. no higher than /
>
> IDGI.  Am I missing something, or you really only set that flag in the
> beginning of the pathwalk?  At the bare minimum, you want to treat
> nd_jump_link() the same way, or your protection is trivially defeated by
> using /proc/self/cwd/$PATHNAME instead of $PATHNAME...

nd_jump_link() is definitely an oversight.  Doh!

Starting at the root or starting at mount_root of a mount point that
flag is not necessary.  As we can obviously walk up as far as it is
possible to go on that mount.

Furthermore legitimize_mnt will fail if a problematic rename happens
during the mount.

The next patch limits what follow_up and follow_nup_rcu can do.

So I have all of the normal operations covered, but I definitely need to
take a second look to see if there are any additional locations like
nd_jump_link where we can jump onto a path in the middle of a mount and
need to test to see if it is connected.

Eric

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review  4/4] vfs: Do not allow escaping from bind mounts.
       [not found]             ` <20150409232212.GX889-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
@ 2015-04-10  2:51               ` Eric W. Biederman
  0 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-04-10  2:51 UTC (permalink / raw)
  To: Al Viro
  Cc: Andrey Vagin, Richard Weinberger, Linux Containers,
	Andy Lutomirski, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn,
	Willy Tarreau

Al Viro <viro-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org> writes:

> On Wed, Apr 08, 2015 at 06:34:12PM -0500, Eric W. Biederman wrote:
>> +	if (ancestor) {
>> +		mark_violated_mounts(dentry, ancestor);
>> +		mark_violated_mounts(target, ancestor);
>> +	}
>
> Umm...  Both sides the same way, regardless of whether it's exchange or
> move?  Looks wrong...

I am pretty certain it can cause d_path to become an information leak
if we do not.

> Look:
>
> mkdir /tmp/a
> mkdir /tmp/b
> mkdir /tmp/c
> mkdir /tmp/b/c
> touch /tmp/a/x
> mount --bind /tmp/b /tmp/c
> mv /tmp/a/x /tmp/b/c/x
>
> should that make the vfsmount on /tmp/c violated?  And if so, why?

If /tmp is a mount point and before the move there was a:
touch /tmp/b/c/x

And a process opened /tmp/c/c/x.
d_path on that file descriptor before __d_move would say:

/tmp/c/c/x

after the __d_move d_path would say:

/tmp/c/a/x

Which is bizareely weird in this example, and could potentially be
an expolitable information leak in the hands of someone who knew
what they were doing.

I am not clever enough to take that deleted directory and walk up the
tree, so the damage may be limited to seeing the true path on the
fileystem.  But it just may be that I am dense today.

Furthermore all of the relevant changes to the dentry that happen 
when exchange is true also happen when exchange is false, so I am very
reluctant to believe that the non-exchange case is not exploitable by a
sufficiently clever individual.

Eric

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review  4/4] vfs: Do not allow escaping from bind mounts.
  2015-04-09 23:22           ` Al Viro
@ 2015-04-10  2:51             ` Eric W. Biederman
       [not found]               ` <874moo1ysg.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  2015-04-10  3:14               ` Al Viro
       [not found]             ` <20150409232212.GX889-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
  1 sibling, 2 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-04-10  2:51 UTC (permalink / raw)
  To: Al Viro
  Cc: Linux Containers, linux-fsdevel, Andy Lutomirski,
	Serge E. Hallyn, Richard Weinberger, Andrey Vagin, Jann Horn,
	Willy Tarreau, Omar Sandoval

Al Viro <viro@ZenIV.linux.org.uk> writes:

> On Wed, Apr 08, 2015 at 06:34:12PM -0500, Eric W. Biederman wrote:
>> +	if (ancestor) {
>> +		mark_violated_mounts(dentry, ancestor);
>> +		mark_violated_mounts(target, ancestor);
>> +	}
>
> Umm...  Both sides the same way, regardless of whether it's exchange or
> move?  Looks wrong...

I am pretty certain it can cause d_path to become an information leak
if we do not.

> Look:
>
> mkdir /tmp/a
> mkdir /tmp/b
> mkdir /tmp/c
> mkdir /tmp/b/c
> touch /tmp/a/x
> mount --bind /tmp/b /tmp/c
> mv /tmp/a/x /tmp/b/c/x
>
> should that make the vfsmount on /tmp/c violated?  And if so, why?

If /tmp is a mount point and before the move there was a:
touch /tmp/b/c/x

And a process opened /tmp/c/c/x.
d_path on that file descriptor before __d_move would say:

/tmp/c/c/x

after the __d_move d_path would say:

/tmp/c/a/x

Which is bizareely weird in this example, and could potentially be
an expolitable information leak in the hands of someone who knew
what they were doing.

I am not clever enough to take that deleted directory and walk up the
tree, so the damage may be limited to seeing the true path on the
fileystem.  But it just may be that I am dense today.

Furthermore all of the relevant changes to the dentry that happen 
when exchange is true also happen when exchange is false, so I am very
reluctant to believe that the non-exchange case is not exploitable by a
sufficiently clever individual.

Eric

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review  4/4] vfs: Do not allow escaping from bind mounts.
       [not found]               ` <874moo1ysg.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-04-10  3:14                 ` Al Viro
  0 siblings, 0 replies; 240+ messages in thread
From: Al Viro @ 2015-04-10  3:14 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrey Vagin, Richard Weinberger, Linux Containers,
	Andy Lutomirski, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn,
	Willy Tarreau

On Thu, Apr 09, 2015 at 09:51:11PM -0500, Eric W. Biederman wrote:
> And a process opened /tmp/c/c/x.
> d_path on that file descriptor before __d_move would say:
> 
> /tmp/c/c/x
> 
> after the __d_move d_path would say:
> 
> /tmp/c/a/x

So what?

> Which is bizareely weird in this example, and could potentially be
> an expolitable information leak in the hands of someone who knew
> what they were doing.
> 
> I am not clever enough to take that deleted directory and walk up the
> tree, so the damage may be limited to seeing the true path on the
> fileystem.  But it just may be that I am dense today.
> 
> Furthermore all of the relevant changes to the dentry that happen 
> when exchange is true also happen when exchange is false, so I am very
> reluctant to believe that the non-exchange case is not exploitable by a
> sufficiently clever individual.

	Exploited how?  The same assistant might very well have done
echo "/tmp/c/a/x or whatever else I might want to pass to you" >/tmp/c/c/x
and pass whatever information they wanted _that_ way.

	As it is, you've created one hell of a DoS - *anyone* can poison
any vfsmount covering a subtree if they have access to a containing subtree
somewhere and write permissions on a directory inside and directory outside
of the victim one.

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review  4/4] vfs: Do not allow escaping from bind mounts.
  2015-04-10  2:51             ` Eric W. Biederman
       [not found]               ` <874moo1ysg.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-04-10  3:14               ` Al Viro
  1 sibling, 0 replies; 240+ messages in thread
From: Al Viro @ 2015-04-10  3:14 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Containers, linux-fsdevel, Andy Lutomirski,
	Serge E. Hallyn, Richard Weinberger, Andrey Vagin, Jann Horn,
	Willy Tarreau, Omar Sandoval

On Thu, Apr 09, 2015 at 09:51:11PM -0500, Eric W. Biederman wrote:
> And a process opened /tmp/c/c/x.
> d_path on that file descriptor before __d_move would say:
> 
> /tmp/c/c/x
> 
> after the __d_move d_path would say:
> 
> /tmp/c/a/x

So what?

> Which is bizareely weird in this example, and could potentially be
> an expolitable information leak in the hands of someone who knew
> what they were doing.
> 
> I am not clever enough to take that deleted directory and walk up the
> tree, so the damage may be limited to seeing the true path on the
> fileystem.  But it just may be that I am dense today.
> 
> Furthermore all of the relevant changes to the dentry that happen 
> when exchange is true also happen when exchange is false, so I am very
> reluctant to believe that the non-exchange case is not exploitable by a
> sufficiently clever individual.

	Exploited how?  The same assistant might very well have done
echo "/tmp/c/a/x or whatever else I might want to pass to you" >/tmp/c/c/x
and pass whatever information they wanted _that_ way.

	As it is, you've created one hell of a DoS - *anyone* can poison
any vfsmount covering a subtree if they have access to a containing subtree
somewhere and write permissions on a directory inside and directory outside
of the victim one.

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 0/4] Loopback mount escape fixes
       [not found]         ` <874moq9oyb.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                             ` (4 preceding siblings ...)
  2015-04-09 19:01           ` [PATCH review 0/4] Loopback mount escape fixes Eric W. Biederman
@ 2015-04-13 12:18           ` Miklos Szeredi
  2015-08-03 21:25           ` [PATCH review 0/6] Bind " Eric W. Biederman
  6 siblings, 0 replies; 240+ messages in thread
From: Miklos Szeredi @ 2015-04-13 12:18 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrey Vagin, Richard Weinberger, Linux Containers,
	Andy Lutomirski, Al Viro, linux-fsdevel, Jann Horn,
	Willy Tarreau

On Thu, Apr 9, 2015 at 1:31 AM, Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>
> After the last round of feedback I sat down and played with my fix
> for the fact that a strategically placed rename, ".." on bind mounts
> go up past the root of the bind mount.
>
> The code better handles the escaped directory returning into it's bind
> mount, and is now roughly a constant factor cost in all cases from what
> the code costs without the fix.
>
> So I think I have found a better tradeoff between fixing this bug and
> not slowing down path name lookups in the common case.

Maybe I'm missing something, but I see a much simpler fix:

 - When following ".." first just check against the dentry being equal
to the root dentry.

 - If so, then check mount being equal to root mount.

 - If so, then we are fine, found the root.

 - If mount is not root mount, then we either have a bind mount or the
escape scenario. So have a peek at the mount tree to see if we have a
chance of reaching root or not.

  - If yes, then we are fine, continue upward.

  - Otherwise stop here and act like we found root.

This doesn't have to hook into d_move() and will only trigger the
"violated" mode on an very specific and rare case.

I haven't thought about this very hard, but I don't see how the root
dentry could be avoided without first having access to something
outside the root.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 0/4] Loopback mount escape fixes
  2015-04-08 23:31       ` [PATCH review 0/4] Loopback mount escape fixes Eric W. Biederman
  2015-04-08 23:33         ` [PATCH review 3/4] vfs: Handle mounts whose parents are unreachable from their mountpoint Eric W. Biederman
  2015-04-08 23:34         ` [PATCH review 4/4] vfs: Do not allow escaping from bind mounts Eric W. Biederman
@ 2015-04-13 12:18         ` Miklos Szeredi
       [not found]           ` <CAELBmZBCCC1dspo4rPkFfh3c6RZBUYAZpz0tbUSukcf9att7Cw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2015-07-24 20:39           ` Eric W. Biederman
       [not found]         ` <874moq9oyb.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  3 siblings, 2 replies; 240+ messages in thread
From: Miklos Szeredi @ 2015-04-13 12:18 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Containers, linux-fsdevel, Al Viro, Andy Lutomirski,
	Serge E. Hallyn, Richard Weinberger, Andrey Vagin, Jann Horn,
	Willy Tarreau, Omar Sandoval

On Thu, Apr 9, 2015 at 1:31 AM, Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> After the last round of feedback I sat down and played with my fix
> for the fact that a strategically placed rename, ".." on bind mounts
> go up past the root of the bind mount.
>
> The code better handles the escaped directory returning into it's bind
> mount, and is now roughly a constant factor cost in all cases from what
> the code costs without the fix.
>
> So I think I have found a better tradeoff between fixing this bug and
> not slowing down path name lookups in the common case.

Maybe I'm missing something, but I see a much simpler fix:

 - When following ".." first just check against the dentry being equal
to the root dentry.

 - If so, then check mount being equal to root mount.

 - If so, then we are fine, found the root.

 - If mount is not root mount, then we either have a bind mount or the
escape scenario. So have a peek at the mount tree to see if we have a
chance of reaching root or not.

  - If yes, then we are fine, continue upward.

  - Otherwise stop here and act like we found root.

This doesn't have to hook into d_move() and will only trigger the
"violated" mode on an very specific and rare case.

I haven't thought about this very hard, but I don't see how the root
dentry could be avoided without first having access to something
outside the root.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 240+ messages in thread

* [GIT PULL] Usernamespace related locked mount fixes
       [not found]       ` <87a8yqou41.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                           ` (19 preceding siblings ...)
  2015-04-08 23:31         ` [PATCH review 0/4] Loopback mount escape fixes Eric W. Biederman
@ 2015-04-16 23:40         ` Eric W. Biederman
  20 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-04-16 23:40 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Richard Weinberger, Andy Lutomirski, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Willy Tarreau

Linus,

Please pull the for-linus branch from the git tree:

   git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-linus

   HEAD: e0c9c0afd2fc958ffa34b697972721d81df8a56f mnt: Update detach_mounts to leave mounts connected

Way back in October Andrey Vagin reported that umount(MNT_DETACH) could
be used to defeat MNT_LOCKED.  As I worked to fix this I discovered
that combined with mount propagation and an appropriate selection of
shared subtrees a reference to a directory on an unmounted filesystem is
not necessary.

That MNT_DETACH is allowed in user namespace in a form that can break
MNT_LOCKED comes from my early misunderstanding what MNT_DETACH does.

To avoid breaking existing userspace the conflict between MNT_DETACH and
MNT_LOCKED is fixed by leaving mounts that are locked to their parents
in the mount hash table until the last reference goes away.

While investigating this issue I also found an issue with
__detach_mounts.  The code was unnecessarily and incorrectly triggering
mount propagation.  Resulting in too many mounts going away when a
directory is deleted, and too many cpu cycles are burned while doing
that.

Looking some more I realized that __detach_mounts by only keeping mounts
connected that were MNT_LOCKED it had the potential to still leak
information so I tweaked the code to keep everything locked together
that possibly could be.

This code was almost ready last cycle but Al invented fs_pin which
slightly simplifies this code but required rewrites and retesting,
and I have not been in top form for a while so it took me a while to get
all of that done.  Similiarly this pull request is late because I have
been feeling absolutely miserable all week.

The issue of being able to escape a bind mount has not yet been
addressed, as the fixes are not yet mature.

Eric W. Biederman (15):
      mnt: Use hlist_move_list in namespace_unlock
      mnt: Improve the umount_tree flags
      mnt: Don't propagate umounts in __detach_mounts
      mnt: In umount_tree reuse mnt_list instead of mnt_hash
      mnt: Add MNT_UMOUNT flag
      mnt: Delay removal from the mount hash.
      mnt: On an unmount propagate clearing of MNT_LOCKED
      mnt: Don't propagate unmounts to locked mounts
      mnt: Fail collect_mounts when applied to unmounted mounts
      mnt: Factor out unhash_mnt from detach_mnt and umount_tree
      mnt: Factor umount_mnt from umount_tree
      fs_pin: Allow for the possibility that m_list or s_list go unused.
      mnt: Honor MNT_LOCKED when detaching mounts
      mnt: Fix the error check in __detach_mounts
      mnt: Update detach_mounts to leave mounts connected

 fs/fs_pin.c            |   4 +-
 fs/namespace.c         | 142 +++++++++++++++++++++++++++++++++----------------
 fs/pnode.c             |  60 ++++++++++++++++++---
 fs/pnode.h             |   7 ++-
 include/linux/fs_pin.h |   2 +
 include/linux/mount.h  |   1 +
 6 files changed, 159 insertions(+), 57 deletions(-)

^ permalink raw reply	[flat|nested] 240+ messages in thread

* [GIT PULL] Usernamespace related locked mount fixes
  2015-04-03  1:53     ` [PATCH review 0/19] Locked mount and loopback mount fixes Eric W. Biederman
                         ` (20 preceding siblings ...)
  2015-04-08 23:31       ` [PATCH review 0/4] Loopback mount escape fixes Eric W. Biederman
@ 2015-04-16 23:40       ` Eric W. Biederman
       [not found]         ` <87383z1w1v.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  2015-04-16 23:42         ` Eric W. Biederman
  21 siblings, 2 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-04-16 23:40 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-fsdevel, Al Viro, Andy Lutomirski, Serge E. Hallyn,
	Richard Weinberger, Andrey Vagin, Jann Horn, Willy Tarreau,
	Omar Sandoval

Linus,

Please pull the for-linus branch from the git tree:

   git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-linus

   HEAD: e0c9c0afd2fc958ffa34b697972721d81df8a56f mnt: Update detach_mounts to leave mounts connected

Way back in October Andrey Vagin reported that umount(MNT_DETACH) could
be used to defeat MNT_LOCKED.  As I worked to fix this I discovered
that combined with mount propagation and an appropriate selection of
shared subtrees a reference to a directory on an unmounted filesystem is
not necessary.

That MNT_DETACH is allowed in user namespace in a form that can break
MNT_LOCKED comes from my early misunderstanding what MNT_DETACH does.

To avoid breaking existing userspace the conflict between MNT_DETACH and
MNT_LOCKED is fixed by leaving mounts that are locked to their parents
in the mount hash table until the last reference goes away.

While investigating this issue I also found an issue with
__detach_mounts.  The code was unnecessarily and incorrectly triggering
mount propagation.  Resulting in too many mounts going away when a
directory is deleted, and too many cpu cycles are burned while doing
that.

Looking some more I realized that __detach_mounts by only keeping mounts
connected that were MNT_LOCKED it had the potential to still leak
information so I tweaked the code to keep everything locked together
that possibly could be.

This code was almost ready last cycle but Al invented fs_pin which
slightly simplifies this code but required rewrites and retesting,
and I have not been in top form for a while so it took me a while to get
all of that done.  Similiarly this pull request is late because I have
been feeling absolutely miserable all week.

The issue of being able to escape a bind mount has not yet been
addressed, as the fixes are not yet mature.

Eric W. Biederman (15):
      mnt: Use hlist_move_list in namespace_unlock
      mnt: Improve the umount_tree flags
      mnt: Don't propagate umounts in __detach_mounts
      mnt: In umount_tree reuse mnt_list instead of mnt_hash
      mnt: Add MNT_UMOUNT flag
      mnt: Delay removal from the mount hash.
      mnt: On an unmount propagate clearing of MNT_LOCKED
      mnt: Don't propagate unmounts to locked mounts
      mnt: Fail collect_mounts when applied to unmounted mounts
      mnt: Factor out unhash_mnt from detach_mnt and umount_tree
      mnt: Factor umount_mnt from umount_tree
      fs_pin: Allow for the possibility that m_list or s_list go unused.
      mnt: Honor MNT_LOCKED when detaching mounts
      mnt: Fix the error check in __detach_mounts
      mnt: Update detach_mounts to leave mounts connected

 fs/fs_pin.c            |   4 +-
 fs/namespace.c         | 142 +++++++++++++++++++++++++++++++++----------------
 fs/pnode.c             |  60 ++++++++++++++++++---
 fs/pnode.h             |   7 ++-
 include/linux/fs_pin.h |   2 +
 include/linux/mount.h  |   1 +
 6 files changed, 159 insertions(+), 57 deletions(-)

^ permalink raw reply	[flat|nested] 240+ messages in thread

* [GIT PULL] Usernamespace related locked mount fixes
       [not found]         ` <87383z1w1v.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-04-16 23:42           ` Eric W. Biederman
  0 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-04-16 23:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrey Vagin, Richard Weinberger, Linux Containers,
	Andy Lutomirski, Al Viro, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	Jann Horn, Willy Tarreau

Linus,

Please pull the for-linus branch from the git tree:

   git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-linus

   HEAD: e0c9c0afd2fc958ffa34b697972721d81df8a56f mnt: Update detach_mounts to leave mounts connected

Way back in October Andrey Vagin reported that umount(MNT_DETACH) could
be used to defeat MNT_LOCKED.  As I worked to fix this I discovered
that combined with mount propagation and an appropriate selection of
shared subtrees a reference to a directory on an unmounted filesystem is
not necessary.

That MNT_DETACH is allowed in user namespace in a form that can break
MNT_LOCKED comes from my early misunderstanding what MNT_DETACH does.

To avoid breaking existing userspace the conflict between MNT_DETACH and
MNT_LOCKED is fixed by leaving mounts that are locked to their parents
in the mount hash table until the last reference goes away.

While investigating this issue I also found an issue with
__detach_mounts.  The code was unnecessarily and incorrectly triggering
mount propagation.  Resulting in too many mounts going away when a
directory is deleted, and too many cpu cycles are burned while doing
that.

Looking some more I realized that __detach_mounts by only keeping mounts
connected that were MNT_LOCKED it had the potential to still leak
information so I tweaked the code to keep everything locked together
that possibly could be.

This code was almost ready last cycle but Al invented fs_pin which
slightly simplifies this code but required rewrites and retesting,
and I have not been in top form for a while so it took me a while to get
all of that done.  Similiarly this pull request is late because I have
been feeling absolutely miserable all week.

The issue of being able to escape a bind mount has not yet been
addressed, as the fixes are not yet mature.

Eric W. Biederman (15):
      mnt: Use hlist_move_list in namespace_unlock
      mnt: Improve the umount_tree flags
      mnt: Don't propagate umounts in __detach_mounts
      mnt: In umount_tree reuse mnt_list instead of mnt_hash
      mnt: Add MNT_UMOUNT flag
      mnt: Delay removal from the mount hash.
      mnt: On an unmount propagate clearing of MNT_LOCKED
      mnt: Don't propagate unmounts to locked mounts
      mnt: Fail collect_mounts when applied to unmounted mounts
      mnt: Factor out unhash_mnt from detach_mnt and umount_tree
      mnt: Factor umount_mnt from umount_tree
      fs_pin: Allow for the possibility that m_list or s_list go unused.
      mnt: Honor MNT_LOCKED when detaching mounts
      mnt: Fix the error check in __detach_mounts
      mnt: Update detach_mounts to leave mounts connected

 fs/fs_pin.c            |   4 +-
 fs/namespace.c         | 142 +++++++++++++++++++++++++++++++++----------------
 fs/pnode.c             |  60 ++++++++++++++++++---
 fs/pnode.h             |   7 ++-
 include/linux/fs_pin.h |   2 +
 include/linux/mount.h  |   1 +
 6 files changed, 159 insertions(+), 57 deletions(-)

p.s. My apologies for everyone who is seeing this twice I failed to
send this to Linus...

^ permalink raw reply	[flat|nested] 240+ messages in thread

* [GIT PULL] Usernamespace related locked mount fixes
  2015-04-16 23:40       ` [GIT PULL] Usernamespace related locked mount fixes Eric W. Biederman
       [not found]         ` <87383z1w1v.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-04-16 23:42         ` Eric W. Biederman
  1 sibling, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-04-16 23:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-fsdevel, Al Viro, Andy Lutomirski, Serge E. Hallyn,
	Richard Weinberger, Andrey Vagin, Jann Horn, Willy Tarreau,
	Omar Sandoval, Linux Containers

Linus,

Please pull the for-linus branch from the git tree:

   git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-linus

   HEAD: e0c9c0afd2fc958ffa34b697972721d81df8a56f mnt: Update detach_mounts to leave mounts connected

Way back in October Andrey Vagin reported that umount(MNT_DETACH) could
be used to defeat MNT_LOCKED.  As I worked to fix this I discovered
that combined with mount propagation and an appropriate selection of
shared subtrees a reference to a directory on an unmounted filesystem is
not necessary.

That MNT_DETACH is allowed in user namespace in a form that can break
MNT_LOCKED comes from my early misunderstanding what MNT_DETACH does.

To avoid breaking existing userspace the conflict between MNT_DETACH and
MNT_LOCKED is fixed by leaving mounts that are locked to their parents
in the mount hash table until the last reference goes away.

While investigating this issue I also found an issue with
__detach_mounts.  The code was unnecessarily and incorrectly triggering
mount propagation.  Resulting in too many mounts going away when a
directory is deleted, and too many cpu cycles are burned while doing
that.

Looking some more I realized that __detach_mounts by only keeping mounts
connected that were MNT_LOCKED it had the potential to still leak
information so I tweaked the code to keep everything locked together
that possibly could be.

This code was almost ready last cycle but Al invented fs_pin which
slightly simplifies this code but required rewrites and retesting,
and I have not been in top form for a while so it took me a while to get
all of that done.  Similiarly this pull request is late because I have
been feeling absolutely miserable all week.

The issue of being able to escape a bind mount has not yet been
addressed, as the fixes are not yet mature.

Eric W. Biederman (15):
      mnt: Use hlist_move_list in namespace_unlock
      mnt: Improve the umount_tree flags
      mnt: Don't propagate umounts in __detach_mounts
      mnt: In umount_tree reuse mnt_list instead of mnt_hash
      mnt: Add MNT_UMOUNT flag
      mnt: Delay removal from the mount hash.
      mnt: On an unmount propagate clearing of MNT_LOCKED
      mnt: Don't propagate unmounts to locked mounts
      mnt: Fail collect_mounts when applied to unmounted mounts
      mnt: Factor out unhash_mnt from detach_mnt and umount_tree
      mnt: Factor umount_mnt from umount_tree
      fs_pin: Allow for the possibility that m_list or s_list go unused.
      mnt: Honor MNT_LOCKED when detaching mounts
      mnt: Fix the error check in __detach_mounts
      mnt: Update detach_mounts to leave mounts connected

 fs/fs_pin.c            |   4 +-
 fs/namespace.c         | 142 +++++++++++++++++++++++++++++++++----------------
 fs/pnode.c             |  60 ++++++++++++++++++---
 fs/pnode.h             |   7 ++-
 include/linux/fs_pin.h |   2 +
 include/linux/mount.h  |   1 +
 6 files changed, 159 insertions(+), 57 deletions(-)

p.s. My apologies for everyone who is seeing this twice I failed to
send this to Linus...

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 12/19] fs_pin: Allow for the possibility that m_list or s_list go unused.
       [not found]         ` <1428026183-14879-12-git-send-email-ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
@ 2015-05-11 13:36           ` Konstantin Khlebnikov
  0 siblings, 0 replies; 240+ messages in thread
From: Konstantin Khlebnikov @ 2015-05-11 13:36 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrey Vagin, Richard Weinberger, Linux Containers,
	Andy Lutomirski, Al Viro, linux-fsdevel, Jann Horn,
	Willy Tarreau

I've seen crash in 4.0.2 while played with namespaces. This patch helped.
So, it should be queued into stable@ for sure, but I don't know how
many kernel versions are affected.


[29221.493301] BUG: unable to handle kernel NULL pointer dereference
at           (null)
[29221.493396] IP: [<ffffffff811d4ad8>] pin_remove+0x58/0xc0
[29221.493456] PGD 0
[29221.493481] Oops: 0002 [#1] SMP
[29221.493521] Modules linked in: iwldvm iwlwifi nfsd auth_rpcgss
oid_registry nfs_acl nfs lockd grace sunrpc ipt_MASQUERADE
nf_nat_masquerade_ipv4 iptable_nat bridge stp llc vfat fat fuse
snd_hda_codec_hdmi snd_hda_codec_conexant snd_hda_codec_generic
iTCO_wdt intel_powerclamp coretemp kvm_intel uvcvideo kvm
videobuf2_vmalloc videobuf2_memops videobuf2_core v4l2_common
snd_hda_intel videodev snd_hda_controller snd_hda_codec snd_hwdep
snd_pcm i915 lpc_ich mfd_core thinkpad_acpi snd_timer wmi snd
soundcore drm_kms_helper sdhci_pci sdhci e1000e
[29221.494197] CPU: 2 PID: 30219 Comm: ct.sh Not tainted 4.0.2-zurg+ #167
[29221.494291] Hardware name: LENOVO 4291QY6/4291QY6, BIOS 8DET51WW
(1.21 ) 08/02/2011
[29221.494392] task: ffff8803cb3e3bf0 ti: ffff8803ebf00000 task.ti:
ffff8803ebf00000
[29221.494514] RIP: 0010:[<ffffffff811d4ad8>]  [<ffffffff811d4ad8>]
pin_remove+0x58/0xc0
[29221.494655] RSP: 0018:ffff8803ebf03da8  EFLAGS: 00010246
[29221.494746] RAX: 0000000000000000 RBX: ffff88040ad33620 RCX: 000000000000000d
[29221.494854] RDX: 0000000000000000 RSI: 0000000000000003 RDI: ffffffff81ed92f0
[29221.494969] RBP: ffff8803ebf03db8 R08: ffffffff81cf9fa8 R09: 0000000000000246
[29221.495081] R10: 000000000001e001 R11: 0000000000000000 R12: ffff8803ebf03e08
[29221.495178] R13: ffff8803cb3e42e0 R14: ffff8803cb3e3bf0 R15: ffff88040a37cd68
[29221.495260] FS:  00007f02da4f8700(0000) GS:ffff88041e280000(0000)
knlGS:0000000000000000
[29221.495343] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[29221.495395] CR2: 0000000000000000 CR3: 0000000002c0d000 CR4: 00000000000407e0
[29221.495456] Stack:
[29221.495477]  ffff8803cb3e3bf0 ffff88040ad33620 ffff8803ebf03dd8
ffffffff811c2f82
[29221.495554]  ffff8803ebf03df8 ffff88040ad33620 ffff8803ebf03e38
ffffffff811d4c54
[29221.495630]  ffffffff81c65b40 ffff880300000000 ffff8803cb3e3bf0
ffffffff810be600
[29221.495707] Call Trace:
[29221.495739]  [<ffffffff811c2f82>] drop_mountpoint+0x22/0x40
[29221.495773]  [<ffffffff811d4c54>] pin_kill+0x64/0xf0
[29221.495790]  [<ffffffff810be600>] ? wait_woken+0x90/0x90
[29221.495806]  [<ffffffff811d4d09>] mnt_pin_kill+0x29/0x40
[29221.495822]  [<ffffffff811c2410>] cleanup_mnt+0x90/0xa0
[29221.495838]  [<ffffffff811c2472>] __cleanup_mnt+0x12/0x20
[29221.495855]  [<ffffffff810a2d67>] task_work_run+0xb7/0xf0
[29221.495873]  [<ffffffff81089332>] do_exit+0x2d2/0xac0
[29221.495890]  [<ffffffff811a39d8>] ? __vfs_read+0x18/0x50
[29221.495905]  [<ffffffff811a3a9a>] ? vfs_read+0x8a/0x120
[29221.495921]  [<ffffffff8108a887>] do_group_exit+0x47/0xc0
[29221.495937]  [<ffffffff8108a914>] SyS_exit_group+0x14/0x20
[29221.495961]  [<ffffffff816da20d>] system_call_fastpath+0x16/0x1b
[29221.495978] Code: 48 89 50 08 48 b8 00 01 10 00 00 00 ad de 48 8b
53 28 48 89 43 30 48 b8 00 02 20 00 00 00 ad de 48 89 43 38 48 8b 43
20 48 85 c0 <48> 89 02 74 04 48 89 50 08 48 b8 00 01 10 00 00 00 ad de
48 89
[29221.496108] RIP  [<ffffffff811d4ad8>] pin_remove+0x58/0xc0
[29221.496126]  RSP <ffff8803ebf03da8>
[29221.496136] CR2: 0000000000000000

On Fri, Apr 3, 2015 at 4:56 AM, Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
> This is needed to support lazily umounting locked mounts.  Because the
> entire unmounted subtree needs to stay together until there are no
> users with references to any part of the subtree.
>
> To support this guarantee that the fs_pin m_list and s_list nodes
> are initialized by initializing them in init_fs_pin allowing
> for the possibility that pin_insert_group does not touch them.
>
> Further use hlist_del_init in pin_remove so that there is
> a hlist_unhashed test before the list we attempt to update
> the previous list item.
>
> Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
> ---
>  fs/fs_pin.c            | 4 ++--
>  include/linux/fs_pin.h | 2 ++
>  2 files changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/fs/fs_pin.c b/fs/fs_pin.c
> index b06c98796afb..611b5408f6ec 100644
> --- a/fs/fs_pin.c
> +++ b/fs/fs_pin.c
> @@ -9,8 +9,8 @@ static DEFINE_SPINLOCK(pin_lock);
>  void pin_remove(struct fs_pin *pin)
>  {
>         spin_lock(&pin_lock);
> -       hlist_del(&pin->m_list);
> -       hlist_del(&pin->s_list);
> +       hlist_del_init(&pin->m_list);
> +       hlist_del_init(&pin->s_list);
>         spin_unlock(&pin_lock);
>         spin_lock_irq(&pin->wait.lock);
>         pin->done = 1;
> diff --git a/include/linux/fs_pin.h b/include/linux/fs_pin.h
> index 9dc4e0384bfb..3886b3bffd7f 100644
> --- a/include/linux/fs_pin.h
> +++ b/include/linux/fs_pin.h
> @@ -13,6 +13,8 @@ struct vfsmount;
>  static inline void init_fs_pin(struct fs_pin *p, void (*kill)(struct fs_pin *))
>  {
>         init_waitqueue_head(&p->wait);
> +       INIT_HLIST_NODE(&p->s_list);
> +       INIT_HLIST_NODE(&p->m_list);
>         p->kill = kill;
>  }
>
> --
> 2.2.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 0/4] Loopback mount escape fixes
       [not found]           ` <CAELBmZBCCC1dspo4rPkFfh3c6RZBUYAZpz0tbUSukcf9att7Cw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-07-24 20:39             ` Eric W. Biederman
  0 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-07-24 20:39 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Andrey Vagin, Richard Weinberger, Linux Containers,
	Andy Lutomirski, Al Viro, linux-fsdevel, Jann Horn,
	Willy Tarreau

Miklos Szeredi <miklos-sUDqSbJrdHQHWmgEVkV9KA@public.gmane.org> writes:

> On Thu, Apr 9, 2015 at 1:31 AM, Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>>
>> After the last round of feedback I sat down and played with my fix
>> for the fact that a strategically placed rename, ".." on bind mounts
>> go up past the root of the bind mount.
>>
>> The code better handles the escaped directory returning into it's bind
>> mount, and is now roughly a constant factor cost in all cases from what
>> the code costs without the fix.
>>
>> So I think I have found a better tradeoff between fixing this bug and
>> not slowing down path name lookups in the common case.
>
> Maybe I'm missing something, but I see a much simpler fix:
>
>  - When following ".." first just check against the dentry being equal
> to the root dentry.
>
>  - If so, then check mount being equal to root mount.
>
>  - If so, then we are fine, found the root.
>
>  - If mount is not root mount, then we either have a bind mount or the
> escape scenario. So have a peek at the mount tree to see if we have a
> chance of reaching root or not.
>
>   - If yes, then we are fine, continue upward.
>
>   - Otherwise stop here and act like we found root.

In concrete terms I think you are suggesting something like this patch
to follow_dot_dot.

diff --git a/fs/namei.c b/fs/namei.c
index ae4e4c18b2ac..56a8562899a1 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1409,6 +1409,11 @@ static void follow_dotdot(struct nameidata *nd)
                        break;
                }
                if (nd->path.dentry != nd->path.mnt->mnt_root) {
+                       /* Escaped path? */
+                       if ((nd->path.mnt->mnt_root != nd->path.mnt->mnt_sb->s_root) &&
+                           d_ancestor(nd->path.mnt->mnt_root, nd->path.dentry))
+                               break;
+                       }
                        /* rare case of legitimate dget_parent()... */
                        nd->path.dentry = dget_parent(nd->path.dentry);
                        dput(old);

> This doesn't have to hook into d_move() and will only trigger the
> "violated" mode on an very specific and rare case.

Am I misunderstanding you?  I don't think .. on a bind mount is a very
specific rare case.

Operations such as following ../../../../../../../../../.. would go from
a cost of O(10) to a cost of O((10*(10 + P + 1))/2) aka
from O(N) to O(N^2+N*P). Where P is the depth of the path below 10
directories up.

Given that in cases like containers bind mounts are frequently the root
mount point of a filesystem I don't think we want that expense, if we
can possibly avoid it.  As that is a DOS attack and messes up
performance for cases that are not afflicected with an escape.

Eric

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* Re: [PATCH review 0/4] Loopback mount escape fixes
  2015-04-13 12:18         ` [PATCH review 0/4] Loopback mount escape fixes Miklos Szeredi
       [not found]           ` <CAELBmZBCCC1dspo4rPkFfh3c6RZBUYAZpz0tbUSukcf9att7Cw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-07-24 20:39           ` Eric W. Biederman
  1 sibling, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-07-24 20:39 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Linux Containers, linux-fsdevel, Al Viro, Andy Lutomirski,
	Serge E. Hallyn, Richard Weinberger, Andrey Vagin, Jann Horn,
	Willy Tarreau, Omar Sandoval

Miklos Szeredi <miklos@szeredi.hu> writes:

> On Thu, Apr 9, 2015 at 1:31 AM, Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> After the last round of feedback I sat down and played with my fix
>> for the fact that a strategically placed rename, ".." on bind mounts
>> go up past the root of the bind mount.
>>
>> The code better handles the escaped directory returning into it's bind
>> mount, and is now roughly a constant factor cost in all cases from what
>> the code costs without the fix.
>>
>> So I think I have found a better tradeoff between fixing this bug and
>> not slowing down path name lookups in the common case.
>
> Maybe I'm missing something, but I see a much simpler fix:
>
>  - When following ".." first just check against the dentry being equal
> to the root dentry.
>
>  - If so, then check mount being equal to root mount.
>
>  - If so, then we are fine, found the root.
>
>  - If mount is not root mount, then we either have a bind mount or the
> escape scenario. So have a peek at the mount tree to see if we have a
> chance of reaching root or not.
>
>   - If yes, then we are fine, continue upward.
>
>   - Otherwise stop here and act like we found root.

In concrete terms I think you are suggesting something like this patch
to follow_dot_dot.

diff --git a/fs/namei.c b/fs/namei.c
index ae4e4c18b2ac..56a8562899a1 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1409,6 +1409,11 @@ static void follow_dotdot(struct nameidata *nd)
                        break;
                }
                if (nd->path.dentry != nd->path.mnt->mnt_root) {
+                       /* Escaped path? */
+                       if ((nd->path.mnt->mnt_root != nd->path.mnt->mnt_sb->s_root) &&
+                           d_ancestor(nd->path.mnt->mnt_root, nd->path.dentry))
+                               break;
+                       }
                        /* rare case of legitimate dget_parent()... */
                        nd->path.dentry = dget_parent(nd->path.dentry);
                        dput(old);

> This doesn't have to hook into d_move() and will only trigger the
> "violated" mode on an very specific and rare case.

Am I misunderstanding you?  I don't think .. on a bind mount is a very
specific rare case.

Operations such as following ../../../../../../../../../.. would go from
a cost of O(10) to a cost of O((10*(10 + P + 1))/2) aka
from O(N) to O(N^2+N*P). Where P is the depth of the path below 10
directories up.

Given that in cases like containers bind mounts are frequently the root
mount point of a filesystem I don't think we want that expense, if we
can possibly avoid it.  As that is a DOS attack and messes up
performance for cases that are not afflicected with an escape.

Eric

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 0/6] Bind mount escape fixes
       [not found]         ` <874moq9oyb.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                             ` (5 preceding siblings ...)
  2015-04-13 12:18           ` Miklos Szeredi
@ 2015-08-03 21:25           ` Eric W. Biederman
  2015-08-03 21:26             ` [PATCH review 1/6] mnt: Track which mounts use a dentry as root Eric W. Biederman
                               ` (4 more replies)
  6 siblings, 5 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-08-03 21:25 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Miklos Szeredi, Richard Weinberger,
	Andy Lutomirski, J. Bruce Fields, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Linus Torvalds,
	Willy Tarreau

It is possible in some situations to rename a file or directory through
one mount point such that it can start out inside of a bind mount and
after the rename wind up outside of the bind mount.  Unfortunately with
user namespaces these conditions can be trivially created by creating a
bind mount under an existing bind mount.

I have identified four situations in which this may be a problem.
- __d_path and d_absolute_path need to error on disconnected paths
  that can not reach some root directory or lsm path based security
  checks can incorrectly succeed.

- Normal path name resolution following .. can lead to a directory
  that is outside of the original loopback mount.

- file handle reconsititution aka exportfs_decode_fh can yield a dentry
  from which d_parent can be followed up to mnt->sb->s_root, but
  d_parent can not be followed up to mnt->mnt_root.

- Mounts on a path that has been renamed outside of a loopback mount
  become unreachable, as there is no possible path that can be passed
  to umount to unmount them.

My strategy:

o File handle reconsitituion problems can be prevented by enabling
  the nfsd subtree checks for nfs exports, and open_by_handle_at
  requires capable(CAP_DAC_READ_SEARCH) so is only usable by the global
  root.  This makes any problems difficult if not impossible to exploit
  in practice so I have not yet written code to address that issue.

o The functions __d_path and d_absolute_path are agumented so that the
  security modules will not be fed a problematic path to work with.

o Following of .. has been agumented to test that after d_parent has
  been resolved the original  directory is connected, and if not
  an error of -ENOENT is returned.

o I do not worry about mounts that are disconnected from their bind
  mount as these mounts can always be freed by either umount -l on
  the bind mount they have escaped from, or by freeing the mount
  namespace.  So I do not believe there is an actual problem.

That name resolution is a common fast path and most of the code in this
patchset is to support keeping following .. from becoming quadratic as
far as is humanly possible.

For the implementation I went back to the drawing board and carefully
read through the affected code, so I could be certain I knew what was
going on, and this wound of with some very significant implementation
changes from a correctness point of view.

On each mount I keep an escape count which is almost but not quite a
seqcount that is bumped each time a directory escapes a mount point.
This allows marking the mounts that do have directories escape and
allows caching of when a path has been verified to have no escapes, so
in the common case even a mount that has had a directory escape will see
only a single call to d_ancestor during path name resolution the first
time .. is encountered.

I have not benchmarked the code but I don't see any reason to expect
anything except for rename will see a performance impact, and then only
in cases with where a rename potentially affects allows a directory to
escape lots of mounts.

Do I have something that is good enough this time, or am I blind and
missing something?

These changes are all against v4.2-rc4. 

For those who like to see everything in a single tree the code is at:

     git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-testing

Eric W. Biederman (6):
      mnt: Track which mounts use a dentry as root.
      dcache: Handle escaped paths in prepend_path
      dcache: Implement d_common_ancestor
      mnt: Track when a directory escapes a bind mount
      vfs: Test for and handle paths that are unreachable from their mnt_root
      vfs: Cache the results of path_connected

 fs/dcache.c            |  90 ++++++++++++++++--
 fs/mount.h             |  25 +++++
 fs/namei.c             |  59 +++++++++++-
 fs/namespace.c         | 243 ++++++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/dcache.h |   8 ++
 5 files changed, 409 insertions(+), 16 deletions(-)

^ permalink raw reply	[flat|nested] 240+ messages in thread

* [PATCH review 1/6] mnt: Track which mounts use a dentry as root.
       [not found]             ` <871tfkawu9.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-08-03 21:26               ` Eric W. Biederman
  2015-08-03 21:26               ` [PATCH review 2/6] dcache: Handle escaped paths in prepend_path Eric W. Biederman
                                 ` (5 subsequent siblings)
  6 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-08-03 21:26 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Miklos Szeredi, Richard Weinberger,
	Andy Lutomirski, J. Bruce Fields, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Linus Torvalds,
	Willy Tarreau


This is needed infrastructure for better handling of when files
or directories are moved out from under the root of a bind mount.

Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/mount.h             |   7 +++
 fs/namespace.c         | 120 +++++++++++++++++++++++++++++++++++++++++++++++--
 include/linux/dcache.h |   7 +++
 3 files changed, 130 insertions(+), 4 deletions(-)

diff --git a/fs/mount.h b/fs/mount.h
index 14db05d424f7..e8f22970fe59 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -27,6 +27,12 @@ struct mountpoint {
 	int m_count;
 };
 
+struct mountroot {
+	struct hlist_node r_hash;
+	struct dentry *r_dentry;
+	struct hlist_head r_list;
+};
+
 struct mount {
 	struct hlist_node mnt_hash;
 	struct mount *mnt_parent;
@@ -55,6 +61,7 @@ struct mount {
 	struct mnt_namespace *mnt_ns;	/* containing namespace */
 	struct mountpoint *mnt_mp;	/* where is it mounted */
 	struct hlist_node mnt_mp_list;	/* list mounts with the same mountpoint */
+	struct hlist_node mnt_mr_list;	/* list mounts with the same mountroot */
 #ifdef CONFIG_FSNOTIFY
 	struct hlist_head mnt_fsnotify_marks;
 	__u32 mnt_fsnotify_mask;
diff --git a/fs/namespace.c b/fs/namespace.c
index 2b8aa15fd6df..2ce987af9afa 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -31,6 +31,8 @@ static unsigned int m_hash_mask __read_mostly;
 static unsigned int m_hash_shift __read_mostly;
 static unsigned int mp_hash_mask __read_mostly;
 static unsigned int mp_hash_shift __read_mostly;
+static unsigned int mr_hash_mask __read_mostly;
+static unsigned int mr_hash_shift __read_mostly;
 
 static __initdata unsigned long mhash_entries;
 static int __init set_mhash_entries(char *str)
@@ -52,6 +54,16 @@ static int __init set_mphash_entries(char *str)
 }
 __setup("mphash_entries=", set_mphash_entries);
 
+static __initdata unsigned long mrhash_entries;
+static int __init set_mrhash_entries(char *str)
+{
+	if (!str)
+		return 0;
+	mrhash_entries = simple_strtoul(str, &str, 0);
+	return 1;
+}
+__setup("mrhash_entries=", set_mrhash_entries);
+
 static u64 event;
 static DEFINE_IDA(mnt_id_ida);
 static DEFINE_IDA(mnt_group_ida);
@@ -61,6 +73,7 @@ static int mnt_group_start = 1;
 
 static struct hlist_head *mount_hashtable __read_mostly;
 static struct hlist_head *mountpoint_hashtable __read_mostly;
+static struct hlist_head *mountroot_hashtable __read_mostly;
 static struct kmem_cache *mnt_cache __read_mostly;
 static DECLARE_RWSEM(namespace_sem);
 
@@ -93,6 +106,13 @@ static inline struct hlist_head *mp_hash(struct dentry *dentry)
 	return &mountpoint_hashtable[tmp & mp_hash_mask];
 }
 
+static inline struct hlist_head *mr_hash(struct dentry *dentry)
+{
+	unsigned long tmp = ((unsigned long)dentry / L1_CACHE_BYTES);
+	tmp = tmp + (tmp >> mr_hash_shift);
+	return &mountroot_hashtable[tmp & mr_hash_mask];
+}
+
 /*
  * allocation is serialized by namespace_sem, but we need the spinlock to
  * serialize with freeing.
@@ -234,6 +254,7 @@ static struct mount *alloc_vfsmnt(const char *name)
 		INIT_LIST_HEAD(&mnt->mnt_slave_list);
 		INIT_LIST_HEAD(&mnt->mnt_slave);
 		INIT_HLIST_NODE(&mnt->mnt_mp_list);
+		INIT_HLIST_NODE(&mnt->mnt_mr_list);
 #ifdef CONFIG_FSNOTIFY
 		INIT_HLIST_HEAD(&mnt->mnt_fsnotify_marks);
 #endif
@@ -779,6 +800,77 @@ static void put_mountpoint(struct mountpoint *mp)
 	}
 }
 
+static struct mountroot *lookup_mountroot(struct dentry *dentry)
+{
+	struct hlist_head *chain = mr_hash(dentry);
+	struct mountroot *mr;
+
+	hlist_for_each_entry(mr, chain, r_hash) {
+		if (mr->r_dentry == dentry)
+			return mr;
+	}
+	return NULL;
+}
+
+static int mnt_set_root(struct mount *mnt, struct dentry *root)
+{
+	struct mountroot *mr = NULL;
+
+	read_seqlock_excl(&mount_lock);
+	if (d_mountroot(root))
+		mr = lookup_mountroot(root);
+	if (!mr) {
+		struct mountroot *new;
+		read_sequnlock_excl(&mount_lock);
+
+		new = kmalloc(sizeof(struct mountroot), GFP_KERNEL);
+		if (!new)
+			return -ENOMEM;
+
+		read_seqlock_excl(&mount_lock);
+		mr = lookup_mountroot(root);
+		if (mr) {
+			kfree(new);
+		} else {
+			struct hlist_head *chain = mr_hash(root);
+
+			mr = new;
+			mr->r_dentry = root;
+			INIT_HLIST_HEAD(&mr->r_list);
+			hlist_add_head(&mr->r_hash, chain);
+
+			spin_lock(&root->d_lock);
+			root->d_flags |= DCACHE_MOUNTROOT;
+			spin_unlock(&root->d_lock);
+		}
+	}
+	mnt->mnt.mnt_root = root;
+	hlist_add_head(&mnt->mnt_mr_list, &mr->r_list);
+	read_sequnlock_excl(&mount_lock);
+
+	return 0;
+}
+
+static void mnt_put_root(struct mount *mnt)
+{
+	struct dentry *root = mnt->mnt.mnt_root;
+	struct mountroot *mr;
+
+	read_seqlock_excl(&mount_lock);
+	mr = lookup_mountroot(root);
+	BUG_ON(!mr);
+	hlist_del(&mnt->mnt_mr_list);
+	if (hlist_empty(&mr->r_list)) {
+		hlist_del(&mr->r_hash);
+		spin_lock(&root->d_lock);
+		root->d_flags &= ~DCACHE_MOUNTROOT;
+		spin_unlock(&root->d_lock);
+		kfree(mr);
+	}
+	read_sequnlock_excl(&mount_lock);
+	dput(root);
+}
+
 static inline int check_mnt(struct mount *mnt)
 {
 	return mnt->mnt_ns == current->nsproxy->mnt_ns;
@@ -934,6 +1026,7 @@ vfs_kern_mount(struct file_system_type *type, int flags, const char *name, void
 {
 	struct mount *mnt;
 	struct dentry *root;
+	int err;
 
 	if (!type)
 		return ERR_PTR(-ENODEV);
@@ -952,8 +1045,16 @@ vfs_kern_mount(struct file_system_type *type, int flags, const char *name, void
 		return ERR_CAST(root);
 	}
 
-	mnt->mnt.mnt_root = root;
 	mnt->mnt.mnt_sb = root->d_sb;
+	err = mnt_set_root(mnt, root);
+	if (err) {
+		dput(root);
+		deactivate_super(mnt->mnt.mnt_sb);
+		mnt_free_id(mnt);
+		free_vfsmnt(mnt);
+		return ERR_PTR(err);
+	}
+
 	mnt->mnt_mountpoint = mnt->mnt.mnt_root;
 	mnt->mnt_parent = mnt;
 	lock_mount_hash();
@@ -985,6 +1086,10 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root,
 			goto out_free;
 	}
 
+	err = mnt_set_root(mnt, root);
+	if (err)
+		goto out_free;
+
 	mnt->mnt.mnt_flags = old->mnt.mnt_flags & ~(MNT_WRITE_HOLD|MNT_MARKED);
 	/* Don't allow unprivileged users to change mount flags */
 	if (flag & CL_UNPRIVILEGED) {
@@ -1010,7 +1115,7 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root,
 
 	atomic_inc(&sb->s_active);
 	mnt->mnt.mnt_sb = sb;
-	mnt->mnt.mnt_root = dget(root);
+	dget(root);
 	mnt->mnt_mountpoint = mnt->mnt.mnt_root;
 	mnt->mnt_parent = mnt;
 	lock_mount_hash();
@@ -1063,7 +1168,7 @@ static void cleanup_mnt(struct mount *mnt)
 	if (unlikely(mnt->mnt_pins.first))
 		mnt_pin_kill(mnt);
 	fsnotify_vfsmount_delete(&mnt->mnt);
-	dput(mnt->mnt.mnt_root);
+	mnt_put_root(mnt);
 	deactivate_super(mnt->mnt.mnt_sb);
 	mnt_free_id(mnt);
 	call_rcu(&mnt->mnt_rcu, delayed_free_vfsmnt);
@@ -3120,14 +3225,21 @@ void __init mnt_init(void)
 				mphash_entries, 19,
 				0,
 				&mp_hash_shift, &mp_hash_mask, 0, 0);
+	mountroot_hashtable = alloc_large_system_hash("Mountroot-cache",
+				sizeof(struct hlist_head),
+				mrhash_entries, 19,
+				0,
+				&mr_hash_shift, &mr_hash_mask, 0, 0);
 
-	if (!mount_hashtable || !mountpoint_hashtable)
+	if (!mount_hashtable || !mountpoint_hashtable || !mountroot_hashtable)
 		panic("Failed to allocate mount hash table\n");
 
 	for (u = 0; u <= m_hash_mask; u++)
 		INIT_HLIST_HEAD(&mount_hashtable[u]);
 	for (u = 0; u <= mp_hash_mask; u++)
 		INIT_HLIST_HEAD(&mountpoint_hashtable[u]);
+	for (u = 0; u <= mr_hash_mask; u++)
+		INIT_HLIST_HEAD(&mountroot_hashtable[u]);
 
 	kernfs_init();
 
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index d67ae119cf4e..52a5e6915f58 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -228,6 +228,8 @@ struct dentry_operations {
 #define DCACHE_FALLTHRU			0x01000000 /* Fall through to lower layer */
 #define DCACHE_OP_SELECT_INODE		0x02000000 /* Unioned entry: dcache op selects inode */
 
+#define DCACHE_MOUNTROOT		0x04000000 /* Root of a vfsmount */
+
 extern seqlock_t rename_lock;
 
 /*
@@ -404,6 +406,11 @@ static inline bool d_mountpoint(const struct dentry *dentry)
 	return dentry->d_flags & DCACHE_MOUNTED;
 }
 
+static inline bool d_mountroot(const struct dentry *dentry)
+{
+	return dentry->d_flags & DCACHE_MOUNTROOT;
+}
+
 /*
  * Directory cache entry type accessor functions.
  */
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 1/6] mnt: Track which mounts use a dentry as root.
  2015-08-03 21:25           ` [PATCH review 0/6] Bind " Eric W. Biederman
@ 2015-08-03 21:26             ` Eric W. Biederman
  2015-08-07 10:46               ` Nikolay Borisov
       [not found]               ` <87vbcw9i8g.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
       [not found]             ` <871tfkawu9.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                               ` (3 subsequent siblings)
  4 siblings, 2 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-08-03 21:26 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-fsdevel, Al Viro, Andy Lutomirski, Serge E. Hallyn,
	Richard Weinberger, Andrey Vagin, Jann Horn, Willy Tarreau,
	Omar Sandoval, Miklos Szeredi, Linus Torvalds, J. Bruce Fields


This is needed infrastructure for better handling of when files
or directories are moved out from under the root of a bind mount.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/mount.h             |   7 +++
 fs/namespace.c         | 120 +++++++++++++++++++++++++++++++++++++++++++++++--
 include/linux/dcache.h |   7 +++
 3 files changed, 130 insertions(+), 4 deletions(-)

diff --git a/fs/mount.h b/fs/mount.h
index 14db05d424f7..e8f22970fe59 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -27,6 +27,12 @@ struct mountpoint {
 	int m_count;
 };
 
+struct mountroot {
+	struct hlist_node r_hash;
+	struct dentry *r_dentry;
+	struct hlist_head r_list;
+};
+
 struct mount {
 	struct hlist_node mnt_hash;
 	struct mount *mnt_parent;
@@ -55,6 +61,7 @@ struct mount {
 	struct mnt_namespace *mnt_ns;	/* containing namespace */
 	struct mountpoint *mnt_mp;	/* where is it mounted */
 	struct hlist_node mnt_mp_list;	/* list mounts with the same mountpoint */
+	struct hlist_node mnt_mr_list;	/* list mounts with the same mountroot */
 #ifdef CONFIG_FSNOTIFY
 	struct hlist_head mnt_fsnotify_marks;
 	__u32 mnt_fsnotify_mask;
diff --git a/fs/namespace.c b/fs/namespace.c
index 2b8aa15fd6df..2ce987af9afa 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -31,6 +31,8 @@ static unsigned int m_hash_mask __read_mostly;
 static unsigned int m_hash_shift __read_mostly;
 static unsigned int mp_hash_mask __read_mostly;
 static unsigned int mp_hash_shift __read_mostly;
+static unsigned int mr_hash_mask __read_mostly;
+static unsigned int mr_hash_shift __read_mostly;
 
 static __initdata unsigned long mhash_entries;
 static int __init set_mhash_entries(char *str)
@@ -52,6 +54,16 @@ static int __init set_mphash_entries(char *str)
 }
 __setup("mphash_entries=", set_mphash_entries);
 
+static __initdata unsigned long mrhash_entries;
+static int __init set_mrhash_entries(char *str)
+{
+	if (!str)
+		return 0;
+	mrhash_entries = simple_strtoul(str, &str, 0);
+	return 1;
+}
+__setup("mrhash_entries=", set_mrhash_entries);
+
 static u64 event;
 static DEFINE_IDA(mnt_id_ida);
 static DEFINE_IDA(mnt_group_ida);
@@ -61,6 +73,7 @@ static int mnt_group_start = 1;
 
 static struct hlist_head *mount_hashtable __read_mostly;
 static struct hlist_head *mountpoint_hashtable __read_mostly;
+static struct hlist_head *mountroot_hashtable __read_mostly;
 static struct kmem_cache *mnt_cache __read_mostly;
 static DECLARE_RWSEM(namespace_sem);
 
@@ -93,6 +106,13 @@ static inline struct hlist_head *mp_hash(struct dentry *dentry)
 	return &mountpoint_hashtable[tmp & mp_hash_mask];
 }
 
+static inline struct hlist_head *mr_hash(struct dentry *dentry)
+{
+	unsigned long tmp = ((unsigned long)dentry / L1_CACHE_BYTES);
+	tmp = tmp + (tmp >> mr_hash_shift);
+	return &mountroot_hashtable[tmp & mr_hash_mask];
+}
+
 /*
  * allocation is serialized by namespace_sem, but we need the spinlock to
  * serialize with freeing.
@@ -234,6 +254,7 @@ static struct mount *alloc_vfsmnt(const char *name)
 		INIT_LIST_HEAD(&mnt->mnt_slave_list);
 		INIT_LIST_HEAD(&mnt->mnt_slave);
 		INIT_HLIST_NODE(&mnt->mnt_mp_list);
+		INIT_HLIST_NODE(&mnt->mnt_mr_list);
 #ifdef CONFIG_FSNOTIFY
 		INIT_HLIST_HEAD(&mnt->mnt_fsnotify_marks);
 #endif
@@ -779,6 +800,77 @@ static void put_mountpoint(struct mountpoint *mp)
 	}
 }
 
+static struct mountroot *lookup_mountroot(struct dentry *dentry)
+{
+	struct hlist_head *chain = mr_hash(dentry);
+	struct mountroot *mr;
+
+	hlist_for_each_entry(mr, chain, r_hash) {
+		if (mr->r_dentry == dentry)
+			return mr;
+	}
+	return NULL;
+}
+
+static int mnt_set_root(struct mount *mnt, struct dentry *root)
+{
+	struct mountroot *mr = NULL;
+
+	read_seqlock_excl(&mount_lock);
+	if (d_mountroot(root))
+		mr = lookup_mountroot(root);
+	if (!mr) {
+		struct mountroot *new;
+		read_sequnlock_excl(&mount_lock);
+
+		new = kmalloc(sizeof(struct mountroot), GFP_KERNEL);
+		if (!new)
+			return -ENOMEM;
+
+		read_seqlock_excl(&mount_lock);
+		mr = lookup_mountroot(root);
+		if (mr) {
+			kfree(new);
+		} else {
+			struct hlist_head *chain = mr_hash(root);
+
+			mr = new;
+			mr->r_dentry = root;
+			INIT_HLIST_HEAD(&mr->r_list);
+			hlist_add_head(&mr->r_hash, chain);
+
+			spin_lock(&root->d_lock);
+			root->d_flags |= DCACHE_MOUNTROOT;
+			spin_unlock(&root->d_lock);
+		}
+	}
+	mnt->mnt.mnt_root = root;
+	hlist_add_head(&mnt->mnt_mr_list, &mr->r_list);
+	read_sequnlock_excl(&mount_lock);
+
+	return 0;
+}
+
+static void mnt_put_root(struct mount *mnt)
+{
+	struct dentry *root = mnt->mnt.mnt_root;
+	struct mountroot *mr;
+
+	read_seqlock_excl(&mount_lock);
+	mr = lookup_mountroot(root);
+	BUG_ON(!mr);
+	hlist_del(&mnt->mnt_mr_list);
+	if (hlist_empty(&mr->r_list)) {
+		hlist_del(&mr->r_hash);
+		spin_lock(&root->d_lock);
+		root->d_flags &= ~DCACHE_MOUNTROOT;
+		spin_unlock(&root->d_lock);
+		kfree(mr);
+	}
+	read_sequnlock_excl(&mount_lock);
+	dput(root);
+}
+
 static inline int check_mnt(struct mount *mnt)
 {
 	return mnt->mnt_ns == current->nsproxy->mnt_ns;
@@ -934,6 +1026,7 @@ vfs_kern_mount(struct file_system_type *type, int flags, const char *name, void
 {
 	struct mount *mnt;
 	struct dentry *root;
+	int err;
 
 	if (!type)
 		return ERR_PTR(-ENODEV);
@@ -952,8 +1045,16 @@ vfs_kern_mount(struct file_system_type *type, int flags, const char *name, void
 		return ERR_CAST(root);
 	}
 
-	mnt->mnt.mnt_root = root;
 	mnt->mnt.mnt_sb = root->d_sb;
+	err = mnt_set_root(mnt, root);
+	if (err) {
+		dput(root);
+		deactivate_super(mnt->mnt.mnt_sb);
+		mnt_free_id(mnt);
+		free_vfsmnt(mnt);
+		return ERR_PTR(err);
+	}
+
 	mnt->mnt_mountpoint = mnt->mnt.mnt_root;
 	mnt->mnt_parent = mnt;
 	lock_mount_hash();
@@ -985,6 +1086,10 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root,
 			goto out_free;
 	}
 
+	err = mnt_set_root(mnt, root);
+	if (err)
+		goto out_free;
+
 	mnt->mnt.mnt_flags = old->mnt.mnt_flags & ~(MNT_WRITE_HOLD|MNT_MARKED);
 	/* Don't allow unprivileged users to change mount flags */
 	if (flag & CL_UNPRIVILEGED) {
@@ -1010,7 +1115,7 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root,
 
 	atomic_inc(&sb->s_active);
 	mnt->mnt.mnt_sb = sb;
-	mnt->mnt.mnt_root = dget(root);
+	dget(root);
 	mnt->mnt_mountpoint = mnt->mnt.mnt_root;
 	mnt->mnt_parent = mnt;
 	lock_mount_hash();
@@ -1063,7 +1168,7 @@ static void cleanup_mnt(struct mount *mnt)
 	if (unlikely(mnt->mnt_pins.first))
 		mnt_pin_kill(mnt);
 	fsnotify_vfsmount_delete(&mnt->mnt);
-	dput(mnt->mnt.mnt_root);
+	mnt_put_root(mnt);
 	deactivate_super(mnt->mnt.mnt_sb);
 	mnt_free_id(mnt);
 	call_rcu(&mnt->mnt_rcu, delayed_free_vfsmnt);
@@ -3120,14 +3225,21 @@ void __init mnt_init(void)
 				mphash_entries, 19,
 				0,
 				&mp_hash_shift, &mp_hash_mask, 0, 0);
+	mountroot_hashtable = alloc_large_system_hash("Mountroot-cache",
+				sizeof(struct hlist_head),
+				mrhash_entries, 19,
+				0,
+				&mr_hash_shift, &mr_hash_mask, 0, 0);
 
-	if (!mount_hashtable || !mountpoint_hashtable)
+	if (!mount_hashtable || !mountpoint_hashtable || !mountroot_hashtable)
 		panic("Failed to allocate mount hash table\n");
 
 	for (u = 0; u <= m_hash_mask; u++)
 		INIT_HLIST_HEAD(&mount_hashtable[u]);
 	for (u = 0; u <= mp_hash_mask; u++)
 		INIT_HLIST_HEAD(&mountpoint_hashtable[u]);
+	for (u = 0; u <= mr_hash_mask; u++)
+		INIT_HLIST_HEAD(&mountroot_hashtable[u]);
 
 	kernfs_init();
 
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index d67ae119cf4e..52a5e6915f58 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -228,6 +228,8 @@ struct dentry_operations {
 #define DCACHE_FALLTHRU			0x01000000 /* Fall through to lower layer */
 #define DCACHE_OP_SELECT_INODE		0x02000000 /* Unioned entry: dcache op selects inode */
 
+#define DCACHE_MOUNTROOT		0x04000000 /* Root of a vfsmount */
+
 extern seqlock_t rename_lock;
 
 /*
@@ -404,6 +406,11 @@ static inline bool d_mountpoint(const struct dentry *dentry)
 	return dentry->d_flags & DCACHE_MOUNTED;
 }
 
+static inline bool d_mountroot(const struct dentry *dentry)
+{
+	return dentry->d_flags & DCACHE_MOUNTROOT;
+}
+
 /*
  * Directory cache entry type accessor functions.
  */
-- 
2.2.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 2/6] dcache: Handle escaped paths in prepend_path
       [not found]             ` <871tfkawu9.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  2015-08-03 21:26               ` Eric W. Biederman
@ 2015-08-03 21:26               ` Eric W. Biederman
  2015-08-03 21:27               ` [PATCH review 3/6] dcache: Implement d_common_ancestor Eric W. Biederman
                                 ` (4 subsequent siblings)
  6 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-08-03 21:26 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Miklos Szeredi, Richard Weinberger,
	Andy Lutomirski, J. Bruce Fields, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Linus Torvalds,
	Willy Tarreau

A rename can result in a dentry that by walking up d_parent
will never reach it's mnt_root.  For lack of a better term
I call this an escaped path.

prepend_path is called by four different functions __d_path,
d_absolute_path, d_path, and getcwd.

__d_path only wants to see paths are connected to the root it passes
in.  So __d_path needs prepend_path to return an error.

d_absolute_path similarly wants to see paths that are connected to
some root.  Escaped paths are not connected to any mnt_root so
d_absolute_path needs prepend_path to return an error greater
than 1.  So escaped paths will be treated like paths on lazily
unmounted mounts.

getcwd needs to prepend "(unreachable)" so getcwd also needs
prepend_path to return an error.

d_path is the interesting hold out.  d_path just wants to print
something, and does not care about the weird cases.  Which raises
the question what should be printed?

Given that <escaped_path>/<anything> should result in -ENOENT I
believe it is desirable for escaped paths to be printed as empty
paths.  As there are not really any meaninful path components when
considered from the perspective of a mount tree.

So tweak prepend_path to return an empty path with an new error
code of 3 when it encounters an escaped path.

Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/dcache.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/fs/dcache.c b/fs/dcache.c
index 5c8ea15e73a5..d7fe995dd32d 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2926,6 +2926,13 @@ restart:

 		if (dentry == vfsmnt->mnt_root || IS_ROOT(dentry)) {
 			struct mount *parent = ACCESS_ONCE(mnt->mnt_parent);
+			/* Escaped? */
+			if (dentry != vfsmnt->mnt_root) {
+				bptr = *buffer;
+				blen = *buflen;
+				error = 3;
+				break;
+			}
 			/* Global root? */
 			if (mnt != parent) {
 				dentry = ACCESS_ONCE(mnt->mnt_mountpoint);
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 3/6] dcache: Implement d_common_ancestor
       [not found]             ` <871tfkawu9.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  2015-08-03 21:26               ` Eric W. Biederman
  2015-08-03 21:26               ` [PATCH review 2/6] dcache: Handle escaped paths in prepend_path Eric W. Biederman
@ 2015-08-03 21:27               ` Eric W. Biederman
  2015-08-03 21:27               ` [PATCH review 4/6] mnt: Track when a directory escapes a bind mount Eric W. Biederman
                                 ` (3 subsequent siblings)
  6 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-08-03 21:27 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Miklos Szeredi, Richard Weinberger,
	Andy Lutomirski, J. Bruce Fields, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Linus Torvalds,
	Willy Tarreau


If possible find the common ancestor of two dentries.

This is necessary infrastructure for better handling the case
when a dentry is moved out from under the root of a bind mount.

Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/dcache.c            | 37 +++++++++++++++++++++++++++++++++++++
 include/linux/dcache.h |  1 +
 2 files changed, 38 insertions(+)

diff --git a/fs/dcache.c b/fs/dcache.c
index d7fe995dd32d..9f4de1007a8d 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2472,6 +2472,43 @@ void dentry_update_name_case(struct dentry *dentry, struct qstr *name)
 }
 EXPORT_SYMBOL(dentry_update_name_case);
 
+static unsigned long d_depth(const struct dentry *dentry)
+{
+	unsigned long depth = 0;
+
+	while (!IS_ROOT(dentry)) {
+		dentry = dentry->d_parent;
+		depth++;
+	}
+	return depth;
+}
+
+const struct dentry *d_common_ancestor(const struct dentry *left,
+				       const struct dentry *right)
+{
+	unsigned long ldepth = d_depth(left);
+	unsigned long rdepth = d_depth(right);
+
+	while (ldepth > rdepth) {
+		left = left->d_parent;
+		ldepth--;
+	}
+
+	while (rdepth > ldepth) {
+		right = right->d_parent;
+		rdepth--;
+	}
+
+	while (left != right) {
+		if (IS_ROOT(left))
+			return NULL;
+		left = left->d_parent;
+		right = right->d_parent;
+	}
+
+	return left;
+}
+
 static void swap_names(struct dentry *dentry, struct dentry *target)
 {
 	if (unlikely(dname_external(target))) {
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 52a5e6915f58..56de8288cdee 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -313,6 +313,7 @@ extern void dentry_update_name_case(struct dentry *, struct qstr *);
 extern void d_move(struct dentry *, struct dentry *);
 extern void d_exchange(struct dentry *, struct dentry *);
 extern struct dentry *d_ancestor(struct dentry *, struct dentry *);
+extern const struct dentry *d_common_ancestor(const struct dentry *, const struct dentry *);
 
 /* appendix may either be NULL or be used for transname suffixes */
 extern struct dentry *d_lookup(const struct dentry *, const struct qstr *);
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 4/6] mnt: Track when a directory escapes a bind mount
       [not found]             ` <871tfkawu9.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                                 ` (2 preceding siblings ...)
  2015-08-03 21:27               ` [PATCH review 3/6] dcache: Implement d_common_ancestor Eric W. Biederman
@ 2015-08-03 21:27               ` Eric W. Biederman
  2015-08-03 21:30               ` [PATCH review 5/6] vfs: Test for and handle paths that are unreachable from their mnt_root Eric W. Biederman
                                 ` (2 subsequent siblings)
  6 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-08-03 21:27 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Miklos Szeredi, Richard Weinberger,
	Andy Lutomirski, J. Bruce Fields, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Linus Torvalds,
	Willy Tarreau


When bind mounts are in use, and there is another path to the
filesystem it is possible to rename files or directories from a path
underneath the root of the bind mount to a path that is not underneath
the root of the bind mount.

When a directory is moved out from under the root of a bind mount path
name lookups that go up the directory tree potentially allow accessing
the entire dentry tree of the filesystem.  This is not expected, not
what is desired and winds up being a secruity problem for userspace.

Augment d_move, d_exchange, and __d_unalias with matching calls to
lock_namespace_rename and unlock_namespace_rename, to mark mounts
that directories have escaped from.

A few notes on the implementation:

- The escape count on struct mount must be incremented both before the
  rename and after.  If the count is not incremented before the rename
  it is possible to hit a scenario where the rename happens the code
  walks up the directory tree to somewhere outside of the bind mount
  before the count is touched.  Similary without a count after the
  rename it is possible for the code to look at the escape count
  validate a path is connected before the rename and assume cache the
  escape count, leading to not retesting the path is ok.

- A lock either namespace_sem or mount_lock needs to be held across
  the duration of renames where a directory could be escaping to
  guarantee pairing of the escape_count increments and to ensure
  that a mount is not added, escaped, and missed during the rename.

- The locking order must be mount_lock outside of rename_lock
  as prepend_path already takes the locks in this order.

- I have audited all callers of d_move and d_exchange and in every
  instance it appears safe for d_move and d_exchange to start
  sleeping.  d_splice_alias already sleeps in security_d_instantiate
  so no audit was needed for it to begin sleeping.

  As I can just take mount_lock I don't use that freedom in this
  change, but it can be relevant to small changes to the locking in
  this code.

- The largest change is in d_unalias, where the two cases are split
  apart so they can be handled separately.  In the easy case of a
  rename within the same directory all that is needed is __d_move
  (escaping a mount is impossible in that case).  In the more involved
  case mutexes need to be acquired, and now the spin locks need to be
  dropped so that proper lock aquisition order around __d_move can be
  arranged.

  As I read the code inode->i_lock needs to be held until
  ailas->d_parent->d_parent->i_mutex is taken.  The only case I can
  see that removes an inode from a dentry is d_delete called from a
  path like vfs_unlink.  Those paths all take the parent directories
  inode mutex.  Thus once the parent directories inode mutex is held
  it becomes unnecessary to hold inode->i_lock to ensure the alias
  remains an alias.

  Similarly the rename_lock does not need to be held once the
  s_vfs_rename_mutex is taken.

Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/dcache.c    |  46 +++++++++++++++++----
 fs/mount.h     |  18 +++++++++
 fs/namespace.c | 123 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 180 insertions(+), 7 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 9f4de1007a8d..c25ef7ef8e7f 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2707,9 +2707,17 @@ static void __d_move(struct dentry *dentry, struct dentry *target,
  */
 void d_move(struct dentry *dentry, struct dentry *target)
 {
+	const struct dentry *unlock;
+
+	unlock = lock_namespace_rename(dentry, target, false);
+
 	write_seqlock(&rename_lock);
 	__d_move(dentry, target, false);
 	write_sequnlock(&rename_lock);
+
+	if (unlock)
+		unlock_namespace_rename(unlock, dentry, target, false);
+
 }
 EXPORT_SYMBOL(d_move);
 
@@ -2720,6 +2728,10 @@ EXPORT_SYMBOL(d_move);
  */
 void d_exchange(struct dentry *dentry1, struct dentry *dentry2)
 {
+	const struct dentry *unlock;
+
+	unlock = lock_namespace_rename(dentry1, dentry2, true);
+
 	write_seqlock(&rename_lock);
 
 	WARN_ON(!dentry1->d_inode);
@@ -2730,6 +2742,9 @@ void d_exchange(struct dentry *dentry1, struct dentry *dentry2)
 	__d_move(dentry1, dentry2, true);
 
 	write_sequnlock(&rename_lock);
+
+	if (unlock)
+		unlock_namespace_rename(unlock, dentry1, dentry2, true);
 }
 
 /**
@@ -2764,11 +2779,15 @@ static int __d_unalias(struct inode *inode,
 		struct dentry *dentry, struct dentry *alias)
 {
 	struct mutex *m1 = NULL, *m2 = NULL;
-	int ret = -ESTALE;
+	const struct dentry *unlock;
 
 	/* If alias and dentry share a parent, then no extra locks required */
-	if (alias->d_parent == dentry->d_parent)
-		goto out_unalias;
+	if (alias->d_parent == dentry->d_parent) {
+		__d_move(alias, dentry, false);
+		spin_unlock(&inode->i_lock);
+		write_sequnlock(&rename_lock);
+		return 0;
+	}
 
 	/* See lock_rename() */
 	if (!mutex_trylock(&dentry->d_sb->s_vfs_rename_mutex))
@@ -2777,16 +2796,30 @@ static int __d_unalias(struct inode *inode,
 	if (!mutex_trylock(&alias->d_parent->d_inode->i_mutex))
 		goto out_err;
 	m2 = &alias->d_parent->d_inode->i_mutex;
-out_unalias:
+
+	spin_unlock(&inode->i_lock);
+	write_sequnlock(&rename_lock);
+
+	unlock = lock_namespace_rename(alias, dentry, false);
+
+	write_seqlock(&rename_lock);
 	__d_move(alias, dentry, false);
-	ret = 0;
+	write_sequnlock(&rename_lock);
+
+	if (unlock)
+		unlock_namespace_rename(unlock, alias, dentry, false);
+
+	mutex_unlock(m2);
+	mutex_unlock(m1);
+	return 0;
 out_err:
 	spin_unlock(&inode->i_lock);
+	write_sequnlock(&rename_lock);
 	if (m2)
 		mutex_unlock(m2);
 	if (m1)
 		mutex_unlock(m1);
-	return ret;
+	return -ESTALE;
 }
 
 /**
@@ -2841,7 +2874,6 @@ struct dentry *d_splice_alias(struct inode *inode, struct dentry *dentry)
 					inode->i_sb->s_id);
 			} else if (!IS_ROOT(new)) {
 				int err = __d_unalias(inode, dentry, new);
-				write_sequnlock(&rename_lock);
 				if (err) {
 					dput(new);
 					new = ERR_PTR(err);
diff --git a/fs/mount.h b/fs/mount.h
index e8f22970fe59..d32d074cc0d4 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -38,6 +38,7 @@ struct mount {
 	struct mount *mnt_parent;
 	struct dentry *mnt_mountpoint;
 	struct vfsmount mnt;
+	unsigned mnt_escape_count;
 	union {
 		struct rcu_head mnt_rcu;
 		struct llist_node mnt_llist;
@@ -107,6 +108,23 @@ static inline void detach_mounts(struct dentry *dentry)
 	__detach_mounts(dentry);
 }
 
+extern const struct dentry *lock_namespace_rename(struct dentry *, struct dentry *, bool);
+extern void unlock_namespace_rename(const struct dentry *, struct dentry *, struct dentry *, bool);
+
+static inline unsigned read_mnt_escape_count(struct vfsmount *vfsmount)
+{
+	struct mount *mnt = real_mount(vfsmount);
+	unsigned ret = READ_ONCE(mnt->mnt_escape_count);
+	smp_rmb();
+	return ret;
+}
+
+static inline void cache_mnt_escape_count(unsigned *cache, unsigned escape_count)
+{
+	if (likely(escape_count & 1) == 0)
+		*cache = escape_count;
+}
+
 static inline void get_mnt_ns(struct mnt_namespace *ns)
 {
 	atomic_inc(&ns->count);
diff --git a/fs/namespace.c b/fs/namespace.c
index 2ce987af9afa..9faec24f3f23 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1681,6 +1681,129 @@ out_unlock:
 	namespace_unlock();
 }
 
+static void lock_escaped_mounts_begin(struct dentry *root)
+{
+	struct mountroot *mr;
+	struct mount *mnt;
+
+	mr = lookup_mountroot(root);
+	if (mr) {
+		/* Mark each mount from which a directory is escaping.
+		 */
+		hlist_for_each_entry(mnt, &mr->r_list, mnt_mr_list) {
+			/* Don't return to 0 if the couunt wraps */
+			if (unlikely(mnt->mnt_escape_count == (0U - 2)))
+				mnt->mnt_escape_count = 1;
+			else
+				mnt->mnt_escape_count++;
+			smp_wmb();
+		}
+	}
+}
+
+static void lock_escaped_mounts_end(struct dentry *root)
+{
+	struct mountroot *mr;
+	struct mount *mnt;
+
+	mr = lookup_mountroot(root);
+	if (mr) {
+		/* Mark each mount from which a directory is escaping.
+		 */
+		hlist_for_each_entry(mnt, &mr->r_list, mnt_mr_list) {
+			smp_wmb();
+			mnt->mnt_escape_count++;
+		}
+	}
+}
+
+static void handle_mount_escapes_begin(const struct dentry *ancestor,
+				       struct dentry *escapee)
+{
+	struct dentry *dentry;
+
+	/* Don't look for non-directory escapes */
+	if (!d_is_dir(escapee))
+		return;
+
+	for (dentry = escapee->d_parent; dentry != ancestor;
+	     dentry = dentry->d_parent) {
+
+		if (d_mountroot(dentry))
+			lock_escaped_mounts_begin(dentry);
+
+		/* In case there is no common ancestor */
+		if (IS_ROOT(dentry))
+			break;
+	}
+}
+
+static void handle_mount_escapes_end(const struct dentry *ancestor, struct dentry *escapee)
+{
+	struct dentry *dentry;
+
+	/* Don't look for non-directory escapes */
+	if (!d_is_dir(escapee))
+		return;
+
+	for (dentry = escapee->d_parent; dentry != ancestor;
+	     dentry = dentry->d_parent) {
+
+		if (d_mountroot(dentry))
+			lock_escaped_mounts_end(dentry);
+
+		/* In case there is no common ancestor */
+		if (IS_ROOT(dentry))
+			break;
+	}
+	return;
+}
+
+const struct dentry *lock_namespace_rename(struct dentry *dentry1,
+					   struct dentry *dentry2, bool exchange)
+{
+	const struct dentry *ancestor;
+
+	if (dentry1->d_parent == dentry2->d_parent)
+		return NULL;
+
+	if (!d_is_dir(dentry1) && (!exchange || !d_is_dir(dentry2)))
+		return NULL;
+
+	ancestor = d_common_ancestor(dentry1, dentry2);
+
+	read_seqlock_excl(&mount_lock);
+	if (!exchange) {
+		handle_mount_escapes_begin(ancestor, dentry1);
+	} else {
+		handle_mount_escapes_begin(ancestor, dentry1);
+		handle_mount_escapes_begin(ancestor, dentry2);
+	}
+
+	if (ancestor == NULL)
+		ancestor = ERR_PTR(-ENOENT);
+
+	return ancestor;
+}
+
+void unlock_namespace_rename(const struct dentry *ancestor, struct dentry *dentry1,
+			     struct dentry *dentry2, bool exchange)
+{
+	if (!ancestor)
+		return;
+
+	if (ancestor == ERR_PTR(-ENOENT))
+		ancestor = NULL;
+
+	if (!exchange) {
+		handle_mount_escapes_end(ancestor, dentry2);
+	} else {
+		handle_mount_escapes_end(ancestor, dentry2);
+		handle_mount_escapes_end(ancestor, dentry1);
+	}
+	read_sequnlock_excl(&mount_lock);
+}
+
 /* 
  * Is the caller allowed to modify his namespace?
  */
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 4/6] mnt: Track when a directory escapes a bind mount
  2015-08-03 21:25           ` [PATCH review 0/6] Bind " Eric W. Biederman
  2015-08-03 21:26             ` [PATCH review 1/6] mnt: Track which mounts use a dentry as root Eric W. Biederman
       [not found]             ` <871tfkawu9.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-08-03 21:27             ` Eric W. Biederman
       [not found]               ` <87egjk9i61.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  2015-08-03 21:30             ` [PATCH review 5/6] " Eric W. Biederman
  2015-08-03 21:30             ` [PATCH review 6/6] vfs: Cache the results of path_connected Eric W. Biederman
  4 siblings, 1 reply; 240+ messages in thread
From: Eric W. Biederman @ 2015-08-03 21:27 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-fsdevel, Al Viro, Andy Lutomirski, Serge E. Hallyn,
	Richard Weinberger, Andrey Vagin, Jann Horn, Willy Tarreau,
	Omar Sandoval, Miklos Szeredi, Linus Torvalds, J. Bruce Fields


When bind mounts are in use, and there is another path to the
filesystem it is possible to rename files or directories from a path
underneath the root of the bind mount to a path that is not underneath
the root of the bind mount.

When a directory is moved out from under the root of a bind mount path
name lookups that go up the directory tree potentially allow accessing
the entire dentry tree of the filesystem.  This is not expected, not
what is desired and winds up being a secruity problem for userspace.

Augment d_move, d_exchange, and __d_unalias with matching calls to
lock_namespace_rename and unlock_namespace_rename, to mark mounts
that directories have escaped from.

A few notes on the implementation:

- The escape count on struct mount must be incremented both before the
  rename and after.  If the count is not incremented before the rename
  it is possible to hit a scenario where the rename happens the code
  walks up the directory tree to somewhere outside of the bind mount
  before the count is touched.  Similary without a count after the
  rename it is possible for the code to look at the escape count
  validate a path is connected before the rename and assume cache the
  escape count, leading to not retesting the path is ok.

- A lock either namespace_sem or mount_lock needs to be held across
  the duration of renames where a directory could be escaping to
  guarantee pairing of the escape_count increments and to ensure
  that a mount is not added, escaped, and missed during the rename.

- The locking order must be mount_lock outside of rename_lock
  as prepend_path already takes the locks in this order.

- I have audited all callers of d_move and d_exchange and in every
  instance it appears safe for d_move and d_exchange to start
  sleeping.  d_splice_alias already sleeps in security_d_instantiate
  so no audit was needed for it to begin sleeping.

  As I can just take mount_lock I don't use that freedom in this
  change, but it can be relevant to small changes to the locking in
  this code.

- The largest change is in d_unalias, where the two cases are split
  apart so they can be handled separately.  In the easy case of a
  rename within the same directory all that is needed is __d_move
  (escaping a mount is impossible in that case).  In the more involved
  case mutexes need to be acquired, and now the spin locks need to be
  dropped so that proper lock aquisition order around __d_move can be
  arranged.

  As I read the code inode->i_lock needs to be held until
  ailas->d_parent->d_parent->i_mutex is taken.  The only case I can
  see that removes an inode from a dentry is d_delete called from a
  path like vfs_unlink.  Those paths all take the parent directories
  inode mutex.  Thus once the parent directories inode mutex is held
  it becomes unnecessary to hold inode->i_lock to ensure the alias
  remains an alias.

  Similarly the rename_lock does not need to be held once the
  s_vfs_rename_mutex is taken.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/dcache.c    |  46 +++++++++++++++++----
 fs/mount.h     |  18 +++++++++
 fs/namespace.c | 123 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 180 insertions(+), 7 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 9f4de1007a8d..c25ef7ef8e7f 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2707,9 +2707,17 @@ static void __d_move(struct dentry *dentry, struct dentry *target,
  */
 void d_move(struct dentry *dentry, struct dentry *target)
 {
+	const struct dentry *unlock;
+
+	unlock = lock_namespace_rename(dentry, target, false);
+
 	write_seqlock(&rename_lock);
 	__d_move(dentry, target, false);
 	write_sequnlock(&rename_lock);
+
+	if (unlock)
+		unlock_namespace_rename(unlock, dentry, target, false);
+
 }
 EXPORT_SYMBOL(d_move);
 
@@ -2720,6 +2728,10 @@ EXPORT_SYMBOL(d_move);
  */
 void d_exchange(struct dentry *dentry1, struct dentry *dentry2)
 {
+	const struct dentry *unlock;
+
+	unlock = lock_namespace_rename(dentry1, dentry2, true);
+
 	write_seqlock(&rename_lock);
 
 	WARN_ON(!dentry1->d_inode);
@@ -2730,6 +2742,9 @@ void d_exchange(struct dentry *dentry1, struct dentry *dentry2)
 	__d_move(dentry1, dentry2, true);
 
 	write_sequnlock(&rename_lock);
+
+	if (unlock)
+		unlock_namespace_rename(unlock, dentry1, dentry2, true);
 }
 
 /**
@@ -2764,11 +2779,15 @@ static int __d_unalias(struct inode *inode,
 		struct dentry *dentry, struct dentry *alias)
 {
 	struct mutex *m1 = NULL, *m2 = NULL;
-	int ret = -ESTALE;
+	const struct dentry *unlock;
 
 	/* If alias and dentry share a parent, then no extra locks required */
-	if (alias->d_parent == dentry->d_parent)
-		goto out_unalias;
+	if (alias->d_parent == dentry->d_parent) {
+		__d_move(alias, dentry, false);
+		spin_unlock(&inode->i_lock);
+		write_sequnlock(&rename_lock);
+		return 0;
+	}
 
 	/* See lock_rename() */
 	if (!mutex_trylock(&dentry->d_sb->s_vfs_rename_mutex))
@@ -2777,16 +2796,30 @@ static int __d_unalias(struct inode *inode,
 	if (!mutex_trylock(&alias->d_parent->d_inode->i_mutex))
 		goto out_err;
 	m2 = &alias->d_parent->d_inode->i_mutex;
-out_unalias:
+
+	spin_unlock(&inode->i_lock);
+	write_sequnlock(&rename_lock);
+
+	unlock = lock_namespace_rename(alias, dentry, false);
+
+	write_seqlock(&rename_lock);
 	__d_move(alias, dentry, false);
-	ret = 0;
+	write_sequnlock(&rename_lock);
+
+	if (unlock)
+		unlock_namespace_rename(unlock, alias, dentry, false);
+
+	mutex_unlock(m2);
+	mutex_unlock(m1);
+	return 0;
 out_err:
 	spin_unlock(&inode->i_lock);
+	write_sequnlock(&rename_lock);
 	if (m2)
 		mutex_unlock(m2);
 	if (m1)
 		mutex_unlock(m1);
-	return ret;
+	return -ESTALE;
 }
 
 /**
@@ -2841,7 +2874,6 @@ struct dentry *d_splice_alias(struct inode *inode, struct dentry *dentry)
 					inode->i_sb->s_id);
 			} else if (!IS_ROOT(new)) {
 				int err = __d_unalias(inode, dentry, new);
-				write_sequnlock(&rename_lock);
 				if (err) {
 					dput(new);
 					new = ERR_PTR(err);
diff --git a/fs/mount.h b/fs/mount.h
index e8f22970fe59..d32d074cc0d4 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -38,6 +38,7 @@ struct mount {
 	struct mount *mnt_parent;
 	struct dentry *mnt_mountpoint;
 	struct vfsmount mnt;
+	unsigned mnt_escape_count;
 	union {
 		struct rcu_head mnt_rcu;
 		struct llist_node mnt_llist;
@@ -107,6 +108,23 @@ static inline void detach_mounts(struct dentry *dentry)
 	__detach_mounts(dentry);
 }
 
+extern const struct dentry *lock_namespace_rename(struct dentry *, struct dentry *, bool);
+extern void unlock_namespace_rename(const struct dentry *, struct dentry *, struct dentry *, bool);
+
+static inline unsigned read_mnt_escape_count(struct vfsmount *vfsmount)
+{
+	struct mount *mnt = real_mount(vfsmount);
+	unsigned ret = READ_ONCE(mnt->mnt_escape_count);
+	smp_rmb();
+	return ret;
+}
+
+static inline void cache_mnt_escape_count(unsigned *cache, unsigned escape_count)
+{
+	if (likely(escape_count & 1) == 0)
+		*cache = escape_count;
+}
+
 static inline void get_mnt_ns(struct mnt_namespace *ns)
 {
 	atomic_inc(&ns->count);
diff --git a/fs/namespace.c b/fs/namespace.c
index 2ce987af9afa..9faec24f3f23 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1681,6 +1681,129 @@ out_unlock:
 	namespace_unlock();
 }
 
+static void lock_escaped_mounts_begin(struct dentry *root)
+{
+	struct mountroot *mr;
+	struct mount *mnt;
+
+	mr = lookup_mountroot(root);
+	if (mr) {
+		/* Mark each mount from which a directory is escaping.
+		 */
+		hlist_for_each_entry(mnt, &mr->r_list, mnt_mr_list) {
+			/* Don't return to 0 if the couunt wraps */
+			if (unlikely(mnt->mnt_escape_count == (0U - 2)))
+				mnt->mnt_escape_count = 1;
+			else
+				mnt->mnt_escape_count++;
+			smp_wmb();
+		}
+	}
+}
+
+static void lock_escaped_mounts_end(struct dentry *root)
+{
+	struct mountroot *mr;
+	struct mount *mnt;
+
+	mr = lookup_mountroot(root);
+	if (mr) {
+		/* Mark each mount from which a directory is escaping.
+		 */
+		hlist_for_each_entry(mnt, &mr->r_list, mnt_mr_list) {
+			smp_wmb();
+			mnt->mnt_escape_count++;
+		}
+	}
+}
+
+static void handle_mount_escapes_begin(const struct dentry *ancestor,
+				       struct dentry *escapee)
+{
+	struct dentry *dentry;
+
+	/* Don't look for non-directory escapes */
+	if (!d_is_dir(escapee))
+		return;
+
+	for (dentry = escapee->d_parent; dentry != ancestor;
+	     dentry = dentry->d_parent) {
+
+		if (d_mountroot(dentry))
+			lock_escaped_mounts_begin(dentry);
+
+		/* In case there is no common ancestor */
+		if (IS_ROOT(dentry))
+			break;
+	}
+}
+
+static void handle_mount_escapes_end(const struct dentry *ancestor, struct dentry *escapee)
+{
+	struct dentry *dentry;
+
+	/* Don't look for non-directory escapes */
+	if (!d_is_dir(escapee))
+		return;
+
+	for (dentry = escapee->d_parent; dentry != ancestor;
+	     dentry = dentry->d_parent) {
+
+		if (d_mountroot(dentry))
+			lock_escaped_mounts_end(dentry);
+
+		/* In case there is no common ancestor */
+		if (IS_ROOT(dentry))
+			break;
+	}
+	return;
+}
+
+const struct dentry *lock_namespace_rename(struct dentry *dentry1,
+					   struct dentry *dentry2, bool exchange)
+{
+	const struct dentry *ancestor;
+
+	if (dentry1->d_parent == dentry2->d_parent)
+		return NULL;
+
+	if (!d_is_dir(dentry1) && (!exchange || !d_is_dir(dentry2)))
+		return NULL;
+
+	ancestor = d_common_ancestor(dentry1, dentry2);
+
+	read_seqlock_excl(&mount_lock);
+	if (!exchange) {
+		handle_mount_escapes_begin(ancestor, dentry1);
+	} else {
+		handle_mount_escapes_begin(ancestor, dentry1);
+		handle_mount_escapes_begin(ancestor, dentry2);
+	}
+
+	if (ancestor == NULL)
+		ancestor = ERR_PTR(-ENOENT);
+
+	return ancestor;
+}
+
+void unlock_namespace_rename(const struct dentry *ancestor, struct dentry *dentry1,
+			     struct dentry *dentry2, bool exchange)
+{
+	if (!ancestor)
+		return;
+
+	if (ancestor == ERR_PTR(-ENOENT))
+		ancestor = NULL;
+
+	if (!exchange) {
+		handle_mount_escapes_end(ancestor, dentry2);
+	} else {
+		handle_mount_escapes_end(ancestor, dentry2);
+		handle_mount_escapes_end(ancestor, dentry1);
+	}
+	read_sequnlock_excl(&mount_lock);
+}
+
 /* 
  * Is the caller allowed to modify his namespace?
  */
-- 
2.2.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 5/6] vfs: Test for and handle paths that are unreachable from their mnt_root
       [not found]             ` <871tfkawu9.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                                 ` (3 preceding siblings ...)
  2015-08-03 21:27               ` [PATCH review 4/6] mnt: Track when a directory escapes a bind mount Eric W. Biederman
@ 2015-08-03 21:30               ` Eric W. Biederman
  2015-08-03 21:30               ` [PATCH review 6/6] vfs: Cache the results of path_connected Eric W. Biederman
  2015-08-05  3:14               ` [PATCH review 7/6] vfs: Make mnt_escape_count 64bit Eric W. Biederman
  6 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-08-03 21:30 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Miklos Szeredi, Richard Weinberger,
	Andy Lutomirski, J. Bruce Fields, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Linus Torvalds,
	Willy Tarreau


In rare cases a directory can be renamed out from under a bind mount.
In those cases without special handling it becomes possible to walk up
the directory tree to the root dentry of the filesystem and down
from the root dentry to every other file or directory on the filesystem.

Like division by zero .. from an unconnected path can not be given
a useful semantic as there is no predicting at which path component
the code will realize it is unconnected.  We certainly can not match
the current behavior as the current behavior is a security hole.

Therefore when encounting .. when following an unconnected path
return -ENOENT.

- Add a function path_connected to verify nd->path.dentry is reachable
  from nd->path.mnt.mnt_root.  AKA to validate that rename did not do
  something nasty to the bind mount.

  To avoid races path_connected must be called after following a path
  component to it's next path component.

Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/namei.c | 35 ++++++++++++++++++++++++++++++++---
 1 file changed, 32 insertions(+), 3 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index ae4e4c18b2ac..bccd3810ff60 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -560,6 +560,27 @@ static int __nd_alloc_stack(struct nameidata *nd)
 	return 0;
 }
 
+/**
+ * path_connected - Verify that a nd->path.dentry is below nd->path.mnt->mnt.mnt_root
+ * @nd: nameidate to verify
+ *
+ * Rename can sometimes move a file or directory outside of a bind
+ * mount, path_connected allows those cases to be detected.
+ */
+static bool path_connected(struct nameidata *nd)
+{
+	struct vfsmount *mnt = nd->path.mnt;
+	unsigned escape_count = read_mnt_escape_count(mnt);
+
+	if (likely(escape_count == 0))
+		return true;
+
+	if (!is_subdir(nd->path.dentry, mnt->mnt_root))
+		return false;
+
+	return true;
+}
+
 static inline int nd_alloc_stack(struct nameidata *nd)
 {
 	if (likely(nd->depth != EMBEDDED_LEVELS))
@@ -1294,6 +1315,8 @@ static int follow_dotdot_rcu(struct nameidata *nd)
 			seq = read_seqcount_begin(&parent->d_seq);
 			if (unlikely(read_seqcount_retry(&old->d_seq, nd->seq)))
 				return -ECHILD;
+			if (unlikely(!path_connected(nd)))
+				return -ENOENT;
 			nd->path.dentry = parent;
 			nd->seq = seq;
 			break;
@@ -1396,7 +1419,7 @@ static void follow_mount(struct path *path)
 	}
 }
 
-static void follow_dotdot(struct nameidata *nd)
+static int follow_dotdot(struct nameidata *nd)
 {
 	if (!nd->root.mnt)
 		set_root(nd);
@@ -1410,7 +1433,12 @@ static void follow_dotdot(struct nameidata *nd)
 		}
 		if (nd->path.dentry != nd->path.mnt->mnt_root) {
 			/* rare case of legitimate dget_parent()... */
-			nd->path.dentry = dget_parent(nd->path.dentry);
+			struct dentry *parent = dget_parent(nd->path.dentry);
+			if (unlikely(!path_connected(nd))) {
+				dput(parent);
+				return -ENOENT;
+			}
+			nd->path.dentry = parent;
 			dput(old);
 			break;
 		}
@@ -1419,6 +1447,7 @@ static void follow_dotdot(struct nameidata *nd)
 	}
 	follow_mount(&nd->path);
 	nd->inode = nd->path.dentry->d_inode;
+	return 0;
 }
 
 /*
@@ -1634,7 +1663,7 @@ static inline int handle_dots(struct nameidata *nd, int type)
 		if (nd->flags & LOOKUP_RCU) {
 			return follow_dotdot_rcu(nd);
 		} else
-			follow_dotdot(nd);
+			return follow_dotdot(nd);
 	}
 	return 0;
 }
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 5/6] vfs: Test for and handle paths that are unreachable from their mnt_root
  2015-08-03 21:25           ` [PATCH review 0/6] Bind " Eric W. Biederman
                               ` (2 preceding siblings ...)
  2015-08-03 21:27             ` [PATCH review 4/6] mnt: Track when a directory escapes a bind mount Eric W. Biederman
@ 2015-08-03 21:30             ` Eric W. Biederman
       [not found]               ` <878u9s9i1d.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  2015-08-03 21:30             ` [PATCH review 6/6] vfs: Cache the results of path_connected Eric W. Biederman
  4 siblings, 1 reply; 240+ messages in thread
From: Eric W. Biederman @ 2015-08-03 21:30 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-fsdevel, Al Viro, Andy Lutomirski, Serge E. Hallyn,
	Richard Weinberger, Andrey Vagin, Jann Horn, Willy Tarreau,
	Omar Sandoval, Miklos Szeredi, Linus Torvalds, J. Bruce Fields


In rare cases a directory can be renamed out from under a bind mount.
In those cases without special handling it becomes possible to walk up
the directory tree to the root dentry of the filesystem and down
from the root dentry to every other file or directory on the filesystem.

Like division by zero .. from an unconnected path can not be given
a useful semantic as there is no predicting at which path component
the code will realize it is unconnected.  We certainly can not match
the current behavior as the current behavior is a security hole.

Therefore when encounting .. when following an unconnected path
return -ENOENT.

- Add a function path_connected to verify nd->path.dentry is reachable
  from nd->path.mnt.mnt_root.  AKA to validate that rename did not do
  something nasty to the bind mount.

  To avoid races path_connected must be called after following a path
  component to it's next path component.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/namei.c | 35 ++++++++++++++++++++++++++++++++---
 1 file changed, 32 insertions(+), 3 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index ae4e4c18b2ac..bccd3810ff60 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -560,6 +560,27 @@ static int __nd_alloc_stack(struct nameidata *nd)
 	return 0;
 }
 
+/**
+ * path_connected - Verify that a nd->path.dentry is below nd->path.mnt->mnt.mnt_root
+ * @nd: nameidate to verify
+ *
+ * Rename can sometimes move a file or directory outside of a bind
+ * mount, path_connected allows those cases to be detected.
+ */
+static bool path_connected(struct nameidata *nd)
+{
+	struct vfsmount *mnt = nd->path.mnt;
+	unsigned escape_count = read_mnt_escape_count(mnt);
+
+	if (likely(escape_count == 0))
+		return true;
+
+	if (!is_subdir(nd->path.dentry, mnt->mnt_root))
+		return false;
+
+	return true;
+}
+
 static inline int nd_alloc_stack(struct nameidata *nd)
 {
 	if (likely(nd->depth != EMBEDDED_LEVELS))
@@ -1294,6 +1315,8 @@ static int follow_dotdot_rcu(struct nameidata *nd)
 			seq = read_seqcount_begin(&parent->d_seq);
 			if (unlikely(read_seqcount_retry(&old->d_seq, nd->seq)))
 				return -ECHILD;
+			if (unlikely(!path_connected(nd)))
+				return -ENOENT;
 			nd->path.dentry = parent;
 			nd->seq = seq;
 			break;
@@ -1396,7 +1419,7 @@ static void follow_mount(struct path *path)
 	}
 }
 
-static void follow_dotdot(struct nameidata *nd)
+static int follow_dotdot(struct nameidata *nd)
 {
 	if (!nd->root.mnt)
 		set_root(nd);
@@ -1410,7 +1433,12 @@ static void follow_dotdot(struct nameidata *nd)
 		}
 		if (nd->path.dentry != nd->path.mnt->mnt_root) {
 			/* rare case of legitimate dget_parent()... */
-			nd->path.dentry = dget_parent(nd->path.dentry);
+			struct dentry *parent = dget_parent(nd->path.dentry);
+			if (unlikely(!path_connected(nd))) {
+				dput(parent);
+				return -ENOENT;
+			}
+			nd->path.dentry = parent;
 			dput(old);
 			break;
 		}
@@ -1419,6 +1447,7 @@ static void follow_dotdot(struct nameidata *nd)
 	}
 	follow_mount(&nd->path);
 	nd->inode = nd->path.dentry->d_inode;
+	return 0;
 }
 
 /*
@@ -1634,7 +1663,7 @@ static inline int handle_dots(struct nameidata *nd, int type)
 		if (nd->flags & LOOKUP_RCU) {
 			return follow_dotdot_rcu(nd);
 		} else
-			follow_dotdot(nd);
+			return follow_dotdot(nd);
 	}
 	return 0;
 }
-- 
2.2.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 6/6] vfs: Cache the results of path_connected
       [not found]             ` <871tfkawu9.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                                 ` (4 preceding siblings ...)
  2015-08-03 21:30               ` [PATCH review 5/6] vfs: Test for and handle paths that are unreachable from their mnt_root Eric W. Biederman
@ 2015-08-03 21:30               ` Eric W. Biederman
  2015-08-05  3:14               ` [PATCH review 7/6] vfs: Make mnt_escape_count 64bit Eric W. Biederman
  6 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-08-03 21:30 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Miklos Szeredi, Richard Weinberger,
	Andy Lutomirski, J. Bruce Fields, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Linus Torvalds,
	Willy Tarreau


Add a new field mnt_escape_count in nameidata, initialize it to 0 and
cache the value of read_mnt_escape_count in nd->mnt_escape_count.

This allows a single check in path_connected in the common case where
either the mount has had no escapes (mnt_escape_count == 0) or there
has been an escape and it has been validated that the current path
does not escape.

To keep the cache valid nd->mnt_escape_count must be set to 0 whenever
the nd->path.mnt changes or when nd->path.dentry changes such that
the connectedness of the previous value of nd->path.dentry does
not imply the connected of the new value of nd->path.dentry.

Various locations in fs/namei.c are updated to set
nd->mnt_escape_count to 0 as necessary.

Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/namei.c | 26 +++++++++++++++++++++++---
 1 file changed, 23 insertions(+), 3 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index bccd3810ff60..79a5dca073f5 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -514,6 +514,7 @@ struct nameidata {
 	struct nameidata *saved;
 	unsigned	root_seq;
 	int		dfd;
+	unsigned	mnt_escape_count;
 };
 
 static void set_nameidata(struct nameidata *p, int dfd, struct filename *name)
@@ -572,12 +573,13 @@ static bool path_connected(struct nameidata *nd)
 	struct vfsmount *mnt = nd->path.mnt;
 	unsigned escape_count = read_mnt_escape_count(mnt);
 
-	if (likely(escape_count == 0))
+	if (likely(escape_count == nd->mnt_escape_count))
 		return true;
 
 	if (!is_subdir(nd->path.dentry, mnt->mnt_root))
 		return false;
 
+	cache_mnt_escape_count(&nd->mnt_escape_count, escape_count);
 	return true;
 }
 
@@ -840,6 +842,9 @@ static inline void path_to_nameidata(const struct path *path,
 		if (nd->path.mnt != path->mnt)
 			mntput(nd->path.mnt);
 	}
+	if (unlikely((nd->path.mnt != path->mnt) ||
+		     (nd->path.dentry != path->dentry->d_parent)))
+		nd->mnt_escape_count = 0;
 	nd->path.mnt = path->mnt;
 	nd->path.dentry = path->dentry;
 }
@@ -856,6 +861,7 @@ void nd_jump_link(struct path *path)
 	nd->path = *path;
 	nd->inode = nd->path.dentry->d_inode;
 	nd->flags |= LOOKUP_JUMPED;
+	nd->mnt_escape_count = 0;
 }
 
 static inline void put_link(struct nameidata *nd)
@@ -1040,6 +1046,7 @@ const char *get_link(struct nameidata *nd)
 			nd->inode = nd->path.dentry->d_inode;
 		}
 		nd->flags |= LOOKUP_JUMPED;
+		nd->mnt_escape_count = 0;
 		while (unlikely(*++res == '/'))
 			;
 	}
@@ -1335,6 +1342,7 @@ static int follow_dotdot_rcu(struct nameidata *nd)
 			nd->path.mnt = &mparent->mnt;
 			inode = inode2;
 			nd->seq = seq;
+			nd->mnt_escape_count = 0;
 		}
 	}
 	while (unlikely(d_mountpoint(nd->path.dentry))) {
@@ -1348,6 +1356,7 @@ static int follow_dotdot_rcu(struct nameidata *nd)
 		nd->path.dentry = mounted->mnt.mnt_root;
 		inode = nd->path.dentry->d_inode;
 		nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq);
+		nd->mnt_escape_count = 0;
 	}
 	nd->inode = inode;
 	return 0;
@@ -1406,8 +1415,9 @@ EXPORT_SYMBOL(follow_down);
 /*
  * Skip to top of mountpoint pile in refwalk mode for follow_dotdot()
  */
-static void follow_mount(struct path *path)
+static bool follow_mount(struct path *path)
 {
+	bool followed = false;
 	while (d_mountpoint(path->dentry)) {
 		struct vfsmount *mounted = lookup_mnt(path);
 		if (!mounted)
@@ -1416,7 +1426,9 @@ static void follow_mount(struct path *path)
 		mntput(path->mnt);
 		path->mnt = mounted;
 		path->dentry = dget(mounted->mnt_root);
+		followed = true;
 	}
+	return followed;
 }
 
 static int follow_dotdot(struct nameidata *nd)
@@ -1444,8 +1456,10 @@ static int follow_dotdot(struct nameidata *nd)
 		}
 		if (!follow_up(&nd->path))
 			break;
+		nd->mnt_escape_count = 0;
 	}
-	follow_mount(&nd->path);
+	if (follow_mount(&nd->path))
+		nd->mnt_escape_count = 0;
 	nd->inode = nd->path.dentry->d_inode;
 	return 0;
 }
@@ -1997,6 +2011,7 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
 	nd->flags = flags | LOOKUP_JUMPED | LOOKUP_PARENT;
 	nd->depth = 0;
 	nd->total_link_count = 0;
+	nd->mnt_escape_count = 0;
 	if (flags & LOOKUP_ROOT) {
 		struct dentry *root = nd->root.dentry;
 		struct inode *inode = root->d_inode;
@@ -3026,6 +3041,7 @@ static int do_last(struct nameidata *nd,
 	unsigned seq;
 	struct inode *inode;
 	struct path save_parent = { .dentry = NULL, .mnt = NULL };
+	unsigned save_parent_escape_count = 0;
 	struct path path;
 	bool retried = false;
 	int error;
@@ -3155,6 +3171,9 @@ finish_lookup:
 	} else {
 		save_parent.dentry = nd->path.dentry;
 		save_parent.mnt = mntget(path.mnt);
+		save_parent_escape_count = nd->mnt_escape_count;
+		if (nd->path.dentry != path.dentry->d_parent)
+			nd->mnt_escape_count = 0;
 		nd->path.dentry = path.dentry;
 
 	}
@@ -3227,6 +3246,7 @@ stale_open:
 
 	BUG_ON(save_parent.dentry != dir);
 	path_put(&nd->path);
+	nd->mnt_escape_count = save_parent_escape_count;
 	nd->path = save_parent;
 	nd->inode = dir->d_inode;
 	save_parent.mnt = NULL;
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 6/6] vfs: Cache the results of path_connected
  2015-08-03 21:25           ` [PATCH review 0/6] Bind " Eric W. Biederman
                               ` (3 preceding siblings ...)
  2015-08-03 21:30             ` [PATCH review 5/6] " Eric W. Biederman
@ 2015-08-03 21:30             ` Eric W. Biederman
       [not found]               ` <8738009i0h.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  2015-08-04 11:52               ` Andrew Vagin
  4 siblings, 2 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-08-03 21:30 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-fsdevel, Al Viro, Andy Lutomirski, Serge E. Hallyn,
	Richard Weinberger, Andrey Vagin, Jann Horn, Willy Tarreau,
	Omar Sandoval, Miklos Szeredi, Linus Torvalds, J. Bruce Fields


Add a new field mnt_escape_count in nameidata, initialize it to 0 and
cache the value of read_mnt_escape_count in nd->mnt_escape_count.

This allows a single check in path_connected in the common case where
either the mount has had no escapes (mnt_escape_count == 0) or there
has been an escape and it has been validated that the current path
does not escape.

To keep the cache valid nd->mnt_escape_count must be set to 0 whenever
the nd->path.mnt changes or when nd->path.dentry changes such that
the connectedness of the previous value of nd->path.dentry does
not imply the connected of the new value of nd->path.dentry.

Various locations in fs/namei.c are updated to set
nd->mnt_escape_count to 0 as necessary.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/namei.c | 26 +++++++++++++++++++++++---
 1 file changed, 23 insertions(+), 3 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index bccd3810ff60..79a5dca073f5 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -514,6 +514,7 @@ struct nameidata {
 	struct nameidata *saved;
 	unsigned	root_seq;
 	int		dfd;
+	unsigned	mnt_escape_count;
 };
 
 static void set_nameidata(struct nameidata *p, int dfd, struct filename *name)
@@ -572,12 +573,13 @@ static bool path_connected(struct nameidata *nd)
 	struct vfsmount *mnt = nd->path.mnt;
 	unsigned escape_count = read_mnt_escape_count(mnt);
 
-	if (likely(escape_count == 0))
+	if (likely(escape_count == nd->mnt_escape_count))
 		return true;
 
 	if (!is_subdir(nd->path.dentry, mnt->mnt_root))
 		return false;
 
+	cache_mnt_escape_count(&nd->mnt_escape_count, escape_count);
 	return true;
 }
 
@@ -840,6 +842,9 @@ static inline void path_to_nameidata(const struct path *path,
 		if (nd->path.mnt != path->mnt)
 			mntput(nd->path.mnt);
 	}
+	if (unlikely((nd->path.mnt != path->mnt) ||
+		     (nd->path.dentry != path->dentry->d_parent)))
+		nd->mnt_escape_count = 0;
 	nd->path.mnt = path->mnt;
 	nd->path.dentry = path->dentry;
 }
@@ -856,6 +861,7 @@ void nd_jump_link(struct path *path)
 	nd->path = *path;
 	nd->inode = nd->path.dentry->d_inode;
 	nd->flags |= LOOKUP_JUMPED;
+	nd->mnt_escape_count = 0;
 }
 
 static inline void put_link(struct nameidata *nd)
@@ -1040,6 +1046,7 @@ const char *get_link(struct nameidata *nd)
 			nd->inode = nd->path.dentry->d_inode;
 		}
 		nd->flags |= LOOKUP_JUMPED;
+		nd->mnt_escape_count = 0;
 		while (unlikely(*++res == '/'))
 			;
 	}
@@ -1335,6 +1342,7 @@ static int follow_dotdot_rcu(struct nameidata *nd)
 			nd->path.mnt = &mparent->mnt;
 			inode = inode2;
 			nd->seq = seq;
+			nd->mnt_escape_count = 0;
 		}
 	}
 	while (unlikely(d_mountpoint(nd->path.dentry))) {
@@ -1348,6 +1356,7 @@ static int follow_dotdot_rcu(struct nameidata *nd)
 		nd->path.dentry = mounted->mnt.mnt_root;
 		inode = nd->path.dentry->d_inode;
 		nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq);
+		nd->mnt_escape_count = 0;
 	}
 	nd->inode = inode;
 	return 0;
@@ -1406,8 +1415,9 @@ EXPORT_SYMBOL(follow_down);
 /*
  * Skip to top of mountpoint pile in refwalk mode for follow_dotdot()
  */
-static void follow_mount(struct path *path)
+static bool follow_mount(struct path *path)
 {
+	bool followed = false;
 	while (d_mountpoint(path->dentry)) {
 		struct vfsmount *mounted = lookup_mnt(path);
 		if (!mounted)
@@ -1416,7 +1426,9 @@ static void follow_mount(struct path *path)
 		mntput(path->mnt);
 		path->mnt = mounted;
 		path->dentry = dget(mounted->mnt_root);
+		followed = true;
 	}
+	return followed;
 }
 
 static int follow_dotdot(struct nameidata *nd)
@@ -1444,8 +1456,10 @@ static int follow_dotdot(struct nameidata *nd)
 		}
 		if (!follow_up(&nd->path))
 			break;
+		nd->mnt_escape_count = 0;
 	}
-	follow_mount(&nd->path);
+	if (follow_mount(&nd->path))
+		nd->mnt_escape_count = 0;
 	nd->inode = nd->path.dentry->d_inode;
 	return 0;
 }
@@ -1997,6 +2011,7 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
 	nd->flags = flags | LOOKUP_JUMPED | LOOKUP_PARENT;
 	nd->depth = 0;
 	nd->total_link_count = 0;
+	nd->mnt_escape_count = 0;
 	if (flags & LOOKUP_ROOT) {
 		struct dentry *root = nd->root.dentry;
 		struct inode *inode = root->d_inode;
@@ -3026,6 +3041,7 @@ static int do_last(struct nameidata *nd,
 	unsigned seq;
 	struct inode *inode;
 	struct path save_parent = { .dentry = NULL, .mnt = NULL };
+	unsigned save_parent_escape_count = 0;
 	struct path path;
 	bool retried = false;
 	int error;
@@ -3155,6 +3171,9 @@ finish_lookup:
 	} else {
 		save_parent.dentry = nd->path.dentry;
 		save_parent.mnt = mntget(path.mnt);
+		save_parent_escape_count = nd->mnt_escape_count;
+		if (nd->path.dentry != path.dentry->d_parent)
+			nd->mnt_escape_count = 0;
 		nd->path.dentry = path.dentry;
 
 	}
@@ -3227,6 +3246,7 @@ stale_open:
 
 	BUG_ON(save_parent.dentry != dir);
 	path_put(&nd->path);
+	nd->mnt_escape_count = save_parent_escape_count;
 	nd->path = save_parent;
 	nd->inode = dir->d_inode;
 	save_parent.mnt = NULL;
-- 
2.2.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* Re: [PATCH review 6/6] vfs: Cache the results of path_connected
       [not found]               ` <8738009i0h.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-08-04 11:52                 ` Andrew Vagin
  0 siblings, 0 replies; 240+ messages in thread
From: Andrew Vagin @ 2015-08-04 11:52 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrey Vagin, Miklos Szeredi, Richard Weinberger,
	Linux Containers, Andy Lutomirski, J. Bruce Fields, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Linus Torvalds,
	Willy Tarreau

On Mon, Aug 03, 2015 at 04:30:54PM -0500, Eric W. Biederman wrote:
> 
> Add a new field mnt_escape_count in nameidata, initialize it to 0 and
> cache the value of read_mnt_escape_count in nd->mnt_escape_count.
> 
> This allows a single check in path_connected in the common case where
> either the mount has had no escapes (mnt_escape_count == 0) or there
> has been an escape and it has been validated that the current path
> does not escape.
> 
> To keep the cache valid nd->mnt_escape_count must be set to 0 whenever
> the nd->path.mnt changes or when nd->path.dentry changes such that
> the connectedness of the previous value of nd->path.dentry does
> not imply the connected of the new value of nd->path.dentry.
> 
> Various locations in fs/namei.c are updated to set
> nd->mnt_escape_count to 0 as necessary.
> 
> Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
> ---
>  fs/namei.c | 26 +++++++++++++++++++++++---
>  1 file changed, 23 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/namei.c b/fs/namei.c
> index bccd3810ff60..79a5dca073f5 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -514,6 +514,7 @@ struct nameidata {
>  	struct nameidata *saved;
>  	unsigned	root_seq;
>  	int		dfd;
> +	unsigned	mnt_escape_count;
>  };
>  
>  static void set_nameidata(struct nameidata *p, int dfd, struct filename *name)
> @@ -572,12 +573,13 @@ static bool path_connected(struct nameidata *nd)
>  	struct vfsmount *mnt = nd->path.mnt;
>  	unsigned escape_count = read_mnt_escape_count(mnt);
>  
> -	if (likely(escape_count == 0))
> +	if (likely(escape_count == nd->mnt_escape_count))
>  		return true;

The size of mnt_escape_count is only 4 bytes. Looks like it possible to
make UINT_MAX / 2 operations for the resonable time and get the same
value of mnt_escape_count, path_connected() will return true, but the
path may be already detached. What do you think about this?

Thanks,
Andrew

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 6/6] vfs: Cache the results of path_connected
  2015-08-03 21:30             ` [PATCH review 6/6] vfs: Cache the results of path_connected Eric W. Biederman
       [not found]               ` <8738009i0h.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-08-04 11:52               ` Andrew Vagin
       [not found]                 ` <20150804115215.GA317-wo1vFcy6AUs@public.gmane.org>
  1 sibling, 1 reply; 240+ messages in thread
From: Andrew Vagin @ 2015-08-04 11:52 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Containers, linux-fsdevel, Al Viro, Andy Lutomirski,
	Serge E. Hallyn, Richard Weinberger, Andrey Vagin, Jann Horn,
	Willy Tarreau, Omar Sandoval, Miklos Szeredi, Linus Torvalds,
	J. Bruce Fields

On Mon, Aug 03, 2015 at 04:30:54PM -0500, Eric W. Biederman wrote:
> 
> Add a new field mnt_escape_count in nameidata, initialize it to 0 and
> cache the value of read_mnt_escape_count in nd->mnt_escape_count.
> 
> This allows a single check in path_connected in the common case where
> either the mount has had no escapes (mnt_escape_count == 0) or there
> has been an escape and it has been validated that the current path
> does not escape.
> 
> To keep the cache valid nd->mnt_escape_count must be set to 0 whenever
> the nd->path.mnt changes or when nd->path.dentry changes such that
> the connectedness of the previous value of nd->path.dentry does
> not imply the connected of the new value of nd->path.dentry.
> 
> Various locations in fs/namei.c are updated to set
> nd->mnt_escape_count to 0 as necessary.
> 
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  fs/namei.c | 26 +++++++++++++++++++++++---
>  1 file changed, 23 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/namei.c b/fs/namei.c
> index bccd3810ff60..79a5dca073f5 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -514,6 +514,7 @@ struct nameidata {
>  	struct nameidata *saved;
>  	unsigned	root_seq;
>  	int		dfd;
> +	unsigned	mnt_escape_count;
>  };
>  
>  static void set_nameidata(struct nameidata *p, int dfd, struct filename *name)
> @@ -572,12 +573,13 @@ static bool path_connected(struct nameidata *nd)
>  	struct vfsmount *mnt = nd->path.mnt;
>  	unsigned escape_count = read_mnt_escape_count(mnt);
>  
> -	if (likely(escape_count == 0))
> +	if (likely(escape_count == nd->mnt_escape_count))
>  		return true;

The size of mnt_escape_count is only 4 bytes. Looks like it possible to
make UINT_MAX / 2 operations for the resonable time and get the same
value of mnt_escape_count, path_connected() will return true, but the
path may be already detached. What do you think about this?

Thanks,
Andrew

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 6/6] vfs: Cache the results of path_connected
       [not found]                 ` <20150804115215.GA317-wo1vFcy6AUs@public.gmane.org>
@ 2015-08-04 17:41                   ` Eric W. Biederman
  2015-08-04 19:44                     ` J. Bruce Fields
       [not found]                     ` <871tfj0x4j.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  0 siblings, 2 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-08-04 17:41 UTC (permalink / raw)
  To: Andrew Vagin
  Cc: Andrey Vagin, Miklos Szeredi, Richard Weinberger,
	Linux Containers, Andy Lutomirski, J. Bruce Fields, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Linus Torvalds,
	Willy Tarreau

Andrew Vagin <avagin-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes:

> On Mon, Aug 03, 2015 at 04:30:54PM -0500, Eric W. Biederman wrote:
>> 
>> Add a new field mnt_escape_count in nameidata, initialize it to 0 and
>> cache the value of read_mnt_escape_count in nd->mnt_escape_count.
>> 
>> This allows a single check in path_connected in the common case where
>> either the mount has had no escapes (mnt_escape_count == 0) or there
>> has been an escape and it has been validated that the current path
>> does not escape.
>> 
>> To keep the cache valid nd->mnt_escape_count must be set to 0 whenever
>> the nd->path.mnt changes or when nd->path.dentry changes such that
>> the connectedness of the previous value of nd->path.dentry does
>> not imply the connected of the new value of nd->path.dentry.
>> 
>> Various locations in fs/namei.c are updated to set
>> nd->mnt_escape_count to 0 as necessary.
>> 
>> Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
>> ---
>>  fs/namei.c | 26 +++++++++++++++++++++++---
>>  1 file changed, 23 insertions(+), 3 deletions(-)
>> 
>> diff --git a/fs/namei.c b/fs/namei.c
>> index bccd3810ff60..79a5dca073f5 100644
>> --- a/fs/namei.c
>> +++ b/fs/namei.c
>> @@ -514,6 +514,7 @@ struct nameidata {
>>  	struct nameidata *saved;
>>  	unsigned	root_seq;
>>  	int		dfd;
>> +	unsigned	mnt_escape_count;
>>  };
>>  
>>  static void set_nameidata(struct nameidata *p, int dfd, struct filename *name)
>> @@ -572,12 +573,13 @@ static bool path_connected(struct nameidata *nd)
>>  	struct vfsmount *mnt = nd->path.mnt;
>>  	unsigned escape_count = read_mnt_escape_count(mnt);
>>  
>> -	if (likely(escape_count == 0))
>> +	if (likely(escape_count == nd->mnt_escape_count))
>>  		return true;
>
> The size of mnt_escape_count is only 4 bytes. Looks like it possible to
> make UINT_MAX / 2 operations for the resonable time and get the same
> value of mnt_escape_count, path_connected() will return true, but the
> path may be already detached. What do you think about this?

It is an interesting question.  

For the locking on rename we have something that looks like:
mutex_lock(&...s_vfs_mutex);
mutex_lock(&p2->d_inode->i_mutex);
mutex_lock(&p1->d_inode->i_mutex);
read_seqlock_excl(&mount_lock);
escape_count++;
write_seqlock(&rename_lock);
write_seqcount_begin(&dentry->d_seq);
write_seqcount_begin_nested(&target->d_seq);
write_seqcount_end_nested(&target->d_seq);
write_seqcount_end(&dentry->d_seq);
write_sequnlock(&rename_lock);
escape_count++;
read_sequnlock_excl(&mount_lock);
mutex_unlock(&p1->d_inode->i_mutex);
mutex_unlock(&p2->d_inode->i_mutex);
mutex_unlock(&...s_vfs_mutex);

Which is at least 16 serialized cpu operations.  To reach overflow then
it would take at least 16 * 2**32 operations cpu operations.  On a 4Ghz
16 * 2**32 operations would take roughly 16 seconds.  In practice I
think it would take noticably more than 16 seconds to perform that many
renames as there is a lot more going on than just the locking I signled
out.

A pathname lookup taking 16 seconds seems absurd.  But perhaps in the
worst case.

The maximum length of a path that can be passed into path_lookup is
4096.  For a lookup to be problematic there must be at least as many
instances of .. as there are of any other path component.  So each pair
of a minium length path element and a .. element must take at least 5
bytes. Which in 4096 bytes leaves room for 819 path elements.  If every
one of those 819 path components triggered a disk seek at 100 seeks per
second I could see a path name lookup potentially taking 8 seconds.

MAXSYMLINKS is 40.  So you could perhaps if you have a truly pathlogical
set of file names might make that 320 seconds. Ick.

To get some real figures I have performed a few renames on my 2.5Ghz
laptop with ext4 on an ssd.  Performing a simple rename in the same
directory (which involves much less than in required to rename a file
out of a mount point) I get the following timings:

   renames    time
   200,000      1.2s
 2,000,000     17.2s
20,000,000    205.6s
40,000,000    418.5s

At that kind of speed I would expect 4,000,000,000 renames to take 41850
seconds or 11.625 hours.

Going the other way on an old system with spinning rust for a hard drive
I created 1000 nested directories all named a and times how long it
would take to stat a pathlogical pathname.  With a simple pathname
without symlinks involved the worst case I have been able to trigger
is a 0.3 second path name lookup time.   Not being satisified with
that I managed to create a file about 83,968 directories deep and a
set of 40 symlinks setup up to get me there in one stat call the first
symlink being 8192 directories deep.  The worst case time I have been
able to measure from stat -L on the symlink that leads me to that file
is 14.5 seconds. 

If my numbers are at all representative I don't see a realistic
possibility of being able to perform enough renames to roll over a 32bit
mnt_escape_count during a path name lookup.  Even my most optimistic
estimate required 16 seconds to perform the renames and I have not been
able to get any pathname lookup to run that long.

At the same time I do think it is probably worth it to make escape count
an unsigned long, because that is practically free and it removes the
any theoretical concern on 64bit.

Am I missing anything?

Eric

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 6/6] vfs: Cache the results of path_connected
       [not found]                     ` <871tfj0x4j.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-08-04 19:44                       ` J. Bruce Fields
  0 siblings, 0 replies; 240+ messages in thread
From: J. Bruce Fields @ 2015-08-04 19:44 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrey Vagin, Miklos Szeredi, Richard Weinberger,
	Linux Containers, Andy Lutomirski, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Linus Torvalds,
	Willy Tarreau

On Tue, Aug 04, 2015 at 12:41:32PM -0500, Eric W. Biederman wrote:
> A pathname lookup taking 16 seconds seems absurd.  But perhaps in the
> worst case.
> 
> The maximum length of a path that can be passed into path_lookup is
> 4096.  For a lookup to be problematic there must be at least as many
> instances of .. as there are of any other path component.  So each pair
> of a minium length path element and a .. element must take at least 5
> bytes. Which in 4096 bytes leaves room for 819 path elements.  If every
> one of those 819 path components triggered a disk seek at 100 seeks per
> second I could see a path name lookup potentially taking 8 seconds.

A lookup on NFS while a server's rebooting or the network's flaky could
take arbitrary long.  Other network filesystems and fuse can have
similar problems.  Depending on threat model an attacker might have
quite precise control over that timing.  Disk filesystems could have all
the same problems since there's no guarantee the underlying block device
is really local.  Even ignoring that, hardware can be slow or flaky.
And couldn't an allocation in theory block for an arbitrary long time?

Apologies for just dropping into the middle here!  I haven't read the
rest and don't have the context to know whether any of that's relevant.

--b.

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 6/6] vfs: Cache the results of path_connected
  2015-08-04 17:41                   ` Eric W. Biederman
@ 2015-08-04 19:44                     ` J. Bruce Fields
  2015-08-04 22:58                       ` Eric W. Biederman
       [not found]                       ` <20150804194447.GB6664-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
       [not found]                     ` <871tfj0x4j.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  1 sibling, 2 replies; 240+ messages in thread
From: J. Bruce Fields @ 2015-08-04 19:44 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Vagin, Linux Containers, linux-fsdevel, Al Viro,
	Andy Lutomirski, Serge E. Hallyn, Richard Weinberger,
	Andrey Vagin, Jann Horn, Willy Tarreau, Omar Sandoval,
	Miklos Szeredi, Linus Torvalds

On Tue, Aug 04, 2015 at 12:41:32PM -0500, Eric W. Biederman wrote:
> A pathname lookup taking 16 seconds seems absurd.  But perhaps in the
> worst case.
> 
> The maximum length of a path that can be passed into path_lookup is
> 4096.  For a lookup to be problematic there must be at least as many
> instances of .. as there are of any other path component.  So each pair
> of a minium length path element and a .. element must take at least 5
> bytes. Which in 4096 bytes leaves room for 819 path elements.  If every
> one of those 819 path components triggered a disk seek at 100 seeks per
> second I could see a path name lookup potentially taking 8 seconds.

A lookup on NFS while a server's rebooting or the network's flaky could
take arbitrary long.  Other network filesystems and fuse can have
similar problems.  Depending on threat model an attacker might have
quite precise control over that timing.  Disk filesystems could have all
the same problems since there's no guarantee the underlying block device
is really local.  Even ignoring that, hardware can be slow or flaky.
And couldn't an allocation in theory block for an arbitrary long time?

Apologies for just dropping into the middle here!  I haven't read the
rest and don't have the context to know whether any of that's relevant.

--b.

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 6/6] vfs: Cache the results of path_connected
       [not found]                       ` <20150804194447.GB6664-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
@ 2015-08-04 22:58                         ` Eric W. Biederman
  0 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-08-04 22:58 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Andrey Vagin, Miklos Szeredi, Richard Weinberger,
	Linux Containers, Andy Lutomirski, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Linus Torvalds,
	Willy Tarreau

"J. Bruce Fields" <bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org> writes:

> On Tue, Aug 04, 2015 at 12:41:32PM -0500, Eric W. Biederman wrote:
>> A pathname lookup taking 16 seconds seems absurd.  But perhaps in the
>> worst case.
>> 
>> The maximum length of a path that can be passed into path_lookup is
>> 4096.  For a lookup to be problematic there must be at least as many
>> instances of .. as there are of any other path component.  So each pair
>> of a minium length path element and a .. element must take at least 5
>> bytes. Which in 4096 bytes leaves room for 819 path elements.  If every
>> one of those 819 path components triggered a disk seek at 100 seeks per
>> second I could see a path name lookup potentially taking 8 seconds.
>
> A lookup on NFS while a server's rebooting or the network's flaky could
> take arbitrary long.  Other network filesystems and fuse can have
> similar problems.  Depending on threat model an attacker might have
> quite precise control over that timing.  Disk filesystems could have all
> the same problems since there's no guarantee the underlying block device
> is really local.  Even ignoring that, hardware can be slow or flaky.
> And couldn't an allocation in theory block for an arbitrary long time?
>
> Apologies for just dropping into the middle here!  I haven't read the
> rest and don't have the context to know whether any of that's relevant.

No problem.  The basic question is: can 2Billion renames be performed on
the same filesystem in less time than a single path lookup?  Allowing
the use of a 32bit counter.

Most of the issues that slow up lookup also slow up rename so I have
not been focusing on them.

If you could look up thread and tell me what you think of the issue with
file handle to dentry conversion and bind mounts I would be appreciate.

I have been testing a little more on my system and it appears that it
takes an 60minutes give or take to perform 2 Billino renames on ramfs.
A faster cpu (5Ghz?) could perhaps get that down to 30 minutes.

With no slow downs and no weirdness I have been able to get a single
pathname lookup to take just over 2 minutes, and I expect I could get
that to take another minute more.

Those numbers are within a factor of 10 of each other, and I expect
someone clever could finagle something to overcome the rest.  So sigh.

There just is not enough margin in there to be certain of things.  Now
with the small change to make that counter 64bit and that 30 minutes to
wrap the counter becomes 240,000+ years.  I think I can safely not worry
about the issue.  I just need to come up with a good 32bit implemenation.

Eric

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 6/6] vfs: Cache the results of path_connected
  2015-08-04 19:44                     ` J. Bruce Fields
@ 2015-08-04 22:58                       ` Eric W. Biederman
       [not found]                         ` <874mkey824.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
       [not found]                       ` <20150804194447.GB6664-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
  1 sibling, 1 reply; 240+ messages in thread
From: Eric W. Biederman @ 2015-08-04 22:58 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Andrew Vagin, Linux Containers, linux-fsdevel, Al Viro,
	Andy Lutomirski, Serge E. Hallyn, Richard Weinberger,
	Andrey Vagin, Jann Horn, Willy Tarreau, Omar Sandoval,
	Miklos Szeredi, Linus Torvalds

"J. Bruce Fields" <bfields@fieldses.org> writes:

> On Tue, Aug 04, 2015 at 12:41:32PM -0500, Eric W. Biederman wrote:
>> A pathname lookup taking 16 seconds seems absurd.  But perhaps in the
>> worst case.
>> 
>> The maximum length of a path that can be passed into path_lookup is
>> 4096.  For a lookup to be problematic there must be at least as many
>> instances of .. as there are of any other path component.  So each pair
>> of a minium length path element and a .. element must take at least 5
>> bytes. Which in 4096 bytes leaves room for 819 path elements.  If every
>> one of those 819 path components triggered a disk seek at 100 seeks per
>> second I could see a path name lookup potentially taking 8 seconds.
>
> A lookup on NFS while a server's rebooting or the network's flaky could
> take arbitrary long.  Other network filesystems and fuse can have
> similar problems.  Depending on threat model an attacker might have
> quite precise control over that timing.  Disk filesystems could have all
> the same problems since there's no guarantee the underlying block device
> is really local.  Even ignoring that, hardware can be slow or flaky.
> And couldn't an allocation in theory block for an arbitrary long time?
>
> Apologies for just dropping into the middle here!  I haven't read the
> rest and don't have the context to know whether any of that's relevant.

No problem.  The basic question is: can 2Billion renames be performed on
the same filesystem in less time than a single path lookup?  Allowing
the use of a 32bit counter.

Most of the issues that slow up lookup also slow up rename so I have
not been focusing on them.

If you could look up thread and tell me what you think of the issue with
file handle to dentry conversion and bind mounts I would be appreciate.

I have been testing a little more on my system and it appears that it
takes an 60minutes give or take to perform 2 Billino renames on ramfs.
A faster cpu (5Ghz?) could perhaps get that down to 30 minutes.

With no slow downs and no weirdness I have been able to get a single
pathname lookup to take just over 2 minutes, and I expect I could get
that to take another minute more.

Those numbers are within a factor of 10 of each other, and I expect
someone clever could finagle something to overcome the rest.  So sigh.

There just is not enough margin in there to be certain of things.  Now
with the small change to make that counter 64bit and that 30 minutes to
wrap the counter becomes 240,000+ years.  I think I can safely not worry
about the issue.  I just need to come up with a good 32bit implemenation.

Eric

^ permalink raw reply	[flat|nested] 240+ messages in thread

* [PATCH review 7/6] vfs: Make mnt_escape_count 64bit
       [not found]             ` <871tfkawu9.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                                 ` (5 preceding siblings ...)
  2015-08-03 21:30               ` [PATCH review 6/6] vfs: Cache the results of path_connected Eric W. Biederman
@ 2015-08-05  3:14               ` Eric W. Biederman
  6 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-08-05  3:14 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Miklos Szeredi, Richard Weinberger,
	Andy Lutomirski, J. Bruce Fields, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Linus Torvalds,
	Willy Tarreau


The primary way that mnt_escape_count differs from a seqcount is that
it's value is cached across operations that sleep.  In emperical
testing mnt_escape_count can be made to rollover in 64 minutes on an
Intel 2.5Ghz Core i5 processor on ramfs.  Meanwhile a single pathname
lookup on an otherwise idle system has been measured at 2 minutes 9
seconds.  Those numbers are entirely too close for comfort, especially
given that nfs lookups can take indefinitely long.

Extend mnt_escape_count to 64bit to increase the expected time to
rollover from 1 hour to 489,957 years.  Even if the efficiency of
rename is increased to be able to rename 2^31 entries in 1 second
(instead of the 1 hour that I measured) it will still take 136 years
before the escape count rolls over, making it essentially never.

On 32bit the low 32bit word of the 64bit count is treated as a
sequence count such that if you read the low 32bit value, read the
high 32bit value, and then read the low 32bit value again and the low
32bit value remains unchanged the high 32bit value is guaranteed to be
stable when the low 32bit value does not equal -1UL.  Thankfully in
the unlikely event that the low 32bit value is -1UL the code does not
care about the high 32bit value so the code does not need to reread
the values in that case.

Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/mount.h     | 32 ++++++++++++++++++++++++++++----
 fs/namei.c     |  6 +++---
 fs/namespace.c | 13 ++++++++++++-
 3 files changed, 43 insertions(+), 8 deletions(-)

diff --git a/fs/mount.h b/fs/mount.h
index d32d074cc0d4..cd89e786efa7 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -38,7 +38,10 @@ struct mount {
 	struct mount *mnt_parent;
 	struct dentry *mnt_mountpoint;
 	struct vfsmount mnt;
-	unsigned mnt_escape_count;
+	unsigned long mnt_escape_count;
+#if BITS_PER_LONG < 64
+	unsigned long mnt_escape_count_high;
+#endif
 	union {
 		struct rcu_head mnt_rcu;
 		struct llist_node mnt_llist;
@@ -111,15 +114,36 @@ static inline void detach_mounts(struct dentry *dentry)
 extern const struct dentry *lock_namespace_rename(struct dentry *, struct dentry *, bool);
 extern void unlock_namespace_rename(const struct dentry *, struct dentry *, struct dentry *, bool);
 
-static inline unsigned read_mnt_escape_count(struct vfsmount *vfsmount)
+static inline u64 read_mnt_escape_count(struct vfsmount *vfsmount)
 {
 	struct mount *mnt = real_mount(vfsmount);
-	unsigned ret = READ_ONCE(mnt->mnt_escape_count);
+#if BITS_PER_LONG >= 64
+	u64 ret = READ_ONCE(mnt->mnt_escape_count);
+#else
+	u64 ret;
+	unsigned long low0, low, high;
+	/* In the unlikely event that low0 == low and low == -1
+	 * mnt_escape_count_high may or not be incremented yet.  In
+	 * that event the odd value of low will not match the anything
+	 * cached, will signal that the validity of is_subdir is in
+	 * flux and will not be cached.  Therefore when low == -1 the
+	 * value of high does not matter.
+	 */
+	low0 = READ_ONCE(mnt->mnt_escape_count);
+	do {
+		low = low0;
+		smp_rmb();
+		high = READ_ONCE(mnt->mnt_escape_count_high);
+		smp_rmb();
+		low0 = READ_ONCE(mnt->mnt_escape_count);
+	} while (low != low0);
+	ret = (((u64)high) << 32) | low;
+#endif
 	smp_rmb();
 	return ret;
 }
 
-static inline void cache_mnt_escape_count(unsigned *cache, unsigned escape_count)
+static inline void cache_mnt_escape_count(u64 *cache, u64 escape_count)
 {
 	if (likely(escape_count & 1) == 0)
 		*cache = escape_count;
diff --git a/fs/namei.c b/fs/namei.c
index 79a5dca073f5..ef1463c0b96a 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -514,7 +514,7 @@ struct nameidata {
 	struct nameidata *saved;
 	unsigned	root_seq;
 	int		dfd;
-	unsigned	mnt_escape_count;
+	u64		mnt_escape_count;
 };
 
 static void set_nameidata(struct nameidata *p, int dfd, struct filename *name)
@@ -571,7 +571,7 @@ static int __nd_alloc_stack(struct nameidata *nd)
 static bool path_connected(struct nameidata *nd)
 {
 	struct vfsmount *mnt = nd->path.mnt;
-	unsigned escape_count = read_mnt_escape_count(mnt);
+	u64 escape_count = read_mnt_escape_count(mnt);
 
 	if (likely(escape_count == nd->mnt_escape_count))
 		return true;
@@ -3041,7 +3041,7 @@ static int do_last(struct nameidata *nd,
 	unsigned seq;
 	struct inode *inode;
 	struct path save_parent = { .dentry = NULL, .mnt = NULL };
-	unsigned save_parent_escape_count = 0;
+	u64 save_parent_escape_count = 0;
 	struct path path;
 	bool retried = false;
 	int error;
diff --git a/fs/namespace.c b/fs/namespace.c
index 9faec24f3f23..98596c4b992a 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1692,10 +1692,21 @@ static void lock_escaped_mounts_begin(struct dentry *root)
 		 */
 		hlist_for_each_entry(mnt, &mr->r_list, mnt_mr_list) {
 			/* Don't return to 0 if the couunt wraps */
-			if (unlikely(mnt->mnt_escape_count == (0U - 2)))
+#if BITS_PER_LONG >= 64
+			if (unlikely(mnt->mnt_escape_count == (0UL - 2)))
 				mnt->mnt_escape_count = 1;
 			else
 				mnt->mnt_escape_count++;
+#else
+			if (unlikely(mnt->mnt_escape_count == (0UL - 2))) {
+				WRITE_ONCE(mnt->mnt_escape_count, (0UL - 1));
+				smp_wmb();
+				mnt->mnt_escape_count_high++;
+				smp_wmb();
+				WRITE_ONCE(mnt->mnt_escape_count, 1);
+			} else
+				mnt->mnt_escape_count++;
+#endif
 			smp_wmb();
 		}
 	}
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* Re: [PATCH review 6/6] vfs: Cache the results of path_connected
       [not found]                         ` <874mkey824.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-08-05 15:59                           ` J. Bruce Fields
       [not found]                             ` <20150805155948.GD17797-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
  0 siblings, 1 reply; 240+ messages in thread
From: J. Bruce Fields @ 2015-08-05 15:59 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrey Vagin, Miklos Szeredi, Richard Weinberger,
	Linux Containers, Andy Lutomirski, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Linus Torvalds,
	Willy Tarreau

On Tue, Aug 04, 2015 at 05:58:59PM -0500, Eric W. Biederman wrote:
> "J. Bruce Fields" <bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org> writes:
> 
> > On Tue, Aug 04, 2015 at 12:41:32PM -0500, Eric W. Biederman wrote:
> >> A pathname lookup taking 16 seconds seems absurd.  But perhaps in the
> >> worst case.
> >> 
> >> The maximum length of a path that can be passed into path_lookup is
> >> 4096.  For a lookup to be problematic there must be at least as many
> >> instances of .. as there are of any other path component.  So each pair
> >> of a minium length path element and a .. element must take at least 5
> >> bytes. Which in 4096 bytes leaves room for 819 path elements.  If every
> >> one of those 819 path components triggered a disk seek at 100 seeks per
> >> second I could see a path name lookup potentially taking 8 seconds.
> >
> > A lookup on NFS while a server's rebooting or the network's flaky could
> > take arbitrary long.  Other network filesystems and fuse can have
> > similar problems.  Depending on threat model an attacker might have
> > quite precise control over that timing.  Disk filesystems could have all
> > the same problems since there's no guarantee the underlying block device
> > is really local.  Even ignoring that, hardware can be slow or flaky.
> > And couldn't an allocation in theory block for an arbitrary long time?
> >
> > Apologies for just dropping into the middle here!  I haven't read the
> > rest and don't have the context to know whether any of that's relevant.
> 
> No problem.  The basic question is: can 2Billion renames be performed on
> the same filesystem in less time than a single path lookup?  Allowing
> the use of a 32bit counter.

Certainly if you have control over an NFS or FUSE server then you can
arrange for that to happen--just delay the lookup until you've processed
enough renames.  I don't know if that's interesting....

> If you could look up thread and tell me what you think of the issue with
> file handle to dentry conversion and bind mounts I would be appreciate.

OK, I see your comments in "[PATCH review 0/6] Bind mount escape fixes",
I'm not sure I understand yet, I'll take a closer look.

--b.

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 6/6] vfs: Cache the results of path_connected
       [not found]                             ` <20150805155948.GD17797-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
@ 2015-08-05 16:28                               ` Eric W. Biederman
       [not found]                                 ` <878u9pwvg8.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  0 siblings, 1 reply; 240+ messages in thread
From: Eric W. Biederman @ 2015-08-05 16:28 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Andrey Vagin, Miklos Szeredi, Richard Weinberger,
	Linux Containers, Andy Lutomirski, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Linus Torvalds,
	Willy Tarreau

"J. Bruce Fields" <bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org> writes:

> On Tue, Aug 04, 2015 at 05:58:59PM -0500, Eric W. Biederman wrote:
>>
>> No problem.  The basic question is: can 2Billion renames be performed on
>> the same filesystem in less time than a single path lookup?  Allowing
>> the use of a 32bit counter.
>
> Certainly if you have control over an NFS or FUSE server then you can
> arrange for that to happen--just delay the lookup until you've processed
> enough renames.  I don't know if that's interesting....

Not particularly when the whole point is to start with a bind mount, do
something trick and then have access to the whole filesystem instead of
just the part of the filesystem exposed by the bind mount.

If you control the filesystem you already have access to the entire
filesystem, so you don't need to do something trick.

That something tricky is a well placed rename that borks the tree
structure and causes .. to never see the subdirectory that is the root
of the bind mount.

>> If you could look up thread and tell me what you think of the issue with
>> file handle to dentry conversion and bind mounts I would be appreciate.
>
> OK, I see your comments in "[PATCH review 0/6] Bind mount escape fixes",
> I'm not sure I understand yet, I'll take a closer look.

Thanks.

The file handle reconstitution code can certainly be affected by all of
this.  Given that it is an failure if reconnect_path can't reconnect the
path of a file handle.  I think it can reasonably considered an error in
all cases if that path is outside of an exported bind mount, but I don't
know that area particularly well.  The solution might just be don't
export file handles from bind mounts.

Eric

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 1/6] mnt: Track which mounts use a dentry as root.
       [not found]               ` <87vbcw9i8g.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-08-07 10:46                 ` Nikolay Borisov
  0 siblings, 0 replies; 240+ messages in thread
From: Nikolay Borisov @ 2015-08-07 10:46 UTC (permalink / raw)
  To: Eric W. Biederman, Linux Containers
  Cc: Andrey Vagin, Miklos Szeredi, Richard Weinberger,
	Andy Lutomirski, J. Bruce Fields, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Linus Torvalds,
	Willy Tarreau



On 08/04/2015 12:26 AM, Eric W. Biederman wrote:
> 
> This is needed infrastructure for better handling of when files
> or directories are moved out from under the root of a bind mount.
> 
> Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
> ---
>  fs/mount.h             |   7 +++
>  fs/namespace.c         | 120 +++++++++++++++++++++++++++++++++++++++++++++++--
>  include/linux/dcache.h |   7 +++
>  3 files changed, 130 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/mount.h b/fs/mount.h
> index 14db05d424f7..e8f22970fe59 100644
> --- a/fs/mount.h
> +++ b/fs/mount.h
> @@ -27,6 +27,12 @@ struct mountpoint {
>  	int m_count;
>  };
>  
> +struct mountroot {
> +	struct hlist_node r_hash;
> +	struct dentry *r_dentry;
> +	struct hlist_head r_list;
> +};
> +
>  struct mount {
>  	struct hlist_node mnt_hash;
>  	struct mount *mnt_parent;
> @@ -55,6 +61,7 @@ struct mount {
>  	struct mnt_namespace *mnt_ns;	/* containing namespace */
>  	struct mountpoint *mnt_mp;	/* where is it mounted */
>  	struct hlist_node mnt_mp_list;	/* list mounts with the same mountpoint */
> +	struct hlist_node mnt_mr_list;	/* list mounts with the same mountroot */
>  #ifdef CONFIG_FSNOTIFY
>  	struct hlist_head mnt_fsnotify_marks;
>  	__u32 mnt_fsnotify_mask;
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 2b8aa15fd6df..2ce987af9afa 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -31,6 +31,8 @@ static unsigned int m_hash_mask __read_mostly;
>  static unsigned int m_hash_shift __read_mostly;
>  static unsigned int mp_hash_mask __read_mostly;
>  static unsigned int mp_hash_shift __read_mostly;
> +static unsigned int mr_hash_mask __read_mostly;
> +static unsigned int mr_hash_shift __read_mostly;
>  
>  static __initdata unsigned long mhash_entries;
>  static int __init set_mhash_entries(char *str)
> @@ -52,6 +54,16 @@ static int __init set_mphash_entries(char *str)
>  }
>  __setup("mphash_entries=", set_mphash_entries);
>  
> +static __initdata unsigned long mrhash_entries;
> +static int __init set_mrhash_entries(char *str)
> +{
> +	if (!str)
> +		return 0;
> +	mrhash_entries = simple_strtoul(str, &str, 0);

Nit: Any particular reason for using simple_* rather than kstrto* family
of functions?

> +	return 1;
> +}
> +__setup("mrhash_entries=", set_mrhash_entries);
> +
>  static u64 event;
>  static DEFINE_IDA(mnt_id_ida);
>  static DEFINE_IDA(mnt_group_ida);
> @@ -61,6 +73,7 @@ static int mnt_group_start = 1;
>  
>  static struct hlist_head *mount_hashtable __read_mostly;
>  static struct hlist_head *mountpoint_hashtable __read_mostly;
> +static struct hlist_head *mountroot_hashtable __read_mostly;
>  static struct kmem_cache *mnt_cache __read_mostly;
>  static DECLARE_RWSEM(namespace_sem);
>  
> @@ -93,6 +106,13 @@ static inline struct hlist_head *mp_hash(struct dentry *dentry)
>  	return &mountpoint_hashtable[tmp & mp_hash_mask];
>  }
>  
> +static inline struct hlist_head *mr_hash(struct dentry *dentry)
> +{
> +	unsigned long tmp = ((unsigned long)dentry / L1_CACHE_BYTES);
> +	tmp = tmp + (tmp >> mr_hash_shift);
> +	return &mountroot_hashtable[tmp & mr_hash_mask];
> +}
> +
>  /*
>   * allocation is serialized by namespace_sem, but we need the spinlock to
>   * serialize with freeing.
> @@ -234,6 +254,7 @@ static struct mount *alloc_vfsmnt(const char *name)
>  		INIT_LIST_HEAD(&mnt->mnt_slave_list);
>  		INIT_LIST_HEAD(&mnt->mnt_slave);
>  		INIT_HLIST_NODE(&mnt->mnt_mp_list);
> +		INIT_HLIST_NODE(&mnt->mnt_mr_list);
>  #ifdef CONFIG_FSNOTIFY
>  		INIT_HLIST_HEAD(&mnt->mnt_fsnotify_marks);
>  #endif
> @@ -779,6 +800,77 @@ static void put_mountpoint(struct mountpoint *mp)
>  	}
>  }
>  
> +static struct mountroot *lookup_mountroot(struct dentry *dentry)
> +{
> +	struct hlist_head *chain = mr_hash(dentry);
> +	struct mountroot *mr;
> +
> +	hlist_for_each_entry(mr, chain, r_hash) {
> +		if (mr->r_dentry == dentry)
> +			return mr;
> +	}
> +	return NULL;
> +}
> +
> +static int mnt_set_root(struct mount *mnt, struct dentry *root)
> +{
> +	struct mountroot *mr = NULL;
> +
> +	read_seqlock_excl(&mount_lock);
> +	if (d_mountroot(root))
> +		mr = lookup_mountroot(root);
> +	if (!mr) {
> +		struct mountroot *new;
> +		read_sequnlock_excl(&mount_lock);
> +
> +		new = kmalloc(sizeof(struct mountroot), GFP_KERNEL);
> +		if (!new)
> +			return -ENOMEM;
> +
> +		read_seqlock_excl(&mount_lock);
> +		mr = lookup_mountroot(root);
> +		if (mr) {
> +			kfree(new);
> +		} else {
> +			struct hlist_head *chain = mr_hash(root);
> +
> +			mr = new;
> +			mr->r_dentry = root;
> +			INIT_HLIST_HEAD(&mr->r_list);
> +			hlist_add_head(&mr->r_hash, chain);
> +
> +			spin_lock(&root->d_lock);
> +			root->d_flags |= DCACHE_MOUNTROOT;
> +			spin_unlock(&root->d_lock);
> +		}
> +	}
> +	mnt->mnt.mnt_root = root;
> +	hlist_add_head(&mnt->mnt_mr_list, &mr->r_list);
> +	read_sequnlock_excl(&mount_lock);
> +
> +	return 0;
> +}
> +
> +static void mnt_put_root(struct mount *mnt)
> +{
> +	struct dentry *root = mnt->mnt.mnt_root;
> +	struct mountroot *mr;
> +
> +	read_seqlock_excl(&mount_lock);
> +	mr = lookup_mountroot(root);
> +	BUG_ON(!mr);
> +	hlist_del(&mnt->mnt_mr_list);
> +	if (hlist_empty(&mr->r_list)) {
> +		hlist_del(&mr->r_hash);
> +		spin_lock(&root->d_lock);
> +		root->d_flags &= ~DCACHE_MOUNTROOT;
> +		spin_unlock(&root->d_lock);
> +		kfree(mr);
> +	}
> +	read_sequnlock_excl(&mount_lock);
> +	dput(root);
> +}
> +
>  static inline int check_mnt(struct mount *mnt)
>  {
>  	return mnt->mnt_ns == current->nsproxy->mnt_ns;
> @@ -934,6 +1026,7 @@ vfs_kern_mount(struct file_system_type *type, int flags, const char *name, void
>  {
>  	struct mount *mnt;
>  	struct dentry *root;
> +	int err;
>  
>  	if (!type)
>  		return ERR_PTR(-ENODEV);
> @@ -952,8 +1045,16 @@ vfs_kern_mount(struct file_system_type *type, int flags, const char *name, void
>  		return ERR_CAST(root);
>  	}
>  
> -	mnt->mnt.mnt_root = root;
>  	mnt->mnt.mnt_sb = root->d_sb;
> +	err = mnt_set_root(mnt, root);
> +	if (err) {
> +		dput(root);
> +		deactivate_super(mnt->mnt.mnt_sb);
> +		mnt_free_id(mnt);
> +		free_vfsmnt(mnt);
> +		return ERR_PTR(err);
> +	}
> +
>  	mnt->mnt_mountpoint = mnt->mnt.mnt_root;
>  	mnt->mnt_parent = mnt;
>  	lock_mount_hash();
> @@ -985,6 +1086,10 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root,
>  			goto out_free;
>  	}
>  
> +	err = mnt_set_root(mnt, root);
> +	if (err)
> +		goto out_free;
> +
>  	mnt->mnt.mnt_flags = old->mnt.mnt_flags & ~(MNT_WRITE_HOLD|MNT_MARKED);
>  	/* Don't allow unprivileged users to change mount flags */
>  	if (flag & CL_UNPRIVILEGED) {
> @@ -1010,7 +1115,7 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root,
>  
>  	atomic_inc(&sb->s_active);
>  	mnt->mnt.mnt_sb = sb;
> -	mnt->mnt.mnt_root = dget(root);
> +	dget(root);
>  	mnt->mnt_mountpoint = mnt->mnt.mnt_root;
>  	mnt->mnt_parent = mnt;
>  	lock_mount_hash();
> @@ -1063,7 +1168,7 @@ static void cleanup_mnt(struct mount *mnt)
>  	if (unlikely(mnt->mnt_pins.first))
>  		mnt_pin_kill(mnt);
>  	fsnotify_vfsmount_delete(&mnt->mnt);
> -	dput(mnt->mnt.mnt_root);
> +	mnt_put_root(mnt);
>  	deactivate_super(mnt->mnt.mnt_sb);
>  	mnt_free_id(mnt);
>  	call_rcu(&mnt->mnt_rcu, delayed_free_vfsmnt);
> @@ -3120,14 +3225,21 @@ void __init mnt_init(void)
>  				mphash_entries, 19,
>  				0,
>  				&mp_hash_shift, &mp_hash_mask, 0, 0);
> +	mountroot_hashtable = alloc_large_system_hash("Mountroot-cache",
> +				sizeof(struct hlist_head),
> +				mrhash_entries, 19,
> +				0,
> +				&mr_hash_shift, &mr_hash_mask, 0, 0);
>  
> -	if (!mount_hashtable || !mountpoint_hashtable)
> +	if (!mount_hashtable || !mountpoint_hashtable || !mountroot_hashtable)
>  		panic("Failed to allocate mount hash table\n");
>  
>  	for (u = 0; u <= m_hash_mask; u++)
>  		INIT_HLIST_HEAD(&mount_hashtable[u]);
>  	for (u = 0; u <= mp_hash_mask; u++)
>  		INIT_HLIST_HEAD(&mountpoint_hashtable[u]);
> +	for (u = 0; u <= mr_hash_mask; u++)
> +		INIT_HLIST_HEAD(&mountroot_hashtable[u]);
>  
>  	kernfs_init();
>  
> diff --git a/include/linux/dcache.h b/include/linux/dcache.h
> index d67ae119cf4e..52a5e6915f58 100644
> --- a/include/linux/dcache.h
> +++ b/include/linux/dcache.h
> @@ -228,6 +228,8 @@ struct dentry_operations {
>  #define DCACHE_FALLTHRU			0x01000000 /* Fall through to lower layer */
>  #define DCACHE_OP_SELECT_INODE		0x02000000 /* Unioned entry: dcache op selects inode */
>  
> +#define DCACHE_MOUNTROOT		0x04000000 /* Root of a vfsmount */
> +
>  extern seqlock_t rename_lock;
>  
>  /*
> @@ -404,6 +406,11 @@ static inline bool d_mountpoint(const struct dentry *dentry)
>  	return dentry->d_flags & DCACHE_MOUNTED;
>  }
>  
> +static inline bool d_mountroot(const struct dentry *dentry)
> +{
> +	return dentry->d_flags & DCACHE_MOUNTROOT;
> +}
> +
>  /*
>   * Directory cache entry type accessor functions.
>   */
> 

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 1/6] mnt: Track which mounts use a dentry as root.
  2015-08-03 21:26             ` [PATCH review 1/6] mnt: Track which mounts use a dentry as root Eric W. Biederman
@ 2015-08-07 10:46               ` Nikolay Borisov
       [not found]                 ` <55C48C94.6050804-6AxghH7DbtA@public.gmane.org>
  2015-08-07 15:43                 ` Eric W. Biederman
       [not found]               ` <87vbcw9i8g.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  1 sibling, 2 replies; 240+ messages in thread
From: Nikolay Borisov @ 2015-08-07 10:46 UTC (permalink / raw)
  To: Eric W. Biederman, Linux Containers
  Cc: Andrey Vagin, Miklos Szeredi, Richard Weinberger,
	Andy Lutomirski, J. Bruce Fields, Al Viro, linux-fsdevel,
	Jann Horn, Linus Torvalds, Willy Tarreau



On 08/04/2015 12:26 AM, Eric W. Biederman wrote:
> 
> This is needed infrastructure for better handling of when files
> or directories are moved out from under the root of a bind mount.
> 
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  fs/mount.h             |   7 +++
>  fs/namespace.c         | 120 +++++++++++++++++++++++++++++++++++++++++++++++--
>  include/linux/dcache.h |   7 +++
>  3 files changed, 130 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/mount.h b/fs/mount.h
> index 14db05d424f7..e8f22970fe59 100644
> --- a/fs/mount.h
> +++ b/fs/mount.h
> @@ -27,6 +27,12 @@ struct mountpoint {
>  	int m_count;
>  };
>  
> +struct mountroot {
> +	struct hlist_node r_hash;
> +	struct dentry *r_dentry;
> +	struct hlist_head r_list;
> +};
> +
>  struct mount {
>  	struct hlist_node mnt_hash;
>  	struct mount *mnt_parent;
> @@ -55,6 +61,7 @@ struct mount {
>  	struct mnt_namespace *mnt_ns;	/* containing namespace */
>  	struct mountpoint *mnt_mp;	/* where is it mounted */
>  	struct hlist_node mnt_mp_list;	/* list mounts with the same mountpoint */
> +	struct hlist_node mnt_mr_list;	/* list mounts with the same mountroot */
>  #ifdef CONFIG_FSNOTIFY
>  	struct hlist_head mnt_fsnotify_marks;
>  	__u32 mnt_fsnotify_mask;
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 2b8aa15fd6df..2ce987af9afa 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -31,6 +31,8 @@ static unsigned int m_hash_mask __read_mostly;
>  static unsigned int m_hash_shift __read_mostly;
>  static unsigned int mp_hash_mask __read_mostly;
>  static unsigned int mp_hash_shift __read_mostly;
> +static unsigned int mr_hash_mask __read_mostly;
> +static unsigned int mr_hash_shift __read_mostly;
>  
>  static __initdata unsigned long mhash_entries;
>  static int __init set_mhash_entries(char *str)
> @@ -52,6 +54,16 @@ static int __init set_mphash_entries(char *str)
>  }
>  __setup("mphash_entries=", set_mphash_entries);
>  
> +static __initdata unsigned long mrhash_entries;
> +static int __init set_mrhash_entries(char *str)
> +{
> +	if (!str)
> +		return 0;
> +	mrhash_entries = simple_strtoul(str, &str, 0);

Nit: Any particular reason for using simple_* rather than kstrto* family
of functions?

> +	return 1;
> +}
> +__setup("mrhash_entries=", set_mrhash_entries);
> +
>  static u64 event;
>  static DEFINE_IDA(mnt_id_ida);
>  static DEFINE_IDA(mnt_group_ida);
> @@ -61,6 +73,7 @@ static int mnt_group_start = 1;
>  
>  static struct hlist_head *mount_hashtable __read_mostly;
>  static struct hlist_head *mountpoint_hashtable __read_mostly;
> +static struct hlist_head *mountroot_hashtable __read_mostly;
>  static struct kmem_cache *mnt_cache __read_mostly;
>  static DECLARE_RWSEM(namespace_sem);
>  
> @@ -93,6 +106,13 @@ static inline struct hlist_head *mp_hash(struct dentry *dentry)
>  	return &mountpoint_hashtable[tmp & mp_hash_mask];
>  }
>  
> +static inline struct hlist_head *mr_hash(struct dentry *dentry)
> +{
> +	unsigned long tmp = ((unsigned long)dentry / L1_CACHE_BYTES);
> +	tmp = tmp + (tmp >> mr_hash_shift);
> +	return &mountroot_hashtable[tmp & mr_hash_mask];
> +}
> +
>  /*
>   * allocation is serialized by namespace_sem, but we need the spinlock to
>   * serialize with freeing.
> @@ -234,6 +254,7 @@ static struct mount *alloc_vfsmnt(const char *name)
>  		INIT_LIST_HEAD(&mnt->mnt_slave_list);
>  		INIT_LIST_HEAD(&mnt->mnt_slave);
>  		INIT_HLIST_NODE(&mnt->mnt_mp_list);
> +		INIT_HLIST_NODE(&mnt->mnt_mr_list);
>  #ifdef CONFIG_FSNOTIFY
>  		INIT_HLIST_HEAD(&mnt->mnt_fsnotify_marks);
>  #endif
> @@ -779,6 +800,77 @@ static void put_mountpoint(struct mountpoint *mp)
>  	}
>  }
>  
> +static struct mountroot *lookup_mountroot(struct dentry *dentry)
> +{
> +	struct hlist_head *chain = mr_hash(dentry);
> +	struct mountroot *mr;
> +
> +	hlist_for_each_entry(mr, chain, r_hash) {
> +		if (mr->r_dentry == dentry)
> +			return mr;
> +	}
> +	return NULL;
> +}
> +
> +static int mnt_set_root(struct mount *mnt, struct dentry *root)
> +{
> +	struct mountroot *mr = NULL;
> +
> +	read_seqlock_excl(&mount_lock);
> +	if (d_mountroot(root))
> +		mr = lookup_mountroot(root);
> +	if (!mr) {
> +		struct mountroot *new;
> +		read_sequnlock_excl(&mount_lock);
> +
> +		new = kmalloc(sizeof(struct mountroot), GFP_KERNEL);
> +		if (!new)
> +			return -ENOMEM;
> +
> +		read_seqlock_excl(&mount_lock);
> +		mr = lookup_mountroot(root);
> +		if (mr) {
> +			kfree(new);
> +		} else {
> +			struct hlist_head *chain = mr_hash(root);
> +
> +			mr = new;
> +			mr->r_dentry = root;
> +			INIT_HLIST_HEAD(&mr->r_list);
> +			hlist_add_head(&mr->r_hash, chain);
> +
> +			spin_lock(&root->d_lock);
> +			root->d_flags |= DCACHE_MOUNTROOT;
> +			spin_unlock(&root->d_lock);
> +		}
> +	}
> +	mnt->mnt.mnt_root = root;
> +	hlist_add_head(&mnt->mnt_mr_list, &mr->r_list);
> +	read_sequnlock_excl(&mount_lock);
> +
> +	return 0;
> +}
> +
> +static void mnt_put_root(struct mount *mnt)
> +{
> +	struct dentry *root = mnt->mnt.mnt_root;
> +	struct mountroot *mr;
> +
> +	read_seqlock_excl(&mount_lock);
> +	mr = lookup_mountroot(root);
> +	BUG_ON(!mr);
> +	hlist_del(&mnt->mnt_mr_list);
> +	if (hlist_empty(&mr->r_list)) {
> +		hlist_del(&mr->r_hash);
> +		spin_lock(&root->d_lock);
> +		root->d_flags &= ~DCACHE_MOUNTROOT;
> +		spin_unlock(&root->d_lock);
> +		kfree(mr);
> +	}
> +	read_sequnlock_excl(&mount_lock);
> +	dput(root);
> +}
> +
>  static inline int check_mnt(struct mount *mnt)
>  {
>  	return mnt->mnt_ns == current->nsproxy->mnt_ns;
> @@ -934,6 +1026,7 @@ vfs_kern_mount(struct file_system_type *type, int flags, const char *name, void
>  {
>  	struct mount *mnt;
>  	struct dentry *root;
> +	int err;
>  
>  	if (!type)
>  		return ERR_PTR(-ENODEV);
> @@ -952,8 +1045,16 @@ vfs_kern_mount(struct file_system_type *type, int flags, const char *name, void
>  		return ERR_CAST(root);
>  	}
>  
> -	mnt->mnt.mnt_root = root;
>  	mnt->mnt.mnt_sb = root->d_sb;
> +	err = mnt_set_root(mnt, root);
> +	if (err) {
> +		dput(root);
> +		deactivate_super(mnt->mnt.mnt_sb);
> +		mnt_free_id(mnt);
> +		free_vfsmnt(mnt);
> +		return ERR_PTR(err);
> +	}
> +
>  	mnt->mnt_mountpoint = mnt->mnt.mnt_root;
>  	mnt->mnt_parent = mnt;
>  	lock_mount_hash();
> @@ -985,6 +1086,10 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root,
>  			goto out_free;
>  	}
>  
> +	err = mnt_set_root(mnt, root);
> +	if (err)
> +		goto out_free;
> +
>  	mnt->mnt.mnt_flags = old->mnt.mnt_flags & ~(MNT_WRITE_HOLD|MNT_MARKED);
>  	/* Don't allow unprivileged users to change mount flags */
>  	if (flag & CL_UNPRIVILEGED) {
> @@ -1010,7 +1115,7 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root,
>  
>  	atomic_inc(&sb->s_active);
>  	mnt->mnt.mnt_sb = sb;
> -	mnt->mnt.mnt_root = dget(root);
> +	dget(root);
>  	mnt->mnt_mountpoint = mnt->mnt.mnt_root;
>  	mnt->mnt_parent = mnt;
>  	lock_mount_hash();
> @@ -1063,7 +1168,7 @@ static void cleanup_mnt(struct mount *mnt)
>  	if (unlikely(mnt->mnt_pins.first))
>  		mnt_pin_kill(mnt);
>  	fsnotify_vfsmount_delete(&mnt->mnt);
> -	dput(mnt->mnt.mnt_root);
> +	mnt_put_root(mnt);
>  	deactivate_super(mnt->mnt.mnt_sb);
>  	mnt_free_id(mnt);
>  	call_rcu(&mnt->mnt_rcu, delayed_free_vfsmnt);
> @@ -3120,14 +3225,21 @@ void __init mnt_init(void)
>  				mphash_entries, 19,
>  				0,
>  				&mp_hash_shift, &mp_hash_mask, 0, 0);
> +	mountroot_hashtable = alloc_large_system_hash("Mountroot-cache",
> +				sizeof(struct hlist_head),
> +				mrhash_entries, 19,
> +				0,
> +				&mr_hash_shift, &mr_hash_mask, 0, 0);
>  
> -	if (!mount_hashtable || !mountpoint_hashtable)
> +	if (!mount_hashtable || !mountpoint_hashtable || !mountroot_hashtable)
>  		panic("Failed to allocate mount hash table\n");
>  
>  	for (u = 0; u <= m_hash_mask; u++)
>  		INIT_HLIST_HEAD(&mount_hashtable[u]);
>  	for (u = 0; u <= mp_hash_mask; u++)
>  		INIT_HLIST_HEAD(&mountpoint_hashtable[u]);
> +	for (u = 0; u <= mr_hash_mask; u++)
> +		INIT_HLIST_HEAD(&mountroot_hashtable[u]);
>  
>  	kernfs_init();
>  
> diff --git a/include/linux/dcache.h b/include/linux/dcache.h
> index d67ae119cf4e..52a5e6915f58 100644
> --- a/include/linux/dcache.h
> +++ b/include/linux/dcache.h
> @@ -228,6 +228,8 @@ struct dentry_operations {
>  #define DCACHE_FALLTHRU			0x01000000 /* Fall through to lower layer */
>  #define DCACHE_OP_SELECT_INODE		0x02000000 /* Unioned entry: dcache op selects inode */
>  
> +#define DCACHE_MOUNTROOT		0x04000000 /* Root of a vfsmount */
> +
>  extern seqlock_t rename_lock;
>  
>  /*
> @@ -404,6 +406,11 @@ static inline bool d_mountpoint(const struct dentry *dentry)
>  	return dentry->d_flags & DCACHE_MOUNTED;
>  }
>  
> +static inline bool d_mountroot(const struct dentry *dentry)
> +{
> +	return dentry->d_flags & DCACHE_MOUNTROOT;
> +}
> +
>  /*
>   * Directory cache entry type accessor functions.
>   */
> 

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 1/6] mnt: Track which mounts use a dentry as root.
       [not found]                 ` <55C48C94.6050804-6AxghH7DbtA@public.gmane.org>
@ 2015-08-07 15:43                   ` Eric W. Biederman
  0 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-08-07 15:43 UTC (permalink / raw)
  To: Nikolay Borisov
  Cc: Andrey Vagin, Miklos Szeredi, Richard Weinberger,
	Linux Containers, Andy Lutomirski, J. Bruce Fields, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Linus Torvalds,
	Willy Tarreau

Nikolay Borisov <kernel-6AxghH7DbtA@public.gmane.org> writes:

>> diff --git a/fs/namespace.c b/fs/namespace.c
>> index 2b8aa15fd6df..2ce987af9afa 100644
>> --- a/fs/namespace.c
>> +++ b/fs/namespace.c
>> @@ -31,6 +31,8 @@ static unsigned int m_hash_mask __read_mostly;
>>  static unsigned int m_hash_shift __read_mostly;
>>  static unsigned int mp_hash_mask __read_mostly;
>>  static unsigned int mp_hash_shift __read_mostly;
>> +static unsigned int mr_hash_mask __read_mostly;
>> +static unsigned int mr_hash_shift __read_mostly;
>>  
>>  static __initdata unsigned long mhash_entries;
>>  static int __init set_mhash_entries(char *str)
>> @@ -52,6 +54,16 @@ static int __init set_mphash_entries(char *str)
>>  }
>>  __setup("mphash_entries=", set_mphash_entries);
>>  
>> +static __initdata unsigned long mrhash_entries;
>> +static int __init set_mrhash_entries(char *str)
>> +{
>> +	if (!str)
>> +		return 0;
>> +	mrhash_entries = simple_strtoul(str, &str, 0);
>
> Nit: Any particular reason for using simple_* rather than kstrto* family
> of functions?

That is what set_mhash_entries, and set_mphash_entries do, and I
maintained the existing style in the code.

It does look like a followup change to add error handling in the
pathological cases might be worthwhile.

Although it would probably be even better to convert these hash tables
into rcu resizeable hash tables that can automatically grow to the size
needed.

Eric

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 1/6] mnt: Track which mounts use a dentry as root.
  2015-08-07 10:46               ` Nikolay Borisov
       [not found]                 ` <55C48C94.6050804-6AxghH7DbtA@public.gmane.org>
@ 2015-08-07 15:43                 ` Eric W. Biederman
  1 sibling, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-08-07 15:43 UTC (permalink / raw)
  To: Nikolay Borisov
  Cc: Linux Containers, Andrey Vagin, Miklos Szeredi,
	Richard Weinberger, Andy Lutomirski, J. Bruce Fields, Al Viro,
	linux-fsdevel, Jann Horn, Linus Torvalds, Willy Tarreau

Nikolay Borisov <kernel@kyup.com> writes:

>> diff --git a/fs/namespace.c b/fs/namespace.c
>> index 2b8aa15fd6df..2ce987af9afa 100644
>> --- a/fs/namespace.c
>> +++ b/fs/namespace.c
>> @@ -31,6 +31,8 @@ static unsigned int m_hash_mask __read_mostly;
>>  static unsigned int m_hash_shift __read_mostly;
>>  static unsigned int mp_hash_mask __read_mostly;
>>  static unsigned int mp_hash_shift __read_mostly;
>> +static unsigned int mr_hash_mask __read_mostly;
>> +static unsigned int mr_hash_shift __read_mostly;
>>  
>>  static __initdata unsigned long mhash_entries;
>>  static int __init set_mhash_entries(char *str)
>> @@ -52,6 +54,16 @@ static int __init set_mphash_entries(char *str)
>>  }
>>  __setup("mphash_entries=", set_mphash_entries);
>>  
>> +static __initdata unsigned long mrhash_entries;
>> +static int __init set_mrhash_entries(char *str)
>> +{
>> +	if (!str)
>> +		return 0;
>> +	mrhash_entries = simple_strtoul(str, &str, 0);
>
> Nit: Any particular reason for using simple_* rather than kstrto* family
> of functions?

That is what set_mhash_entries, and set_mphash_entries do, and I
maintained the existing style in the code.

It does look like a followup change to add error handling in the
pathological cases might be worthwhile.

Although it would probably be even better to convert these hash tables
into rcu resizeable hash tables that can automatically grow to the size
needed.

Eric

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 4/6] mnt: Track when a directory escapes a bind mount
       [not found]               ` <87egjk9i61.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-08-10  4:36                 ` Al Viro
       [not found]                   ` <20150810043637.GC14139-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
  2015-08-14  4:10                   ` Eric W. Biederman
  0 siblings, 2 replies; 240+ messages in thread
From: Al Viro @ 2015-08-10  4:36 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrey Vagin, Miklos Szeredi, Richard Weinberger,
	Linux Containers, Andy Lutomirski, J. Bruce Fields,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Linus Torvalds,
	Willy Tarreau

On Mon, Aug 03, 2015 at 04:27:34PM -0500, Eric W. Biederman wrote:

> - The escape count on struct mount must be incremented both before the
>   rename and after.  If the count is not incremented before the rename
>   it is possible to hit a scenario where the rename happens the code
>   walks up the directory tree to somewhere outside of the bind mount
>   before the count is touched.  Similary without a count after the
>   rename it is possible for the code to look at the escape count
>   validate a path is connected before the rename and assume cache the
>   escape count, leading to not retesting the path is ok.

Umm...  I wonder if you are overcomplicating the things here.  Sure,
I understand wanting to reduce the checks on "..", but...  It costs you
considerable complexity (especially when it comes to 64bit counts),
it's really brittle (you need to be very careful about the places where
you zero the cached values in fs/namei.c and missing one will lead to
really unpleasant effects there) _and_ it's all for the benefit of 
a very rare case.  With check you are optimizing away not being all that
costly anyway.
 
> - The largest change is in d_unalias, where the two cases are split
>   apart so they can be handled separately.  In the easy case of a
>   rename within the same directory all that is needed is __d_move
>   (escaping a mount is impossible in that case).  In the more involved
>   case mutexes need to be acquired, and now the spin locks need to be
>   dropped so that proper lock aquisition order around __d_move can be
>   arranged.

I _really_ hate that part.  Could you explain WTF is wrong with simply
taking mount_lock in that case of __d_splice_alias() just outside of
rename_lock?

> +	unlock = lock_namespace_rename(dentry, target, false);
> +
>  	write_seqlock(&rename_lock);
>  	__d_move(dentry, target, false);
>  	write_sequnlock(&rename_lock);
> +
> +	if (unlock)
> +		unlock_namespace_rename(unlock, dentry, target, false);
> +

Your unlock_namespace_rename() should've been a static inline.
With the check of unlock != NULL done in there.  Two such inlines,
actually, and to hell with the boolean argument.  Same split for the
lock counterpart, of course.

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 5/6] vfs: Test for and handle paths that are unreachable from their mnt_root
       [not found]               ` <878u9s9i1d.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-08-10  4:38                 ` Al Viro
       [not found]                   ` <20150810043814.GD14139-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
  0 siblings, 1 reply; 240+ messages in thread
From: Al Viro @ 2015-08-10  4:38 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrey Vagin, Miklos Szeredi, Richard Weinberger,
	Linux Containers, Andy Lutomirski, J. Bruce Fields,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Linus Torvalds,
	Willy Tarreau

On Mon, Aug 03, 2015 at 04:30:22PM -0500, Eric W. Biederman wrote:

> +	if (!is_subdir(nd->path.dentry, mnt->mnt_root))
> +		return false;

Umm...  What's to protect us from racing with d_move() right here?

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 4/6] mnt: Track when a directory escapes a bind mount
       [not found]                   ` <20150810043637.GC14139-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
@ 2015-08-10  4:43                     ` Al Viro
  2015-08-14  4:10                     ` Eric W. Biederman
  1 sibling, 0 replies; 240+ messages in thread
From: Al Viro @ 2015-08-10  4:43 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrey Vagin, Miklos Szeredi, Richard Weinberger,
	Linux Containers, Andy Lutomirski, J. Bruce Fields,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Linus Torvalds,
	Willy Tarreau

On Mon, Aug 10, 2015 at 05:36:37AM +0100, Al Viro wrote:

> Your unlock_namespace_rename() should've been a static inline.
> With the check of unlock != NULL done in there.  Two such inlines,
> actually, and to hell with the boolean argument.  Same split for the
> lock counterpart, of course.

PS: that thing should be in fs/dcache.c, at least in the part that
deals with finding the common ancestor, etc.  And __d_move() (and
dentry_lock_for_move()) games with d_ancestor() should be redundant now.

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 5/6] vfs: Test for and handle paths that are unreachable from their mnt_root
       [not found]                   ` <20150810043814.GD14139-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
@ 2015-08-10 19:34                     ` Eric W. Biederman
  0 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-08-10 19:34 UTC (permalink / raw)
  To: Al Viro
  Cc: Andrey Vagin, Miklos Szeredi, Richard Weinberger,
	Linux Containers, Andy Lutomirski, J. Bruce Fields,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Linus Torvalds,
	Willy Tarreau

Al Viro <viro-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org> writes:

> On Mon, Aug 03, 2015 at 04:30:22PM -0500, Eric W. Biederman wrote:
>
>> +	if (!is_subdir(nd->path.dentry, mnt->mnt_root))
>> +		return false;
>
> Umm...  What's to protect us from racing with d_move() right here?

is_subdir does the read_seqretry on rename_lock.  Which is enough
to ensure connectivity exists at a single moment in time.

Beyond that the entire path lookup races with d_move, and the code
calls path_connected just after finding the parent directory, which
ensures that in the moment that follow_dotdot is setting nd->dentry
that the original nd->dentry is connected, and by extension the new
as the new one is an ancestor.

Or are you thinking of a different race?

Eric

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 4/6] mnt: Track when a directory escapes a bind mount
       [not found]                   ` <20150810043637.GC14139-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
  2015-08-10  4:43                     ` Al Viro
@ 2015-08-14  4:10                     ` Eric W. Biederman
  1 sibling, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-08-14  4:10 UTC (permalink / raw)
  To: Al Viro
  Cc: Andrey Vagin, Miklos Szeredi, Richard Weinberger,
	Linux Containers, Andy Lutomirski, J. Bruce Fields,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Linus Torvalds,
	Willy Tarreau

Al Viro <viro-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org> writes:

> On Mon, Aug 03, 2015 at 04:27:34PM -0500, Eric W. Biederman wrote:
>
>> - The escape count on struct mount must be incremented both before the
>>   rename and after.  If the count is not incremented before the rename
>>   it is possible to hit a scenario where the rename happens the code
>>   walks up the directory tree to somewhere outside of the bind mount
>>   before the count is touched.  Similary without a count after the
>>   rename it is possible for the code to look at the escape count
>>   validate a path is connected before the rename and assume cache the
>>   escape count, leading to not retesting the path is ok.
>
> Umm...  I wonder if you are overcomplicating the things here.  Sure,
> I understand wanting to reduce the checks on "..", but...  It costs you
> considerable complexity (especially when it comes to 64bit counts),
> it's really brittle (you need to be very careful about the places where
> you zero the cached values in fs/namei.c and missing one will lead to
> really unpleasant effects there) _and_ it's all for the benefit of 
> a very rare case.  With check you are optimizing away not being all that
> costly anyway.

I had to give this a long hard think.  Algorithms going to O(N^2) when
it is uncessarry really bother me.  I ran some numbers for really deep
directory trees, slow memory, etc and I could not come up with a
scenario where even in it's worst case d_ancestor would take anywhere
near as long as a one disk seek, and most of the d_ancestor would be
much quicker.

So it appears to me that in the worst case a pathname lookup consisting
of a ridiculous number of .. components starting with a cold cache, on a
mount where a directory has escaped is likely to be faster than a
similar lookup going down the tree with many disk seeks.

I don't think the 64bit counts and the zeroing the cache values are
quite as bad as you make out.  There are much trickier things already in
path name lookup code.  But I do agree that it is easy to get wrong
because nothing will show up in testing, and getting it wrong will have
really unpleasant effects.

I also can't see a scenario where a directory would escape a subtree
that is mounted somewhere without it being a misconfiguration.

So I agree it is not worth it to optimize the code so that there
are an absolute minimum number of d_ancestor calls during pathname
lookup.

Further replacing mnt_escape_count with a mnt_flag makes the code much
simpler.  Which I very much appreciate.

>> - The largest change is in d_unalias, where the two cases are split
>>   apart so they can be handled separately.  In the easy case of a
>>   rename within the same directory all that is needed is __d_move
>>   (escaping a mount is impossible in that case).  In the more involved
>>   case mutexes need to be acquired, and now the spin locks need to be
>>   dropped so that proper lock aquisition order around __d_move can be
>>   arranged.
>
> I _really_ hate that part.  Could you explain WTF is wrong with simply
> taking mount_lock in that case of __d_splice_alias() just outside of
> rename_lock?

Me too.  So upon realizing the that inode->i_lock is held longer than
necessary in d_splice_alias I reworked the locking in d_splice_alias.

Updated patches to follow in a little bit.

> PS: that thing should be in fs/dcache.c, at least in the part that
> deals with finding the common ancestor, etc.  And __d_move() (and
> dentry_lock_for_move()) games with d_ancestor() should be redundant
> now.

It does seem reasonable that the BUG_ONs in __d_move that call
d_ancestor can be removed, or simplified by passing the common ancestor
into __d_move.

I don't know the code in dentry_lock_for_move well enough to say
anything except that the d_ancestor call in dentry_lock_for_move looks
reasonable.

Doing anything inside of __d_move or dentry_lock_for_move appears
to be a detour to the cause of preventing escaping from bind mounts.
So while I have no problems with the with the kinds of changes I hear
you suggesting, but unless I encounter something that makes changing
__d_move or dentry_lock_for_more relevant to the work of preventing
escaping from bind mounts I don't plan on touching them while that
is my focus.

Eric

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 4/6] mnt: Track when a directory escapes a bind mount
  2015-08-10  4:36                 ` Al Viro
       [not found]                   ` <20150810043637.GC14139-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
@ 2015-08-14  4:10                   ` Eric W. Biederman
       [not found]                     ` <877foymrwt.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  1 sibling, 1 reply; 240+ messages in thread
From: Eric W. Biederman @ 2015-08-14  4:10 UTC (permalink / raw)
  To: Al Viro
  Cc: Linux Containers, linux-fsdevel, Andy Lutomirski,
	Serge E. Hallyn, Richard Weinberger, Andrey Vagin, Jann Horn,
	Willy Tarreau, Omar Sandoval, Miklos Szeredi, Linus Torvalds,
	J. Bruce Fields

Al Viro <viro@ZenIV.linux.org.uk> writes:

> On Mon, Aug 03, 2015 at 04:27:34PM -0500, Eric W. Biederman wrote:
>
>> - The escape count on struct mount must be incremented both before the
>>   rename and after.  If the count is not incremented before the rename
>>   it is possible to hit a scenario where the rename happens the code
>>   walks up the directory tree to somewhere outside of the bind mount
>>   before the count is touched.  Similary without a count after the
>>   rename it is possible for the code to look at the escape count
>>   validate a path is connected before the rename and assume cache the
>>   escape count, leading to not retesting the path is ok.
>
> Umm...  I wonder if you are overcomplicating the things here.  Sure,
> I understand wanting to reduce the checks on "..", but...  It costs you
> considerable complexity (especially when it comes to 64bit counts),
> it's really brittle (you need to be very careful about the places where
> you zero the cached values in fs/namei.c and missing one will lead to
> really unpleasant effects there) _and_ it's all for the benefit of 
> a very rare case.  With check you are optimizing away not being all that
> costly anyway.

I had to give this a long hard think.  Algorithms going to O(N^2) when
it is uncessarry really bother me.  I ran some numbers for really deep
directory trees, slow memory, etc and I could not come up with a
scenario where even in it's worst case d_ancestor would take anywhere
near as long as a one disk seek, and most of the d_ancestor would be
much quicker.

So it appears to me that in the worst case a pathname lookup consisting
of a ridiculous number of .. components starting with a cold cache, on a
mount where a directory has escaped is likely to be faster than a
similar lookup going down the tree with many disk seeks.

I don't think the 64bit counts and the zeroing the cache values are
quite as bad as you make out.  There are much trickier things already in
path name lookup code.  But I do agree that it is easy to get wrong
because nothing will show up in testing, and getting it wrong will have
really unpleasant effects.

I also can't see a scenario where a directory would escape a subtree
that is mounted somewhere without it being a misconfiguration.

So I agree it is not worth it to optimize the code so that there
are an absolute minimum number of d_ancestor calls during pathname
lookup.

Further replacing mnt_escape_count with a mnt_flag makes the code much
simpler.  Which I very much appreciate.

>> - The largest change is in d_unalias, where the two cases are split
>>   apart so they can be handled separately.  In the easy case of a
>>   rename within the same directory all that is needed is __d_move
>>   (escaping a mount is impossible in that case).  In the more involved
>>   case mutexes need to be acquired, and now the spin locks need to be
>>   dropped so that proper lock aquisition order around __d_move can be
>>   arranged.
>
> I _really_ hate that part.  Could you explain WTF is wrong with simply
> taking mount_lock in that case of __d_splice_alias() just outside of
> rename_lock?

Me too.  So upon realizing the that inode->i_lock is held longer than
necessary in d_splice_alias I reworked the locking in d_splice_alias.

Updated patches to follow in a little bit.

> PS: that thing should be in fs/dcache.c, at least in the part that
> deals with finding the common ancestor, etc.  And __d_move() (and
> dentry_lock_for_move()) games with d_ancestor() should be redundant
> now.

It does seem reasonable that the BUG_ONs in __d_move that call
d_ancestor can be removed, or simplified by passing the common ancestor
into __d_move.

I don't know the code in dentry_lock_for_move well enough to say
anything except that the d_ancestor call in dentry_lock_for_move looks
reasonable.

Doing anything inside of __d_move or dentry_lock_for_move appears
to be a detour to the cause of preventing escaping from bind mounts.
So while I have no problems with the with the kinds of changes I hear
you suggesting, but unless I encounter something that makes changing
__d_move or dentry_lock_for_more relevant to the work of preventing
escaping from bind mounts I don't plan on touching them while that
is my focus.

Eric

^ permalink raw reply	[flat|nested] 240+ messages in thread

* [PATCH review 0/8] Bind mount escape fixes
       [not found]                     ` <877foymrwt.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-08-14  4:29                       ` Eric W. Biederman
  2015-08-14  4:30                         ` [PATCH review 1/8] dcache: Handle escaped paths in prepend_path Eric W. Biederman
                                           ` (3 more replies)
  0 siblings, 4 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-08-14  4:29 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Miklos Szeredi, Richard Weinberger,
	Andy Lutomirski, J. Bruce Fields, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Linus Torvalds,
	Willy Tarreau

It is possible in some situations to rename a file or directory through
one mount point such that it can start out inside of a bind mount and
after the rename wind up outside of the bind mount.  Unfortunately with
user namespaces these conditions can be trivially created by creating a
bind mount under an existing bind mount.

I have identified four situations in which this may be a problem.
- __d_path and d_absolute_path need to error on disconnected paths
  that can not reach some root directory or lsm path based security
  checks can incorrectly succeed.

- Normal path name resolution following .. can lead to a directory
  that is outside of the original loopback mount.

- file handle reconsititution aka exportfs_decode_fh can yield a dentry
  from which d_parent can be followed up to mnt->sb->s_root, but
  d_parent can not be followed up to mnt->mnt_root.

- Mounts on a path that has been renamed outside of a loopback mount
  become unreachable, as there is no possible path that can be passed
  to umount to unmount them.

My strategy:

o File handle reconsitituion problems can be prevented by enabling
  the nfsd subtree checks for nfs exports, and open_by_handle_at
  requires capable(CAP_DAC_READ_SEARCH) so is only usable by the global
  root.  This makes any problems difficult if not impossible to exploit
  in practice so I have not yet written code to address that issue.

o The functions __d_path and d_absolute_path are agumented so that the
  security modules will not be fed a problematic path to work with.

o Following of .. has been agumented to test that after d_parent has
  been resolved the original  directory is connected, and if not
  an error of -ENOENT is returned.

o I do not worry about mounts that are disconnected from their bind
  mount as these mounts can always be freed by either umount -l on
  the bind mount they have escaped from, or by freeing the mount
  namespace.  So I do not believe there is an actual problem.

Pathname resolution is a common fast path and most of the code in this
patchset to support keeping .. from becoming expensive in the common
case.

After hearing the Al's feedback and running some numbers I have given
up attempting to keeping the number of d_ancestor calls during pathname
resolution to an absolute minimum.  It appears that simply preventing
calls d_ancestor unless a directory has escaped is good enough.  This
change in approach has significantly simplified the code.

The big implementation change to note is that I have rewritten
d_splice_alias and made some significant progress in cleaning up how the
locks are dealt with.  The only limitation now is that
dentry->d_parent->d_inode->i_mutex is taken in lookup held when
d_splice_alias is called.    If that ever goes away my new
d_splice_alias can easily take all of the locks it needs to rename
a directory in the proper order.

Does anyone see anything significant that I have missed?

These changes are all against v4.2-rc1. 

For those who like to see everything in a single tree the code is at:

     git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-testing

Eric W. Biederman (8):
      dcache: Handle escaped paths in prepend_path
      dcache: Reduce the scope of i_lock in d_splice_alias
      dcache: Clearly separate the two directory rename cases in d_splice_alias
      mnt: Track which mounts use a dentry as root.
      dcache: Implement d_common_ancestor
      dcache: Only read d_flags once is d_is_dir
      mnt: Track when a directory escapes a bind mount
      vfs: Test for and handle paths that are unreachable from their mnt_root

 fs/dcache.c            | 193 +++++++++++++++++++++++++++++++++++++------------
 fs/mount.h             |   9 +++
 fs/namei.c             |  26 ++++++-
 fs/namespace.c         | 152 +++++++++++++++++++++++++++++++++++++-
 include/linux/dcache.h |  11 ++-
 include/linux/mount.h  |   1 +
 6 files changed, 338 insertions(+), 54 deletions(-)

^ permalink raw reply	[flat|nested] 240+ messages in thread

* [PATCH review 1/8] dcache: Handle escaped paths in prepend_path
       [not found]                         ` <87wpwyjxwc.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-08-14  4:30                           ` Eric W. Biederman
  2015-08-14  4:30                           ` [PATCH review 2/8] dcache: Reduce the scope of i_lock in d_splice_alias Eric W. Biederman
                                             ` (6 subsequent siblings)
  7 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-08-14  4:30 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Miklos Szeredi, Richard Weinberger,
	Andy Lutomirski, J. Bruce Fields, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Linus Torvalds,
	Willy Tarreau

A rename can result in a dentry that by walking up d_parent
will never reach it's mnt_root.  For lack of a better term
I call this an escaped path.

prepend_path is called by four different functions __d_path,
d_absolute_path, d_path, and getcwd.

__d_path only wants to see paths are connected to the root it passes
in.  So __d_path needs prepend_path to return an error.

d_absolute_path similarly wants to see paths that are connected to
some root.  Escaped paths are not connected to any mnt_root so
d_absolute_path needs prepend_path to return an error greater
than 1.  So escaped paths will be treated like paths on lazily
unmounted mounts.

getcwd needs to prepend "(unreachable)" so getcwd also needs
prepend_path to return an error.

d_path is the interesting hold out.  d_path just wants to print
something, and does not care about the weird cases.  Which raises
the question what should be printed?

Given that <escaped_path>/<anything> should result in -ENOENT I
believe it is desirable for escaped paths to be printed as empty
paths.  As there are not really any meaninful path components when
considered from the perspective of a mount tree.

So tweak prepend_path to return an empty path with an new error
code of 3 when it encounters an escaped path.

Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/dcache.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/fs/dcache.c b/fs/dcache.c
index 7a3f3e5f9cea..f762e76e85cc 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2923,6 +2923,13 @@ restart:

 		if (dentry == vfsmnt->mnt_root || IS_ROOT(dentry)) {
 			struct mount *parent = ACCESS_ONCE(mnt->mnt_parent);
+			/* Escaped? */
+			if (dentry != vfsmnt->mnt_root) {
+				bptr = *buffer;
+				blen = *buflen;
+				error = 3;
+				break;
+			}
 			/* Global root? */
 			if (mnt != parent) {
 				dentry = ACCESS_ONCE(mnt->mnt_mountpoint);
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 1/8] dcache: Handle escaped paths in prepend_path
  2015-08-14  4:29                       ` [PATCH review 0/8] Bind mount escape fixes Eric W. Biederman
@ 2015-08-14  4:30                         ` Eric W. Biederman
  2015-08-14  4:30                         ` [PATCH review 2/8] dcache: Reduce the scope of i_lock in d_splice_alias Eric W. Biederman
                                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-08-14  4:30 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-fsdevel, Al Viro, Andy Lutomirski, Serge E. Hallyn,
	Richard Weinberger, Andrey Vagin, Jann Horn, Willy Tarreau,
	Omar Sandoval, Miklos Szeredi, Linus Torvalds, J. Bruce Fields

A rename can result in a dentry that by walking up d_parent
will never reach it's mnt_root.  For lack of a better term
I call this an escaped path.

prepend_path is called by four different functions __d_path,
d_absolute_path, d_path, and getcwd.

__d_path only wants to see paths are connected to the root it passes
in.  So __d_path needs prepend_path to return an error.

d_absolute_path similarly wants to see paths that are connected to
some root.  Escaped paths are not connected to any mnt_root so
d_absolute_path needs prepend_path to return an error greater
than 1.  So escaped paths will be treated like paths on lazily
unmounted mounts.

getcwd needs to prepend "(unreachable)" so getcwd also needs
prepend_path to return an error.

d_path is the interesting hold out.  d_path just wants to print
something, and does not care about the weird cases.  Which raises
the question what should be printed?

Given that <escaped_path>/<anything> should result in -ENOENT I
believe it is desirable for escaped paths to be printed as empty
paths.  As there are not really any meaninful path components when
considered from the perspective of a mount tree.

So tweak prepend_path to return an empty path with an new error
code of 3 when it encounters an escaped path.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/dcache.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/fs/dcache.c b/fs/dcache.c
index 7a3f3e5f9cea..f762e76e85cc 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2923,6 +2923,13 @@ restart:

 		if (dentry == vfsmnt->mnt_root || IS_ROOT(dentry)) {
 			struct mount *parent = ACCESS_ONCE(mnt->mnt_parent);
+			/* Escaped? */
+			if (dentry != vfsmnt->mnt_root) {
+				bptr = *buffer;
+				blen = *buflen;
+				error = 3;
+				break;
+			}
 			/* Global root? */
 			if (mnt != parent) {
 				dentry = ACCESS_ONCE(mnt->mnt_mountpoint);
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 2/8] dcache: Reduce the scope of i_lock in d_splice_alias
       [not found]                         ` <87wpwyjxwc.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  2015-08-14  4:30                           ` [PATCH review 1/8] dcache: Handle escaped paths in prepend_path Eric W. Biederman
@ 2015-08-14  4:30                           ` Eric W. Biederman
  2015-08-14  4:31                           ` [PATCH review 3/8] dcache: Clearly separate the two directory rename cases " Eric W. Biederman
                                             ` (5 subsequent siblings)
  7 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-08-14  4:30 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Miklos Szeredi, Richard Weinberger,
	Andy Lutomirski, J. Bruce Fields, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Linus Torvalds,
	Willy Tarreau


i_lock is only needed until __d_find_any_alias calls dget on the alias
dentry.  After that the reference to new ensures that dentry_kill and
d_delete will not remove the inode from the dentry, and remove the
dentry from the inode->d_entry list.

The inode i_lock came to be held over the the __d_move calls in
d_splice_alias through a series of introduction of locks with
increasing smaller scope.  First it was the dcache_lock, then
it was the dcache_inode_lock, and finally inode->i_lock.

Furthermore inode->i_lock is not held over any other calls
to d_move or __d_move so it can not provide any meaningful
rename protection.

Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/dcache.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index f762e76e85cc..53b7f1e63beb 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2715,7 +2715,7 @@ struct dentry *d_ancestor(struct dentry *p1, struct dentry *p2)
  * This helper attempts to cope with remotely renamed directories
  *
  * It assumes that the caller is already holding
- * dentry->d_parent->d_inode->i_mutex, inode->i_lock and rename_lock
+ * dentry->d_parent->d_inode->i_mutex, and rename_lock
  *
  * Note: If ever the locking in lock_rename() changes, then please
  * remember to update this too...
@@ -2741,7 +2741,6 @@ out_unalias:
 	__d_move(alias, dentry, false);
 	ret = 0;
 out_err:
-	spin_unlock(&inode->i_lock);
 	if (m2)
 		mutex_unlock(m2);
 	if (m1)
@@ -2787,10 +2786,11 @@ struct dentry *d_splice_alias(struct inode *inode, struct dentry *dentry)
 	if (S_ISDIR(inode->i_mode)) {
 		struct dentry *new = __d_find_any_alias(inode);
 		if (unlikely(new)) {
+			/* The reference to new ensures it remains an alias */
+			spin_unlock(&inode->i_lock);
 			write_seqlock(&rename_lock);
 			if (unlikely(d_ancestor(new, dentry))) {
 				write_sequnlock(&rename_lock);
-				spin_unlock(&inode->i_lock);
 				dput(new);
 				new = ERR_PTR(-ELOOP);
 				pr_warn_ratelimited(
@@ -2809,7 +2809,6 @@ struct dentry *d_splice_alias(struct inode *inode, struct dentry *dentry)
 			} else {
 				__d_move(new, dentry, false);
 				write_sequnlock(&rename_lock);
-				spin_unlock(&inode->i_lock);
 				security_d_instantiate(new, inode);
 			}
 			iput(inode);
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 2/8] dcache: Reduce the scope of i_lock in d_splice_alias
  2015-08-14  4:29                       ` [PATCH review 0/8] Bind mount escape fixes Eric W. Biederman
  2015-08-14  4:30                         ` [PATCH review 1/8] dcache: Handle escaped paths in prepend_path Eric W. Biederman
@ 2015-08-14  4:30                         ` Eric W. Biederman
       [not found]                         ` <87wpwyjxwc.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  2015-08-14  4:31                         ` [PATCH review 3/8] dcache: Clearly separate the two directory rename cases in d_splice_alias Eric W. Biederman
  3 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-08-14  4:30 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-fsdevel, Al Viro, Andy Lutomirski, Serge E. Hallyn,
	Richard Weinberger, Andrey Vagin, Jann Horn, Willy Tarreau,
	Omar Sandoval, Miklos Szeredi, Linus Torvalds, J. Bruce Fields


i_lock is only needed until __d_find_any_alias calls dget on the alias
dentry.  After that the reference to new ensures that dentry_kill and
d_delete will not remove the inode from the dentry, and remove the
dentry from the inode->d_entry list.

The inode i_lock came to be held over the the __d_move calls in
d_splice_alias through a series of introduction of locks with
increasing smaller scope.  First it was the dcache_lock, then
it was the dcache_inode_lock, and finally inode->i_lock.

Furthermore inode->i_lock is not held over any other calls
to d_move or __d_move so it can not provide any meaningful
rename protection.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/dcache.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index f762e76e85cc..53b7f1e63beb 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2715,7 +2715,7 @@ struct dentry *d_ancestor(struct dentry *p1, struct dentry *p2)
  * This helper attempts to cope with remotely renamed directories
  *
  * It assumes that the caller is already holding
- * dentry->d_parent->d_inode->i_mutex, inode->i_lock and rename_lock
+ * dentry->d_parent->d_inode->i_mutex, and rename_lock
  *
  * Note: If ever the locking in lock_rename() changes, then please
  * remember to update this too...
@@ -2741,7 +2741,6 @@ out_unalias:
 	__d_move(alias, dentry, false);
 	ret = 0;
 out_err:
-	spin_unlock(&inode->i_lock);
 	if (m2)
 		mutex_unlock(m2);
 	if (m1)
@@ -2787,10 +2786,11 @@ struct dentry *d_splice_alias(struct inode *inode, struct dentry *dentry)
 	if (S_ISDIR(inode->i_mode)) {
 		struct dentry *new = __d_find_any_alias(inode);
 		if (unlikely(new)) {
+			/* The reference to new ensures it remains an alias */
+			spin_unlock(&inode->i_lock);
 			write_seqlock(&rename_lock);
 			if (unlikely(d_ancestor(new, dentry))) {
 				write_sequnlock(&rename_lock);
-				spin_unlock(&inode->i_lock);
 				dput(new);
 				new = ERR_PTR(-ELOOP);
 				pr_warn_ratelimited(
@@ -2809,7 +2809,6 @@ struct dentry *d_splice_alias(struct inode *inode, struct dentry *dentry)
 			} else {
 				__d_move(new, dentry, false);
 				write_sequnlock(&rename_lock);
-				spin_unlock(&inode->i_lock);
 				security_d_instantiate(new, inode);
 			}
 			iput(inode);
-- 
2.2.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 3/8] dcache: Clearly separate the two directory rename cases in d_splice_alias
       [not found]                         ` <87wpwyjxwc.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  2015-08-14  4:30                           ` [PATCH review 1/8] dcache: Handle escaped paths in prepend_path Eric W. Biederman
  2015-08-14  4:30                           ` [PATCH review 2/8] dcache: Reduce the scope of i_lock in d_splice_alias Eric W. Biederman
@ 2015-08-14  4:31                           ` Eric W. Biederman
  2015-08-14  4:32                           ` [PATCH review 4/8] mnt: Track which mounts use a dentry as root Eric W. Biederman
                                             ` (4 subsequent siblings)
  7 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-08-14  4:31 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Miklos Szeredi, Richard Weinberger,
	Andy Lutomirski, J. Bruce Fields, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Linus Torvalds,
	Willy Tarreau


There are two scenarios that can result in an alias being found in
d_splice_alias.  The first scenario is a disconnected dentry being
looked up and reconnected (common with nfs file handles, and
open_by_handle_at).  The second scenario is a lookup on a directory
lazily discovering that it was renamed on the remote filesystem.

The locking challenge for handling these two scenarios is that
the instance of lookup calling d_splice_alias has already
taken the dentry->d_parent->d_inode->i_mutex.

If the mutex was not held the locking we could just do:
	rename_mutex = &inode->i_sb->s_vfs_rename_mutex;
	mutex_lock(rename_mutex);
	if (d_ancestor(alias, dentry)) {
		mutex_unlock(rename_mutex);
		pr_warn_ratelimited(...);
		return -ELOOP;
	}
	m1 = &dentry->d_parent->d_inode->i_mutex;
	mutex_lock_nested(m1, I_MUTEX_PARENT);
	if (!IS_ROOT(alias) && (alias->d_parent != dentry->d_parent)) {
		m2 = &alias->d_parent->d_inode->i_mutex;
		mutex_lock_nested(m2, I_MUTEX_PARENT2);
	}
	d_move(alias, dentry);
	if (m2)
		mutex_unlock(m2);
	mutex_unlock(m1);
	mutex_unlock(rename_mutex);

Which is essentially lock_rename and unlock_reaname with added
handling of the IS_ROOT case.

The problem is that as the lookup locking stands today grabbing the
s_vfs_rename_mutex must use mutex_trylock which can fail, so for reliability
reasons we need to avoid using the rename_mutex as much as possible.

For the case where a disconnected dentry is being connected (aka the
IS_ROOT case) the d_ancestor test can be placed under the rename_lock
removing any chance that taking locks will cause a failure.

For the lazy rename case things are trickier because when
dentry->d_parent and alias->d_parent are not equal the code need to
take the s_vfs_rename_mutex, or risk the d_ancestor call in
lock_rename being inaccurate.  As s_vfs_rename_mutex is take first
there are no locking ordering issues with taking
alias->d_parent->d_inode->i_mutex.  Furthermore as games with
rename_lock are unnecessary d_move instead of __d_move can be called.

Sleeping in d_splice_alias is not something new as
security_d_instantiate is a sleeping function.

Compared to the current implementation this change introduces a
function for each case __d_rename_alias and __d_connect_alias and
moves the acquisition of rename_lock down into those functions.

In the case of __d_rename_alias which was __d_unalias this allows the
existing lock ordering rules to be followed much more closely removing
the need for a trylock when acquiring
alias->d_parent->d_inode->i_mutex, and allowing the use of d_move.

A common helper that prints the warning message when a loop is
detected is factored out into d_alias_is_ancestor so that code does not
need to be duplicated in both cases.

Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/dcache.c | 111 ++++++++++++++++++++++++++++++++++++------------------------
 1 file changed, 67 insertions(+), 44 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 53b7f1e63beb..c1eece74621f 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2711,41 +2711,77 @@ struct dentry *d_ancestor(struct dentry *p1, struct dentry *p2)
 	return NULL;
 }
 
+static struct dentry *d_alias_is_ancestor(struct dentry *alias,
+					  struct dentry *dentry)
+{
+	struct dentry *ancestor = d_ancestor(alias, dentry);
+	if (ancestor) {
+		pr_warn_ratelimited(
+			"VFS: Lookup of '%s' in %s %s would have caused loop\n",
+			dentry->d_name.name,
+			dentry->d_sb->s_type->name,
+			dentry->d_sb->s_id);
+	}
+	return ancestor;
+}
+
 /*
  * This helper attempts to cope with remotely renamed directories
  *
  * It assumes that the caller is already holding
- * dentry->d_parent->d_inode->i_mutex, and rename_lock
+ * dentry->d_parent->d_inode->i_mutex
  *
  * Note: If ever the locking in lock_rename() changes, then please
  * remember to update this too...
  */
-static int __d_unalias(struct inode *inode,
-		struct dentry *dentry, struct dentry *alias)
+static int __d_rename_alias(struct dentry *alias, struct dentry *dentry)
 {
-	struct mutex *m1 = NULL, *m2 = NULL;
-	int ret = -ESTALE;
+	struct mutex *rename_mutex = NULL, *parent_mutex = NULL;
 
-	/* If alias and dentry share a parent, then no extra locks required */
-	if (alias->d_parent == dentry->d_parent)
-		goto out_unalias;
+	/* Holding dentry->d_parent->d_inode->i_mutex guarantees this
+	 * that the equality or inequality alias->d_parent and
+	 * dentry->d_parent remains stable.
+	 */
+	if (alias->d_parent != dentry->d_parent) {
+		/* See lock_rename() */
+		rename_mutex = &dentry->d_sb->s_vfs_rename_mutex;
+		if (!mutex_trylock(rename_mutex))
+			return -ESTALE;
+
+		if (d_alias_is_ancestor(alias, dentry)) {
+			mutex_unlock(rename_mutex);
+			return -ELOOP;
+		}
+		parent_mutex = &alias->d_parent->d_inode->i_mutex;
+		mutex_lock_nested(parent_mutex, I_MUTEX_PARENT2);
+	}
+	d_move(alias, dentry);
+	if (parent_mutex) {
+		mutex_unlock(parent_mutex);
+		mutex_unlock(rename_mutex);
+	}
+	return 0;
+}
 
-	/* See lock_rename() */
-	if (!mutex_trylock(&dentry->d_sb->s_vfs_rename_mutex))
-		goto out_err;
-	m1 = &dentry->d_sb->s_vfs_rename_mutex;
-	if (!mutex_trylock(&alias->d_parent->d_inode->i_mutex))
-		goto out_err;
-	m2 = &alias->d_parent->d_inode->i_mutex;
-out_unalias:
+/*
+ * This helper connects disconnected dentries.
+ *
+ * It assumes that the caller is already holding
+ * dentry->d_parent->d_inode->i_mutex
+ *
+ */
+static int __d_connect_alias(struct inode *inode,
+			     struct dentry *alias, struct dentry *dentry)
+{
+	write_seqlock(&rename_lock);
+	if (unlikely(d_alias_is_ancestor(alias, dentry))) {
+		write_sequnlock(&rename_lock);
+		return -ELOOP;
+	}
 	__d_move(alias, dentry, false);
-	ret = 0;
-out_err:
-	if (m2)
-		mutex_unlock(m2);
-	if (m1)
-		mutex_unlock(m1);
-	return ret;
+	write_sequnlock(&rename_lock);
+	security_d_instantiate(alias, inode);
+	return 0;
 }
 
 /**
@@ -2786,30 +2822,17 @@ struct dentry *d_splice_alias(struct inode *inode, struct dentry *dentry)
 	if (S_ISDIR(inode->i_mode)) {
 		struct dentry *new = __d_find_any_alias(inode);
 		if (unlikely(new)) {
+			int err;
 			/* The reference to new ensures it remains an alias */
 			spin_unlock(&inode->i_lock);
-			write_seqlock(&rename_lock);
-			if (unlikely(d_ancestor(new, dentry))) {
-				write_sequnlock(&rename_lock);
+
+			if (!IS_ROOT(new))
+				err = __d_rename_alias(new, dentry);
+			else
+				err = __d_connect_alias(inode, new, dentry);
+			if (err) {
 				dput(new);
-				new = ERR_PTR(-ELOOP);
-				pr_warn_ratelimited(
-					"VFS: Lookup of '%s' in %s %s"
-					" would have caused loop\n",
-					dentry->d_name.name,
-					inode->i_sb->s_type->name,
-					inode->i_sb->s_id);
-			} else if (!IS_ROOT(new)) {
-				int err = __d_unalias(inode, dentry, new);
-				write_sequnlock(&rename_lock);
-				if (err) {
-					dput(new);
-					new = ERR_PTR(err);
-				}
-			} else {
-				__d_move(new, dentry, false);
-				write_sequnlock(&rename_lock);
-				security_d_instantiate(new, inode);
+				new = ERR_PTR(err);
 			}
 			iput(inode);
 			return new;
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 3/8] dcache: Clearly separate the two directory rename cases in d_splice_alias
  2015-08-14  4:29                       ` [PATCH review 0/8] Bind mount escape fixes Eric W. Biederman
                                           ` (2 preceding siblings ...)
       [not found]                         ` <87wpwyjxwc.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-08-14  4:31                         ` Eric W. Biederman
       [not found]                           ` <87fv3mjxsc.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  3 siblings, 1 reply; 240+ messages in thread
From: Eric W. Biederman @ 2015-08-14  4:31 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-fsdevel, Al Viro, Andy Lutomirski, Serge E. Hallyn,
	Richard Weinberger, Andrey Vagin, Jann Horn, Willy Tarreau,
	Omar Sandoval, Miklos Szeredi, Linus Torvalds, J. Bruce Fields


There are two scenarios that can result in an alias being found in
d_splice_alias.  The first scenario is a disconnected dentry being
looked up and reconnected (common with nfs file handles, and
open_by_handle_at).  The second scenario is a lookup on a directory
lazily discovering that it was renamed on the remote filesystem.

The locking challenge for handling these two scenarios is that
the instance of lookup calling d_splice_alias has already
taken the dentry->d_parent->d_inode->i_mutex.

If the mutex was not held the locking we could just do:
	rename_mutex = &inode->i_sb->s_vfs_rename_mutex;
	mutex_lock(rename_mutex);
	if (d_ancestor(alias, dentry)) {
		mutex_unlock(rename_mutex);
		pr_warn_ratelimited(...);
		return -ELOOP;
	}
	m1 = &dentry->d_parent->d_inode->i_mutex;
	mutex_lock_nested(m1, I_MUTEX_PARENT);
	if (!IS_ROOT(alias) && (alias->d_parent != dentry->d_parent)) {
		m2 = &alias->d_parent->d_inode->i_mutex;
		mutex_lock_nested(m2, I_MUTEX_PARENT2);
	}
	d_move(alias, dentry);
	if (m2)
		mutex_unlock(m2);
	mutex_unlock(m1);
	mutex_unlock(rename_mutex);

Which is essentially lock_rename and unlock_reaname with added
handling of the IS_ROOT case.

The problem is that as the lookup locking stands today grabbing the
s_vfs_rename_mutex must use mutex_trylock which can fail, so for reliability
reasons we need to avoid using the rename_mutex as much as possible.

For the case where a disconnected dentry is being connected (aka the
IS_ROOT case) the d_ancestor test can be placed under the rename_lock
removing any chance that taking locks will cause a failure.

For the lazy rename case things are trickier because when
dentry->d_parent and alias->d_parent are not equal the code need to
take the s_vfs_rename_mutex, or risk the d_ancestor call in
lock_rename being inaccurate.  As s_vfs_rename_mutex is take first
there are no locking ordering issues with taking
alias->d_parent->d_inode->i_mutex.  Furthermore as games with
rename_lock are unnecessary d_move instead of __d_move can be called.

Sleeping in d_splice_alias is not something new as
security_d_instantiate is a sleeping function.

Compared to the current implementation this change introduces a
function for each case __d_rename_alias and __d_connect_alias and
moves the acquisition of rename_lock down into those functions.

In the case of __d_rename_alias which was __d_unalias this allows the
existing lock ordering rules to be followed much more closely removing
the need for a trylock when acquiring
alias->d_parent->d_inode->i_mutex, and allowing the use of d_move.

A common helper that prints the warning message when a loop is
detected is factored out into d_alias_is_ancestor so that code does not
need to be duplicated in both cases.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/dcache.c | 111 ++++++++++++++++++++++++++++++++++++------------------------
 1 file changed, 67 insertions(+), 44 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 53b7f1e63beb..c1eece74621f 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2711,41 +2711,77 @@ struct dentry *d_ancestor(struct dentry *p1, struct dentry *p2)
 	return NULL;
 }
 
+static struct dentry *d_alias_is_ancestor(struct dentry *alias,
+					  struct dentry *dentry)
+{
+	struct dentry *ancestor = d_ancestor(alias, dentry);
+	if (ancestor) {
+		pr_warn_ratelimited(
+			"VFS: Lookup of '%s' in %s %s would have caused loop\n",
+			dentry->d_name.name,
+			dentry->d_sb->s_type->name,
+			dentry->d_sb->s_id);
+	}
+	return ancestor;
+}
+
 /*
  * This helper attempts to cope with remotely renamed directories
  *
  * It assumes that the caller is already holding
- * dentry->d_parent->d_inode->i_mutex, and rename_lock
+ * dentry->d_parent->d_inode->i_mutex
  *
  * Note: If ever the locking in lock_rename() changes, then please
  * remember to update this too...
  */
-static int __d_unalias(struct inode *inode,
-		struct dentry *dentry, struct dentry *alias)
+static int __d_rename_alias(struct dentry *alias, struct dentry *dentry)
 {
-	struct mutex *m1 = NULL, *m2 = NULL;
-	int ret = -ESTALE;
+	struct mutex *rename_mutex = NULL, *parent_mutex = NULL;
 
-	/* If alias and dentry share a parent, then no extra locks required */
-	if (alias->d_parent == dentry->d_parent)
-		goto out_unalias;
+	/* Holding dentry->d_parent->d_inode->i_mutex guarantees this
+	 * that the equality or inequality alias->d_parent and
+	 * dentry->d_parent remains stable.
+	 */
+	if (alias->d_parent != dentry->d_parent) {
+		/* See lock_rename() */
+		rename_mutex = &dentry->d_sb->s_vfs_rename_mutex;
+		if (!mutex_trylock(rename_mutex))
+			return -ESTALE;
+
+		if (d_alias_is_ancestor(alias, dentry)) {
+			mutex_unlock(rename_mutex);
+			return -ELOOP;
+		}
+		parent_mutex = &alias->d_parent->d_inode->i_mutex;
+		mutex_lock_nested(parent_mutex, I_MUTEX_PARENT2);
+	}
+	d_move(alias, dentry);
+	if (parent_mutex) {
+		mutex_unlock(parent_mutex);
+		mutex_unlock(rename_mutex);
+	}
+	return 0;
+}
 
-	/* See lock_rename() */
-	if (!mutex_trylock(&dentry->d_sb->s_vfs_rename_mutex))
-		goto out_err;
-	m1 = &dentry->d_sb->s_vfs_rename_mutex;
-	if (!mutex_trylock(&alias->d_parent->d_inode->i_mutex))
-		goto out_err;
-	m2 = &alias->d_parent->d_inode->i_mutex;
-out_unalias:
+/*
+ * This helper connects disconnected dentries.
+ *
+ * It assumes that the caller is already holding
+ * dentry->d_parent->d_inode->i_mutex
+ *
+ */
+static int __d_connect_alias(struct inode *inode,
+			     struct dentry *alias, struct dentry *dentry)
+{
+	write_seqlock(&rename_lock);
+	if (unlikely(d_alias_is_ancestor(alias, dentry))) {
+		write_sequnlock(&rename_lock);
+		return -ELOOP;
+	}
 	__d_move(alias, dentry, false);
-	ret = 0;
-out_err:
-	if (m2)
-		mutex_unlock(m2);
-	if (m1)
-		mutex_unlock(m1);
-	return ret;
+	write_sequnlock(&rename_lock);
+	security_d_instantiate(alias, inode);
+	return 0;
 }
 
 /**
@@ -2786,30 +2822,17 @@ struct dentry *d_splice_alias(struct inode *inode, struct dentry *dentry)
 	if (S_ISDIR(inode->i_mode)) {
 		struct dentry *new = __d_find_any_alias(inode);
 		if (unlikely(new)) {
+			int err;
 			/* The reference to new ensures it remains an alias */
 			spin_unlock(&inode->i_lock);
-			write_seqlock(&rename_lock);
-			if (unlikely(d_ancestor(new, dentry))) {
-				write_sequnlock(&rename_lock);
+
+			if (!IS_ROOT(new))
+				err = __d_rename_alias(new, dentry);
+			else
+				err = __d_connect_alias(inode, new, dentry);
+			if (err) {
 				dput(new);
-				new = ERR_PTR(-ELOOP);
-				pr_warn_ratelimited(
-					"VFS: Lookup of '%s' in %s %s"
-					" would have caused loop\n",
-					dentry->d_name.name,
-					inode->i_sb->s_type->name,
-					inode->i_sb->s_id);
-			} else if (!IS_ROOT(new)) {
-				int err = __d_unalias(inode, dentry, new);
-				write_sequnlock(&rename_lock);
-				if (err) {
-					dput(new);
-					new = ERR_PTR(err);
-				}
-			} else {
-				__d_move(new, dentry, false);
-				write_sequnlock(&rename_lock);
-				security_d_instantiate(new, inode);
+				new = ERR_PTR(err);
 			}
 			iput(inode);
 			return new;
-- 
2.2.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 4/8] mnt: Track which mounts use a dentry as root.
       [not found]                         ` <87wpwyjxwc.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                                             ` (2 preceding siblings ...)
  2015-08-14  4:31                           ` [PATCH review 3/8] dcache: Clearly separate the two directory rename cases " Eric W. Biederman
@ 2015-08-14  4:32                           ` Eric W. Biederman
  2015-08-14  4:33                           ` [PATCH review 5/8] dcache: Implement d_common_ancestor Eric W. Biederman
                                             ` (3 subsequent siblings)
  7 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-08-14  4:32 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Miklos Szeredi, Richard Weinberger,
	Andy Lutomirski, J. Bruce Fields, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Linus Torvalds,
	Willy Tarreau


This is needed infrastructure for better handling of when files
or directories are moved out from under the root of a bind mount.

Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/mount.h             |   7 +++
 fs/namespace.c         | 120 +++++++++++++++++++++++++++++++++++++++++++++++--
 include/linux/dcache.h |   7 +++
 3 files changed, 130 insertions(+), 4 deletions(-)

diff --git a/fs/mount.h b/fs/mount.h
index 14db05d424f7..e8f22970fe59 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -27,6 +27,12 @@ struct mountpoint {
 	int m_count;
 };
 
+struct mountroot {
+	struct hlist_node r_hash;
+	struct dentry *r_dentry;
+	struct hlist_head r_list;
+};
+
 struct mount {
 	struct hlist_node mnt_hash;
 	struct mount *mnt_parent;
@@ -55,6 +61,7 @@ struct mount {
 	struct mnt_namespace *mnt_ns;	/* containing namespace */
 	struct mountpoint *mnt_mp;	/* where is it mounted */
 	struct hlist_node mnt_mp_list;	/* list mounts with the same mountpoint */
+	struct hlist_node mnt_mr_list;	/* list mounts with the same mountroot */
 #ifdef CONFIG_FSNOTIFY
 	struct hlist_head mnt_fsnotify_marks;
 	__u32 mnt_fsnotify_mask;
diff --git a/fs/namespace.c b/fs/namespace.c
index c7cb8a526c05..af6abf476394 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -31,6 +31,8 @@ static unsigned int m_hash_mask __read_mostly;
 static unsigned int m_hash_shift __read_mostly;
 static unsigned int mp_hash_mask __read_mostly;
 static unsigned int mp_hash_shift __read_mostly;
+static unsigned int mr_hash_mask __read_mostly;
+static unsigned int mr_hash_shift __read_mostly;
 
 static __initdata unsigned long mhash_entries;
 static int __init set_mhash_entries(char *str)
@@ -52,6 +54,16 @@ static int __init set_mphash_entries(char *str)
 }
 __setup("mphash_entries=", set_mphash_entries);
 
+static __initdata unsigned long mrhash_entries;
+static int __init set_mrhash_entries(char *str)
+{
+	if (!str)
+		return 0;
+	mrhash_entries = simple_strtoul(str, &str, 0);
+	return 1;
+}
+__setup("mrhash_entries=", set_mrhash_entries);
+
 static u64 event;
 static DEFINE_IDA(mnt_id_ida);
 static DEFINE_IDA(mnt_group_ida);
@@ -61,6 +73,7 @@ static int mnt_group_start = 1;
 
 static struct hlist_head *mount_hashtable __read_mostly;
 static struct hlist_head *mountpoint_hashtable __read_mostly;
+static struct hlist_head *mountroot_hashtable __read_mostly;
 static struct kmem_cache *mnt_cache __read_mostly;
 static DECLARE_RWSEM(namespace_sem);
 
@@ -93,6 +106,13 @@ static inline struct hlist_head *mp_hash(struct dentry *dentry)
 	return &mountpoint_hashtable[tmp & mp_hash_mask];
 }
 
+static inline struct hlist_head *mr_hash(struct dentry *dentry)
+{
+	unsigned long tmp = ((unsigned long)dentry / L1_CACHE_BYTES);
+	tmp = tmp + (tmp >> mr_hash_shift);
+	return &mountroot_hashtable[tmp & mr_hash_mask];
+}
+
 /*
  * allocation is serialized by namespace_sem, but we need the spinlock to
  * serialize with freeing.
@@ -234,6 +254,7 @@ static struct mount *alloc_vfsmnt(const char *name)
 		INIT_LIST_HEAD(&mnt->mnt_slave_list);
 		INIT_LIST_HEAD(&mnt->mnt_slave);
 		INIT_HLIST_NODE(&mnt->mnt_mp_list);
+		INIT_HLIST_NODE(&mnt->mnt_mr_list);
 #ifdef CONFIG_FSNOTIFY
 		INIT_HLIST_HEAD(&mnt->mnt_fsnotify_marks);
 #endif
@@ -779,6 +800,77 @@ static void put_mountpoint(struct mountpoint *mp)
 	}
 }
 
+static struct mountroot *lookup_mountroot(struct dentry *dentry)
+{
+	struct hlist_head *chain = mr_hash(dentry);
+	struct mountroot *mr;
+
+	hlist_for_each_entry(mr, chain, r_hash) {
+		if (mr->r_dentry == dentry)
+			return mr;
+	}
+	return NULL;
+}
+
+static int mnt_set_root(struct mount *mnt, struct dentry *root)
+{
+	struct mountroot *mr = NULL;
+
+	read_seqlock_excl(&mount_lock);
+	if (d_mountroot(root))
+		mr = lookup_mountroot(root);
+	if (!mr) {
+		struct mountroot *new;
+		read_sequnlock_excl(&mount_lock);
+
+		new = kmalloc(sizeof(struct mountroot), GFP_KERNEL);
+		if (!new)
+			return -ENOMEM;
+
+		read_seqlock_excl(&mount_lock);
+		mr = lookup_mountroot(root);
+		if (mr) {
+			kfree(new);
+		} else {
+			struct hlist_head *chain = mr_hash(root);
+
+			mr = new;
+			mr->r_dentry = root;
+			INIT_HLIST_HEAD(&mr->r_list);
+			hlist_add_head(&mr->r_hash, chain);
+
+			spin_lock(&root->d_lock);
+			root->d_flags |= DCACHE_MOUNTROOT;
+			spin_unlock(&root->d_lock);
+		}
+	}
+	mnt->mnt.mnt_root = root;
+	hlist_add_head(&mnt->mnt_mr_list, &mr->r_list);
+	read_sequnlock_excl(&mount_lock);
+
+	return 0;
+}
+
+static void mnt_put_root(struct mount *mnt)
+{
+	struct dentry *root = mnt->mnt.mnt_root;
+	struct mountroot *mr;
+
+	read_seqlock_excl(&mount_lock);
+	mr = lookup_mountroot(root);
+	BUG_ON(!mr);
+	hlist_del(&mnt->mnt_mr_list);
+	if (hlist_empty(&mr->r_list)) {
+		hlist_del(&mr->r_hash);
+		spin_lock(&root->d_lock);
+		root->d_flags &= ~DCACHE_MOUNTROOT;
+		spin_unlock(&root->d_lock);
+		kfree(mr);
+	}
+	read_sequnlock_excl(&mount_lock);
+	dput(root);
+}
+
 static inline int check_mnt(struct mount *mnt)
 {
 	return mnt->mnt_ns == current->nsproxy->mnt_ns;
@@ -934,6 +1026,7 @@ vfs_kern_mount(struct file_system_type *type, int flags, const char *name, void
 {
 	struct mount *mnt;
 	struct dentry *root;
+	int err;
 
 	if (!type)
 		return ERR_PTR(-ENODEV);
@@ -952,8 +1045,16 @@ vfs_kern_mount(struct file_system_type *type, int flags, const char *name, void
 		return ERR_CAST(root);
 	}
 
-	mnt->mnt.mnt_root = root;
 	mnt->mnt.mnt_sb = root->d_sb;
+	err = mnt_set_root(mnt, root);
+	if (err) {
+		dput(root);
+		deactivate_super(mnt->mnt.mnt_sb);
+		mnt_free_id(mnt);
+		free_vfsmnt(mnt);
+		return ERR_PTR(err);
+	}
+
 	mnt->mnt_mountpoint = mnt->mnt.mnt_root;
 	mnt->mnt_parent = mnt;
 	lock_mount_hash();
@@ -985,6 +1086,10 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root,
 			goto out_free;
 	}
 
+	err = mnt_set_root(mnt, root);
+	if (err)
+		goto out_free;
+
 	mnt->mnt.mnt_flags = old->mnt.mnt_flags & ~(MNT_WRITE_HOLD|MNT_MARKED);
 	/* Don't allow unprivileged users to change mount flags */
 	if (flag & CL_UNPRIVILEGED) {
@@ -1010,7 +1115,7 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root,
 
 	atomic_inc(&sb->s_active);
 	mnt->mnt.mnt_sb = sb;
-	mnt->mnt.mnt_root = dget(root);
+	dget(root);
 	mnt->mnt_mountpoint = mnt->mnt.mnt_root;
 	mnt->mnt_parent = mnt;
 	lock_mount_hash();
@@ -1063,7 +1168,7 @@ static void cleanup_mnt(struct mount *mnt)
 	if (unlikely(mnt->mnt_pins.first))
 		mnt_pin_kill(mnt);
 	fsnotify_vfsmount_delete(&mnt->mnt);
-	dput(mnt->mnt.mnt_root);
+	mnt_put_root(mnt);
 	deactivate_super(mnt->mnt.mnt_sb);
 	mnt_free_id(mnt);
 	call_rcu(&mnt->mnt_rcu, delayed_free_vfsmnt);
@@ -3096,14 +3201,21 @@ void __init mnt_init(void)
 				mphash_entries, 19,
 				0,
 				&mp_hash_shift, &mp_hash_mask, 0, 0);
+	mountroot_hashtable = alloc_large_system_hash("Mountroot-cache",
+				sizeof(struct hlist_head),
+				mrhash_entries, 19,
+				0,
+				&mr_hash_shift, &mr_hash_mask, 0, 0);
 
-	if (!mount_hashtable || !mountpoint_hashtable)
+	if (!mount_hashtable || !mountpoint_hashtable || !mountroot_hashtable)
 		panic("Failed to allocate mount hash table\n");
 
 	for (u = 0; u <= m_hash_mask; u++)
 		INIT_HLIST_HEAD(&mount_hashtable[u]);
 	for (u = 0; u <= mp_hash_mask; u++)
 		INIT_HLIST_HEAD(&mountpoint_hashtable[u]);
+	for (u = 0; u <= mr_hash_mask; u++)
+		INIT_HLIST_HEAD(&mountroot_hashtable[u]);
 
 	kernfs_init();
 
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index d2d50249b7b2..06bed2a1053c 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -228,6 +228,8 @@ struct dentry_operations {
 #define DCACHE_FALLTHRU			0x01000000 /* Fall through to lower layer */
 #define DCACHE_OP_SELECT_INODE		0x02000000 /* Unioned entry: dcache op selects inode */
 
+#define DCACHE_MOUNTROOT		0x04000000 /* Root of a vfsmount */
+
 extern seqlock_t rename_lock;
 
 /*
@@ -403,6 +405,11 @@ static inline bool d_mountpoint(const struct dentry *dentry)
 	return dentry->d_flags & DCACHE_MOUNTED;
 }
 
+static inline bool d_mountroot(const struct dentry *dentry)
+{
+	return dentry->d_flags & DCACHE_MOUNTROOT;
+}
+
 /*
  * Directory cache entry type accessor functions.
  */
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 5/8] dcache: Implement d_common_ancestor
       [not found]                         ` <87wpwyjxwc.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                                             ` (3 preceding siblings ...)
  2015-08-14  4:32                           ` [PATCH review 4/8] mnt: Track which mounts use a dentry as root Eric W. Biederman
@ 2015-08-14  4:33                           ` Eric W. Biederman
  2015-08-14  4:34                           ` [PATCH review 6/8] dcache: Only read d_flags once is d_is_dir Eric W. Biederman
                                             ` (2 subsequent siblings)
  7 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-08-14  4:33 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Miklos Szeredi, Richard Weinberger,
	Andy Lutomirski, J. Bruce Fields, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Linus Torvalds,
	Willy Tarreau


If possible find the common ancestor of two dentries.

This is necessary infrastructure for better handling the case
when a dentry is moved out from under the root of a bind mount.

Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/dcache.c            | 37 +++++++++++++++++++++++++++++++++++++
 include/linux/dcache.h |  1 +
 2 files changed, 38 insertions(+)

diff --git a/fs/dcache.c b/fs/dcache.c
index c1eece74621f..1f2f51055515 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2469,6 +2469,43 @@ void dentry_update_name_case(struct dentry *dentry, struct qstr *name)
 }
 EXPORT_SYMBOL(dentry_update_name_case);
 
+static unsigned long d_depth(const struct dentry *dentry)
+{
+	unsigned long depth = 0;
+
+	while (!IS_ROOT(dentry)) {
+		dentry = dentry->d_parent;
+		depth++;
+	}
+	return depth;
+}
+
+const struct dentry *d_common_ancestor(const struct dentry *left,
+				       const struct dentry *right)
+{
+	unsigned long ldepth = d_depth(left);
+	unsigned long rdepth = d_depth(right);
+
+	while (ldepth > rdepth) {
+		left = left->d_parent;
+		ldepth--;
+	}
+
+	while (rdepth > ldepth) {
+		right = right->d_parent;
+		rdepth--;
+	}
+
+	while (left != right) {
+		if (IS_ROOT(left))
+			return NULL;
+		left = left->d_parent;
+		right = right->d_parent;
+	}
+
+	return left;
+}
+
 static void swap_names(struct dentry *dentry, struct dentry *target)
 {
 	if (unlikely(dname_external(target))) {
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 06bed2a1053c..5b69856b45a2 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -313,6 +313,7 @@ extern void dentry_update_name_case(struct dentry *, struct qstr *);
 extern void d_move(struct dentry *, struct dentry *);
 extern void d_exchange(struct dentry *, struct dentry *);
 extern struct dentry *d_ancestor(struct dentry *, struct dentry *);
+extern const struct dentry *d_common_ancestor(const struct dentry *, const struct dentry *);
 
 /* appendix may either be NULL or be used for transname suffixes */
 extern struct dentry *d_lookup(const struct dentry *, const struct qstr *);
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 6/8] dcache: Only read d_flags once is d_is_dir
       [not found]                         ` <87wpwyjxwc.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                                             ` (4 preceding siblings ...)
  2015-08-14  4:33                           ` [PATCH review 5/8] dcache: Implement d_common_ancestor Eric W. Biederman
@ 2015-08-14  4:34                           ` Eric W. Biederman
  2015-08-14  4:35                           ` [PATCH review 7/8] mnt: Track when a directory escapes a bind mount Eric W. Biederman
  2015-08-14  4:36                           ` [PATCH review 8/8] vfs: Test for and handle paths that are unreachable from their mnt_root Eric W. Biederman
  7 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-08-14  4:34 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Miklos Szeredi, Richard Weinberger,
	Andy Lutomirski, J. Bruce Fields, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Linus Torvalds,
	Willy Tarreau


Cache the value of __d_entry_type in d_is_dir and test if it equal to
DCACHE_DIRECTORY_TYPE or DCACHE_AUTODIR_TYPE.

The generated assembly goes from:
	movl	(%rdi), %eax  # MEM[(volatile __u32 *)dentry_3(D)], tmp73
	andl	$7340032, %eax  #, tmp73
	cmpl	$2097152, %eax  #, tmp73
	je	.L1091	#,
	movl	(%rdi), %eax	# MEM[(volatile __u32 *)dentry_3(D)], tmp74
	andl	$7340032, %eax	#, tmp74
	cmpl	$3145728, %eax	#, tmp74
	je	.L1091	#,
to:
	movl	(%rdi), %eax	# MEM[(volatile __u32 *)dentry_3(D)], tmp71
	andl	$6291456, %eax	#, tmp71
	cmpl	$2097152, %eax	#, tmp71
	jne	.L1091	  #,

Which with only one read of d_flags, one comparison and one jump is
dramatically better code.

As __d_entry_type is not written to allow the compiler to optimize
away anything that it does, when it is possible and reasonable to
optimize things away the optimization needs to be performend manually.

Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 include/linux/dcache.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 5b69856b45a2..82eb50aaf446 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -443,7 +443,8 @@ static inline bool d_is_autodir(const struct dentry *dentry)
 
 static inline bool d_is_dir(const struct dentry *dentry)
 {
-	return d_can_lookup(dentry) || d_is_autodir(dentry);
+	unsigned type = __d_entry_type(dentry);
+	return (type == DCACHE_DIRECTORY_TYPE) || (type == DCACHE_AUTODIR_TYPE);
 }
 
 static inline bool d_is_symlink(const struct dentry *dentry)
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 7/8] mnt: Track when a directory escapes a bind mount
       [not found]                         ` <87wpwyjxwc.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                                             ` (5 preceding siblings ...)
  2015-08-14  4:34                           ` [PATCH review 6/8] dcache: Only read d_flags once is d_is_dir Eric W. Biederman
@ 2015-08-14  4:35                           ` Eric W. Biederman
  2015-08-14  4:36                           ` [PATCH review 8/8] vfs: Test for and handle paths that are unreachable from their mnt_root Eric W. Biederman
  7 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-08-14  4:35 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Miklos Szeredi, Richard Weinberger,
	Andy Lutomirski, J. Bruce Fields, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Linus Torvalds,
	Willy Tarreau


When bind mounts are in use, and there is another path to the
filesystem it is possible to rename files or directories from a path
underneath the root of the bind mount to a path that is not underneath
the root of the bind mount.

When a directory is moved out from under the root of a bind mount path
name lookups that go up the directory tree potentially allow accessing
the entire dentry tree of the filesystem.  This is not expected, not
what is desired and winds up being a secruity problem for userspace.

Augment d_move, d_exchange to call d_common_ancestor and
handle_possible_mount_escapes to mark any mount points that
directories escape from.

A few notes on the implementation:

- d_splice_alias does not need to be touched as the only case that
  can result in a directory escaping calls d_move.

- Only directory escapes are recorded as only those are relevant to
  new pathname lookup.  Escaped files are handled in prepend_path.

- A lock either namespace_sem or mount_lock needs to be held across
  the duration of renames where a directory could be escaping to
  ensure that a mount is not added, escaped, and missed during the
  rename.

- The mount_lock is used as it does not sleep.  I have audited all of
  thecallers of d_move and d_exchange and in every instance it appears
  safe for d_move and d_exchange to start sleeping.  But there is
  no point in adding sleeping behavior if that is unncessary.

- The locking order must be mount_lock outside of rename_lock
  as prepend_path already takes the locks in this order.

Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/dcache.c           | 33 +++++++++++++++++++++++++++++++++
 fs/mount.h            |  2 ++
 fs/namespace.c        | 32 ++++++++++++++++++++++++++++++++
 include/linux/mount.h |  1 +
 4 files changed, 68 insertions(+)

diff --git a/fs/dcache.c b/fs/dcache.c
index 1f2f51055515..7927c1fbdb93 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2704,9 +2704,23 @@ static void __d_move(struct dentry *dentry, struct dentry *target,
  */
 void d_move(struct dentry *dentry, struct dentry *target)
 {
+	bool unlock = false;
+
+	if (d_is_dir(dentry) && (dentry->d_parent != target->d_parent)) {
+		const struct dentry *ancestor;
+
+		ancestor = d_common_ancestor(dentry, target);
+		read_seqlock_excl(&mount_lock);
+		unlock = true;
+		handle_possible_mount_escapee(ancestor, dentry);
+	}
+
 	write_seqlock(&rename_lock);
 	__d_move(dentry, target, false);
 	write_sequnlock(&rename_lock);
+	if (unlock)
+		read_sequnlock_excl(&mount_lock);
+
 }
 EXPORT_SYMBOL(d_move);
 
@@ -2717,6 +2731,23 @@ EXPORT_SYMBOL(d_move);
  */
 void d_exchange(struct dentry *dentry1, struct dentry *dentry2)
 {
+	bool d1_is_dir = d_is_dir(dentry1);
+	bool d2_is_dir = d_is_dir(dentry2);
+	bool unlock = false;
+
+	if ((d1_is_dir || d2_is_dir) &&
+	    (dentry1->d_parent != dentry2->d_parent)) {
+		const struct dentry *ancestor;
+
+		ancestor = d_common_ancestor(dentry1, dentry2);
+		read_seqlock_excl(&mount_lock);
+		unlock = true;
+		if (d1_is_dir)
+			handle_possible_mount_escapee(ancestor, dentry1);
+		if (d2_is_dir)
+			handle_possible_mount_escapee(ancestor, dentry2);
+	}
+
 	write_seqlock(&rename_lock);
 
 	WARN_ON(!dentry1->d_inode);
@@ -2727,6 +2758,8 @@ void d_exchange(struct dentry *dentry1, struct dentry *dentry2)
 	__d_move(dentry1, dentry2, true);
 
 	write_sequnlock(&rename_lock);
+	if (unlock)
+		read_sequnlock_excl(&mount_lock);
 }
 
 /**
diff --git a/fs/mount.h b/fs/mount.h
index e8f22970fe59..ad91963c83ac 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -107,6 +107,8 @@ static inline void detach_mounts(struct dentry *dentry)
 	__detach_mounts(dentry);
 }
 
+extern void handle_possible_mount_escapee(const struct dentry *, struct dentry *);
+
 static inline void get_mnt_ns(struct mnt_namespace *ns)
 {
 	atomic_inc(&ns->count);
diff --git a/fs/namespace.c b/fs/namespace.c
index af6abf476394..ddcd0b61a448 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1657,6 +1657,38 @@ out_unlock:
 	namespace_unlock();
 }
 
+static void mark_escaped_mounts(struct dentry *root)
+{
+	/* Must be called with mount_lock held */
+	struct mountroot *mr;
+	struct mount *mnt;
+
+	mr = lookup_mountroot(root);
+	if (mr) {
+		/* Mark each mount from which a directory is escaping.
+		 */
+		hlist_for_each_entry(mnt, &mr->r_list, mnt_mr_list)
+			mnt->mnt.mnt_flags |= MNT_DIR_ESCAPED;
+	}
+}
+
+void handle_possible_mount_escapee(const struct dentry *ancestor,
+				   struct dentry *escapee)
+{
+	struct dentry *dentry;
+
+	for (dentry = escapee->d_parent; dentry != ancestor;
+	     dentry = dentry->d_parent) {
+
+		if (d_mountroot(dentry))
+			mark_escaped_mounts(dentry);
+
+		/* In case there is no common ancestor */
+		if (IS_ROOT(dentry))
+			break;
+	}
+}
+
 /* 
  * Is the caller allowed to modify his namespace?
  */
diff --git a/include/linux/mount.h b/include/linux/mount.h
index f822c3c11377..e58bc12b19aa 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -62,6 +62,7 @@ struct mnt_namespace;
 #define MNT_SYNC_UMOUNT		0x2000000
 #define MNT_MARKED		0x4000000
 #define MNT_UMOUNT		0x8000000
+#define MNT_DIR_ESCAPED		0x10000000
 
 struct vfsmount {
 	struct dentry *mnt_root;	/* root of the mounted tree */
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 8/8] vfs: Test for and handle paths that are unreachable from their mnt_root
       [not found]                         ` <87wpwyjxwc.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                                             ` (6 preceding siblings ...)
  2015-08-14  4:35                           ` [PATCH review 7/8] mnt: Track when a directory escapes a bind mount Eric W. Biederman
@ 2015-08-14  4:36                           ` Eric W. Biederman
  7 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-08-14  4:36 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Miklos Szeredi, Richard Weinberger,
	Andy Lutomirski, J. Bruce Fields, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Linus Torvalds,
	Willy Tarreau


In rare cases a directory can be renamed out from under a bind mount.
In those cases without special handling it becomes possible to walk up
the directory tree to the root dentry of the filesystem and down
from the root dentry to every other file or directory on the filesystem.

Like division by zero .. from an unconnected path can not be given
a useful semantic as there is no predicting at which path component
the code will realize it is unconnected.  We certainly can not match
the current behavior as the current behavior is a security hole.

Therefore when encounting .. when following an unconnected path
return -ENOENT.

- Add a function path_connected to verify nd->path.dentry is reachable
  from nd->path.mnt.mnt_root.  AKA to validate that rename did not do
  something nasty to the bind mount.

  To avoid races path_connected must be called after following a path
  component to it's next path component.

Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/namei.c | 26 ++++++++++++++++++++++++--
 1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index ae4e4c18b2ac..de3549a6c696 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -560,6 +560,23 @@ static int __nd_alloc_stack(struct nameidata *nd)
 	return 0;
 }
 
+/**
+ * path_connected - Verify that a nd->path.dentry is below nd->path.mnt->mnt.mnt_root
+ * @nd: nameidate to verify
+ *
+ * Rename can sometimes move a file or directory outside of a bind
+ * mount, path_connected allows those cases to be detected.
+ */
+static bool path_connected(const struct path *path)
+{
+	struct vfsmount *mnt = path->mnt;
+
+	if (likely(!(mnt->mnt_flags & MNT_DIR_ESCAPED)))
+		return true;
+
+	return is_subdir(path->dentry, mnt->mnt_root);
+}
+
 static inline int nd_alloc_stack(struct nameidata *nd)
 {
 	if (likely(nd->depth != EMBEDDED_LEVELS))
@@ -1296,6 +1313,8 @@ static int follow_dotdot_rcu(struct nameidata *nd)
 				return -ECHILD;
 			nd->path.dentry = parent;
 			nd->seq = seq;
+			if (unlikely(!path_connected(&nd->path)))
+				return -ENOENT;
 			break;
 		} else {
 			struct mount *mnt = real_mount(nd->path.mnt);
@@ -1396,7 +1415,7 @@ static void follow_mount(struct path *path)
 	}
 }
 
-static void follow_dotdot(struct nameidata *nd)
+static int follow_dotdot(struct nameidata *nd)
 {
 	if (!nd->root.mnt)
 		set_root(nd);
@@ -1412,6 +1431,8 @@ static void follow_dotdot(struct nameidata *nd)
 			/* rare case of legitimate dget_parent()... */
 			nd->path.dentry = dget_parent(nd->path.dentry);
 			dput(old);
+			if (unlikely(!path_connected(&nd->path)))
+				return -ENOENT;
 			break;
 		}
 		if (!follow_up(&nd->path))
@@ -1419,6 +1440,7 @@ static void follow_dotdot(struct nameidata *nd)
 	}
 	follow_mount(&nd->path);
 	nd->inode = nd->path.dentry->d_inode;
+	return 0;
 }
 
 /*
@@ -1634,7 +1656,7 @@ static inline int handle_dots(struct nameidata *nd, int type)
 		if (nd->flags & LOOKUP_RCU) {
 			return follow_dotdot_rcu(nd);
 		} else
-			follow_dotdot(nd);
+			return follow_dotdot(nd);
 	}
 	return 0;
 }
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* Re: [PATCH review 3/8] dcache: Clearly separate the two directory rename cases in d_splice_alias
       [not found]                           ` <87fv3mjxsc.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-08-15  6:16                             ` Al Viro
       [not found]                               ` <20150815061617.GG14139-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
  0 siblings, 1 reply; 240+ messages in thread
From: Al Viro @ 2015-08-15  6:16 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrey Vagin, Miklos Szeredi, Richard Weinberger,
	Linux Containers, Andy Lutomirski, J. Bruce Fields,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Linus Torvalds,
	Willy Tarreau

On Thu, Aug 13, 2015 at 11:31:47PM -0500, Eric W. Biederman wrote:

> The problem is that as the lookup locking stands today grabbing the
> s_vfs_rename_mutex must use mutex_trylock which can fail, so for reliability
> reasons we need to avoid using the rename_mutex as much as possible.

I really don't like it.  For one thing, you *still* are taking it - have to,
really.  So this argument is moot anyway.  ESTALE can happen here.  For
another, I'm not convinced that this "we don't need no stinkin'
extra locks for attaching a detached subtree" is correct.  E.g. what's
to protect that IS_ROOT(new) from changing right under you?

Consider a corrupted filesystem (or a bogus server, or a sufficiently
unpleasant race with another client, etc.)

You have a detached subtree with lookups from *TWO* directories trying to
attach its root.  Now what?  ->i_mutex on either prospective parent won't
give a damn thing - different inodes.  The current tree would've serialized
those on rename_lock and treated that as attach a detached followed by
relocate a misplaced.  With your change we get a race in there.

This place is subtle and nasty, and we had rather nasty races there.
Repeatedly.  Any non-trivial locking changes in that area should go
separately from everything else and only with an accurate analysis of
those changes.  It's one of the easiest places to fuck up in.  Been
there, done that, and so had many other people.

I'm not saying that it wouldn't benefit from cleaner locking - it sure
as hell would.  But it's in a really incestous relationship with a lot
of other pieces, both in fs/dcache.c and elsewhere.  Let's not mix that
into anything else - driveby cleanups in that place are very likely to
cause serious trouble.

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 3/8] dcache: Clearly separate the two directory rename cases in d_splice_alias
       [not found]                               ` <20150815061617.GG14139-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
@ 2015-08-15 18:25                                 ` Eric W. Biederman
       [not found]                                   ` <874mk08l3g.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  2015-08-15 18:39                                   ` Eric W. Biederman
  0 siblings, 2 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-08-15 18:25 UTC (permalink / raw)
  To: Al Viro
  Cc: Andrey Vagin, Miklos Szeredi, Richard Weinberger,
	Linux Containers, Andy Lutomirski, J. Bruce Fields,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Linus Torvalds,
	Willy Tarreau

Al Viro <viro-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org> writes:

> On Thu, Aug 13, 2015 at 11:31:47PM -0500, Eric W. Biederman wrote:
>
>> The problem is that as the lookup locking stands today grabbing the
>> s_vfs_rename_mutex must use mutex_trylock which can fail, so for reliability
>> reasons we need to avoid using the rename_mutex as much as possible.
>
> I really don't like it.  For one thing, you *still* are taking it - have to,
> really.  So this argument is moot anyway.  

Not moot.  We don't take s_vfs_rename_mutex when we are connecting a
disconnected dentry alias.  If d_splice_alias could sleep and take
s_vfs_rename_mutex then we could have a single path through the code.

Unfortunately the code can't sleep when taking s_vfs_rename_mutex so
attempting to take s_vfs_rename_mutex for both paths will introduce
unnecessary -ESTALE failures into d_splice_alias.

> ESTALE can happen here.  For
> another, I'm not convinced that this "we don't need no stinkin'
> extra locks for attaching a detached subtree" is correct.  E.g. what's
> to protect that IS_ROOT(new) from changing right under you?

You are quite correct that I missed that nothing protects the result of
IS_ROOT(new).   So my change does introduce a case where we don't
hold the appropriate inode mutexes when renaming a dentry and that
introduces races elsewhere in the code so it is not acceptable.

But it is true that we only take rename_lock and don't take any
additional mutex when connecting a disconnected dentry.  Aka "we don't
need no stinkin' extra locks".  We clearly can not take the
new->d_parent->d_inode->i_mutex when IS_ROOT(new) as that is
meaningless.  Further I do not see a point in taking s_vfs_rename_mutex
in that case.

Not for this round but if you can see any reason why our not taking
s_vfs_rename_mutex when connecting disconnected dentries is wrong
and we need to take it and risk -ESTALE.  I would love to know because I
would love to clean up that mess.

> I'm not saying that it wouldn't benefit from cleaner locking - it sure
> as hell would.  But it's in a really incestous relationship with a lot
> of other pieces, both in fs/dcache.c and elsewhere.  Let's not mix that
> into anything else - driveby cleanups in that place are very likely to
> cause serious trouble.

Fair enough. 

I do have to touch d_splice_alias and change the locking, so
unfortunately I have to do some of the nasty locking analysis anyway.

I am keeping the i_lock cleanup because it is trivial and removes the
need to figure out if there is any existing ordering between i_lock and
mount_lock, and if taking mount_lock could induce a deadlock.  That
audit would not be trivial.

Eric

^ permalink raw reply	[flat|nested] 240+ messages in thread

* [PATCH review 0/7] Bind mount escape fixes
       [not found]                                   ` <874mk08l3g.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-08-15 18:35                                     ` Eric W. Biederman
  2015-08-15 18:36                                       ` [PATCH review 1/7] dcache: Handle escaped paths in prepend_path Eric W. Biederman
                                                         ` (4 more replies)
  2015-08-15 18:39                                     ` [PATCH review 6/7] mnt: Track when a directory escapes a bind mount Eric W. Biederman
  2015-08-15 18:39                                     ` [PATCH review 7/7] vfs: Test for and handle paths that are unreachable from their mnt_root Eric W. Biederman
  2 siblings, 5 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-08-15 18:35 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Miklos Szeredi, Richard Weinberger,
	Andy Lutomirski, J. Bruce Fields, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Linus Torvalds,
	Willy Tarreau

It is possible in some situations to rename a file or directory through
one mount point such that it can start out inside of a bind mount and
after the rename wind up outside of the bind mount.  Unfortunately with
user namespaces these conditions can be trivially created by creating a
bind mount under an existing bind mount.

I have identified four situations in which this may be a problem.
- __d_path and d_absolute_path need to error on disconnected paths
  that can not reach some root directory or lsm path based security
  checks can incorrectly succeed.

- Normal path name resolution following .. can lead to a directory
  that is outside of the original loopback mount.

- file handle reconsititution aka exportfs_decode_fh can yield a dentry
  from which d_parent can be followed up to mnt->sb->s_root, but
  d_parent can not be followed up to mnt->mnt_root.

- Mounts on a path that has been renamed outside of a loopback mount
  become unreachable, as there is no possible path that can be passed
  to umount to unmount them.

My strategy:

o File handle reconsitituion problems can be prevented by enabling
  the nfsd subtree checks for nfs exports, and open_by_handle_at
  requires capable(CAP_DAC_READ_SEARCH) so is only usable by the global
  root.  This makes any problems difficult if not impossible to exploit
  in practice so I have not yet written code to address that issue.

o The functions __d_path and d_absolute_path are agumented so that the
  security modules will not be fed a problematic path to work with.

o Following of .. has been agumented to test that after d_parent has
  been resolved the original  directory is connected, and if not
  an error of -ENOENT is returned.

o I do not worry about mounts that are disconnected from their bind
  mount as these mounts can always be freed by either umount -l on
  the bind mount they have escaped from, or by freeing the mount
  namespace.  So I do not believe there is an actual problem.

Pathname resolution is a common fast path and most of the code in this
patchset to support keeping .. from becoming expensive in the common
case.

After hearing the Al's feedback and running some numbers I have given
up attempting to keeping the number of d_ancestor calls during pathname
resolution to an absolute minimum.  It appears that simply preventing
calls d_ancestor unless a directory has escaped is good enough.  This
change in approach has significantly simplified the code.

The implementation change this round is I have dropped my patch cleaning
up d_splice_alias.  Al Viro found a race that makes the technique I was
using fundamentally racy.  I now have d_splice_alias taking mount_lock
around rename_lock.  Since I don't have to sleep in d_splice_alias
change is minimal and sufficient for this purpose.

Barring some other idiocy I think this will be the final version of this
patchset.

These changes are all against v4.2-rc1. 

For those who like to see everything in a single tree the code is at:

     git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-testing

Eric W. Biederman (7):
      dcache: Handle escaped paths in prepend_path
      dcache: Reduce the scope of i_lock in d_splice_alias
      mnt: Track which mounts use a dentry as root.
      dcache: Implement d_common_ancestor
      dcache: Only read d_flags once in d_is_dir
      mnt: Track when a directory escapes a bind mount
      vfs: Test for and handle paths that are unreachable from their mnt_root

 fs/dcache.c            |  91 +++++++++++++++++++++++++++--
 fs/mount.h             |   9 +++
 fs/namei.c             |  26 ++++++++-
 fs/namespace.c         | 152 +++++++++++++++++++++++++++++++++++++++++++++++--
 include/linux/dcache.h |  11 +++-
 include/linux/mount.h  |   1 +
 6 files changed, 279 insertions(+), 11 deletions(-)

^ permalink raw reply	[flat|nested] 240+ messages in thread

* [PATCH review 1/7] dcache: Handle escaped paths in prepend_path
       [not found]                                       ` <87a8ts763c.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-08-15 18:36                                         ` Eric W. Biederman
  2015-08-15 18:36                                         ` [PATCH review 2/7] dcache: Reduce the scope of i_lock in d_splice_alias Eric W. Biederman
                                                           ` (4 subsequent siblings)
  5 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-08-15 18:36 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Miklos Szeredi, Richard Weinberger,
	Andy Lutomirski, J. Bruce Fields, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Linus Torvalds,
	Willy Tarreau

A rename can result in a dentry that by walking up d_parent
will never reach it's mnt_root.  For lack of a better term
I call this an escaped path.

prepend_path is called by four different functions __d_path,
d_absolute_path, d_path, and getcwd.

__d_path only wants to see paths are connected to the root it passes
in.  So __d_path needs prepend_path to return an error.

d_absolute_path similarly wants to see paths that are connected to
some root.  Escaped paths are not connected to any mnt_root so
d_absolute_path needs prepend_path to return an error greater
than 1.  So escaped paths will be treated like paths on lazily
unmounted mounts.

getcwd needs to prepend "(unreachable)" so getcwd also needs
prepend_path to return an error.

d_path is the interesting hold out.  d_path just wants to print
something, and does not care about the weird cases.  Which raises
the question what should be printed?

Given that <escaped_path>/<anything> should result in -ENOENT I
believe it is desirable for escaped paths to be printed as empty
paths.  As there are not really any meaninful path components when
considered from the perspective of a mount tree.

So tweak prepend_path to return an empty path with an new error
code of 3 when it encounters an escaped path.

Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/dcache.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/fs/dcache.c b/fs/dcache.c
index 7a3f3e5f9cea..f762e76e85cc 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2923,6 +2923,13 @@ restart:

 		if (dentry == vfsmnt->mnt_root || IS_ROOT(dentry)) {
 			struct mount *parent = ACCESS_ONCE(mnt->mnt_parent);
+			/* Escaped? */
+			if (dentry != vfsmnt->mnt_root) {
+				bptr = *buffer;
+				blen = *buflen;
+				error = 3;
+				break;
+			}
 			/* Global root? */
 			if (mnt != parent) {
 				dentry = ACCESS_ONCE(mnt->mnt_mountpoint);
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 1/7] dcache: Handle escaped paths in prepend_path
  2015-08-15 18:35                                     ` [PATCH review 0/7] Bind mount escape fixes Eric W. Biederman
@ 2015-08-15 18:36                                       ` Eric W. Biederman
  2015-08-15 18:36                                       ` [PATCH review 2/7] dcache: Reduce the scope of i_lock in d_splice_alias Eric W. Biederman
                                                         ` (3 subsequent siblings)
  4 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-08-15 18:36 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-fsdevel, Al Viro, Andy Lutomirski, Serge E. Hallyn,
	Richard Weinberger, Andrey Vagin, Jann Horn, Willy Tarreau,
	Omar Sandoval, Miklos Szeredi, Linus Torvalds, J. Bruce Fields

A rename can result in a dentry that by walking up d_parent
will never reach it's mnt_root.  For lack of a better term
I call this an escaped path.

prepend_path is called by four different functions __d_path,
d_absolute_path, d_path, and getcwd.

__d_path only wants to see paths are connected to the root it passes
in.  So __d_path needs prepend_path to return an error.

d_absolute_path similarly wants to see paths that are connected to
some root.  Escaped paths are not connected to any mnt_root so
d_absolute_path needs prepend_path to return an error greater
than 1.  So escaped paths will be treated like paths on lazily
unmounted mounts.

getcwd needs to prepend "(unreachable)" so getcwd also needs
prepend_path to return an error.

d_path is the interesting hold out.  d_path just wants to print
something, and does not care about the weird cases.  Which raises
the question what should be printed?

Given that <escaped_path>/<anything> should result in -ENOENT I
believe it is desirable for escaped paths to be printed as empty
paths.  As there are not really any meaninful path components when
considered from the perspective of a mount tree.

So tweak prepend_path to return an empty path with an new error
code of 3 when it encounters an escaped path.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/dcache.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/fs/dcache.c b/fs/dcache.c
index 7a3f3e5f9cea..f762e76e85cc 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2923,6 +2923,13 @@ restart:

 		if (dentry == vfsmnt->mnt_root || IS_ROOT(dentry)) {
 			struct mount *parent = ACCESS_ONCE(mnt->mnt_parent);
+			/* Escaped? */
+			if (dentry != vfsmnt->mnt_root) {
+				bptr = *buffer;
+				blen = *buflen;
+				error = 3;
+				break;
+			}
 			/* Global root? */
 			if (mnt != parent) {
 				dentry = ACCESS_ONCE(mnt->mnt_mountpoint);
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 2/7] dcache: Reduce the scope of i_lock in d_splice_alias
       [not found]                                       ` <87a8ts763c.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  2015-08-15 18:36                                         ` [PATCH review 1/7] dcache: Handle escaped paths in prepend_path Eric W. Biederman
@ 2015-08-15 18:36                                         ` Eric W. Biederman
  2015-08-15 18:37                                         ` [PATCH review 3/7] mnt: Track which mounts use a dentry as root Eric W. Biederman
                                                           ` (3 subsequent siblings)
  5 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-08-15 18:36 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Miklos Szeredi, Richard Weinberger,
	Andy Lutomirski, J. Bruce Fields, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Linus Torvalds,
	Willy Tarreau


i_lock is only needed until __d_find_any_alias calls dget on the alias
dentry.  After that the reference to new ensures that dentry_kill and
d_delete will not remove the inode from the dentry, and remove the
dentry from the inode->d_entry list.

The inode i_lock came to be held over the the __d_move calls in
d_splice_alias through a series of introduction of locks with
increasing smaller scope.  First it was the dcache_lock, then
it was the dcache_inode_lock, and finally inode->i_lock.

Furthermore inode->i_lock is not held over any other calls
to d_move or __d_move so it can not provide any meaningful
rename protection.

Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/dcache.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index f762e76e85cc..53b7f1e63beb 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2715,7 +2715,7 @@ struct dentry *d_ancestor(struct dentry *p1, struct dentry *p2)
  * This helper attempts to cope with remotely renamed directories
  *
  * It assumes that the caller is already holding
- * dentry->d_parent->d_inode->i_mutex, inode->i_lock and rename_lock
+ * dentry->d_parent->d_inode->i_mutex, and rename_lock
  *
  * Note: If ever the locking in lock_rename() changes, then please
  * remember to update this too...
@@ -2741,7 +2741,6 @@ out_unalias:
 	__d_move(alias, dentry, false);
 	ret = 0;
 out_err:
-	spin_unlock(&inode->i_lock);
 	if (m2)
 		mutex_unlock(m2);
 	if (m1)
@@ -2787,10 +2786,11 @@ struct dentry *d_splice_alias(struct inode *inode, struct dentry *dentry)
 	if (S_ISDIR(inode->i_mode)) {
 		struct dentry *new = __d_find_any_alias(inode);
 		if (unlikely(new)) {
+			/* The reference to new ensures it remains an alias */
+			spin_unlock(&inode->i_lock);
 			write_seqlock(&rename_lock);
 			if (unlikely(d_ancestor(new, dentry))) {
 				write_sequnlock(&rename_lock);
-				spin_unlock(&inode->i_lock);
 				dput(new);
 				new = ERR_PTR(-ELOOP);
 				pr_warn_ratelimited(
@@ -2809,7 +2809,6 @@ struct dentry *d_splice_alias(struct inode *inode, struct dentry *dentry)
 			} else {
 				__d_move(new, dentry, false);
 				write_sequnlock(&rename_lock);
-				spin_unlock(&inode->i_lock);
 				security_d_instantiate(new, inode);
 			}
 			iput(inode);
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 2/7] dcache: Reduce the scope of i_lock in d_splice_alias
  2015-08-15 18:35                                     ` [PATCH review 0/7] Bind mount escape fixes Eric W. Biederman
  2015-08-15 18:36                                       ` [PATCH review 1/7] dcache: Handle escaped paths in prepend_path Eric W. Biederman
@ 2015-08-15 18:36                                       ` Eric W. Biederman
       [not found]                                       ` <87a8ts763c.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                                                         ` (2 subsequent siblings)
  4 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-08-15 18:36 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-fsdevel, Al Viro, Andy Lutomirski, Serge E. Hallyn,
	Richard Weinberger, Andrey Vagin, Jann Horn, Willy Tarreau,
	Omar Sandoval, Miklos Szeredi, Linus Torvalds, J. Bruce Fields


i_lock is only needed until __d_find_any_alias calls dget on the alias
dentry.  After that the reference to new ensures that dentry_kill and
d_delete will not remove the inode from the dentry, and remove the
dentry from the inode->d_entry list.

The inode i_lock came to be held over the the __d_move calls in
d_splice_alias through a series of introduction of locks with
increasing smaller scope.  First it was the dcache_lock, then
it was the dcache_inode_lock, and finally inode->i_lock.

Furthermore inode->i_lock is not held over any other calls
to d_move or __d_move so it can not provide any meaningful
rename protection.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/dcache.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index f762e76e85cc..53b7f1e63beb 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2715,7 +2715,7 @@ struct dentry *d_ancestor(struct dentry *p1, struct dentry *p2)
  * This helper attempts to cope with remotely renamed directories
  *
  * It assumes that the caller is already holding
- * dentry->d_parent->d_inode->i_mutex, inode->i_lock and rename_lock
+ * dentry->d_parent->d_inode->i_mutex, and rename_lock
  *
  * Note: If ever the locking in lock_rename() changes, then please
  * remember to update this too...
@@ -2741,7 +2741,6 @@ out_unalias:
 	__d_move(alias, dentry, false);
 	ret = 0;
 out_err:
-	spin_unlock(&inode->i_lock);
 	if (m2)
 		mutex_unlock(m2);
 	if (m1)
@@ -2787,10 +2786,11 @@ struct dentry *d_splice_alias(struct inode *inode, struct dentry *dentry)
 	if (S_ISDIR(inode->i_mode)) {
 		struct dentry *new = __d_find_any_alias(inode);
 		if (unlikely(new)) {
+			/* The reference to new ensures it remains an alias */
+			spin_unlock(&inode->i_lock);
 			write_seqlock(&rename_lock);
 			if (unlikely(d_ancestor(new, dentry))) {
 				write_sequnlock(&rename_lock);
-				spin_unlock(&inode->i_lock);
 				dput(new);
 				new = ERR_PTR(-ELOOP);
 				pr_warn_ratelimited(
@@ -2809,7 +2809,6 @@ struct dentry *d_splice_alias(struct inode *inode, struct dentry *dentry)
 			} else {
 				__d_move(new, dentry, false);
 				write_sequnlock(&rename_lock);
-				spin_unlock(&inode->i_lock);
 				security_d_instantiate(new, inode);
 			}
 			iput(inode);
-- 
2.2.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 3/7] mnt: Track which mounts use a dentry as root.
       [not found]                                       ` <87a8ts763c.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  2015-08-15 18:36                                         ` [PATCH review 1/7] dcache: Handle escaped paths in prepend_path Eric W. Biederman
  2015-08-15 18:36                                         ` [PATCH review 2/7] dcache: Reduce the scope of i_lock in d_splice_alias Eric W. Biederman
@ 2015-08-15 18:37                                         ` Eric W. Biederman
  2015-08-15 18:37                                         ` [PATCH review 4/7] dcache: Implement d_common_ancestor Eric W. Biederman
                                                           ` (2 subsequent siblings)
  5 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-08-15 18:37 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Miklos Szeredi, Richard Weinberger,
	Andy Lutomirski, J. Bruce Fields, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Linus Torvalds,
	Willy Tarreau


This is needed infrastructure for better handling of when files
or directories are moved out from under the root of a bind mount.

Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/mount.h             |   7 +++
 fs/namespace.c         | 120 +++++++++++++++++++++++++++++++++++++++++++++++--
 include/linux/dcache.h |   7 +++
 3 files changed, 130 insertions(+), 4 deletions(-)

diff --git a/fs/mount.h b/fs/mount.h
index 14db05d424f7..e8f22970fe59 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -27,6 +27,12 @@ struct mountpoint {
 	int m_count;
 };
 
+struct mountroot {
+	struct hlist_node r_hash;
+	struct dentry *r_dentry;
+	struct hlist_head r_list;
+};
+
 struct mount {
 	struct hlist_node mnt_hash;
 	struct mount *mnt_parent;
@@ -55,6 +61,7 @@ struct mount {
 	struct mnt_namespace *mnt_ns;	/* containing namespace */
 	struct mountpoint *mnt_mp;	/* where is it mounted */
 	struct hlist_node mnt_mp_list;	/* list mounts with the same mountpoint */
+	struct hlist_node mnt_mr_list;	/* list mounts with the same mountroot */
 #ifdef CONFIG_FSNOTIFY
 	struct hlist_head mnt_fsnotify_marks;
 	__u32 mnt_fsnotify_mask;
diff --git a/fs/namespace.c b/fs/namespace.c
index c7cb8a526c05..af6abf476394 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -31,6 +31,8 @@ static unsigned int m_hash_mask __read_mostly;
 static unsigned int m_hash_shift __read_mostly;
 static unsigned int mp_hash_mask __read_mostly;
 static unsigned int mp_hash_shift __read_mostly;
+static unsigned int mr_hash_mask __read_mostly;
+static unsigned int mr_hash_shift __read_mostly;
 
 static __initdata unsigned long mhash_entries;
 static int __init set_mhash_entries(char *str)
@@ -52,6 +54,16 @@ static int __init set_mphash_entries(char *str)
 }
 __setup("mphash_entries=", set_mphash_entries);
 
+static __initdata unsigned long mrhash_entries;
+static int __init set_mrhash_entries(char *str)
+{
+	if (!str)
+		return 0;
+	mrhash_entries = simple_strtoul(str, &str, 0);
+	return 1;
+}
+__setup("mrhash_entries=", set_mrhash_entries);
+
 static u64 event;
 static DEFINE_IDA(mnt_id_ida);
 static DEFINE_IDA(mnt_group_ida);
@@ -61,6 +73,7 @@ static int mnt_group_start = 1;
 
 static struct hlist_head *mount_hashtable __read_mostly;
 static struct hlist_head *mountpoint_hashtable __read_mostly;
+static struct hlist_head *mountroot_hashtable __read_mostly;
 static struct kmem_cache *mnt_cache __read_mostly;
 static DECLARE_RWSEM(namespace_sem);
 
@@ -93,6 +106,13 @@ static inline struct hlist_head *mp_hash(struct dentry *dentry)
 	return &mountpoint_hashtable[tmp & mp_hash_mask];
 }
 
+static inline struct hlist_head *mr_hash(struct dentry *dentry)
+{
+	unsigned long tmp = ((unsigned long)dentry / L1_CACHE_BYTES);
+	tmp = tmp + (tmp >> mr_hash_shift);
+	return &mountroot_hashtable[tmp & mr_hash_mask];
+}
+
 /*
  * allocation is serialized by namespace_sem, but we need the spinlock to
  * serialize with freeing.
@@ -234,6 +254,7 @@ static struct mount *alloc_vfsmnt(const char *name)
 		INIT_LIST_HEAD(&mnt->mnt_slave_list);
 		INIT_LIST_HEAD(&mnt->mnt_slave);
 		INIT_HLIST_NODE(&mnt->mnt_mp_list);
+		INIT_HLIST_NODE(&mnt->mnt_mr_list);
 #ifdef CONFIG_FSNOTIFY
 		INIT_HLIST_HEAD(&mnt->mnt_fsnotify_marks);
 #endif
@@ -779,6 +800,77 @@ static void put_mountpoint(struct mountpoint *mp)
 	}
 }
 
+static struct mountroot *lookup_mountroot(struct dentry *dentry)
+{
+	struct hlist_head *chain = mr_hash(dentry);
+	struct mountroot *mr;
+
+	hlist_for_each_entry(mr, chain, r_hash) {
+		if (mr->r_dentry == dentry)
+			return mr;
+	}
+	return NULL;
+}
+
+static int mnt_set_root(struct mount *mnt, struct dentry *root)
+{
+	struct mountroot *mr = NULL;
+
+	read_seqlock_excl(&mount_lock);
+	if (d_mountroot(root))
+		mr = lookup_mountroot(root);
+	if (!mr) {
+		struct mountroot *new;
+		read_sequnlock_excl(&mount_lock);
+
+		new = kmalloc(sizeof(struct mountroot), GFP_KERNEL);
+		if (!new)
+			return -ENOMEM;
+
+		read_seqlock_excl(&mount_lock);
+		mr = lookup_mountroot(root);
+		if (mr) {
+			kfree(new);
+		} else {
+			struct hlist_head *chain = mr_hash(root);
+
+			mr = new;
+			mr->r_dentry = root;
+			INIT_HLIST_HEAD(&mr->r_list);
+			hlist_add_head(&mr->r_hash, chain);
+
+			spin_lock(&root->d_lock);
+			root->d_flags |= DCACHE_MOUNTROOT;
+			spin_unlock(&root->d_lock);
+		}
+	}
+	mnt->mnt.mnt_root = root;
+	hlist_add_head(&mnt->mnt_mr_list, &mr->r_list);
+	read_sequnlock_excl(&mount_lock);
+
+	return 0;
+}
+
+static void mnt_put_root(struct mount *mnt)
+{
+	struct dentry *root = mnt->mnt.mnt_root;
+	struct mountroot *mr;
+
+	read_seqlock_excl(&mount_lock);
+	mr = lookup_mountroot(root);
+	BUG_ON(!mr);
+	hlist_del(&mnt->mnt_mr_list);
+	if (hlist_empty(&mr->r_list)) {
+		hlist_del(&mr->r_hash);
+		spin_lock(&root->d_lock);
+		root->d_flags &= ~DCACHE_MOUNTROOT;
+		spin_unlock(&root->d_lock);
+		kfree(mr);
+	}
+	read_sequnlock_excl(&mount_lock);
+	dput(root);
+}
+
 static inline int check_mnt(struct mount *mnt)
 {
 	return mnt->mnt_ns == current->nsproxy->mnt_ns;
@@ -934,6 +1026,7 @@ vfs_kern_mount(struct file_system_type *type, int flags, const char *name, void
 {
 	struct mount *mnt;
 	struct dentry *root;
+	int err;
 
 	if (!type)
 		return ERR_PTR(-ENODEV);
@@ -952,8 +1045,16 @@ vfs_kern_mount(struct file_system_type *type, int flags, const char *name, void
 		return ERR_CAST(root);
 	}
 
-	mnt->mnt.mnt_root = root;
 	mnt->mnt.mnt_sb = root->d_sb;
+	err = mnt_set_root(mnt, root);
+	if (err) {
+		dput(root);
+		deactivate_super(mnt->mnt.mnt_sb);
+		mnt_free_id(mnt);
+		free_vfsmnt(mnt);
+		return ERR_PTR(err);
+	}
+
 	mnt->mnt_mountpoint = mnt->mnt.mnt_root;
 	mnt->mnt_parent = mnt;
 	lock_mount_hash();
@@ -985,6 +1086,10 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root,
 			goto out_free;
 	}
 
+	err = mnt_set_root(mnt, root);
+	if (err)
+		goto out_free;
+
 	mnt->mnt.mnt_flags = old->mnt.mnt_flags & ~(MNT_WRITE_HOLD|MNT_MARKED);
 	/* Don't allow unprivileged users to change mount flags */
 	if (flag & CL_UNPRIVILEGED) {
@@ -1010,7 +1115,7 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root,
 
 	atomic_inc(&sb->s_active);
 	mnt->mnt.mnt_sb = sb;
-	mnt->mnt.mnt_root = dget(root);
+	dget(root);
 	mnt->mnt_mountpoint = mnt->mnt.mnt_root;
 	mnt->mnt_parent = mnt;
 	lock_mount_hash();
@@ -1063,7 +1168,7 @@ static void cleanup_mnt(struct mount *mnt)
 	if (unlikely(mnt->mnt_pins.first))
 		mnt_pin_kill(mnt);
 	fsnotify_vfsmount_delete(&mnt->mnt);
-	dput(mnt->mnt.mnt_root);
+	mnt_put_root(mnt);
 	deactivate_super(mnt->mnt.mnt_sb);
 	mnt_free_id(mnt);
 	call_rcu(&mnt->mnt_rcu, delayed_free_vfsmnt);
@@ -3096,14 +3201,21 @@ void __init mnt_init(void)
 				mphash_entries, 19,
 				0,
 				&mp_hash_shift, &mp_hash_mask, 0, 0);
+	mountroot_hashtable = alloc_large_system_hash("Mountroot-cache",
+				sizeof(struct hlist_head),
+				mrhash_entries, 19,
+				0,
+				&mr_hash_shift, &mr_hash_mask, 0, 0);
 
-	if (!mount_hashtable || !mountpoint_hashtable)
+	if (!mount_hashtable || !mountpoint_hashtable || !mountroot_hashtable)
 		panic("Failed to allocate mount hash table\n");
 
 	for (u = 0; u <= m_hash_mask; u++)
 		INIT_HLIST_HEAD(&mount_hashtable[u]);
 	for (u = 0; u <= mp_hash_mask; u++)
 		INIT_HLIST_HEAD(&mountpoint_hashtable[u]);
+	for (u = 0; u <= mr_hash_mask; u++)
+		INIT_HLIST_HEAD(&mountroot_hashtable[u]);
 
 	kernfs_init();
 
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index d2d50249b7b2..06bed2a1053c 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -228,6 +228,8 @@ struct dentry_operations {
 #define DCACHE_FALLTHRU			0x01000000 /* Fall through to lower layer */
 #define DCACHE_OP_SELECT_INODE		0x02000000 /* Unioned entry: dcache op selects inode */
 
+#define DCACHE_MOUNTROOT		0x04000000 /* Root of a vfsmount */
+
 extern seqlock_t rename_lock;
 
 /*
@@ -403,6 +405,11 @@ static inline bool d_mountpoint(const struct dentry *dentry)
 	return dentry->d_flags & DCACHE_MOUNTED;
 }
 
+static inline bool d_mountroot(const struct dentry *dentry)
+{
+	return dentry->d_flags & DCACHE_MOUNTROOT;
+}
+
 /*
  * Directory cache entry type accessor functions.
  */
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 4/7] dcache: Implement d_common_ancestor
       [not found]                                       ` <87a8ts763c.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                                                           ` (2 preceding siblings ...)
  2015-08-15 18:37                                         ` [PATCH review 3/7] mnt: Track which mounts use a dentry as root Eric W. Biederman
@ 2015-08-15 18:37                                         ` Eric W. Biederman
  2015-08-15 18:38                                         ` [PATCH review 5/7] dcache: Only read d_flags once in d_is_dir Eric W. Biederman
  2015-08-15 19:36                                         ` [PATCH review 0/7] Bind mount escape fixes Linus Torvalds
  5 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-08-15 18:37 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Miklos Szeredi, Richard Weinberger,
	Andy Lutomirski, J. Bruce Fields, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Linus Torvalds,
	Willy Tarreau


If possible find the common ancestor of two dentries.

This is necessary infrastructure for better handling the case
when a dentry is moved out from under the root of a bind mount.

Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/dcache.c            | 37 +++++++++++++++++++++++++++++++++++++
 include/linux/dcache.h |  1 +
 2 files changed, 38 insertions(+)

diff --git a/fs/dcache.c b/fs/dcache.c
index 53b7f1e63beb..4e66bf92a481 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2469,6 +2469,43 @@ void dentry_update_name_case(struct dentry *dentry, struct qstr *name)
 }
 EXPORT_SYMBOL(dentry_update_name_case);
 
+static unsigned long d_depth(const struct dentry *dentry)
+{
+	unsigned long depth = 0;
+
+	while (!IS_ROOT(dentry)) {
+		dentry = dentry->d_parent;
+		depth++;
+	}
+	return depth;
+}
+
+const struct dentry *d_common_ancestor(const struct dentry *left,
+				       const struct dentry *right)
+{
+	unsigned long ldepth = d_depth(left);
+	unsigned long rdepth = d_depth(right);
+
+	while (ldepth > rdepth) {
+		left = left->d_parent;
+		ldepth--;
+	}
+
+	while (rdepth > ldepth) {
+		right = right->d_parent;
+		rdepth--;
+	}
+
+	while (left != right) {
+		if (IS_ROOT(left))
+			return NULL;
+		left = left->d_parent;
+		right = right->d_parent;
+	}
+
+	return left;
+}
+
 static void swap_names(struct dentry *dentry, struct dentry *target)
 {
 	if (unlikely(dname_external(target))) {
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 06bed2a1053c..5b69856b45a2 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -313,6 +313,7 @@ extern void dentry_update_name_case(struct dentry *, struct qstr *);
 extern void d_move(struct dentry *, struct dentry *);
 extern void d_exchange(struct dentry *, struct dentry *);
 extern struct dentry *d_ancestor(struct dentry *, struct dentry *);
+extern const struct dentry *d_common_ancestor(const struct dentry *, const struct dentry *);
 
 /* appendix may either be NULL or be used for transname suffixes */
 extern struct dentry *d_lookup(const struct dentry *, const struct qstr *);
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 4/7] dcache: Implement d_common_ancestor
  2015-08-15 18:35                                     ` [PATCH review 0/7] Bind mount escape fixes Eric W. Biederman
                                                         ` (2 preceding siblings ...)
       [not found]                                       ` <87a8ts763c.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-08-15 18:37                                       ` Eric W. Biederman
  2015-08-15 19:36                                       ` [PATCH review 0/7] Bind mount escape fixes Linus Torvalds
  4 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-08-15 18:37 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-fsdevel, Al Viro, Andy Lutomirski, Serge E. Hallyn,
	Richard Weinberger, Andrey Vagin, Jann Horn, Willy Tarreau,
	Omar Sandoval, Miklos Szeredi, Linus Torvalds, J. Bruce Fields


If possible find the common ancestor of two dentries.

This is necessary infrastructure for better handling the case
when a dentry is moved out from under the root of a bind mount.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/dcache.c            | 37 +++++++++++++++++++++++++++++++++++++
 include/linux/dcache.h |  1 +
 2 files changed, 38 insertions(+)

diff --git a/fs/dcache.c b/fs/dcache.c
index 53b7f1e63beb..4e66bf92a481 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2469,6 +2469,43 @@ void dentry_update_name_case(struct dentry *dentry, struct qstr *name)
 }
 EXPORT_SYMBOL(dentry_update_name_case);
 
+static unsigned long d_depth(const struct dentry *dentry)
+{
+	unsigned long depth = 0;
+
+	while (!IS_ROOT(dentry)) {
+		dentry = dentry->d_parent;
+		depth++;
+	}
+	return depth;
+}
+
+const struct dentry *d_common_ancestor(const struct dentry *left,
+				       const struct dentry *right)
+{
+	unsigned long ldepth = d_depth(left);
+	unsigned long rdepth = d_depth(right);
+
+	while (ldepth > rdepth) {
+		left = left->d_parent;
+		ldepth--;
+	}
+
+	while (rdepth > ldepth) {
+		right = right->d_parent;
+		rdepth--;
+	}
+
+	while (left != right) {
+		if (IS_ROOT(left))
+			return NULL;
+		left = left->d_parent;
+		right = right->d_parent;
+	}
+
+	return left;
+}
+
 static void swap_names(struct dentry *dentry, struct dentry *target)
 {
 	if (unlikely(dname_external(target))) {
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 06bed2a1053c..5b69856b45a2 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -313,6 +313,7 @@ extern void dentry_update_name_case(struct dentry *, struct qstr *);
 extern void d_move(struct dentry *, struct dentry *);
 extern void d_exchange(struct dentry *, struct dentry *);
 extern struct dentry *d_ancestor(struct dentry *, struct dentry *);
+extern const struct dentry *d_common_ancestor(const struct dentry *, const struct dentry *);
 
 /* appendix may either be NULL or be used for transname suffixes */
 extern struct dentry *d_lookup(const struct dentry *, const struct qstr *);
-- 
2.2.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 5/7] dcache: Only read d_flags once in d_is_dir
       [not found]                                       ` <87a8ts763c.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                                                           ` (3 preceding siblings ...)
  2015-08-15 18:37                                         ` [PATCH review 4/7] dcache: Implement d_common_ancestor Eric W. Biederman
@ 2015-08-15 18:38                                         ` Eric W. Biederman
  2015-08-15 19:36                                         ` [PATCH review 0/7] Bind mount escape fixes Linus Torvalds
  5 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-08-15 18:38 UTC (permalink / raw)
  To: Linux Containers
  Cc: Andrey Vagin, Miklos Szeredi, Richard Weinberger,
	Andy Lutomirski, J. Bruce Fields, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Linus Torvalds,
	Willy Tarreau


Cache the value of __d_entry_type in d_is_dir and test if it equal to
DCACHE_DIRECTORY_TYPE or DCACHE_AUTODIR_TYPE.

The generated assembly goes from:
	movl	(%rdi), %eax  # MEM[(volatile __u32 *)dentry_3(D)], tmp73
	andl	$7340032, %eax  #, tmp73
	cmpl	$2097152, %eax  #, tmp73
	je	.L1091	#,
	movl	(%rdi), %eax	# MEM[(volatile __u32 *)dentry_3(D)], tmp74
	andl	$7340032, %eax	#, tmp74
	cmpl	$3145728, %eax	#, tmp74
	je	.L1091	#,
to:
	movl	(%rdi), %eax	# MEM[(volatile __u32 *)dentry_3(D)], tmp71
	andl	$6291456, %eax	#, tmp71
	cmpl	$2097152, %eax	#, tmp71
	jne	.L1091	  #,

Which with only one read of d_flags, one comparison and one jump is
dramatically better code.

As __d_entry_type is not written to allow the compiler to optimize
away anything that it does, when it is possible and reasonable to
optimize things away the optimization needs to be performend manually.

Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 include/linux/dcache.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 5b69856b45a2..82eb50aaf446 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -443,7 +443,8 @@ static inline bool d_is_autodir(const struct dentry *dentry)
 
 static inline bool d_is_dir(const struct dentry *dentry)
 {
-	return d_can_lookup(dentry) || d_is_autodir(dentry);
+	unsigned type = __d_entry_type(dentry);
+	return (type == DCACHE_DIRECTORY_TYPE) || (type == DCACHE_AUTODIR_TYPE);
 }
 
 static inline bool d_is_symlink(const struct dentry *dentry)
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 6/7] mnt: Track when a directory escapes a bind mount
       [not found]                                   ` <874mk08l3g.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  2015-08-15 18:35                                     ` [PATCH review 0/7] Bind mount escape fixes Eric W. Biederman
@ 2015-08-15 18:39                                     ` Eric W. Biederman
  2015-08-15 18:39                                     ` [PATCH review 7/7] vfs: Test for and handle paths that are unreachable from their mnt_root Eric W. Biederman
  2 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-08-15 18:39 UTC (permalink / raw)
  To: Al Viro
  Cc: Andrey Vagin, Miklos Szeredi, Richard Weinberger,
	Linux Containers, Andy Lutomirski, J. Bruce Fields,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Linus Torvalds,
	Willy Tarreau


When bind mounts are in use, and there is another path to the
filesystem it is possible to rename files or directories from a path
underneath the root of the bind mount to a path that is not underneath
the root of the bind mount.

When a directory is moved out from under the root of a bind mount path
name lookups that go up the directory tree potentially allow accessing
the entire dentry tree of the filesystem.  This is not expected, not
what is desired and winds up being a secruity problem for userspace.

Augment d_move, d_exchange, and __d_unalias to call d_common_ancestor
and handle_possible_mount_escapes to mark any mount points that
directories escape from.

A few notes on the implementation:

- Only directory escapes are recorded as only those are relevant to
  new pathname lookup.  Escaped files are handled in prepend_path.

- A lock either namespace_sem or mount_lock needs to be held across
  the duration of renames where a directory could be escaping to
  ensure that a mount is not added, escaped, and missed during the
  rename.

- The mount_lock is used as it does not sleep.  I have audited all of
  thecallers of d_move and d_exchange and in every instance it appears
  safe for d_move and d_exchange to start sleeping.  But there is
  no point in adding sleeping behavior if that is unncessary.

- The locking order must be mount_lock outside of rename_lock
  as prepend_path already takes the locks in this order.

- d_splice_alias (which calls __d_unalias) is a painful when it comes
  to this kind of locking, as it mostly takes the spinlocks before the
  sleeping locks.  So I have implmented the suboptimal but stupid and
  correct version of the locking and always take mount_lock.

Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/dcache.c           | 40 ++++++++++++++++++++++++++++++++++++++++
 fs/mount.h            |  2 ++
 fs/namespace.c        | 32 ++++++++++++++++++++++++++++++++
 include/linux/mount.h |  1 +
 4 files changed, 75 insertions(+)

diff --git a/fs/dcache.c b/fs/dcache.c
index 4e66bf92a481..ccc7daa0ae71 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2704,9 +2704,23 @@ static void __d_move(struct dentry *dentry, struct dentry *target,
  */
 void d_move(struct dentry *dentry, struct dentry *target)
 {
+	bool unlock = false;
+
+	if (d_is_dir(dentry) && (dentry->d_parent != target->d_parent)) {
+		const struct dentry *ancestor;
+
+		ancestor = d_common_ancestor(dentry, target);
+		read_seqlock_excl(&mount_lock);
+		unlock = true;
+		handle_possible_mount_escapee(ancestor, dentry);
+	}
+
 	write_seqlock(&rename_lock);
 	__d_move(dentry, target, false);
 	write_sequnlock(&rename_lock);
+	if (unlock)
+		read_sequnlock_excl(&mount_lock);
+
 }
 EXPORT_SYMBOL(d_move);
 
@@ -2717,6 +2731,23 @@ EXPORT_SYMBOL(d_move);
  */
 void d_exchange(struct dentry *dentry1, struct dentry *dentry2)
 {
+	bool d1_is_dir = d_is_dir(dentry1);
+	bool d2_is_dir = d_is_dir(dentry2);
+	bool unlock = false;
+
+	if ((d1_is_dir || d2_is_dir) &&
+	    (dentry1->d_parent != dentry2->d_parent)) {
+		const struct dentry *ancestor;
+
+		ancestor = d_common_ancestor(dentry1, dentry2);
+		read_seqlock_excl(&mount_lock);
+		unlock = true;
+		if (d1_is_dir)
+			handle_possible_mount_escapee(ancestor, dentry1);
+		if (d2_is_dir)
+			handle_possible_mount_escapee(ancestor, dentry2);
+	}
+
 	write_seqlock(&rename_lock);
 
 	WARN_ON(!dentry1->d_inode);
@@ -2727,6 +2758,8 @@ void d_exchange(struct dentry *dentry1, struct dentry *dentry2)
 	__d_move(dentry1, dentry2, true);
 
 	write_sequnlock(&rename_lock);
+	if (unlock)
+		read_sequnlock_excl(&mount_lock);
 }
 
 /**
@@ -2761,6 +2794,7 @@ static int __d_unalias(struct inode *inode,
 		struct dentry *dentry, struct dentry *alias)
 {
 	struct mutex *m1 = NULL, *m2 = NULL;
+	const struct dentry *ancestor;
 	int ret = -ESTALE;
 
 	/* If alias and dentry share a parent, then no extra locks required */
@@ -2774,6 +2808,8 @@ static int __d_unalias(struct inode *inode,
 	if (!mutex_trylock(&alias->d_parent->d_inode->i_mutex))
 		goto out_err;
 	m2 = &alias->d_parent->d_inode->i_mutex;
+	ancestor = d_common_ancestor(alias, dentry);
+	handle_possible_mount_escapee(ancestor, alias);
 out_unalias:
 	__d_move(alias, dentry, false);
 	ret = 0;
@@ -2825,9 +2861,11 @@ struct dentry *d_splice_alias(struct inode *inode, struct dentry *dentry)
 		if (unlikely(new)) {
 			/* The reference to new ensures it remains an alias */
 			spin_unlock(&inode->i_lock);
+			read_seqlock_excl(&mount_lock);
 			write_seqlock(&rename_lock);
 			if (unlikely(d_ancestor(new, dentry))) {
 				write_sequnlock(&rename_lock);
+				read_sequnlock_excl(&mount_lock);
 				dput(new);
 				new = ERR_PTR(-ELOOP);
 				pr_warn_ratelimited(
@@ -2839,6 +2877,7 @@ struct dentry *d_splice_alias(struct inode *inode, struct dentry *dentry)
 			} else if (!IS_ROOT(new)) {
 				int err = __d_unalias(inode, dentry, new);
 				write_sequnlock(&rename_lock);
+				read_sequnlock_excl(&mount_lock);
 				if (err) {
 					dput(new);
 					new = ERR_PTR(err);
@@ -2846,6 +2885,7 @@ struct dentry *d_splice_alias(struct inode *inode, struct dentry *dentry)
 			} else {
 				__d_move(new, dentry, false);
 				write_sequnlock(&rename_lock);
+				read_sequnlock_excl(&mount_lock);
 				security_d_instantiate(new, inode);
 			}
 			iput(inode);
diff --git a/fs/mount.h b/fs/mount.h
index e8f22970fe59..ad91963c83ac 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -107,6 +107,8 @@ static inline void detach_mounts(struct dentry *dentry)
 	__detach_mounts(dentry);
 }
 
+extern void handle_possible_mount_escapee(const struct dentry *, struct dentry *);
+
 static inline void get_mnt_ns(struct mnt_namespace *ns)
 {
 	atomic_inc(&ns->count);
diff --git a/fs/namespace.c b/fs/namespace.c
index af6abf476394..ddcd0b61a448 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1657,6 +1657,38 @@ out_unlock:
 	namespace_unlock();
 }
 
+static void mark_escaped_mounts(struct dentry *root)
+{
+	/* Must be called with mount_lock held */
+	struct mountroot *mr;
+	struct mount *mnt;
+
+	mr = lookup_mountroot(root);
+	if (mr) {
+		/* Mark each mount from which a directory is escaping.
+		 */
+		hlist_for_each_entry(mnt, &mr->r_list, mnt_mr_list)
+			mnt->mnt.mnt_flags |= MNT_DIR_ESCAPED;
+	}
+}
+
+void handle_possible_mount_escapee(const struct dentry *ancestor,
+				   struct dentry *escapee)
+{
+	struct dentry *dentry;
+
+	for (dentry = escapee->d_parent; dentry != ancestor;
+	     dentry = dentry->d_parent) {
+
+		if (d_mountroot(dentry))
+			mark_escaped_mounts(dentry);
+
+		/* In case there is no common ancestor */
+		if (IS_ROOT(dentry))
+			break;
+	}
+}
+
 /* 
  * Is the caller allowed to modify his namespace?
  */
diff --git a/include/linux/mount.h b/include/linux/mount.h
index f822c3c11377..e58bc12b19aa 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -62,6 +62,7 @@ struct mnt_namespace;
 #define MNT_SYNC_UMOUNT		0x2000000
 #define MNT_MARKED		0x4000000
 #define MNT_UMOUNT		0x8000000
+#define MNT_DIR_ESCAPED		0x10000000
 
 struct vfsmount {
 	struct dentry *mnt_root;	/* root of the mounted tree */
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 7/7] vfs: Test for and handle paths that are unreachable from their mnt_root
       [not found]                                   ` <874mk08l3g.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  2015-08-15 18:35                                     ` [PATCH review 0/7] Bind mount escape fixes Eric W. Biederman
  2015-08-15 18:39                                     ` [PATCH review 6/7] mnt: Track when a directory escapes a bind mount Eric W. Biederman
@ 2015-08-15 18:39                                     ` Eric W. Biederman
  2 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-08-15 18:39 UTC (permalink / raw)
  To: Al Viro
  Cc: Andrey Vagin, Miklos Szeredi, Richard Weinberger,
	Linux Containers, Andy Lutomirski, J. Bruce Fields,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Linus Torvalds,
	Willy Tarreau


In rare cases a directory can be renamed out from under a bind mount.
In those cases without special handling it becomes possible to walk up
the directory tree to the root dentry of the filesystem and down
from the root dentry to every other file or directory on the filesystem.

Like division by zero .. from an unconnected path can not be given
a useful semantic as there is no predicting at which path component
the code will realize it is unconnected.  We certainly can not match
the current behavior as the current behavior is a security hole.

Therefore when encounting .. when following an unconnected path
return -ENOENT.

- Add a function path_connected to verify path->dentry is reachable
  from path->mnt.mnt_root.  AKA to validate that rename did not do
  something nasty to the bind mount.

  To avoid races path_connected must be called after following a path
  component to it's next path component.

Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/namei.c | 26 ++++++++++++++++++++++++--
 1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index ae4e4c18b2ac..18d4884c7e85 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -560,6 +560,23 @@ static int __nd_alloc_stack(struct nameidata *nd)
 	return 0;
 }
 
+/**
+ * path_connected - Verify that a path->dentry is below path->mnt.mnt_root
+ * @path: nameidate to verify
+ *
+ * Rename can sometimes move a file or directory outside of a bind
+ * mount, path_connected allows those cases to be detected.
+ */
+static bool path_connected(const struct path *path)
+{
+	struct vfsmount *mnt = path->mnt;
+
+	if (likely(!(mnt->mnt_flags & MNT_DIR_ESCAPED)))
+		return true;
+
+	return is_subdir(path->dentry, mnt->mnt_root);
+}
+
 static inline int nd_alloc_stack(struct nameidata *nd)
 {
 	if (likely(nd->depth != EMBEDDED_LEVELS))
@@ -1296,6 +1313,8 @@ static int follow_dotdot_rcu(struct nameidata *nd)
 				return -ECHILD;
 			nd->path.dentry = parent;
 			nd->seq = seq;
+			if (unlikely(!path_connected(&nd->path)))
+				return -ENOENT;
 			break;
 		} else {
 			struct mount *mnt = real_mount(nd->path.mnt);
@@ -1396,7 +1415,7 @@ static void follow_mount(struct path *path)
 	}
 }
 
-static void follow_dotdot(struct nameidata *nd)
+static int follow_dotdot(struct nameidata *nd)
 {
 	if (!nd->root.mnt)
 		set_root(nd);
@@ -1412,6 +1431,8 @@ static void follow_dotdot(struct nameidata *nd)
 			/* rare case of legitimate dget_parent()... */
 			nd->path.dentry = dget_parent(nd->path.dentry);
 			dput(old);
+			if (unlikely(!path_connected(&nd->path)))
+				return -ENOENT;
 			break;
 		}
 		if (!follow_up(&nd->path))
@@ -1419,6 +1440,7 @@ static void follow_dotdot(struct nameidata *nd)
 	}
 	follow_mount(&nd->path);
 	nd->inode = nd->path.dentry->d_inode;
+	return 0;
 }
 
 /*
@@ -1634,7 +1656,7 @@ static inline int handle_dots(struct nameidata *nd, int type)
 		if (nd->flags & LOOKUP_RCU) {
 			return follow_dotdot_rcu(nd);
 		} else
-			follow_dotdot(nd);
+			return follow_dotdot(nd);
 	}
 	return 0;
 }
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH review 7/7] vfs: Test for and handle paths that are unreachable from their mnt_root
  2015-08-15 18:25                                 ` Eric W. Biederman
       [not found]                                   ` <874mk08l3g.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-08-15 18:39                                   ` Eric W. Biederman
  1 sibling, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-08-15 18:39 UTC (permalink / raw)
  To: Al Viro
  Cc: Linux Containers, linux-fsdevel, Andy Lutomirski,
	Serge E. Hallyn, Richard Weinberger, Andrey Vagin, Jann Horn,
	Willy Tarreau, Omar Sandoval, Miklos Szeredi, Linus Torvalds,
	J. Bruce Fields


In rare cases a directory can be renamed out from under a bind mount.
In those cases without special handling it becomes possible to walk up
the directory tree to the root dentry of the filesystem and down
from the root dentry to every other file or directory on the filesystem.

Like division by zero .. from an unconnected path can not be given
a useful semantic as there is no predicting at which path component
the code will realize it is unconnected.  We certainly can not match
the current behavior as the current behavior is a security hole.

Therefore when encounting .. when following an unconnected path
return -ENOENT.

- Add a function path_connected to verify path->dentry is reachable
  from path->mnt.mnt_root.  AKA to validate that rename did not do
  something nasty to the bind mount.

  To avoid races path_connected must be called after following a path
  component to it's next path component.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/namei.c | 26 ++++++++++++++++++++++++--
 1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index ae4e4c18b2ac..18d4884c7e85 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -560,6 +560,23 @@ static int __nd_alloc_stack(struct nameidata *nd)
 	return 0;
 }
 
+/**
+ * path_connected - Verify that a path->dentry is below path->mnt.mnt_root
+ * @path: nameidate to verify
+ *
+ * Rename can sometimes move a file or directory outside of a bind
+ * mount, path_connected allows those cases to be detected.
+ */
+static bool path_connected(const struct path *path)
+{
+	struct vfsmount *mnt = path->mnt;
+
+	if (likely(!(mnt->mnt_flags & MNT_DIR_ESCAPED)))
+		return true;
+
+	return is_subdir(path->dentry, mnt->mnt_root);
+}
+
 static inline int nd_alloc_stack(struct nameidata *nd)
 {
 	if (likely(nd->depth != EMBEDDED_LEVELS))
@@ -1296,6 +1313,8 @@ static int follow_dotdot_rcu(struct nameidata *nd)
 				return -ECHILD;
 			nd->path.dentry = parent;
 			nd->seq = seq;
+			if (unlikely(!path_connected(&nd->path)))
+				return -ENOENT;
 			break;
 		} else {
 			struct mount *mnt = real_mount(nd->path.mnt);
@@ -1396,7 +1415,7 @@ static void follow_mount(struct path *path)
 	}
 }
 
-static void follow_dotdot(struct nameidata *nd)
+static int follow_dotdot(struct nameidata *nd)
 {
 	if (!nd->root.mnt)
 		set_root(nd);
@@ -1412,6 +1431,8 @@ static void follow_dotdot(struct nameidata *nd)
 			/* rare case of legitimate dget_parent()... */
 			nd->path.dentry = dget_parent(nd->path.dentry);
 			dput(old);
+			if (unlikely(!path_connected(&nd->path)))
+				return -ENOENT;
 			break;
 		}
 		if (!follow_up(&nd->path))
@@ -1419,6 +1440,7 @@ static void follow_dotdot(struct nameidata *nd)
 	}
 	follow_mount(&nd->path);
 	nd->inode = nd->path.dentry->d_inode;
+	return 0;
 }
 
 /*
@@ -1634,7 +1656,7 @@ static inline int handle_dots(struct nameidata *nd, int type)
 		if (nd->flags & LOOKUP_RCU) {
 			return follow_dotdot_rcu(nd);
 		} else
-			follow_dotdot(nd);
+			return follow_dotdot(nd);
 	}
 	return 0;
 }
-- 
2.2.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* Re: [PATCH review 0/7] Bind mount escape fixes
       [not found]                                       ` <87a8ts763c.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                                                           ` (4 preceding siblings ...)
  2015-08-15 18:38                                         ` [PATCH review 5/7] dcache: Only read d_flags once in d_is_dir Eric W. Biederman
@ 2015-08-15 19:36                                         ` Linus Torvalds
  5 siblings, 0 replies; 240+ messages in thread
From: Linus Torvalds @ 2015-08-15 19:36 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrey Vagin, Miklos Szeredi, Richard Weinberger,
	Linux Containers, Andy Lutomirski, J. Bruce Fields, Al Viro,
	linux-fsdevel, Jann Horn, Willy Tarreau

On Sat, Aug 15, 2015 at 11:35 AM, Eric W. Biederman
<ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>
> The implementation change this round is I have dropped my patch cleaning
> up d_splice_alias.  Al Viro found a race that makes the technique I was
> using fundamentally racy.  I now have d_splice_alias taking mount_lock
> around rename_lock.  Since I don't have to sleep in d_splice_alias
> change is minimal and sufficient for this purpose.

Quite frankly, I hate this series.

Not all of it. Patches 1,2 and 5 look fine to me. But 3,4,6 and 7 I'm
not happy with. Particularly patch 6.

It just smells like a bad hack to me. My gut feel says that it's all
wrong. It doesn't feel clean or right.

I'd much rather make ".." handling more expensive than add and
maintain that MNT_DIR_ESCAPED flag.  My gut feel is that yes, we
should look seriously at making ".." much smarter (so I don't object
to the concept of patch 7/7 at all), but I think we should strive to
look at that ".." handling *without* adding the crufty odd special
bind mount crud.

Put another way: the whole "escape bind mount" thing is not at all a
new issue, it smells very much like the very traditional Unix "escape
chroot" thing. And I detest how this adds magic rules for bind mounts,
when it feels like a much more generic issue.

Ok, so I haven't really thought deeply about this, this literally is
just a "gut feel" kind of thing.

Can we really not validate ".." some clever way _without_ adding all
those "mount escape" flags? And by "clever" I potentially mean "not
clever" and in fact just fairly brute force. I'd almost prefer to just
walk the parent chains all the way to the root and validate the ".."
that way..

                 Linus

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 0/7] Bind mount escape fixes
  2015-08-15 18:35                                     ` [PATCH review 0/7] Bind mount escape fixes Eric W. Biederman
                                                         ` (3 preceding siblings ...)
  2015-08-15 18:37                                       ` [PATCH review 4/7] dcache: Implement d_common_ancestor Eric W. Biederman
@ 2015-08-15 19:36                                       ` Linus Torvalds
       [not found]                                         ` <CA+55aFzMuCn33yK71HoKnj1hr8=ac_Y-vfE5mM8h4f3YJeGKvg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  4 siblings, 1 reply; 240+ messages in thread
From: Linus Torvalds @ 2015-08-15 19:36 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Containers, linux-fsdevel, Al Viro, Andy Lutomirski,
	Serge E. Hallyn, Richard Weinberger, Andrey Vagin, Jann Horn,
	Willy Tarreau, Omar Sandoval, Miklos Szeredi, J. Bruce Fields

On Sat, Aug 15, 2015 at 11:35 AM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> The implementation change this round is I have dropped my patch cleaning
> up d_splice_alias.  Al Viro found a race that makes the technique I was
> using fundamentally racy.  I now have d_splice_alias taking mount_lock
> around rename_lock.  Since I don't have to sleep in d_splice_alias
> change is minimal and sufficient for this purpose.

Quite frankly, I hate this series.

Not all of it. Patches 1,2 and 5 look fine to me. But 3,4,6 and 7 I'm
not happy with. Particularly patch 6.

It just smells like a bad hack to me. My gut feel says that it's all
wrong. It doesn't feel clean or right.

I'd much rather make ".." handling more expensive than add and
maintain that MNT_DIR_ESCAPED flag.  My gut feel is that yes, we
should look seriously at making ".." much smarter (so I don't object
to the concept of patch 7/7 at all), but I think we should strive to
look at that ".." handling *without* adding the crufty odd special
bind mount crud.

Put another way: the whole "escape bind mount" thing is not at all a
new issue, it smells very much like the very traditional Unix "escape
chroot" thing. And I detest how this adds magic rules for bind mounts,
when it feels like a much more generic issue.

Ok, so I haven't really thought deeply about this, this literally is
just a "gut feel" kind of thing.

Can we really not validate ".." some clever way _without_ adding all
those "mount escape" flags? And by "clever" I potentially mean "not
clever" and in fact just fairly brute force. I'd almost prefer to just
walk the parent chains all the way to the root and validate the ".."
that way..

                 Linus

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 0/7] Bind mount escape fixes
       [not found]                                         ` <CA+55aFzMuCn33yK71HoKnj1hr8=ac_Y-vfE5mM8h4f3YJeGKvg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-08-15 19:48                                           ` Linus Torvalds
       [not found]                                             ` <CA+55aFyeu-p_3eJQCLM0TDuLYvo10mx379FaCFq7Z103RgKvVA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 240+ messages in thread
From: Linus Torvalds @ 2015-08-15 19:48 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrey Vagin, Miklos Szeredi, Richard Weinberger,
	Linux Containers, Andy Lutomirski, J. Bruce Fields, Al Viro,
	linux-fsdevel, Jann Horn, Willy Tarreau

On Sat, Aug 15, 2015 at 12:36 PM, Linus Torvalds
<torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote:
>
> Can we really not validate ".." some clever way _without_ adding all
> those "mount escape" flags? And by "clever" I potentially mean "not
> clever" and in fact just fairly brute force. I'd almost prefer to just
> walk the parent chains all the way to the root and validate the ".."
> that way..

For example: while it's true that walking a logn chain of parents (to
validate that we hit root etc) would be expensive, I don't think we'd
necessarily need to do it for the common case.

For example, if out current "mnt->mnt_root" is a _real_ root (so
IS_ROOT() is true), then we know we're not in some possibly partial
bind mount, so we don't need to check anything else, and we can
happily move to the parent dentry *without* having to be particularly
careful.

Otherwise we might need to walk the dentry parent chain to check that
yes, we will hit that mnt->mnt_root" entry, and that we're not
possibly escaping the bind mount. But even that walk is "just"
following a chain of pointers. It's not *that* expensive.

I'd much rather make ".." more expensive, if it means that we don't
have to track the status of whether a mount has a potentially escaped
directory in it or not.  Because I think we can avoid the costs for
traditional non-bind mounts.

No?

                      Linus

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 0/7] Bind mount escape fixes
       [not found]                                             ` <CA+55aFyeu-p_3eJQCLM0TDuLYvo10mx379FaCFq7Z103RgKvVA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-08-15 21:07                                               ` Eric W. Biederman
       [not found]                                                 ` <E2AECA7F-ED57-4FCD-A4C0-8C7C4B860FB6-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 240+ messages in thread
From: Eric W. Biederman @ 2015-08-15 21:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrey Vagin, Miklos Szeredi, Richard Weinberger,
	Linux Containers, Andy Lutomirski, J. Bruce Fields, Al Viro,
	linux-fsdevel, Jann Horn, Willy Tarreau



On August 15, 2015 2:48:34 PM CDT, Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote:
>On Sat, Aug 15, 2015 at 12:36 PM, Linus Torvalds
><torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote:
>>
>> Can we really not validate ".." some clever way _without_ adding all
>> those "mount escape" flags? And by "clever" I potentially mean "not
>> clever" and in fact just fairly brute force. I'd almost prefer to
>just
>> walk the parent chains all the way to the root and validate the ".."
>> that way..
>
>For example: while it's true that walking a logn chain of parents (to
>validate that we hit root etc) would be expensive, I don't think we'd
>necessarily need to do it for the common case.
>
>For example, if out current "mnt->mnt_root" is a _real_ root (so
>IS_ROOT() is true), then we know we're not in some possibly partial
>bind mount, so we don't need to check anything else, and we can
>happily move to the parent dentry *without* having to be particularly
>careful.
>
>Otherwise we might need to walk the dentry parent chain to check that
>yes, we will hit that mnt->mnt_root" entry, and that we're not
>possibly escaping the bind mount. But even that walk is "just"
>following a chain of pointers. It's not *that* expensive.
>
>I'd much rather make ".." more expensive, if it means that we don't
>have to track the status of whether a mount has a potentially escaped
>directory in it or not.  Because I think we can avoid the costs for
>traditional non-bind mounts.
>
>No?

Yes we can compare s_root and mnt_root and only call is_subir  if they don't match.

At this point it is a matter of trade offs.

If there is not an escape I do not expect my current implementation will have a measurable cost.   And I don't expect there will be any escapes.

That said if you and Al would be happy with what you are proposing I can easily implement it.

My only concern at this point is that I know some containers run  with a bind mount for their root directory so it might be a change with a measurable cost.  At the same time shallow directory paths are the norm so I don't expect there to be much of a cost.

Eric

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 0/7] Bind mount escape fixes
       [not found]                                                 ` <E2AECA7F-ED57-4FCD-A4C0-8C7C4B860FB6-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
@ 2015-08-15 22:47                                                   ` Linus Torvalds
       [not found]                                                     ` <CA+55aFx2s7TrmPKviKnFL0nGRZDHuCajW_UO02EnF+CsJY2-4w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 240+ messages in thread
From: Linus Torvalds @ 2015-08-15 22:47 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrey Vagin, Miklos Szeredi, Richard Weinberger,
	Linux Containers, Andy Lutomirski, J. Bruce Fields, Al Viro,
	linux-fsdevel, Jann Horn, Willy Tarreau

On Sat, Aug 15, 2015 at 2:07 PM, Eric W. Biederman
<ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>
> Yes we can compare s_root and mnt_root and only call is_subir  if they don't match.

Not even "is_subdir()" - for the RCU traversal case, just d_ancestor()
should be sufficient since we'd already be in an RCU read-locked
region and the RCU lookup checks the rename sequence number around it
all.

And d_ancestor() should really be pretty low-cost - even *if* we have
to call it, which wouldn't even be the case for the normal situation.

> At this point it is a matter of trade offs.
>
> If there is not an escape I do not expect my current implementation will have a measurable cost.
> And I don't expect there will be any escapes.

So the cost I worry about is not the CPU cost, but the complexity and
correctness. If anything goes subtly wrong, the end result is going to
be some very very subtle bugs.

And personally, I'd be much happier with something that is a bit more
straightforward, even if it makes ".." lookup slower. Especially since
I think we can limit the costs to fairly obvious cases (ie only for
partial bind mounts). Keep the code more straightforward, and *if* we
ever see the cost of dentry traversal

But it's up to Al, I think.

Al, comments?

                Linus

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 0/7] Bind mount escape fixes
       [not found]                                                     ` <CA+55aFx2s7TrmPKviKnFL0nGRZDHuCajW_UO02EnF+CsJY2-4w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-08-16  0:59                                                       ` Eric W. Biederman
       [not found]                                                         ` <87bne82glg.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  2015-08-16  2:12                                                       ` [PATCH review 0/7] Bind mount escape fixes Al Viro
  1 sibling, 1 reply; 240+ messages in thread
From: Eric W. Biederman @ 2015-08-16  0:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrey Vagin, Miklos Szeredi, Richard Weinberger,
	Linux Containers, Andy Lutomirski, J. Bruce Fields, Al Viro,
	linux-fsdevel, Jann Horn, Willy Tarreau

Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> writes:

> On Sat, Aug 15, 2015 at 2:07 PM, Eric W. Biederman
> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>>
>> Yes we can compare s_root and mnt_root and only call is_subir  if they don't match.
>
> Not even "is_subdir()" - for the RCU traversal case, just d_ancestor()
> should be sufficient since we'd already be in an RCU read-locked
> region and the RCU lookup checks the rename sequence number around it
> all.

We check the dentry sequence number and the mount sequence number, which
may be enough to catch a local rename but is certainly not enough to
catch what d_ancestor cares about.

Further we have the partial rcu to non-rcu walk case represented by
unlazy_walk that means we can't blithely do something that might be
wrong and only check the sequence numbers at each step.

> And d_ancestor() should really be pretty low-cost - even *if* we have
> to call it, which wouldn't even be the case for the normal situation.

>
>> At this point it is a matter of trade offs.
>>
>> If there is not an escape I do not expect my current implementation will have a measurable cost.
>> And I don't expect there will be any escapes.
>
> So the cost I worry about is not the CPU cost, but the complexity and
> correctness. If anything goes subtly wrong, the end result is going to
> be some very very subtle bugs.

Fair enough.  I like simple low complexity code, but I don't want to
mess up the pathname lookup fastpath.

> And personally, I'd be much happier with something that is a bit more
> straightforward, even if it makes ".." lookup slower. Especially since
> I think we can limit the costs to fairly obvious cases (ie only for
> partial bind mounts). Keep the code more straightforward, and *if* we
> ever see the cost of dentry traversal
>
> But it's up to Al, I think.
>
> Al, comments?

At the very beginning of this I got shot down by Al Viro for a simple
implementation that essentially had everything except the check for
being a bind mount.  Knowing what I know now I realize it was a bit
buggy, calling d_ancestor in the rcu walk instead of d_subdir, but it
was shot down for the cpu cost.  Then Al suggested the basic approach I
have taken in these patches.

As soon as I am done testing I am going to post the revised version of
my final patch that only performs is_subdir checks on bind mounts.

Then we can decide to merge whichever version of the code you and Al are
happy with.

Eric

^ permalink raw reply	[flat|nested] 240+ messages in thread

* [PATCH] vfs: Test for and handle paths that are unreachable from their mnt_root
       [not found]                                                         ` <87bne82glg.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-08-16  1:27                                                           ` Eric W. Biederman
  2015-08-17  3:56                                                             ` NeilBrown
       [not found]                                                             ` <87tws010r2.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  0 siblings, 2 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-08-16  1:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrey Vagin, Miklos Szeredi, Richard Weinberger,
	Linux Containers, Andy Lutomirski, J. Bruce Fields, Al Viro,
	linux-fsdevel, Jann Horn, Willy Tarreau


In rare cases a directory can be renamed out from under a bind mount.
In those cases without special handling it becomes possible to walk up
the directory tree to the root dentry of the filesystem and down
from the root dentry to every other file or directory on the filesystem.

Like division by zero .. from an unconnected path can not be given
a useful semantic as there is no predicting at which path component
the code will realize it is unconnected.  We certainly can not match
the current behavior as the current behavior is a security hole.

Therefore when encounting .. when following an unconnected path
return -ENOENT.

- Add a function path_connected to verify path->dentry is reachable
  from path->mnt.mnt_root.  AKA to validate that rename did not do
  something nasty to the bind mount.

  To avoid races path_connected must be called after following a path
  component to it's next path component.

Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---

This is the simple version that needs no extra vfs support.

My availability is likely to be a bit spotty for the next while
as I am travelling to and then attending Linux Plumbers Conference.

 fs/namei.c | 27 +++++++++++++++++++++++++--
 1 file changed, 25 insertions(+), 2 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index ae4e4c18b2ac..5303e994f8d6 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -560,6 +560,24 @@ static int __nd_alloc_stack(struct nameidata *nd)
 	return 0;
 }
 
+/**
+ * path_connected - Verify that a path->dentry is below path->mnt.mnt_root
+ * @path: nameidate to verify
+ *
+ * Rename can sometimes move a file or directory outside of a bind
+ * mount, path_connected allows those cases to be detected.
+ */
+static bool path_connected(const struct path *path)
+{
+	struct vfsmount *mnt = path->mnt;
+
+	/* Only bind mounts can have disconnected paths */
+	if (mnt->mnt_root == mnt->mnt_sb->s_root)
+		return true;
+
+	return is_subdir(path->dentry, mnt->mnt_root);
+}
+
 static inline int nd_alloc_stack(struct nameidata *nd)
 {
 	if (likely(nd->depth != EMBEDDED_LEVELS))
@@ -1296,6 +1314,8 @@ static int follow_dotdot_rcu(struct nameidata *nd)
 				return -ECHILD;
 			nd->path.dentry = parent;
 			nd->seq = seq;
+			if (unlikely(!path_connected(&nd->path)))
+				return -ENOENT;
 			break;
 		} else {
 			struct mount *mnt = real_mount(nd->path.mnt);
@@ -1396,7 +1416,7 @@ static void follow_mount(struct path *path)
 	}
 }
 
-static void follow_dotdot(struct nameidata *nd)
+static int follow_dotdot(struct nameidata *nd)
 {
 	if (!nd->root.mnt)
 		set_root(nd);
@@ -1412,6 +1432,8 @@ static void follow_dotdot(struct nameidata *nd)
 			/* rare case of legitimate dget_parent()... */
 			nd->path.dentry = dget_parent(nd->path.dentry);
 			dput(old);
+			if (unlikely(!path_connected(&nd->path)))
+				return -ENOENT;
 			break;
 		}
 		if (!follow_up(&nd->path))
@@ -1419,6 +1441,7 @@ static void follow_dotdot(struct nameidata *nd)
 	}
 	follow_mount(&nd->path);
 	nd->inode = nd->path.dentry->d_inode;
+	return 0;
 }
 
 /*
@@ -1634,7 +1657,7 @@ static inline int handle_dots(struct nameidata *nd, int type)
 		if (nd->flags & LOOKUP_RCU) {
 			return follow_dotdot_rcu(nd);
 		} else
-			follow_dotdot(nd);
+			return follow_dotdot(nd);
 	}
 	return 0;
 }
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* Re: [PATCH review 0/7] Bind mount escape fixes
       [not found]                                                     ` <CA+55aFx2s7TrmPKviKnFL0nGRZDHuCajW_UO02EnF+CsJY2-4w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2015-08-16  0:59                                                       ` Eric W. Biederman
@ 2015-08-16  2:12                                                       ` Al Viro
       [not found]                                                         ` <20150816021209.GI14139-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
  2015-08-16  2:25                                                         ` Linus Torvalds
  1 sibling, 2 replies; 240+ messages in thread
From: Al Viro @ 2015-08-16  2:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrey Vagin, Miklos Szeredi, Richard Weinberger,
	Linux Containers, Andy Lutomirski, J. Bruce Fields,
	Eric W. Biederman, linux-fsdevel, Jann Horn, Willy Tarreau

On Sat, Aug 15, 2015 at 03:47:50PM -0700, Linus Torvalds wrote:
> So the cost I worry about is not the CPU cost, but the complexity and
> correctness. If anything goes subtly wrong, the end result is going to
> be some very very subtle bugs.
> 
> And personally, I'd be much happier with something that is a bit more
> straightforward, even if it makes ".." lookup slower. Especially since
> I think we can limit the costs to fairly obvious cases (ie only for
> partial bind mounts). Keep the code more straightforward, and *if* we
> ever see the cost of dentry traversal
> 
> But it's up to Al, I think.
> 
> Al, comments?

I think you are underestimating the frequency of .. traversals.  Any build
process that creates relative symlinks will be hitting it all the time,
for one thing.  And less-than-entire-fs mounts are not something pathological -
I've got quite a few here, things like container setups often create such
beasts, etc.  Not to mention that things like NFSv4 will often look like such
partial mounts, BTW.

I really don't understand why the hell do we need *anything* complicated
around __d_move() callers - just take sodding spinlock of mount_lock in
the (few) callers around rename_lock, and that's it.  No need for anything
subtle and brittle.

Redoing the locking in d_splice_alias() simply doesn't belong anywhere near
that work.

Basically, all places where we change tree topology go through __d_move()
(and take rename_lock around it).  There are very few of those.  And we
can find and taint affected mounts quite easily - I think we all agree that
beginning of that series looks sane (locating mounts by mnt_root), right?

Let's just add "mount_lock should be held read-exclusive by all callers of
__d_move()" and do the find-and-taint logics from
dentry_lock_for_move().  Move these
        BUG_ON(d_ancestor(dentry, target));
        BUG_ON(d_ancestor(target, dentry));
into dentry_lock_for_move(), while we are at it.  And have it begin with
finding the last common ancestor of dentry and target.  Which turns those
checks into ancestor == dentry and ancestor == target...  Then we need to
taint everything with root being an ancestor of dentry, but not an ancestor
of target.  Which is trivial with LCA already found.  For exchange case we
need to do that both for dentry and target, of course.  That's it.  After that
we just use that taint instead of "is it a partial?" in .. handling.

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 0/7] Bind mount escape fixes
       [not found]                                                         ` <20150816021209.GI14139-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
@ 2015-08-16  2:25                                                           ` Linus Torvalds
  0 siblings, 0 replies; 240+ messages in thread
From: Linus Torvalds @ 2015-08-16  2:25 UTC (permalink / raw)
  To: Al Viro
  Cc: Andrey Vagin, Miklos Szeredi, Richard Weinberger,
	Linux Containers, Andy Lutomirski, J. Bruce Fields,
	Eric W. Biederman, linux-fsdevel, Jann Horn, Willy Tarreau

On Sat, Aug 15, 2015 at 7:12 PM, Al Viro <viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org> wrote:
>
> I think you are underestimating the frequency of .. traversals.  Any build
> process that creates relative symlinks will be hitting it all the time,
> for one thing.

I suspect you're over-estimating how expensive it is to just walk down
to the mount-point. It's just a few pointer traversals.

Realistically, we probably do more than that for a *regular* path
component lookup, when we follow the hash chains. Following a d_parent
chain for ".." isn't that different.

Just looking at the last patch Eric sent, that one looks _trivial_. It
didn't need *any* preparation or new rules. Compared to the mess with
marking things MNT_DIR_ESCAPED etc, I know which approach I'd prefer.

But hey, if you think you can simplify it... I just don't think that
even totally ignoring the d_splice_alias() things, and totally
ignoring any locking around __d_move(), the whole "mark things
MNT_DIR_ESCAPED" is a lot more complex.

              Linus

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 0/7] Bind mount escape fixes
  2015-08-16  2:12                                                       ` [PATCH review 0/7] Bind mount escape fixes Al Viro
       [not found]                                                         ` <20150816021209.GI14139-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
@ 2015-08-16  2:25                                                         ` Linus Torvalds
       [not found]                                                           ` <CA+55aFy3pzEY=4dfd_PX-Og_b7fqrG1rDniOqehBfQhXb=Cg9A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 240+ messages in thread
From: Linus Torvalds @ 2015-08-16  2:25 UTC (permalink / raw)
  To: Al Viro
  Cc: Eric W. Biederman, Linux Containers, linux-fsdevel,
	Andy Lutomirski, Serge E. Hallyn, Richard Weinberger,
	Andrey Vagin, Jann Horn, Willy Tarreau, Omar Sandoval,
	Miklos Szeredi, J. Bruce Fields

On Sat, Aug 15, 2015 at 7:12 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> I think you are underestimating the frequency of .. traversals.  Any build
> process that creates relative symlinks will be hitting it all the time,
> for one thing.

I suspect you're over-estimating how expensive it is to just walk down
to the mount-point. It's just a few pointer traversals.

Realistically, we probably do more than that for a *regular* path
component lookup, when we follow the hash chains. Following a d_parent
chain for ".." isn't that different.

Just looking at the last patch Eric sent, that one looks _trivial_. It
didn't need *any* preparation or new rules. Compared to the mess with
marking things MNT_DIR_ESCAPED etc, I know which approach I'd prefer.

But hey, if you think you can simplify it... I just don't think that
even totally ignoring the d_splice_alias() things, and totally
ignoring any locking around __d_move(), the whole "mark things
MNT_DIR_ESCAPED" is a lot more complex.

              Linus

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 0/7] Bind mount escape fixes
       [not found]                                                           ` <CA+55aFy3pzEY=4dfd_PX-Og_b7fqrG1rDniOqehBfQhXb=Cg9A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-08-16  4:53                                                             ` Al Viro
       [not found]                                                               ` <20150816045322.GJ14139-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
  2015-08-16 11:51                                                             ` Eric W. Biederman
  1 sibling, 1 reply; 240+ messages in thread
From: Al Viro @ 2015-08-16  4:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrey Vagin, Miklos Szeredi, Richard Weinberger,
	Linux Containers, Andy Lutomirski, J. Bruce Fields,
	Eric W. Biederman, linux-fsdevel, Jann Horn, Willy Tarreau

On Sat, Aug 15, 2015 at 07:25:41PM -0700, Linus Torvalds wrote:
> On Sat, Aug 15, 2015 at 7:12 PM, Al Viro <viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org> wrote:
> >
> > I think you are underestimating the frequency of .. traversals.  Any build
> > process that creates relative symlinks will be hitting it all the time,
> > for one thing.
> 
> I suspect you're over-estimating how expensive it is to just walk down
> to the mount-point. It's just a few pointer traversals.
> 
> Realistically, we probably do more than that for a *regular* path
> component lookup, when we follow the hash chains. Following a d_parent
> chain for ".." isn't that different.

Point, but...  Keep in mind that there's another PITA in there - unreachable
submounts are not as harmless as Eric hopes.  umount -l of the entire tainted
mount is a very large hammer _and_ userland needs to know when to use
it in the first place; otherwise we'll end up with dirty reboots.
So slightly longer term I want to have something done to them when they become
unreachable.  Namely, detach and leave in their place a trap that would
give EINVAL on attempt to cross.  Details depend on another pile of patches
to review and serialize (nfsd-related fs_pin stuff), but catching the moments
when they become unreachable is going to be useful (IOW, I don't see how to do
it without catching those; there might be an elegant solution I'm missing, of
course).

> Just looking at the last patch Eric sent, that one looks _trivial_. It
> didn't need *any* preparation or new rules. Compared to the mess with
> marking things MNT_DIR_ESCAPED etc, I know which approach I'd prefer.
> 
> But hey, if you think you can simplify it... I just don't think that
> even totally ignoring the d_splice_alias() things, and totally
> ignoring any locking around __d_move(), the whole "mark things
> MNT_DIR_ESCAPED" is a lot more complex.

	Basically what I have in mind is a few helpers called from 
dentry_lock_for_move() with d_move(), d_exchange() and d_splice_alias()
doing read_seqlock_excl(&mount_lock); just before grabbing rename_lock and
dropping it right after dropping rename_lock.

find_and_taint(dentry, ancestor)
{
	if (!is_dir(dentry))
		return;
	for (p = dentry->d_parent; p != ancestor; p = next) {
		if (unlikely(d_is_someones_root(p)))
			taint_mounts_of_this_subtree(p);
			// ... with dentry passed there as well when we
			// start handling unreachable submounts.
		next = p->d_parent;
		if (p == next)
			break;
	}
}

depth(d)
{
	int n;
	for (n = 0; !IS_ROOT(d); d = d->d_parent, n++)
		;
	return n;
}

/* find the last common ancestor of d1 and d2; NULL if there isn't one */
LCA(d1, d2)
{
	int n1 = depth(d1), n2 = depth(d2);
	if (n1 > n2)
		do d1 = d1->d_parent; while (--n1 != n2);
	else if (n1 < n2)
		do d2 = d2->d_parent; while (--n2 != n1);
	while (d1 != d2) {
		if (unlikely(IS_ROOT(d1)))
			return NULL;
		d1 = d1->d_parent;
		d2 = d2->d_parent;
	}
	return d1;
}

dentry_lock_for_move(dentry, target, exchange)
{
	ancestor = LCA(dentry, target);
	BUG_ON(ancestor == dentry);	// these BUG_ON are antisocial, BTW
	BUG_ON(ancestor == target);
	find_and_taint(dentry, ancestor);
	if (exchange)
		find_and_taint(target, ancestor);
	// the rest - as we do now
}

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 0/7] Bind mount escape fixes
       [not found]                                                               ` <20150816045322.GJ14139-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
@ 2015-08-16  6:22                                                                 ` Eric W. Biederman
  2015-08-16  6:55                                                                   ` Al Viro
       [not found]                                                                   ` <87fv3ju4zy.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  0 siblings, 2 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-08-16  6:22 UTC (permalink / raw)
  To: Al Viro
  Cc: Andrey Vagin, Miklos Szeredi, Richard Weinberger,
	Linux Containers, Andy Lutomirski, J. Bruce Fields,
	linux-fsdevel, Jann Horn, Linus Torvalds, Willy Tarreau

Al Viro <viro-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org> writes:

> On Sat, Aug 15, 2015 at 07:25:41PM -0700, Linus Torvalds wrote:
>> On Sat, Aug 15, 2015 at 7:12 PM, Al Viro <viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org> wrote:
>> >
>> > I think you are underestimating the frequency of .. traversals.  Any build
>> > process that creates relative symlinks will be hitting it all the time,
>> > for one thing.
>> 
>> I suspect you're over-estimating how expensive it is to just walk down
>> to the mount-point. It's just a few pointer traversals.
>> 
>> Realistically, we probably do more than that for a *regular* path
>> component lookup, when we follow the hash chains. Following a d_parent
>> chain for ".." isn't that different.
>
> Point, but...  Keep in mind that there's another PITA in there - unreachable
> submounts are not as harmless as Eric hopes.  umount -l of the entire tainted
> mount is a very large hammer _and_ userland needs to know when to use
> it in the first place; otherwise we'll end up with dirty reboots.
> So slightly longer term I want to have something done to them when they become
> unreachable.  Namely, detach and leave in their place a trap that would
> give EINVAL on attempt to cross.  Details depend on another pile of patches
> to review and serialize (nfsd-related fs_pin stuff), but catching the moments
> when they become unreachable is going to be useful (IOW, I don't see how to do
> it without catching those; there might be an elegant solution I'm missing, of
> course).

*Rolls my eyes*

This has to be one of the most inconsistent lines of reasoning I have
ever heard.  Either escaping from a bind mount is as rare as hen's teeth
and our handling of the case is to keep it that way.  Or if they are
common we really need to handle people performing lookups on bind mounts
people have escaped from.

If escaping from bind mounts is as rare as hen's teeth (which it had
better be), then we don't have to be clever, we don't have to be nice,
and umount -l is perfectly sufficient.  And most of the time it really
will be an exit of a mount namespace that blows things away anyway.

Certainly all of the easy to create cases require creating a user
namespace and which onws a mount namespace that a less privileged user
can manipulate.

If you gave someone you don't trust permission to move a directory
under which you have a mount point in the initial mount namespace
and that causes problems I am pretty certain that is a PBKAC error.

But even if I accept that the unmount code has to be done you have been
objecting to the infrastructure that is needed to make it happen.

That is to unmount something we must take namespace_sem from
d_splice_alias.  But more specifically we must call namespace_lock()
and namespace_unlock() from d_splice_alias.  And there are no hacks
like trylock for running synchronize_rcu.  That is we must sleep in
d_splice_alias to unmount things.   Arranging the locks so that it can
happen is what you have rather been strenuously objecting to.

Further if you want to unmount things you can't do it from __d_move
inside the rename_lock.  The logic must be outside the rename_lock.

>> Just looking at the last patch Eric sent, that one looks _trivial_. It
>> didn't need *any* preparation or new rules. Compared to the mess with
>> marking things MNT_DIR_ESCAPED etc, I know which approach I'd prefer.
>> 
>> But hey, if you think you can simplify it... I just don't think that
>> even totally ignoring the d_splice_alias() things, and totally
>> ignoring any locking around __d_move(), the whole "mark things
>> MNT_DIR_ESCAPED" is a lot more complex.
>
> 	Basically what I have in mind is a few helpers called from 
> dentry_lock_for_move() with d_move(), d_exchange() and d_splice_alias()
> doing read_seqlock_excl(&mount_lock); just before grabbing rename_lock and
> dropping it right after dropping rename_lock.

So you are arguing for code that has heavier locking than what I have
done, and you are conveniently ignoring the cost of reviewing all of
the code to see if i_lock is ever taken under mount_lock.

> find_and_taint(dentry, ancestor)
> {
> 	if (!is_dir(dentry))
> 		return;
> 	for (p = dentry->d_parent; p != ancestor; p = next) {
> 		if (unlikely(d_is_someones_root(p)))
> 			taint_mounts_of_this_subtree(p);
> 			// ... with dentry passed there as well when we
> 			// start handling unreachable submounts.
> 		next = p->d_parent;
> 		if (p == next)
> 			break;
> 	}
> }
>
> depth(d)
> {
> 	int n;
> 	for (n = 0; !IS_ROOT(d); d = d->d_parent, n++)
> 		;
> 	return n;
> }
>
> /* find the last common ancestor of d1 and d2; NULL if there isn't one */
> LCA(d1, d2)
> {
You are of course missing the common case and the obvious optimization
here:
	if (d1->d_parent == d2->d_parent)
		return d1->d_parent;

> 	int n1 = depth(d1), n2 = depth(d2);
> 	if (n1 > n2)
> 		do d1 = d1->d_parent; while (--n1 != n2);
> 	else if (n1 < n2)
> 		do d2 = d2->d_parent; while (--n2 != n1);
> 	while (d1 != d2) {
> 		if (unlikely(IS_ROOT(d1)))
> 			return NULL;
> 		d1 = d1->d_parent;
> 		d2 = d2->d_parent;
> 	}
> 	return d1;
> }
>
> dentry_lock_for_move(dentry, target, exchange)
> {
> 	ancestor = LCA(dentry, target);
> 	BUG_ON(ancestor == dentry);	// these BUG_ON are antisocial, BTW
> 	BUG_ON(ancestor == target);

> 	find_and_taint(dentry, ancestor);
> 	if (exchange)
> 		find_and_taint(target, ancestor);
> 	// the rest - as we do now

Which begs the question.  What is the point of doing this in
dentry_lock_for_move.  None of that code has any logical connection to
the function of dentry_lock_for_move.  And if the rest of the logic is
untouched it is just stupid to do the work in dentry_lock_for_move.

> }

Al I am sorry.  I am having a very hard time taking your comments on
code direction seriously at this point.  I have tested working code. 

You have some sort of half thought out proof of concept code that you
have never published.  And you are complaining that my working code is
not your half thought through code.

In my last round of patches that I sent out today.  I did put mount_lock
just outside of rename_lock, in d_splice_alias.  But apparently you
haven't noticed.

Now at this point I have hit the limit of my time available for rewrites
before the merge window.  We can go with my 7 patch variant I posted
today (whose only sin appears not to be your implemenation), it's
trivial reduction that Linus likes because it is simple, someone else
can write one, or this can all wait until the next development cycle.

Given that I have working code with no known defects I was really hoping
I could plug this security hole this development cycle.

Eric

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 0/7] Bind mount escape fixes
       [not found]                                                                   ` <87fv3ju4zy.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-08-16  6:55                                                                     ` Al Viro
  0 siblings, 0 replies; 240+ messages in thread
From: Al Viro @ 2015-08-16  6:55 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrey Vagin, Miklos Szeredi, Richard Weinberger,
	Linux Containers, Andy Lutomirski, J. Bruce Fields,
	linux-fsdevel, Jann Horn, Linus Torvalds, Willy Tarreau

On Sun, Aug 16, 2015 at 01:22:41AM -0500, Eric W. Biederman wrote:

> In my last round of patches that I sent out today.  I did put mount_lock
> just outside of rename_lock, in d_splice_alias.  But apparently you
> haven't noticed.

I have.  The problem I have with that one is that you end up with duplicated
logics rather than taking it to one place.

> Now at this point I have hit the limit of my time available for rewrites
> before the merge window.  We can go with my 7 patch variant I posted
> today (whose only sin appears not to be your implemenation), it's
> trivial reduction that Linus likes because it is simple, someone else
> can write one, or this can all wait until the next development cycle.

... or either of us can do merging those checks into a single place,
be it as a followup to your 7-patch series, or folded with the
fs/dcache.c-affecting patches in there.  If you have no time left, I can
certainly do that followup myself - not a problem[1]

And umount-related followups are just that - I'm not asking you to do those,
especially since as I said this stuff is sensitive to fs_pin details (so far
it appears to fold nicely with the __detach_mounts()/umount_tree() stuff,
BTW).

[1] with credits for your patches preserved - normally I would assume that
this goes without saying, but your reply seems to imply that I'm playing some
kind of political BS games, so I'd rather spell that out.

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 0/7] Bind mount escape fixes
  2015-08-16  6:22                                                                 ` Eric W. Biederman
@ 2015-08-16  6:55                                                                   ` Al Viro
       [not found]                                                                     ` <20150816065503.GL14139-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
       [not found]                                                                   ` <87fv3ju4zy.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  1 sibling, 1 reply; 240+ messages in thread
From: Al Viro @ 2015-08-16  6:55 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Linux Containers, linux-fsdevel, Andy Lutomirski,
	Serge E. Hallyn, Richard Weinberger, Andrey Vagin, Jann Horn,
	Willy Tarreau, Omar Sandoval, Miklos Szeredi, J. Bruce Fields

On Sun, Aug 16, 2015 at 01:22:41AM -0500, Eric W. Biederman wrote:

> In my last round of patches that I sent out today.  I did put mount_lock
> just outside of rename_lock, in d_splice_alias.  But apparently you
> haven't noticed.

I have.  The problem I have with that one is that you end up with duplicated
logics rather than taking it to one place.

> Now at this point I have hit the limit of my time available for rewrites
> before the merge window.  We can go with my 7 patch variant I posted
> today (whose only sin appears not to be your implemenation), it's
> trivial reduction that Linus likes because it is simple, someone else
> can write one, or this can all wait until the next development cycle.

... or either of us can do merging those checks into a single place,
be it as a followup to your 7-patch series, or folded with the
fs/dcache.c-affecting patches in there.  If you have no time left, I can
certainly do that followup myself - not a problem[1]

And umount-related followups are just that - I'm not asking you to do those,
especially since as I said this stuff is sensitive to fs_pin details (so far
it appears to fold nicely with the __detach_mounts()/umount_tree() stuff,
BTW).

[1] with credits for your patches preserved - normally I would assume that
this goes without saying, but your reply seems to imply that I'm playing some
kind of political BS games, so I'd rather spell that out.

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 0/7] Bind mount escape fixes
       [not found]                                                                     ` <20150816065503.GL14139-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
@ 2015-08-16  7:04                                                                       ` Al Viro
  2015-08-16 11:33                                                                       ` Eric W. Biederman
  1 sibling, 0 replies; 240+ messages in thread
From: Al Viro @ 2015-08-16  7:04 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrey Vagin, Miklos Szeredi, Richard Weinberger,
	Linux Containers, Andy Lutomirski, J. Bruce Fields,
	linux-fsdevel, Jann Horn, Linus Torvalds, Willy Tarreau

On Sun, Aug 16, 2015 at 07:55:03AM +0100, Al Viro wrote:

> And umount-related followups are just that - I'm not asking you to do those,
> especially since as I said this stuff is sensitive to fs_pin details (so far
> it appears to fold nicely with the __detach_mounts()/umount_tree() stuff,
> BTW).

PS: it doesn't need namespace_sem taken inside __d_move() - actual
detaching does, of course, but that part gets done via task_work_add(),
in a reasonably sane locking environment.

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 0/7] Bind mount escape fixes
       [not found]                                                                     ` <20150816065503.GL14139-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
  2015-08-16  7:04                                                                       ` Al Viro
@ 2015-08-16 11:33                                                                       ` Eric W. Biederman
       [not found]                                                                         ` <87bne7piwu.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  1 sibling, 1 reply; 240+ messages in thread
From: Eric W. Biederman @ 2015-08-16 11:33 UTC (permalink / raw)
  To: Al Viro
  Cc: Andrey Vagin, Miklos Szeredi, Richard Weinberger,
	Linux Containers, Andy Lutomirski, J. Bruce Fields,
	linux-fsdevel, Jann Horn, Linus Torvalds, Willy Tarreau

Al Viro <viro-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org> writes:

> On Sun, Aug 16, 2015 at 01:22:41AM -0500, Eric W. Biederman wrote:
>
>> In my last round of patches that I sent out today.  I did put mount_lock
>> just outside of rename_lock, in d_splice_alias.  But apparently you
>> haven't noticed.
>
> I have.  The problem I have with that one is that you end up with duplicated
> logics rather than taking it to one place.

We are talking about a very small duplication of code.  But you are also
talking about combining logic that can get you in trouble pretty easily.

As I read the code sketch you posted it had the issue that if a
disconnected dentry was also a mount point it would mark that mount
point as escaped (because there was no common ancestor).

In the split out logic I don't even bother because you trivially can't
have escaped from anywhere when IS_ROOT(dentry).

>> Now at this point I have hit the limit of my time available for rewrites
>> before the merge window.  We can go with my 7 patch variant I posted
>> today (whose only sin appears not to be your implemenation), it's
>> trivial reduction that Linus likes because it is simple, someone else
>> can write one, or this can all wait until the next development cycle.
>
> ... or either of us can do merging those checks into a single place,
> be it as a followup to your 7-patch series, or folded with the
> fs/dcache.c-affecting patches in there.  If you have no time left, I can
> certainly do that followup myself - not a problem[1]

I don't have time.  Everytime I have worked with this it has take pretty
much full days of staring at the code, and I don't have any more full
days left before the merge window.

> And umount-related followups are just that - I'm not asking you to do those,
> especially since as I said this stuff is sensitive to fs_pin details (so far
> it appears to fold nicely with the __detach_mounts()/umount_tree() stuff,
> BTW).
>
> PS: it doesn't need namespace_sem taken inside __d_move() - actual
> detaching does, of course, but that part gets done via task_work_add(),
> in a reasonably sane locking environment.

The part that boggles my mind about that whole approach is that just
outside of rename_lock there already is a sane locking environment.

Admittedly it will take a touch of work in d_splice_alias to get it
sane, but that isn't particularly hard.

	write_seqlock(&rename_lock);
        if (!IS_ROOT(new))
		__d_unalias(...);

__d_unalias(...)
{	
        /* Parent's don't go away so !IS_ROOT(new) will stay valid */
        write_sequnlock(&rename_lock);
        /* Hooray! The code can be normal and sleep if it needs to,
         * before calling into __d_move.
         */
         ...
         d_move();
}

Similarly for duplicate logic removal.  Teaching d_splice_alias when
it is performing an ordinarly rename to be able to just call d_move
gets you far more from a maintenance perspective than cramming
everything into __d_move.

> [1] with credits for your patches preserved - normally I would assume that
> this goes without saying, but your reply seems to imply that I'm playing some
> kind of political BS games, so I'd rather spell that out.

My issue wasn't political games, but rather holding code to an
apparently BS standard.

Eric

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 0/7] Bind mount escape fixes
       [not found]                                                           ` <CA+55aFy3pzEY=4dfd_PX-Og_b7fqrG1rDniOqehBfQhXb=Cg9A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2015-08-16  4:53                                                             ` Al Viro
@ 2015-08-16 11:51                                                             ` Eric W. Biederman
  2015-08-16 22:29                                                               ` Willy Tarreau
       [not found]                                                               ` <87egj3moxm.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  1 sibling, 2 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-08-16 11:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrey Vagin, Miklos Szeredi, Richard Weinberger,
	Linux Containers, Andy Lutomirski, J. Bruce Fields, Al Viro,
	linux-fsdevel, Jann Horn, Willy Tarreau

Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> writes:

> On Sat, Aug 15, 2015 at 7:12 PM, Al Viro <viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org> wrote:
>>
>> I think you are underestimating the frequency of .. traversals.  Any build
>> process that creates relative symlinks will be hitting it all the time,
>> for one thing.
>
> I suspect you're over-estimating how expensive it is to just walk down
> to the mount-point. It's just a few pointer traversals.
>
> Realistically, we probably do more than that for a *regular* path
> component lookup, when we follow the hash chains. Following a d_parent
> chain for ".." isn't that different.
>
> Just looking at the last patch Eric sent, that one looks _trivial_. It
> didn't need *any* preparation or new rules. Compared to the mess with
> marking things MNT_DIR_ESCAPED etc, I know which approach I'd prefer.
>
> But hey, if you think you can simplify it... I just don't think that
> even totally ignoring the d_splice_alias() things, and totally
> ignoring any locking around __d_move(), the whole "mark things
> MNT_DIR_ESCAPED" is a lot more complex.

It occurs to me that there is a fairly simple way we can emperically
test to see how expensive calling is_subdir for every .. on a bind mount
is in practice.

- Take my last patch
- run a benchmark outside of a bind mount (perhaps a kernel compile).
- run the same benchmark inside of a bind mount.

See if the performance differs.

I am going to try to find time to do this, but I am travelling for the
next couple of days.

If someone who has a bit more time wants to try it and beats me to that
would be great.

I think having some emperical numbers would be nice in this part of the
conversation.

Eric

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 0/7] Bind mount escape fixes
       [not found]                                                               ` <87egj3moxm.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-08-16 22:29                                                                 ` Willy Tarreau
  0 siblings, 0 replies; 240+ messages in thread
From: Willy Tarreau @ 2015-08-16 22:29 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrey Vagin, Miklos Szeredi, Richard Weinberger,
	Linux Containers, Andy Lutomirski, J. Bruce Fields, Al Viro,
	linux-fsdevel, Jann Horn, Linus Torvalds

On Sun, Aug 16, 2015 at 06:51:33AM -0500, Eric W. Biederman wrote:
> It occurs to me that there is a fairly simple way we can emperically
> test to see how expensive calling is_subdir for every .. on a bind mount
> is in practice.
> 
> - Take my last patch
> - run a benchmark outside of a bind mount (perhaps a kernel compile).
> - run the same benchmark inside of a bind mount.
> 
> See if the performance differs.
> 
> I am going to try to find time to do this, but I am travelling for the
> next couple of days.
> 
> If someone who has a bit more time wants to try it and beats me to that
> would be great.
> 
> I think having some emperical numbers would be nice in this part of the
> conversation.

I took a bit of time to do something simpler though less scientific : I
just ran a kernel build under strace -c -f to get a rough idea of the
number of syscalls. It takes 2m23 without strace and 3m00 under strace.
7.2M syscalls were seen out of which only 3.2M were related to FS access
(mostly open(), fstat() and stat()), the rest is minor. That's 22k
syscalls per second, or 45 us between two syscalls. From this point I'm
having a hard time imagining that any extra tests in the code path would
have a significant impact to cause a measurable difference compared to
these 45us, especially since I'm counting all FS accesses (worst case)
and not just those referencing "..".

Just my two cents,
Willy

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 0/7] Bind mount escape fixes
  2015-08-16 11:51                                                             ` Eric W. Biederman
@ 2015-08-16 22:29                                                               ` Willy Tarreau
       [not found]                                                               ` <87egj3moxm.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  1 sibling, 0 replies; 240+ messages in thread
From: Willy Tarreau @ 2015-08-16 22:29 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Al Viro, Linux Containers, linux-fsdevel,
	Andy Lutomirski, Serge E. Hallyn, Richard Weinberger,
	Andrey Vagin, Jann Horn, Omar Sandoval, Miklos Szeredi,
	J. Bruce Fields

On Sun, Aug 16, 2015 at 06:51:33AM -0500, Eric W. Biederman wrote:
> It occurs to me that there is a fairly simple way we can emperically
> test to see how expensive calling is_subdir for every .. on a bind mount
> is in practice.
> 
> - Take my last patch
> - run a benchmark outside of a bind mount (perhaps a kernel compile).
> - run the same benchmark inside of a bind mount.
> 
> See if the performance differs.
> 
> I am going to try to find time to do this, but I am travelling for the
> next couple of days.
> 
> If someone who has a bit more time wants to try it and beats me to that
> would be great.
> 
> I think having some emperical numbers would be nice in this part of the
> conversation.

I took a bit of time to do something simpler though less scientific : I
just ran a kernel build under strace -c -f to get a rough idea of the
number of syscalls. It takes 2m23 without strace and 3m00 under strace.
7.2M syscalls were seen out of which only 3.2M were related to FS access
(mostly open(), fstat() and stat()), the rest is minor. That's 22k
syscalls per second, or 45 us between two syscalls. From this point I'm
having a hard time imagining that any extra tests in the code path would
have a significant impact to cause a measurable difference compared to
these 45us, especially since I'm counting all FS accesses (worst case)
and not just those referencing "..".

Just my two cents,
Willy

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH] vfs: Test for and handle paths that are unreachable from their mnt_root
       [not found]                                                             ` <87tws010r2.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-08-17  3:56                                                               ` NeilBrown
  0 siblings, 0 replies; 240+ messages in thread
From: NeilBrown @ 2015-08-17  3:56 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrey Vagin, Miklos Szeredi, Richard Weinberger,
	Linux Containers, Andy Lutomirski, J. Bruce Fields, Al Viro,
	linux-fsdevel, Jann Horn, Linus Torvalds, Willy Tarreau

On Sat, 15 Aug 2015 20:27:13 -0500 ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org (Eric W.
Biederman) wrote:

> 
> In rare cases a directory can be renamed out from under a bind mount.
> In those cases without special handling it becomes possible to walk up
> the directory tree to the root dentry of the filesystem and down
> from the root dentry to every other file or directory on the filesystem.
> 
> Like division by zero .. from an unconnected path can not be given
> a useful semantic as there is no predicting at which path component
> the code will realize it is unconnected.  We certainly can not match
> the current behavior as the current behavior is a security hole.
> 
> Therefore when encounting .. when following an unconnected path
> return -ENOENT.
> 
> - Add a function path_connected to verify path->dentry is reachable
>   from path->mnt.mnt_root.  AKA to validate that rename did not do
>   something nasty to the bind mount.
> 
>   To avoid races path_connected must be called after following a path
>   component to it's next path component.
> 
> Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
> ---
> 
> This is the simple version that needs no extra vfs support.
> 
> My availability is likely to be a bit spotty for the next while
> as I am travelling to and then attending Linux Plumbers Conference.
> 
>  fs/namei.c | 27 +++++++++++++++++++++++++--
>  1 file changed, 25 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/namei.c b/fs/namei.c
> index ae4e4c18b2ac..5303e994f8d6 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -560,6 +560,24 @@ static int __nd_alloc_stack(struct nameidata *nd)
>  	return 0;
>  }
>  
> +/**
> + * path_connected - Verify that a path->dentry is below path->mnt.mnt_root
> + * @path: nameidate to verify
> + *
> + * Rename can sometimes move a file or directory outside of a bind
> + * mount, path_connected allows those cases to be detected.

While it is obviously true that a rename can move a file outside of a
bind mount, it doesn't seem relevant and so could be confusing.
This is only ever used for directories, and a file could already be
outside a bind mount even while it is inside.

I would stick with "Rename can sometimes move a directory outside..."



> + */
> +static bool path_connected(const struct path *path)
> +{
> +	struct vfsmount *mnt = path->mnt;
> +
> +	/* Only bind mounts can have disconnected paths */
> +	if (mnt->mnt_root == mnt->mnt_sb->s_root)
> +		return true;
> +
> +	return is_subdir(path->dentry, mnt->mnt_root);
> +}
> +
>  static inline int nd_alloc_stack(struct nameidata *nd)
>  {
>  	if (likely(nd->depth != EMBEDDED_LEVELS))
> @@ -1296,6 +1314,8 @@ static int follow_dotdot_rcu(struct nameidata *nd)
>  				return -ECHILD;
>  			nd->path.dentry = parent;
>  			nd->seq = seq;
> +			if (unlikely(!path_connected(&nd->path)))
> +				return -ENOENT;
>  			break;
>  		} else {
>  			struct mount *mnt = real_mount(nd->path.mnt);
> @@ -1396,7 +1416,7 @@ static void follow_mount(struct path *path)
>  	}
>  }
>  
> -static void follow_dotdot(struct nameidata *nd)
> +static int follow_dotdot(struct nameidata *nd)
>  {
>  	if (!nd->root.mnt)
>  		set_root(nd);
> @@ -1412,6 +1432,8 @@ static void follow_dotdot(struct nameidata *nd)
>  			/* rare case of legitimate dget_parent()... */
>  			nd->path.dentry = dget_parent(nd->path.dentry);
>  			dput(old);
> +			if (unlikely(!path_connected(&nd->path)))
> +				return -ENOENT;
>  			break;
>  		}
>  		if (!follow_up(&nd->path))
> @@ -1419,6 +1441,7 @@ static void follow_dotdot(struct nameidata *nd)
>  	}
>  	follow_mount(&nd->path);
>  	nd->inode = nd->path.dentry->d_inode;
> +	return 0;
>  }
>  
>  /*
> @@ -1634,7 +1657,7 @@ static inline int handle_dots(struct nameidata *nd, int type)
>  		if (nd->flags & LOOKUP_RCU) {
>  			return follow_dotdot_rcu(nd);
>  		} else
> -			follow_dotdot(nd);
> +			return follow_dotdot(nd);
>  	}
>  	return 0;
>  }


I really like this patch, particularly from the standpoint of
backporting to -stable and enterprise kernels.  I suspect that all the
tracking of which mounts might have been escaped from should be
classified as premature optimisation, until measurements show otherwise.

path_connected() adds no locks or even atomics.  The path that it walks
will be well-trodden and so very likely most of it will be in the CPU
cache.

And I particularly like that follow_dotdot() and follow_dotdot_rcu()
now both that the same signature :-)  What's not to like?

Reviewed-by: NeilBrown <neilb-IBi9RG/b67k@public.gmane.org>

in case it helps.


NeilBrown

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH] vfs: Test for and handle paths that are unreachable from their mnt_root
  2015-08-16  1:27                                                           ` [PATCH] vfs: Test for and handle paths that are unreachable from their mnt_root Eric W. Biederman
@ 2015-08-17  3:56                                                             ` NeilBrown
       [not found]                                                             ` <87tws010r2.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  1 sibling, 0 replies; 240+ messages in thread
From: NeilBrown @ 2015-08-17  3:56 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Linux Containers, linux-fsdevel, Al Viro,
	Andy Lutomirski, Serge E. Hallyn, Richard Weinberger,
	Andrey Vagin, Jann Horn, Willy Tarreau, Omar Sandoval,
	Miklos Szeredi, J. Bruce Fields

On Sat, 15 Aug 2015 20:27:13 -0500 ebiederm@xmission.com (Eric W.
Biederman) wrote:

> 
> In rare cases a directory can be renamed out from under a bind mount.
> In those cases without special handling it becomes possible to walk up
> the directory tree to the root dentry of the filesystem and down
> from the root dentry to every other file or directory on the filesystem.
> 
> Like division by zero .. from an unconnected path can not be given
> a useful semantic as there is no predicting at which path component
> the code will realize it is unconnected.  We certainly can not match
> the current behavior as the current behavior is a security hole.
> 
> Therefore when encounting .. when following an unconnected path
> return -ENOENT.
> 
> - Add a function path_connected to verify path->dentry is reachable
>   from path->mnt.mnt_root.  AKA to validate that rename did not do
>   something nasty to the bind mount.
> 
>   To avoid races path_connected must be called after following a path
>   component to it's next path component.
> 
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
> 
> This is the simple version that needs no extra vfs support.
> 
> My availability is likely to be a bit spotty for the next while
> as I am travelling to and then attending Linux Plumbers Conference.
> 
>  fs/namei.c | 27 +++++++++++++++++++++++++--
>  1 file changed, 25 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/namei.c b/fs/namei.c
> index ae4e4c18b2ac..5303e994f8d6 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -560,6 +560,24 @@ static int __nd_alloc_stack(struct nameidata *nd)
>  	return 0;
>  }
>  
> +/**
> + * path_connected - Verify that a path->dentry is below path->mnt.mnt_root
> + * @path: nameidate to verify
> + *
> + * Rename can sometimes move a file or directory outside of a bind
> + * mount, path_connected allows those cases to be detected.

While it is obviously true that a rename can move a file outside of a
bind mount, it doesn't seem relevant and so could be confusing.
This is only ever used for directories, and a file could already be
outside a bind mount even while it is inside.

I would stick with "Rename can sometimes move a directory outside..."



> + */
> +static bool path_connected(const struct path *path)
> +{
> +	struct vfsmount *mnt = path->mnt;
> +
> +	/* Only bind mounts can have disconnected paths */
> +	if (mnt->mnt_root == mnt->mnt_sb->s_root)
> +		return true;
> +
> +	return is_subdir(path->dentry, mnt->mnt_root);
> +}
> +
>  static inline int nd_alloc_stack(struct nameidata *nd)
>  {
>  	if (likely(nd->depth != EMBEDDED_LEVELS))
> @@ -1296,6 +1314,8 @@ static int follow_dotdot_rcu(struct nameidata *nd)
>  				return -ECHILD;
>  			nd->path.dentry = parent;
>  			nd->seq = seq;
> +			if (unlikely(!path_connected(&nd->path)))
> +				return -ENOENT;
>  			break;
>  		} else {
>  			struct mount *mnt = real_mount(nd->path.mnt);
> @@ -1396,7 +1416,7 @@ static void follow_mount(struct path *path)
>  	}
>  }
>  
> -static void follow_dotdot(struct nameidata *nd)
> +static int follow_dotdot(struct nameidata *nd)
>  {
>  	if (!nd->root.mnt)
>  		set_root(nd);
> @@ -1412,6 +1432,8 @@ static void follow_dotdot(struct nameidata *nd)
>  			/* rare case of legitimate dget_parent()... */
>  			nd->path.dentry = dget_parent(nd->path.dentry);
>  			dput(old);
> +			if (unlikely(!path_connected(&nd->path)))
> +				return -ENOENT;
>  			break;
>  		}
>  		if (!follow_up(&nd->path))
> @@ -1419,6 +1441,7 @@ static void follow_dotdot(struct nameidata *nd)
>  	}
>  	follow_mount(&nd->path);
>  	nd->inode = nd->path.dentry->d_inode;
> +	return 0;
>  }
>  
>  /*
> @@ -1634,7 +1657,7 @@ static inline int handle_dots(struct nameidata *nd, int type)
>  		if (nd->flags & LOOKUP_RCU) {
>  			return follow_dotdot_rcu(nd);
>  		} else
> -			follow_dotdot(nd);
> +			return follow_dotdot(nd);
>  	}
>  	return 0;
>  }


I really like this patch, particularly from the standpoint of
backporting to -stable and enterprise kernels.  I suspect that all the
tracking of which mounts might have been escaped from should be
classified as premature optimisation, until measurements show otherwise.

path_connected() adds no locks or even atomics.  The path that it walks
will be well-trodden and so very likely most of it will be in the CPU
cache.

And I particularly like that follow_dotdot() and follow_dotdot_rcu()
now both that the same signature :-)  What's not to like?

Reviewed-by: NeilBrown <neilb@suse.com>

in case it helps.


NeilBrown

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 0/7] Bind mount escape fixes
       [not found]                                                                         ` <87bne7piwu.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-08-21  7:51                                                                           ` Al Viro
  2015-08-21 15:27                                                                             ` Eric W. Biederman
       [not found]                                                                             ` <20150821075105.GF18890-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
  0 siblings, 2 replies; 240+ messages in thread
From: Al Viro @ 2015-08-21  7:51 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrey Vagin, Miklos Szeredi, Richard Weinberger,
	Linux Containers, Andy Lutomirski, J. Bruce Fields,
	linux-fsdevel, Jann Horn, Linus Torvalds, Willy Tarreau

On Sun, Aug 16, 2015 at 06:33:21AM -0500, Eric W. Biederman wrote:

> > ... or either of us can do merging those checks into a single place,
> > be it as a followup to your 7-patch series, or folded with the
> > fs/dcache.c-affecting patches in there.  If you have no time left, I can
> > certainly do that followup myself - not a problem[1]
> 
> I don't have time.  Everytime I have worked with this it has take pretty
> much full days of staring at the code, and I don't have any more full
> days left before the merge window.

OK, at that point I've pretty much given up on fs_pin for this cycle.
And testing your variant with unconditional checks on .. appears to have
fairly low overhead.  I still want to deal with catching and unmounting the
unreachable suckers, so fs/dcache.c side of things will get used when we get
to that stuff, but for now I've taken your 1/7, 2/7 plus the variant of
"vfs: Test for and handle paths that are unreachable from their mnt_root"
that doesn't care whether anything escaped or not.

3--6 are held in a local branch for now; I *am* going to use them
come next cycle.  And there's another pile of fun around that area, also
for the next cycle - kernel-initiated subtree removals on things like
sysfs et.al.; handling of the locking in those is inconsistent and tied
with the fun we have for d_move()/__d_unalias().  Sigh...

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 0/7] Bind mount escape fixes
       [not found]                                                                             ` <20150821075105.GF18890-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
@ 2015-08-21 15:27                                                                               ` Eric W. Biederman
  0 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-08-21 15:27 UTC (permalink / raw)
  To: Al Viro
  Cc: Andrey Vagin, Miklos Szeredi, Richard Weinberger,
	Linux Containers, Andy Lutomirski, J. Bruce Fields,
	linux-fsdevel, Jann Horn, Linus Torvalds, Willy Tarreau



On August 21, 2015 12:51:05 AM PDT, Al Viro <viro-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org> wrote:
>On Sun, Aug 16, 2015 at 06:33:21AM -0500, Eric W. Biederman wrote:
>
>> > ... or either of us can do merging those checks into a single
>place,
>> > be it as a followup to your 7-patch series, or folded with the
>> > fs/dcache.c-affecting patches in there.  If you have no time left,
>I can
>> > certainly do that followup myself - not a problem[1]
>> 
>> I don't have time.  Everytime I have worked with this it has take
>pretty
>> much full days of staring at the code, and I don't have any more full
>> days left before the merge window.
>
>OK, at that point I've pretty much given up on fs_pin for this cycle.
>And testing your variant with unconditional checks on .. appears to
>have
>fairly low overhead.  I still want to deal with catching and unmounting
>the
>unreachable suckers, so fs/dcache.c side of things will get used when
>we get
>to that stuff, but for now I've taken your 1/7, 2/7 plus the variant of
>"vfs: Test for and handle paths that are unreachable from their
>mnt_root"
>that doesn't care whether anything escaped or not.
>
>3--6 are held in a local branch for now; I *am* going to use them
>come next cycle.  And there's another pile of fun around that area,
>also
>for the next cycle - kernel-initiated subtree removals on things like
>sysfs et.al.; handling of the locking in those is inconsistent and tied
>with the fun we have for d_move()/__d_unalias().  Sigh...


I am sorry to hear about a mess in sysfs. 

I am glad to hear that we do not appear to need MNT_DIR_ESCAPED to avoid performance regressions.  That should make back porters lives much easier.


As for the future I have a suspicion that we want to look at the rcu readable dynamically sized hash tables the networking guys cooked up.

Al thank you very much for performance testing the simple version.

Eric

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 0/7] Bind mount escape fixes
  2015-08-21  7:51                                                                           ` Al Viro
@ 2015-08-21 15:27                                                                             ` Eric W. Biederman
       [not found]                                                                             ` <20150821075105.GF18890-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
  1 sibling, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-08-21 15:27 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Linux Containers, linux-fsdevel, Andy Lutomirski,
	Serge E. Hallyn, Richard Weinberger, Andrey Vagin, Jann Horn,
	Willy Tarreau, Omar Sandoval, Miklos Szeredi, J. Bruce Fields



On August 21, 2015 12:51:05 AM PDT, Al Viro <viro@ZenIV.linux.org.uk> wrote:
>On Sun, Aug 16, 2015 at 06:33:21AM -0500, Eric W. Biederman wrote:
>
>> > ... or either of us can do merging those checks into a single
>place,
>> > be it as a followup to your 7-patch series, or folded with the
>> > fs/dcache.c-affecting patches in there.  If you have no time left,
>I can
>> > certainly do that followup myself - not a problem[1]
>> 
>> I don't have time.  Everytime I have worked with this it has take
>pretty
>> much full days of staring at the code, and I don't have any more full
>> days left before the merge window.
>
>OK, at that point I've pretty much given up on fs_pin for this cycle.
>And testing your variant with unconditional checks on .. appears to
>have
>fairly low overhead.  I still want to deal with catching and unmounting
>the
>unreachable suckers, so fs/dcache.c side of things will get used when
>we get
>to that stuff, but for now I've taken your 1/7, 2/7 plus the variant of
>"vfs: Test for and handle paths that are unreachable from their
>mnt_root"
>that doesn't care whether anything escaped or not.
>
>3--6 are held in a local branch for now; I *am* going to use them
>come next cycle.  And there's another pile of fun around that area,
>also
>for the next cycle - kernel-initiated subtree removals on things like
>sysfs et.al.; handling of the locking in those is inconsistent and tied
>with the fun we have for d_move()/__d_unalias().  Sigh...


I am sorry to hear about a mess in sysfs. 

I am glad to hear that we do not appear to need MNT_DIR_ESCAPED to avoid performance regressions.  That should make back porters lives much easier.


As for the future I have a suspicion that we want to look at the rcu readable dynamically sized hash tables the networking guys cooked up.

Al thank you very much for performance testing the simple version.

Eric

 

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 6/6] vfs: Cache the results of path_connected
       [not found]                                 ` <878u9pwvg8.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-08-28 19:43                                   ` J. Bruce Fields
  2015-08-28 19:45                                     ` J. Bruce Fields
       [not found]                                     ` <20150828194302.GE10468-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
  0 siblings, 2 replies; 240+ messages in thread
From: J. Bruce Fields @ 2015-08-28 19:43 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrey Vagin, Miklos Szeredi, Richard Weinberger,
	Linux Containers, Andy Lutomirski, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Linus Torvalds,
	Willy Tarreau

This is response is kind of ridiculously delayed; vacation and a couple
other things interfered!:

On Wed, Aug 05, 2015 at 11:28:55AM -0500, Eric W. Biederman wrote:
> "J. Bruce Fields" <bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org> writes:
> 
> > On Tue, Aug 04, 2015 at 05:58:59PM -0500, Eric W. Biederman wrote:
> >>
> >> No problem.  The basic question is: can 2Billion renames be performed on
> >> the same filesystem in less time than a single path lookup?  Allowing
> >> the use of a 32bit counter.
> >
> > Certainly if you have control over an NFS or FUSE server then you can
> > arrange for that to happen--just delay the lookup until you've processed
> > enough renames.  I don't know if that's interesting....
> 
> Not particularly when the whole point is to start with a bind mount, do
> something trick and then have access to the whole filesystem instead of
> just the part of the filesystem exposed by the bind mount.
> 
> If you control the filesystem you already have access to the entire
> filesystem, so you don't need to do something trick.

I thought there was also a concern about impact on the sanity of the
system as a whole beyond just the contents of one filesystem: e.g. you
don't want an unprivileged user to be able to create an unmountable
mountpoint.

> That something tricky is a well placed rename that borks the tree
> structure and causes .. to never see the subdirectory that is the root
> of the bind mount.
> 
> >> If you could look up thread and tell me what you think of the issue with
> >> file handle to dentry conversion and bind mounts I would be appreciate.
> >
> > OK, I see your comments in "[PATCH review 0/6] Bind mount escape fixes",
> > I'm not sure I understand yet, I'll take a closer look.
> 
> Thanks.
> 
> The file handle reconstitution code can certainly be affected by all of
> this.  Given that it is an failure if reconnect_path can't reconnect the
> path of a file handle.  I think it can reasonably considered an error in
> all cases if that path is outside of an exported bind mount, but I don't
> know that area particularly well.  The solution might just be don't
> export file handles from bind mounts.

I don't think there's any new cause for concern here.

I'd quibble with the language "don't export filehandles", *if* by that
you mean "don't tell allow anyone to know filehandles".  They're
guessable, so keeping them secret doesn't guarantee much security.

The dangerous operation is open_by_handle, and people need to understand
that if you allow someone to call that then you're effectively giving
access to the whole filesystem.  That's always been true.  (We don't
really have an efficient way to determine if a non-directory is in a
given subtree anyway.)

Such filehandle-guessing attacks on NFS have long been well-understood.
NOSUBTREECHECK can prevent them but causes other problems, so isn't the
default.

So the basic rule I think is "don't allow lookup-by-filehandle (or NFS
export) on part of a filesystem unless you'd be willing to allow it on
the whole thing".

--b.

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 6/6] vfs: Cache the results of path_connected
       [not found]                                     ` <20150828194302.GE10468-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
@ 2015-08-28 19:45                                       ` J. Bruce Fields
  0 siblings, 0 replies; 240+ messages in thread
From: J. Bruce Fields @ 2015-08-28 19:45 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrey Vagin, Miklos Szeredi, Richard Weinberger,
	Linux Containers, Andy Lutomirski, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Linus Torvalds,
	Willy Tarreau

On Fri, Aug 28, 2015 at 03:43:02PM -0400, J. Bruce Fields wrote:
> On Wed, Aug 05, 2015 at 11:28:55AM -0500, Eric W. Biederman wrote:
> > The file handle reconstitution code can certainly be affected by all of
> > this.  Given that it is an failure if reconnect_path can't reconnect the
> > path of a file handle.  I think it can reasonably considered an error in
> > all cases if that path is outside of an exported bind mount, but I don't
> > know that area particularly well.  The solution might just be don't
> > export file handles from bind mounts.
> 
> I don't think there's any new cause for concern here.
> 
> I'd quibble with the language "don't export filehandles", *if* by that
> you mean "don't tell allow anyone to know filehandles".  They're
> guessable, so keeping them secret doesn't guarantee much security.
> 
> The dangerous operation is open_by_handle, and people need to understand
> that if you allow someone to call that then you're effectively giving
> access to the whole filesystem.  That's always been true.  (We don't
> really have an efficient way to determine if a non-directory is in a
> given subtree anyway.)
> 
> Such filehandle-guessing attacks on NFS have long been well-understood.
> NOSUBTREECHECK can prevent them but causes other problems, so isn't the
> default.
> 
> So the basic rule I think is "don't allow lookup-by-filehandle (or NFS
> export) on part of a filesystem unless you'd be willing to allow it on
> the whole thing".

(So in case it wasn't clear: ACK to just ignoring this, I don't think
your (otherwise interesting) observations point to anything that needs
fixing in the lookup-by-filehandle case.)

--b.

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 6/6] vfs: Cache the results of path_connected
  2015-08-28 19:43                                   ` J. Bruce Fields
@ 2015-08-28 19:45                                     ` J. Bruce Fields
       [not found]                                       ` <20150828194540.GF10468-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
       [not found]                                     ` <20150828194302.GE10468-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
  1 sibling, 1 reply; 240+ messages in thread
From: J. Bruce Fields @ 2015-08-28 19:45 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Vagin, Linux Containers, linux-fsdevel, Al Viro,
	Andy Lutomirski, Serge E. Hallyn, Richard Weinberger,
	Andrey Vagin, Jann Horn, Willy Tarreau, Omar Sandoval,
	Miklos Szeredi, Linus Torvalds

On Fri, Aug 28, 2015 at 03:43:02PM -0400, J. Bruce Fields wrote:
> On Wed, Aug 05, 2015 at 11:28:55AM -0500, Eric W. Biederman wrote:
> > The file handle reconstitution code can certainly be affected by all of
> > this.  Given that it is an failure if reconnect_path can't reconnect the
> > path of a file handle.  I think it can reasonably considered an error in
> > all cases if that path is outside of an exported bind mount, but I don't
> > know that area particularly well.  The solution might just be don't
> > export file handles from bind mounts.
> 
> I don't think there's any new cause for concern here.
> 
> I'd quibble with the language "don't export filehandles", *if* by that
> you mean "don't tell allow anyone to know filehandles".  They're
> guessable, so keeping them secret doesn't guarantee much security.
> 
> The dangerous operation is open_by_handle, and people need to understand
> that if you allow someone to call that then you're effectively giving
> access to the whole filesystem.  That's always been true.  (We don't
> really have an efficient way to determine if a non-directory is in a
> given subtree anyway.)
> 
> Such filehandle-guessing attacks on NFS have long been well-understood.
> NOSUBTREECHECK can prevent them but causes other problems, so isn't the
> default.
> 
> So the basic rule I think is "don't allow lookup-by-filehandle (or NFS
> export) on part of a filesystem unless you'd be willing to allow it on
> the whole thing".

(So in case it wasn't clear: ACK to just ignoring this, I don't think
your (otherwise interesting) observations point to anything that needs
fixing in the lookup-by-filehandle case.)

--b.

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 6/6] vfs: Cache the results of path_connected
       [not found]                                       ` <20150828194540.GF10468-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
@ 2015-08-31 21:17                                         ` Eric W. Biederman
       [not found]                                           ` <87k2sb88ev.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  0 siblings, 1 reply; 240+ messages in thread
From: Eric W. Biederman @ 2015-08-31 21:17 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Andrey Vagin, Miklos Szeredi, Richard Weinberger,
	Linux Containers, Andy Lutomirski, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Linus Torvalds,
	Willy Tarreau

"J. Bruce Fields" <bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org> writes:

> On Fri, Aug 28, 2015 at 03:43:02PM -0400, J. Bruce Fields wrote:
>> On Wed, Aug 05, 2015 at 11:28:55AM -0500, Eric W. Biederman wrote:
>> > The file handle reconstitution code can certainly be affected by all of
>> > this.  Given that it is an failure if reconnect_path can't reconnect the
>> > path of a file handle.  I think it can reasonably considered an error in
>> > all cases if that path is outside of an exported bind mount, but I don't
>> > know that area particularly well.  The solution might just be don't
>> > export file handles from bind mounts.
>> 
>> I don't think there's any new cause for concern here.
>> 
>> I'd quibble with the language "don't export filehandles", *if* by that
>> you mean "don't tell allow anyone to know filehandles".  They're
>> guessable, so keeping them secret doesn't guarantee much security.
>> 
>> The dangerous operation is open_by_handle, and people need to understand
>> that if you allow someone to call that then you're effectively giving
>> access to the whole filesystem.  That's always been true.  (We don't
>> really have an efficient way to determine if a non-directory is in a
>> given subtree anyway.)
>>
>>
>> Such filehandle-guessing attacks on NFS have long been well-understood.
>> NOSUBTREECHECK can prevent them but causes other problems, so isn't the
>> default.

Interesting.  I guess it makes sense that filehandles can be guessed.  I
wonder if a crypto variant that is resistant to plain text attacks would
be a practical defense.

We do have d_ancestor/is_subdir that is a compartively efficient way to
see if a dentry is in a given subtree.  As it does not need to perform
the permission checks I believe it would be some cheaper than the
current nfs subtree check code.  I don't know if that would avoid the
known problem with the subtree check code.  Nor do I know if it would be
cheap enough to use for every nfsd operation when a file handle is
received.

>> So the basic rule I think is "don't allow lookup-by-filehandle (or NFS
>> export) on part of a filesystem unless you'd be willing to allow it on
>> the whole thing".
>
> (So in case it wasn't clear: ACK to just ignoring this, I don't think
> your (otherwise interesting) observations point to anything that needs
> fixing in the lookup-by-filehandle case.)

Thanks for looking into this.  It helps to know that someone who knows
the history of what happens with filehandles has looked at this.

Eric

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 6/6] vfs: Cache the results of path_connected
       [not found]                                           ` <87k2sb88ev.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-09-01 14:46                                             ` J. Bruce Fields
  2015-09-01 18:00                                               ` Eric W. Biederman
       [not found]                                               ` <20150901144632.GA32692-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
  0 siblings, 2 replies; 240+ messages in thread
From: J. Bruce Fields @ 2015-09-01 14:46 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrey Vagin, Miklos Szeredi, Richard Weinberger,
	Linux Containers, Andy Lutomirski, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Linus Torvalds,
	Willy Tarreau

On Mon, Aug 31, 2015 at 04:17:28PM -0500, Eric W. Biederman wrote:
> "J. Bruce Fields" <bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org> writes:
> 
> > On Fri, Aug 28, 2015 at 03:43:02PM -0400, J. Bruce Fields wrote:
> >> On Wed, Aug 05, 2015 at 11:28:55AM -0500, Eric W. Biederman wrote:
> >> > The file handle reconstitution code can certainly be affected by all of
> >> > this.  Given that it is an failure if reconnect_path can't reconnect the
> >> > path of a file handle.  I think it can reasonably considered an error in
> >> > all cases if that path is outside of an exported bind mount, but I don't
> >> > know that area particularly well.  The solution might just be don't
> >> > export file handles from bind mounts.
> >> 
> >> I don't think there's any new cause for concern here.
> >> 
> >> I'd quibble with the language "don't export filehandles", *if* by that
> >> you mean "don't tell allow anyone to know filehandles".  They're
> >> guessable, so keeping them secret doesn't guarantee much security.
> >> 
> >> The dangerous operation is open_by_handle, and people need to understand
> >> that if you allow someone to call that then you're effectively giving
> >> access to the whole filesystem.  That's always been true.  (We don't
> >> really have an efficient way to determine if a non-directory is in a
> >> given subtree anyway.)
> >>
> >>
> >> Such filehandle-guessing attacks on NFS have long been well-understood.
> >> NOSUBTREECHECK can prevent them but causes other problems, so isn't the
> >> default.
> 
> Interesting.  I guess it makes sense that filehandles can be guessed.  I
> wonder if a crypto variant that is resistant to plain text attacks would
> be a practical defense.

People have considered it.  I don't think it would be hard: just
generate some permanent server secret and use it to encrypt all
filehandles (and decrypt them again when looking them up).

Some of the reasons I don't think it's been done:

	- Well, it's work, and nobody's really felt that strongly about
	  it.

	- It's usually not that hard to create another filesystem when
	  you need a real security boundary.

	- Filehandles are forever.  But it's hard to keep secrets
	  forever, especially when they have to be transmitted over the
	  network a lot.  (In more detail: client applications can hold
	  long-lived references to filesystem objects through open file
	  descriptors or current working directories.  They expect those
	  to keep working even over server reboots.  We don't even know
	  about those references.  So any filehandle we give out could
	  be looked up at any time in the future.  The only case where
	  we consider it acceptable to return ESTALE is when the
	  object's actually gone.)

> We do have d_ancestor/is_subdir that is a compartively efficient way to
> see if a dentry is in a given subtree.  As it does not need to perform
> the permission checks I believe it would be some cheaper than the
> current nfs subtree check code.  I don't know if that would avoid the
> known problem with the subtree check code.  Nor do I know if it would be
> cheap enough to use for every nfsd operation when a file handle is
> received.

That would solve the problem for directories, but not for files.  For
non-directories we'd need special support from the filesystem (since at
the time we look up the filehandle the parent(s) may not be in the
dcache yet).  Steps to check subtree membership would be roughly (number
of hardlinks) * (average depth).  I think it's probably not worth it.

Anyway, forgive the digressions....

--b.

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 6/6] vfs: Cache the results of path_connected
       [not found]                                               ` <20150901144632.GA32692-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
@ 2015-09-01 18:00                                                 ` Eric W. Biederman
  0 siblings, 0 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-09-01 18:00 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Andrey Vagin, Miklos Szeredi, Richard Weinberger,
	Linux Containers, Andy Lutomirski, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Linus Torvalds,
	Willy Tarreau

"J. Bruce Fields" <bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org> writes:

> On Mon, Aug 31, 2015 at 04:17:28PM -0500, Eric W. Biederman wrote:
>> "J. Bruce Fields" <bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org> writes:
>> 
>> > On Fri, Aug 28, 2015 at 03:43:02PM -0400, J. Bruce Fields wrote:
>> >> On Wed, Aug 05, 2015 at 11:28:55AM -0500, Eric W. Biederman wrote:
>> >> > The file handle reconstitution code can certainly be affected by all of
>> >> > this.  Given that it is an failure if reconnect_path can't reconnect the
>> >> > path of a file handle.  I think it can reasonably considered an error in
>> >> > all cases if that path is outside of an exported bind mount, but I don't
>> >> > know that area particularly well.  The solution might just be don't
>> >> > export file handles from bind mounts.
>> >> 
>> >> I don't think there's any new cause for concern here.
>> >> 
>> >> I'd quibble with the language "don't export filehandles", *if* by that
>> >> you mean "don't tell allow anyone to know filehandles".  They're
>> >> guessable, so keeping them secret doesn't guarantee much security.
>> >> 
>> >> The dangerous operation is open_by_handle, and people need to understand
>> >> that if you allow someone to call that then you're effectively giving
>> >> access to the whole filesystem.  That's always been true.  (We don't
>> >> really have an efficient way to determine if a non-directory is in a
>> >> given subtree anyway.)
>> >>
>> >>
>> >> Such filehandle-guessing attacks on NFS have long been well-understood.
>> >> NOSUBTREECHECK can prevent them but causes other problems, so isn't the
>> >> default.
>> 
>> Interesting.  I guess it makes sense that filehandles can be guessed.  I
>> wonder if a crypto variant that is resistant to plain text attacks would
>> be a practical defense.
>
> People have considered it.  I don't think it would be hard: just
> generate some permanent server secret and use it to encrypt all
> filehandles (and decrypt them again when looking them up).
>
> Some of the reasons I don't think it's been done:
>
> 	- Well, it's work, and nobody's really felt that strongly about
> 	  it.
>
> 	- It's usually not that hard to create another filesystem when
> 	  you need a real security boundary.
>
> 	- Filehandles are forever.  But it's hard to keep secrets
> 	  forever, especially when they have to be transmitted over the
> 	  network a lot.  (In more detail: client applications can hold
> 	  long-lived references to filesystem objects through open file
> 	  descriptors or current working directories.  They expect those
> 	  to keep working even over server reboots.  We don't even know
> 	  about those references.  So any filehandle we give out could
> 	  be looked up at any time in the future.  The only case where
> 	  we consider it acceptable to return ESTALE is when the
> 	  object's actually gone.)
>
>> We do have d_ancestor/is_subdir that is a compartively efficient way to
>> see if a dentry is in a given subtree.  As it does not need to perform
>> the permission checks I believe it would be some cheaper than the
>> current nfs subtree check code.  I don't know if that would avoid the
>> known problem with the subtree check code.  Nor do I know if it would be
>> cheap enough to use for every nfsd operation when a file handle is
>> received.
>
> That would solve the problem for directories, but not for files.  For
> non-directories we'd need special support from the filesystem (since at
> the time we look up the filehandle the parent(s) may not be in the
> dcache yet).  Steps to check subtree membership would be roughly (number
> of hardlinks) * (average depth).  I think it's probably not worth it.

If viewed that way you are probably right.

I was simply thinking about using the existing subtree check without the
inode_permission test.  Which works at the cost of a readdir to
reconnect an inode to a directory.  I had not noticed the readdir
before.  So I admit it does not sound like something that is a
particularly speedy way to go.

> Anyway, forgive the digressions....

No problem.  Thank you for the discussion.  This has if nothing else
allowed me to understand this from a real world perspective, and in
particular allows me to understand which permission checks would be
necessary to safely allow file handles in a user namespace (if we ever
decide it is safe to allow that).

In short if you did not mount the filesystem you better not be nfs
exporting the filesystem, or parts of the filesystem, or be allowed to
use file handle access to the filesystem.

Eric

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 6/6] vfs: Cache the results of path_connected
  2015-09-01 14:46                                             ` J. Bruce Fields
@ 2015-09-01 18:00                                               ` Eric W. Biederman
       [not found]                                                 ` <877foavx3f.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  2015-09-01 18:11                                                 ` J. Bruce Fields
       [not found]                                               ` <20150901144632.GA32692-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
  1 sibling, 2 replies; 240+ messages in thread
From: Eric W. Biederman @ 2015-09-01 18:00 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Andrey Vagin, Miklos Szeredi, Richard Weinberger,
	Linux Containers, Andy Lutomirski, Al Viro, linux-fsdevel,
	Jann Horn, Linus Torvalds, Willy Tarreau

"J. Bruce Fields" <bfields@fieldses.org> writes:

> On Mon, Aug 31, 2015 at 04:17:28PM -0500, Eric W. Biederman wrote:
>> "J. Bruce Fields" <bfields@fieldses.org> writes:
>> 
>> > On Fri, Aug 28, 2015 at 03:43:02PM -0400, J. Bruce Fields wrote:
>> >> On Wed, Aug 05, 2015 at 11:28:55AM -0500, Eric W. Biederman wrote:
>> >> > The file handle reconstitution code can certainly be affected by all of
>> >> > this.  Given that it is an failure if reconnect_path can't reconnect the
>> >> > path of a file handle.  I think it can reasonably considered an error in
>> >> > all cases if that path is outside of an exported bind mount, but I don't
>> >> > know that area particularly well.  The solution might just be don't
>> >> > export file handles from bind mounts.
>> >> 
>> >> I don't think there's any new cause for concern here.
>> >> 
>> >> I'd quibble with the language "don't export filehandles", *if* by that
>> >> you mean "don't tell allow anyone to know filehandles".  They're
>> >> guessable, so keeping them secret doesn't guarantee much security.
>> >> 
>> >> The dangerous operation is open_by_handle, and people need to understand
>> >> that if you allow someone to call that then you're effectively giving
>> >> access to the whole filesystem.  That's always been true.  (We don't
>> >> really have an efficient way to determine if a non-directory is in a
>> >> given subtree anyway.)
>> >>
>> >>
>> >> Such filehandle-guessing attacks on NFS have long been well-understood.
>> >> NOSUBTREECHECK can prevent them but causes other problems, so isn't the
>> >> default.
>> 
>> Interesting.  I guess it makes sense that filehandles can be guessed.  I
>> wonder if a crypto variant that is resistant to plain text attacks would
>> be a practical defense.
>
> People have considered it.  I don't think it would be hard: just
> generate some permanent server secret and use it to encrypt all
> filehandles (and decrypt them again when looking them up).
>
> Some of the reasons I don't think it's been done:
>
> 	- Well, it's work, and nobody's really felt that strongly about
> 	  it.
>
> 	- It's usually not that hard to create another filesystem when
> 	  you need a real security boundary.
>
> 	- Filehandles are forever.  But it's hard to keep secrets
> 	  forever, especially when they have to be transmitted over the
> 	  network a lot.  (In more detail: client applications can hold
> 	  long-lived references to filesystem objects through open file
> 	  descriptors or current working directories.  They expect those
> 	  to keep working even over server reboots.  We don't even know
> 	  about those references.  So any filehandle we give out could
> 	  be looked up at any time in the future.  The only case where
> 	  we consider it acceptable to return ESTALE is when the
> 	  object's actually gone.)
>
>> We do have d_ancestor/is_subdir that is a compartively efficient way to
>> see if a dentry is in a given subtree.  As it does not need to perform
>> the permission checks I believe it would be some cheaper than the
>> current nfs subtree check code.  I don't know if that would avoid the
>> known problem with the subtree check code.  Nor do I know if it would be
>> cheap enough to use for every nfsd operation when a file handle is
>> received.
>
> That would solve the problem for directories, but not for files.  For
> non-directories we'd need special support from the filesystem (since at
> the time we look up the filehandle the parent(s) may not be in the
> dcache yet).  Steps to check subtree membership would be roughly (number
> of hardlinks) * (average depth).  I think it's probably not worth it.

If viewed that way you are probably right.

I was simply thinking about using the existing subtree check without the
inode_permission test.  Which works at the cost of a readdir to
reconnect an inode to a directory.  I had not noticed the readdir
before.  So I admit it does not sound like something that is a
particularly speedy way to go.

> Anyway, forgive the digressions....

No problem.  Thank you for the discussion.  This has if nothing else
allowed me to understand this from a real world perspective, and in
particular allows me to understand which permission checks would be
necessary to safely allow file handles in a user namespace (if we ever
decide it is safe to allow that).

In short if you did not mount the filesystem you better not be nfs
exporting the filesystem, or parts of the filesystem, or be allowed to
use file handle access to the filesystem.

Eric

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 6/6] vfs: Cache the results of path_connected
       [not found]                                                 ` <877foavx3f.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-09-01 18:11                                                   ` J. Bruce Fields
  0 siblings, 0 replies; 240+ messages in thread
From: J. Bruce Fields @ 2015-09-01 18:11 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrey Vagin, Miklos Szeredi, Richard Weinberger,
	Linux Containers, Andy Lutomirski, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Linus Torvalds,
	Willy Tarreau

On Tue, Sep 01, 2015 at 01:00:20PM -0500, Eric W. Biederman wrote:
> No problem.  Thank you for the discussion.  This has if nothing else
> allowed me to understand this from a real world perspective, and in
> particular allows me to understand which permission checks would be
> necessary to safely allow file handles in a user namespace (if we ever
> decide it is safe to allow that).
> 
> In short if you did not mount the filesystem you better not be nfs
> exporting the filesystem, or parts of the filesystem, or be allowed to
> use file handle access to the filesystem.

Agreed.

--b.

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH review 6/6] vfs: Cache the results of path_connected
  2015-09-01 18:00                                               ` Eric W. Biederman
       [not found]                                                 ` <877foavx3f.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-09-01 18:11                                                 ` J. Bruce Fields
  1 sibling, 0 replies; 240+ messages in thread
From: J. Bruce Fields @ 2015-09-01 18:11 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrey Vagin, Miklos Szeredi, Richard Weinberger,
	Linux Containers, Andy Lutomirski, Al Viro, linux-fsdevel,
	Jann Horn, Linus Torvalds, Willy Tarreau

On Tue, Sep 01, 2015 at 01:00:20PM -0500, Eric W. Biederman wrote:
> No problem.  Thank you for the discussion.  This has if nothing else
> allowed me to understand this from a real world perspective, and in
> particular allows me to understand which permission checks would be
> necessary to safely allow file handles in a user namespace (if we ever
> decide it is safe to allow that).
> 
> In short if you did not mount the filesystem you better not be nfs
> exporting the filesystem, or parts of the filesystem, or be allowed to
> use file handle access to the filesystem.

Agreed.

--b.

^ permalink raw reply	[flat|nested] 240+ messages in thread

end of thread, other threads:[~2015-09-01 18:11 UTC | newest]

Thread overview: 240+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-02 21:42 [PATCH review 0/9] Call for testing and review of mount detach fixes Eric W. Biederman
2015-01-02 21:42 ` Eric W. Biederman
2015-01-02 21:52 ` [PATCH review 2/9] mnt: Don't propagate umounts in __detach_mounts Eric W. Biederman
2015-01-02 21:52 ` [PATCH review 3/9] mnt: In umount_tree reuse mnt_list instead of mnt_hash Eric W. Biederman
     [not found] ` <871tncuaf6.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-01-02 21:52   ` [PATCH review 1/9] mnt: Improve the umount_tree flags Eric W. Biederman
2015-01-02 21:52   ` [PATCH review 2/9] mnt: Don't propagate umounts in __detach_mounts Eric W. Biederman
2015-01-02 21:52   ` [PATCH review 3/9] mnt: In umount_tree reuse mnt_list instead of mnt_hash Eric W. Biederman
2015-01-02 21:52   ` [PATCH review 4/9] mnt: Add MNT_UMOUNT flag Eric W. Biederman
2015-01-02 21:52   ` [PATCH review 5/9] mnt: Delay removal from the mount hash Eric W. Biederman
2015-01-02 21:52   ` [PATCH review 6/9] mnt: Factor out __detach_mnt from detach_mnt Eric W. Biederman
2015-01-02 21:52   ` [PATCH review 7/9] mnt: Simplify umount_tree Eric W. Biederman
2015-01-02 21:52   ` [PATCH review 8/9] mnt: Remove redundant NULL tests in namespace_unlock Eric W. Biederman
2015-01-02 21:52   ` [PATCH review 9/9] mnt: Honor MNT_LOCKED when detaching mounts Eric W. Biederman
2015-01-05 20:45   ` [PATCH review 0/11 Call for testing and review of mount detach fixes (take 2) Eric W. Biederman
     [not found]     ` <87mw5xq7lt.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-01-05 20:46       ` [PATCH review 01/11] mnt: Improve the umount_tree flags Eric W. Biederman
2015-01-05 20:46       ` [PATCH review 02/11] mnt: Don't propagate umounts in __detach_mounts Eric W. Biederman
2015-01-05 20:46       ` [PATCH review 03/11] mnt: In umount_tree reuse mnt_list instead of mnt_hash Eric W. Biederman
2015-01-05 20:46       ` [PATCH review 04/11] mnt: Add MNT_UMOUNT flag Eric W. Biederman
2015-01-05 20:46       ` [PATCH review 05/11] mnt: Delay removal from the mount hash Eric W. Biederman
2015-01-05 20:46       ` [PATCH review 06/11] mnt: Factor out __detach_mnt from detach_mnt Eric W. Biederman
2015-01-05 20:46       ` [PATCH review 07/11] mnt: Simplify umount_tree Eric W. Biederman
2015-01-05 20:46       ` [PATCH review 08/11] mnt: Remove redundant NULL tests in namespace_unlock Eric W. Biederman
2015-01-05 20:46       ` [PATCH review 09/11] mnt: On an unmount propagate clearing of MNT_LOCKED Eric W. Biederman
2015-01-05 20:46       ` [PATCH review 10/11] mnt: Don't propagate unmounts to locked mounts Eric W. Biederman
2015-01-05 20:46       ` [PATCH review 11/11] mnt: Honor MNT_LOCKED when detaching mounts Eric W. Biederman
2015-04-03  1:53       ` [PATCH review 0/19] Locked mount and loopback mount fixes Eric W. Biederman
2015-01-05 20:46     ` [PATCH review 02/11] mnt: Don't propagate umounts in __detach_mounts Eric W. Biederman
2015-01-05 20:46     ` [PATCH review 03/11] mnt: In umount_tree reuse mnt_list instead of mnt_hash Eric W. Biederman
2015-01-05 20:46     ` [PATCH review 04/11] mnt: Add MNT_UMOUNT flag Eric W. Biederman
2015-01-05 20:46     ` [PATCH review 06/11] mnt: Factor out __detach_mnt from detach_mnt Eric W. Biederman
2015-01-05 20:46     ` [PATCH review 07/11] mnt: Simplify umount_tree Eric W. Biederman
2015-01-05 20:46     ` [PATCH review 08/11] mnt: Remove redundant NULL tests in namespace_unlock Eric W. Biederman
2015-01-05 20:46     ` [PATCH review 09/11] mnt: On an unmount propagate clearing of MNT_LOCKED Eric W. Biederman
2015-01-05 20:46     ` [PATCH review 10/11] mnt: Don't propagate unmounts to locked mounts Eric W. Biederman
2015-01-05 20:46     ` [PATCH review 11/11] mnt: Honor MNT_LOCKED when detaching mounts Eric W. Biederman
     [not found]       ` <1420490787-14387-11-git-send-email-ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2015-01-07 18:43         ` Al Viro
2015-01-07 18:43       ` Al Viro
     [not found]         ` <20150107184334.GZ22149-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
2015-01-07 19:28           ` Al Viro
     [not found]             ` <20150107192807.GA22149-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
2015-01-07 19:53               ` Eric W. Biederman
2015-01-07 19:30           ` Eric W. Biederman
2015-01-07 19:30         ` Eric W. Biederman
     [not found]           ` <87h9w2gzht.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-01-07 20:52             ` Al Viro
     [not found]               ` <20150107205239.GB22149-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
2015-01-07 21:51                 ` Eric W. Biederman
2015-01-08  0:22                   ` Al Viro
2015-01-08  3:02                     ` Al Viro
     [not found]                       ` <20150108030229.GD22149-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
2015-01-08  3:11                         ` Al Viro
     [not found]                     ` <20150108002227.GC22149-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
2015-01-08  3:02                       ` Al Viro
2015-01-08 22:32                       ` Al Viro
     [not found]                         ` <20150108223212.GF22149-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
2015-01-09 20:31                           ` Al Viro
2015-01-09 20:31                         ` Al Viro
2015-01-09 21:30                           ` Eric W. Biederman
     [not found]                             ` <87k30vwskd.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-01-09 22:17                               ` Al Viro
     [not found]                                 ` <20150109221715.GN22149-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
2015-01-09 22:25                                   ` Eric W. Biederman
     [not found]                           ` <20150109203126.GI22149-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
2015-01-09 21:30                             ` Eric W. Biederman
2015-01-10  5:32                             ` Eric W. Biederman
     [not found]                               ` <87h9vzryio.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-01-10  5:51                                 ` Al Viro
     [not found]                                   ` <20150110055148.GY22149-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
2015-01-11  2:00                                     ` Al Viro
2015-01-16 18:29                                       ` Eric W. Biederman
     [not found]                                       ` <20150111020030.GF22149-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
2015-01-11  2:50                                         ` Al Viro
2015-01-16 18:29                                         ` Eric W. Biederman
     [not found]                   ` <87iogi8dka.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-01-08  0:22                     ` Al Viro
2015-04-03  1:53     ` [PATCH review 0/19] Locked mount and loopback mount fixes Eric W. Biederman
2015-04-03  1:56       ` [PATCH review 01/19] mnt: Use hlist_move_list in namespace_unlock Eric W. Biederman
2015-04-03  1:56       ` [PATCH review 02/19] mnt: Improve the umount_tree flags Eric W. Biederman
2015-04-03  1:56       ` [PATCH review 03/19] mnt: Don't propagate umounts in __detach_mounts Eric W. Biederman
2015-04-03  1:56       ` [PATCH review 04/19] mnt: In umount_tree reuse mnt_list instead of mnt_hash Eric W. Biederman
2015-04-03  1:56       ` [PATCH review 05/19] mnt: Add MNT_UMOUNT flag Eric W. Biederman
2015-04-03  1:56       ` [PATCH review 06/19] mnt: Delay removal from the mount hash Eric W. Biederman
2015-04-03  1:56       ` [PATCH review 07/19] mnt: On an unmount propagate clearing of MNT_LOCKED Eric W. Biederman
2015-04-03  1:56       ` [PATCH review 08/19] mnt: Don't propagate unmounts to locked mounts Eric W. Biederman
2015-04-03  1:56       ` [PATCH review 09/19] mnt: Fail collect_mounts when applied to unmounted mounts Eric W. Biederman
2015-04-03  8:55         ` Lukasz Pawelczyk
     [not found]           ` <1428051353.1924.2.camel-Sze3O3UU22JBDgjK7y7TUQ@public.gmane.org>
2015-04-09 16:39             ` Eric W. Biederman
     [not found]         ` <1428026183-14879-9-git-send-email-ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2015-04-03  8:55           ` Lukasz Pawelczyk
     [not found]       ` <87a8yqou41.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-04-03  1:56         ` [PATCH review 01/19] mnt: Use hlist_move_list in namespace_unlock Eric W. Biederman
2015-04-03  1:56         ` [PATCH review 02/19] mnt: Improve the umount_tree flags Eric W. Biederman
2015-04-03  1:56         ` [PATCH review 03/19] mnt: Don't propagate umounts in __detach_mounts Eric W. Biederman
2015-04-03  1:56         ` [PATCH review 04/19] mnt: In umount_tree reuse mnt_list instead of mnt_hash Eric W. Biederman
2015-04-03  1:56         ` [PATCH review 05/19] mnt: Add MNT_UMOUNT flag Eric W. Biederman
2015-04-03  1:56         ` [PATCH review 06/19] mnt: Delay removal from the mount hash Eric W. Biederman
2015-04-03  1:56         ` [PATCH review 07/19] mnt: On an unmount propagate clearing of MNT_LOCKED Eric W. Biederman
2015-04-03  1:56         ` [PATCH review 08/19] mnt: Don't propagate unmounts to locked mounts Eric W. Biederman
2015-04-03  1:56         ` [PATCH review 09/19] mnt: Fail collect_mounts when applied to unmounted mounts Eric W. Biederman
2015-04-03  1:56         ` [PATCH review 10/19] mnt: Factor out unhash_mnt from detach_mnt and umount_tree Eric W. Biederman
2015-04-03  1:56         ` [PATCH review 11/19] mnt: Factor umount_mnt from umount_tree Eric W. Biederman
2015-04-03  1:56         ` [PATCH review 12/19] fs_pin: Allow for the possibility that m_list or s_list go unused Eric W. Biederman
2015-04-03  1:56         ` [PATCH review 13/19] mnt: Honor MNT_LOCKED when detaching mounts Eric W. Biederman
2015-04-03  1:56         ` [PATCH review 14/19] mnt: Fix the error check in __detach_mounts Eric W. Biederman
2015-04-03  1:56         ` [PATCH review 15/19] mnt: Update detach_mounts to leave mounts connected Eric W. Biederman
2015-04-03  1:56         ` [PATCH review 16/19] mnt: Track which mounts use a dentry as root Eric W. Biederman
2015-04-03  1:56         ` [PATCH review 17/19] vfs: Test for and handle paths that are unreachable from their mnt_root Eric W. Biederman
2015-04-03  1:56         ` [PATCH review 18/19] vfs: Handle mounts whose parents are unreachable from their mountpoint Eric W. Biederman
2015-04-03  1:56         ` [PATCH review 19/19] vfs: Do not allow escaping from bind mounts Eric W. Biederman
2015-04-08 23:31         ` [PATCH review 0/4] Loopback mount escape fixes Eric W. Biederman
2015-04-16 23:40         ` [GIT PULL] Usernamespace related locked mount fixes Eric W. Biederman
2015-04-03  1:56       ` [PATCH review 10/19] mnt: Factor out unhash_mnt from detach_mnt and umount_tree Eric W. Biederman
2015-04-03  1:56       ` [PATCH review 11/19] mnt: Factor umount_mnt from umount_tree Eric W. Biederman
2015-04-03  1:56       ` [PATCH review 12/19] fs_pin: Allow for the possibility that m_list or s_list go unused Eric W. Biederman
     [not found]         ` <1428026183-14879-12-git-send-email-ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2015-05-11 13:36           ` Konstantin Khlebnikov
2015-04-03  1:56       ` [PATCH review 13/19] mnt: Honor MNT_LOCKED when detaching mounts Eric W. Biederman
2015-04-03  1:56       ` [PATCH review 14/19] mnt: Fix the error check in __detach_mounts Eric W. Biederman
2015-04-03  1:56       ` [PATCH review 15/19] mnt: Update detach_mounts to leave mounts connected Eric W. Biederman
2015-04-03  1:56       ` [PATCH review 16/19] mnt: Track which mounts use a dentry as root Eric W. Biederman
     [not found]         ` <1428026183-14879-16-git-send-email-ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2015-04-03  5:54           ` Al Viro
     [not found]             ` <20150403055449.GE889-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
2015-04-03 10:31               ` Eric W. Biederman
2015-04-07 20:22               ` Eric W. Biederman
2015-04-07 20:22             ` Eric W. Biederman
2015-04-03  1:56       ` [PATCH review 17/19] vfs: Test for and handle paths that are unreachable from their mnt_root Eric W. Biederman
2015-04-03  1:56       ` [PATCH review 18/19] vfs: Handle mounts whose parents are unreachable from their mountpoint Eric W. Biederman
2015-04-03  1:56       ` [PATCH review 19/19] vfs: Do not allow escaping from bind mounts Eric W. Biederman
2015-04-03  6:20         ` Al Viro
     [not found]           ` <20150403062035.GF889-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
2015-04-03 10:22             ` Eric W. Biederman
     [not found]         ` <1428026183-14879-19-git-send-email-ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2015-04-03  6:20           ` Al Viro
2015-04-08 23:31       ` [PATCH review 0/4] Loopback mount escape fixes Eric W. Biederman
2015-04-08 23:33         ` [PATCH review 3/4] vfs: Handle mounts whose parents are unreachable from their mountpoint Eric W. Biederman
2015-04-08 23:34         ` [PATCH review 4/4] vfs: Do not allow escaping from bind mounts Eric W. Biederman
     [not found]           ` <87iod68aa3.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-04-09 13:06             ` Jann Horn
2015-04-09 23:22             ` Al Viro
2015-04-09 13:06           ` Jann Horn
     [not found]             ` <20150409130601.GA22250-J1fxOzX/cBvk1uMJSBkQmQ@public.gmane.org>
2015-04-09 16:52               ` Eric W. Biederman
2015-04-09 23:22           ` Al Viro
2015-04-10  2:51             ` Eric W. Biederman
     [not found]               ` <874moo1ysg.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-04-10  3:14                 ` Al Viro
2015-04-10  3:14               ` Al Viro
     [not found]             ` <20150409232212.GX889-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
2015-04-10  2:51               ` Eric W. Biederman
2015-04-13 12:18         ` [PATCH review 0/4] Loopback mount escape fixes Miklos Szeredi
     [not found]           ` <CAELBmZBCCC1dspo4rPkFfh3c6RZBUYAZpz0tbUSukcf9att7Cw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-07-24 20:39             ` Eric W. Biederman
2015-07-24 20:39           ` Eric W. Biederman
     [not found]         ` <874moq9oyb.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-04-08 23:32           ` [PATCH review 1/4] mnt: Track which mounts use a dentry as root Eric W. Biederman
2015-04-08 23:32           ` [PATCH review 2/4] vfs: Test for and handle paths that are unreachable from their mnt_root Eric W. Biederman
     [not found]             ` <87sica8ac5.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-04-09 23:16               ` Al Viro
     [not found]                 ` <20150409231636.GW889-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
2015-04-10  2:24                   ` Eric W. Biederman
2015-04-08 23:33           ` [PATCH review 3/4] vfs: Handle mounts whose parents are unreachable from their mountpoint Eric W. Biederman
2015-04-08 23:34           ` [PATCH review 4/4] vfs: Do not allow escaping from bind mounts Eric W. Biederman
2015-04-09 19:01           ` [PATCH review 0/4] Loopback mount escape fixes Eric W. Biederman
2015-04-09 19:12             ` Al Viro
2015-04-09 19:14               ` Eric W. Biederman
     [not found]               ` <20150409191232.GV889-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
2015-04-09 19:14                 ` Eric W. Biederman
     [not found]             ` <87egnt5dok.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-04-09 19:12               ` Al Viro
2015-04-13 12:18           ` Miklos Szeredi
2015-08-03 21:25           ` [PATCH review 0/6] Bind " Eric W. Biederman
2015-08-03 21:26             ` [PATCH review 1/6] mnt: Track which mounts use a dentry as root Eric W. Biederman
2015-08-07 10:46               ` Nikolay Borisov
     [not found]                 ` <55C48C94.6050804-6AxghH7DbtA@public.gmane.org>
2015-08-07 15:43                   ` Eric W. Biederman
2015-08-07 15:43                 ` Eric W. Biederman
     [not found]               ` <87vbcw9i8g.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-08-07 10:46                 ` Nikolay Borisov
     [not found]             ` <871tfkawu9.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-08-03 21:26               ` Eric W. Biederman
2015-08-03 21:26               ` [PATCH review 2/6] dcache: Handle escaped paths in prepend_path Eric W. Biederman
2015-08-03 21:27               ` [PATCH review 3/6] dcache: Implement d_common_ancestor Eric W. Biederman
2015-08-03 21:27               ` [PATCH review 4/6] mnt: Track when a directory escapes a bind mount Eric W. Biederman
2015-08-03 21:30               ` [PATCH review 5/6] vfs: Test for and handle paths that are unreachable from their mnt_root Eric W. Biederman
2015-08-03 21:30               ` [PATCH review 6/6] vfs: Cache the results of path_connected Eric W. Biederman
2015-08-05  3:14               ` [PATCH review 7/6] vfs: Make mnt_escape_count 64bit Eric W. Biederman
2015-08-03 21:27             ` [PATCH review 4/6] mnt: Track when a directory escapes a bind mount Eric W. Biederman
     [not found]               ` <87egjk9i61.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-08-10  4:36                 ` Al Viro
     [not found]                   ` <20150810043637.GC14139-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
2015-08-10  4:43                     ` Al Viro
2015-08-14  4:10                     ` Eric W. Biederman
2015-08-14  4:10                   ` Eric W. Biederman
     [not found]                     ` <877foymrwt.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-08-14  4:29                       ` [PATCH review 0/8] Bind mount escape fixes Eric W. Biederman
2015-08-14  4:30                         ` [PATCH review 1/8] dcache: Handle escaped paths in prepend_path Eric W. Biederman
2015-08-14  4:30                         ` [PATCH review 2/8] dcache: Reduce the scope of i_lock in d_splice_alias Eric W. Biederman
     [not found]                         ` <87wpwyjxwc.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-08-14  4:30                           ` [PATCH review 1/8] dcache: Handle escaped paths in prepend_path Eric W. Biederman
2015-08-14  4:30                           ` [PATCH review 2/8] dcache: Reduce the scope of i_lock in d_splice_alias Eric W. Biederman
2015-08-14  4:31                           ` [PATCH review 3/8] dcache: Clearly separate the two directory rename cases " Eric W. Biederman
2015-08-14  4:32                           ` [PATCH review 4/8] mnt: Track which mounts use a dentry as root Eric W. Biederman
2015-08-14  4:33                           ` [PATCH review 5/8] dcache: Implement d_common_ancestor Eric W. Biederman
2015-08-14  4:34                           ` [PATCH review 6/8] dcache: Only read d_flags once is d_is_dir Eric W. Biederman
2015-08-14  4:35                           ` [PATCH review 7/8] mnt: Track when a directory escapes a bind mount Eric W. Biederman
2015-08-14  4:36                           ` [PATCH review 8/8] vfs: Test for and handle paths that are unreachable from their mnt_root Eric W. Biederman
2015-08-14  4:31                         ` [PATCH review 3/8] dcache: Clearly separate the two directory rename cases in d_splice_alias Eric W. Biederman
     [not found]                           ` <87fv3mjxsc.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-08-15  6:16                             ` Al Viro
     [not found]                               ` <20150815061617.GG14139-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
2015-08-15 18:25                                 ` Eric W. Biederman
     [not found]                                   ` <874mk08l3g.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-08-15 18:35                                     ` [PATCH review 0/7] Bind mount escape fixes Eric W. Biederman
2015-08-15 18:36                                       ` [PATCH review 1/7] dcache: Handle escaped paths in prepend_path Eric W. Biederman
2015-08-15 18:36                                       ` [PATCH review 2/7] dcache: Reduce the scope of i_lock in d_splice_alias Eric W. Biederman
     [not found]                                       ` <87a8ts763c.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-08-15 18:36                                         ` [PATCH review 1/7] dcache: Handle escaped paths in prepend_path Eric W. Biederman
2015-08-15 18:36                                         ` [PATCH review 2/7] dcache: Reduce the scope of i_lock in d_splice_alias Eric W. Biederman
2015-08-15 18:37                                         ` [PATCH review 3/7] mnt: Track which mounts use a dentry as root Eric W. Biederman
2015-08-15 18:37                                         ` [PATCH review 4/7] dcache: Implement d_common_ancestor Eric W. Biederman
2015-08-15 18:38                                         ` [PATCH review 5/7] dcache: Only read d_flags once in d_is_dir Eric W. Biederman
2015-08-15 19:36                                         ` [PATCH review 0/7] Bind mount escape fixes Linus Torvalds
2015-08-15 18:37                                       ` [PATCH review 4/7] dcache: Implement d_common_ancestor Eric W. Biederman
2015-08-15 19:36                                       ` [PATCH review 0/7] Bind mount escape fixes Linus Torvalds
     [not found]                                         ` <CA+55aFzMuCn33yK71HoKnj1hr8=ac_Y-vfE5mM8h4f3YJeGKvg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-08-15 19:48                                           ` Linus Torvalds
     [not found]                                             ` <CA+55aFyeu-p_3eJQCLM0TDuLYvo10mx379FaCFq7Z103RgKvVA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-08-15 21:07                                               ` Eric W. Biederman
     [not found]                                                 ` <E2AECA7F-ED57-4FCD-A4C0-8C7C4B860FB6-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2015-08-15 22:47                                                   ` Linus Torvalds
     [not found]                                                     ` <CA+55aFx2s7TrmPKviKnFL0nGRZDHuCajW_UO02EnF+CsJY2-4w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-08-16  0:59                                                       ` Eric W. Biederman
     [not found]                                                         ` <87bne82glg.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-08-16  1:27                                                           ` [PATCH] vfs: Test for and handle paths that are unreachable from their mnt_root Eric W. Biederman
2015-08-17  3:56                                                             ` NeilBrown
     [not found]                                                             ` <87tws010r2.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-08-17  3:56                                                               ` NeilBrown
2015-08-16  2:12                                                       ` [PATCH review 0/7] Bind mount escape fixes Al Viro
     [not found]                                                         ` <20150816021209.GI14139-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
2015-08-16  2:25                                                           ` Linus Torvalds
2015-08-16  2:25                                                         ` Linus Torvalds
     [not found]                                                           ` <CA+55aFy3pzEY=4dfd_PX-Og_b7fqrG1rDniOqehBfQhXb=Cg9A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-08-16  4:53                                                             ` Al Viro
     [not found]                                                               ` <20150816045322.GJ14139-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
2015-08-16  6:22                                                                 ` Eric W. Biederman
2015-08-16  6:55                                                                   ` Al Viro
     [not found]                                                                     ` <20150816065503.GL14139-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
2015-08-16  7:04                                                                       ` Al Viro
2015-08-16 11:33                                                                       ` Eric W. Biederman
     [not found]                                                                         ` <87bne7piwu.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-08-21  7:51                                                                           ` Al Viro
2015-08-21 15:27                                                                             ` Eric W. Biederman
     [not found]                                                                             ` <20150821075105.GF18890-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
2015-08-21 15:27                                                                               ` Eric W. Biederman
     [not found]                                                                   ` <87fv3ju4zy.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-08-16  6:55                                                                     ` Al Viro
2015-08-16 11:51                                                             ` Eric W. Biederman
2015-08-16 22:29                                                               ` Willy Tarreau
     [not found]                                                               ` <87egj3moxm.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-08-16 22:29                                                                 ` Willy Tarreau
2015-08-15 18:39                                     ` [PATCH review 6/7] mnt: Track when a directory escapes a bind mount Eric W. Biederman
2015-08-15 18:39                                     ` [PATCH review 7/7] vfs: Test for and handle paths that are unreachable from their mnt_root Eric W. Biederman
2015-08-15 18:39                                   ` Eric W. Biederman
2015-08-03 21:30             ` [PATCH review 5/6] " Eric W. Biederman
     [not found]               ` <878u9s9i1d.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-08-10  4:38                 ` Al Viro
     [not found]                   ` <20150810043814.GD14139-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
2015-08-10 19:34                     ` Eric W. Biederman
2015-08-03 21:30             ` [PATCH review 6/6] vfs: Cache the results of path_connected Eric W. Biederman
     [not found]               ` <8738009i0h.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-08-04 11:52                 ` Andrew Vagin
2015-08-04 11:52               ` Andrew Vagin
     [not found]                 ` <20150804115215.GA317-wo1vFcy6AUs@public.gmane.org>
2015-08-04 17:41                   ` Eric W. Biederman
2015-08-04 19:44                     ` J. Bruce Fields
2015-08-04 22:58                       ` Eric W. Biederman
     [not found]                         ` <874mkey824.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-08-05 15:59                           ` J. Bruce Fields
     [not found]                             ` <20150805155948.GD17797-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
2015-08-05 16:28                               ` Eric W. Biederman
     [not found]                                 ` <878u9pwvg8.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-08-28 19:43                                   ` J. Bruce Fields
2015-08-28 19:45                                     ` J. Bruce Fields
     [not found]                                       ` <20150828194540.GF10468-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
2015-08-31 21:17                                         ` Eric W. Biederman
     [not found]                                           ` <87k2sb88ev.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-09-01 14:46                                             ` J. Bruce Fields
2015-09-01 18:00                                               ` Eric W. Biederman
     [not found]                                                 ` <877foavx3f.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-09-01 18:11                                                   ` J. Bruce Fields
2015-09-01 18:11                                                 ` J. Bruce Fields
     [not found]                                               ` <20150901144632.GA32692-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
2015-09-01 18:00                                                 ` Eric W. Biederman
     [not found]                                     ` <20150828194302.GE10468-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
2015-08-28 19:45                                       ` J. Bruce Fields
     [not found]                       ` <20150804194447.GB6664-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
2015-08-04 22:58                         ` Eric W. Biederman
     [not found]                     ` <871tfj0x4j.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-08-04 19:44                       ` J. Bruce Fields
2015-04-16 23:40       ` [GIT PULL] Usernamespace related locked mount fixes Eric W. Biederman
     [not found]         ` <87383z1w1v.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-04-16 23:42           ` Eric W. Biederman
2015-04-16 23:42         ` Eric W. Biederman
2015-01-02 21:52 ` [PATCH review 4/9] mnt: Add MNT_UMOUNT flag Eric W. Biederman
2015-01-02 21:52 ` [PATCH review 5/9] mnt: Delay removal from the mount hash Eric W. Biederman
2015-01-02 21:52 ` [PATCH review 6/9] mnt: Factor out __detach_mnt from detach_mnt Eric W. Biederman
2015-01-02 21:52 ` [PATCH review 7/9] mnt: Simplify umount_tree Eric W. Biederman
2015-01-02 21:52 ` [PATCH review 8/9] mnt: Remove redundant NULL tests in namespace_unlock Eric W. Biederman
2015-01-02 21:52 ` [PATCH review 9/9] mnt: Honor MNT_LOCKED when detaching mounts Eric W. Biederman
     [not found]   ` <1420235574-15177-9-git-send-email-ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2015-01-03  2:27     ` Eric W. Biederman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.