All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH linux-next] mqueue: fix IPC namespace use-after-free
@ 2017-12-19 10:14 Giuseppe Scrivano
  2017-12-19 11:48 ` Al Viro
  0 siblings, 1 reply; 13+ messages in thread
From: Giuseppe Scrivano @ 2017-12-19 10:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, gscrivan, alexander.deucher, broonie, chris, David Miller,
	deepa.kernel, Greg KH, luc.vanoostenryck, lucien xin,
	Ingo Molnar, Neil Horman, syzkaller-bugs, Al Viro,
	Vladislav Yasevich

mqueue_evict_inode() doesn't access the ipc namespace if it was
already freed.  It can happen if in a new IPC namespace the inode was
created without a prior mq_open() which creates the vfsmount used to
access the superblock from mq_clear_sbinfo().

Keep a direct pointer to the superblock used by the inodes so we can
correctly reset the reference to the IPC namespace being destroyed.

Bug introduced with 9c583773d03633 ("ipc, mqueue: lazy call
kern_mount_data in new namespaces")

==================================================================
BUG: KASAN: use-after-free in __read_once_size include/linux/compiler.h:183
[inline]
BUG: KASAN: use-after-free in atomic_read arch/x86/include/asm/atomic.h:27
[inline]
BUG: KASAN: use-after-free in refcount_inc_not_zero+0x16e/0x180
lib/refcount.c:120
Read of size 4 at addr ffff8801c51bb200 by task syzkaller711981/3156

CPU: 1 PID: 3156 Comm: syzkaller711981 Not tainted 4.15.0-rc2-mm1+ #39
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:17 [inline]
  dump_stack+0x194/0x257 lib/dump_stack.c:53
  print_address_description+0x73/0x250 mm/kasan/report.c:252
  kasan_report_error mm/kasan/report.c:351 [inline]
  kasan_report+0x25b/0x340 mm/kasan/report.c:409
  __asan_report_load4_noabort+0x14/0x20 mm/kasan/report.c:429
  __read_once_size include/linux/compiler.h:183 [inline]
  atomic_read arch/x86/include/asm/atomic.h:27 [inline]
  refcount_inc_not_zero+0x16e/0x180 lib/refcount.c:120
  refcount_inc+0x15/0x50 lib/refcount.c:153
  get_ipc_ns include/linux/ipc_namespace.h:129 [inline]
  __get_ns_from_inode ipc/mqueue.c:110 [inline]
  get_ns_from_inode ipc/mqueue.c:118 [inline]
  mqueue_evict_inode+0x137/0x9c0 ipc/mqueue.c:402
  evict+0x481/0x920 fs/inode.c:552
  iput_final fs/inode.c:1514 [inline]
  iput+0x7b9/0xaf0 fs/inode.c:1541
  dentry_unlink_inode+0x4b0/0x5e0 fs/dcache.c:376
  __dentry_kill+0x3b7/0x6d0 fs/dcache.c:573
  shrink_dentry_list+0x3c5/0xcf0 fs/dcache.c:1020
  shrink_dcache_parent+0xba/0x230 fs/dcache.c:1454
  do_one_tree+0x15/0x50 fs/dcache.c:1485
  shrink_dcache_for_umount+0xbb/0x290 fs/dcache.c:1502
  generic_shutdown_super+0xcd/0x540 fs/super.c:424
  kill_anon_super fs/super.c:987 [inline]
  kill_litter_super+0x72/0x90 fs/super.c:997
  deactivate_locked_super+0x88/0xd0 fs/super.c:312
  deactivate_super+0x141/0x1b0 fs/super.c:343
  cleanup_mnt+0xb2/0x150 fs/namespace.c:1173
  __cleanup_mnt+0x16/0x20 fs/namespace.c:1180
  task_work_run+0x199/0x270 kernel/task_work.c:113
  exit_task_work include/linux/task_work.h:22 [inline]
  do_exit+0x9bb/0x1ae0 kernel/exit.c:869
  do_group_exit+0x149/0x400 kernel/exit.c:972
  SYSC_exit_group kernel/exit.c:983 [inline]
  SyS_exit_group+0x1d/0x20 kernel/exit.c:981
  entry_SYSCALL_64_fastpath+0x1f/0x96
RIP: 0033:0x440729
RSP: 002b:00007ffd090ef228 EFLAGS: 00000206 ORIG_RAX: 00000000000000e7
RAX: ffffffffffffffda RBX: 0030656c69662f2e RCX: 0000000000440729
RDX: 0000000000440729 RSI: 0000000000000000 RDI: 0000000000000001
RBP: 00000000006cb018 R08: 0000000000000000 R09: 00000000004002c8
R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000401bf0
R13: 0000000000401c80 R14: 0000000000000000 R15: 0000000000000000

Allocated by task 3156:
  save_stack+0x43/0xd0 mm/kasan/kasan.c:447
  set_track mm/kasan/kasan.c:459 [inline]
  kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551
  kmem_cache_alloc_trace+0x136/0x750 mm/slab.c:3614
  kmalloc include/linux/slab.h:516 [inline]
  create_ipc_ns ipc/namespace.c:45 [inline]
  copy_ipcs+0x1b3/0x520 ipc/namespace.c:96
  create_new_namespaces+0x278/0x880 kernel/nsproxy.c:87
  unshare_nsproxy_namespaces+0xae/0x1e0 kernel/nsproxy.c:206
  SYSC_unshare kernel/fork.c:2421 [inline]
  SyS_unshare+0x653/0xfa0 kernel/fork.c:2371
  entry_SYSCALL_64_fastpath+0x1f/0x96

Freed by task 3156:
  save_stack+0x43/0xd0 mm/kasan/kasan.c:447
  set_track mm/kasan/kasan.c:459 [inline]
  kasan_slab_free+0x71/0xc0 mm/kasan/kasan.c:524
  __cache_free mm/slab.c:3492 [inline]
  kfree+0xca/0x250 mm/slab.c:3807
  free_ipc_ns ipc/namespace.c:139 [inline]
  put_ipc_ns+0x112/0x150 ipc/namespace.c:164
  free_nsproxy+0xc0/0x1f0 kernel/nsproxy.c:180
  switch_task_namespaces+0x9d/0xc0 kernel/nsproxy.c:229
  exit_task_namespaces+0x17/0x20 kernel/nsproxy.c:234
  do_exit+0x9b6/0x1ae0 kernel/exit.c:868
  do_group_exit+0x149/0x400 kernel/exit.c:972
  SYSC_exit_group kernel/exit.c:983 [inline]
  SyS_exit_group+0x1d/0x20 kernel/exit.c:981
  entry_SYSCALL_64_fastpath+0x1f/0x96

The buggy address belongs to the object at ffff8801c51bb200
  which belongs to the cache kmalloc-2048 of size 2048
The buggy address is located 0 bytes inside of
  2048-byte region [ffff8801c51bb200, ffff8801c51bba00)
The buggy address belongs to the page:
page:000000007764ba6d count:1 mapcount:0 mapping:000000002c36623f index:0x0
compound_mapcount: 0
flags: 0x2fffc0000008100(slab|head)
raw: 02fffc0000008100 ffff8801c51ba100 0000000000000000 0000000100000003
raw: ffffea000715d320 ffff8801dac01950 ffff8801dac00c40 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
  ffff8801c51bb100: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
  ffff8801c51bb180: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
> ffff8801c51bb200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                    ^
  ffff8801c51bb280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  ffff8801c51bb300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================

Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
---
 include/linux/ipc_namespace.h | 3 ++-
 ipc/mqueue.c                  | 6 ++++--
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h
index 554e31494f69..29ae2ede7602 100644
--- a/include/linux/ipc_namespace.h
+++ b/include/linux/ipc_namespace.h
@@ -52,7 +52,8 @@ struct ipc_namespace {
 	struct notifier_block ipcns_nb;
 
 	/* The kern_mount of the mqueuefs sb.  We take a ref on it */
-	struct vfsmount	*mq_mnt;
+	struct vfsmount		*mq_mnt;
+	struct super_block	*mq_sb;
 
 	/* # queues in this ns, protected by mq_lock */
 	unsigned int    mq_queues_count;
diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index 36f177dcb39a..d664c0b0f075 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -341,6 +341,7 @@ static int mqueue_fill_super(struct super_block *sb, void *data, int silent)
 	sb->s_root = d_make_root(inode);
 	if (!sb->s_root)
 		return -ENOMEM;
+	ns->mq_sb = sb;
 	return 0;
 }
 
@@ -1554,6 +1555,7 @@ int mq_init_ns(struct ipc_namespace *ns, bool mount)
 	ns->mq_msg_max       = DFLT_MSGMAX;
 	ns->mq_msgsize_max   = DFLT_MSGSIZEMAX;
 	ns->mq_msg_default   = DFLT_MSG;
+	ns->mq_sb            = NULL;
 	ns->mq_msgsize_default  = DFLT_MSGSIZE;
 
 	if (!mount)
@@ -1573,8 +1575,8 @@ int mq_init_ns(struct ipc_namespace *ns, bool mount)
 
 void mq_clear_sbinfo(struct ipc_namespace *ns)
 {
-	if (ns->mq_mnt)
-		ns->mq_mnt->mnt_sb->s_fs_info = NULL;
+	if (ns->mq_sb)
+		ns->mq_sb->s_fs_info = NULL;
 }
 
 void mq_put_mnt(struct ipc_namespace *ns)
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH linux-next] mqueue: fix IPC namespace use-after-free
  2017-12-19 10:14 [PATCH linux-next] mqueue: fix IPC namespace use-after-free Giuseppe Scrivano
@ 2017-12-19 11:48 ` Al Viro
  2017-12-19 15:32   ` Al Viro
  0 siblings, 1 reply; 13+ messages in thread
From: Al Viro @ 2017-12-19 11:48 UTC (permalink / raw)
  To: Giuseppe Scrivano
  Cc: Andrew Morton, LKML, alexander.deucher, broonie, chris,
	David Miller, deepa.kernel, Greg KH, luc.vanoostenryck,
	lucien xin, Ingo Molnar, Neil Horman, syzkaller-bugs,
	Vladislav Yasevich

On Tue, Dec 19, 2017 at 11:14:40AM +0100, Giuseppe Scrivano wrote:
> mqueue_evict_inode() doesn't access the ipc namespace if it was
> already freed.  It can happen if in a new IPC namespace the inode was
> created without a prior mq_open() which creates the vfsmount used to
> access the superblock from mq_clear_sbinfo().
> 
> Keep a direct pointer to the superblock used by the inodes so we can
> correctly reset the reference to the IPC namespace being destroyed.
> 
> Bug introduced with 9c583773d03633 ("ipc, mqueue: lazy call
> kern_mount_data in new namespaces")

And just what will happen in the same scenario if you mount the damn
thing in userland without ever calling mq_open(), touch a file there,
then unmount and then leave the ipc namespace?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH linux-next] mqueue: fix IPC namespace use-after-free
  2017-12-19 11:48 ` Al Viro
@ 2017-12-19 15:32   ` Al Viro
  2017-12-19 15:44     ` Al Viro
  2017-12-19 16:59     ` Giuseppe Scrivano
  0 siblings, 2 replies; 13+ messages in thread
From: Al Viro @ 2017-12-19 15:32 UTC (permalink / raw)
  To: Giuseppe Scrivano
  Cc: Andrew Morton, LKML, alexander.deucher, broonie, chris,
	David Miller, deepa.kernel, Greg KH, luc.vanoostenryck,
	lucien xin, Ingo Molnar, Neil Horman, syzkaller-bugs,
	Vladislav Yasevich

On Tue, Dec 19, 2017 at 11:48:19AM +0000, Al Viro wrote:
> On Tue, Dec 19, 2017 at 11:14:40AM +0100, Giuseppe Scrivano wrote:
> > mqueue_evict_inode() doesn't access the ipc namespace if it was
> > already freed.  It can happen if in a new IPC namespace the inode was
> > created without a prior mq_open() which creates the vfsmount used to
> > access the superblock from mq_clear_sbinfo().
> > 
> > Keep a direct pointer to the superblock used by the inodes so we can
> > correctly reset the reference to the IPC namespace being destroyed.
> > 
> > Bug introduced with 9c583773d03633 ("ipc, mqueue: lazy call
> > kern_mount_data in new namespaces")
> 
> And just what will happen in the same scenario if you mount the damn
> thing in userland without ever calling mq_open(), touch a file there,
> then unmount and then leave the ipc namespace?

FWIW, the real solution would be to have userland mounts trigger the creation
of internal one, same as mq_open() would.  Something along these lines
(completely untested, on top of vfs.git#for-next).  Care to give it some
beating?

diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index 10b82338415b..30327e201571 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -343,18 +343,46 @@ static int mqueue_fill_super(struct super_block *sb, void *data, int silent)
 	return 0;
 }
 
+static struct file_system_type mqueue_fs_type;
+/*
+ * Return value is pinned only by reference in ->mq_mnt; it will
+ * live until ipcns dies.  Caller does not need to drop it.
+ */
+static struct vfsmount *mq_internal_mount(void)
+{
+	struct ipc_namespace *ns = current->nsproxy->ipc_ns;
+	struct vfsmount *m = ns->mq_mnt;
+	if (m)
+		return m;
+	m = kern_mount_data(&mqueue_fs_type, ns);
+	spin_lock(&mq_lock);
+	if (unlikely(ns->mq_mnt)) {
+		spin_unlock(&mq_lock);
+		if (!IS_ERR(m))
+			kern_unmount(m);
+			return ns->mq_mnt;
+	}
+	if (!IS_ERR(m))
+		ns->mq_mnt = m;
+	spin_unlock(&mq_lock);
+	return m;
+}
+
 static struct dentry *mqueue_mount(struct file_system_type *fs_type,
 			 int flags, const char *dev_name,
 			 void *data)
 {
-	struct ipc_namespace *ns;
-	if (flags & MS_KERNMOUNT) {
-		ns = data;
-		data = NULL;
-	} else {
-		ns = current->nsproxy->ipc_ns;
-	}
-	return mount_ns(fs_type, flags, data, ns, ns->user_ns, mqueue_fill_super);
+	struct ipc_namespace *ns = data;
+	struct vfsmount *m;
+	if (flags & MS_KERNMOUNT)
+		return mount_ns(fs_type, flags, NULL, ns, ns->user_ns,
+				mqueue_fill_super);
+	m = mq_internal_mount();
+	if (IS_ERR(m))
+		return ERR_CAST(m);
+	atomic_inc(&m->mnt_sb->s_active);
+	down_write(&m->mnt_sb->s_umount);
+	return dget(m->mnt_root);
 }
 
 static void init_once(void *foo)
@@ -743,13 +771,16 @@ static int prepare_open(struct dentry *dentry, int oflag, int ro,
 static int do_mq_open(const char __user *u_name, int oflag, umode_t mode,
 		      struct mq_attr *attr)
 {
-	struct vfsmount *mnt = current->nsproxy->ipc_ns->mq_mnt;
-	struct dentry *root = mnt->mnt_root;
+	struct vfsmount *mnt = mq_internal_mount();
+	struct dentry *root;
 	struct filename *name;
 	struct path path;
 	int fd, error;
 	int ro;
 
+	if (IS_ERR(mnt))
+		return PTR_ERR(mnt);
+
 	audit_mq_open(oflag, mode, attr);
 
 	if (IS_ERR(name = getname(u_name)))
@@ -760,6 +791,7 @@ static int do_mq_open(const char __user *u_name, int oflag, umode_t mode,
 		goto out_putname;
 
 	ro = mnt_want_write(mnt);	/* we'll drop it in any case */
+	root = mnt->mnt_root;
 	inode_lock(d_inode(root));
 	path.dentry = lookup_one_len(name->name, root, strlen(name->name));
 	if (IS_ERR(path.dentry)) {
@@ -1535,27 +1567,24 @@ int mq_init_ns(struct ipc_namespace *ns)
 	ns->mq_msg_default   = DFLT_MSG;
 	ns->mq_msgsize_default  = DFLT_MSGSIZE;
 
-	ns->mq_mnt = kern_mount_data(&mqueue_fs_type, ns);
-	if (IS_ERR(ns->mq_mnt)) {
-		int err = PTR_ERR(ns->mq_mnt);
-		ns->mq_mnt = NULL;
-		return err;
-	}
 	return 0;
 }
 
 void mq_clear_sbinfo(struct ipc_namespace *ns)
 {
-	ns->mq_mnt->mnt_sb->s_fs_info = NULL;
+	if (ns->mq_mnt)
+		ns->mq_mnt->mnt_sb->s_fs_info = NULL;
 }
 
 void mq_put_mnt(struct ipc_namespace *ns)
 {
-	kern_unmount(ns->mq_mnt);
+	if (ns->mq_mnt)
+		kern_unmount(ns->mq_mnt);
 }
 
 static int __init init_mqueue_fs(void)
 {
+	struct vfsmount *m;
 	int error;
 
 	mqueue_inode_cachep = kmem_cache_create("mqueue_inode_cache",
@@ -1577,6 +1606,10 @@ static int __init init_mqueue_fs(void)
 	if (error)
 		goto out_filesystem;
 
+	m = kern_mount_data(&mqueue_fs_type, &init_ipc_ns);
+	if (IS_ERR(m))
+		goto out_filesystem;
+	init_ipc_ns.mq_mnt = m;
 	return 0;
 
 out_filesystem:

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH linux-next] mqueue: fix IPC namespace use-after-free
  2017-12-19 15:32   ` Al Viro
@ 2017-12-19 15:44     ` Al Viro
  2017-12-19 16:31       ` Dmitry Vyukov
  2017-12-19 16:59     ` Giuseppe Scrivano
  1 sibling, 1 reply; 13+ messages in thread
From: Al Viro @ 2017-12-19 15:44 UTC (permalink / raw)
  To: Giuseppe Scrivano
  Cc: Andrew Morton, LKML, alexander.deucher, broonie, chris,
	David Miller, deepa.kernel, Greg KH, luc.vanoostenryck,
	lucien xin, Ingo Molnar, Neil Horman, syzkaller-bugs,
	Vladislav Yasevich

On Tue, Dec 19, 2017 at 03:32:25PM +0000, Al Viro wrote:
> +	m = mq_internal_mount();
> +	if (IS_ERR(m))
> +		return ERR_CAST(m);
> +	atomic_inc(&m->mnt_sb->s_active);
> +	down_write(&m->mnt_sb->s_umount);
> +	return dget(m->mnt_root);

Note: this is stripped down mount_subtree(m, ""), of course;
it might make sense to recognize that case and bypass the
create_mnt_ns/vfs_path_lookup/put_mnt_ns business in
mount_subtree() when the relative pathname is empty, replacing
it with path.mnt = mntget(mnt); path.dentry = dget(mnt->mnt_root);
in such case.  That'd allow to simply call mount_subtree() here.
It would work as-is, but it's ridiculously heavy for such use
right now.

>  static int __init init_mqueue_fs(void)
>  {
> +	struct vfsmount *m;
>  	int error;
>  
>  	mqueue_inode_cachep = kmem_cache_create("mqueue_inode_cache",
> @@ -1577,6 +1606,10 @@ static int __init init_mqueue_fs(void)
>  	if (error)
>  		goto out_filesystem;
>  
> +	m = kern_mount_data(&mqueue_fs_type, &init_ipc_ns);
> +	if (IS_ERR(m))
> +		goto out_filesystem;
> +	init_ipc_ns.mq_mnt = m;
>  	return 0;
>  
>  out_filesystem:

Unrelated issue, but register_filesystem() should be the last thing
module_init() of a filesystem driver does.  It's a separate story,
in any case...

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH linux-next] mqueue: fix IPC namespace use-after-free
  2017-12-19 15:44     ` Al Viro
@ 2017-12-19 16:31       ` Dmitry Vyukov
  2017-12-19 17:02         ` Giuseppe Scrivano
  0 siblings, 1 reply; 13+ messages in thread
From: Dmitry Vyukov @ 2017-12-19 16:31 UTC (permalink / raw)
  To: Al Viro
  Cc: Giuseppe Scrivano, Andrew Morton, LKML, alexander.deucher,
	Mark Brown, Chris Wilson, David Miller, deepa.kernel, Greg KH,
	luc.vanoostenryck, lucien xin, Ingo Molnar, Neil Horman,
	syzkaller-bugs, Vladislav Yasevich

On Tue, Dec 19, 2017 at 4:44 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Tue, Dec 19, 2017 at 03:32:25PM +0000, Al Viro wrote:
>> +     m = mq_internal_mount();
>> +     if (IS_ERR(m))
>> +             return ERR_CAST(m);
>> +     atomic_inc(&m->mnt_sb->s_active);
>> +     down_write(&m->mnt_sb->s_umount);
>> +     return dget(m->mnt_root);
>
> Note: this is stripped down mount_subtree(m, ""), of course;
> it might make sense to recognize that case and bypass the
> create_mnt_ns/vfs_path_lookup/put_mnt_ns business in
> mount_subtree() when the relative pathname is empty, replacing
> it with path.mnt = mntget(mnt); path.dentry = dget(mnt->mnt_root);
> in such case.  That'd allow to simply call mount_subtree() here.
> It would work as-is, but it's ridiculously heavy for such use
> right now.
>
>>  static int __init init_mqueue_fs(void)
>>  {
>> +     struct vfsmount *m;
>>       int error;
>>
>>       mqueue_inode_cachep = kmem_cache_create("mqueue_inode_cache",
>> @@ -1577,6 +1606,10 @@ static int __init init_mqueue_fs(void)
>>       if (error)
>>               goto out_filesystem;
>>
>> +     m = kern_mount_data(&mqueue_fs_type, &init_ipc_ns);
>> +     if (IS_ERR(m))
>> +             goto out_filesystem;
>> +     init_ipc_ns.mq_mnt = m;
>>       return 0;
>>
>>  out_filesystem:
>
> Unrelated issue, but register_filesystem() should be the last thing
> module_init() of a filesystem driver does.  It's a separate story,
> in any case...

Giuseppe, what report is this?
If there is a reproducer, you can ask syzbot to test a patch.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH linux-next] mqueue: fix IPC namespace use-after-free
  2017-12-19 15:32   ` Al Viro
  2017-12-19 15:44     ` Al Viro
@ 2017-12-19 16:59     ` Giuseppe Scrivano
  2017-12-19 18:40       ` Giuseppe Scrivano
  1 sibling, 1 reply; 13+ messages in thread
From: Giuseppe Scrivano @ 2017-12-19 16:59 UTC (permalink / raw)
  To: Al Viro
  Cc: Andrew Morton, LKML, alexander.deucher, broonie, chris,
	David Miller, deepa.kernel, Greg KH, luc.vanoostenryck,
	lucien xin, Ingo Molnar, Neil Horman, syzkaller-bugs,
	Vladislav Yasevich

Al Viro <viro@ZenIV.linux.org.uk> writes:

> On Tue, Dec 19, 2017 at 11:48:19AM +0000, Al Viro wrote:
>> On Tue, Dec 19, 2017 at 11:14:40AM +0100, Giuseppe Scrivano wrote:
>> > mqueue_evict_inode() doesn't access the ipc namespace if it was
>> > already freed.  It can happen if in a new IPC namespace the inode was
>> > created without a prior mq_open() which creates the vfsmount used to
>> > access the superblock from mq_clear_sbinfo().
>> > 
>> > Keep a direct pointer to the superblock used by the inodes so we can
>> > correctly reset the reference to the IPC namespace being destroyed.
>> > 
>> > Bug introduced with 9c583773d03633 ("ipc, mqueue: lazy call
>> > kern_mount_data in new namespaces")
>> 
>> And just what will happen in the same scenario if you mount the damn
>> thing in userland without ever calling mq_open(), touch a file there,
>> then unmount and then leave the ipc namespace?
>
> FWIW, the real solution would be to have userland mounts trigger the creation
> of internal one, same as mq_open() would.  Something along these lines
> (completely untested, on top of vfs.git#for-next).  Care to give it some
> beating?

thanks for the patch.  It seems to work after this minor fixup on top of
it:

diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index 30327e201571..636989a44fae 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -360,7 +360,7 @@ static struct vfsmount *mq_internal_mount(void)
 		spin_unlock(&mq_lock);
 		if (!IS_ERR(m))
 			kern_unmount(m);
-			return ns->mq_mnt;
+		return ns->mq_mnt;
 	}
 	if (!IS_ERR(m))
 		ns->mq_mnt = m;
@@ -1560,6 +1560,7 @@ static struct file_system_type mqueue_fs_type = {
 
 int mq_init_ns(struct ipc_namespace *ns)
 {
+	ns->mq_mnt           = NULL;
 	ns->mq_queues_count  = 0;
 	ns->mq_queues_max    = DFLT_QUEUESMAX;
 	ns->mq_msg_max       = DFLT_MSGMAX;


The only issue I've seen with my version is that if I do:

# unshare -im /bin/sh
# mount -t mqueue mqueue /dev/mqueue
# touch /dev/mqueue/foo
# umount /dev/mqueue
# mount -t mqueue mqueue /dev/mqueue

then /dev/mqueue/foo doesn't exist at this point.  Your patch does not
have this problem and /dev/mqueue/foo is again accessible after the
second mount.

Regards,
Giuseppe

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH linux-next] mqueue: fix IPC namespace use-after-free
  2017-12-19 16:31       ` Dmitry Vyukov
@ 2017-12-19 17:02         ` Giuseppe Scrivano
  0 siblings, 0 replies; 13+ messages in thread
From: Giuseppe Scrivano @ 2017-12-19 17:02 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Al Viro, Andrew Morton, LKML, alexander.deucher, Mark Brown,
	Chris Wilson, David Miller, deepa.kernel, Greg KH,
	luc.vanoostenryck, lucien xin, Ingo Molnar, Neil Horman,
	syzkaller-bugs, Vladislav Yasevich

Dmitry Vyukov <dvyukov@google.com> writes:

>> Unrelated issue, but register_filesystem() should be the last thing
>> module_init() of a filesystem driver does.  It's a separate story,
>> in any case...
>
> Giuseppe, what report is this?
> If there is a reproducer, you can ask syzbot to test a patch.

I have tried locally the reproducer and the issue seems fixed both in
Al's patch and in my version.

In any case, the original issue was:

https://groups.google.com/forum/#!msg/syzkaller-bugs/1XBaqnPSXzs/VF-eCSPuCQAJ

Thanks,
Giuseppe

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH linux-next] mqueue: fix IPC namespace use-after-free
  2017-12-19 16:59     ` Giuseppe Scrivano
@ 2017-12-19 18:40       ` Giuseppe Scrivano
  2017-12-19 20:14         ` Al Viro
  0 siblings, 1 reply; 13+ messages in thread
From: Giuseppe Scrivano @ 2017-12-19 18:40 UTC (permalink / raw)
  To: Al Viro
  Cc: Andrew Morton, LKML, alexander.deucher, broonie, chris,
	David Miller, deepa.kernel, Greg KH, luc.vanoostenryck,
	lucien xin, Ingo Molnar, Neil Horman, syzkaller-bugs,
	Vladislav Yasevich

Giuseppe Scrivano <gscrivan@redhat.com> writes:

> The only issue I've seen with my version is that if I do:
>
> # unshare -im /bin/sh
> # mount -t mqueue mqueue /dev/mqueue
> # touch /dev/mqueue/foo
> # umount /dev/mqueue
> # mount -t mqueue mqueue /dev/mqueue
>
> then /dev/mqueue/foo doesn't exist at this point.  Your patch does not
> have this problem and /dev/mqueue/foo is again accessible after the
> second mount.

although, how much is that of an issue?  Is there any other way to delay
the cost of kern_mount_data()?  Most containers have /dev/mqueue mounted
but it is not really going to be used.

Would it be possible somehow to postpone it until the first inode is
created?

Thanks,
Giuseppe

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH linux-next] mqueue: fix IPC namespace use-after-free
  2017-12-19 18:40       ` Giuseppe Scrivano
@ 2017-12-19 20:14         ` Al Viro
  2017-12-19 21:49           ` Eric W. Biederman
  2017-12-21 19:19           ` Giuseppe Scrivano
  0 siblings, 2 replies; 13+ messages in thread
From: Al Viro @ 2017-12-19 20:14 UTC (permalink / raw)
  To: Giuseppe Scrivano
  Cc: Andrew Morton, LKML, alexander.deucher, broonie, chris,
	David Miller, deepa.kernel, Greg KH, luc.vanoostenryck,
	lucien xin, Ingo Molnar, Neil Horman, syzkaller-bugs,
	Vladislav Yasevich

On Tue, Dec 19, 2017 at 07:40:43PM +0100, Giuseppe Scrivano wrote:
> Giuseppe Scrivano <gscrivan@redhat.com> writes:
> 
> > The only issue I've seen with my version is that if I do:
> >
> > # unshare -im /bin/sh
> > # mount -t mqueue mqueue /dev/mqueue
> > # touch /dev/mqueue/foo
> > # umount /dev/mqueue
> > # mount -t mqueue mqueue /dev/mqueue
> >
> > then /dev/mqueue/foo doesn't exist at this point.  Your patch does not
> > have this problem and /dev/mqueue/foo is again accessible after the
> > second mount.
> 
> although, how much is that of an issue?  Is there any other way to delay

You do realize that with your patch you would end up with worse than that -
mount/touch/umount and now you've got ->mq_sb non-NULL and pointing to freed
super_block.  With really unpleasant effects when you quit that ipcns...

> the cost of kern_mount_data()?  Most containers have /dev/mqueue mounted
> but it is not really going to be used.

_What_ cost?  At mount(2) time you are setting the superblock up anyway, so
what would you be delaying?  kmem_cache_alloc() for struct mount and assignments
to its fields?  That's noise; if anything, I would expect the main cost with
a plenty of containers to be in sget() scanning the list of mqueue superblocks.
And we can get rid of that, while we are at it - to hell with mount_ns(), with
that approach we can just use mount_nodev() instead.  The logics in
mq_internal_mount() will deal with multiple instances - if somebody has already
triggered creation of internal mount, all subsequent calls in that ipcns will
end up avoiding kern_mount_data() entirely.  And if you have two callers
racing - sure, you will get two superblocks.  Not for long, though - the first
one to get to setting ->mq_mnt (serialized on mq_lock) wins, the second loses
and prompty destroys his vfsmount and superblock.  I seriously suspect that
variant below would cut down on the cost a whole lot more - as it is, we have
the total of O(N^2) spent in the loop inside of sget_userns() when we create
N ipcns and mount in each of those; this patch should cut that to O(N)...

diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index 10b82338415b..e9870b0cda29 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -325,8 +325,9 @@ static struct inode *mqueue_get_inode(struct super_block *sb,
 static int mqueue_fill_super(struct super_block *sb, void *data, int silent)
 {
 	struct inode *inode;
-	struct ipc_namespace *ns = sb->s_fs_info;
+	struct ipc_namespace *ns = data;
 
+	sb->s_fs_info = ns;
 	sb->s_iflags |= SB_I_NOEXEC | SB_I_NODEV;
 	sb->s_blocksize = PAGE_SIZE;
 	sb->s_blocksize_bits = PAGE_SHIFT;
@@ -343,18 +344,44 @@ static int mqueue_fill_super(struct super_block *sb, void *data, int silent)
 	return 0;
 }
 
+static struct file_system_type mqueue_fs_type;
+/*
+ * Return value is pinned only by reference in ->mq_mnt; it will
+ * live until ipcns dies.  Caller does not need to drop it.
+ */
+static struct vfsmount *mq_internal_mount(void)
+{
+	struct ipc_namespace *ns = current->nsproxy->ipc_ns;
+	struct vfsmount *m = ns->mq_mnt;
+	if (m)
+		return m;
+	m = kern_mount_data(&mqueue_fs_type, ns);
+	spin_lock(&mq_lock);
+	if (unlikely(ns->mq_mnt)) {
+		spin_unlock(&mq_lock);
+		if (!IS_ERR(m))
+			kern_unmount(m);
+		return ns->mq_mnt;
+	}
+	if (!IS_ERR(m))
+		ns->mq_mnt = m;
+	spin_unlock(&mq_lock);
+	return m;
+}
+
 static struct dentry *mqueue_mount(struct file_system_type *fs_type,
 			 int flags, const char *dev_name,
 			 void *data)
 {
-	struct ipc_namespace *ns;
-	if (flags & MS_KERNMOUNT) {
-		ns = data;
-		data = NULL;
-	} else {
-		ns = current->nsproxy->ipc_ns;
-	}
-	return mount_ns(fs_type, flags, data, ns, ns->user_ns, mqueue_fill_super);
+	struct vfsmount *m;
+	if (flags & MS_KERNMOUNT)
+		return mount_nodev(fs_type, flags, data, mqueue_fill_super);
+	m = mq_internal_mount();
+	if (IS_ERR(m))
+		return ERR_CAST(m);
+	atomic_inc(&m->mnt_sb->s_active);
+	down_write(&m->mnt_sb->s_umount);
+	return dget(m->mnt_root);
 }
 
 static void init_once(void *foo)
@@ -743,13 +770,16 @@ static int prepare_open(struct dentry *dentry, int oflag, int ro,
 static int do_mq_open(const char __user *u_name, int oflag, umode_t mode,
 		      struct mq_attr *attr)
 {
-	struct vfsmount *mnt = current->nsproxy->ipc_ns->mq_mnt;
-	struct dentry *root = mnt->mnt_root;
+	struct vfsmount *mnt = mq_internal_mount();
+	struct dentry *root;
 	struct filename *name;
 	struct path path;
 	int fd, error;
 	int ro;
 
+	if (IS_ERR(mnt))
+		return PTR_ERR(mnt);
+
 	audit_mq_open(oflag, mode, attr);
 
 	if (IS_ERR(name = getname(u_name)))
@@ -760,6 +790,7 @@ static int do_mq_open(const char __user *u_name, int oflag, umode_t mode,
 		goto out_putname;
 
 	ro = mnt_want_write(mnt);	/* we'll drop it in any case */
+	root = mnt->mnt_root;
 	inode_lock(d_inode(root));
 	path.dentry = lookup_one_len(name->name, root, strlen(name->name));
 	if (IS_ERR(path.dentry)) {
@@ -1534,28 +1565,26 @@ int mq_init_ns(struct ipc_namespace *ns)
 	ns->mq_msgsize_max   = DFLT_MSGSIZEMAX;
 	ns->mq_msg_default   = DFLT_MSG;
 	ns->mq_msgsize_default  = DFLT_MSGSIZE;
+	ns->mq_mnt = NULL;
 
-	ns->mq_mnt = kern_mount_data(&mqueue_fs_type, ns);
-	if (IS_ERR(ns->mq_mnt)) {
-		int err = PTR_ERR(ns->mq_mnt);
-		ns->mq_mnt = NULL;
-		return err;
-	}
 	return 0;
 }
 
 void mq_clear_sbinfo(struct ipc_namespace *ns)
 {
-	ns->mq_mnt->mnt_sb->s_fs_info = NULL;
+	if (ns->mq_mnt)
+		ns->mq_mnt->mnt_sb->s_fs_info = NULL;
 }
 
 void mq_put_mnt(struct ipc_namespace *ns)
 {
-	kern_unmount(ns->mq_mnt);
+	if (ns->mq_mnt)
+		kern_unmount(ns->mq_mnt);
 }
 
 static int __init init_mqueue_fs(void)
 {
+	struct vfsmount *m;
 	int error;
 
 	mqueue_inode_cachep = kmem_cache_create("mqueue_inode_cache",
@@ -1577,6 +1606,10 @@ static int __init init_mqueue_fs(void)
 	if (error)
 		goto out_filesystem;
 
+	m = kern_mount_data(&mqueue_fs_type, &init_ipc_ns);
+	if (IS_ERR(m))
+		goto out_filesystem;
+	init_ipc_ns.mq_mnt = m;
 	return 0;
 
 out_filesystem:

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH linux-next] mqueue: fix IPC namespace use-after-free
  2017-12-19 20:14         ` Al Viro
@ 2017-12-19 21:49           ` Eric W. Biederman
  2017-12-19 22:40             ` Al Viro
  2017-12-21 19:19           ` Giuseppe Scrivano
  1 sibling, 1 reply; 13+ messages in thread
From: Eric W. Biederman @ 2017-12-19 21:49 UTC (permalink / raw)
  To: Al Viro
  Cc: Giuseppe Scrivano, Andrew Morton, LKML, alexander.deucher,
	broonie, chris, David Miller, deepa.kernel, Greg KH,
	luc.vanoostenryck, lucien xin, Ingo Molnar, Neil Horman,
	syzkaller-bugs, Vladislav Yasevich

Al Viro <viro@ZenIV.linux.org.uk> writes:

> On Tue, Dec 19, 2017 at 07:40:43PM +0100, Giuseppe Scrivano wrote:
>> Giuseppe Scrivano <gscrivan@redhat.com> writes:
>> 
>> > The only issue I've seen with my version is that if I do:
>> >
>> > # unshare -im /bin/sh
>> > # mount -t mqueue mqueue /dev/mqueue
>> > # touch /dev/mqueue/foo
>> > # umount /dev/mqueue
>> > # mount -t mqueue mqueue /dev/mqueue
>> >
>> > then /dev/mqueue/foo doesn't exist at this point.  Your patch does not
>> > have this problem and /dev/mqueue/foo is again accessible after the
>> > second mount.
>> 
>> although, how much is that of an issue?  Is there any other way to delay
>
> You do realize that with your patch you would end up with worse than that -
> mount/touch/umount and now you've got ->mq_sb non-NULL and pointing to freed
> super_block.  With really unpleasant effects when you quit that ipcns...
>
>> the cost of kern_mount_data()?  Most containers have /dev/mqueue mounted
>> but it is not really going to be used.
>
> _What_ cost?  At mount(2) time you are setting the superblock up anyway, so
> what would you be delaying?  kmem_cache_alloc() for struct mount and assignments
> to its fields?  That's noise; if anything, I would expect the main cost with
> a plenty of containers to be in sget() scanning the list of mqueue superblocks.
> And we can get rid of that, while we are at it - to hell with mount_ns(), with
> that approach we can just use mount_nodev() instead.  The logics in
> mq_internal_mount() will deal with multiple instances - if somebody has already
> triggered creation of internal mount, all subsequent calls in that ipcns will
> end up avoiding kern_mount_data() entirely.  And if you have two callers
> racing - sure, you will get two superblocks.  Not for long, though - the first
> one to get to setting ->mq_mnt (serialized on mq_lock) wins, the second loses
> and prompty destroys his vfsmount and superblock.  I seriously suspect that
> variant below would cut down on the cost a whole lot more - as it is, we have
> the total of O(N^2) spent in the loop inside of sget_userns() when we create
> N ipcns and mount in each of those; this patch should cut that to
> O(N)...

If that is where the cost is, is there any point in delaying creating
the internal mount at all?

Eric

>
> diff --git a/ipc/mqueue.c b/ipc/mqueue.c
> index 10b82338415b..e9870b0cda29 100644
> --- a/ipc/mqueue.c
> +++ b/ipc/mqueue.c
> @@ -325,8 +325,9 @@ static struct inode *mqueue_get_inode(struct super_block *sb,
>  static int mqueue_fill_super(struct super_block *sb, void *data, int silent)
>  {
>  	struct inode *inode;
> -	struct ipc_namespace *ns = sb->s_fs_info;
> +	struct ipc_namespace *ns = data;
>  
> +	sb->s_fs_info = ns;
>  	sb->s_iflags |= SB_I_NOEXEC | SB_I_NODEV;
>  	sb->s_blocksize = PAGE_SIZE;
>  	sb->s_blocksize_bits = PAGE_SHIFT;
> @@ -343,18 +344,44 @@ static int mqueue_fill_super(struct super_block *sb, void *data, int silent)
>  	return 0;
>  }
>  
> +static struct file_system_type mqueue_fs_type;
> +/*
> + * Return value is pinned only by reference in ->mq_mnt; it will
> + * live until ipcns dies.  Caller does not need to drop it.
> + */
> +static struct vfsmount *mq_internal_mount(void)
> +{
> +	struct ipc_namespace *ns = current->nsproxy->ipc_ns;
> +	struct vfsmount *m = ns->mq_mnt;
> +	if (m)
> +		return m;
> +	m = kern_mount_data(&mqueue_fs_type, ns);
> +	spin_lock(&mq_lock);
> +	if (unlikely(ns->mq_mnt)) {
> +		spin_unlock(&mq_lock);
> +		if (!IS_ERR(m))
> +			kern_unmount(m);
> +		return ns->mq_mnt;
> +	}
> +	if (!IS_ERR(m))
> +		ns->mq_mnt = m;
> +	spin_unlock(&mq_lock);
> +	return m;
> +}
> +
>  static struct dentry *mqueue_mount(struct file_system_type *fs_type,
>  			 int flags, const char *dev_name,
>  			 void *data)
>  {
> -	struct ipc_namespace *ns;
> -	if (flags & MS_KERNMOUNT) {
> -		ns = data;
> -		data = NULL;
> -	} else {
> -		ns = current->nsproxy->ipc_ns;
> -	}
> -	return mount_ns(fs_type, flags, data, ns, ns->user_ns, mqueue_fill_super);
> +	struct vfsmount *m;
> +	if (flags & MS_KERNMOUNT)
> +		return mount_nodev(fs_type, flags, data, mqueue_fill_super);
> +	m = mq_internal_mount();
> +	if (IS_ERR(m))
> +		return ERR_CAST(m);
> +	atomic_inc(&m->mnt_sb->s_active);
> +	down_write(&m->mnt_sb->s_umount);
> +	return dget(m->mnt_root);
>  }
>  
>  static void init_once(void *foo)
> @@ -743,13 +770,16 @@ static int prepare_open(struct dentry *dentry, int oflag, int ro,
>  static int do_mq_open(const char __user *u_name, int oflag, umode_t mode,
>  		      struct mq_attr *attr)
>  {
> -	struct vfsmount *mnt = current->nsproxy->ipc_ns->mq_mnt;
> -	struct dentry *root = mnt->mnt_root;
> +	struct vfsmount *mnt = mq_internal_mount();
> +	struct dentry *root;
>  	struct filename *name;
>  	struct path path;
>  	int fd, error;
>  	int ro;
>  
> +	if (IS_ERR(mnt))
> +		return PTR_ERR(mnt);
> +
>  	audit_mq_open(oflag, mode, attr);
>  
>  	if (IS_ERR(name = getname(u_name)))
> @@ -760,6 +790,7 @@ static int do_mq_open(const char __user *u_name, int oflag, umode_t mode,
>  		goto out_putname;
>  
>  	ro = mnt_want_write(mnt);	/* we'll drop it in any case */
> +	root = mnt->mnt_root;
>  	inode_lock(d_inode(root));
>  	path.dentry = lookup_one_len(name->name, root, strlen(name->name));
>  	if (IS_ERR(path.dentry)) {
> @@ -1534,28 +1565,26 @@ int mq_init_ns(struct ipc_namespace *ns)
>  	ns->mq_msgsize_max   = DFLT_MSGSIZEMAX;
>  	ns->mq_msg_default   = DFLT_MSG;
>  	ns->mq_msgsize_default  = DFLT_MSGSIZE;
> +	ns->mq_mnt = NULL;
>  
> -	ns->mq_mnt = kern_mount_data(&mqueue_fs_type, ns);
> -	if (IS_ERR(ns->mq_mnt)) {
> -		int err = PTR_ERR(ns->mq_mnt);
> -		ns->mq_mnt = NULL;
> -		return err;
> -	}
>  	return 0;
>  }
>  
>  void mq_clear_sbinfo(struct ipc_namespace *ns)
>  {
> -	ns->mq_mnt->mnt_sb->s_fs_info = NULL;
> +	if (ns->mq_mnt)
> +		ns->mq_mnt->mnt_sb->s_fs_info = NULL;
>  }
>  
>  void mq_put_mnt(struct ipc_namespace *ns)
>  {
> -	kern_unmount(ns->mq_mnt);
> +	if (ns->mq_mnt)
> +		kern_unmount(ns->mq_mnt);
>  }
>  
>  static int __init init_mqueue_fs(void)
>  {
> +	struct vfsmount *m;
>  	int error;
>  
>  	mqueue_inode_cachep = kmem_cache_create("mqueue_inode_cache",
> @@ -1577,6 +1606,10 @@ static int __init init_mqueue_fs(void)
>  	if (error)
>  		goto out_filesystem;
>  
> +	m = kern_mount_data(&mqueue_fs_type, &init_ipc_ns);
> +	if (IS_ERR(m))
> +		goto out_filesystem;
> +	init_ipc_ns.mq_mnt = m;
>  	return 0;
>  
>  out_filesystem:

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH linux-next] mqueue: fix IPC namespace use-after-free
  2017-12-19 21:49           ` Eric W. Biederman
@ 2017-12-19 22:40             ` Al Viro
  2017-12-19 23:36               ` Eric W. Biederman
  0 siblings, 1 reply; 13+ messages in thread
From: Al Viro @ 2017-12-19 22:40 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Giuseppe Scrivano, Andrew Morton, LKML, alexander.deucher,
	broonie, chris, David Miller, deepa.kernel, Greg KH,
	luc.vanoostenryck, lucien xin, Ingo Molnar, Neil Horman,
	syzkaller-bugs, Vladislav Yasevich

On Tue, Dec 19, 2017 at 03:49:24PM -0600, Eric W. Biederman wrote:
> > what would you be delaying?  kmem_cache_alloc() for struct mount and assignments
> > to its fields?  That's noise; if anything, I would expect the main cost with
> > a plenty of containers to be in sget() scanning the list of mqueue superblocks.
> > And we can get rid of that, while we are at it - to hell with mount_ns(), with
> > that approach we can just use mount_nodev() instead.  The logics in
> > mq_internal_mount() will deal with multiple instances - if somebody has already
> > triggered creation of internal mount, all subsequent calls in that ipcns will
> > end up avoiding kern_mount_data() entirely.  And if you have two callers
> > racing - sure, you will get two superblocks.  Not for long, though - the first
> > one to get to setting ->mq_mnt (serialized on mq_lock) wins, the second loses
> > and prompty destroys his vfsmount and superblock.  I seriously suspect that
> > variant below would cut down on the cost a whole lot more - as it is, we have
> > the total of O(N^2) spent in the loop inside of sget_userns() when we create
> > N ipcns and mount in each of those; this patch should cut that to
> > O(N)...
> 
> If that is where the cost is, is there any point in delaying creating
> the internal mount at all?

We won't know without the profiles...  Incidentally, is there any point in
using mount_ns() for procfs?  Similar scheme (with ->proc_mnt instead of
->mq_mnt, of course) would live with mount_nodev() just fine, and it's
definitely less costly - we don't bother with the loop in sget_userns()
at all that way.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH linux-next] mqueue: fix IPC namespace use-after-free
  2017-12-19 22:40             ` Al Viro
@ 2017-12-19 23:36               ` Eric W. Biederman
  0 siblings, 0 replies; 13+ messages in thread
From: Eric W. Biederman @ 2017-12-19 23:36 UTC (permalink / raw)
  To: Al Viro
  Cc: Giuseppe Scrivano, Andrew Morton, LKML, alexander.deucher,
	broonie, chris, David Miller, deepa.kernel, Greg KH,
	luc.vanoostenryck, lucien xin, Ingo Molnar, Neil Horman,
	syzkaller-bugs, Vladislav Yasevich

Al Viro <viro@ZenIV.linux.org.uk> writes:

> On Tue, Dec 19, 2017 at 03:49:24PM -0600, Eric W. Biederman wrote:
>> > what would you be delaying?  kmem_cache_alloc() for struct mount and assignments
>> > to its fields?  That's noise; if anything, I would expect the main cost with
>> > a plenty of containers to be in sget() scanning the list of mqueue superblocks.
>> > And we can get rid of that, while we are at it - to hell with mount_ns(), with
>> > that approach we can just use mount_nodev() instead.  The logics in
>> > mq_internal_mount() will deal with multiple instances - if somebody has already
>> > triggered creation of internal mount, all subsequent calls in that ipcns will
>> > end up avoiding kern_mount_data() entirely.  And if you have two callers
>> > racing - sure, you will get two superblocks.  Not for long, though - the first
>> > one to get to setting ->mq_mnt (serialized on mq_lock) wins, the second loses
>> > and prompty destroys his vfsmount and superblock.  I seriously suspect that
>> > variant below would cut down on the cost a whole lot more - as it is, we have
>> > the total of O(N^2) spent in the loop inside of sget_userns() when we create
>> > N ipcns and mount in each of those; this patch should cut that to
>> > O(N)...
>> 
>> If that is where the cost is, is there any point in delaying creating
>> the internal mount at all?
>
> We won't know without the profiles...  Incidentally, is there any point in
> using mount_ns() for procfs?  Similar scheme (with ->proc_mnt instead of
> ->mq_mnt, of course) would live with mount_nodev() just fine, and it's
> definitely less costly - we don't bother with the loop in sget_userns()
> at all that way.

The mechanism of mqueuefs and proc are different for dealing with a
filesystem that continues to be mounted/referenced after the namespace
exists.

Proc actually takes a reference on the pid namespace so it is easier to
work with.  pid_ns_prepare_proc and and pid_ns_release_proc are the
namespace side of that dependency.

So yes we could look at a local cache in the namespace and all
would be well for proc.  I don't know what we would use for locking when
we start allowing more that one path to set it.
atmoic_cmpxchg(&proc_mnt, NULL)?

That makes me suspect we could have a common helper that does the work.

I do know that the reason I moved proc to mount_ns is that it had simply
been open coding that function.

Eric

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH linux-next] mqueue: fix IPC namespace use-after-free
  2017-12-19 20:14         ` Al Viro
  2017-12-19 21:49           ` Eric W. Biederman
@ 2017-12-21 19:19           ` Giuseppe Scrivano
  1 sibling, 0 replies; 13+ messages in thread
From: Giuseppe Scrivano @ 2017-12-21 19:19 UTC (permalink / raw)
  To: Al Viro
  Cc: Andrew Morton, LKML, alexander.deucher, broonie, chris,
	David Miller, deepa.kernel, Greg KH, luc.vanoostenryck,
	lucien xin, Ingo Molnar, Neil Horman, syzkaller-bugs,
	Vladislav Yasevich

Al Viro <viro@ZenIV.linux.org.uk> writes:

> On Tue, Dec 19, 2017 at 07:40:43PM +0100, Giuseppe Scrivano wrote:
>> Giuseppe Scrivano <gscrivan@redhat.com> writes:
>> 
>> > The only issue I've seen with my version is that if I do:
>> >
>> > # unshare -im /bin/sh
>> > # mount -t mqueue mqueue /dev/mqueue
>> > # touch /dev/mqueue/foo
>> > # umount /dev/mqueue
>> > # mount -t mqueue mqueue /dev/mqueue
>> >
>> > then /dev/mqueue/foo doesn't exist at this point.  Your patch does not
>> > have this problem and /dev/mqueue/foo is again accessible after the
>> > second mount.
>> 
>> although, how much is that of an issue?  Is there any other way to delay
>
> You do realize that with your patch you would end up with worse than that -
> mount/touch/umount and now you've got ->mq_sb non-NULL and pointing to freed
> super_block.  With really unpleasant effects when you quit that ipcns...
>
>> the cost of kern_mount_data()?  Most containers have /dev/mqueue mounted
>> but it is not really going to be used.
>
> _What_ cost?  At mount(2) time you are setting the superblock up anyway, so
> what would you be delaying?  kmem_cache_alloc() for struct mount and assignments
> to its fields?  That's noise; if anything, I would expect the main cost with
> a plenty of containers to be in sget() scanning the list of mqueue superblocks.
> And we can get rid of that, while we are at it - to hell with mount_ns(), with
> that approach we can just use mount_nodev() instead.  The logics in
> mq_internal_mount() will deal with multiple instances - if somebody has already
> triggered creation of internal mount, all subsequent calls in that ipcns will
> end up avoiding kern_mount_data() entirely.  And if you have two callers
> racing - sure, you will get two superblocks.  Not for long, though - the first
> one to get to setting ->mq_mnt (serialized on mq_lock) wins, the second loses
> and prompty destroys his vfsmount and superblock.  I seriously suspect that
> variant below would cut down on the cost a whole lot more - as it is, we have
> the total of O(N^2) spent in the loop inside of sget_userns() when we create
> N ipcns and mount in each of those; this patch should cut that to O(N)...

Thanks for the explanation.  I see what you mean now and how this cost is
inevitable as anyway we need to setup the superblock.

Giuseppe

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2017-12-21 19:19 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-19 10:14 [PATCH linux-next] mqueue: fix IPC namespace use-after-free Giuseppe Scrivano
2017-12-19 11:48 ` Al Viro
2017-12-19 15:32   ` Al Viro
2017-12-19 15:44     ` Al Viro
2017-12-19 16:31       ` Dmitry Vyukov
2017-12-19 17:02         ` Giuseppe Scrivano
2017-12-19 16:59     ` Giuseppe Scrivano
2017-12-19 18:40       ` Giuseppe Scrivano
2017-12-19 20:14         ` Al Viro
2017-12-19 21:49           ` Eric W. Biederman
2017-12-19 22:40             ` Al Viro
2017-12-19 23:36               ` Eric W. Biederman
2017-12-21 19:19           ` Giuseppe Scrivano

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.