* [PATCH] netns: dangling pointer on netns bind mount point. @ 2020-04-07 2:35 Levi 2020-04-07 3:05 ` Al Viro 0 siblings, 1 reply; 9+ messages in thread From: Levi @ 2020-04-07 2:35 UTC (permalink / raw) To: viro, davem, kuba, gnault, nicolas.dichtel, edumazet, lirongqing, tglx, johannes.berg, dhowells, daniel, linux-fsdevel, linux-kernel, netdev When we try to bind mount on network namespace (ex) /proc/{pid}/ns/net, inode's private data can have dangling pointer to net_namespace that was already freed in below case. 1. Forking the process. 2. [PARENT] Waiting the Child till the end. 3. [CHILD] call unshare for creating new network namespace 4. [CHILD] Bind mount with /proc/self/ns/net to some mount point. 5. [CHILD] Exit child. 6. [PARENT] Try to setns with binded mount point In step 5, net_namespace made by child process'll be freed, But in bind mount point, it still held the pointer to net_namespace made by child process. In this situation, when parent try to call "setns" systemcall with the bind mount point, parent process try to access to freed memory, That makes memory corruption. This patch fix the above scenario by increaseing reference count. Signed-off-by: Levi Yun <ppbuk5246@gmail.com> --- fs/namespace.c | 31 +++++++++++++++++++++++++++++++ include/net/net_namespace.h | 7 +++++++ net/core/net_namespace.c | 5 ----- 3 files changed, 38 insertions(+), 5 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index a28e4db075ed..ed0fbb6a1b52 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -31,6 +31,10 @@ #include <linux/fs_context.h> #include <linux/shmem_fs.h> +#ifdef CONFIG_NET_NS +#include <net/net_namespace.h> +#endif + #include "pnode.h" #include "internal.h" @@ -1013,12 +1017,25 @@ vfs_submount(const struct dentry *mountpoint, struct file_system_type *type, } EXPORT_SYMBOL_GPL(vfs_submount); +#ifdef CONFIG_NET_NS +static bool is_net_ns_file(struct dentry *dentry) +{ + /* Is this a proxy for a network namespace? */ + return dentry->d_op == &ns_dentry_operations && + dentry->d_fsdata == &netns_operations; +} +#endif + static struct mount *clone_mnt(struct mount *old, struct dentry *root, int flag) { struct super_block *sb = old->mnt.mnt_sb; struct mount *mnt; int err; +#ifdef CONFIG_NET_NS + struct ns_common *ns = NULL; + struct net *net = NULL; +#endif mnt = alloc_vfsmnt(old->mnt_devname); if (!mnt) @@ -1035,6 +1052,20 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root, goto out_free; } +#ifdef CONFIG_NET_NS + if (!(flag & CL_COPY_MNT_NS_FILE) && is_net_ns_file(root)) { + ns = get_proc_ns(d_inode(root)); + if (ns == NULL || ns->ops->type != CLONE_NEWNET) { + err = -EINVAL; + + goto out_free; + } + + net = to_net_ns(ns); + net = get_net(net); + } +#endif + mnt->mnt.mnt_flags = old->mnt.mnt_flags; mnt->mnt.mnt_flags &= ~(MNT_WRITE_HOLD|MNT_MARKED|MNT_INTERNAL); diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h index ab96fb59131c..275258d1dbee 100644 --- a/include/net/net_namespace.h +++ b/include/net/net_namespace.h @@ -474,4 +474,11 @@ static inline void fnhe_genid_bump(struct net *net) atomic_inc(&net->fnhe_genid); } +#ifdef CONFIG_NET_NS +static inline struct net *to_net_ns(struct ns_common *ns) +{ + return container_of(ns, struct net, ns); +} +#endif + #endif /* __NET_NET_NAMESPACE_H */ diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c index 190ca66a383b..3a6d9155806f 100644 --- a/net/core/net_namespace.c +++ b/net/core/net_namespace.c @@ -1343,11 +1343,6 @@ static struct ns_common *netns_get(struct task_struct *task) return net ? &net->ns : NULL; } -static inline struct net *to_net_ns(struct ns_common *ns) -{ - return container_of(ns, struct net, ns); -} - static void netns_put(struct ns_common *ns) { put_net(to_net_ns(ns)); -- 2.26.0 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH] netns: dangling pointer on netns bind mount point. 2020-04-07 2:35 [PATCH] netns: dangling pointer on netns bind mount point Levi @ 2020-04-07 3:05 ` Al Viro 2020-04-07 3:13 ` Al Viro 0 siblings, 1 reply; 9+ messages in thread From: Al Viro @ 2020-04-07 3:05 UTC (permalink / raw) To: Levi Cc: davem, kuba, gnault, nicolas.dichtel, edumazet, lirongqing, tglx, johannes.berg, dhowells, daniel, linux-fsdevel, linux-kernel, netdev On Tue, Apr 07, 2020 at 11:35:46AM +0900, Levi wrote: > When we try to bind mount on network namespace (ex) /proc/{pid}/ns/net, > inode's private data can have dangling pointer to net_namespace that was > already freed in below case. > > 1. Forking the process. > 2. [PARENT] Waiting the Child till the end. > 3. [CHILD] call unshare for creating new network namespace > 4. [CHILD] Bind mount with /proc/self/ns/net to some mount point. > 5. [CHILD] Exit child. > 6. [PARENT] Try to setns with binded mount point > > In step 5, net_namespace made by child process'll be freed, > But in bind mount point, it still held the pointer to net_namespace made > by child process. > In this situation, when parent try to call "setns" systemcall with the > bind mount point, parent process try to access to freed memory, That > makes memory corruption. > > This patch fix the above scenario by increaseing reference count. This can't be the right fix. > +#ifdef CONFIG_NET_NS > + if (!(flag & CL_COPY_MNT_NS_FILE) && is_net_ns_file(root)) { > + ns = get_proc_ns(d_inode(root)); > + if (ns == NULL || ns->ops->type != CLONE_NEWNET) { > + err = -EINVAL; > + > + goto out_free; > + } > + > + net = to_net_ns(ns); > + net = get_net(net); No. This is completely wrong. You have each struct mount pointing to that sucker to grab an extra reference on an object; you do *NOT* drop said reference when struct mount is destroyed. You are papering over a dangling pointer of some sort by introducing a trivially exploitable leak that happens to hit your scenario. Hell, do (your step 4 + umount your mountpoint) in a loop, then watch what happens to refcounts with that patch. This is bollocks; the reference is *NOT* in struct mount. At all. It's not even in struct dentry. What it's attached to is struct inode and it should be pinned as long as that inode is alive - it's dropped in nsfs_evict(). And if _that_ gets called while dentry is still pinned (as ->mnt_root of something), you have much worse problems. Could you post a reproducer, preferably one that would trigger an oops on mainline? ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] netns: dangling pointer on netns bind mount point. 2020-04-07 3:05 ` Al Viro @ 2020-04-07 3:13 ` Al Viro 2020-04-07 3:29 ` Yun Levi 0 siblings, 1 reply; 9+ messages in thread From: Al Viro @ 2020-04-07 3:13 UTC (permalink / raw) To: Levi Cc: davem, kuba, gnault, nicolas.dichtel, edumazet, lirongqing, tglx, johannes.berg, dhowells, daniel, linux-fsdevel, linux-kernel, netdev On Tue, Apr 07, 2020 at 04:05:04AM +0100, Al Viro wrote: > Could you post a reproducer, preferably one that would trigger an oops > on mainline? BTW, just to make sure - are we talking about analysis of existing oops, or is it "never seen it happen, but looks like it should be triggerable" kind of situation? ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] netns: dangling pointer on netns bind mount point. 2020-04-07 3:13 ` Al Viro @ 2020-04-07 3:29 ` Yun Levi 2020-04-07 4:03 ` Al Viro 0 siblings, 1 reply; 9+ messages in thread From: Yun Levi @ 2020-04-07 3:29 UTC (permalink / raw) To: Al Viro Cc: David S. Miller, Jakub Kicinski, Guillaume Nault, Nicolas Dichtel, Eric Dumazet, Li RongQing, Thomas Gleixner, Johannes Berg, David Howells, daniel, linux-fsdevel, linux-kernel, netdev Actually I confirm that Kernel Panic on 4.19 version. But I couldn't check main line not yet. Below is the one of the panic log what i've experienced Internal error: Oops: 96000004 [#1] SMP 2020-03-14T15:35 Modules linked in: hsl_linux_helper(O) saswifmod(O) asrim(O) linux_bcm_knet(O) linux_user_bde(O) linux_kernel_bde(O) ds_peripheral(O) m_vlog(O) m_scontrol(O) CPU: 2 PID: 5966 Comm: c.EQMD_ASYNC Tainted: G W O 4.19.26 NOS Hardware name: LS1046A RDB Board (DT) pstate: 60000005 (nZCv daif -PAN -UAO) :44 pc : sock_prot_inuse_add+0x10/0x20 lr : netlink_release+0x30c/0x5a8 sp : ffff0000244c3b20 x29: ffff0000244c3b20 x28: ffff8008774ea000 x27: ffff800865b093d0 x26: ffff800865b09000 x25: ffff800863580c00 x24: ffff00001892f000 x23: ffff80086514ba00 x22: ffff000018acd000 x21: ffff0000189ae000 x20: ffff00001892a000 x19: ffff0000088bfb2c x18: 0000000000000400 x16: 0000000000000000 x15: 0000000000000400 x14: 0000000000000400 x13: 0000000000000000 x12: 0000000000000001 x11: 000000000000a949 x10: 0000000000062d05 x9 : 0000000000000027 x8 : ffff800877120280 x7 : 0000000000000200 x6 : ffff800865b092e8 x5 : ffff0000244c3ae0 x4 : 0000000000000000 x3 : 0000800866e9a000 x2 : 00000000ffffffff x1 : 0000000000000000 x0 : 0000000000000000 Process c.EQMD_ASYNC (pid: 5966, stack limit = 0x0000000088e20a05) Call trace: sock_prot_inuse_add+0x10/0x20 __sock_release+0x44/0xc0 sock_close+0x14/0x20 __fput+0x8c/0x1b8 ____fput+0xc/0x18 task_work_run+0xa8/0xd8 do_exit+0x2e4/0xa50 do_group_exit+0x34/0xc8 get_signal+0xd4/0x600 do_signal+0x174/0x268 do_notify_resume+0xcc/0x110 work_pending+0x8/0x10 Code: b940c821 f940c000 d538d083 8b010800 (b8606861) ---[ end trace 0b98c9ccbfd9f6fd ]--- Date/Time : 02-14-0120 06:35:47 Kernel panic - not syncing: Fatal exception in interrupt SMP: stopping secondary CPUs Kernel Offset: disabled CPU features: 0x0,21806000 Memory Limit: none What I saw is when I try to bind on some mount point to /proc/{pid}/ns/net which made by child process, That's doesn't increase the netns' refcnt. And when the child process's going to exit, it frees the netns But Still bind mount point's inode's private data point to netns which was freed by child when it exits. Thank you for your reviewing But I also confirm that problem on mainline too. And sorry to my fault. On Tue, Apr 7, 2020 at 12:13 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > On Tue, Apr 07, 2020 at 04:05:04AM +0100, Al Viro wrote: > > > Could you post a reproducer, preferably one that would trigger an oops > > on mainline? > > BTW, just to make sure - are we talking about analysis of existing > oops, or is it "never seen it happen, but looks like it should be > triggerable" kind of situation? ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] netns: dangling pointer on netns bind mount point. 2020-04-07 3:29 ` Yun Levi @ 2020-04-07 4:03 ` Al Viro [not found] ` <CAM7-yPRaQsNgZKjru40nM1N_u8HVLVKmJCAzu20DcPL=jzKjWQ@mail.gmail.com> 0 siblings, 1 reply; 9+ messages in thread From: Al Viro @ 2020-04-07 4:03 UTC (permalink / raw) To: Yun Levi Cc: David S. Miller, Jakub Kicinski, Guillaume Nault, Nicolas Dichtel, Eric Dumazet, Li RongQing, Thomas Gleixner, Johannes Berg, David Howells, daniel, linux-fsdevel, linux-kernel, netdev On Tue, Apr 07, 2020 at 12:29:34PM +0900, Yun Levi wrote: > sock_prot_inuse_add+0x10/0x20 > __sock_release+0x44/0xc0 > sock_close+0x14/0x20 > __fput+0x8c/0x1b8 > ____fput+0xc/0x18 > task_work_run+0xa8/0xd8 > do_exit+0x2e4/0xa50 > do_group_exit+0x34/0xc8 > get_signal+0xd4/0x600 > do_signal+0x174/0x268 > do_notify_resume+0xcc/0x110 > work_pending+0x8/0x10 > Code: b940c821 f940c000 d538d083 8b010800 (b8606861) ---[ end trace > 0b98c9ccbfd9f6fd ]--- > Date/Time : 02-14-0120 06:35:47 Kernel panic - not syncing: Fatal > exception in interrupt SMP: stopping secondary CPUs Kernel Offset: > disabled CPU features: 0x0,21806000 Memory Limit: none > > What I saw is when I try to bind on some mount point to > /proc/{pid}/ns/net which made by child process, That's doesn't > increase the netns' refcnt. Why would it? Increase of netns refcount should happen when you follow /proc/*/ns/net and stay for as long as nsfs inode is alive, not by cloning that mount. And while we are at it, you don't need to bind it anywhere in order to call setns() - just open the sucker and then feed the resulting descriptor to setns(2). No mount --bind involved and if child exiting between open() and setns() would free netns, we would have exact same problem. > And when the child process's going to exit, it frees the netns But > Still bind mount point's inode's private data point to netns which was > freed by child when it exits. Look: we call ns_get_path(), which calls ns_get_path_cb(), which calls the callback passed to it (ns_get_path_task(), in this case), which grabs a reference to ns. Then we pass that reference to __ns_get_path(). __ns_get_path() looks for dentry stashed in ns. If there is one and it's still alive, we grab a reference to that dentry and drop the reference to ns - no new inodes had been created, so no new namespace references have appeared. Existing inode is pinned by dentry and dentry is pinned by _dentry_ reference we've got. If dentry is not there or is already in the middle of being destroyed, we allocate a new inode, stash our namespace reference into it, create a dentry referencing that new inode, stash it into namespace and return that dentry. Without dropping namespace reference we'd obtained in ns_get_path_task() - it went into new inode. If inode or dentry creation fails (out of memory), we drop what we'd obtained (namespace if inode creation fails, just the inode if dentry creation fails - namespace reference that went into inode will be dropped by inode destructor in the latter case) and return an error. If somebody else manages to get through the entire thing while we'd been allocating stuff and we see _their_ dentry already stashed into namespace, we just drop our dentry (its destructor will drop inode reference, which will lead to inode destructor, which will drop namespace reference) and bugger off with -EAGAIN. The caller (ns_get_path_cb()) will retry the entire thing. The invariant to be preserved here is "each of those inodes holds a reference to namespace for as long as the inode exists". ns_get_path() increments namespace refcount if and only if it has allocated an nsfs inode. If it grabs an existing dentry instead, the namespace refcount remains unchanged. Anyway, I would really like to see .config and C (or shell) reproducer. Ideally - .config that could be run in a KVM. ^ permalink raw reply [flat|nested] 9+ messages in thread
[parent not found: <CAM7-yPRaQsNgZKjru40nM1N_u8HVLVKmJCAzu20DcPL=jzKjWQ@mail.gmail.com>]
* Fwd: [PATCH] netns: dangling pointer on netns bind mount point. [not found] ` <CAM7-yPRaQsNgZKjru40nM1N_u8HVLVKmJCAzu20DcPL=jzKjWQ@mail.gmail.com> @ 2020-04-07 12:57 ` Yun Levi 2020-04-07 18:26 ` Al Viro 1 sibling, 0 replies; 9+ messages in thread From: Yun Levi @ 2020-04-07 12:57 UTC (permalink / raw) To: daniel, linux-fsdevel, linux-kernel, netdev ---------- Forwarded message --------- From: Yun Levi <ppbuk5246@gmail.com> Date: Tue, Apr 7, 2020 at 9:53 PM Subject: Re: [PATCH] netns: dangling pointer on netns bind mount point. To: Al Viro <viro@zeniv.linux.org.uk> Cc: David S. Miller <davem@davemloft.net>, Jakub Kicinski <kuba@kernel.org>, Guillaume Nault <gnault@redhat.com>, Nicolas Dichtel <nicolas.dichtel@6wind.com>, Eric Dumazet <edumazet@google.com>, Li RongQing <lirongqing@baidu.com>, Thomas Gleixner <tglx@linutronix.de>, Johannes Berg <johannes.berg@intel.com>, David Howells <dhowells@redhat.com>, <daniel@iogearbox.net>, <linux-fsdevel@vger.kernel.org>, <linux-kernel@vger.kernel.org>, <netdev@vger.kernel.org> BTW, It's my question. >Look: we call ns_get_path(), which calls ns_get_path_cb(), which >calls the callback passed to it (ns_get_path_task(), in this case), >which grabs a reference to ns. Then we pass that reference to >__ns_get_path(). > >__ns_get_path() looks for dentry stashed in ns. If there is one >and it's still alive, we grab a reference to that dentry and drop >the reference to ns - no new inodes had been created, so no new >namespace references have appeared. Existing inode is pinned >by dentry and dentry is pinned by _dentry_ reference we've got. actually ns_get_path is called in unshare(2). and it makes new dentry and inode in __ns_get_path finally (Cuz it create new network namespace) In that case, when I mount with --bind option to this proc/self/ns/net, it only increase dentry refcount on do_loopback->clone_mnt finally (not call netns_operation->get) That means it's not increase previous created network namespace reference count but only increase reference count of _dentry_ In that situation, If I exit the child process it definitely frees the net_namespace previous created at the same time it decrease net_namespace's refcnt in exit_task_namespace(). It means it's possible that bind mount point can hold the dentry with inode having net_namespace's dangling pointer in another process. In above situation, parent who know that binded mount point calls setns(2) then it sets the net_namespace which is refered by the inode of the dentry increased by do_loopback. That makes set the net_namespace which was freed already. The Kernel Panic log is happend NOT in child kill but in Parent killed after I setns(2) with the net_namespace made by child process. Thanks you for your reviewing, But Please forgive that I couldn't share .config right now (I try to make for x86....) I try to understand your comments But Please forgive my fault because of my ignorant... If my explain is wrong, please rebuke me... and Please share your knowledge... Thank you. On Tue, Apr 7, 2020 at 1:04 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > On Tue, Apr 07, 2020 at 12:29:34PM +0900, Yun Levi wrote: > > > sock_prot_inuse_add+0x10/0x20 > > __sock_release+0x44/0xc0 > > sock_close+0x14/0x20 > > __fput+0x8c/0x1b8 > > ____fput+0xc/0x18 > > task_work_run+0xa8/0xd8 > > do_exit+0x2e4/0xa50 > > do_group_exit+0x34/0xc8 > > get_signal+0xd4/0x600 > > do_signal+0x174/0x268 > > do_notify_resume+0xcc/0x110 > > work_pending+0x8/0x10 > > Code: b940c821 f940c000 d538d083 8b010800 (b8606861) ---[ end trace > > 0b98c9ccbfd9f6fd ]--- > > Date/Time : 02-14-0120 06:35:47 Kernel panic - not syncing: Fatal > > exception in interrupt SMP: stopping secondary CPUs Kernel Offset: > > disabled CPU features: 0x0,21806000 Memory Limit: none > > > > What I saw is when I try to bind on some mount point to > > /proc/{pid}/ns/net which made by child process, That's doesn't > > increase the netns' refcnt. > > Why would it? Increase of netns refcount should happen when you > follow /proc/*/ns/net and stay for as long as nsfs inode is alive, > not by cloning that mount. And while we are at it, you don't need > to bind it anywhere in order to call setns() - just open the > sucker and then feed the resulting descriptor to setns(2). No > mount --bind involved and if child exiting between open() and > setns() would free netns, we would have exact same problem. > > > And when the child process's going to exit, it frees the netns But > > Still bind mount point's inode's private data point to netns which was > > freed by child when it exits. > > Look: we call ns_get_path(), which calls ns_get_path_cb(), which > calls the callback passed to it (ns_get_path_task(), in this case), > which grabs a reference to ns. Then we pass that reference to > __ns_get_path(). > > __ns_get_path() looks for dentry stashed in ns. If there is one > and it's still alive, we grab a reference to that dentry and drop > the reference to ns - no new inodes had been created, so no new > namespace references have appeared. Existing inode is pinned > by dentry and dentry is pinned by _dentry_ reference we've got. > > If dentry is not there or is already in the middle of being destroyed, > we allocate a new inode, stash our namespace reference into it, > create a dentry referencing that new inode, stash it into namespace > and return that dentry. Without dropping namespace reference we'd > obtained in ns_get_path_task() - it went into new inode. > > If inode or dentry creation fails (out of memory), we drop what > we'd obtained (namespace if inode creation fails, just the inode > if dentry creation fails - namespace reference that went into > inode will be dropped by inode destructor in the latter case) and > return an error. > > If somebody else manages to get through the entire thing while > we'd been allocating stuff and we see _their_ dentry already > stashed into namespace, we just drop our dentry (its destructor > will drop inode reference, which will lead to inode destructor, > which will drop namespace reference) and bugger off with -EAGAIN. > The caller (ns_get_path_cb()) will retry the entire thing. > > The invariant to be preserved here is "each of those inodes > holds a reference to namespace for as long as the inode exists". > ns_get_path() increments namespace refcount if and only if it > has allocated an nsfs inode. If it grabs an existing dentry > instead, the namespace refcount remains unchanged. > > Anyway, I would really like to see .config and C (or shell) > reproducer. Ideally - .config that could be run in a KVM. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] netns: dangling pointer on netns bind mount point. [not found] ` <CAM7-yPRaQsNgZKjru40nM1N_u8HVLVKmJCAzu20DcPL=jzKjWQ@mail.gmail.com> 2020-04-07 12:57 ` Fwd: " Yun Levi @ 2020-04-07 18:26 ` Al Viro 2020-04-08 5:59 ` Yun Levi 1 sibling, 1 reply; 9+ messages in thread From: Al Viro @ 2020-04-07 18:26 UTC (permalink / raw) To: Yun Levi Cc: David S. Miller, Jakub Kicinski, Guillaume Nault, Nicolas Dichtel, Eric Dumazet, Li RongQing, Thomas Gleixner, Johannes Berg, David Howells, daniel, linux-fsdevel, linux-kernel, netdev On Tue, Apr 07, 2020 at 09:53:29PM +0900, Yun Levi wrote: > BTW, It's my question. > > >Look: we call ns_get_path(), which calls ns_get_path_cb(), which > >calls the callback passed to it (ns_get_path_task(), in this case), > >which grabs a reference to ns. Then we pass that reference to > >__ns_get_path(). > > > >__ns_get_path() looks for dentry stashed in ns. If there is one > >and it's still alive, we grab a reference to that dentry and drop > >the reference to ns - no new inodes had been created, so no new > >namespace references have appeared. Existing inode is pinned > >by dentry and dentry is pinned by _dentry_ reference we've got. > > actually ns_get_path is called in unshare(2). Yes, it does. Via perf_event_namespaces(), which does perf_fill_ns_link_info(&ns_link_info[NET_NS_INDEX], task, &netns_operations); and there we have error = ns_get_path(&ns_path, task, ns_ops); if (!error) { ns_inode = ns_path.dentry->d_inode; ns_link_info->dev = new_encode_dev(ns_inode->i_sb->s_dev); ns_link_info->ino = ns_inode->i_ino; path_put(&ns_path); } See that path_put()? Dentry reference is dropped by it. > and it makes new dentry and > inode in __ns_get_path finally (Cuz it create new network namespace) > > In that case, when I mount with --bind option to this proc/self/ns/net, it > only increase dentry refcount on do_loopback->clone_mnt finally (not call > netns_operation->get) > That means it's not increase previous created network namespace reference > count but only increase reference count of _dentry_ > > In that situation, If I exit the child process it definitely frees the > net_namespace previous created at the same time it decrease net_namespace's > refcnt in exit_task_namespace(). > It means it's possible that bind mount point can hold the dentry with inode > having net_namespace's dangling pointer in another process. > In above situation, parent who know that binded mount point calls setns(2) > then it sets the net_namespace which is refered by the inode of the dentry > increased by do_loopback. > That makes set the net_namespace which was freed already. How? Netns reference in inode contributes to netns refcount. And it's held for as long as the _inode_ has positive refcount - we only drop it from the inode destructor, *NOT* every time we drop a reference to inode. In the similar fashion, the inode reference in dentry contributes to inode refcount. And again, that inode reference won't be dropped until the _last_ reference to dentry gets dropped. Incrementing refcount of dentry is enough to pin the inode and thus the netns the inode refers to. It's a very common pattern with refcounting; a useful way of thinking about it is to consider the refcount of e.g. inode as sum of several components, one of them being "number of struct dentry instances with ->d_inode pointing to our inode". And look at e.g. __ns_get_path() like this: rcu_read_lock(); d = atomic_long_read(&ns->stashed); if (!d) goto slow; dentry = (struct dentry *)d; if (!lockref_get_not_dead(&dentry->d_lockref)) goto slow; other_count(dentry) got incremented by 1. rcu_read_unlock(); ns->ops->put(ns); other_count(ns) decremented by 1. got_it: path->mnt = mntget(mnt); path->dentry = dentry; path added to paths(dentry), other_count(dentry) decremented by 1 (getting it back to the original value). return 0; slow: rcu_read_unlock(); inode = new_inode_pseudo(mnt->mnt_sb); if (!inode) { ns->ops->put(ns); subtract 1 from other_count(ns) return -ENOMEM; } dentries(inode) = empty other_count(inode) = 1 .... inode->i_private = ns; add inode to inodes(ns), subtract 1 from other_count(ns); the total is unchanged. dentry = d_alloc_anon(mnt->mnt_sb); if (!dentry) { iput(inode); subtract 1 from other_count(inode). Since now all components of inode refcount are zero, inode gets destroyed. Destructor calls nsfs_evict_inode(), which removes the inode from inodes(ns). The total effect: inode is destroyed, inodes(ns) is back to what it used to be and other_count(ns) is left decremented by 1 compared to what we used to have. IOW, the balance is the same as if inode allocation would've failed. return -ENOMEM; } other_count(dentry) = 1 d_instantiate(dentry, inode); add dentry to dentries(inode), subtract 1 from other_count(inode). The total is unchanged. Now other_count(inode) is 0 and dentries(inode) is {dentry}. d = atomic_long_cmpxchg(&ns->stashed, 0, (unsigned long)dentry); if (d) { somebody else has gotten there first d_delete(dentry); /* make sure ->d_prune() does nothing */ dput(dentry); subtract 1 from other_count(dentry) (which will drive it to 0). Since no other references exist, dentry gets destroyed. Destructor will remove dentry from dentries(inode) and since other_count(inode) is already zero, trigger destruction of inode. That, in turn, will remove inode from inodes(ns). Total effect: dentry is destroyed, inode is destroyed, inodes(ns) is back to what it used to be, other_count(ns) is left decremented by 1 compared to what we used to have. cpu_relax(); return -EAGAIN; } goto got_it; got_it: path->mnt = mntget(mnt); path->dentry = dentry; add path to paths(dentry), subtract 1 from other_count(dentry). At that point other_count(dentry) is back to 0, ditto for other_count(inode) and other_count(ns) is left decremented by 1 compared to what it used to be. inode is added to inodes(ns), dentry - to dentries(inode) and path - to paths(dentry). return 0; and we are done. In all cases the total effect is the same as far as "other" counts go: other_count(ns) is down by 1 and that's the only change in other_count() of *ANY* objects. Of course we do not keep track of the sets explicitly; it would cost too much and we only interested in the sum of their sizes anyway. What we actually store is the sum, so operations like "transfer the reference from one component to another" are not immediately obvious to be refcounting ones - the sum is unchanged. Conceptually, though, they are refcounting operations. Up to d_instantiate() we are holding a reference to inode; after that we are *not* - it has been transferred to dentry. That's why on subsequent failure exits we do not call iput() - the inode reference is not ours to discard anymore. In the same way, up to inode->i_private = ns; we are holding a reference to ns. After that we are not - it's been transferred to inode. From that point on it's not ours to discard; it will be dropped when inode gets destroyed, whenever that happens. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] netns: dangling pointer on netns bind mount point. 2020-04-07 18:26 ` Al Viro @ 2020-04-08 5:59 ` Yun Levi 2020-04-08 13:48 ` Al Viro 0 siblings, 1 reply; 9+ messages in thread From: Yun Levi @ 2020-04-08 5:59 UTC (permalink / raw) To: Al Viro Cc: David S. Miller, Jakub Kicinski, Guillaume Nault, Nicolas Dichtel, Eric Dumazet, Li RongQing, Thomas Gleixner, Johannes Berg, David Howells, daniel, linux-fsdevel, linux-kernel, netdev Thank you for great comments. Thanks to you I understand what i missed. I try to generate problem on mainline But, as you explained that situation isn't happen, Maybe my other things which I made generate some problem (freeing network namespace..) Thanks for great answering and sharing. If I meet the situation, at that time I'll share. Thank you very much! P.S. If I have a question, Could I ask via e-mail like this? On Wed, Apr 8, 2020 at 3:26 AM Al Viro <viro@zeniv.linux.org.uk> wrote: > > On Tue, Apr 07, 2020 at 09:53:29PM +0900, Yun Levi wrote: > > BTW, It's my question. > > > > >Look: we call ns_get_path(), which calls ns_get_path_cb(), which > > >calls the callback passed to it (ns_get_path_task(), in this case), > > >which grabs a reference to ns. Then we pass that reference to > > >__ns_get_path(). > > > > > >__ns_get_path() looks for dentry stashed in ns. If there is one > > >and it's still alive, we grab a reference to that dentry and drop > > >the reference to ns - no new inodes had been created, so no new > > >namespace references have appeared. Existing inode is pinned > > >by dentry and dentry is pinned by _dentry_ reference we've got. > > > > actually ns_get_path is called in unshare(2). > > Yes, it does. Via perf_event_namespaces(), which does > perf_fill_ns_link_info(&ns_link_info[NET_NS_INDEX], > task, &netns_operations); > and there we have > error = ns_get_path(&ns_path, task, ns_ops); > if (!error) { > ns_inode = ns_path.dentry->d_inode; > ns_link_info->dev = new_encode_dev(ns_inode->i_sb->s_dev); > ns_link_info->ino = ns_inode->i_ino; > path_put(&ns_path); > } > See that path_put()? Dentry reference is dropped by it. > > > and it makes new dentry and > > inode in __ns_get_path finally (Cuz it create new network namespace) > > > > In that case, when I mount with --bind option to this proc/self/ns/net, it > > only increase dentry refcount on do_loopback->clone_mnt finally (not call > > netns_operation->get) > > That means it's not increase previous created network namespace reference > > count but only increase reference count of _dentry_ > > > > In that situation, If I exit the child process it definitely frees the > > net_namespace previous created at the same time it decrease net_namespace's > > refcnt in exit_task_namespace(). > > It means it's possible that bind mount point can hold the dentry with inode > > having net_namespace's dangling pointer in another process. > > In above situation, parent who know that binded mount point calls setns(2) > > then it sets the net_namespace which is refered by the inode of the dentry > > increased by do_loopback. > > That makes set the net_namespace which was freed already. > > How? Netns reference in inode contributes to netns refcount. And it's held > for as long as the _inode_ has positive refcount - we only drop it from > the inode destructor, *NOT* every time we drop a reference to inode. > In the similar fashion, the inode reference in dentry contributes to inode > refcount. And again, that inode reference won't be dropped until the _last_ > reference to dentry gets dropped. > > Incrementing refcount of dentry is enough to pin the inode and thus the > netns the inode refers to. It's a very common pattern with refcounting; > a useful way of thinking about it is to consider the refcount of e.g. > inode as sum of several components, one of them being "number of struct > dentry instances with ->d_inode pointing to our inode". And look at e.g. > __ns_get_path() like this: > rcu_read_lock(); > d = atomic_long_read(&ns->stashed); > if (!d) > goto slow; > dentry = (struct dentry *)d; > if (!lockref_get_not_dead(&dentry->d_lockref)) > goto slow; > other_count(dentry) got incremented by 1. > rcu_read_unlock(); > ns->ops->put(ns); > other_count(ns) decremented by 1. > got_it: > path->mnt = mntget(mnt); > path->dentry = dentry; > path added to paths(dentry), other_count(dentry) decremented by 1 (getting > it back to the original value). > return 0; > slow: > rcu_read_unlock(); > inode = new_inode_pseudo(mnt->mnt_sb); > if (!inode) { > ns->ops->put(ns); > subtract 1 from other_count(ns) > return -ENOMEM; > } > dentries(inode) = empty > other_count(inode) = 1 > .... > inode->i_private = ns; > add inode to inodes(ns), subtract 1 from other_count(ns); the total > is unchanged. > dentry = d_alloc_anon(mnt->mnt_sb); > if (!dentry) { > iput(inode); > subtract 1 from other_count(inode). Since now all components of > inode refcount are zero, inode gets destroyed. Destructor calls > nsfs_evict_inode(), which removes the inode from inodes(ns). > The total effect: inode is destroyed, inodes(ns) is back to what > it used to be and other_count(ns) is left decremented by 1 compared > to what we used to have. IOW, the balance is the same as if inode > allocation would've failed. > return -ENOMEM; > } > other_count(dentry) = 1 > d_instantiate(dentry, inode); > add dentry to dentries(inode), subtract 1 from other_count(inode). > The total is unchanged. Now other_count(inode) is 0 and dentries(inode) > is {dentry}. > d = atomic_long_cmpxchg(&ns->stashed, 0, (unsigned long)dentry); > if (d) { > somebody else has gotten there first > d_delete(dentry); /* make sure ->d_prune() does nothing */ > dput(dentry); > subtract 1 from other_count(dentry) (which will drive it to 0). Since > no other references exist, dentry gets destroyed. Destructor will > remove dentry from dentries(inode) and since other_count(inode) is already > zero, trigger destruction of inode. That, in turn, will remove inode > from inodes(ns). Total effect: dentry is destroyed, inode is destroyed, > inodes(ns) is back to what it used to be, other_count(ns) is left decremented > by 1 compared to what we used to have. > cpu_relax(); > return -EAGAIN; > } > goto got_it; > got_it: > path->mnt = mntget(mnt); > path->dentry = dentry; > add path to paths(dentry), subtract 1 from other_count(dentry). At that > point other_count(dentry) is back to 0, ditto for other_count(inode) and > other_count(ns) is left decremented by 1 compared to what it used to be. > inode is added to inodes(ns), dentry - to dentries(inode) and path - to > paths(dentry). > return 0; > and we are done. > > In all cases the total effect is the same as far as "other" counts go: > other_count(ns) is down by 1 and that's the only change in other_count() > of *ANY* objects. Of course we do not keep track of the sets explicitly; > it would cost too much and we only interested in the sum of their sizes > anyway. What we actually store is the sum, so operations like "transfer > the reference from one component to another" are not immediately obvious > to be refcounting ones - the sum is unchanged. Conceptually, though, > they are refcounting operations. > Up to d_instantiate() we are holding a reference to inode; > after that we are *not* - it has been transferred to dentry. That's > why on subsequent failure exits we do not call iput() - the inode > reference is not ours to discard anymore. > In the same way, up to inode->i_private = ns; we are holding > a reference to ns. After that we are not - it's been transferred to > inode. From that point on it's not ours to discard; it will be > dropped when inode gets destroyed, whenever that happens. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] netns: dangling pointer on netns bind mount point. 2020-04-08 5:59 ` Yun Levi @ 2020-04-08 13:48 ` Al Viro 0 siblings, 0 replies; 9+ messages in thread From: Al Viro @ 2020-04-08 13:48 UTC (permalink / raw) To: Yun Levi Cc: David S. Miller, Jakub Kicinski, Guillaume Nault, Nicolas Dichtel, Eric Dumazet, Li RongQing, Thomas Gleixner, Johannes Berg, David Howells, daniel, linux-fsdevel, linux-kernel, netdev On Wed, Apr 08, 2020 at 02:59:17PM +0900, Yun Levi wrote: > Thank you for great comments. Thanks to you I understand what i missed. > > I try to generate problem on mainline But, as you explained that > situation isn't happen, > > Maybe my other things which I made generate some problem (freeing > network namespace..) > > Thanks for great answering and sharing. > > If I meet the situation, at that time I'll share. Thank you very much! > > P.S. If I have a question, Could I ask via e-mail like this? Sure, no problem... ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2020-04-08 13:49 UTC | newest] Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-04-07 2:35 [PATCH] netns: dangling pointer on netns bind mount point Levi 2020-04-07 3:05 ` Al Viro 2020-04-07 3:13 ` Al Viro 2020-04-07 3:29 ` Yun Levi 2020-04-07 4:03 ` Al Viro [not found] ` <CAM7-yPRaQsNgZKjru40nM1N_u8HVLVKmJCAzu20DcPL=jzKjWQ@mail.gmail.com> 2020-04-07 12:57 ` Fwd: " Yun Levi 2020-04-07 18:26 ` Al Viro 2020-04-08 5:59 ` Yun Levi 2020-04-08 13:48 ` Al Viro
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).