* KASAN: use-after-free Read in path_lookupat
@ 2018-11-28 17:40 syzbot
2019-03-25 0:44 ` syzbot
2019-03-25 1:23 ` Linus Torvalds
0 siblings, 2 replies; 37+ messages in thread
From: syzbot @ 2018-11-28 17:40 UTC (permalink / raw)
To: linux-fsdevel, linux-kernel, syzkaller-bugs, viro
Hello,
syzbot found the following crash on:
HEAD commit: 442b8cea2477 Add linux-next specific files for 20181109
git tree: linux-next
console output: https://syzkaller.appspot.com/x/log.txt?x=129574db400000
kernel config: https://syzkaller.appspot.com/x/.config?x=2f72bdb11df9fbe8
dashboard link: https://syzkaller.appspot.com/bug?extid=7a8ba368b47fdefca61e
compiler: gcc (GCC) 8.0.1 20180413 (experimental)
syz repro: https://syzkaller.appspot.com/x/repro.syz?x=132815f5400000
IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+7a8ba368b47fdefca61e@syzkaller.appspotmail.com
IPv6: ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready
8021q: adding VLAN 0 to HW filter on device team0
8021q: adding VLAN 0 to HW filter on device team0
8021q: adding VLAN 0 to HW filter on device team0
==================================================================
BUG: KASAN: use-after-free in lookup_last fs/namei.c:2269 [inline]
BUG: KASAN: use-after-free in path_lookupat.isra.43+0x9f8/0xc00
fs/namei.c:2318
Read of size 1 at addr ffff8801c4865c47 by task syz-executor2/9423
CPU: 0 PID: 9423 Comm: syz-executor2 Not tainted 4.20.0-rc1-next-20181109+
#110
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x244/0x39d lib/dump_stack.c:113
print_address_description.cold.7+0x9/0x1ff mm/kasan/report.c:256
kasan_report_error mm/kasan/report.c:354 [inline]
kasan_report.cold.8+0x242/0x309 mm/kasan/report.c:412
__asan_report_load1_noabort+0x14/0x20 mm/kasan/report.c:430
lookup_last fs/namei.c:2269 [inline]
path_lookupat.isra.43+0x9f8/0xc00 fs/namei.c:2318
filename_lookup+0x26a/0x520 fs/namei.c:2348
user_path_at_empty+0x40/0x50 fs/namei.c:2608
user_path include/linux/namei.h:62 [inline]
do_mount+0x180/0x1ff0 fs/namespace.c:2980
ksys_mount+0x12d/0x140 fs/namespace.c:3258
__do_sys_mount fs/namespace.c:3272 [inline]
__se_sys_mount fs/namespace.c:3269 [inline]
__x64_sys_mount+0xbe/0x150 fs/namespace.c:3269
do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x457569
Code: fd b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7
48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
ff 0f 83 cb b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007fde6ed96c78 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 0000000000457569
RDX: 0000000020000040 RSI: 0000000020000000 RDI: 0000000000000000
RBP: 000000000072bf00 R08: 0000000020000340 R09: 0000000000000000
R10: 0000000000200000 R11: 0000000000000246 R12: 00007fde6ed976d4
R13: 00000000004c2c24 R14: 00000000004d4990 R15: 00000000ffffffff
Allocated by task 9424:
save_stack+0x43/0xd0 mm/kasan/kasan.c:448
set_track mm/kasan/kasan.c:460 [inline]
kasan_kmalloc+0xc7/0xe0 mm/kasan/kasan.c:553
__do_kmalloc mm/slab.c:3722 [inline]
__kmalloc_track_caller+0x157/0x760 mm/slab.c:3737
kstrdup+0x39/0x70 mm/util.c:49
bpf_symlink+0x26/0x140 kernel/bpf/inode.c:356
vfs_symlink+0x37a/0x5d0 fs/namei.c:4127
do_symlinkat+0x242/0x2d0 fs/namei.c:4154
__do_sys_symlink fs/namei.c:4173 [inline]
__se_sys_symlink fs/namei.c:4171 [inline]
__x64_sys_symlink+0x59/0x80 fs/namei.c:4171
do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
Freed by task 9425:
save_stack+0x43/0xd0 mm/kasan/kasan.c:448
set_track mm/kasan/kasan.c:460 [inline]
__kasan_slab_free+0x102/0x150 mm/kasan/kasan.c:521
kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528
__cache_free mm/slab.c:3498 [inline]
kfree+0xcf/0x230 mm/slab.c:3817
bpf_evict_inode+0x11f/0x150 kernel/bpf/inode.c:565
evict+0x4b9/0x980 fs/inode.c:558
iput_final fs/inode.c:1550 [inline]
iput+0x674/0xa90 fs/inode.c:1576
do_unlinkat+0x733/0xa30 fs/namei.c:4069
__do_sys_unlink fs/namei.c:4110 [inline]
__se_sys_unlink fs/namei.c:4108 [inline]
__x64_sys_unlink+0x42/0x50 fs/namei.c:4108
do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
The buggy address belongs to the object at ffff8801c4865c40
which belongs to the cache kmalloc-32 of size 32
The buggy address is located 7 bytes inside of
32-byte region [ffff8801c4865c40, ffff8801c4865c60)
The buggy address belongs to the page:
page:ffffea0007121940 count:1 mapcount:0 mapping:ffff8801da8001c0
index:0xffff8801c4865fc1
flags: 0x2fffc0000000200(slab)
raw: 02fffc0000000200 ffffea000763c048 ffffea000762cf08 ffff8801da8001c0
raw: ffff8801c4865fc1 ffff8801c4865000 000000010000003f 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff8801c4865b00: fb fb fb fb fc fc fc fc 05 fc fc fc fc fc fc fc
ffff8801c4865b80: fb fb fb fb fc fc fc fc fb fb fb fb fc fc fc fc
> ffff8801c4865c00: fb fb fb fb fc fc fc fc fb fb fb fb fc fc fc fc
^
ffff8801c4865c80: fb fb fb fb fc fc fc fc fb fb fb fb fc fc fc fc
ffff8801c4865d00: fb fb fb fb fc fc fc fc 05 fc fc fc fc fc fc fc
==================================================================
---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.
syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#bug-status-tracking for how to communicate with
syzbot.
syzbot can test patches for this bug, for details see:
https://goo.gl/tpsmEJ#testing-patches
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: KASAN: use-after-free Read in path_lookupat 2018-11-28 17:40 KASAN: use-after-free Read in path_lookupat syzbot @ 2019-03-25 0:44 ` syzbot 2019-03-25 1:25 ` Linus Torvalds 2019-03-25 1:23 ` Linus Torvalds 1 sibling, 1 reply; 37+ messages in thread From: syzbot @ 2019-03-25 0:44 UTC (permalink / raw) To: dledford, hch, leonro, linux-fsdevel, linux-kernel, sagi, syzkaller-bugs, viro syzbot has bisected this bug to: commit a435393acafbf0ecff4deb3e3cb554b34f0d0664 Author: Sagi Grimberg <sagi@grimberg.me> Date: Thu Jul 13 08:09:40 2017 +0000 mlx5: move affinity hints assignments to generic code bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=12df1ba3200000 start commit: 442b8cea Add linux-next specific files for 20181109 git tree: linux-next final crash: https://syzkaller.appspot.com/x/report.txt?x=11df1ba3200000 console output: https://syzkaller.appspot.com/x/log.txt?x=16df1ba3200000 kernel config: https://syzkaller.appspot.com/x/.config?x=2f72bdb11df9fbe8 dashboard link: https://syzkaller.appspot.com/bug?extid=7a8ba368b47fdefca61e syz repro: https://syzkaller.appspot.com/x/repro.syz?x=132815f5400000 Reported-by: syzbot+7a8ba368b47fdefca61e@syzkaller.appspotmail.com Fixes: a435393acafb ("mlx5: move affinity hints assignments to generic code") For information about bisection process see: https://goo.gl/tpsmEJ#bisection ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: KASAN: use-after-free Read in path_lookupat 2019-03-25 0:44 ` syzbot @ 2019-03-25 1:25 ` Linus Torvalds 0 siblings, 0 replies; 37+ messages in thread From: Linus Torvalds @ 2019-03-25 1:25 UTC (permalink / raw) To: syzbot Cc: dledford, Christoph Hellwig, Leon Romanovsky, linux-fsdevel, Linux List Kernel Mailing, Sagi Grimberg, syzkaller-bugs, Al Viro On Sun, Mar 24, 2019 at 5:44 PM syzbot <syzbot+7a8ba368b47fdefca61e@syzkaller.appspotmail.com> wrote: > > syzbot has bisected this bug to: > > commit a435393acafbf0ecff4deb3e3cb554b34f0d0664 That's wrong. It might change some timing or something to make the race hit, but it looks entirely unrelated to the bpf path_lookupat problem. Linus ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: KASAN: use-after-free Read in path_lookupat 2018-11-28 17:40 KASAN: use-after-free Read in path_lookupat syzbot 2019-03-25 0:44 ` syzbot @ 2019-03-25 1:23 ` Linus Torvalds 2019-03-25 4:57 ` Al Viro 1 sibling, 1 reply; 37+ messages in thread From: Linus Torvalds @ 2019-03-25 1:23 UTC (permalink / raw) To: syzbot, Alexei Starovoitov, Daniel Borkmann Cc: linux-fsdevel, Linux List Kernel Mailing, syzkaller-bugs, Al Viro Hmm. Al, this one seems real and also seems pretty nasty from a vfs interface standpoint. On Wed, Nov 28, 2018 at 9:40 AM syzbot <syzbot+7a8ba368b47fdefca61e@syzkaller.appspotmail.com> wrote: > > BUG: KASAN: use-after-free in lookup_last fs/namei.c:2269 [inline] > BUG: KASAN: use-after-free in path_lookupat.isra.43+0x9f8/0xc00 fs/namei.c:2318 > ... > lookup_last fs/namei.c:2269 [inline] > path_lookupat.isra.43+0x9f8/0xc00 fs/namei.c:2318 Ok, path lookup using RCU. > Allocated by task 9424: > kstrdup+0x39/0x70 mm/util.c:49 > bpf_symlink+0x26/0x140 kernel/bpf/inode.c:356 It's the symlink data for the bpf filesystem. Fair enough. > Freed by task 9425: > kfree+0xcf/0x230 mm/slab.c:3817 > bpf_evict_inode+0x11f/0x150 kernel/bpf/inode.c:565 > evict+0x4b9/0x980 fs/inode.c:558 > iput_final fs/inode.c:1550 [inline] > iput+0x674/0xa90 fs/inode.c:1576 > do_unlinkat+0x733/0xa30 fs/namei.c:4069 .. and the path lookup is racing with the final unlink. We actually RCU-delay the inode freeing itself, but when we do the final iput(), the "evict()" function is called synchronously. Now, the simple fix would seem to just RCU-delay the kfree() of the symlink data in bpf_evict_inode(). Maybe that's the right thing to do. I'd call it trivial, except you'd need to have that rcu head to attach it to, making it slightly less trivial. I guess the kmalloc could just malloc that too. But it does strike me that this might be a generic issue. I wonder how many other filesystems do this? Alexei, Daniel - the "proper" fix right is probably one of four cases: (a) rcu-delay the freeing, and use simple_symlink_inode_operations(). No set_delayed_call() needed, because it stays around for the inode lifetime, but the RCU-delaying is still needed for lookup. or (b) put the symlink in a page, and use page_symlink_inode_operations for that inode, where we have a "struct delayed_call" and get a page ref or (c) put the symlink in the inode itself, and then use that simple_symlink_inode_operations (and now the RCU-delaying of the inode protects it). of (d) make a special getlink() that does "copy symlink, set_delayed_call(done, kfree_link, copy)" but I do wonder if we should perhaps just make simple_symlink_inode_operations do that (d) case for people. Al, comments? At the very least, if we don't make simple_symlink_inode_operations() do that, we should have a *big* comment that if it's not part of the inode data, it needs to be RCU-delayed. Or maybe we could add a final inode callback function for "rcu_free" that is called as the RCU-delayed freeing of the inode itself happens? And then people could hook into that for freeing the inode->i_link data. So many choices.. But the current situation seems unnecessarily complex for the filesystem, and isn't really documented. Our documentation currently says for get_link(): "If the body won't go away until the inode is gone, nothing else is needed", which is wrong (or at least very misleading, since the last "inode is gone" callback we have is that evict() function). Linus ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: KASAN: use-after-free Read in path_lookupat 2019-03-25 1:23 ` Linus Torvalds @ 2019-03-25 4:57 ` Al Viro 2019-03-25 9:15 ` Daniel Borkmann 2019-03-25 18:36 ` Linus Torvalds 0 siblings, 2 replies; 37+ messages in thread From: Al Viro @ 2019-03-25 4:57 UTC (permalink / raw) To: Linus Torvalds Cc: syzbot, Alexei Starovoitov, Daniel Borkmann, linux-fsdevel, Linux List Kernel Mailing, syzkaller-bugs On Sun, Mar 24, 2019 at 06:23:24PM -0700, Linus Torvalds wrote: > Al, comments? At the very least, if we don't make > simple_symlink_inode_operations() do that, we should have a *big* > comment that if it's not part of the inode data, it needs to be > RCU-delayed. simple_symlink_inode_operations is red herring here - what matters is ->i_link being set; those have ->get_link == simple_get_link, but note that it is *not* called: res = inode->i_link; if (!res) { const char * (*get)(struct dentry *, struct inode *, struct delayed_call *); get = inode->i_op->get_link; if (nd->flags & LOOKUP_RCU) { res = get(NULL, inode, &last->done); if (res == ERR_PTR(-ECHILD)) { if (unlikely(unlazy_walk(nd))) return ERR_PTR(-ECHILD); res = get(dentry, inode, &last->done); } } else { res = get(dentry, inode, &last->done); } if (IS_ERR_OR_NULL(res)) return res; } for traversal and similar for readlink(2). And we certainly don't want to allocate copies in those cases - it would fuck RCU traversals for all fast symlinks (i.e. for the majority of symlinks out there). Actual situation: * shmem, erofs: OK, kfree() from the thing ->destroy_inode() is calling via call_rcu(). * befs, ext2, ext4, freevxfs, jfs, orangefs, ufs: OK, coallocated with inode * debugfs: broken * jffs2: broken, freeing of f->target should be moved to jffs2_i_callback(). * ubifs: broken, ought to move kfree(ui->data); from ubifs_destroy_inode() to ubifs_i_callback() * ceph: broken, needs to move kfree(ci->symlink) from ceph_destroy_inode() to ceph_i_callback(). * bpf: broken So we have 5 broken cases, all with the same kind of fix: move freeing into the RCU-delayed part of ->destroy_inode(); for debugfs and bpf that requires adding ->alloc_inode()/->destroy_inode(), rather than relying upon the defaults from fs/inode.c > Or maybe we could add a final inode callback function for "rcu_free" > that is called as the RCU-delayed freeing of the inode itself happens? > And then people could hook into that for freeing the inode->i_link > data. You mean, split ->destroy_inode() into immediate and RCU-delayed parts? There are filesystems where both parts are non-empty - we can't just switch all ->destroy_inode() work to call_rcu(). > So many choices.. But the current situation seems unnecessarily > complex for the filesystem, and isn't really documented. > > Our documentation currently says for get_link(): "If the body won't go > away until the inode is gone, nothing else is needed", which is wrong > (or at least very misleading, since the last "inode is gone" callback > we have is that evict() function). s/inode is gone/struct inode is freed/, but it's obviously not clear enough. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: KASAN: use-after-free Read in path_lookupat 2019-03-25 4:57 ` Al Viro @ 2019-03-25 9:15 ` Daniel Borkmann 2019-03-25 11:11 ` Al Viro 2019-03-25 18:36 ` Linus Torvalds 1 sibling, 1 reply; 37+ messages in thread From: Daniel Borkmann @ 2019-03-25 9:15 UTC (permalink / raw) To: Al Viro, Linus Torvalds Cc: syzbot, Alexei Starovoitov, linux-fsdevel, Linux List Kernel Mailing, syzkaller-bugs Hi Al, hi Linus, On 03/25/2019 05:57 AM, Al Viro wrote: > On Sun, Mar 24, 2019 at 06:23:24PM -0700, Linus Torvalds wrote: > >> Al, comments? At the very least, if we don't make >> simple_symlink_inode_operations() do that, we should have a *big* >> comment that if it's not part of the inode data, it needs to be >> RCU-delayed. > > simple_symlink_inode_operations is red herring here - what matters > is ->i_link being set; those have ->get_link == simple_get_link, > but note that it is *not* called: > res = inode->i_link; > if (!res) { > const char * (*get)(struct dentry *, struct inode *, > struct delayed_call *); > get = inode->i_op->get_link; > if (nd->flags & LOOKUP_RCU) { > res = get(NULL, inode, &last->done); > if (res == ERR_PTR(-ECHILD)) { > if (unlikely(unlazy_walk(nd))) > return ERR_PTR(-ECHILD); > res = get(dentry, inode, &last->done); > } > } else { > res = get(dentry, inode, &last->done); > } > if (IS_ERR_OR_NULL(res)) > return res; > } > for traversal and similar for readlink(2). And we certainly don't want > to allocate copies in those cases - it would fuck RCU traversals for > all fast symlinks (i.e. for the majority of symlinks out there). > > Actual situation: > > * shmem, erofs: OK, kfree() from the thing ->destroy_inode() is calling via > call_rcu(). > * befs, ext2, ext4, freevxfs, jfs, orangefs, ufs: OK, coallocated with inode > * debugfs: broken > * jffs2: broken, freeing of f->target should be moved to jffs2_i_callback(). > * ubifs: broken, ought to move kfree(ui->data); from ubifs_destroy_inode() to > ubifs_i_callback() > * ceph: broken, needs to move kfree(ci->symlink) from ceph_destroy_inode() > to ceph_i_callback(). > * bpf: broken > > So we have 5 broken cases, all with the same kind of fix: move freeing > into the RCU-delayed part of ->destroy_inode(); for debugfs and bpf > that requires adding ->alloc_inode()/->destroy_inode(), rather than > relying upon the defaults from fs/inode.c I believe I copied that logic from one of the other fs'es back then, sigh. Thanks for the analysis and hints for fixing. Would the below (only compile tested for now) look reasonable to you? I believe there is no other way around than to add our own inode cache, but so be it. The freeing of the i_link is then RCU-deferred in bpf_destroy_inode_deferred(): kernel/bpf/inode.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 52 insertions(+), 5 deletions(-) diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c index 2ada5e2..69fcaf6 100644 --- a/kernel/bpf/inode.c +++ b/kernel/bpf/inode.c @@ -80,6 +80,47 @@ static const struct inode_operations bpf_dir_iops; static const struct inode_operations bpf_prog_iops = { }; static const struct inode_operations bpf_map_iops = { }; +static struct kmem_cache *bpf_inode_cache __read_mostly; + +struct inode *bpf_alloc_inode(struct super_block *sb) +{ + return kmem_cache_alloc(bpf_inode_cache, GFP_KERNEL); +} + +static void bpf_destroy_inode_deferred(struct rcu_head *head) +{ + struct inode *inode = container_of(head, struct inode, i_rcu); + + if (S_ISLNK(inode->i_mode)) + kfree(inode->i_link); + kmem_cache_free(bpf_inode_cache, inode); +} + +static void bpf_destroy_inode(struct inode *inode) +{ + call_rcu(&inode->i_rcu, bpf_destroy_inode_deferred); +} + +static void bpf_inode_init_once(void *inode) +{ + inode_init_once(inode); +} + +static int bpf_init_inode_cache(void) +{ + bpf_inode_cache = kmem_cache_create("bpf_inode_cache", + sizeof(struct inode), 0, + (SLAB_RECLAIM_ACCOUNT| + SLAB_MEM_SPREAD|SLAB_ACCOUNT), + bpf_inode_init_once); + return !bpf_inode_cache ? -ENOMEM : 0; +} + +static void bpf_destroy_inode_cache(void) +{ + kmem_cache_destroy(bpf_inode_cache); +} + static struct inode *bpf_get_inode(struct super_block *sb, const struct inode *dir, umode_t mode) @@ -561,8 +602,6 @@ static void bpf_evict_inode(struct inode *inode) truncate_inode_pages_final(&inode->i_data); clear_inode(inode); - if (S_ISLNK(inode->i_mode)) - kfree(inode->i_link); if (!bpf_inode_type(inode, &type)) bpf_any_put(inode->i_private, type); } @@ -580,6 +619,8 @@ static int bpf_show_options(struct seq_file *m, struct dentry *root) } static const struct super_operations bpf_super_ops = { + .alloc_inode = bpf_alloc_inode, + .destroy_inode = bpf_destroy_inode, .statfs = simple_statfs, .drop_inode = generic_delete_inode, .show_options = bpf_show_options, @@ -671,13 +712,19 @@ static int __init bpf_init(void) { int ret; - ret = sysfs_create_mount_point(fs_kobj, "bpf"); + ret = bpf_init_inode_cache(); if (ret) return ret; - + ret = sysfs_create_mount_point(fs_kobj, "bpf"); + if (ret) { + bpf_destroy_inode_cache(); + return ret; + } ret = register_filesystem(&bpf_fs_type); - if (ret) + if (ret) { sysfs_remove_mount_point(fs_kobj, "bpf"); + bpf_destroy_inode_cache(); + } return ret; } -- 2.9.5 >> Or maybe we could add a final inode callback function for "rcu_free" >> that is called as the RCU-delayed freeing of the inode itself happens? >> And then people could hook into that for freeing the inode->i_link >> data. > > You mean, split ->destroy_inode() into immediate and RCU-delayed parts? > There are filesystems where both parts are non-empty - we can't just > switch all ->destroy_inode() work to call_rcu(). > >> So many choices.. But the current situation seems unnecessarily >> complex for the filesystem, and isn't really documented. >> >> Our documentation currently says for get_link(): "If the body won't go >> away until the inode is gone, nothing else is needed", which is wrong >> (or at least very misleading, since the last "inode is gone" callback >> we have is that evict() function). > > s/inode is gone/struct inode is freed/, but it's obviously not clear > enough. > ^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: KASAN: use-after-free Read in path_lookupat 2019-03-25 9:15 ` Daniel Borkmann @ 2019-03-25 11:11 ` Al Viro 2019-03-25 11:17 ` Al Viro 0 siblings, 1 reply; 37+ messages in thread From: Al Viro @ 2019-03-25 11:11 UTC (permalink / raw) To: Daniel Borkmann Cc: Linus Torvalds, syzbot, Alexei Starovoitov, linux-fsdevel, Linux List Kernel Mailing, syzkaller-bugs On Mon, Mar 25, 2019 at 10:15:40AM +0100, Daniel Borkmann wrote: > > So we have 5 broken cases, all with the same kind of fix: move freeing > > into the RCU-delayed part of ->destroy_inode(); for debugfs and bpf > > that requires adding ->alloc_inode()/->destroy_inode(), rather than > > relying upon the defaults from fs/inode.c > > I believe I copied that logic from one of the other fs'es back then, sigh. > Thanks for the analysis and hints for fixing. Would the below (only compile > tested for now) look reasonable to you? I believe there is no other way > around than to add our own inode cache, but so be it. The freeing of the > i_link is then RCU-deferred in bpf_destroy_inode_deferred(): It looks like it would suffice, but it's way too much boilerplate for my taste ;-/ Most of your headache here comes from messing with slab setup and the only reason for that is being unable to free stuff into inode_cachep. So grep for inode_cachep would be a good idea, and it digs up the following sucker: void free_inode_nonrcu(struct inode *inode) { kmem_cache_free(inode_cachep, inode); } EXPORT_SYMBOL(free_inode_nonrcu); IOW, you need nothing on ->alloc_inode() side and for ->destroy_inode() just do call_rcu() of a callback, that would kfree(inode->link) if it was a symlink, then call free_inode_nonrcu(inode). Considerably smaller patch that way... ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: KASAN: use-after-free Read in path_lookupat 2019-03-25 11:11 ` Al Viro @ 2019-03-25 11:17 ` Al Viro 2019-03-25 11:21 ` Daniel Borkmann 0 siblings, 1 reply; 37+ messages in thread From: Al Viro @ 2019-03-25 11:17 UTC (permalink / raw) To: Daniel Borkmann Cc: Linus Torvalds, syzbot, Alexei Starovoitov, linux-fsdevel, Linux List Kernel Mailing, syzkaller-bugs On Mon, Mar 25, 2019 at 11:11:23AM +0000, Al Viro wrote: > On Mon, Mar 25, 2019 at 10:15:40AM +0100, Daniel Borkmann wrote: > > > So we have 5 broken cases, all with the same kind of fix: move freeing > > > into the RCU-delayed part of ->destroy_inode(); for debugfs and bpf > > > that requires adding ->alloc_inode()/->destroy_inode(), rather than > > > relying upon the defaults from fs/inode.c > > > > I believe I copied that logic from one of the other fs'es back then, sigh. > > Thanks for the analysis and hints for fixing. Would the below (only compile > > tested for now) look reasonable to you? I believe there is no other way > > around than to add our own inode cache, but so be it. The freeing of the > > i_link is then RCU-deferred in bpf_destroy_inode_deferred(): > > It looks like it would suffice, but it's way too much boilerplate for my > taste ;-/ > > Most of your headache here comes from messing with slab setup and the > only reason for that is being unable to free stuff into inode_cachep. > So grep for inode_cachep would be a good idea, and it digs up the > following sucker: > void free_inode_nonrcu(struct inode *inode) > { > kmem_cache_free(inode_cachep, inode); > } > EXPORT_SYMBOL(free_inode_nonrcu); > > IOW, you need nothing on ->alloc_inode() side and for ->destroy_inode() > just do call_rcu() of a callback, that would kfree(inode->link) if > it was a symlink, then call free_inode_nonrcu(inode). > > Considerably smaller patch that way... AFAICS, it boils down to just this (also only compile-tested): diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c index 2ada5e21dfa6..9437f88b6acf 100644 --- a/kernel/bpf/inode.c +++ b/kernel/bpf/inode.c @@ -80,6 +80,20 @@ static const struct inode_operations bpf_dir_iops; static const struct inode_operations bpf_prog_iops = { }; static const struct inode_operations bpf_map_iops = { }; +static void bpf_destroy_inode_deferred(struct rcu_head *head) +{ + struct inode *inode = container_of(head, struct inode, i_rcu); + + if (S_ISLNK(inode->i_mode)) + kfree(inode->i_link); + free_inode_nonrcu(inode); +} + +static void bpf_destroy_inode(struct inode *inode) +{ + call_rcu(&inode->i_rcu, bpf_destroy_inode_deferred); +} + static struct inode *bpf_get_inode(struct super_block *sb, const struct inode *dir, umode_t mode) @@ -561,8 +575,6 @@ static void bpf_evict_inode(struct inode *inode) truncate_inode_pages_final(&inode->i_data); clear_inode(inode); - if (S_ISLNK(inode->i_mode)) - kfree(inode->i_link); if (!bpf_inode_type(inode, &type)) bpf_any_put(inode->i_private, type); } @@ -584,6 +596,7 @@ static const struct super_operations bpf_super_ops = { .drop_inode = generic_delete_inode, .show_options = bpf_show_options, .evict_inode = bpf_evict_inode, + .destroy_inode = bpf_destroy_inode, }; enum { ^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: KASAN: use-after-free Read in path_lookupat 2019-03-25 11:17 ` Al Viro @ 2019-03-25 11:21 ` Daniel Borkmann 0 siblings, 0 replies; 37+ messages in thread From: Daniel Borkmann @ 2019-03-25 11:21 UTC (permalink / raw) To: Al Viro Cc: Linus Torvalds, syzbot, Alexei Starovoitov, linux-fsdevel, Linux List Kernel Mailing, syzkaller-bugs On 03/25/2019 12:17 PM, Al Viro wrote: > On Mon, Mar 25, 2019 at 11:11:23AM +0000, Al Viro wrote: >> On Mon, Mar 25, 2019 at 10:15:40AM +0100, Daniel Borkmann wrote: >>>> So we have 5 broken cases, all with the same kind of fix: move freeing >>>> into the RCU-delayed part of ->destroy_inode(); for debugfs and bpf >>>> that requires adding ->alloc_inode()/->destroy_inode(), rather than >>>> relying upon the defaults from fs/inode.c >>> >>> I believe I copied that logic from one of the other fs'es back then, sigh. >>> Thanks for the analysis and hints for fixing. Would the below (only compile >>> tested for now) look reasonable to you? I believe there is no other way >>> around than to add our own inode cache, but so be it. The freeing of the >>> i_link is then RCU-deferred in bpf_destroy_inode_deferred(): >> >> It looks like it would suffice, but it's way too much boilerplate for my >> taste ;-/ >> >> Most of your headache here comes from messing with slab setup and the >> only reason for that is being unable to free stuff into inode_cachep. >> So grep for inode_cachep would be a good idea, and it digs up the >> following sucker: >> void free_inode_nonrcu(struct inode *inode) >> { >> kmem_cache_free(inode_cachep, inode); >> } >> EXPORT_SYMBOL(free_inode_nonrcu); >> >> IOW, you need nothing on ->alloc_inode() side and for ->destroy_inode() >> just do call_rcu() of a callback, that would kfree(inode->link) if >> it was a symlink, then call free_inode_nonrcu(inode). >> >> Considerably smaller patch that way... > > AFAICS, it boils down to just this (also only compile-tested): Ah yep, much better, agree. Want to cook proper patch for bpf, or want me to take care of it? Thanks, Daniel > diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c > index 2ada5e21dfa6..9437f88b6acf 100644 > --- a/kernel/bpf/inode.c > +++ b/kernel/bpf/inode.c > @@ -80,6 +80,20 @@ static const struct inode_operations bpf_dir_iops; > static const struct inode_operations bpf_prog_iops = { }; > static const struct inode_operations bpf_map_iops = { }; > > +static void bpf_destroy_inode_deferred(struct rcu_head *head) > +{ > + struct inode *inode = container_of(head, struct inode, i_rcu); > + > + if (S_ISLNK(inode->i_mode)) > + kfree(inode->i_link); > + free_inode_nonrcu(inode); > +} > + > +static void bpf_destroy_inode(struct inode *inode) > +{ > + call_rcu(&inode->i_rcu, bpf_destroy_inode_deferred); > +} > + > static struct inode *bpf_get_inode(struct super_block *sb, > const struct inode *dir, > umode_t mode) > @@ -561,8 +575,6 @@ static void bpf_evict_inode(struct inode *inode) > truncate_inode_pages_final(&inode->i_data); > clear_inode(inode); > > - if (S_ISLNK(inode->i_mode)) > - kfree(inode->i_link); > if (!bpf_inode_type(inode, &type)) > bpf_any_put(inode->i_private, type); > } > @@ -584,6 +596,7 @@ static const struct super_operations bpf_super_ops = { > .drop_inode = generic_delete_inode, > .show_options = bpf_show_options, > .evict_inode = bpf_evict_inode, > + .destroy_inode = bpf_destroy_inode, > }; > > enum { > ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: KASAN: use-after-free Read in path_lookupat 2019-03-25 4:57 ` Al Viro 2019-03-25 9:15 ` Daniel Borkmann @ 2019-03-25 18:36 ` Linus Torvalds 2019-03-25 19:18 ` Linus Torvalds 2019-03-25 19:43 ` Al Viro 1 sibling, 2 replies; 37+ messages in thread From: Linus Torvalds @ 2019-03-25 18:36 UTC (permalink / raw) To: Al Viro Cc: syzbot, Alexei Starovoitov, Daniel Borkmann, linux-fsdevel, Linux List Kernel Mailing, syzkaller-bugs On Sun, Mar 24, 2019 at 9:57 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > So we have 5 broken cases, all with the same kind of fix: move freeing > into the RCU-delayed part of ->destroy_inode(); for debugfs and bpf > that requires adding ->alloc_inode()/->destroy_inode(), rather than > relying upon the defaults from fs/inode.c The fact that we had five different broken filesystems does imply that we should just make this much easier at the VFS layer instead of forcing filesystems to cope with a bad interface. > > Or maybe we could add a final inode callback function for "rcu_free" > > that is called as the RCU-delayed freeing of the inode itself happens? > > And then people could hook into that for freeing the inode->i_link > > data. > > You mean, split ->destroy_inode() into immediate and RCU-delayed parts? > There are filesystems where both parts are non-empty - we can't just > switch all ->destroy_inode() work to call_rcu(). Right. Not just move the existing destroy_inode() - because as you say, people may not be able to to do that in RCU contect, but split it up, and add a "final_free_inode()" callback or something for the RCU phase. In fact, I suspect that *every* user of destroy_inode() already has to have its own RCU callback to another final freeing function anyway. Because they really shouldn't free the inode itself early. Maybe we can just make that be a generic thing? And no, it's not like the patch to the bpf filesystem looks _horrible_, but I think the fact that the low-level filesystem needed to do its own RCU-deferred thing (even if it wasn't doing any RCU on its own, although bpf obviously has that in other areas) is kind of wrong, and shows that our vfs interfaces are a bit awkward. Plus it's not like it was just bpf that got this wrong. It's also somewhat silly to add _another_ RCU callback when we already have one for the inode. Linus ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: KASAN: use-after-free Read in path_lookupat 2019-03-25 18:36 ` Linus Torvalds @ 2019-03-25 19:18 ` Linus Torvalds 2019-03-25 21:14 ` Al Viro 2019-03-25 19:43 ` Al Viro 1 sibling, 1 reply; 37+ messages in thread From: Linus Torvalds @ 2019-03-25 19:18 UTC (permalink / raw) To: Al Viro Cc: syzbot, Alexei Starovoitov, Daniel Borkmann, linux-fsdevel, Linux List Kernel Mailing, syzkaller-bugs [-- Attachment #1: Type: text/plain, Size: 678 bytes --] On Mon, Mar 25, 2019 at 11:36 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > > You mean, split ->destroy_inode() into immediate and RCU-delayed parts? > > There are filesystems where both parts are non-empty - we can't just > > switch all ->destroy_inode() work to call_rcu(). > > Right. Not just move the existing destroy_inode() - because as you > say, people may not be able to to do that in RCU contect, but split it > up, and add a "final_free_inode()" callback or something for the RCU > phase. Something like the attached. COMPLETELY UNTESTED. And no filesystems converted to actually use the new rcu_destroy_inode() thing. Hmm? Linus [-- Attachment #2: patch.diff --] [-- Type: text/x-patch, Size: 3424 bytes --] Documentation/filesystems/vfs.txt | 6 ++++++ fs/inode.c | 27 +++++++++++++++++++++------ include/linux/fs.h | 1 + 3 files changed, 28 insertions(+), 6 deletions(-) diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index 761c6fd24a53..60f7841a12e6 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt @@ -210,6 +210,7 @@ filesystem. As of kernel 2.6.22, the following members are defined: struct super_operations { struct inode *(*alloc_inode)(struct super_block *sb); void (*destroy_inode)(struct inode *); + int (*rcu_destroy_inode)(struct inode *); void (*dirty_inode) (struct inode *, int flags); int (*write_inode) (struct inode *, int); @@ -248,6 +249,11 @@ or bottom half). ->alloc_inode was defined and simply undoes anything done by ->alloc_inode. + rcu_destroy_inode: this method is called after the RCU delay by + destroy_inode() to release resources allocated for struct inode. + If it returns a non-zero value, it means that it has free'd the + inode, otherwise the inode layer will free it. + dirty_inode: this method is called by the VFS to mark an inode dirty. write_inode: this method is called when the VFS needs to write an diff --git a/fs/inode.c b/fs/inode.c index e9d97add2b36..b85457baad20 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -215,9 +215,13 @@ static struct inode *alloc_inode(struct super_block *sb) return NULL; if (unlikely(inode_init_always(sb, inode))) { - if (inode->i_sb->s_op->destroy_inode) + if (inode->i_sb->s_op->destroy_inode) { inode->i_sb->s_op->destroy_inode(inode); - else + if (!inode->i_sb->s_op->rcu_destroy_inode) + return NULL; + } + if (!inode->i_sb->s_op->rcu_destroy_inode || + !inode->i_sb->s_op->rcu_destroy_inode(inode)) kmem_cache_free(inode_cachep, inode); return NULL; } @@ -256,17 +260,28 @@ EXPORT_SYMBOL(__destroy_inode); static void i_callback(struct rcu_head *head) { struct inode *inode = container_of(head, struct inode, i_rcu); - kmem_cache_free(inode_cachep, inode); + + if (!inode->i_sb->s_op->rcu_destroy_inode || + !inode->i_sb->s_op->rcu_destroy_inode(inode)) + kmem_cache_free(inode_cachep, inode); } static void destroy_inode(struct inode *inode) { BUG_ON(!list_empty(&inode->i_lru)); __destroy_inode(inode); - if (inode->i_sb->s_op->destroy_inode) + + /* + * If we have a 'destroy_inode' but no 'rcu_destroy_inode' + * then the filesystem handles the RCU-delayed destruction + * on its own, and we don't do any RCU callbacks. + */ + if (inode->i_sb->s_op->destroy_inode) { inode->i_sb->s_op->destroy_inode(inode); - else - call_rcu(&inode->i_rcu, i_callback); + if (!inode->i_sb->s_op->rcu_destroy_inode) + return; + } + call_rcu(&inode->i_rcu, i_callback); } /** diff --git a/include/linux/fs.h b/include/linux/fs.h index 8b42df09b04c..727561ecbc23 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1900,6 +1900,7 @@ extern loff_t vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos, struct super_operations { struct inode *(*alloc_inode)(struct super_block *sb); void (*destroy_inode)(struct inode *); + int (*rcu_destroy_inode)(struct inode *); void (*dirty_inode) (struct inode *, int flags); int (*write_inode) (struct inode *, struct writeback_control *wbc); ^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: KASAN: use-after-free Read in path_lookupat 2019-03-25 19:18 ` Linus Torvalds @ 2019-03-25 21:14 ` Al Viro 2019-03-25 21:45 ` Linus Torvalds 0 siblings, 1 reply; 37+ messages in thread From: Al Viro @ 2019-03-25 21:14 UTC (permalink / raw) To: Linus Torvalds Cc: syzbot, Alexei Starovoitov, Daniel Borkmann, linux-fsdevel, Linux List Kernel Mailing, syzkaller-bugs On Mon, Mar 25, 2019 at 12:18:02PM -0700, Linus Torvalds wrote: > COMPLETELY UNTESTED. And no filesystems converted to actually use the > new rcu_destroy_inode() thing. > > Hmm? Maybe, but we really need to come up with sane documentation on the entire drop_inode/evict_inode/destroy_inode/rcu_destroy_inode group ;-/ And I want to understand the writeback-related issues in ocfs2 and f2fs - the current kludges in those smell fishy. > if (unlikely(inode_init_always(sb, inode))) { > - if (inode->i_sb->s_op->destroy_inode) > + if (inode->i_sb->s_op->destroy_inode) { > inode->i_sb->s_op->destroy_inode(inode); > - else > + if (!inode->i_sb->s_op->rcu_destroy_inode) > + return NULL; > + } > + if (!inode->i_sb->s_op->rcu_destroy_inode || > + !inode->i_sb->s_op->rcu_destroy_inode(inode)) > kmem_cache_free(inode_cachep, inode); ITYM i_callback(inode); here, possibly suitably renamed. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: KASAN: use-after-free Read in path_lookupat 2019-03-25 21:14 ` Al Viro @ 2019-03-25 21:45 ` Linus Torvalds 2019-03-25 22:04 ` Daniel Borkmann 2019-03-25 23:37 ` Al Viro 0 siblings, 2 replies; 37+ messages in thread From: Linus Torvalds @ 2019-03-25 21:45 UTC (permalink / raw) To: Al Viro Cc: syzbot, Alexei Starovoitov, Daniel Borkmann, linux-fsdevel, Linux List Kernel Mailing, syzkaller-bugs On Mon, Mar 25, 2019 at 2:14 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > Maybe, but we really need to come up with sane documentation on the > entire drop_inode/evict_inode/destroy_inode/rcu_destroy_inode > group ;-/ Yeah. I actually think the "destroy_inode/rcu_destroy_inode" part is the simplest one to understand: it's just tearing down the inode, and the RCU part is (obviously, as shown by this thread) somewhat subtle, but at the same time not really all that hard to explain: "inodes that can be looked up through RCU lookup need to be destroyed with RCU delay". I think drop_inode->evict_inode is arguably the much more subtle stuff. That's the "we're not destroying things, we're making decisions about the state" kind of thing. And we actually don't have any documentation at all for those two. Well, there's the "porting" thing, but.. > And I want to understand the writeback-related issues > in ocfs2 and f2fs - the current kludges in those smell fishy. I'm assuming you're talking about exactly that drop_inode() kind of subtlety. At least for ocfs2, the whole "rcu_destroy_inode()" patch I sent out would simplify things a lot. *All* that the ocfs2_destroy_inode() function does is to schedule the inode freeing for RCU delay. Assuming my patch is fine (as mentioned, it was entirely untested), that patch would actually simplify that a lot. Get rid of ocfs2_destroy_inode() entirely, and just make ocfs2_rcu_destroy_inode() do that kmem_cache_free(). Mucho clean-up (we have that ocfs2_rcu_destroy_inode() currently as ocfs2_i_callback(), but with the ugly rcu-head interface). > > if (unlikely(inode_init_always(sb, inode))) { > > - if (inode->i_sb->s_op->destroy_inode) > > + if (inode->i_sb->s_op->destroy_inode) { > > inode->i_sb->s_op->destroy_inode(inode); > > - else > > + if (!inode->i_sb->s_op->rcu_destroy_inode) > > + return NULL; > > + } > > + if (!inode->i_sb->s_op->rcu_destroy_inode || > > + !inode->i_sb->s_op->rcu_destroy_inode(inode)) > > kmem_cache_free(inode_cachep, inode); > > ITYM i_callback(inode); here, possibly suitably renamed. Yeah, I guess we could just call that i_callback() directly. I wrote it that way because it was how the code was organized (it didn't call i_callback() before either), but yes, it woudl avoid some duplicate code to do it that way. And yes, at that point we'd probably want to rename it too. On the whole, looking at a few filesystems, I really think that 'rcu_destroy_inode()" would simplify several of them. That ocfs2 case I already mentioned, but looking around the exact same is trye in v9fs, vxfs, dlmfs, hostfs, bfs, coda, fatfs, isofs, minix, nfs, openprom, proc, qnx4, qnx6, sysv, befs, adfs, affs, efs, ext2, f2fs, gfs2, hfs, hfsplus, hpfs, jffs2, nilfs, reiserfs, romfs, squashfs) And in other filesystems that actually *do* have a real destroy_inode() that does other things too, they still want the rcu delayed part as well (eg btrfs, ceph, fuse, hugetlbfs, afs, ecryptfs, ext4, jfs, organgefs, ovl, ubifs). In fact, every single filesystem I looked at (and I looked at most) would be simplified by that patch. And for some it would make it easier to fix bugs in the process (ie bpf) because they don't currently have that rcu delay and they need it. So looking at all the existing users of destroy_inode(), I really do think we should have that rcu_destroy_inode() callback. Because that 3 files changed, 28 insertions(+), 6 deletions(-) will allow quite a lot of lines to be removed from low-level filesystems with all the call_rcu() and callback container_of(head, struct inode, i_rcu) noise just going away. But yes, it would be good to have more documentation fo the drop/evict phase. As mentioned, I think the destroy phase is actually the easy one, with the RCU part being the most complex one that this would actually simplify. Linus ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: KASAN: use-after-free Read in path_lookupat 2019-03-25 21:45 ` Linus Torvalds @ 2019-03-25 22:04 ` Daniel Borkmann 2019-03-25 22:13 ` Linus Torvalds 2019-03-25 22:49 ` Al Viro 2019-03-25 23:37 ` Al Viro 1 sibling, 2 replies; 37+ messages in thread From: Daniel Borkmann @ 2019-03-25 22:04 UTC (permalink / raw) To: Linus Torvalds, Al Viro Cc: syzbot, Alexei Starovoitov, linux-fsdevel, Linux List Kernel Mailing, syzkaller-bugs On 03/25/2019 10:45 PM, Linus Torvalds wrote: > On Mon, Mar 25, 2019 at 2:14 PM Al Viro <viro@zeniv.linux.org.uk> wrote: >> >> Maybe, but we really need to come up with sane documentation on the >> entire drop_inode/evict_inode/destroy_inode/rcu_destroy_inode >> group ;-/ > > Yeah. > > I actually think the "destroy_inode/rcu_destroy_inode" part is the > simplest one to understand: it's just tearing down the inode, and the > RCU part is (obviously, as shown by this thread) somewhat subtle, but > at the same time not really all that hard to explain: "inodes that can > be looked up through RCU lookup need to be destroyed with RCU delay". > > I think drop_inode->evict_inode is arguably the much more subtle > stuff. That's the "we're not destroying things, we're making decisions > about the state" kind of thing. > > And we actually don't have any documentation at all for those two. > Well, there's the "porting" thing, but.. > >> And I want to understand the writeback-related issues >> in ocfs2 and f2fs - the current kludges in those smell fishy. > > I'm assuming you're talking about exactly that drop_inode() kind of subtlety. > > At least for ocfs2, the whole "rcu_destroy_inode()" patch I sent out > would simplify things a lot. *All* that the ocfs2_destroy_inode() > function does is to schedule the inode freeing for RCU delay. > > Assuming my patch is fine (as mentioned, it was entirely untested), > that patch would actually simplify that a lot. Get rid of > ocfs2_destroy_inode() entirely, and just make > ocfs2_rcu_destroy_inode() do that kmem_cache_free(). Mucho clean-up > (we have that ocfs2_rcu_destroy_inode() currently as > ocfs2_i_callback(), but with the ugly rcu-head interface). > >>> if (unlikely(inode_init_always(sb, inode))) { >>> - if (inode->i_sb->s_op->destroy_inode) >>> + if (inode->i_sb->s_op->destroy_inode) { >>> inode->i_sb->s_op->destroy_inode(inode); >>> - else >>> + if (!inode->i_sb->s_op->rcu_destroy_inode) >>> + return NULL; >>> + } >>> + if (!inode->i_sb->s_op->rcu_destroy_inode || >>> + !inode->i_sb->s_op->rcu_destroy_inode(inode)) >>> kmem_cache_free(inode_cachep, inode); >> >> ITYM i_callback(inode); here, possibly suitably renamed. > > Yeah, I guess we could just call that i_callback() directly. I wrote > it that way because it was how the code was organized (it didn't call > i_callback() before either), but yes, it woudl avoid some duplicate > code to do it that way. > > And yes, at that point we'd probably want to rename it too. > > On the whole, looking at a few filesystems, I really think that > 'rcu_destroy_inode()" would simplify several of them. That ocfs2 case > I already mentioned, but looking around the exact same is trye in > v9fs, vxfs, dlmfs, hostfs, bfs, coda, fatfs, isofs, minix, nfs, > openprom, proc, qnx4, qnx6, sysv, befs, adfs, affs, efs, ext2, f2fs, > gfs2, hfs, hfsplus, hpfs, jffs2, nilfs, reiserfs, romfs, squashfs) > > And in other filesystems that actually *do* have a real > destroy_inode() that does other things too, they still want the rcu > delayed part as well (eg btrfs, ceph, fuse, hugetlbfs, afs, ecryptfs, > ext4, jfs, organgefs, ovl, ubifs). > > In fact, every single filesystem I looked at (and I looked at most) > would be simplified by that patch. > > And for some it would make it easier to fix bugs in the process (ie > bpf) because they don't currently have that rcu delay and they need > it. I'm fine either way, I think the rcu_destroy_inode would indeed simplify it nicely. In any case fwiw, here's what I'd have ready for standby on bpf side and tested as well. Decided to get rid of bpf_evict_inode() entirely since the only callback we'd really need is on final inode destruction: From fc2111c4ebb487b630a423752c0f23b5f2535a55 Mon Sep 17 00:00:00 2001 From: Daniel Borkmann <daniel@iogearbox.net> Date: Mon, 25 Mar 2019 15:54:43 +0100 Subject: [PATCH bpf] bpf: fix use after free in bpf_evict_inode syzkaller was able to generate the following UAF in bpf: BUG: KASAN: use-after-free in lookup_last fs/namei.c:2269 [inline] BUG: KASAN: use-after-free in path_lookupat.isra.43+0x9f8/0xc00 fs/namei.c:2318 Read of size 1 at addr ffff8801c4865c47 by task syz-executor2/9423 CPU: 0 PID: 9423 Comm: syz-executor2 Not tainted 4.20.0-rc1-next-20181109+ #110 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x244/0x39d lib/dump_stack.c:113 print_address_description.cold.7+0x9/0x1ff mm/kasan/report.c:256 kasan_report_error mm/kasan/report.c:354 [inline] kasan_report.cold.8+0x242/0x309 mm/kasan/report.c:412 __asan_report_load1_noabort+0x14/0x20 mm/kasan/report.c:430 lookup_last fs/namei.c:2269 [inline] path_lookupat.isra.43+0x9f8/0xc00 fs/namei.c:2318 filename_lookup+0x26a/0x520 fs/namei.c:2348 user_path_at_empty+0x40/0x50 fs/namei.c:2608 user_path include/linux/namei.h:62 [inline] do_mount+0x180/0x1ff0 fs/namespace.c:2980 ksys_mount+0x12d/0x140 fs/namespace.c:3258 __do_sys_mount fs/namespace.c:3272 [inline] __se_sys_mount fs/namespace.c:3269 [inline] __x64_sys_mount+0xbe/0x150 fs/namespace.c:3269 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x457569 Code: fd b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 cb b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007fde6ed96c78 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5 RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 0000000000457569 RDX: 0000000020000040 RSI: 0000000020000000 RDI: 0000000000000000 RBP: 000000000072bf00 R08: 0000000020000340 R09: 0000000000000000 R10: 0000000000200000 R11: 0000000000000246 R12: 00007fde6ed976d4 R13: 00000000004c2c24 R14: 00000000004d4990 R15: 00000000ffffffff Allocated by task 9424: save_stack+0x43/0xd0 mm/kasan/kasan.c:448 set_track mm/kasan/kasan.c:460 [inline] kasan_kmalloc+0xc7/0xe0 mm/kasan/kasan.c:553 __do_kmalloc mm/slab.c:3722 [inline] __kmalloc_track_caller+0x157/0x760 mm/slab.c:3737 kstrdup+0x39/0x70 mm/util.c:49 bpf_symlink+0x26/0x140 kernel/bpf/inode.c:356 vfs_symlink+0x37a/0x5d0 fs/namei.c:4127 do_symlinkat+0x242/0x2d0 fs/namei.c:4154 __do_sys_symlink fs/namei.c:4173 [inline] __se_sys_symlink fs/namei.c:4171 [inline] __x64_sys_symlink+0x59/0x80 fs/namei.c:4171 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe Freed by task 9425: save_stack+0x43/0xd0 mm/kasan/kasan.c:448 set_track mm/kasan/kasan.c:460 [inline] __kasan_slab_free+0x102/0x150 mm/kasan/kasan.c:521 kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528 __cache_free mm/slab.c:3498 [inline] kfree+0xcf/0x230 mm/slab.c:3817 bpf_evict_inode+0x11f/0x150 kernel/bpf/inode.c:565 evict+0x4b9/0x980 fs/inode.c:558 iput_final fs/inode.c:1550 [inline] iput+0x674/0xa90 fs/inode.c:1576 do_unlinkat+0x733/0xa30 fs/namei.c:4069 __do_sys_unlink fs/namei.c:4110 [inline] __se_sys_unlink fs/namei.c:4108 [inline] __x64_sys_unlink+0x42/0x50 fs/namei.c:4108 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe In this scenario path lookup under RCU is racing with the final unlink in case of symlinks. As Linus puts it in his analysis: [...] We actually RCU-delay the inode freeing itself, but when we do the final iput(), the "evict()" function is called synchronously. Now, the simple fix would seem to just RCU-delay the kfree() of the symlink data in bpf_evict_inode(). Maybe that's the right thing to do. [...] Al suggested to piggy-back on the ->destroy_inode() callback in order to implement RCU deferral there which can then kfree() the inode->i_link eventually right before putting inode back into inode cache. By reusing free_inode_nonrcu() from there we can avoid the need for our own inode cache and just reuse generic one as we currently do. And in-fact on top of all this we should just get rid of the bpf_evict_inode() entirely. This means truncate_inode_pages_final() and clear_inode() will then simply be called by the fs core via evict(). Dropping the reference should really only be done when inode is unhashed and nothing reachable anymore, so it's better also moved into the final ->destroy_inode() callback. Fixes: 0f98621bef5d ("bpf, inode: add support for symlinks and fix mtime/ctime") Reported-by: syzbot+fb731ca573367b7f6564@syzkaller.appspotmail.com Reported-by: syzbot+a13e5ead792d6df37818@syzkaller.appspotmail.com Reported-by: syzbot+7a8ba368b47fdefca61e@syzkaller.appspotmail.com Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Analyzed-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/lkml/0000000000006946d2057bbd0eef@google.com/T/ --- kernel/bpf/inode.c | 32 ++++++++++++++++++-------------- 1 file changed, 18 insertions(+), 14 deletions(-) diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c index 2ada5e21dfa6..4a8f390a2b82 100644 --- a/kernel/bpf/inode.c +++ b/kernel/bpf/inode.c @@ -554,19 +554,6 @@ struct bpf_prog *bpf_prog_get_type_path(const char *name, enum bpf_prog_type typ } EXPORT_SYMBOL(bpf_prog_get_type_path); -static void bpf_evict_inode(struct inode *inode) -{ - enum bpf_type type; - - truncate_inode_pages_final(&inode->i_data); - clear_inode(inode); - - if (S_ISLNK(inode->i_mode)) - kfree(inode->i_link); - if (!bpf_inode_type(inode, &type)) - bpf_any_put(inode->i_private, type); -} - /* * Display the mount options in /proc/mounts. */ @@ -579,11 +566,28 @@ static int bpf_show_options(struct seq_file *m, struct dentry *root) return 0; } +static void bpf_destroy_inode_deferred(struct rcu_head *head) +{ + struct inode *inode = container_of(head, struct inode, i_rcu); + enum bpf_type type; + + if (S_ISLNK(inode->i_mode)) + kfree(inode->i_link); + if (!bpf_inode_type(inode, &type)) + bpf_any_put(inode->i_private, type); + free_inode_nonrcu(inode); +} + +static void bpf_destroy_inode(struct inode *inode) +{ + call_rcu(&inode->i_rcu, bpf_destroy_inode_deferred); +} + static const struct super_operations bpf_super_ops = { .statfs = simple_statfs, .drop_inode = generic_delete_inode, .show_options = bpf_show_options, - .evict_inode = bpf_evict_inode, + .destroy_inode = bpf_destroy_inode, }; enum { -- 2.17.1 ^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: KASAN: use-after-free Read in path_lookupat 2019-03-25 22:04 ` Daniel Borkmann @ 2019-03-25 22:13 ` Linus Torvalds 2019-03-25 22:41 ` Daniel Borkmann 2019-03-25 22:49 ` Al Viro 1 sibling, 1 reply; 37+ messages in thread From: Linus Torvalds @ 2019-03-25 22:13 UTC (permalink / raw) To: Daniel Borkmann Cc: Al Viro, syzbot, Alexei Starovoitov, linux-fsdevel, Linux List Kernel Mailing, syzkaller-bugs On Mon, Mar 25, 2019 at 3:04 PM Daniel Borkmann <daniel@iogearbox.net> wrote: > > I'm fine either way, I think the rcu_destroy_inode would indeed simplify > it nicely. In any case fwiw, here's what I'd have ready for standby on bpf > side and tested as well. Decided to get rid of bpf_evict_inode() entirely > since the only callback we'd really need is on final inode destruction: Yes, this looks correct to me. I think this is worth doing regardless. Even if we then make the vfs layer add that rcu_destroy_inode(), that will fit very well with this patch, and getting rid of the special bpf_evict_inode() logic and just letting the normal vfs inode cleanup happen looks like the right thing. So ack from me on your > Subject: [PATCH bpf] bpf: fix use after free in bpf_evict_inode patch regardless of what else we might end up doing in this area to clean things up. Linus ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: KASAN: use-after-free Read in path_lookupat 2019-03-25 22:13 ` Linus Torvalds @ 2019-03-25 22:41 ` Daniel Borkmann 0 siblings, 0 replies; 37+ messages in thread From: Daniel Borkmann @ 2019-03-25 22:41 UTC (permalink / raw) To: Linus Torvalds Cc: Al Viro, syzbot, Alexei Starovoitov, linux-fsdevel, Linux List Kernel Mailing, syzkaller-bugs On 03/25/2019 11:13 PM, Linus Torvalds wrote: > On Mon, Mar 25, 2019 at 3:04 PM Daniel Borkmann <daniel@iogearbox.net> wrote: >> >> I'm fine either way, I think the rcu_destroy_inode would indeed simplify >> it nicely. In any case fwiw, here's what I'd have ready for standby on bpf >> side and tested as well. Decided to get rid of bpf_evict_inode() entirely >> since the only callback we'd really need is on final inode destruction: > > Yes, this looks correct to me. > > I think this is worth doing regardless. Even if we then make the vfs > layer add that rcu_destroy_inode(), that will fit very well with this > patch, and getting rid of the special bpf_evict_inode() logic and just > letting the normal vfs inode cleanup happen looks like the right > thing. > > So ack from me on your > >> Subject: [PATCH bpf] bpf: fix use after free in bpf_evict_inode > > patch regardless of what else we might end up doing in this area to > clean things up. Ok, thanks a lot. In that case, I just pushed this one to bpf tree, and it will come your way the usual route via David. Thanks, Daniel ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: KASAN: use-after-free Read in path_lookupat 2019-03-25 22:04 ` Daniel Borkmann 2019-03-25 22:13 ` Linus Torvalds @ 2019-03-25 22:49 ` Al Viro 1 sibling, 0 replies; 37+ messages in thread From: Al Viro @ 2019-03-25 22:49 UTC (permalink / raw) To: Daniel Borkmann Cc: Linus Torvalds, syzbot, Alexei Starovoitov, linux-fsdevel, Linux List Kernel Mailing, syzkaller-bugs On Mon, Mar 25, 2019 at 11:04:53PM +0100, Daniel Borkmann wrote: > +static void bpf_destroy_inode_deferred(struct rcu_head *head) > +{ > + struct inode *inode = container_of(head, struct inode, i_rcu); > + enum bpf_type type; > + > + if (S_ISLNK(inode->i_mode)) > + kfree(inode->i_link); > + if (!bpf_inode_type(inode, &type)) > + bpf_any_put(inode->i_private, type); > + free_inode_nonrcu(inode); > +} Umm... Can bpf_any_put() can be called in that context? It's going to run in softirq; if that's OK (I'm not familiar enough with BPF guts to tell), sure, that's probably the best approach. Note, BTW, that it's obviously -stable fodder, so any splitting of ->destroy_inode() would live on top of that fix. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: KASAN: use-after-free Read in path_lookupat 2019-03-25 21:45 ` Linus Torvalds 2019-03-25 22:04 ` Daniel Borkmann @ 2019-03-25 23:37 ` Al Viro 2019-03-25 23:44 ` Alexei Starovoitov ` (2 more replies) 1 sibling, 3 replies; 37+ messages in thread From: Al Viro @ 2019-03-25 23:37 UTC (permalink / raw) To: Linus Torvalds Cc: syzbot, Alexei Starovoitov, Daniel Borkmann, linux-fsdevel, Linux List Kernel Mailing, syzkaller-bugs On Mon, Mar 25, 2019 at 02:45:00PM -0700, Linus Torvalds wrote: > On Mon, Mar 25, 2019 at 2:14 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > Maybe, but we really need to come up with sane documentation on the > > entire drop_inode/evict_inode/destroy_inode/rcu_destroy_inode > > group ;-/ > > Yeah. > > I actually think the "destroy_inode/rcu_destroy_inode" part is the > simplest one to understand: it's just tearing down the inode, and the > RCU part is (obviously, as shown by this thread) somewhat subtle, but > at the same time not really all that hard to explain: "inodes that can > be looked up through RCU lookup need to be destroyed with RCU delay". ... with at least "but remember that call_rcu()-scheduled stuff runs in softirq, so there's a limit to what you can defer there; furthermore, there are implications for any spinlocks taken in that callback". > I think drop_inode->evict_inode is arguably the much more subtle > stuff. That's the "we're not destroying things, we're making decisions > about the state" kind of thing. It's worse than that - another bloody subtle question is the location of clear_inode() wrt the other work done in ->evict_inode() ;-/ > And we actually don't have any documentation at all for those two. > Well, there's the "porting" thing, but.. > > > And I want to understand the writeback-related issues > > in ocfs2 and f2fs - the current kludges in those smell fishy. > > I'm assuming you're talking about exactly that drop_inode() kind of subtlety. Yes. > At least for ocfs2, the whole "rcu_destroy_inode()" patch I sent out > would simplify things a lot. *All* that the ocfs2_destroy_inode() > function does is to schedule the inode freeing for RCU delay. > > Assuming my patch is fine (as mentioned, it was entirely untested), > that patch would actually simplify that a lot. Get rid of > ocfs2_destroy_inode() entirely, and just make > ocfs2_rcu_destroy_inode() do that kmem_cache_free(). Mucho clean-up > (we have that ocfs2_rcu_destroy_inode() currently as > ocfs2_i_callback(), but with the ugly rcu-head interface). Sure, but I wonder if cleaner solution (if any) for whatever is going on in their drop_inode/evict_inode area might spill into destroy_inode... I certainly agree that having destroy_inode consist just of call_rcu() is very common. FWIW, non-trivial exceptions from that pattern are: fs/afs/super.c:673:static void afs_destroy_inode(struct inode *inode) fs/btrfs/inode.c:9215:void btrfs_destroy_inode(struct inode *inode) fs/btrfs/inode.c:9202:void btrfs_test_destroy_inode(struct inode *inode) fs/ceph/inode.c:530:void ceph_destroy_inode(struct inode *inode) fs/ecryptfs/super.c:88:static void ecryptfs_destroy_inode(struct inode *inode) fs/ext4/super.c:1106:static void ext4_destroy_inode(struct inode *inode) fs/inode.c:228:void free_inode_nonrcu(struct inode *inode) fs/fuse/inode.c:116:static void fuse_destroy_inode(struct inode *inode) fs/hugetlbfs/inode.c:1052:static void hugetlbfs_destroy_inode(struct inode *inode) fs/jfs/super.c:134:static void jfs_destroy_inode(struct inode *inode) fs/ntfs/inode.c:341:void ntfs_destroy_big_inode(struct inode *inode) fs/overlayfs/super.c:200:static void ovl_destroy_inode(struct inode *inode) mm/shmem.c:3646:static void shmem_destroy_inode(struct inode *inode) net/socket.c:266:static void sock_destroy_inode(struct inode *inode) fs/ubifs/super.c:282:static void ubifs_destroy_inode(struct inode *inode) fs/xfs/xfs_super.c:969:xfs_fs_destroy_inode( Some of those could convert to that pattern, so it's even fewer than that. And getting rid of container_of in those callbacks would be nice too. One obvious observation, though - we want the actual fixes to go before any method changes, for backporting reasons. For debugfs it's clearly "use default ->evict_inode(), have explicit ->destroy_inode() using free_inode_nonrcu()" - there we have nothing else done in ->evict_inode() and kfree is obviously safe in softirq. I'll post that (or push to vfs.git#fixes), along with minimal fixes for other 3. If bpf_any_put() is softirq-safe, we'll have the full set for -stable and the rest could be done on top of that. Won't solve the documetation problem, unfortunately ;-/ ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: KASAN: use-after-free Read in path_lookupat 2019-03-25 23:37 ` Al Viro @ 2019-03-25 23:44 ` Alexei Starovoitov 2019-03-26 0:21 ` Al Viro 2019-03-26 1:38 ` ceph: fix use-after-free on symlink traversal Al Viro 2019-03-26 1:45 ` KASAN: use-after-free Read in path_lookupat Al Viro 2 siblings, 1 reply; 37+ messages in thread From: Alexei Starovoitov @ 2019-03-25 23:44 UTC (permalink / raw) To: Al Viro Cc: Linus Torvalds, syzbot, Alexei Starovoitov, Daniel Borkmann, linux-fsdevel, Linux List Kernel Mailing, syzkaller-bugs On Mon, Mar 25, 2019 at 11:37:32PM +0000, Al Viro wrote: > > For debugfs it's clearly "use default ->evict_inode(), have explicit > ->destroy_inode() using free_inode_nonrcu()" - there we have nothing > else done in ->evict_inode() and kfree is obviously safe in softirq. > I'll post that (or push to vfs.git#fixes), along with minimal fixes > for other 3. If bpf_any_put() is softirq-safe, we'll have the full > set for -stable and the rest could be done on top of that. Yes. bpf_any_put() is safe to be called from softirq. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: KASAN: use-after-free Read in path_lookupat 2019-03-25 23:44 ` Alexei Starovoitov @ 2019-03-26 0:21 ` Al Viro 0 siblings, 0 replies; 37+ messages in thread From: Al Viro @ 2019-03-26 0:21 UTC (permalink / raw) To: Alexei Starovoitov Cc: Linus Torvalds, syzbot, Alexei Starovoitov, Daniel Borkmann, linux-fsdevel, Linux List Kernel Mailing, syzkaller-bugs On Mon, Mar 25, 2019 at 04:44:40PM -0700, Alexei Starovoitov wrote: > On Mon, Mar 25, 2019 at 11:37:32PM +0000, Al Viro wrote: > > > > For debugfs it's clearly "use default ->evict_inode(), have explicit > > ->destroy_inode() using free_inode_nonrcu()" - there we have nothing > > else done in ->evict_inode() and kfree is obviously safe in softirq. > > I'll post that (or push to vfs.git#fixes), along with minimal fixes > > for other 3. If bpf_any_put() is softirq-safe, we'll have the full > > set for -stable and the rest could be done on top of that. > > Yes. bpf_any_put() is safe to be called from softirq. In that case, Acked-by: Al Viro <viro@zeniv.linux.org.uk> on Daniel's patch. ^ permalink raw reply [flat|nested] 37+ messages in thread
* ceph: fix use-after-free on symlink traversal 2019-03-25 23:37 ` Al Viro 2019-03-25 23:44 ` Alexei Starovoitov @ 2019-03-26 1:38 ` Al Viro 2019-03-26 1:39 ` jffs2: " Al Viro ` (4 more replies) 2019-03-26 1:45 ` KASAN: use-after-free Read in path_lookupat Al Viro 2 siblings, 5 replies; 37+ messages in thread From: Al Viro @ 2019-03-26 1:38 UTC (permalink / raw) To: Linus Torvalds Cc: syzbot, Alexei Starovoitov, Daniel Borkmann, linux-fsdevel, Linux List Kernel Mailing, syzkaller-bugs free the symlink body after the same RCU delay we have for freeing the struct inode itself, so that traversal during RCU pathwalk wouldn't step into freed memory. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c index e3346628efe2..2d61ddda9bf5 100644 --- a/fs/ceph/inode.c +++ b/fs/ceph/inode.c @@ -524,6 +524,7 @@ static void ceph_i_callback(struct rcu_head *head) struct inode *inode = container_of(head, struct inode, i_rcu); struct ceph_inode_info *ci = ceph_inode(inode); + kfree(ci->i_symlink); kmem_cache_free(ceph_inode_cachep, ci); } @@ -566,7 +567,6 @@ void ceph_destroy_inode(struct inode *inode) } } - kfree(ci->i_symlink); while ((n = rb_first(&ci->i_fragtree)) != NULL) { frag = rb_entry(n, struct ceph_inode_frag, node); rb_erase(n, &ci->i_fragtree); ^ permalink raw reply related [flat|nested] 37+ messages in thread
* jffs2: fix use-after-free on symlink traversal 2019-03-26 1:38 ` ceph: fix use-after-free on symlink traversal Al Viro @ 2019-03-26 1:39 ` Al Viro 2019-03-26 1:40 ` ubifs: " Al Viro ` (3 subsequent siblings) 4 siblings, 0 replies; 37+ messages in thread From: Al Viro @ 2019-03-26 1:39 UTC (permalink / raw) To: Linus Torvalds Cc: syzbot, Alexei Starovoitov, Daniel Borkmann, linux-fsdevel, Linux List Kernel Mailing, syzkaller-bugs free the symlink body after the same RCU delay we have for freeing the struct inode itself, so that traversal during RCU pathwalk wouldn't step into freed memory. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- diff --git a/fs/jffs2/readinode.c b/fs/jffs2/readinode.c index 389ea53ea487..bccfc40b3a74 100644 --- a/fs/jffs2/readinode.c +++ b/fs/jffs2/readinode.c @@ -1414,11 +1414,6 @@ void jffs2_do_clear_inode(struct jffs2_sb_info *c, struct jffs2_inode_info *f) jffs2_kill_fragtree(&f->fragtree, deleted?c:NULL); - if (f->target) { - kfree(f->target); - f->target = NULL; - } - fds = f->dents; while(fds) { fd = fds; diff --git a/fs/jffs2/super.c b/fs/jffs2/super.c index bb6ae387469f..05d892c79339 100644 --- a/fs/jffs2/super.c +++ b/fs/jffs2/super.c @@ -47,7 +47,10 @@ static struct inode *jffs2_alloc_inode(struct super_block *sb) static void jffs2_i_callback(struct rcu_head *head) { struct inode *inode = container_of(head, struct inode, i_rcu); - kmem_cache_free(jffs2_inode_cachep, JFFS2_INODE_INFO(inode)); + struct jffs2_inode_info *f = JFFS2_INODE_INFO(inode); + + kfree(f->target); + kmem_cache_free(jffs2_inode_cachep, f); } static void jffs2_destroy_inode(struct inode *inode) ^ permalink raw reply related [flat|nested] 37+ messages in thread
* ubifs: fix use-after-free on symlink traversal 2019-03-26 1:38 ` ceph: fix use-after-free on symlink traversal Al Viro 2019-03-26 1:39 ` jffs2: " Al Viro @ 2019-03-26 1:40 ` Al Viro 2019-03-26 1:43 ` debugfs: " Al Viro ` (2 subsequent siblings) 4 siblings, 0 replies; 37+ messages in thread From: Al Viro @ 2019-03-26 1:40 UTC (permalink / raw) To: Linus Torvalds Cc: syzbot, Alexei Starovoitov, Daniel Borkmann, linux-fsdevel, Linux List Kernel Mailing, syzkaller-bugs free the symlink body after the same RCU delay we have for freeing the struct inode itself, so that traversal during RCU pathwalk wouldn't step into freed memory. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c index 8dc2818fdd84..12628184772c 100644 --- a/fs/ubifs/super.c +++ b/fs/ubifs/super.c @@ -276,14 +276,12 @@ static void ubifs_i_callback(struct rcu_head *head) { struct inode *inode = container_of(head, struct inode, i_rcu); struct ubifs_inode *ui = ubifs_inode(inode); + kfree(ui->data); kmem_cache_free(ubifs_inode_slab, ui); } static void ubifs_destroy_inode(struct inode *inode) { - struct ubifs_inode *ui = ubifs_inode(inode); - - kfree(ui->data); call_rcu(&inode->i_rcu, ubifs_i_callback); } ^ permalink raw reply related [flat|nested] 37+ messages in thread
* debugfs: fix use-after-free on symlink traversal 2019-03-26 1:38 ` ceph: fix use-after-free on symlink traversal Al Viro 2019-03-26 1:39 ` jffs2: " Al Viro 2019-03-26 1:40 ` ubifs: " Al Viro @ 2019-03-26 1:43 ` Al Viro 2019-03-26 10:41 ` ceph: " Jeff Layton 2019-03-26 11:38 ` Ilya Dryomov 4 siblings, 0 replies; 37+ messages in thread From: Al Viro @ 2019-03-26 1:43 UTC (permalink / raw) To: Linus Torvalds Cc: syzbot, Alexei Starovoitov, Daniel Borkmann, linux-fsdevel, Linux List Kernel Mailing, syzkaller-bugs symlink body shouldn't be freed without an RCU delay. Switch debugfs to ->destroy_inode() and use of call_rcu(); free both the inode and symlink body in the callback. Similar to solution for bpf, only here it's even more obvious that ->evict_inode() can be dropped. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- diff --git a/fs/debugfs/inode.c b/fs/debugfs/inode.c index 95b5e78c22b1..f25daa207421 100644 --- a/fs/debugfs/inode.c +++ b/fs/debugfs/inode.c @@ -163,19 +163,24 @@ static int debugfs_show_options(struct seq_file *m, struct dentry *root) return 0; } -static void debugfs_evict_inode(struct inode *inode) +static void debugfs_i_callback(struct rcu_head *head) { - truncate_inode_pages_final(&inode->i_data); - clear_inode(inode); + struct inode *inode = container_of(head, struct inode, i_rcu); if (S_ISLNK(inode->i_mode)) kfree(inode->i_link); + free_inode_nonrcu(inode); +} + +static void debugfs_destroy_inode(struct inode *inode) +{ + call_rcu(&inode->i_rcu, debugfs_i_callback); } static const struct super_operations debugfs_super_operations = { .statfs = simple_statfs, .remount_fs = debugfs_remount, .show_options = debugfs_show_options, - .evict_inode = debugfs_evict_inode, + .destroy_inode = debugfs_destroy_inode, }; static void debugfs_release_dentry(struct dentry *dentry) ^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: ceph: fix use-after-free on symlink traversal 2019-03-26 1:38 ` ceph: fix use-after-free on symlink traversal Al Viro ` (2 preceding siblings ...) 2019-03-26 1:43 ` debugfs: " Al Viro @ 2019-03-26 10:41 ` Jeff Layton 2019-03-26 11:38 ` Ilya Dryomov 4 siblings, 0 replies; 37+ messages in thread From: Jeff Layton @ 2019-03-26 10:41 UTC (permalink / raw) To: Al Viro Cc: Linus Torvalds, syzbot, Alexei Starovoitov, Daniel Borkmann, linux-fsdevel, Linux List Kernel Mailing, syzkaller-bugs On Mon, Mar 25, 2019 at 9:39 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > free the symlink body after the same RCU delay we have for freeing the > struct inode itself, so that traversal during RCU pathwalk wouldn't step > into freed memory. > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- > diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c > index e3346628efe2..2d61ddda9bf5 100644 > --- a/fs/ceph/inode.c > +++ b/fs/ceph/inode.c > @@ -524,6 +524,7 @@ static void ceph_i_callback(struct rcu_head *head) > struct inode *inode = container_of(head, struct inode, i_rcu); > struct ceph_inode_info *ci = ceph_inode(inode); > > + kfree(ci->i_symlink); > kmem_cache_free(ceph_inode_cachep, ci); > } > > @@ -566,7 +567,6 @@ void ceph_destroy_inode(struct inode *inode) > } > } > > - kfree(ci->i_symlink); > while ((n = rb_first(&ci->i_fragtree)) != NULL) { > frag = rb_entry(n, struct ceph_inode_frag, node); > rb_erase(n, &ci->i_fragtree); Reviewed-by: Jeff Layton <jlayton@kernel.org> ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: ceph: fix use-after-free on symlink traversal 2019-03-26 1:38 ` ceph: fix use-after-free on symlink traversal Al Viro ` (3 preceding siblings ...) 2019-03-26 10:41 ` ceph: " Jeff Layton @ 2019-03-26 11:38 ` Ilya Dryomov 4 siblings, 0 replies; 37+ messages in thread From: Ilya Dryomov @ 2019-03-26 11:38 UTC (permalink / raw) To: Al Viro Cc: Linus Torvalds, syzbot, Alexei Starovoitov, Daniel Borkmann, linux-fsdevel, Linux List Kernel Mailing, syzkaller-bugs On Tue, Mar 26, 2019 at 2:39 AM Al Viro <viro@zeniv.linux.org.uk> wrote: > > free the symlink body after the same RCU delay we have for freeing the > struct inode itself, so that traversal during RCU pathwalk wouldn't step > into freed memory. > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- > diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c > index e3346628efe2..2d61ddda9bf5 100644 > --- a/fs/ceph/inode.c > +++ b/fs/ceph/inode.c > @@ -524,6 +524,7 @@ static void ceph_i_callback(struct rcu_head *head) > struct inode *inode = container_of(head, struct inode, i_rcu); > struct ceph_inode_info *ci = ceph_inode(inode); > > + kfree(ci->i_symlink); > kmem_cache_free(ceph_inode_cachep, ci); > } > > @@ -566,7 +567,6 @@ void ceph_destroy_inode(struct inode *inode) > } > } > > - kfree(ci->i_symlink); > while ((n = rb_first(&ci->i_fragtree)) != NULL) { > frag = rb_entry(n, struct ceph_inode_frag, node); > rb_erase(n, &ci->i_fragtree); Al, I see you directed this patch at Linus instead of ceph-devel. I can pick it up for -rc3 as I have an important libceph fix pending anyway. Let me know if you want me to handle it. Thanks, Ilya ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: KASAN: use-after-free Read in path_lookupat 2019-03-25 23:37 ` Al Viro 2019-03-25 23:44 ` Alexei Starovoitov 2019-03-26 1:38 ` ceph: fix use-after-free on symlink traversal Al Viro @ 2019-03-26 1:45 ` Al Viro 2019-04-10 18:11 ` Al Viro 2 siblings, 1 reply; 37+ messages in thread From: Al Viro @ 2019-03-26 1:45 UTC (permalink / raw) To: Linus Torvalds Cc: syzbot, Alexei Starovoitov, Daniel Borkmann, linux-fsdevel, Linux List Kernel Mailing, syzkaller-bugs On Mon, Mar 25, 2019 at 11:37:32PM +0000, Al Viro wrote: > For debugfs it's clearly "use default ->evict_inode(), have explicit > ->destroy_inode() using free_inode_nonrcu()" - there we have nothing > else done in ->evict_inode() and kfree is obviously safe in softirq. > I'll post that (or push to vfs.git#fixes), along with minimal fixes > for other 3. If bpf_any_put() is softirq-safe, we'll have the full > set for -stable and the rest could be done on top of that. > > Won't solve the documetation problem, unfortunately ;-/ Posted; all of those (as well as Daniel's bpf patch) are Cc:stable fodder. Documentation is still, er, deficient... ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: KASAN: use-after-free Read in path_lookupat 2019-03-26 1:45 ` KASAN: use-after-free Read in path_lookupat Al Viro @ 2019-04-10 18:11 ` Al Viro 2019-04-10 19:44 ` Linus Torvalds 0 siblings, 1 reply; 37+ messages in thread From: Al Viro @ 2019-04-10 18:11 UTC (permalink / raw) To: Linus Torvalds Cc: syzbot, Alexei Starovoitov, Daniel Borkmann, linux-fsdevel, Linux List Kernel Mailing, syzkaller-bugs, James Morris, linux-security-module On Tue, Mar 26, 2019 at 01:45:52AM +0000, Al Viro wrote: > On Mon, Mar 25, 2019 at 11:37:32PM +0000, Al Viro wrote: > > > For debugfs it's clearly "use default ->evict_inode(), have explicit > > ->destroy_inode() using free_inode_nonrcu()" - there we have nothing > > else done in ->evict_inode() and kfree is obviously safe in softirq. > > I'll post that (or push to vfs.git#fixes), along with minimal fixes > > for other 3. If bpf_any_put() is softirq-safe, we'll have the full > > set for -stable and the rest could be done on top of that. > > > > Won't solve the documetation problem, unfortunately ;-/ > > Posted; all of those (as well as Daniel's bpf patch) are Cc:stable > fodder. Documentation is still, er, deficient... ... and unfortunately there are two more, exactly like debugfs - securityfs and apparmorfs, found while sorting out the series for separate rcu-delayed counterpart of ->destroy_inode(). Both are in vfs.git#fixes. Which way should that go - directly or via linux-security.git? Both are stable fodder, in theory, but much harder to hit than their ubifs/debugfs/bpf counterparts... ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: KASAN: use-after-free Read in path_lookupat 2019-04-10 18:11 ` Al Viro @ 2019-04-10 19:44 ` Linus Torvalds 0 siblings, 0 replies; 37+ messages in thread From: Linus Torvalds @ 2019-04-10 19:44 UTC (permalink / raw) To: Al Viro Cc: syzbot, Alexei Starovoitov, Daniel Borkmann, linux-fsdevel, Linux List Kernel Mailing, syzkaller-bugs, James Morris, linux-security-module On Wed, Apr 10, 2019 at 8:11 AM Al Viro <viro@zeniv.linux.org.uk> wrote: > > Both are in vfs.git#fixes. Which way should that go - directly or > via linux-security.git? Just do it directly. I doubt you can trigger them for securityfs and apparmourfs, since normal users have no way to remove any files from them, so the race with final unlink sounds fairly irrelevant in practice, no? Linus ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: KASAN: use-after-free Read in path_lookupat 2019-03-25 18:36 ` Linus Torvalds 2019-03-25 19:18 ` Linus Torvalds @ 2019-03-25 19:43 ` Al Viro 2019-03-25 22:48 ` Dave Chinner 1 sibling, 1 reply; 37+ messages in thread From: Al Viro @ 2019-03-25 19:43 UTC (permalink / raw) To: Linus Torvalds Cc: syzbot, Alexei Starovoitov, Daniel Borkmann, linux-fsdevel, Linux List Kernel Mailing, syzkaller-bugs On Mon, Mar 25, 2019 at 11:36:01AM -0700, Linus Torvalds wrote: > Right. Not just move the existing destroy_inode() - because as you > say, people may not be able to to do that in RCU contect, but split it > up, and add a "final_free_inode()" callback or something for the RCU > phase. > > In fact, I suspect that *every* user of destroy_inode() already has to > have its own RCU callback to another final freeing function anyway. Nope - pipes do not, and for a good reason. > Because they really shouldn't free the inode itself early. Maybe we > can just make that be a generic thing? Maybe... OTOH, we already have more methods on the destruction end than I'm comfortable trying to document. Because we clearly *do* need a clear set of rules along the lines of "this kind of thing belongs in this method, this - in that one". As it is, on the way to inode destruction we have 1) [kinda] ->drop_inode() - decide whether this inode is not worth keeping around in icache once the refcount reaches zero. Predicate, shouldn't change inode state at all. Common instances: * default (encoded as NULL, available as generic_drop_inode()) - "keep it around if it's still hashed and link count is non-zero". * generic_delete_inode(): "don't retain that sucker" However, 3 instances are more or less weird - f2fs, gfs and ocfs2. gfs2 one is the least odd of those, but the other two... What the hell is forced writeback doing in ocfs2_drop_inode()? If they don't want to retain anything, fine, but then why do the inode->i_state |= I_WILL_FREE; spin_unlock(&inode->i_lock); write_inode_now(inode, 1); spin_lock(&inode->i_lock); WARN_ON(inode->i_state & I_NEW); inode->i_state &= ~I_WILL_FREE; dance in ->drop_inode()? It will be immediately followed by ->evict_inode(), and it feels like the damn thing would be a lot more natural there... And what, for pity sake, f2fs is doing with truncation in that predicate, of all places? The comment in there is /* * This is to avoid a deadlock condition like below. * writeback_single_inode(inode) * - f2fs_write_data_page * - f2fs_gc -> iput -> evict * - inode_wait_for_writeback(inode) */ which looks... uninspiring, to put it mildly. 2) ->evict_inode() - called when we kick the inode out. Freeing on-disk inodes, etc. belongs there. inode is still in icache hash chain at that point, so any icache lookups for it will block until that thing is done. If we have something non-trivial done by iget... test() callbacks, we must keep the data structures needed by those. 3) ->destroy_inode() - destructor. By that point all remaining references to inode are (stale) RCU ones. The stuff that might be reached via those has to have an RCU delay between the call of ->destroy_inode() and freeing. Very commonly that's done by call_rcu(), and more often than not it's the only thing in ->destroy_inode(). However, if we know that there'll be no RCU accessors, we can do freeing immediately - pipes do just that. And the above is piss-poor as documentation goes - it doesn't answer the "where should this go?" any better than "try to see what similar filesystems are doing", which is asking for cargo-culting ;-/ ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: KASAN: use-after-free Read in path_lookupat 2019-03-25 19:43 ` Al Viro @ 2019-03-25 22:48 ` Dave Chinner 2019-03-25 23:02 ` Al Viro 0 siblings, 1 reply; 37+ messages in thread From: Dave Chinner @ 2019-03-25 22:48 UTC (permalink / raw) To: Al Viro Cc: Linus Torvalds, syzbot, Alexei Starovoitov, Daniel Borkmann, linux-fsdevel, Linux List Kernel Mailing, syzkaller-bugs On Mon, Mar 25, 2019 at 07:43:32PM +0000, Al Viro wrote: > On Mon, Mar 25, 2019 at 11:36:01AM -0700, Linus Torvalds wrote: > > Right. Not just move the existing destroy_inode() - because as you > > say, people may not be able to to do that in RCU contect, but split it > > up, and add a "final_free_inode()" callback or something for the RCU > > phase. > > > > In fact, I suspect that *every* user of destroy_inode() already has to > > have its own RCU callback to another final freeing function anyway. > > Nope - pipes do not, and for a good reason. > > > Because they really shouldn't free the inode itself early. Maybe we > > can just make that be a generic thing? > > Maybe... OTOH, we already have more methods on the destruction end > than I'm comfortable trying to document. Because we clearly *do* > need a clear set of rules along the lines of "this kind of thing belongs > in this method, this - in that one". > > As it is, on the way to inode destruction we have > > 1) [kinda] ->drop_inode() - decide whether this inode is not worth > keeping around in icache once the refcount reaches zero. Predicate, > shouldn't change inode state at all. Common instances: > * default (encoded as NULL, available as generic_drop_inode()) - > "keep it around if it's still hashed and link count is non-zero". > * generic_delete_inode(): "don't retain that sucker" > > However, 3 instances are more or less weird - f2fs, gfs and ocfs2. > gfs2 one is the least odd of those, but the other two... > What the hell is forced writeback doing in ocfs2_drop_inode()? > If they don't want to retain anything, fine, but then why do > the > inode->i_state |= I_WILL_FREE; > spin_unlock(&inode->i_lock); > write_inode_now(inode, 1); > spin_lock(&inode->i_lock); > WARN_ON(inode->i_state & I_NEW); > inode->i_state &= ~I_WILL_FREE; > dance in ->drop_inode()? It will be immediately followed by > ->evict_inode(), and it feels like the damn thing would be > a lot more natural there... And what, for pity sake, f2fs is > doing with truncation in that predicate, of all places? The > comment in there is > /* > * This is to avoid a deadlock condition like below. > * writeback_single_inode(inode) > * - f2fs_write_data_page > * - f2fs_gc -> iput -> evict > * - inode_wait_for_writeback(inode) > */ > which looks... uninspiring, to put it mildly. > > 2) ->evict_inode() - called when we kick the inode out. Freeing > on-disk inodes, etc. belongs there. inode is still in icache > hash chain at that point, so any icache lookups for it will block > until that thing is done. If we have something non-trivial done > by iget... test() callbacks, we must keep the data structures needed > by those. > > 3) ->destroy_inode() - destructor. By that point all remaining > references to inode are (stale) RCU ones. At the VFS layer, perhaps, but the filesystem itself can still have valid references to the struct inode the VFS knows nothing about. i.e. filesystems typically wrap the struct inode in their own extended inode structure and hence have more information in them and a life cycle that the VFS knows nothing about. i.e. the VFS _does not own_ the struct inode that the filesystem allocated for it's use. XFS has an inode cache layer below the VFS that provides an inode life cycle that wraps around the outside of the VFS inode life cycle. We do not use the VFS inode cache at all, and fake the fact the inode is in the "icache hash chain" so that the VFS behaves exactly as it should. Further, we use lockless, RCU protected lookups on the XFS inode cache, so we need to control XFS inode allocation and freeing compeltely internally to XFS because it has RCU requirements the VFS struct inode doesn't have. And when it comes to VFS inode reclaim, XFS does not implement ->evict_inode because there is nothing at the VFS level to do. And ->destroy_inode ends up doing cleanup work (e.g. freeing on-disk inodes) which is non-trivial, blocking work, but then still requires the struct xfs_inode to be written back to disk before it can bei freed. So it just gets marked "reclaimable" and background reclaim then takes care of it from there so we avoid synchronous IO in inode reclaim... This works because don't track dirty inode metadata in the VFS writeback code (it's tracked with much more precision in the XFS log infrastructure) and we don't write back inodes from the VFS infrastructure, either. It's all done based on internal state outside the VFS. And, because of this, the VFS cannot assume that it can free the struct inode after calling ->destroy_inode or even use call_rcu() to run a filesystem destructor because the filesystem may need to do work that needs to block and that's not allowed in an RCU callback... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: KASAN: use-after-free Read in path_lookupat 2019-03-25 22:48 ` Dave Chinner @ 2019-03-25 23:02 ` Al Viro [not found] ` <CAGe7X7mb=gK7zhSwmT_6mmmkcbjhZAOb=wj31BdUcHkNUPsm2Q@mail.gmail.com> 2019-03-27 17:22 ` Jan Kara 0 siblings, 2 replies; 37+ messages in thread From: Al Viro @ 2019-03-25 23:02 UTC (permalink / raw) To: Dave Chinner Cc: Linus Torvalds, syzbot, Alexei Starovoitov, Daniel Borkmann, linux-fsdevel, Linux List Kernel Mailing, syzkaller-bugs, Jan Kara, Jaegeuk Kim, Joel Becker, Mark Fasheh On Tue, Mar 26, 2019 at 09:48:23AM +1100, Dave Chinner wrote: > And when it comes to VFS inode reclaim, XFS does not implement > ->evict_inode because there is nothing at the VFS level to do. > And ->destroy_inode ends up doing cleanup work (e.g. freeing on-disk > inodes) which is non-trivial, blocking work, but then still requires > the struct xfs_inode to be written back to disk before it can bei > freed. So it just gets marked "reclaimable" and background reclaim > then takes care of it from there so we avoid synchronous IO in inode > reclaim... > > This works because don't track dirty inode metadata in the VFS > writeback code (it's tracked with much more precision in the XFS log > infrastructure) and we don't write back inodes from the VFS > infrastructure, either. It's all done based on internal state > outside the VFS. > > And, because of this, the VFS cannot assume that it can free > the struct inode after calling ->destroy_inode or even use > call_rcu() to run a filesystem destructor because the filesystem > may need to do work that needs to block and that's not allowed in an > RCU callback... In Linus' patch that's what you get with non-NULL ->destroy_inode + NULL ->destroy_inode_rcu, so XFS won't be screwed by that. Said that, yes, XFS adds another fun twist there (AFAICS, it's the only in-tree filesystem that pulls that off). I would really like some comments from f2fs and ocfs2 folks, as well as Jan - he's had much more recent contact with writeback code than I have... Could somebody explain what's going on in f2fs and ocfs2 ->drop_inode()? It _should_ be just a predicate; looks like both are playing very odd games to work around writeback problems and I wonder if there's a cleaner solution for that. I can try and dig through maillist(s) archives, but I would really appreciate it if somebody could give a braindump on the issues dealt with in there... ^ permalink raw reply [flat|nested] 37+ messages in thread
[parent not found: <CAGe7X7mb=gK7zhSwmT_6mmmkcbjhZAOb=wj31BdUcHkNUPsm2Q@mail.gmail.com>]
* Re: KASAN: use-after-free Read in path_lookupat [not found] ` <CAGe7X7mb=gK7zhSwmT_6mmmkcbjhZAOb=wj31BdUcHkNUPsm2Q@mail.gmail.com> @ 2019-03-26 4:15 ` Al Viro 2019-03-27 16:58 ` Jan Kara 0 siblings, 1 reply; 37+ messages in thread From: Al Viro @ 2019-03-26 4:15 UTC (permalink / raw) To: Mark Fasheh Cc: Dave Chinner, Linus Torvalds, syzbot, Alexei Starovoitov, Daniel Borkmann, linux-fsdevel, Linux List Kernel Mailing, syzkaller-bugs, Jan Kara, Jaegeuk Kim, Joel Becker On Mon, Mar 25, 2019 at 08:18:25PM -0700, Mark Fasheh wrote: > Hey Al, > > It's been a while since I've looked at that bit of code but it looks like > Ocfs2 is syncing the inode to disk and disposing of it's memory > representation (which would include the cluster locks held) so that other > nodes get a chance to delete the potentially orphaned inode. In Ocfs2 we > won't delete an inode if it exists in another nodes cache. Wait a sec - what's the reason for forcing that write_inode_now(); why doesn't the normal mechanism work? I'm afraid I still don't get it - we do wait for writeback in evict_inode(), or the local filesystems wouldn't work. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: KASAN: use-after-free Read in path_lookupat 2019-03-26 4:15 ` Al Viro @ 2019-03-27 16:58 ` Jan Kara 2019-03-27 18:59 ` Al Viro 0 siblings, 1 reply; 37+ messages in thread From: Jan Kara @ 2019-03-27 16:58 UTC (permalink / raw) To: Al Viro Cc: Mark Fasheh, Dave Chinner, Linus Torvalds, syzbot, Alexei Starovoitov, Daniel Borkmann, linux-fsdevel, Linux List Kernel Mailing, syzkaller-bugs, Jan Kara, Jaegeuk Kim, Joel Becker On Tue 26-03-19 04:15:10, Al Viro wrote: > On Mon, Mar 25, 2019 at 08:18:25PM -0700, Mark Fasheh wrote: > > > Hey Al, > > > > It's been a while since I've looked at that bit of code but it looks like > > Ocfs2 is syncing the inode to disk and disposing of it's memory > > representation (which would include the cluster locks held) so that other > > nodes get a chance to delete the potentially orphaned inode. In Ocfs2 we > > won't delete an inode if it exists in another nodes cache. > > Wait a sec - what's the reason for forcing that write_inode_now(); why > doesn't the normal mechanism work? I'm afraid I still don't get it - > we do wait for writeback in evict_inode(), or the local filesystems > wouldn't work. I'm just guessing here but they don't want an inode cached once its last dentry goes away (it makes cluster wide synchronization easier for them and they do play tricks with cluster lock on dentries). There is some info in 513e2dae9422 "ocfs2: flush inode data to disk and free inode when i_count becomes zero" which adds this ocfs2_drop_inode() implementation. So when the last inode reference is dropped, they want to flush any dirty data to disk and evict the inode. But AFAICT they should be fine with flushing the inode from their ->evict_inode method. I_FREEING just stops the flusher thread from touching the inode but explicit writeback through write_inode_now(inode, 1) should go through just fine. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: KASAN: use-after-free Read in path_lookupat 2019-03-27 16:58 ` Jan Kara @ 2019-03-27 18:59 ` Al Viro 2019-03-28 9:00 ` Jan Kara 0 siblings, 1 reply; 37+ messages in thread From: Al Viro @ 2019-03-27 18:59 UTC (permalink / raw) To: Jan Kara Cc: Mark Fasheh, Dave Chinner, Linus Torvalds, syzbot, Alexei Starovoitov, Daniel Borkmann, linux-fsdevel, Linux List Kernel Mailing, syzkaller-bugs, Jaegeuk Kim, Joel Becker On Wed, Mar 27, 2019 at 05:58:31PM +0100, Jan Kara wrote: > On Tue 26-03-19 04:15:10, Al Viro wrote: > > On Mon, Mar 25, 2019 at 08:18:25PM -0700, Mark Fasheh wrote: > > > > > Hey Al, > > > > > > It's been a while since I've looked at that bit of code but it looks like > > > Ocfs2 is syncing the inode to disk and disposing of it's memory > > > representation (which would include the cluster locks held) so that other > > > nodes get a chance to delete the potentially orphaned inode. In Ocfs2 we > > > won't delete an inode if it exists in another nodes cache. > > > > Wait a sec - what's the reason for forcing that write_inode_now(); why > > doesn't the normal mechanism work? I'm afraid I still don't get it - > > we do wait for writeback in evict_inode(), or the local filesystems > > wouldn't work. > > I'm just guessing here but they don't want an inode cached once its last > dentry goes away (it makes cluster wide synchronization easier for them and > they do play tricks with cluster lock on dentries). Sure, but that's as simple as "return 1 from ->drop_inode()". > There is some info in > 513e2dae9422 "ocfs2: flush inode data to disk and free inode when i_count > becomes zero" which adds this ocfs2_drop_inode() implementation. So when > the last inode reference is dropped, they want to flush any dirty data to > disk and evict the inode. But AFAICT they should be fine with flushing the > inode from their ->evict_inode method. I_FREEING just stops the flusher > thread from touching the inode but explicit writeback through > write_inode_now(inode, 1) should go through just fine. Umm... Why is that write_inode_now() needed in either place? I agree that moving it to ->evict_inode() ought to be safe, but what makes it necessary in the first place? Put it another way, what dirties the data and/or metadata without marking it dirty? ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: KASAN: use-after-free Read in path_lookupat 2019-03-27 18:59 ` Al Viro @ 2019-03-28 9:00 ` Jan Kara 0 siblings, 0 replies; 37+ messages in thread From: Jan Kara @ 2019-03-28 9:00 UTC (permalink / raw) To: Al Viro Cc: Jan Kara, Mark Fasheh, Dave Chinner, Linus Torvalds, syzbot, Alexei Starovoitov, Daniel Borkmann, linux-fsdevel, Linux List Kernel Mailing, syzkaller-bugs, Jaegeuk Kim, Joel Becker On Wed 27-03-19 18:59:48, Al Viro wrote: > On Wed, Mar 27, 2019 at 05:58:31PM +0100, Jan Kara wrote: > > On Tue 26-03-19 04:15:10, Al Viro wrote: > > > On Mon, Mar 25, 2019 at 08:18:25PM -0700, Mark Fasheh wrote: > > > > > > > Hey Al, > > > > > > > > It's been a while since I've looked at that bit of code but it looks like > > > > Ocfs2 is syncing the inode to disk and disposing of it's memory > > > > representation (which would include the cluster locks held) so that other > > > > nodes get a chance to delete the potentially orphaned inode. In Ocfs2 we > > > > won't delete an inode if it exists in another nodes cache. > > > > > > Wait a sec - what's the reason for forcing that write_inode_now(); why > > > doesn't the normal mechanism work? I'm afraid I still don't get it - > > > we do wait for writeback in evict_inode(), or the local filesystems > > > wouldn't work. > > > > I'm just guessing here but they don't want an inode cached once its last > > dentry goes away (it makes cluster wide synchronization easier for them and > > they do play tricks with cluster lock on dentries). > > Sure, but that's as simple as "return 1 from ->drop_inode()". Right. > > There is some info in > > 513e2dae9422 "ocfs2: flush inode data to disk and free inode when i_count > > becomes zero" which adds this ocfs2_drop_inode() implementation. So when > > the last inode reference is dropped, they want to flush any dirty data to > > disk and evict the inode. But AFAICT they should be fine with flushing the > > inode from their ->evict_inode method. I_FREEING just stops the flusher > > thread from touching the inode but explicit writeback through > > write_inode_now(inode, 1) should go through just fine. > > Umm... Why is that write_inode_now() needed in either place? I agree that > moving it to ->evict_inode() ought to be safe, but what makes it necessary > in the first place? Put it another way, what dirties the data and/or > metadata without marking it dirty? Well, the inode & pages are marked dirty and they are dirty when we get to iput_final(). But if ->drop_inode() returns 1 (which normally happens only for unlinked files), we will not write out the inode in iput_final() and the dirty data just gets discarded in ->evict_inode(). OCFS2 doesn't want this so they have to write-out by hand. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: KASAN: use-after-free Read in path_lookupat 2019-03-25 23:02 ` Al Viro [not found] ` <CAGe7X7mb=gK7zhSwmT_6mmmkcbjhZAOb=wj31BdUcHkNUPsm2Q@mail.gmail.com> @ 2019-03-27 17:22 ` Jan Kara 1 sibling, 0 replies; 37+ messages in thread From: Jan Kara @ 2019-03-27 17:22 UTC (permalink / raw) To: Al Viro Cc: Dave Chinner, Linus Torvalds, syzbot, Alexei Starovoitov, Daniel Borkmann, linux-fsdevel, Linux List Kernel Mailing, syzkaller-bugs, Jan Kara, Jaegeuk Kim, Joel Becker, Mark Fasheh On Mon 25-03-19 23:02:11, Al Viro wrote: > On Tue, Mar 26, 2019 at 09:48:23AM +1100, Dave Chinner wrote: > > > And when it comes to VFS inode reclaim, XFS does not implement > > ->evict_inode because there is nothing at the VFS level to do. > > And ->destroy_inode ends up doing cleanup work (e.g. freeing on-disk > > inodes) which is non-trivial, blocking work, but then still requires > > the struct xfs_inode to be written back to disk before it can bei > > freed. So it just gets marked "reclaimable" and background reclaim > > then takes care of it from there so we avoid synchronous IO in inode > > reclaim... > > > > This works because don't track dirty inode metadata in the VFS > > writeback code (it's tracked with much more precision in the XFS log > > infrastructure) and we don't write back inodes from the VFS > > infrastructure, either. It's all done based on internal state > > outside the VFS. > > > > And, because of this, the VFS cannot assume that it can free > > the struct inode after calling ->destroy_inode or even use > > call_rcu() to run a filesystem destructor because the filesystem > > may need to do work that needs to block and that's not allowed in an > > RCU callback... > > In Linus' patch that's what you get with non-NULL ->destroy_inode > + NULL ->destroy_inode_rcu, so XFS won't be screwed by that. > Said that, yes, XFS adds another fun twist there (AFAICS, it's > the only in-tree filesystem that pulls that off). > > I would really like some comments from f2fs and ocfs2 folks, as well > as Jan - he's had much more recent contact with writeback code than > I have... Could somebody explain what's going on in f2fs and ocfs2 > ->drop_inode()? It _should_ be just a predicate; looks like both > are playing very odd games to work around writeback problems and > I wonder if there's a cleaner solution for that. I can try and dig > through maillist(s) archives, but I would really appreciate it > if somebody could give a braindump on the issues dealt with in there... OCFS2 is discussed elsewhere and should be relatively easy to deal with. F2FS seems harder. The problem is that AFAICS they get inode references from their garbage collection code which can get called during page writeback. And then they need to drop these references and they can be the last ones to hold the inode reference for an unlinked inode forcing flush worker into inode cleanup. Which generally causes problems and was the reason why writeback code does not take inode references but relies on I_SYNC to protect it from inode reclaim instead (see commit 169ebd90131b "writeback: Avoid iput() from flusher thread"). And they noticed the problem as well and hacked around it... Now I don't know enough about F2FS and its garbage collection to tell how they can avoid dropping inode references from flush worker context. But that's the right solution for avoiding deadlocks. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 37+ messages in thread
end of thread, other threads:[~2019-04-10 19:45 UTC | newest] Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-11-28 17:40 KASAN: use-after-free Read in path_lookupat syzbot 2019-03-25 0:44 ` syzbot 2019-03-25 1:25 ` Linus Torvalds 2019-03-25 1:23 ` Linus Torvalds 2019-03-25 4:57 ` Al Viro 2019-03-25 9:15 ` Daniel Borkmann 2019-03-25 11:11 ` Al Viro 2019-03-25 11:17 ` Al Viro 2019-03-25 11:21 ` Daniel Borkmann 2019-03-25 18:36 ` Linus Torvalds 2019-03-25 19:18 ` Linus Torvalds 2019-03-25 21:14 ` Al Viro 2019-03-25 21:45 ` Linus Torvalds 2019-03-25 22:04 ` Daniel Borkmann 2019-03-25 22:13 ` Linus Torvalds 2019-03-25 22:41 ` Daniel Borkmann 2019-03-25 22:49 ` Al Viro 2019-03-25 23:37 ` Al Viro 2019-03-25 23:44 ` Alexei Starovoitov 2019-03-26 0:21 ` Al Viro 2019-03-26 1:38 ` ceph: fix use-after-free on symlink traversal Al Viro 2019-03-26 1:39 ` jffs2: " Al Viro 2019-03-26 1:40 ` ubifs: " Al Viro 2019-03-26 1:43 ` debugfs: " Al Viro 2019-03-26 10:41 ` ceph: " Jeff Layton 2019-03-26 11:38 ` Ilya Dryomov 2019-03-26 1:45 ` KASAN: use-after-free Read in path_lookupat Al Viro 2019-04-10 18:11 ` Al Viro 2019-04-10 19:44 ` Linus Torvalds 2019-03-25 19:43 ` Al Viro 2019-03-25 22:48 ` Dave Chinner 2019-03-25 23:02 ` Al Viro [not found] ` <CAGe7X7mb=gK7zhSwmT_6mmmkcbjhZAOb=wj31BdUcHkNUPsm2Q@mail.gmail.com> 2019-03-26 4:15 ` Al Viro 2019-03-27 16:58 ` Jan Kara 2019-03-27 18:59 ` Al Viro 2019-03-28 9:00 ` Jan Kara 2019-03-27 17:22 ` Jan Kara
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.