linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* KASAN: use-after-free Read in path_lookupat
@ 2018-11-28 17:40 syzbot
  2019-03-25  0:44 ` syzbot
  2019-03-25  1:23 ` Linus Torvalds
  0 siblings, 2 replies; 37+ messages in thread
From: syzbot @ 2018-11-28 17:40 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, syzkaller-bugs, viro

Hello,

syzbot found the following crash on:

HEAD commit:    442b8cea2477 Add linux-next specific files for 20181109
git tree:       linux-next
console output: https://syzkaller.appspot.com/x/log.txt?x=129574db400000
kernel config:  https://syzkaller.appspot.com/x/.config?x=2f72bdb11df9fbe8
dashboard link: https://syzkaller.appspot.com/bug?extid=7a8ba368b47fdefca61e
compiler:       gcc (GCC) 8.0.1 20180413 (experimental)
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=132815f5400000

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+7a8ba368b47fdefca61e@syzkaller.appspotmail.com

IPv6: ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready
8021q: adding VLAN 0 to HW filter on device team0
8021q: adding VLAN 0 to HW filter on device team0
8021q: adding VLAN 0 to HW filter on device team0
==================================================================
BUG: KASAN: use-after-free in lookup_last fs/namei.c:2269 [inline]
BUG: KASAN: use-after-free in path_lookupat.isra.43+0x9f8/0xc00  
fs/namei.c:2318
Read of size 1 at addr ffff8801c4865c47 by task syz-executor2/9423

CPU: 0 PID: 9423 Comm: syz-executor2 Not tainted 4.20.0-rc1-next-20181109+  
#110
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x244/0x39d lib/dump_stack.c:113
  print_address_description.cold.7+0x9/0x1ff mm/kasan/report.c:256
  kasan_report_error mm/kasan/report.c:354 [inline]
  kasan_report.cold.8+0x242/0x309 mm/kasan/report.c:412
  __asan_report_load1_noabort+0x14/0x20 mm/kasan/report.c:430
  lookup_last fs/namei.c:2269 [inline]
  path_lookupat.isra.43+0x9f8/0xc00 fs/namei.c:2318
  filename_lookup+0x26a/0x520 fs/namei.c:2348
  user_path_at_empty+0x40/0x50 fs/namei.c:2608
  user_path include/linux/namei.h:62 [inline]
  do_mount+0x180/0x1ff0 fs/namespace.c:2980
  ksys_mount+0x12d/0x140 fs/namespace.c:3258
  __do_sys_mount fs/namespace.c:3272 [inline]
  __se_sys_mount fs/namespace.c:3269 [inline]
  __x64_sys_mount+0xbe/0x150 fs/namespace.c:3269
  do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x457569
Code: fd b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7  
48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff  
ff 0f 83 cb b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007fde6ed96c78 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 0000000000457569
RDX: 0000000020000040 RSI: 0000000020000000 RDI: 0000000000000000
RBP: 000000000072bf00 R08: 0000000020000340 R09: 0000000000000000
R10: 0000000000200000 R11: 0000000000000246 R12: 00007fde6ed976d4
R13: 00000000004c2c24 R14: 00000000004d4990 R15: 00000000ffffffff

Allocated by task 9424:
  save_stack+0x43/0xd0 mm/kasan/kasan.c:448
  set_track mm/kasan/kasan.c:460 [inline]
  kasan_kmalloc+0xc7/0xe0 mm/kasan/kasan.c:553
  __do_kmalloc mm/slab.c:3722 [inline]
  __kmalloc_track_caller+0x157/0x760 mm/slab.c:3737
  kstrdup+0x39/0x70 mm/util.c:49
  bpf_symlink+0x26/0x140 kernel/bpf/inode.c:356
  vfs_symlink+0x37a/0x5d0 fs/namei.c:4127
  do_symlinkat+0x242/0x2d0 fs/namei.c:4154
  __do_sys_symlink fs/namei.c:4173 [inline]
  __se_sys_symlink fs/namei.c:4171 [inline]
  __x64_sys_symlink+0x59/0x80 fs/namei.c:4171
  do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 9425:
  save_stack+0x43/0xd0 mm/kasan/kasan.c:448
  set_track mm/kasan/kasan.c:460 [inline]
  __kasan_slab_free+0x102/0x150 mm/kasan/kasan.c:521
  kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528
  __cache_free mm/slab.c:3498 [inline]
  kfree+0xcf/0x230 mm/slab.c:3817
  bpf_evict_inode+0x11f/0x150 kernel/bpf/inode.c:565
  evict+0x4b9/0x980 fs/inode.c:558
  iput_final fs/inode.c:1550 [inline]
  iput+0x674/0xa90 fs/inode.c:1576
  do_unlinkat+0x733/0xa30 fs/namei.c:4069
  __do_sys_unlink fs/namei.c:4110 [inline]
  __se_sys_unlink fs/namei.c:4108 [inline]
  __x64_sys_unlink+0x42/0x50 fs/namei.c:4108
  do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

The buggy address belongs to the object at ffff8801c4865c40
  which belongs to the cache kmalloc-32 of size 32
The buggy address is located 7 bytes inside of
  32-byte region [ffff8801c4865c40, ffff8801c4865c60)
The buggy address belongs to the page:
page:ffffea0007121940 count:1 mapcount:0 mapping:ffff8801da8001c0  
index:0xffff8801c4865fc1
flags: 0x2fffc0000000200(slab)
raw: 02fffc0000000200 ffffea000763c048 ffffea000762cf08 ffff8801da8001c0
raw: ffff8801c4865fc1 ffff8801c4865000 000000010000003f 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
  ffff8801c4865b00: fb fb fb fb fc fc fc fc 05 fc fc fc fc fc fc fc
  ffff8801c4865b80: fb fb fb fb fc fc fc fc fb fb fb fb fc fc fc fc
> ffff8801c4865c00: fb fb fb fb fc fc fc fc fb fb fb fb fc fc fc fc
                                            ^
  ffff8801c4865c80: fb fb fb fb fc fc fc fc fb fb fb fb fc fc fc fc
  ffff8801c4865d00: fb fb fb fb fc fc fc fc 05 fc fc fc fc fc fc fc
==================================================================


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#bug-status-tracking for how to communicate with  
syzbot.
syzbot can test patches for this bug, for details see:
https://goo.gl/tpsmEJ#testing-patches

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: KASAN: use-after-free Read in path_lookupat
  2018-11-28 17:40 KASAN: use-after-free Read in path_lookupat syzbot
@ 2019-03-25  0:44 ` syzbot
  2019-03-25  1:25   ` Linus Torvalds
  2019-03-25  1:23 ` Linus Torvalds
  1 sibling, 1 reply; 37+ messages in thread
From: syzbot @ 2019-03-25  0:44 UTC (permalink / raw)
  To: dledford, hch, leonro, linux-fsdevel, linux-kernel, sagi,
	syzkaller-bugs, viro

syzbot has bisected this bug to:

commit a435393acafbf0ecff4deb3e3cb554b34f0d0664
Author: Sagi Grimberg <sagi@grimberg.me>
Date:   Thu Jul 13 08:09:40 2017 +0000

     mlx5: move affinity hints assignments to generic code

bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=12df1ba3200000
start commit:   442b8cea Add linux-next specific files for 20181109
git tree:       linux-next
final crash:    https://syzkaller.appspot.com/x/report.txt?x=11df1ba3200000
console output: https://syzkaller.appspot.com/x/log.txt?x=16df1ba3200000
kernel config:  https://syzkaller.appspot.com/x/.config?x=2f72bdb11df9fbe8
dashboard link: https://syzkaller.appspot.com/bug?extid=7a8ba368b47fdefca61e
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=132815f5400000

Reported-by: syzbot+7a8ba368b47fdefca61e@syzkaller.appspotmail.com
Fixes: a435393acafb ("mlx5: move affinity hints assignments to generic  
code")

For information about bisection process see: https://goo.gl/tpsmEJ#bisection

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: KASAN: use-after-free Read in path_lookupat
  2018-11-28 17:40 KASAN: use-after-free Read in path_lookupat syzbot
  2019-03-25  0:44 ` syzbot
@ 2019-03-25  1:23 ` Linus Torvalds
  2019-03-25  4:57   ` Al Viro
  1 sibling, 1 reply; 37+ messages in thread
From: Linus Torvalds @ 2019-03-25  1:23 UTC (permalink / raw)
  To: syzbot, Alexei Starovoitov, Daniel Borkmann
  Cc: linux-fsdevel, Linux List Kernel Mailing, syzkaller-bugs, Al Viro

Hmm.

Al, this one seems real and also seems pretty nasty from a vfs
interface standpoint.

On Wed, Nov 28, 2018 at 9:40 AM syzbot
<syzbot+7a8ba368b47fdefca61e@syzkaller.appspotmail.com> wrote:
>
> BUG: KASAN: use-after-free in lookup_last fs/namei.c:2269 [inline]
> BUG: KASAN: use-after-free in path_lookupat.isra.43+0x9f8/0xc00  fs/namei.c:2318
> ...
>   lookup_last fs/namei.c:2269 [inline]
>   path_lookupat.isra.43+0x9f8/0xc00 fs/namei.c:2318

Ok, path lookup using RCU.

> Allocated by task 9424:
>   kstrdup+0x39/0x70 mm/util.c:49
>   bpf_symlink+0x26/0x140 kernel/bpf/inode.c:356

It's the symlink data for the bpf filesystem. Fair enough.

> Freed by task 9425:
>   kfree+0xcf/0x230 mm/slab.c:3817
>   bpf_evict_inode+0x11f/0x150 kernel/bpf/inode.c:565
>   evict+0x4b9/0x980 fs/inode.c:558
>   iput_final fs/inode.c:1550 [inline]
>   iput+0x674/0xa90 fs/inode.c:1576
>   do_unlinkat+0x733/0xa30 fs/namei.c:4069

.. and the path lookup is racing with the final unlink.

We actually RCU-delay the inode freeing itself, but when we do the
final iput(), the "evict()" function is called synchronously.

Now, the simple fix would seem to just RCU-delay the kfree() of the
symlink data in bpf_evict_inode(). Maybe that's the right thing to do.
I'd call it trivial, except you'd need to have that rcu head to attach
it to, making it slightly less trivial. I guess the kmalloc could just
malloc that too.

But it does strike me that this might be a generic issue. I wonder how
many other filesystems do this?

Alexei, Daniel - the "proper" fix right is probably one of four cases:

 (a) rcu-delay the freeing, and use simple_symlink_inode_operations().
No set_delayed_call() needed, because it stays around for the inode
lifetime, but the RCU-delaying is still needed for lookup.

or

 (b) put the symlink in a page, and use page_symlink_inode_operations
for that inode, where we have a "struct delayed_call" and get a page
ref

or

 (c) put the symlink in the inode itself, and then use that
simple_symlink_inode_operations (and now the RCU-delaying of the inode
protects it).

of

 (d) make a special getlink() that does "copy symlink,
set_delayed_call(done, kfree_link, copy)"

but I do wonder if we should perhaps just make
simple_symlink_inode_operations do that (d) case for people.

Al, comments? At the very least, if we don't make
simple_symlink_inode_operations() do that, we should have a *big*
comment that if it's not part of the inode data, it needs to be
RCU-delayed.

Or maybe we could add a final inode callback function for "rcu_free"
that is called as the RCU-delayed freeing of the inode itself happens?
And then people could hook into that for freeing the inode->i_link
data.

So many choices.. But the current situation seems unnecessarily
complex for the filesystem, and isn't really documented.

Our documentation currently says for get_link(): "If the body won't go
away until the inode is gone, nothing else is needed", which is wrong
(or at least very misleading, since the last "inode is gone" callback
we have is that evict() function).

               Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: KASAN: use-after-free Read in path_lookupat
  2019-03-25  0:44 ` syzbot
@ 2019-03-25  1:25   ` Linus Torvalds
  0 siblings, 0 replies; 37+ messages in thread
From: Linus Torvalds @ 2019-03-25  1:25 UTC (permalink / raw)
  To: syzbot
  Cc: dledford, Christoph Hellwig, Leon Romanovsky, linux-fsdevel,
	Linux List Kernel Mailing, Sagi Grimberg, syzkaller-bugs,
	Al Viro

On Sun, Mar 24, 2019 at 5:44 PM syzbot
<syzbot+7a8ba368b47fdefca61e@syzkaller.appspotmail.com> wrote:
>
> syzbot has bisected this bug to:
>
> commit a435393acafbf0ecff4deb3e3cb554b34f0d0664

That's wrong. It might change some timing or something to make the
race hit, but it looks entirely unrelated to the bpf path_lookupat
problem.

                 Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: KASAN: use-after-free Read in path_lookupat
  2019-03-25  1:23 ` Linus Torvalds
@ 2019-03-25  4:57   ` Al Viro
  2019-03-25  9:15     ` Daniel Borkmann
  2019-03-25 18:36     ` Linus Torvalds
  0 siblings, 2 replies; 37+ messages in thread
From: Al Viro @ 2019-03-25  4:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: syzbot, Alexei Starovoitov, Daniel Borkmann, linux-fsdevel,
	Linux List Kernel Mailing, syzkaller-bugs

On Sun, Mar 24, 2019 at 06:23:24PM -0700, Linus Torvalds wrote:


> Al, comments? At the very least, if we don't make
> simple_symlink_inode_operations() do that, we should have a *big*
> comment that if it's not part of the inode data, it needs to be
> RCU-delayed.

simple_symlink_inode_operations is red herring here - what matters
is ->i_link being set; those have ->get_link == simple_get_link,
but note that it is *not* called:
        res = inode->i_link;
        if (!res) {
                const char * (*get)(struct dentry *, struct inode *,
                                struct delayed_call *);
                get = inode->i_op->get_link;
                if (nd->flags & LOOKUP_RCU) {
                        res = get(NULL, inode, &last->done);
                        if (res == ERR_PTR(-ECHILD)) {
                                if (unlikely(unlazy_walk(nd)))
                                        return ERR_PTR(-ECHILD);
                                res = get(dentry, inode, &last->done);
                        }
                } else {
                        res = get(dentry, inode, &last->done);
                }
                if (IS_ERR_OR_NULL(res))
                        return res;
	}
for traversal and similar for readlink(2).  And we certainly don't want
to allocate copies in those cases - it would fuck RCU traversals for
all fast symlinks (i.e. for the majority of symlinks out there).

Actual situation:

* shmem, erofs: OK, kfree() from the thing ->destroy_inode() is calling via
call_rcu().
* befs, ext2, ext4, freevxfs, jfs, orangefs, ufs: OK, coallocated with inode
* debugfs: broken
* jffs2: broken, freeing of f->target should be moved to jffs2_i_callback().
* ubifs: broken, ought to move kfree(ui->data); from ubifs_destroy_inode() to
ubifs_i_callback()
* ceph: broken, needs to move kfree(ci->symlink) from ceph_destroy_inode()
to ceph_i_callback().
* bpf: broken

So we have 5 broken cases, all with the same kind of fix: move freeing
into the RCU-delayed part of ->destroy_inode(); for debugfs and bpf
that requires adding ->alloc_inode()/->destroy_inode(), rather than
relying upon the defaults from fs/inode.c

> Or maybe we could add a final inode callback function for "rcu_free"
> that is called as the RCU-delayed freeing of the inode itself happens?
> And then people could hook into that for freeing the inode->i_link
> data.

You mean, split ->destroy_inode() into immediate and RCU-delayed parts?
There are filesystems where both parts are non-empty - we can't just
switch all ->destroy_inode() work to call_rcu().

> So many choices.. But the current situation seems unnecessarily
> complex for the filesystem, and isn't really documented.
> 
> Our documentation currently says for get_link(): "If the body won't go
> away until the inode is gone, nothing else is needed", which is wrong
> (or at least very misleading, since the last "inode is gone" callback
> we have is that evict() function).

s/inode is gone/struct inode is freed/, but it's obviously not clear
enough.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: KASAN: use-after-free Read in path_lookupat
  2019-03-25  4:57   ` Al Viro
@ 2019-03-25  9:15     ` Daniel Borkmann
  2019-03-25 11:11       ` Al Viro
  2019-03-25 18:36     ` Linus Torvalds
  1 sibling, 1 reply; 37+ messages in thread
From: Daniel Borkmann @ 2019-03-25  9:15 UTC (permalink / raw)
  To: Al Viro, Linus Torvalds
  Cc: syzbot, Alexei Starovoitov, linux-fsdevel,
	Linux List Kernel Mailing, syzkaller-bugs

Hi Al, hi Linus,

On 03/25/2019 05:57 AM, Al Viro wrote:
> On Sun, Mar 24, 2019 at 06:23:24PM -0700, Linus Torvalds wrote:
> 
>> Al, comments? At the very least, if we don't make
>> simple_symlink_inode_operations() do that, we should have a *big*
>> comment that if it's not part of the inode data, it needs to be
>> RCU-delayed.
> 
> simple_symlink_inode_operations is red herring here - what matters
> is ->i_link being set; those have ->get_link == simple_get_link,
> but note that it is *not* called:
>         res = inode->i_link;
>         if (!res) {
>                 const char * (*get)(struct dentry *, struct inode *,
>                                 struct delayed_call *);
>                 get = inode->i_op->get_link;
>                 if (nd->flags & LOOKUP_RCU) {
>                         res = get(NULL, inode, &last->done);
>                         if (res == ERR_PTR(-ECHILD)) {
>                                 if (unlikely(unlazy_walk(nd)))
>                                         return ERR_PTR(-ECHILD);
>                                 res = get(dentry, inode, &last->done);
>                         }
>                 } else {
>                         res = get(dentry, inode, &last->done);
>                 }
>                 if (IS_ERR_OR_NULL(res))
>                         return res;
> 	}
> for traversal and similar for readlink(2).  And we certainly don't want
> to allocate copies in those cases - it would fuck RCU traversals for
> all fast symlinks (i.e. for the majority of symlinks out there).
> 
> Actual situation:
> 
> * shmem, erofs: OK, kfree() from the thing ->destroy_inode() is calling via
> call_rcu().
> * befs, ext2, ext4, freevxfs, jfs, orangefs, ufs: OK, coallocated with inode
> * debugfs: broken
> * jffs2: broken, freeing of f->target should be moved to jffs2_i_callback().
> * ubifs: broken, ought to move kfree(ui->data); from ubifs_destroy_inode() to
> ubifs_i_callback()
> * ceph: broken, needs to move kfree(ci->symlink) from ceph_destroy_inode()
> to ceph_i_callback().
> * bpf: broken
> 
> So we have 5 broken cases, all with the same kind of fix: move freeing
> into the RCU-delayed part of ->destroy_inode(); for debugfs and bpf
> that requires adding ->alloc_inode()/->destroy_inode(), rather than
> relying upon the defaults from fs/inode.c

I believe I copied that logic from one of the other fs'es back then, sigh.
Thanks for the analysis and hints for fixing. Would the below (only compile
tested for now) look reasonable to you? I believe there is no other way
around than to add our own inode cache, but so be it. The freeing of the
i_link is then RCU-deferred in bpf_destroy_inode_deferred():

 kernel/bpf/inode.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 52 insertions(+), 5 deletions(-)

diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
index 2ada5e2..69fcaf6 100644
--- a/kernel/bpf/inode.c
+++ b/kernel/bpf/inode.c
@@ -80,6 +80,47 @@ static const struct inode_operations bpf_dir_iops;
 static const struct inode_operations bpf_prog_iops = { };
 static const struct inode_operations bpf_map_iops  = { };

+static struct kmem_cache *bpf_inode_cache __read_mostly;
+
+struct inode *bpf_alloc_inode(struct super_block *sb)
+{
+	return kmem_cache_alloc(bpf_inode_cache, GFP_KERNEL);
+}
+
+static void bpf_destroy_inode_deferred(struct rcu_head *head)
+{
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+
+	if (S_ISLNK(inode->i_mode))
+		kfree(inode->i_link);
+	kmem_cache_free(bpf_inode_cache, inode);
+}
+
+static void bpf_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, bpf_destroy_inode_deferred);
+}
+
+static void bpf_inode_init_once(void *inode)
+{
+	inode_init_once(inode);
+}
+
+static int bpf_init_inode_cache(void)
+{
+	bpf_inode_cache = kmem_cache_create("bpf_inode_cache",
+					    sizeof(struct inode), 0,
+					    (SLAB_RECLAIM_ACCOUNT|
+					     SLAB_MEM_SPREAD|SLAB_ACCOUNT),
+					    bpf_inode_init_once);
+	return !bpf_inode_cache ? -ENOMEM : 0;
+}
+
+static void bpf_destroy_inode_cache(void)
+{
+	kmem_cache_destroy(bpf_inode_cache);
+}
+
 static struct inode *bpf_get_inode(struct super_block *sb,
 				   const struct inode *dir,
 				   umode_t mode)
@@ -561,8 +602,6 @@ static void bpf_evict_inode(struct inode *inode)
 	truncate_inode_pages_final(&inode->i_data);
 	clear_inode(inode);

-	if (S_ISLNK(inode->i_mode))
-		kfree(inode->i_link);
 	if (!bpf_inode_type(inode, &type))
 		bpf_any_put(inode->i_private, type);
 }
@@ -580,6 +619,8 @@ static int bpf_show_options(struct seq_file *m, struct dentry *root)
 }

 static const struct super_operations bpf_super_ops = {
+	.alloc_inode	= bpf_alloc_inode,
+	.destroy_inode	= bpf_destroy_inode,
 	.statfs		= simple_statfs,
 	.drop_inode	= generic_delete_inode,
 	.show_options	= bpf_show_options,
@@ -671,13 +712,19 @@ static int __init bpf_init(void)
 {
 	int ret;

-	ret = sysfs_create_mount_point(fs_kobj, "bpf");
+	ret = bpf_init_inode_cache();
 	if (ret)
 		return ret;
-
+	ret = sysfs_create_mount_point(fs_kobj, "bpf");
+	if (ret) {
+		bpf_destroy_inode_cache();
+		return ret;
+	}
 	ret = register_filesystem(&bpf_fs_type);
-	if (ret)
+	if (ret) {
 		sysfs_remove_mount_point(fs_kobj, "bpf");
+		bpf_destroy_inode_cache();
+	}

 	return ret;
 }
-- 
2.9.5

>> Or maybe we could add a final inode callback function for "rcu_free"
>> that is called as the RCU-delayed freeing of the inode itself happens?
>> And then people could hook into that for freeing the inode->i_link
>> data.
> 
> You mean, split ->destroy_inode() into immediate and RCU-delayed parts?
> There are filesystems where both parts are non-empty - we can't just
> switch all ->destroy_inode() work to call_rcu().
> 
>> So many choices.. But the current situation seems unnecessarily
>> complex for the filesystem, and isn't really documented.
>>
>> Our documentation currently says for get_link(): "If the body won't go
>> away until the inode is gone, nothing else is needed", which is wrong
>> (or at least very misleading, since the last "inode is gone" callback
>> we have is that evict() function).
> 
> s/inode is gone/struct inode is freed/, but it's obviously not clear
> enough.
> 


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: KASAN: use-after-free Read in path_lookupat
  2019-03-25  9:15     ` Daniel Borkmann
@ 2019-03-25 11:11       ` Al Viro
  2019-03-25 11:17         ` Al Viro
  0 siblings, 1 reply; 37+ messages in thread
From: Al Viro @ 2019-03-25 11:11 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Linus Torvalds, syzbot, Alexei Starovoitov, linux-fsdevel,
	Linux List Kernel Mailing, syzkaller-bugs

On Mon, Mar 25, 2019 at 10:15:40AM +0100, Daniel Borkmann wrote:
> > So we have 5 broken cases, all with the same kind of fix: move freeing
> > into the RCU-delayed part of ->destroy_inode(); for debugfs and bpf
> > that requires adding ->alloc_inode()/->destroy_inode(), rather than
> > relying upon the defaults from fs/inode.c
> 
> I believe I copied that logic from one of the other fs'es back then, sigh.
> Thanks for the analysis and hints for fixing. Would the below (only compile
> tested for now) look reasonable to you? I believe there is no other way
> around than to add our own inode cache, but so be it. The freeing of the
> i_link is then RCU-deferred in bpf_destroy_inode_deferred():

It looks like it would suffice, but it's way too much boilerplate for my
taste ;-/

Most of your headache here comes from messing with slab setup and the
only reason for that is being unable to free stuff into inode_cachep.
So grep for inode_cachep would be a good idea, and it digs up the
following sucker:
void free_inode_nonrcu(struct inode *inode)  
{
        kmem_cache_free(inode_cachep, inode);
}
EXPORT_SYMBOL(free_inode_nonrcu);

IOW, you need nothing on ->alloc_inode() side and for ->destroy_inode()
just do call_rcu() of a callback, that would kfree(inode->link) if
it was a symlink, then call free_inode_nonrcu(inode).

Considerably smaller patch that way...

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: KASAN: use-after-free Read in path_lookupat
  2019-03-25 11:11       ` Al Viro
@ 2019-03-25 11:17         ` Al Viro
  2019-03-25 11:21           ` Daniel Borkmann
  0 siblings, 1 reply; 37+ messages in thread
From: Al Viro @ 2019-03-25 11:17 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Linus Torvalds, syzbot, Alexei Starovoitov, linux-fsdevel,
	Linux List Kernel Mailing, syzkaller-bugs

On Mon, Mar 25, 2019 at 11:11:23AM +0000, Al Viro wrote:
> On Mon, Mar 25, 2019 at 10:15:40AM +0100, Daniel Borkmann wrote:
> > > So we have 5 broken cases, all with the same kind of fix: move freeing
> > > into the RCU-delayed part of ->destroy_inode(); for debugfs and bpf
> > > that requires adding ->alloc_inode()/->destroy_inode(), rather than
> > > relying upon the defaults from fs/inode.c
> > 
> > I believe I copied that logic from one of the other fs'es back then, sigh.
> > Thanks for the analysis and hints for fixing. Would the below (only compile
> > tested for now) look reasonable to you? I believe there is no other way
> > around than to add our own inode cache, but so be it. The freeing of the
> > i_link is then RCU-deferred in bpf_destroy_inode_deferred():
> 
> It looks like it would suffice, but it's way too much boilerplate for my
> taste ;-/
> 
> Most of your headache here comes from messing with slab setup and the
> only reason for that is being unable to free stuff into inode_cachep.
> So grep for inode_cachep would be a good idea, and it digs up the
> following sucker:
> void free_inode_nonrcu(struct inode *inode)  
> {
>         kmem_cache_free(inode_cachep, inode);
> }
> EXPORT_SYMBOL(free_inode_nonrcu);
> 
> IOW, you need nothing on ->alloc_inode() side and for ->destroy_inode()
> just do call_rcu() of a callback, that would kfree(inode->link) if
> it was a symlink, then call free_inode_nonrcu(inode).
> 
> Considerably smaller patch that way...

AFAICS, it boils down to just this (also only compile-tested):

diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
index 2ada5e21dfa6..9437f88b6acf 100644
--- a/kernel/bpf/inode.c
+++ b/kernel/bpf/inode.c
@@ -80,6 +80,20 @@ static const struct inode_operations bpf_dir_iops;
 static const struct inode_operations bpf_prog_iops = { };
 static const struct inode_operations bpf_map_iops  = { };
 
+static void bpf_destroy_inode_deferred(struct rcu_head *head)
+{
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+
+	if (S_ISLNK(inode->i_mode))
+		kfree(inode->i_link);
+	free_inode_nonrcu(inode);
+}
+
+static void bpf_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, bpf_destroy_inode_deferred);
+}
+
 static struct inode *bpf_get_inode(struct super_block *sb,
 				   const struct inode *dir,
 				   umode_t mode)
@@ -561,8 +575,6 @@ static void bpf_evict_inode(struct inode *inode)
 	truncate_inode_pages_final(&inode->i_data);
 	clear_inode(inode);
 
-	if (S_ISLNK(inode->i_mode))
-		kfree(inode->i_link);
 	if (!bpf_inode_type(inode, &type))
 		bpf_any_put(inode->i_private, type);
 }
@@ -584,6 +596,7 @@ static const struct super_operations bpf_super_ops = {
 	.drop_inode	= generic_delete_inode,
 	.show_options	= bpf_show_options,
 	.evict_inode	= bpf_evict_inode,
+	.destroy_inode	= bpf_destroy_inode,
 };
 
 enum {

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: KASAN: use-after-free Read in path_lookupat
  2019-03-25 11:17         ` Al Viro
@ 2019-03-25 11:21           ` Daniel Borkmann
  0 siblings, 0 replies; 37+ messages in thread
From: Daniel Borkmann @ 2019-03-25 11:21 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, syzbot, Alexei Starovoitov, linux-fsdevel,
	Linux List Kernel Mailing, syzkaller-bugs

On 03/25/2019 12:17 PM, Al Viro wrote:
> On Mon, Mar 25, 2019 at 11:11:23AM +0000, Al Viro wrote:
>> On Mon, Mar 25, 2019 at 10:15:40AM +0100, Daniel Borkmann wrote:
>>>> So we have 5 broken cases, all with the same kind of fix: move freeing
>>>> into the RCU-delayed part of ->destroy_inode(); for debugfs and bpf
>>>> that requires adding ->alloc_inode()/->destroy_inode(), rather than
>>>> relying upon the defaults from fs/inode.c
>>>
>>> I believe I copied that logic from one of the other fs'es back then, sigh.
>>> Thanks for the analysis and hints for fixing. Would the below (only compile
>>> tested for now) look reasonable to you? I believe there is no other way
>>> around than to add our own inode cache, but so be it. The freeing of the
>>> i_link is then RCU-deferred in bpf_destroy_inode_deferred():
>>
>> It looks like it would suffice, but it's way too much boilerplate for my
>> taste ;-/
>>
>> Most of your headache here comes from messing with slab setup and the
>> only reason for that is being unable to free stuff into inode_cachep.
>> So grep for inode_cachep would be a good idea, and it digs up the
>> following sucker:
>> void free_inode_nonrcu(struct inode *inode)  
>> {
>>         kmem_cache_free(inode_cachep, inode);
>> }
>> EXPORT_SYMBOL(free_inode_nonrcu);
>>
>> IOW, you need nothing on ->alloc_inode() side and for ->destroy_inode()
>> just do call_rcu() of a callback, that would kfree(inode->link) if
>> it was a symlink, then call free_inode_nonrcu(inode).
>>
>> Considerably smaller patch that way...
> 
> AFAICS, it boils down to just this (also only compile-tested):

Ah yep, much better, agree. Want to cook proper patch for bpf, or want
me to take care of it?

Thanks,
Daniel

> diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
> index 2ada5e21dfa6..9437f88b6acf 100644
> --- a/kernel/bpf/inode.c
> +++ b/kernel/bpf/inode.c
> @@ -80,6 +80,20 @@ static const struct inode_operations bpf_dir_iops;
>  static const struct inode_operations bpf_prog_iops = { };
>  static const struct inode_operations bpf_map_iops  = { };
>  
> +static void bpf_destroy_inode_deferred(struct rcu_head *head)
> +{
> +	struct inode *inode = container_of(head, struct inode, i_rcu);
> +
> +	if (S_ISLNK(inode->i_mode))
> +		kfree(inode->i_link);
> +	free_inode_nonrcu(inode);
> +}
> +
> +static void bpf_destroy_inode(struct inode *inode)
> +{
> +	call_rcu(&inode->i_rcu, bpf_destroy_inode_deferred);
> +}
> +
>  static struct inode *bpf_get_inode(struct super_block *sb,
>  				   const struct inode *dir,
>  				   umode_t mode)
> @@ -561,8 +575,6 @@ static void bpf_evict_inode(struct inode *inode)
>  	truncate_inode_pages_final(&inode->i_data);
>  	clear_inode(inode);
>  
> -	if (S_ISLNK(inode->i_mode))
> -		kfree(inode->i_link);
>  	if (!bpf_inode_type(inode, &type))
>  		bpf_any_put(inode->i_private, type);
>  }
> @@ -584,6 +596,7 @@ static const struct super_operations bpf_super_ops = {
>  	.drop_inode	= generic_delete_inode,
>  	.show_options	= bpf_show_options,
>  	.evict_inode	= bpf_evict_inode,
> +	.destroy_inode	= bpf_destroy_inode,
>  };
>  
>  enum {
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: KASAN: use-after-free Read in path_lookupat
  2019-03-25  4:57   ` Al Viro
  2019-03-25  9:15     ` Daniel Borkmann
@ 2019-03-25 18:36     ` Linus Torvalds
  2019-03-25 19:18       ` Linus Torvalds
  2019-03-25 19:43       ` Al Viro
  1 sibling, 2 replies; 37+ messages in thread
From: Linus Torvalds @ 2019-03-25 18:36 UTC (permalink / raw)
  To: Al Viro
  Cc: syzbot, Alexei Starovoitov, Daniel Borkmann, linux-fsdevel,
	Linux List Kernel Mailing, syzkaller-bugs

On Sun, Mar 24, 2019 at 9:57 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> So we have 5 broken cases, all with the same kind of fix: move freeing
> into the RCU-delayed part of ->destroy_inode(); for debugfs and bpf
> that requires adding ->alloc_inode()/->destroy_inode(), rather than
> relying upon the defaults from fs/inode.c

The fact that we had five different broken filesystems does imply that
we should just make this much easier at the VFS layer instead of
forcing filesystems to cope with a bad interface.

> > Or maybe we could add a final inode callback function for "rcu_free"
> > that is called as the RCU-delayed freeing of the inode itself happens?
> > And then people could hook into that for freeing the inode->i_link
> > data.
>
> You mean, split ->destroy_inode() into immediate and RCU-delayed parts?
> There are filesystems where both parts are non-empty - we can't just
> switch all ->destroy_inode() work to call_rcu().

Right. Not just move the existing destroy_inode() - because as you
say, people may not be able to to do that in RCU contect, but split it
up, and add a "final_free_inode()" callback or something for the RCU
phase.

In fact, I suspect that *every* user of destroy_inode() already has to
have its own RCU callback to another final freeing function anyway.
Because they really shouldn't free the inode itself early. Maybe we
can just make that be a generic thing?

And no, it's not like the patch to the bpf filesystem looks
_horrible_, but I think the fact that the low-level filesystem needed
to do its own RCU-deferred thing (even if it wasn't doing any RCU on
its own, although bpf obviously has that in other areas) is kind of
wrong, and shows that our vfs interfaces are a bit awkward. Plus it's
not like it was just bpf that got this wrong.

It's also somewhat silly to add _another_ RCU callback when we already
have one for the inode.

                  Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: KASAN: use-after-free Read in path_lookupat
  2019-03-25 18:36     ` Linus Torvalds
@ 2019-03-25 19:18       ` Linus Torvalds
  2019-03-25 21:14         ` Al Viro
  2019-03-25 19:43       ` Al Viro
  1 sibling, 1 reply; 37+ messages in thread
From: Linus Torvalds @ 2019-03-25 19:18 UTC (permalink / raw)
  To: Al Viro
  Cc: syzbot, Alexei Starovoitov, Daniel Borkmann, linux-fsdevel,
	Linux List Kernel Mailing, syzkaller-bugs

[-- Attachment #1: Type: text/plain, Size: 678 bytes --]

On Mon, Mar 25, 2019 at 11:36 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> >
> > You mean, split ->destroy_inode() into immediate and RCU-delayed parts?
> > There are filesystems where both parts are non-empty - we can't just
> > switch all ->destroy_inode() work to call_rcu().
>
> Right. Not just move the existing destroy_inode() - because as you
> say, people may not be able to to do that in RCU contect, but split it
> up, and add a "final_free_inode()" callback or something for the RCU
> phase.

Something like the attached.

COMPLETELY UNTESTED. And no filesystems converted to actually use the
new rcu_destroy_inode() thing.

Hmm?

                Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 3424 bytes --]

 Documentation/filesystems/vfs.txt |  6 ++++++
 fs/inode.c                        | 27 +++++++++++++++++++++------
 include/linux/fs.h                |  1 +
 3 files changed, 28 insertions(+), 6 deletions(-)

diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index 761c6fd24a53..60f7841a12e6 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -210,6 +210,7 @@ filesystem. As of kernel 2.6.22, the following members are defined:
 struct super_operations {
         struct inode *(*alloc_inode)(struct super_block *sb);
         void (*destroy_inode)(struct inode *);
+        int (*rcu_destroy_inode)(struct inode *);
 
         void (*dirty_inode) (struct inode *, int flags);
         int (*write_inode) (struct inode *, int);
@@ -248,6 +249,11 @@ or bottom half).
   	->alloc_inode was defined and simply undoes anything done by
 	->alloc_inode.
 
+ rcu_destroy_inode: this method is called after the RCU delay by
+	destroy_inode() to release resources allocated for struct inode.
+	If it returns a non-zero value, it means that it has free'd the
+	inode, otherwise the inode layer will free it.
+
   dirty_inode: this method is called by the VFS to mark an inode dirty.
 
   write_inode: this method is called when the VFS needs to write an
diff --git a/fs/inode.c b/fs/inode.c
index e9d97add2b36..b85457baad20 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -215,9 +215,13 @@ static struct inode *alloc_inode(struct super_block *sb)
 		return NULL;
 
 	if (unlikely(inode_init_always(sb, inode))) {
-		if (inode->i_sb->s_op->destroy_inode)
+		if (inode->i_sb->s_op->destroy_inode) {
 			inode->i_sb->s_op->destroy_inode(inode);
-		else
+			if (!inode->i_sb->s_op->rcu_destroy_inode)
+				return NULL;
+		}
+		if (!inode->i_sb->s_op->rcu_destroy_inode ||
+		    !inode->i_sb->s_op->rcu_destroy_inode(inode))
 			kmem_cache_free(inode_cachep, inode);
 		return NULL;
 	}
@@ -256,17 +260,28 @@ EXPORT_SYMBOL(__destroy_inode);
 static void i_callback(struct rcu_head *head)
 {
 	struct inode *inode = container_of(head, struct inode, i_rcu);
-	kmem_cache_free(inode_cachep, inode);
+
+	if (!inode->i_sb->s_op->rcu_destroy_inode ||
+	    !inode->i_sb->s_op->rcu_destroy_inode(inode))
+		kmem_cache_free(inode_cachep, inode);
 }
 
 static void destroy_inode(struct inode *inode)
 {
 	BUG_ON(!list_empty(&inode->i_lru));
 	__destroy_inode(inode);
-	if (inode->i_sb->s_op->destroy_inode)
+
+	/*
+	 * If we have a 'destroy_inode' but no 'rcu_destroy_inode'
+	 * then the filesystem handles the RCU-delayed destruction
+	 * on its own, and we don't do any RCU callbacks.
+	 */
+	if (inode->i_sb->s_op->destroy_inode) {
 		inode->i_sb->s_op->destroy_inode(inode);
-	else
-		call_rcu(&inode->i_rcu, i_callback);
+		if (!inode->i_sb->s_op->rcu_destroy_inode)
+			return;
+	}
+	call_rcu(&inode->i_rcu, i_callback);
 }
 
 /**
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 8b42df09b04c..727561ecbc23 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1900,6 +1900,7 @@ extern loff_t vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos,
 struct super_operations {
    	struct inode *(*alloc_inode)(struct super_block *sb);
 	void (*destroy_inode)(struct inode *);
+	int (*rcu_destroy_inode)(struct inode *);
 
    	void (*dirty_inode) (struct inode *, int flags);
 	int (*write_inode) (struct inode *, struct writeback_control *wbc);

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: KASAN: use-after-free Read in path_lookupat
  2019-03-25 18:36     ` Linus Torvalds
  2019-03-25 19:18       ` Linus Torvalds
@ 2019-03-25 19:43       ` Al Viro
  2019-03-25 22:48         ` Dave Chinner
  1 sibling, 1 reply; 37+ messages in thread
From: Al Viro @ 2019-03-25 19:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: syzbot, Alexei Starovoitov, Daniel Borkmann, linux-fsdevel,
	Linux List Kernel Mailing, syzkaller-bugs

On Mon, Mar 25, 2019 at 11:36:01AM -0700, Linus Torvalds wrote:
> Right. Not just move the existing destroy_inode() - because as you
> say, people may not be able to to do that in RCU contect, but split it
> up, and add a "final_free_inode()" callback or something for the RCU
> phase.
> 
> In fact, I suspect that *every* user of destroy_inode() already has to
> have its own RCU callback to another final freeing function anyway.

Nope - pipes do not, and for a good reason.

> Because they really shouldn't free the inode itself early. Maybe we
> can just make that be a generic thing?

Maybe...  OTOH, we already have more methods on the destruction end
than I'm comfortable trying to document.  Because we clearly *do*
need a clear set of rules along the lines of "this kind of thing belongs
in this method, this - in that one".

As it is, on the way to inode destruction we have

1) [kinda] ->drop_inode() - decide whether this inode is not worth
keeping around in icache once the refcount reaches zero.  Predicate,
shouldn't change inode state at all.    Common instances:
	* default (encoded as NULL, available as generic_drop_inode()) -
"keep it around if it's still hashed and link count is non-zero".
	* generic_delete_inode(): "don't retain that sucker"

However, 3 instances are more or less weird - f2fs, gfs and ocfs2.
gfs2 one is the least odd of those, but the other two...
What the hell is forced writeback doing in ocfs2_drop_inode()?
If they don't want to retain anything, fine, but then why do
the
        inode->i_state |= I_WILL_FREE;
        spin_unlock(&inode->i_lock);
        write_inode_now(inode, 1);
        spin_lock(&inode->i_lock);
        WARN_ON(inode->i_state & I_NEW);
        inode->i_state &= ~I_WILL_FREE;
dance in ->drop_inode()?  It will be immediately followed by
->evict_inode(), and it feels like the damn thing would be
a lot more natural there...  And what, for pity sake, f2fs is
doing with truncation in that predicate, of all places?  The
comment in there is
        /*
         * This is to avoid a deadlock condition like below.
         * writeback_single_inode(inode)
         *  - f2fs_write_data_page
         *    - f2fs_gc -> iput -> evict
         *       - inode_wait_for_writeback(inode)
         */
which looks... uninspiring, to put it mildly.

2) ->evict_inode() - called when we kick the inode out.  Freeing
on-disk inodes, etc. belongs there.  inode is still in icache
hash chain at that point, so any icache lookups for it will block
until that thing is done.  If we have something non-trivial done
by iget... test() callbacks, we must keep the data structures needed
by those.

3) ->destroy_inode() - destructor.  By that point all remaining
references to inode are (stale) RCU ones.  The stuff that might
be reached via those has to have an RCU delay between the call
of ->destroy_inode() and freeing.  Very commonly that's done
by call_rcu(), and more often than not it's the only thing in
->destroy_inode().  However, if we know that there'll be no
RCU accessors, we can do freeing immediately - pipes do just
that.

And the above is piss-poor as documentation goes - it doesn't
answer the "where should this go?" any better than "try to
see what similar filesystems are doing", which is asking for
cargo-culting ;-/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: KASAN: use-after-free Read in path_lookupat
  2019-03-25 19:18       ` Linus Torvalds
@ 2019-03-25 21:14         ` Al Viro
  2019-03-25 21:45           ` Linus Torvalds
  0 siblings, 1 reply; 37+ messages in thread
From: Al Viro @ 2019-03-25 21:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: syzbot, Alexei Starovoitov, Daniel Borkmann, linux-fsdevel,
	Linux List Kernel Mailing, syzkaller-bugs

On Mon, Mar 25, 2019 at 12:18:02PM -0700, Linus Torvalds wrote:
> COMPLETELY UNTESTED. And no filesystems converted to actually use the
> new rcu_destroy_inode() thing.
> 
> Hmm?

Maybe, but we really need to come up with sane documentation on the
entire drop_inode/evict_inode/destroy_inode/rcu_destroy_inode
group ;-/  And I want to understand the writeback-related issues
in ocfs2 and f2fs - the current kludges in those smell fishy.

>  	if (unlikely(inode_init_always(sb, inode))) {
> -		if (inode->i_sb->s_op->destroy_inode)
> +		if (inode->i_sb->s_op->destroy_inode) {
>  			inode->i_sb->s_op->destroy_inode(inode);
> -		else
> +			if (!inode->i_sb->s_op->rcu_destroy_inode)
> +				return NULL;
> +		}
> +		if (!inode->i_sb->s_op->rcu_destroy_inode ||
> +		    !inode->i_sb->s_op->rcu_destroy_inode(inode))
>  			kmem_cache_free(inode_cachep, inode);

ITYM  i_callback(inode); here, possibly suitably renamed.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: KASAN: use-after-free Read in path_lookupat
  2019-03-25 21:14         ` Al Viro
@ 2019-03-25 21:45           ` Linus Torvalds
  2019-03-25 22:04             ` Daniel Borkmann
  2019-03-25 23:37             ` Al Viro
  0 siblings, 2 replies; 37+ messages in thread
From: Linus Torvalds @ 2019-03-25 21:45 UTC (permalink / raw)
  To: Al Viro
  Cc: syzbot, Alexei Starovoitov, Daniel Borkmann, linux-fsdevel,
	Linux List Kernel Mailing, syzkaller-bugs

On Mon, Mar 25, 2019 at 2:14 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Maybe, but we really need to come up with sane documentation on the
> entire drop_inode/evict_inode/destroy_inode/rcu_destroy_inode
> group ;-/

Yeah.

I actually think the "destroy_inode/rcu_destroy_inode" part is the
simplest one to understand: it's just tearing down the inode, and the
RCU part is (obviously, as shown by this thread) somewhat subtle, but
at the same time not really all that hard to explain: "inodes that can
be looked up through RCU lookup need to be destroyed with RCU delay".

I think drop_inode->evict_inode is arguably the much more subtle
stuff. That's the "we're not destroying things, we're making decisions
about the state" kind of thing.

And we actually don't have any documentation at all for those two.
Well, there's the "porting" thing, but..

> And I want to understand the writeback-related issues
> in ocfs2 and f2fs - the current kludges in those smell fishy.

I'm assuming you're talking about exactly that drop_inode() kind of subtlety.

At least for ocfs2, the whole "rcu_destroy_inode()" patch I sent out
would simplify things a lot. *All* that the ocfs2_destroy_inode()
function does is to schedule the inode freeing for RCU delay.

Assuming my patch is fine (as mentioned, it was entirely untested),
that patch would actually simplify that a lot. Get rid of
ocfs2_destroy_inode() entirely, and just make
ocfs2_rcu_destroy_inode() do that kmem_cache_free(). Mucho clean-up
(we have that ocfs2_rcu_destroy_inode() currently as
ocfs2_i_callback(), but with the ugly rcu-head interface).

> >       if (unlikely(inode_init_always(sb, inode))) {
> > -             if (inode->i_sb->s_op->destroy_inode)
> > +             if (inode->i_sb->s_op->destroy_inode) {
> >                       inode->i_sb->s_op->destroy_inode(inode);
> > -             else
> > +                     if (!inode->i_sb->s_op->rcu_destroy_inode)
> > +                             return NULL;
> > +             }
> > +             if (!inode->i_sb->s_op->rcu_destroy_inode ||
> > +                 !inode->i_sb->s_op->rcu_destroy_inode(inode))
> >                       kmem_cache_free(inode_cachep, inode);
>
> ITYM  i_callback(inode); here, possibly suitably renamed.

Yeah, I guess we could just call that i_callback() directly. I wrote
it that way because it was how the code was organized (it didn't call
i_callback() before either), but yes, it woudl avoid some duplicate
code to do it that way.

And yes, at that point we'd probably want to rename it too.

On the whole, looking at a few filesystems, I really think that
'rcu_destroy_inode()" would simplify several of them.  That ocfs2 case
I already mentioned, but looking around the exact same is trye in
v9fs, vxfs, dlmfs, hostfs, bfs, coda, fatfs, isofs, minix, nfs,
openprom, proc, qnx4, qnx6, sysv, befs, adfs, affs, efs, ext2, f2fs,
gfs2, hfs, hfsplus, hpfs, jffs2, nilfs, reiserfs, romfs, squashfs)

And in other filesystems that actually *do* have a real
destroy_inode() that does other things too, they still want the rcu
delayed part as well (eg btrfs, ceph, fuse, hugetlbfs, afs, ecryptfs,
ext4, jfs, organgefs, ovl, ubifs).

In fact, every single filesystem I looked at (and I looked at most)
would be simplified by that patch.

And for some it would make it easier to fix bugs in the process (ie
bpf) because they don't currently have that rcu delay and they need
it.

So looking at all the existing users of destroy_inode(), I really do
think we should have that rcu_destroy_inode() callback. Because that

 3 files changed, 28 insertions(+), 6 deletions(-)

will allow quite a lot of lines to be removed from low-level
filesystems with all the call_rcu() and callback container_of(head,
struct inode, i_rcu) noise just going away.

But yes, it would be good to have more documentation fo the drop/evict
phase. As mentioned, I think the destroy phase is actually the easy
one, with the RCU part being the most complex one that this would
actually simplify.

                   Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: KASAN: use-after-free Read in path_lookupat
  2019-03-25 21:45           ` Linus Torvalds
@ 2019-03-25 22:04             ` Daniel Borkmann
  2019-03-25 22:13               ` Linus Torvalds
  2019-03-25 22:49               ` Al Viro
  2019-03-25 23:37             ` Al Viro
  1 sibling, 2 replies; 37+ messages in thread
From: Daniel Borkmann @ 2019-03-25 22:04 UTC (permalink / raw)
  To: Linus Torvalds, Al Viro
  Cc: syzbot, Alexei Starovoitov, linux-fsdevel,
	Linux List Kernel Mailing, syzkaller-bugs

On 03/25/2019 10:45 PM, Linus Torvalds wrote:
> On Mon, Mar 25, 2019 at 2:14 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>>
>> Maybe, but we really need to come up with sane documentation on the
>> entire drop_inode/evict_inode/destroy_inode/rcu_destroy_inode
>> group ;-/
> 
> Yeah.
> 
> I actually think the "destroy_inode/rcu_destroy_inode" part is the
> simplest one to understand: it's just tearing down the inode, and the
> RCU part is (obviously, as shown by this thread) somewhat subtle, but
> at the same time not really all that hard to explain: "inodes that can
> be looked up through RCU lookup need to be destroyed with RCU delay".
> 
> I think drop_inode->evict_inode is arguably the much more subtle
> stuff. That's the "we're not destroying things, we're making decisions
> about the state" kind of thing.
> 
> And we actually don't have any documentation at all for those two.
> Well, there's the "porting" thing, but..
> 
>> And I want to understand the writeback-related issues
>> in ocfs2 and f2fs - the current kludges in those smell fishy.
> 
> I'm assuming you're talking about exactly that drop_inode() kind of subtlety.
> 
> At least for ocfs2, the whole "rcu_destroy_inode()" patch I sent out
> would simplify things a lot. *All* that the ocfs2_destroy_inode()
> function does is to schedule the inode freeing for RCU delay.
> 
> Assuming my patch is fine (as mentioned, it was entirely untested),
> that patch would actually simplify that a lot. Get rid of
> ocfs2_destroy_inode() entirely, and just make
> ocfs2_rcu_destroy_inode() do that kmem_cache_free(). Mucho clean-up
> (we have that ocfs2_rcu_destroy_inode() currently as
> ocfs2_i_callback(), but with the ugly rcu-head interface).
> 
>>>       if (unlikely(inode_init_always(sb, inode))) {
>>> -             if (inode->i_sb->s_op->destroy_inode)
>>> +             if (inode->i_sb->s_op->destroy_inode) {
>>>                       inode->i_sb->s_op->destroy_inode(inode);
>>> -             else
>>> +                     if (!inode->i_sb->s_op->rcu_destroy_inode)
>>> +                             return NULL;
>>> +             }
>>> +             if (!inode->i_sb->s_op->rcu_destroy_inode ||
>>> +                 !inode->i_sb->s_op->rcu_destroy_inode(inode))
>>>                       kmem_cache_free(inode_cachep, inode);
>>
>> ITYM  i_callback(inode); here, possibly suitably renamed.
> 
> Yeah, I guess we could just call that i_callback() directly. I wrote
> it that way because it was how the code was organized (it didn't call
> i_callback() before either), but yes, it woudl avoid some duplicate
> code to do it that way.
> 
> And yes, at that point we'd probably want to rename it too.
> 
> On the whole, looking at a few filesystems, I really think that
> 'rcu_destroy_inode()" would simplify several of them.  That ocfs2 case
> I already mentioned, but looking around the exact same is trye in
> v9fs, vxfs, dlmfs, hostfs, bfs, coda, fatfs, isofs, minix, nfs,
> openprom, proc, qnx4, qnx6, sysv, befs, adfs, affs, efs, ext2, f2fs,
> gfs2, hfs, hfsplus, hpfs, jffs2, nilfs, reiserfs, romfs, squashfs)
> 
> And in other filesystems that actually *do* have a real
> destroy_inode() that does other things too, they still want the rcu
> delayed part as well (eg btrfs, ceph, fuse, hugetlbfs, afs, ecryptfs,
> ext4, jfs, organgefs, ovl, ubifs).
> 
> In fact, every single filesystem I looked at (and I looked at most)
> would be simplified by that patch.
> 
> And for some it would make it easier to fix bugs in the process (ie
> bpf) because they don't currently have that rcu delay and they need
> it.

I'm fine either way, I think the rcu_destroy_inode would indeed simplify
it nicely. In any case fwiw, here's what I'd have ready for standby on bpf
side and tested as well. Decided to get rid of bpf_evict_inode() entirely
since the only callback we'd really need is on final inode destruction:

From fc2111c4ebb487b630a423752c0f23b5f2535a55 Mon Sep 17 00:00:00 2001
From: Daniel Borkmann <daniel@iogearbox.net>
Date: Mon, 25 Mar 2019 15:54:43 +0100
Subject: [PATCH bpf] bpf: fix use after free in bpf_evict_inode

syzkaller was able to generate the following UAF in bpf:

  BUG: KASAN: use-after-free in lookup_last fs/namei.c:2269 [inline]
  BUG: KASAN: use-after-free in path_lookupat.isra.43+0x9f8/0xc00 fs/namei.c:2318
  Read of size 1 at addr ffff8801c4865c47 by task syz-executor2/9423

  CPU: 0 PID: 9423 Comm: syz-executor2 Not tainted 4.20.0-rc1-next-20181109+
  #110
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
  Google 01/01/2011
  Call Trace:
    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0x244/0x39d lib/dump_stack.c:113
    print_address_description.cold.7+0x9/0x1ff mm/kasan/report.c:256
    kasan_report_error mm/kasan/report.c:354 [inline]
    kasan_report.cold.8+0x242/0x309 mm/kasan/report.c:412
    __asan_report_load1_noabort+0x14/0x20 mm/kasan/report.c:430
    lookup_last fs/namei.c:2269 [inline]
    path_lookupat.isra.43+0x9f8/0xc00 fs/namei.c:2318
    filename_lookup+0x26a/0x520 fs/namei.c:2348
    user_path_at_empty+0x40/0x50 fs/namei.c:2608
    user_path include/linux/namei.h:62 [inline]
    do_mount+0x180/0x1ff0 fs/namespace.c:2980
    ksys_mount+0x12d/0x140 fs/namespace.c:3258
    __do_sys_mount fs/namespace.c:3272 [inline]
    __se_sys_mount fs/namespace.c:3269 [inline]
    __x64_sys_mount+0xbe/0x150 fs/namespace.c:3269
    do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe
  RIP: 0033:0x457569
  Code: fd b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7
  48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
  ff 0f 83 cb b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00
  RSP: 002b:00007fde6ed96c78 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
  RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 0000000000457569
  RDX: 0000000020000040 RSI: 0000000020000000 RDI: 0000000000000000
  RBP: 000000000072bf00 R08: 0000000020000340 R09: 0000000000000000
  R10: 0000000000200000 R11: 0000000000000246 R12: 00007fde6ed976d4
  R13: 00000000004c2c24 R14: 00000000004d4990 R15: 00000000ffffffff

  Allocated by task 9424:
    save_stack+0x43/0xd0 mm/kasan/kasan.c:448
    set_track mm/kasan/kasan.c:460 [inline]
    kasan_kmalloc+0xc7/0xe0 mm/kasan/kasan.c:553
    __do_kmalloc mm/slab.c:3722 [inline]
    __kmalloc_track_caller+0x157/0x760 mm/slab.c:3737
    kstrdup+0x39/0x70 mm/util.c:49
    bpf_symlink+0x26/0x140 kernel/bpf/inode.c:356
    vfs_symlink+0x37a/0x5d0 fs/namei.c:4127
    do_symlinkat+0x242/0x2d0 fs/namei.c:4154
    __do_sys_symlink fs/namei.c:4173 [inline]
    __se_sys_symlink fs/namei.c:4171 [inline]
    __x64_sys_symlink+0x59/0x80 fs/namei.c:4171
    do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

  Freed by task 9425:
    save_stack+0x43/0xd0 mm/kasan/kasan.c:448
    set_track mm/kasan/kasan.c:460 [inline]
    __kasan_slab_free+0x102/0x150 mm/kasan/kasan.c:521
    kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528
    __cache_free mm/slab.c:3498 [inline]
    kfree+0xcf/0x230 mm/slab.c:3817
    bpf_evict_inode+0x11f/0x150 kernel/bpf/inode.c:565
    evict+0x4b9/0x980 fs/inode.c:558
    iput_final fs/inode.c:1550 [inline]
    iput+0x674/0xa90 fs/inode.c:1576
    do_unlinkat+0x733/0xa30 fs/namei.c:4069
    __do_sys_unlink fs/namei.c:4110 [inline]
    __se_sys_unlink fs/namei.c:4108 [inline]
    __x64_sys_unlink+0x42/0x50 fs/namei.c:4108
    do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

In this scenario path lookup under RCU is racing with the final
unlink in case of symlinks. As Linus puts it in his analysis:

  [...] We actually RCU-delay the inode freeing itself, but
  when we do the final iput(), the "evict()" function is called
  synchronously. Now, the simple fix would seem to just RCU-delay
  the kfree() of the symlink data in bpf_evict_inode(). Maybe
  that's the right thing to do. [...]

Al suggested to piggy-back on the ->destroy_inode() callback in
order to implement RCU deferral there which can then kfree() the
inode->i_link eventually right before putting inode back into
inode cache. By reusing free_inode_nonrcu() from there we can
avoid the need for our own inode cache and just reuse generic
one as we currently do.

And in-fact on top of all this we should just get rid of the
bpf_evict_inode() entirely. This means truncate_inode_pages_final()
and clear_inode() will then simply be called by the fs core via
evict(). Dropping the reference should really only be done when
inode is unhashed and nothing reachable anymore, so it's better
also moved into the final ->destroy_inode() callback.

Fixes: 0f98621bef5d ("bpf, inode: add support for symlinks and fix mtime/ctime")
Reported-by: syzbot+fb731ca573367b7f6564@syzkaller.appspotmail.com
Reported-by: syzbot+a13e5ead792d6df37818@syzkaller.appspotmail.com
Reported-by: syzbot+7a8ba368b47fdefca61e@syzkaller.appspotmail.com
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Analyzed-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/lkml/0000000000006946d2057bbd0eef@google.com/T/
---
 kernel/bpf/inode.c | 32 ++++++++++++++++++--------------
 1 file changed, 18 insertions(+), 14 deletions(-)

diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
index 2ada5e21dfa6..4a8f390a2b82 100644
--- a/kernel/bpf/inode.c
+++ b/kernel/bpf/inode.c
@@ -554,19 +554,6 @@ struct bpf_prog *bpf_prog_get_type_path(const char *name, enum bpf_prog_type typ
 }
 EXPORT_SYMBOL(bpf_prog_get_type_path);

-static void bpf_evict_inode(struct inode *inode)
-{
-	enum bpf_type type;
-
-	truncate_inode_pages_final(&inode->i_data);
-	clear_inode(inode);
-
-	if (S_ISLNK(inode->i_mode))
-		kfree(inode->i_link);
-	if (!bpf_inode_type(inode, &type))
-		bpf_any_put(inode->i_private, type);
-}
-
 /*
  * Display the mount options in /proc/mounts.
  */
@@ -579,11 +566,28 @@ static int bpf_show_options(struct seq_file *m, struct dentry *root)
 	return 0;
 }

+static void bpf_destroy_inode_deferred(struct rcu_head *head)
+{
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	enum bpf_type type;
+
+	if (S_ISLNK(inode->i_mode))
+		kfree(inode->i_link);
+	if (!bpf_inode_type(inode, &type))
+		bpf_any_put(inode->i_private, type);
+	free_inode_nonrcu(inode);
+}
+
+static void bpf_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, bpf_destroy_inode_deferred);
+}
+
 static const struct super_operations bpf_super_ops = {
 	.statfs		= simple_statfs,
 	.drop_inode	= generic_delete_inode,
 	.show_options	= bpf_show_options,
-	.evict_inode	= bpf_evict_inode,
+	.destroy_inode	= bpf_destroy_inode,
 };

 enum {
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: KASAN: use-after-free Read in path_lookupat
  2019-03-25 22:04             ` Daniel Borkmann
@ 2019-03-25 22:13               ` Linus Torvalds
  2019-03-25 22:41                 ` Daniel Borkmann
  2019-03-25 22:49               ` Al Viro
  1 sibling, 1 reply; 37+ messages in thread
From: Linus Torvalds @ 2019-03-25 22:13 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Al Viro, syzbot, Alexei Starovoitov, linux-fsdevel,
	Linux List Kernel Mailing, syzkaller-bugs

On Mon, Mar 25, 2019 at 3:04 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> I'm fine either way, I think the rcu_destroy_inode would indeed simplify
> it nicely. In any case fwiw, here's what I'd have ready for standby on bpf
> side and tested as well. Decided to get rid of bpf_evict_inode() entirely
> since the only callback we'd really need is on final inode destruction:

Yes, this looks correct to me.

I think this is worth doing regardless. Even if we then make the vfs
layer add that rcu_destroy_inode(), that will fit very well with this
patch, and getting rid of the special bpf_evict_inode() logic and just
letting the normal vfs inode cleanup happen looks like the right
thing.

So ack from me on your

> Subject: [PATCH bpf] bpf: fix use after free in bpf_evict_inode

patch regardless of what else we might end up doing in this area to
clean things up.

                 Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: KASAN: use-after-free Read in path_lookupat
  2019-03-25 22:13               ` Linus Torvalds
@ 2019-03-25 22:41                 ` Daniel Borkmann
  0 siblings, 0 replies; 37+ messages in thread
From: Daniel Borkmann @ 2019-03-25 22:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, syzbot, Alexei Starovoitov, linux-fsdevel,
	Linux List Kernel Mailing, syzkaller-bugs

On 03/25/2019 11:13 PM, Linus Torvalds wrote:
> On Mon, Mar 25, 2019 at 3:04 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>>
>> I'm fine either way, I think the rcu_destroy_inode would indeed simplify
>> it nicely. In any case fwiw, here's what I'd have ready for standby on bpf
>> side and tested as well. Decided to get rid of bpf_evict_inode() entirely
>> since the only callback we'd really need is on final inode destruction:
> 
> Yes, this looks correct to me.
> 
> I think this is worth doing regardless. Even if we then make the vfs
> layer add that rcu_destroy_inode(), that will fit very well with this
> patch, and getting rid of the special bpf_evict_inode() logic and just
> letting the normal vfs inode cleanup happen looks like the right
> thing.
> 
> So ack from me on your
> 
>> Subject: [PATCH bpf] bpf: fix use after free in bpf_evict_inode
> 
> patch regardless of what else we might end up doing in this area to
> clean things up.

Ok, thanks a lot. In that case, I just pushed this one to bpf tree, and
it will come your way the usual route via David.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: KASAN: use-after-free Read in path_lookupat
  2019-03-25 19:43       ` Al Viro
@ 2019-03-25 22:48         ` Dave Chinner
  2019-03-25 23:02           ` Al Viro
  0 siblings, 1 reply; 37+ messages in thread
From: Dave Chinner @ 2019-03-25 22:48 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, syzbot, Alexei Starovoitov, Daniel Borkmann,
	linux-fsdevel, Linux List Kernel Mailing, syzkaller-bugs

On Mon, Mar 25, 2019 at 07:43:32PM +0000, Al Viro wrote:
> On Mon, Mar 25, 2019 at 11:36:01AM -0700, Linus Torvalds wrote:
> > Right. Not just move the existing destroy_inode() - because as you
> > say, people may not be able to to do that in RCU contect, but split it
> > up, and add a "final_free_inode()" callback or something for the RCU
> > phase.
> > 
> > In fact, I suspect that *every* user of destroy_inode() already has to
> > have its own RCU callback to another final freeing function anyway.
> 
> Nope - pipes do not, and for a good reason.
> 
> > Because they really shouldn't free the inode itself early. Maybe we
> > can just make that be a generic thing?
> 
> Maybe...  OTOH, we already have more methods on the destruction end
> than I'm comfortable trying to document.  Because we clearly *do*
> need a clear set of rules along the lines of "this kind of thing belongs
> in this method, this - in that one".
> 
> As it is, on the way to inode destruction we have
> 
> 1) [kinda] ->drop_inode() - decide whether this inode is not worth
> keeping around in icache once the refcount reaches zero.  Predicate,
> shouldn't change inode state at all.    Common instances:
> 	* default (encoded as NULL, available as generic_drop_inode()) -
> "keep it around if it's still hashed and link count is non-zero".
> 	* generic_delete_inode(): "don't retain that sucker"
> 
> However, 3 instances are more or less weird - f2fs, gfs and ocfs2.
> gfs2 one is the least odd of those, but the other two...
> What the hell is forced writeback doing in ocfs2_drop_inode()?
> If they don't want to retain anything, fine, but then why do
> the
>         inode->i_state |= I_WILL_FREE;
>         spin_unlock(&inode->i_lock);
>         write_inode_now(inode, 1);
>         spin_lock(&inode->i_lock);
>         WARN_ON(inode->i_state & I_NEW);
>         inode->i_state &= ~I_WILL_FREE;
> dance in ->drop_inode()?  It will be immediately followed by
> ->evict_inode(), and it feels like the damn thing would be
> a lot more natural there...  And what, for pity sake, f2fs is
> doing with truncation in that predicate, of all places?  The
> comment in there is
>         /*
>          * This is to avoid a deadlock condition like below.
>          * writeback_single_inode(inode)
>          *  - f2fs_write_data_page
>          *    - f2fs_gc -> iput -> evict
>          *       - inode_wait_for_writeback(inode)
>          */
> which looks... uninspiring, to put it mildly.
> 
> 2) ->evict_inode() - called when we kick the inode out.  Freeing
> on-disk inodes, etc. belongs there. inode is still in icache
> hash chain at that point, so any icache lookups for it will block
> until that thing is done.  If we have something non-trivial done
> by iget... test() callbacks, we must keep the data structures needed
> by those.
> 
> 3) ->destroy_inode() - destructor.  By that point all remaining
> references to inode are (stale) RCU ones.

At the VFS layer, perhaps, but the filesystem itself can still have
valid references to the struct inode the VFS knows nothing about.
i.e. filesystems typically wrap the struct inode in their own
extended inode structure and hence have more information in them
and a life cycle that the VFS knows nothing about. i.e. the VFS _does
not own_ the struct inode that the filesystem allocated for it's
use.

XFS has an inode cache layer below the VFS that provides an inode
life cycle that wraps around the outside of the VFS inode life
cycle. We do not use the VFS inode cache at all, and
fake the fact the inode is in the "icache hash chain" so that the
VFS behaves exactly as it should.

Further, we use lockless, RCU protected lookups on the XFS inode
cache, so we need to control XFS inode allocation and freeing
compeltely internally to XFS because it has RCU requirements the VFS
struct inode doesn't have.

And when it comes to VFS inode reclaim, XFS does not implement
->evict_inode because there is nothing at the VFS level to do.
And ->destroy_inode ends up doing cleanup work (e.g. freeing on-disk
inodes) which is non-trivial, blocking work, but then still requires
the struct xfs_inode to be written back to disk before it can bei
freed. So it just gets marked "reclaimable" and background reclaim
then takes care of it from there so we avoid synchronous IO in inode
reclaim...

This works because don't track dirty inode metadata in the VFS
writeback code (it's tracked with much more precision in the XFS log
infrastructure) and we don't write back inodes from the VFS
infrastructure, either. It's all done based on internal state
outside the VFS.

And, because of this, the VFS cannot assume that it can free
the struct inode after calling ->destroy_inode or even use
call_rcu() to run a filesystem destructor because the filesystem
may need to do work that needs to block and that's not allowed in an
RCU callback...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: KASAN: use-after-free Read in path_lookupat
  2019-03-25 22:04             ` Daniel Borkmann
  2019-03-25 22:13               ` Linus Torvalds
@ 2019-03-25 22:49               ` Al Viro
  1 sibling, 0 replies; 37+ messages in thread
From: Al Viro @ 2019-03-25 22:49 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Linus Torvalds, syzbot, Alexei Starovoitov, linux-fsdevel,
	Linux List Kernel Mailing, syzkaller-bugs

On Mon, Mar 25, 2019 at 11:04:53PM +0100, Daniel Borkmann wrote:

> +static void bpf_destroy_inode_deferred(struct rcu_head *head)
> +{
> +	struct inode *inode = container_of(head, struct inode, i_rcu);
> +	enum bpf_type type;
> +
> +	if (S_ISLNK(inode->i_mode))
> +		kfree(inode->i_link);
> +	if (!bpf_inode_type(inode, &type))
> +		bpf_any_put(inode->i_private, type);
> +	free_inode_nonrcu(inode);
> +}

Umm...  Can bpf_any_put() can be called in that context?  It's going
to run in softirq; if that's OK (I'm not familiar enough with BPF guts
to tell), sure, that's probably the best approach.

Note, BTW, that it's obviously -stable fodder, so any splitting
of ->destroy_inode() would live on top of that fix.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: KASAN: use-after-free Read in path_lookupat
  2019-03-25 22:48         ` Dave Chinner
@ 2019-03-25 23:02           ` Al Viro
       [not found]             ` <CAGe7X7mb=gK7zhSwmT_6mmmkcbjhZAOb=wj31BdUcHkNUPsm2Q@mail.gmail.com>
  2019-03-27 17:22             ` Jan Kara
  0 siblings, 2 replies; 37+ messages in thread
From: Al Viro @ 2019-03-25 23:02 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Linus Torvalds, syzbot, Alexei Starovoitov, Daniel Borkmann,
	linux-fsdevel, Linux List Kernel Mailing, syzkaller-bugs,
	Jan Kara, Jaegeuk Kim, Joel Becker, Mark Fasheh

On Tue, Mar 26, 2019 at 09:48:23AM +1100, Dave Chinner wrote:

> And when it comes to VFS inode reclaim, XFS does not implement
> ->evict_inode because there is nothing at the VFS level to do.
> And ->destroy_inode ends up doing cleanup work (e.g. freeing on-disk
> inodes) which is non-trivial, blocking work, but then still requires
> the struct xfs_inode to be written back to disk before it can bei
> freed. So it just gets marked "reclaimable" and background reclaim
> then takes care of it from there so we avoid synchronous IO in inode
> reclaim...
> 
> This works because don't track dirty inode metadata in the VFS
> writeback code (it's tracked with much more precision in the XFS log
> infrastructure) and we don't write back inodes from the VFS
> infrastructure, either. It's all done based on internal state
> outside the VFS.
> 
> And, because of this, the VFS cannot assume that it can free
> the struct inode after calling ->destroy_inode or even use
> call_rcu() to run a filesystem destructor because the filesystem
> may need to do work that needs to block and that's not allowed in an
> RCU callback...

In Linus' patch that's what you get with non-NULL ->destroy_inode
+ NULL ->destroy_inode_rcu, so XFS won't be screwed by that.
Said that, yes, XFS adds another fun twist there (AFAICS, it's
the only in-tree filesystem that pulls that off).

I would really like some comments from f2fs and ocfs2 folks, as well
as Jan - he's had much more recent contact with writeback code than
I have...  Could somebody explain what's going on in f2fs and ocfs2
->drop_inode()?  It _should_ be just a predicate; looks like both
are playing very odd games to work around writeback problems and
I wonder if there's a cleaner solution for that.  I can try and dig
through maillist(s) archives, but I would really appreciate it
if somebody could give a braindump on the issues dealt with in there...

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: KASAN: use-after-free Read in path_lookupat
  2019-03-25 21:45           ` Linus Torvalds
  2019-03-25 22:04             ` Daniel Borkmann
@ 2019-03-25 23:37             ` Al Viro
  2019-03-25 23:44               ` Alexei Starovoitov
                                 ` (2 more replies)
  1 sibling, 3 replies; 37+ messages in thread
From: Al Viro @ 2019-03-25 23:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: syzbot, Alexei Starovoitov, Daniel Borkmann, linux-fsdevel,
	Linux List Kernel Mailing, syzkaller-bugs

On Mon, Mar 25, 2019 at 02:45:00PM -0700, Linus Torvalds wrote:
> On Mon, Mar 25, 2019 at 2:14 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > Maybe, but we really need to come up with sane documentation on the
> > entire drop_inode/evict_inode/destroy_inode/rcu_destroy_inode
> > group ;-/
> 
> Yeah.
> 
> I actually think the "destroy_inode/rcu_destroy_inode" part is the
> simplest one to understand: it's just tearing down the inode, and the
> RCU part is (obviously, as shown by this thread) somewhat subtle, but
> at the same time not really all that hard to explain: "inodes that can
> be looked up through RCU lookup need to be destroyed with RCU delay".

... with at least "but remember that call_rcu()-scheduled stuff runs in
softirq, so there's a limit to what you can defer there; furthermore,
there are implications for any spinlocks taken in that callback".

> I think drop_inode->evict_inode is arguably the much more subtle
> stuff. That's the "we're not destroying things, we're making decisions
> about the state" kind of thing.

It's worse than that - another bloody subtle question is the location
of clear_inode() wrt the other work done in ->evict_inode() ;-/

> And we actually don't have any documentation at all for those two.
> Well, there's the "porting" thing, but..
>
> > And I want to understand the writeback-related issues
> > in ocfs2 and f2fs - the current kludges in those smell fishy.
> 
> I'm assuming you're talking about exactly that drop_inode() kind of subtlety.

Yes.

> At least for ocfs2, the whole "rcu_destroy_inode()" patch I sent out
> would simplify things a lot. *All* that the ocfs2_destroy_inode()
> function does is to schedule the inode freeing for RCU delay.
> 
> Assuming my patch is fine (as mentioned, it was entirely untested),
> that patch would actually simplify that a lot. Get rid of
> ocfs2_destroy_inode() entirely, and just make
> ocfs2_rcu_destroy_inode() do that kmem_cache_free(). Mucho clean-up
> (we have that ocfs2_rcu_destroy_inode() currently as
> ocfs2_i_callback(), but with the ugly rcu-head interface).

Sure, but I wonder if cleaner solution (if any) for whatever is going
on in their drop_inode/evict_inode area might spill into destroy_inode...

I certainly agree that having destroy_inode consist just of call_rcu()
is very common.  FWIW, non-trivial exceptions from that pattern are:

fs/afs/super.c:673:static void afs_destroy_inode(struct inode *inode)
fs/btrfs/inode.c:9215:void btrfs_destroy_inode(struct inode *inode)
fs/btrfs/inode.c:9202:void btrfs_test_destroy_inode(struct inode *inode)
fs/ceph/inode.c:530:void ceph_destroy_inode(struct inode *inode)
fs/ecryptfs/super.c:88:static void ecryptfs_destroy_inode(struct inode *inode)
fs/ext4/super.c:1106:static void ext4_destroy_inode(struct inode *inode)
fs/inode.c:228:void free_inode_nonrcu(struct inode *inode)
fs/fuse/inode.c:116:static void fuse_destroy_inode(struct inode *inode)
fs/hugetlbfs/inode.c:1052:static void hugetlbfs_destroy_inode(struct inode *inode)
fs/jfs/super.c:134:static void jfs_destroy_inode(struct inode *inode)
fs/ntfs/inode.c:341:void ntfs_destroy_big_inode(struct inode *inode)
fs/overlayfs/super.c:200:static void ovl_destroy_inode(struct inode *inode)
mm/shmem.c:3646:static void shmem_destroy_inode(struct inode *inode)
net/socket.c:266:static void sock_destroy_inode(struct inode *inode)
fs/ubifs/super.c:282:static void ubifs_destroy_inode(struct inode *inode)
fs/xfs/xfs_super.c:969:xfs_fs_destroy_inode(

Some of those could convert to that pattern, so it's even fewer than
that.  And getting rid of container_of in those callbacks would be
nice too.  One obvious observation, though - we want the actual
fixes to go before any method changes, for backporting reasons.

For debugfs it's clearly "use default ->evict_inode(), have explicit
->destroy_inode() using free_inode_nonrcu()" - there we have nothing
else done in ->evict_inode() and kfree is obviously safe in softirq.
I'll post that (or push to vfs.git#fixes), along with minimal fixes
for other 3.  If bpf_any_put() is softirq-safe, we'll have the full
set for -stable and the rest could be done on top of that.

Won't solve the documetation problem, unfortunately ;-/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: KASAN: use-after-free Read in path_lookupat
  2019-03-25 23:37             ` Al Viro
@ 2019-03-25 23:44               ` Alexei Starovoitov
  2019-03-26  0:21                 ` Al Viro
  2019-03-26  1:38               ` ceph: fix use-after-free on symlink traversal Al Viro
  2019-03-26  1:45               ` KASAN: use-after-free Read in path_lookupat Al Viro
  2 siblings, 1 reply; 37+ messages in thread
From: Alexei Starovoitov @ 2019-03-25 23:44 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, syzbot, Alexei Starovoitov, Daniel Borkmann,
	linux-fsdevel, Linux List Kernel Mailing, syzkaller-bugs

On Mon, Mar 25, 2019 at 11:37:32PM +0000, Al Viro wrote:
> 
> For debugfs it's clearly "use default ->evict_inode(), have explicit
> ->destroy_inode() using free_inode_nonrcu()" - there we have nothing
> else done in ->evict_inode() and kfree is obviously safe in softirq.
> I'll post that (or push to vfs.git#fixes), along with minimal fixes
> for other 3.  If bpf_any_put() is softirq-safe, we'll have the full
> set for -stable and the rest could be done on top of that.

Yes. bpf_any_put() is safe to be called from softirq.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: KASAN: use-after-free Read in path_lookupat
  2019-03-25 23:44               ` Alexei Starovoitov
@ 2019-03-26  0:21                 ` Al Viro
  0 siblings, 0 replies; 37+ messages in thread
From: Al Viro @ 2019-03-26  0:21 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Linus Torvalds, syzbot, Alexei Starovoitov, Daniel Borkmann,
	linux-fsdevel, Linux List Kernel Mailing, syzkaller-bugs

On Mon, Mar 25, 2019 at 04:44:40PM -0700, Alexei Starovoitov wrote:
> On Mon, Mar 25, 2019 at 11:37:32PM +0000, Al Viro wrote:
> > 
> > For debugfs it's clearly "use default ->evict_inode(), have explicit
> > ->destroy_inode() using free_inode_nonrcu()" - there we have nothing
> > else done in ->evict_inode() and kfree is obviously safe in softirq.
> > I'll post that (or push to vfs.git#fixes), along with minimal fixes
> > for other 3.  If bpf_any_put() is softirq-safe, we'll have the full
> > set for -stable and the rest could be done on top of that.
> 
> Yes. bpf_any_put() is safe to be called from softirq.

In that case,

Acked-by: Al Viro <viro@zeniv.linux.org.uk>

on Daniel's patch.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* ceph: fix use-after-free on symlink traversal
  2019-03-25 23:37             ` Al Viro
  2019-03-25 23:44               ` Alexei Starovoitov
@ 2019-03-26  1:38               ` Al Viro
  2019-03-26  1:39                 ` jffs2: " Al Viro
                                   ` (4 more replies)
  2019-03-26  1:45               ` KASAN: use-after-free Read in path_lookupat Al Viro
  2 siblings, 5 replies; 37+ messages in thread
From: Al Viro @ 2019-03-26  1:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: syzbot, Alexei Starovoitov, Daniel Borkmann, linux-fsdevel,
	Linux List Kernel Mailing, syzkaller-bugs

free the symlink body after the same RCU delay we have for freeing the
struct inode itself, so that traversal during RCU pathwalk wouldn't step
into freed memory.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index e3346628efe2..2d61ddda9bf5 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -524,6 +524,7 @@ static void ceph_i_callback(struct rcu_head *head)
 	struct inode *inode = container_of(head, struct inode, i_rcu);
 	struct ceph_inode_info *ci = ceph_inode(inode);
 
+	kfree(ci->i_symlink);
 	kmem_cache_free(ceph_inode_cachep, ci);
 }
 
@@ -566,7 +567,6 @@ void ceph_destroy_inode(struct inode *inode)
 		}
 	}
 
-	kfree(ci->i_symlink);
 	while ((n = rb_first(&ci->i_fragtree)) != NULL) {
 		frag = rb_entry(n, struct ceph_inode_frag, node);
 		rb_erase(n, &ci->i_fragtree);

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* jffs2: fix use-after-free on symlink traversal
  2019-03-26  1:38               ` ceph: fix use-after-free on symlink traversal Al Viro
@ 2019-03-26  1:39                 ` Al Viro
  2019-03-26  1:40                 ` ubifs: " Al Viro
                                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 37+ messages in thread
From: Al Viro @ 2019-03-26  1:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: syzbot, Alexei Starovoitov, Daniel Borkmann, linux-fsdevel,
	Linux List Kernel Mailing, syzkaller-bugs

free the symlink body after the same RCU delay we have for freeing the 
struct inode itself, so that traversal during RCU pathwalk wouldn't step 
into freed memory.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
diff --git a/fs/jffs2/readinode.c b/fs/jffs2/readinode.c
index 389ea53ea487..bccfc40b3a74 100644
--- a/fs/jffs2/readinode.c
+++ b/fs/jffs2/readinode.c
@@ -1414,11 +1414,6 @@ void jffs2_do_clear_inode(struct jffs2_sb_info *c, struct jffs2_inode_info *f)
 
 	jffs2_kill_fragtree(&f->fragtree, deleted?c:NULL);
 
-	if (f->target) {
-		kfree(f->target);
-		f->target = NULL;
-	}
-
 	fds = f->dents;
 	while(fds) {
 		fd = fds;
diff --git a/fs/jffs2/super.c b/fs/jffs2/super.c
index bb6ae387469f..05d892c79339 100644
--- a/fs/jffs2/super.c
+++ b/fs/jffs2/super.c
@@ -47,7 +47,10 @@ static struct inode *jffs2_alloc_inode(struct super_block *sb)
 static void jffs2_i_callback(struct rcu_head *head)
 {
 	struct inode *inode = container_of(head, struct inode, i_rcu);
-	kmem_cache_free(jffs2_inode_cachep, JFFS2_INODE_INFO(inode));
+	struct jffs2_inode_info *f = JFFS2_INODE_INFO(inode);
+
+	kfree(f->target);
+	kmem_cache_free(jffs2_inode_cachep, f);
 }
 
 static void jffs2_destroy_inode(struct inode *inode)

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* ubifs: fix use-after-free on symlink traversal
  2019-03-26  1:38               ` ceph: fix use-after-free on symlink traversal Al Viro
  2019-03-26  1:39                 ` jffs2: " Al Viro
@ 2019-03-26  1:40                 ` Al Viro
  2019-03-26  1:43                 ` debugfs: " Al Viro
                                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 37+ messages in thread
From: Al Viro @ 2019-03-26  1:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: syzbot, Alexei Starovoitov, Daniel Borkmann, linux-fsdevel,
	Linux List Kernel Mailing, syzkaller-bugs

free the symlink body after the same RCU delay we have for freeing the 
struct inode itself, so that traversal during RCU pathwalk wouldn't step 
into freed memory.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
index 8dc2818fdd84..12628184772c 100644
--- a/fs/ubifs/super.c
+++ b/fs/ubifs/super.c
@@ -276,14 +276,12 @@ static void ubifs_i_callback(struct rcu_head *head)
 {
 	struct inode *inode = container_of(head, struct inode, i_rcu);
 	struct ubifs_inode *ui = ubifs_inode(inode);
+	kfree(ui->data);
 	kmem_cache_free(ubifs_inode_slab, ui);
 }
 
 static void ubifs_destroy_inode(struct inode *inode)
 {
-	struct ubifs_inode *ui = ubifs_inode(inode);
-
-	kfree(ui->data);
 	call_rcu(&inode->i_rcu, ubifs_i_callback);
 }
 

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* debugfs: fix use-after-free on symlink traversal
  2019-03-26  1:38               ` ceph: fix use-after-free on symlink traversal Al Viro
  2019-03-26  1:39                 ` jffs2: " Al Viro
  2019-03-26  1:40                 ` ubifs: " Al Viro
@ 2019-03-26  1:43                 ` Al Viro
  2019-03-26 10:41                 ` ceph: " Jeff Layton
  2019-03-26 11:38                 ` Ilya Dryomov
  4 siblings, 0 replies; 37+ messages in thread
From: Al Viro @ 2019-03-26  1:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: syzbot, Alexei Starovoitov, Daniel Borkmann, linux-fsdevel,
	Linux List Kernel Mailing, syzkaller-bugs

symlink body shouldn't be freed without an RCU delay.  Switch debugfs to
->destroy_inode() and use of call_rcu(); free both the inode and symlink
body in the callback.  Similar to solution for bpf, only here it's even
more obvious that ->evict_inode() can be dropped.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
diff --git a/fs/debugfs/inode.c b/fs/debugfs/inode.c
index 95b5e78c22b1..f25daa207421 100644
--- a/fs/debugfs/inode.c
+++ b/fs/debugfs/inode.c
@@ -163,19 +163,24 @@ static int debugfs_show_options(struct seq_file *m, struct dentry *root)
 	return 0;
 }
 
-static void debugfs_evict_inode(struct inode *inode)
+static void debugfs_i_callback(struct rcu_head *head)
 {
-	truncate_inode_pages_final(&inode->i_data);
-	clear_inode(inode);
+	struct inode *inode = container_of(head, struct inode, i_rcu);
 	if (S_ISLNK(inode->i_mode))
 		kfree(inode->i_link);
+	free_inode_nonrcu(inode);
+}
+
+static void debugfs_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, debugfs_i_callback);
 }
 
 static const struct super_operations debugfs_super_operations = {
 	.statfs		= simple_statfs,
 	.remount_fs	= debugfs_remount,
 	.show_options	= debugfs_show_options,
-	.evict_inode	= debugfs_evict_inode,
+	.destroy_inode	= debugfs_destroy_inode,
 };
 
 static void debugfs_release_dentry(struct dentry *dentry)

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: KASAN: use-after-free Read in path_lookupat
  2019-03-25 23:37             ` Al Viro
  2019-03-25 23:44               ` Alexei Starovoitov
  2019-03-26  1:38               ` ceph: fix use-after-free on symlink traversal Al Viro
@ 2019-03-26  1:45               ` Al Viro
  2019-04-10 18:11                 ` Al Viro
  2 siblings, 1 reply; 37+ messages in thread
From: Al Viro @ 2019-03-26  1:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: syzbot, Alexei Starovoitov, Daniel Borkmann, linux-fsdevel,
	Linux List Kernel Mailing, syzkaller-bugs

On Mon, Mar 25, 2019 at 11:37:32PM +0000, Al Viro wrote:

> For debugfs it's clearly "use default ->evict_inode(), have explicit
> ->destroy_inode() using free_inode_nonrcu()" - there we have nothing
> else done in ->evict_inode() and kfree is obviously safe in softirq.
> I'll post that (or push to vfs.git#fixes), along with minimal fixes
> for other 3.  If bpf_any_put() is softirq-safe, we'll have the full
> set for -stable and the rest could be done on top of that.
> 
> Won't solve the documetation problem, unfortunately ;-/

Posted; all of those (as well as Daniel's bpf patch) are Cc:stable
fodder.  Documentation is still, er, deficient...

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: KASAN: use-after-free Read in path_lookupat
       [not found]             ` <CAGe7X7mb=gK7zhSwmT_6mmmkcbjhZAOb=wj31BdUcHkNUPsm2Q@mail.gmail.com>
@ 2019-03-26  4:15               ` Al Viro
  2019-03-27 16:58                 ` Jan Kara
  0 siblings, 1 reply; 37+ messages in thread
From: Al Viro @ 2019-03-26  4:15 UTC (permalink / raw)
  To: Mark Fasheh
  Cc: Dave Chinner, Linus Torvalds, syzbot, Alexei Starovoitov,
	Daniel Borkmann, linux-fsdevel, Linux List Kernel Mailing,
	syzkaller-bugs, Jan Kara, Jaegeuk Kim, Joel Becker

On Mon, Mar 25, 2019 at 08:18:25PM -0700, Mark Fasheh wrote:

> Hey Al,
> 
> It's been a while since I've looked at that bit of code but it looks like
> Ocfs2 is syncing the inode to disk and disposing of it's memory
> representation (which would include the cluster locks held) so that other
> nodes get a chance to delete the potentially orphaned inode. In Ocfs2 we
> won't delete an inode if it exists in another nodes cache.

Wait a sec - what's the reason for forcing that write_inode_now(); why
doesn't the normal mechanism work?  I'm afraid I still don't get it -
we do wait for writeback in evict_inode(), or the local filesystems
wouldn't work.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: ceph: fix use-after-free on symlink traversal
  2019-03-26  1:38               ` ceph: fix use-after-free on symlink traversal Al Viro
                                   ` (2 preceding siblings ...)
  2019-03-26  1:43                 ` debugfs: " Al Viro
@ 2019-03-26 10:41                 ` Jeff Layton
  2019-03-26 11:38                 ` Ilya Dryomov
  4 siblings, 0 replies; 37+ messages in thread
From: Jeff Layton @ 2019-03-26 10:41 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, syzbot, Alexei Starovoitov, Daniel Borkmann,
	linux-fsdevel, Linux List Kernel Mailing, syzkaller-bugs

On Mon, Mar 25, 2019 at 9:39 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> free the symlink body after the same RCU delay we have for freeing the
> struct inode itself, so that traversal during RCU pathwalk wouldn't step
> into freed memory.
>
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
> diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
> index e3346628efe2..2d61ddda9bf5 100644
> --- a/fs/ceph/inode.c
> +++ b/fs/ceph/inode.c
> @@ -524,6 +524,7 @@ static void ceph_i_callback(struct rcu_head *head)
>         struct inode *inode = container_of(head, struct inode, i_rcu);
>         struct ceph_inode_info *ci = ceph_inode(inode);
>
> +       kfree(ci->i_symlink);
>         kmem_cache_free(ceph_inode_cachep, ci);
>  }
>
> @@ -566,7 +567,6 @@ void ceph_destroy_inode(struct inode *inode)
>                 }
>         }
>
> -       kfree(ci->i_symlink);
>         while ((n = rb_first(&ci->i_fragtree)) != NULL) {
>                 frag = rb_entry(n, struct ceph_inode_frag, node);
>                 rb_erase(n, &ci->i_fragtree);


Reviewed-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: ceph: fix use-after-free on symlink traversal
  2019-03-26  1:38               ` ceph: fix use-after-free on symlink traversal Al Viro
                                   ` (3 preceding siblings ...)
  2019-03-26 10:41                 ` ceph: " Jeff Layton
@ 2019-03-26 11:38                 ` Ilya Dryomov
  4 siblings, 0 replies; 37+ messages in thread
From: Ilya Dryomov @ 2019-03-26 11:38 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, syzbot, Alexei Starovoitov, Daniel Borkmann,
	linux-fsdevel, Linux List Kernel Mailing, syzkaller-bugs

On Tue, Mar 26, 2019 at 2:39 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> free the symlink body after the same RCU delay we have for freeing the
> struct inode itself, so that traversal during RCU pathwalk wouldn't step
> into freed memory.
>
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
> diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
> index e3346628efe2..2d61ddda9bf5 100644
> --- a/fs/ceph/inode.c
> +++ b/fs/ceph/inode.c
> @@ -524,6 +524,7 @@ static void ceph_i_callback(struct rcu_head *head)
>         struct inode *inode = container_of(head, struct inode, i_rcu);
>         struct ceph_inode_info *ci = ceph_inode(inode);
>
> +       kfree(ci->i_symlink);
>         kmem_cache_free(ceph_inode_cachep, ci);
>  }
>
> @@ -566,7 +567,6 @@ void ceph_destroy_inode(struct inode *inode)
>                 }
>         }
>
> -       kfree(ci->i_symlink);
>         while ((n = rb_first(&ci->i_fragtree)) != NULL) {
>                 frag = rb_entry(n, struct ceph_inode_frag, node);
>                 rb_erase(n, &ci->i_fragtree);

Al, I see you directed this patch at Linus instead of ceph-devel.
I can pick it up for -rc3 as I have an important libceph fix pending
anyway.  Let me know if you want me to handle it.

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: KASAN: use-after-free Read in path_lookupat
  2019-03-26  4:15               ` Al Viro
@ 2019-03-27 16:58                 ` Jan Kara
  2019-03-27 18:59                   ` Al Viro
  0 siblings, 1 reply; 37+ messages in thread
From: Jan Kara @ 2019-03-27 16:58 UTC (permalink / raw)
  To: Al Viro
  Cc: Mark Fasheh, Dave Chinner, Linus Torvalds, syzbot,
	Alexei Starovoitov, Daniel Borkmann, linux-fsdevel,
	Linux List Kernel Mailing, syzkaller-bugs, Jan Kara, Jaegeuk Kim,
	Joel Becker

On Tue 26-03-19 04:15:10, Al Viro wrote:
> On Mon, Mar 25, 2019 at 08:18:25PM -0700, Mark Fasheh wrote:
> 
> > Hey Al,
> > 
> > It's been a while since I've looked at that bit of code but it looks like
> > Ocfs2 is syncing the inode to disk and disposing of it's memory
> > representation (which would include the cluster locks held) so that other
> > nodes get a chance to delete the potentially orphaned inode. In Ocfs2 we
> > won't delete an inode if it exists in another nodes cache.
> 
> Wait a sec - what's the reason for forcing that write_inode_now(); why
> doesn't the normal mechanism work?  I'm afraid I still don't get it -
> we do wait for writeback in evict_inode(), or the local filesystems
> wouldn't work.

I'm just guessing here but they don't want an inode cached once its last
dentry goes away (it makes cluster wide synchronization easier for them and
they do play tricks with cluster lock on dentries). There is some info in
513e2dae9422 "ocfs2: flush inode data to disk and free inode when i_count
becomes zero" which adds this ocfs2_drop_inode() implementation. So when
the last inode reference is dropped, they want to flush any dirty data to
disk and evict the inode. But AFAICT they should be fine with flushing the
inode from their ->evict_inode method. I_FREEING just stops the flusher
thread from touching the inode but explicit writeback through
write_inode_now(inode, 1) should go through just fine.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: KASAN: use-after-free Read in path_lookupat
  2019-03-25 23:02           ` Al Viro
       [not found]             ` <CAGe7X7mb=gK7zhSwmT_6mmmkcbjhZAOb=wj31BdUcHkNUPsm2Q@mail.gmail.com>
@ 2019-03-27 17:22             ` Jan Kara
  1 sibling, 0 replies; 37+ messages in thread
From: Jan Kara @ 2019-03-27 17:22 UTC (permalink / raw)
  To: Al Viro
  Cc: Dave Chinner, Linus Torvalds, syzbot, Alexei Starovoitov,
	Daniel Borkmann, linux-fsdevel, Linux List Kernel Mailing,
	syzkaller-bugs, Jan Kara, Jaegeuk Kim, Joel Becker, Mark Fasheh

On Mon 25-03-19 23:02:11, Al Viro wrote:
> On Tue, Mar 26, 2019 at 09:48:23AM +1100, Dave Chinner wrote:
> 
> > And when it comes to VFS inode reclaim, XFS does not implement
> > ->evict_inode because there is nothing at the VFS level to do.
> > And ->destroy_inode ends up doing cleanup work (e.g. freeing on-disk
> > inodes) which is non-trivial, blocking work, but then still requires
> > the struct xfs_inode to be written back to disk before it can bei
> > freed. So it just gets marked "reclaimable" and background reclaim
> > then takes care of it from there so we avoid synchronous IO in inode
> > reclaim...
> > 
> > This works because don't track dirty inode metadata in the VFS
> > writeback code (it's tracked with much more precision in the XFS log
> > infrastructure) and we don't write back inodes from the VFS
> > infrastructure, either. It's all done based on internal state
> > outside the VFS.
> > 
> > And, because of this, the VFS cannot assume that it can free
> > the struct inode after calling ->destroy_inode or even use
> > call_rcu() to run a filesystem destructor because the filesystem
> > may need to do work that needs to block and that's not allowed in an
> > RCU callback...
> 
> In Linus' patch that's what you get with non-NULL ->destroy_inode
> + NULL ->destroy_inode_rcu, so XFS won't be screwed by that.
> Said that, yes, XFS adds another fun twist there (AFAICS, it's
> the only in-tree filesystem that pulls that off).
> 
> I would really like some comments from f2fs and ocfs2 folks, as well
> as Jan - he's had much more recent contact with writeback code than
> I have...  Could somebody explain what's going on in f2fs and ocfs2
> ->drop_inode()?  It _should_ be just a predicate; looks like both
> are playing very odd games to work around writeback problems and
> I wonder if there's a cleaner solution for that.  I can try and dig
> through maillist(s) archives, but I would really appreciate it
> if somebody could give a braindump on the issues dealt with in there...

OCFS2 is discussed elsewhere and should be relatively easy to deal with.
F2FS seems harder. The problem is that AFAICS they get inode references
from their garbage collection code which can get called during page
writeback. And then they need to drop these references and they can be the
last ones to hold the inode reference for an unlinked inode forcing flush
worker into inode cleanup. Which generally causes problems and was the
reason why writeback code does not take inode references but relies on
I_SYNC to protect it from inode reclaim instead (see commit 169ebd90131b
"writeback: Avoid iput() from flusher thread"). And they noticed the
problem as well and hacked around it... Now I don't know enough about F2FS
and its garbage collection to tell how they can avoid dropping inode
references from flush worker context. But that's the right solution for
avoiding deadlocks.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: KASAN: use-after-free Read in path_lookupat
  2019-03-27 16:58                 ` Jan Kara
@ 2019-03-27 18:59                   ` Al Viro
  2019-03-28  9:00                     ` Jan Kara
  0 siblings, 1 reply; 37+ messages in thread
From: Al Viro @ 2019-03-27 18:59 UTC (permalink / raw)
  To: Jan Kara
  Cc: Mark Fasheh, Dave Chinner, Linus Torvalds, syzbot,
	Alexei Starovoitov, Daniel Borkmann, linux-fsdevel,
	Linux List Kernel Mailing, syzkaller-bugs, Jaegeuk Kim,
	Joel Becker

On Wed, Mar 27, 2019 at 05:58:31PM +0100, Jan Kara wrote:
> On Tue 26-03-19 04:15:10, Al Viro wrote:
> > On Mon, Mar 25, 2019 at 08:18:25PM -0700, Mark Fasheh wrote:
> > 
> > > Hey Al,
> > > 
> > > It's been a while since I've looked at that bit of code but it looks like
> > > Ocfs2 is syncing the inode to disk and disposing of it's memory
> > > representation (which would include the cluster locks held) so that other
> > > nodes get a chance to delete the potentially orphaned inode. In Ocfs2 we
> > > won't delete an inode if it exists in another nodes cache.
> > 
> > Wait a sec - what's the reason for forcing that write_inode_now(); why
> > doesn't the normal mechanism work?  I'm afraid I still don't get it -
> > we do wait for writeback in evict_inode(), or the local filesystems
> > wouldn't work.
> 
> I'm just guessing here but they don't want an inode cached once its last
> dentry goes away (it makes cluster wide synchronization easier for them and
> they do play tricks with cluster lock on dentries).

Sure, but that's as simple as "return 1 from ->drop_inode()".

> There is some info in
> 513e2dae9422 "ocfs2: flush inode data to disk and free inode when i_count
> becomes zero" which adds this ocfs2_drop_inode() implementation. So when
> the last inode reference is dropped, they want to flush any dirty data to
> disk and evict the inode. But AFAICT they should be fine with flushing the
> inode from their ->evict_inode method. I_FREEING just stops the flusher
> thread from touching the inode but explicit writeback through
> write_inode_now(inode, 1) should go through just fine.

Umm...  Why is that write_inode_now() needed in either place?  I agree that
moving it to ->evict_inode() ought to be safe, but what makes it necessary
in the first place?  Put it another way, what dirties the data and/or
metadata without marking it dirty?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: KASAN: use-after-free Read in path_lookupat
  2019-03-27 18:59                   ` Al Viro
@ 2019-03-28  9:00                     ` Jan Kara
  0 siblings, 0 replies; 37+ messages in thread
From: Jan Kara @ 2019-03-28  9:00 UTC (permalink / raw)
  To: Al Viro
  Cc: Jan Kara, Mark Fasheh, Dave Chinner, Linus Torvalds, syzbot,
	Alexei Starovoitov, Daniel Borkmann, linux-fsdevel,
	Linux List Kernel Mailing, syzkaller-bugs, Jaegeuk Kim,
	Joel Becker

On Wed 27-03-19 18:59:48, Al Viro wrote:
> On Wed, Mar 27, 2019 at 05:58:31PM +0100, Jan Kara wrote:
> > On Tue 26-03-19 04:15:10, Al Viro wrote:
> > > On Mon, Mar 25, 2019 at 08:18:25PM -0700, Mark Fasheh wrote:
> > > 
> > > > Hey Al,
> > > > 
> > > > It's been a while since I've looked at that bit of code but it looks like
> > > > Ocfs2 is syncing the inode to disk and disposing of it's memory
> > > > representation (which would include the cluster locks held) so that other
> > > > nodes get a chance to delete the potentially orphaned inode. In Ocfs2 we
> > > > won't delete an inode if it exists in another nodes cache.
> > > 
> > > Wait a sec - what's the reason for forcing that write_inode_now(); why
> > > doesn't the normal mechanism work?  I'm afraid I still don't get it -
> > > we do wait for writeback in evict_inode(), or the local filesystems
> > > wouldn't work.
> > 
> > I'm just guessing here but they don't want an inode cached once its last
> > dentry goes away (it makes cluster wide synchronization easier for them and
> > they do play tricks with cluster lock on dentries).
> 
> Sure, but that's as simple as "return 1 from ->drop_inode()".

Right.

> > There is some info in
> > 513e2dae9422 "ocfs2: flush inode data to disk and free inode when i_count
> > becomes zero" which adds this ocfs2_drop_inode() implementation. So when
> > the last inode reference is dropped, they want to flush any dirty data to
> > disk and evict the inode. But AFAICT they should be fine with flushing the
> > inode from their ->evict_inode method. I_FREEING just stops the flusher
> > thread from touching the inode but explicit writeback through
> > write_inode_now(inode, 1) should go through just fine.
> 
> Umm...  Why is that write_inode_now() needed in either place?  I agree that
> moving it to ->evict_inode() ought to be safe, but what makes it necessary
> in the first place?  Put it another way, what dirties the data and/or
> metadata without marking it dirty?

Well, the inode & pages are marked dirty and they are dirty when we get to
iput_final(). But if ->drop_inode() returns 1 (which normally happens only
for unlinked files), we will not write out the inode in iput_final() and
the dirty data just gets discarded in ->evict_inode(). OCFS2 doesn't want
this so they have to write-out by hand.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: KASAN: use-after-free Read in path_lookupat
  2019-03-26  1:45               ` KASAN: use-after-free Read in path_lookupat Al Viro
@ 2019-04-10 18:11                 ` Al Viro
  2019-04-10 19:44                   ` Linus Torvalds
  0 siblings, 1 reply; 37+ messages in thread
From: Al Viro @ 2019-04-10 18:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: syzbot, Alexei Starovoitov, Daniel Borkmann, linux-fsdevel,
	Linux List Kernel Mailing, syzkaller-bugs, James Morris,
	linux-security-module

On Tue, Mar 26, 2019 at 01:45:52AM +0000, Al Viro wrote:
> On Mon, Mar 25, 2019 at 11:37:32PM +0000, Al Viro wrote:
> 
> > For debugfs it's clearly "use default ->evict_inode(), have explicit
> > ->destroy_inode() using free_inode_nonrcu()" - there we have nothing
> > else done in ->evict_inode() and kfree is obviously safe in softirq.
> > I'll post that (or push to vfs.git#fixes), along with minimal fixes
> > for other 3.  If bpf_any_put() is softirq-safe, we'll have the full
> > set for -stable and the rest could be done on top of that.
> > 
> > Won't solve the documetation problem, unfortunately ;-/
> 
> Posted; all of those (as well as Daniel's bpf patch) are Cc:stable
> fodder.  Documentation is still, er, deficient...

... and unfortunately there are two more, exactly like debugfs -
securityfs and apparmorfs, found while sorting out the series
for separate rcu-delayed counterpart of ->destroy_inode().

Both are in vfs.git#fixes.  Which way should that go - directly or
via linux-security.git?  Both are stable fodder, in theory, but
much harder to hit than their ubifs/debugfs/bpf counterparts...

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: KASAN: use-after-free Read in path_lookupat
  2019-04-10 18:11                 ` Al Viro
@ 2019-04-10 19:44                   ` Linus Torvalds
  0 siblings, 0 replies; 37+ messages in thread
From: Linus Torvalds @ 2019-04-10 19:44 UTC (permalink / raw)
  To: Al Viro
  Cc: syzbot, Alexei Starovoitov, Daniel Borkmann, linux-fsdevel,
	Linux List Kernel Mailing, syzkaller-bugs, James Morris,
	linux-security-module

On Wed, Apr 10, 2019 at 8:11 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Both are in vfs.git#fixes.  Which way should that go - directly or
> via linux-security.git?

Just do it directly. I doubt you can trigger them for securityfs and
apparmourfs, since normal users have no way to remove any files from
them, so the race with final unlink sounds fairly irrelevant in
practice, no?

               Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2019-04-10 19:45 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-28 17:40 KASAN: use-after-free Read in path_lookupat syzbot
2019-03-25  0:44 ` syzbot
2019-03-25  1:25   ` Linus Torvalds
2019-03-25  1:23 ` Linus Torvalds
2019-03-25  4:57   ` Al Viro
2019-03-25  9:15     ` Daniel Borkmann
2019-03-25 11:11       ` Al Viro
2019-03-25 11:17         ` Al Viro
2019-03-25 11:21           ` Daniel Borkmann
2019-03-25 18:36     ` Linus Torvalds
2019-03-25 19:18       ` Linus Torvalds
2019-03-25 21:14         ` Al Viro
2019-03-25 21:45           ` Linus Torvalds
2019-03-25 22:04             ` Daniel Borkmann
2019-03-25 22:13               ` Linus Torvalds
2019-03-25 22:41                 ` Daniel Borkmann
2019-03-25 22:49               ` Al Viro
2019-03-25 23:37             ` Al Viro
2019-03-25 23:44               ` Alexei Starovoitov
2019-03-26  0:21                 ` Al Viro
2019-03-26  1:38               ` ceph: fix use-after-free on symlink traversal Al Viro
2019-03-26  1:39                 ` jffs2: " Al Viro
2019-03-26  1:40                 ` ubifs: " Al Viro
2019-03-26  1:43                 ` debugfs: " Al Viro
2019-03-26 10:41                 ` ceph: " Jeff Layton
2019-03-26 11:38                 ` Ilya Dryomov
2019-03-26  1:45               ` KASAN: use-after-free Read in path_lookupat Al Viro
2019-04-10 18:11                 ` Al Viro
2019-04-10 19:44                   ` Linus Torvalds
2019-03-25 19:43       ` Al Viro
2019-03-25 22:48         ` Dave Chinner
2019-03-25 23:02           ` Al Viro
     [not found]             ` <CAGe7X7mb=gK7zhSwmT_6mmmkcbjhZAOb=wj31BdUcHkNUPsm2Q@mail.gmail.com>
2019-03-26  4:15               ` Al Viro
2019-03-27 16:58                 ` Jan Kara
2019-03-27 18:59                   ` Al Viro
2019-03-28  9:00                     ` Jan Kara
2019-03-27 17:22             ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).