* fs: GPF in locked_inode_to_wb_and_lock_list @ 2016-04-18 9:44 Dmitry Vyukov 2016-04-20 21:14 ` Tejun Heo 0 siblings, 1 reply; 17+ messages in thread From: Dmitry Vyukov @ 2016-04-18 9:44 UTC (permalink / raw) To: Alexander Viro, linux-fsdevel, LKML, Tejun Heo Cc: syzkaller, Kostya Serebryany, Alexander Potapenko Hello, I've hit the following GPF while running syzkaller fuzzer on commit 806fdcce017dc98c4dbf8ed001750a0d7d2bb0af (Apr 14): general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN Modules linked in: CPU: 3 PID: 32 Comm: kworker/u10:1 Not tainted 4.6.0-rc3+ #349 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 Workqueue: writeback wb_workfn (flush-11:0) task: ffff88006ccf1840 ti: ffff88006cda8000 task.ti: ffff88006cda8000 RIP: 0010:[<ffffffff818884d2>] [<ffffffff818884d2>] locked_inode_to_wb_and_lock_list+0xa2/0x750 RSP: 0018:ffff88006cdaf7d0 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88006ccf2050 RDX: 0000000000000000 RSI: 000000114c8a8484 RDI: 0000000000000286 RBP: ffff88006cdaf820 R08: ffff88006ccf1840 R09: 0000000000000000 R10: 000229915090805f R11: 0000000000000001 R12: ffff88006a72f5e0 R13: dffffc0000000000 R14: ffffed000d4e5eed R15: ffffffff8830cf40 FS: 0000000000000000(0000) GS:ffff88006d500000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000003301bf8 CR3: 000000006368f000 CR4: 00000000000006e0 DR0: 0000000000001ec9 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600 Stack: ffff88006a72f680 ffff88006a72f768 ffff8800671230d8 03ff88006cdaf948 ffff88006a72f668 ffff88006a72f5e0 ffff8800671230d8 ffff88006cdaf948 ffff880065b90cc8 ffff880067123100 ffff88006cdaf970 ffffffff8188e12e Call Trace: [< inline >] inode_to_wb_and_lock_list fs/fs-writeback.c:309 [<ffffffff8188e12e>] writeback_sb_inodes+0x4de/0x1250 fs/fs-writeback.c:1554 [<ffffffff8188efa4>] __writeback_inodes_wb+0x104/0x1e0 fs/fs-writeback.c:1600 [<ffffffff8188f9ae>] wb_writeback+0x7ce/0xc90 fs/fs-writeback.c:1709 [< inline >] wb_do_writeback fs/fs-writeback.c:1844 [<ffffffff81891079>] wb_workfn+0x2f9/0x1000 fs/fs-writeback.c:1884 [<ffffffff813bcd1e>] process_one_work+0x78e/0x15c0 kernel/workqueue.c:2094 [<ffffffff813bdc2b>] worker_thread+0xdb/0xfc0 kernel/workqueue.c:2228 [<ffffffff813cdeef>] kthread+0x23f/0x2d0 drivers/block/aoe/aoecmd.c:1303 [<ffffffff867bc5d2>] ret_from_fork+0x22/0x50 arch/x86/entry/entry_64.S:392 Code: 05 94 4a a8 06 85 c0 0f 85 03 03 00 00 e8 07 15 d0 ff 41 80 3e 00 0f 85 64 06 00 00 49 8b 9c 24 88 01 00 00 48 89 d8 48 c1 e8 03 <42> 80 3c 28 00 0f 85 17 06 00 00 48 8b 03 48 83 c0 50 48 39 c3 RIP [< inline >] wb_get include/linux/backing-dev-defs.h:212 RIP [<ffffffff818884d2>] locked_inode_to_wb_and_lock_list+0xa2/0x750 fs/fs-writeback.c:281 RSP <ffff88006cdaf7d0> ---[ end trace 986a4d314dcb2694 ]--- The crash happened here: if (wb != &wb->bdi->wb) ffffffff818884cb: 48 89 d8 mov %rbx,%rax ffffffff818884ce: 48 c1 e8 03 shr $0x3,%rax ffffffff818884d2: 42 80 3c 28 00 cmpb $0x0,(%rax,%r13,1) ffffffff818884d7: 0f 85 17 06 00 00 jne ffffffff81888af4 <locked_inode_to_wb_and_lock_list+0x6c4> ffffffff818884dd: 48 8b 03 mov (%rbx),%rax ffffffff818884e0: 48 83 c0 50 add $0x50,%rax ffffffff818884e4: 48 39 c3 cmp %rax,%rbx ffffffff818884e7: 0f 84 c3 00 00 00 je ffffffff818885b0 <locked_inode_to_wb_and_lock_list+0x180> Which means that bdi is NULL (if I get indirections right). The block device is flush-11:0 (/dev/sr0). I was able to reproduce it once by re-running programs from the crash log: https://gist.githubusercontent.com/dvyukov/7c9e6358b661e2d50a774a70b5ee348a/raw/5cd94e32bbcf2a788b5efd0768cfd883c5921070/gistfile1.txt But it took half an hour, and then another hour long run did not trigger it. So it is probably triggered by some rare race. Thank you ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: fs: GPF in locked_inode_to_wb_and_lock_list 2016-04-18 9:44 fs: GPF in locked_inode_to_wb_and_lock_list Dmitry Vyukov @ 2016-04-20 21:14 ` Tejun Heo 2016-04-21 8:25 ` Dmitry Vyukov 2016-04-21 8:35 ` Dmitry Vyukov 0 siblings, 2 replies; 17+ messages in thread From: Tejun Heo @ 2016-04-20 21:14 UTC (permalink / raw) To: Dmitry Vyukov Cc: Alexander Viro, linux-fsdevel, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko Hello, Dmitry. On Mon, Apr 18, 2016 at 11:44:11AM +0200, Dmitry Vyukov wrote: > general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN ... > RIP: 0010:[<ffffffff818884d2>] [<ffffffff818884d2>] > locked_inode_to_wb_and_lock_list+0xa2/0x750 > RSP: 0018:ffff88006cdaf7d0 EFLAGS: 00010246 > RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88006ccf2050 > RDX: 0000000000000000 RSI: 000000114c8a8484 RDI: 0000000000000286 > RBP: ffff88006cdaf820 R08: ffff88006ccf1840 R09: 0000000000000000 > R10: 000229915090805f R11: 0000000000000001 R12: ffff88006a72f5e0 > R13: dffffc0000000000 R14: ffffed000d4e5eed R15: ffffffff8830cf40 > FS: 0000000000000000(0000) GS:ffff88006d500000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 0000000003301bf8 CR3: 000000006368f000 CR4: 00000000000006e0 > DR0: 0000000000001ec9 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600 > Stack: > ffff88006a72f680 ffff88006a72f768 ffff8800671230d8 03ff88006cdaf948 > ffff88006a72f668 ffff88006a72f5e0 ffff8800671230d8 ffff88006cdaf948 > ffff880065b90cc8 ffff880067123100 ffff88006cdaf970 ffffffff8188e12e > Call Trace: > [< inline >] inode_to_wb_and_lock_list fs/fs-writeback.c:309 > [<ffffffff8188e12e>] writeback_sb_inodes+0x4de/0x1250 fs/fs-writeback.c:1554 > [<ffffffff8188efa4>] __writeback_inodes_wb+0x104/0x1e0 fs/fs-writeback.c:1600 > [<ffffffff8188f9ae>] wb_writeback+0x7ce/0xc90 fs/fs-writeback.c:1709 > [< inline >] wb_do_writeback fs/fs-writeback.c:1844 > [<ffffffff81891079>] wb_workfn+0x2f9/0x1000 fs/fs-writeback.c:1884 > [<ffffffff813bcd1e>] process_one_work+0x78e/0x15c0 kernel/workqueue.c:2094 > [<ffffffff813bdc2b>] worker_thread+0xdb/0xfc0 kernel/workqueue.c:2228 > [<ffffffff813cdeef>] kthread+0x23f/0x2d0 drivers/block/aoe/aoecmd.c:1303 > [<ffffffff867bc5d2>] ret_from_fork+0x22/0x50 arch/x86/entry/entry_64.S:392 > Code: 05 94 4a a8 06 85 c0 0f 85 03 03 00 00 e8 07 15 d0 ff 41 80 3e > 00 0f 85 64 06 00 00 49 8b 9c 24 88 01 00 00 48 89 d8 48 c1 e8 03 <42> > 80 3c 28 00 0f 85 17 06 00 00 48 8b 03 48 83 c0 50 48 39 c3 > RIP [< inline >] wb_get include/linux/backing-dev-defs.h:212 > RIP [<ffffffff818884d2>] locked_inode_to_wb_and_lock_list+0xa2/0x750 > fs/fs-writeback.c:281 > RSP <ffff88006cdaf7d0> Man, that's a beautiful trace w/ decoding of inline functions. When did we start doing that? Is there a specific config option for this? > ---[ end trace 986a4d314dcb2694 ]--- > The crash happened here: > > if (wb != &wb->bdi->wb) > ffffffff818884cb: 48 89 d8 mov %rbx,%rax > ffffffff818884ce: 48 c1 e8 03 shr $0x3,%rax > ffffffff818884d2: 42 80 3c 28 00 cmpb $0x0,(%rax,%r13,1) So, it's the above instruction. > ffffffff818884d7: 0f 85 17 06 00 00 jne > ffffffff81888af4 <locked_inode_to_wb_and_lock_list+0x6c4> > ffffffff818884dd: 48 8b 03 mov (%rbx),%rax > ffffffff818884e0: 48 83 c0 50 add $0x50,%rax > ffffffff818884e4: 48 39 c3 cmp %rax,%rbx > ffffffff818884e7: 0f 84 c3 00 00 00 je > ffffffff818885b0 <locked_inode_to_wb_and_lock_list+0x180> > > Which means that bdi is NULL (if I get indirections right). So, the wb != &wb->bdi->wb comparison would be the cmp at 0xffffffff818884e4 and given that it just compares the address of &bdi->wb, bdi being NULL wouldn't trigger the fault. cmpb $0x0,(%rax,%r13,1) -> *(u8 *)(%rax + %r13) == 0 -> *(u8 *)((%rbx >> 3) + %r13) == 0 Where can that be from? I can't find anything matching even in the surrounding functions. Hmmm... The base address %r13 is 0xdffffc0000000000 which isn't a proper canonical address and in general suspcious. Ooh, it's KASAN_SHADOW_OFFSET. It looks like something is making KASAN trigger a fault. Can we please bring in someone who's more familiar with KASAN? Thanks. -- tejun ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: fs: GPF in locked_inode_to_wb_and_lock_list 2016-04-20 21:14 ` Tejun Heo @ 2016-04-21 8:25 ` Dmitry Vyukov 2016-04-21 9:10 ` Andrey Ryabinin 2016-04-21 16:14 ` Tejun Heo 2016-04-21 8:35 ` Dmitry Vyukov 1 sibling, 2 replies; 17+ messages in thread From: Dmitry Vyukov @ 2016-04-21 8:25 UTC (permalink / raw) To: Tejun Heo Cc: Alexander Viro, linux-fsdevel, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko On Wed, Apr 20, 2016 at 11:14 PM, Tejun Heo <tj@kernel.org> wrote: > Hello, Dmitry. > > On Mon, Apr 18, 2016 at 11:44:11AM +0200, Dmitry Vyukov wrote: > >> general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN > ... >> RIP: 0010:[<ffffffff818884d2>] [<ffffffff818884d2>] >> locked_inode_to_wb_and_lock_list+0xa2/0x750 >> RSP: 0018:ffff88006cdaf7d0 EFLAGS: 00010246 >> RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88006ccf2050 >> RDX: 0000000000000000 RSI: 000000114c8a8484 RDI: 0000000000000286 >> RBP: ffff88006cdaf820 R08: ffff88006ccf1840 R09: 0000000000000000 >> R10: 000229915090805f R11: 0000000000000001 R12: ffff88006a72f5e0 >> R13: dffffc0000000000 R14: ffffed000d4e5eed R15: ffffffff8830cf40 >> FS: 0000000000000000(0000) GS:ffff88006d500000(0000) knlGS:0000000000000000 >> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> CR2: 0000000003301bf8 CR3: 000000006368f000 CR4: 00000000000006e0 >> DR0: 0000000000001ec9 DR1: 0000000000000000 DR2: 0000000000000000 >> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600 >> Stack: >> ffff88006a72f680 ffff88006a72f768 ffff8800671230d8 03ff88006cdaf948 >> ffff88006a72f668 ffff88006a72f5e0 ffff8800671230d8 ffff88006cdaf948 >> ffff880065b90cc8 ffff880067123100 ffff88006cdaf970 ffffffff8188e12e >> Call Trace: >> [< inline >] inode_to_wb_and_lock_list fs/fs-writeback.c:309 >> [<ffffffff8188e12e>] writeback_sb_inodes+0x4de/0x1250 fs/fs-writeback.c:1554 >> [<ffffffff8188efa4>] __writeback_inodes_wb+0x104/0x1e0 fs/fs-writeback.c:1600 >> [<ffffffff8188f9ae>] wb_writeback+0x7ce/0xc90 fs/fs-writeback.c:1709 >> [< inline >] wb_do_writeback fs/fs-writeback.c:1844 >> [<ffffffff81891079>] wb_workfn+0x2f9/0x1000 fs/fs-writeback.c:1884 >> [<ffffffff813bcd1e>] process_one_work+0x78e/0x15c0 kernel/workqueue.c:2094 >> [<ffffffff813bdc2b>] worker_thread+0xdb/0xfc0 kernel/workqueue.c:2228 >> [<ffffffff813cdeef>] kthread+0x23f/0x2d0 drivers/block/aoe/aoecmd.c:1303 >> [<ffffffff867bc5d2>] ret_from_fork+0x22/0x50 arch/x86/entry/entry_64.S:392 >> Code: 05 94 4a a8 06 85 c0 0f 85 03 03 00 00 e8 07 15 d0 ff 41 80 3e >> 00 0f 85 64 06 00 00 49 8b 9c 24 88 01 00 00 48 89 d8 48 c1 e8 03 <42> >> 80 3c 28 00 0f 85 17 06 00 00 48 8b 03 48 83 c0 50 48 39 c3 >> RIP [< inline >] wb_get include/linux/backing-dev-defs.h:212 >> RIP [<ffffffff818884d2>] locked_inode_to_wb_and_lock_list+0xa2/0x750 >> fs/fs-writeback.c:281 >> RSP <ffff88006cdaf7d0> > > Man, that's a beautiful trace w/ decoding of inline functions. When > did we start doing that? Is there a specific config option for this? Thanks! :) I use this script for symbolization: https://github.com/google/sanitizers/blob/master/address-sanitizer/tools/kasan_symbolize.py It invokes addr2line to provide file:line info, adds inline frames, strips ? frames (are they ever useful?) and strips timestamps. You just need to pipe a report through it. It assumes that vmlinux is in the current dir, but you can override it with --linux flag. It can also insert actual source lines into a report (amount of context is controlled with flags), for example: Call Trace: [< inline >] khugepaged_scan_pmd include/linux/uaccess.h:9 8 { 9 current->pagefault_disabled++; 10 } [< inline >] khugepaged_scan_mm_slot mm/huge_memory.c:2716 2715 hend); 2716 ret = khugepaged_scan_pmd(mm, vma, 2717 khugepaged_scan.address, [< inline >] khugepaged_do_scan mm/huge_memory.c:2796 2795 pass_through_head < 2) 2796 progress += khugepaged_scan_mm_slot(pages - progress, 2797 &hpage); [<ffffffff81790b33>] khugepaged+0x993/0x48e0 mm/huge_memory.c:2831 2830 while (!kthread_should_stop()) { 2831 khugepaged_do_scan(); 2832 khugepaged_wait_work(); [<ffffffff813c195f>] kthread+0x23f/0x2d0 drivers/block/aoe/aoecmd.c:1303 1302 add_wait_queue(k->waitq, &wait); 1303 __set_current_state(TASK_INTERRUPTIBLE); 1304 } [<ffffffff866d1b2f>] ret_from_fork+0x3f/0x70 arch/x86/entry/entry_64.S:405 404 SWAPGS 405 jmp restore_regs_and_iret 406 END(ret_from_fork) ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: fs: GPF in locked_inode_to_wb_and_lock_list 2016-04-21 8:25 ` Dmitry Vyukov @ 2016-04-21 9:10 ` Andrey Ryabinin 2016-04-21 9:29 ` Dmitry Vyukov 2016-04-21 16:14 ` Tejun Heo 1 sibling, 1 reply; 17+ messages in thread From: Andrey Ryabinin @ 2016-04-21 9:10 UTC (permalink / raw) To: Dmitry Vyukov Cc: Tejun Heo, Alexander Viro, linux-fsdevel, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin 2016-04-21 11:25 GMT+03:00 Dmitry Vyukov <dvyukov@google.com>: > On Wed, Apr 20, 2016 at 11:14 PM, Tejun Heo <tj@kernel.org> wrote: >> Hello, Dmitry. >> >> On Mon, Apr 18, 2016 at 11:44:11AM +0200, Dmitry Vyukov wrote: >> >>> general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN >> ... >>> RIP: 0010:[<ffffffff818884d2>] [<ffffffff818884d2>] >>> locked_inode_to_wb_and_lock_list+0xa2/0x750 >>> RSP: 0018:ffff88006cdaf7d0 EFLAGS: 00010246 >>> RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88006ccf2050 >>> RDX: 0000000000000000 RSI: 000000114c8a8484 RDI: 0000000000000286 >>> RBP: ffff88006cdaf820 R08: ffff88006ccf1840 R09: 0000000000000000 >>> R10: 000229915090805f R11: 0000000000000001 R12: ffff88006a72f5e0 >>> R13: dffffc0000000000 R14: ffffed000d4e5eed R15: ffffffff8830cf40 >>> FS: 0000000000000000(0000) GS:ffff88006d500000(0000) knlGS:0000000000000000 >>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>> CR2: 0000000003301bf8 CR3: 000000006368f000 CR4: 00000000000006e0 >>> DR0: 0000000000001ec9 DR1: 0000000000000000 DR2: 0000000000000000 >>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600 >>> Stack: >>> ffff88006a72f680 ffff88006a72f768 ffff8800671230d8 03ff88006cdaf948 >>> ffff88006a72f668 ffff88006a72f5e0 ffff8800671230d8 ffff88006cdaf948 >>> ffff880065b90cc8 ffff880067123100 ffff88006cdaf970 ffffffff8188e12e >>> Call Trace: >>> [< inline >] inode_to_wb_and_lock_list fs/fs-writeback.c:309 >>> [<ffffffff8188e12e>] writeback_sb_inodes+0x4de/0x1250 fs/fs-writeback.c:1554 >>> [<ffffffff8188efa4>] __writeback_inodes_wb+0x104/0x1e0 fs/fs-writeback.c:1600 >>> [<ffffffff8188f9ae>] wb_writeback+0x7ce/0xc90 fs/fs-writeback.c:1709 >>> [< inline >] wb_do_writeback fs/fs-writeback.c:1844 >>> [<ffffffff81891079>] wb_workfn+0x2f9/0x1000 fs/fs-writeback.c:1884 >>> [<ffffffff813bcd1e>] process_one_work+0x78e/0x15c0 kernel/workqueue.c:2094 >>> [<ffffffff813bdc2b>] worker_thread+0xdb/0xfc0 kernel/workqueue.c:2228 >>> [<ffffffff813cdeef>] kthread+0x23f/0x2d0 drivers/block/aoe/aoecmd.c:1303 >>> [<ffffffff867bc5d2>] ret_from_fork+0x22/0x50 arch/x86/entry/entry_64.S:392 >>> Code: 05 94 4a a8 06 85 c0 0f 85 03 03 00 00 e8 07 15 d0 ff 41 80 3e >>> 00 0f 85 64 06 00 00 49 8b 9c 24 88 01 00 00 48 89 d8 48 c1 e8 03 <42> >>> 80 3c 28 00 0f 85 17 06 00 00 48 8b 03 48 83 c0 50 48 39 c3 >>> RIP [< inline >] wb_get include/linux/backing-dev-defs.h:212 >>> RIP [<ffffffff818884d2>] locked_inode_to_wb_and_lock_list+0xa2/0x750 >>> fs/fs-writeback.c:281 >>> RSP <ffff88006cdaf7d0> >> >> Man, that's a beautiful trace w/ decoding of inline functions. When >> did we start doing that? Is there a specific config option for this? > > Thanks! :) > I use this script for symbolization: > https://github.com/google/sanitizers/blob/master/address-sanitizer/tools/kasan_symbolize.py BTW, it would be nice to have that script in kernel tree, instead of scripts/decode_stacktrace.sh which is far less advanced. > It invokes addr2line to provide file:line info, adds inline frames, > strips ? frames (are they ever useful?) and strips timestamps. > You just need to pipe a report through it. It assumes that vmlinux is > in the current dir, but you can override it with --linux flag. > > It can also insert actual source lines into a report (amount of > context is controlled with flags), for example: > > > Call Trace: > [< inline >] khugepaged_scan_pmd include/linux/uaccess.h:9 > 8 { > 9 current->pagefault_disabled++; > 10 } > [< inline >] khugepaged_scan_mm_slot mm/huge_memory.c:2716 > 2715 hend); > 2716 ret = khugepaged_scan_pmd(mm, vma, > 2717 > khugepaged_scan.address, > [< inline >] khugepaged_do_scan mm/huge_memory.c:2796 > 2795 pass_through_head < 2) > 2796 progress += > khugepaged_scan_mm_slot(pages - progress, > 2797 &hpage); > [<ffffffff81790b33>] khugepaged+0x993/0x48e0 mm/huge_memory.c:2831 > 2830 while (!kthread_should_stop()) { > 2831 khugepaged_do_scan(); > 2832 khugepaged_wait_work(); > [<ffffffff813c195f>] kthread+0x23f/0x2d0 drivers/block/aoe/aoecmd.c:1303 > 1302 add_wait_queue(k->waitq, &wait); > 1303 __set_current_state(TASK_INTERRUPTIBLE); > 1304 } > [<ffffffff866d1b2f>] ret_from_fork+0x3f/0x70 arch/x86/entry/entry_64.S:405 > 404 SWAPGS > 405 jmp restore_regs_and_iret > 406 END(ret_from_fork) ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: fs: GPF in locked_inode_to_wb_and_lock_list 2016-04-21 9:10 ` Andrey Ryabinin @ 2016-04-21 9:29 ` Dmitry Vyukov 0 siblings, 0 replies; 17+ messages in thread From: Dmitry Vyukov @ 2016-04-21 9:29 UTC (permalink / raw) To: Andrey Ryabinin Cc: Tejun Heo, Alexander Viro, linux-fsdevel, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin, Andrey Konovalov On Thu, Apr 21, 2016 at 11:10 AM, Andrey Ryabinin <ryabinin.a.a@gmail.com> wrote: > 2016-04-21 11:25 GMT+03:00 Dmitry Vyukov <dvyukov@google.com>: >> On Wed, Apr 20, 2016 at 11:14 PM, Tejun Heo <tj@kernel.org> wrote: >>> Hello, Dmitry. >>> >>> On Mon, Apr 18, 2016 at 11:44:11AM +0200, Dmitry Vyukov wrote: >>> >>>> general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN >>> ... >>>> RIP: 0010:[<ffffffff818884d2>] [<ffffffff818884d2>] >>>> locked_inode_to_wb_and_lock_list+0xa2/0x750 >>>> RSP: 0018:ffff88006cdaf7d0 EFLAGS: 00010246 >>>> RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88006ccf2050 >>>> RDX: 0000000000000000 RSI: 000000114c8a8484 RDI: 0000000000000286 >>>> RBP: ffff88006cdaf820 R08: ffff88006ccf1840 R09: 0000000000000000 >>>> R10: 000229915090805f R11: 0000000000000001 R12: ffff88006a72f5e0 >>>> R13: dffffc0000000000 R14: ffffed000d4e5eed R15: ffffffff8830cf40 >>>> FS: 0000000000000000(0000) GS:ffff88006d500000(0000) knlGS:0000000000000000 >>>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>>> CR2: 0000000003301bf8 CR3: 000000006368f000 CR4: 00000000000006e0 >>>> DR0: 0000000000001ec9 DR1: 0000000000000000 DR2: 0000000000000000 >>>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600 >>>> Stack: >>>> ffff88006a72f680 ffff88006a72f768 ffff8800671230d8 03ff88006cdaf948 >>>> ffff88006a72f668 ffff88006a72f5e0 ffff8800671230d8 ffff88006cdaf948 >>>> ffff880065b90cc8 ffff880067123100 ffff88006cdaf970 ffffffff8188e12e >>>> Call Trace: >>>> [< inline >] inode_to_wb_and_lock_list fs/fs-writeback.c:309 >>>> [<ffffffff8188e12e>] writeback_sb_inodes+0x4de/0x1250 fs/fs-writeback.c:1554 >>>> [<ffffffff8188efa4>] __writeback_inodes_wb+0x104/0x1e0 fs/fs-writeback.c:1600 >>>> [<ffffffff8188f9ae>] wb_writeback+0x7ce/0xc90 fs/fs-writeback.c:1709 >>>> [< inline >] wb_do_writeback fs/fs-writeback.c:1844 >>>> [<ffffffff81891079>] wb_workfn+0x2f9/0x1000 fs/fs-writeback.c:1884 >>>> [<ffffffff813bcd1e>] process_one_work+0x78e/0x15c0 kernel/workqueue.c:2094 >>>> [<ffffffff813bdc2b>] worker_thread+0xdb/0xfc0 kernel/workqueue.c:2228 >>>> [<ffffffff813cdeef>] kthread+0x23f/0x2d0 drivers/block/aoe/aoecmd.c:1303 >>>> [<ffffffff867bc5d2>] ret_from_fork+0x22/0x50 arch/x86/entry/entry_64.S:392 >>>> Code: 05 94 4a a8 06 85 c0 0f 85 03 03 00 00 e8 07 15 d0 ff 41 80 3e >>>> 00 0f 85 64 06 00 00 49 8b 9c 24 88 01 00 00 48 89 d8 48 c1 e8 03 <42> >>>> 80 3c 28 00 0f 85 17 06 00 00 48 8b 03 48 83 c0 50 48 39 c3 >>>> RIP [< inline >] wb_get include/linux/backing-dev-defs.h:212 >>>> RIP [<ffffffff818884d2>] locked_inode_to_wb_and_lock_list+0xa2/0x750 >>>> fs/fs-writeback.c:281 >>>> RSP <ffff88006cdaf7d0> >>> >>> Man, that's a beautiful trace w/ decoding of inline functions. When >>> did we start doing that? Is there a specific config option for this? >> >> Thanks! :) >> I use this script for symbolization: >> https://github.com/google/sanitizers/blob/master/address-sanitizer/tools/kasan_symbolize.py > > BTW, it would be nice to have that script in kernel tree, instead of > scripts/decode_stacktrace.sh which > is far less advanced. I am all for it! We have 2 minor, unresolved bugs in the script: https://github.com/google/sanitizers/issues?utf8=%E2%9C%93&q=is%3Aissue+is%3Aopen+kasan_symbolize It would be good to resolve them before submitting it to kernel tree. But I am not particularly experienced in Python. +Andrey Konovalov will be able to do it in autumn. If anybody is willing to take it earlier, you are welcome (please drop a note to https://github.com/google/sanitizers/issues/674 to coordinate efforts). >> It invokes addr2line to provide file:line info, adds inline frames, >> strips ? frames (are they ever useful?) and strips timestamps. >> You just need to pipe a report through it. It assumes that vmlinux is >> in the current dir, but you can override it with --linux flag. >> >> It can also insert actual source lines into a report (amount of >> context is controlled with flags), for example: >> >> >> Call Trace: >> [< inline >] khugepaged_scan_pmd include/linux/uaccess.h:9 >> 8 { >> 9 current->pagefault_disabled++; >> 10 } >> [< inline >] khugepaged_scan_mm_slot mm/huge_memory.c:2716 >> 2715 hend); >> 2716 ret = khugepaged_scan_pmd(mm, vma, >> 2717 >> khugepaged_scan.address, >> [< inline >] khugepaged_do_scan mm/huge_memory.c:2796 >> 2795 pass_through_head < 2) >> 2796 progress += >> khugepaged_scan_mm_slot(pages - progress, >> 2797 &hpage); >> [<ffffffff81790b33>] khugepaged+0x993/0x48e0 mm/huge_memory.c:2831 >> 2830 while (!kthread_should_stop()) { >> 2831 khugepaged_do_scan(); >> 2832 khugepaged_wait_work(); >> [<ffffffff813c195f>] kthread+0x23f/0x2d0 drivers/block/aoe/aoecmd.c:1303 >> 1302 add_wait_queue(k->waitq, &wait); >> 1303 __set_current_state(TASK_INTERRUPTIBLE); >> 1304 } >> [<ffffffff866d1b2f>] ret_from_fork+0x3f/0x70 arch/x86/entry/entry_64.S:405 >> 404 SWAPGS >> 405 jmp restore_regs_and_iret >> 406 END(ret_from_fork) ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: fs: GPF in locked_inode_to_wb_and_lock_list 2016-04-21 8:25 ` Dmitry Vyukov 2016-04-21 9:10 ` Andrey Ryabinin @ 2016-04-21 16:14 ` Tejun Heo 1 sibling, 0 replies; 17+ messages in thread From: Tejun Heo @ 2016-04-21 16:14 UTC (permalink / raw) To: Dmitry Vyukov Cc: Alexander Viro, linux-fsdevel, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko Hello, On Thu, Apr 21, 2016 at 10:25:12AM +0200, Dmitry Vyukov wrote: > I use this script for symbolization: > https://github.com/google/sanitizers/blob/master/address-sanitizer/tools/kasan_symbolize.py > It invokes addr2line to provide file:line info, adds inline frames, > strips ? frames (are they ever useful?) and strips timestamps. Heh, sometimes they are. They obviously are useful when backlinks are broken for some reason. Through the remnants of the old call stacks, they also can sometimes tell the history of the task - where it was before it got to the current backtrace. It isn't reliable at all but I can recall several occasions where they were useful in root causing issues. It *might* make a sense to have an option to not strip them. And, yeah, I'm all for it being in the kernel tree. Thanks. -- tejun ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: fs: GPF in locked_inode_to_wb_and_lock_list 2016-04-20 21:14 ` Tejun Heo 2016-04-21 8:25 ` Dmitry Vyukov @ 2016-04-21 8:35 ` Dmitry Vyukov 2016-04-21 9:45 ` Andrey Ryabinin 1 sibling, 1 reply; 17+ messages in thread From: Dmitry Vyukov @ 2016-04-21 8:35 UTC (permalink / raw) To: Tejun Heo Cc: Alexander Viro, linux-fsdevel, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko On Wed, Apr 20, 2016 at 11:14 PM, Tejun Heo <tj@kernel.org> wrote: > Hello, Dmitry. > > On Mon, Apr 18, 2016 at 11:44:11AM +0200, Dmitry Vyukov wrote: > >> general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN > ... >> RIP: 0010:[<ffffffff818884d2>] [<ffffffff818884d2>] >> locked_inode_to_wb_and_lock_list+0xa2/0x750 >> RSP: 0018:ffff88006cdaf7d0 EFLAGS: 00010246 >> RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88006ccf2050 >> RDX: 0000000000000000 RSI: 000000114c8a8484 RDI: 0000000000000286 >> RBP: ffff88006cdaf820 R08: ffff88006ccf1840 R09: 0000000000000000 >> R10: 000229915090805f R11: 0000000000000001 R12: ffff88006a72f5e0 >> R13: dffffc0000000000 R14: ffffed000d4e5eed R15: ffffffff8830cf40 >> FS: 0000000000000000(0000) GS:ffff88006d500000(0000) knlGS:0000000000000000 >> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> CR2: 0000000003301bf8 CR3: 000000006368f000 CR4: 00000000000006e0 >> DR0: 0000000000001ec9 DR1: 0000000000000000 DR2: 0000000000000000 >> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600 >> Stack: >> ffff88006a72f680 ffff88006a72f768 ffff8800671230d8 03ff88006cdaf948 >> ffff88006a72f668 ffff88006a72f5e0 ffff8800671230d8 ffff88006cdaf948 >> ffff880065b90cc8 ffff880067123100 ffff88006cdaf970 ffffffff8188e12e >> Call Trace: >> [< inline >] inode_to_wb_and_lock_list fs/fs-writeback.c:309 >> [<ffffffff8188e12e>] writeback_sb_inodes+0x4de/0x1250 fs/fs-writeback.c:1554 >> [<ffffffff8188efa4>] __writeback_inodes_wb+0x104/0x1e0 fs/fs-writeback.c:1600 >> [<ffffffff8188f9ae>] wb_writeback+0x7ce/0xc90 fs/fs-writeback.c:1709 >> [< inline >] wb_do_writeback fs/fs-writeback.c:1844 >> [<ffffffff81891079>] wb_workfn+0x2f9/0x1000 fs/fs-writeback.c:1884 >> [<ffffffff813bcd1e>] process_one_work+0x78e/0x15c0 kernel/workqueue.c:2094 >> [<ffffffff813bdc2b>] worker_thread+0xdb/0xfc0 kernel/workqueue.c:2228 >> [<ffffffff813cdeef>] kthread+0x23f/0x2d0 drivers/block/aoe/aoecmd.c:1303 >> [<ffffffff867bc5d2>] ret_from_fork+0x22/0x50 arch/x86/entry/entry_64.S:392 >> Code: 05 94 4a a8 06 85 c0 0f 85 03 03 00 00 e8 07 15 d0 ff 41 80 3e >> 00 0f 85 64 06 00 00 49 8b 9c 24 88 01 00 00 48 89 d8 48 c1 e8 03 <42> >> 80 3c 28 00 0f 85 17 06 00 00 48 8b 03 48 83 c0 50 48 39 c3 >> RIP [< inline >] wb_get include/linux/backing-dev-defs.h:212 >> RIP [<ffffffff818884d2>] locked_inode_to_wb_and_lock_list+0xa2/0x750 >> fs/fs-writeback.c:281 >> RSP <ffff88006cdaf7d0> > > Man, that's a beautiful trace w/ decoding of inline functions. When > did we start doing that? Is there a specific config option for this? > >> ---[ end trace 986a4d314dcb2694 ]--- >> The crash happened here: >> >> if (wb != &wb->bdi->wb) >> ffffffff818884cb: 48 89 d8 mov %rbx,%rax >> ffffffff818884ce: 48 c1 e8 03 shr $0x3,%rax >> ffffffff818884d2: 42 80 3c 28 00 cmpb $0x0,(%rax,%r13,1) > > So, it's the above instruction. > >> ffffffff818884d7: 0f 85 17 06 00 00 jne >> ffffffff81888af4 <locked_inode_to_wb_and_lock_list+0x6c4> >> ffffffff818884dd: 48 8b 03 mov (%rbx),%rax >> ffffffff818884e0: 48 83 c0 50 add $0x50,%rax >> ffffffff818884e4: 48 39 c3 cmp %rax,%rbx >> ffffffff818884e7: 0f 84 c3 00 00 00 je >> ffffffff818885b0 <locked_inode_to_wb_and_lock_list+0x180> >> >> Which means that bdi is NULL (if I get indirections right). > > So, the wb != &wb->bdi->wb comparison would be the cmp at > 0xffffffff818884e4 and given that it just compares the address of > &bdi->wb, bdi being NULL wouldn't trigger the fault. > > cmpb $0x0,(%rax,%r13,1) > -> *(u8 *)(%rax + %r13) == 0 > -> *(u8 *)((%rbx >> 3) + %r13) == 0 > > Where can that be from? I can't find anything matching even in the > surrounding functions. > > Hmmm... The base address %r13 is 0xdffffc0000000000 which isn't a > proper canonical address and in general suspcious. Ooh, it's > KASAN_SHADOW_OFFSET. It looks like something is making KASAN trigger > a fault. Can we please bring in someone who's more familiar with > KASAN? I am here. For every memory access to ADDR, KASAN makes a byte load from KASAN_SHADOW_OFFSET+ADDR>>8 first. For accesses to kernel memory, the byte is addressable and contains addressability state of the ADDR. It also has a positive side effect of catching any accesses to user memory at the place of occurrence as the address computations produces a non-valid address. But the negative side effect that instead of a usual NULL-deref GPF, you now get what we have here. %r13 contains KASAN_SHADOW_OFFSET, and %rax contains the address that normal code is going to dereference few instructions later. So without KASAN the code would trigger NULL deref on (%rbx is a copy of %rax): ffffffff818884dd: 48 8b 03 mov (%rbx),%rax So whatever load "&wb->bdi->wb" produces is a NULL deref. (is it wb that is NULL?) Sorry for this mess. It is a known issue in KASAN, but we don't know how to fix it without slowing down execution and sacrificing other properties. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: fs: GPF in locked_inode_to_wb_and_lock_list 2016-04-21 8:35 ` Dmitry Vyukov @ 2016-04-21 9:45 ` Andrey Ryabinin 2016-04-21 10:00 ` Dmitry Vyukov 0 siblings, 1 reply; 17+ messages in thread From: Andrey Ryabinin @ 2016-04-21 9:45 UTC (permalink / raw) To: Dmitry Vyukov Cc: Tejun Heo, Alexander Viro, linux-fsdevel, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko 2016-04-21 11:35 GMT+03:00 Dmitry Vyukov <dvyukov@google.com>: > > ffffffff818884dd: 48 8b 03 mov (%rbx),%rax > > So whatever load "&wb->bdi->wb" produces is a NULL deref. (is it wb > that is NULL?) Yes it's NULL wb, because there is only one load: mov (%rbx),%rax => rax = wb->bdi add $0x50,%rax => rax = &bdi->wb ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: fs: GPF in locked_inode_to_wb_and_lock_list 2016-04-21 9:45 ` Andrey Ryabinin @ 2016-04-21 10:00 ` Dmitry Vyukov 2016-04-21 17:06 ` Tejun Heo 0 siblings, 1 reply; 17+ messages in thread From: Dmitry Vyukov @ 2016-04-21 10:00 UTC (permalink / raw) To: Andrey Ryabinin Cc: Tejun Heo, Alexander Viro, linux-fsdevel, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko On Thu, Apr 21, 2016 at 11:45 AM, Andrey Ryabinin <ryabinin.a.a@gmail.com> wrote: > 2016-04-21 11:35 GMT+03:00 Dmitry Vyukov <dvyukov@google.com>: >> >> ffffffff818884dd: 48 8b 03 mov (%rbx),%rax >> >> So whatever load "&wb->bdi->wb" produces is a NULL deref. (is it wb >> that is NULL?) > > Yes it's NULL wb, because there is only one load: > mov (%rbx),%rax => rax = wb->bdi > add $0x50,%rax => rax = &bdi->wb I bet that wb becomes NULL on the second iteration of the loop. The loop loops in case of a race with another thread, so it would also explain why it is difficult to reproduce. Tejun, does it make any sense to you? ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: fs: GPF in locked_inode_to_wb_and_lock_list 2016-04-21 10:00 ` Dmitry Vyukov @ 2016-04-21 17:06 ` Tejun Heo 2016-04-22 18:55 ` Dmitry Vyukov 0 siblings, 1 reply; 17+ messages in thread From: Tejun Heo @ 2016-04-21 17:06 UTC (permalink / raw) To: Dmitry Vyukov Cc: Andrey Ryabinin, Alexander Viro, linux-fsdevel, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Ilya Dryomov, Jens Axboe, Jan Kara Hello, (cc'ing Ilya, Jan and Jens) On Thu, Apr 21, 2016 at 12:00:38PM +0200, Dmitry Vyukov wrote: > On Thu, Apr 21, 2016 at 11:45 AM, Andrey Ryabinin > <ryabinin.a.a@gmail.com> wrote: > > 2016-04-21 11:35 GMT+03:00 Dmitry Vyukov <dvyukov@google.com>: > >> > >> ffffffff818884dd: 48 8b 03 mov (%rbx),%rax > >> > >> So whatever load "&wb->bdi->wb" produces is a NULL deref. (is it wb > >> that is NULL?) > > > > Yes it's NULL wb, because there is only one load: > > mov (%rbx),%rax => rax = wb->bdi > > add $0x50,%rax => rax = &bdi->wb > > > I bet that wb becomes NULL on the second iteration of the loop. The > loop loops in case of a race with another thread, so it would also > explain why it is difficult to reproduce. > > Tejun, does it make any sense to you? Yeah, that makes sense. I think the culprit is 43d1c0eb7e11 ("block: detach bdev inode from its wb in __blkdev_put()") which allows inode to wb association to be broken while other operations including writeback are in progress. I thought it should be okay as the inode must be clean at that point but that obviously doesn't mean that there can be no writeback operations in flight. I hope we could eventually move away from the current model where we try to swap out an underlying data structure while upper layers may still be referring to it in the future but for now we can make sure the writeback operation is finished before detaching wb. Dmitry, I understand that the bug is difficult to reproduce but can you please give the following patch a try? Thanks. diff --git a/fs/block_dev.c b/fs/block_dev.c index 20a2c02..209ea33 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -1530,12 +1530,7 @@ static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part) kill_bdev(bdev); bdev_write_inode(bdev); - /* - * Detaching bdev inode from its wb in __destroy_inode() - * is too late: the queue which embeds its bdi (along with - * root wb) can be gone as soon as we put_disk() below. - */ - inode_detach_wb(bdev->bd_inode); + inode_detach_blkdev_wb(bdev); } if (bdev->bd_contains == bdev) { if (disk->fops->release) diff --git a/include/linux/writeback.h b/include/linux/writeback.h index d0b5ca5..ec1f530 100644 --- a/include/linux/writeback.h +++ b/include/linux/writeback.h @@ -230,6 +230,25 @@ static inline void inode_detach_wb(struct inode *inode) } /** + * inode_detach_blkdev_wb - disassociate a bd_inode from its wb + * @bdev: block_device of interest + * + * @bdev is being put for the last time. Detaching bdev inode in + * __destroy_inode() is too late: the queue which embeds its bdi (along + * with root wb) can be gone as soon as the containing disk is put. + * + * This function dissociates @bdev->bd_inode from its wb. The inode must + * be clean and no further operations should be started on it. + */ +static inline void inode_detach_blkdev_wb(struct block_device *bdev) +{ + if (bdev->bd_inode->i_wb) { + flush_delayed_work(&bdev->bd_inode->i_wb->dwork); + inode_detach_wb(bdev->bd_inode); + } +} + +/** * wbc_attach_fdatawrite_inode - associate wbc and inode for fdatawrite * @wbc: writeback_control of interest * @inode: target inode @@ -277,6 +296,10 @@ static inline void inode_detach_wb(struct inode *inode) { } +static inline void inode_detach_blkdev_wb(struct block_device *bdev) +{ +} + static inline void wbc_attach_and_unlock_inode(struct writeback_control *wbc, struct inode *inode) __releases(&inode->i_lock) ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: fs: GPF in locked_inode_to_wb_and_lock_list 2016-04-21 17:06 ` Tejun Heo @ 2016-04-22 18:55 ` Dmitry Vyukov 2016-06-06 17:46 ` Dmitry Vyukov 0 siblings, 1 reply; 17+ messages in thread From: Dmitry Vyukov @ 2016-04-22 18:55 UTC (permalink / raw) To: Tejun Heo Cc: Andrey Ryabinin, Alexander Viro, linux-fsdevel, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Ilya Dryomov, Jens Axboe, Jan Kara On Thu, Apr 21, 2016 at 7:06 PM, Tejun Heo <tj@kernel.org> wrote: > Hello, > > (cc'ing Ilya, Jan and Jens) > > On Thu, Apr 21, 2016 at 12:00:38PM +0200, Dmitry Vyukov wrote: >> On Thu, Apr 21, 2016 at 11:45 AM, Andrey Ryabinin >> <ryabinin.a.a@gmail.com> wrote: >> > 2016-04-21 11:35 GMT+03:00 Dmitry Vyukov <dvyukov@google.com>: >> >> >> >> ffffffff818884dd: 48 8b 03 mov (%rbx),%rax >> >> >> >> So whatever load "&wb->bdi->wb" produces is a NULL deref. (is it wb >> >> that is NULL?) >> > >> > Yes it's NULL wb, because there is only one load: >> > mov (%rbx),%rax => rax = wb->bdi >> > add $0x50,%rax => rax = &bdi->wb >> >> >> I bet that wb becomes NULL on the second iteration of the loop. The >> loop loops in case of a race with another thread, so it would also >> explain why it is difficult to reproduce. >> >> Tejun, does it make any sense to you? > > Yeah, that makes sense. I think the culprit is 43d1c0eb7e11 ("block: > detach bdev inode from its wb in __blkdev_put()") which allows inode > to wb association to be broken while other operations including > writeback are in progress. I thought it should be okay as the inode > must be clean at that point but that obviously doesn't mean that there > can be no writeback operations in flight. > > I hope we could eventually move away from the current model where we > try to swap out an underlying data structure while upper layers may > still be referring to it in the future but for now we can make sure > the writeback operation is finished before detaching wb. > > Dmitry, I understand that the bug is difficult to reproduce but can > you please give the following patch a try? I've merged it into my tree and will restart the fuzzer and leave it running for the weekend. Though, yeah, it is difficult to reproduce... > Thanks. > > diff --git a/fs/block_dev.c b/fs/block_dev.c > index 20a2c02..209ea33 100644 > --- a/fs/block_dev.c > +++ b/fs/block_dev.c > @@ -1530,12 +1530,7 @@ static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part) > kill_bdev(bdev); > > bdev_write_inode(bdev); > - /* > - * Detaching bdev inode from its wb in __destroy_inode() > - * is too late: the queue which embeds its bdi (along with > - * root wb) can be gone as soon as we put_disk() below. > - */ > - inode_detach_wb(bdev->bd_inode); > + inode_detach_blkdev_wb(bdev); > } > if (bdev->bd_contains == bdev) { > if (disk->fops->release) > diff --git a/include/linux/writeback.h b/include/linux/writeback.h > index d0b5ca5..ec1f530 100644 > --- a/include/linux/writeback.h > +++ b/include/linux/writeback.h > @@ -230,6 +230,25 @@ static inline void inode_detach_wb(struct inode *inode) > } > > /** > + * inode_detach_blkdev_wb - disassociate a bd_inode from its wb > + * @bdev: block_device of interest > + * > + * @bdev is being put for the last time. Detaching bdev inode in > + * __destroy_inode() is too late: the queue which embeds its bdi (along > + * with root wb) can be gone as soon as the containing disk is put. > + * > + * This function dissociates @bdev->bd_inode from its wb. The inode must > + * be clean and no further operations should be started on it. > + */ > +static inline void inode_detach_blkdev_wb(struct block_device *bdev) > +{ > + if (bdev->bd_inode->i_wb) { > + flush_delayed_work(&bdev->bd_inode->i_wb->dwork); > + inode_detach_wb(bdev->bd_inode); > + } > +} > + > +/** > * wbc_attach_fdatawrite_inode - associate wbc and inode for fdatawrite > * @wbc: writeback_control of interest > * @inode: target inode > @@ -277,6 +296,10 @@ static inline void inode_detach_wb(struct inode *inode) > { > } > > +static inline void inode_detach_blkdev_wb(struct block_device *bdev) > +{ > +} > + > static inline void wbc_attach_and_unlock_inode(struct writeback_control *wbc, > struct inode *inode) > __releases(&inode->i_lock) ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: fs: GPF in locked_inode_to_wb_and_lock_list 2016-04-22 18:55 ` Dmitry Vyukov @ 2016-06-06 17:46 ` Dmitry Vyukov 2016-06-17 16:04 ` [PATCH] block: flush writeback dwork before detaching a bdev inode from it Tejun Heo 0 siblings, 1 reply; 17+ messages in thread From: Dmitry Vyukov @ 2016-06-06 17:46 UTC (permalink / raw) To: Tejun Heo Cc: Andrey Ryabinin, Alexander Viro, linux-fsdevel, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Ilya Dryomov, Jens Axboe, Jan Kara Hi Tejun, I have not seem the crash since then. Please mail the patch. On Fri, Apr 22, 2016 at 8:55 PM, Dmitry Vyukov <dvyukov@google.com> wrote: > On Thu, Apr 21, 2016 at 7:06 PM, Tejun Heo <tj@kernel.org> wrote: >> Hello, >> >> (cc'ing Ilya, Jan and Jens) >> >> On Thu, Apr 21, 2016 at 12:00:38PM +0200, Dmitry Vyukov wrote: >>> On Thu, Apr 21, 2016 at 11:45 AM, Andrey Ryabinin >>> <ryabinin.a.a@gmail.com> wrote: >>> > 2016-04-21 11:35 GMT+03:00 Dmitry Vyukov <dvyukov@google.com>: >>> >> >>> >> ffffffff818884dd: 48 8b 03 mov (%rbx),%rax >>> >> >>> >> So whatever load "&wb->bdi->wb" produces is a NULL deref. (is it wb >>> >> that is NULL?) >>> > >>> > Yes it's NULL wb, because there is only one load: >>> > mov (%rbx),%rax => rax = wb->bdi >>> > add $0x50,%rax => rax = &bdi->wb >>> >>> >>> I bet that wb becomes NULL on the second iteration of the loop. The >>> loop loops in case of a race with another thread, so it would also >>> explain why it is difficult to reproduce. >>> >>> Tejun, does it make any sense to you? >> >> Yeah, that makes sense. I think the culprit is 43d1c0eb7e11 ("block: >> detach bdev inode from its wb in __blkdev_put()") which allows inode >> to wb association to be broken while other operations including >> writeback are in progress. I thought it should be okay as the inode >> must be clean at that point but that obviously doesn't mean that there >> can be no writeback operations in flight. >> >> I hope we could eventually move away from the current model where we >> try to swap out an underlying data structure while upper layers may >> still be referring to it in the future but for now we can make sure >> the writeback operation is finished before detaching wb. >> >> Dmitry, I understand that the bug is difficult to reproduce but can >> you please give the following patch a try? > > > I've merged it into my tree and will restart the fuzzer and leave it > running for the weekend. > Though, yeah, it is difficult to reproduce... > > >> Thanks. >> >> diff --git a/fs/block_dev.c b/fs/block_dev.c >> index 20a2c02..209ea33 100644 >> --- a/fs/block_dev.c >> +++ b/fs/block_dev.c >> @@ -1530,12 +1530,7 @@ static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part) >> kill_bdev(bdev); >> >> bdev_write_inode(bdev); >> - /* >> - * Detaching bdev inode from its wb in __destroy_inode() >> - * is too late: the queue which embeds its bdi (along with >> - * root wb) can be gone as soon as we put_disk() below. >> - */ >> - inode_detach_wb(bdev->bd_inode); >> + inode_detach_blkdev_wb(bdev); >> } >> if (bdev->bd_contains == bdev) { >> if (disk->fops->release) >> diff --git a/include/linux/writeback.h b/include/linux/writeback.h >> index d0b5ca5..ec1f530 100644 >> --- a/include/linux/writeback.h >> +++ b/include/linux/writeback.h >> @@ -230,6 +230,25 @@ static inline void inode_detach_wb(struct inode *inode) >> } >> >> /** >> + * inode_detach_blkdev_wb - disassociate a bd_inode from its wb >> + * @bdev: block_device of interest >> + * >> + * @bdev is being put for the last time. Detaching bdev inode in >> + * __destroy_inode() is too late: the queue which embeds its bdi (along >> + * with root wb) can be gone as soon as the containing disk is put. >> + * >> + * This function dissociates @bdev->bd_inode from its wb. The inode must >> + * be clean and no further operations should be started on it. >> + */ >> +static inline void inode_detach_blkdev_wb(struct block_device *bdev) >> +{ >> + if (bdev->bd_inode->i_wb) { >> + flush_delayed_work(&bdev->bd_inode->i_wb->dwork); >> + inode_detach_wb(bdev->bd_inode); >> + } >> +} >> + >> +/** >> * wbc_attach_fdatawrite_inode - associate wbc and inode for fdatawrite >> * @wbc: writeback_control of interest >> * @inode: target inode >> @@ -277,6 +296,10 @@ static inline void inode_detach_wb(struct inode *inode) >> { >> } >> >> +static inline void inode_detach_blkdev_wb(struct block_device *bdev) >> +{ >> +} >> + >> static inline void wbc_attach_and_unlock_inode(struct writeback_control *wbc, >> struct inode *inode) >> __releases(&inode->i_lock) ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH] block: flush writeback dwork before detaching a bdev inode from it 2016-06-06 17:46 ` Dmitry Vyukov @ 2016-06-17 16:04 ` Tejun Heo 2016-06-20 13:31 ` Jan Kara 0 siblings, 1 reply; 17+ messages in thread From: Tejun Heo @ 2016-06-17 16:04 UTC (permalink / raw) To: Jens Axboe Cc: Andrey Ryabinin, Alexander Viro, linux-fsdevel, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Ilya Dryomov, Jan Kara, Dmitry Vyukov, kernel-team 43d1c0eb7e11 ("block: detach bdev inode from its wb in __blkdev_put()") detached bdev inode from its wb as the bdev inode may outlive the underlying bdi and thus the wb. This is accomplished by invoking inode_detach_wb() from __blkdev_put(); however, while the inode can't be dirtied by the time control reaches there, that doesn't guarantee that writeback isn't already in progress on the inode. This can lead to the inode being disassociated from its wb while writeback operation is in flight causing oopses like the following. general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN Modules linked in: CPU: 3 PID: 32 Comm: kworker/u10:1 Not tainted 4.6.0-rc3+ #349 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 Workqueue: writeback wb_workfn (flush-11:0) task: ffff88006ccf1840 ti: ffff88006cda8000 task.ti: ffff88006cda8000 RIP: 0010:[<ffffffff818884d2>] [<ffffffff818884d2>] locked_inode_to_wb_and_lock_list+0xa2/0x750 RSP: 0018:ffff88006cdaf7d0 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88006ccf2050 RDX: 0000000000000000 RSI: 000000114c8a8484 RDI: 0000000000000286 RBP: ffff88006cdaf820 R08: ffff88006ccf1840 R09: 0000000000000000 R10: 000229915090805f R11: 0000000000000001 R12: ffff88006a72f5e0 R13: dffffc0000000000 R14: ffffed000d4e5eed R15: ffffffff8830cf40 FS: 0000000000000000(0000) GS:ffff88006d500000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000003301bf8 CR3: 000000006368f000 CR4: 00000000000006e0 DR0: 0000000000001ec9 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600 Stack: ffff88006a72f680 ffff88006a72f768 ffff8800671230d8 03ff88006cdaf948 ffff88006a72f668 ffff88006a72f5e0 ffff8800671230d8 ffff88006cdaf948 ffff880065b90cc8 ffff880067123100 ffff88006cdaf970 ffffffff8188e12e Call Trace: [< inline >] inode_to_wb_and_lock_list fs/fs-writeback.c:309 [<ffffffff8188e12e>] writeback_sb_inodes+0x4de/0x1250 fs/fs-writeback.c:1554 [<ffffffff8188efa4>] __writeback_inodes_wb+0x104/0x1e0 fs/fs-writeback.c:1600 [<ffffffff8188f9ae>] wb_writeback+0x7ce/0xc90 fs/fs-writeback.c:1709 [< inline >] wb_do_writeback fs/fs-writeback.c:1844 [<ffffffff81891079>] wb_workfn+0x2f9/0x1000 fs/fs-writeback.c:1884 [<ffffffff813bcd1e>] process_one_work+0x78e/0x15c0 kernel/workqueue.c:2094 [<ffffffff813bdc2b>] worker_thread+0xdb/0xfc0 kernel/workqueue.c:2228 [<ffffffff813cdeef>] kthread+0x23f/0x2d0 drivers/block/aoe/aoecmd.c:1303 [<ffffffff867bc5d2>] ret_from_fork+0x22/0x50 arch/x86/entry/entry_64.S:392 Code: 05 94 4a a8 06 85 c0 0f 85 03 03 00 00 e8 07 15 d0 ff 41 80 3e 00 0f 85 64 06 00 00 49 8b 9c 24 88 01 00 00 48 89 d8 48 c1 e8 03 <42> 80 3c 28 00 0f 85 17 06 00 00 48 8b 03 48 83 c0 50 48 39 c3 RIP [< inline >] wb_get include/linux/backing-dev-defs.h:212 RIP [<ffffffff818884d2>] locked_inode_to_wb_and_lock_list+0xa2/0x750 fs/fs-writeback.c:281 RSP <ffff88006cdaf7d0> ---[ end trace 986a4d314dcb2694 ]--- Fix it by flushing the wb dwork before detaching the inode. Combined with the fact that the inode can no longer be dirtied, this guarantees that no writeback operation can be in flight or initiated. As this involves details on writeback side which doesn't have much to do with block_dev, encapsulate it in a helper function - inode_detach_blkdev_wb(). Signed-off-by: Tejun Heo <tj@kernel.org> Reported-and-tested-by: Dmitry Vyukov <dvyukov@google.com> Link: http://lkml.kernel.org/g/CACT4Y+YAjq8mcfiVxR075didJKCyOCVrqxdbfKdgUxabstbfmA@mail.gmail.com Fixes: 43d1c0eb7e11 ("block: detach bdev inode from its wb in __blkdev_put()") Cc: Ilya Dryomov <idryomov@gmail.com> Cc: stable@vger.kernel.org # v4.2+ --- fs/block_dev.c | 7 +------ include/linux/writeback.h | 23 +++++++++++++++++++++++ 2 files changed, 24 insertions(+), 6 deletions(-) diff --git a/fs/block_dev.c b/fs/block_dev.c index 71ccab1..6a14100 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -1612,12 +1612,7 @@ static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part) kill_bdev(bdev); bdev_write_inode(bdev); - /* - * Detaching bdev inode from its wb in __destroy_inode() - * is too late: the queue which embeds its bdi (along with - * root wb) can be gone as soon as we put_disk() below. - */ - inode_detach_wb(bdev->bd_inode); + inode_detach_blkdev_wb(bdev); } if (bdev->bd_contains == bdev) { if (disk->fops->release) diff --git a/include/linux/writeback.h b/include/linux/writeback.h index d0b5ca5..ec1f530 100644 --- a/include/linux/writeback.h +++ b/include/linux/writeback.h @@ -230,6 +230,25 @@ static inline void inode_detach_wb(struct inode *inode) } /** + * inode_detach_blkdev_wb - disassociate a bd_inode from its wb + * @bdev: block_device of interest + * + * @bdev is being put for the last time. Detaching bdev inode in + * __destroy_inode() is too late: the queue which embeds its bdi (along + * with root wb) can be gone as soon as the containing disk is put. + * + * This function dissociates @bdev->bd_inode from its wb. The inode must + * be clean and no further operations should be started on it. + */ +static inline void inode_detach_blkdev_wb(struct block_device *bdev) +{ + if (bdev->bd_inode->i_wb) { + flush_delayed_work(&bdev->bd_inode->i_wb->dwork); + inode_detach_wb(bdev->bd_inode); + } +} + +/** * wbc_attach_fdatawrite_inode - associate wbc and inode for fdatawrite * @wbc: writeback_control of interest * @inode: target inode @@ -277,6 +296,10 @@ static inline void inode_detach_wb(struct inode *inode) { } +static inline void inode_detach_blkdev_wb(struct block_device *bdev) +{ +} + static inline void wbc_attach_and_unlock_inode(struct writeback_control *wbc, struct inode *inode) __releases(&inode->i_lock) ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH] block: flush writeback dwork before detaching a bdev inode from it 2016-06-17 16:04 ` [PATCH] block: flush writeback dwork before detaching a bdev inode from it Tejun Heo @ 2016-06-20 13:31 ` Jan Kara 2016-06-20 13:38 ` Dmitry Vyukov 0 siblings, 1 reply; 17+ messages in thread From: Jan Kara @ 2016-06-20 13:31 UTC (permalink / raw) To: Tejun Heo Cc: Jens Axboe, Andrey Ryabinin, Alexander Viro, linux-fsdevel, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Ilya Dryomov, Jan Kara, Dmitry Vyukov, kernel-team, Ted Tso [-- Attachment #1: Type: text/plain, Size: 2248 bytes --] Hi, On Fri 17-06-16 12:04:05, Tejun Heo wrote: > 43d1c0eb7e11 ("block: detach bdev inode from its wb in > __blkdev_put()") detached bdev inode from its wb as the bdev inode may > outlive the underlying bdi and thus the wb. This is accomplished by > invoking inode_detach_wb() from __blkdev_put(); however, while the > inode can't be dirtied by the time control reaches there, that doesn't > guarantee that writeback isn't already in progress on the inode. This > can lead to the inode being disassociated from its wb while writeback > operation is in flight causing oopses like the following. <snip> > Fix it by flushing the wb dwork before detaching the inode. Combined > with the fact that the inode can no longer be dirtied, this guarantees > that no writeback operation can be in flight or initiated. Sorry for the late reply but now when thinking about the patch I don't think it is quite right. Writeback can happen from other contexts than just the worker one (e.g. kswapd can do writeback, or in some out-of-memory situations we may punt to doing writeback directly instead of calling the worker, or sync(2) calls fdatawrite() for block device inode directly when iterating through blockdev superblock). So flushing the workqueue IMHO is not covering 100% of cases. I wanted to suggest to use inode_wait_for_writeback() which will make sure writeback is done with the inode. However we effectively already do this by calling bdev_write_inode() which calls writeback_single_inode() in WB_SYNC_ALL mode. So by the time that call completes we are sure writeback code is not looking at the inode. *However* what I think is happening is that sync_blockdev() writes all the dirty pages, bdev_write_inode() writes the inode and clears all dirty bits, however the inode still stays in the b_dirty / b_io list of the wb because it has I_DIRTY_TIME set. Subsequently once flusher work runs, it finds the inode, looks at it and boom. And the problem seems to be that write_inode_now(inode, true) does not result in I_DIRTY_TIME being cleared. Attached patch should fix this issue - it is compile-tested only. Dmitry, can you check whether this patch fixes the issue for you as well? Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR [-- Attachment #2: 0001-writeback-Write-dirty-times-for-WB_SYNC_ALL-writebac.patch --] [-- Type: text/x-patch, Size: 4228 bytes --] >From 907aceb33f6b598f00c9b1d10a110c74e4a0f954 Mon Sep 17 00:00:00 2001 From: Jan Kara <jack@suse.cz> Date: Mon, 20 Jun 2016 15:13:42 +0200 Subject: [PATCH] writeback: Write dirty times for WB_SYNC_ALL writeback Currently we take care to handle I_DIRTY_TIME in vfs_fsync() and queue_io() so that inodes which have only dirty timestamps are properly written on fsync(2) and sync(2). However there are other call sites - most notably going through write_inode_now() - which expect inode to be clean after WB_SYNC_ALL writeback. This is not currently true as we do not clear I_DIRTY_TIME in __writeback_single_inode() even for WB_SYNC_ALL writeback in all the cases. This then resulted in the following oops because bdev_write_inode() did not clean the inode and writeback code later stumbled over a dirty inode with detached wb. general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN Modules linked in: CPU: 3 PID: 32 Comm: kworker/u10:1 Not tainted 4.6.0-rc3+ #349 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 Workqueue: writeback wb_workfn (flush-11:0) task: ffff88006ccf1840 ti: ffff88006cda8000 task.ti: ffff88006cda8000 RIP: 0010:[<ffffffff818884d2>] [<ffffffff818884d2>] locked_inode_to_wb_and_lock_list+0xa2/0x750 RSP: 0018:ffff88006cdaf7d0 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88006ccf2050 RDX: 0000000000000000 RSI: 000000114c8a8484 RDI: 0000000000000286 RBP: ffff88006cdaf820 R08: ffff88006ccf1840 R09: 0000000000000000 R10: 000229915090805f R11: 0000000000000001 R12: ffff88006a72f5e0 R13: dffffc0000000000 R14: ffffed000d4e5eed R15: ffffffff8830cf40 FS: 0000000000000000(0000) GS:ffff88006d500000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000003301bf8 CR3: 000000006368f000 CR4: 00000000000006e0 DR0: 0000000000001ec9 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600 Stack: ffff88006a72f680 ffff88006a72f768 ffff8800671230d8 03ff88006cdaf948 ffff88006a72f668 ffff88006a72f5e0 ffff8800671230d8 ffff88006cdaf948 ffff880065b90cc8 ffff880067123100 ffff88006cdaf970 ffffffff8188e12e Call Trace: [< inline >] inode_to_wb_and_lock_list fs/fs-writeback.c:309 [<ffffffff8188e12e>] writeback_sb_inodes+0x4de/0x1250 fs/fs-writeback.c:1554 [<ffffffff8188efa4>] __writeback_inodes_wb+0x104/0x1e0 fs/fs-writeback.c:1600 [<ffffffff8188f9ae>] wb_writeback+0x7ce/0xc90 fs/fs-writeback.c:1709 [< inline >] wb_do_writeback fs/fs-writeback.c:1844 [<ffffffff81891079>] wb_workfn+0x2f9/0x1000 fs/fs-writeback.c:1884 [<ffffffff813bcd1e>] process_one_work+0x78e/0x15c0 kernel/workqueue.c:2094 [<ffffffff813bdc2b>] worker_thread+0xdb/0xfc0 kernel/workqueue.c:2228 [<ffffffff813cdeef>] kthread+0x23f/0x2d0 drivers/block/aoe/aoecmd.c:1303 [<ffffffff867bc5d2>] ret_from_fork+0x22/0x50 arch/x86/entry/entry_64.S:392 Code: 05 94 4a a8 06 85 c0 0f 85 03 03 00 00 e8 07 15 d0 ff 41 80 3e 00 0f 85 64 06 00 00 49 8b 9c 24 88 01 00 00 48 89 d8 48 c1 e8 03 <42> 80 3c 28 00 0f 85 17 06 00 00 48 8b 03 48 83 c0 50 48 39 c3 RIP [< inline >] wb_get include/linux/backing-dev-defs.h:212 RIP [<ffffffff818884d2>] locked_inode_to_wb_and_lock_list+0xa2/0x750 fs/fs-writeback.c:281 RSP <ffff88006cdaf7d0> ---[ end trace 986a4d314dcb2694 ]--- Fix the problem by making sure __writeback_single_inode() writes inode only with dirty times in WB_SYNC_ALL mode. Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Jan Kara <jack@suse.cz> --- fs/fs-writeback.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 989a2cef6b76..51efb644410a 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -1291,6 +1291,7 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc) dirty = inode->i_state & I_DIRTY; if (inode->i_state & I_DIRTY_TIME) { if ((dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) || + wbc->sync_mode == WB_SYNC_ALL || unlikely(inode->i_state & I_DIRTY_TIME_EXPIRED) || unlikely(time_after(jiffies, (inode->dirtied_time_when + -- 2.6.6 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH] block: flush writeback dwork before detaching a bdev inode from it 2016-06-20 13:31 ` Jan Kara @ 2016-06-20 13:38 ` Dmitry Vyukov 2016-06-20 17:40 ` Tejun Heo 0 siblings, 1 reply; 17+ messages in thread From: Dmitry Vyukov @ 2016-06-20 13:38 UTC (permalink / raw) To: Jan Kara Cc: Tejun Heo, Jens Axboe, Andrey Ryabinin, Alexander Viro, linux-fsdevel, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Ilya Dryomov, Jan Kara, kernel-team, Ted Tso On Mon, Jun 20, 2016 at 3:31 PM, Jan Kara <jack@suse.cz> wrote: > Hi, > > On Fri 17-06-16 12:04:05, Tejun Heo wrote: >> 43d1c0eb7e11 ("block: detach bdev inode from its wb in >> __blkdev_put()") detached bdev inode from its wb as the bdev inode may >> outlive the underlying bdi and thus the wb. This is accomplished by >> invoking inode_detach_wb() from __blkdev_put(); however, while the >> inode can't be dirtied by the time control reaches there, that doesn't >> guarantee that writeback isn't already in progress on the inode. This >> can lead to the inode being disassociated from its wb while writeback >> operation is in flight causing oopses like the following. > > <snip> > >> Fix it by flushing the wb dwork before detaching the inode. Combined >> with the fact that the inode can no longer be dirtied, this guarantees >> that no writeback operation can be in flight or initiated. > > Sorry for the late reply but now when thinking about the patch I don't > think it is quite right. Writeback can happen from other contexts than just > the worker one (e.g. kswapd can do writeback, or in some out-of-memory > situations we may punt to doing writeback directly instead of calling the > worker, or sync(2) calls fdatawrite() for block device inode directly when > iterating through blockdev superblock). So flushing the workqueue IMHO is > not covering 100% of cases. > > I wanted to suggest to use inode_wait_for_writeback() which will make sure > writeback is done with the inode. However we effectively already do this > by calling bdev_write_inode() which calls writeback_single_inode() in > WB_SYNC_ALL mode. So by the time that call completes we are sure writeback > code is not looking at the inode. *However* what I think is happening is > that sync_blockdev() writes all the dirty pages, bdev_write_inode() writes > the inode and clears all dirty bits, however the inode still stays in the > b_dirty / b_io list of the wb because it has I_DIRTY_TIME set. Subsequently > once flusher work runs, it finds the inode, looks at it and boom. And the > problem seems to be that write_inode_now(inode, true) does not result in > I_DIRTY_TIME being cleared. > > Attached patch should fix this issue - it is compile-tested only. Dmitry, > can you check whether this patch fixes the issue for you as well? I can't directly test it because crash happened very infrequently. If Tejun/Ted agree that it is the right way to fix it, then I can patch it, restart the fuzzer and leave it running for a while. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] block: flush writeback dwork before detaching a bdev inode from it 2016-06-20 13:38 ` Dmitry Vyukov @ 2016-06-20 17:40 ` Tejun Heo 2016-06-21 12:58 ` Dmitry Vyukov 0 siblings, 1 reply; 17+ messages in thread From: Tejun Heo @ 2016-06-20 17:40 UTC (permalink / raw) To: Dmitry Vyukov Cc: Jan Kara, Jens Axboe, Andrey Ryabinin, Alexander Viro, linux-fsdevel, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Ilya Dryomov, Jan Kara, kernel-team, Ted Tso Hello, On Mon, Jun 20, 2016 at 03:38:41PM +0200, Dmitry Vyukov wrote: > > Sorry for the late reply but now when thinking about the patch I don't > > think it is quite right. Writeback can happen from other contexts than just > > the worker one (e.g. kswapd can do writeback, or in some out-of-memory > > situations we may punt to doing writeback directly instead of calling the > > worker, or sync(2) calls fdatawrite() for block device inode directly when > > iterating through blockdev superblock). So flushing the workqueue IMHO is > > not covering 100% of cases. Hmmm, yeah, the patch undoes what the cgroup writeback changes added but it looks like the addition was exposing the existing problem more rather than causing a new one. > > I wanted to suggest to use inode_wait_for_writeback() which will make sure > > writeback is done with the inode. However we effectively already do this > > by calling bdev_write_inode() which calls writeback_single_inode() in > > WB_SYNC_ALL mode. So by the time that call completes we are sure writeback > > code is not looking at the inode. *However* what I think is happening is > > that sync_blockdev() writes all the dirty pages, bdev_write_inode() writes > > the inode and clears all dirty bits, however the inode still stays in the > > b_dirty / b_io list of the wb because it has I_DIRTY_TIME set. Subsequently > > once flusher work runs, it finds the inode, looks at it and boom. And the > > problem seems to be that write_inode_now(inode, true) does not result in > > I_DIRTY_TIME being cleared. > > > > Attached patch should fix this issue - it is compile-tested only. Dmitry, > > can you check whether this patch fixes the issue for you as well? > > I can't directly test it because crash happened very infrequently. > If Tejun/Ted agree that it is the right way to fix it, then I can > patch it, restart the fuzzer and leave it running for a while. Yes, please try out the patch. Thanks a lot. -- tejun ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] block: flush writeback dwork before detaching a bdev inode from it 2016-06-20 17:40 ` Tejun Heo @ 2016-06-21 12:58 ` Dmitry Vyukov 0 siblings, 0 replies; 17+ messages in thread From: Dmitry Vyukov @ 2016-06-21 12:58 UTC (permalink / raw) To: Tejun Heo Cc: Jan Kara, Jens Axboe, Andrey Ryabinin, Alexander Viro, linux-fsdevel, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Ilya Dryomov, Jan Kara, kernel-team, Ted Tso On Mon, Jun 20, 2016 at 7:40 PM, Tejun Heo <tj@kernel.org> wrote: > Hello, > > On Mon, Jun 20, 2016 at 03:38:41PM +0200, Dmitry Vyukov wrote: >> > Sorry for the late reply but now when thinking about the patch I don't >> > think it is quite right. Writeback can happen from other contexts than just >> > the worker one (e.g. kswapd can do writeback, or in some out-of-memory >> > situations we may punt to doing writeback directly instead of calling the >> > worker, or sync(2) calls fdatawrite() for block device inode directly when >> > iterating through blockdev superblock). So flushing the workqueue IMHO is >> > not covering 100% of cases. > > Hmmm, yeah, the patch undoes what the cgroup writeback changes added > but it looks like the addition was exposing the existing problem more > rather than causing a new one. > >> > I wanted to suggest to use inode_wait_for_writeback() which will make sure >> > writeback is done with the inode. However we effectively already do this >> > by calling bdev_write_inode() which calls writeback_single_inode() in >> > WB_SYNC_ALL mode. So by the time that call completes we are sure writeback >> > code is not looking at the inode. *However* what I think is happening is >> > that sync_blockdev() writes all the dirty pages, bdev_write_inode() writes >> > the inode and clears all dirty bits, however the inode still stays in the >> > b_dirty / b_io list of the wb because it has I_DIRTY_TIME set. Subsequently >> > once flusher work runs, it finds the inode, looks at it and boom. And the >> > problem seems to be that write_inode_now(inode, true) does not result in >> > I_DIRTY_TIME being cleared. >> > >> > Attached patch should fix this issue - it is compile-tested only. Dmitry, >> > can you check whether this patch fixes the issue for you as well? >> >> I can't directly test it because crash happened very infrequently. >> If Tejun/Ted agree that it is the right way to fix it, then I can >> patch it, restart the fuzzer and leave it running for a while. > > Yes, please try out the patch. Done At least it should catch any new introduced bugs. ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2016-06-21 13:05 UTC | newest] Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-04-18 9:44 fs: GPF in locked_inode_to_wb_and_lock_list Dmitry Vyukov 2016-04-20 21:14 ` Tejun Heo 2016-04-21 8:25 ` Dmitry Vyukov 2016-04-21 9:10 ` Andrey Ryabinin 2016-04-21 9:29 ` Dmitry Vyukov 2016-04-21 16:14 ` Tejun Heo 2016-04-21 8:35 ` Dmitry Vyukov 2016-04-21 9:45 ` Andrey Ryabinin 2016-04-21 10:00 ` Dmitry Vyukov 2016-04-21 17:06 ` Tejun Heo 2016-04-22 18:55 ` Dmitry Vyukov 2016-06-06 17:46 ` Dmitry Vyukov 2016-06-17 16:04 ` [PATCH] block: flush writeback dwork before detaching a bdev inode from it Tejun Heo 2016-06-20 13:31 ` Jan Kara 2016-06-20 13:38 ` Dmitry Vyukov 2016-06-20 17:40 ` Tejun Heo 2016-06-21 12:58 ` Dmitry Vyukov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).