linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* fs: GPF in locked_inode_to_wb_and_lock_list
@ 2016-04-18  9:44 Dmitry Vyukov
  2016-04-20 21:14 ` Tejun Heo
  0 siblings, 1 reply; 17+ messages in thread
From: Dmitry Vyukov @ 2016-04-18  9:44 UTC (permalink / raw)
  To: Alexander Viro, linux-fsdevel, LKML, Tejun Heo
  Cc: syzkaller, Kostya Serebryany, Alexander Potapenko

Hello,

I've hit the following GPF while running syzkaller fuzzer on commit
806fdcce017dc98c4dbf8ed001750a0d7d2bb0af (Apr 14):

general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN
Modules linked in:
CPU: 3 PID: 32 Comm: kworker/u10:1 Not tainted 4.6.0-rc3+ #349
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Workqueue: writeback wb_workfn (flush-11:0)
task: ffff88006ccf1840 ti: ffff88006cda8000 task.ti: ffff88006cda8000
RIP: 0010:[<ffffffff818884d2>]  [<ffffffff818884d2>]
locked_inode_to_wb_and_lock_list+0xa2/0x750
RSP: 0018:ffff88006cdaf7d0  EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88006ccf2050
RDX: 0000000000000000 RSI: 000000114c8a8484 RDI: 0000000000000286
RBP: ffff88006cdaf820 R08: ffff88006ccf1840 R09: 0000000000000000
R10: 000229915090805f R11: 0000000000000001 R12: ffff88006a72f5e0
R13: dffffc0000000000 R14: ffffed000d4e5eed R15: ffffffff8830cf40
FS:  0000000000000000(0000) GS:ffff88006d500000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000003301bf8 CR3: 000000006368f000 CR4: 00000000000006e0
DR0: 0000000000001ec9 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
Stack:
 ffff88006a72f680 ffff88006a72f768 ffff8800671230d8 03ff88006cdaf948
 ffff88006a72f668 ffff88006a72f5e0 ffff8800671230d8 ffff88006cdaf948
 ffff880065b90cc8 ffff880067123100 ffff88006cdaf970 ffffffff8188e12e
Call Trace:
 [<     inline     >] inode_to_wb_and_lock_list fs/fs-writeback.c:309
 [<ffffffff8188e12e>] writeback_sb_inodes+0x4de/0x1250 fs/fs-writeback.c:1554
 [<ffffffff8188efa4>] __writeback_inodes_wb+0x104/0x1e0 fs/fs-writeback.c:1600
 [<ffffffff8188f9ae>] wb_writeback+0x7ce/0xc90 fs/fs-writeback.c:1709
 [<     inline     >] wb_do_writeback fs/fs-writeback.c:1844
 [<ffffffff81891079>] wb_workfn+0x2f9/0x1000 fs/fs-writeback.c:1884
 [<ffffffff813bcd1e>] process_one_work+0x78e/0x15c0 kernel/workqueue.c:2094
 [<ffffffff813bdc2b>] worker_thread+0xdb/0xfc0 kernel/workqueue.c:2228
 [<ffffffff813cdeef>] kthread+0x23f/0x2d0 drivers/block/aoe/aoecmd.c:1303
 [<ffffffff867bc5d2>] ret_from_fork+0x22/0x50 arch/x86/entry/entry_64.S:392
Code: 05 94 4a a8 06 85 c0 0f 85 03 03 00 00 e8 07 15 d0 ff 41 80 3e
00 0f 85 64 06 00 00 49 8b 9c 24 88 01 00 00 48 89 d8 48 c1 e8 03 <42>
80 3c 28 00 0f 85 17 06 00 00 48 8b 03 48 83 c0 50 48 39 c3
RIP  [<     inline     >] wb_get include/linux/backing-dev-defs.h:212
RIP  [<ffffffff818884d2>] locked_inode_to_wb_and_lock_list+0xa2/0x750
fs/fs-writeback.c:281
 RSP <ffff88006cdaf7d0>
---[ end trace 986a4d314dcb2694 ]---


The crash happened here:

        if (wb != &wb->bdi->wb)
ffffffff818884cb:       48 89 d8                mov    %rbx,%rax
ffffffff818884ce:       48 c1 e8 03             shr    $0x3,%rax
ffffffff818884d2:       42 80 3c 28 00          cmpb   $0x0,(%rax,%r13,1)
ffffffff818884d7:       0f 85 17 06 00 00       jne
ffffffff81888af4 <locked_inode_to_wb_and_lock_list+0x6c4>
ffffffff818884dd:       48 8b 03                mov    (%rbx),%rax
ffffffff818884e0:       48 83 c0 50             add    $0x50,%rax
ffffffff818884e4:       48 39 c3                cmp    %rax,%rbx
ffffffff818884e7:       0f 84 c3 00 00 00       je
ffffffff818885b0 <locked_inode_to_wb_and_lock_list+0x180>

Which means that bdi is NULL (if I get indirections right).

The block device is flush-11:0 (/dev/sr0).

I was able to reproduce it once by re-running programs from the crash log:
https://gist.githubusercontent.com/dvyukov/7c9e6358b661e2d50a774a70b5ee348a/raw/5cd94e32bbcf2a788b5efd0768cfd883c5921070/gistfile1.txt
But it took half an hour, and then another hour long run did not
trigger it. So it is probably triggered by some rare race.

Thank you

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: fs: GPF in locked_inode_to_wb_and_lock_list
  2016-04-18  9:44 fs: GPF in locked_inode_to_wb_and_lock_list Dmitry Vyukov
@ 2016-04-20 21:14 ` Tejun Heo
  2016-04-21  8:25   ` Dmitry Vyukov
  2016-04-21  8:35   ` Dmitry Vyukov
  0 siblings, 2 replies; 17+ messages in thread
From: Tejun Heo @ 2016-04-20 21:14 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Alexander Viro, linux-fsdevel, LKML, syzkaller,
	Kostya Serebryany, Alexander Potapenko

Hello, Dmitry.

On Mon, Apr 18, 2016 at 11:44:11AM +0200, Dmitry Vyukov wrote:

> general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN
...
> RIP: 0010:[<ffffffff818884d2>]  [<ffffffff818884d2>]
> locked_inode_to_wb_and_lock_list+0xa2/0x750
> RSP: 0018:ffff88006cdaf7d0  EFLAGS: 00010246
> RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88006ccf2050
> RDX: 0000000000000000 RSI: 000000114c8a8484 RDI: 0000000000000286
> RBP: ffff88006cdaf820 R08: ffff88006ccf1840 R09: 0000000000000000
> R10: 000229915090805f R11: 0000000000000001 R12: ffff88006a72f5e0
> R13: dffffc0000000000 R14: ffffed000d4e5eed R15: ffffffff8830cf40
> FS:  0000000000000000(0000) GS:ffff88006d500000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000003301bf8 CR3: 000000006368f000 CR4: 00000000000006e0
> DR0: 0000000000001ec9 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
> Stack:
>  ffff88006a72f680 ffff88006a72f768 ffff8800671230d8 03ff88006cdaf948
>  ffff88006a72f668 ffff88006a72f5e0 ffff8800671230d8 ffff88006cdaf948
>  ffff880065b90cc8 ffff880067123100 ffff88006cdaf970 ffffffff8188e12e
> Call Trace:
>  [<     inline     >] inode_to_wb_and_lock_list fs/fs-writeback.c:309
>  [<ffffffff8188e12e>] writeback_sb_inodes+0x4de/0x1250 fs/fs-writeback.c:1554
>  [<ffffffff8188efa4>] __writeback_inodes_wb+0x104/0x1e0 fs/fs-writeback.c:1600
>  [<ffffffff8188f9ae>] wb_writeback+0x7ce/0xc90 fs/fs-writeback.c:1709
>  [<     inline     >] wb_do_writeback fs/fs-writeback.c:1844
>  [<ffffffff81891079>] wb_workfn+0x2f9/0x1000 fs/fs-writeback.c:1884
>  [<ffffffff813bcd1e>] process_one_work+0x78e/0x15c0 kernel/workqueue.c:2094
>  [<ffffffff813bdc2b>] worker_thread+0xdb/0xfc0 kernel/workqueue.c:2228
>  [<ffffffff813cdeef>] kthread+0x23f/0x2d0 drivers/block/aoe/aoecmd.c:1303
>  [<ffffffff867bc5d2>] ret_from_fork+0x22/0x50 arch/x86/entry/entry_64.S:392
> Code: 05 94 4a a8 06 85 c0 0f 85 03 03 00 00 e8 07 15 d0 ff 41 80 3e
> 00 0f 85 64 06 00 00 49 8b 9c 24 88 01 00 00 48 89 d8 48 c1 e8 03 <42>
> 80 3c 28 00 0f 85 17 06 00 00 48 8b 03 48 83 c0 50 48 39 c3
> RIP  [<     inline     >] wb_get include/linux/backing-dev-defs.h:212
> RIP  [<ffffffff818884d2>] locked_inode_to_wb_and_lock_list+0xa2/0x750
> fs/fs-writeback.c:281
>  RSP <ffff88006cdaf7d0>

Man, that's a beautiful trace w/ decoding of inline functions.  When
did we start doing that?  Is there a specific config option for this?

> ---[ end trace 986a4d314dcb2694 ]---
> The crash happened here:
> 
>         if (wb != &wb->bdi->wb)
> ffffffff818884cb:       48 89 d8                mov    %rbx,%rax
> ffffffff818884ce:       48 c1 e8 03             shr    $0x3,%rax
> ffffffff818884d2:       42 80 3c 28 00          cmpb   $0x0,(%rax,%r13,1)

So, it's the above instruction.

> ffffffff818884d7:       0f 85 17 06 00 00       jne
> ffffffff81888af4 <locked_inode_to_wb_and_lock_list+0x6c4>
> ffffffff818884dd:       48 8b 03                mov    (%rbx),%rax
> ffffffff818884e0:       48 83 c0 50             add    $0x50,%rax
> ffffffff818884e4:       48 39 c3                cmp    %rax,%rbx
> ffffffff818884e7:       0f 84 c3 00 00 00       je
> ffffffff818885b0 <locked_inode_to_wb_and_lock_list+0x180>
> 
> Which means that bdi is NULL (if I get indirections right).

So, the wb != &wb->bdi->wb comparison would be the cmp at
0xffffffff818884e4 and given that it just compares the address of
&bdi->wb, bdi being NULL wouldn't trigger the fault.

   cmpb $0x0,(%rax,%r13,1) 
-> *(u8 *)(%rax + %r13) == 0
-> *(u8 *)((%rbx >> 3) + %r13) == 0

Where can that be from?  I can't find anything matching even in the
surrounding functions.

Hmmm... The base address %r13 is 0xdffffc0000000000 which isn't a
proper canonical address and in general suspcious.  Ooh, it's
KASAN_SHADOW_OFFSET.  It looks like something is making KASAN trigger
a fault.  Can we please bring in someone who's more familiar with
KASAN?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: fs: GPF in locked_inode_to_wb_and_lock_list
  2016-04-20 21:14 ` Tejun Heo
@ 2016-04-21  8:25   ` Dmitry Vyukov
  2016-04-21  9:10     ` Andrey Ryabinin
  2016-04-21 16:14     ` Tejun Heo
  2016-04-21  8:35   ` Dmitry Vyukov
  1 sibling, 2 replies; 17+ messages in thread
From: Dmitry Vyukov @ 2016-04-21  8:25 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Alexander Viro, linux-fsdevel, LKML, syzkaller,
	Kostya Serebryany, Alexander Potapenko

On Wed, Apr 20, 2016 at 11:14 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello, Dmitry.
>
> On Mon, Apr 18, 2016 at 11:44:11AM +0200, Dmitry Vyukov wrote:
>
>> general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN
> ...
>> RIP: 0010:[<ffffffff818884d2>]  [<ffffffff818884d2>]
>> locked_inode_to_wb_and_lock_list+0xa2/0x750
>> RSP: 0018:ffff88006cdaf7d0  EFLAGS: 00010246
>> RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88006ccf2050
>> RDX: 0000000000000000 RSI: 000000114c8a8484 RDI: 0000000000000286
>> RBP: ffff88006cdaf820 R08: ffff88006ccf1840 R09: 0000000000000000
>> R10: 000229915090805f R11: 0000000000000001 R12: ffff88006a72f5e0
>> R13: dffffc0000000000 R14: ffffed000d4e5eed R15: ffffffff8830cf40
>> FS:  0000000000000000(0000) GS:ffff88006d500000(0000) knlGS:0000000000000000
>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> CR2: 0000000003301bf8 CR3: 000000006368f000 CR4: 00000000000006e0
>> DR0: 0000000000001ec9 DR1: 0000000000000000 DR2: 0000000000000000
>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
>> Stack:
>>  ffff88006a72f680 ffff88006a72f768 ffff8800671230d8 03ff88006cdaf948
>>  ffff88006a72f668 ffff88006a72f5e0 ffff8800671230d8 ffff88006cdaf948
>>  ffff880065b90cc8 ffff880067123100 ffff88006cdaf970 ffffffff8188e12e
>> Call Trace:
>>  [<     inline     >] inode_to_wb_and_lock_list fs/fs-writeback.c:309
>>  [<ffffffff8188e12e>] writeback_sb_inodes+0x4de/0x1250 fs/fs-writeback.c:1554
>>  [<ffffffff8188efa4>] __writeback_inodes_wb+0x104/0x1e0 fs/fs-writeback.c:1600
>>  [<ffffffff8188f9ae>] wb_writeback+0x7ce/0xc90 fs/fs-writeback.c:1709
>>  [<     inline     >] wb_do_writeback fs/fs-writeback.c:1844
>>  [<ffffffff81891079>] wb_workfn+0x2f9/0x1000 fs/fs-writeback.c:1884
>>  [<ffffffff813bcd1e>] process_one_work+0x78e/0x15c0 kernel/workqueue.c:2094
>>  [<ffffffff813bdc2b>] worker_thread+0xdb/0xfc0 kernel/workqueue.c:2228
>>  [<ffffffff813cdeef>] kthread+0x23f/0x2d0 drivers/block/aoe/aoecmd.c:1303
>>  [<ffffffff867bc5d2>] ret_from_fork+0x22/0x50 arch/x86/entry/entry_64.S:392
>> Code: 05 94 4a a8 06 85 c0 0f 85 03 03 00 00 e8 07 15 d0 ff 41 80 3e
>> 00 0f 85 64 06 00 00 49 8b 9c 24 88 01 00 00 48 89 d8 48 c1 e8 03 <42>
>> 80 3c 28 00 0f 85 17 06 00 00 48 8b 03 48 83 c0 50 48 39 c3
>> RIP  [<     inline     >] wb_get include/linux/backing-dev-defs.h:212
>> RIP  [<ffffffff818884d2>] locked_inode_to_wb_and_lock_list+0xa2/0x750
>> fs/fs-writeback.c:281
>>  RSP <ffff88006cdaf7d0>
>
> Man, that's a beautiful trace w/ decoding of inline functions.  When
> did we start doing that?  Is there a specific config option for this?

Thanks! :)
I use this script for symbolization:
https://github.com/google/sanitizers/blob/master/address-sanitizer/tools/kasan_symbolize.py
It invokes addr2line to provide file:line info, adds inline frames,
strips ? frames (are they ever useful?) and strips timestamps.
You just need to pipe a report through it. It assumes that vmlinux is
in the current dir, but you can override it with --linux flag.

It can also insert actual source lines into a report (amount of
context is controlled with flags), for example:


Call Trace:
 [<     inline     >] khugepaged_scan_pmd include/linux/uaccess.h:9
      8 {
      9         current->pagefault_disabled++;
     10 }
 [<     inline     >] khugepaged_scan_mm_slot mm/huge_memory.c:2716
   2715                                   hend);
   2716                         ret = khugepaged_scan_pmd(mm, vma,
   2717
khugepaged_scan.address,
 [<     inline     >] khugepaged_do_scan mm/huge_memory.c:2796
   2795                     pass_through_head < 2)
   2796                         progress +=
khugepaged_scan_mm_slot(pages - progress,
   2797                                                             &hpage);
 [<ffffffff81790b33>] khugepaged+0x993/0x48e0 mm/huge_memory.c:2831
   2830         while (!kthread_should_stop()) {
   2831                 khugepaged_do_scan();
   2832                 khugepaged_wait_work();
 [<ffffffff813c195f>] kthread+0x23f/0x2d0 drivers/block/aoe/aoecmd.c:1303
   1302                         add_wait_queue(k->waitq, &wait);
   1303                         __set_current_state(TASK_INTERRUPTIBLE);
   1304                 }
 [<ffffffff866d1b2f>] ret_from_fork+0x3f/0x70 arch/x86/entry/entry_64.S:405
    404         SWAPGS
    405         jmp        restore_regs_and_iret
    406 END(ret_from_fork)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: fs: GPF in locked_inode_to_wb_and_lock_list
  2016-04-20 21:14 ` Tejun Heo
  2016-04-21  8:25   ` Dmitry Vyukov
@ 2016-04-21  8:35   ` Dmitry Vyukov
  2016-04-21  9:45     ` Andrey Ryabinin
  1 sibling, 1 reply; 17+ messages in thread
From: Dmitry Vyukov @ 2016-04-21  8:35 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Alexander Viro, linux-fsdevel, LKML, syzkaller,
	Kostya Serebryany, Alexander Potapenko

On Wed, Apr 20, 2016 at 11:14 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello, Dmitry.
>
> On Mon, Apr 18, 2016 at 11:44:11AM +0200, Dmitry Vyukov wrote:
>
>> general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN
> ...
>> RIP: 0010:[<ffffffff818884d2>]  [<ffffffff818884d2>]
>> locked_inode_to_wb_and_lock_list+0xa2/0x750
>> RSP: 0018:ffff88006cdaf7d0  EFLAGS: 00010246
>> RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88006ccf2050
>> RDX: 0000000000000000 RSI: 000000114c8a8484 RDI: 0000000000000286
>> RBP: ffff88006cdaf820 R08: ffff88006ccf1840 R09: 0000000000000000
>> R10: 000229915090805f R11: 0000000000000001 R12: ffff88006a72f5e0
>> R13: dffffc0000000000 R14: ffffed000d4e5eed R15: ffffffff8830cf40
>> FS:  0000000000000000(0000) GS:ffff88006d500000(0000) knlGS:0000000000000000
>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> CR2: 0000000003301bf8 CR3: 000000006368f000 CR4: 00000000000006e0
>> DR0: 0000000000001ec9 DR1: 0000000000000000 DR2: 0000000000000000
>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
>> Stack:
>>  ffff88006a72f680 ffff88006a72f768 ffff8800671230d8 03ff88006cdaf948
>>  ffff88006a72f668 ffff88006a72f5e0 ffff8800671230d8 ffff88006cdaf948
>>  ffff880065b90cc8 ffff880067123100 ffff88006cdaf970 ffffffff8188e12e
>> Call Trace:
>>  [<     inline     >] inode_to_wb_and_lock_list fs/fs-writeback.c:309
>>  [<ffffffff8188e12e>] writeback_sb_inodes+0x4de/0x1250 fs/fs-writeback.c:1554
>>  [<ffffffff8188efa4>] __writeback_inodes_wb+0x104/0x1e0 fs/fs-writeback.c:1600
>>  [<ffffffff8188f9ae>] wb_writeback+0x7ce/0xc90 fs/fs-writeback.c:1709
>>  [<     inline     >] wb_do_writeback fs/fs-writeback.c:1844
>>  [<ffffffff81891079>] wb_workfn+0x2f9/0x1000 fs/fs-writeback.c:1884
>>  [<ffffffff813bcd1e>] process_one_work+0x78e/0x15c0 kernel/workqueue.c:2094
>>  [<ffffffff813bdc2b>] worker_thread+0xdb/0xfc0 kernel/workqueue.c:2228
>>  [<ffffffff813cdeef>] kthread+0x23f/0x2d0 drivers/block/aoe/aoecmd.c:1303
>>  [<ffffffff867bc5d2>] ret_from_fork+0x22/0x50 arch/x86/entry/entry_64.S:392
>> Code: 05 94 4a a8 06 85 c0 0f 85 03 03 00 00 e8 07 15 d0 ff 41 80 3e
>> 00 0f 85 64 06 00 00 49 8b 9c 24 88 01 00 00 48 89 d8 48 c1 e8 03 <42>
>> 80 3c 28 00 0f 85 17 06 00 00 48 8b 03 48 83 c0 50 48 39 c3
>> RIP  [<     inline     >] wb_get include/linux/backing-dev-defs.h:212
>> RIP  [<ffffffff818884d2>] locked_inode_to_wb_and_lock_list+0xa2/0x750
>> fs/fs-writeback.c:281
>>  RSP <ffff88006cdaf7d0>
>
> Man, that's a beautiful trace w/ decoding of inline functions.  When
> did we start doing that?  Is there a specific config option for this?
>
>> ---[ end trace 986a4d314dcb2694 ]---
>> The crash happened here:
>>
>>         if (wb != &wb->bdi->wb)
>> ffffffff818884cb:       48 89 d8                mov    %rbx,%rax
>> ffffffff818884ce:       48 c1 e8 03             shr    $0x3,%rax
>> ffffffff818884d2:       42 80 3c 28 00          cmpb   $0x0,(%rax,%r13,1)
>
> So, it's the above instruction.
>
>> ffffffff818884d7:       0f 85 17 06 00 00       jne
>> ffffffff81888af4 <locked_inode_to_wb_and_lock_list+0x6c4>
>> ffffffff818884dd:       48 8b 03                mov    (%rbx),%rax
>> ffffffff818884e0:       48 83 c0 50             add    $0x50,%rax
>> ffffffff818884e4:       48 39 c3                cmp    %rax,%rbx
>> ffffffff818884e7:       0f 84 c3 00 00 00       je
>> ffffffff818885b0 <locked_inode_to_wb_and_lock_list+0x180>
>>
>> Which means that bdi is NULL (if I get indirections right).
>
> So, the wb != &wb->bdi->wb comparison would be the cmp at
> 0xffffffff818884e4 and given that it just compares the address of
> &bdi->wb, bdi being NULL wouldn't trigger the fault.
>
>    cmpb $0x0,(%rax,%r13,1)
> -> *(u8 *)(%rax + %r13) == 0
> -> *(u8 *)((%rbx >> 3) + %r13) == 0
>
> Where can that be from?  I can't find anything matching even in the
> surrounding functions.
>
> Hmmm... The base address %r13 is 0xdffffc0000000000 which isn't a
> proper canonical address and in general suspcious.  Ooh, it's
> KASAN_SHADOW_OFFSET.  It looks like something is making KASAN trigger
> a fault.  Can we please bring in someone who's more familiar with
> KASAN?

I am here.
For every memory access to ADDR, KASAN makes a byte load from
KASAN_SHADOW_OFFSET+ADDR>>8 first.
For accesses to kernel memory, the byte is addressable and contains
addressability state of the ADDR.
It also has a positive side effect of catching any accesses to user
memory at the place of occurrence as the address computations produces
a non-valid address.
But the negative side effect that instead of a usual NULL-deref GPF,
you now get what we have here. %r13 contains KASAN_SHADOW_OFFSET, and
%rax contains the address that normal code is going to dereference few
instructions later. So without KASAN the code would trigger NULL deref
on (%rbx is a copy of %rax):

ffffffff818884dd:       48 8b 03                mov    (%rbx),%rax

So whatever load "&wb->bdi->wb" produces is a NULL deref. (is it wb
that is NULL?)

Sorry for this mess. It is a known issue in KASAN, but we don't know
how to fix it without slowing down execution and sacrificing other
properties.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: fs: GPF in locked_inode_to_wb_and_lock_list
  2016-04-21  8:25   ` Dmitry Vyukov
@ 2016-04-21  9:10     ` Andrey Ryabinin
  2016-04-21  9:29       ` Dmitry Vyukov
  2016-04-21 16:14     ` Tejun Heo
  1 sibling, 1 reply; 17+ messages in thread
From: Andrey Ryabinin @ 2016-04-21  9:10 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Tejun Heo, Alexander Viro, linux-fsdevel, LKML, syzkaller,
	Kostya Serebryany, Alexander Potapenko, Sasha Levin

2016-04-21 11:25 GMT+03:00 Dmitry Vyukov <dvyukov@google.com>:
> On Wed, Apr 20, 2016 at 11:14 PM, Tejun Heo <tj@kernel.org> wrote:
>> Hello, Dmitry.
>>
>> On Mon, Apr 18, 2016 at 11:44:11AM +0200, Dmitry Vyukov wrote:
>>
>>> general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN
>> ...
>>> RIP: 0010:[<ffffffff818884d2>]  [<ffffffff818884d2>]
>>> locked_inode_to_wb_and_lock_list+0xa2/0x750
>>> RSP: 0018:ffff88006cdaf7d0  EFLAGS: 00010246
>>> RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88006ccf2050
>>> RDX: 0000000000000000 RSI: 000000114c8a8484 RDI: 0000000000000286
>>> RBP: ffff88006cdaf820 R08: ffff88006ccf1840 R09: 0000000000000000
>>> R10: 000229915090805f R11: 0000000000000001 R12: ffff88006a72f5e0
>>> R13: dffffc0000000000 R14: ffffed000d4e5eed R15: ffffffff8830cf40
>>> FS:  0000000000000000(0000) GS:ffff88006d500000(0000) knlGS:0000000000000000
>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> CR2: 0000000003301bf8 CR3: 000000006368f000 CR4: 00000000000006e0
>>> DR0: 0000000000001ec9 DR1: 0000000000000000 DR2: 0000000000000000
>>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
>>> Stack:
>>>  ffff88006a72f680 ffff88006a72f768 ffff8800671230d8 03ff88006cdaf948
>>>  ffff88006a72f668 ffff88006a72f5e0 ffff8800671230d8 ffff88006cdaf948
>>>  ffff880065b90cc8 ffff880067123100 ffff88006cdaf970 ffffffff8188e12e
>>> Call Trace:
>>>  [<     inline     >] inode_to_wb_and_lock_list fs/fs-writeback.c:309
>>>  [<ffffffff8188e12e>] writeback_sb_inodes+0x4de/0x1250 fs/fs-writeback.c:1554
>>>  [<ffffffff8188efa4>] __writeback_inodes_wb+0x104/0x1e0 fs/fs-writeback.c:1600
>>>  [<ffffffff8188f9ae>] wb_writeback+0x7ce/0xc90 fs/fs-writeback.c:1709
>>>  [<     inline     >] wb_do_writeback fs/fs-writeback.c:1844
>>>  [<ffffffff81891079>] wb_workfn+0x2f9/0x1000 fs/fs-writeback.c:1884
>>>  [<ffffffff813bcd1e>] process_one_work+0x78e/0x15c0 kernel/workqueue.c:2094
>>>  [<ffffffff813bdc2b>] worker_thread+0xdb/0xfc0 kernel/workqueue.c:2228
>>>  [<ffffffff813cdeef>] kthread+0x23f/0x2d0 drivers/block/aoe/aoecmd.c:1303
>>>  [<ffffffff867bc5d2>] ret_from_fork+0x22/0x50 arch/x86/entry/entry_64.S:392
>>> Code: 05 94 4a a8 06 85 c0 0f 85 03 03 00 00 e8 07 15 d0 ff 41 80 3e
>>> 00 0f 85 64 06 00 00 49 8b 9c 24 88 01 00 00 48 89 d8 48 c1 e8 03 <42>
>>> 80 3c 28 00 0f 85 17 06 00 00 48 8b 03 48 83 c0 50 48 39 c3
>>> RIP  [<     inline     >] wb_get include/linux/backing-dev-defs.h:212
>>> RIP  [<ffffffff818884d2>] locked_inode_to_wb_and_lock_list+0xa2/0x750
>>> fs/fs-writeback.c:281
>>>  RSP <ffff88006cdaf7d0>
>>
>> Man, that's a beautiful trace w/ decoding of inline functions.  When
>> did we start doing that?  Is there a specific config option for this?
>
> Thanks! :)
> I use this script for symbolization:
> https://github.com/google/sanitizers/blob/master/address-sanitizer/tools/kasan_symbolize.py

BTW, it would be nice to have that script in kernel tree, instead of
scripts/decode_stacktrace.sh which
is far less advanced.


> It invokes addr2line to provide file:line info, adds inline frames,
> strips ? frames (are they ever useful?) and strips timestamps.
> You just need to pipe a report through it. It assumes that vmlinux is
> in the current dir, but you can override it with --linux flag.
>
> It can also insert actual source lines into a report (amount of
> context is controlled with flags), for example:
>
>
> Call Trace:
>  [<     inline     >] khugepaged_scan_pmd include/linux/uaccess.h:9
>       8 {
>       9         current->pagefault_disabled++;
>      10 }
>  [<     inline     >] khugepaged_scan_mm_slot mm/huge_memory.c:2716
>    2715                                   hend);
>    2716                         ret = khugepaged_scan_pmd(mm, vma,
>    2717
> khugepaged_scan.address,
>  [<     inline     >] khugepaged_do_scan mm/huge_memory.c:2796
>    2795                     pass_through_head < 2)
>    2796                         progress +=
> khugepaged_scan_mm_slot(pages - progress,
>    2797                                                             &hpage);
>  [<ffffffff81790b33>] khugepaged+0x993/0x48e0 mm/huge_memory.c:2831
>    2830         while (!kthread_should_stop()) {
>    2831                 khugepaged_do_scan();
>    2832                 khugepaged_wait_work();
>  [<ffffffff813c195f>] kthread+0x23f/0x2d0 drivers/block/aoe/aoecmd.c:1303
>    1302                         add_wait_queue(k->waitq, &wait);
>    1303                         __set_current_state(TASK_INTERRUPTIBLE);
>    1304                 }
>  [<ffffffff866d1b2f>] ret_from_fork+0x3f/0x70 arch/x86/entry/entry_64.S:405
>     404         SWAPGS
>     405         jmp        restore_regs_and_iret
>     406 END(ret_from_fork)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: fs: GPF in locked_inode_to_wb_and_lock_list
  2016-04-21  9:10     ` Andrey Ryabinin
@ 2016-04-21  9:29       ` Dmitry Vyukov
  0 siblings, 0 replies; 17+ messages in thread
From: Dmitry Vyukov @ 2016-04-21  9:29 UTC (permalink / raw)
  To: Andrey Ryabinin
  Cc: Tejun Heo, Alexander Viro, linux-fsdevel, LKML, syzkaller,
	Kostya Serebryany, Alexander Potapenko, Sasha Levin,
	Andrey Konovalov

On Thu, Apr 21, 2016 at 11:10 AM, Andrey Ryabinin
<ryabinin.a.a@gmail.com> wrote:
> 2016-04-21 11:25 GMT+03:00 Dmitry Vyukov <dvyukov@google.com>:
>> On Wed, Apr 20, 2016 at 11:14 PM, Tejun Heo <tj@kernel.org> wrote:
>>> Hello, Dmitry.
>>>
>>> On Mon, Apr 18, 2016 at 11:44:11AM +0200, Dmitry Vyukov wrote:
>>>
>>>> general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN
>>> ...
>>>> RIP: 0010:[<ffffffff818884d2>]  [<ffffffff818884d2>]
>>>> locked_inode_to_wb_and_lock_list+0xa2/0x750
>>>> RSP: 0018:ffff88006cdaf7d0  EFLAGS: 00010246
>>>> RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88006ccf2050
>>>> RDX: 0000000000000000 RSI: 000000114c8a8484 RDI: 0000000000000286
>>>> RBP: ffff88006cdaf820 R08: ffff88006ccf1840 R09: 0000000000000000
>>>> R10: 000229915090805f R11: 0000000000000001 R12: ffff88006a72f5e0
>>>> R13: dffffc0000000000 R14: ffffed000d4e5eed R15: ffffffff8830cf40
>>>> FS:  0000000000000000(0000) GS:ffff88006d500000(0000) knlGS:0000000000000000
>>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> CR2: 0000000003301bf8 CR3: 000000006368f000 CR4: 00000000000006e0
>>>> DR0: 0000000000001ec9 DR1: 0000000000000000 DR2: 0000000000000000
>>>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
>>>> Stack:
>>>>  ffff88006a72f680 ffff88006a72f768 ffff8800671230d8 03ff88006cdaf948
>>>>  ffff88006a72f668 ffff88006a72f5e0 ffff8800671230d8 ffff88006cdaf948
>>>>  ffff880065b90cc8 ffff880067123100 ffff88006cdaf970 ffffffff8188e12e
>>>> Call Trace:
>>>>  [<     inline     >] inode_to_wb_and_lock_list fs/fs-writeback.c:309
>>>>  [<ffffffff8188e12e>] writeback_sb_inodes+0x4de/0x1250 fs/fs-writeback.c:1554
>>>>  [<ffffffff8188efa4>] __writeback_inodes_wb+0x104/0x1e0 fs/fs-writeback.c:1600
>>>>  [<ffffffff8188f9ae>] wb_writeback+0x7ce/0xc90 fs/fs-writeback.c:1709
>>>>  [<     inline     >] wb_do_writeback fs/fs-writeback.c:1844
>>>>  [<ffffffff81891079>] wb_workfn+0x2f9/0x1000 fs/fs-writeback.c:1884
>>>>  [<ffffffff813bcd1e>] process_one_work+0x78e/0x15c0 kernel/workqueue.c:2094
>>>>  [<ffffffff813bdc2b>] worker_thread+0xdb/0xfc0 kernel/workqueue.c:2228
>>>>  [<ffffffff813cdeef>] kthread+0x23f/0x2d0 drivers/block/aoe/aoecmd.c:1303
>>>>  [<ffffffff867bc5d2>] ret_from_fork+0x22/0x50 arch/x86/entry/entry_64.S:392
>>>> Code: 05 94 4a a8 06 85 c0 0f 85 03 03 00 00 e8 07 15 d0 ff 41 80 3e
>>>> 00 0f 85 64 06 00 00 49 8b 9c 24 88 01 00 00 48 89 d8 48 c1 e8 03 <42>
>>>> 80 3c 28 00 0f 85 17 06 00 00 48 8b 03 48 83 c0 50 48 39 c3
>>>> RIP  [<     inline     >] wb_get include/linux/backing-dev-defs.h:212
>>>> RIP  [<ffffffff818884d2>] locked_inode_to_wb_and_lock_list+0xa2/0x750
>>>> fs/fs-writeback.c:281
>>>>  RSP <ffff88006cdaf7d0>
>>>
>>> Man, that's a beautiful trace w/ decoding of inline functions.  When
>>> did we start doing that?  Is there a specific config option for this?
>>
>> Thanks! :)
>> I use this script for symbolization:
>> https://github.com/google/sanitizers/blob/master/address-sanitizer/tools/kasan_symbolize.py
>
> BTW, it would be nice to have that script in kernel tree, instead of
> scripts/decode_stacktrace.sh which
> is far less advanced.


I am all for it!
We have 2 minor, unresolved bugs in the script:
https://github.com/google/sanitizers/issues?utf8=%E2%9C%93&q=is%3Aissue+is%3Aopen+kasan_symbolize
It would be good to resolve them before submitting it to kernel tree.
But I am not particularly experienced in Python. +Andrey Konovalov
will be able to do it in autumn. If anybody is willing to take it
earlier, you are welcome (please drop a note to
https://github.com/google/sanitizers/issues/674 to coordinate
efforts).


>> It invokes addr2line to provide file:line info, adds inline frames,
>> strips ? frames (are they ever useful?) and strips timestamps.
>> You just need to pipe a report through it. It assumes that vmlinux is
>> in the current dir, but you can override it with --linux flag.
>>
>> It can also insert actual source lines into a report (amount of
>> context is controlled with flags), for example:
>>
>>
>> Call Trace:
>>  [<     inline     >] khugepaged_scan_pmd include/linux/uaccess.h:9
>>       8 {
>>       9         current->pagefault_disabled++;
>>      10 }
>>  [<     inline     >] khugepaged_scan_mm_slot mm/huge_memory.c:2716
>>    2715                                   hend);
>>    2716                         ret = khugepaged_scan_pmd(mm, vma,
>>    2717
>> khugepaged_scan.address,
>>  [<     inline     >] khugepaged_do_scan mm/huge_memory.c:2796
>>    2795                     pass_through_head < 2)
>>    2796                         progress +=
>> khugepaged_scan_mm_slot(pages - progress,
>>    2797                                                             &hpage);
>>  [<ffffffff81790b33>] khugepaged+0x993/0x48e0 mm/huge_memory.c:2831
>>    2830         while (!kthread_should_stop()) {
>>    2831                 khugepaged_do_scan();
>>    2832                 khugepaged_wait_work();
>>  [<ffffffff813c195f>] kthread+0x23f/0x2d0 drivers/block/aoe/aoecmd.c:1303
>>    1302                         add_wait_queue(k->waitq, &wait);
>>    1303                         __set_current_state(TASK_INTERRUPTIBLE);
>>    1304                 }
>>  [<ffffffff866d1b2f>] ret_from_fork+0x3f/0x70 arch/x86/entry/entry_64.S:405
>>     404         SWAPGS
>>     405         jmp        restore_regs_and_iret
>>     406 END(ret_from_fork)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: fs: GPF in locked_inode_to_wb_and_lock_list
  2016-04-21  8:35   ` Dmitry Vyukov
@ 2016-04-21  9:45     ` Andrey Ryabinin
  2016-04-21 10:00       ` Dmitry Vyukov
  0 siblings, 1 reply; 17+ messages in thread
From: Andrey Ryabinin @ 2016-04-21  9:45 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Tejun Heo, Alexander Viro, linux-fsdevel, LKML, syzkaller,
	Kostya Serebryany, Alexander Potapenko

2016-04-21 11:35 GMT+03:00 Dmitry Vyukov <dvyukov@google.com>:
>
> ffffffff818884dd:       48 8b 03                mov    (%rbx),%rax
>
> So whatever load "&wb->bdi->wb" produces is a NULL deref. (is it wb
> that is NULL?)

Yes it's NULL wb, because there is only one load:
    mov    (%rbx),%rax        =>       rax = wb->bdi
    add    $0x50,%rax         =>       rax = &bdi->wb

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: fs: GPF in locked_inode_to_wb_and_lock_list
  2016-04-21  9:45     ` Andrey Ryabinin
@ 2016-04-21 10:00       ` Dmitry Vyukov
  2016-04-21 17:06         ` Tejun Heo
  0 siblings, 1 reply; 17+ messages in thread
From: Dmitry Vyukov @ 2016-04-21 10:00 UTC (permalink / raw)
  To: Andrey Ryabinin
  Cc: Tejun Heo, Alexander Viro, linux-fsdevel, LKML, syzkaller,
	Kostya Serebryany, Alexander Potapenko

On Thu, Apr 21, 2016 at 11:45 AM, Andrey Ryabinin
<ryabinin.a.a@gmail.com> wrote:
> 2016-04-21 11:35 GMT+03:00 Dmitry Vyukov <dvyukov@google.com>:
>>
>> ffffffff818884dd:       48 8b 03                mov    (%rbx),%rax
>>
>> So whatever load "&wb->bdi->wb" produces is a NULL deref. (is it wb
>> that is NULL?)
>
> Yes it's NULL wb, because there is only one load:
>     mov    (%rbx),%rax        =>       rax = wb->bdi
>     add    $0x50,%rax         =>       rax = &bdi->wb


I bet that wb becomes NULL on the second iteration of the loop. The
loop loops in case of a race with another thread, so it would also
explain why it is difficult to reproduce.

Tejun, does it make any sense to you?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: fs: GPF in locked_inode_to_wb_and_lock_list
  2016-04-21  8:25   ` Dmitry Vyukov
  2016-04-21  9:10     ` Andrey Ryabinin
@ 2016-04-21 16:14     ` Tejun Heo
  1 sibling, 0 replies; 17+ messages in thread
From: Tejun Heo @ 2016-04-21 16:14 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Alexander Viro, linux-fsdevel, LKML, syzkaller,
	Kostya Serebryany, Alexander Potapenko

Hello,

On Thu, Apr 21, 2016 at 10:25:12AM +0200, Dmitry Vyukov wrote:
> I use this script for symbolization:
> https://github.com/google/sanitizers/blob/master/address-sanitizer/tools/kasan_symbolize.py
> It invokes addr2line to provide file:line info, adds inline frames,
> strips ? frames (are they ever useful?) and strips timestamps.

Heh, sometimes they are.  They obviously are useful when backlinks are
broken for some reason.  Through the remnants of the old call stacks,
they also can sometimes tell the history of the task - where it was
before it got to the current backtrace.  It isn't reliable at all but
I can recall several occasions where they were useful in root causing
issues.  It *might* make a sense to have an option to not strip them.

And, yeah, I'm all for it being in the kernel tree.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: fs: GPF in locked_inode_to_wb_and_lock_list
  2016-04-21 10:00       ` Dmitry Vyukov
@ 2016-04-21 17:06         ` Tejun Heo
  2016-04-22 18:55           ` Dmitry Vyukov
  0 siblings, 1 reply; 17+ messages in thread
From: Tejun Heo @ 2016-04-21 17:06 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Andrey Ryabinin, Alexander Viro, linux-fsdevel, LKML, syzkaller,
	Kostya Serebryany, Alexander Potapenko, Ilya Dryomov, Jens Axboe,
	Jan Kara

Hello,

(cc'ing Ilya, Jan and Jens)

On Thu, Apr 21, 2016 at 12:00:38PM +0200, Dmitry Vyukov wrote:
> On Thu, Apr 21, 2016 at 11:45 AM, Andrey Ryabinin
> <ryabinin.a.a@gmail.com> wrote:
> > 2016-04-21 11:35 GMT+03:00 Dmitry Vyukov <dvyukov@google.com>:
> >>
> >> ffffffff818884dd:       48 8b 03                mov    (%rbx),%rax
> >>
> >> So whatever load "&wb->bdi->wb" produces is a NULL deref. (is it wb
> >> that is NULL?)
> >
> > Yes it's NULL wb, because there is only one load:
> >     mov    (%rbx),%rax        =>       rax = wb->bdi
> >     add    $0x50,%rax         =>       rax = &bdi->wb
> 
> 
> I bet that wb becomes NULL on the second iteration of the loop. The
> loop loops in case of a race with another thread, so it would also
> explain why it is difficult to reproduce.
> 
> Tejun, does it make any sense to you?

Yeah, that makes sense.  I think the culprit is 43d1c0eb7e11 ("block:
detach bdev inode from its wb in __blkdev_put()") which allows inode
to wb association to be broken while other operations including
writeback are in progress.  I thought it should be okay as the inode
must be clean at that point but that obviously doesn't mean that there
can be no writeback operations in flight.

I hope we could eventually move away from the current model where we
try to swap out an underlying data structure while upper layers may
still be referring to it in the future but for now we can make sure
the writeback operation is finished before detaching wb.

Dmitry, I understand that the bug is difficult to reproduce but can
you please give the following patch a try?

Thanks.

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 20a2c02..209ea33 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1530,12 +1530,7 @@ static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part)
 		kill_bdev(bdev);
 
 		bdev_write_inode(bdev);
-		/*
-		 * Detaching bdev inode from its wb in __destroy_inode()
-		 * is too late: the queue which embeds its bdi (along with
-		 * root wb) can be gone as soon as we put_disk() below.
-		 */
-		inode_detach_wb(bdev->bd_inode);
+		inode_detach_blkdev_wb(bdev);
 	}
 	if (bdev->bd_contains == bdev) {
 		if (disk->fops->release)
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index d0b5ca5..ec1f530 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -230,6 +230,25 @@ static inline void inode_detach_wb(struct inode *inode)
 }
 
 /**
+ * inode_detach_blkdev_wb - disassociate a bd_inode from its wb
+ * @bdev: block_device of interest
+ *
+ * @bdev is being put for the last time.  Detaching bdev inode in
+ * __destroy_inode() is too late: the queue which embeds its bdi (along
+ * with root wb) can be gone as soon as the containing disk is put.
+ *
+ * This function dissociates @bdev->bd_inode from its wb.  The inode must
+ * be clean and no further operations should be started on it.
+ */
+static inline void inode_detach_blkdev_wb(struct block_device *bdev)
+{
+	if (bdev->bd_inode->i_wb) {
+		flush_delayed_work(&bdev->bd_inode->i_wb->dwork);
+		inode_detach_wb(bdev->bd_inode);
+	}
+}
+
+/**
  * wbc_attach_fdatawrite_inode - associate wbc and inode for fdatawrite
  * @wbc: writeback_control of interest
  * @inode: target inode
@@ -277,6 +296,10 @@ static inline void inode_detach_wb(struct inode *inode)
 {
 }
 
+static inline void inode_detach_blkdev_wb(struct block_device *bdev)
+{
+}
+
 static inline void wbc_attach_and_unlock_inode(struct writeback_control *wbc,
 					       struct inode *inode)
 	__releases(&inode->i_lock)

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: fs: GPF in locked_inode_to_wb_and_lock_list
  2016-04-21 17:06         ` Tejun Heo
@ 2016-04-22 18:55           ` Dmitry Vyukov
  2016-06-06 17:46             ` Dmitry Vyukov
  0 siblings, 1 reply; 17+ messages in thread
From: Dmitry Vyukov @ 2016-04-22 18:55 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrey Ryabinin, Alexander Viro, linux-fsdevel, LKML, syzkaller,
	Kostya Serebryany, Alexander Potapenko, Ilya Dryomov, Jens Axboe,
	Jan Kara

On Thu, Apr 21, 2016 at 7:06 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello,
>
> (cc'ing Ilya, Jan and Jens)
>
> On Thu, Apr 21, 2016 at 12:00:38PM +0200, Dmitry Vyukov wrote:
>> On Thu, Apr 21, 2016 at 11:45 AM, Andrey Ryabinin
>> <ryabinin.a.a@gmail.com> wrote:
>> > 2016-04-21 11:35 GMT+03:00 Dmitry Vyukov <dvyukov@google.com>:
>> >>
>> >> ffffffff818884dd:       48 8b 03                mov    (%rbx),%rax
>> >>
>> >> So whatever load "&wb->bdi->wb" produces is a NULL deref. (is it wb
>> >> that is NULL?)
>> >
>> > Yes it's NULL wb, because there is only one load:
>> >     mov    (%rbx),%rax        =>       rax = wb->bdi
>> >     add    $0x50,%rax         =>       rax = &bdi->wb
>>
>>
>> I bet that wb becomes NULL on the second iteration of the loop. The
>> loop loops in case of a race with another thread, so it would also
>> explain why it is difficult to reproduce.
>>
>> Tejun, does it make any sense to you?
>
> Yeah, that makes sense.  I think the culprit is 43d1c0eb7e11 ("block:
> detach bdev inode from its wb in __blkdev_put()") which allows inode
> to wb association to be broken while other operations including
> writeback are in progress.  I thought it should be okay as the inode
> must be clean at that point but that obviously doesn't mean that there
> can be no writeback operations in flight.
>
> I hope we could eventually move away from the current model where we
> try to swap out an underlying data structure while upper layers may
> still be referring to it in the future but for now we can make sure
> the writeback operation is finished before detaching wb.
>
> Dmitry, I understand that the bug is difficult to reproduce but can
> you please give the following patch a try?


I've merged it into my tree and will restart the fuzzer and leave it
running for the weekend.
Though, yeah, it is difficult to reproduce...


> Thanks.
>
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 20a2c02..209ea33 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -1530,12 +1530,7 @@ static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part)
>                 kill_bdev(bdev);
>
>                 bdev_write_inode(bdev);
> -               /*
> -                * Detaching bdev inode from its wb in __destroy_inode()
> -                * is too late: the queue which embeds its bdi (along with
> -                * root wb) can be gone as soon as we put_disk() below.
> -                */
> -               inode_detach_wb(bdev->bd_inode);
> +               inode_detach_blkdev_wb(bdev);
>         }
>         if (bdev->bd_contains == bdev) {
>                 if (disk->fops->release)
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index d0b5ca5..ec1f530 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -230,6 +230,25 @@ static inline void inode_detach_wb(struct inode *inode)
>  }
>
>  /**
> + * inode_detach_blkdev_wb - disassociate a bd_inode from its wb
> + * @bdev: block_device of interest
> + *
> + * @bdev is being put for the last time.  Detaching bdev inode in
> + * __destroy_inode() is too late: the queue which embeds its bdi (along
> + * with root wb) can be gone as soon as the containing disk is put.
> + *
> + * This function dissociates @bdev->bd_inode from its wb.  The inode must
> + * be clean and no further operations should be started on it.
> + */
> +static inline void inode_detach_blkdev_wb(struct block_device *bdev)
> +{
> +       if (bdev->bd_inode->i_wb) {
> +               flush_delayed_work(&bdev->bd_inode->i_wb->dwork);
> +               inode_detach_wb(bdev->bd_inode);
> +       }
> +}
> +
> +/**
>   * wbc_attach_fdatawrite_inode - associate wbc and inode for fdatawrite
>   * @wbc: writeback_control of interest
>   * @inode: target inode
> @@ -277,6 +296,10 @@ static inline void inode_detach_wb(struct inode *inode)
>  {
>  }
>
> +static inline void inode_detach_blkdev_wb(struct block_device *bdev)
> +{
> +}
> +
>  static inline void wbc_attach_and_unlock_inode(struct writeback_control *wbc,
>                                                struct inode *inode)
>         __releases(&inode->i_lock)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: fs: GPF in locked_inode_to_wb_and_lock_list
  2016-04-22 18:55           ` Dmitry Vyukov
@ 2016-06-06 17:46             ` Dmitry Vyukov
  2016-06-17 16:04               ` [PATCH] block: flush writeback dwork before detaching a bdev inode from it Tejun Heo
  0 siblings, 1 reply; 17+ messages in thread
From: Dmitry Vyukov @ 2016-06-06 17:46 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrey Ryabinin, Alexander Viro, linux-fsdevel, LKML, syzkaller,
	Kostya Serebryany, Alexander Potapenko, Ilya Dryomov, Jens Axboe,
	Jan Kara

Hi Tejun,

I have not seem the crash since then. Please mail the patch.


On Fri, Apr 22, 2016 at 8:55 PM, Dmitry Vyukov <dvyukov@google.com> wrote:
> On Thu, Apr 21, 2016 at 7:06 PM, Tejun Heo <tj@kernel.org> wrote:
>> Hello,
>>
>> (cc'ing Ilya, Jan and Jens)
>>
>> On Thu, Apr 21, 2016 at 12:00:38PM +0200, Dmitry Vyukov wrote:
>>> On Thu, Apr 21, 2016 at 11:45 AM, Andrey Ryabinin
>>> <ryabinin.a.a@gmail.com> wrote:
>>> > 2016-04-21 11:35 GMT+03:00 Dmitry Vyukov <dvyukov@google.com>:
>>> >>
>>> >> ffffffff818884dd:       48 8b 03                mov    (%rbx),%rax
>>> >>
>>> >> So whatever load "&wb->bdi->wb" produces is a NULL deref. (is it wb
>>> >> that is NULL?)
>>> >
>>> > Yes it's NULL wb, because there is only one load:
>>> >     mov    (%rbx),%rax        =>       rax = wb->bdi
>>> >     add    $0x50,%rax         =>       rax = &bdi->wb
>>>
>>>
>>> I bet that wb becomes NULL on the second iteration of the loop. The
>>> loop loops in case of a race with another thread, so it would also
>>> explain why it is difficult to reproduce.
>>>
>>> Tejun, does it make any sense to you?
>>
>> Yeah, that makes sense.  I think the culprit is 43d1c0eb7e11 ("block:
>> detach bdev inode from its wb in __blkdev_put()") which allows inode
>> to wb association to be broken while other operations including
>> writeback are in progress.  I thought it should be okay as the inode
>> must be clean at that point but that obviously doesn't mean that there
>> can be no writeback operations in flight.
>>
>> I hope we could eventually move away from the current model where we
>> try to swap out an underlying data structure while upper layers may
>> still be referring to it in the future but for now we can make sure
>> the writeback operation is finished before detaching wb.
>>
>> Dmitry, I understand that the bug is difficult to reproduce but can
>> you please give the following patch a try?
>
>
> I've merged it into my tree and will restart the fuzzer and leave it
> running for the weekend.
> Though, yeah, it is difficult to reproduce...
>
>
>> Thanks.
>>
>> diff --git a/fs/block_dev.c b/fs/block_dev.c
>> index 20a2c02..209ea33 100644
>> --- a/fs/block_dev.c
>> +++ b/fs/block_dev.c
>> @@ -1530,12 +1530,7 @@ static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part)
>>                 kill_bdev(bdev);
>>
>>                 bdev_write_inode(bdev);
>> -               /*
>> -                * Detaching bdev inode from its wb in __destroy_inode()
>> -                * is too late: the queue which embeds its bdi (along with
>> -                * root wb) can be gone as soon as we put_disk() below.
>> -                */
>> -               inode_detach_wb(bdev->bd_inode);
>> +               inode_detach_blkdev_wb(bdev);
>>         }
>>         if (bdev->bd_contains == bdev) {
>>                 if (disk->fops->release)
>> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
>> index d0b5ca5..ec1f530 100644
>> --- a/include/linux/writeback.h
>> +++ b/include/linux/writeback.h
>> @@ -230,6 +230,25 @@ static inline void inode_detach_wb(struct inode *inode)
>>  }
>>
>>  /**
>> + * inode_detach_blkdev_wb - disassociate a bd_inode from its wb
>> + * @bdev: block_device of interest
>> + *
>> + * @bdev is being put for the last time.  Detaching bdev inode in
>> + * __destroy_inode() is too late: the queue which embeds its bdi (along
>> + * with root wb) can be gone as soon as the containing disk is put.
>> + *
>> + * This function dissociates @bdev->bd_inode from its wb.  The inode must
>> + * be clean and no further operations should be started on it.
>> + */
>> +static inline void inode_detach_blkdev_wb(struct block_device *bdev)
>> +{
>> +       if (bdev->bd_inode->i_wb) {
>> +               flush_delayed_work(&bdev->bd_inode->i_wb->dwork);
>> +               inode_detach_wb(bdev->bd_inode);
>> +       }
>> +}
>> +
>> +/**
>>   * wbc_attach_fdatawrite_inode - associate wbc and inode for fdatawrite
>>   * @wbc: writeback_control of interest
>>   * @inode: target inode
>> @@ -277,6 +296,10 @@ static inline void inode_detach_wb(struct inode *inode)
>>  {
>>  }
>>
>> +static inline void inode_detach_blkdev_wb(struct block_device *bdev)
>> +{
>> +}
>> +
>>  static inline void wbc_attach_and_unlock_inode(struct writeback_control *wbc,
>>                                                struct inode *inode)
>>         __releases(&inode->i_lock)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH] block: flush writeback dwork before detaching a bdev inode from it
  2016-06-06 17:46             ` Dmitry Vyukov
@ 2016-06-17 16:04               ` Tejun Heo
  2016-06-20 13:31                 ` Jan Kara
  0 siblings, 1 reply; 17+ messages in thread
From: Tejun Heo @ 2016-06-17 16:04 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Andrey Ryabinin, Alexander Viro, linux-fsdevel, LKML, syzkaller,
	Kostya Serebryany, Alexander Potapenko, Ilya Dryomov, Jan Kara,
	Dmitry Vyukov, kernel-team

43d1c0eb7e11 ("block: detach bdev inode from its wb in
__blkdev_put()") detached bdev inode from its wb as the bdev inode may
outlive the underlying bdi and thus the wb.  This is accomplished by
invoking inode_detach_wb() from __blkdev_put(); however, while the
inode can't be dirtied by the time control reaches there, that doesn't
guarantee that writeback isn't already in progress on the inode.  This
can lead to the inode being disassociated from its wb while writeback
operation is in flight causing oopses like the following.

  general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN
  Modules linked in:
  CPU: 3 PID: 32 Comm: kworker/u10:1 Not tainted 4.6.0-rc3+ #349
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
  Workqueue: writeback wb_workfn (flush-11:0)
  task: ffff88006ccf1840 ti: ffff88006cda8000 task.ti: ffff88006cda8000
  RIP: 0010:[<ffffffff818884d2>]  [<ffffffff818884d2>]
  locked_inode_to_wb_and_lock_list+0xa2/0x750
  RSP: 0018:ffff88006cdaf7d0  EFLAGS: 00010246
  RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88006ccf2050
  RDX: 0000000000000000 RSI: 000000114c8a8484 RDI: 0000000000000286
  RBP: ffff88006cdaf820 R08: ffff88006ccf1840 R09: 0000000000000000
  R10: 000229915090805f R11: 0000000000000001 R12: ffff88006a72f5e0
  R13: dffffc0000000000 R14: ffffed000d4e5eed R15: ffffffff8830cf40
  FS:  0000000000000000(0000) GS:ffff88006d500000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000003301bf8 CR3: 000000006368f000 CR4: 00000000000006e0
  DR0: 0000000000001ec9 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
  Stack:
   ffff88006a72f680 ffff88006a72f768 ffff8800671230d8 03ff88006cdaf948
   ffff88006a72f668 ffff88006a72f5e0 ffff8800671230d8 ffff88006cdaf948
   ffff880065b90cc8 ffff880067123100 ffff88006cdaf970 ffffffff8188e12e
  Call Trace:
   [<     inline     >] inode_to_wb_and_lock_list fs/fs-writeback.c:309
   [<ffffffff8188e12e>] writeback_sb_inodes+0x4de/0x1250 fs/fs-writeback.c:1554
   [<ffffffff8188efa4>] __writeback_inodes_wb+0x104/0x1e0 fs/fs-writeback.c:1600
   [<ffffffff8188f9ae>] wb_writeback+0x7ce/0xc90 fs/fs-writeback.c:1709
   [<     inline     >] wb_do_writeback fs/fs-writeback.c:1844
   [<ffffffff81891079>] wb_workfn+0x2f9/0x1000 fs/fs-writeback.c:1884
   [<ffffffff813bcd1e>] process_one_work+0x78e/0x15c0 kernel/workqueue.c:2094
   [<ffffffff813bdc2b>] worker_thread+0xdb/0xfc0 kernel/workqueue.c:2228
   [<ffffffff813cdeef>] kthread+0x23f/0x2d0 drivers/block/aoe/aoecmd.c:1303
   [<ffffffff867bc5d2>] ret_from_fork+0x22/0x50 arch/x86/entry/entry_64.S:392
  Code: 05 94 4a a8 06 85 c0 0f 85 03 03 00 00 e8 07 15 d0 ff 41 80 3e
  00 0f 85 64 06 00 00 49 8b 9c 24 88 01 00 00 48 89 d8 48 c1 e8 03 <42>
  80 3c 28 00 0f 85 17 06 00 00 48 8b 03 48 83 c0 50 48 39 c3
  RIP  [<     inline     >] wb_get include/linux/backing-dev-defs.h:212
  RIP  [<ffffffff818884d2>] locked_inode_to_wb_and_lock_list+0xa2/0x750
  fs/fs-writeback.c:281
   RSP <ffff88006cdaf7d0>
  ---[ end trace 986a4d314dcb2694 ]---

Fix it by flushing the wb dwork before detaching the inode.  Combined
with the fact that the inode can no longer be dirtied, this guarantees
that no writeback operation can be in flight or initiated.  As this
involves details on writeback side which doesn't have much to do with
block_dev, encapsulate it in a helper function -
inode_detach_blkdev_wb().

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Dmitry Vyukov <dvyukov@google.com>
Link: http://lkml.kernel.org/g/CACT4Y+YAjq8mcfiVxR075didJKCyOCVrqxdbfKdgUxabstbfmA@mail.gmail.com
Fixes: 43d1c0eb7e11 ("block: detach bdev inode from its wb in __blkdev_put()")
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: stable@vger.kernel.org # v4.2+
---
 fs/block_dev.c            |    7 +------
 include/linux/writeback.h |   23 +++++++++++++++++++++++
 2 files changed, 24 insertions(+), 6 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 71ccab1..6a14100 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1612,12 +1612,7 @@ static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part)
 		kill_bdev(bdev);
 
 		bdev_write_inode(bdev);
-		/*
-		 * Detaching bdev inode from its wb in __destroy_inode()
-		 * is too late: the queue which embeds its bdi (along with
-		 * root wb) can be gone as soon as we put_disk() below.
-		 */
-		inode_detach_wb(bdev->bd_inode);
+		inode_detach_blkdev_wb(bdev);
 	}
 	if (bdev->bd_contains == bdev) {
 		if (disk->fops->release)
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index d0b5ca5..ec1f530 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -230,6 +230,25 @@ static inline void inode_detach_wb(struct inode *inode)
 }
 
 /**
+ * inode_detach_blkdev_wb - disassociate a bd_inode from its wb
+ * @bdev: block_device of interest
+ *
+ * @bdev is being put for the last time.  Detaching bdev inode in
+ * __destroy_inode() is too late: the queue which embeds its bdi (along
+ * with root wb) can be gone as soon as the containing disk is put.
+ *
+ * This function dissociates @bdev->bd_inode from its wb.  The inode must
+ * be clean and no further operations should be started on it.
+ */
+static inline void inode_detach_blkdev_wb(struct block_device *bdev)
+{
+	if (bdev->bd_inode->i_wb) {
+		flush_delayed_work(&bdev->bd_inode->i_wb->dwork);
+		inode_detach_wb(bdev->bd_inode);
+	}
+}
+
+/**
  * wbc_attach_fdatawrite_inode - associate wbc and inode for fdatawrite
  * @wbc: writeback_control of interest
  * @inode: target inode
@@ -277,6 +296,10 @@ static inline void inode_detach_wb(struct inode *inode)
 {
 }
 
+static inline void inode_detach_blkdev_wb(struct block_device *bdev)
+{
+}
+
 static inline void wbc_attach_and_unlock_inode(struct writeback_control *wbc,
 					       struct inode *inode)
 	__releases(&inode->i_lock)

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] block: flush writeback dwork before detaching a bdev inode from it
  2016-06-17 16:04               ` [PATCH] block: flush writeback dwork before detaching a bdev inode from it Tejun Heo
@ 2016-06-20 13:31                 ` Jan Kara
  2016-06-20 13:38                   ` Dmitry Vyukov
  0 siblings, 1 reply; 17+ messages in thread
From: Jan Kara @ 2016-06-20 13:31 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, Andrey Ryabinin, Alexander Viro, linux-fsdevel, LKML,
	syzkaller, Kostya Serebryany, Alexander Potapenko, Ilya Dryomov,
	Jan Kara, Dmitry Vyukov, kernel-team, Ted Tso

[-- Attachment #1: Type: text/plain, Size: 2248 bytes --]

Hi,

On Fri 17-06-16 12:04:05, Tejun Heo wrote:
> 43d1c0eb7e11 ("block: detach bdev inode from its wb in
> __blkdev_put()") detached bdev inode from its wb as the bdev inode may
> outlive the underlying bdi and thus the wb.  This is accomplished by
> invoking inode_detach_wb() from __blkdev_put(); however, while the
> inode can't be dirtied by the time control reaches there, that doesn't
> guarantee that writeback isn't already in progress on the inode.  This
> can lead to the inode being disassociated from its wb while writeback
> operation is in flight causing oopses like the following.

<snip>

> Fix it by flushing the wb dwork before detaching the inode.  Combined
> with the fact that the inode can no longer be dirtied, this guarantees
> that no writeback operation can be in flight or initiated.

Sorry for the late reply but now when thinking about the patch I don't
think it is quite right. Writeback can happen from other contexts than just
the worker one (e.g. kswapd can do writeback, or in some out-of-memory
situations we may punt to doing writeback directly instead of calling the
worker, or sync(2) calls fdatawrite() for block device inode directly when
iterating through blockdev superblock). So flushing the workqueue IMHO is
not covering 100% of cases. 

I wanted to suggest to use inode_wait_for_writeback() which will make sure
writeback is done with the inode. However we effectively already do this
by calling bdev_write_inode() which calls writeback_single_inode() in
WB_SYNC_ALL mode. So by the time that call completes we are sure writeback
code is not looking at the inode. *However* what I think is happening is
that sync_blockdev() writes all the dirty pages, bdev_write_inode() writes
the inode and clears all dirty bits, however the inode still stays in the
b_dirty / b_io list of the wb because it has I_DIRTY_TIME set. Subsequently
once flusher work runs, it finds the inode, looks at it and boom. And the
problem seems to be that write_inode_now(inode, true) does not result in
I_DIRTY_TIME being cleared.

Attached patch should fix this issue - it is compile-tested only. Dmitry,
can you check whether this patch fixes the issue for you as well?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

[-- Attachment #2: 0001-writeback-Write-dirty-times-for-WB_SYNC_ALL-writebac.patch --]
[-- Type: text/x-patch, Size: 4228 bytes --]

>From 907aceb33f6b598f00c9b1d10a110c74e4a0f954 Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Mon, 20 Jun 2016 15:13:42 +0200
Subject: [PATCH] writeback: Write dirty times for WB_SYNC_ALL writeback

Currently we take care to handle I_DIRTY_TIME in vfs_fsync() and
queue_io() so that inodes which have only dirty timestamps are properly
written on fsync(2) and sync(2). However there are other call sites -
most notably going through write_inode_now() - which expect inode to be
clean after WB_SYNC_ALL writeback. This is not currently true as we do
not clear I_DIRTY_TIME in __writeback_single_inode() even for
WB_SYNC_ALL writeback in all the cases. This then resulted in the
following oops because bdev_write_inode() did not clean the inode and
writeback code later stumbled over a dirty inode with detached wb.

  general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN
  Modules linked in:
  CPU: 3 PID: 32 Comm: kworker/u10:1 Not tainted 4.6.0-rc3+ #349
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
  Workqueue: writeback wb_workfn (flush-11:0)
  task: ffff88006ccf1840 ti: ffff88006cda8000 task.ti: ffff88006cda8000
  RIP: 0010:[<ffffffff818884d2>]  [<ffffffff818884d2>]
  locked_inode_to_wb_and_lock_list+0xa2/0x750
  RSP: 0018:ffff88006cdaf7d0  EFLAGS: 00010246
  RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88006ccf2050
  RDX: 0000000000000000 RSI: 000000114c8a8484 RDI: 0000000000000286
  RBP: ffff88006cdaf820 R08: ffff88006ccf1840 R09: 0000000000000000
  R10: 000229915090805f R11: 0000000000000001 R12: ffff88006a72f5e0
  R13: dffffc0000000000 R14: ffffed000d4e5eed R15: ffffffff8830cf40
  FS:  0000000000000000(0000) GS:ffff88006d500000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000003301bf8 CR3: 000000006368f000 CR4: 00000000000006e0
  DR0: 0000000000001ec9 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
  Stack:
   ffff88006a72f680 ffff88006a72f768 ffff8800671230d8 03ff88006cdaf948
   ffff88006a72f668 ffff88006a72f5e0 ffff8800671230d8 ffff88006cdaf948
   ffff880065b90cc8 ffff880067123100 ffff88006cdaf970 ffffffff8188e12e
  Call Trace:
   [<     inline     >] inode_to_wb_and_lock_list fs/fs-writeback.c:309
   [<ffffffff8188e12e>] writeback_sb_inodes+0x4de/0x1250 fs/fs-writeback.c:1554
   [<ffffffff8188efa4>] __writeback_inodes_wb+0x104/0x1e0 fs/fs-writeback.c:1600
   [<ffffffff8188f9ae>] wb_writeback+0x7ce/0xc90 fs/fs-writeback.c:1709
   [<     inline     >] wb_do_writeback fs/fs-writeback.c:1844
   [<ffffffff81891079>] wb_workfn+0x2f9/0x1000 fs/fs-writeback.c:1884
   [<ffffffff813bcd1e>] process_one_work+0x78e/0x15c0 kernel/workqueue.c:2094
   [<ffffffff813bdc2b>] worker_thread+0xdb/0xfc0 kernel/workqueue.c:2228
   [<ffffffff813cdeef>] kthread+0x23f/0x2d0 drivers/block/aoe/aoecmd.c:1303
   [<ffffffff867bc5d2>] ret_from_fork+0x22/0x50 arch/x86/entry/entry_64.S:392
  Code: 05 94 4a a8 06 85 c0 0f 85 03 03 00 00 e8 07 15 d0 ff 41 80 3e
  00 0f 85 64 06 00 00 49 8b 9c 24 88 01 00 00 48 89 d8 48 c1 e8 03 <42>
  80 3c 28 00 0f 85 17 06 00 00 48 8b 03 48 83 c0 50 48 39 c3
  RIP  [<     inline     >] wb_get include/linux/backing-dev-defs.h:212
  RIP  [<ffffffff818884d2>] locked_inode_to_wb_and_lock_list+0xa2/0x750
  fs/fs-writeback.c:281
   RSP <ffff88006cdaf7d0>
  ---[ end trace 986a4d314dcb2694 ]---

Fix the problem by making sure __writeback_single_inode() writes inode
only with dirty times in WB_SYNC_ALL mode.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 989a2cef6b76..51efb644410a 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1291,6 +1291,7 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 	dirty = inode->i_state & I_DIRTY;
 	if (inode->i_state & I_DIRTY_TIME) {
 		if ((dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) ||
+		    wbc->sync_mode == WB_SYNC_ALL ||
 		    unlikely(inode->i_state & I_DIRTY_TIME_EXPIRED) ||
 		    unlikely(time_after(jiffies,
 					(inode->dirtied_time_when +
-- 
2.6.6


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] block: flush writeback dwork before detaching a bdev inode from it
  2016-06-20 13:31                 ` Jan Kara
@ 2016-06-20 13:38                   ` Dmitry Vyukov
  2016-06-20 17:40                     ` Tejun Heo
  0 siblings, 1 reply; 17+ messages in thread
From: Dmitry Vyukov @ 2016-06-20 13:38 UTC (permalink / raw)
  To: Jan Kara
  Cc: Tejun Heo, Jens Axboe, Andrey Ryabinin, Alexander Viro,
	linux-fsdevel, LKML, syzkaller, Kostya Serebryany,
	Alexander Potapenko, Ilya Dryomov, Jan Kara, kernel-team,
	Ted Tso

On Mon, Jun 20, 2016 at 3:31 PM, Jan Kara <jack@suse.cz> wrote:
> Hi,
>
> On Fri 17-06-16 12:04:05, Tejun Heo wrote:
>> 43d1c0eb7e11 ("block: detach bdev inode from its wb in
>> __blkdev_put()") detached bdev inode from its wb as the bdev inode may
>> outlive the underlying bdi and thus the wb.  This is accomplished by
>> invoking inode_detach_wb() from __blkdev_put(); however, while the
>> inode can't be dirtied by the time control reaches there, that doesn't
>> guarantee that writeback isn't already in progress on the inode.  This
>> can lead to the inode being disassociated from its wb while writeback
>> operation is in flight causing oopses like the following.
>
> <snip>
>
>> Fix it by flushing the wb dwork before detaching the inode.  Combined
>> with the fact that the inode can no longer be dirtied, this guarantees
>> that no writeback operation can be in flight or initiated.
>
> Sorry for the late reply but now when thinking about the patch I don't
> think it is quite right. Writeback can happen from other contexts than just
> the worker one (e.g. kswapd can do writeback, or in some out-of-memory
> situations we may punt to doing writeback directly instead of calling the
> worker, or sync(2) calls fdatawrite() for block device inode directly when
> iterating through blockdev superblock). So flushing the workqueue IMHO is
> not covering 100% of cases.
>
> I wanted to suggest to use inode_wait_for_writeback() which will make sure
> writeback is done with the inode. However we effectively already do this
> by calling bdev_write_inode() which calls writeback_single_inode() in
> WB_SYNC_ALL mode. So by the time that call completes we are sure writeback
> code is not looking at the inode. *However* what I think is happening is
> that sync_blockdev() writes all the dirty pages, bdev_write_inode() writes
> the inode and clears all dirty bits, however the inode still stays in the
> b_dirty / b_io list of the wb because it has I_DIRTY_TIME set. Subsequently
> once flusher work runs, it finds the inode, looks at it and boom. And the
> problem seems to be that write_inode_now(inode, true) does not result in
> I_DIRTY_TIME being cleared.
>
> Attached patch should fix this issue - it is compile-tested only. Dmitry,
> can you check whether this patch fixes the issue for you as well?


I can't directly test it because crash happened very infrequently.
If Tejun/Ted agree that it is the right way to fix it, then I can
patch it, restart the fuzzer and leave it running for a while.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] block: flush writeback dwork before detaching a bdev inode from it
  2016-06-20 13:38                   ` Dmitry Vyukov
@ 2016-06-20 17:40                     ` Tejun Heo
  2016-06-21 12:58                       ` Dmitry Vyukov
  0 siblings, 1 reply; 17+ messages in thread
From: Tejun Heo @ 2016-06-20 17:40 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Jan Kara, Jens Axboe, Andrey Ryabinin, Alexander Viro,
	linux-fsdevel, LKML, syzkaller, Kostya Serebryany,
	Alexander Potapenko, Ilya Dryomov, Jan Kara, kernel-team,
	Ted Tso

Hello,

On Mon, Jun 20, 2016 at 03:38:41PM +0200, Dmitry Vyukov wrote:
> > Sorry for the late reply but now when thinking about the patch I don't
> > think it is quite right. Writeback can happen from other contexts than just
> > the worker one (e.g. kswapd can do writeback, or in some out-of-memory
> > situations we may punt to doing writeback directly instead of calling the
> > worker, or sync(2) calls fdatawrite() for block device inode directly when
> > iterating through blockdev superblock). So flushing the workqueue IMHO is
> > not covering 100% of cases.

Hmmm, yeah, the patch undoes what the cgroup writeback changes added
but it looks like the addition was exposing the existing problem more
rather than causing a new one.

> > I wanted to suggest to use inode_wait_for_writeback() which will make sure
> > writeback is done with the inode. However we effectively already do this
> > by calling bdev_write_inode() which calls writeback_single_inode() in
> > WB_SYNC_ALL mode. So by the time that call completes we are sure writeback
> > code is not looking at the inode. *However* what I think is happening is
> > that sync_blockdev() writes all the dirty pages, bdev_write_inode() writes
> > the inode and clears all dirty bits, however the inode still stays in the
> > b_dirty / b_io list of the wb because it has I_DIRTY_TIME set. Subsequently
> > once flusher work runs, it finds the inode, looks at it and boom. And the
> > problem seems to be that write_inode_now(inode, true) does not result in
> > I_DIRTY_TIME being cleared.
> >
> > Attached patch should fix this issue - it is compile-tested only. Dmitry,
> > can you check whether this patch fixes the issue for you as well?
> 
> I can't directly test it because crash happened very infrequently.
> If Tejun/Ted agree that it is the right way to fix it, then I can
> patch it, restart the fuzzer and leave it running for a while.

Yes, please try out the patch.

Thanks a lot.

-- 
tejun

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] block: flush writeback dwork before detaching a bdev inode from it
  2016-06-20 17:40                     ` Tejun Heo
@ 2016-06-21 12:58                       ` Dmitry Vyukov
  0 siblings, 0 replies; 17+ messages in thread
From: Dmitry Vyukov @ 2016-06-21 12:58 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jan Kara, Jens Axboe, Andrey Ryabinin, Alexander Viro,
	linux-fsdevel, LKML, syzkaller, Kostya Serebryany,
	Alexander Potapenko, Ilya Dryomov, Jan Kara, kernel-team,
	Ted Tso

On Mon, Jun 20, 2016 at 7:40 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello,
>
> On Mon, Jun 20, 2016 at 03:38:41PM +0200, Dmitry Vyukov wrote:
>> > Sorry for the late reply but now when thinking about the patch I don't
>> > think it is quite right. Writeback can happen from other contexts than just
>> > the worker one (e.g. kswapd can do writeback, or in some out-of-memory
>> > situations we may punt to doing writeback directly instead of calling the
>> > worker, or sync(2) calls fdatawrite() for block device inode directly when
>> > iterating through blockdev superblock). So flushing the workqueue IMHO is
>> > not covering 100% of cases.
>
> Hmmm, yeah, the patch undoes what the cgroup writeback changes added
> but it looks like the addition was exposing the existing problem more
> rather than causing a new one.
>
>> > I wanted to suggest to use inode_wait_for_writeback() which will make sure
>> > writeback is done with the inode. However we effectively already do this
>> > by calling bdev_write_inode() which calls writeback_single_inode() in
>> > WB_SYNC_ALL mode. So by the time that call completes we are sure writeback
>> > code is not looking at the inode. *However* what I think is happening is
>> > that sync_blockdev() writes all the dirty pages, bdev_write_inode() writes
>> > the inode and clears all dirty bits, however the inode still stays in the
>> > b_dirty / b_io list of the wb because it has I_DIRTY_TIME set. Subsequently
>> > once flusher work runs, it finds the inode, looks at it and boom. And the
>> > problem seems to be that write_inode_now(inode, true) does not result in
>> > I_DIRTY_TIME being cleared.
>> >
>> > Attached patch should fix this issue - it is compile-tested only. Dmitry,
>> > can you check whether this patch fixes the issue for you as well?
>>
>> I can't directly test it because crash happened very infrequently.
>> If Tejun/Ted agree that it is the right way to fix it, then I can
>> patch it, restart the fuzzer and leave it running for a while.
>
> Yes, please try out the patch.

Done
At least it should catch any new introduced bugs.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2016-06-21 13:05 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-18  9:44 fs: GPF in locked_inode_to_wb_and_lock_list Dmitry Vyukov
2016-04-20 21:14 ` Tejun Heo
2016-04-21  8:25   ` Dmitry Vyukov
2016-04-21  9:10     ` Andrey Ryabinin
2016-04-21  9:29       ` Dmitry Vyukov
2016-04-21 16:14     ` Tejun Heo
2016-04-21  8:35   ` Dmitry Vyukov
2016-04-21  9:45     ` Andrey Ryabinin
2016-04-21 10:00       ` Dmitry Vyukov
2016-04-21 17:06         ` Tejun Heo
2016-04-22 18:55           ` Dmitry Vyukov
2016-06-06 17:46             ` Dmitry Vyukov
2016-06-17 16:04               ` [PATCH] block: flush writeback dwork before detaching a bdev inode from it Tejun Heo
2016-06-20 13:31                 ` Jan Kara
2016-06-20 13:38                   ` Dmitry Vyukov
2016-06-20 17:40                     ` Tejun Heo
2016-06-21 12:58                       ` Dmitry Vyukov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).