* btrfs bio linked list corruption. @ 2016-10-11 14:45 Dave Jones 2016-10-11 15:11 ` Al Viro ` (2 more replies) 0 siblings, 3 replies; 118+ messages in thread From: Dave Jones @ 2016-10-11 14:45 UTC (permalink / raw) To: Al Viro; +Cc: Chris Mason, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel This is from Linus' current tree, with Al's iovec fixups on top. ------------[ cut here ]------------ WARNING: CPU: 1 PID: 3673 at lib/list_debug.c:33 __list_add+0x89/0xb0 list_add corruption. prev->next should be next (ffffe8ffff806648), but was ffffc9000067fcd8. (prev=ffff880503878b80). CPU: 1 PID: 3673 Comm: trinity-c0 Not tainted 4.8.0-think+ #13 ffffc90000d87458 ffffffff8d32007c ffffc90000d874a8 0000000000000000 ffffc90000d87498 ffffffff8d07a6c1 0000002100000246 ffff88050388e880 ffff880503878b80 ffffe8ffff806648 ffffe8ffffc06600 ffff880502808008 Call Trace: [<ffffffff8d32007c>] dump_stack+0x4f/0x73 [<ffffffff8d07a6c1>] __warn+0xc1/0xe0 [<ffffffff8d07a73a>] warn_slowpath_fmt+0x5a/0x80 [<ffffffff8d33e689>] __list_add+0x89/0xb0 [<ffffffff8d30a1c8>] blk_sq_make_request+0x2f8/0x350 [<ffffffff8d2fd9cc>] ? generic_make_request+0xec/0x240 [<ffffffff8d2fd9d9>] generic_make_request+0xf9/0x240 [<ffffffff8d2fdb98>] submit_bio+0x78/0x150 [<ffffffff8d349c05>] ? __percpu_counter_add+0x85/0xb0 [<ffffffffc03627de>] btrfs_map_bio+0x19e/0x330 [btrfs] [<ffffffffc03289ca>] btree_submit_bio_hook+0xfa/0x110 [btrfs] [<ffffffffc034ff15>] submit_one_bio+0x65/0xa0 [btrfs] [<ffffffffc0358cb0>] read_extent_buffer_pages+0x2f0/0x3d0 [btrfs] [<ffffffffc0327020>] ? free_root_pointers+0x60/0x60 [btrfs] [<ffffffffc03283c8>] btree_read_extent_buffer_pages.constprop.55+0xa8/0x110 [btrfs] [<ffffffffc0328bcd>] read_tree_block+0x2d/0x50 [btrfs] [<ffffffffc03080a4>] read_block_for_search.isra.33+0x134/0x330 [btrfs] [<ffffffff8d7c2d6c>] ? _raw_write_unlock+0x2c/0x50 [<ffffffffc0302fec>] ? unlock_up+0x16c/0x1a0 [btrfs] [<ffffffffc030a3d0>] btrfs_search_slot+0x450/0xa40 [btrfs] [<ffffffffc0324983>] btrfs_del_csums+0xe3/0x2e0 [btrfs] [<ffffffffc03134fd>] __btrfs_free_extent.isra.82+0x32d/0xc90 [btrfs] [<ffffffffc03178b3>] __btrfs_run_delayed_refs+0x4d3/0x1010 [btrfs] [<ffffffff8d33e5d7>] ? debug_smp_processor_id+0x17/0x20 [<ffffffff8d0c6109>] ? get_lock_stats+0x19/0x50 [<ffffffffc031b32c>] btrfs_run_delayed_refs+0x9c/0x2d0 [btrfs] [<ffffffffc033d628>] btrfs_truncate_inode_items+0x888/0xda0 [btrfs] [<ffffffffc033dc25>] btrfs_truncate+0xe5/0x2b0 [btrfs] [<ffffffffc033e569>] btrfs_setattr+0x249/0x360 [btrfs] [<ffffffff8d1f4092>] notify_change+0x252/0x440 [<ffffffff8d1d164e>] do_truncate+0x6e/0xc0 [<ffffffff8d1d1a4c>] do_sys_ftruncate.constprop.19+0x10c/0x170 [<ffffffff8d33e5f3>] ? __this_cpu_preempt_check+0x13/0x20 [<ffffffff8d1d1ad9>] SyS_ftruncate+0x9/0x10 [<ffffffff8d00259c>] do_syscall_64+0x5c/0x170 [<ffffffff8d7c2f8b>] entry_SYSCALL64_slow_path+0x25/0x25 --[ end trace 906673a2f703b373 ]--- ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: btrfs bio linked list corruption. 2016-10-11 14:45 btrfs bio linked list corruption Dave Jones @ 2016-10-11 15:11 ` Al Viro 2016-10-11 15:19 ` Dave Jones 2016-10-11 15:54 ` Chris Mason 2016-10-18 22:42 ` Dave Jones 2 siblings, 1 reply; 118+ messages in thread From: Al Viro @ 2016-10-11 15:11 UTC (permalink / raw) To: Dave Jones, Chris Mason, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel On Tue, Oct 11, 2016 at 10:45:08AM -0400, Dave Jones wrote: > This is from Linus' current tree, with Al's iovec fixups on top. Those iovec fixups are in the current tree... TBH, I don't see anything in splice-related stuff that could come anywhere near that (short of some general memory corruption having random effects of that sort). Could you try to bisect that sucker, or is it too hard to reproduce? ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: btrfs bio linked list corruption. 2016-10-11 15:11 ` Al Viro @ 2016-10-11 15:19 ` Dave Jones 2016-10-11 15:20 ` Chris Mason 0 siblings, 1 reply; 118+ messages in thread From: Dave Jones @ 2016-10-11 15:19 UTC (permalink / raw) To: Al Viro; +Cc: Chris Mason, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel On Tue, Oct 11, 2016 at 04:11:39PM +0100, Al Viro wrote: > On Tue, Oct 11, 2016 at 10:45:08AM -0400, Dave Jones wrote: > > This is from Linus' current tree, with Al's iovec fixups on top. > > Those iovec fixups are in the current tree... ah yeah, git quietly dropped my local copy when I rebased so I didn't notice. > TBH, I don't see anything > in splice-related stuff that could come anywhere near that (short of > some general memory corruption having random effects of that sort). > > Could you try to bisect that sucker, or is it too hard to reproduce? Only hit it the once overnight so far. Will see if I can find a better way to reproduce today. Dave ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: btrfs bio linked list corruption. 2016-10-11 15:19 ` Dave Jones @ 2016-10-11 15:20 ` Chris Mason 2016-10-11 15:49 ` Dave Jones 0 siblings, 1 reply; 118+ messages in thread From: Chris Mason @ 2016-10-11 15:20 UTC (permalink / raw) To: Dave Jones, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel On 10/11/2016 11:19 AM, Dave Jones wrote: > On Tue, Oct 11, 2016 at 04:11:39PM +0100, Al Viro wrote: > > On Tue, Oct 11, 2016 at 10:45:08AM -0400, Dave Jones wrote: > > > This is from Linus' current tree, with Al's iovec fixups on top. > > > > Those iovec fixups are in the current tree... > > ah yeah, git quietly dropped my local copy when I rebased so I didn't notice. > > > TBH, I don't see anything > > in splice-related stuff that could come anywhere near that (short of > > some general memory corruption having random effects of that sort). > > > > Could you try to bisect that sucker, or is it too hard to reproduce? > > Only hit it the once overnight so far. Will see if I can find a better way to > reproduce today. This call trace is reading metadata so we can finish the truncate. I'd say adding more memory pressure would make it happen more often. I'll try to trigger. -chris ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: btrfs bio linked list corruption. 2016-10-11 15:20 ` Chris Mason @ 2016-10-11 15:49 ` Dave Jones 0 siblings, 0 replies; 118+ messages in thread From: Dave Jones @ 2016-10-11 15:49 UTC (permalink / raw) To: Chris Mason; +Cc: Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel On Tue, Oct 11, 2016 at 11:20:41AM -0400, Chris Mason wrote: > > > On 10/11/2016 11:19 AM, Dave Jones wrote: > > On Tue, Oct 11, 2016 at 04:11:39PM +0100, Al Viro wrote: > > > On Tue, Oct 11, 2016 at 10:45:08AM -0400, Dave Jones wrote: > > > > This is from Linus' current tree, with Al's iovec fixups on top. > > > > > > Those iovec fixups are in the current tree... > > > > ah yeah, git quietly dropped my local copy when I rebased so I didn't notice. > > > > > TBH, I don't see anything > > > in splice-related stuff that could come anywhere near that (short of > > > some general memory corruption having random effects of that sort). > > > > > > Could you try to bisect that sucker, or is it too hard to reproduce? > > > > Only hit it the once overnight so far. Will see if I can find a better way to > > reproduce today. > > This call trace is reading metadata so we can finish the truncate. I'd > say adding more memory pressure would make it happen more often. That story checks out. There were a bunch of oom's in the log before this. Dave ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: btrfs bio linked list corruption. 2016-10-11 14:45 btrfs bio linked list corruption Dave Jones 2016-10-11 15:11 ` Al Viro @ 2016-10-11 15:54 ` Chris Mason 2016-10-11 16:25 ` Dave Jones 2016-10-12 13:47 ` Dave Jones 2016-10-18 22:42 ` Dave Jones 2 siblings, 2 replies; 118+ messages in thread From: Chris Mason @ 2016-10-11 15:54 UTC (permalink / raw) To: Dave Jones, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel On 10/11/2016 10:45 AM, Dave Jones wrote: > This is from Linus' current tree, with Al's iovec fixups on top. > > ------------[ cut here ]------------ > WARNING: CPU: 1 PID: 3673 at lib/list_debug.c:33 __list_add+0x89/0xb0 > list_add corruption. prev->next should be next (ffffe8ffff806648), but was ffffc9000067fcd8. (prev=ffff880503878b80). > CPU: 1 PID: 3673 Comm: trinity-c0 Not tainted 4.8.0-think+ #13 > ffffc90000d87458 ffffffff8d32007c ffffc90000d874a8 0000000000000000 > ffffc90000d87498 ffffffff8d07a6c1 0000002100000246 ffff88050388e880 > ffff880503878b80 ffffe8ffff806648 ffffe8ffffc06600 ffff880502808008 > Call Trace: > [<ffffffff8d32007c>] dump_stack+0x4f/0x73 > [<ffffffff8d07a6c1>] __warn+0xc1/0xe0 > [<ffffffff8d07a73a>] warn_slowpath_fmt+0x5a/0x80 > [<ffffffff8d33e689>] __list_add+0x89/0xb0 > [<ffffffff8d30a1c8>] blk_sq_make_request+0x2f8/0x350 /* * A task plug currently exists. Since this is completely lockless, * utilize that to temporarily store requests until the task is * either done or scheduled away. */ plug = current->plug; if (plug) { blk_mq_bio_to_request(rq, bio); if (!request_count) trace_block_plug(q); blk_mq_put_ctx(data.ctx); if (request_count >= BLK_MAX_REQUEST_COUNT) { blk_flush_plug_list(plug, false); trace_block_plug(q); } list_add_tail(&rq->queuelist, &plug->mq_list); ^^^^^^^^^^^^^^^^^^^^^^ Dave, is this where we're crashing? This seems strange. -chris ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: btrfs bio linked list corruption. 2016-10-11 15:54 ` Chris Mason @ 2016-10-11 16:25 ` Dave Jones 2016-10-12 13:47 ` Dave Jones 1 sibling, 0 replies; 118+ messages in thread From: Dave Jones @ 2016-10-11 16:25 UTC (permalink / raw) To: Chris Mason; +Cc: Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel On Tue, Oct 11, 2016 at 11:54:09AM -0400, Chris Mason wrote: > > > On 10/11/2016 10:45 AM, Dave Jones wrote: > > This is from Linus' current tree, with Al's iovec fixups on top. > > > > ------------[ cut here ]------------ > > WARNING: CPU: 1 PID: 3673 at lib/list_debug.c:33 __list_add+0x89/0xb0 > > list_add corruption. prev->next should be next (ffffe8ffff806648), but was ffffc9000067fcd8. (prev=ffff880503878b80). > > CPU: 1 PID: 3673 Comm: trinity-c0 Not tainted 4.8.0-think+ #13 > > ffffc90000d87458 ffffffff8d32007c ffffc90000d874a8 0000000000000000 > > ffffc90000d87498 ffffffff8d07a6c1 0000002100000246 ffff88050388e880 > > ffff880503878b80 ffffe8ffff806648 ffffe8ffffc06600 ffff880502808008 > > Call Trace: > > [<ffffffff8d32007c>] dump_stack+0x4f/0x73 > > [<ffffffff8d07a6c1>] __warn+0xc1/0xe0 > > [<ffffffff8d07a73a>] warn_slowpath_fmt+0x5a/0x80 > > [<ffffffff8d33e689>] __list_add+0x89/0xb0 > > [<ffffffff8d30a1c8>] blk_sq_make_request+0x2f8/0x350 > > /* > * A task plug currently exists. Since this is completely lockless, > * utilize that to temporarily store requests until the task is > * either done or scheduled away. > */ > plug = current->plug; > if (plug) { > blk_mq_bio_to_request(rq, bio); > if (!request_count) > trace_block_plug(q); > > blk_mq_put_ctx(data.ctx); > > if (request_count >= BLK_MAX_REQUEST_COUNT) { > blk_flush_plug_list(plug, false); > trace_block_plug(q); > } > > list_add_tail(&rq->queuelist, &plug->mq_list); > ^^^^^^^^^^^^^^^^^^^^^^ > > Dave, is this where we're crashing? This seems strange. According to objdump -S .. ffffffff8130a1b7: 48 8b 70 50 mov 0x50(%rax),%rsi list_add_tail(&rq->queuelist, &ctx->rq_list); ffffffff8130a1bb: 48 8d 50 48 lea 0x48(%rax),%rdx ffffffff8130a1bf: 48 89 45 a8 mov %rax,-0x58(%rbp) ffffffff8130a1c3: e8 38 44 03 00 callq ffffffff8133e600 <__list_add> blk_mq_hctx_mark_pending(hctx, ctx); ffffffff8130a1c8: 48 8b 45 a8 mov -0x58(%rbp),%rax ffffffff8130a1cc: 4c 89 ff mov %r15,%rdi That looks like the list_add_tail from __blk_mq_insert_req_list Dave ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: btrfs bio linked list corruption. 2016-10-11 15:54 ` Chris Mason 2016-10-11 16:25 ` Dave Jones @ 2016-10-12 13:47 ` Dave Jones 2016-10-12 14:40 ` Dave Jones 1 sibling, 1 reply; 118+ messages in thread From: Dave Jones @ 2016-10-12 13:47 UTC (permalink / raw) To: Chris Mason; +Cc: Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel On Tue, Oct 11, 2016 at 11:54:09AM -0400, Chris Mason wrote: > > > On 10/11/2016 10:45 AM, Dave Jones wrote: > > This is from Linus' current tree, with Al's iovec fixups on top. > > > > ------------[ cut here ]------------ > > WARNING: CPU: 1 PID: 3673 at lib/list_debug.c:33 __list_add+0x89/0xb0 > > list_add corruption. prev->next should be next (ffffe8ffff806648), but was ffffc9000067fcd8. (prev=ffff880503878b80). > > CPU: 1 PID: 3673 Comm: trinity-c0 Not tainted 4.8.0-think+ #13 > > ffffc90000d87458 ffffffff8d32007c ffffc90000d874a8 0000000000000000 > > ffffc90000d87498 ffffffff8d07a6c1 0000002100000246 ffff88050388e880 I hit this again overnight, it's the same trace, the only difference being slightly different addresses in the list pointers: [42572.777196] list_add corruption. prev->next should be next (ffffe8ffff806648), but was ffffc90000647cd8. (prev=ffff880503a0ba00). I'm actually a little surprised that ->next was the same across two reboots on two different kernel builds. That might be a sign this is more repeatable than I'd thought, even if it does take hours of runtime right now to trigger it. I'll try and narrow the scope of what trinity is doing to see if I can make it happen faster. Dave ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: btrfs bio linked list corruption. 2016-10-12 13:47 ` Dave Jones @ 2016-10-12 14:40 ` Dave Jones 2016-10-12 14:42 ` Chris Mason 0 siblings, 1 reply; 118+ messages in thread From: Dave Jones @ 2016-10-12 14:40 UTC (permalink / raw) To: Chris Mason, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel On Wed, Oct 12, 2016 at 09:47:17AM -0400, Dave Jones wrote: > On Tue, Oct 11, 2016 at 11:54:09AM -0400, Chris Mason wrote: > > > > > > On 10/11/2016 10:45 AM, Dave Jones wrote: > > > This is from Linus' current tree, with Al's iovec fixups on top. > > > > > > ------------[ cut here ]------------ > > > WARNING: CPU: 1 PID: 3673 at lib/list_debug.c:33 __list_add+0x89/0xb0 > > > list_add corruption. prev->next should be next (ffffe8ffff806648), but was ffffc9000067fcd8. (prev=ffff880503878b80). > > > CPU: 1 PID: 3673 Comm: trinity-c0 Not tainted 4.8.0-think+ #13 > > > ffffc90000d87458 ffffffff8d32007c ffffc90000d874a8 0000000000000000 > > > ffffc90000d87498 ffffffff8d07a6c1 0000002100000246 ffff88050388e880 > > I hit this again overnight, it's the same trace, the only difference > being slightly different addresses in the list pointers: > > [42572.777196] list_add corruption. prev->next should be next (ffffe8ffff806648), but was ffffc90000647cd8. (prev=ffff880503a0ba00). > > I'm actually a little surprised that ->next was the same across two > reboots on two different kernel builds. That might be a sign this is > more repeatable than I'd thought, even if it does take hours of runtime > right now to trigger it. I'll try and narrow the scope of what trinity > is doing to see if I can make it happen faster. .. and of course the first thing that happens is a completely different btrfs trace.. WARNING: CPU: 1 PID: 21706 at fs/btrfs/transaction.c:489 start_transaction+0x40a/0x440 [btrfs] CPU: 1 PID: 21706 Comm: trinity-c16 Not tainted 4.8.0-think+ #14 ffffc900019076a8 ffffffffb731ff3c 0000000000000000 0000000000000000 ffffc900019076e8 ffffffffb707a6c1 000001e9f5806ce0 ffff8804f74c4d98 0000000000000801 ffff880501cfa2a8 000000000000008a 000000000000008a Call Trace: [<ffffffffb731ff3c>] dump_stack+0x4f/0x73 [<ffffffffb707a6c1>] __warn+0xc1/0xe0 [<ffffffffb707a7e8>] warn_slowpath_null+0x18/0x20 [<ffffffffc01d312a>] start_transaction+0x40a/0x440 [btrfs] [<ffffffffc01a6215>] ? btrfs_alloc_path+0x15/0x20 [btrfs] [<ffffffffc01d31b2>] btrfs_join_transaction+0x12/0x20 [btrfs] [<ffffffffc01d92cf>] cow_file_range_inline+0xef/0x830 [btrfs] [<ffffffffc01d9d75>] cow_file_range.isra.64+0x365/0x480 [btrfs] [<ffffffffb77c24cc>] ? _raw_spin_unlock+0x2c/0x50 [<ffffffffc01f229f>] ? release_extent_buffer+0x9f/0x110 [btrfs] [<ffffffffc01da299>] run_delalloc_nocow+0x409/0xbd0 [btrfs] [<ffffffffb70c6109>] ? get_lock_stats+0x19/0x50 [<ffffffffc01dadea>] run_delalloc_range+0x38a/0x3e0 [btrfs] [<ffffffffc01f4aba>] writepage_delalloc.isra.47+0x10a/0x190 [btrfs] [<ffffffffc01f7678>] __extent_writepage+0xd8/0x2c0 [btrfs] [<ffffffffc01f7b2e>] extent_write_cache_pages.isra.44.constprop.63+0x2ce/0x430 [btrfs] [<ffffffffb733e497>] ? debug_smp_processor_id+0x17/0x20 [<ffffffffb70c6109>] ? get_lock_stats+0x19/0x50 [<ffffffffc01f8278>] extent_writepages+0x58/0x80 [btrfs] [<ffffffffc01d7a80>] ? btrfs_releasepage+0x40/0x40 [btrfs] [<ffffffffc01d4a63>] btrfs_writepages+0x23/0x30 [btrfs] [<ffffffffb7162e9c>] do_writepages+0x1c/0x30 [<ffffffffb71550f1>] __filemap_fdatawrite_range+0xc1/0x100 [<ffffffffb71551ee>] filemap_fdatawrite_range+0xe/0x10 [<ffffffffc01eb2fb>] btrfs_fdatawrite_range+0x1b/0x50 [btrfs] [<ffffffffc01f0820>] btrfs_wait_ordered_range+0x40/0x100 [btrfs] [<ffffffffc01eb5d5>] btrfs_sync_file+0x285/0x390 [btrfs] [<ffffffffb7207626>] vfs_fsync_range+0x46/0xa0 [<ffffffffb72076d8>] do_fsync+0x38/0x60 [<ffffffffb720795b>] SyS_fsync+0xb/0x10 [<ffffffffb700259c>] do_syscall_64+0x5c/0x170 [<ffffffffb77c2e4b>] entry_SYSCALL64_slow_path+0x25/0x25 ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: btrfs bio linked list corruption. 2016-10-12 14:40 ` Dave Jones @ 2016-10-12 14:42 ` Chris Mason 2016-10-13 18:16 ` Dave Jones 0 siblings, 1 reply; 118+ messages in thread From: Chris Mason @ 2016-10-12 14:42 UTC (permalink / raw) To: Dave Jones, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel On 10/12/2016 10:40 AM, Dave Jones wrote: > On Wed, Oct 12, 2016 at 09:47:17AM -0400, Dave Jones wrote: > > On Tue, Oct 11, 2016 at 11:54:09AM -0400, Chris Mason wrote: > > > > > > > > > On 10/11/2016 10:45 AM, Dave Jones wrote: > > > > This is from Linus' current tree, with Al's iovec fixups on top. > > > > > > > > ------------[ cut here ]------------ > > > > WARNING: CPU: 1 PID: 3673 at lib/list_debug.c:33 __list_add+0x89/0xb0 > > > > list_add corruption. prev->next should be next (ffffe8ffff806648), but was ffffc9000067fcd8. (prev=ffff880503878b80). > > > > CPU: 1 PID: 3673 Comm: trinity-c0 Not tainted 4.8.0-think+ #13 > > > > ffffc90000d87458 ffffffff8d32007c ffffc90000d874a8 0000000000000000 > > > > ffffc90000d87498 ffffffff8d07a6c1 0000002100000246 ffff88050388e880 > > > > I hit this again overnight, it's the same trace, the only difference > > being slightly different addresses in the list pointers: > > > > [42572.777196] list_add corruption. prev->next should be next (ffffe8ffff806648), but was ffffc90000647cd8. (prev=ffff880503a0ba00). > > > > I'm actually a little surprised that ->next was the same across two > > reboots on two different kernel builds. That might be a sign this is > > more repeatable than I'd thought, even if it does take hours of runtime > > right now to trigger it. I'll try and narrow the scope of what trinity > > is doing to see if I can make it happen faster. > > .. and of course the first thing that happens is a completely different > btrfs trace.. > > > WARNING: CPU: 1 PID: 21706 at fs/btrfs/transaction.c:489 start_transaction+0x40a/0x440 [btrfs] > CPU: 1 PID: 21706 Comm: trinity-c16 Not tainted 4.8.0-think+ #14 > ffffc900019076a8 ffffffffb731ff3c 0000000000000000 0000000000000000 > ffffc900019076e8 ffffffffb707a6c1 000001e9f5806ce0 ffff8804f74c4d98 > 0000000000000801 ffff880501cfa2a8 000000000000008a 000000000000008a This isn't even IO. Uuughhhh. We're going to need a fast enough test that we can bisect. -chris ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: btrfs bio linked list corruption. 2016-10-12 14:42 ` Chris Mason @ 2016-10-13 18:16 ` Dave Jones 2016-10-13 21:18 ` Chris Mason 0 siblings, 1 reply; 118+ messages in thread From: Dave Jones @ 2016-10-13 18:16 UTC (permalink / raw) To: Chris Mason; +Cc: Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel On Wed, Oct 12, 2016 at 10:42:46AM -0400, Chris Mason wrote: > On 10/12/2016 10:40 AM, Dave Jones wrote: > > On Wed, Oct 12, 2016 at 09:47:17AM -0400, Dave Jones wrote: > > > On Tue, Oct 11, 2016 at 11:54:09AM -0400, Chris Mason wrote: > > > > > > > > > > > > On 10/11/2016 10:45 AM, Dave Jones wrote: > > > > > This is from Linus' current tree, with Al's iovec fixups on top. > > > > > > > > > > ------------[ cut here ]------------ > > > > > WARNING: CPU: 1 PID: 3673 at lib/list_debug.c:33 __list_add+0x89/0xb0 > > > > > list_add corruption. prev->next should be next (ffffe8ffff806648), but was ffffc9000067fcd8. (prev=ffff880503878b80). > > > > > CPU: 1 PID: 3673 Comm: trinity-c0 Not tainted 4.8.0-think+ #13 > > > > > ffffc90000d87458 ffffffff8d32007c ffffc90000d874a8 0000000000000000 > > > > > ffffc90000d87498 ffffffff8d07a6c1 0000002100000246 ffff88050388e880 > > > > > > I hit this again overnight, it's the same trace, the only difference > > > being slightly different addresses in the list pointers: > > > > > > [42572.777196] list_add corruption. prev->next should be next (ffffe8ffff806648), but was ffffc90000647cd8. (prev=ffff880503a0ba00). > > > > > > I'm actually a little surprised that ->next was the same across two > > > reboots on two different kernel builds. That might be a sign this is > > > more repeatable than I'd thought, even if it does take hours of runtime > > > right now to trigger it. I'll try and narrow the scope of what trinity > > > is doing to see if I can make it happen faster. > > > > .. and of course the first thing that happens is a completely different > > btrfs trace.. > > > > > > WARNING: CPU: 1 PID: 21706 at fs/btrfs/transaction.c:489 start_transaction+0x40a/0x440 [btrfs] > > CPU: 1 PID: 21706 Comm: trinity-c16 Not tainted 4.8.0-think+ #14 > > ffffc900019076a8 ffffffffb731ff3c 0000000000000000 0000000000000000 > > ffffc900019076e8 ffffffffb707a6c1 000001e9f5806ce0 ffff8804f74c4d98 > > 0000000000000801 ffff880501cfa2a8 000000000000008a 000000000000008a > > This isn't even IO. Uuughhhh. We're going to need a fast enough test > that we can bisect. Progress... I've found that this combination of syscalls.. ./trinity -C64 -q -l off -a64 --enable-fds=testfile -c fsync -c fsetxattr -c lremovexattr -c pwritev2 hits one of these two bugs in a few minutes runtime. Just the xattr syscalls + fsync isn't enough, neither is just pwrite + fsync. Mix them together though, and something goes awry. Dave ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: btrfs bio linked list corruption. 2016-10-13 18:16 ` Dave Jones @ 2016-10-13 21:18 ` Chris Mason 2016-10-13 21:56 ` Dave Jones 2016-10-16 0:42 ` Dave Jones 0 siblings, 2 replies; 118+ messages in thread From: Chris Mason @ 2016-10-13 21:18 UTC (permalink / raw) To: Dave Jones, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel On 10/13/2016 02:16 PM, Dave Jones wrote: > On Wed, Oct 12, 2016 at 10:42:46AM -0400, Chris Mason wrote: > > On 10/12/2016 10:40 AM, Dave Jones wrote: > > > On Wed, Oct 12, 2016 at 09:47:17AM -0400, Dave Jones wrote: > > > > On Tue, Oct 11, 2016 at 11:54:09AM -0400, Chris Mason wrote: > > > > > > > > > > > > > > > On 10/11/2016 10:45 AM, Dave Jones wrote: > > > > > > This is from Linus' current tree, with Al's iovec fixups on top. > > > > > > > > > > > > ------------[ cut here ]------------ > > > > > > WARNING: CPU: 1 PID: 3673 at lib/list_debug.c:33 __list_add+0x89/0xb0 > > > > > > list_add corruption. prev->next should be next (ffffe8ffff806648), but was ffffc9000067fcd8. (prev=ffff880503878b80). > > > > > > CPU: 1 PID: 3673 Comm: trinity-c0 Not tainted 4.8.0-think+ #13 > > > > > > ffffc90000d87458 ffffffff8d32007c ffffc90000d874a8 0000000000000000 > > > > > > ffffc90000d87498 ffffffff8d07a6c1 0000002100000246 ffff88050388e880 > > > > > > > > I hit this again overnight, it's the same trace, the only difference > > > > being slightly different addresses in the list pointers: > > > > > > > > [42572.777196] list_add corruption. prev->next should be next (ffffe8ffff806648), but was ffffc90000647cd8. (prev=ffff880503a0ba00). > > > > > > > > I'm actually a little surprised that ->next was the same across two > > > > reboots on two different kernel builds. That might be a sign this is > > > > more repeatable than I'd thought, even if it does take hours of runtime > > > > right now to trigger it. I'll try and narrow the scope of what trinity > > > > is doing to see if I can make it happen faster. > > > > > > .. and of course the first thing that happens is a completely different > > > btrfs trace.. > > > > > > > > > WARNING: CPU: 1 PID: 21706 at fs/btrfs/transaction.c:489 start_transaction+0x40a/0x440 [btrfs] > > > CPU: 1 PID: 21706 Comm: trinity-c16 Not tainted 4.8.0-think+ #14 > > > ffffc900019076a8 ffffffffb731ff3c 0000000000000000 0000000000000000 > > > ffffc900019076e8 ffffffffb707a6c1 000001e9f5806ce0 ffff8804f74c4d98 > > > 0000000000000801 ffff880501cfa2a8 000000000000008a 000000000000008a > > > > This isn't even IO. Uuughhhh. We're going to need a fast enough test > > that we can bisect. > > Progress... > I've found that this combination of syscalls.. > > ./trinity -C64 -q -l off -a64 --enable-fds=testfile -c fsync -c fsetxattr -c lremovexattr -c pwritev2 > > hits one of these two bugs in a few minutes runtime. > > Just the xattr syscalls + fsync isn't enough, neither is just pwrite + fsync. > Mix them together though, and something goes awry. > Hasn't triggered here yet. I'll leave it running though. -chris ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: btrfs bio linked list corruption. 2016-10-13 21:18 ` Chris Mason @ 2016-10-13 21:56 ` Dave Jones 2016-10-16 0:42 ` Dave Jones 1 sibling, 0 replies; 118+ messages in thread From: Dave Jones @ 2016-10-13 21:56 UTC (permalink / raw) To: Chris Mason; +Cc: Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel On Thu, Oct 13, 2016 at 05:18:46PM -0400, Chris Mason wrote: > > > > WARNING: CPU: 1 PID: 21706 at fs/btrfs/transaction.c:489 start_transaction+0x40a/0x440 [btrfs] > > > > CPU: 1 PID: 21706 Comm: trinity-c16 Not tainted 4.8.0-think+ #14 > > > > ffffc900019076a8 ffffffffb731ff3c 0000000000000000 0000000000000000 > > > > ffffc900019076e8 ffffffffb707a6c1 000001e9f5806ce0 ffff8804f74c4d98 > > > > 0000000000000801 ffff880501cfa2a8 000000000000008a 000000000000008a > > > > > > This isn't even IO. Uuughhhh. We're going to need a fast enough test > > > that we can bisect. > > > > Progress... > > I've found that this combination of syscalls.. > > > > ./trinity -C64 -q -l off -a64 --enable-fds=testfile -c fsync -c fsetxattr -c lremovexattr -c pwritev2 > > > > hits one of these two bugs in a few minutes runtime. > > > > Just the xattr syscalls + fsync isn't enough, neither is just pwrite + fsync. > > Mix them together though, and something goes awry. > > > > Hasn't triggered here yet. I'll leave it running though. With that combo of params I triggered it 3-4 times in a row within minutes.. Then as soon as I posted, it stopped being so easy to repro. There's some other variable I haven't figured out yet (maybe how the random way that files get opened in fds/testfiles.c), but it does seem to point at the xattr changes. I'll poke at it some more tomorrow. Dave ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: btrfs bio linked list corruption. 2016-10-13 21:18 ` Chris Mason 2016-10-13 21:56 ` Dave Jones @ 2016-10-16 0:42 ` Dave Jones 2016-10-18 1:07 ` Chris Mason 1 sibling, 1 reply; 118+ messages in thread From: Dave Jones @ 2016-10-16 0:42 UTC (permalink / raw) To: Chris Mason; +Cc: Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel On Thu, Oct 13, 2016 at 05:18:46PM -0400, Chris Mason wrote: > > > > .. and of course the first thing that happens is a completely different > > > > btrfs trace.. > > > > > > > > > > > > WARNING: CPU: 1 PID: 21706 at fs/btrfs/transaction.c:489 start_transaction+0x40a/0x440 [btrfs] > > > > CPU: 1 PID: 21706 Comm: trinity-c16 Not tainted 4.8.0-think+ #14 > > > > ffffc900019076a8 ffffffffb731ff3c 0000000000000000 0000000000000000 > > > > ffffc900019076e8 ffffffffb707a6c1 000001e9f5806ce0 ffff8804f74c4d98 > > > > 0000000000000801 ffff880501cfa2a8 000000000000008a 000000000000008a > > > > > > This isn't even IO. Uuughhhh. We're going to need a fast enough test > > > that we can bisect. > > > > Progress... > > I've found that this combination of syscalls.. > > > > ./trinity -C64 -q -l off -a64 --enable-fds=testfile -c fsync -c fsetxattr -c lremovexattr -c pwritev2 > > > > hits one of these two bugs in a few minutes runtime. > > > > Just the xattr syscalls + fsync isn't enough, neither is just pwrite + fsync. > > Mix them together though, and something goes awry. > > > Hasn't triggered here yet. I'll leave it running though. The hits keep coming.. BUG: Bad page state in process kworker/u8:12 pfn:4988fa page:ffffea0012623e80 count:0 mapcount:0 mapping:ffff8804450456e0 index:0x9 flags: 0x400000000000000c(referenced|uptodate) page dumped because: non-NULL mapping CPU: 2 PID: 1388 Comm: kworker/u8:12 Not tainted 4.8.0-think+ #18 Workqueue: writeback wb_workfn (flush-btrfs-1) ffffc90000aef7e8 ffffffff81320e7c ffffea0012623e80 ffffffff819fe6ec ffffc90000aef810 ffffffff81159b3f 0000000000000000 ffffea0012623e80 400000000000000c ffffc90000aef820 ffffffff81159bfa ffffc90000aef868 Call Trace: [<ffffffff81320e7c>] dump_stack+0x4f/0x73 [<ffffffff81159b3f>] bad_page+0xbf/0x120 [<ffffffff81159bfa>] free_pages_check_bad+0x5a/0x70 [<ffffffff8115c0fb>] free_hot_cold_page+0x20b/0x270 [<ffffffff8115c41b>] free_hot_cold_page_list+0x2b/0x50 [<ffffffff81165062>] release_pages+0x2d2/0x380 [<ffffffff811665d2>] __pagevec_release+0x22/0x30 [<ffffffffa009f810>] extent_write_cache_pages.isra.48.constprop.63+0x350/0x430 [btrfs] [<ffffffff8133f487>] ? debug_smp_processor_id+0x17/0x20 [<ffffffff810c6999>] ? get_lock_stats+0x19/0x50 [<ffffffffa009fce8>] extent_writepages+0x58/0x80 [btrfs] [<ffffffffa007f150>] ? btrfs_releasepage+0x40/0x40 [btrfs] [<ffffffffa007c0d3>] btrfs_writepages+0x23/0x30 [btrfs] [<ffffffff8116370c>] do_writepages+0x1c/0x30 [<ffffffff81202d63>] __writeback_single_inode+0x33/0x180 [<ffffffff8120357b>] writeback_sb_inodes+0x2cb/0x5d0 [<ffffffff8120390d>] __writeback_inodes_wb+0x8d/0xc0 [<ffffffff81203c03>] wb_writeback+0x203/0x210 [<ffffffff81204197>] wb_workfn+0xe7/0x2a0 [<ffffffff810c8b7f>] ? __lock_acquire.isra.32+0x1cf/0x8c0 [<ffffffff8109458a>] process_one_work+0x1da/0x4b0 [<ffffffff8109452a>] ? process_one_work+0x17a/0x4b0 [<ffffffff810948a9>] worker_thread+0x49/0x490 [<ffffffff81094860>] ? process_one_work+0x4b0/0x4b0 [<ffffffff81094860>] ? process_one_work+0x4b0/0x4b0 ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: btrfs bio linked list corruption. 2016-10-16 0:42 ` Dave Jones @ 2016-10-18 1:07 ` Chris Mason 0 siblings, 0 replies; 118+ messages in thread From: Chris Mason @ 2016-10-18 1:07 UTC (permalink / raw) To: Dave Jones; +Cc: Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel On Sat, Oct 15, 2016 at 08:42:40PM -0400, Dave Jones wrote: >On Thu, Oct 13, 2016 at 05:18:46PM -0400, Chris Mason wrote: > > > > > > .. and of course the first thing that happens is a completely different > > > > > btrfs trace.. > > > > > > > > > > > > > > > WARNING: CPU: 1 PID: 21706 at fs/btrfs/transaction.c:489 start_transaction+0x40a/0x440 [btrfs] > > > > > CPU: 1 PID: 21706 Comm: trinity-c16 Not tainted 4.8.0-think+ #14 > > > > > ffffc900019076a8 ffffffffb731ff3c 0000000000000000 0000000000000000 > > > > > ffffc900019076e8 ffffffffb707a6c1 000001e9f5806ce0 ffff8804f74c4d98 > > > > > 0000000000000801 ffff880501cfa2a8 000000000000008a 000000000000008a > > > > > > > > This isn't even IO. Uuughhhh. We're going to need a fast enough test > > > > that we can bisect. > > > > > > Progress... > > > I've found that this combination of syscalls.. > > > > > > ./trinity -C64 -q -l off -a64 --enable-fds=testfile -c fsync -c fsetxattr -c lremovexattr -c pwritev2 > > > > > > hits one of these two bugs in a few minutes runtime. > > > > > > Just the xattr syscalls + fsync isn't enough, neither is just pwrite + fsync. > > > Mix them together though, and something goes awry. > > > > > Hasn't triggered here yet. I'll leave it running though. > >The hits keep coming.. > >BUG: Bad page state in process kworker/u8:12 pfn:4988fa >page:ffffea0012623e80 count:0 mapcount:0 mapping:ffff8804450456e0 index:0x9 Hmpf, I've had this running since Friday without failing. Can you send me your .config please? -chris ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-11 14:45 btrfs bio linked list corruption Dave Jones 2016-10-11 15:11 ` Al Viro 2016-10-11 15:54 ` Chris Mason @ 2016-10-18 22:42 ` Dave Jones 2016-10-18 23:12 ` Jens Axboe 2 siblings, 1 reply; 118+ messages in thread From: Dave Jones @ 2016-10-18 22:42 UTC (permalink / raw) To: Al Viro, Chris Mason, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, axboe, Linus Torvalds On Tue, Oct 11, 2016 at 10:45:07AM -0400, Dave Jones wrote: > WARNING: CPU: 1 PID: 3673 at lib/list_debug.c:33 __list_add+0x89/0xb0 > list_add corruption. prev->next should be next (ffffe8ffff806648), but was ffffc9000067fcd8. (prev=ffff880503878b80). > CPU: 1 PID: 3673 Comm: trinity-c0 Not tainted 4.8.0-think+ #13 > ffffc90000d87458 ffffffff8d32007c ffffc90000d874a8 0000000000000000 > ffffc90000d87498 ffffffff8d07a6c1 0000002100000246 ffff88050388e880 > ffff880503878b80 ffffe8ffff806648 ffffe8ffffc06600 ffff880502808008 > Call Trace: > [<ffffffff8d32007c>] dump_stack+0x4f/0x73 > [<ffffffff8d07a6c1>] __warn+0xc1/0xe0 > [<ffffffff8d07a73a>] warn_slowpath_fmt+0x5a/0x80 > [<ffffffff8d33e689>] __list_add+0x89/0xb0 > [<ffffffff8d30a1c8>] blk_sq_make_request+0x2f8/0x350 > [<ffffffff8d2fd9cc>] ? generic_make_request+0xec/0x240 > [<ffffffff8d2fd9d9>] generic_make_request+0xf9/0x240 > [<ffffffff8d2fdb98>] submit_bio+0x78/0x150 > [<ffffffff8d349c05>] ? __percpu_counter_add+0x85/0xb0 > [<ffffffffc03627de>] btrfs_map_bio+0x19e/0x330 [btrfs] > [<ffffffffc03289ca>] btree_submit_bio_hook+0xfa/0x110 [btrfs] > [<ffffffffc034ff15>] submit_one_bio+0x65/0xa0 [btrfs] > [<ffffffffc0358cb0>] read_extent_buffer_pages+0x2f0/0x3d0 [btrfs] > [<ffffffffc0327020>] ? free_root_pointers+0x60/0x60 [btrfs] > [<ffffffffc03283c8>] btree_read_extent_buffer_pages.constprop.55+0xa8/0x110 [btrfs] > [<ffffffffc0328bcd>] read_tree_block+0x2d/0x50 [btrfs] > [<ffffffffc03080a4>] read_block_for_search.isra.33+0x134/0x330 [btrfs] > [<ffffffff8d7c2d6c>] ? _raw_write_unlock+0x2c/0x50 > [<ffffffffc0302fec>] ? unlock_up+0x16c/0x1a0 [btrfs] > [<ffffffffc030a3d0>] btrfs_search_slot+0x450/0xa40 [btrfs] > [<ffffffffc0324983>] btrfs_del_csums+0xe3/0x2e0 [btrfs] > [<ffffffffc03134fd>] __btrfs_free_extent.isra.82+0x32d/0xc90 [btrfs] > [<ffffffffc03178b3>] __btrfs_run_delayed_refs+0x4d3/0x1010 [btrfs] > [<ffffffff8d33e5d7>] ? debug_smp_processor_id+0x17/0x20 > [<ffffffff8d0c6109>] ? get_lock_stats+0x19/0x50 > [<ffffffffc031b32c>] btrfs_run_delayed_refs+0x9c/0x2d0 [btrfs] > [<ffffffffc033d628>] btrfs_truncate_inode_items+0x888/0xda0 [btrfs] > [<ffffffffc033dc25>] btrfs_truncate+0xe5/0x2b0 [btrfs] > [<ffffffffc033e569>] btrfs_setattr+0x249/0x360 [btrfs] > [<ffffffff8d1f4092>] notify_change+0x252/0x440 > [<ffffffff8d1d164e>] do_truncate+0x6e/0xc0 > [<ffffffff8d1d1a4c>] do_sys_ftruncate.constprop.19+0x10c/0x170 > [<ffffffff8d33e5f3>] ? __this_cpu_preempt_check+0x13/0x20 > [<ffffffff8d1d1ad9>] SyS_ftruncate+0x9/0x10 > [<ffffffff8d00259c>] do_syscall_64+0x5c/0x170 > [<ffffffff8d7c2f8b>] entry_SYSCALL64_slow_path+0x25/0x25 So Chris had me do a run on ext4 just for giggles. It took a while, but eventually this fell out... WARNING: CPU: 3 PID: 21324 at lib/list_debug.c:33 __list_add+0x89/0xb0 list_add corruption. prev->next should be next (ffffe8ffffc05648), but was ffffc9000028bcd8. (prev=ffff880503a145c0). CPU: 3 PID: 21324 Comm: modprobe Not tainted 4.9.0-rc1-think+ #1 ffffc90000a6b7b8 ffffffff81320e3c ffffc90000a6b808 0000000000000000 ffffc90000a6b7f8 ffffffff8107a711 0000002100000246 ffff8805039f1740 ffff880503a145c0 ffffe8ffffc05648 ffffe8ffffa05600 ffff880502c39548 Call Trace: [<ffffffff81320e3c>] dump_stack+0x4f/0x73 [<ffffffff8107a711>] __warn+0xc1/0xe0 [<ffffffff8107a78a>] warn_slowpath_fmt+0x5a/0x80 [<ffffffff8133f499>] __list_add+0x89/0xb0 [<ffffffff8130af88>] blk_sq_make_request+0x2f8/0x350 [<ffffffff812fe6dc>] ? generic_make_request+0xec/0x240 [<ffffffff812fe6e9>] generic_make_request+0xf9/0x240 [<ffffffff812fe8a8>] submit_bio+0x78/0x150 [<ffffffff8120bde6>] ? __find_get_block+0x126/0x130 [<ffffffff8120cbff>] submit_bh_wbc+0x16f/0x1e0 [<ffffffff8120a400>] ? __end_buffer_read_notouch+0x20/0x20 [<ffffffff8120d958>] ll_rw_block+0xa8/0xb0 [<ffffffff8120da0f>] __breadahead+0x3f/0x70 [<ffffffff81264ffc>] __ext4_get_inode_loc+0x37c/0x3d0 [<ffffffff8126806d>] ext4_iget+0x8d/0xb90 [<ffffffff811f0759>] ? d_alloc_parallel+0x329/0x700 [<ffffffff81268b9a>] ext4_iget_normal+0x2a/0x30 [<ffffffff81273cd6>] ext4_lookup+0x136/0x250 [<ffffffff811e118d>] lookup_slow+0x12d/0x220 [<ffffffff811e3897>] walk_component+0x1e7/0x310 [<ffffffff811e33f8>] ? path_init+0x4d8/0x520 [<ffffffff811e4022>] path_lookupat+0x62/0x120 [<ffffffff811e4f22>] ? getname_flags+0x32/0x180 [<ffffffff811e5278>] filename_lookup+0xa8/0x130 [<ffffffff81352526>] ? strncpy_from_user+0x46/0x170 [<ffffffff811e4f3e>] ? getname_flags+0x4e/0x180 [<ffffffff811e53d1>] user_path_at_empty+0x31/0x40 [<ffffffff811d9df1>] vfs_fstatat+0x61/0xc0 [<ffffffff810c8b9f>] ? __lock_acquire.isra.32+0x1cf/0x8c0 [<ffffffff811da30e>] SYSC_newstat+0x2e/0x60 [<ffffffff8133f403>] ? __this_cpu_preempt_check+0x13/0x20 [<ffffffff811da499>] SyS_newstat+0x9/0x10 [<ffffffff8100259c>] do_syscall_64+0x5c/0x170 [<ffffffff817c27cb>] entry_SYSCALL64_slow_path+0x25/0x25 So this one isn't a btrfs specific problem as I first thought. This sometimes reproduces within minutes, sometimes hours, which makes it a pain to bisect. It only started showing up this merge window though. Dave ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-18 22:42 ` Dave Jones @ 2016-10-18 23:12 ` Jens Axboe 2016-10-18 23:31 ` Chris Mason 0 siblings, 1 reply; 118+ messages in thread From: Jens Axboe @ 2016-10-18 23:12 UTC (permalink / raw) To: Dave Jones, Al Viro, Chris Mason, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Linus Torvalds On 10/18/2016 04:42 PM, Dave Jones wrote: > On Tue, Oct 11, 2016 at 10:45:07AM -0400, Dave Jones wrote: > > > WARNING: CPU: 1 PID: 3673 at lib/list_debug.c:33 __list_add+0x89/0xb0 > > list_add corruption. prev->next should be next (ffffe8ffff806648), but was ffffc9000067fcd8. (prev=ffff880503878b80). > > CPU: 1 PID: 3673 Comm: trinity-c0 Not tainted 4.8.0-think+ #13 > > ffffc90000d87458 ffffffff8d32007c ffffc90000d874a8 0000000000000000 > > ffffc90000d87498 ffffffff8d07a6c1 0000002100000246 ffff88050388e880 > > ffff880503878b80 ffffe8ffff806648 ffffe8ffffc06600 ffff880502808008 > > Call Trace: > > [<ffffffff8d32007c>] dump_stack+0x4f/0x73 > > [<ffffffff8d07a6c1>] __warn+0xc1/0xe0 > > [<ffffffff8d07a73a>] warn_slowpath_fmt+0x5a/0x80 > > [<ffffffff8d33e689>] __list_add+0x89/0xb0 > > [<ffffffff8d30a1c8>] blk_sq_make_request+0x2f8/0x350 > > [<ffffffff8d2fd9cc>] ? generic_make_request+0xec/0x240 > > [<ffffffff8d2fd9d9>] generic_make_request+0xf9/0x240 > > [<ffffffff8d2fdb98>] submit_bio+0x78/0x150 > > [<ffffffff8d349c05>] ? __percpu_counter_add+0x85/0xb0 > > [<ffffffffc03627de>] btrfs_map_bio+0x19e/0x330 [btrfs] > > [<ffffffffc03289ca>] btree_submit_bio_hook+0xfa/0x110 [btrfs] > > [<ffffffffc034ff15>] submit_one_bio+0x65/0xa0 [btrfs] > > [<ffffffffc0358cb0>] read_extent_buffer_pages+0x2f0/0x3d0 [btrfs] > > [<ffffffffc0327020>] ? free_root_pointers+0x60/0x60 [btrfs] > > [<ffffffffc03283c8>] btree_read_extent_buffer_pages.constprop.55+0xa8/0x110 [btrfs] > > [<ffffffffc0328bcd>] read_tree_block+0x2d/0x50 [btrfs] > > [<ffffffffc03080a4>] read_block_for_search.isra.33+0x134/0x330 [btrfs] > > [<ffffffff8d7c2d6c>] ? _raw_write_unlock+0x2c/0x50 > > [<ffffffffc0302fec>] ? unlock_up+0x16c/0x1a0 [btrfs] > > [<ffffffffc030a3d0>] btrfs_search_slot+0x450/0xa40 [btrfs] > > [<ffffffffc0324983>] btrfs_del_csums+0xe3/0x2e0 [btrfs] > > [<ffffffffc03134fd>] __btrfs_free_extent.isra.82+0x32d/0xc90 [btrfs] > > [<ffffffffc03178b3>] __btrfs_run_delayed_refs+0x4d3/0x1010 [btrfs] > > [<ffffffff8d33e5d7>] ? debug_smp_processor_id+0x17/0x20 > > [<ffffffff8d0c6109>] ? get_lock_stats+0x19/0x50 > > [<ffffffffc031b32c>] btrfs_run_delayed_refs+0x9c/0x2d0 [btrfs] > > [<ffffffffc033d628>] btrfs_truncate_inode_items+0x888/0xda0 [btrfs] > > [<ffffffffc033dc25>] btrfs_truncate+0xe5/0x2b0 [btrfs] > > [<ffffffffc033e569>] btrfs_setattr+0x249/0x360 [btrfs] > > [<ffffffff8d1f4092>] notify_change+0x252/0x440 > > [<ffffffff8d1d164e>] do_truncate+0x6e/0xc0 > > [<ffffffff8d1d1a4c>] do_sys_ftruncate.constprop.19+0x10c/0x170 > > [<ffffffff8d33e5f3>] ? __this_cpu_preempt_check+0x13/0x20 > > [<ffffffff8d1d1ad9>] SyS_ftruncate+0x9/0x10 > > [<ffffffff8d00259c>] do_syscall_64+0x5c/0x170 > > [<ffffffff8d7c2f8b>] entry_SYSCALL64_slow_path+0x25/0x25 > > So Chris had me do a run on ext4 just for giggles. It took a while, but > eventually this fell out... > > > WARNING: CPU: 3 PID: 21324 at lib/list_debug.c:33 __list_add+0x89/0xb0 > list_add corruption. prev->next should be next (ffffe8ffffc05648), but was ffffc9000028bcd8. (prev=ffff880503a145c0). > CPU: 3 PID: 21324 Comm: modprobe Not tainted 4.9.0-rc1-think+ #1 > ffffc90000a6b7b8 ffffffff81320e3c ffffc90000a6b808 0000000000000000 > ffffc90000a6b7f8 ffffffff8107a711 0000002100000246 ffff8805039f1740 > ffff880503a145c0 ffffe8ffffc05648 ffffe8ffffa05600 ffff880502c39548 > Call Trace: > [<ffffffff81320e3c>] dump_stack+0x4f/0x73 > [<ffffffff8107a711>] __warn+0xc1/0xe0 > [<ffffffff8107a78a>] warn_slowpath_fmt+0x5a/0x80 > [<ffffffff8133f499>] __list_add+0x89/0xb0 > [<ffffffff8130af88>] blk_sq_make_request+0x2f8/0x350 > [<ffffffff812fe6dc>] ? generic_make_request+0xec/0x240 > [<ffffffff812fe6e9>] generic_make_request+0xf9/0x240 > [<ffffffff812fe8a8>] submit_bio+0x78/0x150 > [<ffffffff8120bde6>] ? __find_get_block+0x126/0x130 > [<ffffffff8120cbff>] submit_bh_wbc+0x16f/0x1e0 > [<ffffffff8120a400>] ? __end_buffer_read_notouch+0x20/0x20 > [<ffffffff8120d958>] ll_rw_block+0xa8/0xb0 > [<ffffffff8120da0f>] __breadahead+0x3f/0x70 > [<ffffffff81264ffc>] __ext4_get_inode_loc+0x37c/0x3d0 > [<ffffffff8126806d>] ext4_iget+0x8d/0xb90 > [<ffffffff811f0759>] ? d_alloc_parallel+0x329/0x700 > [<ffffffff81268b9a>] ext4_iget_normal+0x2a/0x30 > [<ffffffff81273cd6>] ext4_lookup+0x136/0x250 > [<ffffffff811e118d>] lookup_slow+0x12d/0x220 > [<ffffffff811e3897>] walk_component+0x1e7/0x310 > [<ffffffff811e33f8>] ? path_init+0x4d8/0x520 > [<ffffffff811e4022>] path_lookupat+0x62/0x120 > [<ffffffff811e4f22>] ? getname_flags+0x32/0x180 > [<ffffffff811e5278>] filename_lookup+0xa8/0x130 > [<ffffffff81352526>] ? strncpy_from_user+0x46/0x170 > [<ffffffff811e4f3e>] ? getname_flags+0x4e/0x180 > [<ffffffff811e53d1>] user_path_at_empty+0x31/0x40 > [<ffffffff811d9df1>] vfs_fstatat+0x61/0xc0 > [<ffffffff810c8b9f>] ? __lock_acquire.isra.32+0x1cf/0x8c0 > [<ffffffff811da30e>] SYSC_newstat+0x2e/0x60 > [<ffffffff8133f403>] ? __this_cpu_preempt_check+0x13/0x20 > [<ffffffff811da499>] SyS_newstat+0x9/0x10 > [<ffffffff8100259c>] do_syscall_64+0x5c/0x170 > [<ffffffff817c27cb>] entry_SYSCALL64_slow_path+0x25/0x25 > > So this one isn't a btrfs specific problem as I first thought. > > This sometimes reproduces within minutes, sometimes hours, which makes > it a pain to bisect. It only started showing up this merge window though. Chinner reported the same thing on XFS, I'll look into it asap. -- Jens Axboe ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-18 23:12 ` Jens Axboe @ 2016-10-18 23:31 ` Chris Mason 2016-10-18 23:36 ` Jens Axboe 2016-10-18 23:39 ` Linus Torvalds 0 siblings, 2 replies; 118+ messages in thread From: Chris Mason @ 2016-10-18 23:31 UTC (permalink / raw) To: Jens Axboe Cc: Dave Jones, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Linus Torvalds On Tue, Oct 18, 2016 at 05:12:41PM -0600, Jens Axboe wrote: >On 10/18/2016 04:42 PM, Dave Jones wrote: >>So Chris had me do a run on ext4 just for giggles. It took a while, but >>eventually this fell out... >> >> >>WARNING: CPU: 3 PID: 21324 at lib/list_debug.c:33 __list_add+0x89/0xb0 >>list_add corruption. prev->next should be next (ffffe8ffffc05648), but was ffffc9000028bcd8. (prev=ffff880503a145c0). >>CPU: 3 PID: 21324 Comm: modprobe Not tainted 4.9.0-rc1-think+ #1 >> ffffc90000a6b7b8 ffffffff81320e3c ffffc90000a6b808 0000000000000000 >> ffffc90000a6b7f8 ffffffff8107a711 0000002100000246 ffff8805039f1740 >> ffff880503a145c0 ffffe8ffffc05648 ffffe8ffffa05600 ffff880502c39548 >>Call Trace: >> [<ffffffff81320e3c>] dump_stack+0x4f/0x73 >> [<ffffffff8107a711>] __warn+0xc1/0xe0 >> [<ffffffff8107a78a>] warn_slowpath_fmt+0x5a/0x80 >> [<ffffffff8133f499>] __list_add+0x89/0xb0 >> [<ffffffff8130af88>] blk_sq_make_request+0x2f8/0x350 >> [<ffffffff812fe6dc>] ? generic_make_request+0xec/0x240 >> [<ffffffff812fe6e9>] generic_make_request+0xf9/0x240 >> [<ffffffff812fe8a8>] submit_bio+0x78/0x150 >> [<ffffffff8120bde6>] ? __find_get_block+0x126/0x130 >> [<ffffffff8120cbff>] submit_bh_wbc+0x16f/0x1e0 >> [<ffffffff8120a400>] ? __end_buffer_read_notouch+0x20/0x20 >> [<ffffffff8120d958>] ll_rw_block+0xa8/0xb0 >> [<ffffffff8120da0f>] __breadahead+0x3f/0x70 >> [<ffffffff81264ffc>] __ext4_get_inode_loc+0x37c/0x3d0 >> [<ffffffff8126806d>] ext4_iget+0x8d/0xb90 >> [<ffffffff811f0759>] ? d_alloc_parallel+0x329/0x700 >> [<ffffffff81268b9a>] ext4_iget_normal+0x2a/0x30 >> [<ffffffff81273cd6>] ext4_lookup+0x136/0x250 >> [<ffffffff811e118d>] lookup_slow+0x12d/0x220 >> [<ffffffff811e3897>] walk_component+0x1e7/0x310 >> [<ffffffff811e33f8>] ? path_init+0x4d8/0x520 >> [<ffffffff811e4022>] path_lookupat+0x62/0x120 >> [<ffffffff811e4f22>] ? getname_flags+0x32/0x180 >> [<ffffffff811e5278>] filename_lookup+0xa8/0x130 >> [<ffffffff81352526>] ? strncpy_from_user+0x46/0x170 >> [<ffffffff811e4f3e>] ? getname_flags+0x4e/0x180 >> [<ffffffff811e53d1>] user_path_at_empty+0x31/0x40 >> [<ffffffff811d9df1>] vfs_fstatat+0x61/0xc0 >> [<ffffffff810c8b9f>] ? __lock_acquire.isra.32+0x1cf/0x8c0 >> [<ffffffff811da30e>] SYSC_newstat+0x2e/0x60 >> [<ffffffff8133f403>] ? __this_cpu_preempt_check+0x13/0x20 >> [<ffffffff811da499>] SyS_newstat+0x9/0x10 >> [<ffffffff8100259c>] do_syscall_64+0x5c/0x170 >> [<ffffffff817c27cb>] entry_SYSCALL64_slow_path+0x25/0x25 >> >>So this one isn't a btrfs specific problem as I first thought. >> >>This sometimes reproduces within minutes, sometimes hours, which makes >>it a pain to bisect. It only started showing up this merge window though. > >Chinner reported the same thing on XFS, I'll look into it asap. Jens, not sure if you saw the whole thread. This has triggered bad page state errors, and also corrupted a btrfs list. It hurts me to say, but it might not actually be your fault. -chris ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-18 23:31 ` Chris Mason @ 2016-10-18 23:36 ` Jens Axboe 2016-10-18 23:39 ` Linus Torvalds 1 sibling, 0 replies; 118+ messages in thread From: Jens Axboe @ 2016-10-18 23:36 UTC (permalink / raw) To: Chris Mason, Dave Jones, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Linus Torvalds On 10/18/2016 05:31 PM, Chris Mason wrote: > On Tue, Oct 18, 2016 at 05:12:41PM -0600, Jens Axboe wrote: >> On 10/18/2016 04:42 PM, Dave Jones wrote: >>> So Chris had me do a run on ext4 just for giggles. It took a while, but >>> eventually this fell out... >>> >>> >>> WARNING: CPU: 3 PID: 21324 at lib/list_debug.c:33 __list_add+0x89/0xb0 >>> list_add corruption. prev->next should be next (ffffe8ffffc05648), >>> but was ffffc9000028bcd8. (prev=ffff880503a145c0). >>> CPU: 3 PID: 21324 Comm: modprobe Not tainted 4.9.0-rc1-think+ #1 >>> ffffc90000a6b7b8 ffffffff81320e3c ffffc90000a6b808 0000000000000000 >>> ffffc90000a6b7f8 ffffffff8107a711 0000002100000246 ffff8805039f1740 >>> ffff880503a145c0 ffffe8ffffc05648 ffffe8ffffa05600 ffff880502c39548 >>> Call Trace: >>> [<ffffffff81320e3c>] dump_stack+0x4f/0x73 >>> [<ffffffff8107a711>] __warn+0xc1/0xe0 >>> [<ffffffff8107a78a>] warn_slowpath_fmt+0x5a/0x80 >>> [<ffffffff8133f499>] __list_add+0x89/0xb0 >>> [<ffffffff8130af88>] blk_sq_make_request+0x2f8/0x350 >>> [<ffffffff812fe6dc>] ? generic_make_request+0xec/0x240 >>> [<ffffffff812fe6e9>] generic_make_request+0xf9/0x240 >>> [<ffffffff812fe8a8>] submit_bio+0x78/0x150 >>> [<ffffffff8120bde6>] ? __find_get_block+0x126/0x130 >>> [<ffffffff8120cbff>] submit_bh_wbc+0x16f/0x1e0 >>> [<ffffffff8120a400>] ? __end_buffer_read_notouch+0x20/0x20 >>> [<ffffffff8120d958>] ll_rw_block+0xa8/0xb0 >>> [<ffffffff8120da0f>] __breadahead+0x3f/0x70 >>> [<ffffffff81264ffc>] __ext4_get_inode_loc+0x37c/0x3d0 >>> [<ffffffff8126806d>] ext4_iget+0x8d/0xb90 >>> [<ffffffff811f0759>] ? d_alloc_parallel+0x329/0x700 >>> [<ffffffff81268b9a>] ext4_iget_normal+0x2a/0x30 >>> [<ffffffff81273cd6>] ext4_lookup+0x136/0x250 >>> [<ffffffff811e118d>] lookup_slow+0x12d/0x220 >>> [<ffffffff811e3897>] walk_component+0x1e7/0x310 >>> [<ffffffff811e33f8>] ? path_init+0x4d8/0x520 >>> [<ffffffff811e4022>] path_lookupat+0x62/0x120 >>> [<ffffffff811e4f22>] ? getname_flags+0x32/0x180 >>> [<ffffffff811e5278>] filename_lookup+0xa8/0x130 >>> [<ffffffff81352526>] ? strncpy_from_user+0x46/0x170 >>> [<ffffffff811e4f3e>] ? getname_flags+0x4e/0x180 >>> [<ffffffff811e53d1>] user_path_at_empty+0x31/0x40 >>> [<ffffffff811d9df1>] vfs_fstatat+0x61/0xc0 >>> [<ffffffff810c8b9f>] ? __lock_acquire.isra.32+0x1cf/0x8c0 >>> [<ffffffff811da30e>] SYSC_newstat+0x2e/0x60 >>> [<ffffffff8133f403>] ? __this_cpu_preempt_check+0x13/0x20 >>> [<ffffffff811da499>] SyS_newstat+0x9/0x10 >>> [<ffffffff8100259c>] do_syscall_64+0x5c/0x170 >>> [<ffffffff817c27cb>] entry_SYSCALL64_slow_path+0x25/0x25 >>> >>> So this one isn't a btrfs specific problem as I first thought. >>> >>> This sometimes reproduces within minutes, sometimes hours, which makes >>> it a pain to bisect. It only started showing up this merge window >>> though. >> >> Chinner reported the same thing on XFS, I'll look into it asap. > > Jens, not sure if you saw the whole thread. This has triggered bad page > state errors, and also corrupted a btrfs list. It hurts me to say, but > it might not actually be your fault. Just went through the block changes again, pretty basic this round and nothing that touches plugging at all. General memory corruption could certainly explain it... -- Jens Axboe ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-18 23:31 ` Chris Mason 2016-10-18 23:36 ` Jens Axboe @ 2016-10-18 23:39 ` Linus Torvalds 2016-10-18 23:42 ` Chris Mason 1 sibling, 1 reply; 118+ messages in thread From: Linus Torvalds @ 2016-10-18 23:39 UTC (permalink / raw) To: Chris Mason, Jens Axboe, Dave Jones, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Linus Torvalds On Tue, Oct 18, 2016 at 4:31 PM, Chris Mason <clm@fb.com> wrote: > > Jens, not sure if you saw the whole thread. This has triggered bad page > state errors, and also corrupted a btrfs list. It hurts me to say, but it > might not actually be your fault. Where is that thread, and what is the "this" that triggers problems? Looking at the "->mq_list" users, I'm not seeing any changes there in the last year or so. So I don't think it's the list itself. Linus ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-18 23:39 ` Linus Torvalds @ 2016-10-18 23:42 ` Chris Mason 2016-10-19 0:10 ` Linus Torvalds 2016-10-19 17:09 ` Philipp Hahn 0 siblings, 2 replies; 118+ messages in thread From: Chris Mason @ 2016-10-18 23:42 UTC (permalink / raw) To: Linus Torvalds Cc: Jens Axboe, Dave Jones, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel On Tue, Oct 18, 2016 at 04:39:22PM -0700, Linus Torvalds wrote: >On Tue, Oct 18, 2016 at 4:31 PM, Chris Mason <clm@fb.com> wrote: >> >> Jens, not sure if you saw the whole thread. This has triggered bad page >> state errors, and also corrupted a btrfs list. It hurts me to say, but it >> might not actually be your fault. > >Where is that thread, and what is the "this" that triggers problems? > >Looking at the "->mq_list" users, I'm not seeing any changes there in >the last year or so. So I don't think it's the list itself. Seems to be the whole thing: http://www.gossamer-threads.com/lists/linux/kernel/2545792 My guess is xattr, but I don't have a good reason for that. -chris ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-18 23:42 ` Chris Mason @ 2016-10-19 0:10 ` Linus Torvalds 2016-10-19 0:19 ` Chris Mason ` (2 more replies) 2016-10-19 17:09 ` Philipp Hahn 1 sibling, 3 replies; 118+ messages in thread From: Linus Torvalds @ 2016-10-19 0:10 UTC (permalink / raw) To: Chris Mason, Linus Torvalds, Jens Axboe, Dave Jones, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Andrew Lutomirski On Tue, Oct 18, 2016 at 4:42 PM, Chris Mason <clm@fb.com> wrote: > > Seems to be the whole thing: Ahh. On lkml, so I do have it in my mailbox, but Dave changed the subject line when he tested on ext4 rather than btrfs.. Anyway, the corrupted address is somewhat interesting. As Dave Jones said, he saw list_add corruption. prev->next should be next (ffffe8ffff806648), but was ffffc9000067fcd8. (prev=ffff880503878b80). list_add corruption. prev->next should be next (ffffe8ffffc05648), but was ffffc9000028bcd8. (prev=ffff880503a145c0). and Dave Chinner reports list_add corruption. prev->next should be next (ffffe8ffffc02808), but was ffffc90005f6bda8. (prev=ffff88013363bb80). and it's worth noting that the "but was" is a remarkably consistent vmalloc address (the ffffc9000.. pattern gives it away). In fact, it's identical across two boots for DaveJ in the low 14 bits, and fairly high up in those low 14 bots (0x3cd8). DaveC has a different address, but it's also in the vmalloc space, and also looks like it is fairly high up in 14 bits (0x3da8). So in both cases it's almost certainly a stack address with a fairly empty stack. The differences are presumably due to different kernel configurations and/or just different filesystems calling the same function that does the same bad thing but now at different depths in the stack. Adding Andy to the cc, because this *might* be triggered by the vmalloc stack code itself. Maybe the re-use of stacks showing some problem? Maybe Chris (who can't see the problem) doesn't have CONFIG_VMAP_STACK enabled? Andy - this is on lkml, under Dave Chinner: [regression, 4.9-rc1] blk-mq: list corruption in request queue Dave Jones: btrfs bio linked list corruption. Re: bio linked list corruption. and they are definitely the same thing across three different filesystems (xfs, btrfs and ext4), and they are consistent enough that there is almost certainly a single very specific memory corrupting issue that overwrites something with a stack pointer. Linus ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-19 0:10 ` Linus Torvalds @ 2016-10-19 0:19 ` Chris Mason 2016-10-19 0:28 ` Linus Torvalds 2016-10-19 1:05 ` Andy Lutomirski 2 siblings, 0 replies; 118+ messages in thread From: Chris Mason @ 2016-10-19 0:19 UTC (permalink / raw) To: Linus Torvalds Cc: Jens Axboe, Dave Jones, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Andrew Lutomirski On Tue, Oct 18, 2016 at 05:10:56PM -0700, Linus Torvalds wrote: >On Tue, Oct 18, 2016 at 4:42 PM, Chris Mason <clm@fb.com> wrote: >> >> Seems to be the whole thing: > >Ahh. On lkml, so I do have it in my mailbox, but Dave changed the >subject line when he tested on ext4 rather than btrfs.. > >Anyway, the corrupted address is somewhat interesting. As Dave Jones >said, he saw > > list_add corruption. prev->next should be next (ffffe8ffff806648), >but was ffffc9000067fcd8. (prev=ffff880503878b80). > list_add corruption. prev->next should be next (ffffe8ffffc05648), >but was ffffc9000028bcd8. (prev=ffff880503a145c0). > >and Dave Chinner reports > > list_add corruption. prev->next should be next (ffffe8ffffc02808), >but was ffffc90005f6bda8. (prev=ffff88013363bb80). > >and it's worth noting that the "but was" is a remarkably consistent >vmalloc address (the ffffc9000.. pattern gives it away). In fact, it's >identical across two boots for DaveJ in the low 14 bits, and fairly >high up in those low 14 bots (0x3cd8). > >DaveC has a different address, but it's also in the vmalloc space, and >also looks like it is fairly high up in 14 bits (0x3da8). So in both >cases it's almost certainly a stack address with a fairly empty stack. >The differences are presumably due to different kernel configurations >and/or just different filesystems calling the same function that does >the same bad thing but now at different depths in the stack. > >Adding Andy to the cc, because this *might* be triggered by the >vmalloc stack code itself. Maybe the re-use of stacks showing some >problem? Maybe Chris (who can't see the problem) doesn't have >CONFIG_VMAP_STACK enabled? CONFIG_VMAP_STACK=y, but maybe I just need to hammer on process creation more. I'm testing in a hugely stripped down VM, so Dave might have more background stuff going on. -chris ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-19 0:10 ` Linus Torvalds 2016-10-19 0:19 ` Chris Mason @ 2016-10-19 0:28 ` Linus Torvalds 2016-10-20 22:48 ` Dave Jones 2016-10-19 1:05 ` Andy Lutomirski 2 siblings, 1 reply; 118+ messages in thread From: Linus Torvalds @ 2016-10-19 0:28 UTC (permalink / raw) To: Chris Mason, Linus Torvalds, Jens Axboe, Dave Jones, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Andrew Lutomirski On Tue, Oct 18, 2016 at 5:10 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Adding Andy to the cc, because this *might* be triggered by the > vmalloc stack code itself. Maybe the re-use of stacks showing some > problem? Maybe Chris (who can't see the problem) doesn't have > CONFIG_VMAP_STACK enabled? I bet it's the plug itself that is the stack address. In fact, it's probably that mq_list head pointer I think every single users of block plugging uses the pattern struct blk_plug plug; blk_start_plug(&plug); and then we'll have INIT_LIST_HEAD(&plug->mq_list); which initializes that mq_list head with the stack addresses pointing to itself. So when we see something like this: list_add corruption. prev->next should be next (ffffe8ffff806648), but was ffffc9000067fcd8. (prev=ffff880503878b80) and it comes from list_add_tail(&rq->queuelist, &plug->mq_list); which will expand to __list_add(new, head->prev, head) which in this case *should* be: __list_add(&rq->queuelist, plug->mq_list.prev, &plug->mq_list); so in fact we *should* have "next" be a stack address. So that debug message is really really odd. I would expect that "next" is the stack address (because we're adding to the tail of the list, so "next" is the list head itself), but the debug message corruption printout says that "was" is the stack address, but next isn't. Weird.The "but was" value actually looks like the right address should look, but the actual address (which *should* be just "&plug->mq_list" and really should be on the stack) looks bogus. I'm now very confused. Linus ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-19 0:28 ` Linus Torvalds @ 2016-10-20 22:48 ` Dave Jones 0 siblings, 0 replies; 118+ messages in thread From: Dave Jones @ 2016-10-20 22:48 UTC (permalink / raw) To: Linus Torvalds Cc: Chris Mason, Jens Axboe, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Andrew Lutomirski On Tue, Oct 18, 2016 at 05:28:44PM -0700, Linus Torvalds wrote: > On Tue, Oct 18, 2016 at 5:10 PM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > Adding Andy to the cc, because this *might* be triggered by the > > vmalloc stack code itself. Maybe the re-use of stacks showing some > > problem? Maybe Chris (who can't see the problem) doesn't have > > CONFIG_VMAP_STACK enabled? > > I bet it's the plug itself that is the stack address. In fact, it's > probably that mq_list head pointer So I've done a few experiments the last couple days. 1, I see some kind of disaster happen with every filesystem ext4, btrfs, xfs. For some reason I can repro it faster on btrfs (though xfs blew up pretty quickly too, but I don't know if it's the same as this list corruption bug). 2, I ran for 24 hours with VMAP_STACK turned off. I saw some _different_ btrfs problems, but I never hit that list debug corruption once. 3, I turned vmap stacks back on, and got this pretty quickly. Another new flavor of crash, but Chris recommended I post this one because it looks interesting. [ 3943.514961] BUG: Bad page state in process kworker/u8:14 pfn:482244 [ 3943.532400] page:ffffea0012089100 count:0 mapcount:0 mapping:ffff8804c40d6ae0 index:0x2f [ 3943.551865] flags: 0x4000000000000008(uptodate) [ 3943.561652] page dumped because: non-NULL mapping [ 3943.587698] CPU: 2 PID: 26823 Comm: kworker/u8:14 Not tainted 4.9.0-rc1-think+ #9 [ 3943.607409] Workqueue: writeback wb_workfn [ 3943.617194] (flush-btrfs-2) [ 3943.617260] ffffc90001bf7870 [ 3943.627007] ffffffff8130c93c [ 3943.627075] ffffea0012089100 [ 3943.627112] ffffffff819ff37c [ 3943.627149] ffffc90001bf7898 [ 3943.636918] ffffffff81150fef [ 3943.636985] 0000000000000000 [ 3943.637021] ffffea0012089100 [ 3943.637059] 4000000000000008 [ 3943.646965] ffffc90001bf78a8 [ 3943.647041] ffffffff811510aa [ 3943.647081] ffffc90001bf78f0 [ 3943.647126] Call Trace: [ 3943.657068] [<ffffffff8130c93c>] dump_stack+0x4f/0x73 [ 3943.666996] [<ffffffff81150fef>] bad_page+0xbf/0x120 [ 3943.676839] [<ffffffff811510aa>] free_pages_check_bad+0x5a/0x70 [ 3943.686646] [<ffffffff8115355b>] free_hot_cold_page+0x20b/0x270 [ 3943.696402] [<ffffffff8115387b>] free_hot_cold_page_list+0x2b/0x50 [ 3943.706092] [<ffffffff8115c1fd>] release_pages+0x2bd/0x350 [ 3943.715726] [<ffffffff8115d732>] __pagevec_release+0x22/0x30 [ 3943.725358] [<ffffffffa00a0d4e>] extent_write_cache_pages.isra.48.constprop.63+0x32e/0x400 [btrfs] [ 3943.735126] [<ffffffffa00a1199>] extent_writepages+0x49/0x60 [btrfs] [ 3943.744808] [<ffffffffa0081840>] ? btrfs_releasepage+0x40/0x40 [btrfs] [ 3943.754457] [<ffffffffa007e993>] btrfs_writepages+0x23/0x30 [btrfs] [ 3943.764085] [<ffffffff8115a91c>] do_writepages+0x1c/0x30 [ 3943.773667] [<ffffffff811f65f3>] __writeback_single_inode+0x33/0x180 [ 3943.783233] [<ffffffff811f6de8>] writeback_sb_inodes+0x2a8/0x5b0 [ 3943.792870] [<ffffffff811f733b>] wb_writeback+0xeb/0x1f0 [ 3943.802326] [<ffffffff811f7972>] wb_workfn+0xd2/0x280 [ 3943.811673] [<ffffffff810906e5>] process_one_work+0x1d5/0x490 [ 3943.821044] [<ffffffff81090685>] ? process_one_work+0x175/0x490 [ 3943.830447] [<ffffffff810909e9>] worker_thread+0x49/0x490 [ 3943.839756] [<ffffffff810909a0>] ? process_one_work+0x490/0x490 [ 3943.849074] [<ffffffff810909a0>] ? process_one_work+0x490/0x490 [ 3943.858264] [<ffffffff81095b5e>] kthread+0xee/0x110 [ 3943.867451] [<ffffffff81095a70>] ? kthread_park+0x60/0x60 [ 3943.876616] [<ffffffff81095a70>] ? kthread_park+0x60/0x60 [ 3943.885624] [<ffffffff81095a70>] ? kthread_park+0x60/0x60 [ 3943.894580] [<ffffffff81790492>] ret_from_fork+0x22/0x30 This feels like chasing a moving target, because the crash keeps changing.. I'm going to spend some time trying to at least pin down a selection of syscalls that trinity can reproduce this with quickly. Early-on, it seemed like this was xattr related, but now I'm not so sure. Once or twice, I was able to repro it within a few minutes using just writev, fsync, lsetxattr and lremovexattr. Then a day later, I found I could run for a day before seeing it. Position of the moon or something. Or it could have been entirely unrelated to the actual syscalls being run, and based just on how contended the cpu/memory was. Dave ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-19 0:10 ` Linus Torvalds 2016-10-19 0:19 ` Chris Mason 2016-10-19 0:28 ` Linus Torvalds @ 2016-10-19 1:05 ` Andy Lutomirski 2016-10-20 22:50 ` Dave Jones 2 siblings, 1 reply; 118+ messages in thread From: Andy Lutomirski @ 2016-10-19 1:05 UTC (permalink / raw) To: Linus Torvalds, Chris Mason, Jens Axboe, Dave Jones, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Andrew Lutomirski On 10/18/2016 05:10 PM, Linus Torvalds wrote: > On Tue, Oct 18, 2016 at 4:42 PM, Chris Mason <clm@fb.com> wrote: >> >> Seems to be the whole thing: > > Ahh. On lkml, so I do have it in my mailbox, but Dave changed the > subject line when he tested on ext4 rather than btrfs.. > > Anyway, the corrupted address is somewhat interesting. As Dave Jones > said, he saw > > list_add corruption. prev->next should be next (ffffe8ffff806648), > but was ffffc9000067fcd8. (prev=ffff880503878b80). > list_add corruption. prev->next should be next (ffffe8ffffc05648), > but was ffffc9000028bcd8. (prev=ffff880503a145c0). > > and Dave Chinner reports > > list_add corruption. prev->next should be next (ffffe8ffffc02808), > but was ffffc90005f6bda8. (prev=ffff88013363bb80). > > and it's worth noting that the "but was" is a remarkably consistent > vmalloc address (the ffffc9000.. pattern gives it away). In fact, it's > identical across two boots for DaveJ in the low 14 bits, and fairly > high up in those low 14 bots (0x3cd8). > > DaveC has a different address, but it's also in the vmalloc space, and > also looks like it is fairly high up in 14 bits (0x3da8). So in both > cases it's almost certainly a stack address with a fairly empty stack. > The differences are presumably due to different kernel configurations > and/or just different filesystems calling the same function that does > the same bad thing but now at different depths in the stack. > > Adding Andy to the cc, because this *might* be triggered by the > vmalloc stack code itself. Maybe the re-use of stacks showing some > problem? Maybe Chris (who can't see the problem) doesn't have > CONFIG_VMAP_STACK enabled? Wouldn't this cause the exact opposite problem? If the warning is to be believed, then prev is *not* on the stack but somehow prev->next ended up pointing to the stack. If stack reuse caused something to corrupt a value on the stack, then how would this cause a stack address to be written to a non-stack location? All I can think of is that "prev" itself is corrupted somehow. One possible debugging approach would be to change: #define NR_CACHED_STACKS 2 to #define NR_CACHED_STACKS 0 in kernel/fork.c and to set CONFIG_DEBUG_PAGEALLOC=y. The latter will force an immediate TLB flush after vfree. Also, CONFIG_DEBUG_VIRTUAL=y can be quite helpful for debugging stack issues. I'm tempted to do something equivalent to hardwiring that option on for a while if CONFIG_VMAP_STACK=y. ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-19 1:05 ` Andy Lutomirski @ 2016-10-20 22:50 ` Dave Jones 2016-10-20 23:01 ` Andy Lutomirski 0 siblings, 1 reply; 118+ messages in thread From: Dave Jones @ 2016-10-20 22:50 UTC (permalink / raw) To: Andy Lutomirski Cc: Linus Torvalds, Chris Mason, Jens Axboe, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel On Tue, Oct 18, 2016 at 06:05:57PM -0700, Andy Lutomirski wrote: > One possible debugging approach would be to change: > > #define NR_CACHED_STACKS 2 > > to > > #define NR_CACHED_STACKS 0 > > in kernel/fork.c and to set CONFIG_DEBUG_PAGEALLOC=y. The latter will > force an immediate TLB flush after vfree. I can give that idea some runtime, but it sounds like this a case where we're trying to prove a negative, and that'll just run and run ? In which case I might do this when I'm travelling on Sunday. > Also, CONFIG_DEBUG_VIRTUAL=y can be quite helpful for debugging stack > issues. I'm tempted to do something equivalent to hardwiring that > option on for a while if CONFIG_VMAP_STACK=y. This one I had on. Nothing interesting jumped out. Dave ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-20 22:50 ` Dave Jones @ 2016-10-20 23:01 ` Andy Lutomirski 2016-10-20 23:03 ` Dave Jones 0 siblings, 1 reply; 118+ messages in thread From: Andy Lutomirski @ 2016-10-20 23:01 UTC (permalink / raw) To: Dave Jones, Andy Lutomirski, Linus Torvalds, Chris Mason, Jens Axboe, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel On Thu, Oct 20, 2016 at 3:50 PM, Dave Jones <davej@codemonkey.org.uk> wrote: > On Tue, Oct 18, 2016 at 06:05:57PM -0700, Andy Lutomirski wrote: > > > One possible debugging approach would be to change: > > > > #define NR_CACHED_STACKS 2 > > > > to > > > > #define NR_CACHED_STACKS 0 > > > > in kernel/fork.c and to set CONFIG_DEBUG_PAGEALLOC=y. The latter will > > force an immediate TLB flush after vfree. > > I can give that idea some runtime, but it sounds like this a case where > we're trying to prove a negative, and that'll just run and run ? In which case I > might do this when I'm travelling on Sunday. The idea is that the stack will be free and unmapped immediately upon process exit if configured like this so that bogus stack accesses (by the CPU, not DMA) would OOPS immediately. ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-20 23:01 ` Andy Lutomirski @ 2016-10-20 23:03 ` Dave Jones 2016-10-20 23:23 ` Andy Lutomirski 0 siblings, 1 reply; 118+ messages in thread From: Dave Jones @ 2016-10-20 23:03 UTC (permalink / raw) To: Andy Lutomirski Cc: Andy Lutomirski, Linus Torvalds, Chris Mason, Jens Axboe, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel On Thu, Oct 20, 2016 at 04:01:12PM -0700, Andy Lutomirski wrote: > On Thu, Oct 20, 2016 at 3:50 PM, Dave Jones <davej@codemonkey.org.uk> wrote: > > On Tue, Oct 18, 2016 at 06:05:57PM -0700, Andy Lutomirski wrote: > > > > > One possible debugging approach would be to change: > > > > > > #define NR_CACHED_STACKS 2 > > > > > > to > > > > > > #define NR_CACHED_STACKS 0 > > > > > > in kernel/fork.c and to set CONFIG_DEBUG_PAGEALLOC=y. The latter will > > > force an immediate TLB flush after vfree. > > > > I can give that idea some runtime, but it sounds like this a case where > > we're trying to prove a negative, and that'll just run and run ? In which case I > > might do this when I'm travelling on Sunday. > > The idea is that the stack will be free and unmapped immediately upon > process exit if configured like this so that bogus stack accesses (by > the CPU, not DMA) would OOPS immediately. oh, misparsed. ok, I can definitely get behind that idea then. I'll do that next. Dave ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-20 23:03 ` Dave Jones @ 2016-10-20 23:23 ` Andy Lutomirski 2016-10-21 20:02 ` Dave Jones 0 siblings, 1 reply; 118+ messages in thread From: Andy Lutomirski @ 2016-10-20 23:23 UTC (permalink / raw) To: Dave Jones, Andy Lutomirski, Andy Lutomirski, Linus Torvalds, Chris Mason, Jens Axboe, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel On Thu, Oct 20, 2016 at 4:03 PM, Dave Jones <davej@codemonkey.org.uk> wrote: > On Thu, Oct 20, 2016 at 04:01:12PM -0700, Andy Lutomirski wrote: > > On Thu, Oct 20, 2016 at 3:50 PM, Dave Jones <davej@codemonkey.org.uk> wrote: > > > On Tue, Oct 18, 2016 at 06:05:57PM -0700, Andy Lutomirski wrote: > > > > > > > One possible debugging approach would be to change: > > > > > > > > #define NR_CACHED_STACKS 2 > > > > > > > > to > > > > > > > > #define NR_CACHED_STACKS 0 > > > > > > > > in kernel/fork.c and to set CONFIG_DEBUG_PAGEALLOC=y. The latter will > > > > force an immediate TLB flush after vfree. > > > > > > I can give that idea some runtime, but it sounds like this a case where > > > we're trying to prove a negative, and that'll just run and run ? In which case I > > > might do this when I'm travelling on Sunday. > > > > The idea is that the stack will be free and unmapped immediately upon > > process exit if configured like this so that bogus stack accesses (by > > the CPU, not DMA) would OOPS immediately. > > oh, misparsed. ok, I can definitely get behind that idea then. > I'll do that next. > It could be worth trying this, too: https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vmap_stack&id=174531fef4e8 It occurred to me that the current code is a little bit fragile. --Andy ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-20 23:23 ` Andy Lutomirski @ 2016-10-21 20:02 ` Dave Jones 2016-10-21 20:17 ` Chris Mason 2016-10-22 15:20 ` Dave Jones 0 siblings, 2 replies; 118+ messages in thread From: Dave Jones @ 2016-10-21 20:02 UTC (permalink / raw) To: Andy Lutomirski Cc: Andy Lutomirski, Linus Torvalds, Chris Mason, Jens Axboe, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel On Thu, Oct 20, 2016 at 04:23:32PM -0700, Andy Lutomirski wrote: > On Thu, Oct 20, 2016 at 4:03 PM, Dave Jones <davej@codemonkey.org.uk> wrote: > > On Thu, Oct 20, 2016 at 04:01:12PM -0700, Andy Lutomirski wrote: > > > On Thu, Oct 20, 2016 at 3:50 PM, Dave Jones <davej@codemonkey.org.uk> wrote: > > > > On Tue, Oct 18, 2016 at 06:05:57PM -0700, Andy Lutomirski wrote: > > > > > > > > > One possible debugging approach would be to change: > > > > > > > > > > #define NR_CACHED_STACKS 2 > > > > > > > > > > to > > > > > > > > > > #define NR_CACHED_STACKS 0 > > > > > > > > > > in kernel/fork.c and to set CONFIG_DEBUG_PAGEALLOC=y. The latter will > > > > > force an immediate TLB flush after vfree. > > > > > > > > I can give that idea some runtime, but it sounds like this a case where > > > > we're trying to prove a negative, and that'll just run and run ? In which case I > > > > might do this when I'm travelling on Sunday. > > > > > > The idea is that the stack will be free and unmapped immediately upon > > > process exit if configured like this so that bogus stack accesses (by > > > the CPU, not DMA) would OOPS immediately. > > > > oh, misparsed. ok, I can definitely get behind that idea then. > > I'll do that next. > > > > It could be worth trying this, too: > > https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vmap_stack&id=174531fef4e8 > > It occurred to me that the current code is a little bit fragile. It's been nearly 24hrs with the above changes, and it's been pretty much silent the whole time. The only thing of note over that time period has been a btrfs lockdep warning that's been around for a while, and occasional btrfs checksum failures, which I've been seeing for a while, but seem to have gotten worse since 4.8. I'm pretty confident in the disk being ok in this machine, so I think the checksum warnings are bogus. Chris suggested they may be the result of memory corruption, but there's little else going on. BTRFS warning (device sda3): csum failed ino 130654 off 0 csum 2566472073 expected csum 3008371513 BTRFS warning (device sda3): csum failed ino 131057 off 4096 csum 3563910319 expected csum 738595262 BTRFS warning (device sda3): csum failed ino 131176 off 4096 csum 1344477721 expected csum 441864825 BTRFS warning (device sda3): csum failed ino 131241 off 245760 csum 3576232181 expected csum 2566472073 BTRFS warning (device sda3): csum failed ino 131429 off 0 csum 1494450239 expected csum 2646577722 BTRFS warning (device sda3): csum failed ino 131471 off 0 csum 3949539320 expected csum 3828807800 BTRFS warning (device sda3): csum failed ino 131471 off 4096 csum 3475108475 expected csum 2566472073 BTRFS warning (device sda3): csum failed ino 131471 off 958464 csum 142982740 expected csum 2566472073 BTRFS warning (device sda3): csum failed ino 131471 off 0 csum 3949539320 expected csum 3828807800 BTRFS warning (device sda3): csum failed ino 131532 off 270336 csum 3138898528 expected csum 2566472073 BTRFS warning (device sda3): csum failed ino 131532 off 1249280 csum 2169165042 expected csum 2566472073 BTRFS warning (device sda3): csum failed ino 131649 off 16384 csum 2914965650 expected csum 1425742005 A curious thing: the expected csum 2566472073 turns up a number of times for different inodes, and gets differing actual csums each time. I suppose this could be something like a block of all zeros in multiple files, but it struck me as surprising. btrfs people: is there an easy way to map those inodes to a filename ? I'm betting those are the test files that trinity generates. If so, it might point to a race somewhere. Dave ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-21 20:02 ` Dave Jones @ 2016-10-21 20:17 ` Chris Mason 2016-10-21 20:23 ` Dave Jones 2016-10-22 15:20 ` Dave Jones 1 sibling, 1 reply; 118+ messages in thread From: Chris Mason @ 2016-10-21 20:17 UTC (permalink / raw) To: Dave Jones, Andy Lutomirski, Andy Lutomirski, Linus Torvalds, Jens Axboe, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel On 10/21/2016 04:02 PM, Dave Jones wrote: > On Thu, Oct 20, 2016 at 04:23:32PM -0700, Andy Lutomirski wrote: > > On Thu, Oct 20, 2016 at 4:03 PM, Dave Jones <davej@codemonkey.org.uk> wrote: > > > On Thu, Oct 20, 2016 at 04:01:12PM -0700, Andy Lutomirski wrote: > > > > On Thu, Oct 20, 2016 at 3:50 PM, Dave Jones <davej@codemonkey.org.uk> wrote: > > > > > On Tue, Oct 18, 2016 at 06:05:57PM -0700, Andy Lutomirski wrote: > > > > > > > > > > > One possible debugging approach would be to change: > > > > > > > > > > > > #define NR_CACHED_STACKS 2 > > > > > > > > > > > > to > > > > > > > > > > > > #define NR_CACHED_STACKS 0 > > > > > > > > > > > > in kernel/fork.c and to set CONFIG_DEBUG_PAGEALLOC=y. The latter will > > > > > > force an immediate TLB flush after vfree. > > > > > > > > > > I can give that idea some runtime, but it sounds like this a case where > > > > > we're trying to prove a negative, and that'll just run and run ? In which case I > > > > > might do this when I'm travelling on Sunday. > > > > > > > > The idea is that the stack will be free and unmapped immediately upon > > > > process exit if configured like this so that bogus stack accesses (by > > > > the CPU, not DMA) would OOPS immediately. > > > > > > oh, misparsed. ok, I can definitely get behind that idea then. > > > I'll do that next. > > > > > > > It could be worth trying this, too: > > > > https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vmap_stack&id=174531fef4e8 > > > > It occurred to me that the current code is a little bit fragile. > > It's been nearly 24hrs with the above changes, and it's been pretty much > silent the whole time. > > The only thing of note over that time period has been a btrfs lockdep > warning that's been around for a while, and occasional btrfs checksum > failures, which I've been seeing for a while, but seem to have gotten > worse since 4.8. Meaning you hit them with v4.8 or not? > > I'm pretty confident in the disk being ok in this machine, so I think > the checksum warnings are bogus. Chris suggested they may be the result > of memory corruption, but there's little else going on. > > > BTRFS warning (device sda3): csum failed ino 130654 off 0 csum 2566472073 expected csum 3008371513 > BTRFS warning (device sda3): csum failed ino 131057 off 4096 csum 3563910319 expected csum 738595262 > BTRFS warning (device sda3): csum failed ino 131176 off 4096 csum 1344477721 expected csum 441864825 > BTRFS warning (device sda3): csum failed ino 131241 off 245760 csum 3576232181 expected csum 2566472073 > BTRFS warning (device sda3): csum failed ino 131429 off 0 csum 1494450239 expected csum 2646577722 > BTRFS warning (device sda3): csum failed ino 131471 off 0 csum 3949539320 expected csum 3828807800 > BTRFS warning (device sda3): csum failed ino 131471 off 4096 csum 3475108475 expected csum 2566472073 > BTRFS warning (device sda3): csum failed ino 131471 off 958464 csum 142982740 expected csum 2566472073 > BTRFS warning (device sda3): csum failed ino 131471 off 0 csum 3949539320 expected csum 3828807800 > BTRFS warning (device sda3): csum failed ino 131532 off 270336 csum 3138898528 expected csum 2566472073 > BTRFS warning (device sda3): csum failed ino 131532 off 1249280 csum 2169165042 expected csum 2566472073 > BTRFS warning (device sda3): csum failed ino 131649 off 16384 csum 2914965650 expected csum 1425742005 > > > A curious thing: the expected csum 2566472073 turns up a number of times for different inodes, and gets > differing actual csums each time. I suppose this could be something like a block of all zeros in multiple files, > but it struck me as surprising. > > btrfs people: is there an easy way to map those inodes to a filename ? I'm betting those are the > test files that trinity generates. If so, it might point to a race somewhere. btrfs inspect inode 130654 mntpoint -chris ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-21 20:17 ` Chris Mason @ 2016-10-21 20:23 ` Dave Jones 2016-10-21 20:38 ` Chris Mason 0 siblings, 1 reply; 118+ messages in thread From: Dave Jones @ 2016-10-21 20:23 UTC (permalink / raw) To: Chris Mason Cc: Andy Lutomirski, Andy Lutomirski, Linus Torvalds, Jens Axboe, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel On Fri, Oct 21, 2016 at 04:17:48PM -0400, Chris Mason wrote: > > BTRFS warning (device sda3): csum failed ino 130654 off 0 csum 2566472073 expected csum 3008371513 > > BTRFS warning (device sda3): csum failed ino 131057 off 4096 csum 3563910319 expected csum 738595262 > > BTRFS warning (device sda3): csum failed ino 131176 off 4096 csum 1344477721 expected csum 441864825 > > BTRFS warning (device sda3): csum failed ino 131241 off 245760 csum 3576232181 expected csum 2566472073 > > BTRFS warning (device sda3): csum failed ino 131429 off 0 csum 1494450239 expected csum 2646577722 > > BTRFS warning (device sda3): csum failed ino 131471 off 0 csum 3949539320 expected csum 3828807800 > > BTRFS warning (device sda3): csum failed ino 131471 off 4096 csum 3475108475 expected csum 2566472073 > > BTRFS warning (device sda3): csum failed ino 131471 off 958464 csum 142982740 expected csum 2566472073 > > BTRFS warning (device sda3): csum failed ino 131471 off 0 csum 3949539320 expected csum 3828807800 > > BTRFS warning (device sda3): csum failed ino 131532 off 270336 csum 3138898528 expected csum 2566472073 > > BTRFS warning (device sda3): csum failed ino 131532 off 1249280 csum 2169165042 expected csum 2566472073 > > BTRFS warning (device sda3): csum failed ino 131649 off 16384 csum 2914965650 expected csum 1425742005 > > > > > > A curious thing: the expected csum 2566472073 turns up a number of times for different inodes, and gets > > differing actual csums each time. I suppose this could be something like a block of all zeros in multiple files, > > but it struck me as surprising. > > > > btrfs people: is there an easy way to map those inodes to a filename ? I'm betting those are the > > test files that trinity generates. If so, it might point to a race somewhere. > > btrfs inspect inode 130654 mntpoint Interesting, they all return ERROR: ino paths ioctl: No such file or directory So these files got deleted perhaps ? Dave ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-21 20:23 ` Dave Jones @ 2016-10-21 20:38 ` Chris Mason 2016-10-21 20:41 ` Josef Bacik 0 siblings, 1 reply; 118+ messages in thread From: Chris Mason @ 2016-10-21 20:38 UTC (permalink / raw) To: Dave Jones, Andy Lutomirski, Andy Lutomirski, Linus Torvalds, Jens Axboe, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel On 10/21/2016 04:23 PM, Dave Jones wrote: > On Fri, Oct 21, 2016 at 04:17:48PM -0400, Chris Mason wrote: > > > > BTRFS warning (device sda3): csum failed ino 130654 off 0 csum 2566472073 expected csum 3008371513 > > > BTRFS warning (device sda3): csum failed ino 131057 off 4096 csum 3563910319 expected csum 738595262 > > > BTRFS warning (device sda3): csum failed ino 131176 off 4096 csum 1344477721 expected csum 441864825 > > > BTRFS warning (device sda3): csum failed ino 131241 off 245760 csum 3576232181 expected csum 2566472073 > > > BTRFS warning (device sda3): csum failed ino 131429 off 0 csum 1494450239 expected csum 2646577722 > > > BTRFS warning (device sda3): csum failed ino 131471 off 0 csum 3949539320 expected csum 3828807800 > > > BTRFS warning (device sda3): csum failed ino 131471 off 4096 csum 3475108475 expected csum 2566472073 > > > BTRFS warning (device sda3): csum failed ino 131471 off 958464 csum 142982740 expected csum 2566472073 > > > BTRFS warning (device sda3): csum failed ino 131471 off 0 csum 3949539320 expected csum 3828807800 > > > BTRFS warning (device sda3): csum failed ino 131532 off 270336 csum 3138898528 expected csum 2566472073 > > > BTRFS warning (device sda3): csum failed ino 131532 off 1249280 csum 2169165042 expected csum 2566472073 > > > BTRFS warning (device sda3): csum failed ino 131649 off 16384 csum 2914965650 expected csum 1425742005 > > > > > > > > > A curious thing: the expected csum 2566472073 turns up a number of times for different inodes, and gets > > > differing actual csums each time. I suppose this could be something like a block of all zeros in multiple files, > > > but it struck me as surprising. > > > > > > btrfs people: is there an easy way to map those inodes to a filename ? I'm betting those are the > > > test files that trinity generates. If so, it might point to a race somewhere. > > > > btrfs inspect inode 130654 mntpoint > > Interesting, they all return > > ERROR: ino paths ioctl: No such file or directory > > So these files got deleted perhaps ? > Yeah, they must have. -chris ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-21 20:38 ` Chris Mason @ 2016-10-21 20:41 ` Josef Bacik 2016-10-21 21:11 ` Dave Jones 0 siblings, 1 reply; 118+ messages in thread From: Josef Bacik @ 2016-10-21 20:41 UTC (permalink / raw) To: Chris Mason, Dave Jones, Andy Lutomirski, Andy Lutomirski, Linus Torvalds, Jens Axboe, Al Viro, David Sterba, linux-btrfs, Linux Kernel On 10/21/2016 04:38 PM, Chris Mason wrote: > > > On 10/21/2016 04:23 PM, Dave Jones wrote: >> On Fri, Oct 21, 2016 at 04:17:48PM -0400, Chris Mason wrote: >> >> > > BTRFS warning (device sda3): csum failed ino 130654 off 0 csum 2566472073 >> expected csum 3008371513 >> > > BTRFS warning (device sda3): csum failed ino 131057 off 4096 csum >> 3563910319 expected csum 738595262 >> > > BTRFS warning (device sda3): csum failed ino 131176 off 4096 csum >> 1344477721 expected csum 441864825 >> > > BTRFS warning (device sda3): csum failed ino 131241 off 245760 csum >> 3576232181 expected csum 2566472073 >> > > BTRFS warning (device sda3): csum failed ino 131429 off 0 csum 1494450239 >> expected csum 2646577722 >> > > BTRFS warning (device sda3): csum failed ino 131471 off 0 csum 3949539320 >> expected csum 3828807800 >> > > BTRFS warning (device sda3): csum failed ino 131471 off 4096 csum >> 3475108475 expected csum 2566472073 >> > > BTRFS warning (device sda3): csum failed ino 131471 off 958464 csum >> 142982740 expected csum 2566472073 >> > > BTRFS warning (device sda3): csum failed ino 131471 off 0 csum 3949539320 >> expected csum 3828807800 >> > > BTRFS warning (device sda3): csum failed ino 131532 off 270336 csum >> 3138898528 expected csum 2566472073 >> > > BTRFS warning (device sda3): csum failed ino 131532 off 1249280 csum >> 2169165042 expected csum 2566472073 >> > > BTRFS warning (device sda3): csum failed ino 131649 off 16384 csum >> 2914965650 expected csum 1425742005 >> > > >> > > >> > > A curious thing: the expected csum 2566472073 turns up a number of times >> for different inodes, and gets >> > > differing actual csums each time. I suppose this could be something like >> a block of all zeros in multiple files, >> > > but it struck me as surprising. >> > > >> > > btrfs people: is there an easy way to map those inodes to a filename ? >> I'm betting those are the >> > > test files that trinity generates. If so, it might point to a race >> somewhere. >> > >> > btrfs inspect inode 130654 mntpoint >> >> Interesting, they all return >> >> ERROR: ino paths ioctl: No such file or directory >> >> So these files got deleted perhaps ? >> > Yeah, they must have. > So one thing that will cause spurious csum errors is if you do things like change the memory while it is in flight during O_DIRECT. Does trinity do that? If so then that would explain it. If not we should probably dig into it. Thanks, Josef ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-21 20:41 ` Josef Bacik @ 2016-10-21 21:11 ` Dave Jones 0 siblings, 0 replies; 118+ messages in thread From: Dave Jones @ 2016-10-21 21:11 UTC (permalink / raw) To: Josef Bacik Cc: Chris Mason, Andy Lutomirski, Andy Lutomirski, Linus Torvalds, Jens Axboe, Al Viro, David Sterba, linux-btrfs, Linux Kernel On Fri, Oct 21, 2016 at 04:41:09PM -0400, Josef Bacik wrote: > >> > > >> > btrfs inspect inode 130654 mntpoint > >> > >> Interesting, they all return > >> > >> ERROR: ino paths ioctl: No such file or directory > >> > >> So these files got deleted perhaps ? > >> > > Yeah, they must have. > > > > So one thing that will cause spurious csum errors is if you do things like > change the memory while it is in flight during O_DIRECT. Does trinity do that? > If so then that would explain it. If not we should probably dig into it. Thanks, Yeah, that's definitely possible. And it wasn't that long ago I added some code to always open testfiles multiple times with different modes, so the likely of O_DIRECT went up. That would explain why I've started seeing this more. Dave ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-21 20:02 ` Dave Jones 2016-10-21 20:17 ` Chris Mason @ 2016-10-22 15:20 ` Dave Jones 2016-10-23 21:32 ` Chris Mason 1 sibling, 1 reply; 118+ messages in thread From: Dave Jones @ 2016-10-22 15:20 UTC (permalink / raw) To: Andy Lutomirski, Andy Lutomirski, Linus Torvalds, Chris Mason, Jens Axboe, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel On Fri, Oct 21, 2016 at 04:02:45PM -0400, Dave Jones wrote: > > It could be worth trying this, too: > > > > https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vmap_stack&id=174531fef4e8 > > > > It occurred to me that the current code is a little bit fragile. > > It's been nearly 24hrs with the above changes, and it's been pretty much > silent the whole time. > > The only thing of note over that time period has been a btrfs lockdep > warning that's been around for a while, and occasional btrfs checksum > failures, which I've been seeing for a while, but seem to have gotten > worse since 4.8. > > I'm pretty confident in the disk being ok in this machine, so I think > the checksum warnings are bogus. Chris suggested they may be the result > of memory corruption, but there's little else going on. The only interesting thing last nights run was this.. BUG: Bad page state in process kworker/u8:1 pfn:4e2b70 page:ffffea00138adc00 count:0 mapcount:0 mapping:ffff88046e9fc2e0 index:0xdf0 flags: 0x400000000000000c(referenced|uptodate) page dumped because: non-NULL mapping CPU: 3 PID: 24234 Comm: kworker/u8:1 Not tainted 4.9.0-rc1-think+ #11 Workqueue: writeback wb_workfn (flush-btrfs-2) ffffc90001f97828 ffffffff8130d07c ffffea00138adc00 ffffffff819ff524 ffffc90001f97850 ffffffff8115117f 0000000000000000 ffffea00138adc00 400000000000000c ffffc90001f97860 ffffffff8115123a ffffc90001f978a8 Call Trace: [<ffffffff8130d07c>] dump_stack+0x4f/0x73 [<ffffffff8115117f>] bad_page+0xbf/0x120 [<ffffffff8115123a>] free_pages_check_bad+0x5a/0x70 [<ffffffff81153b38>] free_hot_cold_page+0x248/0x290 [<ffffffff81153e3b>] free_hot_cold_page_list+0x2b/0x50 [<ffffffff8115c84d>] release_pages+0x2bd/0x350 [<ffffffff8115dd82>] __pagevec_release+0x22/0x30 [<ffffffffa009cd4e>] extent_write_cache_pages.isra.48.constprop.63+0x32e/0x400 [btrfs] [<ffffffffa009d199>] extent_writepages+0x49/0x60 [btrfs] [<ffffffffa007d840>] ? btrfs_releasepage+0x40/0x40 [btrfs] [<ffffffffa007a993>] btrfs_writepages+0x23/0x30 [btrfs] [<ffffffff8115af6c>] do_writepages+0x1c/0x30 [<ffffffff811f6d33>] __writeback_single_inode+0x33/0x180 [<ffffffff811f7528>] writeback_sb_inodes+0x2a8/0x5b0 [<ffffffff811f78bd>] __writeback_inodes_wb+0x8d/0xc0 [<ffffffff811f7b73>] wb_writeback+0x1e3/0x1f0 [<ffffffff811f80b2>] wb_workfn+0xd2/0x280 [<ffffffff81090875>] process_one_work+0x1d5/0x490 [<ffffffff81090815>] ? process_one_work+0x175/0x490 [<ffffffff81090b79>] worker_thread+0x49/0x490 [<ffffffff81090b30>] ? process_one_work+0x490/0x490 [<ffffffff81095cee>] kthread+0xee/0x110 [<ffffffff81095c00>] ? kthread_park+0x60/0x60 [<ffffffff81790bd2>] ret_from_fork+0x22/0x30 ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-22 15:20 ` Dave Jones @ 2016-10-23 21:32 ` Chris Mason 2016-10-24 4:40 ` Dave Jones 0 siblings, 1 reply; 118+ messages in thread From: Chris Mason @ 2016-10-23 21:32 UTC (permalink / raw) To: Dave Jones, Andy Lutomirski, Andy Lutomirski, Linus Torvalds, Jens Axboe, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel On 10/22/2016 11:20 AM, Dave Jones wrote: > On Fri, Oct 21, 2016 at 04:02:45PM -0400, Dave Jones wrote: > > > > It could be worth trying this, too: > > > > > > https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vmap_stack&id=174531fef4e8 > > > > > > It occurred to me that the current code is a little bit fragile. > > > > It's been nearly 24hrs with the above changes, and it's been pretty much > > silent the whole time. > > > > The only thing of note over that time period has been a btrfs lockdep > > warning that's been around for a while, and occasional btrfs checksum > > failures, which I've been seeing for a while, but seem to have gotten > > worse since 4.8. > > > > I'm pretty confident in the disk being ok in this machine, so I think > > the checksum warnings are bogus. Chris suggested they may be the result > > of memory corruption, but there's little else going on. > > The only interesting thing last nights run was this.. > > BUG: Bad page state in process kworker/u8:1 pfn:4e2b70 > page:ffffea00138adc00 count:0 mapcount:0 mapping:ffff88046e9fc2e0 index:0xdf0 > flags: 0x400000000000000c(referenced|uptodate) > page dumped because: non-NULL mapping > CPU: 3 PID: 24234 Comm: kworker/u8:1 Not tainted 4.9.0-rc1-think+ #11 > Workqueue: writeback wb_workfn (flush-btrfs-2) Well crud, we're back to wondering if this is Btrfs or the stack corruption. Since the pagevecs are on the stack and this is a new crash, my guess is you'll be able to trigger it on xfs/ext4 too. But we should make sure. -chris ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-23 21:32 ` Chris Mason @ 2016-10-24 4:40 ` Dave Jones 2016-10-24 13:42 ` Chris Mason 2016-10-24 20:06 ` Andy Lutomirski 0 siblings, 2 replies; 118+ messages in thread From: Dave Jones @ 2016-10-24 4:40 UTC (permalink / raw) To: Chris Mason Cc: Andy Lutomirski, Andy Lutomirski, Linus Torvalds, Jens Axboe, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel On Sun, Oct 23, 2016 at 05:32:21PM -0400, Chris Mason wrote: > > > On 10/22/2016 11:20 AM, Dave Jones wrote: > > On Fri, Oct 21, 2016 at 04:02:45PM -0400, Dave Jones wrote: > > > > > > It could be worth trying this, too: > > > > > > > > https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vmap_stack&id=174531fef4e8 > > > > > > > > It occurred to me that the current code is a little bit fragile. > > > > > > It's been nearly 24hrs with the above changes, and it's been pretty much > > > silent the whole time. > > > > > > The only thing of note over that time period has been a btrfs lockdep > > > warning that's been around for a while, and occasional btrfs checksum > > > failures, which I've been seeing for a while, but seem to have gotten > > > worse since 4.8. > > > > > > I'm pretty confident in the disk being ok in this machine, so I think > > > the checksum warnings are bogus. Chris suggested they may be the result > > > of memory corruption, but there's little else going on. > > > > The only interesting thing last nights run was this.. > > > > BUG: Bad page state in process kworker/u8:1 pfn:4e2b70 > > page:ffffea00138adc00 count:0 mapcount:0 mapping:ffff88046e9fc2e0 index:0xdf0 > > flags: 0x400000000000000c(referenced|uptodate) > > page dumped because: non-NULL mapping > > CPU: 3 PID: 24234 Comm: kworker/u8:1 Not tainted 4.9.0-rc1-think+ #11 > > Workqueue: writeback wb_workfn (flush-btrfs-2) > > Well crud, we're back to wondering if this is Btrfs or the stack > corruption. Since the pagevecs are on the stack and this is a new > crash, my guess is you'll be able to trigger it on xfs/ext4 too. But we > should make sure. Here's an interesting one from today, pointing the finger at xattrs again. [69943.450108] Oops: 0003 [#1] PREEMPT SMP DEBUG_PAGEALLOC [69943.454452] CPU: 1 PID: 21558 Comm: trinity-c60 Not tainted 4.9.0-rc1-think+ #11 [69943.463510] task: ffff8804f8dd3740 task.stack: ffffc9000b108000 [69943.468077] RIP: 0010:[<ffffffff810c3f6b>] [69943.472704] [<ffffffff810c3f6b>] __lock_acquire.isra.32+0x6b/0x8c0 [69943.477489] RSP: 0018:ffffc9000b10b9e8 EFLAGS: 00010086 [69943.482368] RAX: ffffffff81789b90 RBX: ffff8804f8dd3740 RCX: 0000000000000000 [69943.487410] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [69943.492515] RBP: ffffc9000b10ba18 R08: 0000000000000001 R09: 0000000000000000 [69943.497666] R10: 0000000000000001 R11: 00003f9cfa7f4e73 R12: 0000000000000000 [69943.502880] R13: 0000000000000000 R14: ffffc9000af7bd48 R15: ffff8804f8dd3740 [69943.508163] FS: 00007f64904a2b40(0000) GS:ffff880507a00000(0000) knlGS:0000000000000000 [69943.513591] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [69943.518917] CR2: ffffffff81789d28 CR3: 00000004a8f16000 CR4: 00000000001406e0 [69943.524253] DR0: 00007f5b97fd4000 DR1: 0000000000000000 DR2: 0000000000000000 [69943.529488] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600 [69943.534771] Stack: [69943.540023] ffff880507bd74c0 [69943.545317] ffff8804f8dd3740 0000000000000046 0000000000000286[69943.545456] ffffc9000af7bd08 [69943.550930] 0000000000000100 ffffc9000b10ba50 ffffffff810c4b68[69943.551069] ffffffff810ba40c [69943.556657] ffff880400000000 0000000000000000 ffffc9000af7bd48[69943.556796] Call Trace: [69943.562465] [<ffffffff810c4b68>] lock_acquire+0x58/0x70 [69943.568354] [<ffffffff810ba40c>] ? finish_wait+0x3c/0x70 [69943.574306] [<ffffffff8178fef2>] _raw_spin_lock_irqsave+0x42/0x80 [69943.580335] [<ffffffff810ba40c>] ? finish_wait+0x3c/0x70 [69943.586237] [<ffffffff810ba40c>] finish_wait+0x3c/0x70 [69943.591992] [<ffffffff81169727>] shmem_fault+0x167/0x1b0 [69943.597807] [<ffffffff810ba6c0>] ? prepare_to_wait_event+0x100/0x100 [69943.603741] [<ffffffff8117b46d>] __do_fault+0x6d/0x1b0 [69943.609743] [<ffffffff8117f168>] handle_mm_fault+0xc58/0x1170 [69943.615822] [<ffffffff8117e553>] ? handle_mm_fault+0x43/0x1170 [69943.621971] [<ffffffff81044982>] __do_page_fault+0x172/0x4e0 [69943.628184] [<ffffffff81044d10>] do_page_fault+0x20/0x70 [69943.634449] [<ffffffff8132a897>] ? debug_smp_processor_id+0x17/0x20 [69943.640784] [<ffffffff81791f3f>] page_fault+0x1f/0x30 [69943.647170] [<ffffffff8133d69c>] ? strncpy_from_user+0x5c/0x170 [69943.653480] [<ffffffff8133d686>] ? strncpy_from_user+0x46/0x170 [69943.659632] [<ffffffff811f22a7>] setxattr+0x57/0x170 [69943.665846] [<ffffffff8132a897>] ? debug_smp_processor_id+0x17/0x20 [69943.672172] [<ffffffff810c1f09>] ? get_lock_stats+0x19/0x50 [69943.678558] [<ffffffff810a58f6>] ? sched_clock_cpu+0xb6/0xd0 [69943.685007] [<ffffffff810c40cf>] ? __lock_acquire.isra.32+0x1cf/0x8c0 [69943.691542] [<ffffffff8132a8b3>] ? __this_cpu_preempt_check+0x13/0x20 [69943.698130] [<ffffffff8109b9bc>] ? preempt_count_add+0x7c/0xc0 [69943.704791] [<ffffffff811ecda1>] ? __mnt_want_write+0x61/0x90 [69943.711519] [<ffffffff811f2638>] SyS_fsetxattr+0x78/0xa0 [69943.718300] [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 [69943.724949] [<ffffffff81790a4b>] entry_SYSCALL64_slow_path+0x25/0x25 [69943.731521] Code: [69943.738124] 00 83 fe 01 0f 86 0e 03 00 00 31 d2 4c 89 f7 44 89 45 d0 89 4d d4 e8 75 e7 ff ff 8b 4d d4 48 85 c0 44 8b 45 d0 0f 84 d8 02 00 00 <f0> ff 80 98 01 00 00 8b 15 e0 21 8f 01 45 8b 8f 50 08 00 00 85 [69943.745506] RIP [69943.752432] [<ffffffff810c3f6b>] __lock_acquire.isra.32+0x6b/0x8c0 RSP <ffffc9000b10b9e8> [69943.766856] CR2: ffffffff81789d28 ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-24 4:40 ` Dave Jones @ 2016-10-24 13:42 ` Chris Mason 2016-10-26 0:27 ` Dave Jones 2016-10-24 20:06 ` Andy Lutomirski 1 sibling, 1 reply; 118+ messages in thread From: Chris Mason @ 2016-10-24 13:42 UTC (permalink / raw) To: Dave Jones, Andy Lutomirski, Andy Lutomirski, Linus Torvalds, Jens Axboe, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel On 10/24/2016 12:40 AM, Dave Jones wrote: > On Sun, Oct 23, 2016 at 05:32:21PM -0400, Chris Mason wrote: > > > > > > On 10/22/2016 11:20 AM, Dave Jones wrote: > > > On Fri, Oct 21, 2016 at 04:02:45PM -0400, Dave Jones wrote: > > > > > > > > It could be worth trying this, too: > > > > > > > > > > https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vmap_stack&id=174531fef4e8 > > > > > > > > > > It occurred to me that the current code is a little bit fragile. > > > > > > > > It's been nearly 24hrs with the above changes, and it's been pretty much > > > > silent the whole time. > > > > > > > > The only thing of note over that time period has been a btrfs lockdep > > > > warning that's been around for a while, and occasional btrfs checksum > > > > failures, which I've been seeing for a while, but seem to have gotten > > > > worse since 4.8. > > > > > > > > I'm pretty confident in the disk being ok in this machine, so I think > > > > the checksum warnings are bogus. Chris suggested they may be the result > > > > of memory corruption, but there's little else going on. > > > > > > The only interesting thing last nights run was this.. > > > > > > BUG: Bad page state in process kworker/u8:1 pfn:4e2b70 > > > page:ffffea00138adc00 count:0 mapcount:0 mapping:ffff88046e9fc2e0 index:0xdf0 > > > flags: 0x400000000000000c(referenced|uptodate) > > > page dumped because: non-NULL mapping > > > CPU: 3 PID: 24234 Comm: kworker/u8:1 Not tainted 4.9.0-rc1-think+ #11 > > > Workqueue: writeback wb_workfn (flush-btrfs-2) > > > > Well crud, we're back to wondering if this is Btrfs or the stack > > corruption. Since the pagevecs are on the stack and this is a new > > crash, my guess is you'll be able to trigger it on xfs/ext4 too. But we > > should make sure. > > Here's an interesting one from today, pointing the finger at xattrs again. > > > [69943.450108] Oops: 0003 [#1] PREEMPT SMP DEBUG_PAGEALLOC > [69943.454452] CPU: 1 PID: 21558 Comm: trinity-c60 Not tainted 4.9.0-rc1-think+ #11 > [69943.463510] task: ffff8804f8dd3740 task.stack: ffffc9000b108000 > [69943.468077] RIP: 0010:[<ffffffff810c3f6b>] Was this btrfs? -chris ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-24 13:42 ` Chris Mason @ 2016-10-26 0:27 ` Dave Jones 2016-10-26 1:33 ` Linus Torvalds 2016-10-27 5:41 ` Dave Chinner 0 siblings, 2 replies; 118+ messages in thread From: Dave Jones @ 2016-10-26 0:27 UTC (permalink / raw) To: Chris Mason Cc: Andy Lutomirski, Andy Lutomirski, Linus Torvalds, Jens Axboe, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Dave Chinner On Mon, Oct 24, 2016 at 09:42:39AM -0400, Chris Mason wrote: > > > Well crud, we're back to wondering if this is Btrfs or the stack > > > corruption. Since the pagevecs are on the stack and this is a new > > > crash, my guess is you'll be able to trigger it on xfs/ext4 too. But we > > > should make sure. > > > > Here's an interesting one from today, pointing the finger at xattrs again. > > > > > > [69943.450108] Oops: 0003 [#1] PREEMPT SMP DEBUG_PAGEALLOC > > [69943.454452] CPU: 1 PID: 21558 Comm: trinity-c60 Not tainted 4.9.0-rc1-think+ #11 > > [69943.463510] task: ffff8804f8dd3740 task.stack: ffffc9000b108000 > > [69943.468077] RIP: 0010:[<ffffffff810c3f6b>] > > Was this btrfs? I already told you elsewhere, but for benefit of everyone else, yes, it was. At Chris' behest, I gave ext4 some more air-time with this workload. It ran for 1 day 6 hrs without incident before I got bored and tried something else. I threw XFS on the test partition, restarted the test, and got the warnings below across two reboots. DaveC: Do these look like real problems, or is this more "looks like random memory corruption" ? It's been a while since I did some stress testing on XFS, so these might not be new.. XFS: Assertion failed: oldlen > newlen, file: fs/xfs/libxfs/xfs_bmap.c, line: 2938 ------------[ cut here ]------------ kernel BUG at fs/xfs/xfs_message.c:113! invalid opcode: 0000 [#1] PREEMPT SMP CPU: 1 PID: 6227 Comm: trinity-c9 Not tainted 4.9.0-rc1-think+ #6 task: ffff8804f4658040 task.stack: ffff88050568c000 RIP: 0010:[<ffffffffa02d3e2b>] [<ffffffffa02d3e2b>] assfail+0x1b/0x20 [xfs] RSP: 0000:ffff88050568f9e8 EFLAGS: 00010282 RAX: 00000000ffffffea RBX: 0000000000000046 RCX: 0000000000000001 RDX: 00000000ffffffc0 RSI: 000000000000000a RDI: ffffffffa02fe34d RBP: ffff88050568f9e8 R08: 0000000000000000 R09: 0000000000000000 R10: 000000000000000a R11: f000000000000000 R12: ffff88050568fb44 R13: 00000000000000f3 R14: ffff8804f292bf88 R15: 000ffffffffe0046 FS: 00007fe2ddfdfb40(0000) GS:ffff88050a000000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fe2dbabd000 CR3: 00000004f461f000 CR4: 00000000001406e0 Stack: ffff88050568fa88 ffffffffa027ccee fffffffffffffff9 ffff8804f16fd8b0 0000000000003ffa 0000000000000032 ffff8804f292bf40 0000000000004976 000ffffffffe0008 00000000000004fd ffff880400000000 0000000000005107 Call Trace: [<ffffffffa027ccee>] xfs_bmap_add_extent_hole_delay+0x54e/0x620 [xfs] [<ffffffffa027f2d4>] xfs_bmapi_reserve_delalloc+0x2b4/0x400 [xfs] [<ffffffffa02cadd7>] xfs_file_iomap_begin_delay.isra.12+0x247/0x3c0 [xfs] [<ffffffffa02cb0d1>] xfs_file_iomap_begin+0x181/0x270 [xfs] [<ffffffffa02ca13e>] ? xfs_file_iomap_end+0x9e/0xe0 [xfs] [<ffffffff8122e573>] iomap_apply+0x53/0x100 [<ffffffff8122df10>] ? iomap_write_end+0x70/0x70 [<ffffffff8122e68b>] iomap_file_buffered_write+0x6b/0x90 [<ffffffff8122df10>] ? iomap_write_end+0x70/0x70 [<ffffffffa02c1dd8>] xfs_file_buffered_aio_write+0xe8/0x1d0 [xfs] [<ffffffff810c3b7f>] ? __lock_acquire.isra.32+0x1cf/0x8c0 [<ffffffffa02c1f45>] xfs_file_write_iter+0x85/0x120 [xfs] [<ffffffff811c8c98>] do_iter_readv_writev+0xa8/0x100 [<ffffffff811c9622>] do_readv_writev+0x172/0x210 [<ffffffffa02c1ec0>] ? xfs_file_buffered_aio_write+0x1d0/0x1d0 [xfs] [<ffffffff811e9794>] ? __fdget_pos+0x44/0x50 [<ffffffff8178b2f2>] ? mutex_lock_nested+0x272/0x3f0 [<ffffffff811e9794>] ? __fdget_pos+0x44/0x50 [<ffffffff811e9794>] ? __fdget_pos+0x44/0x50 [<ffffffff811c98ea>] vfs_writev+0x3a/0x50 [<ffffffff811c9950>] do_writev+0x50/0xd0 [<ffffffff811ca9fb>] SyS_writev+0xb/0x10 [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 [<ffffffff8178ff4b>] entry_SYSCALL64_slow_path+0x25/0x25 Code: 48 c7 c7 65 e3 2f a0 e8 74 37 da e0 5d c3 66 90 55 48 89 f1 41 89 d0 48 c7 c6 18 93 30 a0 48 89 fa 48 89 e5 31 ff e8 65 fa ff ff <0f> 0b 0f 1f 00 55 48 63 f6 49 89 f9 41 b8 01 00 00 00 48 89 e5 RIP [<ffffffffa02d3e2b>] assfail+0x1b/0x20 [xfs] RSP <ffff88050568f9e8> XFS: Assertion failed: tp->t_blk_res_used <= tp->t_blk_res, file: fs/xfs/xfs_trans.c, line: 309 kernel BUG at fs/xfs/xfs_message.c:113! invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC CPU: 0 PID: 7309 Comm: kworker/u8:1 Not tainted 4.9.0-rc1-think+ #11 Workqueue: writeback wb_workfn (flush-8:0) task: ffff88025eb98040 task.stack: ffffc9000a914000 RIP: 0010:[<ffffffffa0571e2b>] [<ffffffffa0571e2b>] assfail+0x1b/0x20 [xfs] RSP: 0018:ffffc9000a917410 EFLAGS: 00010282 RAX: 00000000ffffffea RBX: ffff8804538d22b8 RCX: 0000000000000001 RDX: 00000000ffffffc0 RSI: 000000000000000a RDI: ffffffffa059c34d RBP: ffffc9000a917410 R08: 0000000000000000 R09: 0000000000000000 R10: 000000000000000a R11: f000000000000000 R12: ffffffffffffffff R13: ffff88047c765698 R14: 0000000000000001 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff880507800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000008 CR3: 00000004c56e7000 CR4: 00000000001406f0 DR0: 00007fec5e3c9000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600 Stack: ffffc9000a917438 ffffffffa057abe1 ffffc9000a917510 ffffc9000a917510 ffffc9000a917510 ffffc9000a917460 ffffffffa0548eff ffffc9000a917510 0000000000000001 ffffc9000a917510 ffffc9000a917480 ffffffffa050aa3d Call Trace: [<ffffffffa057abe1>] xfs_trans_mod_sb+0x241/0x280 [xfs] [<ffffffffa0548eff>] xfs_ag_resv_alloc_extent+0x4f/0xc0 [xfs] [<ffffffffa050aa3d>] xfs_alloc_ag_vextent+0x23d/0x300 [xfs] [<ffffffffa050bb1b>] xfs_alloc_vextent+0x5fb/0x6d0 [xfs] [<ffffffffa051c1b4>] xfs_bmap_btalloc+0x304/0x8e0 [xfs] [<ffffffffa054648e>] ? xfs_iext_bno_to_ext+0xee/0x170 [xfs] [<ffffffffa051c8db>] xfs_bmap_alloc+0x2b/0x40 [xfs] [<ffffffffa051dc30>] xfs_bmapi_write+0x640/0x1210 [xfs] [<ffffffffa0569326>] xfs_iomap_write_allocate+0x166/0x350 [xfs] [<ffffffffa05540b0>] xfs_map_blocks+0x1b0/0x260 [xfs] [<ffffffffa0554beb>] xfs_do_writepage+0x23b/0x730 [xfs] [<ffffffff81159ef8>] ? clear_page_dirty_for_io+0x128/0x210 [<ffffffff81159e71>] ? clear_page_dirty_for_io+0xa1/0x210 [<ffffffff8115a1b6>] write_cache_pages+0x1d6/0x4a0 [<ffffffffa05549b0>] ? xfs_aops_discard_page+0x140/0x140 [xfs] [<ffffffffa0554419>] xfs_vm_writepages+0x59/0x80 [xfs] [<ffffffff8115af6c>] do_writepages+0x1c/0x30 [<ffffffff811f6d33>] __writeback_single_inode+0x33/0x180 [<ffffffff811f7528>] writeback_sb_inodes+0x2a8/0x5b0 [<ffffffff811f78bd>] __writeback_inodes_wb+0x8d/0xc0 [<ffffffff811f7b73>] wb_writeback+0x1e3/0x1f0 [<ffffffff811f80b2>] wb_workfn+0xd2/0x280 [<ffffffff81090875>] process_one_work+0x1d5/0x490 [<ffffffff81090815>] ? process_one_work+0x175/0x490 [<ffffffff81090b79>] worker_thread+0x49/0x490 [<ffffffff81090b30>] ? process_one_work+0x490/0x490 [<ffffffff81090b30>] ? process_one_work+0x490/0x490 [<ffffffff81095cee>] kthread+0xee/0x110 [<ffffffff81095c00>] ? kthread_park+0x60/0x60 [<ffffffff81790bd2>] ret_from_fork+0x22/0x30 Code: 48 c7 c7 65 c3 59 a0 e8 c4 5c b0 e0 5d c3 66 90 55 48 89 f1 41 89 d0 48 c7 c6 18 73 5a a0 48 89 fa 48 89 e5 31 ff e8 65 fa ff ff <0f> 0b 0f 1f 00 55 48 63 f6 49 89 f9 41 b8 01 00 00 00 48 89 e5 RIP [<ffffffffa0571e2b>] assfail+0x1b/0x20 [xfs] RSP <ffffc9000a917410> ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-26 0:27 ` Dave Jones @ 2016-10-26 1:33 ` Linus Torvalds 2016-10-26 1:39 ` Linus Torvalds 2016-10-27 5:41 ` Dave Chinner 1 sibling, 1 reply; 118+ messages in thread From: Linus Torvalds @ 2016-10-26 1:33 UTC (permalink / raw) To: Dave Jones, Chris Mason, Andy Lutomirski, Andy Lutomirski, Linus Torvalds, Jens Axboe, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Dave Chinner On Tue, Oct 25, 2016 at 5:27 PM, Dave Jones <davej@codemonkey.org.uk> wrote: > > DaveC: Do these look like real problems, or is this more "looks like > random memory corruption" ? It's been a while since I did some stress > testing on XFS, so these might not be new.. Andy, do you think we could just do some poisoning of the stack as we free it, to see if that catches anything? Something truly stupid like just --- a/kernel/fork.c +++ b/kernel/fork.c @@ -218,6 +218,7 @@ static inline void free_thread_stack(struct task_struct *tsk) unsigned long flags; int i; + memset(tsk->stack_vm_area->addr, 0xd0, THREAD_SIZE); local_irq_save(flags); for (i = 0; i < NR_CACHED_STACKS; i++) { if (this_cpu_read(cached_stacks[i])) or similar? It seems like DaveJ had an easier time triggering these problems with the stack cache, but they clearly didn't go away when the stack cache was disabled. So maybe the stack cache just made the reuse more likely and faster, making the problem show up faster too. But if we actively poison things, we'll corrupt the free'd stack *immediately* if there is some stale use.. Completely untested. Maybe there's some reason we can't write to the whole thing like that? Linus Linus ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-26 1:33 ` Linus Torvalds @ 2016-10-26 1:39 ` Linus Torvalds 2016-10-26 16:30 ` Dave Jones 0 siblings, 1 reply; 118+ messages in thread From: Linus Torvalds @ 2016-10-26 1:39 UTC (permalink / raw) To: Dave Jones, Chris Mason, Andy Lutomirski, Andy Lutomirski, Linus Torvalds, Jens Axboe, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Dave Chinner On Tue, Oct 25, 2016 at 6:33 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Completely untested. Maybe there's some reason we can't write to the > whole thing like that? That hack boots and seems to work for me, but doesn't show anything. Dave, mind just trying that oneliner? Linus ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-26 1:39 ` Linus Torvalds @ 2016-10-26 16:30 ` Dave Jones 2016-10-26 16:48 ` Linus Torvalds 0 siblings, 1 reply; 118+ messages in thread From: Dave Jones @ 2016-10-26 16:30 UTC (permalink / raw) To: Linus Torvalds Cc: Chris Mason, Andy Lutomirski, Andy Lutomirski, Jens Axboe, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Dave Chinner On Tue, Oct 25, 2016 at 06:39:03PM -0700, Linus Torvalds wrote: > On Tue, Oct 25, 2016 at 6:33 PM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > Completely untested. Maybe there's some reason we can't write to the > > whole thing like that? > > That hack boots and seems to work for me, but doesn't show anything. > > Dave, mind just trying that oneliner? I gave this a go last thing last night. It crashed within 5 minutes, but it was one we've already seen (the bad page map trace) with nothing additional that looked interesting. I rebooted, and tried again and went to bed. 12 hours later, it's still doing it's thing. Heisenbugs man, literally the worst. Dave ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-26 16:30 ` Dave Jones @ 2016-10-26 16:48 ` Linus Torvalds 2016-10-26 18:18 ` Dave Jones 2016-10-26 18:42 ` Dave Jones 0 siblings, 2 replies; 118+ messages in thread From: Linus Torvalds @ 2016-10-26 16:48 UTC (permalink / raw) To: Dave Jones, Linus Torvalds, Chris Mason, Andy Lutomirski, Andy Lutomirski, Jens Axboe, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Dave Chinner On Wed, Oct 26, 2016 at 9:30 AM, Dave Jones <davej@codemonkey.org.uk> wrote: > > I gave this a go last thing last night. It crashed within 5 minutes, > but it was one we've already seen (the bad page map trace) with nothing > additional that looked interesting. Did the bad page map trace have any registers that looked like they had 0xd0d0d0d0d0d0 in them? I assume not, but worth checking. > Heisenbugs man, literally the worst. I know you already had this in some email, but I lost it. I think you narrowed it down to a specific set of system calls that seems to trigger this best. fallocate and xattrs or something? Linus ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-26 16:48 ` Linus Torvalds @ 2016-10-26 18:18 ` Dave Jones 2016-10-26 18:42 ` Dave Jones 1 sibling, 0 replies; 118+ messages in thread From: Dave Jones @ 2016-10-26 18:18 UTC (permalink / raw) To: Linus Torvalds Cc: Chris Mason, Andy Lutomirski, Andy Lutomirski, Jens Axboe, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Dave Chinner On Wed, Oct 26, 2016 at 09:48:39AM -0700, Linus Torvalds wrote: > On Wed, Oct 26, 2016 at 9:30 AM, Dave Jones <davej@codemonkey.org.uk> wrote: > > > > I gave this a go last thing last night. It crashed within 5 minutes, > > but it was one we've already seen (the bad page map trace) with nothing > > additional that looked interesting. > > Did the bad page map trace have any registers that looked like they > had 0xd0d0d0d0d0d0 in them? > > I assume not, but worth checking. sadly not. In case I did overlook something, here's last nights.. BUG: Bad page state in process kworker/u8:13 pfn:4dae31 page:ffffea00136b8c40 count:0 mapcount:0 mapping:ffff8804f011d6e0 index:0xd1530 flags: 0x400000000000000c(referenced|uptodate) page dumped because: non-NULL mapping CPU: 3 PID: 1207 Comm: kworker/u8:13 Not tainted 4.9.0-rc2-think+ #3 Workqueue: writeback wb_workfn (flush-btrfs-1) ffffc900006fb870 ffffffff8130cf3c ffffea00136b8c40 ffffffff819ff54c ffffc900006fb898 ffffffff811511af 0000000000000000 ffffea00136b8c40 400000000000000c ffffc900006fb8a8 ffffffff8115126a ffffc900006fb8f0 Call Trace: [<ffffffff8130cf3c>] dump_stack+0x4f/0x73 [<ffffffff811511af>] bad_page+0xbf/0x120 [<ffffffff8115126a>] free_pages_check_bad+0x5a/0x70 [<ffffffff81153b68>] free_hot_cold_page+0x248/0x290 [<ffffffff81153e6b>] free_hot_cold_page_list+0x2b/0x50 [<ffffffff8115c87d>] release_pages+0x2bd/0x350 [<ffffffff8115ddb2>] __pagevec_release+0x22/0x30 [<ffffffffa00a0d4e>] extent_write_cache_pages.isra.48.constprop.63+0x32e/0x400 [btrfs] [<ffffffffa00a1199>] extent_writepages+0x49/0x60 [btrfs] [<ffffffffa0081840>] ? btrfs_releasepage+0x40/0x40 [btrfs] [<ffffffffa007e993>] btrfs_writepages+0x23/0x30 [btrfs] [<ffffffff8115af9c>] do_writepages+0x1c/0x30 [<ffffffff811f6d73>] __writeback_single_inode+0x33/0x180 [<ffffffff811f7568>] writeback_sb_inodes+0x2a8/0x5b0 [<ffffffff811f7abb>] wb_writeback+0xeb/0x1f0 [<ffffffff811f80f2>] wb_workfn+0xd2/0x280 [<ffffffff810908a5>] process_one_work+0x1d5/0x490 [<ffffffff81090845>] ? process_one_work+0x175/0x490 [<ffffffff81090ba9>] worker_thread+0x49/0x490 [<ffffffff81090b60>] ? process_one_work+0x490/0x490 [<ffffffff81095d1e>] kthread+0xee/0x110 [<ffffffff81095c30>] ? kthread_park+0x60/0x60 [<ffffffff81790a52>] ret_from_fork+0x22/0x30 > > Heisenbugs man, literally the worst. > > I know you already had this in some email, but I lost it. I think you > narrowed it down to a specific set of system calls that seems to > trigger this best. fallocate and xattrs or something? I did. Or so I thought. Then iirc, it ran for a really long time with those. I'll revisit that just to be sure. Something else I tried: Trinity is very heavily stressing the fork path, with new children forking off and doing crazy shit all the time, segfaulting, repeat. I wrote a small test app that models all of that sans the "do crazy shit with syscalls", and that didn't do anything useful in terms of reproducing this. I hoped that had panned out, because I could totally see the relationship between reusing vmap'ing stacks and heavy fork() use. Perhaps I'm still missing something non-obvious. Dave ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-26 16:48 ` Linus Torvalds 2016-10-26 18:18 ` Dave Jones @ 2016-10-26 18:42 ` Dave Jones 2016-10-26 19:06 ` Linus Torvalds 1 sibling, 1 reply; 118+ messages in thread From: Dave Jones @ 2016-10-26 18:42 UTC (permalink / raw) To: Linus Torvalds Cc: Chris Mason, Andy Lutomirski, Andy Lutomirski, Jens Axboe, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Dave Chinner On Wed, Oct 26, 2016 at 09:48:39AM -0700, Linus Torvalds wrote: > I know you already had this in some email, but I lost it. I think you > narrowed it down to a specific set of system calls that seems to > trigger this best. fallocate and xattrs or something? So I was about to give that a shot again. That this has been running doing for 24hrs was bugging me. I ctrl-c'd the current run, and trinity just sat there, because all of its child processes are stuck in D state. The stacks show nearly all of them are stuck in sync_inodes_sb iotop & vmstat shows there is _zero_ io actually happening. So it's spent most the night doing next to nothing useful afaict. Chris ? Here's the /proc/pid/stack's of those children.. [<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300 [<ffffffff811fbd37>] sync_filesystem+0x57/0xa0 [<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70 [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 [<ffffffffffffffff>] 0xffffffffffffffff [<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300 [<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20 [<ffffffff811cdd3f>] iterate_supers+0xaf/0x100 [<ffffffff811fbdb0>] sys_sync+0x30/0x90 [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 [<ffffffffffffffff>] 0xffffffffffffffff [<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300 [<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20 [<ffffffff811cdd3f>] iterate_supers+0xaf/0x100 [<ffffffff811fbdb0>] sys_sync+0x30/0x90 [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 [<ffffffffffffffff>] 0xffffffffffffffff [<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300 [<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20 [<ffffffff811cdd3f>] iterate_supers+0xaf/0x100 [<ffffffff811fbdb0>] sys_sync+0x30/0x90 [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 [<ffffffffffffffff>] 0xffffffffffffffff [<ffffffffa009554f>] btrfs_wait_ordered_roots+0x3f/0x200 [btrfs] [<ffffffffa00470d1>] btrfs_sync_fs+0x31/0xc0 [btrfs] [<ffffffff811fbd4e>] sync_filesystem+0x6e/0xa0 [<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70 [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 [<ffffffffffffffff>] 0xffffffffffffffff [<ffffffff8131ae87>] call_rwsem_down_write_failed+0x17/0x30 [<ffffffff811fc294>] utimes_common+0xd4/0x190 [<ffffffff811fc457>] do_utimes+0x107/0x120 [<ffffffff811fc641>] SyS_futimesat+0xa1/0xd0 [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 [<ffffffffffffffff>] 0xffffffffffffffff [<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300 [<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20 [<ffffffff811cdd3f>] iterate_supers+0xaf/0x100 [<ffffffff811fbdb0>] sys_sync+0x30/0x90 [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 [<ffffffffffffffff>] 0xffffffffffffffff [<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300 [<ffffffff811fbd37>] sync_filesystem+0x57/0xa0 [<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70 [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 [<ffffffffffffffff>] 0xffffffffffffffff [<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300 [<ffffffff811fbd37>] sync_filesystem+0x57/0xa0 [<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70 [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 [<ffffffffffffffff>] 0xffffffffffffffff [<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300 [<ffffffff811fbd37>] sync_filesystem+0x57/0xa0 [<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70 [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 [<ffffffffffffffff>] 0xffffffffffffffff [<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300 [<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20 [<ffffffff811cdd3f>] iterate_supers+0xaf/0x100 [<ffffffff811fbdb0>] sys_sync+0x30/0x90 [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 [<ffffffffffffffff>] 0xffffffffffffffff [<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300 [<ffffffff811fbd37>] sync_filesystem+0x57/0xa0 [<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70 [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 [<ffffffffffffffff>] 0xffffffffffffffff [<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300 [<ffffffff811fbd37>] sync_filesystem+0x57/0xa0 [<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70 [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 [<ffffffffffffffff>] 0xffffffffffffffff [<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300 [<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20 [<ffffffff811cdd3f>] iterate_supers+0xaf/0x100 [<ffffffff811fbdb0>] sys_sync+0x30/0x90 [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 [<ffffffffffffffff>] 0xffffffffffffffff [<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300 [<ffffffff811fbd37>] sync_filesystem+0x57/0xa0 [<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70 [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 [<ffffffffffffffff>] 0xffffffffffffffff [<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300 [<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20 [<ffffffff811cdd3f>] iterate_supers+0xaf/0x100 [<ffffffff811fbdb0>] sys_sync+0x30/0x90 [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 [<ffffffffffffffff>] 0xffffffffffffffff [<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300 [<ffffffff811fbd37>] sync_filesystem+0x57/0xa0 [<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70 [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 [<ffffffffffffffff>] 0xffffffffffffffff [<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300 [<ffffffff811fbd37>] sync_filesystem+0x57/0xa0 [<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70 [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 [<ffffffffffffffff>] 0xffffffffffffffff [<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300 [<ffffffff811fbd37>] sync_filesystem+0x57/0xa0 [<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70 [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 [<ffffffffffffffff>] 0xffffffffffffffff [<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300 [<ffffffff811fbd37>] sync_filesystem+0x57/0xa0 [<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70 [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 [<ffffffffffffffff>] 0xffffffffffffffff [<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300 [<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20 [<ffffffff811cdd3f>] iterate_supers+0xaf/0x100 [<ffffffff811fbdb0>] sys_sync+0x30/0x90 [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 [<ffffffffffffffff>] 0xffffffffffffffff [<ffffffff8131ae87>] call_rwsem_down_write_failed+0x17/0x30 [<ffffffffa008ed32>] btrfs_fallocate+0xb2/0xfd0 [btrfs] [<ffffffff811c6c3e>] vfs_fallocate+0x13e/0x220 [<ffffffff811c79f3>] SyS_fallocate+0x43/0x80 [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 [<ffffffffffffffff>] 0xffffffffffffffff [<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300 [<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20 [<ffffffff811cdd3f>] iterate_supers+0xaf/0x100 [<ffffffff811fbdb0>] sys_sync+0x30/0x90 [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 [<ffffffffffffffff>] 0xffffffffffffffff [<ffffffffa009554f>] btrfs_wait_ordered_roots+0x3f/0x200 [btrfs] [<ffffffffa00470d1>] btrfs_sync_fs+0x31/0xc0 [btrfs] [<ffffffff811fbcdb>] sync_fs_one_sb+0x1b/0x20 [<ffffffff811cdd3f>] iterate_supers+0xaf/0x100 [<ffffffff811fbdd0>] sys_sync+0x50/0x90 [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 [<ffffffffffffffff>] 0xffffffffffffffff [<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300 [<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20 [<ffffffff811cdd3f>] iterate_supers+0xaf/0x100 [<ffffffff811fbdb0>] sys_sync+0x30/0x90 [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 [<ffffffffffffffff>] 0xffffffffffffffff [<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300 [<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20 [<ffffffff811cdd3f>] iterate_supers+0xaf/0x100 [<ffffffff811fbdb0>] sys_sync+0x30/0x90 [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 [<ffffffffffffffff>] 0xffffffffffffffff [<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300 [<ffffffff811fbd37>] sync_filesystem+0x57/0xa0 [<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70 [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 [<ffffffffffffffff>] 0xffffffffffffffff [<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300 [<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20 [<ffffffff811cdd3f>] iterate_supers+0xaf/0x100 [<ffffffff811fbdb0>] sys_sync+0x30/0x90 [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 [<ffffffffffffffff>] 0xffffffffffffffff [<ffffffff8131ae87>] call_rwsem_down_write_failed+0x17/0x30 [<ffffffffa008c0f4>] btrfs_file_llseek+0x34/0x290 [btrfs] [<ffffffff811c9ab5>] SyS_lseek+0x85/0xa0 [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 [<ffffffffffffffff>] 0xffffffffffffffff [<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300 [<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20 [<ffffffff811cdd3f>] iterate_supers+0xaf/0x100 [<ffffffff811fbdb0>] sys_sync+0x30/0x90 [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 [<ffffffffffffffff>] 0xffffffffffffffff [<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300 [<ffffffff811fbd37>] sync_filesystem+0x57/0xa0 [<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70 [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 [<ffffffffffffffff>] 0xffffffffffffffff [<ffffffff8131ae87>] call_rwsem_down_write_failed+0x17/0x30 [<ffffffffa0090a0e>] btrfs_sync_file+0x7e/0x360 [btrfs] [<ffffffff811fbbe6>] vfs_fsync_range+0x46/0xa0 [<ffffffffa00910e0>] btrfs_file_write_iter+0x3f0/0x550 [btrfs] [<ffffffff811c97d8>] do_iter_readv_writev+0xa8/0x100 [<ffffffff811ca162>] do_readv_writev+0x172/0x210 [<ffffffff811ca42a>] vfs_writev+0x3a/0x50 [<ffffffff811ca5c0>] do_pwritev+0xb0/0xd0 [<ffffffff811cb592>] SyS_pwritev2+0x12/0x20 [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 [<ffffffffffffffff>] 0xffffffffffffffff [<ffffffff811ea2d4>] __fdget_pos+0x44/0x50 [<ffffffff811de8ec>] SyS_getdents+0x6c/0x110 [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 [<ffffffffffffffff>] 0xffffffffffffffff [<ffffffff8131ae87>] call_rwsem_down_write_failed+0x17/0x30 [<ffffffff811c752a>] do_truncate+0x4a/0x90 [<ffffffff811c790c>] do_sys_ftruncate.constprop.19+0x10c/0x170 [<ffffffff811c7999>] SyS_ftruncate+0x9/0x10 [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 [<ffffffffffffffff>] 0xffffffffffffffff [<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300 [<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20 [<ffffffff811cdd3f>] iterate_supers+0xaf/0x100 [<ffffffff811fbdb0>] sys_sync+0x30/0x90 [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 [<ffffffffffffffff>] 0xffffffffffffffff [<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300 [<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20 [<ffffffff811cdd3f>] iterate_supers+0xaf/0x100 [<ffffffff811fbdb0>] sys_sync+0x30/0x90 [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 [<ffffffffffffffff>] 0xffffffffffffffff [<ffffffffa0095413>] btrfs_wait_ordered_extents+0x1e3/0x2e0 [btrfs] [<ffffffffa0095647>] btrfs_wait_ordered_roots+0x137/0x200 [btrfs] [<ffffffffa00470d1>] btrfs_sync_fs+0x31/0xc0 [btrfs] [<ffffffff811fbcdb>] sync_fs_one_sb+0x1b/0x20 [<ffffffff811cdd3f>] iterate_supers+0xaf/0x100 [<ffffffff811fbdd0>] sys_sync+0x50/0x90 [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 [<ffffffffffffffff>] 0xffffffffffffffff [<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300 [<ffffffff811fbd37>] sync_filesystem+0x57/0xa0 [<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70 [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 [<ffffffffffffffff>] 0xffffffffffffffff [<ffffffff81149fbf>] wait_on_page_bit+0xaf/0xc0 [<ffffffff8114a121>] __filemap_fdatawait_range+0x151/0x170 [<ffffffff8114d79c>] filemap_fdatawait_keep_errors+0x1c/0x20 [<ffffffff811f59b3>] sync_inodes_sb+0x273/0x300 [<ffffffff811fbd37>] sync_filesystem+0x57/0xa0 [<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70 [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 [<ffffffffffffffff>] 0xffffffffffffffff [<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300 [<ffffffff811fbd37>] sync_filesystem+0x57/0xa0 [<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70 [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 [<ffffffffffffffff>] 0xffffffffffffffff [<ffffffff8131ae87>] call_rwsem_down_write_failed+0x17/0x30 [<ffffffffa0090d50>] btrfs_file_write_iter+0x60/0x550 [btrfs] [<ffffffff811c97d8>] do_iter_readv_writev+0xa8/0x100 [<ffffffff811ca162>] do_readv_writev+0x172/0x210 [<ffffffff811ca42a>] vfs_writev+0x3a/0x50 [<ffffffff811ca5c0>] do_pwritev+0xb0/0xd0 [<ffffffff811cb57c>] SyS_pwritev+0xc/0x10 [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 [<ffffffffffffffff>] 0xffffffffffffffff [<ffffffff811ea2d4>] __fdget_pos+0x44/0x50 [<ffffffff811c9a48>] SyS_lseek+0x18/0xa0 [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 [<ffffffffffffffff>] 0xffffffffffffffff [<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300 [<ffffffff811fbd37>] sync_filesystem+0x57/0xa0 [<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70 [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 [<ffffffffffffffff>] 0xffffffffffffffff [<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300 [<ffffffff811fbd37>] sync_filesystem+0x57/0xa0 [<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70 [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 [<ffffffffffffffff>] 0xffffffffffffffff [<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300 [<ffffffff811fbd37>] sync_filesystem+0x57/0xa0 [<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70 [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 [<ffffffffffffffff>] 0xffffffffffffffff [<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300 [<ffffffff811fbd37>] sync_filesystem+0x57/0xa0 [<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70 [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 [<ffffffffffffffff>] 0xffffffffffffffff [<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300 [<ffffffff811fbd37>] sync_filesystem+0x57/0xa0 [<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70 [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 [<ffffffffffffffff>] 0xffffffffffffffff [<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300 [<ffffffff811fbd37>] sync_filesystem+0x57/0xa0 [<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70 [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 [<ffffffffffffffff>] 0xffffffffffffffff [<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300 [<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20 [<ffffffff811cdd3f>] iterate_supers+0xaf/0x100 [<ffffffff811fbdb0>] sys_sync+0x30/0x90 [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 [<ffffffffffffffff>] 0xffffffffffffffff [<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300 [<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20 [<ffffffff811cdd3f>] iterate_supers+0xaf/0x100 [<ffffffff811fbdb0>] sys_sync+0x30/0x90 [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 [<ffffffffffffffff>] 0xffffffffffffffff [<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300 [<ffffffff811fbd37>] sync_filesystem+0x57/0xa0 [<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70 [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 [<ffffffffffffffff>] 0xffffffffffffffff [<ffffffff8131ae87>] call_rwsem_down_write_failed+0x17/0x30 [<ffffffffa0090d50>] btrfs_file_write_iter+0x60/0x550 [btrfs] [<ffffffff811c8e64>] __vfs_write+0xc4/0x120 [<ffffffff811c9f03>] vfs_write+0xb3/0x1a0 [<ffffffff811cb3d4>] SyS_pwrite64+0x74/0x90 [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 [<ffffffffffffffff>] 0xffffffffffffffff [<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300 [<ffffffff811fbd37>] sync_filesystem+0x57/0xa0 [<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70 [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 [<ffffffffffffffff>] 0xffffffffffffffff [<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300 [<ffffffff811fbd37>] sync_filesystem+0x57/0xa0 [<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70 [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 [<ffffffffffffffff>] 0xffffffffffffffff [<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300 [<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20 [<ffffffff811cdd3f>] iterate_supers+0xaf/0x100 [<ffffffff811fbdb0>] sys_sync+0x30/0x90 [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 [<ffffffffffffffff>] 0xffffffffffffffff [<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300 [<ffffffff811fbd37>] sync_filesystem+0x57/0xa0 [<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70 [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 [<ffffffffffffffff>] 0xffffffffffffffff [<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300 [<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20 [<ffffffff811cdd3f>] iterate_supers+0xaf/0x100 [<ffffffff811fbdb0>] sys_sync+0x30/0x90 [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 [<ffffffffffffffff>] 0xffffffffffffffff [<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300 [<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20 [<ffffffff811cdd3f>] iterate_supers+0xaf/0x100 [<ffffffff811fbdb0>] sys_sync+0x30/0x90 [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 [<ffffffffffffffff>] 0xffffffffffffffff [<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300 [<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20 [<ffffffff811cdd3f>] iterate_supers+0xaf/0x100 [<ffffffff811fbdb0>] sys_sync+0x30/0x90 [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 [<ffffffffffffffff>] 0xffffffffffffffff [<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300 [<ffffffff811fbd37>] sync_filesystem+0x57/0xa0 [<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70 [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 [<ffffffffffffffff>] 0xffffffffffffffff [<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300 [<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20 [<ffffffff811cdd3f>] iterate_supers+0xaf/0x100 [<ffffffff811fbdb0>] sys_sync+0x30/0x90 [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 [<ffffffffffffffff>] 0xffffffffffffffff [<ffffffffa009576b>] btrfs_start_ordered_extent+0x5b/0xb0 [btrfs] [<ffffffffa008bf5d>] lock_and_cleanup_extent_if_need+0x22d/0x290 [btrfs] [<ffffffffa008d1e8>] __btrfs_buffered_write+0x1b8/0x6e0 [btrfs] [<ffffffffa0090e60>] btrfs_file_write_iter+0x170/0x550 [btrfs] [<ffffffff811c97d8>] do_iter_readv_writev+0xa8/0x100 [<ffffffff811ca162>] do_readv_writev+0x172/0x210 [<ffffffff811ca42a>] vfs_writev+0x3a/0x50 [<ffffffff811ca5c0>] do_pwritev+0xb0/0xd0 [<ffffffff811cb57c>] SyS_pwritev+0xc/0x10 [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 [<ffffffffffffffff>] 0xffffffffffffffff [<ffffffff8131ae87>] call_rwsem_down_write_failed+0x17/0x30 [<ffffffffa0090d50>] btrfs_file_write_iter+0x60/0x550 [btrfs] [<ffffffff811c8e64>] __vfs_write+0xc4/0x120 [<ffffffff811c998d>] __kernel_write+0x4d/0xf0 [<ffffffff811f995d>] write_pipe_buf+0x6d/0x80 [<ffffffff811f9cdf>] __splice_from_pipe+0x12f/0x1b0 [<ffffffff811fabdc>] splice_from_pipe+0x4c/0x70 [<ffffffff811fac34>] default_file_splice_write+0x14/0x20 [<ffffffff811f9081>] direct_splice_actor+0x31/0x40 [<ffffffff811f972c>] splice_direct_to_actor+0xcc/0x1e0 [<ffffffff811f98d0>] do_splice_direct+0x90/0xb0 [<ffffffff811cad30>] do_sendfile+0x1b0/0x390 [<ffffffff811cb7cf>] SyS_sendfile64+0x5f/0xd0 [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 [<ffffffffffffffff>] 0xffffffffffffffff [<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300 [<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20 [<ffffffff811cdd3f>] iterate_supers+0xaf/0x100 [<ffffffff811fbdb0>] sys_sync+0x30/0x90 [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 [<ffffffffffffffff>] 0xffffffffffffffff [<ffffffff81086eee>] do_sigtimedwait+0x16e/0x260 [<ffffffff81087084>] SYSC_rt_sigtimedwait+0xa4/0x100 [<ffffffff810870e9>] SyS_rt_sigtimedwait+0x9/0x10 [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 [<ffffffffffffffff>] 0xffffffffffffffff ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-26 18:42 ` Dave Jones @ 2016-10-26 19:06 ` Linus Torvalds 2016-10-26 20:00 ` Chris Mason 0 siblings, 1 reply; 118+ messages in thread From: Linus Torvalds @ 2016-10-26 19:06 UTC (permalink / raw) To: Dave Jones, Linus Torvalds, Chris Mason, Andy Lutomirski, Andy Lutomirski, Jens Axboe, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Dave Chinner On Wed, Oct 26, 2016 at 11:42 AM, Dave Jones <davej@codemonkey.org.uk> wrote: > > The stacks show nearly all of them are stuck in sync_inodes_sb That's just wb_wait_for_completion(), and it means that some IO isn't completing. There's also a lot of processes waiting for inode_lock(), and a few waiting for mnt_want_write() Ignoring those, we have > [<ffffffffa009554f>] btrfs_wait_ordered_roots+0x3f/0x200 [btrfs] > [<ffffffffa00470d1>] btrfs_sync_fs+0x31/0xc0 [btrfs] > [<ffffffff811fbd4e>] sync_filesystem+0x6e/0xa0 > [<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70 > [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 > [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 > [<ffffffffffffffff>] 0xffffffffffffffff Don't know this one. There's a couple of them. Could there be some ABBA deadlock on the ordered roots waiting? > [<ffffffff8131ae87>] call_rwsem_down_write_failed+0x17/0x30 > [<ffffffffa008ed32>] btrfs_fallocate+0xb2/0xfd0 [btrfs] > [<ffffffff811c6c3e>] vfs_fallocate+0x13e/0x220 > [<ffffffff811c79f3>] SyS_fallocate+0x43/0x80 > [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 > [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 > [<ffffffffffffffff>] 0xffffffffffffffff This one is also inode_lock(), and is interesting only because it's fallocate(), which has shown up so many times before.. But there are other threads blocked on do_truncate, or btrfs_file_write_iter instead, or on lseek, so this is not different for any other reason. > [<ffffffff81149fbf>] wait_on_page_bit+0xaf/0xc0 > [<ffffffff8114a121>] __filemap_fdatawait_range+0x151/0x170 > [<ffffffff8114d79c>] filemap_fdatawait_keep_errors+0x1c/0x20 > [<ffffffff811f59b3>] sync_inodes_sb+0x273/0x300 > [<ffffffff811fbd37>] sync_filesystem+0x57/0xa0 > [<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70 > [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 > [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 > [<ffffffffffffffff>] 0xffffffffffffffff This is actually waiting on the page. Possibly this is the IO that is never completing, and keeps the inode lock. > [<ffffffffa009576b>] btrfs_start_ordered_extent+0x5b/0xb0 [btrfs] > [<ffffffffa008bf5d>] lock_and_cleanup_extent_if_need+0x22d/0x290 [btrfs] > [<ffffffffa008d1e8>] __btrfs_buffered_write+0x1b8/0x6e0 [btrfs] > [<ffffffffa0090e60>] btrfs_file_write_iter+0x170/0x550 [btrfs] > [<ffffffff811c97d8>] do_iter_readv_writev+0xa8/0x100 > [<ffffffff811ca162>] do_readv_writev+0x172/0x210 > [<ffffffff811ca42a>] vfs_writev+0x3a/0x50 > [<ffffffff811ca5c0>] do_pwritev+0xb0/0xd0 > [<ffffffff811cb57c>] SyS_pwritev+0xc/0x10 > [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 > [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 Hmm. This is the one that *started* the ordered extents (as opposed to the ones waiting for it) I dunno. There might be a lost IO. More likely it's the same corruption that causes it, it just didn't result in an oops this time. Linus ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-26 19:06 ` Linus Torvalds @ 2016-10-26 20:00 ` Chris Mason 2016-10-26 21:52 ` Chris Mason 2016-10-26 22:07 ` Linus Torvalds 0 siblings, 2 replies; 118+ messages in thread From: Chris Mason @ 2016-10-26 20:00 UTC (permalink / raw) To: Linus Torvalds, Dave Jones, Andy Lutomirski, Andy Lutomirski, Jens Axboe, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Dave Chinner On 10/26/2016 03:06 PM, Linus Torvalds wrote: > On Wed, Oct 26, 2016 at 11:42 AM, Dave Jones <davej@codemonkey.org.uk> wrote: >> >> The stacks show nearly all of them are stuck in sync_inodes_sb > > That's just wb_wait_for_completion(), and it means that some IO isn't > completing. > > There's also a lot of processes waiting for inode_lock(), and a few > waiting for mnt_want_write() > > Ignoring those, we have > >> [<ffffffffa009554f>] btrfs_wait_ordered_roots+0x3f/0x200 [btrfs] >> [<ffffffffa00470d1>] btrfs_sync_fs+0x31/0xc0 [btrfs] >> [<ffffffff811fbd4e>] sync_filesystem+0x6e/0xa0 >> [<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70 >> [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 >> [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 >> [<ffffffffffffffff>] 0xffffffffffffffff > > Don't know this one. There's a couple of them. Could there be some > ABBA deadlock on the ordered roots waiting? It's always possible, but we haven't changed anything here. I've tried a long list of things to reproduce this on my test boxes, including days of trinity runs and a kernel module to exercise vmalloc, and thread creation. Today I turned off every CONFIG_DEBUG_* except for list debugging, and ran dbench 2048: [ 2759.118711] WARNING: CPU: 2 PID: 31039 at lib/list_debug.c:33 __list_add+0xbe/0xd0 [ 2759.119652] list_add corruption. prev->next should be next (ffffe8ffffc80308), but was ffffc90000ccfb88. (prev=ffff880128522380). [ 2759.121039] Modules linked in: crc32c_intel i2c_piix4 aesni_intel aes_x86_64 virtio_net glue_helper i2c_core lrw floppy gf128mul serio_raw pcspkr button ablk_helper cryptd sch_fq_codel autofs4 virtio_blk [ 2759.124369] CPU: 2 PID: 31039 Comm: dbench Not tainted 4.9.0-rc1-15246-g4ce9206-dirty #317 [ 2759.125077] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-1.fc24 04/01/2014 [ 2759.125077] ffffc9000f6fb868 ffffffff814fe4ff ffffffff8151cb5e ffffc9000f6fb8c8 [ 2759.125077] ffffc9000f6fb8c8 0000000000000000 ffffc9000f6fb8b8 ffffffff81064bbf [ 2759.127444] ffff880128523680 0000002139968000 ffff880138b7a4a0 ffff880128523540 [ 2759.127444] Call Trace: [ 2759.127444] [<ffffffff814fe4ff>] dump_stack+0x53/0x74 [ 2759.127444] [<ffffffff8151cb5e>] ? __list_add+0xbe/0xd0 [ 2759.127444] [<ffffffff81064bbf>] __warn+0xff/0x120 [ 2759.127444] [<ffffffff81064c99>] warn_slowpath_fmt+0x49/0x50 [ 2759.127444] [<ffffffff8151cb5e>] __list_add+0xbe/0xd0 [ 2759.127444] [<ffffffff814df338>] blk_sq_make_request+0x388/0x580 [ 2759.127444] [<ffffffff814d5b44>] generic_make_request+0x104/0x200 [ 2759.127444] [<ffffffff814d5ca5>] submit_bio+0x65/0x130 [ 2759.127444] [<ffffffff8152a946>] ? __percpu_counter_add+0x96/0xd0 [ 2759.127444] [<ffffffff814260bc>] btrfs_map_bio+0x23c/0x310 [ 2759.127444] [<ffffffff813f42b3>] btrfs_submit_bio_hook+0xd3/0x190 [ 2759.127444] [<ffffffff814117ad>] submit_one_bio+0x6d/0xa0 [ 2759.127444] [<ffffffff8141182e>] flush_epd_write_bio+0x4e/0x70 [ 2759.127444] [<ffffffff81418d8d>] extent_writepages+0x5d/0x70 [ 2759.127444] [<ffffffff813f84e0>] ? btrfs_releasepage+0x50/0x50 [ 2759.127444] [<ffffffff81220ffe>] ? wbc_attach_and_unlock_inode+0x6e/0x170 [ 2759.127444] [<ffffffff813f5047>] btrfs_writepages+0x27/0x30 [ 2759.127444] [<ffffffff81178690>] do_writepages+0x20/0x30 [ 2759.127444] [<ffffffff81167d85>] __filemap_fdatawrite_range+0xb5/0x100 [ 2759.127444] [<ffffffff81168263>] filemap_fdatawrite_range+0x13/0x20 [ 2759.127444] [<ffffffff81405a7b>] btrfs_fdatawrite_range+0x2b/0x70 [ 2759.127444] [<ffffffff81405ba8>] btrfs_sync_file+0x88/0x490 [ 2759.127444] [<ffffffff810751e2>] ? group_send_sig_info+0x42/0x80 [ 2759.127444] [<ffffffff8107527d>] ? kill_pid_info+0x5d/0x90 [ 2759.127444] [<ffffffff8107564a>] ? SYSC_kill+0xba/0x1d0 [ 2759.127444] [<ffffffff811f2638>] ? __sb_end_write+0x58/0x80 [ 2759.127444] [<ffffffff81225b9c>] vfs_fsync_range+0x4c/0xb0 [ 2759.127444] [<ffffffff81002501>] ? syscall_trace_enter+0x201/0x2e0 [ 2759.127444] [<ffffffff81225c1c>] vfs_fsync+0x1c/0x20 [ 2759.127444] [<ffffffff81225c5d>] do_fsync+0x3d/0x70 [ 2759.127444] [<ffffffff810029cb>] ? syscall_slow_exit_work+0xfb/0x100 [ 2759.127444] [<ffffffff81225cc0>] SyS_fsync+0x10/0x20 [ 2759.127444] [<ffffffff81002b65>] do_syscall_64+0x55/0xd0 [ 2759.127444] [<ffffffff810026d7>] ? prepare_exit_to_usermode+0x37/0x40 [ 2759.127444] [<ffffffff819ad986>] entry_SYSCALL64_slow_path+0x25/0x25 [ 2759.150635] ---[ end trace 3b5b7e2ef61c3d02 ]--- I put a variant of your suggested patch in place, but my printk never triggered. Now that I've made it happen once, I'll make sure I can do it over and over again. This doesn't have the patches that Andy asked Davej to try out yet, but I'll try them once I have a reliable reproducer. diff --git a/kernel/fork.c b/kernel/fork.c index 623259f..de95e19 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -165,7 +165,7 @@ void __weak arch_release_thread_stack(unsigned long *stack) * vmalloc() is a bit slow, and calling vfree() enough times will force a TLB * flush. Try to minimize the number of calls by caching stacks. */ -#define NR_CACHED_STACKS 2 +#define NR_CACHED_STACKS 256 static DEFINE_PER_CPU(struct vm_struct *, cached_stacks[NR_CACHED_STACKS]); #endif @@ -173,7 +173,9 @@ static unsigned long *alloc_thread_stack_node(struct task_struct *tsk, int node) { #ifdef CONFIG_VMAP_STACK void *stack; + char *p; int i; + int j; local_irq_disable(); for (i = 0; i < NR_CACHED_STACKS; i++) { @@ -183,7 +185,15 @@ static unsigned long *alloc_thread_stack_node(struct task_struct *tsk, int node) continue; this_cpu_write(cached_stacks[i], NULL); + p = s->addr; + for (j = 0; j < THREAD_SIZE; j++) { + if (p[j] != 'c') { + printk_ratelimited(KERN_CRIT "bad poison %c byte %d\n", p[j], j); + break; + } + } tsk->stack_vm_area = s; + local_irq_enable(); return s->addr; } @@ -219,6 +229,7 @@ static inline void free_thread_stack(struct task_struct *tsk) int i; local_irq_save(flags); + memset(tsk->stack_vm_area->addr, 'c', THREAD_SIZE); for (i = 0; i < NR_CACHED_STACKS; i++) { if (this_cpu_read(cached_stacks[i])) continue; ^ permalink raw reply related [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-26 20:00 ` Chris Mason @ 2016-10-26 21:52 ` Chris Mason 2016-10-26 22:21 ` Linus Torvalds 2016-10-26 22:07 ` Linus Torvalds 1 sibling, 1 reply; 118+ messages in thread From: Chris Mason @ 2016-10-26 21:52 UTC (permalink / raw) To: Linus Torvalds, Dave Jones, Andy Lutomirski, Andy Lutomirski, Jens Axboe, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Dave Chinner On 10/26/2016 04:00 PM, Chris Mason wrote: > > > On 10/26/2016 03:06 PM, Linus Torvalds wrote: >> On Wed, Oct 26, 2016 at 11:42 AM, Dave Jones <davej@codemonkey.org.uk> wrote: >>> >>> The stacks show nearly all of them are stuck in sync_inodes_sb >> >> That's just wb_wait_for_completion(), and it means that some IO isn't >> completing. >> >> There's also a lot of processes waiting for inode_lock(), and a few >> waiting for mnt_want_write() >> >> Ignoring those, we have >> >>> [<ffffffffa009554f>] btrfs_wait_ordered_roots+0x3f/0x200 [btrfs] >>> [<ffffffffa00470d1>] btrfs_sync_fs+0x31/0xc0 [btrfs] >>> [<ffffffff811fbd4e>] sync_filesystem+0x6e/0xa0 >>> [<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70 >>> [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 >>> [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25 >>> [<ffffffffffffffff>] 0xffffffffffffffff >> >> Don't know this one. There's a couple of them. Could there be some >> ABBA deadlock on the ordered roots waiting? > > It's always possible, but we haven't changed anything here. > > I've tried a long list of things to reproduce this on my test boxes, > including days of trinity runs and a kernel module to exercise vmalloc, > and thread creation. > > Today I turned off every CONFIG_DEBUG_* except for list debugging, and > ran dbench 2048: > This one is special because CONFIG_VMAP_STACK is not set. Btrfs triggers in < 10 minutes. I've done 30 minutes each with XFS and Ext4 without luck. This is all in a virtual machine that I can copy on to a bunch of hosts. So I'll get some parallel tests going tonight to narrow it down. ------------[ cut here ]------------ WARNING: CPU: 6 PID: 4481 at lib/list_debug.c:33 __list_add+0xbe/0xd0 list_add corruption. prev->next should be next (ffffe8ffffd80b08), but was ffff88012b65fb88. (prev=ffff880128c8d500). Modules linked in: crc32c_intel aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper i2c_piix4 cryptd i2c_core virtio_net serio_raw floppy button pcspkr sch_fq_codel autofs4 virtio_blk CPU: 6 PID: 4481 Comm: dbench Not tainted 4.9.0-rc2-15419-g811d54d #319 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-1.fc24 04/01/2014 ffff880104eff868 ffffffff814fde0f ffffffff8151c46e ffff880104eff8c8 ffff880104eff8c8 0000000000000000 ffff880104eff8b8 ffffffff810648cf ffff880128cab2c0 000000213fc57c68 ffff8801384e8928 ffff880128cab180 Call Trace: [<ffffffff814fde0f>] dump_stack+0x53/0x74 [<ffffffff8151c46e>] ? __list_add+0xbe/0xd0 [<ffffffff810648cf>] __warn+0xff/0x120 [<ffffffff810649a9>] warn_slowpath_fmt+0x49/0x50 [<ffffffff8151c46e>] __list_add+0xbe/0xd0 [<ffffffff814dec38>] blk_sq_make_request+0x388/0x580 [<ffffffff814d5444>] generic_make_request+0x104/0x200 [<ffffffff814d55a5>] submit_bio+0x65/0x130 [<ffffffff8152a256>] ? __percpu_counter_add+0x96/0xd0 [<ffffffff81425c7c>] btrfs_map_bio+0x23c/0x310 [<ffffffff813f3e73>] btrfs_submit_bio_hook+0xd3/0x190 [<ffffffff8141136d>] submit_one_bio+0x6d/0xa0 [<ffffffff814113ee>] flush_epd_write_bio+0x4e/0x70 [<ffffffff8141894d>] extent_writepages+0x5d/0x70 [<ffffffff813f80a0>] ? btrfs_releasepage+0x50/0x50 [<ffffffff81220d0e>] ? wbc_attach_and_unlock_inode+0x6e/0x170 [<ffffffff813f4c07>] btrfs_writepages+0x27/0x30 [<ffffffff811783a0>] do_writepages+0x20/0x30 [<ffffffff81167a95>] __filemap_fdatawrite_range+0xb5/0x100 [<ffffffff81167f73>] filemap_fdatawrite_range+0x13/0x20 [<ffffffff8140563b>] btrfs_fdatawrite_range+0x2b/0x70 [<ffffffff81405768>] btrfs_sync_file+0x88/0x490 [<ffffffff81074ef2>] ? group_send_sig_info+0x42/0x80 [<ffffffff81074f8d>] ? kill_pid_info+0x5d/0x90 [<ffffffff8107535a>] ? SYSC_kill+0xba/0x1d0 [<ffffffff811f2348>] ? __sb_end_write+0x58/0x80 [<ffffffff812258ac>] vfs_fsync_range+0x4c/0xb0 [<ffffffff81002501>] ? syscall_trace_enter+0x201/0x2e0 [<ffffffff8122592c>] vfs_fsync+0x1c/0x20 [<ffffffff8122596d>] do_fsync+0x3d/0x70 [<ffffffff810029cb>] ? syscall_slow_exit_work+0xfb/0x100 [<ffffffff812259d0>] SyS_fsync+0x10/0x20 [<ffffffff81002b65>] do_syscall_64+0x55/0xd0 [<ffffffff810026d7>] ? prepare_exit_to_usermode+0x37/0x40 [<ffffffff819ad246>] entry_SYSCALL64_slow_path+0x25/0x25 ---[ end trace efe6b17c6dba2a6e ]--- ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-26 21:52 ` Chris Mason @ 2016-10-26 22:21 ` Linus Torvalds 2016-10-26 22:40 ` Dave Jones 0 siblings, 1 reply; 118+ messages in thread From: Linus Torvalds @ 2016-10-26 22:21 UTC (permalink / raw) To: Chris Mason Cc: Dave Jones, Andy Lutomirski, Andy Lutomirski, Jens Axboe, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Dave Chinner [-- Attachment #1: Type: text/plain, Size: 2060 bytes --] On Wed, Oct 26, 2016 at 2:52 PM, Chris Mason <clm@fb.com> wrote: > > This one is special because CONFIG_VMAP_STACK is not set. Btrfs triggers in < 10 minutes. > I've done 30 minutes each with XFS and Ext4 without luck. Ok, see the email I wrote that crossed yours - if it's really some list corruption on ctx->rq_list due to some locking problem, I really would expect CONFIG_VMAP_STACK to be entirely irrelevant, except perhaps from a timing standpoint. > WARNING: CPU: 6 PID: 4481 at lib/list_debug.c:33 __list_add+0xbe/0xd0 > list_add corruption. prev->next should be next (ffffe8ffffd80b08), but was ffff88012b65fb88. (prev=ffff880128c8d500). > Modules linked in: crc32c_intel aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper i2c_piix4 cryptd i2c_core virtio_net serio_raw floppy button pcspkr sch_fq_codel autofs4 virtio_blk > CPU: 6 PID: 4481 Comm: dbench Not tainted 4.9.0-rc2-15419-g811d54d #319 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-1.fc24 04/01/2014 > ffff880104eff868 ffffffff814fde0f ffffffff8151c46e ffff880104eff8c8 > ffff880104eff8c8 0000000000000000 ffff880104eff8b8 ffffffff810648cf > ffff880128cab2c0 000000213fc57c68 ffff8801384e8928 ffff880128cab180 > Call Trace: > [<ffffffff814fde0f>] dump_stack+0x53/0x74 > [<ffffffff8151c46e>] ? __list_add+0xbe/0xd0 > [<ffffffff810648cf>] __warn+0xff/0x120 > [<ffffffff810649a9>] warn_slowpath_fmt+0x49/0x50 > [<ffffffff8151c46e>] __list_add+0xbe/0xd0 > [<ffffffff814dec38>] blk_sq_make_request+0x388/0x580 > [<ffffffff814d5444>] generic_make_request+0x104/0x200 Well, it's very consistent, I have to say. So I really don't think this is random corruption. Could you try the attached patch? It adds a couple of sanity tests: - a number of tests to verify that 'rq->queuelist' isn't already on some queue when it is added to a queue - one test to verify that rq->mq_ctx is the same ctx that we have locked. I may be completely full of shit, and this patch may be pure garbage or "obviously will never trigger", but humor me. Linus [-- Attachment #2: patch.diff --] [-- Type: text/plain, Size: 2059 bytes --] block/blk-mq.c | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/block/blk-mq.c b/block/blk-mq.c index ddc2eed64771..4f575de7fdd0 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -521,6 +521,8 @@ void blk_mq_add_to_requeue_list(struct request *rq, bool at_head) */ BUG_ON(rq->cmd_flags & REQ_SOFTBARRIER); +WARN_ON_ONCE(!list_empty(&rq->queuelist)); + spin_lock_irqsave(&q->requeue_lock, flags); if (at_head) { rq->cmd_flags |= REQ_SOFTBARRIER; @@ -838,6 +840,7 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx) queued++; break; case BLK_MQ_RQ_QUEUE_BUSY: +WARN_ON_ONCE(!list_empty(&rq->queuelist)); list_add(&rq->queuelist, &rq_list); __blk_mq_requeue_request(rq); break; @@ -1034,6 +1037,8 @@ static inline void __blk_mq_insert_req_list(struct blk_mq_hw_ctx *hctx, trace_block_rq_insert(hctx->queue, rq); +WARN_ON_ONCE(!list_empty(&rq->queuelist)); + if (at_head) list_add(&rq->queuelist, &ctx->rq_list); else @@ -1137,6 +1142,7 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule) depth = 0; } +WARN_ON_ONCE(!list_empty(&rq->queuelist)); depth++; list_add_tail(&rq->queuelist, &ctx_list); } @@ -1172,6 +1178,7 @@ static inline bool blk_mq_merge_queue_io(struct blk_mq_hw_ctx *hctx, blk_mq_bio_to_request(rq, bio); spin_lock(&ctx->lock); insert_rq: +WARN_ON_ONCE(rq->mq_ctx != ctx); __blk_mq_insert_request(hctx, rq, false); spin_unlock(&ctx->lock); return false; @@ -1326,6 +1333,7 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio) old_rq = same_queue_rq; list_del_init(&old_rq->queuelist); } +WARN_ON_ONCE(!list_empty(&rq->queuelist)); list_add_tail(&rq->queuelist, &plug->mq_list); } else /* is_sync */ old_rq = rq; @@ -1412,6 +1420,7 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio) trace_block_plug(q); } +WARN_ON_ONCE(!list_empty(&rq->queuelist)); list_add_tail(&rq->queuelist, &plug->mq_list); return cookie; } ^ permalink raw reply related [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-26 22:21 ` Linus Torvalds @ 2016-10-26 22:40 ` Dave Jones 2016-10-26 22:51 ` Linus Torvalds 2016-10-26 22:52 ` Jens Axboe 0 siblings, 2 replies; 118+ messages in thread From: Dave Jones @ 2016-10-26 22:40 UTC (permalink / raw) To: Linus Torvalds Cc: Chris Mason, Andy Lutomirski, Andy Lutomirski, Jens Axboe, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Dave Chinner On Wed, Oct 26, 2016 at 03:21:53PM -0700, Linus Torvalds wrote: > Could you try the attached patch? It adds a couple of sanity tests: > > - a number of tests to verify that 'rq->queuelist' isn't already on > some queue when it is added to a queue > > - one test to verify that rq->mq_ctx is the same ctx that we have locked. > > I may be completely full of shit, and this patch may be pure garbage > or "obviously will never trigger", but humor me. I gave it a shot too for shits & giggles. This falls out during boot. [ 9.244030] EXT4-fs (sda4): mounted filesystem with ordered data mode. Opts: (null) [ 9.271391] ------------[ cut here ]------------ [ 9.278420] WARNING: CPU: 0 PID: 1 at block/blk-mq.c:1181 blk_sq_make_request+0x465/0x4a0 [ 9.285613] CPU: 0 PID: 1 Comm: init Not tainted 4.9.0-rc2-think+ #4 [ 9.300106] ffffc90000013848 [ 9.307353] ffffffff8130d27c [ 9.307420] 0000000000000000 [ 9.307456] 0000000000000000 [ 9.307494] ffffc90000013888 [ 9.314780] ffffffff81077a41 [ 9.314847] 0000049d00013898 [ 9.314884] 0000000000000001 [ 9.314922] 0000000000000002 [ 9.322213] ffffe8ffffc06600 [ 9.322280] ffffe8ffff606600 [ 9.322317] ffff8805015c2e80 [ 9.322355] Call Trace: [ 9.329689] [<ffffffff8130d27c>] dump_stack+0x4f/0x73 [ 9.337208] [<ffffffff81077a41>] __warn+0xc1/0xe0 [ 9.344730] [<ffffffff81077b18>] warn_slowpath_null+0x18/0x20 [ 9.352282] [<ffffffff812f7975>] blk_sq_make_request+0x465/0x4a0 [ 9.359816] [<ffffffff812eb44a>] ? generic_make_request+0xca/0x210 [ 9.367367] [<ffffffff812eb457>] generic_make_request+0xd7/0x210 [ 9.374931] [<ffffffff812eb5f9>] submit_bio+0x69/0x120 [ 9.382488] [<ffffffff812004ef>] ? submit_bh_wbc+0x16f/0x1e0 [ 9.390083] [<ffffffff812004ef>] submit_bh_wbc+0x16f/0x1e0 [ 9.397673] [<ffffffff81200bfe>] submit_bh+0xe/0x10 [ 9.405205] [<ffffffff81255eac>] __ext4_get_inode_loc+0x1ac/0x3d0 [ 9.412795] [<ffffffff81258f4b>] ext4_iget+0x6b/0xb70 [ 9.420366] [<ffffffff811d6018>] ? lookup_slow+0xf8/0x1f0 [ 9.427947] [<ffffffff81259a7a>] ext4_iget_normal+0x2a/0x30 [ 9.435541] [<ffffffff81264734>] ext4_lookup+0x114/0x230 ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-26 22:40 ` Dave Jones @ 2016-10-26 22:51 ` Linus Torvalds 2016-10-26 22:55 ` Jens Axboe ` (2 more replies) 2016-10-26 22:52 ` Jens Axboe 1 sibling, 3 replies; 118+ messages in thread From: Linus Torvalds @ 2016-10-26 22:51 UTC (permalink / raw) To: Dave Jones, Linus Torvalds, Chris Mason, Andy Lutomirski, Andy Lutomirski, Jens Axboe, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Dave Chinner On Wed, Oct 26, 2016 at 3:40 PM, Dave Jones <davej@codemonkey.org.uk> wrote: > > I gave it a shot too for shits & giggles. > This falls out during boot. > > [ 9.278420] WARNING: CPU: 0 PID: 1 at block/blk-mq.c:1181 blk_sq_make_request+0x465/0x4a0 Hmm. That's the WARN_ON_ONCE(rq->mq_ctx != ctx); that I added to blk_mq_merge_queue_io(), and I really think that warning is valid, and the fact that it triggers shows that something is wrong with locking. We just did a spin_lock(&ctx->lock); and that lock is *supposed* to protect the __blk_mq_insert_request(), but that uses rq->mq_ctx. So if rq->mq_ctx != ctx, then we're locking the wrong context. Jens - please explain to me why I'm wrong. Or maybe I actually might have found the problem? In which case please send me a patch that fixes it ;) Dave: it might be a good idea to split that "WARN_ON_ONCE()" in blk_mq_merge_queue_io() into two, since right now it can trigger both for the blk_mq_bio_to_request(rq, bio); path _and_ for the if (!blk_mq_attempt_merge(q, ctx, bio)) { blk_mq_bio_to_request(rq, bio); goto insert_rq; path. If you split it into two: one before that "insert_rq:" label, and one before the "goto insert_rq" thing, then we could see if it is just one of the blk_mq_merge_queue_io() cases (or both) that is broken.. Linus ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-26 22:51 ` Linus Torvalds @ 2016-10-26 22:55 ` Jens Axboe 2016-10-26 22:58 ` Linus Torvalds 2016-10-26 23:01 ` Dave Jones 2 siblings, 0 replies; 118+ messages in thread From: Jens Axboe @ 2016-10-26 22:55 UTC (permalink / raw) To: Linus Torvalds, Dave Jones, Chris Mason, Andy Lutomirski, Andy Lutomirski, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Dave Chinner On 10/26/2016 04:51 PM, Linus Torvalds wrote: > On Wed, Oct 26, 2016 at 3:40 PM, Dave Jones <davej@codemonkey.org.uk> wrote: >> >> I gave it a shot too for shits & giggles. >> This falls out during boot. >> >> [ 9.278420] WARNING: CPU: 0 PID: 1 at block/blk-mq.c:1181 blk_sq_make_request+0x465/0x4a0 > > Hmm. That's the > > WARN_ON_ONCE(rq->mq_ctx != ctx); > > that I added to blk_mq_merge_queue_io(), and I really think that > warning is valid, and the fact that it triggers shows that something > is wrong with locking. > > We just did a > > spin_lock(&ctx->lock); > > and that lock is *supposed* to protect the __blk_mq_insert_request(), > but that uses rq->mq_ctx. > > So if rq->mq_ctx != ctx, then we're locking the wrong context. > > Jens - please explain to me why I'm wrong. > > Or maybe I actually might have found the problem? In which case please > send me a patch that fixes it ;) I think you're pretty close, the two should not be different and I don't immediately see how. I'll run some testing here, should be easier with this knowledge. > Dave: it might be a good idea to split that "WARN_ON_ONCE()" in > blk_mq_merge_queue_io() into two, since right now it can trigger both > for the > > blk_mq_bio_to_request(rq, bio); > > path _and_ for the > > if (!blk_mq_attempt_merge(q, ctx, bio)) { > blk_mq_bio_to_request(rq, bio); > goto insert_rq; And just in case I can't trigger, would be interesting to add a call to blk_rq_dump_flags() as well, in case this is some special request. -- Jens Axboe ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-26 22:51 ` Linus Torvalds 2016-10-26 22:55 ` Jens Axboe @ 2016-10-26 22:58 ` Linus Torvalds 2016-10-26 23:03 ` Jens Axboe 2016-10-26 23:01 ` Dave Jones 2 siblings, 1 reply; 118+ messages in thread From: Linus Torvalds @ 2016-10-26 22:58 UTC (permalink / raw) To: Dave Jones, Linus Torvalds, Chris Mason, Andy Lutomirski, Andy Lutomirski, Jens Axboe, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Dave Chinner On Wed, Oct 26, 2016 at 3:51 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Dave: it might be a good idea to split that "WARN_ON_ONCE()" in > blk_mq_merge_queue_io() into two I did that myself too, since Dave sees this during boot. But I'm not getting the warning ;( Dave gets it with ext4, and thats' what I have too, so I'm not sure what the required trigger would be. Linus ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-26 22:58 ` Linus Torvalds @ 2016-10-26 23:03 ` Jens Axboe 2016-10-26 23:07 ` Dave Jones ` (3 more replies) 0 siblings, 4 replies; 118+ messages in thread From: Jens Axboe @ 2016-10-26 23:03 UTC (permalink / raw) To: Linus Torvalds, Dave Jones, Chris Mason, Andy Lutomirski, Andy Lutomirski, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Dave Chinner On 10/26/2016 04:58 PM, Linus Torvalds wrote: > On Wed, Oct 26, 2016 at 3:51 PM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> >> Dave: it might be a good idea to split that "WARN_ON_ONCE()" in >> blk_mq_merge_queue_io() into two > > I did that myself too, since Dave sees this during boot. > > But I'm not getting the warning ;( > > Dave gets it with ext4, and thats' what I have too, so I'm not sure > what the required trigger would be. Actually, I think I see what might trigger it. You are on nvme, iirc, and that has a deep queue. Dave, are you testing on a sata drive or something similar with a shallower queue depth? If we end up sleeping for a request, I think we could trigger data->ctx being different. Dave, can you hit the warnings with this? Totally untested... diff --git a/block/blk-mq.c b/block/blk-mq.c index ddc2eed64771..80a9c45a9235 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1217,9 +1217,7 @@ static struct request *blk_mq_map_request(struct request_queue *q, blk_mq_set_alloc_data(&alloc_data, q, 0, ctx, hctx); rq = __blk_mq_alloc_request(&alloc_data, op, op_flags); - hctx->queued++; - data->hctx = hctx; - data->ctx = ctx; + data->hctx->queued++; return rq; } -- Jens Axboe ^ permalink raw reply related [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-26 23:03 ` Jens Axboe @ 2016-10-26 23:07 ` Dave Jones 2016-10-26 23:08 ` Linus Torvalds ` (2 subsequent siblings) 3 siblings, 0 replies; 118+ messages in thread From: Dave Jones @ 2016-10-26 23:07 UTC (permalink / raw) To: Jens Axboe Cc: Linus Torvalds, Chris Mason, Andy Lutomirski, Andy Lutomirski, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Dave Chinner On Wed, Oct 26, 2016 at 05:03:45PM -0600, Jens Axboe wrote: > On 10/26/2016 04:58 PM, Linus Torvalds wrote: > > On Wed, Oct 26, 2016 at 3:51 PM, Linus Torvalds > > <torvalds@linux-foundation.org> wrote: > >> > >> Dave: it might be a good idea to split that "WARN_ON_ONCE()" in > >> blk_mq_merge_queue_io() into two > > > > I did that myself too, since Dave sees this during boot. > > > > But I'm not getting the warning ;( > > > > Dave gets it with ext4, and thats' what I have too, so I'm not sure > > what the required trigger would be. > > Actually, I think I see what might trigger it. You are on nvme, iirc, > and that has a deep queue. Dave, are you testing on a sata drive or > something similar with a shallower queue depth? If we end up sleeping > for a request, I think we could trigger data->ctx being different. yeah, just regular sata. I've been meaning to put an ssd in that box for a while, but now it sounds like my procrastination may have paid off. > Dave, can you hit the warnings with this? Totally untested... Coming up.. Dave ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-26 23:03 ` Jens Axboe 2016-10-26 23:07 ` Dave Jones @ 2016-10-26 23:08 ` Linus Torvalds 2016-10-26 23:20 ` Jens Axboe 2016-10-26 23:19 ` Chris Mason 2016-10-27 6:33 ` Christoph Hellwig 3 siblings, 1 reply; 118+ messages in thread From: Linus Torvalds @ 2016-10-26 23:08 UTC (permalink / raw) To: Jens Axboe Cc: Dave Jones, Chris Mason, Andy Lutomirski, Andy Lutomirski, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Dave Chinner On Wed, Oct 26, 2016 at 4:03 PM, Jens Axboe <axboe@fb.com> wrote: > > Actually, I think I see what might trigger it. You are on nvme, iirc, > and that has a deep queue. Yes. I have long since moved on from slow disks, so all my systems are not just flash, but m.2 nvme ssd's. So at least that could explain why Dave sees it at bootup but I don't. Linus ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-26 23:08 ` Linus Torvalds @ 2016-10-26 23:20 ` Jens Axboe 2016-10-26 23:38 ` Chris Mason 0 siblings, 1 reply; 118+ messages in thread From: Jens Axboe @ 2016-10-26 23:20 UTC (permalink / raw) To: Linus Torvalds Cc: Dave Jones, Chris Mason, Andy Lutomirski, Andy Lutomirski, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Dave Chinner On 10/26/2016 05:08 PM, Linus Torvalds wrote: > On Wed, Oct 26, 2016 at 4:03 PM, Jens Axboe <axboe@fb.com> wrote: >> >> Actually, I think I see what might trigger it. You are on nvme, iirc, >> and that has a deep queue. > > Yes. I have long since moved on from slow disks, so all my systems are > not just flash, but m.2 nvme ssd's. > > So at least that could explain why Dave sees it at bootup but I don't. Yep, you'd never sleep during normal boot. The original patch had a problem, though... This one should be better. Too many 'data's, we'll still need to assign ctx/hctx, we should just use the current ones, not the original ones. diff --git a/block/blk-mq.c b/block/blk-mq.c index ddc2eed64771..e56fec187ba6 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1217,9 +1224,9 @@ static struct request *blk_mq_map_request(struct request_queue *q, blk_mq_set_alloc_data(&alloc_data, q, 0, ctx, hctx); rq = __blk_mq_alloc_request(&alloc_data, op, op_flags); - hctx->queued++; - data->hctx = hctx; - data->ctx = ctx; + data->hctx = alloc_data.hctx; + data->ctx = alloc_data.ctx; + data->hctx->queued++; return rq; } -- Jens Axboe ^ permalink raw reply related [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-26 23:20 ` Jens Axboe @ 2016-10-26 23:38 ` Chris Mason 2016-10-26 23:47 ` Dave Jones 0 siblings, 1 reply; 118+ messages in thread From: Chris Mason @ 2016-10-26 23:38 UTC (permalink / raw) To: Jens Axboe Cc: Linus Torvalds, Dave Jones, Andy Lutomirski, Andy Lutomirski, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Dave Chinner On Wed, Oct 26, 2016 at 05:20:01PM -0600, Jens Axboe wrote: >On 10/26/2016 05:08 PM, Linus Torvalds wrote: >>On Wed, Oct 26, 2016 at 4:03 PM, Jens Axboe <axboe@fb.com> wrote: >>> >>>Actually, I think I see what might trigger it. You are on nvme, iirc, >>>and that has a deep queue. >> >>Yes. I have long since moved on from slow disks, so all my systems are >>not just flash, but m.2 nvme ssd's. >> >>So at least that could explain why Dave sees it at bootup but I don't. > >Yep, you'd never sleep during normal boot. The original patch had a >problem, though... This one should be better. Too many 'data's, we'll >still need to assign ctx/hctx, we should just use the current ones, not >the original ones. > >diff --git a/block/blk-mq.c b/block/blk-mq.c >index ddc2eed64771..e56fec187ba6 100644 >--- a/block/blk-mq.c >+++ b/block/blk-mq.c >@@ -1217,9 +1224,9 @@ static struct request *blk_mq_map_request(struct >request_queue *q, > blk_mq_set_alloc_data(&alloc_data, q, 0, ctx, hctx); > rq = __blk_mq_alloc_request(&alloc_data, op, op_flags); > >- hctx->queued++; >- data->hctx = hctx; >- data->ctx = ctx; >+ data->hctx = alloc_data.hctx; >+ data->ctx = alloc_data.ctx; >+ data->hctx->queued++; > return rq; > } This made it through an entire dbench 2048 run on btrfs. My script has it running in a loop, but this is farther than I've gotten before. Looking great so far. -chris ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-26 23:38 ` Chris Mason @ 2016-10-26 23:47 ` Dave Jones 2016-10-27 0:00 ` Jens Axboe 2016-10-31 18:55 ` Dave Jones 0 siblings, 2 replies; 118+ messages in thread From: Dave Jones @ 2016-10-26 23:47 UTC (permalink / raw) To: Chris Mason, Jens Axboe, Linus Torvalds, Andy Lutomirski, Andy Lutomirski, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Dave Chinner On Wed, Oct 26, 2016 at 07:38:08PM -0400, Chris Mason wrote: > >- hctx->queued++; > >- data->hctx = hctx; > >- data->ctx = ctx; > >+ data->hctx = alloc_data.hctx; > >+ data->ctx = alloc_data.ctx; > >+ data->hctx->queued++; > > return rq; > > } > > This made it through an entire dbench 2048 run on btrfs. My script has > it running in a loop, but this is farther than I've gotten before. > Looking great so far. Fixed the splat during boot for me too. Now the fun part, let's see if it fixed the 'weird shit' that Trinity was stumbling on. Dave ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-26 23:47 ` Dave Jones @ 2016-10-27 0:00 ` Jens Axboe 2016-10-27 13:33 ` Chris Mason 2016-10-31 18:55 ` Dave Jones 1 sibling, 1 reply; 118+ messages in thread From: Jens Axboe @ 2016-10-27 0:00 UTC (permalink / raw) To: Dave Jones, Chris Mason, Linus Torvalds, Andy Lutomirski, Andy Lutomirski, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Dave Chinner On 10/26/2016 05:47 PM, Dave Jones wrote: > On Wed, Oct 26, 2016 at 07:38:08PM -0400, Chris Mason wrote: > > > >- hctx->queued++; > > >- data->hctx = hctx; > > >- data->ctx = ctx; > > >+ data->hctx = alloc_data.hctx; > > >+ data->ctx = alloc_data.ctx; > > >+ data->hctx->queued++; > > > return rq; > > > } > > > > This made it through an entire dbench 2048 run on btrfs. My script has > > it running in a loop, but this is farther than I've gotten before. > > Looking great so far. > > Fixed the splat during boot for me too. > Now the fun part, let's see if it fixed the 'weird shit' that Trinity > was stumbling on. Let's let the testing simmer overnight, then I'll turn this into a real patch tomorrow and get it submitted. -- Jens Axboe ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-27 0:00 ` Jens Axboe @ 2016-10-27 13:33 ` Chris Mason 0 siblings, 0 replies; 118+ messages in thread From: Chris Mason @ 2016-10-27 13:33 UTC (permalink / raw) To: Jens Axboe, Dave Jones, Linus Torvalds, Andy Lutomirski, Andy Lutomirski, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Dave Chinner On 10/26/2016 08:00 PM, Jens Axboe wrote: > On 10/26/2016 05:47 PM, Dave Jones wrote: >> On Wed, Oct 26, 2016 at 07:38:08PM -0400, Chris Mason wrote: >> >> > >- hctx->queued++; >> > >- data->hctx = hctx; >> > >- data->ctx = ctx; >> > >+ data->hctx = alloc_data.hctx; >> > >+ data->ctx = alloc_data.ctx; >> > >+ data->hctx->queued++; >> > > return rq; >> > > } >> > >> > This made it through an entire dbench 2048 run on btrfs. My script >> has >> > it running in a loop, but this is farther than I've gotten before. >> > Looking great so far. >> >> Fixed the splat during boot for me too. >> Now the fun part, let's see if it fixed the 'weird shit' that Trinity >> was stumbling on. > > Let's let the testing simmer overnight, then I'll turn this into a real > patch tomorrow and get it submitted. > I ran all night on both btrfs and xfs. XFS came out clean, but btrfs hit the WARN_ON below. I hit it a few times with Jens' patch, always the same warning. It's pretty obviously a btrfs bug, we're not cleaning up this list properly during fsync. I tried a v1 of a btrfs fix overnight, but I see where it was incomplete now and will re-run. For the blk-mq bug, I think we got it! Tested-by: always-blaming-jens-from-now-on <clm@fb.com> WARNING: CPU: 5 PID: 16163 at lib/list_debug.c:62 __list_del_entry+0x86/0xd0 list_del corruption. next->prev should be ffff8801196d3be0, but was ffff88010fc63308 Modules linked in: crc32c_intel aesni_intel aes_x86_64 glue_helper i2c_piix4 lrw i2c_core gf128mul ablk_helper virtio_net serio_raw button pcspkr floppy cryptd sch_fq_codel autofs4 virtio_blk CPU: 5 PID: 16163 Comm: dbench Not tainted 4.9.0-rc2-00041-g811d54d-dirty #322 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-1.fc24 04/01/2014 ffff8801196d3a68 ffffffff814fde3f ffffffff8151c356 ffff8801196d3ac8 ffff8801196d3ac8 0000000000000000 ffff8801196d3ab8 ffffffff810648cf dead000000000100 0000003e813bfc4a ffff8801196d3b98 ffff880122b5c800 Call Trace: [<ffffffff814fde3f>] dump_stack+0x53/0x74 [<ffffffff8151c356>] ? __list_del_entry+0x86/0xd0 [<ffffffff810648cf>] __warn+0xff/0x120 [<ffffffff810649a9>] warn_slowpath_fmt+0x49/0x50 [<ffffffff8151c356>] __list_del_entry+0x86/0xd0 [<ffffffff8143618d>] btrfs_sync_log+0x75d/0xbd0 [<ffffffff8143cfa7>] ? btrfs_log_inode_parent+0x547/0xbb0 [<ffffffff819ad01b>] ? _raw_spin_lock+0x1b/0x40 [<ffffffff8108d1a3>] ? __might_sleep+0x53/0xa0 [<ffffffff812095c5>] ? dput+0x65/0x280 [<ffffffff8143d717>] ? btrfs_log_dentry_safe+0x77/0x90 [<ffffffff81405b04>] btrfs_sync_file+0x424/0x490 [<ffffffff8107535a>] ? SYSC_kill+0xba/0x1d0 [<ffffffff811f2348>] ? __sb_end_write+0x58/0x80 [<ffffffff812258ac>] vfs_fsync_range+0x4c/0xb0 [<ffffffff81002501>] ? syscall_trace_enter+0x201/0x2e0 [<ffffffff8122592c>] vfs_fsync+0x1c/0x20 [<ffffffff8122596d>] do_fsync+0x3d/0x70 [<ffffffff810029cb>] ? syscall_slow_exit_work+0xfb/0x100 [<ffffffff812259d0>] SyS_fsync+0x10/0x20 [<ffffffff81002b65>] do_syscall_64+0x55/0xd0 [<ffffffff810026d7>] ? prepare_exit_to_usermode+0x37/0x40 [<ffffffff819ad286>] entry_SYSCALL64_slow_path+0x25/0x25 ---[ end trace c93288442a6424aa ]--- ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-26 23:47 ` Dave Jones 2016-10-27 0:00 ` Jens Axboe @ 2016-10-31 18:55 ` Dave Jones 2016-10-31 19:35 ` Linus Torvalds 1 sibling, 1 reply; 118+ messages in thread From: Dave Jones @ 2016-10-31 18:55 UTC (permalink / raw) To: Chris Mason, Jens Axboe, Linus Torvalds, Andy Lutomirski, Andy Lutomirski, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Dave Chinner On Wed, Oct 26, 2016 at 07:47:51PM -0400, Dave Jones wrote: > On Wed, Oct 26, 2016 at 07:38:08PM -0400, Chris Mason wrote: > > > >- hctx->queued++; > > >- data->hctx = hctx; > > >- data->ctx = ctx; > > >+ data->hctx = alloc_data.hctx; > > >+ data->ctx = alloc_data.ctx; > > >+ data->hctx->queued++; > > > return rq; > > > } > > > > This made it through an entire dbench 2048 run on btrfs. My script has > > it running in a loop, but this is farther than I've gotten before. > > Looking great so far. > > Fixed the splat during boot for me too. > Now the fun part, let's see if it fixed the 'weird shit' that Trinity > was stumbling on. It took a while, but.. bad news. BUG: Bad page state in process kworker/u8:12 pfn:4e0e39 page:ffffea0013838e40 count:0 mapcount:0 mapping:ffff8804a20310e0 index:0x100c flags: 0x400000000000000c(referenced|uptodate) page dumped because: non-NULL mapping CPU: 3 PID: 1586 Comm: kworker/u8:12 Not tainted 4.9.0-rc3-think+ #1 Workqueue: writeback wb_workfn (flush-btrfs-2) ffffc90000777828 ffffffff8130d04c ffffea0013838e40 ffffffff819ff654 ffffc90000777850 ffffffff81150e5f 0000000000000000 ffffea0013838e40 400000000000000c ffffc90000777860 ffffffff81150f1a ffffc900007778a8 Call Trace: [<ffffffff8130d04c>] dump_stack+0x4f/0x73 [<ffffffff81150e5f>] bad_page+0xbf/0x120 [<ffffffff81150f1a>] free_pages_check_bad+0x5a/0x70 [<ffffffff81153818>] free_hot_cold_page+0x248/0x290 [<ffffffff81153b1b>] free_hot_cold_page_list+0x2b/0x50 [<ffffffff8115c52d>] release_pages+0x2bd/0x350 [<ffffffff8115da62>] __pagevec_release+0x22/0x30 [<ffffffffa00a0d4e>] extent_write_cache_pages.isra.48.constprop.63+0x32e/0x400 [btrfs] [<ffffffffa00a1199>] extent_writepages+0x49/0x60 [btrfs] [<ffffffffa0081840>] ? btrfs_releasepage+0x40/0x40 [btrfs] [<ffffffffa007e993>] btrfs_writepages+0x23/0x30 [btrfs] [<ffffffff8115ac4c>] do_writepages+0x1c/0x30 [<ffffffff811f69f3>] __writeback_single_inode+0x33/0x180 [<ffffffff811f71e8>] writeback_sb_inodes+0x2a8/0x5b0 [<ffffffff811f757d>] __writeback_inodes_wb+0x8d/0xc0 [<ffffffff811f7833>] wb_writeback+0x1e3/0x1f0 [<ffffffff811f7d72>] wb_workfn+0xd2/0x280 [<ffffffff810908d5>] process_one_work+0x1d5/0x490 [<ffffffff81090875>] ? process_one_work+0x175/0x490 [<ffffffff81090bd9>] worker_thread+0x49/0x490 [<ffffffff81090b90>] ? process_one_work+0x490/0x490 [<ffffffff81095d4e>] kthread+0xee/0x110 [<ffffffff81095c60>] ? kthread_park+0x60/0x60 [<ffffffff81790b12>] ret_from_fork+0x22/0x30 ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-31 18:55 ` Dave Jones @ 2016-10-31 19:35 ` Linus Torvalds 2016-10-31 19:44 ` Chris Mason 0 siblings, 1 reply; 118+ messages in thread From: Linus Torvalds @ 2016-10-31 19:35 UTC (permalink / raw) To: Dave Jones, Chris Mason, Jens Axboe, Linus Torvalds, Andy Lutomirski, Andy Lutomirski, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Dave Chinner On Mon, Oct 31, 2016 at 11:55 AM, Dave Jones <davej@codemonkey.org.uk> wrote: > > BUG: Bad page state in process kworker/u8:12 pfn:4e0e39 > page:ffffea0013838e40 count:0 mapcount:0 mapping:ffff8804a20310e0 index:0x100c > flags: 0x400000000000000c(referenced|uptodate) > page dumped because: non-NULL mapping Hmm. So this seems to be btrfs-specific, right? I searched for all your "non-NULL mapping" cases, and they all seem to have basically the same call trace, with some work thread doing writeback and going through btrfs_writepages(). Sounds like it's a race with either fallocate hole-punching or truncate. I'm not seeing it, but I suspect it's btrfs, since DaveJ clearly ran other filesystems too but I am not seeing this backtrace for anything else. Linus ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-31 19:35 ` Linus Torvalds @ 2016-10-31 19:44 ` Chris Mason 2016-11-06 16:55 ` btrfs btree_ctree_super fault Dave Jones 2016-11-23 19:34 ` bio linked list corruption Dave Jones 0 siblings, 2 replies; 118+ messages in thread From: Chris Mason @ 2016-10-31 19:44 UTC (permalink / raw) To: Linus Torvalds Cc: Dave Jones, Jens Axboe, Andy Lutomirski, Andy Lutomirski, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Dave Chinner On Mon, Oct 31, 2016 at 12:35:16PM -0700, Linus Torvalds wrote: >On Mon, Oct 31, 2016 at 11:55 AM, Dave Jones <davej@codemonkey.org.uk> wrote: >> >> BUG: Bad page state in process kworker/u8:12 pfn:4e0e39 >> page:ffffea0013838e40 count:0 mapcount:0 mapping:ffff8804a20310e0 index:0x100c >> flags: 0x400000000000000c(referenced|uptodate) >> page dumped because: non-NULL mapping > >Hmm. So this seems to be btrfs-specific, right? > >I searched for all your "non-NULL mapping" cases, and they all seem to >have basically the same call trace, with some work thread doing >writeback and going through btrfs_writepages(). > >Sounds like it's a race with either fallocate hole-punching or >truncate. I'm not seeing it, but I suspect it's btrfs, since DaveJ >clearly ran other filesystems too but I am not seeing this backtrace >for anything else. Agreed, I think this is a separate bug, almost certainly btrfs specific. I'll work with Dave on a better reproducer. -chris ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: btrfs btree_ctree_super fault 2016-10-31 19:44 ` Chris Mason @ 2016-11-06 16:55 ` Dave Jones 2016-11-08 14:59 ` Dave Jones 2016-11-23 19:34 ` bio linked list corruption Dave Jones 1 sibling, 1 reply; 118+ messages in thread From: Dave Jones @ 2016-11-06 16:55 UTC (permalink / raw) To: Chris Mason, Linus Torvalds, Jens Axboe, Andy Lutomirski, Andy Lutomirski, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Dave Chinner <subject changed, hopefully we're done with bio corruption for now> On Mon, Oct 31, 2016 at 01:44:55PM -0600, Chris Mason wrote: > On Mon, Oct 31, 2016 at 12:35:16PM -0700, Linus Torvalds wrote: > >On Mon, Oct 31, 2016 at 11:55 AM, Dave Jones <davej@codemonkey.org.uk> wrote: > >> > >> BUG: Bad page state in process kworker/u8:12 pfn:4e0e39 > >> page:ffffea0013838e40 count:0 mapcount:0 mapping:ffff8804a20310e0 index:0x100c > >> flags: 0x400000000000000c(referenced|uptodate) > >> page dumped because: non-NULL mapping > > > >Hmm. So this seems to be btrfs-specific, right? > > > >I searched for all your "non-NULL mapping" cases, and they all seem to > >have basically the same call trace, with some work thread doing > >writeback and going through btrfs_writepages(). > > > >Sounds like it's a race with either fallocate hole-punching or > >truncate. I'm not seeing it, but I suspect it's btrfs, since DaveJ > >clearly ran other filesystems too but I am not seeing this backtrace > >for anything else. > > Agreed, I think this is a separate bug, almost certainly btrfs specific. > I'll work with Dave on a better reproducer. Still refining my 'capture ftrace when trinity detects taint' feature, but in the meantime, here's a variant I don't think we've seen before: general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC CPU: 3 PID: 1913 Comm: trinity-c51 Not tainted 4.9.0-rc3-think+ #3 task: ffff880503350040 task.stack: ffffc90000240000 RIP: 0010:[<ffffffffa007c2f6>] [<ffffffffa007c2f6>] write_ctree_super+0x96/0xb30 [btrfs] RSP: 0018:ffffc90000243c90 EFLAGS: 00010286 RAX: dae05adadadad000 RBX: 0000000000000000 RCX: 0000000000000002 RDX: ffff8804fdfcc000 RSI: ffff8804edcee313 RDI: ffff8804edcee1c3 RBP: ffffc90000243d00 R08: 0000000000000003 R09: ffff880000000000 R10: 0000000000000001 R11: 0000000000000100 R12: ffff88045151c548 R13: 0000000000000000 R14: ffff8804ee5122a8 R15: ffff8804572267e8 FS: 00007f25c3e0eb40(0000) GS:ffff880507e00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f25c1560d44 CR3: 0000000454e20000 CR4: 00000000001406e0 DR0: 00007fee93506000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600 Stack: 0000000000000001 ffff88050227b3f8 ffff8804fff01b28 00000001810b7f35 ffffffffa007c265 0000000000000001 ffffc90000243cb8 000000002c9a8645 ffff8804fff01b28 ffff8804fff01b28 ffff88045151c548 0000000000000000 Call Trace: [<ffffffffa007c265>] ? write_ctree_super+0x5/0xb30 [btrfs] [<ffffffffa00d2956>] btrfs_sync_log+0x886/0xa60 [btrfs] [<ffffffffa009f4f9>] btrfs_sync_file+0x479/0x4d0 [btrfs] [<ffffffff812789ab>] vfs_fsync_range+0x4b/0xb0 [<ffffffff81260755>] ? __fget_light+0x5/0x60 [<ffffffff81278a6d>] do_fsync+0x3d/0x70 [<ffffffff81278a35>] ? do_fsync+0x5/0x70 [<ffffffff81278d20>] SyS_fsync+0x10/0x20 [<ffffffff81002d81>] do_syscall_64+0x61/0x170 [<ffffffff81894a8b>] entry_SYSCALL64_slow_path+0x25/0x25 Code: c7 48 8b 42 30 4c 8b 08 48 b8 00 00 00 00 00 16 00 00 49 03 81 a0 01 00 00 49 b9 00 00 00 00 00 88 ff ff 48 c1 f8 06 48 c1 e0 0c <4a> 8b 44 08 50 48 39 46 08 0f 84 8d 08 00 00 49 63 c0 48 8d 0c RIP [<ffffffffa007c2f6>] write_ctree_super+0x96/0xb30 [btrfs] RSP <ffffc90000243c90> All code ======== 0: c7 (bad) 1: 48 8b 42 30 mov 0x30(%rdx),%rax 5: 4c 8b 08 mov (%rax),%r9 8: 48 b8 00 00 00 00 00 movabs $0x160000000000,%rax f: 16 00 00 12: 49 03 81 a0 01 00 00 add 0x1a0(%r9),%rax 19: 49 b9 00 00 00 00 00 movabs $0xffff880000000000,%r9 20: 88 ff ff 23: 48 c1 f8 06 sar $0x6,%rax 27: 48 c1 e0 0c shl $0xc,%rax 2b:* 4a 8b 44 08 50 mov 0x50(%rax,%r9,1),%rax <-- trapping instruction 30: 48 39 46 08 cmp %rax,0x8(%rsi) 34: 0f 84 8d 08 00 00 je 0x8c7 3a: 49 63 c0 movslq %r8d,%rax 3d: 48 rex.W 3e: 8d .byte 0x8d 3f: 0c .byte 0xc Code starting with the faulting instruction =========================================== 0: 4a 8b 44 08 50 mov 0x50(%rax,%r9,1),%rax 5: 48 39 46 08 cmp %rax,0x8(%rsi) 9: 0f 84 8d 08 00 00 je 0x89c f: 49 63 c0 movslq %r8d,%rax 12: 48 rex.W 13: 8d .byte 0x8d 14: 0c .byte 0xc According to objdump -S, it looks like this is an inlined copy of backup_super_roots root_backup = info->super_for_commit->super_roots + last_backup; 2706: 48 8d b8 2b 0b 00 00 lea 0xb2b(%rax),%rdi 270d: 48 63 c1 movslq %ecx,%rax 2710: 48 8d 34 80 lea (%rax,%rax,4),%rsi 2714: 48 8d 04 b0 lea (%rax,%rsi,4),%rax 2718: 48 8d 34 c7 lea (%rdi,%rax,8),%rsi btrfs_header_generation(info->tree_root->node)) 271c: 48 8b 42 30 mov 0x30(%rdx),%rax 2720: 4c 8b 08 mov (%rax),%r9 2723: 48 b8 00 00 00 00 00 movabs $0x160000000000,%rax 272a: 16 00 00 272d: 49 03 81 a0 01 00 00 add 0x1a0(%r9),%rax if (btrfs_backup_tree_root_gen(root_backup) == 2734: 49 b9 00 00 00 00 00 movabs $0xffff880000000000,%r9 273b: 88 ff ff 273e: 48 c1 f8 06 sar $0x6,%rax 2742: 48 c1 e0 0c shl $0xc,%rax 2746: 4a 8b 44 08 50 mov 0x50(%rax,%r9,1),%rax <-- trapping instruction 274b: 48 39 46 08 cmp %rax,0x8(%rsi) 274f: 0f 84 8d 08 00 00 je 2fe2 <write_ctree_super+0x932> ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: btrfs btree_ctree_super fault 2016-11-06 16:55 ` btrfs btree_ctree_super fault Dave Jones @ 2016-11-08 14:59 ` Dave Jones 2016-11-08 15:08 ` Chris Mason 0 siblings, 1 reply; 118+ messages in thread From: Dave Jones @ 2016-11-08 14:59 UTC (permalink / raw) To: Chris Mason, Linus Torvalds, Jens Axboe, Andy Lutomirski, Andy Lutomirski, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Dave Chinner On Sun, Nov 06, 2016 at 11:55:39AM -0500, Dave Jones wrote: > <subject changed, hopefully we're done with bio corruption for now> > > On Mon, Oct 31, 2016 at 01:44:55PM -0600, Chris Mason wrote: > > On Mon, Oct 31, 2016 at 12:35:16PM -0700, Linus Torvalds wrote: > > >On Mon, Oct 31, 2016 at 11:55 AM, Dave Jones <davej@codemonkey.org.uk> wrote: > > >> > > >> BUG: Bad page state in process kworker/u8:12 pfn:4e0e39 > > >> page:ffffea0013838e40 count:0 mapcount:0 mapping:ffff8804a20310e0 index:0x100c > > >> flags: 0x400000000000000c(referenced|uptodate) > > >> page dumped because: non-NULL mapping > > > > > >Hmm. So this seems to be btrfs-specific, right? > > > > > >I searched for all your "non-NULL mapping" cases, and they all seem to > > >have basically the same call trace, with some work thread doing > > >writeback and going through btrfs_writepages(). > > > > > >Sounds like it's a race with either fallocate hole-punching or > > >truncate. I'm not seeing it, but I suspect it's btrfs, since DaveJ > > >clearly ran other filesystems too but I am not seeing this backtrace > > >for anything else. > > > > Agreed, I think this is a separate bug, almost certainly btrfs specific. > > I'll work with Dave on a better reproducer. > > Still refining my 'capture ftrace when trinity detects taint' feature, > but in the meantime, here's a variant I don't think we've seen before: And another new one: kernel BUG at fs/btrfs/ctree.c:3172! invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC CPU: 0 PID: 22702 Comm: trinity-c40 Not tainted 4.9.0-rc4-think+ #1 task: ffff8804ffde37c0 task.stack: ffffc90002188000 RIP: 0010:[<ffffffffa00576b9>] [<ffffffffa00576b9>] btrfs_set_item_key_safe+0x179/0x190 [btrfs] RSP: 0000:ffffc9000218b8a8 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff8804fddcf348 RCX: 0000000000001000 RDX: 0000000000000000 RSI: ffffc9000218b9ce RDI: ffffc9000218b8c7 RBP: ffffc9000218b908 R08: 0000000000004000 R09: ffffc9000218b8c8 R10: 0000000000000000 R11: 0000000000000001 R12: ffffc9000218b8b6 R13: ffffc9000218b9ce R14: 0000000000000001 R15: ffff880480684a88 FS: 00007f7c7f998b40(0000) GS:ffff880507800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 000000044f15f000 CR4: 00000000001406f0 DR0: 00007f4ce439d000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600 Stack: ffff880501430000 d305ffffa00a2245 006c000000000002 0500000000000010 6c000000000002d3 0000000000001000 000000006427eebb ffff880480684a88 0000000000000000 ffff8804fddcf348 0000000000002000 0000000000000000 Call Trace: [<ffffffffa009cff0>] __btrfs_drop_extents+0xb00/0xe30 [btrfs] [<ffffffff8116c80c>] ? function_trace_call+0x13c/0x190 [<ffffffffa009c4f5>] ? __btrfs_drop_extents+0x5/0xe30 [btrfs] [<ffffffff810e2b00>] ? do_raw_write_lock+0xb0/0xc0 [<ffffffffa00cd43d>] btrfs_log_changed_extents+0x35d/0x630 [btrfs] [<ffffffffa00a6a74>] ? release_extent_buffer+0xa4/0x110 [btrfs] [<ffffffffa00cd0e5>] ? btrfs_log_changed_extents+0x5/0x630 [btrfs] [<ffffffffa00d1085>] btrfs_log_inode+0xb05/0x11d0 [btrfs] [<ffffffff8116536c>] ? trace_function+0x6c/0x80 [<ffffffffa00d0580>] ? log_directory_changes+0xc0/0xc0 [btrfs] [<ffffffffa00d1a20>] ? btrfs_log_inode_parent+0x240/0x940 [btrfs] [<ffffffff8116c80c>] ? function_trace_call+0x13c/0x190 [<ffffffffa00d1a20>] btrfs_log_inode_parent+0x240/0x940 [btrfs] [<ffffffffa00d17e5>] ? btrfs_log_inode_parent+0x5/0x940 [btrfs] [<ffffffff81259131>] ? dget_parent+0x71/0x150 [<ffffffffa00d3102>] btrfs_log_dentry_safe+0x62/0x80 [btrfs] [<ffffffffa009f404>] btrfs_sync_file+0x344/0x4d0 [btrfs] [<ffffffff81278a1b>] vfs_fsync_range+0x4b/0xb0 [<ffffffff812607c5>] ? __fget_light+0x5/0x60 [<ffffffff81278add>] do_fsync+0x3d/0x70 [<ffffffff81278aa5>] ? do_fsync+0x5/0x70 [<ffffffff81278db3>] SyS_fdatasync+0x13/0x20 [<ffffffff81002d81>] do_syscall_64+0x61/0x170 [<ffffffff81894ccb>] entry_SYSCALL64_slow_path+0x25/0x25 Code: 48 8b 45 b7 48 8d 7d bf 4c 89 ee 48 89 45 c8 0f b6 45 b6 88 45 c7 48 8b 45 ae 48 89 45 bf e8 af f2 ff ff 85 c0 0f 8f 43 ff ff ff <0f> 0b 0f 0b e8 ee f3 02 e1 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 Unfortunatly, because this was a BUG_ON, it locked up the box so it didn't save any additional debug info. Tempted to see if making BUG_ON a no-op will at least let it live long enough to save the ftrace buffer. Given this seems to be mutating every time I see something go wrong, I'm wondering if this is fallout from memory corruption again. Dave ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: btrfs btree_ctree_super fault 2016-11-08 14:59 ` Dave Jones @ 2016-11-08 15:08 ` Chris Mason 2016-11-10 14:35 ` Dave Jones 0 siblings, 1 reply; 118+ messages in thread From: Chris Mason @ 2016-11-08 15:08 UTC (permalink / raw) To: Dave Jones, Linus Torvalds, Jens Axboe, Andy Lutomirski, Andy Lutomirski, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Dave Chinner On 11/08/2016 09:59 AM, Dave Jones wrote: > On Sun, Nov 06, 2016 at 11:55:39AM -0500, Dave Jones wrote: > > <subject changed, hopefully we're done with bio corruption for now> > > > > On Mon, Oct 31, 2016 at 01:44:55PM -0600, Chris Mason wrote: > > > On Mon, Oct 31, 2016 at 12:35:16PM -0700, Linus Torvalds wrote: > > > >On Mon, Oct 31, 2016 at 11:55 AM, Dave Jones <davej@codemonkey.org.uk> wrote: > > > >> > > > >> BUG: Bad page state in process kworker/u8:12 pfn:4e0e39 > > > >> page:ffffea0013838e40 count:0 mapcount:0 mapping:ffff8804a20310e0 index:0x100c > > > >> flags: 0x400000000000000c(referenced|uptodate) > > > >> page dumped because: non-NULL mapping > > > > > > > >Hmm. So this seems to be btrfs-specific, right? > > > > > > > >I searched for all your "non-NULL mapping" cases, and they all seem to > > > >have basically the same call trace, with some work thread doing > > > >writeback and going through btrfs_writepages(). > > > > > > > >Sounds like it's a race with either fallocate hole-punching or > > > >truncate. I'm not seeing it, but I suspect it's btrfs, since DaveJ > > > >clearly ran other filesystems too but I am not seeing this backtrace > > > >for anything else. > > > > > > Agreed, I think this is a separate bug, almost certainly btrfs specific. > > > I'll work with Dave on a better reproducer. > > > > Still refining my 'capture ftrace when trinity detects taint' feature, > > but in the meantime, here's a variant I don't think we've seen before: > > And another new one: > > kernel BUG at fs/btrfs/ctree.c:3172! > invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC > CPU: 0 PID: 22702 Comm: trinity-c40 Not tainted 4.9.0-rc4-think+ #1 > task: ffff8804ffde37c0 task.stack: ffffc90002188000 > RIP: 0010:[<ffffffffa00576b9>] > [<ffffffffa00576b9>] btrfs_set_item_key_safe+0x179/0x190 [btrfs] > RSP: 0000:ffffc9000218b8a8 EFLAGS: 00010246 > RAX: 0000000000000000 RBX: ffff8804fddcf348 RCX: 0000000000001000 > RDX: 0000000000000000 RSI: ffffc9000218b9ce RDI: ffffc9000218b8c7 > RBP: ffffc9000218b908 R08: 0000000000004000 R09: ffffc9000218b8c8 > R10: 0000000000000000 R11: 0000000000000001 R12: ffffc9000218b8b6 > R13: ffffc9000218b9ce R14: 0000000000000001 R15: ffff880480684a88 > FS: 00007f7c7f998b40(0000) GS:ffff880507800000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 0000000000000000 CR3: 000000044f15f000 CR4: 00000000001406f0 > DR0: 00007f4ce439d000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600 > Stack: > ffff880501430000 d305ffffa00a2245 006c000000000002 0500000000000010 > 6c000000000002d3 0000000000001000 000000006427eebb ffff880480684a88 > 0000000000000000 ffff8804fddcf348 0000000000002000 0000000000000000 > Call Trace: > [<ffffffffa009cff0>] __btrfs_drop_extents+0xb00/0xe30 [btrfs] We've been hunting this one for at least two years. It's the white whale of btrfs bugs. Josef has a semi-reliable reproducer now, but I think it's not the same as the pagevec based problems you reported earlier. -chris ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: btrfs btree_ctree_super fault 2016-11-08 15:08 ` Chris Mason @ 2016-11-10 14:35 ` Dave Jones 2016-11-10 15:27 ` Chris Mason 0 siblings, 1 reply; 118+ messages in thread From: Dave Jones @ 2016-11-10 14:35 UTC (permalink / raw) To: Chris Mason Cc: Linus Torvalds, Jens Axboe, Andy Lutomirski, Andy Lutomirski, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Dave Chinner On Tue, Nov 08, 2016 at 10:08:04AM -0500, Chris Mason wrote: > > And another new one: > > > > kernel BUG at fs/btrfs/ctree.c:3172! > > > > Call Trace: > > [<ffffffffa009cff0>] __btrfs_drop_extents+0xb00/0xe30 [btrfs] > > We've been hunting this one for at least two years. It's the white > whale of btrfs bugs. Josef has a semi-reliable reproducer now, but I > think it's not the same as the pagevec based problems you reported earlier. Great, now for whatever reason, I'm hitting this over and over. Even better, after the last time I hit it, it reboot and this happened during boot.. BTRFS info (device sda6): disk space caching is enabled BTRFS info (device sda6): has skinny extents BTRFS info (device sda3): disk space caching is enabled ------------[ cut here ]------------ WARNING: CPU: 1 PID: 443 at fs/btrfs/file.c:546 btrfs_drop_extent_cache+0x411/0x420 [btrfs] CPU: 1 PID: 443 Comm: mount Not tainted 4.9.0-rc4-think+ #1 ffffc90000c4b468 ffffffff813b66bc 0000000000000000 0000000000000000 ffffc90000c4b4a8 ffffffff81086d2b 0000022200c4b488 000000000002f265 40c8dded1afd6000 ffff8804ff5cddc8 ffff8804ef26f2b8 40c8dded1afd5000 Call Trace: [<ffffffff813b66bc>] dump_stack+0x4f/0x73 [<ffffffff81086d2b>] __warn+0xcb/0xf0 [<ffffffff81086e5d>] warn_slowpath_null+0x1d/0x20 [<ffffffffa009c0f1>] btrfs_drop_extent_cache+0x411/0x420 [btrfs] [<ffffffff81215923>] ? alloc_debug_processing+0x73/0x1b0 [<ffffffffa009c93f>] __btrfs_drop_extents+0x44f/0xe30 [btrfs] [<ffffffffa005426a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs] [<ffffffffa005426a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs] [<ffffffff8121842a>] ? kmem_cache_alloc+0x2aa/0x330 [<ffffffffa005426a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs] [<ffffffffa009e399>] btrfs_drop_extents+0x79/0xa0 [btrfs] [<ffffffffa00ce4d1>] replay_one_extent+0x1e1/0x710 [btrfs] [<ffffffffa00cec6d>] replay_one_buffer+0x26d/0x7e0 [btrfs] [<ffffffff8121732c>] ? ___slab_alloc.constprop.83+0x27c/0x5c0 [<ffffffffa005426a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs] [<ffffffff813d5b87>] ? debug_smp_processor_id+0x17/0x20 [<ffffffffa00ca3db>] walk_up_log_tree+0xeb/0x240 [btrfs] [<ffffffffa00ca5d6>] walk_log_tree+0xa6/0x1d0 [btrfs] [<ffffffffa00d32fc>] btrfs_recover_log_trees+0x1dc/0x460 [btrfs] [<ffffffffa00cea00>] ? replay_one_extent+0x710/0x710 [btrfs] [<ffffffffa0081f65>] open_ctree+0x2575/0x2670 [btrfs] [<ffffffffa005144b>] btrfs_mount+0xd0b/0xe10 [btrfs] [<ffffffff811da804>] ? pcpu_alloc+0x2d4/0x660 [<ffffffff810dce41>] ? lockdep_init_map+0x61/0x200 [<ffffffff810d39eb>] ? __init_waitqueue_head+0x3b/0x50 [<ffffffff81243794>] mount_fs+0x14/0xa0 [<ffffffff8126305b>] vfs_kern_mount+0x6b/0x150 [<ffffffffa0050a08>] btrfs_mount+0x2c8/0xe10 [btrfs] [<ffffffff811da804>] ? pcpu_alloc+0x2d4/0x660 [<ffffffff810dce41>] ? lockdep_init_map+0x61/0x200 [<ffffffff810dce41>] ? lockdep_init_map+0x61/0x200 [<ffffffff810d39eb>] ? __init_waitqueue_head+0x3b/0x50 [<ffffffff81243794>] mount_fs+0x14/0xa0 [<ffffffff8126305b>] vfs_kern_mount+0x6b/0x150 [<ffffffff81265bd2>] do_mount+0x1c2/0xda0 [<ffffffff811d41c0>] ? memdup_user+0x60/0x90 [<ffffffff81266ac3>] SyS_mount+0x83/0xd0 [<ffffffff81002d81>] do_syscall_64+0x61/0x170 [<ffffffff81894ccb>] entry_SYSCALL64_slow_path+0x25/0x25 ---[ end trace d3fa03bb9c115bbe ]--- BTRFS: error (device sda3) in btrfs_replay_log:2491: errno=-17 Object already exists (Failed to recover log tree) BTRFS error (device sda3): cleaner transaction attach returned -30 BTRFS error (device sda3): open_ctree failed Guess I'll hit it with btrfsck and hope for the best.. Dave ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: btrfs btree_ctree_super fault 2016-11-10 14:35 ` Dave Jones @ 2016-11-10 15:27 ` Chris Mason 0 siblings, 0 replies; 118+ messages in thread From: Chris Mason @ 2016-11-10 15:27 UTC (permalink / raw) To: Dave Jones, Linus Torvalds, Jens Axboe, Andy Lutomirski, Andy Lutomirski, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Dave Chinner On 11/10/2016 09:35 AM, Dave Jones wrote: > On Tue, Nov 08, 2016 at 10:08:04AM -0500, Chris Mason wrote: > > > > And another new one: > > > > > > kernel BUG at fs/btrfs/ctree.c:3172! > > > > > > Call Trace: > > > [<ffffffffa009cff0>] __btrfs_drop_extents+0xb00/0xe30 [btrfs] > > > > We've been hunting this one for at least two years. It's the white > > whale of btrfs bugs. Josef has a semi-reliable reproducer now, but I > > think it's not the same as the pagevec based problems you reported earlier. > > Great, now for whatever reason, I'm hitting this over and over. > > Even better, after the last time I hit it, it reboot and this happened during boot.. > > BTRFS info (device sda6): disk space caching is enabled > BTRFS info (device sda6): has skinny extents > BTRFS info (device sda3): disk space caching is enabled > ------------[ cut here ]------------ > WARNING: CPU: 1 PID: 443 at fs/btrfs/file.c:546 btrfs_drop_extent_cache+0x411/0x420 [btrfs] > CPU: 1 PID: 443 Comm: mount Not tainted 4.9.0-rc4-think+ #1 > ffffc90000c4b468 ffffffff813b66bc 0000000000000000 0000000000000000 > ffffc90000c4b4a8 ffffffff81086d2b 0000022200c4b488 000000000002f265 > 40c8dded1afd6000 ffff8804ff5cddc8 ffff8804ef26f2b8 40c8dded1afd5000 > Call Trace: > [<ffffffff813b66bc>] dump_stack+0x4f/0x73 > [<ffffffff81086d2b>] __warn+0xcb/0xf0 > [<ffffffff81086e5d>] warn_slowpath_null+0x1d/0x20 > [<ffffffffa009c0f1>] btrfs_drop_extent_cache+0x411/0x420 [btrfs] > [<ffffffff81215923>] ? alloc_debug_processing+0x73/0x1b0 > [<ffffffffa009c93f>] __btrfs_drop_extents+0x44f/0xe30 [btrfs] > [<ffffffffa005426a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs] > [<ffffffffa005426a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs] > [<ffffffff8121842a>] ? kmem_cache_alloc+0x2aa/0x330 > [<ffffffffa005426a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs] > [<ffffffffa009e399>] btrfs_drop_extents+0x79/0xa0 [btrfs] > [<ffffffffa00ce4d1>] replay_one_extent+0x1e1/0x710 [btrfs] > [<ffffffffa00cec6d>] replay_one_buffer+0x26d/0x7e0 [btrfs] > [<ffffffff8121732c>] ? ___slab_alloc.constprop.83+0x27c/0x5c0 > [<ffffffffa005426a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs] > [<ffffffff813d5b87>] ? debug_smp_processor_id+0x17/0x20 > [<ffffffffa00ca3db>] walk_up_log_tree+0xeb/0x240 [btrfs] > [<ffffffffa00ca5d6>] walk_log_tree+0xa6/0x1d0 [btrfs] > [<ffffffffa00d32fc>] btrfs_recover_log_trees+0x1dc/0x460 [btrfs] > [<ffffffffa00cea00>] ? replay_one_extent+0x710/0x710 [btrfs] > [<ffffffffa0081f65>] open_ctree+0x2575/0x2670 [btrfs] > [<ffffffffa005144b>] btrfs_mount+0xd0b/0xe10 [btrfs] > [<ffffffff811da804>] ? pcpu_alloc+0x2d4/0x660 > [<ffffffff810dce41>] ? lockdep_init_map+0x61/0x200 > [<ffffffff810d39eb>] ? __init_waitqueue_head+0x3b/0x50 > [<ffffffff81243794>] mount_fs+0x14/0xa0 > [<ffffffff8126305b>] vfs_kern_mount+0x6b/0x150 > [<ffffffffa0050a08>] btrfs_mount+0x2c8/0xe10 [btrfs] > [<ffffffff811da804>] ? pcpu_alloc+0x2d4/0x660 > [<ffffffff810dce41>] ? lockdep_init_map+0x61/0x200 > [<ffffffff810dce41>] ? lockdep_init_map+0x61/0x200 > [<ffffffff810d39eb>] ? __init_waitqueue_head+0x3b/0x50 > [<ffffffff81243794>] mount_fs+0x14/0xa0 > [<ffffffff8126305b>] vfs_kern_mount+0x6b/0x150 > [<ffffffff81265bd2>] do_mount+0x1c2/0xda0 > [<ffffffff811d41c0>] ? memdup_user+0x60/0x90 > [<ffffffff81266ac3>] SyS_mount+0x83/0xd0 > [<ffffffff81002d81>] do_syscall_64+0x61/0x170 > [<ffffffff81894ccb>] entry_SYSCALL64_slow_path+0x25/0x25 > ---[ end trace d3fa03bb9c115bbe ]--- > BTRFS: error (device sda3) in btrfs_replay_log:2491: errno=-17 Object already exists (Failed to recover log tree) > BTRFS error (device sda3): cleaner transaction attach returned -30 > BTRFS error (device sda3): open_ctree failed > > > Guess I'll hit it with btrfsck and hope for the best.. You can zero the log if you need to. Josef has a ton of tracing around this right now, so I'm hoping we nail it down very soon. -chris ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-31 19:44 ` Chris Mason 2016-11-06 16:55 ` btrfs btree_ctree_super fault Dave Jones @ 2016-11-23 19:34 ` Dave Jones 2016-11-23 19:58 ` Dave Jones 1 sibling, 1 reply; 118+ messages in thread From: Dave Jones @ 2016-11-23 19:34 UTC (permalink / raw) To: Chris Mason, Linus Torvalds, Jens Axboe, Andy Lutomirski, Andy Lutomirski, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Dave Chinner On Mon, Oct 31, 2016 at 01:44:55PM -0600, Chris Mason wrote: > On Mon, Oct 31, 2016 at 12:35:16PM -0700, Linus Torvalds wrote: > >On Mon, Oct 31, 2016 at 11:55 AM, Dave Jones <davej@codemonkey.org.uk> wrote: > >> > >> BUG: Bad page state in process kworker/u8:12 pfn:4e0e39 > >> page:ffffea0013838e40 count:0 mapcount:0 mapping:ffff8804a20310e0 index:0x100c > >> flags: 0x400000000000000c(referenced|uptodate) > >> page dumped because: non-NULL mapping > > > >Hmm. So this seems to be btrfs-specific, right? > > > >I searched for all your "non-NULL mapping" cases, and they all seem to > >have basically the same call trace, with some work thread doing > >writeback and going through btrfs_writepages(). > > > >Sounds like it's a race with either fallocate hole-punching or > >truncate. I'm not seeing it, but I suspect it's btrfs, since DaveJ > >clearly ran other filesystems too but I am not seeing this backtrace > >for anything else. > > Agreed, I think this is a separate bug, almost certainly btrfs specific. > I'll work with Dave on a better reproducer. Ok, let's try this.. Here's the usual dump, from the current tree: [ 317.689216] BUG: Bad page state in process kworker/u8:8 pfn:4d8fd4 [ 317.702235] page:ffffea001363f500 count:0 mapcount:0 mapping:ffff8804eef8e0e0 index:0x29 [ 317.718068] flags: 0x400000000000000c(referenced|uptodate) [ 317.731086] page dumped because: non-NULL mapping [ 317.744118] CPU: 3 PID: 1558 Comm: kworker/u8:8 Not tainted 4.9.0-rc6-think+ #1 [ 317.770404] Workqueue: writeback wb_workfn [ 317.783490] (flush-btrfs-3) [ 317.783581] ffffc90000ff3798 [ 317.796651] ffffffff813b69bc [ 317.796742] ffffea001363f500 [ 317.796798] ffffffff81c474ee [ 317.796854] ffffc90000ff37c0 [ 317.809910] ffffffff811b30c4 [ 317.810001] 0000000000000000 [ 317.810055] ffffea001363f500 [ 317.810112] 0000000000000003 [ 317.823230] ffffc90000ff37d0 [ 317.823320] ffffffff811b318f [ 317.823376] ffffc90000ff3818 [ 317.823431] Call Trace: [ 317.836645] [<ffffffff813b69bc>] dump_stack+0x4f/0x73 [ 317.850063] [<ffffffff811b30c4>] bad_page+0xc4/0x130 [ 317.863426] [<ffffffff811b318f>] free_pages_check_bad+0x5f/0x70 [ 317.876805] [<ffffffff811b61f5>] free_hot_cold_page+0x305/0x3b0 [ 317.890132] [<ffffffff811b660e>] free_hot_cold_page_list+0x7e/0x170 [ 317.903410] [<ffffffff811c08b7>] release_pages+0x2e7/0x3a0 [ 317.916600] [<ffffffff811c2067>] __pagevec_release+0x27/0x40 [ 317.929817] [<ffffffffa00b0a55>] extent_write_cache_pages.isra.44.constprop.59+0x355/0x430 [btrfs] [ 317.943073] [<ffffffffa00b0fcd>] extent_writepages+0x5d/0x90 [btrfs] [ 317.956033] [<ffffffffa008ea60>] ? btrfs_releasepage+0x40/0x40 [btrfs] [ 317.968902] [<ffffffffa008b6a8>] btrfs_writepages+0x28/0x30 [btrfs] [ 317.981744] [<ffffffff811be6d1>] do_writepages+0x21/0x30 [ 317.994524] [<ffffffff812727cf>] __writeback_single_inode+0x7f/0x6e0 [ 318.007333] [<ffffffff81894cc1>] ? _raw_spin_unlock+0x31/0x50 [ 318.020090] [<ffffffff8127356d>] writeback_sb_inodes+0x31d/0x6c0 [ 318.032844] [<ffffffff812739a2>] __writeback_inodes_wb+0x92/0xc0 [ 318.045623] [<ffffffff81273e97>] wb_writeback+0x3f7/0x530 [ 318.058304] [<ffffffff81274a38>] wb_workfn+0x148/0x620 [ 318.070949] [<ffffffff810a4b54>] ? process_one_work+0x194/0x690 [ 318.083549] [<ffffffff812748f5>] ? wb_workfn+0x5/0x620 [ 318.096160] [<ffffffff810a4bf2>] process_one_work+0x232/0x690 [ 318.108722] [<ffffffff810a4b54>] ? process_one_work+0x194/0x690 [ 318.121225] [<ffffffff810a509e>] worker_thread+0x4e/0x490 [ 318.133680] [<ffffffff810a5050>] ? process_one_work+0x690/0x690 [ 318.146126] [<ffffffff810a5050>] ? process_one_work+0x690/0x690 [ 318.158507] [<ffffffff810aa7e2>] kthread+0x102/0x120 [ 318.170880] [<ffffffff810aa6e0>] ? kthread_park+0x60/0x60 [ 318.183254] [<ffffffff81895822>] ret_from_fork+0x22/0x30 This time, I managed to get my ftrace magic working, so we have this trace from just before this happened. Does this shed any light ? https://codemonkey.org.uk/junk/trace.txt (note that some of the spinlock/rcu calls will be missing here. I chose to add them to ftrace's exclude list because they dominate the trace buffer so much, especially in debug builds) Dave ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-11-23 19:34 ` bio linked list corruption Dave Jones @ 2016-11-23 19:58 ` Dave Jones 2016-12-01 15:32 ` btrfs_destroy_inode warn (outstanding extents) Dave Jones 2016-12-04 23:04 ` bio linked list corruption Vegard Nossum 0 siblings, 2 replies; 118+ messages in thread From: Dave Jones @ 2016-11-23 19:58 UTC (permalink / raw) To: Chris Mason, Linus Torvalds, Jens Axboe, Andy Lutomirski, Andy Lutomirski, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Dave Chinner On Wed, Nov 23, 2016 at 02:34:19PM -0500, Dave Jones wrote: > [ 317.689216] BUG: Bad page state in process kworker/u8:8 pfn:4d8fd4 > trace from just before this happened. Does this shed any light ? > > https://codemonkey.org.uk/junk/trace.txt crap, I just noticed the timestamps in the trace come from quite a bit later. I'll tweak the code to do the taint checking/ftrace stop after every syscall, that should narrow the window some more. Getting closer.. Dave ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: btrfs_destroy_inode warn (outstanding extents) 2016-11-23 19:58 ` Dave Jones @ 2016-12-01 15:32 ` Dave Jones 2016-12-03 16:48 ` Dave Jones 2016-12-09 21:12 ` Steven Rostedt 2016-12-04 23:04 ` bio linked list corruption Vegard Nossum 1 sibling, 2 replies; 118+ messages in thread From: Dave Jones @ 2016-12-01 15:32 UTC (permalink / raw) To: Chris Mason, Linus Torvalds, Jens Axboe, Andy Lutomirski, Andy Lutomirski, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Dave Chinner On Wed, Nov 23, 2016 at 02:58:45PM -0500, Dave Jones wrote: > On Wed, Nov 23, 2016 at 02:34:19PM -0500, Dave Jones wrote: > > > [ 317.689216] BUG: Bad page state in process kworker/u8:8 pfn:4d8fd4 > > trace from just before this happened. Does this shed any light ? > > > > https://codemonkey.org.uk/junk/trace.txt > > crap, I just noticed the timestamps in the trace come from quite a bit > later. I'll tweak the code to do the taint checking/ftrace stop after > every syscall, that should narrow the window some more. > > Getting closer.. Ok, this is getting more like it. http://codemonkey.org.uk/junk/btrfs-destroy-inode-outstanding-extents.txt Also same bug, different run, but a different traceview http://codemonkey.org.uk/junk/btrfs-destroy-inode-outstanding-extents-function-graph.txt (function-graph screws up the RIP for some reason, 'return_to_handler' should actually be btrfs_destroy_inode) Anyways, I've got some code that works pretty well for dumping the ftrace buffer now when things go awry. I just need to run it enough times that I hit that bad page state instead of this, or a lockdep bug first. Dave ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: btrfs_destroy_inode warn (outstanding extents) 2016-12-01 15:32 ` btrfs_destroy_inode warn (outstanding extents) Dave Jones @ 2016-12-03 16:48 ` Dave Jones 2016-12-07 16:15 ` Dave Jones 2016-12-09 21:12 ` Steven Rostedt 1 sibling, 1 reply; 118+ messages in thread From: Dave Jones @ 2016-12-03 16:48 UTC (permalink / raw) To: Chris Mason, Linus Torvalds, Jens Axboe, Andy Lutomirski, Andy Lutomirski, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Dave Chinner On Thu, Dec 01, 2016 at 10:32:09AM -0500, Dave Jones wrote: > http://codemonkey.org.uk/junk/btrfs-destroy-inode-outstanding-extents.txt > > Also same bug, different run, but a different traceview http://codemonkey.org.uk/junk/btrfs-destroy-inode-outstanding-extents-function-graph.txt > > (function-graph screws up the RIP for some reason, 'return_to_handler' > should actually be btrfs_destroy_inode) Chris pointed me at a pending patch that took care of this warning. > Anyways, I've got some code that works pretty well for dumping the > ftrace buffer now when things go awry. I just need to run it enough > times that I hit that bad page state instead of this, or a lockdep bug first. Which allowed me to run long enough to get this trace.. http://codemonkey.org.uk/junk/bad-page-state.txt Does this shed any light ? The interesting process here seems to be kworker/u8:17, and the trace captures some of what that was doing before that bad page was hit. I've got another run going now, so I'll compare that trace when it happens to see if it matches my current theory that it's something to do with that btrfs_scrubparity_helper. I've seen that show up in stack traces a few times while chasing this. Dave ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: btrfs_destroy_inode warn (outstanding extents) 2016-12-03 16:48 ` Dave Jones @ 2016-12-07 16:15 ` Dave Jones 0 siblings, 0 replies; 118+ messages in thread From: Dave Jones @ 2016-12-07 16:15 UTC (permalink / raw) To: Chris Mason, Linus Torvalds, Jens Axboe, Andy Lutomirski, Andy Lutomirski, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Dave Chinner On Sat, Dec 03, 2016 at 11:48:33AM -0500, Dave Jones wrote: > The interesting process here seems to be kworker/u8:17, and the trace > captures some of what that was doing before that bad page was hit. I'm travelling next week, so I'm trying to braindump the stuff I've found so far and summarise so I can pick it back up later if no-one else figures it out first. I've hit the bad page map spew with enough regularity that I've now got a handful of good traces. http://codemonkey.org.uk/junk/btrfs/bad-page-state1.txt http://codemonkey.org.uk/junk/btrfs/bad-page-state2.txt http://codemonkey.org.uk/junk/btrfs/bad-page-state3.txt http://codemonkey.org.uk/junk/btrfs/bad-page-state4.txt http://codemonkey.org.uk/junk/btrfs/bad-page-state5.txt http://codemonkey.org.uk/junk/btrfs/bad-page-state6.txt It smells to me like a race between truncate and the writeback workqueue. The variety of traces here seem to show both sides of the race, sometimes it's kworker, sometimes a trinity child process. bad-page-state3.txt onwards have some bonus trace_printk's from btrfs_setsize as I was curious what sizes we were passing down to truncate. The only patterns I see are going from very large to very small sizes. Perhaps that causes truncate to generate so much writeback that it makes the race apparent ? Other stuff I keep hitting: Start transaction spew: http://codemonkey.org.uk/junk/btrfs/start_transaction.txt That's the WARN_ON(h->use_count > 2); I hit this with enough regularity that I had to comment it out. It's not clear to me whether this is related at all. Lockdep spew: http://codemonkey.org.uk/junk/btrfs/register_lock_class1.txt http://codemonkey.org.uk/junk/btrfs/register_lock_class2.txt This stuff has been around for a while (4.6ish iirc) Sometimes the fs got into a screwed up state that needed btrfscking. http://codemonkey.org.uk/junk/btrfs/replay-log-fail.txt Dave ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: btrfs_destroy_inode warn (outstanding extents) 2016-12-01 15:32 ` btrfs_destroy_inode warn (outstanding extents) Dave Jones 2016-12-03 16:48 ` Dave Jones @ 2016-12-09 21:12 ` Steven Rostedt 1 sibling, 0 replies; 118+ messages in thread From: Steven Rostedt @ 2016-12-09 21:12 UTC (permalink / raw) To: Dave Jones, Chris Mason, Linus Torvalds, Jens Axboe, Andy Lutomirski, Andy Lutomirski, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Dave Chinner On Thu, Dec 01, 2016 at 10:32:09AM -0500, Dave Jones wrote: > > (function-graph screws up the RIP for some reason, 'return_to_handler' > should actually be btrfs_destroy_inode) That's because function_graph hijacks the return address and replaces it with return_to_handler. The back trace has code to show what that handler is suppose to be. But I should see why the WARNING shows it instead, and fix that too. Thanks, -- Steve ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-11-23 19:58 ` Dave Jones 2016-12-01 15:32 ` btrfs_destroy_inode warn (outstanding extents) Dave Jones @ 2016-12-04 23:04 ` Vegard Nossum 2016-12-05 11:10 ` Vegard Nossum 2016-12-05 18:11 ` Andy Lutomirski 1 sibling, 2 replies; 118+ messages in thread From: Vegard Nossum @ 2016-12-04 23:04 UTC (permalink / raw) To: Dave Jones, Chris Mason, Linus Torvalds, Jens Axboe, Andy Lutomirski, Andy Lutomirski, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Dave Chinner On 23 November 2016 at 20:58, Dave Jones <davej@codemonkey.org.uk> wrote: > On Wed, Nov 23, 2016 at 02:34:19PM -0500, Dave Jones wrote: > > > [ 317.689216] BUG: Bad page state in process kworker/u8:8 pfn:4d8fd4 > > trace from just before this happened. Does this shed any light ? > > > > https://codemonkey.org.uk/junk/trace.txt > > crap, I just noticed the timestamps in the trace come from quite a bit > later. I'll tweak the code to do the taint checking/ftrace stop after > every syscall, that should narrow the window some more. FWIW I hit this as well: BUG: unable to handle kernel paging request at ffffffff81ff08b7 IP: [<ffffffff8135f2ea>] __lock_acquire.isra.32+0xda/0x1a30 PGD 461e067 PUD 461f063 PMD 1e001e1 Oops: 0003 [#1] PREEMPT SMP KASAN Dumping ftrace buffer: (ftrace buffer empty) CPU: 0 PID: 21744 Comm: trinity-c56 Tainted: G B 4.9.0-rc7+ #217 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014 task: ffff8801ee924080 task.stack: ffff8801bab88000 RIP: 0010:[<ffffffff8135f2ea>] [<ffffffff8135f2ea>] __lock_acquire.isra.32+0xda/0x1a30 RSP: 0018:ffff8801bab8f730 EFLAGS: 00010082 RAX: ffffffff81ff071f RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000003 RSI: 0000000000000000 RDI: ffffffff85ae7d00 RBP: ffff8801bab8f7b0 R08: 0000000000000001 R09: 0000000000000000 R10: ffff8801e727fc40 R11: fffffbfff0b54ced R12: ffffffff85ae7d00 R13: ffffffff84912920 R14: ffff8801ee924080 R15: 0000000000000000 FS: 00007f37ee653700(0000) GS:ffff8801f6a00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffff81ff08b7 CR3: 00000001daa70000 CR4: 00000000000006f0 DR0: 00007f37ee465000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600 Stack: ffff8801ee9247d0 0000000000000000 0000000100000000 ffff8801ee924080 ffff8801f6a201c0 ffff8801f6a201c0 0000000000000000 0000000000000001 ffff880100000000 ffff880100000000 ffff8801e727fc40 ffff8801ee924080 Call Trace: [<ffffffff81361751>] lock_acquire+0x141/0x2b0 [<ffffffff813530c0>] ? finish_wait+0xb0/0x180 [<ffffffff83c95b29>] _raw_spin_lock_irqsave+0x49/0x60 [<ffffffff813530c0>] ? finish_wait+0xb0/0x180 [<ffffffff813530c0>] finish_wait+0xb0/0x180 [<ffffffff81576227>] shmem_fault+0x4c7/0x6b0 [<ffffffff83a9b7cd>] ? p9_client_rpc+0x13d/0xf40 [<ffffffff81575d60>] ? shmem_getpage_gfp+0x1c90/0x1c90 [<ffffffff81fbe777>] ? radix_tree_next_chunk+0x4f7/0x840 [<ffffffff81352150>] ? wake_atomic_t_function+0x210/0x210 [<ffffffff815ad316>] __do_fault+0x206/0x410 [<ffffffff815ad110>] ? do_page_mkwrite+0x320/0x320 [<ffffffff815b9bcf>] handle_mm_fault+0x1cef/0x2a60 [<ffffffff815b8012>] ? handle_mm_fault+0x132/0x2a60 [<ffffffff815b7ee0>] ? __pmd_alloc+0x370/0x370 [<ffffffff81692e7e>] ? inode_add_bytes+0x10e/0x160 [<ffffffff8162de11>] ? memset+0x31/0x40 [<ffffffff815cba10>] ? find_vma+0x30/0x150 [<ffffffff812373a2>] __do_page_fault+0x452/0x9f0 [<ffffffff81ff071f>] ? iov_iter_init+0xaf/0x1d0 [<ffffffff81237bf5>] trace_do_page_fault+0x1e5/0x3a0 [<ffffffff8122a007>] do_async_page_fault+0x27/0xa0 [<ffffffff83c97618>] async_page_fault+0x28/0x30 [<ffffffff82059341>] ? strnlen_user+0x91/0x1a0 [<ffffffff8205931e>] ? strnlen_user+0x6e/0x1a0 [<ffffffff8157e038>] strndup_user+0x28/0xb0 [<ffffffff81d83c17>] SyS_add_key+0xc7/0x370 [<ffffffff81d83b50>] ? key_get_type_from_user.constprop.6+0xd0/0xd0 [<ffffffff815143ea>] ? __context_tracking_exit.part.4+0x3a/0x1e0 [<ffffffff81d83b50>] ? key_get_type_from_user.constprop.6+0xd0/0xd0 [<ffffffff8100524f>] do_syscall_64+0x1af/0x4d0 [<ffffffff83c96534>] entry_SYSCALL64_slow_path+0x25/0x25 Code: 89 4d b8 44 89 45 c0 89 4d c8 4c 89 55 d0 e8 ee c3 ff ff 48 85 c0 4c 8b 55 d0 8b 4d c8 44 8b 45 c0 4c 8b 4d b8 0f 84 c6 01 00 00 <3e> ff 80 98 01 00 00 49 8d be 48 07 00 00 48 ba 00 00 00 00 00 RIP [<ffffffff8135f2ea>] __lock_acquire.isra.32+0xda/0x1a30 I didn't read all the emails in this thread, the crash site looks identical to one of the earlier traces although the caller may be different. I think you can rule out btrfs in any case, probably block layer as well, since it looks like this comes from shmem. Vegard ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-12-04 23:04 ` bio linked list corruption Vegard Nossum @ 2016-12-05 11:10 ` Vegard Nossum 2016-12-05 17:09 ` Vegard Nossum 2016-12-05 18:11 ` Andy Lutomirski 1 sibling, 1 reply; 118+ messages in thread From: Vegard Nossum @ 2016-12-05 11:10 UTC (permalink / raw) To: Dave Jones, Chris Mason, Linus Torvalds, Jens Axboe, Andy Lutomirski, Andy Lutomirski, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Dave Chinner On 5 December 2016 at 00:04, Vegard Nossum <vegard.nossum@gmail.com> wrote: > FWIW I hit this as well: > > BUG: unable to handle kernel paging request at ffffffff81ff08b7 > IP: [<ffffffff8135f2ea>] __lock_acquire.isra.32+0xda/0x1a30 > CPU: 0 PID: 21744 Comm: trinity-c56 Tainted: G B 4.9.0-rc7+ #217 [...] > I think you can rule out btrfs in any case, probably block layer as > well, since it looks like this comes from shmem. I should rather say that the VM runs on a 9p root filesystem and it doesn't use/mount any block devices or disk-based filesystems. I have all the trinity logs for the crash if that's useful. I tried a couple of runs with just the (at the time) in-progress syscalls but it didn't turn up anything interesting. Otherwise it seems like a lot of data to go through by hand. The crashing child seems to have just started, though, if that's relevant: [child56:21744] [0] open(filename="/sys/block/loop2/power/runtime_active_time", flags=0x777b01, mode=666) = -1 (Not a directory) [child56:21744] [1] [32BIT] sched_getattr(pid=1, param=0x7f37ec26c000, size=3415) = -1 (Invalid argument) [child56:21744] [2] [32BIT] access(filename="/proc/2/task/2/net/stat/arp_cache", mode=2000) = -1 (Invalid argument) [child56:21744] [3] getegid() = 0xfffe [child56:21744] [4] swapoff(path="/proc/721/task/721/net/dev_snmp6/tunl0") = -1 (Operation not permitted) [child56:21744] [5] timerfd_create(clockid=0x0, flags=0x0) = 439 [child56:21744] [6] pkey_mprotect(start=0x7f37ee656000, len=0, prot=0x1000008, key=0x600000000000000) = 0 [child56:21744] [7] msync(start=0x7f37ee657000, len=0, flags=0x6) = 0 [child56:21744] [8] flock(fd=168, cmd=0xffffffc191f30b0c) = -1 (Invalid argument) [child56:21744] [9] add_key(_type=0x437a15, _description=0x7f37ec06c000, _payload=0x7f37ec06c000, plen=0, ringid=0xfffffffffffffff8) The other logfiles end thusly: ==> trinity-child0.log <== [child0:21593] [311] faccessat(dfd=246, filename="/proc/983/task/983/net/protocols", mode=2000) = -1 (Invalid argument) [child0:21593] [312] renameat(olddfd=246, oldname="/proc/13/task/13/attr/sockcreate", newdfd=377, newname="/proc/16/task/16/net/stat/rt_cache") = -1 (Permission denied) [child0:21593] [313] [32BIT] readv(fd=289, vec=0x2e1a3d0, vlen=215) = 0 ==> trinity-child100.log <== [child100:21536] [439] setgid(gid=0x2a000200) = -1 (Operation not permitted) [child100:21536] [440] waitid(which=175, upid=21587, infop=0x4, options=3542, ru=0x7f37ec76c000) = -1 (Invalid argument) [child100:21536] [441] getxattr(pathname="/proc/980/task/980/net/ptype", name=0x7f37ee466000, value=0x7f37ec26c000, size=49) = -1 (Operation not supported) ==> trinity-child101.log <== [child101:21537] [55] getcwd(buf=0x7f37ee466000, size=4096) = 39 [child101:21537] [56] [32BIT] munlock(addr=0x7f37ee658000, len=0) = 0 [child101:21537] [57] semctl(semid=0xbd851e2b40e7df, semnum=0x1b1b1b1b1b, cmd=0x2000000000, arg=0xcacacacaca) = -1 (Invalid argument) ==> trinity-child102.log <== [child102:21542] [11] readahead(fd=353, offset=2, count=249) = -1 (Invalid argument) [child102:21542] [12] add_key(_type=0x43793f, _description=0x7f37ec46c000, _payload=0x7f37ee658000, plen=32, ringid=0xfffffffffffffffa) = -1 (Invalid argument) [child102:21542] [13] time(tloc=0x7f37ee466000) = 0x584474e0 ==> trinity-child103.log <== [child103:21543] [45] dup(fildes=183) = 512 [child103:21543] [46] rt_sigpending(set=0x7f37ec86c000, sigsetsize=32) = -1 (Invalid argument) [child103:21543] [47] newstat(filename="/proc/587/task/587/gid_map", statbuf=0x7f37ee466000) = 0 ==> trinity-child104.log <== [child104:21546] [49] getdents(fd=162, dirent=0x0, count=127) = -1 (Not a directory) [child104:21546] [50] [32BIT] clock_adjtime(which_clock=0, utx=0x4) = -1 (Bad address) [child104:21546] [51] setsid() = 0x542a ==> trinity-child105.log <== [child105:21547] [523] epoll_wait(epfd=244, events=0x8, maxevents=246, timeout=-1) = -1 (Invalid argument) [child105:21547] [524] dup2(oldfd=244, newfd=244) = 244 [child105:21547] [525] acct(name=0x7f37ec26c000) = -1 (Operation not permitted) ==> trinity-child106.log <== [child106:19910] [136] getegid() = 0xfffe [child106:19910] [137] munmap(addr=0x7f37ee65a000, len=4096) = 0 [child106:19910] [138] clock_nanosleep(which_clock=0x1, flags=0x1, rqtp=0x7f37ec06c000, rmtp=0x7f37ee466000) ==> trinity-child107.log <== [child107:21224] [994] copy_file_range(fd_in=373, off_in=0x2400e210, fd_out=373, off_out=8, len=8, flags=0x0) = -1 (Bad file descriptor) [child107:21224] [995] kcmp(pid1=1, pid2=21453, type=0x5, idx1=0x787878787878, idx2=0xffffff6060606060) = -1 (Operation not permitted) [child107:21224] [996] [32BIT] readv(fd=365, vec=0x2e27e10, vlen=36) = 0 ==> trinity-child108.log <== [child108:21226] [759] recvfrom(fd=219, ubuf=0x7f37ec26c000, size=8, flags=0x0, addr=0x2e1ed80, addr_len=110) = -1 (Bad file descriptor) [child108:21226] [760] shmat(shmid=-4097, shmaddr=0x7f37ee465000, shmflg=-195) = -1 (Invalid argument) [child108:21226] [761] [32BIT] seccomp(op=0x0, flags=0x0, uargs=0x0) ==> trinity-child109.log <== [child109:21552] [241] fremovexattr(fd=133, name=0x4) = -1 (Bad address) [child109:21552] [242] sched_setattr(pid=0, uattr=0x1) = -1 (Invalid argument) [child109:21552] [243] msgget(key=0x71717171717171, msgflg=0xec00000000) = 0x2d805a ==> trinity-child10.log <== [child10:21619] [120] creat(pathname="/proc/6/attr/fscreate", mode=0) = 521 [child10:21619] [121] [32BIT] sethostname(name=0x7f37ec06c000, len=4096) = -1 (Operation not permitted) [child10:21619] [122] add_key(_type=0x43793f, _description=0x7f37ec66c000, _payload=0x7f37ec06c000, plen=8, ringid=0xfffffffffffffffe) = -1 (Invalid argument) ==> trinity-child110.log <== [child110:21554] [289] accept(fd=366, upeer_sockaddr=0x2e09f10, upeer_addrlen=12) = -1 (Socket operation on non-socket) [child110:21554] [290] unlinkat(dfd=147, pathname="/proc/16/task/16/net/atm/br2684", flag=-62) = -1 (Invalid argument) [child110:21554] [291] [32BIT] seccomp(op=0x0, flags=0x0, uargs=0x0) ==> trinity-child111.log <== [child111:21556] [102] getsockname(fd=292, usockaddr=0x2e10190, usockaddr_len=110) = -1 (Socket operation on non-socket) [child111:21556] [103] munmap(addr=0x7f37ee65a000, len=4096) = 0 [child111:21556] [104] personality(personality=1) = 0 ==> trinity-child112.log <== [child112:21241] [848] setpgid(pid=21456, pgid=9) = -1 (No such process) [child112:21241] [849] splice(fd_in=182, off_in=0x7f37ec76c000, fd_out=182, off_out=0x7f37ec76c008, len=4, flags=0x7) = -1 (Bad file descriptor) [child112:21241] [850] setgid(gid=0) = -1 (Operation not permitted) ==> trinity-child113.log <== [child113:21560] [453] memfd_create(uname=0x7f37ee657000, flag=0x1) = 530 [child113:21560] [454] pwritev2(fd=342, vec=0x2e1e000, vlen=133, pos_l=0xffffff1d, pos_h=-4, flags=0x0) = -1 (Illegal seek) [child113:21560] [455] readahead(fd=338, offset=0xe000, count=4096) = -1 (Invalid argument) ==> trinity-child114.log <== [child114:21562] [27] [32BIT] getgroups16(gidsetsize=0xf12ec06dec, grouplist=0x7f37ee465000) = -1 (Bad address) [child114:21562] [28] keyctl(cmd=0x14, arg2=0xcecececece, arg3=8192, arg4=0x7cc57bd0d8cf577, arg5=0x898989898989) = -1 (Invalid argument) [child114:21562] [29] readlink(path="/sys/block/loop1/mq/0/pending", buf=0x7f37ee467000, bufsiz=0x1565a3aa) = -1 (Invalid argument) ==> trinity-child115.log <== [child115:21563] [392] eventfd(count=8) = 523 [child115:21563] [393] getegid() = 0xfffe [child115:21563] [394] open(filename="/proc/587/stat", flags=0x795ac0, mode=666) = -1 (Invalid argument) ==> trinity-child116.log <== [child116:21567] [80] mbind(start=0x7f37ee65a000, len=0, mode=0x3, nmask=0x0, maxnode=0x8000, flags=0x5) = -1 (Operation not permitted) [child116:21567] [81] flock(fd=357, cmd=0xffffff0fe2220862) = 0 [child116:21567] [82] clock_adjtime(which_clock=15, utx=0x8) = -1 (Invalid argument) ==> trinity-child117.log <== [child117:21570] [333] pipe(fildes=0x7f37ee466000) = 0 [child117:21570] [334] get_robust_list(pid=21439, head_ptr=0x1, len_ptr=0x7f37ebc18000) = -1 (Operation not permitted) [child117:21570] [335] execve(name="/proc/550/net/rt_cache", argv=0x2e230d0, envp=0x2ba92a0) ==> trinity-child118.log <== [child118:21572] [455] fstatat64(dfd=244, filename="/proc/1297/task/1297/fdinfo/110", statbuf=0x8, flag=97) = -1 (Invalid argument) [child118:21572] [456] flistxattr(fd=244, list=0x7f37ec26c000, size=8) = 0 [child118:21572] [457] ioprio_set(which=0x5d5d5d5d5d5d, who=0x222734221582, ioprio=0xffffff5b5b5b5b5b) = -1 (Invalid argument) ==> trinity-child119.log <== [child119:21574] [400] timerfd_create(clockid=0x0, flags=0x0) = 522 [child119:21574] [401] execveat(fd=338, name="/proc/1024/net/can-bcm", argv=0x2bda9f0, envp=0x2e21130, flags=0x100) = 522 [child119:21574] [402] execveat(fd=244, name="/sys/devices/platform/serial8250/tty/ttyS3/power/autosuspend_delay_ms", argv=0x2e09b30, envp=0x2e1d0f0, flags=0x1000) ==> trinity-child11.log <== [child11:21621] [352] symlink(oldname="/sys/devices/virtual/sound/timer/uevent", newname="/proc/549/task/549/net/rpc/auth.unix.gid") = -1 (File exists) [child11:21621] [353] [32BIT] setgroups16(gidsetsize=0xffffff18, grouplist=0x8) = -1 (Operation not permitted) [child11:21621] [354] semtimedop(semid=0x100000de63c0d6e9, tsops=0x4, nsops=0, timeout=0x8) ==> trinity-child120.log <== [child120:21578] [141] execveat(fd=147, name="/proc/2/task/2/net/llc/socket", argv=0x2e0a250, envp=0x2e10190, flags=0x1000) = -1 (Permission denied) [child120:21578] [142] readlinkat(dfd=330, pathname="/sys/devices/virtual/misc/ocfs2_control/power/runtime_active_kids", buf=0x7f37ee656000, bufsiz=4) = -1 (Invalid argument) [child120:21578] [143] fremovexattr(fd=330, name=0x7f37ec26c000) = -1 (Permission denied) ==> trinity-child121.log <== [child121:21263] [586] pread64(fd=237, buf=0x7f37ec76c000, count=3746, pos=75) = 0 [child121:21263] [587] inotify_add_watch(fd=237, pathname="/proc/974/task/974/net/atm/pvc", mask=0x5) = -1 (Invalid argument) [child121:21263] [588] removexattr(pathname="/proc/974/task/974/net/sco", name=0x7f37ec26c000) = -1 (Permission denied) ==> trinity-child122.log <== [child122:21580] [57] getrusage(who=0x3a2a97385da52b25, ru=0x0) = -1 (Invalid argument) [child122:21580] [58] getdents(fd=373, dirent=0x1, count=0x6cfbd4f4) = -1 (Bad file descriptor) [child122:21580] [59] msgsnd(msqid=-9, msgp=0x8, msgsz=59, msgflg=0x0) = -1 (Bad address) ==> trinity-child123.log <== [child123:21582] [289] recvmsg(fd=244, msg=0x7f37ec26c000, flags=0x5f058) = -1 (Socket operation on non-socket) [child123:21582] [290] poll(ufds=0x2d25880, nfds=7, timeout_msecs=0) = 5 [child123:21582] [291] getppid() = 1297 ==> trinity-child124.log <== [child124:20960] [493] uname(name=0x7f37ee466000) = 0 [child124:20960] [494] sysfs(option=0x1, arg1=-3, arg2=0x5982b49806138008) = -1 (Bad address) [child124:20960] [495] execve(name="/proc/sys/net/ipv4/neigh/sit0/base_reachable_time_ms", argv=0x2e122d0, envp=0x2bc95a0) ==> trinity-child125.log <== [child125:21586] [15] setgroups(gidsetsize=12, grouplist=0x0) = -1 (Operation not permitted) [child125:21586] [16] [32BIT] oldumount(name=0x7f37ec46c000) = -1 (Operation not permitted) [child125:21586] [17] utimes(filename="/proc/5/net/anycast6", utimes=0x8) = -1 (Bad address) ==> trinity-child126.log <== [child126:21587] [29] shmdt(shmaddr=0x7f37ec66c000) = -1 (Invalid argument) [child126:21587] [30] setxattr(pathname="/sys/devices/pci0000:00/0000:00:03.0/net/eth0/power/runtime_active_kids", name=0x4, value=0x7f37ec46c000, size=0xfffd, flags=0x1) = -1 (Bad address) [child126:21587] [31] chroot(filename="/proc/1240/net/hci") = -1 (Not a directory) ==> trinity-child127.log <== [child127:21588] [290] clock_gettime(which_clock=6156, tp=0x0) = -1 (Invalid argument) [child127:21588] [291] statfs(pathname="/proc/1252/net/ipv6_route", buf=0x7f37ee659000) = 0 [child127:21588] [292] msync(start=0x7f37ee658000, len=0, flags=0x6) = 0 ==> trinity-child12.log <== [child12:21625] [54] getsid(pid=21655) = 0 [child12:21625] [55] mlock2(start=0x7f37ec06c000, len=0x136000) = -1 (Cannot allocate memory) [child12:21625] [56] dup3(oldfd=180, newfd=180, flags=0x80000) = -1 (Invalid argument) ==> trinity-child13.log <== [child13:21627] [32] add_key(_type=0x437a15, _description=0x7f37ee466001, _payload=0x7f37ee466000, plen=0x40000, ringid=0xfffffffffffffffe) = -1 (Invalid argument) [child13:21627] [33] mknodat(dfd=243, filename="/proc/553/task/553/attr/fscreate", mode=0xf63d64273f4d42, dev=-10) = -1 (Operation not permitted) [child13:21627] [34] mq_timedreceive(mqdes=328, u_msg_ptr=0x7f37ee466000, msg_len=0xd1d1d1, u_msg_prio=0x7f37ec06c000, u_abs_timeout=0x7f37ee466000) = -1 (Invalid argument) ==> trinity-child14.log <== [child14:21633] [194] msgsnd(msqid=0x4e251795f0, msgp=0x0, msgsz=8, msgflg=0x0) = -1 (Bad address) [child14:21633] [195] ppoll(ufds=0x2c4b530, nfds=7, tsp=0x2e09fc0, sigmask=0x7f37ee465000, sigsetsize=128) = -1 (Invalid argument) [child14:21633] [196] faccessat(dfd=251, filename="/proc/15/task/15/net/stat/nf_conntrack", mode=40) = -1 (Invalid argument) ==> trinity-child15.log <== [child15:21634] [210] timer_gettime(timer_id=1, setting=0x4) = -1 (Invalid argument) [child15:21634] [211] lgetxattr(pathname="/proc/551/setgroups", name=0x4, value=0xc, size=1) = -1 (Bad address) [child15:21634] [212] splice(fd_in=265, off_in=0x7f37ec26c000, fd_out=265, off_out=0x7f37ec26c008, len=1, flags=0x4) = -1 (Bad file descriptor) ==> trinity-child16.log <== [child16:21640] [307] sched_get_priority_max(policy=0x7fffffff08000000) = -1 (Invalid argument) [child16:21640] [308] getcpu(cpup=0x7f37ee465000, nodep=0x7f37ee465004, unused=0x7f37ee465008) = 0 [child16:21640] [309] sendfile(out_fd=367, in_fd=367, offset=0x0, count=0) = -1 (Bad file descriptor) ==> trinity-child17.log <== [child17:21642] [227] llistxattr(pathname="/proc/16/task/16/net/route", list=0x8, size=4) = 0 [child17:21642] [228] msync(start=0x7f37ec66c000, len=16384, flags=0x6) = 0 [child17:21642] [229] [32BIT] execve(name="/proc/1024/net/atm/devices", argv=0x2c32e20, envp=0x2dfb5a0) = 0 ==> trinity-child18.log <== [child18:21644] [55] execve(name="/proc/971/task/971/net/icmp", argv=0x2d25880, envp=0x2b996a0) = -1 (Permission denied) [child18:21644] [56] utime(filename="/proc/974/net/raw", times=0x7f37ee465000) = -1 (Operation not permitted) [child18:21644] [57] umask(mask=0xfffffff7) = 18 ==> trinity-child19.log <== [child19:21651] [208] [32BIT] move_pages(pid=21604, nr_pages=145, pages=0x2e1a3d0, nodes=0x2e1b3e0, status=0x2e1b630, flags=0x2) = -1 (Bad address) [child19:21651] [209] timer_getoverrun(timer_id=-2) = -1 (Invalid argument) [child19:21651] [210] rename(oldname="/proc/14/cmdline", newname="/proc/1041/task/1041/net/can/rcvlist_all") = -1 (No such file or directory) ==> trinity-child1.log <== [child1:19992] [1361] lsetxattr(pathname="/proc/1297/task/1297/fdinfo/96", name=0x1, value=0x7f37ee466000, size=0, flags=0x0) = -1 (Bad address) [child1:19992] [1362] setgid(gid=0xf0f0f0f0f) = -1 (Operation not permitted) [child1:19992] [1363] clock_nanosleep(which_clock=0x1, flags=0x0, rqtp=0x7f37ee659000, rmtp=0x7f37ee465000) ==> trinity-child20.log <== [child20:21027] [1852] sendto(fd=244, buff=0x2df7410, len=1, flags=0x800436c8, addr=0x2df7460, addr_len=16) = -1 (Socket operation on non-socket) [child20:21027] [1853] setresgid(rgid=0, egid=0x4848, sgid=0xdddddddd) = -1 (Operation not permitted) [child20:21027] [1854] poll(ufds=0x2e1fb80, nfds=8, timeout_msecs=0) = 6 ==> trinity-child21.log <== [child21:21330] [663] statfs(pathname="/sys/block/md0/queue/scheduler", buf=0x7f37ec66c000) = 0 [child21:21330] [664] sched_setaffinity(pid=1, len=1028, user_mask_ptr=0x7f37ec46c000) = -1 (Operation not permitted) [child21:21330] [665] [32BIT] seccomp(op=0x0, flags=0x0, uargs=0x0) ==> trinity-child22.log <== [child22:21652] [61] pkey_alloc(flags=0, init_val=0x0) = -1 (No space left on device) [child22:21652] [62] fsync(fd=330) = 0 [child22:21652] [63] [32BIT] seccomp(op=0x0, flags=0x0, uargs=0x0) ==> trinity-child23.log <== [child23:21655] [237] sync_file_range(fd=244, offset=191, nbytes=1, flags=0xffffffffffffff9) = -1 (Invalid argument) [child23:21655] [238] io_submit(ctx_id=4088, nr=1, iocbpp=0x7f37ee658000) = -1 (Invalid argument) [child23:21655] [239] flistxattr(fd=367, list=0x7f37ee656000, size=8) ==> trinity-child24.log <== [child24:21657] [86] [32BIT] personality(personality=0x3fde3e4912ec) = 0x82604841 [child24:21657] [87] mkdir(pathname="/proc/924/task/924/net/atalk/interface", mode=0) = -1 (File exists) [child24:21657] [88] [32BIT] fchownat(dfd=381, filename="/sys/module/nf_conntrack/parameters/nf_conntrack_helper", user=0x1804802345, group=0xffffffff, flag=-3) = -1 (Invalid argument) ==> trinity-child25.log <== [child25:21658] [157] timerfd_settime(ufd=201, flags=0x80800, utmr=0x1, otmr=0x9) = -1 (Bad address) [child25:21658] [158] rt_tgsigqueueinfo(tgid=0xc0da20044837, pid=21330, sig=7099, uinfo=0x7f37ee466000) = -1 (Bad address) [child25:21658] [159] pkey_mprotect(start=0x7f37ec76c000, len=0x79000, prot=0x0, key=0x5120a54c676961) = -1 (Invalid argument) ==> trinity-child26.log <== [child26:21665] [20] fdatasync(fd=217) = -1 (Invalid argument) [child26:21665] [21] fsetxattr(fd=137, name=0x4, value=0x7f37ec26c000, size=-9, flags=0x2) = -1 (Bad address) [child26:21665] [22] open_by_handle_at(mountdirfd=137, handle=0x7f37ee465000, flags=0x5b33c0) = -1 (Operation not permitted) ==> trinity-child27.log <== [child27:21668] [190] preadv(fd=381, vec=0x2e1b3e0, vlen=233, pos_l=0x710702af1dfa, pos_h=0xc976c19d4f) = 0 [child27:21668] [191] [32BIT] mlock(addr=0x7f37ec76c000, len=0xce000) = -1 (Cannot allocate memory) [child27:21668] [192] [32BIT] getsockopt(fd=381, level=0x54100210202640, optname=0x9a9a9a9a9a9a, optval=0x7f37ee466000, optlen=8) = -1 (Socket operation on non-socket) ==> trinity-child28.log <== [child28:1813] [202] sched_setscheduler(pid=0, policy=0x5, param=0x7f37ee465000) = 0 [child28:1813] [203] setresuid(ruid=0x80020293a4aa482, euid=247, suid=0xffffffff81000002) = -1 (Operation not permitted) [child28:1813] [204] epoll_ctl(epfd=332, op=0x3, fd=332, event=0x2e0a150) [child28:1813] [204] newstat(filename="/proc/9/task/9/net/atalk", statbuf=0x8) [child28:1813] [204] newstat(filename="/proc/9/task/9/net/atalk", statbuf=0x8) [child28:1813] [204] statfs(pathname="/proc/989/task/989/attr/prev", buf=0x7f37ee466000) [child28:1813] [204] newstat(filename="/proc/9/task/9/net/atalk", statbuf=0x8) [child28:1813] [204] statfs(pathname="/proc/989/task/989/attr/prev", buf=0x7f37ee466000) [child28:1813] [204] timerfd_create(clockid=0x1, flags=0x800) [child28:1813] [204] newstat(filename="/proc/9/task/9/net/atalk", statbuf=0x8) [child28:1813] [204] statfs(pathname="/proc/989/task/989/attr/prev", buf=0x7f37ee466000) [child28:1813] [204] timerfd_create(clockid=0x1, flags=0x800) [child28:1813] [204] process_vm_readv(pid=1, lvec=0x2e21d30, liovcnt=48, rvec=0x2e22040, riovcnt=23, flags=0x0) ==> trinity-child29.log <== = 0 [child29:1329] [68] [32BIT] getitimer(which=0xfffffffb, value=0x7f37ee465000) [child29:1329] [68] setpriority(which=0x8000000000, who=4, niceval=-2) [child29:1329] [68] rt_sigprocmask(how=-9, set=0x0, oset=0x7f37ee466000, sigsetsize=128) = -1 (Invalid argument) [child29:1329] [69] shmat(shmid=0xfffffff9, shmaddr=0x1, shmflg=0x55d10db3) ==> trinity-child2.log <== [child2:21596] [87] accept4(fd=189, upeer_sockaddr=0x2bda970, upeer_addrlen=110, flags=0x80800) = -1 (Bad file descriptor) [child2:21596] [88] access(filename="/proc/1134/task/1134/net/llc/socket", mode=2060) = -1 (Invalid argument) [child2:21596] [89] sendfile(out_fd=189, in_fd=172, offset=0x7f37ee465000, count=4096) = -1 (Bad file descriptor) ==> trinity-child30.log <== [child30:9344] [334] sendmsg(fd=0, msg=0x2c4b530, flags=0x5ba83) = -1 (Socket operation on non-socket) [child30:9344] [335] signalfd(ufd=382, user_mask=0x7f37ee65a000, sizemask=4) = -1 (Invalid argument) [child30:9344] [336] listen(fd=206, backlog=0x4649119809) [child30:9344] [336] io_getevents(ctx_id=0x372b076d67, min_nr=0xfff8, nr=4096, events=0x7f37ec86c000, timeout=0x7f37ee65a000) ==> trinity-child31.log <== [child31:2333] [416] munlockall() = 0 [child31:2333] [417] fgetxattr(fd=189, name=0x0, value=0x7f37ec76c000, size=252) = -1 (Bad file descriptor) [child31:2333] [418] fgetxattr(fd=308, name=0x4, value=0x7f37ec26c000, size=8) [child31:2333] [418] clock_adjtime(which_clock=15, utx=0x1) [child31:2333] [418] clock_adjtime(which_clock=15, utx=0x1) [child31:2333] [418] timer_gettime(timer_id=0xf9f9f9f9f9f9, setting=0x1) [child31:2333] [418] socket(family=32, type=0x80800, protocol=10) [child31:2333] [418] socket(family=32, type=0x80800, protocol=10) [child31:2333] [418] dup2(oldfd=371, newfd=363) [child31:2333] [418] socket(family=32, type=0x80800, protocol=10) [child31:2333] [418] dup2(oldfd=371, newfd=363) [child31:2333] [418] unlink(pathname="/proc/1209/task/1209/net/atm/pvc") [child31:2333] [418] openat(dfd=213, filename="/proc/971/task/971/net/atm", flags=0x141a00, mode=712) [child31:2333] [418] timer_gettime(timer_id=0x900040400186, setting=0x4) [child31:2333] [418] getxattr(pathname="/proc/1023/net/stat/ndisc_cache", name=0x7f37ec86c000, value=0x7f37ec86c001, size=259) [child31:2333] [418] getxattr(pathname="/proc/1023/net/stat/ndisc_cache", name=0x7f37ec86c000, value=0x7f37ec86c001, size=259) [child31:2333] [418] open(filename="/sys/module/l2tp_ip", flags=0x80400, mode=211) [child31:2333] [418] umask(mask=0xa10000004d483090) [child31:2333] [418] umask(mask=0xa10000004d483090) [child31:2333] [418] setgroups(gidsetsize=0, grouplist=0x4) [child31:2333] [418] getgroups(gidsetsize=0xd4d4d4d4d4d4, grouplist=0x8) [child31:2333] [418] getgroups(gidsetsize=0xd4d4d4d4d4d4, grouplist=0x8) [child31:2333] [418] io_cancel(ctx_id=0xffffffff00000004, iocb=0x7f37ee465000, result=0x4fc0000000) [child31:2333] [418] getgroups(gidsetsize=0xd4d4d4d4d4d4, grouplist=0x8) [child31:2333] [418] io_cancel(ctx_id=0xffffffff00000004, iocb=0x7f37ee465000, result=0x4fc0000000) [child31:2333] [418] ioperm(from=221, num=0x1786c7a03485c3c0, turn_on=176) [child31:2333] [418] fadvise64(fd=271, offset=0x2afb1db, len=449, advice=0x5) [child31:2333] [418] getpriority(which=0x1, who=2622) [child31:2333] [418] signalfd(ufd=285, user_mask=0x7f37ee466000, sizemask=0) [child31:2333] [418] setresuid(ruid=40, euid=0xffff, suid=-7) ==> trinity-child32.log <== [child32:21672] [199] getresuid(ruid=0x7f37ec26c000, euid=0x7f37ee65a000, suid=0x7f37ee465000) = 0 [child32:21672] [200] getrlimit(resource=0xa, rlim=0x4) = -1 (Bad address) [child32:21672] [201] [32BIT] signalfd(ufd=367, user_mask=0x7f37ec46c000, sizemask=32) ==> trinity-child33.log <== [child33:21677] [231] mq_getsetattr(mqdes=244, u_mqstat=0x7f37ee658000, u_omqstat=0x7f37ee658008) = -1 (Invalid argument) [child33:21677] [232] gettid() = 0x54ad [child33:21677] [233] timer_create(which_clock=161, timer_event_spec=0x7f37ec86c000, create_timer_id=0x7f37ec86c008) ==> trinity-child34.log <== [child34:21679] [186] sched_rr_get_interval(pid=0, interval=0x7f37ee466000) = 0 [child34:21679] [187] [32BIT] epoll_pwait(epfd=244, events=0x7f37ee657000, maxevents=0x40300000a, timeout=562) = -1 (Invalid argument) [child34:21679] [188] getpriority(which=0x0, who=1) ==> trinity-child35.log <== [child35:14725] [92] getrandom(buf=0x7f37ee656000, count=0, flags=0x3) = -1 (Invalid argument) [child35:14725] [93] setgid(gid=-4) = -1 (Operation not permitted) [child35:14725] [94] execveat(fd=210, name="/proc/942/net/hidp", argv=0x2bda970, envp=0x2e10190, flags=0x0) = -1 (Operation not permitted) ==> trinity-child36.log <== [child36:21681] [167] sethostname(name=0x7f37ec76c000, len=227) = -1 (Operation not permitted) [child36:21681] [168] munlock(addr=0x7f37ec86c000, len=0x48000) = 0 [child36:21681] [169] execve(name="/proc/980/task/980/net/udplite", argv=0x2e0a1a0, envp=0x2e10190) ==> trinity-child37.log <== [child37:21682] [33] setgid(gid=0x6c6c6c6c) = -1 (Operation not permitted) [child37:21682] [34] syslog(type=0x3, buf=0x7f37ee656000, len=0) = 0 [child37:21682] [35] pwrite64(fd=335, buf=0x2e204b0, count=1220, pos=56) = -1 (Bad file descriptor) ==> trinity-child38.log <== [child38:21689] [107] setpriority(which=0xafafafafafafafaf, who=0xffffffffffff0000, niceval=0x400000000) = -1 (Invalid argument) [child38:21689] [108] newstat(filename="/proc/983/cmdline", statbuf=0x7f37ec66c000) = 0 [child38:21689] [109] mq_getsetattr(mqdes=367, u_mqstat=0x0, u_omqstat=0x0) ==> trinity-child39.log <== [child39:20422] [556] adjtimex(txc_p=0x7f37ee466000) = -1 (Operation not permitted) [child39:20422] [557] ioperm(from=0, num=0xd000, turn_on=55) = -1 (Operation not permitted) [child39:20422] [558] clock_nanosleep(which_clock=0x2, flags=0x0, rqtp=0x7f37ec66c000, rmtp=0x7f37ec66c000) ==> trinity-child3.log <== [child3:21597] [146] prlimit64(pid=1, resource=12, new_rlim=0x4, old_rlim=0xc) = -1 (Bad address) [child3:21597] [147] mq_unlink(u_name=0x7f37ec46c000) = -1 (No such file or directory) [child3:21597] [148] [32BIT] rt_sigprocmask(how=0x1616161616161617, set=0x7f37ec76c000, oset=0x7f37ee465000, sigsetsize=128) = -1 (Invalid argument) ==> trinity-child40.log <== [child40:21692] [101] getgid() = 0xfffe [child40:21692] [102] lsetxattr(pathname="/proc/14/task/14/net/udplite6", name=0x7f37ee657000, value=0x7f37ee465000, size=4096, flags=0x0) = -1 (Permission denied) [child40:21692] [103] pwritev2(fd=367, vec=0x2e1a3d0, vlen=92, pos_l=15, pos_h=0xffa4cdcdd0ee0601, flags=0x5) ==> trinity-child41.log <== [child41:3589] [16] sched_get_priority_max(policy=0x34000000000000c3) [child41:3589] [16] timerfd_settime(ufd=308, flags=0x80800, utmr=0x7f37ec66c000, otmr=0x7f37ee466000) [child41:3589] [16] timerfd_settime(ufd=326, flags=0x0, utmr=0x7f37ee465000, otmr=0x7f37ec86c000) = -1 (File exists) [child41:3589] [16] sched_get_priority_max(policy=0x34000000000000c3) [child41:3589] [16] timerfd_settime(ufd=308, flags=0x80800, utmr=0x7f37ec66c000, otmr=0x7f37ee466000) [child41:3589] [16] timerfd_settime(ufd=326, flags=0x0, utmr=0x7f37ee465000, otmr=0x7f37ec86c000) [child41:3589] [16] getresuid(ruid=0x7f37ee467000, euid=0x7f37ee467000, suid=0x7f37ec06c000) [child41:3589] [16] getrusage(who=0x703a9e50fc, ru=0x8) [child41:3589] [16] munlockall() = 0 [child41:3589] [17] chmod(filename="/proc/sys/net/ipv6/route/gc_min_interval", mode=4) ==> trinity-child42.log <== [child42:21696] [165] getrusage(who=-5, ru=0x0) = -1 (Invalid argument) [child42:21696] [166] fremovexattr(fd=168, name=0x7f37ee465000) = -1 (Numerical result out of range) [child42:21696] [167] madvise(start=0x7f37ec06c000, len_in=0xdb000, advice=0x9) = 0 ==> trinity-child43.log <== [child43:21700] [106] readv(fd=244, vec=0x2e1a3d0, vlen=246) = 0 [child43:21700] [107] swapoff(path="/sys/devices/virtual/tty/tty26/power/runtime_usage") = -1 (Operation not permitted) [child43:21700] [108] linkat(olddfd=244, oldname="/sys/devices/virtual/workqueue/writeback/nice", newdfd=244, newname="/proc/1014/net/nfsfs/volumes", flags=0x0) = -1 (File exists) ==> trinity-child44.log <== [child44:21703] [94] mkdir(pathname="/sys/module/pstore/parameters", mode=0) = -1 (File exists) [child44:21703] [95] sendfile(out_fd=367, in_fd=137, offset=0x1, count=8) = -1 (Bad address) [child44:21703] [96] [32BIT] sync_file_range(fd=367, offset=0x707070707, nbytes=0x8000, flags=0xffffefff) ==> trinity-child45.log <== [child45:21704] [1] fdatasync(fd=381) = -1 (Invalid argument) [child45:21704] [2] munmap(addr=0x7f37ee656000, len=4096) = 0 [child45:21704] [3] pipe2(fildes=0x7f37ec46c000, flags=0x4800) = 0 ==> trinity-child46.log <== [child46:21705] [82] rmdir(pathname="/sys/devices/tracepoint/power/runtime_active_kids") = -1 (Permission denied) [child46:21705] [83] chroot(filename="/proc/15/task/15/net/tcp") = -1 (Not a directory) [child46:21705] [84] [32BIT] ioprio_set(which=-2, who=2048, ioprio=0x646464646464) = -1 (Invalid argument) ==> trinity-child47.log <== [child47:21710] [127] setrlimit(resource=0x7f37ee465000, rlim=16384) = -1 (Bad address) [child47:21710] [128] setns(fd=244, nstype=0x0) = -1 (Invalid argument) [child47:21710] [129] pkey_alloc(flags=0, init_val=0x2) ==> trinity-child48.log <== [child48:21717] [39] access(filename="/proc/10/task/10/net/ip_vs_stats", mode=700) = -1 (Invalid argument) [child48:21717] [40] [32BIT] sigprocmask(how=233, set=0x7f37ec46c000, oset=-9) = -1 (Bad address) [child48:21717] [41] [32BIT] seccomp(op=0x0, flags=0x0, uargs=0x0) ==> trinity-child49.log <== [child49:21718] [81] sendto(fd=168, buff=0x2e0a540, len=1, flags=0x41110, addr=0x2ba92a0, addr_len=28) = -1 (Socket operation on non-socket) [child49:21718] [82] clock_gettime(which_clock=0x430d4f678edd, tp=0x8) = -1 (Invalid argument) [child49:21718] [83] removexattr(pathname="/proc/13/attr/prev", name=0x7f37ec06c000) ==> trinity-child4.log <== [child4:21602] [155] getgroups(gidsetsize=0, grouplist=0x1) = 1 [child4:21602] [156] sendmsg(fd=0, msg=0x2c22c50, flags=0x40010883) = -1 (Socket operation on non-socket) [child4:21602] [157] fchown(fd=330, user=0x20240098000041, group=0x20000) = -1 (Operation not permitted) ==> trinity-child50.log <== [child50:21719] [41] shmctl(shmid=0x4c4c4c4c4c4c4c, cmd=0x2, buf=0x0) = -1 (Invalid argument) [child50:21719] [42] timer_delete(timer_id=0x80000000) = -1 (Invalid argument) [child50:21719] [43] madvise(start=0x7f37ec86c000, len_in=0x20000, advice=0x2) ==> trinity-child51.log <== [child51:21723] [91] msgctl(msqid=0x9f054e8b21, cmd=0x3, buf=0x7f37ee466000) = 1763 [child51:21723] [92] time(tloc=0x7f37ee466000) = 0x584474e5 [child51:21723] [93] timer_gettime(timer_id=1, setting=0x7f37ee466000) ==> trinity-child52.log <== [child52:21728] [70] epoll_ctl(epfd=244, op=0x3, fd=244, event=0x2e0a2f0) = -1 (Operation not permitted) [child52:21728] [71] [32BIT] statfs(pathname="/proc/548/net/atm/vc", buf=0x7f37ec46c000) = -1 (Bad address) [child52:21728] [72] rt_sigaction(sig=52, act=0x7f37ee465000, oact=0x7f37ec06c000, sigsetsize=128) = -1 (Invalid argument) ==> trinity-child53.log <== [child53:21734] [64] lseek(fd=244, offset=0, whence=0x1) = 33 [child53:21734] [65] preadv(fd=244, vec=0x2e10190, vlen=17, pos_l=0x2714319808dd0822, pos_h=128) = 0 [child53:21734] [66] epoll_create1(flags=0x0) = 516 ==> trinity-child54.log <== [child54:21738] [50] ioprio_get(which=0, who=10023) = -1 (Invalid argument) [child54:21738] [51] [32BIT] setregid(rgid=0x1000000000000, egid=0xffff) = -1 (Operation not permitted) [child54:21738] [52] execve(name="/proc/1240/net/rpc/auth.rpcsec.context", argv=0x2e10190, envp=0x2c6bd50) ==> trinity-child55.log <== [child55:21741] [44] recvmsg(fd=381, msg=0x1, flags=0x20000000) = -1 (Socket operation on non-socket) [child55:21741] [45] [32BIT] setgroups(gidsetsize=0xf9f9f9f9f9f9f9, grouplist=0x0) = -1 (Operation not permitted) [child55:21741] [46] execveat(fd=168, name="/proc/irq/4/smp_affinity", argv=0x2e10190, envp=0x2e00900, flags=0x100) ==> trinity-child56.log <== [child56:21744] [7] msync(start=0x7f37ee657000, len=0, flags=0x6) = 0 [child56:21744] [8] flock(fd=168, cmd=0xffffffc191f30b0c) = -1 (Invalid argument) [child56:21744] [9] add_key(_type=0x437a15, _description=0x7f37ec06c000, _payload=0x7f37ec06c000, plen=0, ringid=0xfffffffffffffff8) ==> trinity-child57.log <== [child57:21746] [12] [32BIT] mq_open(u_name=0x7f37ec86c000, oflag=8192, mode=2000, u_attr=0x7f37ec86c004) = -1 (Bad address) [child57:21746] [13] getpgid(pid=21226) = 0x52ea [child57:21746] [14] execveat(fd=168, name="/proc/1297/task/1297", argv=0x2bda970, envp=0x2ba92a0, flags=0x0) ==> trinity-child58.log <== [child58:21749] [33] eventfd2(count=4, flags=0x0) = 441 [child58:21749] [34] mknodat(dfd=244, filename="/sys/devices/pci0000:00/0000:00:03.0/net/eth0/queues", mode=0x17104124, dev=0xffffffff0000567f) = -1 (Operation not permitted) [child58:21749] [35] clock_getres(which_clock=158, tp=0x4) = -1 (Invalid argument) ==> trinity-child59.log <== [child59:21754] [21] setfsgid(gid=0x40000000000000) = 0xfffe [child59:21754] [22] getpgrp() = 0 [child59:21754] [23] ppoll(ufds=0x2e0a5d0, nfds=0, tsp=0x2e0a590, sigmask=0x0, sigsetsize=128) ==> trinity-child5.log <== [child5:21604] [287] seccomp(op=0x1, flags=0x0, uargs=0x1) = 0 [child5:21604] [288] rt_sigpending(set=0x7f37ee466000, sigsetsize=0x2323232323232323) = -1 (Invalid argument) [child5:21604] [289] getuid() = 0xfffe ==> trinity-child60.log <== [child60:21758] [0] getsid(pid=1) = 0 [child60:21758] [1] io_cancel(ctx_id=255, iocb=0x7f37ec46c000, result=0x7f37ec46c000) = -1 (Invalid argument) [child60:21758] [2] [32BIT] recvmmsg(fd=367, mmsg=0x1, vlen=4, flags=0x0, timeout=0x7f37ec46c000) ==> trinity-child61.log <== [child61:21760] [5] open(filename="/proc/1240/net/nfsfs", flags=0x331702, mode=666) = 444 [child61:21760] [6] recvmmsg(fd=244, mmsg=0x7f37ec66c000, vlen=1, flags=0x0, timeout=0x7f37ee466000) = -1 (Socket operation on non-socket) [child61:21760] [7] swapon(path=".//proc/3/task/3/net/ip_vs_conn/", swap_flags=0x0) ==> trinity-child62.log <== [child62:21428] [668] memfd_create(uname=0x7f37ee465000, flag=0x1) = 546 [child62:21428] [669] newstat(filename="/sys/devices/virtual/net/sit0/statistics/rx_dropped", statbuf=0x8) = -1 (Bad address) [child62:21428] [670] [32BIT] sigpending(set=0x7f37ee467000) = -1 (Bad address) ==> trinity-child63.log <== [child63:21431] [151] getgid() = 0xfffe [child63:21431] [152] set_mempolicy(mode=0xffffffff, nmask=0x0, maxnode=0xbbbbbbbbbbbbbbbb) = -1 (Invalid argument) [child63:21762] [0] ioperm(from=0xb5b5b5b5b5b5b5b6, num=514, turn_on=135) ==> trinity-child64.log <== [child64:21432] [754] utime(filename="/proc/552/net/can/rcvlist_eff", times=0x4) = -1 (Bad address) [child64:21432] [755] fcntl(fd=244, cmd=0x406, arg=244) = 540 [child64:21432] [756] mlock2(start=0x7f37ec06c000, len=0x9d000) = -1 (Cannot allocate memory) ==> trinity-child65.log <== [child65:21435] [140] timer_create(which_clock=0x10000000, timer_event_spec=0x1, create_timer_id=0x7f37ec46c000) = -1 (Invalid argument) [child65:21435] [141] getpeername(fd=152, usockaddr=0x2e0a1a0, usockaddr_len=16) = -1 (Socket operation on non-socket) [child65:21435] [142] ioperm(from=0x54ee6fb85625, num=1, turn_on=128) = -1 (Invalid argument) ==> trinity-child66.log <== [child66:21438] [310] msgctl(msqid=0x12121212121212, cmd=0x1, buf=0x7f37ee465000) = -1 (Invalid argument) [child66:21438] [311] set_robust_list(head=0x7f37ee658000, len=24) = 0 [child66:21438] [312] msgget(key=0x10000000, msgflg=0x80000000000000d1) = -1 (Permission denied) ==> trinity-child67.log <== [child67:21439] [184] pkey_alloc(flags=0, init_val=0x0) = -1 (Invalid argument) [child67:21439] [185] getpgrp() = 0 [child67:21439] [186] [32BIT] getpgrp() = 0 ==> trinity-child68.log <== [child68:21443] [90] shmdt(shmaddr=0x7f37ee465000) = 0 [child68:21443] [91] remap_file_pages(start=0x7f37ec86c000, size=4096, prot=0, pgoff=0, flags=0x10000) = 0 [child68:21443] [92] [32BIT] get_mempolicy(policy=0x7f37ee467000, nmask=0x7f37ee467001, maxnode=147, addr=0x7f37ee466000, flags=0x0) = -1 (Invalid argument) ==> trinity-child69.log <== [child69:21444] [283] signalfd(ufd=139, user_mask=0x1, sizemask=8) = -1 (Invalid argument) [child69:21444] [284] utimes(filename="/sys/devices/virtual/tty/tty50/power/runtime_status", utimes=0x7f37ee465000) = -1 (Invalid argument) [child69:21444] [285] fremovexattr(fd=139, name=0x7f37ee465000) = -1 (Permission denied) ==> trinity-child6.log <== [child6:21605] [7] msync(start=0x7f37ee656000, len=0, flags=0x1) = 0 [child6:21605] [8] mq_timedreceive(mqdes=324, u_msg_ptr=0x7f37ec06c000, msg_len=0x41da, u_msg_prio=0x7f37ee656000, u_abs_timeout=0x7f37ee657000) = -1 (Bad file descriptor) [child6:21605] [9] [32BIT] sethostname(name=0x7f37ec76c000, len=0x85b19) = -1 (Operation not permitted) ==> trinity-child70.log <== [child70:21447] [654] inotify_init1(flags=0x80000) = 547 [child70:21447] [655] setns(fd=244, nstype=0x0) = -1 (Invalid argument) [child70:21447] [656] getdents(fd=367, dirent=0x4, count=0) ==> trinity-child71.log <== [child71:21449] [86] fremovexattr(fd=344, name=0x1) = -1 (Bad address) [child71:21449] [87] fadvise64(fd=344, offset=0, len=1, advice=0x2) = 0 [child71:21449] [88] [32BIT] fstat64(fd=285, statbuf=0x7f37ee467000) = -1 (Bad address) ==> trinity-child72.log <== [child72:21138] [88] [32BIT] pread64(fd=210, buf=0x7f37ec06c000, count=1361, pos=0) = -1 (Invalid argument) [child72:21138] [89] clock_settime(which_clock=512, tp=0x7f37ec06c000) = -1 (Invalid argument) [child72:21138] [90] mbind(start=0x7f37ec86c000, len=0xf5000, mode=0x2, nmask=0x1, maxnode=0x8000, flags=0x6) = -1 (Bad address) ==> trinity-child73.log <== [child73:21141] [706] setrlimit(resource=0x7f37ec26c000, rlim=0xfffffffd) = -1 (Bad address) [child73:21141] [707] pwrite64(fd=351, buf=0x2e091a0, count=1, pos=0xfffd) = -1 (Bad file descriptor) [child73:21141] [708] time(tloc=0x7f37ee65a000) = 0x584474dd ==> trinity-child74.log <== [child74:21456] [429] setfsgid(gid=-5) = 0xfffe [child74:21456] [430] sethostname(name=0x7f37ec26c000, len=8) = -1 (Operation not permitted) [child74:21456] [431] open(filename="/sys/block/loop2/mq/0/cpu_list", flags=0x333401, mode=666) = -1 (Not a directory) ==> trinity-child75.log <== [child75:21458] [578] accept(fd=316, upeer_sockaddr=0x2e240e0, upeer_addrlen=3729) = -1 (Socket operation on non-socket) [child75:21458] [579] setxattr(pathname="/sys/block/loop5", name=0x7f37ec06c000, value=0x7f37ec86c000, size=8, flags=0x1) = -1 (Permission denied) [child75:21458] [580] sched_setscheduler(pid=0, policy=0x3, param=0x0) = -1 (Invalid argument) ==> trinity-child76.log <== [child76:21461] [546] gettimeofday(tv=0x7f37ee465000, tz=0x7f37ee465004) = -1 (Bad address) [child76:21461] [547] [32BIT] rt_sigaction(sig=32, act=0x7f37ec46c000, oact=0x0, sigsetsize=128) = -1 (Invalid argument) [child76:21461] [548] clock_nanosleep(which_clock=0x2, flags=0x1, rqtp=0x7f37ee656000, rmtp=0x7f37ec06c000) ==> trinity-child77.log <== [child77:21465] [73] swapoff(path="/proc/865/task/865/net/rpc/auth.rpcsec.context") = -1 (Operation not permitted) [child77:21465] [74] sched_getscheduler(pid=21253) = 0 [child77:21465] [75] [32BIT] rt_sigtimedwait(uthese=0xba501c1fc77f1f57, uinfo=0x0, uts=0x7f37ee466000, sigsetsize=0x7f37ee466008) = -1 (Invalid argument) ==> trinity-child78.log <== [child78:21469] [216] mknod(filename="/proc/1023/net/ip_vs_conn", mode=1007, dev=0x1b9008e90080) = -1 (No such file or directory) [child78:21469] [217] geteuid() = 0xfffe [child78:21469] [218] setsockopt(fd=169, level=0xa49a302c26ea, optname=51, optval=0x2e1afd0, optlen=0x5bf09058529076) = -1 (Bad file descriptor) ==> trinity-child79.log <== [child79:21475] [696] setresuid(ruid=-2, euid=43, suid=0x7272727272727272) = -1 (Operation not permitted) [child79:21475] [697] msync(start=0x7f37ee659000, len=0, flags=0x6) = 0 [child79:21475] [698] io_submit(ctx_id=118, nr=8, iocbpp=0x8) = -1 (Invalid argument) ==> trinity-child7.log <== [child7:21609] [33] prctl(option=22, arg2=2, arg3=0, arg4=0, arg5=0) = -1 (Bad address) [child7:21609] [34] userfaultfd(flags=0x800) = 511 [child7:21609] [35] getuid() = 0xfffe ==> trinity-child80.log <== [child80:21478] [101] settimeofday(tv=0x7f37ec26c000, tz=0x7f37ec46c000) = -1 (Operation not permitted) [child80:21478] [102] getdents(fd=330, dirent=0x0, count=56) = -1 (Not a directory) [child80:21478] [103] epoll_ctl(epfd=330, op=0x2, fd=330, event=0x2e0a390) = -1 (Invalid argument) ==> trinity-child81.log <== [child81:21482] [33] [32BIT] move_pages(pid=0, nr_pages=304, pages=0x2e19fc0, nodes=0x2e1afd0, status=0x2e1b4a0, flags=0x0) = -1 (Bad address) [child81:21482] [34] getpriority(which=0x1, who=21008) = 1 [child81:21482] [35] sethostname(name=0x7f37ee659000, len=0x6aeaf512) = -1 (Operation not permitted) ==> trinity-child82.log <== [child82:21163] [963] keyctl(cmd=0x2, arg2=0xff00000, arg3=0xc928d000, arg4=0x400000, arg5=1) = -1 (Invalid argument) [child82:21163] [964] [32BIT] uselib(library=0x7f37ec46c000) = -1 (Bad address) [child82:21163] [965] mq_getsetattr(mqdes=226, u_mqstat=0x1, u_omqstat=0x2) = -1 (Bad address) ==> trinity-child83.log <== [child83:21486] [531] getrlimit(resource=0xb, rlim=0x7f37ee465000) = -1 (Bad address) [child83:21486] [532] gettid() = 0x53ee [child83:21486] [533] renameat(olddfd=244, oldname="/proc/1252/net/atalk/socket", newdfd=267, newname=".//proc/721/net/irda/ircomm") = -1 (Not a directory) ==> trinity-child84.log <== [child84:21488] [238] mkdir(pathname="/proc/1161/task/1161/mounts", mode=222) = -1 (File exists) [child84:21488] [239] capget(header=0x67002320, dataptr=0xfffaffb7effdfffe) = -1 (Bad address) [child84:21488] [240] rmdir(pathname="/sys/devices/virtual/misc/ocfs2_control/power/runtime_active_time") = -1 (Permission denied) ==> trinity-child85.log <== [child85:21489] [151] settimeofday(tv=0x7f37ec76c000, tz=0x7f37ec76c008) = -1 (Invalid argument) [child85:21489] [152] fstatat64(dfd=365, filename=".//proc/551/net/ip_vs_conn", statbuf=0x7f37ec26c000, flag=-2) = -1 (Invalid argument) [child85:21489] [153] read(fd=365, buf=0x7f37ee466000, count=4096) = 0 ==> trinity-child86.log <== [child86:21496] [39] sched_getaffinity(pid=21349, len=217, user_mask_ptr=0x4) = -1 (Invalid argument) [child86:21496] [40] semtimedop(semid=149, tsops=0x7f37ec86c000, nsops=-1, timeout=0x7f37ec86c000) = -1 (Argument list too long) [child86:21496] [41] get_robust_list(pid=0, head_ptr=0x7f37ee465000, len_ptr=0x7f37ec66c000) = 0 ==> trinity-child87.log <== [child87:21500] [517] select(n=0, inp=0x2e12fa0, outp=0x2e13030, exp=0x2e130c0, tvp=0x2e09740) = 0 [child87:21500] [518] readlink(path="/proc/14/comm", buf=0x1, bufsiz=223) = -1 (Invalid argument) [child87:21500] [519] setgid(gid=0) = -1 (Operation not permitted) ==> trinity-child88.log <== [child88:21502] [523] [32BIT] socketcall(call=1, args=0x2e1b390) = -1 (Address family not supported by protocol) [child88:21502] [524] newlstat(filename="/proc/722/task/722/net/rpc/nfs", statbuf=0x7f37ec76c000) = 0 [child88:21502] [525] mq_unlink(u_name=0x7f37ec06c000) ==> trinity-child89.log <== [child89:21507] [132] [32BIT] getresuid(ruid=0x7f37ec06c000, euid=0x7f37ec06c001, suid=0x7f37ec06c005) = -1 (Bad address) [child89:21507] [133] uname(name=0x7f37ec26c000) = 0 [child89:21507] [134] shutdown(fd=213, how=0x0) = -1 (Bad file descriptor) ==> trinity-child8.log <== [child8:21610] [211] mlock2(start=0x7f37ec26c000, len=0x95000) = -1 (Cannot allocate memory) [child8:21610] [212] removexattr(pathname="/sys/module/zswap/parameters/compressor", name=0x7f37ec46c000) = -1 (Permission denied) [child8:21610] [213] socket(family=25, type=2048, protocol=5) = -1 (Address family not supported by protocol) ==> trinity-child90.log <== [child90:21508] [36] [32BIT] llistxattr(pathname="/proc/1045/net/protocols", list=0x0, size=-6) = 0 [child90:21508] [37] pipe(fildes=0x7f37ee467000) = 0 [child90:21508] [38] ftruncate(fd=183, length=0xffffffffca1f2000) = -1 (Invalid argument) ==> trinity-child91.log <== [child91:21514] [28] unlink(pathname="/proc/924/net/netstat") = -1 (Permission denied) [child91:21514] [29] pivot_root(new_root=0x7f37ee465000, put_old=0x7f37ec06c000) = -1 (Operation not permitted) [child91:21514] [30] sched_setscheduler(pid=21410, policy=0x3, param=0x4) = -1 (Bad address) ==> trinity-child92.log <== [child92:21517] [444] [32BIT] chroot(filename="/proc/980/task/980/net/dev_snmp6/sit0") = -1 (Not a directory) [child92:21517] [445] syslog(type=0x1, buf=0x7f37ec26c000, len=0x48000) = -1 (Operation not permitted) [child92:21517] [446] getrandom(buf=0x7f37ec76c000, count=0x2a000, flags=0x1) = -1 (Invalid argument) ==> trinity-child93.log <== [child93:21518] [477] waitid(which=0xffd9, upid=1, infop=0x7f37ec06c000, options=0x100000, ru=0x7f37ec06c000) = -1 (Invalid argument) [child93:21518] [478] fchmod(fd=168, mode=0) = -1 (Operation not permitted) [child93:21518] [479] clock_nanosleep(which_clock=0x2, flags=0x1, rqtp=0x7f37ec76c000, rmtp=0x7f37ee466000) ==> trinity-child94.log <== [child94:20875] [1515] access(filename="/proc/567/net/atm/pvc", mode=5000) = -1 (Invalid argument) [child94:20875] [1516] seccomp(op=0x1, flags=0x1, uargs=0x4) = -1 (Invalid argument) [child94:20875] [1517] [32BIT] getuid() = 0xfffe ==> trinity-child95.log <== [child95:21520] [94] userfaultfd(flags=0x0) = 516 [child95:21520] [95] eventfd(count=153) = 517 [child95:21520] [96] shmat(shmid=0x4000000, shmaddr=0x7f37ee466000, shmflg=0x3000000000400000) = -1 (Invalid argument) ==> trinity-child96.log <== [child96:21522] [225] pwritev2(fd=324, vec=0x2e19fc0, vlen=70, pos_l=0x2727272727, pos_h=255, flags=0x2) = -1 (Illegal seek) [child96:21522] [226] setfsuid(uid=-8) = 0xfffe [child96:21522] [227] getdents(fd=324, dirent=0x7f37ec06c000, count=4096) = -1 (Not a directory) ==> trinity-child97.log <== [child97:21526] [119] [32BIT] statfs64(pathname="/proc/545/task/545/net/can/rcvlist_all", sz=0x8500000000000088) = -1 (Invalid argument) [child97:21526] [120] pipe2(fildes=0x7f37ee465000, flags=0x0) = 0 [child97:21526] [121] utimes(filename="/sys/devices/virtual/tty/ircomm4", utimes=0x0) = -1 (Permission denied) ==> trinity-child98.log <== [child98:21529] [163] setresgid(rgid=20, egid=256, sgid=0xffffffffffff5252) = -1 (Operation not permitted) [child98:21529] [164] llistxattr(pathname="/proc/986/net/dev_snmp6/ip_vti0", list=0x0, size=4) = 0 [child98:21529] [165] getcpu(cpup=0x7f37ee659000, nodep=0x7f37ee659008, unused=0x7f37ee659009) = 0 ==> trinity-child99.log <== [child99:21532] [223] sigaltstack(uss=0x7f37ec86c000, uoss=0x7f37ec86c008, regs=0x7f37ee465000) = -1 (Invalid argument) [child99:21532] [224] vmsplice(fd=320, iov=0x2e1afd0, nr_segs=1016, flags=0xb) = -1 (Bad file descriptor) [child99:21532] [225] [32BIT] setfsgid16(gid=0x2209024001252024) = 0xfffe ==> trinity-child9.log <== [child9:21616] [294] getuid() = 0xfffe [child9:21616] [295] fstatfs(fd=244, buf=0x8) = -1 (Bad address) [child9:21616] [296] mlock2(start=0x7f37ec76c000, len=0xe000) = 0 Vegard ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-12-05 11:10 ` Vegard Nossum @ 2016-12-05 17:09 ` Vegard Nossum 2016-12-05 17:21 ` Dave Jones 2016-12-05 17:55 ` Linus Torvalds 0 siblings, 2 replies; 118+ messages in thread From: Vegard Nossum @ 2016-12-05 17:09 UTC (permalink / raw) To: Dave Jones, Chris Mason, Linus Torvalds, Jens Axboe, Andy Lutomirski, Andy Lutomirski, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Dave Chinner On 5 December 2016 at 12:10, Vegard Nossum <vegard.nossum@gmail.com> wrote: > On 5 December 2016 at 00:04, Vegard Nossum <vegard.nossum@gmail.com> wrote: >> FWIW I hit this as well: >> >> BUG: unable to handle kernel paging request at ffffffff81ff08b7 >> IP: [<ffffffff8135f2ea>] __lock_acquire.isra.32+0xda/0x1a30 >> CPU: 0 PID: 21744 Comm: trinity-c56 Tainted: G B 4.9.0-rc7+ #217 > [...] > >> I think you can rule out btrfs in any case, probably block layer as >> well, since it looks like this comes from shmem. > > I should rather say that the VM runs on a 9p root filesystem and it > doesn't use/mount any block devices or disk-based filesystems. > > I have all the trinity logs for the crash if that's useful. I tried a > couple of runs with just the (at the time) in-progress syscalls but it > didn't turn up anything interesting. Otherwise it seems like a lot of > data to go through by hand. I've hit this another 7 times in the past ~3 hours. Three times the address being dereferenced has pointed to iov_iter_init+0xaf (even across a kernel rebuild), three times it has pointed to put_prev_entity+0x55, once to 0x800000008, and twice to 0x292. The fact that it would hit even one of those more than once across runs is pretty suspicious to me, although the ones that point to iov_iter_init and put_prev_entity point to "random" instructions in the sense that they are neither entry points nor return addresses. shmem_fault() was always on the stack, but it came from different syscalls: add_key(), newuname(), pipe2(), newstat(), fstat(), clock_settime(), mount(), etc. I also got this warning which is related: ------------[ cut here ]------------ WARNING: CPU: 9 PID: 25045 at lib/list_debug.c:59 __list_del_entry+0x14f/0x1d0 list_del corruption. prev->next should be ffff88014bdc79e8, but was ffff88014bfbfc60 Kernel panic - not syncing: panic_on_warn set ... CPU: 9 PID: 25045 Comm: trinity-c22 Not tainted 4.9.0-rc7+ #219 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 ffff88014bdc7700 ffffffff81fb0861 ffffffff83e74b60 ffff88014bdc77d8 ffffffff84006c00 ffffffff847103e0 ffff88014bdc77c8 ffffffff81515244 0000000041b58ab3 ffffffff844e21c2 ffffffff81515061 ffffffff00000054 Call Trace: [<ffffffff81fb0861>] dump_stack+0x83/0xb2 [<ffffffff81515244>] panic+0x1e3/0x3ad [<ffffffff81515061>] ? percpu_up_read_preempt_enable.constprop.45+0xcb/0xcb [<ffffffff82016f7f>] ? __list_del_entry+0x14f/0x1d0 [<ffffffff812708bf>] __warn+0x1bf/0x1e0 [<ffffffff8135f2d2>] ? __lock_acquire.isra.32+0xc2/0x1a30 [<ffffffff8127098c>] warn_slowpath_fmt+0xac/0xd0 [<ffffffff812708e0>] ? __warn+0x1e0/0x1e0 [<ffffffff813530c0>] ? finish_wait+0xb0/0x180 [<ffffffff82016f7f>] __list_del_entry+0x14f/0x1d0 [<ffffffff813530c0>] ? finish_wait+0xb0/0x180 [<ffffffff813530cb>] finish_wait+0xbb/0x180 [<ffffffff81576227>] shmem_fault+0x4c7/0x6b0 [<ffffffff81574743>] ? shmem_getpage_gfp+0x673/0x1c90 [<ffffffff81575d60>] ? shmem_getpage_gfp+0x1c90/0x1c90 [<ffffffff81352150>] ? wake_atomic_t_function+0x210/0x210 [<ffffffff815ad316>] __do_fault+0x206/0x410 [<ffffffff815ad110>] ? do_page_mkwrite+0x320/0x320 [<ffffffff815b80ac>] ? handle_mm_fault+0x1cc/0x2a60 [<ffffffff815b8fd7>] handle_mm_fault+0x10f7/0x2a60 [<ffffffff815b8012>] ? handle_mm_fault+0x132/0x2a60 [<ffffffff81310a7f>] ? thread_group_cputime+0x49f/0x6e0 [<ffffffff815b7ee0>] ? __pmd_alloc+0x370/0x370 [<ffffffff81310a9c>] ? thread_group_cputime+0x4bc/0x6e0 [<ffffffff81310d2d>] ? thread_group_cputime_adjusted+0x6d/0xe0 [<ffffffff81237170>] ? __do_page_fault+0x220/0x9f0 [<ffffffff815cba10>] ? find_vma+0x30/0x150 [<ffffffff812373a2>] __do_page_fault+0x452/0x9f0 [<ffffffff81237bf5>] trace_do_page_fault+0x1e5/0x3a0 [<ffffffff8122a007>] do_async_page_fault+0x27/0xa0 [<ffffffff83c97618>] async_page_fault+0x28/0x30 [<ffffffff81fdec7c>] ? copy_user_generic_string+0x2c/0x40 [<ffffffff812b0303>] ? SyS_times+0x93/0x110 [<ffffffff812b0270>] ? do_sys_times+0x2b0/0x2b0 [<ffffffff812b0270>] ? do_sys_times+0x2b0/0x2b0 [<ffffffff8100524f>] do_syscall_64+0x1af/0x4d0 [<ffffffff83c96534>] entry_SYSCALL64_slow_path+0x25/0x25 ------------[ cut here ]------------ The warning shows that it made it past the list_empty_careful() check in finish_wait() but then bugs out on the &wait->task_list dereference. Anything stick out? Vegard ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-12-05 17:09 ` Vegard Nossum @ 2016-12-05 17:21 ` Dave Jones 2016-12-05 17:55 ` Linus Torvalds 1 sibling, 0 replies; 118+ messages in thread From: Dave Jones @ 2016-12-05 17:21 UTC (permalink / raw) To: Vegard Nossum Cc: Chris Mason, Linus Torvalds, Jens Axboe, Andy Lutomirski, Andy Lutomirski, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Dave Chinner On Mon, Dec 05, 2016 at 06:09:29PM +0100, Vegard Nossum wrote: > On 5 December 2016 at 12:10, Vegard Nossum <vegard.nossum@gmail.com> wrote: > > On 5 December 2016 at 00:04, Vegard Nossum <vegard.nossum@gmail.com> wrote: > >> FWIW I hit this as well: > >> > >> BUG: unable to handle kernel paging request at ffffffff81ff08b7 > >> IP: [<ffffffff8135f2ea>] __lock_acquire.isra.32+0xda/0x1a30 > >> CPU: 0 PID: 21744 Comm: trinity-c56 Tainted: G B 4.9.0-rc7+ #217 > > [...] > > > >> I think you can rule out btrfs in any case, probably block layer as > >> well, since it looks like this comes from shmem. > > > > I should rather say that the VM runs on a 9p root filesystem and it > > doesn't use/mount any block devices or disk-based filesystems. > > > > I have all the trinity logs for the crash if that's useful. I tried a > > couple of runs with just the (at the time) in-progress syscalls but it > > didn't turn up anything interesting. Otherwise it seems like a lot of > > data to go through by hand. > > I've hit this another 7 times in the past ~3 hours. > > Three times the address being dereferenced has pointed to > iov_iter_init+0xaf (even across a kernel rebuild), three times it has > pointed to put_prev_entity+0x55, once to 0x800000008, and twice to > 0x292. The fact that it would hit even one of those more than once > across runs is pretty suspicious to me, although the ones that point > to iov_iter_init and put_prev_entity point to "random" instructions in > the sense that they are neither entry points nor return addresses. > > shmem_fault() was always on the stack, but it came from different > syscalls: add_key(), newuname(), pipe2(), newstat(), fstat(), > clock_settime(), mount(), etc. > ------------[ cut here ]------------ > The warning shows that it made it past the list_empty_careful() check > in finish_wait() but then bugs out on the &wait->task_list > dereference. I just pushed out the ftrace changes I made to Trinity that might help you gather more clues. Right now it's hardcoded to dump a trace to /boot/trace.txt when it detects the kernel has become tainted. Before a trinity run, I run this as root.. #!/bin/sh cd /sys/kernel/debug/tracing/ echo 10000 > buffer_size_kb echo function >> current_tracer for i in $(cat /home/davej/blacklist-symbols) do echo $i >> set_ftrace_notrace done echo 1 >> tracing_on blacklist-symbols is the more noisy stuff that pollutes traces. Right now I use these: https://paste.fedoraproject.org/499794/14809582/ You may need to add some more. (I'll get around to making all this scripting go away, and have trinity just set this stuff up itself eventually) Oh, and if you're not running as root, you might need a diff like below so that trinity can stop the trace when it detects tainting. Dave diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 8696ce6bf2f6..2d6c97e871e0 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -7217,7 +7217,7 @@ init_tracer_tracefs(struct trace_array *tr, struct dentry *d_tracer) trace_create_file("trace_clock", 0644, d_tracer, tr, &trace_clock_fops); - trace_create_file("tracing_on", 0644, d_tracer, + trace_create_file("tracing_on", 0666, d_tracer, tr, &rb_simple_fops); create_trace_options_dir(tr); ^ permalink raw reply related [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-12-05 17:09 ` Vegard Nossum 2016-12-05 17:21 ` Dave Jones @ 2016-12-05 17:55 ` Linus Torvalds 2016-12-05 19:11 ` Vegard Nossum 1 sibling, 1 reply; 118+ messages in thread From: Linus Torvalds @ 2016-12-05 17:55 UTC (permalink / raw) To: Vegard Nossum Cc: Dave Jones, Chris Mason, Jens Axboe, Andy Lutomirski, Andy Lutomirski, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Dave Chinner [-- Attachment #1: Type: text/plain, Size: 1597 bytes --] On Mon, Dec 5, 2016 at 9:09 AM, Vegard Nossum <vegard.nossum@gmail.com> wrote: > > The warning shows that it made it past the list_empty_careful() check > in finish_wait() but then bugs out on the &wait->task_list > dereference. > > Anything stick out? I hate that shmem waitqueue garbage. It's really subtle. I think the problem is that "wake_up_all()" in shmem_fallocate() doesn't necessarily wake up everything. It wakes up TASK_NORMAL - which does include TASK_UNINTERRUPTIBLE, but doesn't actually mean "everything on the list". I think that what happens is that the waiters somehow move from TASK_UNINTERRUPTIBLE to TASK_RUNNING early, and this means that wake_up_all() will ignore them, leave them on the list, and now that list on stack is no longer empty at the end. And the way *THAT* can happen is that the task is on some *other* waitqueue as well, and that other waiqueue wakes it up. That's not impossible, you can certainly have people on wait-queues that still take faults. Or somebody just uses a directed wake_up_process() or something. Since you apparently can recreate this fairly easily, how about trying this stupid patch? NOTE! This is entirely untested. I may have screwed this up entirely. You get the idea, though - just remove the wait queue head from the list - the list entries stay around, but nothing points to the stack entry (that we're going to free) any more. And add the warning to see if this actually ever triggers (and because I'd like to see the callchain when it does, to see if it's another waitqueue somewhere or what..) Linus [-- Attachment #2: patch.diff --] [-- Type: text/plain, Size: 469 bytes --] diff --git a/mm/shmem.c b/mm/shmem.c index 166ebf5d2bce..a80148b43476 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2665,6 +2665,8 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset, spin_lock(&inode->i_lock); inode->i_private = NULL; wake_up_all(&shmem_falloc_waitq); + if (WARN_ON_ONCE(!list_empty(&shmem_falloc_waitq.task_list))) + list_del(&shmem_falloc_waitq.task_list); spin_unlock(&inode->i_lock); error = 0; goto out; ^ permalink raw reply related [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-12-05 17:55 ` Linus Torvalds @ 2016-12-05 19:11 ` Vegard Nossum 2016-12-05 20:10 ` Linus Torvalds 2016-12-05 20:10 ` Vegard Nossum 0 siblings, 2 replies; 118+ messages in thread From: Vegard Nossum @ 2016-12-05 19:11 UTC (permalink / raw) To: Linus Torvalds Cc: Dave Jones, Chris Mason, Jens Axboe, Andy Lutomirski, Andy Lutomirski, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Dave Chinner [-- Attachment #1: Type: text/plain, Size: 2740 bytes --] On 5 December 2016 at 18:55, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Mon, Dec 5, 2016 at 9:09 AM, Vegard Nossum <vegard.nossum@gmail.com> wrote: >> >> The warning shows that it made it past the list_empty_careful() check >> in finish_wait() but then bugs out on the &wait->task_list >> dereference. >> >> Anything stick out? > > I hate that shmem waitqueue garbage. It's really subtle. > > I think the problem is that "wake_up_all()" in shmem_fallocate() > doesn't necessarily wake up everything. It wakes up TASK_NORMAL - > which does include TASK_UNINTERRUPTIBLE, but doesn't actually mean > "everything on the list". > > I think that what happens is that the waiters somehow move from > TASK_UNINTERRUPTIBLE to TASK_RUNNING early, and this means that > wake_up_all() will ignore them, leave them on the list, and now that > list on stack is no longer empty at the end. > > And the way *THAT* can happen is that the task is on some *other* > waitqueue as well, and that other waiqueue wakes it up. That's not > impossible, you can certainly have people on wait-queues that still > take faults. > > Or somebody just uses a directed wake_up_process() or something. > > Since you apparently can recreate this fairly easily, how about trying > this stupid patch? > > NOTE! This is entirely untested. I may have screwed this up entirely. > You get the idea, though - just remove the wait queue head from the > list - the list entries stay around, but nothing points to the stack > entry (that we're going to free) any more. > > And add the warning to see if this actually ever triggers (and because > I'd like to see the callchain when it does, to see if it's another > waitqueue somewhere or what..) ------------[ cut here ]------------ WARNING: CPU: 22 PID: 14012 at mm/shmem.c:2668 shmem_fallocate+0x9a7/0xac0 Kernel panic - not syncing: panic_on_warn set ... CPU: 22 PID: 14012 Comm: trinity-c73 Not tainted 4.9.0-rc7+ #220 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 ffff8801e32af970 ffffffff81fb08c1 ffffffff83e74b60 ffff8801e32afa48 ffffffff83ed7600 ffffffff847103e0 ffff8801e32afa38 ffffffff81515244 0000000041b58ab3 ffffffff844e21da ffffffff81515061 ffffffff8151591e Call Trace: [<ffffffff81fb08c1>] dump_stack+0x83/0xb2 [<ffffffff81515244>] panic+0x1e3/0x3ad [<ffffffff812708bf>] __warn+0x1bf/0x1e0 [<ffffffff81270aac>] warn_slowpath_null+0x2c/0x40 [<ffffffff8157aef7>] shmem_fallocate+0x9a7/0xac0 [<ffffffff8167c6c0>] vfs_fallocate+0x350/0x620 [<ffffffff815ee5c2>] SyS_madvise+0x432/0x1290 [<ffffffff8100524f>] do_syscall_64+0x1af/0x4d0 [<ffffffff83c965b4>] entry_SYSCALL64_slow_path+0x25/0x25 ------------[ cut here ]------------ Attached a full log. Vegard [-- Attachment #2: 0.txt.gz --] [-- Type: application/x-gzip, Size: 18919 bytes --] ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-12-05 19:11 ` Vegard Nossum @ 2016-12-05 20:10 ` Linus Torvalds 2016-12-05 20:35 ` Linus Torvalds 2016-12-05 20:10 ` Vegard Nossum 1 sibling, 1 reply; 118+ messages in thread From: Linus Torvalds @ 2016-12-05 20:10 UTC (permalink / raw) To: Vegard Nossum Cc: Dave Jones, Chris Mason, Jens Axboe, Andy Lutomirski, Andy Lutomirski, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Dave Chinner [-- Attachment #1: Type: text/plain, Size: 906 bytes --] On Mon, Dec 5, 2016 at 11:11 AM, Vegard Nossum <vegard.nossum@gmail.com> wrote: > > ------------[ cut here ]------------ > WARNING: CPU: 22 PID: 14012 at mm/shmem.c:2668 shmem_fallocate+0x9a7/0xac0 Ok, good. So that's confirmed as the cause of this problem. And the call chain that I wanted is obviously completely uninteresting, because it's call cghain on the other side (the page fault side) that would show the nested wake queue behavior. I was just being stupid about it. I wonder if we have any other places where we just blithely assume that "wake_up_all()" will actually empty the whole wait queue. It's _usually_ true, but as noted, nested waiting does happen. Anyway, can you try this patch instead? It should actually cause the wake_up_all() to always remove all entries, and thus the WARN_ON() should no longer happen (and I removed the "list_del()" hackery). Linus [-- Attachment #2: patch.diff --] [-- Type: text/plain, Size: 1440 bytes --] diff --git a/mm/shmem.c b/mm/shmem.c index 166ebf5d2bce..17beb44e9f4f 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1848,6 +1848,19 @@ alloc_nohuge: page = shmem_alloc_and_acct_page(gfp, info, sbinfo, return error; } +/* + * This is like autoremove_wake_function, but it removes the wait queue + * entry unconditionally - even if something else had already woken the + * target. + */ +static int synchronous_wake_function(wait_queue_t *wait, unsigned mode, int sync, void *key) +{ + int ret = default_wake_function(wait, mode, sync, key); + list_del_init(&wait->task_list); + return ret; +} + + static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf) { struct inode *inode = file_inode(vma->vm_file); @@ -1883,7 +1896,7 @@ static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf) vmf->pgoff >= shmem_falloc->start && vmf->pgoff < shmem_falloc->next) { wait_queue_head_t *shmem_falloc_waitq; - DEFINE_WAIT(shmem_fault_wait); + DEFINE_WAIT_FUNC(shmem_fault_wait, synchronous_wake_function); ret = VM_FAULT_NOPAGE; if ((vmf->flags & FAULT_FLAG_ALLOW_RETRY) && @@ -2665,6 +2678,7 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset, spin_lock(&inode->i_lock); inode->i_private = NULL; wake_up_all(&shmem_falloc_waitq); + WARN_ON_ONCE(!list_empty(&shmem_falloc_waitq.task_list)); spin_unlock(&inode->i_lock); error = 0; goto out; ^ permalink raw reply related [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-12-05 20:10 ` Linus Torvalds @ 2016-12-05 20:35 ` Linus Torvalds 2016-12-05 21:33 ` Vegard Nossum 2016-12-06 8:16 ` Peter Zijlstra 0 siblings, 2 replies; 118+ messages in thread From: Linus Torvalds @ 2016-12-05 20:35 UTC (permalink / raw) To: Vegard Nossum, Ingo Molnar, Peter Zijlstra Cc: Dave Jones, Chris Mason, Jens Axboe, Andy Lutomirski, Andy Lutomirski, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Dave Chinner [-- Attachment #1: Type: text/plain, Size: 1664 bytes --] Adding the scheduler people to the participants list, and re-attaching the patch, because while this patch is internal to the VM code, the issue itself is not. There might well be other cases where somebody goes "wake_up_all()" will wake everybody up, so I can put the wait queue head on the stack, and then after I've woken people up I can return". Ingo/PeterZ: the reason that does *not* work is that "wake_up_all()" does make sure that everybody is woken up, but the usual autoremove wake function only removes the wakeup entry if the process was woken up by that particular wakeup. If something else had woken it up, the entry remains on the list, and the waker in this case returned with the wait head still containing entries. Which is deadly when the wait queue head is on the stack. So I'm wondering if we should make that "synchronous_wake_function()" available generally, and maybe introduce a DEFINE_WAIT_SYNC() helper that uses it. Of course, I'm really hoping that this shmem.c use is the _only_ such case. But I doubt it. Comments? Note for Ingo and Peter: this patch has not been tested at all. But Vegard did test an earlier patch of mine that just verified that yes, the issue really was that wait queue entries remained on the wait queue head just as we were about to return and free it. Linus On Mon, Dec 5, 2016 at 12:10 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Anyway, can you try this patch instead? It should actually cause the > wake_up_all() to always remove all entries, and thus the WARN_ON() > should no longer happen (and I removed the "list_del()" hackery). > > Linus [-- Attachment #2: patch.diff --] [-- Type: text/plain, Size: 1440 bytes --] diff --git a/mm/shmem.c b/mm/shmem.c index 166ebf5d2bce..17beb44e9f4f 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1848,6 +1848,19 @@ alloc_nohuge: page = shmem_alloc_and_acct_page(gfp, info, sbinfo, return error; } +/* + * This is like autoremove_wake_function, but it removes the wait queue + * entry unconditionally - even if something else had already woken the + * target. + */ +static int synchronous_wake_function(wait_queue_t *wait, unsigned mode, int sync, void *key) +{ + int ret = default_wake_function(wait, mode, sync, key); + list_del_init(&wait->task_list); + return ret; +} + + static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf) { struct inode *inode = file_inode(vma->vm_file); @@ -1883,7 +1896,7 @@ static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf) vmf->pgoff >= shmem_falloc->start && vmf->pgoff < shmem_falloc->next) { wait_queue_head_t *shmem_falloc_waitq; - DEFINE_WAIT(shmem_fault_wait); + DEFINE_WAIT_FUNC(shmem_fault_wait, synchronous_wake_function); ret = VM_FAULT_NOPAGE; if ((vmf->flags & FAULT_FLAG_ALLOW_RETRY) && @@ -2665,6 +2678,7 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset, spin_lock(&inode->i_lock); inode->i_private = NULL; wake_up_all(&shmem_falloc_waitq); + WARN_ON_ONCE(!list_empty(&shmem_falloc_waitq.task_list)); spin_unlock(&inode->i_lock); error = 0; goto out; ^ permalink raw reply related [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-12-05 20:35 ` Linus Torvalds @ 2016-12-05 21:33 ` Vegard Nossum 2016-12-06 8:42 ` Vegard Nossum 2016-12-06 8:16 ` Peter Zijlstra 1 sibling, 1 reply; 118+ messages in thread From: Vegard Nossum @ 2016-12-05 21:33 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Peter Zijlstra, Dave Jones, Chris Mason, Jens Axboe, Andy Lutomirski, Andy Lutomirski, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Dave Chinner On 5 December 2016 at 21:35, Linus Torvalds <torvalds@linux-foundation.org> wrote: > Note for Ingo and Peter: this patch has not been tested at all. But > Vegard did test an earlier patch of mine that just verified that yes, > the issue really was that wait queue entries remained on the wait > queue head just as we were about to return and free it. The second patch has been running for 1h+ without any problems of any kind. I should typically have seen 2 crashes by now. I'll let it run overnight to be sure. Vegard ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-12-05 21:33 ` Vegard Nossum @ 2016-12-06 8:42 ` Vegard Nossum 0 siblings, 0 replies; 118+ messages in thread From: Vegard Nossum @ 2016-12-06 8:42 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Peter Zijlstra, Dave Jones, Chris Mason, Jens Axboe, Andy Lutomirski, Andy Lutomirski, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Dave Chinner On 5 December 2016 at 22:33, Vegard Nossum <vegard.nossum@gmail.com> wrote: > On 5 December 2016 at 21:35, Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> Note for Ingo and Peter: this patch has not been tested at all. But >> Vegard did test an earlier patch of mine that just verified that yes, >> the issue really was that wait queue entries remained on the wait >> queue head just as we were about to return and free it. > > The second patch has been running for 1h+ without any problems of any > kind. I should typically have seen 2 crashes by now. I'll let it run > overnight to be sure. Alright, so nearly 12 hours later I don't see either the new warning or the original crash at all, so feel free to add: Tested-by: Vegard Nossum <vegard.nossum@oracle.com>. That said, my 8 VMs had all panicked in some way due to OOMs (which is new since v4.8), although some got page allocation stalls for >20s and died because "khugepaged blocked for more than 120 seconds", others got "Out of memory and no killable processes". Vegard ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-12-05 20:35 ` Linus Torvalds 2016-12-05 21:33 ` Vegard Nossum @ 2016-12-06 8:16 ` Peter Zijlstra 2016-12-06 8:36 ` Ingo Molnar 2016-12-06 16:33 ` Linus Torvalds 1 sibling, 2 replies; 118+ messages in thread From: Peter Zijlstra @ 2016-12-06 8:16 UTC (permalink / raw) To: Linus Torvalds Cc: Vegard Nossum, Ingo Molnar, Dave Jones, Chris Mason, Jens Axboe, Andy Lutomirski, Andy Lutomirski, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Dave Chinner On Mon, Dec 05, 2016 at 12:35:52PM -0800, Linus Torvalds wrote: > Adding the scheduler people to the participants list, and re-attaching > the patch, because while this patch is internal to the VM code, the > issue itself is not. > > There might well be other cases where somebody goes "wake_up_all()" > will wake everybody up, so I can put the wait queue head on the stack, > and then after I've woken people up I can return". > > Ingo/PeterZ: the reason that does *not* work is that "wake_up_all()" > does make sure that everybody is woken up, but the usual autoremove > wake function only removes the wakeup entry if the process was woken > up by that particular wakeup. If something else had woken it up, the > entry remains on the list, and the waker in this case returned with > the wait head still containing entries. > > Which is deadly when the wait queue head is on the stack. Yes, very much so. > So I'm wondering if we should make that "synchronous_wake_function()" > available generally, and maybe introduce a DEFINE_WAIT_SYNC() helper > that uses it. We could also do some debug code that tracks the ONSTACK ness and warns on autoremove. Something like the below, equally untested. > > Of course, I'm really hoping that this shmem.c use is the _only_ such > case. But I doubt it. $ git grep DECLARE_WAIT_QUEUE_HEAD_ONSTACK | wc -l 28 --- diff --git a/include/linux/wait.h b/include/linux/wait.h index 2408e8d5c05c..199faaa89847 100644 --- a/include/linux/wait.h +++ b/include/linux/wait.h @@ -39,6 +39,9 @@ struct wait_bit_queue { struct __wait_queue_head { spinlock_t lock; struct list_head task_list; +#ifdef CONFIG_DEBUG_WAITQUEUE + int onstack; +#endif }; typedef struct __wait_queue_head wait_queue_head_t; @@ -56,9 +59,18 @@ struct task_struct; #define DECLARE_WAITQUEUE(name, tsk) \ wait_queue_t name = __WAITQUEUE_INITIALIZER(name, tsk) -#define __WAIT_QUEUE_HEAD_INITIALIZER(name) { \ +#ifdef CONFIG_DEBUG_WAITQUEUE +#define ___WAIT_QUEUE_ONSTACK(onstack) .onstack = (onstack), +#else +#define ___WAIT_QUEUE_ONSTACK(onstack) +#endif + +#define ___WAIT_QUEUE_HEAD_INITIALIZER(name, onstack) { \ .lock = __SPIN_LOCK_UNLOCKED(name.lock), \ - .task_list = { &(name).task_list, &(name).task_list } } + .task_list = { &(name).task_list, &(name).task_list }, \ + ___WAIT_QUEUE_ONSTACK(onstack) } + +#define __WAIT_QUEUE_HEAD_INITIALIZER(name) ___WAIT_QUEUE_HEAD_INITIALIZER(name, 0) #define DECLARE_WAIT_QUEUE_HEAD(name) \ wait_queue_head_t name = __WAIT_QUEUE_HEAD_INITIALIZER(name) @@ -80,11 +92,12 @@ extern void __init_waitqueue_head(wait_queue_head_t *q, const char *name, struct #ifdef CONFIG_LOCKDEP # define __WAIT_QUEUE_HEAD_INIT_ONSTACK(name) \ - ({ init_waitqueue_head(&name); name; }) + ({ init_waitqueue_head(&name); (name).onstack = 1; name; }) # define DECLARE_WAIT_QUEUE_HEAD_ONSTACK(name) \ wait_queue_head_t name = __WAIT_QUEUE_HEAD_INIT_ONSTACK(name) #else -# define DECLARE_WAIT_QUEUE_HEAD_ONSTACK(name) DECLARE_WAIT_QUEUE_HEAD(name) +# define DECLARE_WAIT_QUEUE_HEAD_ONSTACK(name) \ + wait_queue_head_t name = ___WAIT_QUEUE_HEAD_INITIALIZER(name, 1) #endif static inline void init_waitqueue_entry(wait_queue_t *q, struct task_struct *p) diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c index 9453efe9b25a..746d00117d08 100644 --- a/kernel/sched/wait.c +++ b/kernel/sched/wait.c @@ -156,6 +156,13 @@ void __wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr_exclusive) } EXPORT_SYMBOL_GPL(__wake_up_sync); /* For internal use only */ +static inline void prepare_debug(wait_queue_head_t *q, wait_queue_t *wait) +{ +#ifdef CONFIG_DEBUG_WAITQUEUE + WARN_ON_ONCE(q->onstack && wait->func == autoremove_wake_function) +#endif +} + /* * Note: we use "set_current_state()" _after_ the wait-queue add, * because we need a memory barrier there on SMP, so that any @@ -178,6 +185,7 @@ prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, int state) if (list_empty(&wait->task_list)) __add_wait_queue(q, wait); set_current_state(state); + prepare_debug(q, wait); spin_unlock_irqrestore(&q->lock, flags); } EXPORT_SYMBOL(prepare_to_wait); @@ -192,6 +200,7 @@ prepare_to_wait_exclusive(wait_queue_head_t *q, wait_queue_t *wait, int state) if (list_empty(&wait->task_list)) __add_wait_queue_tail(q, wait); set_current_state(state); + prepare_debug(q, wait); spin_unlock_irqrestore(&q->lock, flags); } EXPORT_SYMBOL(prepare_to_wait_exclusive); @@ -235,6 +244,7 @@ long prepare_to_wait_event(wait_queue_head_t *q, wait_queue_t *wait, int state) } set_current_state(state); } + prepare_debug(q, wait); spin_unlock_irqrestore(&q->lock, flags); return ret; diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 9bb7d825ba14..af2ef22a5b2b 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -1235,6 +1235,14 @@ config DEBUG_PI_LIST If unsure, say N. +config DEBUG_WAITQUEUE + bool "Debug waitqueue" + depends on DEBUG_KERNEL + help + Enable this to do sanity checking on waitqueue usage + + If unsure, say N. + config DEBUG_SG bool "Debug SG table operations" depends on DEBUG_KERNEL ^ permalink raw reply related [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-12-06 8:16 ` Peter Zijlstra @ 2016-12-06 8:36 ` Ingo Molnar 2016-12-06 16:33 ` Linus Torvalds 1 sibling, 0 replies; 118+ messages in thread From: Ingo Molnar @ 2016-12-06 8:36 UTC (permalink / raw) To: Peter Zijlstra Cc: Linus Torvalds, Vegard Nossum, Dave Jones, Chris Mason, Jens Axboe, Andy Lutomirski, Andy Lutomirski, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Dave Chinner * Peter Zijlstra <peterz@infradead.org> wrote: > $ git grep DECLARE_WAIT_QUEUE_HEAD_ONSTACK | wc -l > 28 This debug facility looks sensible. A couple of minor suggestions: > --- a/include/linux/wait.h > +++ b/include/linux/wait.h > @@ -39,6 +39,9 @@ struct wait_bit_queue { > struct __wait_queue_head { > spinlock_t lock; > struct list_head task_list; > +#ifdef CONFIG_DEBUG_WAITQUEUE > + int onstack; > +#endif The structure will pack better in the debug-enabled case if 'onstack' is next to 'lock', as spinlock_t is 4 bytes on many architectures. > -#define __WAIT_QUEUE_HEAD_INITIALIZER(name) { \ > +#ifdef CONFIG_DEBUG_WAITQUEUE > +#define ___WAIT_QUEUE_ONSTACK(onstack) .onstack = (onstack), > +#else > +#define ___WAIT_QUEUE_ONSTACK(onstack) > +#endif Please help readers by showing the internal structure of the definition: #ifdef CONFIG_DEBUG_WAITQUEUE # define ___WAIT_QUEUE_ONSTACK(onstack) .onstack = (onstack), #else # define ___WAIT_QUEUE_ONSTACK(onstack) #endif > +static inline void prepare_debug(wait_queue_head_t *q, wait_queue_t *wait) > +{ > +#ifdef CONFIG_DEBUG_WAITQUEUE > + WARN_ON_ONCE(q->onstack && wait->func == autoremove_wake_function) > +#endif > +} I'd name this debug_waitqueue_check() or such - as the 'prepare' is a bit misleadig (we don't prepare debugging, we do the debug check here). > +config DEBUG_WAITQUEUE > + bool "Debug waitqueue" > + depends on DEBUG_KERNEL I'd name it DEBUG_SCHED_WAITQUEUE=y and I'd also make it depend on CONFIG_DEBUG_SCHED. LGTM otherwise! Thanks, Ingo ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-12-06 8:16 ` Peter Zijlstra 2016-12-06 8:36 ` Ingo Molnar @ 2016-12-06 16:33 ` Linus Torvalds 1 sibling, 0 replies; 118+ messages in thread From: Linus Torvalds @ 2016-12-06 16:33 UTC (permalink / raw) To: Peter Zijlstra Cc: Vegard Nossum, Ingo Molnar, Dave Jones, Chris Mason, Jens Axboe, Andy Lutomirski, Andy Lutomirski, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Dave Chinner On Tue, Dec 6, 2016 at 12:16 AM, Peter Zijlstra <peterz@infradead.org> wrote: >> >> Of course, I'm really hoping that this shmem.c use is the _only_ such >> case. But I doubt it. > > $ git grep DECLARE_WAIT_QUEUE_HEAD_ONSTACK | wc -l > 28 Hmm. Most of them seem to be ok, because they use "wait_event()", which will always remove itself from the wait-queue. And they do it from the place that allocates the wait-queue. IOW, the mm/shmem.c case really was fairly special, because it just did "prepare_to_wait()", and then did a finish_wait() - and not in the thread that allocated it on the stack. So it's really that "some _other_ thread allocated the waitqueue on the stack, and now we're doing a wait on it" that is bad. So the normal pattern seems to be: - allocate wq on the stack - pass it on to a waker - wait for it and that's ok, because as part of "wait for it" we will also be cleaning things up. The reason mm/shmem.c was buggy was that it did - allocate wq on the stack - pass it on to somebody else to wait for - wake it up and *that* is buggy, because it's the waiter, not the waker, that normally cleans things up. Is there some good way to find this kind of pattern automatically, I wonder.... Linus ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-12-05 19:11 ` Vegard Nossum 2016-12-05 20:10 ` Linus Torvalds @ 2016-12-05 20:10 ` Vegard Nossum 1 sibling, 0 replies; 118+ messages in thread From: Vegard Nossum @ 2016-12-05 20:10 UTC (permalink / raw) To: Linus Torvalds Cc: Dave Jones, Chris Mason, Jens Axboe, Andy Lutomirski, Andy Lutomirski, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Dave Chinner On 5 December 2016 at 20:11, Vegard Nossum <vegard.nossum@gmail.com> wrote: > On 5 December 2016 at 18:55, Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> On Mon, Dec 5, 2016 at 9:09 AM, Vegard Nossum <vegard.nossum@gmail.com> wrote: >> Since you apparently can recreate this fairly easily, how about trying >> this stupid patch? >> >> NOTE! This is entirely untested. I may have screwed this up entirely. >> You get the idea, though - just remove the wait queue head from the >> list - the list entries stay around, but nothing points to the stack >> entry (that we're going to free) any more. >> >> And add the warning to see if this actually ever triggers (and because >> I'd like to see the callchain when it does, to see if it's another >> waitqueue somewhere or what..) > > ------------[ cut here ]------------ > WARNING: CPU: 22 PID: 14012 at mm/shmem.c:2668 shmem_fallocate+0x9a7/0xac0 > Kernel panic - not syncing: panic_on_warn set ... So I noticed that panic_on_warn just after sending the email and I've been waiting for it it to trigger again. The warning has triggered twice more without panic_on_warn set and I haven't seen any crash yet. Vegard ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-12-04 23:04 ` bio linked list corruption Vegard Nossum 2016-12-05 11:10 ` Vegard Nossum @ 2016-12-05 18:11 ` Andy Lutomirski 2016-12-05 18:25 ` Linus Torvalds 2016-12-05 18:26 ` Vegard Nossum 1 sibling, 2 replies; 118+ messages in thread From: Andy Lutomirski @ 2016-12-05 18:11 UTC (permalink / raw) To: Vegard Nossum, Borislav Petkov Cc: Dave Jones, Chris Mason, Linus Torvalds, Jens Axboe, Andy Lutomirski, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Dave Chinner On Sun, Dec 4, 2016 at 3:04 PM, Vegard Nossum <vegard.nossum@gmail.com> wrote: > On 23 November 2016 at 20:58, Dave Jones <davej@codemonkey.org.uk> wrote: >> On Wed, Nov 23, 2016 at 02:34:19PM -0500, Dave Jones wrote: >> >> > [ 317.689216] BUG: Bad page state in process kworker/u8:8 pfn:4d8fd4 >> > trace from just before this happened. Does this shed any light ? >> > >> > https://codemonkey.org.uk/junk/trace.txt >> >> crap, I just noticed the timestamps in the trace come from quite a bit >> later. I'll tweak the code to do the taint checking/ftrace stop after >> every syscall, that should narrow the window some more. > > FWIW I hit this as well: > > BUG: unable to handle kernel paging request at ffffffff81ff08b7 We really ought to improve this message. If nothing else, it should say whether it was a read, a write, or an instruction fetch. > IP: [<ffffffff8135f2ea>] __lock_acquire.isra.32+0xda/0x1a30 > PGD 461e067 PUD 461f063 > PMD 1e001e1 Too lazy to manually decode this right now, but I don't think it matters. > Oops: 0003 [#1] PREEMPT SMP KASAN Note this is SMP, but that just means CONFIG_SMP=y. Vegard, how many CPUs does your kernel think you have? > RIP: 0010:[<ffffffff8135f2ea>] [<ffffffff8135f2ea>] > __lock_acquire.isra.32+0xda/0x1a30 > RSP: 0018:ffff8801bab8f730 EFLAGS: 00010082 > RAX: ffffffff81ff071f RBX: 0000000000000000 RCX: 0000000000000000 RAX points to kernel text. > Code: 89 4d b8 44 89 45 c0 89 4d c8 4c 89 55 d0 e8 ee c3 ff ff 48 85 > c0 4c 8b 55 d0 8b 4d c8 44 8b 45 c0 4c 8b 4d b8 0f 84 c6 01 00 00 <3e> > ff 80 98 01 00 00 49 8d be 48 07 00 00 48 ba 00 00 00 00 00 2b: 3e ff 80 98 01 00 00 incl %ds:*0x198(%rax) <-- trapping instruction That's very strange. I think this is: atomic_inc((atomic_t *)&class->ops); but my kernel contains: 3cb4: f0 ff 80 98 01 00 00 lock incl 0x198(%rax) So your kernel has been smp-alternatived. That 3e comes from alternatives_smp_unlock. If you're running on SMP with UP alternatives, things will break. What's your kernel command line? Can we have your entire kernel log from boot? Adding Borislav, since he's the guru for this code. ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-12-05 18:11 ` Andy Lutomirski @ 2016-12-05 18:25 ` Linus Torvalds 2016-12-05 18:26 ` Vegard Nossum 1 sibling, 0 replies; 118+ messages in thread From: Linus Torvalds @ 2016-12-05 18:25 UTC (permalink / raw) To: Andy Lutomirski Cc: Vegard Nossum, Borislav Petkov, Dave Jones, Chris Mason, Jens Axboe, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Dave Chinner On Mon, Dec 5, 2016 at 10:11 AM, Andy Lutomirski <luto@kernel.org> wrote: > > So your kernel has been smp-alternatived. That 3e comes from > alternatives_smp_unlock. If you're running on SMP with UP > alternatives, things will break. I'm assuming he's just running in a VM with a single CPU. The problem that I pointed out with assuming wake_up_all() actually removes all wait queue entries does not depend on SMP. The race is much more fundamental and long-lived. Linus ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-12-05 18:11 ` Andy Lutomirski 2016-12-05 18:25 ` Linus Torvalds @ 2016-12-05 18:26 ` Vegard Nossum 1 sibling, 0 replies; 118+ messages in thread From: Vegard Nossum @ 2016-12-05 18:26 UTC (permalink / raw) To: Andy Lutomirski Cc: Borislav Petkov, Dave Jones, Chris Mason, Linus Torvalds, Jens Axboe, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Dave Chinner [-- Attachment #1: Type: text/plain, Size: 4179 bytes --] On 5 December 2016 at 19:11, Andy Lutomirski <luto@kernel.org> wrote: > On Sun, Dec 4, 2016 at 3:04 PM, Vegard Nossum <vegard.nossum@gmail.com> wrote: >> On 23 November 2016 at 20:58, Dave Jones <davej@codemonkey.org.uk> wrote: >>> On Wed, Nov 23, 2016 at 02:34:19PM -0500, Dave Jones wrote: >>> >>> > [ 317.689216] BUG: Bad page state in process kworker/u8:8 pfn:4d8fd4 >>> > trace from just before this happened. Does this shed any light ? >>> > >>> > https://codemonkey.org.uk/junk/trace.txt >>> >>> crap, I just noticed the timestamps in the trace come from quite a bit >>> later. I'll tweak the code to do the taint checking/ftrace stop after >>> every syscall, that should narrow the window some more. >> >> FWIW I hit this as well: >> >> BUG: unable to handle kernel paging request at ffffffff81ff08b7 > > We really ought to improve this message. If nothing else, it should > say whether it was a read, a write, or an instruction fetch. > >> IP: [<ffffffff8135f2ea>] __lock_acquire.isra.32+0xda/0x1a30 >> PGD 461e067 PUD 461f063 >> PMD 1e001e1 > > Too lazy to manually decode this right now, but I don't think it matters. > >> Oops: 0003 [#1] PREEMPT SMP KASAN > > Note this is SMP, but that just means CONFIG_SMP=y. Vegard, how many > CPUs does your kernel think you have? My first crash was running on a 1-CPU guest (not intentionally, but because of a badly configured qemu). I'm honestly surprised it triggered at all with 1 CPU, but I guess it shows that it's not a true concurrency issue at least! > >> RIP: 0010:[<ffffffff8135f2ea>] [<ffffffff8135f2ea>] >> __lock_acquire.isra.32+0xda/0x1a30 >> RSP: 0018:ffff8801bab8f730 EFLAGS: 00010082 >> RAX: ffffffff81ff071f RBX: 0000000000000000 RCX: 0000000000000000 > > RAX points to kernel text. Yes, it's somewhere in the middle of iov_iter_init() -- other crashes also had put_prev_entity(), a null pointer, and some garbage values I couldn't identify. > >> Code: 89 4d b8 44 89 45 c0 89 4d c8 4c 89 55 d0 e8 ee c3 ff ff 48 85 >> c0 4c 8b 55 d0 8b 4d c8 44 8b 45 c0 4c 8b 4d b8 0f 84 c6 01 00 00 <3e> >> ff 80 98 01 00 00 49 8d be 48 07 00 00 48 ba 00 00 00 00 00 > > 2b: 3e ff 80 98 01 00 00 incl %ds:*0x198(%rax) <-- > trapping instruction > > That's very strange. I think this is: > > atomic_inc((atomic_t *)&class->ops); > > but my kernel contains: > > 3cb4: f0 ff 80 98 01 00 00 lock incl 0x198(%rax) > > So your kernel has been smp-alternatived. That 3e comes from > alternatives_smp_unlock. If you're running on SMP with UP > alternatives, things will break. Yes, indeed. It was running on 1 CPU by mistake and still triggered the bug. The crashes started really pouring in once I got my qemu fixed. Just to reassure you, here's another crash which shows it's using the correct instruction on an actual multicore guest: BUG: unable to handle kernel paging request at 00000003030001de IP: [<ffffffff8135f2ea>] __lock_acquire.isra.32+0xda/0x1a30 PGD 183fd2067 PUD 0 Oops: 0002 [#1] PREEMPT SMP KASAN Dumping ftrace buffer: (ftrace buffer empty) CPU: 23 PID: 9584 Comm: trinity-c104 Not tainted 4.9.0-rc7+ #219 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 task: ffff880189f68000 task.stack: ffff88017fe50000 RIP: 0010:[<ffffffff8135f2ea>] [<ffffffff8135f2ea>] __lock_acquire.isra.32+0xda/0x1a30 RSP: 0018:ffff88017fe575e0 EFLAGS: 00010002 RAX: 0000000303000046 RBX: 0000000000000000 RCX: 0000000000000000 [...] Code: 89 4d b8 44 89 45 c0 89 4d c8 4c 89 55 d0 e8 ee c3 ff ff 48 85 c0 4c 8b 55 d0 8b 4d c8 44 8b 45 c0 4c 8b 4d b8 0f 84 c6 01 00 00 <f0> ff 80 98 01 00 00 49 8d be 48 07 00 00 48 ba 00 00 00 00 00 RIP [<ffffffff8135f2ea>] __lock_acquire.isra.32+0xda/0x1a30 RSP <ffff88017fe575e0> CR2: 00000003030001de ---[ end trace 2846425104eb6141 ]--- Kernel panic - not syncing: Fatal exception ------------[ cut here ]------------ 2b:* f0 ff 80 98 01 00 00 lock incl 0x198(%rax) <-- trapping instruction > What's your kernel command line? Can we have your entire kernel log from boot? Just in case you still want this, I've attached the boot log for the "true SMP" guest above. Vegard [-- Attachment #2: 5.txt.gz --] [-- Type: application/x-gzip, Size: 17767 bytes --] ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-26 23:03 ` Jens Axboe 2016-10-26 23:07 ` Dave Jones 2016-10-26 23:08 ` Linus Torvalds @ 2016-10-26 23:19 ` Chris Mason 2016-10-26 23:21 ` Jens Axboe 2016-10-27 6:33 ` Christoph Hellwig 3 siblings, 1 reply; 118+ messages in thread From: Chris Mason @ 2016-10-26 23:19 UTC (permalink / raw) To: Jens Axboe Cc: Linus Torvalds, Dave Jones, Andy Lutomirski, Andy Lutomirski, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Dave Chinner On Wed, Oct 26, 2016 at 05:03:45PM -0600, Jens Axboe wrote: >On 10/26/2016 04:58 PM, Linus Torvalds wrote: >>On Wed, Oct 26, 2016 at 3:51 PM, Linus Torvalds >><torvalds@linux-foundation.org> wrote: >>> >>>Dave: it might be a good idea to split that "WARN_ON_ONCE()" in >>>blk_mq_merge_queue_io() into two >> >>I did that myself too, since Dave sees this during boot. >> >>But I'm not getting the warning ;( >> >>Dave gets it with ext4, and thats' what I have too, so I'm not sure >>what the required trigger would be. > >Actually, I think I see what might trigger it. You are on nvme, iirc, >and that has a deep queue. Dave, are you testing on a sata drive or >something similar with a shallower queue depth? If we end up sleeping >for a request, I think we could trigger data->ctx being different. > >Dave, can you hit the warnings with this? Totally untested... Confirmed, totally untested ;) Don't try this one at home folks (working this out with Jens offlist) G: unable to handle kernel paging request at 0000000002411200 IP: [<ffffffff819acff2>] _raw_spin_lock+0x22/0x40 PGD 12840a067 PUD 128446067 PMD 0 Oops: 0002 [#1] PREEMPT SMP Modules linked in: virtio_blk(+) CPU: 4 PID: 125 Comm: modprobe Not tainted 4.9.0-rc2-00041-g811d54d-dirty #320 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-1.fc24 04/01/2014 task: ffff88013849aac0 task.stack: ffff8801293d8000 RIP: 0010:[<ffffffff819acff2>] [<ffffffff819acff2>] _raw_spin_lock+0x22/0x40 RSP: 0018:ffff8801293db278 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000002411200 RCX: 0000000000000000 RDX: 0000000000000001 RSI: ffff88013a5c1048 RDI: 0000000000000000 RBP: ffff8801293db288 R08: 0000000000000005 R09: ffff880128449380 R10: 0000000000000000 R11: 0000000000000008 R12: 0000000000000000 R13: 0000000000000001 R14: 0000000000000076 R15: ffff8801293b6a80 FS: 00007f1a2a9cdb40(0000) GS:ffff88013fd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000002411200 CR3: 000000013a5d1000 CR4: 00000000000406e0 Stack: ffff8801293db2d0 ffff880128488000 ffff8801293db348 ffffffff814debff 00ff8801293db2c8 ffff8801293db338 ffff8801284888c0 ffff8801284888b8 000060fec00004f9 0000000002411200 ffff880128f810c0 ffff880128f810c0 Call Trace: [<ffffffff814debff>] blk_sq_make_request+0x34f/0x580 [<ffffffff8116b005>] ? mempool_alloc_slab+0x15/0x20 [<ffffffff814d5444>] generic_make_request+0x104/0x200 [<ffffffff814d55a5>] submit_bio+0x65/0x130 [<ffffffff8122a06e>] submit_bh_wbc+0x16e/0x210 [<ffffffff8122a123>] submit_bh+0x13/0x20 [<ffffffff8122b075>] block_read_full_page+0x205/0x3d0 [<ffffffff8122cf00>] ? I_BDEV+0x20/0x20 [<ffffffff8117a1fe>] ? lru_cache_add+0xe/0x10 [<ffffffff81167502>] ? add_to_page_cache_lru+0x92/0xf0 [<ffffffff81166c41>] ? __page_cache_alloc+0xd1/0xe0 [<ffffffff8122df38>] blkdev_readpage+0x18/0x20 [<ffffffff8116ada6>] do_read_cache_page+0x1c6/0x380 [<ffffffff8122df20>] ? blkdev_writepages+0x10/0x10 [<ffffffff811c6662>] ? alloc_pages_current+0xb2/0x1c0 [<ffffffff8116af92>] read_cache_page+0x12/0x20 [<ffffffff814e6b11>] read_dev_sector+0x31/0xb0 [<ffffffff814eb31d>] read_lba+0xbd/0x130 [<ffffffff814eb682>] find_valid_gpt+0xa2/0x580 [<ffffffff814ebb60>] ? find_valid_gpt+0x580/0x580 [<ffffffff814ebbc7>] efi_partition+0x67/0x3d0 [<ffffffff81509cfa>] ? vsnprintf+0x2aa/0x470 [<ffffffff81509f64>] ? snprintf+0x34/0x40 [<ffffffff814ebb60>] ? find_valid_gpt+0x580/0x580 [<ffffffff814e8f46>] check_partition+0x106/0x1e0 [<ffffffff814e741c>] rescan_partitions+0x8c/0x270 [<ffffffff8122ef98>] __blkdev_get+0x328/0x3f0 [<ffffffff8122f0b4>] blkdev_get+0x54/0x320 [<ffffffff8120be7a>] ? unlock_new_inode+0x5a/0x80 [<ffffffff8122dc0f>] ? bdget+0xff/0x110 [<ffffffff814e4d16>] device_add_disk+0x3c6/0x450 [<ffffffff8151970a>] ? ioread8+0x1a/0x40 [<ffffffff815bc68e>] ? vp_get+0x4e/0x70 [<ffffffffa0001540>] virtblk_probe+0x460/0x708 [virtio_blk] [<ffffffff815bc556>] ? vp_finalize_features+0x36/0x50 [<ffffffff815b8c82>] virtio_dev_probe+0x132/0x1e0 [<ffffffff81619709>] driver_probe_device+0x1a9/0x2d0 [<ffffffff819aa9e4>] ? mutex_lock+0x24/0x50 [<ffffffff816198ed>] __driver_attach+0xbd/0xc0 [<ffffffff81619830>] ? driver_probe_device+0x2d0/0x2d0 [<ffffffff81619830>] ? driver_probe_device+0x2d0/0x2d0 [<ffffffff816178aa>] bus_for_each_dev+0x8a/0xb0 [<ffffffff8161920e>] driver_attach+0x1e/0x20 [<ffffffff81618bf6>] bus_add_driver+0x1b6/0x230 [<ffffffff8161a200>] driver_register+0x60/0xe0 [<ffffffff815b8f50>] register_virtio_driver+0x20/0x40 [<ffffffffa0004057>] init+0x57/0x81 [virtio_blk] [<ffffffffa0004000>] ? 0xffffffffa0004000 [<ffffffffa0004000>] ? 0xffffffffa0004000 [<ffffffff810003a6>] do_one_initcall+0x46/0x150 [<ffffffff810e923a>] do_init_module+0x6a/0x210 [<ffffffff811b32b7>] ? vfree+0x37/0x90 [<ffffffff810ebe68>] load_module+0x1638/0x1860 [<ffffffff810e83f0>] ? do_free_init+0x30/0x30 [<ffffffff811f6da4>] ? kernel_read_file_from_fd+0x54/0x90 [<ffffffff810ec152>] SYSC_finit_module+0xc2/0xd0 [<ffffffff810ec16e>] SyS_finit_module+0xe/0x10 [<ffffffff819ad1a0>] entry_SYSCALL_64_fastpath+0x13/0x94 Code: 89 df e8 a2 52 70 ff eb e6 55 48 89 e5 53 48 83 ec 08 66 66 66 66 90 48 89 fb bf 01 00 00 00 e8 95 53 6e ff 31 c0 ba 01 00 00 00 <f0> 0f b1 13 85 c0 75 07 48 83 c4 08 5b c9 c3 89 c6 48 89 df e8 RIP [<ffffffff819acff2>] _raw_spin_lock+0x22/0x40 RSP <ffff8801293db278> CR2: 0000000002411200 ---[ end trace e8cb117e64947621 ]--- Kernel panic - not syncing: Fatal exception Kernel Offset: disabled ---[ end Kernel panic - not syncing: Fatal exception -chris ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-26 23:19 ` Chris Mason @ 2016-10-26 23:21 ` Jens Axboe 0 siblings, 0 replies; 118+ messages in thread From: Jens Axboe @ 2016-10-26 23:21 UTC (permalink / raw) To: Chris Mason, Linus Torvalds, Dave Jones, Andy Lutomirski, Andy Lutomirski, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Dave Chinner On 10/26/2016 05:19 PM, Chris Mason wrote: > On Wed, Oct 26, 2016 at 05:03:45PM -0600, Jens Axboe wrote: >> On 10/26/2016 04:58 PM, Linus Torvalds wrote: >>> On Wed, Oct 26, 2016 at 3:51 PM, Linus Torvalds >>> <torvalds@linux-foundation.org> wrote: >>>> >>>> Dave: it might be a good idea to split that "WARN_ON_ONCE()" in >>>> blk_mq_merge_queue_io() into two >>> >>> I did that myself too, since Dave sees this during boot. >>> >>> But I'm not getting the warning ;( >>> >>> Dave gets it with ext4, and thats' what I have too, so I'm not sure >>> what the required trigger would be. >> >> Actually, I think I see what might trigger it. You are on nvme, iirc, >> and that has a deep queue. Dave, are you testing on a sata drive or >> something similar with a shallower queue depth? If we end up sleeping >> for a request, I think we could trigger data->ctx being different. >> >> Dave, can you hit the warnings with this? Totally untested... > > Confirmed, totally untested ;) Don't try this one at home folks > (working this out with Jens offlist) Ahem, I did say untested! The one I just sent in reply to Linus should be better. Though that one is completely untested as well... -- Jens Axboe ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-26 23:03 ` Jens Axboe ` (2 preceding siblings ...) 2016-10-26 23:19 ` Chris Mason @ 2016-10-27 6:33 ` Christoph Hellwig 2016-10-27 16:34 ` Linus Torvalds 3 siblings, 1 reply; 118+ messages in thread From: Christoph Hellwig @ 2016-10-27 6:33 UTC (permalink / raw) To: Jens Axboe Cc: Linus Torvalds, Dave Jones, Chris Mason, Andy Lutomirski, Andy Lutomirski, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Dave Chinner > Dave, can you hit the warnings with this? Totally untested... Can we just kill off the unhelpful blk_map_ctx structure, e.g.: diff --git a/block/blk-mq.c b/block/blk-mq.c index ddc2eed..d74a74a 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1190,21 +1190,15 @@ static inline bool blk_mq_merge_queue_io(struct blk_mq_hw_ctx *hctx, } } -struct blk_map_ctx { - struct blk_mq_hw_ctx *hctx; - struct blk_mq_ctx *ctx; -}; - static struct request *blk_mq_map_request(struct request_queue *q, struct bio *bio, - struct blk_map_ctx *data) + struct blk_mq_alloc_data *data) { struct blk_mq_hw_ctx *hctx; struct blk_mq_ctx *ctx; struct request *rq; int op = bio_data_dir(bio); int op_flags = 0; - struct blk_mq_alloc_data alloc_data; blk_queue_enter_live(q); ctx = blk_mq_get_ctx(q); @@ -1214,12 +1208,10 @@ static struct request *blk_mq_map_request(struct request_queue *q, op_flags |= REQ_SYNC; trace_block_getrq(q, bio, op); - blk_mq_set_alloc_data(&alloc_data, q, 0, ctx, hctx); - rq = __blk_mq_alloc_request(&alloc_data, op, op_flags); + blk_mq_set_alloc_data(data, q, 0, ctx, hctx); + rq = __blk_mq_alloc_request(data, op, op_flags); - hctx->queued++; - data->hctx = hctx; - data->ctx = ctx; + data->hctx->queued++; return rq; } @@ -1267,7 +1259,7 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio) { const int is_sync = rw_is_sync(bio_op(bio), bio->bi_opf); const int is_flush_fua = bio->bi_opf & (REQ_PREFLUSH | REQ_FUA); - struct blk_map_ctx data; + struct blk_mq_alloc_data data; struct request *rq; unsigned int request_count = 0; struct blk_plug *plug; @@ -1363,7 +1355,7 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio) const int is_flush_fua = bio->bi_opf & (REQ_PREFLUSH | REQ_FUA); struct blk_plug *plug; unsigned int request_count = 0; - struct blk_map_ctx data; + struct blk_mq_alloc_data data; struct request *rq; blk_qc_t cookie; ^ permalink raw reply related [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-27 6:33 ` Christoph Hellwig @ 2016-10-27 16:34 ` Linus Torvalds 2016-10-27 16:36 ` Jens Axboe 0 siblings, 1 reply; 118+ messages in thread From: Linus Torvalds @ 2016-10-27 16:34 UTC (permalink / raw) To: Christoph Hellwig Cc: Jens Axboe, Dave Jones, Chris Mason, Andy Lutomirski, Andy Lutomirski, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Dave Chinner On Wed, Oct 26, 2016 at 11:33 PM, Christoph Hellwig <hch@infradead.org> wrote: >> Dave, can you hit the warnings with this? Totally untested... > > Can we just kill off the unhelpful blk_map_ctx structure, e.g.: Yeah, I found that hard to read too. The difference between blk_map_ctx and blk_mq_alloc_data is minimal, might as well just use the latter everywhere. But please separate that patch from the patch that fixes the ctx list corruption. And Jens - can I have at least the corruption fix asap? Due to KS travel, I'm going to do rc3 on Saturday, and I'd like to have the fix in at least a day before that.. Linus ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-27 16:34 ` Linus Torvalds @ 2016-10-27 16:36 ` Jens Axboe 0 siblings, 0 replies; 118+ messages in thread From: Jens Axboe @ 2016-10-27 16:36 UTC (permalink / raw) To: Linus Torvalds, Christoph Hellwig Cc: Dave Jones, Chris Mason, Andy Lutomirski, Andy Lutomirski, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Dave Chinner On 10/27/2016 10:34 AM, Linus Torvalds wrote: > On Wed, Oct 26, 2016 at 11:33 PM, Christoph Hellwig <hch@infradead.org> wrote: >>> Dave, can you hit the warnings with this? Totally untested... >> >> Can we just kill off the unhelpful blk_map_ctx structure, e.g.: > > Yeah, I found that hard to read too. The difference between > blk_map_ctx and blk_mq_alloc_data is minimal, might as well just use > the latter everywhere. > > But please separate that patch from the patch that fixes the ctx list > corruption. Agree > And Jens - can I have at least the corruption fix asap? Due to KS > travel, I'm going to do rc3 on Saturday, and I'd like to have the fix > in at least a day before that.. Just committed half an hour ago with the acks etc, I'll send it to you shortly. -- Jens Axboe ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-26 22:51 ` Linus Torvalds 2016-10-26 22:55 ` Jens Axboe 2016-10-26 22:58 ` Linus Torvalds @ 2016-10-26 23:01 ` Dave Jones 2016-10-26 23:05 ` Jens Axboe 2 siblings, 1 reply; 118+ messages in thread From: Dave Jones @ 2016-10-26 23:01 UTC (permalink / raw) To: Linus Torvalds Cc: Chris Mason, Andy Lutomirski, Andy Lutomirski, Jens Axboe, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Dave Chinner On Wed, Oct 26, 2016 at 03:51:01PM -0700, Linus Torvalds wrote: > Dave: it might be a good idea to split that "WARN_ON_ONCE()" in > blk_mq_merge_queue_io() into two, since right now it can trigger both > for the > > blk_mq_bio_to_request(rq, bio); > > path _and_ for the > > if (!blk_mq_attempt_merge(q, ctx, bio)) { > blk_mq_bio_to_request(rq, bio); > goto insert_rq; > > path. If you split it into two: one before that "insert_rq:" label, > and one before the "goto insert_rq" thing, then we could see if it is > just one of the blk_mq_merge_queue_io() cases (or both) that is > broken.. It's the latter of the two. [ 12.302392] WARNING: CPU: 3 PID: 272 at block/blk-mq.c:1191 blk_sq_make_request+0x320/0x4d0 Dave ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-26 23:01 ` Dave Jones @ 2016-10-26 23:05 ` Jens Axboe 0 siblings, 0 replies; 118+ messages in thread From: Jens Axboe @ 2016-10-26 23:05 UTC (permalink / raw) To: Dave Jones, Linus Torvalds, Chris Mason, Andy Lutomirski, Andy Lutomirski, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Dave Chinner On 10/26/2016 05:01 PM, Dave Jones wrote: > On Wed, Oct 26, 2016 at 03:51:01PM -0700, Linus Torvalds wrote: > > Dave: it might be a good idea to split that "WARN_ON_ONCE()" in > > blk_mq_merge_queue_io() into two, since right now it can trigger both > > for the > > > > blk_mq_bio_to_request(rq, bio); > > > > path _and_ for the > > > > if (!blk_mq_attempt_merge(q, ctx, bio)) { > > blk_mq_bio_to_request(rq, bio); > > goto insert_rq; > > > > path. If you split it into two: one before that "insert_rq:" label, > > and one before the "goto insert_rq" thing, then we could see if it is > > just one of the blk_mq_merge_queue_io() cases (or both) that is > > broken.. > > It's the latter of the two. > > [ 12.302392] WARNING: CPU: 3 PID: 272 at block/blk-mq.c:1191 > blk_sq_make_request+0x320/0x4d0 I would expect that - so normal request, since we know merging is on. I just sent out a patch, can you give that a whirl? Keep Linus' debugging patch, and apply it on top. Below as well, for reference. diff --git a/block/blk-mq.c b/block/blk-mq.c index ddc2eed64771..80a9c45a9235 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1217,9 +1217,7 @@ static struct request *blk_mq_map_request(struct request_queue *q, blk_mq_set_alloc_data(&alloc_data, q, 0, ctx, hctx); rq = __blk_mq_alloc_request(&alloc_data, op, op_flags); - hctx->queued++; - data->hctx = hctx; - data->ctx = ctx; + data->hctx->queued++; return rq; } -- Jens Axboe ^ permalink raw reply related [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-26 22:40 ` Dave Jones 2016-10-26 22:51 ` Linus Torvalds @ 2016-10-26 22:52 ` Jens Axboe 1 sibling, 0 replies; 118+ messages in thread From: Jens Axboe @ 2016-10-26 22:52 UTC (permalink / raw) To: Dave Jones, Linus Torvalds, Chris Mason, Andy Lutomirski, Andy Lutomirski, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Dave Chinner On 10/26/2016 04:40 PM, Dave Jones wrote: > On Wed, Oct 26, 2016 at 03:21:53PM -0700, Linus Torvalds wrote: > > > Could you try the attached patch? It adds a couple of sanity tests: > > > > - a number of tests to verify that 'rq->queuelist' isn't already on > > some queue when it is added to a queue > > > > - one test to verify that rq->mq_ctx is the same ctx that we have locked. > > > > I may be completely full of shit, and this patch may be pure garbage > > or "obviously will never trigger", but humor me. > > I gave it a shot too for shits & giggles. > This falls out during boot. > > [ 9.244030] EXT4-fs (sda4): mounted filesystem with ordered data mode. Opts: (null) > [ 9.271391] ------------[ cut here ]------------ > [ 9.278420] WARNING: CPU: 0 PID: 1 at block/blk-mq.c:1181 blk_sq_make_request+0x465/0x4a0 > [ 9.285613] CPU: 0 PID: 1 Comm: init Not tainted 4.9.0-rc2-think+ #4 Very odd, don't immediately see how that can happen. For testing, can you try and add the below patch? Just curious if that fixes the list corruption. Thing is, I don't see how ->mq_ctx and ctx are different in this path, but I can debug that on the side. diff --git a/block/blk-mq.c b/block/blk-mq.c index ddc2eed64771..73b9462aa21f 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1165,9 +1165,10 @@ static inline bool hctx_allow_merges(struct blk_mq_hw_ctx *hctx) } static inline bool blk_mq_merge_queue_io(struct blk_mq_hw_ctx *hctx, - struct blk_mq_ctx *ctx, struct request *rq, struct bio *bio) { + struct blk_mq_ctx *ctx = rq->mq_ctx; + if (!hctx_allow_merges(hctx) || !bio_mergeable(bio)) { blk_mq_bio_to_request(rq, bio); spin_lock(&ctx->lock); @@ -1338,7 +1339,7 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio) goto done; } - if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) { + if (!blk_mq_merge_queue_io(data.hctx, rq, bio)) { /* * For a SYNC request, send it to the hardware immediately. For * an ASYNC request, just ensure that we run it later on. The @@ -1416,7 +1417,7 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio) return cookie; } - if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) { + if (!blk_mq_merge_queue_io(data.hctx, rq, bio)) { /* * For a SYNC request, send it to the hardware immediately. For * an ASYNC request, just ensure that we run it later on. The -- Jens Axboe ^ permalink raw reply related [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-26 20:00 ` Chris Mason 2016-10-26 21:52 ` Chris Mason @ 2016-10-26 22:07 ` Linus Torvalds 2016-10-26 22:54 ` Chris Mason 1 sibling, 1 reply; 118+ messages in thread From: Linus Torvalds @ 2016-10-26 22:07 UTC (permalink / raw) To: Chris Mason Cc: Dave Jones, Andy Lutomirski, Andy Lutomirski, Jens Axboe, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Dave Chinner On Wed, Oct 26, 2016 at 1:00 PM, Chris Mason <clm@fb.com> wrote: > > Today I turned off every CONFIG_DEBUG_* except for list debugging, and > ran dbench 2048: > > [ 2759.118711] WARNING: CPU: 2 PID: 31039 at lib/list_debug.c:33 __list_add+0xbe/0xd0 > [ 2759.119652] list_add corruption. prev->next should be next (ffffe8ffffc80308), but was ffffc90000ccfb88. (prev=ffff880128522380). > [ 2759.121039] Modules linked in: crc32c_intel i2c_piix4 aesni_intel aes_x86_64 virtio_net glue_helper i2c_core lrw floppy gf128mul serio_raw pcspkr button ablk_helper cryptd sch_fq_codel autofs4 virtio_blk > [ 2759.124369] CPU: 2 PID: 31039 Comm: dbench Not tainted 4.9.0-rc1-15246-g4ce9206-dirty #317 > [ 2759.125077] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-1.fc24 04/01/2014 > [ 2759.125077] ffffc9000f6fb868 ffffffff814fe4ff ffffffff8151cb5e ffffc9000f6fb8c8 > [ 2759.125077] ffffc9000f6fb8c8 0000000000000000 ffffc9000f6fb8b8 ffffffff81064bbf > [ 2759.127444] ffff880128523680 0000002139968000 ffff880138b7a4a0 ffff880128523540 > [ 2759.127444] Call Trace: > [ 2759.127444] [<ffffffff814fe4ff>] dump_stack+0x53/0x74 > [ 2759.127444] [<ffffffff8151cb5e>] ? __list_add+0xbe/0xd0 > [ 2759.127444] [<ffffffff81064bbf>] __warn+0xff/0x120 > [ 2759.127444] [<ffffffff81064c99>] warn_slowpath_fmt+0x49/0x50 > [ 2759.127444] [<ffffffff8151cb5e>] __list_add+0xbe/0xd0 > [ 2759.127444] [<ffffffff814df338>] blk_sq_make_request+0x388/0x580 Ok, that's definitely the same one that Dave started out seeing. The fact that it is that reliable - two different machines, two very different loads (dbench looks nothing like trinity) really makes me think that maybe the problem really is in the block plugging after all. It very much does not smell like random stack corruption. It's simply not random enough. And I just noticed something: I originally thought that this is the "list_add_tail()" to the plug - which is the only "list_add()" variant in that function. But that never made sense, because the whole "but was" isn't a stack address, and "next" in "list_add_tail()" is basically fixed, and would have to be the stack. But I now notice that there's actually another "list_add()" variant there, and it's the one from __blk_mq_insert_request() that gets inlined into blk_mq_insert_request(), which then gets inlined into blk_mq_make_request(). And that actually makes some sense just looking at the offsets too: blk_sq_make_request+0x388/0x580 so it's somewhat at the end of blk_sq_make_request(). So it's not unlikely. And there it makes perfect sense that the "next should be" value is *not* on the stack. Chris, if you built with debug info, you can try ./scripts/faddr2line /boot/vmlinux blk_sq_make_request+0x388 to get what line that blk_sq_make_request+0x388 address actually is. I think it's the list_add_tail(&rq->queuelist, &ctx->rq_list); in __blk_mq_insert_req_list() (when it's inlined from blk_sq_make_request(), "at_head" will be false. So it smells like "&ctx->rq_list" might be corrupt. And that actually seems much more likely than the "plug" list, because while the plug is entirely thread-local (and thus shouldn't have any races), the ctx->rq_list very much is not. Jens? For example, should we have a BUG_ON(ctx != rq->mq_ctx); in blk_mq_merge_queue_io()? Because it locks ctx->lock, but then __blk_mq_insert_request() will insert things onto the queues of rq->mq_ctx. blk_mq_insert_requests() has similar issues, but there has that BUG_ON(). The locking there really is *very* messy. All the lockers do spin_lock(&ctx->lock); ... spin_unlock(&ctx->lock); but then __blk_mq_insert_request() and __blk_mq_insert_req_list don't act on "ctx", but on "ctx = rq->mq_ctx", so if you ever get those wrong, you're completely dead. Now, I'm not seeing why they'd be wrong, and why they'd be associated with the VMAP_STACK thing, but it could just be an unlucky timing thing. Linus ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-26 22:07 ` Linus Torvalds @ 2016-10-26 22:54 ` Chris Mason 0 siblings, 0 replies; 118+ messages in thread From: Chris Mason @ 2016-10-26 22:54 UTC (permalink / raw) To: Linus Torvalds Cc: Dave Jones, Andy Lutomirski, Andy Lutomirski, Jens Axboe, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel, Dave Chinner On Wed, Oct 26, 2016 at 03:07:10PM -0700, Linus Torvalds wrote: >On Wed, Oct 26, 2016 at 1:00 PM, Chris Mason <clm@fb.com> wrote: >> >> Today I turned off every CONFIG_DEBUG_* except for list debugging, and >> ran dbench 2048: >> >> [ 2759.118711] WARNING: CPU: 2 PID: 31039 at lib/list_debug.c:33 __list_add+0xbe/0xd0 >> [ 2759.119652] list_add corruption. prev->next should be next (ffffe8ffffc80308), but was ffffc90000ccfb88. (prev=ffff880128522380). >> [ 2759.121039] Modules linked in: crc32c_intel i2c_piix4 aesni_intel aes_x86_64 virtio_net glue_helper i2c_core lrw floppy gf128mul serio_raw pcspkr button ablk_helper cryptd sch_fq_codel autofs4 virtio_blk >> [ 2759.124369] CPU: 2 PID: 31039 Comm: dbench Not tainted 4.9.0-rc1-15246-g4ce9206-dirty #317 >> [ 2759.125077] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-1.fc24 04/01/2014 >> [ 2759.125077] ffffc9000f6fb868 ffffffff814fe4ff ffffffff8151cb5e ffffc9000f6fb8c8 >> [ 2759.125077] ffffc9000f6fb8c8 0000000000000000 ffffc9000f6fb8b8 ffffffff81064bbf >> [ 2759.127444] ffff880128523680 0000002139968000 ffff880138b7a4a0 ffff880128523540 >> [ 2759.127444] Call Trace: >> [ 2759.127444] [<ffffffff814fe4ff>] dump_stack+0x53/0x74 >> [ 2759.127444] [<ffffffff8151cb5e>] ? __list_add+0xbe/0xd0 >> [ 2759.127444] [<ffffffff81064bbf>] __warn+0xff/0x120 >> [ 2759.127444] [<ffffffff81064c99>] warn_slowpath_fmt+0x49/0x50 >> [ 2759.127444] [<ffffffff8151cb5e>] __list_add+0xbe/0xd0 >> [ 2759.127444] [<ffffffff814df338>] blk_sq_make_request+0x388/0x580 > >Ok, that's definitely the same one that Dave started out seeing. > >The fact that it is that reliable - two different machines, two very >different loads (dbench looks nothing like trinity) really makes me >think that maybe the problem really is in the block plugging after >all. > >It very much does not smell like random stack corruption. It's simply >not random enough. Agreed. I'd feel better if I could trigger this outside of btrfs, even though Dave Chinner hit something very similar on xfs. I'll peel off another test machine for a long XFS run. > >And I just noticed something: I originally thought that this is the >"list_add_tail()" to the plug - which is the only "list_add()" variant >in that function. > >But that never made sense, because the whole "but was" isn't a stack >address, and "next" in "list_add_tail()" is basically fixed, and would >have to be the stack. > >But I now notice that there's actually another "list_add()" variant >there, and it's the one from __blk_mq_insert_request() that gets >inlined into blk_mq_insert_request(), which then gets inlined into >blk_mq_make_request(). > >And that actually makes some sense just looking at the offsets too: > > blk_sq_make_request+0x388/0x580 > >so it's somewhat at the end of blk_sq_make_request(). So it's not unlikely. > >And there it makes perfect sense that the "next should be" value is >*not* on the stack. > >Chris, if you built with debug info, you can try > > ./scripts/faddr2line /boot/vmlinux blk_sq_make_request+0x388 > >to get what line that blk_sq_make_request+0x388 address actually is. I >think it's the > > list_add_tail(&rq->queuelist, &ctx->rq_list); > >in __blk_mq_insert_req_list() (when it's inlined from >blk_sq_make_request(), "at_head" will be false. > >So it smells like "&ctx->rq_list" might be corrupt. I'm running your current git here, so these line numbers should line up for you: blk_sq_make_request+0x388/0x578: __blk_mq_insert_request at block/blk-mq.c:1049 (inlined by) blk_mq_merge_queue_io at block/blk-mq.c:1175 (inlined by) blk_sq_make_request at block/blk-mq.c:1419 The fsync path in the WARN doesn't have any plugs that I can see, so its not surprising that we're not in the plugging path. I'm here: if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) { /* * For a SYNC request, send it to the hardware immediately. For * an ASYNC request, just ensure that we run it later on. The * latter allows for merging opportunities and more efficient * dispatching. */ I'll try the debugging patch you sent in the other email. -chris ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-26 0:27 ` Dave Jones 2016-10-26 1:33 ` Linus Torvalds @ 2016-10-27 5:41 ` Dave Chinner 2016-10-27 17:23 ` Dave Jones 1 sibling, 1 reply; 118+ messages in thread From: Dave Chinner @ 2016-10-27 5:41 UTC (permalink / raw) To: Dave Jones, Chris Mason, Andy Lutomirski, Andy Lutomirski, Linus Torvalds, Jens Axboe, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel On Tue, Oct 25, 2016 at 08:27:52PM -0400, Dave Jones wrote: > DaveC: Do these look like real problems, or is this more "looks like > random memory corruption" ? It's been a while since I did some stress > testing on XFS, so these might not be new.. > > XFS: Assertion failed: oldlen > newlen, file: fs/xfs/libxfs/xfs_bmap.c, line: 2938 > ------------[ cut here ]------------ > kernel BUG at fs/xfs/xfs_message.c:113! > invalid opcode: 0000 [#1] PREEMPT SMP > CPU: 1 PID: 6227 Comm: trinity-c9 Not tainted 4.9.0-rc1-think+ #6 > task: ffff8804f4658040 task.stack: ffff88050568c000 > RIP: 0010:[<ffffffffa02d3e2b>] > [<ffffffffa02d3e2b>] assfail+0x1b/0x20 [xfs] > RSP: 0000:ffff88050568f9e8 EFLAGS: 00010282 > RAX: 00000000ffffffea RBX: 0000000000000046 RCX: 0000000000000001 > RDX: 00000000ffffffc0 RSI: 000000000000000a RDI: ffffffffa02fe34d > RBP: ffff88050568f9e8 R08: 0000000000000000 R09: 0000000000000000 > R10: 000000000000000a R11: f000000000000000 R12: ffff88050568fb44 > R13: 00000000000000f3 R14: ffff8804f292bf88 R15: 000ffffffffe0046 > FS: 00007fe2ddfdfb40(0000) GS:ffff88050a000000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00007fe2dbabd000 CR3: 00000004f461f000 CR4: 00000000001406e0 > Stack: > ffff88050568fa88 ffffffffa027ccee fffffffffffffff9 ffff8804f16fd8b0 > 0000000000003ffa 0000000000000032 ffff8804f292bf40 0000000000004976 > 000ffffffffe0008 00000000000004fd ffff880400000000 0000000000005107 > Call Trace: > [<ffffffffa027ccee>] xfs_bmap_add_extent_hole_delay+0x54e/0x620 [xfs] > [<ffffffffa027f2d4>] xfs_bmapi_reserve_delalloc+0x2b4/0x400 [xfs] > [<ffffffffa02cadd7>] xfs_file_iomap_begin_delay.isra.12+0x247/0x3c0 [xfs] > [<ffffffffa02cb0d1>] xfs_file_iomap_begin+0x181/0x270 [xfs] > [<ffffffff8122e573>] iomap_apply+0x53/0x100 > [<ffffffff8122e68b>] iomap_file_buffered_write+0x6b/0x90 All this iomap code is new, so it's entirely possible that this is a new bug. The assert failure is indicative that the delalloc extent's metadata reservation grew when we expected it to stay the same or shrink. > XFS: Assertion failed: tp->t_blk_res_used <= tp->t_blk_res, file: fs/xfs/xfs_trans.c, line: 309 ... > [<ffffffffa057abe1>] xfs_trans_mod_sb+0x241/0x280 [xfs] > [<ffffffffa0548eff>] xfs_ag_resv_alloc_extent+0x4f/0xc0 [xfs] > [<ffffffffa050aa3d>] xfs_alloc_ag_vextent+0x23d/0x300 [xfs] > [<ffffffffa050bb1b>] xfs_alloc_vextent+0x5fb/0x6d0 [xfs] > [<ffffffffa051c1b4>] xfs_bmap_btalloc+0x304/0x8e0 [xfs] > [<ffffffffa054648e>] ? xfs_iext_bno_to_ext+0xee/0x170 [xfs] > [<ffffffffa051c8db>] xfs_bmap_alloc+0x2b/0x40 [xfs] > [<ffffffffa051dc30>] xfs_bmapi_write+0x640/0x1210 [xfs] > [<ffffffffa0569326>] xfs_iomap_write_allocate+0x166/0x350 [xfs] > [<ffffffffa05540b0>] xfs_map_blocks+0x1b0/0x260 [xfs] > [<ffffffffa0554beb>] xfs_do_writepage+0x23b/0x730 [xfs] And that's indicative of a delalloc metadata reservation being being too small and so we're allocating unreserved blocks. Different symptoms, same underlying cause, I think. I see the latter assert from time to time in my testing, but it's not common (maybe once a month) and I've never been able to track it down. However, it doesn't affect production systems unless they hit ENOSPC hard enough that it causes the critical reserve pool to be exhausted iand so the allocation fails. That's extremely rare - usually takes a several hundred processes all trying to write as had as they can concurrently and to all slip through the ENOSPC detection without the correct metadata reservations and all require multiple metadata blocks to be allocated durign writeback... If you've got a way to trigger it quickly and reliably, that would be helpful... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-27 5:41 ` Dave Chinner @ 2016-10-27 17:23 ` Dave Jones 0 siblings, 0 replies; 118+ messages in thread From: Dave Jones @ 2016-10-27 17:23 UTC (permalink / raw) To: Dave Chinner Cc: Chris Mason, Andy Lutomirski, Andy Lutomirski, Linus Torvalds, Jens Axboe, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel On Thu, Oct 27, 2016 at 04:41:33PM +1100, Dave Chinner wrote: > And that's indicative of a delalloc metadata reservation being > being too small and so we're allocating unreserved blocks. > > Different symptoms, same underlying cause, I think. > > I see the latter assert from time to time in my testing, but it's > not common (maybe once a month) and I've never been able to track it > down. However, it doesn't affect production systems unless they hit > ENOSPC hard enough that it causes the critical reserve pool to be > exhausted iand so the allocation fails. That's extremely rare - > usually takes a several hundred processes all trying to write as had > as they can concurrently and to all slip through the ENOSPC > detection without the correct metadata reservations and all require > multiple metadata blocks to be allocated durign writeback... > > If you've got a way to trigger it quickly and reliably, that would > be helpful... Seems pretty quickly reproducable for me in some shape or form. Run trinity with --enable-fds=testfile and create enough children to create a fair bit of contention, (for me -C64 seems a good fit on spinning rust, but if you're running on shiny nvme you might have to pump it up a bit). Dave ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-24 4:40 ` Dave Jones 2016-10-24 13:42 ` Chris Mason @ 2016-10-24 20:06 ` Andy Lutomirski 2016-10-24 20:46 ` Linus Torvalds 1 sibling, 1 reply; 118+ messages in thread From: Andy Lutomirski @ 2016-10-24 20:06 UTC (permalink / raw) To: Dave Jones, Chris Mason, Andy Lutomirski, Andy Lutomirski, Linus Torvalds, Jens Axboe, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel On Sun, Oct 23, 2016 at 9:40 PM, Dave Jones <davej@codemonkey.org.uk> wrote: > On Sun, Oct 23, 2016 at 05:32:21PM -0400, Chris Mason wrote: > > > > > > On 10/22/2016 11:20 AM, Dave Jones wrote: > > > On Fri, Oct 21, 2016 at 04:02:45PM -0400, Dave Jones wrote: > > > > > > > > It could be worth trying this, too: > > > > > > > > > > https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vmap_stack&id=174531fef4e8 > > > > > > > > > > It occurred to me that the current code is a little bit fragile. > > > > > > > > It's been nearly 24hrs with the above changes, and it's been pretty much > > > > silent the whole time. > > > > > > > > The only thing of note over that time period has been a btrfs lockdep > > > > warning that's been around for a while, and occasional btrfs checksum > > > > failures, which I've been seeing for a while, but seem to have gotten > > > > worse since 4.8. > > > > > > > > I'm pretty confident in the disk being ok in this machine, so I think > > > > the checksum warnings are bogus. Chris suggested they may be the result > > > > of memory corruption, but there's little else going on. > > > > > > The only interesting thing last nights run was this.. > > > > > > BUG: Bad page state in process kworker/u8:1 pfn:4e2b70 > > > page:ffffea00138adc00 count:0 mapcount:0 mapping:ffff88046e9fc2e0 index:0xdf0 > > > flags: 0x400000000000000c(referenced|uptodate) > > > page dumped because: non-NULL mapping > > > CPU: 3 PID: 24234 Comm: kworker/u8:1 Not tainted 4.9.0-rc1-think+ #11 > > > Workqueue: writeback wb_workfn (flush-btrfs-2) > > > > Well crud, we're back to wondering if this is Btrfs or the stack > > corruption. Since the pagevecs are on the stack and this is a new > > crash, my guess is you'll be able to trigger it on xfs/ext4 too. But we > > should make sure. > > Here's an interesting one from today, pointing the finger at xattrs again. > > > [69943.450108] Oops: 0003 [#1] PREEMPT SMP DEBUG_PAGEALLOC This is an unhandled kernel page fault. The string "Oops" is so helpful :-/ > [69943.454452] CPU: 1 PID: 21558 Comm: trinity-c60 Not tainted 4.9.0-rc1-think+ #11 > [69943.463510] task: ffff8804f8dd3740 task.stack: ffffc9000b108000 > [69943.468077] RIP: 0010:[<ffffffff810c3f6b>] > [69943.472704] [<ffffffff810c3f6b>] __lock_acquire.isra.32+0x6b/0x8c0 > [69943.477489] RSP: 0018:ffffc9000b10b9e8 EFLAGS: 00010086 > [69943.482368] RAX: ffffffff81789b90 RBX: ffff8804f8dd3740 RCX: 0000000000000000 > [69943.487410] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 > [69943.492515] RBP: ffffc9000b10ba18 R08: 0000000000000001 R09: 0000000000000000 > [69943.497666] R10: 0000000000000001 R11: 00003f9cfa7f4e73 R12: 0000000000000000 > [69943.502880] R13: 0000000000000000 R14: ffffc9000af7bd48 R15: ffff8804f8dd3740 > [69943.508163] FS: 00007f64904a2b40(0000) GS:ffff880507a00000(0000) knlGS:0000000000000000 > [69943.513591] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [69943.518917] CR2: ffffffff81789d28 CR3: 00000004a8f16000 CR4: 00000000001406e0 > [69943.524253] DR0: 00007f5b97fd4000 DR1: 0000000000000000 DR2: 0000000000000000 > [69943.529488] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600 > [69943.534771] Stack: > [69943.540023] ffff880507bd74c0 > [69943.545317] ffff8804f8dd3740 0000000000000046 0000000000000286[69943.545456] ffffc9000af7bd08 > [69943.550930] 0000000000000100 ffffc9000b10ba50 ffffffff810c4b68[69943.551069] ffffffff810ba40c > [69943.556657] ffff880400000000 0000000000000000 ffffc9000af7bd48[69943.556796] Call Trace: > [69943.562465] [<ffffffff810c4b68>] lock_acquire+0x58/0x70 > [69943.568354] [<ffffffff810ba40c>] ? finish_wait+0x3c/0x70 > [69943.574306] [<ffffffff8178fef2>] _raw_spin_lock_irqsave+0x42/0x80 > [69943.580335] [<ffffffff810ba40c>] ? finish_wait+0x3c/0x70 > [69943.586237] [<ffffffff810ba40c>] finish_wait+0x3c/0x70 > [69943.591992] [<ffffffff81169727>] shmem_fault+0x167/0x1b0 > [69943.597807] [<ffffffff810ba6c0>] ? prepare_to_wait_event+0x100/0x100 > [69943.603741] [<ffffffff8117b46d>] __do_fault+0x6d/0x1b0 > [69943.609743] [<ffffffff8117f168>] handle_mm_fault+0xc58/0x1170 > [69943.615822] [<ffffffff8117e553>] ? handle_mm_fault+0x43/0x1170 > [69943.621971] [<ffffffff81044982>] __do_page_fault+0x172/0x4e0 > [69943.628184] [<ffffffff81044d10>] do_page_fault+0x20/0x70 > [69943.634449] [<ffffffff8132a897>] ? debug_smp_processor_id+0x17/0x20 > [69943.640784] [<ffffffff81791f3f>] page_fault+0x1f/0x30 > [69943.647170] [<ffffffff8133d69c>] ? strncpy_from_user+0x5c/0x170 > [69943.653480] [<ffffffff8133d686>] ? strncpy_from_user+0x46/0x170 > [69943.659632] [<ffffffff811f22a7>] setxattr+0x57/0x170 > [69943.665846] [<ffffffff8132a897>] ? debug_smp_processor_id+0x17/0x20 > [69943.672172] [<ffffffff810c1f09>] ? get_lock_stats+0x19/0x50 > [69943.678558] [<ffffffff810a58f6>] ? sched_clock_cpu+0xb6/0xd0 > [69943.685007] [<ffffffff810c40cf>] ? __lock_acquire.isra.32+0x1cf/0x8c0 > [69943.691542] [<ffffffff8132a8b3>] ? __this_cpu_preempt_check+0x13/0x20 > [69943.698130] [<ffffffff8109b9bc>] ? preempt_count_add+0x7c/0xc0 > [69943.704791] [<ffffffff811ecda1>] ? __mnt_want_write+0x61/0x90 > [69943.711519] [<ffffffff811f2638>] SyS_fsetxattr+0x78/0xa0 > [69943.718300] [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 > [69943.724949] [<ffffffff81790a4b>] entry_SYSCALL64_slow_path+0x25/0x25 > [69943.731521] Code: > [69943.738124] 00 83 fe 01 0f 86 0e 03 00 00 31 d2 4c 89 f7 44 89 45 d0 89 4d d4 e8 75 e7 ff ff 8b 4d d4 48 85 c0 44 8b 45 d0 0f 84 d8 02 00 00 <f0> ff 80 98 01 00 00 8b 15 e0 21 8f 01 45 8b 8f 50 08 00 00 85 That's lock incl 0x198(%rax). I think this is: atomic_inc((atomic_t *)&class->ops); I suppose this could be stack corruption at work, but after a fair amount of staring, I still haven't found anything in the vmap_stack code that would cause stack corruption. ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-24 20:06 ` Andy Lutomirski @ 2016-10-24 20:46 ` Linus Torvalds 2016-10-24 21:17 ` Linus Torvalds 2016-10-24 22:42 ` Andy Lutomirski 0 siblings, 2 replies; 118+ messages in thread From: Linus Torvalds @ 2016-10-24 20:46 UTC (permalink / raw) To: Andy Lutomirski Cc: Dave Jones, Chris Mason, Andy Lutomirski, Jens Axboe, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel On Mon, Oct 24, 2016 at 1:06 PM, Andy Lutomirski <luto@amacapital.net> wrote: >> >> [69943.450108] Oops: 0003 [#1] PREEMPT SMP DEBUG_PAGEALLOC > > This is an unhandled kernel page fault. The string "Oops" is so helpful :-/ I think there was a line above it that DaveJ just didn't include. > >> [69943.454452] CPU: 1 PID: 21558 Comm: trinity-c60 Not tainted 4.9.0-rc1-think+ #11 >> [69943.463510] task: ffff8804f8dd3740 task.stack: ffffc9000b108000 >> [69943.468077] RIP: 0010:[<ffffffff810c3f6b>] >> [69943.472704] [<ffffffff810c3f6b>] __lock_acquire.isra.32+0x6b/0x8c0 >> [69943.477489] RSP: 0018:ffffc9000b10b9e8 EFLAGS: 00010086 >> [69943.482368] RAX: ffffffff81789b90 RBX: ffff8804f8dd3740 RCX: 0000000000000000 >> [69943.487410] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 >> [69943.492515] RBP: ffffc9000b10ba18 R08: 0000000000000001 R09: 0000000000000000 >> [69943.497666] R10: 0000000000000001 R11: 00003f9cfa7f4e73 R12: 0000000000000000 >> [69943.502880] R13: 0000000000000000 R14: ffffc9000af7bd48 R15: ffff8804f8dd3740 >> [69943.508163] FS: 00007f64904a2b40(0000) GS:ffff880507a00000(0000) knlGS:0000000000000000 >> [69943.513591] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> [69943.518917] CR2: ffffffff81789d28 CR3: 00000004a8f16000 CR4: 00000000001406e0 >> [69943.524253] DR0: 00007f5b97fd4000 DR1: 0000000000000000 DR2: 0000000000000000 >> [69943.529488] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600 >> [69943.534771] Stack: >> [69943.540023] ffff880507bd74c0 >> [69943.545317] ffff8804f8dd3740 0000000000000046 0000000000000286[69943.545456] ffffc9000af7bd08 >> [69943.550930] 0000000000000100 ffffc9000b10ba50 ffffffff810c4b68[69943.551069] ffffffff810ba40c >> [69943.556657] ffff880400000000 0000000000000000 ffffc9000af7bd48[69943.556796] Call Trace: >> [69943.562465] [<ffffffff810c4b68>] lock_acquire+0x58/0x70 >> [69943.568354] [<ffffffff810ba40c>] ? finish_wait+0x3c/0x70 >> [69943.574306] [<ffffffff8178fef2>] _raw_spin_lock_irqsave+0x42/0x80 >> [69943.580335] [<ffffffff810ba40c>] ? finish_wait+0x3c/0x70 >> [69943.586237] [<ffffffff810ba40c>] finish_wait+0x3c/0x70 >> [69943.591992] [<ffffffff81169727>] shmem_fault+0x167/0x1b0 >> [69943.597807] [<ffffffff810ba6c0>] ? prepare_to_wait_event+0x100/0x100 >> [69943.603741] [<ffffffff8117b46d>] __do_fault+0x6d/0x1b0 >> [69943.609743] [<ffffffff8117f168>] handle_mm_fault+0xc58/0x1170 >> [69943.615822] [<ffffffff8117e553>] ? handle_mm_fault+0x43/0x1170 >> [69943.621971] [<ffffffff81044982>] __do_page_fault+0x172/0x4e0 >> [69943.628184] [<ffffffff81044d10>] do_page_fault+0x20/0x70 >> [69943.634449] [<ffffffff8132a897>] ? debug_smp_processor_id+0x17/0x20 >> [69943.640784] [<ffffffff81791f3f>] page_fault+0x1f/0x30 >> [69943.647170] [<ffffffff8133d69c>] ? strncpy_from_user+0x5c/0x170 >> [69943.653480] [<ffffffff8133d686>] ? strncpy_from_user+0x46/0x170 >> [69943.659632] [<ffffffff811f22a7>] setxattr+0x57/0x170 >> [69943.665846] [<ffffffff8132a897>] ? debug_smp_processor_id+0x17/0x20 >> [69943.672172] [<ffffffff810c1f09>] ? get_lock_stats+0x19/0x50 >> [69943.678558] [<ffffffff810a58f6>] ? sched_clock_cpu+0xb6/0xd0 >> [69943.685007] [<ffffffff810c40cf>] ? __lock_acquire.isra.32+0x1cf/0x8c0 >> [69943.691542] [<ffffffff8132a8b3>] ? __this_cpu_preempt_check+0x13/0x20 >> [69943.698130] [<ffffffff8109b9bc>] ? preempt_count_add+0x7c/0xc0 >> [69943.704791] [<ffffffff811ecda1>] ? __mnt_want_write+0x61/0x90 >> [69943.711519] [<ffffffff811f2638>] SyS_fsetxattr+0x78/0xa0 >> [69943.718300] [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 >> [69943.724949] [<ffffffff81790a4b>] entry_SYSCALL64_slow_path+0x25/0x25 >> [69943.731521] Code: >> [69943.738124] 00 83 fe 01 0f 86 0e 03 00 00 31 d2 4c 89 f7 44 89 45 d0 89 4d d4 e8 75 e7 ff ff 8b 4d d4 48 85 c0 44 8b 45 d0 0f 84 d8 02 00 00 <f0> ff 80 98 01 00 00 8b 15 e0 21 8f 01 45 8b 8f 50 08 00 00 85 > > That's lock incl 0x198(%rax). I think this is: > > atomic_inc((atomic_t *)&class->ops); > > I suppose this could be stack corruption at work, but after a fair > amount of staring, I still haven't found anything in the vmap_stack > code that would cause stack corruption. Well, it is intriguing that what faults is this: finish_wait(shmem_falloc_waitq, &shmem_fault_wait); where 'shmem_fault_wait' is a on-stack wait queue. So it really looks very much like stack corruption. What strikes me is that "finish_wait()" does this optimistic "has my entry been removed" without holding the waitqueue lock (and uses list_empty_careful() to make sure it does that "safely"). It has that big comment too: /* * shmem_falloc_waitq points into the shmem_fallocate() * stack of the hole-punching task: shmem_falloc_waitq * is usually invalid by the time we reach here, but * finish_wait() does not dereference it in that case; * though i_lock needed lest racing with wake_up_all(). */ the stack it comes from is the wait queue head from shmem_fallocate(), which will do "wake_up_all()" under the inode lock. On the face of it, the inode lock should make that safe and serialize everything. And yes, finish_wait() does not touch the unsafe stuff if the wait-queue (in the local stack) is empty, which wake_up_all() *should* have guaranteed. It's just a regular wait-queue entry (that DEFINE_WAIT() does that), so it uses the normal autoremove_wake_function() that removes things on successful wakeup: int autoremove_wake_function(wait_queue_t *wait, unsigned mode, int sync, void *key) { int ret = default_wake_function(wait, mode, sync, key); if (ret) list_del_init(&wait->task_list); return ret; } So the only issue is "did default_wake_function() return true"? That's try_to_wake_up(TASK_NORMAL, 0), and I note that it can return zero (and thus *not* remove the entry - leavign the invalid entry tghere) if if (!(p->state & state)) goto out; but "prepare_to_wait()" (which also ran with the inode->i_lock held, and also takes the wait-queue lock) did set p->state to TASK_UNINTERRUPTIBLE. So this is all some really subtle code, but I'm not seeing that it would be wrong. Linus ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-24 20:46 ` Linus Torvalds @ 2016-10-24 21:17 ` Linus Torvalds 2016-10-24 21:50 ` Linus Torvalds 2016-10-24 22:42 ` Andy Lutomirski 1 sibling, 1 reply; 118+ messages in thread From: Linus Torvalds @ 2016-10-24 21:17 UTC (permalink / raw) To: Andy Lutomirski Cc: Dave Jones, Chris Mason, Andy Lutomirski, Jens Axboe, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel On Mon, Oct 24, 2016 at 1:46 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > So this is all some really subtle code, but I'm not seeing that it > would be wrong. Ahh... Except maybe.. The vmalloc/vfree code itself is a bit scary. In particular, we have a rather insane model of TLB flushing. We leave the virtual area on a lazy purge-list, and we delay flushing the TLB and actually freeing the virtual memory for it so that we can batch things up. But we've free'd the physical pages that are *mapped* by that area when we do the vfree(). So there can be stale TLB entries that point to pages that have gotten re-used. They shouldn't matter, because nothing should be writing to those pages, but it strikes me that this may also be hurting the DEBUG_PAGEALLOC thing. Maybe we're not getting the page fautls that we *should* be getting, and there are hidden reads and writes to those paghes that already got free'd.\ There was some nasty reason why we did that insane thing. I think it was just that there are a few high-frequency vmalloc/vfree users and the TLB flushing was killing some performance. But it does strike me that we are playing very fast and loose with the TLB on the vmalloc area. So maybe all the new VMAP code is fine, and it's really vmalloc/vfree that has been subtly broken but nobody has ever cared before? Linus ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-24 21:17 ` Linus Torvalds @ 2016-10-24 21:50 ` Linus Torvalds 2016-10-24 22:02 ` Chris Mason 0 siblings, 1 reply; 118+ messages in thread From: Linus Torvalds @ 2016-10-24 21:50 UTC (permalink / raw) To: Andy Lutomirski Cc: Dave Jones, Chris Mason, Andy Lutomirski, Jens Axboe, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel On Mon, Oct 24, 2016 at 2:17 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > The vmalloc/vfree code itself is a bit scary. In particular, we have a > rather insane model of TLB flushing. We leave the virtual area on a > lazy purge-list, and we delay flushing the TLB and actually freeing > the virtual memory for it so that we can batch things up. Never mind. If DaveJ is running with DEBUG_PAGEALLOC, then the code in vmap_debug_free_range() should have forced a synchronous TLB flush fro vmalloc ranges too, so that doesn't explain it either. Linus ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-24 21:50 ` Linus Torvalds @ 2016-10-24 22:02 ` Chris Mason 0 siblings, 0 replies; 118+ messages in thread From: Chris Mason @ 2016-10-24 22:02 UTC (permalink / raw) To: Linus Torvalds, Andy Lutomirski Cc: Dave Jones, Andy Lutomirski, Jens Axboe, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel On 10/24/2016 05:50 PM, Linus Torvalds wrote: > On Mon, Oct 24, 2016 at 2:17 PM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> >> The vmalloc/vfree code itself is a bit scary. In particular, we have a >> rather insane model of TLB flushing. We leave the virtual area on a >> lazy purge-list, and we delay flushing the TLB and actually freeing >> the virtual memory for it so that we can batch things up. > > Never mind. If DaveJ is running with DEBUG_PAGEALLOC, then the code in > vmap_debug_free_range() should have forced a synchronous TLB flush fro > vmalloc ranges too, so that doesn't explain it either. > My big fear here is that we're just triggering an old stack corruption more reliably. I wonder if we can make it happen much faster by restricting the stacks to a static list and cycling through them in lifo fashion? -chris ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-24 20:46 ` Linus Torvalds 2016-10-24 21:17 ` Linus Torvalds @ 2016-10-24 22:42 ` Andy Lutomirski 2016-10-25 0:00 ` Linus Torvalds 1 sibling, 1 reply; 118+ messages in thread From: Andy Lutomirski @ 2016-10-24 22:42 UTC (permalink / raw) To: Linus Torvalds Cc: Dave Jones, Chris Mason, Andy Lutomirski, Jens Axboe, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel On Mon, Oct 24, 2016 at 1:46 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Mon, Oct 24, 2016 at 1:06 PM, Andy Lutomirski <luto@amacapital.net> wrote: >>> >>> [69943.450108] Oops: 0003 [#1] PREEMPT SMP DEBUG_PAGEALLOC >> >> This is an unhandled kernel page fault. The string "Oops" is so helpful :-/ > > I think there was a line above it that DaveJ just didn't include. > >> >>> [69943.454452] CPU: 1 PID: 21558 Comm: trinity-c60 Not tainted 4.9.0-rc1-think+ #11 >>> [69943.463510] task: ffff8804f8dd3740 task.stack: ffffc9000b108000 >>> [69943.468077] RIP: 0010:[<ffffffff810c3f6b>] >>> [69943.472704] [<ffffffff810c3f6b>] __lock_acquire.isra.32+0x6b/0x8c0 >>> [69943.477489] RSP: 0018:ffffc9000b10b9e8 EFLAGS: 00010086 >>> [69943.482368] RAX: ffffffff81789b90 RBX: ffff8804f8dd3740 RCX: 0000000000000000 >>> [69943.487410] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 >>> [69943.492515] RBP: ffffc9000b10ba18 R08: 0000000000000001 R09: 0000000000000000 >>> [69943.497666] R10: 0000000000000001 R11: 00003f9cfa7f4e73 R12: 0000000000000000 >>> [69943.502880] R13: 0000000000000000 R14: ffffc9000af7bd48 R15: ffff8804f8dd3740 >>> [69943.508163] FS: 00007f64904a2b40(0000) GS:ffff880507a00000(0000) knlGS:0000000000000000 >>> [69943.513591] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>> [69943.518917] CR2: ffffffff81789d28 CR3: 00000004a8f16000 CR4: 00000000001406e0 >>> [69943.524253] DR0: 00007f5b97fd4000 DR1: 0000000000000000 DR2: 0000000000000000 >>> [69943.529488] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600 >>> [69943.534771] Stack: >>> [69943.540023] ffff880507bd74c0 >>> [69943.545317] ffff8804f8dd3740 0000000000000046 0000000000000286[69943.545456] ffffc9000af7bd08 >>> [69943.550930] 0000000000000100 ffffc9000b10ba50 ffffffff810c4b68[69943.551069] ffffffff810ba40c >>> [69943.556657] ffff880400000000 0000000000000000 ffffc9000af7bd48[69943.556796] Call Trace: >>> [69943.562465] [<ffffffff810c4b68>] lock_acquire+0x58/0x70 >>> [69943.568354] [<ffffffff810ba40c>] ? finish_wait+0x3c/0x70 >>> [69943.574306] [<ffffffff8178fef2>] _raw_spin_lock_irqsave+0x42/0x80 >>> [69943.580335] [<ffffffff810ba40c>] ? finish_wait+0x3c/0x70 >>> [69943.586237] [<ffffffff810ba40c>] finish_wait+0x3c/0x70 >>> [69943.591992] [<ffffffff81169727>] shmem_fault+0x167/0x1b0 >>> [69943.597807] [<ffffffff810ba6c0>] ? prepare_to_wait_event+0x100/0x100 >>> [69943.603741] [<ffffffff8117b46d>] __do_fault+0x6d/0x1b0 >>> [69943.609743] [<ffffffff8117f168>] handle_mm_fault+0xc58/0x1170 >>> [69943.615822] [<ffffffff8117e553>] ? handle_mm_fault+0x43/0x1170 >>> [69943.621971] [<ffffffff81044982>] __do_page_fault+0x172/0x4e0 >>> [69943.628184] [<ffffffff81044d10>] do_page_fault+0x20/0x70 >>> [69943.634449] [<ffffffff8132a897>] ? debug_smp_processor_id+0x17/0x20 >>> [69943.640784] [<ffffffff81791f3f>] page_fault+0x1f/0x30 >>> [69943.647170] [<ffffffff8133d69c>] ? strncpy_from_user+0x5c/0x170 >>> [69943.653480] [<ffffffff8133d686>] ? strncpy_from_user+0x46/0x170 >>> [69943.659632] [<ffffffff811f22a7>] setxattr+0x57/0x170 >>> [69943.665846] [<ffffffff8132a897>] ? debug_smp_processor_id+0x17/0x20 >>> [69943.672172] [<ffffffff810c1f09>] ? get_lock_stats+0x19/0x50 >>> [69943.678558] [<ffffffff810a58f6>] ? sched_clock_cpu+0xb6/0xd0 >>> [69943.685007] [<ffffffff810c40cf>] ? __lock_acquire.isra.32+0x1cf/0x8c0 >>> [69943.691542] [<ffffffff8132a8b3>] ? __this_cpu_preempt_check+0x13/0x20 >>> [69943.698130] [<ffffffff8109b9bc>] ? preempt_count_add+0x7c/0xc0 >>> [69943.704791] [<ffffffff811ecda1>] ? __mnt_want_write+0x61/0x90 >>> [69943.711519] [<ffffffff811f2638>] SyS_fsetxattr+0x78/0xa0 >>> [69943.718300] [<ffffffff8100255c>] do_syscall_64+0x5c/0x170 >>> [69943.724949] [<ffffffff81790a4b>] entry_SYSCALL64_slow_path+0x25/0x25 >>> [69943.731521] Code: >>> [69943.738124] 00 83 fe 01 0f 86 0e 03 00 00 31 d2 4c 89 f7 44 89 45 d0 89 4d d4 e8 75 e7 ff ff 8b 4d d4 48 85 c0 44 8b 45 d0 0f 84 d8 02 00 00 <f0> ff 80 98 01 00 00 8b 15 e0 21 8f 01 45 8b 8f 50 08 00 00 85 >> >> That's lock incl 0x198(%rax). I think this is: >> >> atomic_inc((atomic_t *)&class->ops); >> >> I suppose this could be stack corruption at work, but after a fair >> amount of staring, I still haven't found anything in the vmap_stack >> code that would cause stack corruption. > > Well, it is intriguing that what faults is this: > > finish_wait(shmem_falloc_waitq, &shmem_fault_wait); > > where 'shmem_fault_wait' is a on-stack wait queue. So it really looks > very much like stack corruption. > > What strikes me is that "finish_wait()" does this optimistic "has my > entry been removed" without holding the waitqueue lock (and uses > list_empty_careful() to make sure it does that "safely"). > > It has that big comment too: > > /* > * shmem_falloc_waitq points into the shmem_fallocate() > * stack of the hole-punching task: shmem_falloc_waitq > * is usually invalid by the time we reach here, but > * finish_wait() does not dereference it in that case; > * though i_lock needed lest racing with wake_up_all(). > */ > > the stack it comes from is the wait queue head from shmem_fallocate(), > which will do "wake_up_all()" under the inode lock. Here's my theory: I think you're looking at the right code but the wrong stack. shmem_fault_wait is fine, but shmem_fault_waitq looks really dicey. Consider: fallocate calls wake_up_all(), which calls autoremove_wait_function(). That wakes up the shmem_fault thread. Suppose that the shmem_fault thread gets moving really quickly (presumably because it never went to sleep in the first place -- suppose it hadn't made it to schedule(), so schedule() turns into a no-op). It calls finish_wait() *before* autoremove_wake_function() does the list_del_init(). finish_wait() gets to the list_empty_careful() call and it returns true. Now the fallocate thread catches up and *exits*. Dave's test makes a new thread that reuses the stack (the vmap area or the backing store). Now the shmem_fault thread continues on its merry way and takes q->lock. But oh crap, q->lock is pointing at some random spot on some other thread's stack. Kaboom! If I'm right, actually blowing up due to this bug would have been very, very unlikely on 4.8 and below, and it has nothing to do with vmap stacks per se -- it was the RCU removal. We used to wait for an RCU grace period before reusing a stack, and there couldn't have been an RCU grace period in the middle of finish_wait(), so we were safer. (Not totally safe, because the thread that called fallocate() could have made it somewhere else that used the same stack depth, but I can see that being less likely.) IOW, I think that autoremove_wake_function() is buggy. It should remove the wait entry, then try to wake the thread, then re-add the wait entry if the wakeup fails. Or maybe it should use some atomic flag in the wait entry or have some other locking to make sure that finish_wait does not touch q->lock if autoremove_wake_function succeeds. --Andy ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-24 22:42 ` Andy Lutomirski @ 2016-10-25 0:00 ` Linus Torvalds 2016-10-25 1:09 ` Andy Lutomirski 0 siblings, 1 reply; 118+ messages in thread From: Linus Torvalds @ 2016-10-25 0:00 UTC (permalink / raw) To: Andy Lutomirski Cc: Dave Jones, Chris Mason, Andy Lutomirski, Jens Axboe, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel On Mon, Oct 24, 2016 at 3:42 PM, Andy Lutomirski <luto@amacapital.net> wrote: > > Here's my theory: I think you're looking at the right code but the > wrong stack. shmem_fault_wait is fine, but shmem_fault_waitq looks > really dicey. Hmm. > Consider: > > fallocate calls wake_up_all(), which calls autoremove_wait_function(). > That wakes up the shmem_fault thread. Suppose that the shmem_fault > thread gets moving really quickly (presumably because it never went to > sleep in the first place -- suppose it hadn't made it to schedule(), > so schedule() turns into a no-op). It calls finish_wait() *before* > autoremove_wake_function() does the list_del_init(). finish_wait() > gets to the list_empty_careful() call and it returns true. All of this happens under inode->i_lock, so the different parts are serialized. So if this happens before the wakeup, then finish_wait() will se that the wait-queue entry is not empty (it points to the wait queue head in shmem_falloc_waitq. But then it will just remove itself carefully with list_del_init() under the waitqueue lock, and things are fine. Yes, it uses the waitiqueue lock on the other stack, but that stack is still ok since (a) we're serialized by inode->i_lock (b) this code ran before the fallocate thread catches up and exits. In other words, your scenario seems to be dependent on those two threads being able to race. But they both hold inode->i_lock in the critical region you are talking about. > Now the fallocate thread catches up and *exits*. Dave's test makes a > new thread that reuses the stack (the vmap area or the backing store). > > Now the shmem_fault thread continues on its merry way and takes > q->lock. But oh crap, q->lock is pointing at some random spot on some > other thread's stack. Kaboom! Note that q->lock should be entirely immaterial, since inode->i_lock nests outside of it in all uses. Now, if there is some code that runs *without* the inode->i_lock, then that would be a big bug. But I'm not seeing it. I do agree that some race on some stack data structure could easily be the cause of these issues. And yes, the vmap code obviously starts reusing the stack much earlier, and would trigger problems that would essentially be hidden by the fact that the kernel stack used to stay around not just until exit(), but until the process was reaped. I just think that in this case i_lock really looks like it should serialize things correctly. Or are you seeing something I'm not? Linus ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-25 0:00 ` Linus Torvalds @ 2016-10-25 1:09 ` Andy Lutomirski 0 siblings, 0 replies; 118+ messages in thread From: Andy Lutomirski @ 2016-10-25 1:09 UTC (permalink / raw) To: Linus Torvalds Cc: David Sterba, Al Viro, Dave Jones, Linux Kernel, Jens Axboe, Josef Bacik, Chris Mason, linux-btrfs On Oct 24, 2016 5:00 PM, "Linus Torvalds" <torvalds@linux-foundation.org> wrote: > > On Mon, Oct 24, 2016 at 3:42 PM, Andy Lutomirski <luto@amacapital.net> wrote: > > > Now the fallocate thread catches up and *exits*. Dave's test makes a > > new thread that reuses the stack (the vmap area or the backing store). > > > > Now the shmem_fault thread continues on its merry way and takes > > q->lock. But oh crap, q->lock is pointing at some random spot on some > > other thread's stack. Kaboom! > > Note that q->lock should be entirely immaterial, since inode->i_lock > nests outside of it in all uses. > > Now, if there is some code that runs *without* the inode->i_lock, then > that would be a big bug. > > But I'm not seeing it. > > I do agree that some race on some stack data structure could easily be > the cause of these issues. And yes, the vmap code obviously starts > reusing the stack much earlier, and would trigger problems that would > essentially be hidden by the fact that the kernel stack used to stay > around not just until exit(), but until the process was reaped. > > I just think that in this case i_lock really looks like it should > serialize things correctly. > > Or are you seeing something I'm not? No, I missed that. --Andy ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-18 23:42 ` Chris Mason 2016-10-19 0:10 ` Linus Torvalds @ 2016-10-19 17:09 ` Philipp Hahn 2016-10-19 17:43 ` Linus Torvalds 1 sibling, 1 reply; 118+ messages in thread From: Philipp Hahn @ 2016-10-19 17:09 UTC (permalink / raw) To: Chris Mason, Linus Torvalds, Jens Axboe, Dave Jones, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel Hello, Am 19.10.2016 um 01:42 schrieb Chris Mason: > On Tue, Oct 18, 2016 at 04:39:22PM -0700, Linus Torvalds wrote: >> On Tue, Oct 18, 2016 at 4:31 PM, Chris Mason <clm@fb.com> wrote: >>> >>> Jens, not sure if you saw the whole thread. This has triggered bad page >>> state errors, and also corrupted a btrfs list. It hurts me to say, >>> but it >>> might not actually be your fault. >> >> Where is that thread, and what is the "this" that triggers problems? >> >> Looking at the "->mq_list" users, I'm not seeing any changes there in >> the last year or so. So I don't think it's the list itself. > > Seems to be the whole thing: > > http://www.gossamer-threads.com/lists/linux/kernel/2545792 > > My guess is xattr, but I don't have a good reason for that. Nearly a month ago I reported also a "list_add corruption", but with 4.1.6: <http://marc.info/?l=linux-kernel&m=147508265316854&w=2> That server rungs Samba4, which also is a heavy user of xattr. Might be that it is related. Philipp Hahn ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-19 17:09 ` Philipp Hahn @ 2016-10-19 17:43 ` Linus Torvalds 2016-10-20 6:52 ` Ingo Molnar 0 siblings, 1 reply; 118+ messages in thread From: Linus Torvalds @ 2016-10-19 17:43 UTC (permalink / raw) To: Philipp Hahn, Thomas Gleixner, Ingo Molnar Cc: Chris Mason, Jens Axboe, Dave Jones, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel On Wed, Oct 19, 2016 at 10:09 AM, Philipp Hahn <pmhahn@pmhahn.de> wrote: > > Nearly a month ago I reported also a "list_add corruption", but with 4.1.6: > <http://marc.info/?l=linux-kernel&m=147508265316854&w=2> > > That server rungs Samba4, which also is a heavy user of xattr. That one looks very different. In fact, the list that got corrupted for you has since been changed to a hlist (which is *similar* to our doubly-linked list, but has a smaller head and does not allow adding to the end of the list). Also, the "should be" and "was" values are very close, and switched: should be ffffffff81ab3ca8, but was ffffffff81ab3cc8 should be ffffffff81ab3cc8, but was ffffffff81ab3ca8 so it actually looks like it was the same data structure. In particular, it looks like enqueue_timer() ended up racing on adding an entry to one index in the "base->vectors[]" array, while hitting an entry that was pointing to another index near-by. So I don't think it's related. Yours looks like some subtle timer base race. It smells like a locking problem with timers. I'm not seeing what it might be, but it *might* have been fixed by doing the TIMER_MIGRATING bit right in add_timer_on() (commit 22b886dd1018). Adding some timer people just in case, but I don't think your 4.1 report is related. Linus ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-19 17:43 ` Linus Torvalds @ 2016-10-20 6:52 ` Ingo Molnar 2016-10-20 7:17 ` Thomas Gleixner 0 siblings, 1 reply; 118+ messages in thread From: Ingo Molnar @ 2016-10-20 6:52 UTC (permalink / raw) To: Linus Torvalds Cc: Philipp Hahn, Thomas Gleixner, Chris Mason, Jens Axboe, Dave Jones, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel * Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Wed, Oct 19, 2016 at 10:09 AM, Philipp Hahn <pmhahn@pmhahn.de> wrote: > > > > Nearly a month ago I reported also a "list_add corruption", but with 4.1.6: > > <http://marc.info/?l=linux-kernel&m=147508265316854&w=2> > > > > That server rungs Samba4, which also is a heavy user of xattr. > > That one looks very different. In fact, the list that got corrupted > for you has since been changed to a hlist (which is *similar* to our > doubly-linked list, but has a smaller head and does not allow adding > to the end of the list). > > Also, the "should be" and "was" values are very close, and switched: > > should be ffffffff81ab3ca8, but was ffffffff81ab3cc8 > should be ffffffff81ab3cc8, but was ffffffff81ab3ca8 > > so it actually looks like it was the same data structure. In > particular, it looks like enqueue_timer() ended up racing on adding an > entry to one index in the "base->vectors[]" array, while hitting an > entry that was pointing to another index near-by. > > So I don't think it's related. Yours looks like some subtle timer base > race. It smells like a locking problem with timers. I'm not seeing > what it might be, but it *might* have been fixed by doing the > TIMER_MIGRATING bit right in add_timer_on() (commit 22b886dd1018). > > Adding some timer people just in case, but I don't think your 4.1 > report is related. Side note: in case timer callback related corruption is suspected, a very efficient debugging method is to enable debugobjects tracking+checking: CONFIG_DEBUG_OBJECTS=y CONFIG_DEBUG_OBJECTS_SELFTEST=y CONFIG_DEBUG_OBJECTS_FREE=y CONFIG_DEBUG_OBJECTS_TIMERS=y CONFIG_DEBUG_OBJECTS_WORK=y CONFIG_DEBUG_OBJECTS_RCU_HEAD=y CONFIG_DEBUG_OBJECTS_PERCPU_COUNTER=y CONFIG_DEBUG_KOBJECT_RELEASE=y ( Appending these to the .config and running 'make oldconfig' should enable all of these. ) If the problem is in any of these areas then a debug warning could trigger at a more convenient place than some later 'some unrelated bits got corrupted' warning. Thanks, Ingo ^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: bio linked list corruption. 2016-10-20 6:52 ` Ingo Molnar @ 2016-10-20 7:17 ` Thomas Gleixner 0 siblings, 0 replies; 118+ messages in thread From: Thomas Gleixner @ 2016-10-20 7:17 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Philipp Hahn, Chris Mason, Jens Axboe, Dave Jones, Al Viro, Josef Bacik, David Sterba, linux-btrfs, Linux Kernel On Thu, 20 Oct 2016, Ingo Molnar wrote: > * Linus Torvalds <torvalds@linux-foundation.org> wrote: > > So I don't think it's related. Yours looks like some subtle timer base > > race. It smells like a locking problem with timers. I'm not seeing > > what it might be, but it *might* have been fixed by doing the > > TIMER_MIGRATING bit right in add_timer_on() (commit 22b886dd1018). 4.1 does not have the new fangled timer wheel. So 22b886dd1018 is irrelevant. > > Adding some timer people just in case, but I don't think your 4.1 > > report is related. This looks like one of the classic issues where an enqueued timer gets freed/reinitialized or otherwise manipulated. So yes, Ingos recommendation for debugging is the right thing to do. Thanks, tglx ^ permalink raw reply [flat|nested] 118+ messages in thread
end of thread, other threads:[~2016-12-09 21:19 UTC | newest] Thread overview: 118+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-10-11 14:45 btrfs bio linked list corruption Dave Jones 2016-10-11 15:11 ` Al Viro 2016-10-11 15:19 ` Dave Jones 2016-10-11 15:20 ` Chris Mason 2016-10-11 15:49 ` Dave Jones 2016-10-11 15:54 ` Chris Mason 2016-10-11 16:25 ` Dave Jones 2016-10-12 13:47 ` Dave Jones 2016-10-12 14:40 ` Dave Jones 2016-10-12 14:42 ` Chris Mason 2016-10-13 18:16 ` Dave Jones 2016-10-13 21:18 ` Chris Mason 2016-10-13 21:56 ` Dave Jones 2016-10-16 0:42 ` Dave Jones 2016-10-18 1:07 ` Chris Mason 2016-10-18 22:42 ` Dave Jones 2016-10-18 23:12 ` Jens Axboe 2016-10-18 23:31 ` Chris Mason 2016-10-18 23:36 ` Jens Axboe 2016-10-18 23:39 ` Linus Torvalds 2016-10-18 23:42 ` Chris Mason 2016-10-19 0:10 ` Linus Torvalds 2016-10-19 0:19 ` Chris Mason 2016-10-19 0:28 ` Linus Torvalds 2016-10-20 22:48 ` Dave Jones 2016-10-19 1:05 ` Andy Lutomirski 2016-10-20 22:50 ` Dave Jones 2016-10-20 23:01 ` Andy Lutomirski 2016-10-20 23:03 ` Dave Jones 2016-10-20 23:23 ` Andy Lutomirski 2016-10-21 20:02 ` Dave Jones 2016-10-21 20:17 ` Chris Mason 2016-10-21 20:23 ` Dave Jones 2016-10-21 20:38 ` Chris Mason 2016-10-21 20:41 ` Josef Bacik 2016-10-21 21:11 ` Dave Jones 2016-10-22 15:20 ` Dave Jones 2016-10-23 21:32 ` Chris Mason 2016-10-24 4:40 ` Dave Jones 2016-10-24 13:42 ` Chris Mason 2016-10-26 0:27 ` Dave Jones 2016-10-26 1:33 ` Linus Torvalds 2016-10-26 1:39 ` Linus Torvalds 2016-10-26 16:30 ` Dave Jones 2016-10-26 16:48 ` Linus Torvalds 2016-10-26 18:18 ` Dave Jones 2016-10-26 18:42 ` Dave Jones 2016-10-26 19:06 ` Linus Torvalds 2016-10-26 20:00 ` Chris Mason 2016-10-26 21:52 ` Chris Mason 2016-10-26 22:21 ` Linus Torvalds 2016-10-26 22:40 ` Dave Jones 2016-10-26 22:51 ` Linus Torvalds 2016-10-26 22:55 ` Jens Axboe 2016-10-26 22:58 ` Linus Torvalds 2016-10-26 23:03 ` Jens Axboe 2016-10-26 23:07 ` Dave Jones 2016-10-26 23:08 ` Linus Torvalds 2016-10-26 23:20 ` Jens Axboe 2016-10-26 23:38 ` Chris Mason 2016-10-26 23:47 ` Dave Jones 2016-10-27 0:00 ` Jens Axboe 2016-10-27 13:33 ` Chris Mason 2016-10-31 18:55 ` Dave Jones 2016-10-31 19:35 ` Linus Torvalds 2016-10-31 19:44 ` Chris Mason 2016-11-06 16:55 ` btrfs btree_ctree_super fault Dave Jones 2016-11-08 14:59 ` Dave Jones 2016-11-08 15:08 ` Chris Mason 2016-11-10 14:35 ` Dave Jones 2016-11-10 15:27 ` Chris Mason 2016-11-23 19:34 ` bio linked list corruption Dave Jones 2016-11-23 19:58 ` Dave Jones 2016-12-01 15:32 ` btrfs_destroy_inode warn (outstanding extents) Dave Jones 2016-12-03 16:48 ` Dave Jones 2016-12-07 16:15 ` Dave Jones 2016-12-09 21:12 ` Steven Rostedt 2016-12-04 23:04 ` bio linked list corruption Vegard Nossum 2016-12-05 11:10 ` Vegard Nossum 2016-12-05 17:09 ` Vegard Nossum 2016-12-05 17:21 ` Dave Jones 2016-12-05 17:55 ` Linus Torvalds 2016-12-05 19:11 ` Vegard Nossum 2016-12-05 20:10 ` Linus Torvalds 2016-12-05 20:35 ` Linus Torvalds 2016-12-05 21:33 ` Vegard Nossum 2016-12-06 8:42 ` Vegard Nossum 2016-12-06 8:16 ` Peter Zijlstra 2016-12-06 8:36 ` Ingo Molnar 2016-12-06 16:33 ` Linus Torvalds 2016-12-05 20:10 ` Vegard Nossum 2016-12-05 18:11 ` Andy Lutomirski 2016-12-05 18:25 ` Linus Torvalds 2016-12-05 18:26 ` Vegard Nossum 2016-10-26 23:19 ` Chris Mason 2016-10-26 23:21 ` Jens Axboe 2016-10-27 6:33 ` Christoph Hellwig 2016-10-27 16:34 ` Linus Torvalds 2016-10-27 16:36 ` Jens Axboe 2016-10-26 23:01 ` Dave Jones 2016-10-26 23:05 ` Jens Axboe 2016-10-26 22:52 ` Jens Axboe 2016-10-26 22:07 ` Linus Torvalds 2016-10-26 22:54 ` Chris Mason 2016-10-27 5:41 ` Dave Chinner 2016-10-27 17:23 ` Dave Jones 2016-10-24 20:06 ` Andy Lutomirski 2016-10-24 20:46 ` Linus Torvalds 2016-10-24 21:17 ` Linus Torvalds 2016-10-24 21:50 ` Linus Torvalds 2016-10-24 22:02 ` Chris Mason 2016-10-24 22:42 ` Andy Lutomirski 2016-10-25 0:00 ` Linus Torvalds 2016-10-25 1:09 ` Andy Lutomirski 2016-10-19 17:09 ` Philipp Hahn 2016-10-19 17:43 ` Linus Torvalds 2016-10-20 6:52 ` Ingo Molnar 2016-10-20 7:17 ` Thomas Gleixner
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.