* Linux-next parallel cp workload hang @ 2016-05-18 1:46 Xiong Zhou 2016-05-18 5:56 ` Dave Chinner 0 siblings, 1 reply; 9+ messages in thread From: Xiong Zhou @ 2016-05-18 1:46 UTC (permalink / raw) To: linux-next, viro; +Cc: linux-kernel, linux-fsdevel [-- Attachment #1: Type: text/plain, Size: 5438 bytes --] Hi, Parallel cp workload (xfstests generic/273) hangs like blow. It's reproducible with a small chance, less the 1/100 i think. Have hit this in linux-next 20160504 0506 0510 trees, testing on xfs with loop or block device. Ext4 survived several rounds of testing. Linux next 20160510 tree hangs within 500 rounds testing several times. The same tree with vfs parallel lookup patchset reverted survived 900 rounds testing. Reverted commits are attached. Bisecting in this patchset ided this commit: 3b0a3c1ac1598722fc289da19219d14f2a37b31f is the first bad commit commit 3b0a3c1ac1598722fc289da19219d14f2a37b31f Author: Al Viro <viro@zeniv.linux.org.uk> Date: Wed Apr 20 23:42:46 2016 -0400 simple local filesystems: switch to ->iterate_shared() no changes needed (XFS isn't simple, but it has the same parallelism in the interesting parts exercised from CXFS). With this commit reverted on top of Linux next 0510 tree, 5000+ rounds of testing passed. Although 2000 rounds testing had been conducted before good/bad verdict, i'm not 100 percent sure about all this, since it's so hard to hit, and i am not that lucky.. Bisect log and full blocked state process dump log are also attached. Furthermore, this was first hit when testing fs dax on nvdimm, however it's reproducible without dax mount option, and also reproducible on loop device, just seems harder to hit. Thanks, Xiong [0.771475] INFO: task cp:49033 blocked for more than 120 seconds. [0.794263] Not tainted 4.6.0-rc6-next-20160504 #5 [0.812515] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [0.841801] cp D ffff880b4e977928 0 49033 49014 0x00000080 [0.868923] ffff880b4e977928 ffff880ba275d380 ffff880b8d712b80 ffff880b4e978000 [0.897504] 7fffffffffffffff 0000000000000002 0000000000000000 ffff880b8d712b80 [0.925234] ffff880b4e977940 ffffffff816cbc25 ffff88035a1dabb0 ffff880b4e9779e8 [0.953237] Call Trace: [0.958314] [<ffffffff816cbc25>] schedule+0x35/0x80 [0.974854] [<ffffffff816ceb71>] schedule_timeout+0x231/0x2d0 [0.995728] [<ffffffff810d2e2d>] ? down_trylock+0x2d/0x40 [1.015351] [<ffffffffa03fa6c2>] ? xfs_iext_bno_to_ext+0xa2/0x190 [xfs] [1.040182] [<ffffffff816cdb89>] __down_common+0xaa/0x104 [1.059021] [<ffffffffa04076d2>] ? _xfs_buf_find+0x162/0x340 [xfs] [1.081357] [<ffffffff816cdc00>] __down+0x1d/0x1f [1.097166] [<ffffffff810d2e81>] down+0x41/0x50 [1.112869] [<ffffffffa04074bc>] xfs_buf_lock+0x3c/0xf0 [xfs] [1.134504] [<ffffffffa04076d2>] _xfs_buf_find+0x162/0x340 [xfs] [1.156871] [<ffffffffa04078da>] xfs_buf_get_map+0x2a/0x270 [xfs] [1.180010] [<ffffffffa04084ed>] xfs_buf_read_map+0x2d/0x180 [xfs] [1.203538] [<ffffffffa0435f21>] xfs_trans_read_buf_map+0xf1/0x300 [xfs] [1.229194] [<ffffffffa03e75b1>] xfs_da_read_buf+0xd1/0x100 [xfs] [1.251948] [<ffffffffa03ed5e6>] xfs_dir3_data_read+0x26/0x60 [xfs] [1.275736] [<ffffffffa040923e>] xfs_dir2_leaf_readbuf.isra.12+0x1be/0x4a0 [xfs] [1.305094] [<ffffffff816cdc92>] ? down_read+0x12/0x30 [1.323787] [<ffffffffa04187a4>] ? xfs_ilock+0xe4/0x110 [xfs] [1.345114] [<ffffffffa040965b>] xfs_dir2_leaf_getdents+0x13b/0x3d0 [xfs] [1.371818] [<ffffffffa0409f26>] xfs_readdir+0x1a6/0x1c0 [xfs] [1.393471] [<ffffffffa040bcbb>] xfs_file_readdir+0x2b/0x30 [xfs] [1.416874] [<ffffffff8122cec3>] iterate_dir+0x173/0x190 [1.436709] [<ffffffff81003276>] ? do_audit_syscall_entry+0x66/0x70 [1.460951] [<ffffffff8122d338>] SyS_getdents+0x98/0x120 [1.480566] [<ffffffff8122cee0>] ? iterate_dir+0x190/0x190 [1.500909] [<ffffffff81003b12>] do_syscall_64+0x62/0x110 [1.520847] [<ffffffff816cf9a1>] entry_SYSCALL64_slow_path+0x25/0x25 [1.545372] INFO: task cp:49040 blocked for more than 120 seconds. [1.568933] Not tainted 4.6.0-rc6-next-20160504 #5 [1.587943] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [1.618544] cp D ffff880b91463b00 0 49040 49016 0x00000080 [1.645502] ffff880b91463b00 ffff880464d5c140 ffff88029b475700 ffff880b91464000 [1.674145] ffff880411c42610 ffffffffffffffff ffff880411c42628 ffff8802c10bc610 [1.702834] ffff880b91463b18 ffffffff816cbc25 ffff88029b475700 ffff880b91463b88 [1.731501] Call Trace: [1.736866] [<ffffffff816cbc25>] schedule+0x35/0x80 [1.754119] [<ffffffff816ce3c2>] rwsem_down_read_failed+0xf2/0x140 [1.777411] [<ffffffffa0418800>] ? xfs_ilock_data_map_shared+0x30/0x40 [xfs] [1.805090] [<ffffffff81355e08>] call_rwsem_down_read_failed+0x18/0x30 [1.830482] [<ffffffff816cdca0>] down_read+0x20/0x30 [1.848505] [<ffffffffa04187a4>] xfs_ilock+0xe4/0x110 [xfs] [1.869293] [<ffffffffa0418800>] xfs_ilock_data_map_shared+0x30/0x40 [xfs] [1.896775] [<ffffffffa040bc60>] xfs_dir_open+0x30/0x60 [xfs] [1.917882] [<ffffffff8121603f>] do_dentry_open+0x20f/0x320 [1.938919] [<ffffffffa040bc30>] ? xfs_file_mmap+0x50/0x50 [xfs] [1.961532] [<ffffffff81217487>] vfs_open+0x57/0x60 [1.978945] [<ffffffff81226f65>] path_openat+0x325/0x14e0 [1.999273] [<ffffffff81228533>] ? putname+0x53/0x60 [2.017695] [<ffffffff81229581>] do_filp_open+0x91/0x100 [2.036893] [<ffffffff81237866>] ? __alloc_fd+0x46/0x180 [2.055479] [<ffffffff812177e4>] do_sys_open+0x124/0x210 [2.073783] [<ffffffff8113110b>] ? __audit_syscall_exit+0x1db/0x260 [2.096426] [<ffffffff81217904>] SyS_openat+0x14/0x20 [2.113690] [<ffffffff81003b12>] do_syscall_64+0x62/0x110 [2.132417] [<ffffffff816cf9a1>] entry_SYSCALL64_slow_path+0x25/0x25 [-- Attachment #2: g273-block-dumps.tar.gz --] [-- Type: application/gzip, Size: 102883 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Linux-next parallel cp workload hang 2016-05-18 1:46 Linux-next parallel cp workload hang Xiong Zhou @ 2016-05-18 5:56 ` Dave Chinner 2016-05-18 8:31 ` Xiong Zhou 0 siblings, 1 reply; 9+ messages in thread From: Dave Chinner @ 2016-05-18 5:56 UTC (permalink / raw) To: Xiong Zhou; +Cc: linux-next, viro, linux-kernel, linux-fsdevel On Wed, May 18, 2016 at 09:46:15AM +0800, Xiong Zhou wrote: > Hi, > > Parallel cp workload (xfstests generic/273) hangs like blow. > It's reproducible with a small chance, less the 1/100 i think. > > Have hit this in linux-next 20160504 0506 0510 trees, testing on > xfs with loop or block device. Ext4 survived several rounds > of testing. > > Linux next 20160510 tree hangs within 500 rounds testing several > times. The same tree with vfs parallel lookup patchset reverted > survived 900 rounds testing. Reverted commits are attached. What hardware? > Bisecting in this patchset ided this commit: > > 3b0a3c1ac1598722fc289da19219d14f2a37b31f is the first bad commit > commit 3b0a3c1ac1598722fc289da19219d14f2a37b31f > Author: Al Viro <viro@zeniv.linux.org.uk> > Date: Wed Apr 20 23:42:46 2016 -0400 > > simple local filesystems: switch to ->iterate_shared() > > no changes needed (XFS isn't simple, but it has the same parallelism > in the interesting parts exercised from CXFS). > > With this commit reverted on top of Linux next 0510 tree, 5000+ rounds > of testing passed. > > Although 2000 rounds testing had been conducted before good/bad > verdict, i'm not 100 percent sure about all this, since it's > so hard to hit, and i am not that lucky.. > > Bisect log and full blocked state process dump log are also attached. > > Furthermore, this was first hit when testing fs dax on nvdimm, > however it's reproducible without dax mount option, and also > reproducible on loop device, just seems harder to hit. > > Thanks, > Xiong > > [0.771475] INFO: task cp:49033 blocked for more than 120 seconds. > [0.794263] Not tainted 4.6.0-rc6-next-20160504 #5 > [0.812515] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables > this message. > [0.841801] cp D ffff880b4e977928 0 49033 49014 > 0x00000080 > [0.868923] ffff880b4e977928 ffff880ba275d380 ffff880b8d712b80 > ffff880b4e978000 > [0.897504] 7fffffffffffffff 0000000000000002 0000000000000000 > ffff880b8d712b80 > [0.925234] ffff880b4e977940 ffffffff816cbc25 ffff88035a1dabb0 > ffff880b4e9779e8 > [0.953237] Call Trace: > [0.958314] [<ffffffff816cbc25>] schedule+0x35/0x80 > [0.974854] [<ffffffff816ceb71>] schedule_timeout+0x231/0x2d0 > [0.995728] [<ffffffff810d2e2d>] ? down_trylock+0x2d/0x40 > [1.015351] [<ffffffffa03fa6c2>] ? xfs_iext_bno_to_ext+0xa2/0x190 [xfs] > [1.040182] [<ffffffff816cdb89>] __down_common+0xaa/0x104 > [1.059021] [<ffffffffa04076d2>] ? _xfs_buf_find+0x162/0x340 [xfs] > [1.081357] [<ffffffff816cdc00>] __down+0x1d/0x1f > [1.097166] [<ffffffff810d2e81>] down+0x41/0x50 > [1.112869] [<ffffffffa04074bc>] xfs_buf_lock+0x3c/0xf0 [xfs] > [1.134504] [<ffffffffa04076d2>] _xfs_buf_find+0x162/0x340 [xfs] > [1.156871] [<ffffffffa04078da>] xfs_buf_get_map+0x2a/0x270 [xfs] So what's holding that directory data buffer lock? It should only be held if there is either IO in progress, or a modification of the buffer in progress that is blocked somewhere else. > [1.180010] [<ffffffffa04084ed>] xfs_buf_read_map+0x2d/0x180 [xfs] > [1.203538] [<ffffffffa0435f21>] xfs_trans_read_buf_map+0xf1/0x300 [xfs] > [1.229194] [<ffffffffa03e75b1>] xfs_da_read_buf+0xd1/0x100 [xfs] > [1.251948] [<ffffffffa03ed5e6>] xfs_dir3_data_read+0x26/0x60 [xfs] > [1.275736] [<ffffffffa040923e>] xfs_dir2_leaf_readbuf.isra.12+0x1be/0x4a0 [xfs] > [1.305094] [<ffffffff816cdc92>] ? down_read+0x12/0x30 > [1.323787] [<ffffffffa04187a4>] ? xfs_ilock+0xe4/0x110 [xfs] > [1.345114] [<ffffffffa040965b>] xfs_dir2_leaf_getdents+0x13b/0x3d0 [xfs] > [1.371818] [<ffffffffa0409f26>] xfs_readdir+0x1a6/0x1c0 [xfs] So we should be holding the ilock in shared mode here... > [1.393471] [<ffffffffa040bcbb>] xfs_file_readdir+0x2b/0x30 [xfs] > [1.416874] [<ffffffff8122cec3>] iterate_dir+0x173/0x190 > [1.436709] [<ffffffff81003276>] ? do_audit_syscall_entry+0x66/0x70 > [1.460951] [<ffffffff8122d338>] SyS_getdents+0x98/0x120 > [1.480566] [<ffffffff8122cee0>] ? iterate_dir+0x190/0x190 > [1.500909] [<ffffffff81003b12>] do_syscall_64+0x62/0x110 > [1.520847] [<ffffffff816cf9a1>] entry_SYSCALL64_slow_path+0x25/0x25 > [1.545372] INFO: task cp:49040 blocked for more than 120 seconds. > [1.568933] Not tainted 4.6.0-rc6-next-20160504 #5 > [1.587943] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables > this message. > [1.618544] cp D ffff880b91463b00 0 49040 49016 > 0x00000080 > [1.645502] ffff880b91463b00 ffff880464d5c140 ffff88029b475700 > ffff880b91464000 > [1.674145] ffff880411c42610 ffffffffffffffff ffff880411c42628 > ffff8802c10bc610 > [1.702834] ffff880b91463b18 ffffffff816cbc25 ffff88029b475700 > ffff880b91463b88 > [1.731501] Call Trace: > [1.736866] [<ffffffff816cbc25>] schedule+0x35/0x80 > [1.754119] [<ffffffff816ce3c2>] rwsem_down_read_failed+0xf2/0x140 > [1.777411] [<ffffffffa0418800>] ? xfs_ilock_data_map_shared+0x30/0x40 > [xfs] > [1.805090] [<ffffffff81355e08>] call_rwsem_down_read_failed+0x18/0x30 > [1.830482] [<ffffffff816cdca0>] down_read+0x20/0x30 > [1.848505] [<ffffffffa04187a4>] xfs_ilock+0xe4/0x110 [xfs] > [1.869293] [<ffffffffa0418800>] xfs_ilock_data_map_shared+0x30/0x40 And it this is an attempt to lock the inode shared, so if that is failing while there's another shared holder, than means there's an exclusive waiter queued up (i.e. read iheld -> write blocked -> read blocked). So looking at dump-g273xfs0510: [ 845.727907] INFO: task cp:40126 blocked for more than 120 seconds. [ 845.751175] Not tainted 4.6.0-rc7-next-20160510 #9 [ 845.770011] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 845.799967] cp D ffff88046b64bcb8 0 40126 40106 0x00000080 [ 845.826636] ffff88046b64bcb8 ffffffffa0458d36 ffff8804685c2b80 0000000002408240 [ 845.855232] ffff88046b64c000 ffff880426b08450 ffff880426b08450 ffff880426b08468 [ 845.883836] ffff880426b08458 ffff88046b64bcd0 ffffffff816ce9c5 ffff8804685c2b80 [ 845.911854] Call Trace: [ 845.917076] [<ffffffffa0458d36>] ? kmem_zone_alloc+0x96/0x120 [xfs] [ 845.940302] [<ffffffff816ce9c5>] schedule+0x35/0x80 [ 845.957118] [<ffffffff816d1534>] rwsem_down_write_failed+0x114/0x230 [ 845.981242] [<ffffffffa045a294>] ? xlog_grant_head_check+0x54/0x100 [xfs] [ 846.007673] [<ffffffffa04492e4>] ? xfs_vn_update_time+0xd4/0x160 [xfs] [ 846.032009] [<ffffffff81358217>] call_rwsem_down_write_failed+0x17/0x30 [ 846.056909] [<ffffffff816d0a7d>] down_write+0x2d/0x40 [ 846.074580] [<ffffffffa044b78f>] xfs_ilock+0xcf/0x110 [xfs] [ 846.094681] [<ffffffffa04492e4>] xfs_vn_update_time+0xd4/0x160 [xfs] [ 846.118308] [<ffffffff81238648>] touch_atime+0xa8/0xd0 [ 846.136385] [<ffffffff8122f132>] iterate_dir+0x112/0x190 [ 846.155995] [<ffffffff8122f608>] SyS_getdents+0x98/0x120 [ 846.175819] [<ffffffff8122f1b0>] ? iterate_dir+0x190/0x190 [ 846.195808] [<ffffffff81003b12>] do_syscall_64+0x62/0x110 [ 846.215019] [<ffffffff816d2761>] entry_SYSCALL64_slow_path+0x25/0x25 Timestamp updates are transactional, requiring exclusive access to the directory ilock. That's why we have a train-smash of down_read() calls blocking in these dumps - they are queued up behind this call. This is expected under workloads where a couple of hundred threads are all reading the same directory and trying to open files in it. So why isn't the current read holder making progress? Why is it blocked on a directory data buffer lock when the only accesses should be read only? There's nothing in either of the stack traces that tells me this. I suspect that there may be some kind of buffer readahead vs buffer read interaction that results in leaking a buffer lock, but it is not obvious what that may be yet. Can you reproduce this with CONFIG_XFS_DEBUG=y set? if you can, and it doesn't trigger any warnings or asserts, can you then try to reproduce it while tracing the following events: xfs_buf_lock xfs_buf_lock_done xfs_buf_trylock xfs_buf_unlock So we might be able to see if there's an unexpected buffer locking/state pattern occurring when the hang occurs? Also, if you run on slower storage, does the hang get harder or easier to hit? Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Linux-next parallel cp workload hang 2016-05-18 5:56 ` Dave Chinner @ 2016-05-18 8:31 ` Xiong Zhou 2016-05-18 9:54 ` Dave Chinner 0 siblings, 1 reply; 9+ messages in thread From: Xiong Zhou @ 2016-05-18 8:31 UTC (permalink / raw) To: Dave Chinner; +Cc: Xiong Zhou, linux-next, viro, linux-kernel, linux-fsdevel [-- Attachment #1: Type: text/plain, Size: 9125 bytes --] Hi, On Wed, May 18, 2016 at 03:56:34PM +1000, Dave Chinner wrote: > On Wed, May 18, 2016 at 09:46:15AM +0800, Xiong Zhou wrote: > > Hi, > > > > Parallel cp workload (xfstests generic/273) hangs like blow. > > It's reproducible with a small chance, less the 1/100 i think. > > > > Have hit this in linux-next 20160504 0506 0510 trees, testing on > > xfs with loop or block device. Ext4 survived several rounds > > of testing. > > > > Linux next 20160510 tree hangs within 500 rounds testing several > > times. The same tree with vfs parallel lookup patchset reverted > > survived 900 rounds testing. Reverted commits are attached. > > What hardware? A HP prototype host. > > > Bisecting in this patchset ided this commit: > > > > 3b0a3c1ac1598722fc289da19219d14f2a37b31f is the first bad commit > > commit 3b0a3c1ac1598722fc289da19219d14f2a37b31f > > Author: Al Viro <viro@zeniv.linux.org.uk> > > Date: Wed Apr 20 23:42:46 2016 -0400 > > > > simple local filesystems: switch to ->iterate_shared() > > > > no changes needed (XFS isn't simple, but it has the same parallelism > > in the interesting parts exercised from CXFS). > > > > With this commit reverted on top of Linux next 0510 tree, 5000+ rounds > > of testing passed. > > > > Although 2000 rounds testing had been conducted before good/bad > > verdict, i'm not 100 percent sure about all this, since it's > > so hard to hit, and i am not that lucky.. > > > > Bisect log and full blocked state process dump log are also attached. > > > > Furthermore, this was first hit when testing fs dax on nvdimm, > > however it's reproducible without dax mount option, and also > > reproducible on loop device, just seems harder to hit. > > > > Thanks, > > Xiong > > > > [0.771475] INFO: task cp:49033 blocked for more than 120 seconds. > > [0.794263] Not tainted 4.6.0-rc6-next-20160504 #5 > > [0.812515] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables > > this message. > > [0.841801] cp D ffff880b4e977928 0 49033 49014 > > 0x00000080 > > [0.868923] ffff880b4e977928 ffff880ba275d380 ffff880b8d712b80 > > ffff880b4e978000 > > [0.897504] 7fffffffffffffff 0000000000000002 0000000000000000 > > ffff880b8d712b80 > > [0.925234] ffff880b4e977940 ffffffff816cbc25 ffff88035a1dabb0 > > ffff880b4e9779e8 > > [0.953237] Call Trace: > > [0.958314] [<ffffffff816cbc25>] schedule+0x35/0x80 > > [0.974854] [<ffffffff816ceb71>] schedule_timeout+0x231/0x2d0 > > [0.995728] [<ffffffff810d2e2d>] ? down_trylock+0x2d/0x40 > > [1.015351] [<ffffffffa03fa6c2>] ? xfs_iext_bno_to_ext+0xa2/0x190 [xfs] > > [1.040182] [<ffffffff816cdb89>] __down_common+0xaa/0x104 > > [1.059021] [<ffffffffa04076d2>] ? _xfs_buf_find+0x162/0x340 [xfs] > > [1.081357] [<ffffffff816cdc00>] __down+0x1d/0x1f > > [1.097166] [<ffffffff810d2e81>] down+0x41/0x50 > > [1.112869] [<ffffffffa04074bc>] xfs_buf_lock+0x3c/0xf0 [xfs] > > [1.134504] [<ffffffffa04076d2>] _xfs_buf_find+0x162/0x340 [xfs] > > [1.156871] [<ffffffffa04078da>] xfs_buf_get_map+0x2a/0x270 [xfs] > > So what's holding that directory data buffer lock? It should only be > held if there is either IO in progress, or a modification of the > buffer in progress that is blocked somewhere else. > > > [1.180010] [<ffffffffa04084ed>] xfs_buf_read_map+0x2d/0x180 [xfs] > > [1.203538] [<ffffffffa0435f21>] xfs_trans_read_buf_map+0xf1/0x300 [xfs] > > [1.229194] [<ffffffffa03e75b1>] xfs_da_read_buf+0xd1/0x100 [xfs] > > [1.251948] [<ffffffffa03ed5e6>] xfs_dir3_data_read+0x26/0x60 [xfs] > > [1.275736] [<ffffffffa040923e>] xfs_dir2_leaf_readbuf.isra.12+0x1be/0x4a0 [xfs] > > [1.305094] [<ffffffff816cdc92>] ? down_read+0x12/0x30 > > [1.323787] [<ffffffffa04187a4>] ? xfs_ilock+0xe4/0x110 [xfs] > > [1.345114] [<ffffffffa040965b>] xfs_dir2_leaf_getdents+0x13b/0x3d0 [xfs] > > [1.371818] [<ffffffffa0409f26>] xfs_readdir+0x1a6/0x1c0 [xfs] > > So we should be holding the ilock in shared mode here... > > > [1.393471] [<ffffffffa040bcbb>] xfs_file_readdir+0x2b/0x30 [xfs] > > [1.416874] [<ffffffff8122cec3>] iterate_dir+0x173/0x190 > > [1.436709] [<ffffffff81003276>] ? do_audit_syscall_entry+0x66/0x70 > > [1.460951] [<ffffffff8122d338>] SyS_getdents+0x98/0x120 > > [1.480566] [<ffffffff8122cee0>] ? iterate_dir+0x190/0x190 > > [1.500909] [<ffffffff81003b12>] do_syscall_64+0x62/0x110 > > [1.520847] [<ffffffff816cf9a1>] entry_SYSCALL64_slow_path+0x25/0x25 > > [1.545372] INFO: task cp:49040 blocked for more than 120 seconds. > > [1.568933] Not tainted 4.6.0-rc6-next-20160504 #5 > > [1.587943] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables > > this message. > > [1.618544] cp D ffff880b91463b00 0 49040 49016 > > 0x00000080 > > [1.645502] ffff880b91463b00 ffff880464d5c140 ffff88029b475700 > > ffff880b91464000 > > [1.674145] ffff880411c42610 ffffffffffffffff ffff880411c42628 > > ffff8802c10bc610 > > [1.702834] ffff880b91463b18 ffffffff816cbc25 ffff88029b475700 > > ffff880b91463b88 > > [1.731501] Call Trace: > > [1.736866] [<ffffffff816cbc25>] schedule+0x35/0x80 > > [1.754119] [<ffffffff816ce3c2>] rwsem_down_read_failed+0xf2/0x140 > > [1.777411] [<ffffffffa0418800>] ? xfs_ilock_data_map_shared+0x30/0x40 > > [xfs] > > [1.805090] [<ffffffff81355e08>] call_rwsem_down_read_failed+0x18/0x30 > > [1.830482] [<ffffffff816cdca0>] down_read+0x20/0x30 > > [1.848505] [<ffffffffa04187a4>] xfs_ilock+0xe4/0x110 [xfs] > > [1.869293] [<ffffffffa0418800>] xfs_ilock_data_map_shared+0x30/0x40 > > And it this is an attempt to lock the inode shared, so if that is > failing while there's another shared holder, than means there's an > exclusive waiter queued up (i.e. read iheld -> write blocked -> read > blocked). > > > So looking at dump-g273xfs0510: > > [ 845.727907] INFO: task cp:40126 blocked for more than 120 seconds. > [ 845.751175] Not tainted 4.6.0-rc7-next-20160510 #9 > [ 845.770011] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [ 845.799967] cp D ffff88046b64bcb8 0 40126 40106 0x00000080 > [ 845.826636] ffff88046b64bcb8 ffffffffa0458d36 ffff8804685c2b80 0000000002408240 > [ 845.855232] ffff88046b64c000 ffff880426b08450 ffff880426b08450 ffff880426b08468 > [ 845.883836] ffff880426b08458 ffff88046b64bcd0 ffffffff816ce9c5 ffff8804685c2b80 > [ 845.911854] Call Trace: > [ 845.917076] [<ffffffffa0458d36>] ? kmem_zone_alloc+0x96/0x120 [xfs] > [ 845.940302] [<ffffffff816ce9c5>] schedule+0x35/0x80 > [ 845.957118] [<ffffffff816d1534>] rwsem_down_write_failed+0x114/0x230 > [ 845.981242] [<ffffffffa045a294>] ? xlog_grant_head_check+0x54/0x100 [xfs] > [ 846.007673] [<ffffffffa04492e4>] ? xfs_vn_update_time+0xd4/0x160 [xfs] > [ 846.032009] [<ffffffff81358217>] call_rwsem_down_write_failed+0x17/0x30 > [ 846.056909] [<ffffffff816d0a7d>] down_write+0x2d/0x40 > [ 846.074580] [<ffffffffa044b78f>] xfs_ilock+0xcf/0x110 [xfs] > [ 846.094681] [<ffffffffa04492e4>] xfs_vn_update_time+0xd4/0x160 [xfs] > [ 846.118308] [<ffffffff81238648>] touch_atime+0xa8/0xd0 > [ 846.136385] [<ffffffff8122f132>] iterate_dir+0x112/0x190 > [ 846.155995] [<ffffffff8122f608>] SyS_getdents+0x98/0x120 > [ 846.175819] [<ffffffff8122f1b0>] ? iterate_dir+0x190/0x190 > [ 846.195808] [<ffffffff81003b12>] do_syscall_64+0x62/0x110 > [ 846.215019] [<ffffffff816d2761>] entry_SYSCALL64_slow_path+0x25/0x25 > > > Timestamp updates are transactional, requiring exclusive access to > the directory ilock. That's why we have a train-smash of down_read() > calls blocking in these dumps - they are queued up behind this call. > This is expected under workloads where a couple of hundred threads > are all reading the same directory and trying to open files in it. > > So why isn't the current read holder making progress? Why is it > blocked on a directory data buffer lock when the only accesses > should be read only? > > There's nothing in either of the stack traces that tells me this. I > suspect that there may be some kind of buffer readahead vs buffer > read interaction that results in leaking a buffer lock, but it is > not obvious what that may be yet. > > Can you reproduce this with CONFIG_XFS_DEBUG=y set? if you can, and > it doesn't trigger any warnings or asserts, can you then try to > reproduce it while tracing the following events: > > xfs_buf_lock > xfs_buf_lock_done > xfs_buf_trylock > xfs_buf_unlock > > So we might be able to see if there's an unexpected buffer > locking/state pattern occurring when the hang occurs? Yes, i've reproduced this with both CONFIG_XFS_DEBUG=y and the tracers on. There are some trace output after hang for a while. Tracing output, dmesg, block stack dump and config are attached. BTW, reproduced on mainline tree this time, to commit 0b7962a. > > Also, if you run on slower storage, does the hang get harder or > easier to hit? Harder to me. >From easy to hard, or hard to harder, they are nvdimm device with dax mount option , then nvdimm device without dax mount option, lood device. Thanks, Xiong > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com [-- Attachment #2: g273-mainline.tar.gz --] [-- Type: application/gzip, Size: 139695 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Linux-next parallel cp workload hang 2016-05-18 8:31 ` Xiong Zhou @ 2016-05-18 9:54 ` Dave Chinner 2016-05-18 11:46 ` Xiong Zhou 0 siblings, 1 reply; 9+ messages in thread From: Dave Chinner @ 2016-05-18 9:54 UTC (permalink / raw) To: Xiong Zhou; +Cc: linux-next, viro, linux-kernel, linux-fsdevel On Wed, May 18, 2016 at 04:31:50PM +0800, Xiong Zhou wrote: > Hi, > > On Wed, May 18, 2016 at 03:56:34PM +1000, Dave Chinner wrote: > > On Wed, May 18, 2016 at 09:46:15AM +0800, Xiong Zhou wrote: > > > Hi, > > > > > > Parallel cp workload (xfstests generic/273) hangs like blow. > > > It's reproducible with a small chance, less the 1/100 i think. > > > > > > Have hit this in linux-next 20160504 0506 0510 trees, testing on > > > xfs with loop or block device. Ext4 survived several rounds > > > of testing. > > > > > > Linux next 20160510 tree hangs within 500 rounds testing several > > > times. The same tree with vfs parallel lookup patchset reverted > > > survived 900 rounds testing. Reverted commits are attached. > > > What hardware? > > A HP prototype host. description? cpus, memory, etc? I want to have some idea of what hardware I need to reproduce this... xfs_info from the scratch filesystem would also be handy. > > Can you reproduce this with CONFIG_XFS_DEBUG=y set? if you can, and > > it doesn't trigger any warnings or asserts, can you then try to > > reproduce it while tracing the following events: > > > > xfs_buf_lock > > xfs_buf_lock_done > > xfs_buf_trylock > > xfs_buf_unlock > > > > So we might be able to see if there's an unexpected buffer > > locking/state pattern occurring when the hang occurs? > > Yes, i've reproduced this with both CONFIG_XFS_DEBUG=y and the tracers > on. There are some trace output after hang for a while. I'm not actually interested in the trace after the hang - I'm interested in what happened leading up to the hang. The output you've given me tell me that the directory block at offset is locked but nothing in the trace tells me what locked it. Can I suggest using trace-cmd to record the events, then when the test hangs kill the check process so that trace-cmd terminates and gathers the events. Then dump the report to a text file and attach that? Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Linux-next parallel cp workload hang 2016-05-18 9:54 ` Dave Chinner @ 2016-05-18 11:46 ` Xiong Zhou 2016-05-18 14:17 ` Dave Chinner 0 siblings, 1 reply; 9+ messages in thread From: Xiong Zhou @ 2016-05-18 11:46 UTC (permalink / raw) To: Dave Chinner; +Cc: Xiong Zhou, linux-next, viro, linux-kernel, linux-fsdevel [-- Attachment #1: Type: text/plain, Size: 3796 bytes --] Hi, On Wed, May 18, 2016 at 07:54:09PM +1000, Dave Chinner wrote: > On Wed, May 18, 2016 at 04:31:50PM +0800, Xiong Zhou wrote: > > Hi, > > > > On Wed, May 18, 2016 at 03:56:34PM +1000, Dave Chinner wrote: > > > On Wed, May 18, 2016 at 09:46:15AM +0800, Xiong Zhou wrote: > > > > Hi, > > > > > > > > Parallel cp workload (xfstests generic/273) hangs like blow. > > > > It's reproducible with a small chance, less the 1/100 i think. > > > > > > > > Have hit this in linux-next 20160504 0506 0510 trees, testing on > > > > xfs with loop or block device. Ext4 survived several rounds > > > > of testing. > > > > > > > > Linux next 20160510 tree hangs within 500 rounds testing several > > > > times. The same tree with vfs parallel lookup patchset reverted > > > > survived 900 rounds testing. Reverted commits are attached. > > > > What hardware? > > > > A HP prototype host. > > description? cpus, memory, etc? I want to have some idea of what > hardware I need to reproduce this... #lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 48 On-line CPU(s) list: 0-47 Thread(s) per core: 2 Core(s) per socket: 12 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 63 Model name: Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz Stepping: 2 CPU MHz: 2596.918 BogoMIPS: 5208.33 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 30720K NUMA node0 CPU(s): 0-11,24-35 NUMA node1 CPU(s): 12-23,36-47 #free -m total used free shared buff/cache available Mem: 31782 623 27907 9 3251 30491 Swap: 10239 0 10239 > > xfs_info from the scratch filesystem would also be handy. meta-data=/dev/pmem1 isize=256 agcount=4, agsize=131072 blks = sectsz=4096 attr=2, projid32bit=1 = crc=0 finobt=0 data = bsize=4096 blocks=524288, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 ftype=0 log =internal bsize=4096 blocks=2560, version=2 = sectsz=4096 sunit=1 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 > > > > Can you reproduce this with CONFIG_XFS_DEBUG=y set? if you can, and > > > it doesn't trigger any warnings or asserts, can you then try to > > > reproduce it while tracing the following events: > > > > > > xfs_buf_lock > > > xfs_buf_lock_done > > > xfs_buf_trylock > > > xfs_buf_unlock > > > > > > So we might be able to see if there's an unexpected buffer > > > locking/state pattern occurring when the hang occurs? > > > > Yes, i've reproduced this with both CONFIG_XFS_DEBUG=y and the tracers > > on. There are some trace output after hang for a while. > > I'm not actually interested in the trace after the hang - I'm > interested in what happened leading up to the hang. The output > you've given me tell me that the directory block at offset is locked > but nothing in the trace tells me what locked it. > > Can I suggest using trace-cmd to record the events, then when the > test hangs kill the check process so that trace-cmd terminates and > gathers the events. Then dump the report to a text file and attach > that? Sure. Trace report, dmesg, ps axjf after Ctrl+C are attached. Thanks for the instructions and patient. Xiong > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com [-- Attachment #2: g273-trace-report.tar.gz --] [-- Type: application/gzip, Size: 244506 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Linux-next parallel cp workload hang 2016-05-18 11:46 ` Xiong Zhou @ 2016-05-18 14:17 ` Dave Chinner 2016-05-18 23:02 ` Dave Chinner 0 siblings, 1 reply; 9+ messages in thread From: Dave Chinner @ 2016-05-18 14:17 UTC (permalink / raw) To: Xiong Zhou; +Cc: linux-next, viro, linux-kernel, linux-fsdevel On Wed, May 18, 2016 at 07:46:17PM +0800, Xiong Zhou wrote: > > On Wed, May 18, 2016 at 07:54:09PM +1000, Dave Chinner wrote: > > On Wed, May 18, 2016 at 04:31:50PM +0800, Xiong Zhou wrote: > > > Hi, > > > > > > On Wed, May 18, 2016 at 03:56:34PM +1000, Dave Chinner wrote: > > > > On Wed, May 18, 2016 at 09:46:15AM +0800, Xiong Zhou wrote: > > > > > Hi, > > > > > > > > > > Parallel cp workload (xfstests generic/273) hangs like blow. > > > > > It's reproducible with a small chance, less the 1/100 i think. > > > > > > > > > > Have hit this in linux-next 20160504 0506 0510 trees, testing on > > > > > xfs with loop or block device. Ext4 survived several rounds > > > > > of testing. > > > > > > > > > > Linux next 20160510 tree hangs within 500 rounds testing several > > > > > times. The same tree with vfs parallel lookup patchset reverted > > > > > survived 900 rounds testing. Reverted commits are attached. > Ok, this is trivial to reproduce. Al - I've hit this 9 times out of ten running it on a 4p VM with a pair of 4GB ram disks using all the current upstream default mkfs and mount configurations. On the tenth attempt I got the tracing to capture what I needed to see - process 7340 was the last xfs_buf_lock_done trace without an unlock trace, and that process had this trace: schedule rwsem_down_read_failed call_rwsem_down_read_failed down_read xfs_ilock xfs_ilock_data_map_shared xfs_dir2_leaf_getdents xfs_readdir xfs_file_readdir iterate_dir SyS_getdents entry_SYSCALL_64_fastpath Which means it's holding a buffer lock while trying to get the ilock(shared). That's never going to end well - I'm now wondering why lockdep hasn't been all over this lock order inversion.... Essentially, it's a three-process deadlock involving shared/exclusive barriers and inverted lock orders. It's a pre-existing problem with buffer mapping lock orders, nothing to do with with the VFS parallelisation code. process 1 process 2 process 3 --------- --------- --------- readdir iolock(shared) get_leaf_dents iterate entries ilock(shared) map, lock and read buffer iunlock(shared) process entries in buffer ..... readdir iolock(shared) get_leaf_dents iterate entries ilock(shared) map, lock buffer <blocks> finish ->iterate_shared file_accessed() ->update_time start transaction ilock(excl) <blocks> ..... finishes processing buffer get next buffer ilock(shared) <blocks> And that's the deadlock. Now I know what the problem is I can say that process 2 - the transactional timestamp update - is the reason the readdir operations are blocking like this. And I know why CXFS never hit this - it doesn't use the VFS paths, so the VFS calls to update timestamps don't exist during concurrent readdir operations on the CXFS metadata server. Hence process 2 doesn't exist and no exclusive barriers are put in amongst the shared locking.... Patch below should fix the deadlock. Cheers, Dave. -- Dave Chinner david@fromorbit.com xfs: concurrent readdir hangs on data buffer locks From: Dave Chinner <dchinner@redhat.com> There's a three-process deadlock involving shared/exclusive barriers and inverted lock orders in the directory readdir implementation. It's a pre-existing problem with lock ordering, exposed by the VFS parallelisation code. process 1 process 2 process 3 --------- --------- --------- readdir iolock(shared) get_leaf_dents iterate entries ilock(shared) map, lock and read buffer iunlock(shared) process entries in buffer ..... readdir iolock(shared) get_leaf_dents iterate entries ilock(shared) map, lock buffer <blocks> finish ->iterate_shared file_accessed() ->update_time start transaction ilock(excl) <blocks> ..... finishes processing buffer get next buffer ilock(shared) <blocks> And that's the deadlock. Fix this by dropping the current buffer lock in process 1 before trying to map the next buffer. This means we keep the lock order of ilock -> buffer lock intact and hence will allow process 3 to make progress and drop it's ilock(shared) once it is done. Reported-by: Xiong Zhou <xzhou@redhat.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/xfs_dir2_readdir.c | 23 ++++++++++++++--------- 1 file changed, 14 insertions(+), 9 deletions(-) diff --git a/fs/xfs/xfs_dir2_readdir.c b/fs/xfs/xfs_dir2_readdir.c index 93b3ab0..21501dc 100644 --- a/fs/xfs/xfs_dir2_readdir.c +++ b/fs/xfs/xfs_dir2_readdir.c @@ -273,10 +273,11 @@ xfs_dir2_leaf_readbuf( size_t bufsize, struct xfs_dir2_leaf_map_info *mip, xfs_dir2_off_t *curoff, - struct xfs_buf **bpp) + struct xfs_buf **bpp, + bool trim_map) { struct xfs_inode *dp = args->dp; - struct xfs_buf *bp = *bpp; + struct xfs_buf *bp = NULL; struct xfs_bmbt_irec *map = mip->map; struct blk_plug plug; int error = 0; @@ -286,13 +287,10 @@ xfs_dir2_leaf_readbuf( struct xfs_da_geometry *geo = args->geo; /* - * If we have a buffer, we need to release it and - * take it out of the mapping. + * If the caller just finished processing a buffer, it will tell us + * we need to trim that block out of the mapping now it is done. */ - - if (bp) { - xfs_trans_brelse(NULL, bp); - bp = NULL; + if (trim_map) { mip->map_blocks -= geo->fsbcount; /* * Loop to get rid of the extents for the @@ -533,10 +531,17 @@ xfs_dir2_leaf_getdents( */ if (!bp || ptr >= (char *)bp->b_addr + geo->blksize) { int lock_mode; + bool trim_map = false; + + if (bp) { + xfs_trans_brelse(NULL, bp); + bp = NULL; + trim_map = true; + } lock_mode = xfs_ilock_data_map_shared(dp); error = xfs_dir2_leaf_readbuf(args, bufsize, map_info, - &curoff, &bp); + &curoff, &bp, trim_map); xfs_iunlock(dp, lock_mode); if (error || !map_info->map_valid) break; ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: Linux-next parallel cp workload hang 2016-05-18 14:17 ` Dave Chinner @ 2016-05-18 23:02 ` Dave Chinner 2016-05-19 6:22 ` Xiong Zhou 0 siblings, 1 reply; 9+ messages in thread From: Dave Chinner @ 2016-05-18 23:02 UTC (permalink / raw) To: Xiong Zhou; +Cc: linux-next, viro, linux-kernel, linux-fsdevel On Thu, May 19, 2016 at 12:17:26AM +1000, Dave Chinner wrote: > Patch below should fix the deadlock. The test has been running for several hours without failure using this patch, so I'd say this fixes the problem... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Linux-next parallel cp workload hang 2016-05-18 23:02 ` Dave Chinner @ 2016-05-19 6:22 ` Xiong Zhou 0 siblings, 0 replies; 9+ messages in thread From: Xiong Zhou @ 2016-05-19 6:22 UTC (permalink / raw) To: Dave Chinner; +Cc: Xiong Zhou, linux-next, viro, linux-kernel, linux-fsdevel On Thu, May 19, 2016 at 09:02:31AM +1000, Dave Chinner wrote: > On Thu, May 19, 2016 at 12:17:26AM +1000, Dave Chinner wrote: > > Patch below should fix the deadlock. > > The test has been running for several hours without failure using > this patch, so I'd say this fixes the problem... Yes, the same for me. Thanks. Xiong > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com ^ permalink raw reply [flat|nested] 9+ messages in thread
[parent not found: <20160518013802.GA4679@ZX.nay.redhat.com>]
* Re: Linux-next parallel cp workload hang [not found] <20160518013802.GA4679@ZX.nay.redhat.com> @ 2016-05-18 2:06 ` Al Viro 0 siblings, 0 replies; 9+ messages in thread From: Al Viro @ 2016-05-18 2:06 UTC (permalink / raw) To: Xiong Zhou; +Cc: linux-next, inux-kernel, linux-fsdevel, Dave Chinner [dchinner Cc'd] On Wed, May 18, 2016 at 09:38:02AM +0800, Xiong Zhou wrote: > Hi, > > Parallel cp workload (xfstests generic/273) hangs like blow. > It's reproducible with a small chance, less the 1/100 i think. > > Have hit this in linux-next 20160504 0506 0510 trees, testing on > xfs with loop or block device. Ext4 survived several rounds > of testing. > > Linux next 20160510 tree hangs within 500 rounds testing several > times. The same tree with vfs parallel lookup patchset reverted > survived 900 rounds testing. Reverted commits are attached. > > Bisecting in this patchset ided this commit: > > 3b0a3c1ac1598722fc289da19219d14f2a37b31f is the first bad commit > commit 3b0a3c1ac1598722fc289da19219d14f2a37b31f > Author: Al Viro <viro@zeniv.linux.org.uk> > Date: Wed Apr 20 23:42:46 2016 -0400 > > simple local filesystems: switch to ->iterate_shared() > > no changes needed (XFS isn't simple, but it has the same parallelism > in the interesting parts exercised from CXFS). > > With this commit reverted on top of Linux next 0510 tree, 5000+ rounds > of testing passed. Joy... So much for CXFS testing, probably... All that commit changes for XFS is allowing xfs_readdir() to be called in parallel for different struct file. We can always revert the XFS part, of course, but it would be interesting to find out what happens to be dependent on external exclusion there. ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2016-05-19 6:22 UTC | newest] Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-05-18 1:46 Linux-next parallel cp workload hang Xiong Zhou 2016-05-18 5:56 ` Dave Chinner 2016-05-18 8:31 ` Xiong Zhou 2016-05-18 9:54 ` Dave Chinner 2016-05-18 11:46 ` Xiong Zhou 2016-05-18 14:17 ` Dave Chinner 2016-05-18 23:02 ` Dave Chinner 2016-05-19 6:22 ` Xiong Zhou [not found] <20160518013802.GA4679@ZX.nay.redhat.com> 2016-05-18 2:06 ` Al Viro
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).