* generic/269 hangs on lastest upstream kernel @ 2020-02-11 8:14 Yang Xu 2020-02-12 10:54 ` Jan Kara 0 siblings, 1 reply; 19+ messages in thread From: Yang Xu @ 2020-02-11 8:14 UTC (permalink / raw) To: Jan Kara, Theodore Ts'o; +Cc: fstests Hi Since xfstests support rename2, this case(generic/269) reports filesystem inconsistent problem with ext4 on my system(4.18.0-32.el8.x86_64). When I test generic/269(ext4) on 5.6.0-rc1 kernel, it hangs. ---------------------------------------------- dmesg as below: 76.506753] run fstests generic/269 at 2020-02-11 05:53:44 [ 76.955667] EXT4-fs (sdc): mounted filesystem with ordered data mode. Opts: acl, user_xattr [ 100.912511] device virbr0-nic left promiscuous mode [ 100.912520] virbr0: port 1(virbr0-nic) entered disabled state [ 246.801561] INFO: task dd:17284 blocked for more than 122 seconds. [ 246.801564] Not tainted 5.6.0-rc1 #41 [ 246.801565] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this mes sage. [ 246.801566] dd D 0 17284 16931 0x00000080 [ 246.801568] Call Trace: [ 246.801584] ? __schedule+0x251/0x690 [ 246.801586] schedule+0x40/0xb0 [ 246.801588] wb_wait_for_completion+0x52/0x80 [ 246.801591] ? finish_wait+0x80/0x80 [ 246.801592] __writeback_inodes_sb_nr+0xaa/0xd0 [ 246.801593] try_to_writeback_inodes_sb+0x3c/0x50 [ 246.801609] ext4_nonda_switch+0x7b/0x80 [ext4] [ 246.801618] ext4_da_write_begin+0x6f/0x480 [ext4] [ 246.801621] generic_perform_write+0xf4/0x1b0 [ 246.801628] ext4_buffered_write_iter+0x8d/0x120 [ext4] [ 246.801634] ext4_file_write_iter+0x6e/0x700 [ext4] [ 246.801636] new_sync_write+0x12d/0x1d0 [ 246.801638] vfs_write+0xa5/0x1a0 [ 246.801640] ksys_write+0x59/0xd0 [ 246.801643] do_syscall_64+0x55/0x1b0 [ 246.801645] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 246.801646] RIP: 0033:0x7fe9ec947b28 [ 246.801650] Code: Bad RIP value. ---------------------------------------------- Does anyone also meet this problem? Best Regards Yang Xu ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: generic/269 hangs on lastest upstream kernel 2020-02-11 8:14 generic/269 hangs on lastest upstream kernel Yang Xu @ 2020-02-12 10:54 ` Jan Kara 2020-02-13 8:49 ` Yang Xu 0 siblings, 1 reply; 19+ messages in thread From: Jan Kara @ 2020-02-12 10:54 UTC (permalink / raw) To: Yang Xu; +Cc: Jan Kara, Theodore Ts'o, fstests Hello! On Tue 11-02-20 16:14:35, Yang Xu wrote: > Since xfstests support rename2, this case(generic/269) reports filesystem > inconsistent problem with ext4 on my system(4.18.0-32.el8.x86_64). I don't remember seeing this in my testing... It might be specific to that RHEL kernel. > When I test generic/269(ext4) on 5.6.0-rc1 kernel, it hangs. > ---------------------------------------------- > dmesg as below: > 76.506753] run fstests generic/269 at 2020-02-11 05:53:44 > [ 76.955667] EXT4-fs (sdc): mounted filesystem with ordered data mode. > Opts: acl, user_xattr > [ 100.912511] device virbr0-nic left promiscuous mode > [ 100.912520] virbr0: port 1(virbr0-nic) entered disabled state > [ 246.801561] INFO: task dd:17284 blocked for more than 122 seconds. > [ 246.801564] Not tainted 5.6.0-rc1 #41 > [ 246.801565] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables > this mes sage. > [ 246.801566] dd D 0 17284 16931 0x00000080 > [ 246.801568] Call Trace: > [ 246.801584] ? __schedule+0x251/0x690 > [ 246.801586] schedule+0x40/0xb0 > [ 246.801588] wb_wait_for_completion+0x52/0x80 > [ 246.801591] ? finish_wait+0x80/0x80 > [ 246.801592] __writeback_inodes_sb_nr+0xaa/0xd0 > [ 246.801593] try_to_writeback_inodes_sb+0x3c/0x50 Interesting. Does the hang resolve eventually or the machine is hung permanently? If the hang is permanent, can you do: echo w >/proc/sysrq-trigger and send us the stacktraces from dmesg? Thanks! Honza > [ 246.801609] ext4_nonda_switch+0x7b/0x80 [ext4] > [ 246.801618] ext4_da_write_begin+0x6f/0x480 [ext4] > [ 246.801621] generic_perform_write+0xf4/0x1b0 > [ 246.801628] ext4_buffered_write_iter+0x8d/0x120 [ext4] > [ 246.801634] ext4_file_write_iter+0x6e/0x700 [ext4] > [ 246.801636] new_sync_write+0x12d/0x1d0 > [ 246.801638] vfs_write+0xa5/0x1a0 > [ 246.801640] ksys_write+0x59/0xd0 > [ 246.801643] do_syscall_64+0x55/0x1b0 > [ 246.801645] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > [ 246.801646] RIP: 0033:0x7fe9ec947b28 > [ 246.801650] Code: Bad RIP value. > ---------------------------------------------- > > Does anyone also meet this problem? > > Best Regards > Yang Xu > > -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: generic/269 hangs on lastest upstream kernel 2020-02-12 10:54 ` Jan Kara @ 2020-02-13 8:49 ` Yang Xu 2020-02-13 17:08 ` Theodore Y. Ts'o 2020-02-13 21:10 ` Jan Kara 0 siblings, 2 replies; 19+ messages in thread From: Yang Xu @ 2020-02-13 8:49 UTC (permalink / raw) To: Jan Kara; +Cc: Theodore Ts'o, fstests on 2020/02/12 18:54, Jan Kara wrote: > Hello! > > On Tue 11-02-20 16:14:35, Yang Xu wrote: >> Since xfstests support rename2, this case(generic/269) reports filesystem >> inconsistent problem with ext4 on my system(4.18.0-32.el8.x86_64). > > I don't remember seeing this in my testing... It might be specific to that > RHEL kernel. Agree. > >> When I test generic/269(ext4) on 5.6.0-rc1 kernel, it hangs. >> ---------------------------------------------- >> dmesg as below: >> 76.506753] run fstests generic/269 at 2020-02-11 05:53:44 >> [ 76.955667] EXT4-fs (sdc): mounted filesystem with ordered data mode. >> Opts: acl, user_xattr >> [ 100.912511] device virbr0-nic left promiscuous mode >> [ 100.912520] virbr0: port 1(virbr0-nic) entered disabled state >> [ 246.801561] INFO: task dd:17284 blocked for more than 122 seconds. >> [ 246.801564] Not tainted 5.6.0-rc1 #41 >> [ 246.801565] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables >> this mes sage. >> [ 246.801566] dd D 0 17284 16931 0x00000080 >> [ 246.801568] Call Trace: >> [ 246.801584] ? __schedule+0x251/0x690 >> [ 246.801586] schedule+0x40/0xb0 >> [ 246.801588] wb_wait_for_completion+0x52/0x80 >> [ 246.801591] ? finish_wait+0x80/0x80 >> [ 246.801592] __writeback_inodes_sb_nr+0xaa/0xd0 >> [ 246.801593] try_to_writeback_inodes_sb+0x3c/0x50 > > Interesting. Does the hang resolve eventually or the machine is hung > permanently? If the hang is permanent, can you do: > > echo w >/proc/sysrq-trigger > > and send us the stacktraces from dmesg? Thanks! Yes. the hang is permanent, log as below: [ 959.451423] fsstress D 0 20094 20033 0x00000080 [ 959.451424] Call Trace: [ 959.451425] ? __schedule+0x251/0x690 [ 959.451426] schedule+0x40/0xb0 [ 959.451428] schedule_preempt_disabled+0xa/0x10 [ 959.451429] __mutex_lock.isra.8+0x2b5/0x4a0 [ 959.451430] ? __check_object_size+0x162/0x173 [ 959.451431] lock_rename+0x28/0xb0 [ 959.451433] do_renameat2+0x2a9/0x530 [ 959.451434] __x64_sys_renameat2+0x20/0x30 [ 959.451436] do_syscall_64+0x55/0x1b0 [ 959.451436] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 959.453023] dd D 0 21645 19793 0x00004080 [ 959.453024] Call Trace: [ 959.453026] ? __schedule+0x251/0x690 [ 959.453027] ? __wake_up_common_lock+0x87/0xc0 [ 959.453028] schedule+0x40/0xb0 [ 959.453030] jbd2_log_wait_commit+0xac/0x120 [jbd2] [ 959.453032] ? finish_wait+0x80/0x80 [ 959.453034] jbd2_log_do_checkpoint+0x383/0x3f0 [jbd2] [ 959.453036] __jbd2_log_wait_for_space+0x66/0x190 [jbd2] [ 959.453038] add_transaction_credits+0x27d/0x290 [jbd2] [ 959.453040] ? blk_mq_make_request+0x289/0x5d0 [ 959.453042] start_this_handle+0x10a/0x510 [jbd2] [ 959.453043] ? _cond_resched+0x15/0x30 [ 959.453045] jbd2__journal_start+0xea/0x1f0 [jbd2] [ 959.453051] ? ext4_writepages+0x518/0xd90 [ext4] [ 959.453057] __ext4_journal_start_sb+0x6e/0x130 [ext4] [ 959.453063] ext4_writepages+0x518/0xd90 [ext4] [ 959.453065] ? do_writepages+0x41/0xd0 [ 959.453070] ? ext4_mark_inode_dirty+0x1f0/0x1f0 [ext4] [ 959.453072] do_writepages+0x41/0xd0 [ 959.453073] ? iomap_write_begin+0x4c0/0x4c0 [ 959.453188] ? xfs_iunlock+0xf3/0x100 [xfs] [ 959.453189] __filemap_fdatawrite_range+0xcb/0x100 [ 959.453191] ? __raw_spin_unlock+0x5/0x10 [ 959.453198] ext4_release_file+0x6c/0xa0 [ext4] [ 959.453200] __fput+0xbe/0x250 [ 959.453201] task_work_run+0x84/0xa0 [ 959.453203] exit_to_usermode_loop+0xc8/0xd0 [ 959.453204] do_syscall_64+0x1a5/0x1b0 [ 959.453205] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 959.453206] RIP: 0033:0x7f368a22f1a8 Best Regards Yang Xu > > Honza > >> [ 246.801609] ext4_nonda_switch+0x7b/0x80 [ext4] >> [ 246.801618] ext4_da_write_begin+0x6f/0x480 [ext4] >> [ 246.801621] generic_perform_write+0xf4/0x1b0 >> [ 246.801628] ext4_buffered_write_iter+0x8d/0x120 [ext4] >> [ 246.801634] ext4_file_write_iter+0x6e/0x700 [ext4] >> [ 246.801636] new_sync_write+0x12d/0x1d0 >> [ 246.801638] vfs_write+0xa5/0x1a0 >> [ 246.801640] ksys_write+0x59/0xd0 >> [ 246.801643] do_syscall_64+0x55/0x1b0 >> [ 246.801645] entry_SYSCALL_64_after_hwframe+0x44/0xa9 >> [ 246.801646] RIP: 0033:0x7fe9ec947b28 >> [ 246.801650] Code: Bad RIP value. >> ---------------------------------------------- >> >> Does anyone also meet this problem? >> >> Best Regards >> Yang Xu >> >> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: generic/269 hangs on lastest upstream kernel 2020-02-13 8:49 ` Yang Xu @ 2020-02-13 17:08 ` Theodore Y. Ts'o 2020-02-14 1:14 ` Yang Xu 2020-02-13 21:10 ` Jan Kara 1 sibling, 1 reply; 19+ messages in thread From: Theodore Y. Ts'o @ 2020-02-13 17:08 UTC (permalink / raw) To: Yang Xu; +Cc: Jan Kara, fstests On Thu, Feb 13, 2020 at 04:49:21PM +0800, Yang Xu wrote: > > > on 2020/02/12 18:54, Jan Kara wrote: > > Hello! > > > > On Tue 11-02-20 16:14:35, Yang Xu wrote: > > > Since xfstests support rename2, this case(generic/269) reports filesystem > > > inconsistent problem with ext4 on my system(4.18.0-32.el8.x86_64). > > > > I don't remember seeing this in my testing... It might be specific to that > > RHEL kernel. > > Agree. So were you able to reproduce this on a 5.6.0-rc1 kernel or not? If you are, can you send the .config that you used, in case it's configuration specific? - Ted ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: generic/269 hangs on lastest upstream kernel 2020-02-13 17:08 ` Theodore Y. Ts'o @ 2020-02-14 1:14 ` Yang Xu 2020-02-14 14:05 ` Theodore Y. Ts'o 0 siblings, 1 reply; 19+ messages in thread From: Yang Xu @ 2020-02-14 1:14 UTC (permalink / raw) To: Theodore Y. Ts'o; +Cc: Jan Kara, fstests on 2020/02/14 1:08, Theodore Y. Ts'o wrote: > On Thu, Feb 13, 2020 at 04:49:21PM +0800, Yang Xu wrote: >> >> >> on 2020/02/12 18:54, Jan Kara wrote: >>> Hello! >>> >>> On Tue 11-02-20 16:14:35, Yang Xu wrote: >>>> Since xfstests support rename2, this case(generic/269) reports filesystem >>>> inconsistent problem with ext4 on my system(4.18.0-32.el8.x86_64). >>> >>> I don't remember seeing this in my testing... It might be specific to that >>> RHEL kernel. >> >> Agree. > > So were you able to reproduce this on a 5.6.0-rc1 kernel or not? No. I don't reproduce it on 5.6.0-rc1 kernel, but 5.6.0-rc1 kernel hang on my KVM machine when run generic/269. Best Regards Yang Xu > > If you are, can you send the .config that you used, in case it's > configuration specific? > > - Ted > > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: generic/269 hangs on lastest upstream kernel 2020-02-14 1:14 ` Yang Xu @ 2020-02-14 14:05 ` Theodore Y. Ts'o [not found] ` <7adf16bf-d527-1c25-1a24-b4d5e4d757c4@cn.fujitsu.com> 0 siblings, 1 reply; 19+ messages in thread From: Theodore Y. Ts'o @ 2020-02-14 14:05 UTC (permalink / raw) To: Yang Xu; +Cc: Jan Kara, fstests On Fri, Feb 14, 2020 at 09:14:33AM +0800, Yang Xu wrote: > > > > So were you able to reproduce this on a 5.6.0-rc1 kernel or not? > No. I don't reproduce it on 5.6.0-rc1 kernel, but 5.6.0-rc1 kernel hang on > my KVM machine when run generic/269. I'm not able to reproduce the 5.6.0-rc1 hang using kvm-xfstests[1]. Neither has any other ext4 developers, which is why it might be useful to see if there's something unique in your .config for 5.6.0-rc1. Could you send us the .config you were using? [1] https://github.com/tytso/xfstests-bld/blob/master/Documentation/kvm-xfstests.md Cheers, - Ted ^ permalink raw reply [flat|nested] 19+ messages in thread
[parent not found: <7adf16bf-d527-1c25-1a24-b4d5e4d757c4@cn.fujitsu.com>]
* Re: generic/269 hangs on lastest upstream kernel [not found] ` <7adf16bf-d527-1c25-1a24-b4d5e4d757c4@cn.fujitsu.com> @ 2020-02-18 14:35 ` Theodore Y. Ts'o 2020-02-19 10:57 ` Yang Xu 0 siblings, 1 reply; 19+ messages in thread From: Theodore Y. Ts'o @ 2020-02-18 14:35 UTC (permalink / raw) To: Yang Xu; +Cc: Jan Kara, fstests On Tue, Feb 18, 2020 at 11:44:24AM +0800, Yang Xu wrote: > > > on 2020/02/14 22:05, Theodore Y. Ts'o wrote: > > On Fri, Feb 14, 2020 at 09:14:33AM +0800, Yang Xu wrote: > > > > > > > > So were you able to reproduce this on a 5.6.0-rc1 kernel or not? > > > No. I don't reproduce it on 5.6.0-rc1 kernel, but 5.6.0-rc1 kernel hang on > > > my KVM machine when run generic/269. > > > > I'm not able to reproduce the 5.6.0-rc1 hang using kvm-xfstests[1]. > > Neither has any other ext4 developers, which is why it might be useful > > to see if there's something unique in your .config for 5.6.0-rc1. > > Could you send us the .config you were using? > > > Sorry for the late reply. > > my 5.6.0-rc1 config as attach. Unfortunately, I just tried using your config with kvm-xfstests, and it passed without problems. Did you say this was a reliable reproducer on your system? % kvm-xfstests -c 4k generic/269 KERNEL: kernel 5.6.0-rc2-xfstests #1492 SMP Mon Feb 17 23:22:40 EST 2020 x86_64 CPUS: 2 MEM: 1966.03 ext4/4k: 1 tests, 43 seconds generic/269 Pass 42s Totals: 1 tests, 0 skipped, 0 failures, 0 errors, 42s FSTESTVER: blktests 9e02419 (Thu, 19 Dec 2019 14:45:55 -0800) FSTESTVER: e2fsprogs v1.45.4-15-g4b4f7b35 (Wed, 9 Oct 2019 20:25:01 -0400) FSTESTVER: fio fio-3.17 (Mon, 16 Dec 2019 15:48:43 -0700) FSTESTVER: fsverity v1.0 (Wed, 6 Nov 2019 10:35:02 -0800) FSTESTVER: ima-evm-utils v1.2 (Fri, 26 Jul 2019 07:42:17 -0400) FSTESTVER: nvme-cli v1.9-159-g18119bc (Thu, 26 Dec 2019 11:04:01 -0700) FSTESTVER: quota 9a001cc (Tue, 5 Nov 2019 16:12:59 +0100) FSTESTVER: util-linux v2.35-19-g95afec771 (Fri, 24 Jan 2020 12:25:35 -0500) FSTESTVER: xfsprogs v5.4.0 (Fri, 20 Dec 2019 16:47:12 -0500) FSTESTVER: xfstests linux-v3.8-2652-g002e349c (Fri, 24 Jan 2020 00:49:40 -0500) FSTESTVER: xfstests-bld 6f10355 (Fri, 24 Jan 2020 12:36:30 -0500) FSTESTCFG: 4k FSTESTSET: generic/269 FSTESTOPT: aex This was run on a Debian testing system, with kvm version: QEMU emulator version 4.2.0 (Debian 1:4.2-3) What about hardware details of your system? How many CPU's, memory, etc.? And what sort of storage device are you using for kvm? (I'm using virtio-scsi backed by LVM volumes for the scratch and test partitions.) - Ted ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: generic/269 hangs on lastest upstream kernel 2020-02-18 14:35 ` Theodore Y. Ts'o @ 2020-02-19 10:57 ` Yang Xu 0 siblings, 0 replies; 19+ messages in thread From: Yang Xu @ 2020-02-19 10:57 UTC (permalink / raw) To: Theodore Y. Ts'o; +Cc: Jan Kara, fstests on 2020/02/18 22:35, Theodore Y. Ts'o wrote: > On Tue, Feb 18, 2020 at 11:44:24AM +0800, Yang Xu wrote: >> >> >> on 2020/02/14 22:05, Theodore Y. Ts'o wrote: >>> On Fri, Feb 14, 2020 at 09:14:33AM +0800, Yang Xu wrote: >>>>> >>>>> So were you able to reproduce this on a 5.6.0-rc1 kernel or not? >>>> No. I don't reproduce it on 5.6.0-rc1 kernel, but 5.6.0-rc1 kernel hang on >>>> my KVM machine when run generic/269. >>> >>> I'm not able to reproduce the 5.6.0-rc1 hang using kvm-xfstests[1]. >>> Neither has any other ext4 developers, which is why it might be useful >>> to see if there's something unique in your .config for 5.6.0-rc1. >>> Could you send us the .config you were using? >>> >> Sorry for the late reply. >> >> my 5.6.0-rc1 config as attach. > > Unfortunately, I just tried using your config with kvm-xfstests, and > it passed without problems. Did you say this was a reliable > reproducer on your system? Yes, 100%. Tomorrow I will try it on other kvm machines. > > % kvm-xfstests -c 4k generic/269 > KERNEL: kernel 5.6.0-rc2-xfstests #1492 SMP Mon Feb 17 23:22:40 EST 2020 x86_64 > CPUS: 2 > MEM: 1966.03 > > ext4/4k: 1 tests, 43 seconds > generic/269 Pass 42s > Totals: 1 tests, 0 skipped, 0 failures, 0 errors, 42s > > FSTESTVER: blktests 9e02419 (Thu, 19 Dec 2019 14:45:55 -0800) > FSTESTVER: e2fsprogs v1.45.4-15-g4b4f7b35 (Wed, 9 Oct 2019 20:25:01 -0400) > FSTESTVER: fio fio-3.17 (Mon, 16 Dec 2019 15:48:43 -0700) > FSTESTVER: fsverity v1.0 (Wed, 6 Nov 2019 10:35:02 -0800) > FSTESTVER: ima-evm-utils v1.2 (Fri, 26 Jul 2019 07:42:17 -0400) > FSTESTVER: nvme-cli v1.9-159-g18119bc (Thu, 26 Dec 2019 11:04:01 -0700) > FSTESTVER: quota 9a001cc (Tue, 5 Nov 2019 16:12:59 +0100) > FSTESTVER: util-linux v2.35-19-g95afec771 (Fri, 24 Jan 2020 12:25:35 -0500) > FSTESTVER: xfsprogs v5.4.0 (Fri, 20 Dec 2019 16:47:12 -0500) > FSTESTVER: xfstests linux-v3.8-2652-g002e349c (Fri, 24 Jan 2020 00:49:40 -0500) > FSTESTVER: xfstests-bld 6f10355 (Fri, 24 Jan 2020 12:36:30 -0500) > FSTESTCFG: 4k > FSTESTSET: generic/269 > FSTESTOPT: aex > > This was run on a Debian testing system, with kvm version: > > QEMU emulator version 4.2.0 (Debian 1:4.2-3) > I don't test on debian. qemu vesion as below: # qemu-img --version qemu-img version 2.12.0 (qemu-kvm-2.12.0-32.el8+1900+70997154) Copyright (c) 2003-2017 Fabrice Bellard and the QEMU Project developers > What about hardware details of your system? How many CPU's, memory, > etc.? And what sort of storage device are you using for kvm? (I'm > using virtio-scsi backed by LVM volumes for the scratch and test > partitions.) 4cpu and 4G memory(4numa node),4G swap. storge device as below(device bus ide,storage format qcow2) disk type='file' device='disk'> <driver name='qemu' type='qcow2'/> <source file='/var/lib/libvirt/images/test.qcow2'/> <target dev='hdc' bus='ide'/> <address type='drive' controller='0' bus='1' target='0' unit='0'/> </disk> <disk type='file' device='disk'> <driver name='qemu' type='qcow2'/> <source file='/var/lib/libvirt/images/test1.qcow2'/> <target dev='hdd' bus='ide'/> <address type='drive' controller='0' bus='1' target='0' unit='1'/> </disk> Best Regards Yang Xu > > - Ted > > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: generic/269 hangs on lastest upstream kernel 2020-02-13 8:49 ` Yang Xu 2020-02-13 17:08 ` Theodore Y. Ts'o @ 2020-02-13 21:10 ` Jan Kara [not found] ` <062ac52c-3a16-22ef-6396-53334ed94783@cn.fujitsu.com> 1 sibling, 1 reply; 19+ messages in thread From: Jan Kara @ 2020-02-13 21:10 UTC (permalink / raw) To: Yang Xu; +Cc: Jan Kara, Theodore Ts'o, fstests On Thu 13-02-20 16:49:21, Yang Xu wrote: > > > When I test generic/269(ext4) on 5.6.0-rc1 kernel, it hangs. > > > ---------------------------------------------- > > > dmesg as below: > > > 76.506753] run fstests generic/269 at 2020-02-11 05:53:44 > > > [ 76.955667] EXT4-fs (sdc): mounted filesystem with ordered data mode. > > > Opts: acl, user_xattr > > > [ 100.912511] device virbr0-nic left promiscuous mode > > > [ 100.912520] virbr0: port 1(virbr0-nic) entered disabled state > > > [ 246.801561] INFO: task dd:17284 blocked for more than 122 seconds. > > > [ 246.801564] Not tainted 5.6.0-rc1 #41 > > > [ 246.801565] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables > > > this mes sage. > > > [ 246.801566] dd D 0 17284 16931 0x00000080 > > > [ 246.801568] Call Trace: > > > [ 246.801584] ? __schedule+0x251/0x690 > > > [ 246.801586] schedule+0x40/0xb0 > > > [ 246.801588] wb_wait_for_completion+0x52/0x80 > > > [ 246.801591] ? finish_wait+0x80/0x80 > > > [ 246.801592] __writeback_inodes_sb_nr+0xaa/0xd0 > > > [ 246.801593] try_to_writeback_inodes_sb+0x3c/0x50 > > > > Interesting. Does the hang resolve eventually or the machine is hung > > permanently? If the hang is permanent, can you do: > > > > echo w >/proc/sysrq-trigger > > > > and send us the stacktraces from dmesg? Thanks! > Yes. the hang is permanent, log as below: > > [ 959.451423] fsstress D 0 20094 20033 0x00000080 > [ 959.451424] Call Trace: > [ 959.451425] ? __schedule+0x251/0x690 > [ 959.451426] schedule+0x40/0xb0 > [ 959.451428] schedule_preempt_disabled+0xa/0x10 > [ 959.451429] __mutex_lock.isra.8+0x2b5/0x4a0 > [ 959.451430] ? __check_object_size+0x162/0x173 > [ 959.451431] lock_rename+0x28/0xb0 > [ 959.451433] do_renameat2+0x2a9/0x530 > [ 959.451434] __x64_sys_renameat2+0x20/0x30 > [ 959.451436] do_syscall_64+0x55/0x1b0 > [ 959.451436] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > > [ 959.453023] dd D 0 21645 19793 0x00004080 > [ 959.453024] Call Trace: > [ 959.453026] ? __schedule+0x251/0x690 > [ 959.453027] ? __wake_up_common_lock+0x87/0xc0 > [ 959.453028] schedule+0x40/0xb0 > [ 959.453030] jbd2_log_wait_commit+0xac/0x120 [jbd2] > [ 959.453032] ? finish_wait+0x80/0x80 > [ 959.453034] jbd2_log_do_checkpoint+0x383/0x3f0 [jbd2] > [ 959.453036] __jbd2_log_wait_for_space+0x66/0x190 [jbd2] > [ 959.453038] add_transaction_credits+0x27d/0x290 [jbd2] > [ 959.453040] ? blk_mq_make_request+0x289/0x5d0 > [ 959.453042] start_this_handle+0x10a/0x510 [jbd2] > [ 959.453043] ? _cond_resched+0x15/0x30 > [ 959.453045] jbd2__journal_start+0xea/0x1f0 [jbd2] > [ 959.453051] ? ext4_writepages+0x518/0xd90 [ext4] > [ 959.453057] __ext4_journal_start_sb+0x6e/0x130 [ext4] > [ 959.453063] ext4_writepages+0x518/0xd90 [ext4] > [ 959.453065] ? do_writepages+0x41/0xd0 > [ 959.453070] ? ext4_mark_inode_dirty+0x1f0/0x1f0 [ext4] > [ 959.453072] do_writepages+0x41/0xd0 > [ 959.453073] ? iomap_write_begin+0x4c0/0x4c0 > [ 959.453188] ? xfs_iunlock+0xf3/0x100 [xfs] > [ 959.453189] __filemap_fdatawrite_range+0xcb/0x100 > [ 959.453191] ? __raw_spin_unlock+0x5/0x10 > [ 959.453198] ext4_release_file+0x6c/0xa0 [ext4] > [ 959.453200] __fput+0xbe/0x250 > [ 959.453201] task_work_run+0x84/0xa0 > [ 959.453203] exit_to_usermode_loop+0xc8/0xd0 > [ 959.453204] do_syscall_64+0x1a5/0x1b0 > [ 959.453205] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > [ 959.453206] RIP: 0033:0x7f368a22f1a8 Is that all that was in dmesg? I'd expect to see also other blocked processes - in particular a jbd2 thread that should be doing transaction commit and then also some process that's holding i_rwsem fsstress is blocked on... Honza > > > [ 246.801609] ext4_nonda_switch+0x7b/0x80 [ext4] > > > [ 246.801618] ext4_da_write_begin+0x6f/0x480 [ext4] > > > [ 246.801621] generic_perform_write+0xf4/0x1b0 > > > [ 246.801628] ext4_buffered_write_iter+0x8d/0x120 [ext4] > > > [ 246.801634] ext4_file_write_iter+0x6e/0x700 [ext4] > > > [ 246.801636] new_sync_write+0x12d/0x1d0 > > > [ 246.801638] vfs_write+0xa5/0x1a0 > > > [ 246.801640] ksys_write+0x59/0xd0 > > > [ 246.801643] do_syscall_64+0x55/0x1b0 > > > [ 246.801645] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > > [ 246.801646] RIP: 0033:0x7fe9ec947b28 > > > [ 246.801650] Code: Bad RIP value. > > > ---------------------------------------------- -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 19+ messages in thread
[parent not found: <062ac52c-3a16-22ef-6396-53334ed94783@cn.fujitsu.com>]
* Re: generic/269 hangs on lastest upstream kernel [not found] ` <062ac52c-3a16-22ef-6396-53334ed94783@cn.fujitsu.com> @ 2020-02-14 15:00 ` Jan Kara 2020-02-18 3:25 ` Yang Xu 0 siblings, 1 reply; 19+ messages in thread From: Jan Kara @ 2020-02-14 15:00 UTC (permalink / raw) To: Yang Xu; +Cc: Jan Kara, Theodore Ts'o, fstests On Fri 14-02-20 18:24:50, Yang Xu wrote: > on 2020/02/14 5:10, Jan Kara wrote: > > On Thu 13-02-20 16:49:21, Yang Xu wrote: > > > > > When I test generic/269(ext4) on 5.6.0-rc1 kernel, it hangs. > > > > > ---------------------------------------------- > > > > > dmesg as below: > > > > > 76.506753] run fstests generic/269 at 2020-02-11 05:53:44 > > > > > [ 76.955667] EXT4-fs (sdc): mounted filesystem with ordered data mode. > > > > > Opts: acl, user_xattr > > > > > [ 100.912511] device virbr0-nic left promiscuous mode > > > > > [ 100.912520] virbr0: port 1(virbr0-nic) entered disabled state > > > > > [ 246.801561] INFO: task dd:17284 blocked for more than 122 seconds. > > > > > [ 246.801564] Not tainted 5.6.0-rc1 #41 > > > > > [ 246.801565] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables > > > > > this mes sage. > > > > > [ 246.801566] dd D 0 17284 16931 0x00000080 > > > > > [ 246.801568] Call Trace: > > > > > [ 246.801584] ? __schedule+0x251/0x690 > > > > > [ 246.801586] schedule+0x40/0xb0 > > > > > [ 246.801588] wb_wait_for_completion+0x52/0x80 > > > > > [ 246.801591] ? finish_wait+0x80/0x80 > > > > > [ 246.801592] __writeback_inodes_sb_nr+0xaa/0xd0 > > > > > [ 246.801593] try_to_writeback_inodes_sb+0x3c/0x50 > > > > > > > > Interesting. Does the hang resolve eventually or the machine is hung > > > > permanently? If the hang is permanent, can you do: > > > > > > > > echo w >/proc/sysrq-trigger > > > > > > > > and send us the stacktraces from dmesg? Thanks! > > > Yes. the hang is permanent, log as below: > full dmesg as attach ... Thanks! So the culprit seems to be: > [ 388.087799] kworker/u12:0 D 0 32 2 0x80004000 > [ 388.087803] Workqueue: writeback wb_workfn (flush-8:32) > [ 388.087805] Call Trace: > [ 388.087810] ? __schedule+0x251/0x690 > [ 388.087811] ? __switch_to_asm+0x34/0x70 > [ 388.087812] ? __switch_to_asm+0x34/0x70 > [ 388.087814] schedule+0x40/0xb0 > [ 388.087816] schedule_timeout+0x20d/0x310 > [ 388.087818] io_schedule_timeout+0x19/0x40 > [ 388.087819] wait_for_completion_io+0x113/0x180 > [ 388.087822] ? wake_up_q+0xa0/0xa0 > [ 388.087824] submit_bio_wait+0x5b/0x80 > [ 388.087827] blkdev_issue_flush+0x81/0xb0 > [ 388.087834] jbd2_cleanup_journal_tail+0x80/0xa0 [jbd2] > [ 388.087837] jbd2_log_do_checkpoint+0xf4/0x3f0 [jbd2] > [ 388.087840] __jbd2_log_wait_for_space+0x66/0x190 [jbd2] > [ 388.087843] ? finish_wait+0x80/0x80 > [ 388.087845] add_transaction_credits+0x27d/0x290 [jbd2] > [ 388.087847] ? blk_mq_make_request+0x289/0x5d0 > [ 388.087849] start_this_handle+0x10a/0x510 [jbd2] > [ 388.087851] ? _cond_resched+0x15/0x30 > [ 388.087853] jbd2__journal_start+0xea/0x1f0 [jbd2] > [ 388.087869] ? ext4_writepages+0x518/0xd90 [ext4] > [ 388.087875] __ext4_journal_start_sb+0x6e/0x130 [ext4] > [ 388.087883] ext4_writepages+0x518/0xd90 [ext4] > [ 388.087886] ? do_writepages+0x41/0xd0 > [ 388.087893] ? ext4_mark_inode_dirty+0x1f0/0x1f0 [ext4] > [ 388.087894] do_writepages+0x41/0xd0 > [ 388.087896] ? snprintf+0x49/0x60 > [ 388.087898] __writeback_single_inode+0x3d/0x340 > [ 388.087899] writeback_sb_inodes+0x1e5/0x480 > [ 388.087901] wb_writeback+0xfb/0x2f0 > [ 388.087902] wb_workfn+0xf0/0x430 > [ 388.087903] ? __switch_to_asm+0x34/0x70 > [ 388.087905] ? finish_task_switch+0x75/0x250 > [ 388.087907] process_one_work+0x1a7/0x370 > [ 388.087909] worker_thread+0x30/0x380 > [ 388.087911] ? process_one_work+0x370/0x370 > [ 388.087912] kthread+0x10c/0x130 > [ 388.087913] ? kthread_park+0x80/0x80 > [ 388.087914] ret_from_fork+0x35/0x40 This process is actually waiting for IO to complete while holding checkpoint_mutex which holds up everybody else. The question is why the IO doesn't complete - that's definitely outside of filesystem. Maybe a bug in the block layer, storage driver, or something like that... What does 'cat /sys/block/<device-with-xfstests>/inflight' show? Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: generic/269 hangs on lastest upstream kernel 2020-02-14 15:00 ` Jan Kara @ 2020-02-18 3:25 ` Yang Xu 2020-02-18 8:24 ` Jan Kara 0 siblings, 1 reply; 19+ messages in thread From: Yang Xu @ 2020-02-18 3:25 UTC (permalink / raw) To: Jan Kara; +Cc: Theodore Ts'o, fstests on 2020/02/14 23:00, Jan Kara wrote: > On Fri 14-02-20 18:24:50, Yang Xu wrote: >> on 2020/02/14 5:10, Jan Kara wrote: >>> On Thu 13-02-20 16:49:21, Yang Xu wrote: >>>>>> When I test generic/269(ext4) on 5.6.0-rc1 kernel, it hangs. >>>>>> ---------------------------------------------- >>>>>> dmesg as below: >>>>>> 76.506753] run fstests generic/269 at 2020-02-11 05:53:44 >>>>>> [ 76.955667] EXT4-fs (sdc): mounted filesystem with ordered data mode. >>>>>> Opts: acl, user_xattr >>>>>> [ 100.912511] device virbr0-nic left promiscuous mode >>>>>> [ 100.912520] virbr0: port 1(virbr0-nic) entered disabled state >>>>>> [ 246.801561] INFO: task dd:17284 blocked for more than 122 seconds. >>>>>> [ 246.801564] Not tainted 5.6.0-rc1 #41 >>>>>> [ 246.801565] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables >>>>>> this mes sage. >>>>>> [ 246.801566] dd D 0 17284 16931 0x00000080 >>>>>> [ 246.801568] Call Trace: >>>>>> [ 246.801584] ? __schedule+0x251/0x690 >>>>>> [ 246.801586] schedule+0x40/0xb0 >>>>>> [ 246.801588] wb_wait_for_completion+0x52/0x80 >>>>>> [ 246.801591] ? finish_wait+0x80/0x80 >>>>>> [ 246.801592] __writeback_inodes_sb_nr+0xaa/0xd0 >>>>>> [ 246.801593] try_to_writeback_inodes_sb+0x3c/0x50 >>>>> >>>>> Interesting. Does the hang resolve eventually or the machine is hung >>>>> permanently? If the hang is permanent, can you do: >>>>> >>>>> echo w >/proc/sysrq-trigger >>>>> >>>>> and send us the stacktraces from dmesg? Thanks! >>>> Yes. the hang is permanent, log as below: >> full dmesg as attach > ... > > Thanks! So the culprit seems to be: > >> [ 388.087799] kworker/u12:0 D 0 32 2 0x80004000 >> [ 388.087803] Workqueue: writeback wb_workfn (flush-8:32) >> [ 388.087805] Call Trace: >> [ 388.087810] ? __schedule+0x251/0x690 >> [ 388.087811] ? __switch_to_asm+0x34/0x70 >> [ 388.087812] ? __switch_to_asm+0x34/0x70 >> [ 388.087814] schedule+0x40/0xb0 >> [ 388.087816] schedule_timeout+0x20d/0x310 >> [ 388.087818] io_schedule_timeout+0x19/0x40 >> [ 388.087819] wait_for_completion_io+0x113/0x180 >> [ 388.087822] ? wake_up_q+0xa0/0xa0 >> [ 388.087824] submit_bio_wait+0x5b/0x80 >> [ 388.087827] blkdev_issue_flush+0x81/0xb0 >> [ 388.087834] jbd2_cleanup_journal_tail+0x80/0xa0 [jbd2] >> [ 388.087837] jbd2_log_do_checkpoint+0xf4/0x3f0 [jbd2] >> [ 388.087840] __jbd2_log_wait_for_space+0x66/0x190 [jbd2] >> [ 388.087843] ? finish_wait+0x80/0x80 >> [ 388.087845] add_transaction_credits+0x27d/0x290 [jbd2] >> [ 388.087847] ? blk_mq_make_request+0x289/0x5d0 >> [ 388.087849] start_this_handle+0x10a/0x510 [jbd2] >> [ 388.087851] ? _cond_resched+0x15/0x30 >> [ 388.087853] jbd2__journal_start+0xea/0x1f0 [jbd2] >> [ 388.087869] ? ext4_writepages+0x518/0xd90 [ext4] >> [ 388.087875] __ext4_journal_start_sb+0x6e/0x130 [ext4] >> [ 388.087883] ext4_writepages+0x518/0xd90 [ext4] >> [ 388.087886] ? do_writepages+0x41/0xd0 >> [ 388.087893] ? ext4_mark_inode_dirty+0x1f0/0x1f0 [ext4] >> [ 388.087894] do_writepages+0x41/0xd0 >> [ 388.087896] ? snprintf+0x49/0x60 >> [ 388.087898] __writeback_single_inode+0x3d/0x340 >> [ 388.087899] writeback_sb_inodes+0x1e5/0x480 >> [ 388.087901] wb_writeback+0xfb/0x2f0 >> [ 388.087902] wb_workfn+0xf0/0x430 >> [ 388.087903] ? __switch_to_asm+0x34/0x70 >> [ 388.087905] ? finish_task_switch+0x75/0x250 >> [ 388.087907] process_one_work+0x1a7/0x370 >> [ 388.087909] worker_thread+0x30/0x380 >> [ 388.087911] ? process_one_work+0x370/0x370 >> [ 388.087912] kthread+0x10c/0x130 >> [ 388.087913] ? kthread_park+0x80/0x80 >> [ 388.087914] ret_from_fork+0x35/0x40 > > This process is actually waiting for IO to complete while holding > checkpoint_mutex which holds up everybody else. The question is why the IO > doesn't complete - that's definitely outside of filesystem. Maybe a bug in > the block layer, storage driver, or something like that... What does > 'cat /sys/block/<device-with-xfstests>/inflight' show? Sorry for the late reply. This value is 0, it represent it doesn't have inflight data(but it may be counted bug or storage driver bug, is it right?). Also, it doesn't hang on my physical machine, but only hang on vm. So what should I do in next step(change storge disk format)? Best Regards Yang Xu > > Honza > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: generic/269 hangs on lastest upstream kernel 2020-02-18 3:25 ` Yang Xu @ 2020-02-18 8:24 ` Jan Kara 2020-02-18 9:46 ` Yang Xu 0 siblings, 1 reply; 19+ messages in thread From: Jan Kara @ 2020-02-18 8:24 UTC (permalink / raw) To: Yang Xu; +Cc: Jan Kara, Theodore Ts'o, fstests On Tue 18-02-20 11:25:37, Yang Xu wrote: > on 2020/02/14 23:00, Jan Kara wrote: > > On Fri 14-02-20 18:24:50, Yang Xu wrote: > > > on 2020/02/14 5:10, Jan Kara wrote: > > > > On Thu 13-02-20 16:49:21, Yang Xu wrote: > > > > > > > When I test generic/269(ext4) on 5.6.0-rc1 kernel, it hangs. > > > > > > > ---------------------------------------------- > > > > > > > dmesg as below: > > > > > > > 76.506753] run fstests generic/269 at 2020-02-11 05:53:44 > > > > > > > [ 76.955667] EXT4-fs (sdc): mounted filesystem with ordered data mode. > > > > > > > Opts: acl, user_xattr > > > > > > > [ 100.912511] device virbr0-nic left promiscuous mode > > > > > > > [ 100.912520] virbr0: port 1(virbr0-nic) entered disabled state > > > > > > > [ 246.801561] INFO: task dd:17284 blocked for more than 122 seconds. > > > > > > > [ 246.801564] Not tainted 5.6.0-rc1 #41 > > > > > > > [ 246.801565] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables > > > > > > > this mes sage. > > > > > > > [ 246.801566] dd D 0 17284 16931 0x00000080 > > > > > > > [ 246.801568] Call Trace: > > > > > > > [ 246.801584] ? __schedule+0x251/0x690 > > > > > > > [ 246.801586] schedule+0x40/0xb0 > > > > > > > [ 246.801588] wb_wait_for_completion+0x52/0x80 > > > > > > > [ 246.801591] ? finish_wait+0x80/0x80 > > > > > > > [ 246.801592] __writeback_inodes_sb_nr+0xaa/0xd0 > > > > > > > [ 246.801593] try_to_writeback_inodes_sb+0x3c/0x50 > > > > > > > > > > > > Interesting. Does the hang resolve eventually or the machine is hung > > > > > > permanently? If the hang is permanent, can you do: > > > > > > > > > > > > echo w >/proc/sysrq-trigger > > > > > > > > > > > > and send us the stacktraces from dmesg? Thanks! > > > > > Yes. the hang is permanent, log as below: > > > full dmesg as attach > > ... > > > > Thanks! So the culprit seems to be: > > > > > [ 388.087799] kworker/u12:0 D 0 32 2 0x80004000 > > > [ 388.087803] Workqueue: writeback wb_workfn (flush-8:32) > > > [ 388.087805] Call Trace: > > > [ 388.087810] ? __schedule+0x251/0x690 > > > [ 388.087811] ? __switch_to_asm+0x34/0x70 > > > [ 388.087812] ? __switch_to_asm+0x34/0x70 > > > [ 388.087814] schedule+0x40/0xb0 > > > [ 388.087816] schedule_timeout+0x20d/0x310 > > > [ 388.087818] io_schedule_timeout+0x19/0x40 > > > [ 388.087819] wait_for_completion_io+0x113/0x180 > > > [ 388.087822] ? wake_up_q+0xa0/0xa0 > > > [ 388.087824] submit_bio_wait+0x5b/0x80 > > > [ 388.087827] blkdev_issue_flush+0x81/0xb0 > > > [ 388.087834] jbd2_cleanup_journal_tail+0x80/0xa0 [jbd2] > > > [ 388.087837] jbd2_log_do_checkpoint+0xf4/0x3f0 [jbd2] > > > [ 388.087840] __jbd2_log_wait_for_space+0x66/0x190 [jbd2] > > > [ 388.087843] ? finish_wait+0x80/0x80 > > > [ 388.087845] add_transaction_credits+0x27d/0x290 [jbd2] > > > [ 388.087847] ? blk_mq_make_request+0x289/0x5d0 > > > [ 388.087849] start_this_handle+0x10a/0x510 [jbd2] > > > [ 388.087851] ? _cond_resched+0x15/0x30 > > > [ 388.087853] jbd2__journal_start+0xea/0x1f0 [jbd2] > > > [ 388.087869] ? ext4_writepages+0x518/0xd90 [ext4] > > > [ 388.087875] __ext4_journal_start_sb+0x6e/0x130 [ext4] > > > [ 388.087883] ext4_writepages+0x518/0xd90 [ext4] > > > [ 388.087886] ? do_writepages+0x41/0xd0 > > > [ 388.087893] ? ext4_mark_inode_dirty+0x1f0/0x1f0 [ext4] > > > [ 388.087894] do_writepages+0x41/0xd0 > > > [ 388.087896] ? snprintf+0x49/0x60 > > > [ 388.087898] __writeback_single_inode+0x3d/0x340 > > > [ 388.087899] writeback_sb_inodes+0x1e5/0x480 > > > [ 388.087901] wb_writeback+0xfb/0x2f0 > > > [ 388.087902] wb_workfn+0xf0/0x430 > > > [ 388.087903] ? __switch_to_asm+0x34/0x70 > > > [ 388.087905] ? finish_task_switch+0x75/0x250 > > > [ 388.087907] process_one_work+0x1a7/0x370 > > > [ 388.087909] worker_thread+0x30/0x380 > > > [ 388.087911] ? process_one_work+0x370/0x370 > > > [ 388.087912] kthread+0x10c/0x130 > > > [ 388.087913] ? kthread_park+0x80/0x80 > > > [ 388.087914] ret_from_fork+0x35/0x40 > > > > This process is actually waiting for IO to complete while holding > > checkpoint_mutex which holds up everybody else. The question is why the IO > > doesn't complete - that's definitely outside of filesystem. Maybe a bug in > > the block layer, storage driver, or something like that... What does > > 'cat /sys/block/<device-with-xfstests>/inflight' show? > Sorry for the late reply. > This value is 0, it represent it doesn't have inflight data(but it may be > counted bug or storage driver bug, is it right?). > Also, it doesn't hang on my physical machine, but only hang on vm. Hum, curious. Just do make sure, did you check sdc (because that appears to be the stuck device)? > So what should I do in next step(change storge disk format)? I'd try couple of things: 1) If you mount ext4 with barrier=0 mount option, does the problem go away? 2) Can you run the test and at the same time run 'blktrace -d /dev/sdc' to gather traces? Once the machine is stuck, abort blktrace, process the resulting files with 'blkparse -i sdc' and send here compressed blkparse output. We should be able to see what was happening with the stuck request in the trace and maybe that will tell us something. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: generic/269 hangs on lastest upstream kernel 2020-02-18 8:24 ` Jan Kara @ 2020-02-18 9:46 ` Yang Xu 2020-02-18 11:03 ` Jan Kara 0 siblings, 1 reply; 19+ messages in thread From: Yang Xu @ 2020-02-18 9:46 UTC (permalink / raw) To: Jan Kara; +Cc: Theodore Ts'o, fstests [-- Attachment #1: Type: text/plain, Size: 5398 bytes --] on 2020/02/18 16:24, Jan Kara wrote: > On Tue 18-02-20 11:25:37, Yang Xu wrote: >> on 2020/02/14 23:00, Jan Kara wrote: >>> On Fri 14-02-20 18:24:50, Yang Xu wrote: >>>> on 2020/02/14 5:10, Jan Kara wrote: >>>>> On Thu 13-02-20 16:49:21, Yang Xu wrote: >>>>>>>> When I test generic/269(ext4) on 5.6.0-rc1 kernel, it hangs. >>>>>>>> ---------------------------------------------- >>>>>>>> dmesg as below: >>>>>>>> 76.506753] run fstests generic/269 at 2020-02-11 05:53:44 >>>>>>>> [ 76.955667] EXT4-fs (sdc): mounted filesystem with ordered data mode. >>>>>>>> Opts: acl, user_xattr >>>>>>>> [ 100.912511] device virbr0-nic left promiscuous mode >>>>>>>> [ 100.912520] virbr0: port 1(virbr0-nic) entered disabled state >>>>>>>> [ 246.801561] INFO: task dd:17284 blocked for more than 122 seconds. >>>>>>>> [ 246.801564] Not tainted 5.6.0-rc1 #41 >>>>>>>> [ 246.801565] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables >>>>>>>> this mes sage. >>>>>>>> [ 246.801566] dd D 0 17284 16931 0x00000080 >>>>>>>> [ 246.801568] Call Trace: >>>>>>>> [ 246.801584] ? __schedule+0x251/0x690 >>>>>>>> [ 246.801586] schedule+0x40/0xb0 >>>>>>>> [ 246.801588] wb_wait_for_completion+0x52/0x80 >>>>>>>> [ 246.801591] ? finish_wait+0x80/0x80 >>>>>>>> [ 246.801592] __writeback_inodes_sb_nr+0xaa/0xd0 >>>>>>>> [ 246.801593] try_to_writeback_inodes_sb+0x3c/0x50 >>>>>>> >>>>>>> Interesting. Does the hang resolve eventually or the machine is hung >>>>>>> permanently? If the hang is permanent, can you do: >>>>>>> >>>>>>> echo w >/proc/sysrq-trigger >>>>>>> >>>>>>> and send us the stacktraces from dmesg? Thanks! >>>>>> Yes. the hang is permanent, log as below: >>>> full dmesg as attach >>> ... >>> >>> Thanks! So the culprit seems to be: >>> >>>> [ 388.087799] kworker/u12:0 D 0 32 2 0x80004000 >>>> [ 388.087803] Workqueue: writeback wb_workfn (flush-8:32) >>>> [ 388.087805] Call Trace: >>>> [ 388.087810] ? __schedule+0x251/0x690 >>>> [ 388.087811] ? __switch_to_asm+0x34/0x70 >>>> [ 388.087812] ? __switch_to_asm+0x34/0x70 >>>> [ 388.087814] schedule+0x40/0xb0 >>>> [ 388.087816] schedule_timeout+0x20d/0x310 >>>> [ 388.087818] io_schedule_timeout+0x19/0x40 >>>> [ 388.087819] wait_for_completion_io+0x113/0x180 >>>> [ 388.087822] ? wake_up_q+0xa0/0xa0 >>>> [ 388.087824] submit_bio_wait+0x5b/0x80 >>>> [ 388.087827] blkdev_issue_flush+0x81/0xb0 >>>> [ 388.087834] jbd2_cleanup_journal_tail+0x80/0xa0 [jbd2] >>>> [ 388.087837] jbd2_log_do_checkpoint+0xf4/0x3f0 [jbd2] >>>> [ 388.087840] __jbd2_log_wait_for_space+0x66/0x190 [jbd2] >>>> [ 388.087843] ? finish_wait+0x80/0x80 >>>> [ 388.087845] add_transaction_credits+0x27d/0x290 [jbd2] >>>> [ 388.087847] ? blk_mq_make_request+0x289/0x5d0 >>>> [ 388.087849] start_this_handle+0x10a/0x510 [jbd2] >>>> [ 388.087851] ? _cond_resched+0x15/0x30 >>>> [ 388.087853] jbd2__journal_start+0xea/0x1f0 [jbd2] >>>> [ 388.087869] ? ext4_writepages+0x518/0xd90 [ext4] >>>> [ 388.087875] __ext4_journal_start_sb+0x6e/0x130 [ext4] >>>> [ 388.087883] ext4_writepages+0x518/0xd90 [ext4] >>>> [ 388.087886] ? do_writepages+0x41/0xd0 >>>> [ 388.087893] ? ext4_mark_inode_dirty+0x1f0/0x1f0 [ext4] >>>> [ 388.087894] do_writepages+0x41/0xd0 >>>> [ 388.087896] ? snprintf+0x49/0x60 >>>> [ 388.087898] __writeback_single_inode+0x3d/0x340 >>>> [ 388.087899] writeback_sb_inodes+0x1e5/0x480 >>>> [ 388.087901] wb_writeback+0xfb/0x2f0 >>>> [ 388.087902] wb_workfn+0xf0/0x430 >>>> [ 388.087903] ? __switch_to_asm+0x34/0x70 >>>> [ 388.087905] ? finish_task_switch+0x75/0x250 >>>> [ 388.087907] process_one_work+0x1a7/0x370 >>>> [ 388.087909] worker_thread+0x30/0x380 >>>> [ 388.087911] ? process_one_work+0x370/0x370 >>>> [ 388.087912] kthread+0x10c/0x130 >>>> [ 388.087913] ? kthread_park+0x80/0x80 >>>> [ 388.087914] ret_from_fork+0x35/0x40 >>> >>> This process is actually waiting for IO to complete while holding >>> checkpoint_mutex which holds up everybody else. The question is why the IO >>> doesn't complete - that's definitely outside of filesystem. Maybe a bug in >>> the block layer, storage driver, or something like that... What does >>> 'cat /sys/block/<device-with-xfstests>/inflight' show? >> Sorry for the late reply. >> This value is 0, it represent it doesn't have inflight data(but it may be >> counted bug or storage driver bug, is it right?). >> Also, it doesn't hang on my physical machine, but only hang on vm. > > Hum, curious. Just do make sure, did you check sdc (because that appears to > be the stuck device)? Yes, I check sdc, its value is 0. # cat /sys/block/sdc/inflight 0 0 > >> So what should I do in next step(change storge disk format)? > > I'd try couple of things: > > 1) If you mount ext4 with barrier=0 mount option, does the problem go away? Yes. Use barrier=0, this case doesn't hang, > > 2) Can you run the test and at the same time run 'blktrace -d /dev/sdc' to > gather traces? Once the machine is stuck, abort blktrace, process the > resulting files with 'blkparse -i sdc' and send here compressed blkparse > output. We should be able to see what was happening with the stuck request > in the trace and maybe that will tell us something. The log size is too big(58M) and our emali limit is 5M. > > Honza > [-- Attachment #2: sdc_blktrace.txt --] [-- Type: text/plain, Size: 4056 bytes --] CPU0 (sdc): Reads Queued: 1,108, 27,116KiB Writes Queued: 61,036, 285,888KiB Read Dispatches: 1,700, 25,392KiB Write Dispatches: 14,740, 285,436KiB Reads Requeued: 0 Writes Requeued: 0 Reads Completed: 1,765, 26,428KiB Writes Completed: 15,209, 285,448KiB Read Merges: 1, 36KiB Write Merges: 46,135, 185,436KiB Read depth: 1 Write depth: 1 PC Reads Queued: 0, 0KiB PC Writes Queued: 0, 0KiB PC Read Disp.: 16, 2KiB PC Write Disp.: 0, 0KiB PC Reads Req.: 0 PC Writes Req.: 0 PC Reads Compl.: 16 PC Writes Compl.: 0 IO unplugs: 2,435 Timer unplugs: 7,863 CPU1 (sdc): Reads Queued: 1,005, 24,308KiB Writes Queued: 61,020, 490,824KiB Read Dispatches: 1,767, 27,596KiB Write Dispatches: 16,107, 439,560KiB Reads Requeued: 0 Writes Requeued: 0 Reads Completed: 1,672, 24,104KiB Writes Completed: 16,833, 489,500KiB Read Merges: 0, 0KiB Write Merges: 44,537, 180,228KiB Read depth: 1 Write depth: 1 PC Reads Queued: 0, 0KiB PC Writes Queued: 0, 0KiB PC Read Disp.: 28, 3KiB PC Write Disp.: 0, 0KiB PC Reads Req.: 0 PC Writes Req.: 0 PC Reads Compl.: 28 PC Writes Compl.: 0 IO unplugs: 2,268 Timer unplugs: 9,094 CPU2 (sdc): Reads Queued: 1,047, 25,337KiB Writes Queued: 99,999, 21,409MiB Read Dispatches: 1,703, 24,013KiB Write Dispatches: 14,016, 21,436MiB Reads Requeued: 0 Writes Requeued: 0 Reads Completed: 1,760, 25,793KiB Writes Completed: 14,664, 21,410MiB Read Merges: 1, 4KiB Write Merges: 85,721, 343,812KiB Read depth: 1 Write depth: 1 PC Reads Queued: 0, 0KiB PC Writes Queued: 0, 0KiB PC Read Disp.: 1, 0KiB PC Write Disp.: 0, 0KiB PC Reads Req.: 0 PC Writes Req.: 0 PC Reads Compl.: 1 PC Writes Compl.: 0 IO unplugs: 2,617 Timer unplugs: 7,643 CPU3 (sdc): Reads Queued: 1,090, 29,589KiB Writes Queued: 57,152, 787,564KiB Read Dispatches: 1,728, 29,349KiB Write Dispatches: 14,100, 812,204KiB Reads Requeued: 0 Writes Requeued: 0 Reads Completed: 1,700, 30,025KiB Writes Completed: 14,251, 788,148KiB Read Merges: 1, 372KiB Write Merges: 43,240, 174,028KiB Read depth: 1 Write depth: 1 IO unplugs: 2,461 Timer unplugs: 7,297 Total (sdc): Reads Queued: 4,250, 106,350KiB Writes Queued: 279,207, 22,973MiB Read Dispatches: 6,898, 106,350KiB Write Dispatches: 58,963, 22,973MiB Reads Requeued: 0 Writes Requeued: 0 Reads Completed: 6,897, 106,350KiB Writes Completed: 60,957, 22,973MiB Read Merges: 3, 412KiB Write Merges: 219,633, 883,504KiB PC Reads Queued: 0, 0KiB PC Writes Queued: 0, 0KiB PC Read Disp.: 45, 7KiB PC Write Disp.: 0, 0KiB PC Reads Req.: 0 PC Writes Req.: 0 PC Reads Compl.: 45 PC Writes Compl.: 0 IO unplugs: 9,781 Timer unplugs: 31,897 Throughput (R/W): 414KiB/s / 89,548KiB/s Events (sdc): 846,121 entries Skips: 0 forward (0 - 0.0%) Input file sdc.blktrace.0 added Input file sdc.blktrace.1 added Input file sdc.blktrace.2 added Input file sdc.blktrace.3 added ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: generic/269 hangs on lastest upstream kernel 2020-02-18 9:46 ` Yang Xu @ 2020-02-18 11:03 ` Jan Kara 2020-02-19 10:09 ` Yang Xu 0 siblings, 1 reply; 19+ messages in thread From: Jan Kara @ 2020-02-18 11:03 UTC (permalink / raw) To: Yang Xu; +Cc: Jan Kara, Theodore Ts'o, fstests On Tue 18-02-20 17:46:54, Yang Xu wrote: > > on 2020/02/18 16:24, Jan Kara wrote: > > On Tue 18-02-20 11:25:37, Yang Xu wrote: > > > on 2020/02/14 23:00, Jan Kara wrote: > > > > On Fri 14-02-20 18:24:50, Yang Xu wrote: > > > > > on 2020/02/14 5:10, Jan Kara wrote: > > > > > > On Thu 13-02-20 16:49:21, Yang Xu wrote: > > > > > > > > > When I test generic/269(ext4) on 5.6.0-rc1 kernel, it hangs. > > > > > > > > > ---------------------------------------------- > > > > > > > > > dmesg as below: > > > > > > > > > 76.506753] run fstests generic/269 at 2020-02-11 05:53:44 > > > > > > > > > [ 76.955667] EXT4-fs (sdc): mounted filesystem with ordered data mode. > > > > > > > > > Opts: acl, user_xattr > > > > > > > > > [ 100.912511] device virbr0-nic left promiscuous mode > > > > > > > > > [ 100.912520] virbr0: port 1(virbr0-nic) entered disabled state > > > > > > > > > [ 246.801561] INFO: task dd:17284 blocked for more than 122 seconds. > > > > > > > > > [ 246.801564] Not tainted 5.6.0-rc1 #41 > > > > > > > > > [ 246.801565] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables > > > > > > > > > this mes sage. > > > > > > > > > [ 246.801566] dd D 0 17284 16931 0x00000080 > > > > > > > > > [ 246.801568] Call Trace: > > > > > > > > > [ 246.801584] ? __schedule+0x251/0x690 > > > > > > > > > [ 246.801586] schedule+0x40/0xb0 > > > > > > > > > [ 246.801588] wb_wait_for_completion+0x52/0x80 > > > > > > > > > [ 246.801591] ? finish_wait+0x80/0x80 > > > > > > > > > [ 246.801592] __writeback_inodes_sb_nr+0xaa/0xd0 > > > > > > > > > [ 246.801593] try_to_writeback_inodes_sb+0x3c/0x50 > > > > > > > > > > > > > > > > Interesting. Does the hang resolve eventually or the machine is hung > > > > > > > > permanently? If the hang is permanent, can you do: > > > > > > > > > > > > > > > > echo w >/proc/sysrq-trigger > > > > > > > > > > > > > > > > and send us the stacktraces from dmesg? Thanks! > > > > > > > Yes. the hang is permanent, log as below: > > > > > full dmesg as attach > > > > ... > > > > > > > > Thanks! So the culprit seems to be: > > > > > > > > > [ 388.087799] kworker/u12:0 D 0 32 2 0x80004000 > > > > > [ 388.087803] Workqueue: writeback wb_workfn (flush-8:32) > > > > > [ 388.087805] Call Trace: > > > > > [ 388.087810] ? __schedule+0x251/0x690 > > > > > [ 388.087811] ? __switch_to_asm+0x34/0x70 > > > > > [ 388.087812] ? __switch_to_asm+0x34/0x70 > > > > > [ 388.087814] schedule+0x40/0xb0 > > > > > [ 388.087816] schedule_timeout+0x20d/0x310 > > > > > [ 388.087818] io_schedule_timeout+0x19/0x40 > > > > > [ 388.087819] wait_for_completion_io+0x113/0x180 > > > > > [ 388.087822] ? wake_up_q+0xa0/0xa0 > > > > > [ 388.087824] submit_bio_wait+0x5b/0x80 > > > > > [ 388.087827] blkdev_issue_flush+0x81/0xb0 > > > > > [ 388.087834] jbd2_cleanup_journal_tail+0x80/0xa0 [jbd2] > > > > > [ 388.087837] jbd2_log_do_checkpoint+0xf4/0x3f0 [jbd2] > > > > > [ 388.087840] __jbd2_log_wait_for_space+0x66/0x190 [jbd2] > > > > > [ 388.087843] ? finish_wait+0x80/0x80 > > > > > [ 388.087845] add_transaction_credits+0x27d/0x290 [jbd2] > > > > > [ 388.087847] ? blk_mq_make_request+0x289/0x5d0 > > > > > [ 388.087849] start_this_handle+0x10a/0x510 [jbd2] > > > > > [ 388.087851] ? _cond_resched+0x15/0x30 > > > > > [ 388.087853] jbd2__journal_start+0xea/0x1f0 [jbd2] > > > > > [ 388.087869] ? ext4_writepages+0x518/0xd90 [ext4] > > > > > [ 388.087875] __ext4_journal_start_sb+0x6e/0x130 [ext4] > > > > > [ 388.087883] ext4_writepages+0x518/0xd90 [ext4] > > > > > [ 388.087886] ? do_writepages+0x41/0xd0 > > > > > [ 388.087893] ? ext4_mark_inode_dirty+0x1f0/0x1f0 [ext4] > > > > > [ 388.087894] do_writepages+0x41/0xd0 > > > > > [ 388.087896] ? snprintf+0x49/0x60 > > > > > [ 388.087898] __writeback_single_inode+0x3d/0x340 > > > > > [ 388.087899] writeback_sb_inodes+0x1e5/0x480 > > > > > [ 388.087901] wb_writeback+0xfb/0x2f0 > > > > > [ 388.087902] wb_workfn+0xf0/0x430 > > > > > [ 388.087903] ? __switch_to_asm+0x34/0x70 > > > > > [ 388.087905] ? finish_task_switch+0x75/0x250 > > > > > [ 388.087907] process_one_work+0x1a7/0x370 > > > > > [ 388.087909] worker_thread+0x30/0x380 > > > > > [ 388.087911] ? process_one_work+0x370/0x370 > > > > > [ 388.087912] kthread+0x10c/0x130 > > > > > [ 388.087913] ? kthread_park+0x80/0x80 > > > > > [ 388.087914] ret_from_fork+0x35/0x40 > > > > > > > > This process is actually waiting for IO to complete while holding > > > > checkpoint_mutex which holds up everybody else. The question is why the IO > > > > doesn't complete - that's definitely outside of filesystem. Maybe a bug in > > > > the block layer, storage driver, or something like that... What does > > > > 'cat /sys/block/<device-with-xfstests>/inflight' show? > > > Sorry for the late reply. > > > This value is 0, it represent it doesn't have inflight data(but it may be > > > counted bug or storage driver bug, is it right?). > > > Also, it doesn't hang on my physical machine, but only hang on vm. > > > > Hum, curious. Just do make sure, did you check sdc (because that appears to > > be the stuck device)? > Yes, I check sdc, its value is 0. > # cat /sys/block/sdc/inflight > 0 0 OK, thanks! > > > So what should I do in next step(change storge disk format)? > > > > I'd try couple of things: > > > > 1) If you mount ext4 with barrier=0 mount option, does the problem go away? > Yes. Use barrier=0, this case doesn't hang, OK, so there's some problem with how the block layer is handling flush bios... > > 2) Can you run the test and at the same time run 'blktrace -d /dev/sdc' to > > gather traces? Once the machine is stuck, abort blktrace, process the > > resulting files with 'blkparse -i sdc' and send here compressed blkparse > > output. We should be able to see what was happening with the stuck request > > in the trace and maybe that will tell us something. > The log size is too big(58M) and our emali limit is 5M. OK, can you put the log somewhere for download? Alternatively you could provide only last say 20s of the trace which should hopefully fit into the limit... Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: generic/269 hangs on lastest upstream kernel 2020-02-18 11:03 ` Jan Kara @ 2020-02-19 10:09 ` Yang Xu [not found] ` <73af3d5c-ca64-3ad3-aee2-1e78ee4fae4a@cn.fujitsu.com> 0 siblings, 1 reply; 19+ messages in thread From: Yang Xu @ 2020-02-19 10:09 UTC (permalink / raw) To: Jan Kara; +Cc: Theodore Ts'o, fstests on 2020/02/18 19:03, Jan Kara wrote: > On Tue 18-02-20 17:46:54, Yang Xu wrote: >> >> on 2020/02/18 16:24, Jan Kara wrote: >>> On Tue 18-02-20 11:25:37, Yang Xu wrote: >>>> on 2020/02/14 23:00, Jan Kara wrote: >>>>> On Fri 14-02-20 18:24:50, Yang Xu wrote: >>>>>> on 2020/02/14 5:10, Jan Kara wrote: >>>>>>> On Thu 13-02-20 16:49:21, Yang Xu wrote: >>>>>>>>>> When I test generic/269(ext4) on 5.6.0-rc1 kernel, it hangs. >>>>>>>>>> ---------------------------------------------- >>>>>>>>>> dmesg as below: >>>>>>>>>> 76.506753] run fstests generic/269 at 2020-02-11 05:53:44 >>>>>>>>>> [ 76.955667] EXT4-fs (sdc): mounted filesystem with ordered data mode. >>>>>>>>>> Opts: acl, user_xattr >>>>>>>>>> [ 100.912511] device virbr0-nic left promiscuous mode >>>>>>>>>> [ 100.912520] virbr0: port 1(virbr0-nic) entered disabled state >>>>>>>>>> [ 246.801561] INFO: task dd:17284 blocked for more than 122 seconds. >>>>>>>>>> [ 246.801564] Not tainted 5.6.0-rc1 #41 >>>>>>>>>> [ 246.801565] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables >>>>>>>>>> this mes sage. >>>>>>>>>> [ 246.801566] dd D 0 17284 16931 0x00000080 >>>>>>>>>> [ 246.801568] Call Trace: >>>>>>>>>> [ 246.801584] ? __schedule+0x251/0x690 >>>>>>>>>> [ 246.801586] schedule+0x40/0xb0 >>>>>>>>>> [ 246.801588] wb_wait_for_completion+0x52/0x80 >>>>>>>>>> [ 246.801591] ? finish_wait+0x80/0x80 >>>>>>>>>> [ 246.801592] __writeback_inodes_sb_nr+0xaa/0xd0 >>>>>>>>>> [ 246.801593] try_to_writeback_inodes_sb+0x3c/0x50 >>>>>>>>> >>>>>>>>> Interesting. Does the hang resolve eventually or the machine is hung >>>>>>>>> permanently? If the hang is permanent, can you do: >>>>>>>>> >>>>>>>>> echo w >/proc/sysrq-trigger >>>>>>>>> >>>>>>>>> and send us the stacktraces from dmesg? Thanks! >>>>>>>> Yes. the hang is permanent, log as below: >>>>>> full dmesg as attach >>>>> ... >>>>> >>>>> Thanks! So the culprit seems to be: >>>>> >>>>>> [ 388.087799] kworker/u12:0 D 0 32 2 0x80004000 >>>>>> [ 388.087803] Workqueue: writeback wb_workfn (flush-8:32) >>>>>> [ 388.087805] Call Trace: >>>>>> [ 388.087810] ? __schedule+0x251/0x690 >>>>>> [ 388.087811] ? __switch_to_asm+0x34/0x70 >>>>>> [ 388.087812] ? __switch_to_asm+0x34/0x70 >>>>>> [ 388.087814] schedule+0x40/0xb0 >>>>>> [ 388.087816] schedule_timeout+0x20d/0x310 >>>>>> [ 388.087818] io_schedule_timeout+0x19/0x40 >>>>>> [ 388.087819] wait_for_completion_io+0x113/0x180 >>>>>> [ 388.087822] ? wake_up_q+0xa0/0xa0 >>>>>> [ 388.087824] submit_bio_wait+0x5b/0x80 >>>>>> [ 388.087827] blkdev_issue_flush+0x81/0xb0 >>>>>> [ 388.087834] jbd2_cleanup_journal_tail+0x80/0xa0 [jbd2] >>>>>> [ 388.087837] jbd2_log_do_checkpoint+0xf4/0x3f0 [jbd2] >>>>>> [ 388.087840] __jbd2_log_wait_for_space+0x66/0x190 [jbd2] >>>>>> [ 388.087843] ? finish_wait+0x80/0x80 >>>>>> [ 388.087845] add_transaction_credits+0x27d/0x290 [jbd2] >>>>>> [ 388.087847] ? blk_mq_make_request+0x289/0x5d0 >>>>>> [ 388.087849] start_this_handle+0x10a/0x510 [jbd2] >>>>>> [ 388.087851] ? _cond_resched+0x15/0x30 >>>>>> [ 388.087853] jbd2__journal_start+0xea/0x1f0 [jbd2] >>>>>> [ 388.087869] ? ext4_writepages+0x518/0xd90 [ext4] >>>>>> [ 388.087875] __ext4_journal_start_sb+0x6e/0x130 [ext4] >>>>>> [ 388.087883] ext4_writepages+0x518/0xd90 [ext4] >>>>>> [ 388.087886] ? do_writepages+0x41/0xd0 >>>>>> [ 388.087893] ? ext4_mark_inode_dirty+0x1f0/0x1f0 [ext4] >>>>>> [ 388.087894] do_writepages+0x41/0xd0 >>>>>> [ 388.087896] ? snprintf+0x49/0x60 >>>>>> [ 388.087898] __writeback_single_inode+0x3d/0x340 >>>>>> [ 388.087899] writeback_sb_inodes+0x1e5/0x480 >>>>>> [ 388.087901] wb_writeback+0xfb/0x2f0 >>>>>> [ 388.087902] wb_workfn+0xf0/0x430 >>>>>> [ 388.087903] ? __switch_to_asm+0x34/0x70 >>>>>> [ 388.087905] ? finish_task_switch+0x75/0x250 >>>>>> [ 388.087907] process_one_work+0x1a7/0x370 >>>>>> [ 388.087909] worker_thread+0x30/0x380 >>>>>> [ 388.087911] ? process_one_work+0x370/0x370 >>>>>> [ 388.087912] kthread+0x10c/0x130 >>>>>> [ 388.087913] ? kthread_park+0x80/0x80 >>>>>> [ 388.087914] ret_from_fork+0x35/0x40 >>>>> >>>>> This process is actually waiting for IO to complete while holding >>>>> checkpoint_mutex which holds up everybody else. The question is why the IO >>>>> doesn't complete - that's definitely outside of filesystem. Maybe a bug in >>>>> the block layer, storage driver, or something like that... What does >>>>> 'cat /sys/block/<device-with-xfstests>/inflight' show? >>>> Sorry for the late reply. >>>> This value is 0, it represent it doesn't have inflight data(but it may be >>>> counted bug or storage driver bug, is it right?). >>>> Also, it doesn't hang on my physical machine, but only hang on vm. >>> >>> Hum, curious. Just do make sure, did you check sdc (because that appears to >>> be the stuck device)? >> Yes, I check sdc, its value is 0. >> # cat /sys/block/sdc/inflight >> 0 0 > > OK, thanks! > >>>> So what should I do in next step(change storge disk format)? >>> >>> I'd try couple of things: >>> >>> 1) If you mount ext4 with barrier=0 mount option, does the problem go away? >> Yes. Use barrier=0, this case doesn't hang, > > OK, so there's some problem with how the block layer is handling flush > bios... > >>> 2) Can you run the test and at the same time run 'blktrace -d /dev/sdc' to >>> gather traces? Once the machine is stuck, abort blktrace, process the >>> resulting files with 'blkparse -i sdc' and send here compressed blkparse >>> output. We should be able to see what was happening with the stuck request >>> in the trace and maybe that will tell us something. >> The log size is too big(58M) and our emali limit is 5M. > > OK, can you put the log somewhere for download? Alternatively you could > provide only last say 20s of the trace which should hopefully fit into the > limit... Ok. I will use split command and send you in private to avoid much noise. Best Regard Yang Xu > > Honza > ^ permalink raw reply [flat|nested] 19+ messages in thread
[parent not found: <73af3d5c-ca64-3ad3-aee2-1e78ee4fae4a@cn.fujitsu.com>]
* Re: generic/269 hangs on lastest upstream kernel [not found] ` <73af3d5c-ca64-3ad3-aee2-1e78ee4fae4a@cn.fujitsu.com> @ 2020-02-19 12:43 ` Jan Kara 2020-02-19 15:20 ` Theodore Y. Ts'o 0 siblings, 1 reply; 19+ messages in thread From: Jan Kara @ 2020-02-19 12:43 UTC (permalink / raw) To: Yang Xu; +Cc: Jan Kara, Theodore Ts'o, fstests On Wed 19-02-20 18:42:36, Yang Xu wrote: > on 2020/02/19 18:09, Yang Xu wrote: > > > > > 1) If you mount ext4 with barrier=0 mount option, does the > > > > > problem go away? > > > > Yes. Use barrier=0, this case doesn't hang, > > > > > > OK, so there's some problem with how the block layer is handling flush > > > bios... > > > > > > > > 2) Can you run the test and at the same time run 'blktrace > > > > > -d /dev/sdc' to > > > > > gather traces? Once the machine is stuck, abort blktrace, process the > > > > > resulting files with 'blkparse -i sdc' and send here > > > > > compressed blkparse > > > > > output. We should be able to see what was happening with the > > > > > stuck request > > > > > in the trace and maybe that will tell us something. > > > > The log size is too big(58M) and our emali limit is 5M. > > > > > > OK, can you put the log somewhere for download? Alternatively you could > > > provide only last say 20s of the trace which should hopefully fit > > > into the > > > limit... > > Ok. I will use split command and send you in private to avoid much noise. > log as attach. Thanks for the log. So the reason for the hang is clearly visible at the end of the log: 8,32 2 104324 164.814457402 995 Q FWS [fsstress] 8,32 2 104325 164.814458088 995 G FWS [fsstress] 8,32 2 104326 164.814460957 739 D FN [kworker/2:1H] This means, fsstress command has queued cache flush request (from blkdev_issue_flush()), this has been dispatched to the driver ('D' event) but it has never been completed by the driver and so blkdev_issue_flush() never returns. To debug this further, you probably need to start looking into what happens with the request inside QEMU. There's not much I can help you with at this point since I'm not an expert there. Do you use image file as a backing store or a raw partition? Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: generic/269 hangs on lastest upstream kernel 2020-02-19 12:43 ` Jan Kara @ 2020-02-19 15:20 ` Theodore Y. Ts'o 2020-02-20 1:35 ` Yang Xu 0 siblings, 1 reply; 19+ messages in thread From: Theodore Y. Ts'o @ 2020-02-19 15:20 UTC (permalink / raw) To: Jan Kara; +Cc: Yang Xu, fstests On Wed, Feb 19, 2020 at 01:43:24PM +0100, Jan Kara wrote: > This means, fsstress command has queued cache flush request (from > blkdev_issue_flush()), this has been dispatched to the driver ('D' event) > but it has never been completed by the driver and so blkdev_issue_flush() > never returns. > > To debug this further, you probably need to start looking into what happens > with the request inside QEMU. There's not much I can help you with at this > point since I'm not an expert there. Do you use image file as a backing store > or a raw partition? This is looking more and more like a Qemu bug or a host OS issue. Yang is using a two year old Qemu (qemu-kvm-2.12.0-32.el8+1900+70997154) and is using a QCOW backing file. It also could be a host kernel bug, although that's less likely. Yang, any chance you could have a chance to upgrade to a newer version of Qemu, and see if the problem goes away? If you have a RHEL support contract, perhaps you could file a support request with the Red Hat help desk? - Ted ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: generic/269 hangs on lastest upstream kernel 2020-02-19 15:20 ` Theodore Y. Ts'o @ 2020-02-20 1:35 ` Yang Xu 2020-02-25 6:03 ` Yang Xu 0 siblings, 1 reply; 19+ messages in thread From: Yang Xu @ 2020-02-20 1:35 UTC (permalink / raw) To: Theodore Y. Ts'o, Jan Kara; +Cc: fstests on 2020/02/19 23:20, Theodore Y. Ts'o wrote: > On Wed, Feb 19, 2020 at 01:43:24PM +0100, Jan Kara wrote: >> This means, fsstress command has queued cache flush request (from >> blkdev_issue_flush()), this has been dispatched to the driver ('D' event) >> but it has never been completed by the driver and so blkdev_issue_flush() >> never returns. >> >> To debug this further, you probably need to start looking into what happens >> with the request inside QEMU. There's not much I can help you with at this >> point since I'm not an expert there. Do you use image file as a backing store >> or a raw partition? > > This is looking more and more like a Qemu bug or a host OS issue. > Yang is using a two year old Qemu (qemu-kvm-2.12.0-32.el8+1900+70997154) > and is using a QCOW backing file. It also could be a host kernel bug, > although that's less likely. > > Yang, any chance you could have a chance to upgrade to a newer version > of Qemu, and see if the problem goes away? If you have a RHEL support > contract, perhaps you could file a support request with the Red Hat help desk? Of course, I will update lastest qemu to test.(I don't have their support). > > - Ted > > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: generic/269 hangs on lastest upstream kernel 2020-02-20 1:35 ` Yang Xu @ 2020-02-25 6:03 ` Yang Xu 0 siblings, 0 replies; 19+ messages in thread From: Yang Xu @ 2020-02-25 6:03 UTC (permalink / raw) To: Theodore Y. Ts'o, Jan Kara; +Cc: fstests on 2020/02/20 9:35, Yang Xu wrote: > > > on 2020/02/19 23:20, Theodore Y. Ts'o wrote: >> On Wed, Feb 19, 2020 at 01:43:24PM +0100, Jan Kara wrote: >>> This means, fsstress command has queued cache flush request (from >>> blkdev_issue_flush()), this has been dispatched to the driver ('D' >>> event) >>> but it has never been completed by the driver and so >>> blkdev_issue_flush() >>> never returns. >>> >>> To debug this further, you probably need to start looking into what >>> happens >>> with the request inside QEMU. There's not much I can help you with at >>> this >>> point since I'm not an expert there. Do you use image file as a >>> backing store >>> or a raw partition? >> >> This is looking more and more like a Qemu bug or a host OS issue. >> Yang is using a two year old Qemu (qemu-kvm-2.12.0-32.el8+1900+70997154) >> and is using a QCOW backing file. It also could be a host kernel bug, >> although that's less likely. >> >> Yang, any chance you could have a chance to upgrade to a newer version >> of Qemu, and see if the problem goes away? If you have a RHEL support >> contract, perhaps you could file a support request with the Red Hat >> help desk? > Of course, I will update lastest qemu to test.(I don't have their support). Hi Ted, Jan I have updated lastest qemu(4.2.0), it still hangs. But I think it is not fs problem. Because it only failed when disk uses IDE bus, I use scsi bus, it passed. Thanks for your analysis and help. Best Regards Yang Xu >> >> - Ted >> >> > > ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2020-02-25 6:04 UTC | newest] Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-02-11 8:14 generic/269 hangs on lastest upstream kernel Yang Xu 2020-02-12 10:54 ` Jan Kara 2020-02-13 8:49 ` Yang Xu 2020-02-13 17:08 ` Theodore Y. Ts'o 2020-02-14 1:14 ` Yang Xu 2020-02-14 14:05 ` Theodore Y. Ts'o [not found] ` <7adf16bf-d527-1c25-1a24-b4d5e4d757c4@cn.fujitsu.com> 2020-02-18 14:35 ` Theodore Y. Ts'o 2020-02-19 10:57 ` Yang Xu 2020-02-13 21:10 ` Jan Kara [not found] ` <062ac52c-3a16-22ef-6396-53334ed94783@cn.fujitsu.com> 2020-02-14 15:00 ` Jan Kara 2020-02-18 3:25 ` Yang Xu 2020-02-18 8:24 ` Jan Kara 2020-02-18 9:46 ` Yang Xu 2020-02-18 11:03 ` Jan Kara 2020-02-19 10:09 ` Yang Xu [not found] ` <73af3d5c-ca64-3ad3-aee2-1e78ee4fae4a@cn.fujitsu.com> 2020-02-19 12:43 ` Jan Kara 2020-02-19 15:20 ` Theodore Y. Ts'o 2020-02-20 1:35 ` Yang Xu 2020-02-25 6:03 ` Yang Xu
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).