fstests.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* generic/269 hangs on lastest upstream kernel
@ 2020-02-11  8:14 Yang Xu
  2020-02-12 10:54 ` Jan Kara
  0 siblings, 1 reply; 19+ messages in thread
From: Yang Xu @ 2020-02-11  8:14 UTC (permalink / raw)
  To: Jan Kara, Theodore Ts'o; +Cc: fstests

Hi

Since xfstests support rename2, this case(generic/269) reports 
filesystem inconsistent problem with ext4 on my 
system(4.18.0-32.el8.x86_64).

When I test generic/269(ext4) on 5.6.0-rc1 kernel, it hangs.
----------------------------------------------
dmesg as below:
    76.506753] run fstests generic/269 at 2020-02-11 05:53:44
[   76.955667] EXT4-fs (sdc): mounted filesystem with ordered data mode. 
Opts: acl,                           user_xattr
[  100.912511] device virbr0-nic left promiscuous mode
[  100.912520] virbr0: port 1(virbr0-nic) entered disabled state
[  246.801561] INFO: task dd:17284 blocked for more than 122 seconds.
[  246.801564]       Not tainted 5.6.0-rc1 #41
[  246.801565] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this mes                           sage.
[  246.801566] dd              D    0 17284  16931 0x00000080
[  246.801568] Call Trace:
[  246.801584]  ? __schedule+0x251/0x690
[  246.801586]  schedule+0x40/0xb0
[  246.801588]  wb_wait_for_completion+0x52/0x80
[  246.801591]  ? finish_wait+0x80/0x80
[  246.801592]  __writeback_inodes_sb_nr+0xaa/0xd0
[  246.801593]  try_to_writeback_inodes_sb+0x3c/0x50
[  246.801609]  ext4_nonda_switch+0x7b/0x80 [ext4]
[  246.801618]  ext4_da_write_begin+0x6f/0x480 [ext4]
[  246.801621]  generic_perform_write+0xf4/0x1b0
[  246.801628]  ext4_buffered_write_iter+0x8d/0x120 [ext4]
[  246.801634]  ext4_file_write_iter+0x6e/0x700 [ext4]
[  246.801636]  new_sync_write+0x12d/0x1d0
[  246.801638]  vfs_write+0xa5/0x1a0
[  246.801640]  ksys_write+0x59/0xd0
[  246.801643]  do_syscall_64+0x55/0x1b0
[  246.801645]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  246.801646] RIP: 0033:0x7fe9ec947b28
[  246.801650] Code: Bad RIP value.
----------------------------------------------

Does anyone also meet this problem?

Best Regards
Yang Xu



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: generic/269 hangs on lastest upstream kernel
  2020-02-11  8:14 generic/269 hangs on lastest upstream kernel Yang Xu
@ 2020-02-12 10:54 ` Jan Kara
  2020-02-13  8:49   ` Yang Xu
  0 siblings, 1 reply; 19+ messages in thread
From: Jan Kara @ 2020-02-12 10:54 UTC (permalink / raw)
  To: Yang Xu; +Cc: Jan Kara, Theodore Ts'o, fstests

Hello!

On Tue 11-02-20 16:14:35, Yang Xu wrote:
> Since xfstests support rename2, this case(generic/269) reports filesystem
> inconsistent problem with ext4 on my system(4.18.0-32.el8.x86_64).

I don't remember seeing this in my testing... It might be specific to that
RHEL kernel.

> When I test generic/269(ext4) on 5.6.0-rc1 kernel, it hangs.
> ----------------------------------------------
> dmesg as below:
>    76.506753] run fstests generic/269 at 2020-02-11 05:53:44
> [   76.955667] EXT4-fs (sdc): mounted filesystem with ordered data mode.
> Opts: acl,                           user_xattr
> [  100.912511] device virbr0-nic left promiscuous mode
> [  100.912520] virbr0: port 1(virbr0-nic) entered disabled state
> [  246.801561] INFO: task dd:17284 blocked for more than 122 seconds.
> [  246.801564]       Not tainted 5.6.0-rc1 #41
> [  246.801565] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this mes                           sage.
> [  246.801566] dd              D    0 17284  16931 0x00000080
> [  246.801568] Call Trace:
> [  246.801584]  ? __schedule+0x251/0x690
> [  246.801586]  schedule+0x40/0xb0
> [  246.801588]  wb_wait_for_completion+0x52/0x80
> [  246.801591]  ? finish_wait+0x80/0x80
> [  246.801592]  __writeback_inodes_sb_nr+0xaa/0xd0
> [  246.801593]  try_to_writeback_inodes_sb+0x3c/0x50

Interesting. Does the hang resolve eventually or the machine is hung
permanently? If the hang is permanent, can you do:

echo w >/proc/sysrq-trigger

and send us the stacktraces from dmesg? Thanks!

								Honza

> [  246.801609]  ext4_nonda_switch+0x7b/0x80 [ext4]
> [  246.801618]  ext4_da_write_begin+0x6f/0x480 [ext4]
> [  246.801621]  generic_perform_write+0xf4/0x1b0
> [  246.801628]  ext4_buffered_write_iter+0x8d/0x120 [ext4]
> [  246.801634]  ext4_file_write_iter+0x6e/0x700 [ext4]
> [  246.801636]  new_sync_write+0x12d/0x1d0
> [  246.801638]  vfs_write+0xa5/0x1a0
> [  246.801640]  ksys_write+0x59/0xd0
> [  246.801643]  do_syscall_64+0x55/0x1b0
> [  246.801645]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [  246.801646] RIP: 0033:0x7fe9ec947b28
> [  246.801650] Code: Bad RIP value.
> ----------------------------------------------
> 
> Does anyone also meet this problem?
> 
> Best Regards
> Yang Xu
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: generic/269 hangs on lastest upstream kernel
  2020-02-12 10:54 ` Jan Kara
@ 2020-02-13  8:49   ` Yang Xu
  2020-02-13 17:08     ` Theodore Y. Ts'o
  2020-02-13 21:10     ` Jan Kara
  0 siblings, 2 replies; 19+ messages in thread
From: Yang Xu @ 2020-02-13  8:49 UTC (permalink / raw)
  To: Jan Kara; +Cc: Theodore Ts'o, fstests



on 2020/02/12 18:54, Jan Kara wrote:
> Hello!
> 
> On Tue 11-02-20 16:14:35, Yang Xu wrote:
>> Since xfstests support rename2, this case(generic/269) reports filesystem
>> inconsistent problem with ext4 on my system(4.18.0-32.el8.x86_64).
> 
> I don't remember seeing this in my testing... It might be specific to that
> RHEL kernel.
Agree.
> 
>> When I test generic/269(ext4) on 5.6.0-rc1 kernel, it hangs.
>> ----------------------------------------------
>> dmesg as below:
>>     76.506753] run fstests generic/269 at 2020-02-11 05:53:44
>> [   76.955667] EXT4-fs (sdc): mounted filesystem with ordered data mode.
>> Opts: acl,                           user_xattr
>> [  100.912511] device virbr0-nic left promiscuous mode
>> [  100.912520] virbr0: port 1(virbr0-nic) entered disabled state
>> [  246.801561] INFO: task dd:17284 blocked for more than 122 seconds.
>> [  246.801564]       Not tainted 5.6.0-rc1 #41
>> [  246.801565] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
>> this mes                           sage.
>> [  246.801566] dd              D    0 17284  16931 0x00000080
>> [  246.801568] Call Trace:
>> [  246.801584]  ? __schedule+0x251/0x690
>> [  246.801586]  schedule+0x40/0xb0
>> [  246.801588]  wb_wait_for_completion+0x52/0x80
>> [  246.801591]  ? finish_wait+0x80/0x80
>> [  246.801592]  __writeback_inodes_sb_nr+0xaa/0xd0
>> [  246.801593]  try_to_writeback_inodes_sb+0x3c/0x50
> 
> Interesting. Does the hang resolve eventually or the machine is hung
> permanently? If the hang is permanent, can you do:
> 
> echo w >/proc/sysrq-trigger
> 
> and send us the stacktraces from dmesg? Thanks!
Yes. the hang is permanent, log as below:

[  959.451423] fsstress        D    0 20094  20033 0x00000080
[  959.451424] Call Trace:
[  959.451425]  ? __schedule+0x251/0x690
[  959.451426]  schedule+0x40/0xb0
[  959.451428]  schedule_preempt_disabled+0xa/0x10
[  959.451429]  __mutex_lock.isra.8+0x2b5/0x4a0
[  959.451430]  ? __check_object_size+0x162/0x173
[  959.451431]  lock_rename+0x28/0xb0
[  959.451433]  do_renameat2+0x2a9/0x530
[  959.451434]  __x64_sys_renameat2+0x20/0x30
[  959.451436]  do_syscall_64+0x55/0x1b0
[  959.451436]  entry_SYSCALL_64_after_hwframe+0x44/0xa9


[  959.453023] dd              D    0 21645  19793 0x00004080
[  959.453024] Call Trace:
[  959.453026]  ? __schedule+0x251/0x690
[  959.453027]  ? __wake_up_common_lock+0x87/0xc0
[  959.453028]  schedule+0x40/0xb0
[  959.453030]  jbd2_log_wait_commit+0xac/0x120 [jbd2]
[  959.453032]  ? finish_wait+0x80/0x80
[  959.453034]  jbd2_log_do_checkpoint+0x383/0x3f0 [jbd2]
[  959.453036]  __jbd2_log_wait_for_space+0x66/0x190 [jbd2]
[  959.453038]  add_transaction_credits+0x27d/0x290 [jbd2]
[  959.453040]  ? blk_mq_make_request+0x289/0x5d0
[  959.453042]  start_this_handle+0x10a/0x510 [jbd2]
[  959.453043]  ? _cond_resched+0x15/0x30
[  959.453045]  jbd2__journal_start+0xea/0x1f0 [jbd2]
[  959.453051]  ? ext4_writepages+0x518/0xd90 [ext4]
[  959.453057]  __ext4_journal_start_sb+0x6e/0x130 [ext4]
[  959.453063]  ext4_writepages+0x518/0xd90 [ext4]
[  959.453065]  ? do_writepages+0x41/0xd0
[  959.453070]  ? ext4_mark_inode_dirty+0x1f0/0x1f0 [ext4]
[  959.453072]  do_writepages+0x41/0xd0
[  959.453073]  ? iomap_write_begin+0x4c0/0x4c0
[  959.453188]  ? xfs_iunlock+0xf3/0x100 [xfs]
[  959.453189]  __filemap_fdatawrite_range+0xcb/0x100
[  959.453191]  ? __raw_spin_unlock+0x5/0x10
[  959.453198]  ext4_release_file+0x6c/0xa0 [ext4]
[  959.453200]  __fput+0xbe/0x250
[  959.453201]  task_work_run+0x84/0xa0
[  959.453203]  exit_to_usermode_loop+0xc8/0xd0
[  959.453204]  do_syscall_64+0x1a5/0x1b0
[  959.453205]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  959.453206] RIP: 0033:0x7f368a22f1a8

Best Regards
Yang Xu

> 
> 								Honza
> 
>> [  246.801609]  ext4_nonda_switch+0x7b/0x80 [ext4]
>> [  246.801618]  ext4_da_write_begin+0x6f/0x480 [ext4]
>> [  246.801621]  generic_perform_write+0xf4/0x1b0
>> [  246.801628]  ext4_buffered_write_iter+0x8d/0x120 [ext4]
>> [  246.801634]  ext4_file_write_iter+0x6e/0x700 [ext4]
>> [  246.801636]  new_sync_write+0x12d/0x1d0
>> [  246.801638]  vfs_write+0xa5/0x1a0
>> [  246.801640]  ksys_write+0x59/0xd0
>> [  246.801643]  do_syscall_64+0x55/0x1b0
>> [  246.801645]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> [  246.801646] RIP: 0033:0x7fe9ec947b28
>> [  246.801650] Code: Bad RIP value.
>> ----------------------------------------------
>>
>> Does anyone also meet this problem?
>>
>> Best Regards
>> Yang Xu
>>
>>



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: generic/269 hangs on lastest upstream kernel
  2020-02-13  8:49   ` Yang Xu
@ 2020-02-13 17:08     ` Theodore Y. Ts'o
  2020-02-14  1:14       ` Yang Xu
  2020-02-13 21:10     ` Jan Kara
  1 sibling, 1 reply; 19+ messages in thread
From: Theodore Y. Ts'o @ 2020-02-13 17:08 UTC (permalink / raw)
  To: Yang Xu; +Cc: Jan Kara, fstests

On Thu, Feb 13, 2020 at 04:49:21PM +0800, Yang Xu wrote:
> 
> 
> on 2020/02/12 18:54, Jan Kara wrote:
> > Hello!
> > 
> > On Tue 11-02-20 16:14:35, Yang Xu wrote:
> > > Since xfstests support rename2, this case(generic/269) reports filesystem
> > > inconsistent problem with ext4 on my system(4.18.0-32.el8.x86_64).
> > 
> > I don't remember seeing this in my testing... It might be specific to that
> > RHEL kernel.
>
> Agree.

So were you able to reproduce this on a 5.6.0-rc1 kernel or not?

If you are, can you send the .config that you used, in case it's
configuration specific?

					- Ted

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: generic/269 hangs on lastest upstream kernel
  2020-02-13  8:49   ` Yang Xu
  2020-02-13 17:08     ` Theodore Y. Ts'o
@ 2020-02-13 21:10     ` Jan Kara
       [not found]       ` <062ac52c-3a16-22ef-6396-53334ed94783@cn.fujitsu.com>
  1 sibling, 1 reply; 19+ messages in thread
From: Jan Kara @ 2020-02-13 21:10 UTC (permalink / raw)
  To: Yang Xu; +Cc: Jan Kara, Theodore Ts'o, fstests

On Thu 13-02-20 16:49:21, Yang Xu wrote:
> > > When I test generic/269(ext4) on 5.6.0-rc1 kernel, it hangs.
> > > ----------------------------------------------
> > > dmesg as below:
> > >     76.506753] run fstests generic/269 at 2020-02-11 05:53:44
> > > [   76.955667] EXT4-fs (sdc): mounted filesystem with ordered data mode.
> > > Opts: acl,                           user_xattr
> > > [  100.912511] device virbr0-nic left promiscuous mode
> > > [  100.912520] virbr0: port 1(virbr0-nic) entered disabled state
> > > [  246.801561] INFO: task dd:17284 blocked for more than 122 seconds.
> > > [  246.801564]       Not tainted 5.6.0-rc1 #41
> > > [  246.801565] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> > > this mes                           sage.
> > > [  246.801566] dd              D    0 17284  16931 0x00000080
> > > [  246.801568] Call Trace:
> > > [  246.801584]  ? __schedule+0x251/0x690
> > > [  246.801586]  schedule+0x40/0xb0
> > > [  246.801588]  wb_wait_for_completion+0x52/0x80
> > > [  246.801591]  ? finish_wait+0x80/0x80
> > > [  246.801592]  __writeback_inodes_sb_nr+0xaa/0xd0
> > > [  246.801593]  try_to_writeback_inodes_sb+0x3c/0x50
> > 
> > Interesting. Does the hang resolve eventually or the machine is hung
> > permanently? If the hang is permanent, can you do:
> > 
> > echo w >/proc/sysrq-trigger
> > 
> > and send us the stacktraces from dmesg? Thanks!
> Yes. the hang is permanent, log as below:
> 
> [  959.451423] fsstress        D    0 20094  20033 0x00000080
> [  959.451424] Call Trace:
> [  959.451425]  ? __schedule+0x251/0x690
> [  959.451426]  schedule+0x40/0xb0
> [  959.451428]  schedule_preempt_disabled+0xa/0x10
> [  959.451429]  __mutex_lock.isra.8+0x2b5/0x4a0
> [  959.451430]  ? __check_object_size+0x162/0x173
> [  959.451431]  lock_rename+0x28/0xb0
> [  959.451433]  do_renameat2+0x2a9/0x530
> [  959.451434]  __x64_sys_renameat2+0x20/0x30
> [  959.451436]  do_syscall_64+0x55/0x1b0
> [  959.451436]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> 
> [  959.453023] dd              D    0 21645  19793 0x00004080
> [  959.453024] Call Trace:
> [  959.453026]  ? __schedule+0x251/0x690
> [  959.453027]  ? __wake_up_common_lock+0x87/0xc0
> [  959.453028]  schedule+0x40/0xb0
> [  959.453030]  jbd2_log_wait_commit+0xac/0x120 [jbd2]
> [  959.453032]  ? finish_wait+0x80/0x80
> [  959.453034]  jbd2_log_do_checkpoint+0x383/0x3f0 [jbd2]
> [  959.453036]  __jbd2_log_wait_for_space+0x66/0x190 [jbd2]
> [  959.453038]  add_transaction_credits+0x27d/0x290 [jbd2]
> [  959.453040]  ? blk_mq_make_request+0x289/0x5d0
> [  959.453042]  start_this_handle+0x10a/0x510 [jbd2]
> [  959.453043]  ? _cond_resched+0x15/0x30
> [  959.453045]  jbd2__journal_start+0xea/0x1f0 [jbd2]
> [  959.453051]  ? ext4_writepages+0x518/0xd90 [ext4]
> [  959.453057]  __ext4_journal_start_sb+0x6e/0x130 [ext4]
> [  959.453063]  ext4_writepages+0x518/0xd90 [ext4]
> [  959.453065]  ? do_writepages+0x41/0xd0
> [  959.453070]  ? ext4_mark_inode_dirty+0x1f0/0x1f0 [ext4]
> [  959.453072]  do_writepages+0x41/0xd0
> [  959.453073]  ? iomap_write_begin+0x4c0/0x4c0
> [  959.453188]  ? xfs_iunlock+0xf3/0x100 [xfs]
> [  959.453189]  __filemap_fdatawrite_range+0xcb/0x100
> [  959.453191]  ? __raw_spin_unlock+0x5/0x10
> [  959.453198]  ext4_release_file+0x6c/0xa0 [ext4]
> [  959.453200]  __fput+0xbe/0x250
> [  959.453201]  task_work_run+0x84/0xa0
> [  959.453203]  exit_to_usermode_loop+0xc8/0xd0
> [  959.453204]  do_syscall_64+0x1a5/0x1b0
> [  959.453205]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [  959.453206] RIP: 0033:0x7f368a22f1a8

Is that all that was in dmesg? I'd expect to see also other blocked
processes - in particular a jbd2 thread that should be doing transaction
commit and then also some process that's holding i_rwsem fsstress is
blocked on...

								Honza
> > > [  246.801609]  ext4_nonda_switch+0x7b/0x80 [ext4]
> > > [  246.801618]  ext4_da_write_begin+0x6f/0x480 [ext4]
> > > [  246.801621]  generic_perform_write+0xf4/0x1b0
> > > [  246.801628]  ext4_buffered_write_iter+0x8d/0x120 [ext4]
> > > [  246.801634]  ext4_file_write_iter+0x6e/0x700 [ext4]
> > > [  246.801636]  new_sync_write+0x12d/0x1d0
> > > [  246.801638]  vfs_write+0xa5/0x1a0
> > > [  246.801640]  ksys_write+0x59/0xd0
> > > [  246.801643]  do_syscall_64+0x55/0x1b0
> > > [  246.801645]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > [  246.801646] RIP: 0033:0x7fe9ec947b28
> > > [  246.801650] Code: Bad RIP value.
> > > ----------------------------------------------
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: generic/269 hangs on lastest upstream kernel
  2020-02-13 17:08     ` Theodore Y. Ts'o
@ 2020-02-14  1:14       ` Yang Xu
  2020-02-14 14:05         ` Theodore Y. Ts'o
  0 siblings, 1 reply; 19+ messages in thread
From: Yang Xu @ 2020-02-14  1:14 UTC (permalink / raw)
  To: Theodore Y. Ts'o; +Cc: Jan Kara, fstests



on 2020/02/14 1:08, Theodore Y. Ts'o wrote:
> On Thu, Feb 13, 2020 at 04:49:21PM +0800, Yang Xu wrote:
>>
>>
>> on 2020/02/12 18:54, Jan Kara wrote:
>>> Hello!
>>>
>>> On Tue 11-02-20 16:14:35, Yang Xu wrote:
>>>> Since xfstests support rename2, this case(generic/269) reports filesystem
>>>> inconsistent problem with ext4 on my system(4.18.0-32.el8.x86_64).
>>>
>>> I don't remember seeing this in my testing... It might be specific to that
>>> RHEL kernel.
>>
>> Agree.
> 
> So were you able to reproduce this on a 5.6.0-rc1 kernel or not?
No. I don't reproduce it on 5.6.0-rc1 kernel, but  5.6.0-rc1 kernel hang 
on my KVM machine when run generic/269.

Best Regards
Yang Xu
> 
> If you are, can you send the .config that you used, in case it's
> configuration specific?
> 
> 					- Ted
> 
> 



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: generic/269 hangs on lastest upstream kernel
  2020-02-14  1:14       ` Yang Xu
@ 2020-02-14 14:05         ` Theodore Y. Ts'o
       [not found]           ` <7adf16bf-d527-1c25-1a24-b4d5e4d757c4@cn.fujitsu.com>
  0 siblings, 1 reply; 19+ messages in thread
From: Theodore Y. Ts'o @ 2020-02-14 14:05 UTC (permalink / raw)
  To: Yang Xu; +Cc: Jan Kara, fstests

On Fri, Feb 14, 2020 at 09:14:33AM +0800, Yang Xu wrote:
> > 
> > So were you able to reproduce this on a 5.6.0-rc1 kernel or not?
> No. I don't reproduce it on 5.6.0-rc1 kernel, but  5.6.0-rc1 kernel hang on
> my KVM machine when run generic/269.

I'm not able to reproduce the 5.6.0-rc1 hang using kvm-xfstests[1].
Neither has any other ext4 developers, which is why it might be useful
to see if there's something unique in your .config for 5.6.0-rc1.
Could you send us the .config you were using?

[1] https://github.com/tytso/xfstests-bld/blob/master/Documentation/kvm-xfstests.md

Cheers,

						- Ted
						

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: generic/269 hangs on lastest upstream kernel
       [not found]       ` <062ac52c-3a16-22ef-6396-53334ed94783@cn.fujitsu.com>
@ 2020-02-14 15:00         ` Jan Kara
  2020-02-18  3:25           ` Yang Xu
  0 siblings, 1 reply; 19+ messages in thread
From: Jan Kara @ 2020-02-14 15:00 UTC (permalink / raw)
  To: Yang Xu; +Cc: Jan Kara, Theodore Ts'o, fstests

On Fri 14-02-20 18:24:50, Yang Xu wrote:
> on 2020/02/14 5:10, Jan Kara wrote:
> > On Thu 13-02-20 16:49:21, Yang Xu wrote:
> > > > > When I test generic/269(ext4) on 5.6.0-rc1 kernel, it hangs.
> > > > > ----------------------------------------------
> > > > > dmesg as below:
> > > > >      76.506753] run fstests generic/269 at 2020-02-11 05:53:44
> > > > > [   76.955667] EXT4-fs (sdc): mounted filesystem with ordered data mode.
> > > > > Opts: acl,                           user_xattr
> > > > > [  100.912511] device virbr0-nic left promiscuous mode
> > > > > [  100.912520] virbr0: port 1(virbr0-nic) entered disabled state
> > > > > [  246.801561] INFO: task dd:17284 blocked for more than 122 seconds.
> > > > > [  246.801564]       Not tainted 5.6.0-rc1 #41
> > > > > [  246.801565] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> > > > > this mes                           sage.
> > > > > [  246.801566] dd              D    0 17284  16931 0x00000080
> > > > > [  246.801568] Call Trace:
> > > > > [  246.801584]  ? __schedule+0x251/0x690
> > > > > [  246.801586]  schedule+0x40/0xb0
> > > > > [  246.801588]  wb_wait_for_completion+0x52/0x80
> > > > > [  246.801591]  ? finish_wait+0x80/0x80
> > > > > [  246.801592]  __writeback_inodes_sb_nr+0xaa/0xd0
> > > > > [  246.801593]  try_to_writeback_inodes_sb+0x3c/0x50
> > > > 
> > > > Interesting. Does the hang resolve eventually or the machine is hung
> > > > permanently? If the hang is permanent, can you do:
> > > > 
> > > > echo w >/proc/sysrq-trigger
> > > > 
> > > > and send us the stacktraces from dmesg? Thanks!
> > > Yes. the hang is permanent, log as below:
> full dmesg as attach
...

Thanks! So the culprit seems to be:

> [  388.087799] kworker/u12:0   D    0    32      2 0x80004000
> [  388.087803] Workqueue: writeback wb_workfn (flush-8:32)
> [  388.087805] Call Trace:
> [  388.087810]  ? __schedule+0x251/0x690
> [  388.087811]  ? __switch_to_asm+0x34/0x70
> [  388.087812]  ? __switch_to_asm+0x34/0x70
> [  388.087814]  schedule+0x40/0xb0
> [  388.087816]  schedule_timeout+0x20d/0x310
> [  388.087818]  io_schedule_timeout+0x19/0x40
> [  388.087819]  wait_for_completion_io+0x113/0x180
> [  388.087822]  ? wake_up_q+0xa0/0xa0
> [  388.087824]  submit_bio_wait+0x5b/0x80
> [  388.087827]  blkdev_issue_flush+0x81/0xb0
> [  388.087834]  jbd2_cleanup_journal_tail+0x80/0xa0 [jbd2]
> [  388.087837]  jbd2_log_do_checkpoint+0xf4/0x3f0 [jbd2]
> [  388.087840]  __jbd2_log_wait_for_space+0x66/0x190 [jbd2]
> [  388.087843]  ? finish_wait+0x80/0x80
> [  388.087845]  add_transaction_credits+0x27d/0x290 [jbd2]
> [  388.087847]  ? blk_mq_make_request+0x289/0x5d0
> [  388.087849]  start_this_handle+0x10a/0x510 [jbd2]
> [  388.087851]  ? _cond_resched+0x15/0x30
> [  388.087853]  jbd2__journal_start+0xea/0x1f0 [jbd2]
> [  388.087869]  ? ext4_writepages+0x518/0xd90 [ext4]
> [  388.087875]  __ext4_journal_start_sb+0x6e/0x130 [ext4]
> [  388.087883]  ext4_writepages+0x518/0xd90 [ext4]
> [  388.087886]  ? do_writepages+0x41/0xd0
> [  388.087893]  ? ext4_mark_inode_dirty+0x1f0/0x1f0 [ext4]
> [  388.087894]  do_writepages+0x41/0xd0
> [  388.087896]  ? snprintf+0x49/0x60
> [  388.087898]  __writeback_single_inode+0x3d/0x340
> [  388.087899]  writeback_sb_inodes+0x1e5/0x480
> [  388.087901]  wb_writeback+0xfb/0x2f0
> [  388.087902]  wb_workfn+0xf0/0x430
> [  388.087903]  ? __switch_to_asm+0x34/0x70
> [  388.087905]  ? finish_task_switch+0x75/0x250
> [  388.087907]  process_one_work+0x1a7/0x370
> [  388.087909]  worker_thread+0x30/0x380
> [  388.087911]  ? process_one_work+0x370/0x370
> [  388.087912]  kthread+0x10c/0x130
> [  388.087913]  ? kthread_park+0x80/0x80
> [  388.087914]  ret_from_fork+0x35/0x40

This process is actually waiting for IO to complete while holding
checkpoint_mutex which holds up everybody else. The question is why the IO
doesn't complete - that's definitely outside of filesystem. Maybe a bug in
the block layer, storage driver, or something like that... What does
'cat /sys/block/<device-with-xfstests>/inflight' show?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: generic/269 hangs on lastest upstream kernel
  2020-02-14 15:00         ` Jan Kara
@ 2020-02-18  3:25           ` Yang Xu
  2020-02-18  8:24             ` Jan Kara
  0 siblings, 1 reply; 19+ messages in thread
From: Yang Xu @ 2020-02-18  3:25 UTC (permalink / raw)
  To: Jan Kara; +Cc: Theodore Ts'o, fstests



on 2020/02/14 23:00, Jan Kara wrote:
> On Fri 14-02-20 18:24:50, Yang Xu wrote:
>> on 2020/02/14 5:10, Jan Kara wrote:
>>> On Thu 13-02-20 16:49:21, Yang Xu wrote:
>>>>>> When I test generic/269(ext4) on 5.6.0-rc1 kernel, it hangs.
>>>>>> ----------------------------------------------
>>>>>> dmesg as below:
>>>>>>       76.506753] run fstests generic/269 at 2020-02-11 05:53:44
>>>>>> [   76.955667] EXT4-fs (sdc): mounted filesystem with ordered data mode.
>>>>>> Opts: acl,                           user_xattr
>>>>>> [  100.912511] device virbr0-nic left promiscuous mode
>>>>>> [  100.912520] virbr0: port 1(virbr0-nic) entered disabled state
>>>>>> [  246.801561] INFO: task dd:17284 blocked for more than 122 seconds.
>>>>>> [  246.801564]       Not tainted 5.6.0-rc1 #41
>>>>>> [  246.801565] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
>>>>>> this mes                           sage.
>>>>>> [  246.801566] dd              D    0 17284  16931 0x00000080
>>>>>> [  246.801568] Call Trace:
>>>>>> [  246.801584]  ? __schedule+0x251/0x690
>>>>>> [  246.801586]  schedule+0x40/0xb0
>>>>>> [  246.801588]  wb_wait_for_completion+0x52/0x80
>>>>>> [  246.801591]  ? finish_wait+0x80/0x80
>>>>>> [  246.801592]  __writeback_inodes_sb_nr+0xaa/0xd0
>>>>>> [  246.801593]  try_to_writeback_inodes_sb+0x3c/0x50
>>>>>
>>>>> Interesting. Does the hang resolve eventually or the machine is hung
>>>>> permanently? If the hang is permanent, can you do:
>>>>>
>>>>> echo w >/proc/sysrq-trigger
>>>>>
>>>>> and send us the stacktraces from dmesg? Thanks!
>>>> Yes. the hang is permanent, log as below:
>> full dmesg as attach
> ...
> 
> Thanks! So the culprit seems to be:
> 
>> [  388.087799] kworker/u12:0   D    0    32      2 0x80004000
>> [  388.087803] Workqueue: writeback wb_workfn (flush-8:32)
>> [  388.087805] Call Trace:
>> [  388.087810]  ? __schedule+0x251/0x690
>> [  388.087811]  ? __switch_to_asm+0x34/0x70
>> [  388.087812]  ? __switch_to_asm+0x34/0x70
>> [  388.087814]  schedule+0x40/0xb0
>> [  388.087816]  schedule_timeout+0x20d/0x310
>> [  388.087818]  io_schedule_timeout+0x19/0x40
>> [  388.087819]  wait_for_completion_io+0x113/0x180
>> [  388.087822]  ? wake_up_q+0xa0/0xa0
>> [  388.087824]  submit_bio_wait+0x5b/0x80
>> [  388.087827]  blkdev_issue_flush+0x81/0xb0
>> [  388.087834]  jbd2_cleanup_journal_tail+0x80/0xa0 [jbd2]
>> [  388.087837]  jbd2_log_do_checkpoint+0xf4/0x3f0 [jbd2]
>> [  388.087840]  __jbd2_log_wait_for_space+0x66/0x190 [jbd2]
>> [  388.087843]  ? finish_wait+0x80/0x80
>> [  388.087845]  add_transaction_credits+0x27d/0x290 [jbd2]
>> [  388.087847]  ? blk_mq_make_request+0x289/0x5d0
>> [  388.087849]  start_this_handle+0x10a/0x510 [jbd2]
>> [  388.087851]  ? _cond_resched+0x15/0x30
>> [  388.087853]  jbd2__journal_start+0xea/0x1f0 [jbd2]
>> [  388.087869]  ? ext4_writepages+0x518/0xd90 [ext4]
>> [  388.087875]  __ext4_journal_start_sb+0x6e/0x130 [ext4]
>> [  388.087883]  ext4_writepages+0x518/0xd90 [ext4]
>> [  388.087886]  ? do_writepages+0x41/0xd0
>> [  388.087893]  ? ext4_mark_inode_dirty+0x1f0/0x1f0 [ext4]
>> [  388.087894]  do_writepages+0x41/0xd0
>> [  388.087896]  ? snprintf+0x49/0x60
>> [  388.087898]  __writeback_single_inode+0x3d/0x340
>> [  388.087899]  writeback_sb_inodes+0x1e5/0x480
>> [  388.087901]  wb_writeback+0xfb/0x2f0
>> [  388.087902]  wb_workfn+0xf0/0x430
>> [  388.087903]  ? __switch_to_asm+0x34/0x70
>> [  388.087905]  ? finish_task_switch+0x75/0x250
>> [  388.087907]  process_one_work+0x1a7/0x370
>> [  388.087909]  worker_thread+0x30/0x380
>> [  388.087911]  ? process_one_work+0x370/0x370
>> [  388.087912]  kthread+0x10c/0x130
>> [  388.087913]  ? kthread_park+0x80/0x80
>> [  388.087914]  ret_from_fork+0x35/0x40
> 
> This process is actually waiting for IO to complete while holding
> checkpoint_mutex which holds up everybody else. The question is why the IO
> doesn't complete - that's definitely outside of filesystem. Maybe a bug in
> the block layer, storage driver, or something like that... What does
> 'cat /sys/block/<device-with-xfstests>/inflight' show?
Sorry for the late reply.
This value is 0, it represent it doesn't have inflight data(but it may 
be counted bug or storage driver bug, is it right?).
Also, it doesn't hang on my physical machine, but only hang on vm.
So what should I do in next step(change storge disk format)?

Best Regards
Yang Xu
> 
> 								Honza
> 



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: generic/269 hangs on lastest upstream kernel
  2020-02-18  3:25           ` Yang Xu
@ 2020-02-18  8:24             ` Jan Kara
  2020-02-18  9:46               ` Yang Xu
  0 siblings, 1 reply; 19+ messages in thread
From: Jan Kara @ 2020-02-18  8:24 UTC (permalink / raw)
  To: Yang Xu; +Cc: Jan Kara, Theodore Ts'o, fstests

On Tue 18-02-20 11:25:37, Yang Xu wrote:
> on 2020/02/14 23:00, Jan Kara wrote:
> > On Fri 14-02-20 18:24:50, Yang Xu wrote:
> > > on 2020/02/14 5:10, Jan Kara wrote:
> > > > On Thu 13-02-20 16:49:21, Yang Xu wrote:
> > > > > > > When I test generic/269(ext4) on 5.6.0-rc1 kernel, it hangs.
> > > > > > > ----------------------------------------------
> > > > > > > dmesg as below:
> > > > > > >       76.506753] run fstests generic/269 at 2020-02-11 05:53:44
> > > > > > > [   76.955667] EXT4-fs (sdc): mounted filesystem with ordered data mode.
> > > > > > > Opts: acl,                           user_xattr
> > > > > > > [  100.912511] device virbr0-nic left promiscuous mode
> > > > > > > [  100.912520] virbr0: port 1(virbr0-nic) entered disabled state
> > > > > > > [  246.801561] INFO: task dd:17284 blocked for more than 122 seconds.
> > > > > > > [  246.801564]       Not tainted 5.6.0-rc1 #41
> > > > > > > [  246.801565] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> > > > > > > this mes                           sage.
> > > > > > > [  246.801566] dd              D    0 17284  16931 0x00000080
> > > > > > > [  246.801568] Call Trace:
> > > > > > > [  246.801584]  ? __schedule+0x251/0x690
> > > > > > > [  246.801586]  schedule+0x40/0xb0
> > > > > > > [  246.801588]  wb_wait_for_completion+0x52/0x80
> > > > > > > [  246.801591]  ? finish_wait+0x80/0x80
> > > > > > > [  246.801592]  __writeback_inodes_sb_nr+0xaa/0xd0
> > > > > > > [  246.801593]  try_to_writeback_inodes_sb+0x3c/0x50
> > > > > > 
> > > > > > Interesting. Does the hang resolve eventually or the machine is hung
> > > > > > permanently? If the hang is permanent, can you do:
> > > > > > 
> > > > > > echo w >/proc/sysrq-trigger
> > > > > > 
> > > > > > and send us the stacktraces from dmesg? Thanks!
> > > > > Yes. the hang is permanent, log as below:
> > > full dmesg as attach
> > ...
> > 
> > Thanks! So the culprit seems to be:
> > 
> > > [  388.087799] kworker/u12:0   D    0    32      2 0x80004000
> > > [  388.087803] Workqueue: writeback wb_workfn (flush-8:32)
> > > [  388.087805] Call Trace:
> > > [  388.087810]  ? __schedule+0x251/0x690
> > > [  388.087811]  ? __switch_to_asm+0x34/0x70
> > > [  388.087812]  ? __switch_to_asm+0x34/0x70
> > > [  388.087814]  schedule+0x40/0xb0
> > > [  388.087816]  schedule_timeout+0x20d/0x310
> > > [  388.087818]  io_schedule_timeout+0x19/0x40
> > > [  388.087819]  wait_for_completion_io+0x113/0x180
> > > [  388.087822]  ? wake_up_q+0xa0/0xa0
> > > [  388.087824]  submit_bio_wait+0x5b/0x80
> > > [  388.087827]  blkdev_issue_flush+0x81/0xb0
> > > [  388.087834]  jbd2_cleanup_journal_tail+0x80/0xa0 [jbd2]
> > > [  388.087837]  jbd2_log_do_checkpoint+0xf4/0x3f0 [jbd2]
> > > [  388.087840]  __jbd2_log_wait_for_space+0x66/0x190 [jbd2]
> > > [  388.087843]  ? finish_wait+0x80/0x80
> > > [  388.087845]  add_transaction_credits+0x27d/0x290 [jbd2]
> > > [  388.087847]  ? blk_mq_make_request+0x289/0x5d0
> > > [  388.087849]  start_this_handle+0x10a/0x510 [jbd2]
> > > [  388.087851]  ? _cond_resched+0x15/0x30
> > > [  388.087853]  jbd2__journal_start+0xea/0x1f0 [jbd2]
> > > [  388.087869]  ? ext4_writepages+0x518/0xd90 [ext4]
> > > [  388.087875]  __ext4_journal_start_sb+0x6e/0x130 [ext4]
> > > [  388.087883]  ext4_writepages+0x518/0xd90 [ext4]
> > > [  388.087886]  ? do_writepages+0x41/0xd0
> > > [  388.087893]  ? ext4_mark_inode_dirty+0x1f0/0x1f0 [ext4]
> > > [  388.087894]  do_writepages+0x41/0xd0
> > > [  388.087896]  ? snprintf+0x49/0x60
> > > [  388.087898]  __writeback_single_inode+0x3d/0x340
> > > [  388.087899]  writeback_sb_inodes+0x1e5/0x480
> > > [  388.087901]  wb_writeback+0xfb/0x2f0
> > > [  388.087902]  wb_workfn+0xf0/0x430
> > > [  388.087903]  ? __switch_to_asm+0x34/0x70
> > > [  388.087905]  ? finish_task_switch+0x75/0x250
> > > [  388.087907]  process_one_work+0x1a7/0x370
> > > [  388.087909]  worker_thread+0x30/0x380
> > > [  388.087911]  ? process_one_work+0x370/0x370
> > > [  388.087912]  kthread+0x10c/0x130
> > > [  388.087913]  ? kthread_park+0x80/0x80
> > > [  388.087914]  ret_from_fork+0x35/0x40
> > 
> > This process is actually waiting for IO to complete while holding
> > checkpoint_mutex which holds up everybody else. The question is why the IO
> > doesn't complete - that's definitely outside of filesystem. Maybe a bug in
> > the block layer, storage driver, or something like that... What does
> > 'cat /sys/block/<device-with-xfstests>/inflight' show?
> Sorry for the late reply.
> This value is 0, it represent it doesn't have inflight data(but it may be
> counted bug or storage driver bug, is it right?).
> Also, it doesn't hang on my physical machine, but only hang on vm.

Hum, curious. Just do make sure, did you check sdc (because that appears to
be the stuck device)?

> So what should I do in next step(change storge disk format)?

I'd try couple of things:

1) If you mount ext4 with barrier=0 mount option, does the problem go away?

2) Can you run the test and at the same time run 'blktrace -d /dev/sdc' to
gather traces? Once the machine is stuck, abort blktrace, process the
resulting files with 'blkparse -i sdc' and send here compressed blkparse
output. We should be able to see what was happening with the stuck request
in the trace and maybe that will tell us something.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: generic/269 hangs on lastest upstream kernel
  2020-02-18  8:24             ` Jan Kara
@ 2020-02-18  9:46               ` Yang Xu
  2020-02-18 11:03                 ` Jan Kara
  0 siblings, 1 reply; 19+ messages in thread
From: Yang Xu @ 2020-02-18  9:46 UTC (permalink / raw)
  To: Jan Kara; +Cc: Theodore Ts'o, fstests

[-- Attachment #1: Type: text/plain, Size: 5398 bytes --]


on 2020/02/18 16:24, Jan Kara wrote:
> On Tue 18-02-20 11:25:37, Yang Xu wrote:
>> on 2020/02/14 23:00, Jan Kara wrote:
>>> On Fri 14-02-20 18:24:50, Yang Xu wrote:
>>>> on 2020/02/14 5:10, Jan Kara wrote:
>>>>> On Thu 13-02-20 16:49:21, Yang Xu wrote:
>>>>>>>> When I test generic/269(ext4) on 5.6.0-rc1 kernel, it hangs.
>>>>>>>> ----------------------------------------------
>>>>>>>> dmesg as below:
>>>>>>>>        76.506753] run fstests generic/269 at 2020-02-11 05:53:44
>>>>>>>> [   76.955667] EXT4-fs (sdc): mounted filesystem with ordered data mode.
>>>>>>>> Opts: acl,                           user_xattr
>>>>>>>> [  100.912511] device virbr0-nic left promiscuous mode
>>>>>>>> [  100.912520] virbr0: port 1(virbr0-nic) entered disabled state
>>>>>>>> [  246.801561] INFO: task dd:17284 blocked for more than 122 seconds.
>>>>>>>> [  246.801564]       Not tainted 5.6.0-rc1 #41
>>>>>>>> [  246.801565] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
>>>>>>>> this mes                           sage.
>>>>>>>> [  246.801566] dd              D    0 17284  16931 0x00000080
>>>>>>>> [  246.801568] Call Trace:
>>>>>>>> [  246.801584]  ? __schedule+0x251/0x690
>>>>>>>> [  246.801586]  schedule+0x40/0xb0
>>>>>>>> [  246.801588]  wb_wait_for_completion+0x52/0x80
>>>>>>>> [  246.801591]  ? finish_wait+0x80/0x80
>>>>>>>> [  246.801592]  __writeback_inodes_sb_nr+0xaa/0xd0
>>>>>>>> [  246.801593]  try_to_writeback_inodes_sb+0x3c/0x50
>>>>>>>
>>>>>>> Interesting. Does the hang resolve eventually or the machine is hung
>>>>>>> permanently? If the hang is permanent, can you do:
>>>>>>>
>>>>>>> echo w >/proc/sysrq-trigger
>>>>>>>
>>>>>>> and send us the stacktraces from dmesg? Thanks!
>>>>>> Yes. the hang is permanent, log as below:
>>>> full dmesg as attach
>>> ...
>>>
>>> Thanks! So the culprit seems to be:
>>>
>>>> [  388.087799] kworker/u12:0   D    0    32      2 0x80004000
>>>> [  388.087803] Workqueue: writeback wb_workfn (flush-8:32)
>>>> [  388.087805] Call Trace:
>>>> [  388.087810]  ? __schedule+0x251/0x690
>>>> [  388.087811]  ? __switch_to_asm+0x34/0x70
>>>> [  388.087812]  ? __switch_to_asm+0x34/0x70
>>>> [  388.087814]  schedule+0x40/0xb0
>>>> [  388.087816]  schedule_timeout+0x20d/0x310
>>>> [  388.087818]  io_schedule_timeout+0x19/0x40
>>>> [  388.087819]  wait_for_completion_io+0x113/0x180
>>>> [  388.087822]  ? wake_up_q+0xa0/0xa0
>>>> [  388.087824]  submit_bio_wait+0x5b/0x80
>>>> [  388.087827]  blkdev_issue_flush+0x81/0xb0
>>>> [  388.087834]  jbd2_cleanup_journal_tail+0x80/0xa0 [jbd2]
>>>> [  388.087837]  jbd2_log_do_checkpoint+0xf4/0x3f0 [jbd2]
>>>> [  388.087840]  __jbd2_log_wait_for_space+0x66/0x190 [jbd2]
>>>> [  388.087843]  ? finish_wait+0x80/0x80
>>>> [  388.087845]  add_transaction_credits+0x27d/0x290 [jbd2]
>>>> [  388.087847]  ? blk_mq_make_request+0x289/0x5d0
>>>> [  388.087849]  start_this_handle+0x10a/0x510 [jbd2]
>>>> [  388.087851]  ? _cond_resched+0x15/0x30
>>>> [  388.087853]  jbd2__journal_start+0xea/0x1f0 [jbd2]
>>>> [  388.087869]  ? ext4_writepages+0x518/0xd90 [ext4]
>>>> [  388.087875]  __ext4_journal_start_sb+0x6e/0x130 [ext4]
>>>> [  388.087883]  ext4_writepages+0x518/0xd90 [ext4]
>>>> [  388.087886]  ? do_writepages+0x41/0xd0
>>>> [  388.087893]  ? ext4_mark_inode_dirty+0x1f0/0x1f0 [ext4]
>>>> [  388.087894]  do_writepages+0x41/0xd0
>>>> [  388.087896]  ? snprintf+0x49/0x60
>>>> [  388.087898]  __writeback_single_inode+0x3d/0x340
>>>> [  388.087899]  writeback_sb_inodes+0x1e5/0x480
>>>> [  388.087901]  wb_writeback+0xfb/0x2f0
>>>> [  388.087902]  wb_workfn+0xf0/0x430
>>>> [  388.087903]  ? __switch_to_asm+0x34/0x70
>>>> [  388.087905]  ? finish_task_switch+0x75/0x250
>>>> [  388.087907]  process_one_work+0x1a7/0x370
>>>> [  388.087909]  worker_thread+0x30/0x380
>>>> [  388.087911]  ? process_one_work+0x370/0x370
>>>> [  388.087912]  kthread+0x10c/0x130
>>>> [  388.087913]  ? kthread_park+0x80/0x80
>>>> [  388.087914]  ret_from_fork+0x35/0x40
>>>
>>> This process is actually waiting for IO to complete while holding
>>> checkpoint_mutex which holds up everybody else. The question is why the IO
>>> doesn't complete - that's definitely outside of filesystem. Maybe a bug in
>>> the block layer, storage driver, or something like that... What does
>>> 'cat /sys/block/<device-with-xfstests>/inflight' show?
>> Sorry for the late reply.
>> This value is 0, it represent it doesn't have inflight data(but it may be
>> counted bug or storage driver bug, is it right?).
>> Also, it doesn't hang on my physical machine, but only hang on vm.
> 
> Hum, curious. Just do make sure, did you check sdc (because that appears to
> be the stuck device)?
Yes, I check sdc, its value is 0.
# cat /sys/block/sdc/inflight
        0        0

> 
>> So what should I do in next step(change storge disk format)?
> 
> I'd try couple of things:
> 
> 1) If you mount ext4 with barrier=0 mount option, does the problem go away?
Yes. Use barrier=0, this case doesn't hang,
> 
> 2) Can you run the test and at the same time run 'blktrace -d /dev/sdc' to
> gather traces? Once the machine is stuck, abort blktrace, process the
> resulting files with 'blkparse -i sdc' and send here compressed blkparse
> output. We should be able to see what was happening with the stuck request
> in the trace and maybe that will tell us something.
The log size is too big(58M) and our emali limit is 5M.
> 
> 								Honza
> 



[-- Attachment #2: sdc_blktrace.txt --]
[-- Type: text/plain, Size: 4056 bytes --]


CPU0 (sdc):
 Reads Queued:       1,108,   27,116KiB  Writes Queued:      61,036,  285,888KiB
 Read Dispatches:    1,700,   25,392KiB  Write Dispatches:   14,740,  285,436KiB
 Reads Requeued:         0               Writes Requeued:         0
 Reads Completed:    1,765,   26,428KiB  Writes Completed:   15,209,  285,448KiB
 Read Merges:            1,       36KiB  Write Merges:       46,135,  185,436KiB
 Read depth:             1               Write depth:             1
 PC Reads Queued:        0,        0KiB  PC Writes Queued:        0,        0KiB
 PC Read Disp.:         16,        2KiB  PC Write Disp.:          0,        0KiB
 PC Reads Req.:          0               PC Writes Req.:          0
 PC Reads Compl.:       16               PC Writes Compl.:        0
 IO unplugs:         2,435               Timer unplugs:       7,863
CPU1 (sdc):
 Reads Queued:       1,005,   24,308KiB  Writes Queued:      61,020,  490,824KiB
 Read Dispatches:    1,767,   27,596KiB  Write Dispatches:   16,107,  439,560KiB
 Reads Requeued:         0               Writes Requeued:         0
 Reads Completed:    1,672,   24,104KiB  Writes Completed:   16,833,  489,500KiB
 Read Merges:            0,        0KiB  Write Merges:       44,537,  180,228KiB
 Read depth:             1               Write depth:             1
 PC Reads Queued:        0,        0KiB  PC Writes Queued:        0,        0KiB
 PC Read Disp.:         28,        3KiB  PC Write Disp.:          0,        0KiB
 PC Reads Req.:          0               PC Writes Req.:          0
 PC Reads Compl.:       28               PC Writes Compl.:        0
 IO unplugs:         2,268               Timer unplugs:       9,094
CPU2 (sdc):
 Reads Queued:       1,047,   25,337KiB  Writes Queued:      99,999,   21,409MiB
 Read Dispatches:    1,703,   24,013KiB  Write Dispatches:   14,016,   21,436MiB
 Reads Requeued:         0               Writes Requeued:         0
 Reads Completed:    1,760,   25,793KiB  Writes Completed:   14,664,   21,410MiB
 Read Merges:            1,        4KiB  Write Merges:       85,721,  343,812KiB
 Read depth:             1               Write depth:             1
 PC Reads Queued:        0,        0KiB  PC Writes Queued:        0,        0KiB
 PC Read Disp.:          1,        0KiB  PC Write Disp.:          0,        0KiB
 PC Reads Req.:          0               PC Writes Req.:          0
 PC Reads Compl.:        1               PC Writes Compl.:        0
 IO unplugs:         2,617               Timer unplugs:       7,643
CPU3 (sdc):
 Reads Queued:       1,090,   29,589KiB  Writes Queued:      57,152,  787,564KiB
 Read Dispatches:    1,728,   29,349KiB  Write Dispatches:   14,100,  812,204KiB
 Reads Requeued:         0               Writes Requeued:         0
 Reads Completed:    1,700,   30,025KiB  Writes Completed:   14,251,  788,148KiB
 Read Merges:            1,      372KiB  Write Merges:       43,240,  174,028KiB
 Read depth:             1               Write depth:             1
 IO unplugs:         2,461               Timer unplugs:       7,297

Total (sdc):
 Reads Queued:       4,250,  106,350KiB  Writes Queued:     279,207,   22,973MiB
 Read Dispatches:    6,898,  106,350KiB  Write Dispatches:   58,963,   22,973MiB
 Reads Requeued:         0               Writes Requeued:         0
 Reads Completed:    6,897,  106,350KiB  Writes Completed:   60,957,   22,973MiB
 Read Merges:            3,      412KiB  Write Merges:      219,633,  883,504KiB
 PC Reads Queued:        0,        0KiB  PC Writes Queued:        0,        0KiB
 PC Read Disp.:         45,        7KiB  PC Write Disp.:          0,        0KiB
 PC Reads Req.:          0               PC Writes Req.:          0
 PC Reads Compl.:       45               PC Writes Compl.:        0
 IO unplugs:         9,781               Timer unplugs:      31,897

Throughput (R/W): 414KiB/s / 89,548KiB/s
Events (sdc): 846,121 entries
Skips: 0 forward (0 -   0.0%)
Input file sdc.blktrace.0 added
Input file sdc.blktrace.1 added
Input file sdc.blktrace.2 added
Input file sdc.blktrace.3 added

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: generic/269 hangs on lastest upstream kernel
  2020-02-18  9:46               ` Yang Xu
@ 2020-02-18 11:03                 ` Jan Kara
  2020-02-19 10:09                   ` Yang Xu
  0 siblings, 1 reply; 19+ messages in thread
From: Jan Kara @ 2020-02-18 11:03 UTC (permalink / raw)
  To: Yang Xu; +Cc: Jan Kara, Theodore Ts'o, fstests

On Tue 18-02-20 17:46:54, Yang Xu wrote:
> 
> on 2020/02/18 16:24, Jan Kara wrote:
> > On Tue 18-02-20 11:25:37, Yang Xu wrote:
> > > on 2020/02/14 23:00, Jan Kara wrote:
> > > > On Fri 14-02-20 18:24:50, Yang Xu wrote:
> > > > > on 2020/02/14 5:10, Jan Kara wrote:
> > > > > > On Thu 13-02-20 16:49:21, Yang Xu wrote:
> > > > > > > > > When I test generic/269(ext4) on 5.6.0-rc1 kernel, it hangs.
> > > > > > > > > ----------------------------------------------
> > > > > > > > > dmesg as below:
> > > > > > > > >        76.506753] run fstests generic/269 at 2020-02-11 05:53:44
> > > > > > > > > [   76.955667] EXT4-fs (sdc): mounted filesystem with ordered data mode.
> > > > > > > > > Opts: acl,                           user_xattr
> > > > > > > > > [  100.912511] device virbr0-nic left promiscuous mode
> > > > > > > > > [  100.912520] virbr0: port 1(virbr0-nic) entered disabled state
> > > > > > > > > [  246.801561] INFO: task dd:17284 blocked for more than 122 seconds.
> > > > > > > > > [  246.801564]       Not tainted 5.6.0-rc1 #41
> > > > > > > > > [  246.801565] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> > > > > > > > > this mes                           sage.
> > > > > > > > > [  246.801566] dd              D    0 17284  16931 0x00000080
> > > > > > > > > [  246.801568] Call Trace:
> > > > > > > > > [  246.801584]  ? __schedule+0x251/0x690
> > > > > > > > > [  246.801586]  schedule+0x40/0xb0
> > > > > > > > > [  246.801588]  wb_wait_for_completion+0x52/0x80
> > > > > > > > > [  246.801591]  ? finish_wait+0x80/0x80
> > > > > > > > > [  246.801592]  __writeback_inodes_sb_nr+0xaa/0xd0
> > > > > > > > > [  246.801593]  try_to_writeback_inodes_sb+0x3c/0x50
> > > > > > > > 
> > > > > > > > Interesting. Does the hang resolve eventually or the machine is hung
> > > > > > > > permanently? If the hang is permanent, can you do:
> > > > > > > > 
> > > > > > > > echo w >/proc/sysrq-trigger
> > > > > > > > 
> > > > > > > > and send us the stacktraces from dmesg? Thanks!
> > > > > > > Yes. the hang is permanent, log as below:
> > > > > full dmesg as attach
> > > > ...
> > > > 
> > > > Thanks! So the culprit seems to be:
> > > > 
> > > > > [  388.087799] kworker/u12:0   D    0    32      2 0x80004000
> > > > > [  388.087803] Workqueue: writeback wb_workfn (flush-8:32)
> > > > > [  388.087805] Call Trace:
> > > > > [  388.087810]  ? __schedule+0x251/0x690
> > > > > [  388.087811]  ? __switch_to_asm+0x34/0x70
> > > > > [  388.087812]  ? __switch_to_asm+0x34/0x70
> > > > > [  388.087814]  schedule+0x40/0xb0
> > > > > [  388.087816]  schedule_timeout+0x20d/0x310
> > > > > [  388.087818]  io_schedule_timeout+0x19/0x40
> > > > > [  388.087819]  wait_for_completion_io+0x113/0x180
> > > > > [  388.087822]  ? wake_up_q+0xa0/0xa0
> > > > > [  388.087824]  submit_bio_wait+0x5b/0x80
> > > > > [  388.087827]  blkdev_issue_flush+0x81/0xb0
> > > > > [  388.087834]  jbd2_cleanup_journal_tail+0x80/0xa0 [jbd2]
> > > > > [  388.087837]  jbd2_log_do_checkpoint+0xf4/0x3f0 [jbd2]
> > > > > [  388.087840]  __jbd2_log_wait_for_space+0x66/0x190 [jbd2]
> > > > > [  388.087843]  ? finish_wait+0x80/0x80
> > > > > [  388.087845]  add_transaction_credits+0x27d/0x290 [jbd2]
> > > > > [  388.087847]  ? blk_mq_make_request+0x289/0x5d0
> > > > > [  388.087849]  start_this_handle+0x10a/0x510 [jbd2]
> > > > > [  388.087851]  ? _cond_resched+0x15/0x30
> > > > > [  388.087853]  jbd2__journal_start+0xea/0x1f0 [jbd2]
> > > > > [  388.087869]  ? ext4_writepages+0x518/0xd90 [ext4]
> > > > > [  388.087875]  __ext4_journal_start_sb+0x6e/0x130 [ext4]
> > > > > [  388.087883]  ext4_writepages+0x518/0xd90 [ext4]
> > > > > [  388.087886]  ? do_writepages+0x41/0xd0
> > > > > [  388.087893]  ? ext4_mark_inode_dirty+0x1f0/0x1f0 [ext4]
> > > > > [  388.087894]  do_writepages+0x41/0xd0
> > > > > [  388.087896]  ? snprintf+0x49/0x60
> > > > > [  388.087898]  __writeback_single_inode+0x3d/0x340
> > > > > [  388.087899]  writeback_sb_inodes+0x1e5/0x480
> > > > > [  388.087901]  wb_writeback+0xfb/0x2f0
> > > > > [  388.087902]  wb_workfn+0xf0/0x430
> > > > > [  388.087903]  ? __switch_to_asm+0x34/0x70
> > > > > [  388.087905]  ? finish_task_switch+0x75/0x250
> > > > > [  388.087907]  process_one_work+0x1a7/0x370
> > > > > [  388.087909]  worker_thread+0x30/0x380
> > > > > [  388.087911]  ? process_one_work+0x370/0x370
> > > > > [  388.087912]  kthread+0x10c/0x130
> > > > > [  388.087913]  ? kthread_park+0x80/0x80
> > > > > [  388.087914]  ret_from_fork+0x35/0x40
> > > > 
> > > > This process is actually waiting for IO to complete while holding
> > > > checkpoint_mutex which holds up everybody else. The question is why the IO
> > > > doesn't complete - that's definitely outside of filesystem. Maybe a bug in
> > > > the block layer, storage driver, or something like that... What does
> > > > 'cat /sys/block/<device-with-xfstests>/inflight' show?
> > > Sorry for the late reply.
> > > This value is 0, it represent it doesn't have inflight data(but it may be
> > > counted bug or storage driver bug, is it right?).
> > > Also, it doesn't hang on my physical machine, but only hang on vm.
> > 
> > Hum, curious. Just do make sure, did you check sdc (because that appears to
> > be the stuck device)?
> Yes, I check sdc, its value is 0.
> # cat /sys/block/sdc/inflight
>        0        0

OK, thanks!

> > > So what should I do in next step(change storge disk format)?
> > 
> > I'd try couple of things:
> > 
> > 1) If you mount ext4 with barrier=0 mount option, does the problem go away?
> Yes. Use barrier=0, this case doesn't hang,

OK, so there's some problem with how the block layer is handling flush
bios...

> > 2) Can you run the test and at the same time run 'blktrace -d /dev/sdc' to
> > gather traces? Once the machine is stuck, abort blktrace, process the
> > resulting files with 'blkparse -i sdc' and send here compressed blkparse
> > output. We should be able to see what was happening with the stuck request
> > in the trace and maybe that will tell us something.
> The log size is too big(58M) and our emali limit is 5M.

OK, can you put the log somewhere for download? Alternatively you could
provide only last say 20s of the trace which should hopefully fit into the
limit...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: generic/269 hangs on lastest upstream kernel
       [not found]           ` <7adf16bf-d527-1c25-1a24-b4d5e4d757c4@cn.fujitsu.com>
@ 2020-02-18 14:35             ` Theodore Y. Ts'o
  2020-02-19 10:57               ` Yang Xu
  0 siblings, 1 reply; 19+ messages in thread
From: Theodore Y. Ts'o @ 2020-02-18 14:35 UTC (permalink / raw)
  To: Yang Xu; +Cc: Jan Kara, fstests

On Tue, Feb 18, 2020 at 11:44:24AM +0800, Yang Xu wrote:
> 
> 
> on 2020/02/14 22:05, Theodore Y. Ts'o wrote:
> > On Fri, Feb 14, 2020 at 09:14:33AM +0800, Yang Xu wrote:
> > > > 
> > > > So were you able to reproduce this on a 5.6.0-rc1 kernel or not?
> > > No. I don't reproduce it on 5.6.0-rc1 kernel, but  5.6.0-rc1 kernel hang on
> > > my KVM machine when run generic/269.
> > 
> > I'm not able to reproduce the 5.6.0-rc1 hang using kvm-xfstests[1].
> > Neither has any other ext4 developers, which is why it might be useful
> > to see if there's something unique in your .config for 5.6.0-rc1.
> > Could you send us the .config you were using?
> > 
> Sorry for the late reply.
> 
> my 5.6.0-rc1 config as attach.

Unfortunately, I just tried using your config with kvm-xfstests, and
it passed without problems.  Did you say this was a reliable
reproducer on your system?

% kvm-xfstests -c 4k generic/269
KERNEL:    kernel 5.6.0-rc2-xfstests #1492 SMP Mon Feb 17 23:22:40 EST 2020 x86_64
CPUS:      2
MEM:       1966.03

ext4/4k: 1 tests, 43 seconds
  generic/269  Pass     42s
Totals: 1 tests, 0 skipped, 0 failures, 0 errors, 42s

FSTESTVER: blktests 9e02419 (Thu, 19 Dec 2019 14:45:55 -0800)
FSTESTVER: e2fsprogs v1.45.4-15-g4b4f7b35 (Wed, 9 Oct 2019 20:25:01 -0400)
FSTESTVER: fio  fio-3.17 (Mon, 16 Dec 2019 15:48:43 -0700)
FSTESTVER: fsverity v1.0 (Wed, 6 Nov 2019 10:35:02 -0800)
FSTESTVER: ima-evm-utils v1.2 (Fri, 26 Jul 2019 07:42:17 -0400)
FSTESTVER: nvme-cli v1.9-159-g18119bc (Thu, 26 Dec 2019 11:04:01 -0700)
FSTESTVER: quota  9a001cc (Tue, 5 Nov 2019 16:12:59 +0100)
FSTESTVER: util-linux v2.35-19-g95afec771 (Fri, 24 Jan 2020 12:25:35 -0500)
FSTESTVER: xfsprogs v5.4.0 (Fri, 20 Dec 2019 16:47:12 -0500)
FSTESTVER: xfstests linux-v3.8-2652-g002e349c (Fri, 24 Jan 2020 00:49:40 -0500)
FSTESTVER: xfstests-bld 6f10355 (Fri, 24 Jan 2020 12:36:30 -0500)
FSTESTCFG: 4k
FSTESTSET: generic/269
FSTESTOPT: aex

This was run on a Debian testing system, with kvm version:

QEMU emulator version 4.2.0 (Debian 1:4.2-3)

What about hardware details of your system?  How many CPU's, memory,
etc.?  And what sort of storage device are you using for kvm?  (I'm
using virtio-scsi backed by LVM volumes for the scratch and test
partitions.)

						- Ted

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: generic/269 hangs on lastest upstream kernel
  2020-02-18 11:03                 ` Jan Kara
@ 2020-02-19 10:09                   ` Yang Xu
       [not found]                     ` <73af3d5c-ca64-3ad3-aee2-1e78ee4fae4a@cn.fujitsu.com>
  0 siblings, 1 reply; 19+ messages in thread
From: Yang Xu @ 2020-02-19 10:09 UTC (permalink / raw)
  To: Jan Kara; +Cc: Theodore Ts'o, fstests



on 2020/02/18 19:03, Jan Kara wrote:
> On Tue 18-02-20 17:46:54, Yang Xu wrote:
>>
>> on 2020/02/18 16:24, Jan Kara wrote:
>>> On Tue 18-02-20 11:25:37, Yang Xu wrote:
>>>> on 2020/02/14 23:00, Jan Kara wrote:
>>>>> On Fri 14-02-20 18:24:50, Yang Xu wrote:
>>>>>> on 2020/02/14 5:10, Jan Kara wrote:
>>>>>>> On Thu 13-02-20 16:49:21, Yang Xu wrote:
>>>>>>>>>> When I test generic/269(ext4) on 5.6.0-rc1 kernel, it hangs.
>>>>>>>>>> ----------------------------------------------
>>>>>>>>>> dmesg as below:
>>>>>>>>>>         76.506753] run fstests generic/269 at 2020-02-11 05:53:44
>>>>>>>>>> [   76.955667] EXT4-fs (sdc): mounted filesystem with ordered data mode.
>>>>>>>>>> Opts: acl,                           user_xattr
>>>>>>>>>> [  100.912511] device virbr0-nic left promiscuous mode
>>>>>>>>>> [  100.912520] virbr0: port 1(virbr0-nic) entered disabled state
>>>>>>>>>> [  246.801561] INFO: task dd:17284 blocked for more than 122 seconds.
>>>>>>>>>> [  246.801564]       Not tainted 5.6.0-rc1 #41
>>>>>>>>>> [  246.801565] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
>>>>>>>>>> this mes                           sage.
>>>>>>>>>> [  246.801566] dd              D    0 17284  16931 0x00000080
>>>>>>>>>> [  246.801568] Call Trace:
>>>>>>>>>> [  246.801584]  ? __schedule+0x251/0x690
>>>>>>>>>> [  246.801586]  schedule+0x40/0xb0
>>>>>>>>>> [  246.801588]  wb_wait_for_completion+0x52/0x80
>>>>>>>>>> [  246.801591]  ? finish_wait+0x80/0x80
>>>>>>>>>> [  246.801592]  __writeback_inodes_sb_nr+0xaa/0xd0
>>>>>>>>>> [  246.801593]  try_to_writeback_inodes_sb+0x3c/0x50
>>>>>>>>>
>>>>>>>>> Interesting. Does the hang resolve eventually or the machine is hung
>>>>>>>>> permanently? If the hang is permanent, can you do:
>>>>>>>>>
>>>>>>>>> echo w >/proc/sysrq-trigger
>>>>>>>>>
>>>>>>>>> and send us the stacktraces from dmesg? Thanks!
>>>>>>>> Yes. the hang is permanent, log as below:
>>>>>> full dmesg as attach
>>>>> ...
>>>>>
>>>>> Thanks! So the culprit seems to be:
>>>>>
>>>>>> [  388.087799] kworker/u12:0   D    0    32      2 0x80004000
>>>>>> [  388.087803] Workqueue: writeback wb_workfn (flush-8:32)
>>>>>> [  388.087805] Call Trace:
>>>>>> [  388.087810]  ? __schedule+0x251/0x690
>>>>>> [  388.087811]  ? __switch_to_asm+0x34/0x70
>>>>>> [  388.087812]  ? __switch_to_asm+0x34/0x70
>>>>>> [  388.087814]  schedule+0x40/0xb0
>>>>>> [  388.087816]  schedule_timeout+0x20d/0x310
>>>>>> [  388.087818]  io_schedule_timeout+0x19/0x40
>>>>>> [  388.087819]  wait_for_completion_io+0x113/0x180
>>>>>> [  388.087822]  ? wake_up_q+0xa0/0xa0
>>>>>> [  388.087824]  submit_bio_wait+0x5b/0x80
>>>>>> [  388.087827]  blkdev_issue_flush+0x81/0xb0
>>>>>> [  388.087834]  jbd2_cleanup_journal_tail+0x80/0xa0 [jbd2]
>>>>>> [  388.087837]  jbd2_log_do_checkpoint+0xf4/0x3f0 [jbd2]
>>>>>> [  388.087840]  __jbd2_log_wait_for_space+0x66/0x190 [jbd2]
>>>>>> [  388.087843]  ? finish_wait+0x80/0x80
>>>>>> [  388.087845]  add_transaction_credits+0x27d/0x290 [jbd2]
>>>>>> [  388.087847]  ? blk_mq_make_request+0x289/0x5d0
>>>>>> [  388.087849]  start_this_handle+0x10a/0x510 [jbd2]
>>>>>> [  388.087851]  ? _cond_resched+0x15/0x30
>>>>>> [  388.087853]  jbd2__journal_start+0xea/0x1f0 [jbd2]
>>>>>> [  388.087869]  ? ext4_writepages+0x518/0xd90 [ext4]
>>>>>> [  388.087875]  __ext4_journal_start_sb+0x6e/0x130 [ext4]
>>>>>> [  388.087883]  ext4_writepages+0x518/0xd90 [ext4]
>>>>>> [  388.087886]  ? do_writepages+0x41/0xd0
>>>>>> [  388.087893]  ? ext4_mark_inode_dirty+0x1f0/0x1f0 [ext4]
>>>>>> [  388.087894]  do_writepages+0x41/0xd0
>>>>>> [  388.087896]  ? snprintf+0x49/0x60
>>>>>> [  388.087898]  __writeback_single_inode+0x3d/0x340
>>>>>> [  388.087899]  writeback_sb_inodes+0x1e5/0x480
>>>>>> [  388.087901]  wb_writeback+0xfb/0x2f0
>>>>>> [  388.087902]  wb_workfn+0xf0/0x430
>>>>>> [  388.087903]  ? __switch_to_asm+0x34/0x70
>>>>>> [  388.087905]  ? finish_task_switch+0x75/0x250
>>>>>> [  388.087907]  process_one_work+0x1a7/0x370
>>>>>> [  388.087909]  worker_thread+0x30/0x380
>>>>>> [  388.087911]  ? process_one_work+0x370/0x370
>>>>>> [  388.087912]  kthread+0x10c/0x130
>>>>>> [  388.087913]  ? kthread_park+0x80/0x80
>>>>>> [  388.087914]  ret_from_fork+0x35/0x40
>>>>>
>>>>> This process is actually waiting for IO to complete while holding
>>>>> checkpoint_mutex which holds up everybody else. The question is why the IO
>>>>> doesn't complete - that's definitely outside of filesystem. Maybe a bug in
>>>>> the block layer, storage driver, or something like that... What does
>>>>> 'cat /sys/block/<device-with-xfstests>/inflight' show?
>>>> Sorry for the late reply.
>>>> This value is 0, it represent it doesn't have inflight data(but it may be
>>>> counted bug or storage driver bug, is it right?).
>>>> Also, it doesn't hang on my physical machine, but only hang on vm.
>>>
>>> Hum, curious. Just do make sure, did you check sdc (because that appears to
>>> be the stuck device)?
>> Yes, I check sdc, its value is 0.
>> # cat /sys/block/sdc/inflight
>>         0        0
> 
> OK, thanks!
> 
>>>> So what should I do in next step(change storge disk format)?
>>>
>>> I'd try couple of things:
>>>
>>> 1) If you mount ext4 with barrier=0 mount option, does the problem go away?
>> Yes. Use barrier=0, this case doesn't hang,
> 
> OK, so there's some problem with how the block layer is handling flush
> bios...
> 
>>> 2) Can you run the test and at the same time run 'blktrace -d /dev/sdc' to
>>> gather traces? Once the machine is stuck, abort blktrace, process the
>>> resulting files with 'blkparse -i sdc' and send here compressed blkparse
>>> output. We should be able to see what was happening with the stuck request
>>> in the trace and maybe that will tell us something.
>> The log size is too big(58M) and our emali limit is 5M.
> 
> OK, can you put the log somewhere for download? Alternatively you could
> provide only last say 20s of the trace which should hopefully fit into the
> limit...
Ok. I will use split command and send you in private to avoid much noise.

Best Regard
Yang Xu
> 
> 								Honza
> 



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: generic/269 hangs on lastest upstream kernel
  2020-02-18 14:35             ` Theodore Y. Ts'o
@ 2020-02-19 10:57               ` Yang Xu
  0 siblings, 0 replies; 19+ messages in thread
From: Yang Xu @ 2020-02-19 10:57 UTC (permalink / raw)
  To: Theodore Y. Ts'o; +Cc: Jan Kara, fstests



on 2020/02/18 22:35, Theodore Y. Ts'o wrote:
> On Tue, Feb 18, 2020 at 11:44:24AM +0800, Yang Xu wrote:
>>
>>
>> on 2020/02/14 22:05, Theodore Y. Ts'o wrote:
>>> On Fri, Feb 14, 2020 at 09:14:33AM +0800, Yang Xu wrote:
>>>>>
>>>>> So were you able to reproduce this on a 5.6.0-rc1 kernel or not?
>>>> No. I don't reproduce it on 5.6.0-rc1 kernel, but  5.6.0-rc1 kernel hang on
>>>> my KVM machine when run generic/269.
>>>
>>> I'm not able to reproduce the 5.6.0-rc1 hang using kvm-xfstests[1].
>>> Neither has any other ext4 developers, which is why it might be useful
>>> to see if there's something unique in your .config for 5.6.0-rc1.
>>> Could you send us the .config you were using?
>>>
>> Sorry for the late reply.
>>
>> my 5.6.0-rc1 config as attach.
> 
> Unfortunately, I just tried using your config with kvm-xfstests, and
> it passed without problems.  Did you say this was a reliable
> reproducer on your system?

Yes, 100%. Tomorrow I will try it on other kvm machines.
> 
> % kvm-xfstests -c 4k generic/269
> KERNEL:    kernel 5.6.0-rc2-xfstests #1492 SMP Mon Feb 17 23:22:40 EST 2020 x86_64
> CPUS:      2
> MEM:       1966.03
> 
> ext4/4k: 1 tests, 43 seconds
>    generic/269  Pass     42s
> Totals: 1 tests, 0 skipped, 0 failures, 0 errors, 42s
> 
> FSTESTVER: blktests 9e02419 (Thu, 19 Dec 2019 14:45:55 -0800)
> FSTESTVER: e2fsprogs v1.45.4-15-g4b4f7b35 (Wed, 9 Oct 2019 20:25:01 -0400)
> FSTESTVER: fio  fio-3.17 (Mon, 16 Dec 2019 15:48:43 -0700)
> FSTESTVER: fsverity v1.0 (Wed, 6 Nov 2019 10:35:02 -0800)
> FSTESTVER: ima-evm-utils v1.2 (Fri, 26 Jul 2019 07:42:17 -0400)
> FSTESTVER: nvme-cli v1.9-159-g18119bc (Thu, 26 Dec 2019 11:04:01 -0700)
> FSTESTVER: quota  9a001cc (Tue, 5 Nov 2019 16:12:59 +0100)
> FSTESTVER: util-linux v2.35-19-g95afec771 (Fri, 24 Jan 2020 12:25:35 -0500)
> FSTESTVER: xfsprogs v5.4.0 (Fri, 20 Dec 2019 16:47:12 -0500)
> FSTESTVER: xfstests linux-v3.8-2652-g002e349c (Fri, 24 Jan 2020 00:49:40 -0500)
> FSTESTVER: xfstests-bld 6f10355 (Fri, 24 Jan 2020 12:36:30 -0500)
> FSTESTCFG: 4k
> FSTESTSET: generic/269
> FSTESTOPT: aex
> 
> This was run on a Debian testing system, with kvm version:
> 
> QEMU emulator version 4.2.0 (Debian 1:4.2-3)
> 
I don't test on debian. qemu vesion as below:
# qemu-img --version
qemu-img version 2.12.0 (qemu-kvm-2.12.0-32.el8+1900+70997154)
Copyright (c) 2003-2017 Fabrice Bellard and the QEMU Project developers

> What about hardware details of your system?  How many CPU's, memory,
> etc.?  And what sort of storage device are you using for kvm?  (I'm
> using virtio-scsi backed by LVM volumes for the scratch and test
> partitions.)
4cpu and 4G memory(4numa node),4G swap. storge device as below(device 
bus ide,storage format qcow2)

disk type='file' device='disk'>
       <driver name='qemu' type='qcow2'/>
       <source file='/var/lib/libvirt/images/test.qcow2'/>
       <target dev='hdc' bus='ide'/>
       <address type='drive' controller='0' bus='1' target='0' unit='0'/>
     </disk>
     <disk type='file' device='disk'>
       <driver name='qemu' type='qcow2'/>
       <source file='/var/lib/libvirt/images/test1.qcow2'/>
       <target dev='hdd' bus='ide'/>
       <address type='drive' controller='0' bus='1' target='0' unit='1'/>
     </disk>


Best Regards
Yang Xu
> 
> 						- Ted
> 
> 



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: generic/269 hangs on lastest upstream kernel
       [not found]                     ` <73af3d5c-ca64-3ad3-aee2-1e78ee4fae4a@cn.fujitsu.com>
@ 2020-02-19 12:43                       ` Jan Kara
  2020-02-19 15:20                         ` Theodore Y. Ts'o
  0 siblings, 1 reply; 19+ messages in thread
From: Jan Kara @ 2020-02-19 12:43 UTC (permalink / raw)
  To: Yang Xu; +Cc: Jan Kara, Theodore Ts'o, fstests

On Wed 19-02-20 18:42:36, Yang Xu wrote:
> on 2020/02/19 18:09, Yang Xu wrote:
> > > > > 1) If you mount ext4 with barrier=0 mount option, does the
> > > > > problem go away?
> > > > Yes. Use barrier=0, this case doesn't hang,
> > > 
> > > OK, so there's some problem with how the block layer is handling flush
> > > bios...
> > > 
> > > > > 2) Can you run the test and at the same time run 'blktrace
> > > > > -d /dev/sdc' to
> > > > > gather traces? Once the machine is stuck, abort blktrace, process the
> > > > > resulting files with 'blkparse -i sdc' and send here
> > > > > compressed blkparse
> > > > > output. We should be able to see what was happening with the
> > > > > stuck request
> > > > > in the trace and maybe that will tell us something.
> > > > The log size is too big(58M) and our emali limit is 5M.
> > > 
> > > OK, can you put the log somewhere for download? Alternatively you could
> > > provide only last say 20s of the trace which should hopefully fit
> > > into the
> > > limit...
> > Ok. I will use split command and send you in private to avoid much noise.
> log as attach.

Thanks for the log. So the reason for the hang is clearly visible at the
end of the log:

  8,32   2   104324   164.814457402   995  Q FWS [fsstress]
  8,32   2   104325   164.814458088   995  G FWS [fsstress]
  8,32   2   104326   164.814460957   739  D  FN [kworker/2:1H]

This means, fsstress command has queued cache flush request (from
blkdev_issue_flush()), this has been dispatched to the driver ('D' event)
but it has never been completed by the driver and so blkdev_issue_flush()
never returns.

To debug this further, you probably need to start looking into what happens
with the request inside QEMU. There's not much I can help you with at this
point since I'm not an expert there. Do you use image file as a backing store
or a raw partition?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: generic/269 hangs on lastest upstream kernel
  2020-02-19 12:43                       ` Jan Kara
@ 2020-02-19 15:20                         ` Theodore Y. Ts'o
  2020-02-20  1:35                           ` Yang Xu
  0 siblings, 1 reply; 19+ messages in thread
From: Theodore Y. Ts'o @ 2020-02-19 15:20 UTC (permalink / raw)
  To: Jan Kara; +Cc: Yang Xu, fstests

On Wed, Feb 19, 2020 at 01:43:24PM +0100, Jan Kara wrote:
> This means, fsstress command has queued cache flush request (from
> blkdev_issue_flush()), this has been dispatched to the driver ('D' event)
> but it has never been completed by the driver and so blkdev_issue_flush()
> never returns.
> 
> To debug this further, you probably need to start looking into what happens
> with the request inside QEMU. There's not much I can help you with at this
> point since I'm not an expert there. Do you use image file as a backing store
> or a raw partition?

This is looking more and more like a Qemu bug or a host OS issue.
Yang is using a two year old Qemu (qemu-kvm-2.12.0-32.el8+1900+70997154)
and is using a QCOW backing file.  It also could be a host kernel bug,
although that's less likely.

Yang, any chance you could have a chance to upgrade to a newer version
of Qemu, and see if the problem goes away?  If you have a RHEL support
contract, perhaps you could file a support request with the Red Hat help desk?

	  	      	    	   	   	   - Ted

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: generic/269 hangs on lastest upstream kernel
  2020-02-19 15:20                         ` Theodore Y. Ts'o
@ 2020-02-20  1:35                           ` Yang Xu
  2020-02-25  6:03                             ` Yang Xu
  0 siblings, 1 reply; 19+ messages in thread
From: Yang Xu @ 2020-02-20  1:35 UTC (permalink / raw)
  To: Theodore Y. Ts'o, Jan Kara; +Cc: fstests



on 2020/02/19 23:20, Theodore Y. Ts'o wrote:
> On Wed, Feb 19, 2020 at 01:43:24PM +0100, Jan Kara wrote:
>> This means, fsstress command has queued cache flush request (from
>> blkdev_issue_flush()), this has been dispatched to the driver ('D' event)
>> but it has never been completed by the driver and so blkdev_issue_flush()
>> never returns.
>>
>> To debug this further, you probably need to start looking into what happens
>> with the request inside QEMU. There's not much I can help you with at this
>> point since I'm not an expert there. Do you use image file as a backing store
>> or a raw partition?
> 
> This is looking more and more like a Qemu bug or a host OS issue.
> Yang is using a two year old Qemu (qemu-kvm-2.12.0-32.el8+1900+70997154)
> and is using a QCOW backing file.  It also could be a host kernel bug,
> although that's less likely.
> 
> Yang, any chance you could have a chance to upgrade to a newer version
> of Qemu, and see if the problem goes away?  If you have a RHEL support
> contract, perhaps you could file a support request with the Red Hat help desk?
Of course, I will update lastest qemu to test.(I don't have their support).
> 
> 	  	      	    	   	   	   - Ted
> 
> 



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: generic/269 hangs on lastest upstream kernel
  2020-02-20  1:35                           ` Yang Xu
@ 2020-02-25  6:03                             ` Yang Xu
  0 siblings, 0 replies; 19+ messages in thread
From: Yang Xu @ 2020-02-25  6:03 UTC (permalink / raw)
  To: Theodore Y. Ts'o, Jan Kara; +Cc: fstests



on 2020/02/20 9:35, Yang Xu wrote:
> 
> 
> on 2020/02/19 23:20, Theodore Y. Ts'o wrote:
>> On Wed, Feb 19, 2020 at 01:43:24PM +0100, Jan Kara wrote:
>>> This means, fsstress command has queued cache flush request (from
>>> blkdev_issue_flush()), this has been dispatched to the driver ('D' 
>>> event)
>>> but it has never been completed by the driver and so 
>>> blkdev_issue_flush()
>>> never returns.
>>>
>>> To debug this further, you probably need to start looking into what 
>>> happens
>>> with the request inside QEMU. There's not much I can help you with at 
>>> this
>>> point since I'm not an expert there. Do you use image file as a 
>>> backing store
>>> or a raw partition?
>>
>> This is looking more and more like a Qemu bug or a host OS issue.
>> Yang is using a two year old Qemu (qemu-kvm-2.12.0-32.el8+1900+70997154)
>> and is using a QCOW backing file.  It also could be a host kernel bug,
>> although that's less likely.
>>
>> Yang, any chance you could have a chance to upgrade to a newer version
>> of Qemu, and see if the problem goes away?  If you have a RHEL support
>> contract, perhaps you could file a support request with the Red Hat
>> help desk?
> Of course, I will update lastest qemu to test.(I don't have their support).
Hi Ted, Jan
     I have updated lastest qemu(4.2.0), it still hangs. But I think it 
is not fs problem. Because it only failed when disk uses IDE bus, I use 
scsi bus, it passed. Thanks for your analysis and help.

Best Regards
Yang Xu
>>
>>                                              - Ted
>>
>>
> 
> 



^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2020-02-25  6:04 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-11  8:14 generic/269 hangs on lastest upstream kernel Yang Xu
2020-02-12 10:54 ` Jan Kara
2020-02-13  8:49   ` Yang Xu
2020-02-13 17:08     ` Theodore Y. Ts'o
2020-02-14  1:14       ` Yang Xu
2020-02-14 14:05         ` Theodore Y. Ts'o
     [not found]           ` <7adf16bf-d527-1c25-1a24-b4d5e4d757c4@cn.fujitsu.com>
2020-02-18 14:35             ` Theodore Y. Ts'o
2020-02-19 10:57               ` Yang Xu
2020-02-13 21:10     ` Jan Kara
     [not found]       ` <062ac52c-3a16-22ef-6396-53334ed94783@cn.fujitsu.com>
2020-02-14 15:00         ` Jan Kara
2020-02-18  3:25           ` Yang Xu
2020-02-18  8:24             ` Jan Kara
2020-02-18  9:46               ` Yang Xu
2020-02-18 11:03                 ` Jan Kara
2020-02-19 10:09                   ` Yang Xu
     [not found]                     ` <73af3d5c-ca64-3ad3-aee2-1e78ee4fae4a@cn.fujitsu.com>
2020-02-19 12:43                       ` Jan Kara
2020-02-19 15:20                         ` Theodore Y. Ts'o
2020-02-20  1:35                           ` Yang Xu
2020-02-25  6:03                             ` Yang Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).