on 2020/02/18 16:24, Jan Kara wrote:
> On Tue 18-02-20 11:25:37, Yang Xu wrote:
>> on 2020/02/14 23:00, Jan Kara wrote:
>>> On Fri 14-02-20 18:24:50, Yang Xu wrote:
>>>> on 2020/02/14 5:10, Jan Kara wrote:
>>>>> On Thu 13-02-20 16:49:21, Yang Xu wrote:
>>>>>>>> When I test generic/269(ext4) on 5.6.0-rc1 kernel, it hangs.
>>>>>>>> ----------------------------------------------
>>>>>>>> dmesg as below:
>>>>>>>>        76.506753] run fstests generic/269 at 2020-02-11 05:53:44
>>>>>>>> [   76.955667] EXT4-fs (sdc): mounted filesystem with ordered data mode.
>>>>>>>> Opts: acl,                           user_xattr
>>>>>>>> [  100.912511] device virbr0-nic left promiscuous mode
>>>>>>>> [  100.912520] virbr0: port 1(virbr0-nic) entered disabled state
>>>>>>>> [  246.801561] INFO: task dd:17284 blocked for more than 122 seconds.
>>>>>>>> [  246.801564]       Not tainted 5.6.0-rc1 #41
>>>>>>>> [  246.801565] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
>>>>>>>> this mes                           sage.
>>>>>>>> [  246.801566] dd              D    0 17284  16931 0x00000080
>>>>>>>> [  246.801568] Call Trace:
>>>>>>>> [  246.801584]  ? __schedule+0x251/0x690
>>>>>>>> [  246.801586]  schedule+0x40/0xb0
>>>>>>>> [  246.801588]  wb_wait_for_completion+0x52/0x80
>>>>>>>> [  246.801591]  ? finish_wait+0x80/0x80
>>>>>>>> [  246.801592]  __writeback_inodes_sb_nr+0xaa/0xd0
>>>>>>>> [  246.801593]  try_to_writeback_inodes_sb+0x3c/0x50
>>>>>>>
>>>>>>> Interesting. Does the hang resolve eventually or the machine is hung
>>>>>>> permanently? If the hang is permanent, can you do:
>>>>>>>
>>>>>>> echo w >/proc/sysrq-trigger
>>>>>>>
>>>>>>> and send us the stacktraces from dmesg? Thanks!
>>>>>> Yes. the hang is permanent, log as below:
>>>> full dmesg as attach
>>> ...
>>>
>>> Thanks! So the culprit seems to be:
>>>
>>>> [  388.087799] kworker/u12:0   D    0    32      2 0x80004000
>>>> [  388.087803] Workqueue: writeback wb_workfn (flush-8:32)
>>>> [  388.087805] Call Trace:
>>>> [  388.087810]  ? __schedule+0x251/0x690
>>>> [  388.087811]  ? __switch_to_asm+0x34/0x70
>>>> [  388.087812]  ? __switch_to_asm+0x34/0x70
>>>> [  388.087814]  schedule+0x40/0xb0
>>>> [  388.087816]  schedule_timeout+0x20d/0x310
>>>> [  388.087818]  io_schedule_timeout+0x19/0x40
>>>> [  388.087819]  wait_for_completion_io+0x113/0x180
>>>> [  388.087822]  ? wake_up_q+0xa0/0xa0
>>>> [  388.087824]  submit_bio_wait+0x5b/0x80
>>>> [  388.087827]  blkdev_issue_flush+0x81/0xb0
>>>> [  388.087834]  jbd2_cleanup_journal_tail+0x80/0xa0 [jbd2]
>>>> [  388.087837]  jbd2_log_do_checkpoint+0xf4/0x3f0 [jbd2]
>>>> [  388.087840]  __jbd2_log_wait_for_space+0x66/0x190 [jbd2]
>>>> [  388.087843]  ? finish_wait+0x80/0x80
>>>> [  388.087845]  add_transaction_credits+0x27d/0x290 [jbd2]
>>>> [  388.087847]  ? blk_mq_make_request+0x289/0x5d0
>>>> [  388.087849]  start_this_handle+0x10a/0x510 [jbd2]
>>>> [  388.087851]  ? _cond_resched+0x15/0x30
>>>> [  388.087853]  jbd2__journal_start+0xea/0x1f0 [jbd2]
>>>> [  388.087869]  ? ext4_writepages+0x518/0xd90 [ext4]
>>>> [  388.087875]  __ext4_journal_start_sb+0x6e/0x130 [ext4]
>>>> [  388.087883]  ext4_writepages+0x518/0xd90 [ext4]
>>>> [  388.087886]  ? do_writepages+0x41/0xd0
>>>> [  388.087893]  ? ext4_mark_inode_dirty+0x1f0/0x1f0 [ext4]
>>>> [  388.087894]  do_writepages+0x41/0xd0
>>>> [  388.087896]  ? snprintf+0x49/0x60
>>>> [  388.087898]  __writeback_single_inode+0x3d/0x340
>>>> [  388.087899]  writeback_sb_inodes+0x1e5/0x480
>>>> [  388.087901]  wb_writeback+0xfb/0x2f0
>>>> [  388.087902]  wb_workfn+0xf0/0x430
>>>> [  388.087903]  ? __switch_to_asm+0x34/0x70
>>>> [  388.087905]  ? finish_task_switch+0x75/0x250
>>>> [  388.087907]  process_one_work+0x1a7/0x370
>>>> [  388.087909]  worker_thread+0x30/0x380
>>>> [  388.087911]  ? process_one_work+0x370/0x370
>>>> [  388.087912]  kthread+0x10c/0x130
>>>> [  388.087913]  ? kthread_park+0x80/0x80
>>>> [  388.087914]  ret_from_fork+0x35/0x40
>>>
>>> This process is actually waiting for IO to complete while holding
>>> checkpoint_mutex which holds up everybody else. The question is why the IO
>>> doesn't complete - that's definitely outside of filesystem. Maybe a bug in
>>> the block layer, storage driver, or something like that... What does
>>> 'cat /sys/block/<device-with-xfstests>/inflight' show?
>> Sorry for the late reply.
>> This value is 0, it represent it doesn't have inflight data(but it may be
>> counted bug or storage driver bug, is it right?).
>> Also, it doesn't hang on my physical machine, but only hang on vm.
> 
> Hum, curious. Just do make sure, did you check sdc (because that appears to
> be the stuck device)?
Yes, I check sdc, its value is 0.
# cat /sys/block/sdc/inflight
        0        0

> 
>> So what should I do in next step(change storge disk format)?
> 
> I'd try couple of things:
> 
> 1) If you mount ext4 with barrier=0 mount option, does the problem go away?
Yes. Use barrier=0, this case doesn't hang,
> 
> 2) Can you run the test and at the same time run 'blktrace -d /dev/sdc' to
> gather traces? Once the machine is stuck, abort blktrace, process the
> resulting files with 'blkparse -i sdc' and send here compressed blkparse
> output. We should be able to see what was happening with the stuck request
> in the trace and maybe that will tell us something.
The log size is too big(58M) and our emali limit is 5M.
> 
> 								Honza
>