linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* btrfs: hanging processes - race condition?
@ 2010-06-10 17:41 Jerome Ibanes
  2010-06-11  1:12 ` Shaohua Li
  0 siblings, 1 reply; 12+ messages in thread
From: Jerome Ibanes @ 2010-06-10 17:41 UTC (permalink / raw)
  To: linux-btrfs

List,

I ran into a hang issue (race condition: cpu is high when the server is 
idle, meaning that btrfs is hanging, and IOwait is high as well) running 
2.6.34 on debian/lenny on a x86_64 server (dual Opteron 275 w/ 16GB ram). 
The btrfs filesystem live on 18x300GB scsi spindles, configured as Raid-0, 
as shown below:

Label: none  uuid: bc6442c6-2fe2-4236-a5aa-6b7841234c52
         Total devices 18 FS bytes used 2.94TB
         devid    5 size 279.39GB used 208.33GB path /dev/cciss/c1d0
         devid   17 size 279.39GB used 208.34GB path /dev/cciss/c1d8
         devid   16 size 279.39GB used 209.33GB path /dev/cciss/c1d7
         devid    4 size 279.39GB used 208.33GB path /dev/cciss/c0d4
         devid    1 size 279.39GB used 233.72GB path /dev/cciss/c0d1
         devid   13 size 279.39GB used 208.33GB path /dev/cciss/c1d4
         devid    8 size 279.39GB used 208.33GB path /dev/cciss/c1d11
         devid   12 size 279.39GB used 208.33GB path /dev/cciss/c1d3
         devid    3 size 279.39GB used 208.33GB path /dev/cciss/c0d3
         devid    9 size 279.39GB used 208.33GB path /dev/cciss/c1d12
         devid    6 size 279.39GB used 208.33GB path /dev/cciss/c1d1
         devid   11 size 279.39GB used 208.33GB path /dev/cciss/c1d2
         devid   14 size 279.39GB used 208.33GB path /dev/cciss/c1d5
         devid    2 size 279.39GB used 233.70GB path /dev/cciss/c0d2
         devid   15 size 279.39GB used 209.33GB path /dev/cciss/c1d6
         devid   10 size 279.39GB used 208.33GB path /dev/cciss/c1d13
         devid    7 size 279.39GB used 208.33GB path /dev/cciss/c1d10
         devid   18 size 279.39GB used 208.34GB path /dev/cciss/c1d9
Btrfs v0.19-16-g075587c-dirty

The filesystem, mounted in /mnt/btrfs is hanging, no existing or new 
process can access it, however 'df' still displays the disk usage (3TB out 
of 5). The disks appear to be physically healthy. Please note that a 
significant number of files were placed on this filesystem, between 20 and 
30 million files.

The relevant kernel messages are displayed below:

INFO: task btrfs-submit-0:4220 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
btrfs-submit- D 000000010042e12f     0  4220      2 0x00000000
  ffff8803e584ac70 0000000000000046 0000000000004000 0000000000011680
  ffff8803f7349fd8 ffff8803f7349fd8 ffff8803e584ac70 0000000000011680
  0000000000000001 ffff8803ff99d250 ffffffff8149f020 0000000081150ab0
Call Trace:
  [<ffffffff813089f3>] ? io_schedule+0x71/0xb1
  [<ffffffff811470be>] ? get_request_wait+0xab/0x140
  [<ffffffff810406f4>] ? autoremove_wake_function+0x0/0x2e
  [<ffffffff81143a4d>] ? elv_rq_merge_ok+0x89/0x97
  [<ffffffff8114a245>] ? blk_recount_segments+0x17/0x27
  [<ffffffff81147429>] ? __make_request+0x2d6/0x3fc
  [<ffffffff81145b16>] ? generic_make_request+0x207/0x268
  [<ffffffff81145c12>] ? submit_bio+0x9b/0xa2
  [<ffffffffa01aa081>] ? btrfs_requeue_work+0xd7/0xe1 [btrfs]
  [<ffffffffa01a5365>] ? run_scheduled_bios+0x297/0x48f [btrfs]
  [<ffffffffa01aa687>] ? worker_loop+0x17c/0x452 [btrfs]
  [<ffffffffa01aa50b>] ? worker_loop+0x0/0x452 [btrfs]
  [<ffffffff81040331>] ? kthread+0x79/0x81
  [<ffffffff81003674>] ? kernel_thread_helper+0x4/0x10
  [<ffffffff810402b8>] ? kthread+0x0/0x81
  [<ffffffff81003670>] ? kernel_thread_helper+0x0/0x10
INFO: task btrfs-transacti:4230 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
btrfs-transac D 000000010042e1cc     0  4230      2 0x00000000
  ffff8803e544d300 0000000000000046 0000000000004000 0000000000011680
  ffff8803f3531fd8 ffff8803f3531fd8 ffff8803e544d300 0000000000011680
  ffff8803fe488240 00000000000004c1 ffff8803ff8d7340 0000000381147502
Call Trace:
  [<ffffffff810b4153>] ? sync_buffer+0x0/0x3f
  [<ffffffff813089f3>] ? io_schedule+0x71/0xb1
  [<ffffffff810b418e>] ? sync_buffer+0x3b/0x3f
  [<ffffffff81308fba>] ? __wait_on_bit+0x41/0x70
  [<ffffffff810b4153>] ? sync_buffer+0x0/0x3f
  [<ffffffff81309054>] ? out_of_line_wait_on_bit+0x6b/0x77
  [<ffffffff81040722>] ? wake_bit_function+0x0/0x23
  [<ffffffffa0186635>] ? write_dev_supers+0xf3/0x225 [btrfs]
  [<ffffffffa018693b>] ? write_all_supers+0x1d4/0x22c [btrfs]
  [<ffffffffa01898a1>] ? btrfs_commit_transaction+0x4fe/0x5e1 [btrfs]
  [<ffffffff810406f4>] ? autoremove_wake_function+0x0/0x2e
  [<ffffffffa0185628>] ? transaction_kthread+0x16b/0x1fd [btrfs]
  [<ffffffffa01854bd>] ? transaction_kthread+0x0/0x1fd [btrfs]
  [<ffffffff81040331>] ? kthread+0x79/0x81
  [<ffffffff81003674>] ? kernel_thread_helper+0x4/0x10
  [<ffffffff810402b8>] ? kthread+0x0/0x81
  [<ffffffff81003670>] ? kernel_thread_helper+0x0/0x10
INFO: task tar:31615 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
tar           D 000000010042dee1     0 31615   4269 0x00000000
  ffff8803ffa74d70 0000000000000082 0000000000004000 0000000000011680
  ffff88010046dfd8 ffff88010046dfd8 ffff8803ffa74d70 0000000000011680
  ffff880361cdd480 0000000161cdd480 ffff8803ff8becf0 0000000200005fff
Call Trace:
  [<ffffffff8106d2af>] ? sync_page+0x0/0x45
  [<ffffffff813089f3>] ? io_schedule+0x71/0xb1
  [<ffffffff8106d2f0>] ? sync_page+0x41/0x45
  [<ffffffff81308fba>] ? __wait_on_bit+0x41/0x70
  [<ffffffff8106d483>] ? wait_on_page_bit+0x6b/0x71
  [<ffffffff81040722>] ? wake_bit_function+0x0/0x23
  [<ffffffffa0192917>] ? prepare_pages+0xe0/0x244 [btrfs]
  [<ffffffffa017e85a>] ? btrfs_check_data_free_space+0x69/0x206 [btrfs]
  [<ffffffffa0192f6a>] ? btrfs_file_write+0x405/0x711 [btrfs]
  [<ffffffff811ab83d>] ? tty_write+0x213/0x22e
  [<ffffffff810951b8>] ? vfs_write+0xad/0x149
  [<ffffffff81095310>] ? sys_write+0x45/0x6e
  [<ffffffff810028eb>] ? system_call_fastpath+0x16/0x1b
INFO: task btrfs-submit-0:4220 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
btrfs-submit- D 000000010042e12f     0  4220      2 0x00000000
  ffff8803e584ac70 0000000000000046 0000000000004000 0000000000011680
  ffff8803f7349fd8 ffff8803f7349fd8 ffff8803e584ac70 0000000000011680
  0000000000000001 ffff8803ff99d250 ffffffff8149f020 0000000081150ab0
Call Trace:
  [<ffffffff813089f3>] ? io_schedule+0x71/0xb1
  [<ffffffff811470be>] ? get_request_wait+0xab/0x140
  [<ffffffff810406f4>] ? autoremove_wake_function+0x0/0x2e
  [<ffffffff81143a4d>] ? elv_rq_merge_ok+0x89/0x97
  [<ffffffff8114a245>] ? blk_recount_segments+0x17/0x27
  [<ffffffff81147429>] ? __make_request+0x2d6/0x3fc
  [<ffffffff81145b16>] ? generic_make_request+0x207/0x268
  [<ffffffff81145c12>] ? submit_bio+0x9b/0xa2
  [<ffffffffa01aa081>] ? btrfs_requeue_work+0xd7/0xe1 [btrfs]
  [<ffffffffa01a5365>] ? run_scheduled_bios+0x297/0x48f [btrfs]
  [<ffffffffa01aa687>] ? worker_loop+0x17c/0x452 [btrfs]
  [<ffffffffa01aa50b>] ? worker_loop+0x0/0x452 [btrfs]
  [<ffffffff81040331>] ? kthread+0x79/0x81
  [<ffffffff81003674>] ? kernel_thread_helper+0x4/0x10
  [<ffffffff810402b8>] ? kthread+0x0/0x81
  [<ffffffff81003670>] ? kernel_thread_helper+0x0/0x10
INFO: task btrfs-transacti:4230 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
btrfs-transac D 000000010042e1cc     0  4230      2 0x00000000
  ffff8803e544d300 0000000000000046 0000000000004000 0000000000011680
  ffff8803f3531fd8 ffff8803f3531fd8 ffff8803e544d300 0000000000011680
  ffff8803fe488240 00000000000004c1 ffff8803ff8d7340 0000000381147502
Call Trace:
  [<ffffffff810b4153>] ? sync_buffer+0x0/0x3f
  [<ffffffff813089f3>] ? io_schedule+0x71/0xb1
  [<ffffffff810b418e>] ? sync_buffer+0x3b/0x3f
  [<ffffffff81308fba>] ? __wait_on_bit+0x41/0x70
  [<ffffffff810b4153>] ? sync_buffer+0x0/0x3f
  [<ffffffff81309054>] ? out_of_line_wait_on_bit+0x6b/0x77
  [<ffffffff81040722>] ? wake_bit_function+0x0/0x23
  [<ffffffffa0186635>] ? write_dev_supers+0xf3/0x225 [btrfs]
  [<ffffffffa018693b>] ? write_all_supers+0x1d4/0x22c [btrfs]
  [<ffffffffa01898a1>] ? btrfs_commit_transaction+0x4fe/0x5e1 [btrfs]
  [<ffffffff810406f4>] ? autoremove_wake_function+0x0/0x2e
  [<ffffffffa0185628>] ? transaction_kthread+0x16b/0x1fd [btrfs]
  [<ffffffffa01854bd>] ? transaction_kthread+0x0/0x1fd [btrfs]
  [<ffffffff81040331>] ? kthread+0x79/0x81
  [<ffffffff81003674>] ? kernel_thread_helper+0x4/0x10
  [<ffffffff810402b8>] ? kthread+0x0/0x81
  [<ffffffff81003670>] ? kernel_thread_helper+0x0/0x10
INFO: task tar:31615 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
tar           D 000000010042dee1     0 31615   4269 0x00000000
  ffff8803ffa74d70 0000000000000082 0000000000004000 0000000000011680
  ffff88010046dfd8 ffff88010046dfd8 ffff8803ffa74d70 0000000000011680
  ffff880361cdd480 0000000161cdd480 ffff8803ff8becf0 0000000200005fff
Call Trace:
  [<ffffffff8106d2af>] ? sync_page+0x0/0x45
  [<ffffffff813089f3>] ? io_schedule+0x71/0xb1
  [<ffffffff8106d2f0>] ? sync_page+0x41/0x45
  [<ffffffff81308fba>] ? __wait_on_bit+0x41/0x70
  [<ffffffff8106d483>] ? wait_on_page_bit+0x6b/0x71
  [<ffffffff81040722>] ? wake_bit_function+0x0/0x23
  [<ffffffffa0192917>] ? prepare_pages+0xe0/0x244 [btrfs]
  [<ffffffffa017e85a>] ? btrfs_check_data_free_space+0x69/0x206 [btrfs]
  [<ffffffffa0192f6a>] ? btrfs_file_write+0x405/0x711 [btrfs]
  [<ffffffff811ab83d>] ? tty_write+0x213/0x22e
  [<ffffffff810951b8>] ? vfs_write+0xad/0x149
  [<ffffffff81095310>] ? sys_write+0x45/0x6e
  [<ffffffff810028eb>] ? system_call_fastpath+0x16/0x1b
INFO: task btrfs-submit-0:4220 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
btrfs-submit- D 000000010042e12f     0  4220      2 0x00000000
  ffff8803e584ac70 0000000000000046 0000000000004000 0000000000011680
  ffff8803f7349fd8 ffff8803f7349fd8 ffff8803e584ac70 0000000000011680
  0000000000000001 ffff8803ff99d250 ffffffff8149f020 0000000081150ab0
Call Trace:
  [<ffffffff813089f3>] ? io_schedule+0x71/0xb1
  [<ffffffff811470be>] ? get_request_wait+0xab/0x140
  [<ffffffff810406f4>] ? autoremove_wake_function+0x0/0x2e
  [<ffffffff81143a4d>] ? elv_rq_merge_ok+0x89/0x97
  [<ffffffff8114a245>] ? blk_recount_segments+0x17/0x27
  [<ffffffff81147429>] ? __make_request+0x2d6/0x3fc
  [<ffffffff81145b16>] ? generic_make_request+0x207/0x268
  [<ffffffff81145c12>] ? submit_bio+0x9b/0xa2
  [<ffffffffa01aa081>] ? btrfs_requeue_work+0xd7/0xe1 [btrfs]
  [<ffffffffa01a5365>] ? run_scheduled_bios+0x297/0x48f [btrfs]
  [<ffffffffa01aa687>] ? worker_loop+0x17c/0x452 [btrfs]
  [<ffffffffa01aa50b>] ? worker_loop+0x0/0x452 [btrfs]
  [<ffffffff81040331>] ? kthread+0x79/0x81
  [<ffffffff81003674>] ? kernel_thread_helper+0x4/0x10
  [<ffffffff810402b8>] ? kthread+0x0/0x81
  [<ffffffff81003670>] ? kernel_thread_helper+0x0/0x10
INFO: task btrfs-transacti:4230 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
btrfs-transac D 000000010042e1cc     0  4230      2 0x00000000
  ffff8803e544d300 0000000000000046 0000000000004000 0000000000011680
  ffff8803f3531fd8 ffff8803f3531fd8 ffff8803e544d300 0000000000011680
  ffff8803fe488240 00000000000004c1 ffff8803ff8d7340 0000000381147502
Call Trace:
  [<ffffffff810b4153>] ? sync_buffer+0x0/0x3f
  [<ffffffff813089f3>] ? io_schedule+0x71/0xb1
  [<ffffffff810b418e>] ? sync_buffer+0x3b/0x3f
  [<ffffffff81308fba>] ? __wait_on_bit+0x41/0x70
  [<ffffffff810b4153>] ? sync_buffer+0x0/0x3f
  [<ffffffff81309054>] ? out_of_line_wait_on_bit+0x6b/0x77
  [<ffffffff81040722>] ? wake_bit_function+0x0/0x23
  [<ffffffffa0186635>] ? write_dev_supers+0xf3/0x225 [btrfs]
  [<ffffffffa018693b>] ? write_all_supers+0x1d4/0x22c [btrfs]
  [<ffffffffa01898a1>] ? btrfs_commit_transaction+0x4fe/0x5e1 [btrfs]
  [<ffffffff810406f4>] ? autoremove_wake_function+0x0/0x2e
  [<ffffffffa0185628>] ? transaction_kthread+0x16b/0x1fd [btrfs]
  [<ffffffffa01854bd>] ? transaction_kthread+0x0/0x1fd [btrfs]
  [<ffffffff81040331>] ? kthread+0x79/0x81
  [<ffffffff81003674>] ? kernel_thread_helper+0x4/0x10
  [<ffffffff810402b8>] ? kthread+0x0/0x81
  [<ffffffff81003670>] ? kernel_thread_helper+0x0/0x10
INFO: task tar:31615 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
tar           D 000000010042dee1     0 31615   4269 0x00000000
  ffff8803ffa74d70 0000000000000082 0000000000004000 0000000000011680
  ffff88010046dfd8 ffff88010046dfd8 ffff8803ffa74d70 0000000000011680
  ffff880361cdd480 0000000161cdd480 ffff8803ff8becf0 0000000200005fff
Call Trace:
  [<ffffffff8106d2af>] ? sync_page+0x0/0x45
  [<ffffffff813089f3>] ? io_schedule+0x71/0xb1
  [<ffffffff8106d2f0>] ? sync_page+0x41/0x45
  [<ffffffff81308fba>] ? __wait_on_bit+0x41/0x70
  [<ffffffff8106d483>] ? wait_on_page_bit+0x6b/0x71
  [<ffffffff81040722>] ? wake_bit_function+0x0/0x23
  [<ffffffffa0192917>] ? prepare_pages+0xe0/0x244 [btrfs]
  [<ffffffffa017e85a>] ? btrfs_check_data_free_space+0x69/0x206 [btrfs]
  [<ffffffffa0192f6a>] ? btrfs_file_write+0x405/0x711 [btrfs]
  [<ffffffff811ab83d>] ? tty_write+0x213/0x22e
  [<ffffffff810951b8>] ? vfs_write+0xad/0x149
  [<ffffffff81095310>] ? sys_write+0x45/0x6e
  [<ffffffff810028eb>] ? system_call_fastpath+0x16/0x1b
INFO: task btrfs-submit-0:4220 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
btrfs-submit- D 000000010042e12f     0  4220      2 0x00000000
  ffff8803e584ac70 0000000000000046 0000000000004000 0000000000011680
  ffff8803f7349fd8 ffff8803f7349fd8 ffff8803e584ac70 0000000000011680
  0000000000000001 ffff8803ff99d250 ffffffff8149f020 0000000081150ab0
Call Trace:
  [<ffffffff813089f3>] ? io_schedule+0x71/0xb1
  [<ffffffff811470be>] ? get_request_wait+0xab/0x140
  [<ffffffff810406f4>] ? autoremove_wake_function+0x0/0x2e
  [<ffffffff81143a4d>] ? elv_rq_merge_ok+0x89/0x97
  [<ffffffff8114a245>] ? blk_recount_segments+0x17/0x27
  [<ffffffff81147429>] ? __make_request+0x2d6/0x3fc
  [<ffffffff81145b16>] ? generic_make_request+0x207/0x268
  [<ffffffff81145c12>] ? submit_bio+0x9b/0xa2
  [<ffffffffa01aa081>] ? btrfs_requeue_work+0xd7/0xe1 [btrfs]
  [<ffffffffa01a5365>] ? run_scheduled_bios+0x297/0x48f [btrfs]
  [<ffffffffa01aa687>] ? worker_loop+0x17c/0x452 [btrfs]
  [<ffffffffa01aa50b>] ? worker_loop+0x0/0x452 [btrfs]
  [<ffffffff81040331>] ? kthread+0x79/0x81
  [<ffffffff81003674>] ? kernel_thread_helper+0x4/0x10
  [<ffffffff810402b8>] ? kthread+0x0/0x81
  [<ffffffff81003670>] ? kernel_thread_helper+0x0/0x10



Jerome J. Ibanes

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: btrfs: hanging processes - race condition?
  2010-06-10 17:41 btrfs: hanging processes - race condition? Jerome Ibanes
@ 2010-06-11  1:12 ` Shaohua Li
  2010-06-11  2:32   ` Yan, Zheng 
  0 siblings, 1 reply; 12+ messages in thread
From: Shaohua Li @ 2010-06-11  1:12 UTC (permalink / raw)
  To: Jerome Ibanes; +Cc: linux-btrfs, yanmin.zhang, tim.c.chen

On Fri, Jun 11, 2010 at 01:41:41AM +0800, Jerome Ibanes wrote:
> List,
> 
> I ran into a hang issue (race condition: cpu is high when the server is
> idle, meaning that btrfs is hanging, and IOwait is high as well) running
> 2.6.34 on debian/lenny on a x86_64 server (dual Opteron 275 w/ 16GB ram).
> The btrfs filesystem live on 18x300GB scsi spindles, configured as Raid-0,
> as shown below:
> 
> Label: none  uuid: bc6442c6-2fe2-4236-a5aa-6b7841234c52
>          Total devices 18 FS bytes used 2.94TB
>          devid    5 size 279.39GB used 208.33GB path /dev/cciss/c1d0
>          devid   17 size 279.39GB used 208.34GB path /dev/cciss/c1d8
>          devid   16 size 279.39GB used 209.33GB path /dev/cciss/c1d7
>          devid    4 size 279.39GB used 208.33GB path /dev/cciss/c0d4
>          devid    1 size 279.39GB used 233.72GB path /dev/cciss/c0d1
>          devid   13 size 279.39GB used 208.33GB path /dev/cciss/c1d4
>          devid    8 size 279.39GB used 208.33GB path /dev/cciss/c1d11
>          devid   12 size 279.39GB used 208.33GB path /dev/cciss/c1d3
>          devid    3 size 279.39GB used 208.33GB path /dev/cciss/c0d3
>          devid    9 size 279.39GB used 208.33GB path /dev/cciss/c1d12
>          devid    6 size 279.39GB used 208.33GB path /dev/cciss/c1d1
>          devid   11 size 279.39GB used 208.33GB path /dev/cciss/c1d2
>          devid   14 size 279.39GB used 208.33GB path /dev/cciss/c1d5
>          devid    2 size 279.39GB used 233.70GB path /dev/cciss/c0d2
>          devid   15 size 279.39GB used 209.33GB path /dev/cciss/c1d6
>          devid   10 size 279.39GB used 208.33GB path /dev/cciss/c1d13
>          devid    7 size 279.39GB used 208.33GB path /dev/cciss/c1d10
>          devid   18 size 279.39GB used 208.34GB path /dev/cciss/c1d9
> Btrfs v0.19-16-g075587c-dirty
> 
> The filesystem, mounted in /mnt/btrfs is hanging, no existing or new
> process can access it, however 'df' still displays the disk usage (3TB out
> of 5). The disks appear to be physically healthy. Please note that a
> significant number of files were placed on this filesystem, between 20 and
> 30 million files.
> 
> The relevant kernel messages are displayed below:
> 
> INFO: task btrfs-submit-0:4220 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> btrfs-submit- D 000000010042e12f     0  4220      2 0x00000000
>   ffff8803e584ac70 0000000000000046 0000000000004000 0000000000011680
>   ffff8803f7349fd8 ffff8803f7349fd8 ffff8803e584ac70 0000000000011680
>   0000000000000001 ffff8803ff99d250 ffffffff8149f020 0000000081150ab0
> Call Trace:
>   [<ffffffff813089f3>] ? io_schedule+0x71/0xb1
>   [<ffffffff811470be>] ? get_request_wait+0xab/0x140
>   [<ffffffff810406f4>] ? autoremove_wake_function+0x0/0x2e
>   [<ffffffff81143a4d>] ? elv_rq_merge_ok+0x89/0x97
>   [<ffffffff8114a245>] ? blk_recount_segments+0x17/0x27
>   [<ffffffff81147429>] ? __make_request+0x2d6/0x3fc
>   [<ffffffff81145b16>] ? generic_make_request+0x207/0x268
>   [<ffffffff81145c12>] ? submit_bio+0x9b/0xa2
>   [<ffffffffa01aa081>] ? btrfs_requeue_work+0xd7/0xe1 [btrfs]
>   [<ffffffffa01a5365>] ? run_scheduled_bios+0x297/0x48f [btrfs]
>   [<ffffffffa01aa687>] ? worker_loop+0x17c/0x452 [btrfs]
>   [<ffffffffa01aa50b>] ? worker_loop+0x0/0x452 [btrfs]
>   [<ffffffff81040331>] ? kthread+0x79/0x81
>   [<ffffffff81003674>] ? kernel_thread_helper+0x4/0x10
>   [<ffffffff810402b8>] ? kthread+0x0/0x81
>   [<ffffffff81003670>] ? kernel_thread_helper+0x0/0x10
This looks like the issue we saw too, http://lkml.org/lkml/2010/6/8/375.
This is reproduceable in our setup.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: btrfs: hanging processes - race condition?
  2010-06-11  1:12 ` Shaohua Li
@ 2010-06-11  2:32   ` Yan, Zheng 
  2010-06-13  6:50     ` Shaohua Li
  2010-06-14 13:26     ` Chris Mason
  0 siblings, 2 replies; 12+ messages in thread
From: Yan, Zheng  @ 2010-06-11  2:32 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Jerome Ibanes, linux-btrfs, yanmin.zhang, tim.c.chen

On Fri, Jun 11, 2010 at 9:12 AM, Shaohua Li <shaohua.li@intel.com> wrot=
e:
> On Fri, Jun 11, 2010 at 01:41:41AM +0800, Jerome Ibanes wrote:
>> List,
>>
>> I ran into a hang issue (race condition: cpu is high when the server=
 is
>> idle, meaning that btrfs is hanging, and IOwait is high as well) run=
ning
>> 2.6.34 on debian/lenny on a x86_64 server (dual Opteron 275 w/ 16GB =
ram).
>> The btrfs filesystem live on 18x300GB scsi spindles, configured as R=
aid-0,
>> as shown below:
>>
>> Label: none =A0uuid: bc6442c6-2fe2-4236-a5aa-6b7841234c52
>> =A0 =A0 =A0 =A0 =A0Total devices 18 FS bytes used 2.94TB
>> =A0 =A0 =A0 =A0 =A0devid =A0 =A05 size 279.39GB used 208.33GB path /=
dev/cciss/c1d0
>> =A0 =A0 =A0 =A0 =A0devid =A0 17 size 279.39GB used 208.34GB path /de=
v/cciss/c1d8
>> =A0 =A0 =A0 =A0 =A0devid =A0 16 size 279.39GB used 209.33GB path /de=
v/cciss/c1d7
>> =A0 =A0 =A0 =A0 =A0devid =A0 =A04 size 279.39GB used 208.33GB path /=
dev/cciss/c0d4
>> =A0 =A0 =A0 =A0 =A0devid =A0 =A01 size 279.39GB used 233.72GB path /=
dev/cciss/c0d1
>> =A0 =A0 =A0 =A0 =A0devid =A0 13 size 279.39GB used 208.33GB path /de=
v/cciss/c1d4
>> =A0 =A0 =A0 =A0 =A0devid =A0 =A08 size 279.39GB used 208.33GB path /=
dev/cciss/c1d11
>> =A0 =A0 =A0 =A0 =A0devid =A0 12 size 279.39GB used 208.33GB path /de=
v/cciss/c1d3
>> =A0 =A0 =A0 =A0 =A0devid =A0 =A03 size 279.39GB used 208.33GB path /=
dev/cciss/c0d3
>> =A0 =A0 =A0 =A0 =A0devid =A0 =A09 size 279.39GB used 208.33GB path /=
dev/cciss/c1d12
>> =A0 =A0 =A0 =A0 =A0devid =A0 =A06 size 279.39GB used 208.33GB path /=
dev/cciss/c1d1
>> =A0 =A0 =A0 =A0 =A0devid =A0 11 size 279.39GB used 208.33GB path /de=
v/cciss/c1d2
>> =A0 =A0 =A0 =A0 =A0devid =A0 14 size 279.39GB used 208.33GB path /de=
v/cciss/c1d5
>> =A0 =A0 =A0 =A0 =A0devid =A0 =A02 size 279.39GB used 233.70GB path /=
dev/cciss/c0d2
>> =A0 =A0 =A0 =A0 =A0devid =A0 15 size 279.39GB used 209.33GB path /de=
v/cciss/c1d6
>> =A0 =A0 =A0 =A0 =A0devid =A0 10 size 279.39GB used 208.33GB path /de=
v/cciss/c1d13
>> =A0 =A0 =A0 =A0 =A0devid =A0 =A07 size 279.39GB used 208.33GB path /=
dev/cciss/c1d10
>> =A0 =A0 =A0 =A0 =A0devid =A0 18 size 279.39GB used 208.34GB path /de=
v/cciss/c1d9
>> Btrfs v0.19-16-g075587c-dirty
>>
>> The filesystem, mounted in /mnt/btrfs is hanging, no existing or new
>> process can access it, however 'df' still displays the disk usage (3=
TB out
>> of 5). The disks appear to be physically healthy. Please note that a
>> significant number of files were placed on this filesystem, between =
20 and
>> 30 million files.
>>
>> The relevant kernel messages are displayed below:
>>
>> INFO: task btrfs-submit-0:4220 blocked for more than 120 seconds.
>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this mes=
sage.
>> btrfs-submit- D 000000010042e12f =A0 =A0 0 =A04220 =A0 =A0 =A02 0x00=
000000
>> =A0 ffff8803e584ac70 0000000000000046 0000000000004000 0000000000011=
680
>> =A0 ffff8803f7349fd8 ffff8803f7349fd8 ffff8803e584ac70 0000000000011=
680
>> =A0 0000000000000001 ffff8803ff99d250 ffffffff8149f020 0000000081150=
ab0
>> Call Trace:
>> =A0 [<ffffffff813089f3>] ? io_schedule+0x71/0xb1
>> =A0 [<ffffffff811470be>] ? get_request_wait+0xab/0x140
>> =A0 [<ffffffff810406f4>] ? autoremove_wake_function+0x0/0x2e
>> =A0 [<ffffffff81143a4d>] ? elv_rq_merge_ok+0x89/0x97
>> =A0 [<ffffffff8114a245>] ? blk_recount_segments+0x17/0x27
>> =A0 [<ffffffff81147429>] ? __make_request+0x2d6/0x3fc
>> =A0 [<ffffffff81145b16>] ? generic_make_request+0x207/0x268
>> =A0 [<ffffffff81145c12>] ? submit_bio+0x9b/0xa2
>> =A0 [<ffffffffa01aa081>] ? btrfs_requeue_work+0xd7/0xe1 [btrfs]
>> =A0 [<ffffffffa01a5365>] ? run_scheduled_bios+0x297/0x48f [btrfs]
>> =A0 [<ffffffffa01aa687>] ? worker_loop+0x17c/0x452 [btrfs]
>> =A0 [<ffffffffa01aa50b>] ? worker_loop+0x0/0x452 [btrfs]
>> =A0 [<ffffffff81040331>] ? kthread+0x79/0x81
>> =A0 [<ffffffff81003674>] ? kernel_thread_helper+0x4/0x10
>> =A0 [<ffffffff810402b8>] ? kthread+0x0/0x81
>> =A0 [<ffffffff81003670>] ? kernel_thread_helper+0x0/0x10
> This looks like the issue we saw too, http://lkml.org/lkml/2010/6/8/3=
75.
> This is reproduceable in our setup.

I think I know the cause of http://lkml.org/lkml/2010/6/8/375.
The code in the first do-while loop in btrfs_commit_transaction
set current process to TASK_UNINTERRUPTIBLE state, then calls
btrfs_start_delalloc_inodes, btrfs_wait_ordered_extents and
btrfs_run_ordered_operations(). All of these function may call
cond_resched().
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: btrfs: hanging processes - race condition?
  2010-06-11  2:32   ` Yan, Zheng 
@ 2010-06-13  6:50     ` Shaohua Li
  2010-06-14 13:28       ` Chris Mason
  2010-06-14 13:26     ` Chris Mason
  1 sibling, 1 reply; 12+ messages in thread
From: Shaohua Li @ 2010-06-13  6:50 UTC (permalink / raw)
  To: Yan, Zheng , chris.mason
  Cc: Jerome Ibanes, linux-btrfs, Zhang, Yanmin, Chen, Tim C

On Fri, Jun 11, 2010 at 10:32:07AM +0800, Yan, Zheng  wrote:
> On Fri, Jun 11, 2010 at 9:12 AM, Shaohua Li <shaohua.li@intel.com> wr=
ote:
> > On Fri, Jun 11, 2010 at 01:41:41AM +0800, Jerome Ibanes wrote:
> >> List,
> >>
> >> I ran into a hang issue (race condition: cpu is high when the serv=
er is
> >> idle, meaning that btrfs is hanging, and IOwait is high as well) r=
unning
> >> 2.6.34 on debian/lenny on a x86_64 server (dual Opteron 275 w/ 16G=
B ram).
> >> The btrfs filesystem live on 18x300GB scsi spindles, configured as=
 Raid-0,
> >> as shown below:
> >>
> >> Label: none =A0uuid: bc6442c6-2fe2-4236-a5aa-6b7841234c52
> >> =A0 =A0 =A0 =A0 =A0Total devices 18 FS bytes used 2.94TB
> >> =A0 =A0 =A0 =A0 =A0devid =A0 =A05 size 279.39GB used 208.33GB path=
 /dev/cciss/c1d0
> >> =A0 =A0 =A0 =A0 =A0devid =A0 17 size 279.39GB used 208.34GB path /=
dev/cciss/c1d8
> >> =A0 =A0 =A0 =A0 =A0devid =A0 16 size 279.39GB used 209.33GB path /=
dev/cciss/c1d7
> >> =A0 =A0 =A0 =A0 =A0devid =A0 =A04 size 279.39GB used 208.33GB path=
 /dev/cciss/c0d4
> >> =A0 =A0 =A0 =A0 =A0devid =A0 =A01 size 279.39GB used 233.72GB path=
 /dev/cciss/c0d1
> >> =A0 =A0 =A0 =A0 =A0devid =A0 13 size 279.39GB used 208.33GB path /=
dev/cciss/c1d4
> >> =A0 =A0 =A0 =A0 =A0devid =A0 =A08 size 279.39GB used 208.33GB path=
 /dev/cciss/c1d11
> >> =A0 =A0 =A0 =A0 =A0devid =A0 12 size 279.39GB used 208.33GB path /=
dev/cciss/c1d3
> >> =A0 =A0 =A0 =A0 =A0devid =A0 =A03 size 279.39GB used 208.33GB path=
 /dev/cciss/c0d3
> >> =A0 =A0 =A0 =A0 =A0devid =A0 =A09 size 279.39GB used 208.33GB path=
 /dev/cciss/c1d12
> >> =A0 =A0 =A0 =A0 =A0devid =A0 =A06 size 279.39GB used 208.33GB path=
 /dev/cciss/c1d1
> >> =A0 =A0 =A0 =A0 =A0devid =A0 11 size 279.39GB used 208.33GB path /=
dev/cciss/c1d2
> >> =A0 =A0 =A0 =A0 =A0devid =A0 14 size 279.39GB used 208.33GB path /=
dev/cciss/c1d5
> >> =A0 =A0 =A0 =A0 =A0devid =A0 =A02 size 279.39GB used 233.70GB path=
 /dev/cciss/c0d2
> >> =A0 =A0 =A0 =A0 =A0devid =A0 15 size 279.39GB used 209.33GB path /=
dev/cciss/c1d6
> >> =A0 =A0 =A0 =A0 =A0devid =A0 10 size 279.39GB used 208.33GB path /=
dev/cciss/c1d13
> >> =A0 =A0 =A0 =A0 =A0devid =A0 =A07 size 279.39GB used 208.33GB path=
 /dev/cciss/c1d10
> >> =A0 =A0 =A0 =A0 =A0devid =A0 18 size 279.39GB used 208.34GB path /=
dev/cciss/c1d9
> >> Btrfs v0.19-16-g075587c-dirty
> >>
> >> The filesystem, mounted in /mnt/btrfs is hanging, no existing or n=
ew
> >> process can access it, however 'df' still displays the disk usage =
(3TB out
> >> of 5). The disks appear to be physically healthy. Please note that=
 a
> >> significant number of files were placed on this filesystem, betwee=
n 20 and
> >> 30 million files.
> >>
> >> The relevant kernel messages are displayed below:
> >>
> >> INFO: task btrfs-submit-0:4220 blocked for more than 120 seconds.
> >> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this m=
essage.
> >> btrfs-submit- D 000000010042e12f =A0 =A0 0 =A04220 =A0 =A0 =A02 0x=
00000000
> >> =A0 ffff8803e584ac70 0000000000000046 0000000000004000 00000000000=
11680
> >> =A0 ffff8803f7349fd8 ffff8803f7349fd8 ffff8803e584ac70 00000000000=
11680
> >> =A0 0000000000000001 ffff8803ff99d250 ffffffff8149f020 00000000811=
50ab0
> >> Call Trace:
> >> =A0 [<ffffffff813089f3>] ? io_schedule+0x71/0xb1
> >> =A0 [<ffffffff811470be>] ? get_request_wait+0xab/0x140
> >> =A0 [<ffffffff810406f4>] ? autoremove_wake_function+0x0/0x2e
> >> =A0 [<ffffffff81143a4d>] ? elv_rq_merge_ok+0x89/0x97
> >> =A0 [<ffffffff8114a245>] ? blk_recount_segments+0x17/0x27
> >> =A0 [<ffffffff81147429>] ? __make_request+0x2d6/0x3fc
> >> =A0 [<ffffffff81145b16>] ? generic_make_request+0x207/0x268
> >> =A0 [<ffffffff81145c12>] ? submit_bio+0x9b/0xa2
> >> =A0 [<ffffffffa01aa081>] ? btrfs_requeue_work+0xd7/0xe1 [btrfs]
> >> =A0 [<ffffffffa01a5365>] ? run_scheduled_bios+0x297/0x48f [btrfs]
> >> =A0 [<ffffffffa01aa687>] ? worker_loop+0x17c/0x452 [btrfs]
> >> =A0 [<ffffffffa01aa50b>] ? worker_loop+0x0/0x452 [btrfs]
> >> =A0 [<ffffffff81040331>] ? kthread+0x79/0x81
> >> =A0 [<ffffffff81003674>] ? kernel_thread_helper+0x4/0x10
> >> =A0 [<ffffffff810402b8>] ? kthread+0x0/0x81
> >> =A0 [<ffffffff81003670>] ? kernel_thread_helper+0x0/0x10
> > This looks like the issue we saw too, http://lkml.org/lkml/2010/6/8=
/375.
> > This is reproduceable in our setup.
>=20
> I think I know the cause of http://lkml.org/lkml/2010/6/8/375.
> The code in the first do-while loop in btrfs_commit_transaction
> set current process to TASK_UNINTERRUPTIBLE state, then calls
> btrfs_start_delalloc_inodes, btrfs_wait_ordered_extents and
> btrfs_run_ordered_operations(). All of these function may call
> cond_resched().
Hi,
When I test random write, I saw a lot of threads jump into btree_writep=
ages()
and do noting and io throughput is zero for some time. Looks like there=
 is a
live lock. See the code of btree_writepages():
	if (wbc->sync_mode =3D=3D WB_SYNC_NONE) {
		struct btrfs_root *root =3D BTRFS_I(mapping->host)->root;
		u64 num_dirty;
		unsigned long thresh =3D 32 * 1024 * 1024;

		if (wbc->for_kupdate)
			return 0;

		/* this is a bit racy, but that's ok */
		num_dirty =3D root->fs_info->dirty_metadata_bytes;
>>>>>>		if (num_dirty < thresh)
			return 0;
	}
The marked line is quite intrusive. In my test, the live lock is caused=
 by the thresh
check. The dirty_metadata_bytes < 32M. Without it, I can't see the live=
 lock. Not
sure if this is related to the hang.

Thanks,
Shaohua
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: btrfs: hanging processes - race condition?
  2010-06-11  2:32   ` Yan, Zheng 
  2010-06-13  6:50     ` Shaohua Li
@ 2010-06-14 13:26     ` Chris Mason
  1 sibling, 0 replies; 12+ messages in thread
From: Chris Mason @ 2010-06-14 13:26 UTC (permalink / raw)
  To: Yan, Zheng 
  Cc: Shaohua Li, Jerome Ibanes, linux-btrfs, yanmin.zhang, tim.c.chen

On Fri, Jun 11, 2010 at 10:32:07AM +0800, Yan, Zheng  wrote:
> On Fri, Jun 11, 2010 at 9:12 AM, Shaohua Li <shaohua.li@intel.com> wr=
ote:
> > On Fri, Jun 11, 2010 at 01:41:41AM +0800, Jerome Ibanes wrote:
> >> List,
> >>
> >> I ran into a hang issue (race condition: cpu is high when the serv=
er is
> >> idle, meaning that btrfs is hanging, and IOwait is high as well) r=
unning
> >> 2.6.34 on debian/lenny on a x86_64 server (dual Opteron 275 w/ 16G=
B ram).
> >> The btrfs filesystem live on 18x300GB scsi spindles, configured as=
 Raid-0,
> >> as shown below:
> >>
> >> Label: none =A0uuid: bc6442c6-2fe2-4236-a5aa-6b7841234c52
> >> =A0 =A0 =A0 =A0 =A0Total devices 18 FS bytes used 2.94TB
> >> =A0 =A0 =A0 =A0 =A0devid =A0 =A05 size 279.39GB used 208.33GB path=
 /dev/cciss/c1d0
> >> =A0 =A0 =A0 =A0 =A0devid =A0 17 size 279.39GB used 208.34GB path /=
dev/cciss/c1d8
> >> =A0 =A0 =A0 =A0 =A0devid =A0 16 size 279.39GB used 209.33GB path /=
dev/cciss/c1d7
> >> =A0 =A0 =A0 =A0 =A0devid =A0 =A04 size 279.39GB used 208.33GB path=
 /dev/cciss/c0d4
> >> =A0 =A0 =A0 =A0 =A0devid =A0 =A01 size 279.39GB used 233.72GB path=
 /dev/cciss/c0d1
> >> =A0 =A0 =A0 =A0 =A0devid =A0 13 size 279.39GB used 208.33GB path /=
dev/cciss/c1d4
> >> =A0 =A0 =A0 =A0 =A0devid =A0 =A08 size 279.39GB used 208.33GB path=
 /dev/cciss/c1d11
> >> =A0 =A0 =A0 =A0 =A0devid =A0 12 size 279.39GB used 208.33GB path /=
dev/cciss/c1d3
> >> =A0 =A0 =A0 =A0 =A0devid =A0 =A03 size 279.39GB used 208.33GB path=
 /dev/cciss/c0d3
> >> =A0 =A0 =A0 =A0 =A0devid =A0 =A09 size 279.39GB used 208.33GB path=
 /dev/cciss/c1d12
> >> =A0 =A0 =A0 =A0 =A0devid =A0 =A06 size 279.39GB used 208.33GB path=
 /dev/cciss/c1d1
> >> =A0 =A0 =A0 =A0 =A0devid =A0 11 size 279.39GB used 208.33GB path /=
dev/cciss/c1d2
> >> =A0 =A0 =A0 =A0 =A0devid =A0 14 size 279.39GB used 208.33GB path /=
dev/cciss/c1d5
> >> =A0 =A0 =A0 =A0 =A0devid =A0 =A02 size 279.39GB used 233.70GB path=
 /dev/cciss/c0d2
> >> =A0 =A0 =A0 =A0 =A0devid =A0 15 size 279.39GB used 209.33GB path /=
dev/cciss/c1d6
> >> =A0 =A0 =A0 =A0 =A0devid =A0 10 size 279.39GB used 208.33GB path /=
dev/cciss/c1d13
> >> =A0 =A0 =A0 =A0 =A0devid =A0 =A07 size 279.39GB used 208.33GB path=
 /dev/cciss/c1d10
> >> =A0 =A0 =A0 =A0 =A0devid =A0 18 size 279.39GB used 208.34GB path /=
dev/cciss/c1d9
> >> Btrfs v0.19-16-g075587c-dirty
> >>
> >> The filesystem, mounted in /mnt/btrfs is hanging, no existing or n=
ew
> >> process can access it, however 'df' still displays the disk usage =
(3TB out
> >> of 5). The disks appear to be physically healthy. Please note that=
 a
> >> significant number of files were placed on this filesystem, betwee=
n 20 and
> >> 30 million files.
> >>
> >> The relevant kernel messages are displayed below:
> >>
> >> INFO: task btrfs-submit-0:4220 blocked for more than 120 seconds.
> >> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this m=
essage.
> >> btrfs-submit- D 000000010042e12f =A0 =A0 0 =A04220 =A0 =A0 =A02 0x=
00000000
> >> =A0 ffff8803e584ac70 0000000000000046 0000000000004000 00000000000=
11680
> >> =A0 ffff8803f7349fd8 ffff8803f7349fd8 ffff8803e584ac70 00000000000=
11680
> >> =A0 0000000000000001 ffff8803ff99d250 ffffffff8149f020 00000000811=
50ab0
> >> Call Trace:
> >> =A0 [<ffffffff813089f3>] ? io_schedule+0x71/0xb1
> >> =A0 [<ffffffff811470be>] ? get_request_wait+0xab/0x140
> >> =A0 [<ffffffff810406f4>] ? autoremove_wake_function+0x0/0x2e
> >> =A0 [<ffffffff81143a4d>] ? elv_rq_merge_ok+0x89/0x97
> >> =A0 [<ffffffff8114a245>] ? blk_recount_segments+0x17/0x27
> >> =A0 [<ffffffff81147429>] ? __make_request+0x2d6/0x3fc
> >> =A0 [<ffffffff81145b16>] ? generic_make_request+0x207/0x268
> >> =A0 [<ffffffff81145c12>] ? submit_bio+0x9b/0xa2
> >> =A0 [<ffffffffa01aa081>] ? btrfs_requeue_work+0xd7/0xe1 [btrfs]
> >> =A0 [<ffffffffa01a5365>] ? run_scheduled_bios+0x297/0x48f [btrfs]
> >> =A0 [<ffffffffa01aa687>] ? worker_loop+0x17c/0x452 [btrfs]
> >> =A0 [<ffffffffa01aa50b>] ? worker_loop+0x0/0x452 [btrfs]
> >> =A0 [<ffffffff81040331>] ? kthread+0x79/0x81
> >> =A0 [<ffffffff81003674>] ? kernel_thread_helper+0x4/0x10
> >> =A0 [<ffffffff810402b8>] ? kthread+0x0/0x81
> >> =A0 [<ffffffff81003670>] ? kernel_thread_helper+0x0/0x10
> > This looks like the issue we saw too, http://lkml.org/lkml/2010/6/8=
/375.
> > This is reproduceable in our setup.
>=20
> I think I know the cause of http://lkml.org/lkml/2010/6/8/375.
> The code in the first do-while loop in btrfs_commit_transaction
> set current process to TASK_UNINTERRUPTIBLE state, then calls
> btrfs_start_delalloc_inodes, btrfs_wait_ordered_extents and
> btrfs_run_ordered_operations(). All of these function may call
> cond_resched().

The TASK_UNINTERRUPTIBLE problem was fixed with 2.6.35-rc1.  You can
find the changes in the master branch of the btrfs-unstable repo.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: btrfs: hanging processes - race condition?
  2010-06-13  6:50     ` Shaohua Li
@ 2010-06-14 13:28       ` Chris Mason
  2010-06-14 18:12         ` Jerome Ibanes
  2010-06-17  1:41         ` Shaohua Li
  0 siblings, 2 replies; 12+ messages in thread
From: Chris Mason @ 2010-06-14 13:28 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Yan, Zheng , Jerome Ibanes, linux-btrfs, Zhang, Yanmin, Chen, Tim C

On Sun, Jun 13, 2010 at 02:50:06PM +0800, Shaohua Li wrote:
> On Fri, Jun 11, 2010 at 10:32:07AM +0800, Yan, Zheng  wrote:
> > On Fri, Jun 11, 2010 at 9:12 AM, Shaohua Li <shaohua.li@intel.com> =
wrote:
> > > On Fri, Jun 11, 2010 at 01:41:41AM +0800, Jerome Ibanes wrote:
> > >> List,
> > >>
> > >> I ran into a hang issue (race condition: cpu is high when the se=
rver is
> > >> idle, meaning that btrfs is hanging, and IOwait is high as well)=
 running
> > >> 2.6.34 on debian/lenny on a x86_64 server (dual Opteron 275 w/ 1=
6GB ram).
> > >> The btrfs filesystem live on 18x300GB scsi spindles, configured =
as Raid-0,
> > >> as shown below:
> > >>
> > >> Label: none =A0uuid: bc6442c6-2fe2-4236-a5aa-6b7841234c52
> > >> =A0 =A0 =A0 =A0 =A0Total devices 18 FS bytes used 2.94TB
> > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A05 size 279.39GB used 208.33GB pa=
th /dev/cciss/c1d0
> > >> =A0 =A0 =A0 =A0 =A0devid =A0 17 size 279.39GB used 208.34GB path=
 /dev/cciss/c1d8
> > >> =A0 =A0 =A0 =A0 =A0devid =A0 16 size 279.39GB used 209.33GB path=
 /dev/cciss/c1d7
> > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A04 size 279.39GB used 208.33GB pa=
th /dev/cciss/c0d4
> > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A01 size 279.39GB used 233.72GB pa=
th /dev/cciss/c0d1
> > >> =A0 =A0 =A0 =A0 =A0devid =A0 13 size 279.39GB used 208.33GB path=
 /dev/cciss/c1d4
> > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A08 size 279.39GB used 208.33GB pa=
th /dev/cciss/c1d11
> > >> =A0 =A0 =A0 =A0 =A0devid =A0 12 size 279.39GB used 208.33GB path=
 /dev/cciss/c1d3
> > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A03 size 279.39GB used 208.33GB pa=
th /dev/cciss/c0d3
> > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A09 size 279.39GB used 208.33GB pa=
th /dev/cciss/c1d12
> > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A06 size 279.39GB used 208.33GB pa=
th /dev/cciss/c1d1
> > >> =A0 =A0 =A0 =A0 =A0devid =A0 11 size 279.39GB used 208.33GB path=
 /dev/cciss/c1d2
> > >> =A0 =A0 =A0 =A0 =A0devid =A0 14 size 279.39GB used 208.33GB path=
 /dev/cciss/c1d5
> > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A02 size 279.39GB used 233.70GB pa=
th /dev/cciss/c0d2
> > >> =A0 =A0 =A0 =A0 =A0devid =A0 15 size 279.39GB used 209.33GB path=
 /dev/cciss/c1d6
> > >> =A0 =A0 =A0 =A0 =A0devid =A0 10 size 279.39GB used 208.33GB path=
 /dev/cciss/c1d13
> > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A07 size 279.39GB used 208.33GB pa=
th /dev/cciss/c1d10
> > >> =A0 =A0 =A0 =A0 =A0devid =A0 18 size 279.39GB used 208.34GB path=
 /dev/cciss/c1d9
> > >> Btrfs v0.19-16-g075587c-dirty
> > >>
> > >> The filesystem, mounted in /mnt/btrfs is hanging, no existing or=
 new
> > >> process can access it, however 'df' still displays the disk usag=
e (3TB out
> > >> of 5). The disks appear to be physically healthy. Please note th=
at a
> > >> significant number of files were placed on this filesystem, betw=
een 20 and
> > >> 30 million files.
> > >>
> > >> The relevant kernel messages are displayed below:
> > >>
> > >> INFO: task btrfs-submit-0:4220 blocked for more than 120 seconds=
=2E
> > >> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this=
 message.
> > >> btrfs-submit- D 000000010042e12f =A0 =A0 0 =A04220 =A0 =A0 =A02 =
0x00000000
> > >> =A0 ffff8803e584ac70 0000000000000046 0000000000004000 000000000=
0011680
> > >> =A0 ffff8803f7349fd8 ffff8803f7349fd8 ffff8803e584ac70 000000000=
0011680
> > >> =A0 0000000000000001 ffff8803ff99d250 ffffffff8149f020 000000008=
1150ab0
> > >> Call Trace:
> > >> =A0 [<ffffffff813089f3>] ? io_schedule+0x71/0xb1
> > >> =A0 [<ffffffff811470be>] ? get_request_wait+0xab/0x140
> > >> =A0 [<ffffffff810406f4>] ? autoremove_wake_function+0x0/0x2e
> > >> =A0 [<ffffffff81143a4d>] ? elv_rq_merge_ok+0x89/0x97
> > >> =A0 [<ffffffff8114a245>] ? blk_recount_segments+0x17/0x27
> > >> =A0 [<ffffffff81147429>] ? __make_request+0x2d6/0x3fc
> > >> =A0 [<ffffffff81145b16>] ? generic_make_request+0x207/0x268
> > >> =A0 [<ffffffff81145c12>] ? submit_bio+0x9b/0xa2
> > >> =A0 [<ffffffffa01aa081>] ? btrfs_requeue_work+0xd7/0xe1 [btrfs]
> > >> =A0 [<ffffffffa01a5365>] ? run_scheduled_bios+0x297/0x48f [btrfs=
]
> > >> =A0 [<ffffffffa01aa687>] ? worker_loop+0x17c/0x452 [btrfs]
> > >> =A0 [<ffffffffa01aa50b>] ? worker_loop+0x0/0x452 [btrfs]
> > >> =A0 [<ffffffff81040331>] ? kthread+0x79/0x81
> > >> =A0 [<ffffffff81003674>] ? kernel_thread_helper+0x4/0x10
> > >> =A0 [<ffffffff810402b8>] ? kthread+0x0/0x81
> > >> =A0 [<ffffffff81003670>] ? kernel_thread_helper+0x0/0x10
> > > This looks like the issue we saw too, http://lkml.org/lkml/2010/6=
/8/375.
> > > This is reproduceable in our setup.
> >=20
> > I think I know the cause of http://lkml.org/lkml/2010/6/8/375.
> > The code in the first do-while loop in btrfs_commit_transaction
> > set current process to TASK_UNINTERRUPTIBLE state, then calls
> > btrfs_start_delalloc_inodes, btrfs_wait_ordered_extents and
> > btrfs_run_ordered_operations(). All of these function may call
> > cond_resched().
> Hi,
> When I test random write, I saw a lot of threads jump into btree_writ=
epages()
> and do noting and io throughput is zero for some time. Looks like the=
re is a
> live lock. See the code of btree_writepages():
> 	if (wbc->sync_mode =3D=3D WB_SYNC_NONE) {
> 		struct btrfs_root *root =3D BTRFS_I(mapping->host)->root;
> 		u64 num_dirty;
> 		unsigned long thresh =3D 32 * 1024 * 1024;
>=20
> 		if (wbc->for_kupdate)
> 			return 0;
>=20
> 		/* this is a bit racy, but that's ok */
> 		num_dirty =3D root->fs_info->dirty_metadata_bytes;
> >>>>>>		if (num_dirty < thresh)
> 			return 0;
> 	}
> The marked line is quite intrusive. In my test, the live lock is caus=
ed by the thresh
> check. The dirty_metadata_bytes < 32M. Without it, I can't see the li=
ve lock. Not
> sure if this is related to the hang.

How much ram do you have?  The goal of the check is to avoid writing
metadata blocks because once we write them we have to do more IO to cow
them again if they are changed later.

It shouldn't be looping hard in btrfs there, what was the workload?

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: btrfs: hanging processes - race condition?
  2010-06-14 13:28       ` Chris Mason
@ 2010-06-14 18:12         ` Jerome Ibanes
  2010-06-14 19:08           ` Chris Mason
  2010-06-17  1:41         ` Shaohua Li
  1 sibling, 1 reply; 12+ messages in thread
From: Jerome Ibanes @ 2010-06-14 18:12 UTC (permalink / raw)
  To: Chris Mason
  Cc: Shaohua Li, Yan, Zheng ,
	Jerome Ibanes, linux-btrfs, Zhang, Yanmin, Chen, Tim C

[-- Attachment #1: Type: TEXT/PLAIN, Size: 6012 bytes --]

On Mon, 14 Jun 2010, Chris Mason wrote:

> On Sun, Jun 13, 2010 at 02:50:06PM +0800, Shaohua Li wrote:
>> On Fri, Jun 11, 2010 at 10:32:07AM +0800, Yan, Zheng  wrote:
>>> On Fri, Jun 11, 2010 at 9:12 AM, Shaohua Li <shaohua.li@intel.com> wrote:
>>>> On Fri, Jun 11, 2010 at 01:41:41AM +0800, Jerome Ibanes wrote:
>>>>> List,
>>>>>
>>>>> I ran into a hang issue (race condition: cpu is high when the server is
>>>>> idle, meaning that btrfs is hanging, and IOwait is high as well) running
>>>>> 2.6.34 on debian/lenny on a x86_64 server (dual Opteron 275 w/ 16GB ram).
>>>>> The btrfs filesystem live on 18x300GB scsi spindles, configured as Raid-0,
>>>>> as shown below:
>>>>>
>>>>> Label: none  uuid: bc6442c6-2fe2-4236-a5aa-6b7841234c52
>>>>>          Total devices 18 FS bytes used 2.94TB
>>>>>          devid    5 size 279.39GB used 208.33GB path /dev/cciss/c1d0
>>>>>          devid   17 size 279.39GB used 208.34GB path /dev/cciss/c1d8
>>>>>          devid   16 size 279.39GB used 209.33GB path /dev/cciss/c1d7
>>>>>          devid    4 size 279.39GB used 208.33GB path /dev/cciss/c0d4
>>>>>          devid    1 size 279.39GB used 233.72GB path /dev/cciss/c0d1
>>>>>          devid   13 size 279.39GB used 208.33GB path /dev/cciss/c1d4
>>>>>          devid    8 size 279.39GB used 208.33GB path /dev/cciss/c1d11
>>>>>          devid   12 size 279.39GB used 208.33GB path /dev/cciss/c1d3
>>>>>          devid    3 size 279.39GB used 208.33GB path /dev/cciss/c0d3
>>>>>          devid    9 size 279.39GB used 208.33GB path /dev/cciss/c1d12
>>>>>          devid    6 size 279.39GB used 208.33GB path /dev/cciss/c1d1
>>>>>          devid   11 size 279.39GB used 208.33GB path /dev/cciss/c1d2
>>>>>          devid   14 size 279.39GB used 208.33GB path /dev/cciss/c1d5
>>>>>          devid    2 size 279.39GB used 233.70GB path /dev/cciss/c0d2
>>>>>          devid   15 size 279.39GB used 209.33GB path /dev/cciss/c1d6
>>>>>          devid   10 size 279.39GB used 208.33GB path /dev/cciss/c1d13
>>>>>          devid    7 size 279.39GB used 208.33GB path /dev/cciss/c1d10
>>>>>          devid   18 size 279.39GB used 208.34GB path /dev/cciss/c1d9
>>>>> Btrfs v0.19-16-g075587c-dirty
>>>>>
>>>>> The filesystem, mounted in /mnt/btrfs is hanging, no existing or new
>>>>> process can access it, however 'df' still displays the disk usage (3TB out
>>>>> of 5). The disks appear to be physically healthy. Please note that a
>>>>> significant number of files were placed on this filesystem, between 20 and
>>>>> 30 million files.
>>>>>
>>>>> The relevant kernel messages are displayed below:
>>>>>
>>>>> INFO: task btrfs-submit-0:4220 blocked for more than 120 seconds.
>>>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>>> btrfs-submit- D 000000010042e12f     0  4220      2 0x00000000
>>>>>   ffff8803e584ac70 0000000000000046 0000000000004000 0000000000011680
>>>>>   ffff8803f7349fd8 ffff8803f7349fd8 ffff8803e584ac70 0000000000011680
>>>>>   0000000000000001 ffff8803ff99d250 ffffffff8149f020 0000000081150ab0
>>>>> Call Trace:
>>>>>   [<ffffffff813089f3>] ? io_schedule+0x71/0xb1
>>>>>   [<ffffffff811470be>] ? get_request_wait+0xab/0x140
>>>>>   [<ffffffff810406f4>] ? autoremove_wake_function+0x0/0x2e
>>>>>   [<ffffffff81143a4d>] ? elv_rq_merge_ok+0x89/0x97
>>>>>   [<ffffffff8114a245>] ? blk_recount_segments+0x17/0x27
>>>>>   [<ffffffff81147429>] ? __make_request+0x2d6/0x3fc
>>>>>   [<ffffffff81145b16>] ? generic_make_request+0x207/0x268
>>>>>   [<ffffffff81145c12>] ? submit_bio+0x9b/0xa2
>>>>>   [<ffffffffa01aa081>] ? btrfs_requeue_work+0xd7/0xe1 [btrfs]
>>>>>   [<ffffffffa01a5365>] ? run_scheduled_bios+0x297/0x48f [btrfs]
>>>>>   [<ffffffffa01aa687>] ? worker_loop+0x17c/0x452 [btrfs]
>>>>>   [<ffffffffa01aa50b>] ? worker_loop+0x0/0x452 [btrfs]
>>>>>   [<ffffffff81040331>] ? kthread+0x79/0x81
>>>>>   [<ffffffff81003674>] ? kernel_thread_helper+0x4/0x10
>>>>>   [<ffffffff810402b8>] ? kthread+0x0/0x81
>>>>>   [<ffffffff81003670>] ? kernel_thread_helper+0x0/0x10
>>>> This looks like the issue we saw too, http://lkml.org/lkml/2010/6/8/375.
>>>> This is reproduceable in our setup.
>>>
>>> I think I know the cause of http://lkml.org/lkml/2010/6/8/375.
>>> The code in the first do-while loop in btrfs_commit_transaction
>>> set current process to TASK_UNINTERRUPTIBLE state, then calls
>>> btrfs_start_delalloc_inodes, btrfs_wait_ordered_extents and
>>> btrfs_run_ordered_operations(). All of these function may call
>>> cond_resched().
>> Hi,
>> When I test random write, I saw a lot of threads jump into btree_writepages()
>> and do noting and io throughput is zero for some time. Looks like there is a
>> live lock. See the code of btree_writepages():
>> 	if (wbc->sync_mode == WB_SYNC_NONE) {
>> 		struct btrfs_root *root = BTRFS_I(mapping->host)->root;
>> 		u64 num_dirty;
>> 		unsigned long thresh = 32 * 1024 * 1024;
>>
>> 		if (wbc->for_kupdate)
>> 			return 0;
>>
>> 		/* this is a bit racy, but that's ok */
>> 		num_dirty = root->fs_info->dirty_metadata_bytes;
>>>>>>>> 		if (num_dirty < thresh)
>> 			return 0;
>> 	}
>> The marked line is quite intrusive. In my test, the live lock is caused by the thresh
>> check. The dirty_metadata_bytes < 32M. Without it, I can't see the live lock. Not
>> sure if this is related to the hang.
>
> How much ram do you have?  The goal of the check is to avoid writing
> metadata blocks because once we write them we have to do more IO to cow
> them again if they are changed later.

This server has 16GB of ram on a x86_64 (dual opteron 275, ecc memory).

> It shouldn't be looping hard in btrfs there, what was the workload?

The workload was the extraction of large tarballs (one at the time, about 
300+ files extracted by second from a single tarball, which is pretty 
good), as you might expect, the disks were tested (read and write) for 
physical errors before I report this bug.


Jerome J. Ibanes

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: btrfs: hanging processes - race condition?
  2010-06-14 18:12         ` Jerome Ibanes
@ 2010-06-14 19:08           ` Chris Mason
  2010-06-14 19:13             ` Jerome Ibanes
  0 siblings, 1 reply; 12+ messages in thread
From: Chris Mason @ 2010-06-14 19:08 UTC (permalink / raw)
  To: Jerome Ibanes
  Cc: Shaohua Li, Yan, Zheng , linux-btrfs, Zhang, Yanmin, Chen, Tim C

On Mon, Jun 14, 2010 at 11:12:53AM -0700, Jerome Ibanes wrote:
> On Mon, 14 Jun 2010, Chris Mason wrote:
>=20
> >On Sun, Jun 13, 2010 at 02:50:06PM +0800, Shaohua Li wrote:
> >>On Fri, Jun 11, 2010 at 10:32:07AM +0800, Yan, Zheng  wrote:
> >>>On Fri, Jun 11, 2010 at 9:12 AM, Shaohua Li <shaohua.li@intel.com>=
 wrote:
> >>>>On Fri, Jun 11, 2010 at 01:41:41AM +0800, Jerome Ibanes wrote:
> >>>>>List,
> >>>>>
> >>>>>I ran into a hang issue (race condition: cpu is high when the se=
rver is
> >>>>>idle, meaning that btrfs is hanging, and IOwait is high as well)=
 running
> >>>>>2.6.34 on debian/lenny on a x86_64 server (dual Opteron 275 w/ 1=
6GB ram).
> >>>>>The btrfs filesystem live on 18x300GB scsi spindles, configured =
as Raid-0,
> >>>>>as shown below:
> >>>>>
> >>>>>Label: none =A0uuid: bc6442c6-2fe2-4236-a5aa-6b7841234c52
> >>>>>=A0 =A0 =A0 =A0 =A0Total devices 18 FS bytes used 2.94TB
> >>>>>=A0 =A0 =A0 =A0 =A0devid =A0 =A05 size 279.39GB used 208.33GB pa=
th /dev/cciss/c1d0
> >>>>>=A0 =A0 =A0 =A0 =A0devid =A0 17 size 279.39GB used 208.34GB path=
 /dev/cciss/c1d8
> >>>>>=A0 =A0 =A0 =A0 =A0devid =A0 16 size 279.39GB used 209.33GB path=
 /dev/cciss/c1d7
> >>>>>=A0 =A0 =A0 =A0 =A0devid =A0 =A04 size 279.39GB used 208.33GB pa=
th /dev/cciss/c0d4
> >>>>>=A0 =A0 =A0 =A0 =A0devid =A0 =A01 size 279.39GB used 233.72GB pa=
th /dev/cciss/c0d1
> >>>>>=A0 =A0 =A0 =A0 =A0devid =A0 13 size 279.39GB used 208.33GB path=
 /dev/cciss/c1d4
> >>>>>=A0 =A0 =A0 =A0 =A0devid =A0 =A08 size 279.39GB used 208.33GB pa=
th /dev/cciss/c1d11
> >>>>>=A0 =A0 =A0 =A0 =A0devid =A0 12 size 279.39GB used 208.33GB path=
 /dev/cciss/c1d3
> >>>>>=A0 =A0 =A0 =A0 =A0devid =A0 =A03 size 279.39GB used 208.33GB pa=
th /dev/cciss/c0d3
> >>>>>=A0 =A0 =A0 =A0 =A0devid =A0 =A09 size 279.39GB used 208.33GB pa=
th /dev/cciss/c1d12
> >>>>>=A0 =A0 =A0 =A0 =A0devid =A0 =A06 size 279.39GB used 208.33GB pa=
th /dev/cciss/c1d1
> >>>>>=A0 =A0 =A0 =A0 =A0devid =A0 11 size 279.39GB used 208.33GB path=
 /dev/cciss/c1d2
> >>>>>=A0 =A0 =A0 =A0 =A0devid =A0 14 size 279.39GB used 208.33GB path=
 /dev/cciss/c1d5
> >>>>>=A0 =A0 =A0 =A0 =A0devid =A0 =A02 size 279.39GB used 233.70GB pa=
th /dev/cciss/c0d2
> >>>>>=A0 =A0 =A0 =A0 =A0devid =A0 15 size 279.39GB used 209.33GB path=
 /dev/cciss/c1d6
> >>>>>=A0 =A0 =A0 =A0 =A0devid =A0 10 size 279.39GB used 208.33GB path=
 /dev/cciss/c1d13
> >>>>>=A0 =A0 =A0 =A0 =A0devid =A0 =A07 size 279.39GB used 208.33GB pa=
th /dev/cciss/c1d10
> >>>>>=A0 =A0 =A0 =A0 =A0devid =A0 18 size 279.39GB used 208.34GB path=
 /dev/cciss/c1d9
> >>>>>Btrfs v0.19-16-g075587c-dirty
> >>>>>
> >>>>>The filesystem, mounted in /mnt/btrfs is hanging, no existing or=
 new
> >>>>>process can access it, however 'df' still displays the disk usag=
e (3TB out
> >>>>>of 5). The disks appear to be physically healthy. Please note th=
at a
> >>>>>significant number of files were placed on this filesystem, betw=
een 20 and
> >>>>>30 million files.
> >>>>>
> >>>>>The relevant kernel messages are displayed below:
> >>>>>
> >>>>>INFO: task btrfs-submit-0:4220 blocked for more than 120 seconds=
=2E
> >>>>>"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this=
 message.
> >>>>>btrfs-submit- D 000000010042e12f =A0 =A0 0 =A04220 =A0 =A0 =A02 =
0x00000000
> >>>>>=A0 ffff8803e584ac70 0000000000000046 0000000000004000 000000000=
0011680
> >>>>>=A0 ffff8803f7349fd8 ffff8803f7349fd8 ffff8803e584ac70 000000000=
0011680
> >>>>>=A0 0000000000000001 ffff8803ff99d250 ffffffff8149f020 000000008=
1150ab0
> >>>>>Call Trace:
> >>>>>=A0 [<ffffffff813089f3>] ? io_schedule+0x71/0xb1
> >>>>>=A0 [<ffffffff811470be>] ? get_request_wait+0xab/0x140
> >>>>>=A0 [<ffffffff810406f4>] ? autoremove_wake_function+0x0/0x2e
> >>>>>=A0 [<ffffffff81143a4d>] ? elv_rq_merge_ok+0x89/0x97
> >>>>>=A0 [<ffffffff8114a245>] ? blk_recount_segments+0x17/0x27
> >>>>>=A0 [<ffffffff81147429>] ? __make_request+0x2d6/0x3fc
> >>>>>=A0 [<ffffffff81145b16>] ? generic_make_request+0x207/0x268
> >>>>>=A0 [<ffffffff81145c12>] ? submit_bio+0x9b/0xa2
> >>>>>=A0 [<ffffffffa01aa081>] ? btrfs_requeue_work+0xd7/0xe1 [btrfs]
> >>>>>=A0 [<ffffffffa01a5365>] ? run_scheduled_bios+0x297/0x48f [btrfs=
]
> >>>>>=A0 [<ffffffffa01aa687>] ? worker_loop+0x17c/0x452 [btrfs]
> >>>>>=A0 [<ffffffffa01aa50b>] ? worker_loop+0x0/0x452 [btrfs]
> >>>>>=A0 [<ffffffff81040331>] ? kthread+0x79/0x81
> >>>>>=A0 [<ffffffff81003674>] ? kernel_thread_helper+0x4/0x10
> >>>>>=A0 [<ffffffff810402b8>] ? kthread+0x0/0x81
> >>>>>=A0 [<ffffffff81003670>] ? kernel_thread_helper+0x0/0x10
> >>>>This looks like the issue we saw too, http://lkml.org/lkml/2010/6=
/8/375.
> >>>>This is reproduceable in our setup.
> >>>
> >>>I think I know the cause of http://lkml.org/lkml/2010/6/8/375.
> >>>The code in the first do-while loop in btrfs_commit_transaction
> >>>set current process to TASK_UNINTERRUPTIBLE state, then calls
> >>>btrfs_start_delalloc_inodes, btrfs_wait_ordered_extents and
> >>>btrfs_run_ordered_operations(). All of these function may call
> >>>cond_resched().
> >>Hi,
> >>When I test random write, I saw a lot of threads jump into btree_wr=
itepages()
> >>and do noting and io throughput is zero for some time. Looks like t=
here is a
> >>live lock. See the code of btree_writepages():
> >>	if (wbc->sync_mode =3D=3D WB_SYNC_NONE) {
> >>		struct btrfs_root *root =3D BTRFS_I(mapping->host)->root;
> >>		u64 num_dirty;
> >>		unsigned long thresh =3D 32 * 1024 * 1024;
> >>
> >>		if (wbc->for_kupdate)
> >>			return 0;
> >>
> >>		/* this is a bit racy, but that's ok */
> >>		num_dirty =3D root->fs_info->dirty_metadata_bytes;
> >>>>>>>>		if (num_dirty < thresh)
> >>			return 0;
> >>	}
> >>The marked line is quite intrusive. In my test, the live lock is ca=
used by the thresh
> >>check. The dirty_metadata_bytes < 32M. Without it, I can't see the =
live lock. Not
> >>sure if this is related to the hang.
> >
> >How much ram do you have?  The goal of the check is to avoid writing
> >metadata blocks because once we write them we have to do more IO to =
cow
> >them again if they are changed later.
>=20
> This server has 16GB of ram on a x86_64 (dual opteron 275, ecc memory=
).
>=20
> >It shouldn't be looping hard in btrfs there, what was the workload?
>=20
> The workload was the extraction of large tarballs (one at the time,
> about 300+ files extracted by second from a single tarball, which is
> pretty good), as you might expect, the disks were tested (read and
> write) for physical errors before I report this bug.

I think Zheng is right and this one will get fixed by the latest code.
The spinning writepage part should be a different problem.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: btrfs: hanging processes - race condition?
  2010-06-14 19:08           ` Chris Mason
@ 2010-06-14 19:13             ` Jerome Ibanes
  2010-06-16 18:12               ` Jerome Ibanes
  0 siblings, 1 reply; 12+ messages in thread
From: Jerome Ibanes @ 2010-06-14 19:13 UTC (permalink / raw)
  To: Chris Mason
  Cc: Shaohua Li, Yan, Zheng , linux-btrfs, Zhang, Yanmin, Chen, Tim C

[-- Attachment #1: Type: TEXT/PLAIN, Size: 6510 bytes --]

> On Mon, Jun 14, 2010 at 11:12:53AM -0700, Jerome Ibanes wrote:
>> On Mon, 14 Jun 2010, Chris Mason wrote:
>>
>>> On Sun, Jun 13, 2010 at 02:50:06PM +0800, Shaohua Li wrote:
>>>> On Fri, Jun 11, 2010 at 10:32:07AM +0800, Yan, Zheng  wrote:
>>>>> On Fri, Jun 11, 2010 at 9:12 AM, Shaohua Li <shaohua.li@intel.com> wrote:
>>>>>> On Fri, Jun 11, 2010 at 01:41:41AM +0800, Jerome Ibanes wrote:
>>>>>>> List,
>>>>>>>
>>>>>>> I ran into a hang issue (race condition: cpu is high when the server is
>>>>>>> idle, meaning that btrfs is hanging, and IOwait is high as well) running
>>>>>>> 2.6.34 on debian/lenny on a x86_64 server (dual Opteron 275 w/ 16GB ram).
>>>>>>> The btrfs filesystem live on 18x300GB scsi spindles, configured as Raid-0,
>>>>>>> as shown below:
>>>>>>>
>>>>>>> Label: none  uuid: bc6442c6-2fe2-4236-a5aa-6b7841234c52
>>>>>>>          Total devices 18 FS bytes used 2.94TB
>>>>>>>          devid    5 size 279.39GB used 208.33GB path /dev/cciss/c1d0
>>>>>>>          devid   17 size 279.39GB used 208.34GB path /dev/cciss/c1d8
>>>>>>>          devid   16 size 279.39GB used 209.33GB path /dev/cciss/c1d7
>>>>>>>          devid    4 size 279.39GB used 208.33GB path /dev/cciss/c0d4
>>>>>>>          devid    1 size 279.39GB used 233.72GB path /dev/cciss/c0d1
>>>>>>>          devid   13 size 279.39GB used 208.33GB path /dev/cciss/c1d4
>>>>>>>          devid    8 size 279.39GB used 208.33GB path /dev/cciss/c1d11
>>>>>>>          devid   12 size 279.39GB used 208.33GB path /dev/cciss/c1d3
>>>>>>>          devid    3 size 279.39GB used 208.33GB path /dev/cciss/c0d3
>>>>>>>          devid    9 size 279.39GB used 208.33GB path /dev/cciss/c1d12
>>>>>>>          devid    6 size 279.39GB used 208.33GB path /dev/cciss/c1d1
>>>>>>>          devid   11 size 279.39GB used 208.33GB path /dev/cciss/c1d2
>>>>>>>          devid   14 size 279.39GB used 208.33GB path /dev/cciss/c1d5
>>>>>>>          devid    2 size 279.39GB used 233.70GB path /dev/cciss/c0d2
>>>>>>>          devid   15 size 279.39GB used 209.33GB path /dev/cciss/c1d6
>>>>>>>          devid   10 size 279.39GB used 208.33GB path /dev/cciss/c1d13
>>>>>>>          devid    7 size 279.39GB used 208.33GB path /dev/cciss/c1d10
>>>>>>>          devid   18 size 279.39GB used 208.34GB path /dev/cciss/c1d9
>>>>>>> Btrfs v0.19-16-g075587c-dirty
>>>>>>>
>>>>>>> The filesystem, mounted in /mnt/btrfs is hanging, no existing or new
>>>>>>> process can access it, however 'df' still displays the disk usage (3TB out
>>>>>>> of 5). The disks appear to be physically healthy. Please note that a
>>>>>>> significant number of files were placed on this filesystem, between 20 and
>>>>>>> 30 million files.
>>>>>>>
>>>>>>> The relevant kernel messages are displayed below:
>>>>>>>
>>>>>>> INFO: task btrfs-submit-0:4220 blocked for more than 120 seconds.
>>>>>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>>>>> btrfs-submit- D 000000010042e12f     0  4220      2 0x00000000
>>>>>>>   ffff8803e584ac70 0000000000000046 0000000000004000 0000000000011680
>>>>>>>   ffff8803f7349fd8 ffff8803f7349fd8 ffff8803e584ac70 0000000000011680
>>>>>>>   0000000000000001 ffff8803ff99d250 ffffffff8149f020 0000000081150ab0
>>>>>>> Call Trace:
>>>>>>>   [<ffffffff813089f3>] ? io_schedule+0x71/0xb1
>>>>>>>   [<ffffffff811470be>] ? get_request_wait+0xab/0x140
>>>>>>>   [<ffffffff810406f4>] ? autoremove_wake_function+0x0/0x2e
>>>>>>>   [<ffffffff81143a4d>] ? elv_rq_merge_ok+0x89/0x97
>>>>>>>   [<ffffffff8114a245>] ? blk_recount_segments+0x17/0x27
>>>>>>>   [<ffffffff81147429>] ? __make_request+0x2d6/0x3fc
>>>>>>>   [<ffffffff81145b16>] ? generic_make_request+0x207/0x268
>>>>>>>   [<ffffffff81145c12>] ? submit_bio+0x9b/0xa2
>>>>>>>   [<ffffffffa01aa081>] ? btrfs_requeue_work+0xd7/0xe1 [btrfs]
>>>>>>>   [<ffffffffa01a5365>] ? run_scheduled_bios+0x297/0x48f [btrfs]
>>>>>>>   [<ffffffffa01aa687>] ? worker_loop+0x17c/0x452 [btrfs]
>>>>>>>   [<ffffffffa01aa50b>] ? worker_loop+0x0/0x452 [btrfs]
>>>>>>>   [<ffffffff81040331>] ? kthread+0x79/0x81
>>>>>>>   [<ffffffff81003674>] ? kernel_thread_helper+0x4/0x10
>>>>>>>   [<ffffffff810402b8>] ? kthread+0x0/0x81
>>>>>>>   [<ffffffff81003670>] ? kernel_thread_helper+0x0/0x10
>>>>>> This looks like the issue we saw too, http://lkml.org/lkml/2010/6/8/375.
>>>>>> This is reproduceable in our setup.
>>>>>
>>>>> I think I know the cause of http://lkml.org/lkml/2010/6/8/375.
>>>>> The code in the first do-while loop in btrfs_commit_transaction
>>>>> set current process to TASK_UNINTERRUPTIBLE state, then calls
>>>>> btrfs_start_delalloc_inodes, btrfs_wait_ordered_extents and
>>>>> btrfs_run_ordered_operations(). All of these function may call
>>>>> cond_resched().
>>>> Hi,
>>>> When I test random write, I saw a lot of threads jump into btree_writepages()
>>>> and do noting and io throughput is zero for some time. Looks like there is a
>>>> live lock. See the code of btree_writepages():
>>>> 	if (wbc->sync_mode == WB_SYNC_NONE) {
>>>> 		struct btrfs_root *root = BTRFS_I(mapping->host)->root;
>>>> 		u64 num_dirty;
>>>> 		unsigned long thresh = 32 * 1024 * 1024;
>>>>
>>>> 		if (wbc->for_kupdate)
>>>> 			return 0;
>>>>
>>>> 		/* this is a bit racy, but that's ok */
>>>> 		num_dirty = root->fs_info->dirty_metadata_bytes;
>>>>>>>>>> 		if (num_dirty < thresh)
>>>> 			return 0;
>>>> 	}
>>>> The marked line is quite intrusive. In my test, the live lock is caused by the thresh
>>>> check. The dirty_metadata_bytes < 32M. Without it, I can't see the live lock. Not
>>>> sure if this is related to the hang.
>>>
>>> How much ram do you have?  The goal of the check is to avoid writing
>>> metadata blocks because once we write them we have to do more IO to cow
>>> them again if they are changed later.
>>
>> This server has 16GB of ram on a x86_64 (dual opteron 275, ecc memory).
>>
>>> It shouldn't be looping hard in btrfs there, what was the workload?
>>
>> The workload was the extraction of large tarballs (one at the time,
>> about 300+ files extracted by second from a single tarball, which is
>> pretty good), as you might expect, the disks were tested (read and
>> write) for physical errors before I report this bug.
>
> I think Zheng is right and this one will get fixed by the latest code.
> The spinning writepage part should be a different problem.

I'm trying to repro with 2.6.35-rc3, expect results within 24 hours.


Jerome J. Ibanes

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: btrfs: hanging processes - race condition?
  2010-06-14 19:13             ` Jerome Ibanes
@ 2010-06-16 18:12               ` Jerome Ibanes
  0 siblings, 0 replies; 12+ messages in thread
From: Jerome Ibanes @ 2010-06-16 18:12 UTC (permalink / raw)
  To: chris.mason, shaohua.li, yanzheng, linux-btrfs, yanmin.zhang, tim.c.chen
  Cc: Nicolas Michal

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: TEXT/PLAIN; charset=X-UNKNOWN; format=flowed, Size: 8536 bytes --]

On Mon, 14 Jun 2010, Jerome Ibanes wrote:

>>  On Mon, Jun 14, 2010 at 11:12:53AM -0700, Jerome Ibanes wrote:
>> >  On Mon, 14 Jun 2010, Chris Mason wrote:
>> > 
>> > >  On Sun, Jun 13, 2010 at 02:50:06PM +0800, Shaohua Li wrote:
>> > > >  On Fri, Jun 11, 2010 at 10:32:07AM +0800, Yan, Zheng  wrote:
>> > > > >  On Fri, Jun 11, 2010 at 9:12 AM, Shaohua Li <shaohua.li@intel.com> 
>> > > > >  wrote:
>> > > > > >  On Fri, Jun 11, 2010 at 01:41:41AM +0800, Jerome Ibanes wrote:
>> > > > > > >  List,
>> > > > > > > 
>> > > > > > >  I ran into a hang issue (race condition: cpu is high when the 
>> > > > > > >  server is
>> > > > > > >  idle, meaning that btrfs is hanging, and IOwait is high as 
>> > > > > > >  well) running
>> > > > > > >  2.6.34 on debian/lenny on a x86_64 server (dual Opteron 275 w/ 
>> > > > > > >  16GB ram).
>> > > > > > >  The btrfs filesystem live on 18x300GB scsi spindles, 
>> > > > > > >  configured as Raid-0,
>> > > > > > >  as shown below:
>> > > > > > > 
>> > > > > > >  Label: none  uuid: bc6442c6-2fe2-4236-a5aa-6b7841234c52
>> > > > > > >                    Total devices 18 FS bytes used 2.94TB
>> > > > > > >                    devid    5 size 279.39GB used 208.33GB path 
>> > > > > > >           /dev/cciss/c1d0
>> > > > > > >                    devid   17 size 279.39GB used 208.34GB path 
>> > > > > > >           /dev/cciss/c1d8
>> > > > > > >                    devid   16 size 279.39GB used 209.33GB path 
>> > > > > > >           /dev/cciss/c1d7
>> > > > > > >                    devid    4 size 279.39GB used 208.33GB path 
>> > > > > > >           /dev/cciss/c0d4
>> > > > > > >                    devid    1 size 279.39GB used 233.72GB path 
>> > > > > > >           /dev/cciss/c0d1
>> > > > > > >                    devid   13 size 279.39GB used 208.33GB path 
>> > > > > > >           /dev/cciss/c1d4
>> > > > > > >                    devid    8 size 279.39GB used 208.33GB path 
>> > > > > > >           /dev/cciss/c1d11
>> > > > > > >                    devid   12 size 279.39GB used 208.33GB path 
>> > > > > > >           /dev/cciss/c1d3
>> > > > > > >                    devid    3 size 279.39GB used 208.33GB path 
>> > > > > > >           /dev/cciss/c0d3
>> > > > > > >                    devid    9 size 279.39GB used 208.33GB path 
>> > > > > > >           /dev/cciss/c1d12
>> > > > > > >                    devid    6 size 279.39GB used 208.33GB path 
>> > > > > > >           /dev/cciss/c1d1
>> > > > > > >                    devid   11 size 279.39GB used 208.33GB path 
>> > > > > > >           /dev/cciss/c1d2
>> > > > > > >                    devid   14 size 279.39GB used 208.33GB path 
>> > > > > > >           /dev/cciss/c1d5
>> > > > > > >                    devid    2 size 279.39GB used 233.70GB path 
>> > > > > > >           /dev/cciss/c0d2
>> > > > > > >                    devid   15 size 279.39GB used 209.33GB path 
>> > > > > > >           /dev/cciss/c1d6
>> > > > > > >                    devid   10 size 279.39GB used 208.33GB path 
>> > > > > > >           /dev/cciss/c1d13
>> > > > > > >                    devid    7 size 279.39GB used 208.33GB path 
>> > > > > > >           /dev/cciss/c1d10
>> > > > > > >                    devid   18 size 279.39GB used 208.34GB path 
>> > > > > > >           /dev/cciss/c1d9
>> > > > > > >  Btrfs v0.19-16-g075587c-dirty
>> > > > > > > 
>> > > > > > >  The filesystem, mounted in /mnt/btrfs is hanging, no existing 
>> > > > > > >  or new
>> > > > > > >  process can access it, however 'df' still displays the disk 
>> > > > > > >  usage (3TB out
>> > > > > > >  of 5). The disks appear to be physically healthy. Please note 
>> > > > > > >  that a
>> > > > > > >  significant number of files were placed on this filesystem, 
>> > > > > > >  between 20 and
>> > > > > > >  30 million files.
>> > > > > > > 
>> > > > > > >  The relevant kernel messages are displayed below:
>> > > > > > > 
>> > > > > > >  INFO: task btrfs-submit-0:4220 blocked for more than 120 
>> > > > > > >  seconds.
>> > > > > > >  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
>> > > > > > >  this message.
>> > > > > > >  btrfs-submit- D 000000010042e12f     0  4220      2 0x00000000
>> > > > > > >      ffff8803e584ac70 0000000000000046 0000000000004000 
>> > > > > > >    0000000000011680
>> > > > > > >      ffff8803f7349fd8 ffff8803f7349fd8 ffff8803e584ac70 
>> > > > > > >    0000000000011680
>> > > > > > >      0000000000000001 ffff8803ff99d250 ffffffff8149f020 
>> > > > > > >    0000000081150ab0
>> > > > > > >  Call Trace:
>> > > > > > >      [<ffffffff813089f3>] ? io_schedule+0x71/0xb1
>> > > > > > >      [<ffffffff811470be>] ? get_request_wait+0xab/0x140
>> > > > > > >      [<ffffffff810406f4>] ? autoremove_wake_function+0x0/0x2e
>> > > > > > >      [<ffffffff81143a4d>] ? elv_rq_merge_ok+0x89/0x97
>> > > > > > >      [<ffffffff8114a245>] ? blk_recount_segments+0x17/0x27
>> > > > > > >      [<ffffffff81147429>] ? __make_request+0x2d6/0x3fc
>> > > > > > >      [<ffffffff81145b16>] ? generic_make_request+0x207/0x268
>> > > > > > >      [<ffffffff81145c12>] ? submit_bio+0x9b/0xa2
>> > > > > > >      [<ffffffffa01aa081>] ? btrfs_requeue_work+0xd7/0xe1 
>> > > > > > >    [btrfs]
>> > > > > > >      [<ffffffffa01a5365>] ? run_scheduled_bios+0x297/0x48f 
>> > > > > > >    [btrfs]
>> > > > > > >      [<ffffffffa01aa687>] ? worker_loop+0x17c/0x452 [btrfs]
>> > > > > > >      [<ffffffffa01aa50b>] ? worker_loop+0x0/0x452 [btrfs]
>> > > > > > >      [<ffffffff81040331>] ? kthread+0x79/0x81
>> > > > > > >      [<ffffffff81003674>] ? kernel_thread_helper+0x4/0x10
>> > > > > > >      [<ffffffff810402b8>] ? kthread+0x0/0x81
>> > > > > > >      [<ffffffff81003670>] ? kernel_thread_helper+0x0/0x10
>> > > > > >  This looks like the issue we saw too, 
>> > > > > >  http://lkml.org/lkml/2010/6/8/375.
>> > > > > >  This is reproduceable in our setup.
>> > > > > 
>> > > > >  I think I know the cause of http://lkml.org/lkml/2010/6/8/375.
>> > > > >  The code in the first do-while loop in btrfs_commit_transaction
>> > > > >  set current process to TASK_UNINTERRUPTIBLE state, then calls
>> > > > >  btrfs_start_delalloc_inodes, btrfs_wait_ordered_extents and
>> > > > >  btrfs_run_ordered_operations(). All of these function may call
>> > > > >  cond_resched().
>> > > >  Hi,
>> > > >  When I test random write, I saw a lot of threads jump into 
>> > > >  btree_writepages()
>> > > >  and do noting and io throughput is zero for some time. Looks like 
>> > > >  there is a
>> > > >  live lock. See the code of btree_writepages():
>> > > >   if (wbc->sync_mode == WB_SYNC_NONE) {
>> > > >    struct btrfs_root *root = BTRFS_I(mapping->host)->root;
>> > > >    u64 num_dirty;
>> > > >    unsigned long thresh = 32 * 1024 * 1024;
>> > > > 
>> > > >    if (wbc->for_kupdate)
>> > > >     return 0;
>> > > > 
>> > > >    /* this is a bit racy, but that's ok */
>> > > >    num_dirty = root->fs_info->dirty_metadata_bytes;
>> > > > > > > > > >    if (num_dirty < thresh)
>> > > >   		return 0;
>> > > >  	}
>> > > >  The marked line is quite intrusive. In my test, the live lock is 
>> > > >  caused by the thresh
>> > > >  check. The dirty_metadata_bytes < 32M. Without it, I can't see the 
>> > > >  live lock. Not
>> > > >  sure if this is related to the hang.
>> > > 
>> > >  How much ram do you have?  The goal of the check is to avoid writing
>> > >  metadata blocks because once we write them we have to do more IO to 
>> > >  cow
>> > >  them again if they are changed later.
>> > 
>> >  This server has 16GB of ram on a x86_64 (dual opteron 275, ecc memory).
>> > 
>> > >  It shouldn't be looping hard in btrfs there, what was the workload?
>> > 
>> >  The workload was the extraction of large tarballs (one at the time,
>> >  about 300+ files extracted by second from a single tarball, which is
>> >  pretty good), as you might expect, the disks were tested (read and
>> >  write) for physical errors before I report this bug.
>>
>>  I think Zheng is right and this one will get fixed by the latest code.
>>  The spinning writepage part should be a different problem.
>
> I'm trying to repro with 2.6.35-rc3, expect results within 24 hours.

I can no longer repro this issue (after 48 hours of filesystem stress) 
under 2.6.35-rc3 on a filesystem with over 50 million files. I will reopen 
this thread should this reoccur.


Jerome J. Ibanes

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: btrfs: hanging processes - race condition?
  2010-06-14 13:28       ` Chris Mason
  2010-06-14 18:12         ` Jerome Ibanes
@ 2010-06-17  1:41         ` Shaohua Li
  2010-06-18  0:57           ` Shaohua Li
  1 sibling, 1 reply; 12+ messages in thread
From: Shaohua Li @ 2010-06-17  1:41 UTC (permalink / raw)
  To: Chris Mason, Yan, Zheng , Jerome Ibanes, linux-btrfs, 

On Mon, Jun 14, 2010 at 09:28:29PM +0800, Chris Mason wrote:
> On Sun, Jun 13, 2010 at 02:50:06PM +0800, Shaohua Li wrote:
> > On Fri, Jun 11, 2010 at 10:32:07AM +0800, Yan, Zheng  wrote:
> > > On Fri, Jun 11, 2010 at 9:12 AM, Shaohua Li <shaohua.li@intel.com=
> wrote:
> > > > On Fri, Jun 11, 2010 at 01:41:41AM +0800, Jerome Ibanes wrote:
> > > >> List,
> > > >>
> > > >> I ran into a hang issue (race condition: cpu is high when the =
server is
> > > >> idle, meaning that btrfs is hanging, and IOwait is high as wel=
l) running
> > > >> 2.6.34 on debian/lenny on a x86_64 server (dual Opteron 275 w/=
 16GB ram).
> > > >> The btrfs filesystem live on 18x300GB scsi spindles, configure=
d as Raid-0,
> > > >> as shown below:
> > > >>
> > > >> Label: none =A0uuid: bc6442c6-2fe2-4236-a5aa-6b7841234c52
> > > >> =A0 =A0 =A0 =A0 =A0Total devices 18 FS bytes used 2.94TB
> > > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A05 size 279.39GB used 208.33GB =
path /dev/cciss/c1d0
> > > >> =A0 =A0 =A0 =A0 =A0devid =A0 17 size 279.39GB used 208.34GB pa=
th /dev/cciss/c1d8
> > > >> =A0 =A0 =A0 =A0 =A0devid =A0 16 size 279.39GB used 209.33GB pa=
th /dev/cciss/c1d7
> > > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A04 size 279.39GB used 208.33GB =
path /dev/cciss/c0d4
> > > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A01 size 279.39GB used 233.72GB =
path /dev/cciss/c0d1
> > > >> =A0 =A0 =A0 =A0 =A0devid =A0 13 size 279.39GB used 208.33GB pa=
th /dev/cciss/c1d4
> > > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A08 size 279.39GB used 208.33GB =
path /dev/cciss/c1d11
> > > >> =A0 =A0 =A0 =A0 =A0devid =A0 12 size 279.39GB used 208.33GB pa=
th /dev/cciss/c1d3
> > > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A03 size 279.39GB used 208.33GB =
path /dev/cciss/c0d3
> > > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A09 size 279.39GB used 208.33GB =
path /dev/cciss/c1d12
> > > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A06 size 279.39GB used 208.33GB =
path /dev/cciss/c1d1
> > > >> =A0 =A0 =A0 =A0 =A0devid =A0 11 size 279.39GB used 208.33GB pa=
th /dev/cciss/c1d2
> > > >> =A0 =A0 =A0 =A0 =A0devid =A0 14 size 279.39GB used 208.33GB pa=
th /dev/cciss/c1d5
> > > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A02 size 279.39GB used 233.70GB =
path /dev/cciss/c0d2
> > > >> =A0 =A0 =A0 =A0 =A0devid =A0 15 size 279.39GB used 209.33GB pa=
th /dev/cciss/c1d6
> > > >> =A0 =A0 =A0 =A0 =A0devid =A0 10 size 279.39GB used 208.33GB pa=
th /dev/cciss/c1d13
> > > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A07 size 279.39GB used 208.33GB =
path /dev/cciss/c1d10
> > > >> =A0 =A0 =A0 =A0 =A0devid =A0 18 size 279.39GB used 208.34GB pa=
th /dev/cciss/c1d9
> > > >> Btrfs v0.19-16-g075587c-dirty
> > > >>
> > > >> The filesystem, mounted in /mnt/btrfs is hanging, no existing =
or new
> > > >> process can access it, however 'df' still displays the disk us=
age (3TB out
> > > >> of 5). The disks appear to be physically healthy. Please note =
that a
> > > >> significant number of files were placed on this filesystem, be=
tween 20 and
> > > >> 30 million files.
> > > >>
> > > >> The relevant kernel messages are displayed below:
> > > >>
> > > >> INFO: task btrfs-submit-0:4220 blocked for more than 120 secon=
ds.
> > > >> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables th=
is message.
> > > >> btrfs-submit- D 000000010042e12f =A0 =A0 0 =A04220 =A0 =A0 =A0=
2 0x00000000
> > > >> =A0 ffff8803e584ac70 0000000000000046 0000000000004000 0000000=
000011680
> > > >> =A0 ffff8803f7349fd8 ffff8803f7349fd8 ffff8803e584ac70 0000000=
000011680
> > > >> =A0 0000000000000001 ffff8803ff99d250 ffffffff8149f020 0000000=
081150ab0
> > > >> Call Trace:
> > > >> =A0 [<ffffffff813089f3>] ? io_schedule+0x71/0xb1
> > > >> =A0 [<ffffffff811470be>] ? get_request_wait+0xab/0x140
> > > >> =A0 [<ffffffff810406f4>] ? autoremove_wake_function+0x0/0x2e
> > > >> =A0 [<ffffffff81143a4d>] ? elv_rq_merge_ok+0x89/0x97
> > > >> =A0 [<ffffffff8114a245>] ? blk_recount_segments+0x17/0x27
> > > >> =A0 [<ffffffff81147429>] ? __make_request+0x2d6/0x3fc
> > > >> =A0 [<ffffffff81145b16>] ? generic_make_request+0x207/0x268
> > > >> =A0 [<ffffffff81145c12>] ? submit_bio+0x9b/0xa2
> > > >> =A0 [<ffffffffa01aa081>] ? btrfs_requeue_work+0xd7/0xe1 [btrfs=
]
> > > >> =A0 [<ffffffffa01a5365>] ? run_scheduled_bios+0x297/0x48f [btr=
fs]
> > > >> =A0 [<ffffffffa01aa687>] ? worker_loop+0x17c/0x452 [btrfs]
> > > >> =A0 [<ffffffffa01aa50b>] ? worker_loop+0x0/0x452 [btrfs]
> > > >> =A0 [<ffffffff81040331>] ? kthread+0x79/0x81
> > > >> =A0 [<ffffffff81003674>] ? kernel_thread_helper+0x4/0x10
> > > >> =A0 [<ffffffff810402b8>] ? kthread+0x0/0x81
> > > >> =A0 [<ffffffff81003670>] ? kernel_thread_helper+0x0/0x10
> > > > This looks like the issue we saw too, http://lkml.org/lkml/2010=
/6/8/375.
> > > > This is reproduceable in our setup.
> > >=20
> > > I think I know the cause of http://lkml.org/lkml/2010/6/8/375.
> > > The code in the first do-while loop in btrfs_commit_transaction
> > > set current process to TASK_UNINTERRUPTIBLE state, then calls
> > > btrfs_start_delalloc_inodes, btrfs_wait_ordered_extents and
> > > btrfs_run_ordered_operations(). All of these function may call
> > > cond_resched().
> > Hi,
> > When I test random write, I saw a lot of threads jump into btree_wr=
itepages()
> > and do noting and io throughput is zero for some time. Looks like t=
here is a
> > live lock. See the code of btree_writepages():
> > 	if (wbc->sync_mode =3D=3D WB_SYNC_NONE) {
> > 		struct btrfs_root *root =3D BTRFS_I(mapping->host)->root;
> > 		u64 num_dirty;
> > 		unsigned long thresh =3D 32 * 1024 * 1024;
> >=20
> > 		if (wbc->for_kupdate)
> > 			return 0;
> >=20
> > 		/* this is a bit racy, but that's ok */
> > 		num_dirty =3D root->fs_info->dirty_metadata_bytes;
> > >>>>>>		if (num_dirty < thresh)
> > 			return 0;
> > 	}
> > The marked line is quite intrusive. In my test, the live lock is ca=
used by the thresh
> > check. The dirty_metadata_bytes < 32M. Without it, I can't see the =
live lock. Not
> > sure if this is related to the hang.
>=20
> How much ram do you have?  The goal of the check is to avoid writing
> metadata blocks because once we write them we have to do more IO to c=
ow
> them again if they are changed later.
>=20
> It shouldn't be looping hard in btrfs there, what was the workload?
This is a fio randomwrite. Yep, I limited memory to a small size (~500M=
), because it makes
me easily to produce a 'xxx blocked for more than 120 seconds' issue. I=
 can understand small
memory could be an issue, but this still looks intrusive, right?

The issue Yanmin reported is under 2.6.35-rc1, so might not be the
'TASK_UNINTERRUPTIBLE' issue, but we will try -rc3 too.

Thanks,
Shaohua
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: btrfs: hanging processes - race condition?
  2010-06-17  1:41         ` Shaohua Li
@ 2010-06-18  0:57           ` Shaohua Li
  0 siblings, 0 replies; 12+ messages in thread
From: Shaohua Li @ 2010-06-18  0:57 UTC (permalink / raw)
  To: Chris Mason, Yan, Zheng , Jerome Ibanes, linux-btrfs, 

On Thu, Jun 17, 2010 at 09:41:18AM +0800, Shaohua Li wrote:
> On Mon, Jun 14, 2010 at 09:28:29PM +0800, Chris Mason wrote:
> > On Sun, Jun 13, 2010 at 02:50:06PM +0800, Shaohua Li wrote:
> > > On Fri, Jun 11, 2010 at 10:32:07AM +0800, Yan, Zheng  wrote:
> > > > On Fri, Jun 11, 2010 at 9:12 AM, Shaohua Li <shaohua.li@intel.c=
om> wrote:
> > > > > On Fri, Jun 11, 2010 at 01:41:41AM +0800, Jerome Ibanes wrote=
:
> > > > >> List,
> > > > >>
> > > > >> I ran into a hang issue (race condition: cpu is high when th=
e server is
> > > > >> idle, meaning that btrfs is hanging, and IOwait is high as w=
ell) running
> > > > >> 2.6.34 on debian/lenny on a x86_64 server (dual Opteron 275 =
w/ 16GB ram).
> > > > >> The btrfs filesystem live on 18x300GB scsi spindles, configu=
red as Raid-0,
> > > > >> as shown below:
> > > > >>
> > > > >> Label: none =A0uuid: bc6442c6-2fe2-4236-a5aa-6b7841234c52
> > > > >> =A0 =A0 =A0 =A0 =A0Total devices 18 FS bytes used 2.94TB
> > > > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A05 size 279.39GB used 208.33G=
B path /dev/cciss/c1d0
> > > > >> =A0 =A0 =A0 =A0 =A0devid =A0 17 size 279.39GB used 208.34GB =
path /dev/cciss/c1d8
> > > > >> =A0 =A0 =A0 =A0 =A0devid =A0 16 size 279.39GB used 209.33GB =
path /dev/cciss/c1d7
> > > > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A04 size 279.39GB used 208.33G=
B path /dev/cciss/c0d4
> > > > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A01 size 279.39GB used 233.72G=
B path /dev/cciss/c0d1
> > > > >> =A0 =A0 =A0 =A0 =A0devid =A0 13 size 279.39GB used 208.33GB =
path /dev/cciss/c1d4
> > > > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A08 size 279.39GB used 208.33G=
B path /dev/cciss/c1d11
> > > > >> =A0 =A0 =A0 =A0 =A0devid =A0 12 size 279.39GB used 208.33GB =
path /dev/cciss/c1d3
> > > > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A03 size 279.39GB used 208.33G=
B path /dev/cciss/c0d3
> > > > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A09 size 279.39GB used 208.33G=
B path /dev/cciss/c1d12
> > > > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A06 size 279.39GB used 208.33G=
B path /dev/cciss/c1d1
> > > > >> =A0 =A0 =A0 =A0 =A0devid =A0 11 size 279.39GB used 208.33GB =
path /dev/cciss/c1d2
> > > > >> =A0 =A0 =A0 =A0 =A0devid =A0 14 size 279.39GB used 208.33GB =
path /dev/cciss/c1d5
> > > > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A02 size 279.39GB used 233.70G=
B path /dev/cciss/c0d2
> > > > >> =A0 =A0 =A0 =A0 =A0devid =A0 15 size 279.39GB used 209.33GB =
path /dev/cciss/c1d6
> > > > >> =A0 =A0 =A0 =A0 =A0devid =A0 10 size 279.39GB used 208.33GB =
path /dev/cciss/c1d13
> > > > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A07 size 279.39GB used 208.33G=
B path /dev/cciss/c1d10
> > > > >> =A0 =A0 =A0 =A0 =A0devid =A0 18 size 279.39GB used 208.34GB =
path /dev/cciss/c1d9
> > > > >> Btrfs v0.19-16-g075587c-dirty
> > > > >>
> > > > >> The filesystem, mounted in /mnt/btrfs is hanging, no existin=
g or new
> > > > >> process can access it, however 'df' still displays the disk =
usage (3TB out
> > > > >> of 5). The disks appear to be physically healthy. Please not=
e that a
> > > > >> significant number of files were placed on this filesystem, =
between 20 and
> > > > >> 30 million files.
> > > > >>
> > > > >> The relevant kernel messages are displayed below:
> > > > >>
> > > > >> INFO: task btrfs-submit-0:4220 blocked for more than 120 sec=
onds.
> > > > >> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables =
this message.
> > > > >> btrfs-submit- D 000000010042e12f =A0 =A0 0 =A04220 =A0 =A0 =A0=
2 0x00000000
> > > > >> =A0 ffff8803e584ac70 0000000000000046 0000000000004000 00000=
00000011680
> > > > >> =A0 ffff8803f7349fd8 ffff8803f7349fd8 ffff8803e584ac70 00000=
00000011680
> > > > >> =A0 0000000000000001 ffff8803ff99d250 ffffffff8149f020 00000=
00081150ab0
> > > > >> Call Trace:
> > > > >> =A0 [<ffffffff813089f3>] ? io_schedule+0x71/0xb1
> > > > >> =A0 [<ffffffff811470be>] ? get_request_wait+0xab/0x140
> > > > >> =A0 [<ffffffff810406f4>] ? autoremove_wake_function+0x0/0x2e
> > > > >> =A0 [<ffffffff81143a4d>] ? elv_rq_merge_ok+0x89/0x97
> > > > >> =A0 [<ffffffff8114a245>] ? blk_recount_segments+0x17/0x27
> > > > >> =A0 [<ffffffff81147429>] ? __make_request+0x2d6/0x3fc
> > > > >> =A0 [<ffffffff81145b16>] ? generic_make_request+0x207/0x268
> > > > >> =A0 [<ffffffff81145c12>] ? submit_bio+0x9b/0xa2
> > > > >> =A0 [<ffffffffa01aa081>] ? btrfs_requeue_work+0xd7/0xe1 [btr=
fs]
> > > > >> =A0 [<ffffffffa01a5365>] ? run_scheduled_bios+0x297/0x48f [b=
trfs]
> > > > >> =A0 [<ffffffffa01aa687>] ? worker_loop+0x17c/0x452 [btrfs]
> > > > >> =A0 [<ffffffffa01aa50b>] ? worker_loop+0x0/0x452 [btrfs]
> > > > >> =A0 [<ffffffff81040331>] ? kthread+0x79/0x81
> > > > >> =A0 [<ffffffff81003674>] ? kernel_thread_helper+0x4/0x10
> > > > >> =A0 [<ffffffff810402b8>] ? kthread+0x0/0x81
> > > > >> =A0 [<ffffffff81003670>] ? kernel_thread_helper+0x0/0x10
> > > > > This looks like the issue we saw too, http://lkml.org/lkml/20=
10/6/8/375.
> > > > > This is reproduceable in our setup.
> > > >=20
> > > > I think I know the cause of http://lkml.org/lkml/2010/6/8/375.
> > > > The code in the first do-while loop in btrfs_commit_transaction
> > > > set current process to TASK_UNINTERRUPTIBLE state, then calls
> > > > btrfs_start_delalloc_inodes, btrfs_wait_ordered_extents and
> > > > btrfs_run_ordered_operations(). All of these function may call
> > > > cond_resched().
> > > Hi,
> > > When I test random write, I saw a lot of threads jump into btree_=
writepages()
> > > and do noting and io throughput is zero for some time. Looks like=
 there is a
> > > live lock. See the code of btree_writepages():
> > > 	if (wbc->sync_mode =3D=3D WB_SYNC_NONE) {
> > > 		struct btrfs_root *root =3D BTRFS_I(mapping->host)->root;
> > > 		u64 num_dirty;
> > > 		unsigned long thresh =3D 32 * 1024 * 1024;
> > >=20
> > > 		if (wbc->for_kupdate)
> > > 			return 0;
> > >=20
> > > 		/* this is a bit racy, but that's ok */
> > > 		num_dirty =3D root->fs_info->dirty_metadata_bytes;
> > > >>>>>>		if (num_dirty < thresh)
> > > 			return 0;
> > > 	}
> > > The marked line is quite intrusive. In my test, the live lock is =
caused by the thresh
> > > check. The dirty_metadata_bytes < 32M. Without it, I can't see th=
e live lock. Not
> > > sure if this is related to the hang.
> >=20
> > How much ram do you have?  The goal of the check is to avoid writin=
g
> > metadata blocks because once we write them we have to do more IO to=
 cow
> > them again if they are changed later.
> >=20
> > It shouldn't be looping hard in btrfs there, what was the workload?
> This is a fio randomwrite. Yep, I limited memory to a small size (~50=
0M), because it makes
> me easily to produce a 'xxx blocked for more than 120 seconds' issue.=
 I can understand small
> memory could be an issue, but this still looks intrusive, right?
>=20
> The issue Yanmin reported is under 2.6.35-rc1, so might not be the
> 'TASK_UNINTERRUPTIBLE' issue, but we will try -rc3 too.
I still get below message with 2.6.35-rc3. The system is still running,=
 because
my fio test finished even with the message.

INFO: task flush-btrfs-134:14144 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this messag=
e.
flush-btrfs-1 D 0000000100346d5c  4480 14144      2 0x00000000
 ffff88016fd51530 0000000000000046 0000000000000001 ffff880100000000
 ffff88023ef0f100 0000000000013ac0 0000000000013ac0 0000000000004000
 0000000000013ac0 ffff88018be2dfd8 ffff88016fd51530 ffff88018be2dfd8
Call Trace:
 [<ffffffff8124be61>] ? wait_block_group_cache_progress+0xc0/0xe4
 [<ffffffff81052977>] ? autoremove_wake_function+0x0/0x2a
 [<ffffffff81052977>] ? autoremove_wake_function+0x0/0x2a
 [<ffffffff8124f956>] ? find_free_extent+0x694/0x9c4
 [<ffffffff8124fd53>] ? btrfs_reserve_extent+0xcd/0x189
 [<ffffffff81262803>] ? cow_file_range+0x19e/0x2fc
 [<ffffffff81262fe0>] ? run_delalloc_range+0xa7/0x393
 [<ffffffff8127832e>] ? test_range_bit+0x2b/0x127
 [<ffffffff8127b44e>] ? find_lock_delalloc_range+0x1af/0x1d1
 [<ffffffff8127b655>] ? __extent_writepage+0x1e5/0x61f
 [<ffffffff812c1fac>] ? prio_tree_next+0x1c0/0x221
 [<ffffffff812bf3c8>] ? cpumask_any_but+0x28/0x37
 [<ffffffff810b4af7>] ? page_mkclean+0x120/0x148
 [<ffffffff8127bed2>] ? extent_write_cache_pages.clone.0+0x15e/0x26c
 [<ffffffff8127c0db>] ? extent_writepages+0x41/0x5a
 [<ffffffff81264319>] ? btrfs_get_extent+0x0/0x798
 [<ffffffff810e6fcb>] ? writeback_single_inode+0xd1/0x2e8
 [<ffffffff810e7d02>] ? writeback_inodes_wb+0x40d/0x51f
 [<ffffffff810e7f47>] ? wb_writeback+0x133/0x1b2
 [<ffffffff810e81ae>] ? wb_do_writeback+0x148/0x15e
 [<ffffffff810e81fe>] ? bdi_writeback_task+0x3a/0x113
 [<ffffffff81052897>] ? bit_waitqueue+0x14/0xa4
 [<ffffffff810a9b19>] ? bdi_start_fn+0x0/0xc2
 [<ffffffff810a9b7c>] ? bdi_start_fn+0x63/0xc2
 [<ffffffff810a9b19>] ? bdi_start_fn+0x0/0xc2
 [<ffffffff81052525>] ? kthread+0x75/0x7d
 [<ffffffff81003654>] ? kernel_thread_helper+0x4/0x10
 [<ffffffff810524b0>] ? kthread+0x0/0x7d
 [<ffffffff81003650>] ? kernel_thread_helper+0x0/0x10
=20
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2010-06-18  0:57 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-06-10 17:41 btrfs: hanging processes - race condition? Jerome Ibanes
2010-06-11  1:12 ` Shaohua Li
2010-06-11  2:32   ` Yan, Zheng 
2010-06-13  6:50     ` Shaohua Li
2010-06-14 13:28       ` Chris Mason
2010-06-14 18:12         ` Jerome Ibanes
2010-06-14 19:08           ` Chris Mason
2010-06-14 19:13             ` Jerome Ibanes
2010-06-16 18:12               ` Jerome Ibanes
2010-06-17  1:41         ` Shaohua Li
2010-06-18  0:57           ` Shaohua Li
2010-06-14 13:26     ` Chris Mason

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).