* btrfs: hanging processes - race condition? @ 2010-06-10 17:41 Jerome Ibanes 2010-06-11 1:12 ` Shaohua Li 0 siblings, 1 reply; 12+ messages in thread From: Jerome Ibanes @ 2010-06-10 17:41 UTC (permalink / raw) To: linux-btrfs List, I ran into a hang issue (race condition: cpu is high when the server is idle, meaning that btrfs is hanging, and IOwait is high as well) running 2.6.34 on debian/lenny on a x86_64 server (dual Opteron 275 w/ 16GB ram). The btrfs filesystem live on 18x300GB scsi spindles, configured as Raid-0, as shown below: Label: none uuid: bc6442c6-2fe2-4236-a5aa-6b7841234c52 Total devices 18 FS bytes used 2.94TB devid 5 size 279.39GB used 208.33GB path /dev/cciss/c1d0 devid 17 size 279.39GB used 208.34GB path /dev/cciss/c1d8 devid 16 size 279.39GB used 209.33GB path /dev/cciss/c1d7 devid 4 size 279.39GB used 208.33GB path /dev/cciss/c0d4 devid 1 size 279.39GB used 233.72GB path /dev/cciss/c0d1 devid 13 size 279.39GB used 208.33GB path /dev/cciss/c1d4 devid 8 size 279.39GB used 208.33GB path /dev/cciss/c1d11 devid 12 size 279.39GB used 208.33GB path /dev/cciss/c1d3 devid 3 size 279.39GB used 208.33GB path /dev/cciss/c0d3 devid 9 size 279.39GB used 208.33GB path /dev/cciss/c1d12 devid 6 size 279.39GB used 208.33GB path /dev/cciss/c1d1 devid 11 size 279.39GB used 208.33GB path /dev/cciss/c1d2 devid 14 size 279.39GB used 208.33GB path /dev/cciss/c1d5 devid 2 size 279.39GB used 233.70GB path /dev/cciss/c0d2 devid 15 size 279.39GB used 209.33GB path /dev/cciss/c1d6 devid 10 size 279.39GB used 208.33GB path /dev/cciss/c1d13 devid 7 size 279.39GB used 208.33GB path /dev/cciss/c1d10 devid 18 size 279.39GB used 208.34GB path /dev/cciss/c1d9 Btrfs v0.19-16-g075587c-dirty The filesystem, mounted in /mnt/btrfs is hanging, no existing or new process can access it, however 'df' still displays the disk usage (3TB out of 5). The disks appear to be physically healthy. Please note that a significant number of files were placed on this filesystem, between 20 and 30 million files. The relevant kernel messages are displayed below: INFO: task btrfs-submit-0:4220 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. btrfs-submit- D 000000010042e12f 0 4220 2 0x00000000 ffff8803e584ac70 0000000000000046 0000000000004000 0000000000011680 ffff8803f7349fd8 ffff8803f7349fd8 ffff8803e584ac70 0000000000011680 0000000000000001 ffff8803ff99d250 ffffffff8149f020 0000000081150ab0 Call Trace: [<ffffffff813089f3>] ? io_schedule+0x71/0xb1 [<ffffffff811470be>] ? get_request_wait+0xab/0x140 [<ffffffff810406f4>] ? autoremove_wake_function+0x0/0x2e [<ffffffff81143a4d>] ? elv_rq_merge_ok+0x89/0x97 [<ffffffff8114a245>] ? blk_recount_segments+0x17/0x27 [<ffffffff81147429>] ? __make_request+0x2d6/0x3fc [<ffffffff81145b16>] ? generic_make_request+0x207/0x268 [<ffffffff81145c12>] ? submit_bio+0x9b/0xa2 [<ffffffffa01aa081>] ? btrfs_requeue_work+0xd7/0xe1 [btrfs] [<ffffffffa01a5365>] ? run_scheduled_bios+0x297/0x48f [btrfs] [<ffffffffa01aa687>] ? worker_loop+0x17c/0x452 [btrfs] [<ffffffffa01aa50b>] ? worker_loop+0x0/0x452 [btrfs] [<ffffffff81040331>] ? kthread+0x79/0x81 [<ffffffff81003674>] ? kernel_thread_helper+0x4/0x10 [<ffffffff810402b8>] ? kthread+0x0/0x81 [<ffffffff81003670>] ? kernel_thread_helper+0x0/0x10 INFO: task btrfs-transacti:4230 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. btrfs-transac D 000000010042e1cc 0 4230 2 0x00000000 ffff8803e544d300 0000000000000046 0000000000004000 0000000000011680 ffff8803f3531fd8 ffff8803f3531fd8 ffff8803e544d300 0000000000011680 ffff8803fe488240 00000000000004c1 ffff8803ff8d7340 0000000381147502 Call Trace: [<ffffffff810b4153>] ? sync_buffer+0x0/0x3f [<ffffffff813089f3>] ? io_schedule+0x71/0xb1 [<ffffffff810b418e>] ? sync_buffer+0x3b/0x3f [<ffffffff81308fba>] ? __wait_on_bit+0x41/0x70 [<ffffffff810b4153>] ? sync_buffer+0x0/0x3f [<ffffffff81309054>] ? out_of_line_wait_on_bit+0x6b/0x77 [<ffffffff81040722>] ? wake_bit_function+0x0/0x23 [<ffffffffa0186635>] ? write_dev_supers+0xf3/0x225 [btrfs] [<ffffffffa018693b>] ? write_all_supers+0x1d4/0x22c [btrfs] [<ffffffffa01898a1>] ? btrfs_commit_transaction+0x4fe/0x5e1 [btrfs] [<ffffffff810406f4>] ? autoremove_wake_function+0x0/0x2e [<ffffffffa0185628>] ? transaction_kthread+0x16b/0x1fd [btrfs] [<ffffffffa01854bd>] ? transaction_kthread+0x0/0x1fd [btrfs] [<ffffffff81040331>] ? kthread+0x79/0x81 [<ffffffff81003674>] ? kernel_thread_helper+0x4/0x10 [<ffffffff810402b8>] ? kthread+0x0/0x81 [<ffffffff81003670>] ? kernel_thread_helper+0x0/0x10 INFO: task tar:31615 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. tar D 000000010042dee1 0 31615 4269 0x00000000 ffff8803ffa74d70 0000000000000082 0000000000004000 0000000000011680 ffff88010046dfd8 ffff88010046dfd8 ffff8803ffa74d70 0000000000011680 ffff880361cdd480 0000000161cdd480 ffff8803ff8becf0 0000000200005fff Call Trace: [<ffffffff8106d2af>] ? sync_page+0x0/0x45 [<ffffffff813089f3>] ? io_schedule+0x71/0xb1 [<ffffffff8106d2f0>] ? sync_page+0x41/0x45 [<ffffffff81308fba>] ? __wait_on_bit+0x41/0x70 [<ffffffff8106d483>] ? wait_on_page_bit+0x6b/0x71 [<ffffffff81040722>] ? wake_bit_function+0x0/0x23 [<ffffffffa0192917>] ? prepare_pages+0xe0/0x244 [btrfs] [<ffffffffa017e85a>] ? btrfs_check_data_free_space+0x69/0x206 [btrfs] [<ffffffffa0192f6a>] ? btrfs_file_write+0x405/0x711 [btrfs] [<ffffffff811ab83d>] ? tty_write+0x213/0x22e [<ffffffff810951b8>] ? vfs_write+0xad/0x149 [<ffffffff81095310>] ? sys_write+0x45/0x6e [<ffffffff810028eb>] ? system_call_fastpath+0x16/0x1b INFO: task btrfs-submit-0:4220 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. btrfs-submit- D 000000010042e12f 0 4220 2 0x00000000 ffff8803e584ac70 0000000000000046 0000000000004000 0000000000011680 ffff8803f7349fd8 ffff8803f7349fd8 ffff8803e584ac70 0000000000011680 0000000000000001 ffff8803ff99d250 ffffffff8149f020 0000000081150ab0 Call Trace: [<ffffffff813089f3>] ? io_schedule+0x71/0xb1 [<ffffffff811470be>] ? get_request_wait+0xab/0x140 [<ffffffff810406f4>] ? autoremove_wake_function+0x0/0x2e [<ffffffff81143a4d>] ? elv_rq_merge_ok+0x89/0x97 [<ffffffff8114a245>] ? blk_recount_segments+0x17/0x27 [<ffffffff81147429>] ? __make_request+0x2d6/0x3fc [<ffffffff81145b16>] ? generic_make_request+0x207/0x268 [<ffffffff81145c12>] ? submit_bio+0x9b/0xa2 [<ffffffffa01aa081>] ? btrfs_requeue_work+0xd7/0xe1 [btrfs] [<ffffffffa01a5365>] ? run_scheduled_bios+0x297/0x48f [btrfs] [<ffffffffa01aa687>] ? worker_loop+0x17c/0x452 [btrfs] [<ffffffffa01aa50b>] ? worker_loop+0x0/0x452 [btrfs] [<ffffffff81040331>] ? kthread+0x79/0x81 [<ffffffff81003674>] ? kernel_thread_helper+0x4/0x10 [<ffffffff810402b8>] ? kthread+0x0/0x81 [<ffffffff81003670>] ? kernel_thread_helper+0x0/0x10 INFO: task btrfs-transacti:4230 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. btrfs-transac D 000000010042e1cc 0 4230 2 0x00000000 ffff8803e544d300 0000000000000046 0000000000004000 0000000000011680 ffff8803f3531fd8 ffff8803f3531fd8 ffff8803e544d300 0000000000011680 ffff8803fe488240 00000000000004c1 ffff8803ff8d7340 0000000381147502 Call Trace: [<ffffffff810b4153>] ? sync_buffer+0x0/0x3f [<ffffffff813089f3>] ? io_schedule+0x71/0xb1 [<ffffffff810b418e>] ? sync_buffer+0x3b/0x3f [<ffffffff81308fba>] ? __wait_on_bit+0x41/0x70 [<ffffffff810b4153>] ? sync_buffer+0x0/0x3f [<ffffffff81309054>] ? out_of_line_wait_on_bit+0x6b/0x77 [<ffffffff81040722>] ? wake_bit_function+0x0/0x23 [<ffffffffa0186635>] ? write_dev_supers+0xf3/0x225 [btrfs] [<ffffffffa018693b>] ? write_all_supers+0x1d4/0x22c [btrfs] [<ffffffffa01898a1>] ? btrfs_commit_transaction+0x4fe/0x5e1 [btrfs] [<ffffffff810406f4>] ? autoremove_wake_function+0x0/0x2e [<ffffffffa0185628>] ? transaction_kthread+0x16b/0x1fd [btrfs] [<ffffffffa01854bd>] ? transaction_kthread+0x0/0x1fd [btrfs] [<ffffffff81040331>] ? kthread+0x79/0x81 [<ffffffff81003674>] ? kernel_thread_helper+0x4/0x10 [<ffffffff810402b8>] ? kthread+0x0/0x81 [<ffffffff81003670>] ? kernel_thread_helper+0x0/0x10 INFO: task tar:31615 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. tar D 000000010042dee1 0 31615 4269 0x00000000 ffff8803ffa74d70 0000000000000082 0000000000004000 0000000000011680 ffff88010046dfd8 ffff88010046dfd8 ffff8803ffa74d70 0000000000011680 ffff880361cdd480 0000000161cdd480 ffff8803ff8becf0 0000000200005fff Call Trace: [<ffffffff8106d2af>] ? sync_page+0x0/0x45 [<ffffffff813089f3>] ? io_schedule+0x71/0xb1 [<ffffffff8106d2f0>] ? sync_page+0x41/0x45 [<ffffffff81308fba>] ? __wait_on_bit+0x41/0x70 [<ffffffff8106d483>] ? wait_on_page_bit+0x6b/0x71 [<ffffffff81040722>] ? wake_bit_function+0x0/0x23 [<ffffffffa0192917>] ? prepare_pages+0xe0/0x244 [btrfs] [<ffffffffa017e85a>] ? btrfs_check_data_free_space+0x69/0x206 [btrfs] [<ffffffffa0192f6a>] ? btrfs_file_write+0x405/0x711 [btrfs] [<ffffffff811ab83d>] ? tty_write+0x213/0x22e [<ffffffff810951b8>] ? vfs_write+0xad/0x149 [<ffffffff81095310>] ? sys_write+0x45/0x6e [<ffffffff810028eb>] ? system_call_fastpath+0x16/0x1b INFO: task btrfs-submit-0:4220 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. btrfs-submit- D 000000010042e12f 0 4220 2 0x00000000 ffff8803e584ac70 0000000000000046 0000000000004000 0000000000011680 ffff8803f7349fd8 ffff8803f7349fd8 ffff8803e584ac70 0000000000011680 0000000000000001 ffff8803ff99d250 ffffffff8149f020 0000000081150ab0 Call Trace: [<ffffffff813089f3>] ? io_schedule+0x71/0xb1 [<ffffffff811470be>] ? get_request_wait+0xab/0x140 [<ffffffff810406f4>] ? autoremove_wake_function+0x0/0x2e [<ffffffff81143a4d>] ? elv_rq_merge_ok+0x89/0x97 [<ffffffff8114a245>] ? blk_recount_segments+0x17/0x27 [<ffffffff81147429>] ? __make_request+0x2d6/0x3fc [<ffffffff81145b16>] ? generic_make_request+0x207/0x268 [<ffffffff81145c12>] ? submit_bio+0x9b/0xa2 [<ffffffffa01aa081>] ? btrfs_requeue_work+0xd7/0xe1 [btrfs] [<ffffffffa01a5365>] ? run_scheduled_bios+0x297/0x48f [btrfs] [<ffffffffa01aa687>] ? worker_loop+0x17c/0x452 [btrfs] [<ffffffffa01aa50b>] ? worker_loop+0x0/0x452 [btrfs] [<ffffffff81040331>] ? kthread+0x79/0x81 [<ffffffff81003674>] ? kernel_thread_helper+0x4/0x10 [<ffffffff810402b8>] ? kthread+0x0/0x81 [<ffffffff81003670>] ? kernel_thread_helper+0x0/0x10 INFO: task btrfs-transacti:4230 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. btrfs-transac D 000000010042e1cc 0 4230 2 0x00000000 ffff8803e544d300 0000000000000046 0000000000004000 0000000000011680 ffff8803f3531fd8 ffff8803f3531fd8 ffff8803e544d300 0000000000011680 ffff8803fe488240 00000000000004c1 ffff8803ff8d7340 0000000381147502 Call Trace: [<ffffffff810b4153>] ? sync_buffer+0x0/0x3f [<ffffffff813089f3>] ? io_schedule+0x71/0xb1 [<ffffffff810b418e>] ? sync_buffer+0x3b/0x3f [<ffffffff81308fba>] ? __wait_on_bit+0x41/0x70 [<ffffffff810b4153>] ? sync_buffer+0x0/0x3f [<ffffffff81309054>] ? out_of_line_wait_on_bit+0x6b/0x77 [<ffffffff81040722>] ? wake_bit_function+0x0/0x23 [<ffffffffa0186635>] ? write_dev_supers+0xf3/0x225 [btrfs] [<ffffffffa018693b>] ? write_all_supers+0x1d4/0x22c [btrfs] [<ffffffffa01898a1>] ? btrfs_commit_transaction+0x4fe/0x5e1 [btrfs] [<ffffffff810406f4>] ? autoremove_wake_function+0x0/0x2e [<ffffffffa0185628>] ? transaction_kthread+0x16b/0x1fd [btrfs] [<ffffffffa01854bd>] ? transaction_kthread+0x0/0x1fd [btrfs] [<ffffffff81040331>] ? kthread+0x79/0x81 [<ffffffff81003674>] ? kernel_thread_helper+0x4/0x10 [<ffffffff810402b8>] ? kthread+0x0/0x81 [<ffffffff81003670>] ? kernel_thread_helper+0x0/0x10 INFO: task tar:31615 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. tar D 000000010042dee1 0 31615 4269 0x00000000 ffff8803ffa74d70 0000000000000082 0000000000004000 0000000000011680 ffff88010046dfd8 ffff88010046dfd8 ffff8803ffa74d70 0000000000011680 ffff880361cdd480 0000000161cdd480 ffff8803ff8becf0 0000000200005fff Call Trace: [<ffffffff8106d2af>] ? sync_page+0x0/0x45 [<ffffffff813089f3>] ? io_schedule+0x71/0xb1 [<ffffffff8106d2f0>] ? sync_page+0x41/0x45 [<ffffffff81308fba>] ? __wait_on_bit+0x41/0x70 [<ffffffff8106d483>] ? wait_on_page_bit+0x6b/0x71 [<ffffffff81040722>] ? wake_bit_function+0x0/0x23 [<ffffffffa0192917>] ? prepare_pages+0xe0/0x244 [btrfs] [<ffffffffa017e85a>] ? btrfs_check_data_free_space+0x69/0x206 [btrfs] [<ffffffffa0192f6a>] ? btrfs_file_write+0x405/0x711 [btrfs] [<ffffffff811ab83d>] ? tty_write+0x213/0x22e [<ffffffff810951b8>] ? vfs_write+0xad/0x149 [<ffffffff81095310>] ? sys_write+0x45/0x6e [<ffffffff810028eb>] ? system_call_fastpath+0x16/0x1b INFO: task btrfs-submit-0:4220 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. btrfs-submit- D 000000010042e12f 0 4220 2 0x00000000 ffff8803e584ac70 0000000000000046 0000000000004000 0000000000011680 ffff8803f7349fd8 ffff8803f7349fd8 ffff8803e584ac70 0000000000011680 0000000000000001 ffff8803ff99d250 ffffffff8149f020 0000000081150ab0 Call Trace: [<ffffffff813089f3>] ? io_schedule+0x71/0xb1 [<ffffffff811470be>] ? get_request_wait+0xab/0x140 [<ffffffff810406f4>] ? autoremove_wake_function+0x0/0x2e [<ffffffff81143a4d>] ? elv_rq_merge_ok+0x89/0x97 [<ffffffff8114a245>] ? blk_recount_segments+0x17/0x27 [<ffffffff81147429>] ? __make_request+0x2d6/0x3fc [<ffffffff81145b16>] ? generic_make_request+0x207/0x268 [<ffffffff81145c12>] ? submit_bio+0x9b/0xa2 [<ffffffffa01aa081>] ? btrfs_requeue_work+0xd7/0xe1 [btrfs] [<ffffffffa01a5365>] ? run_scheduled_bios+0x297/0x48f [btrfs] [<ffffffffa01aa687>] ? worker_loop+0x17c/0x452 [btrfs] [<ffffffffa01aa50b>] ? worker_loop+0x0/0x452 [btrfs] [<ffffffff81040331>] ? kthread+0x79/0x81 [<ffffffff81003674>] ? kernel_thread_helper+0x4/0x10 [<ffffffff810402b8>] ? kthread+0x0/0x81 [<ffffffff81003670>] ? kernel_thread_helper+0x0/0x10 Jerome J. Ibanes ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: btrfs: hanging processes - race condition? 2010-06-10 17:41 btrfs: hanging processes - race condition? Jerome Ibanes @ 2010-06-11 1:12 ` Shaohua Li 2010-06-11 2:32 ` Yan, Zheng 0 siblings, 1 reply; 12+ messages in thread From: Shaohua Li @ 2010-06-11 1:12 UTC (permalink / raw) To: Jerome Ibanes; +Cc: linux-btrfs, yanmin.zhang, tim.c.chen On Fri, Jun 11, 2010 at 01:41:41AM +0800, Jerome Ibanes wrote: > List, > > I ran into a hang issue (race condition: cpu is high when the server is > idle, meaning that btrfs is hanging, and IOwait is high as well) running > 2.6.34 on debian/lenny on a x86_64 server (dual Opteron 275 w/ 16GB ram). > The btrfs filesystem live on 18x300GB scsi spindles, configured as Raid-0, > as shown below: > > Label: none uuid: bc6442c6-2fe2-4236-a5aa-6b7841234c52 > Total devices 18 FS bytes used 2.94TB > devid 5 size 279.39GB used 208.33GB path /dev/cciss/c1d0 > devid 17 size 279.39GB used 208.34GB path /dev/cciss/c1d8 > devid 16 size 279.39GB used 209.33GB path /dev/cciss/c1d7 > devid 4 size 279.39GB used 208.33GB path /dev/cciss/c0d4 > devid 1 size 279.39GB used 233.72GB path /dev/cciss/c0d1 > devid 13 size 279.39GB used 208.33GB path /dev/cciss/c1d4 > devid 8 size 279.39GB used 208.33GB path /dev/cciss/c1d11 > devid 12 size 279.39GB used 208.33GB path /dev/cciss/c1d3 > devid 3 size 279.39GB used 208.33GB path /dev/cciss/c0d3 > devid 9 size 279.39GB used 208.33GB path /dev/cciss/c1d12 > devid 6 size 279.39GB used 208.33GB path /dev/cciss/c1d1 > devid 11 size 279.39GB used 208.33GB path /dev/cciss/c1d2 > devid 14 size 279.39GB used 208.33GB path /dev/cciss/c1d5 > devid 2 size 279.39GB used 233.70GB path /dev/cciss/c0d2 > devid 15 size 279.39GB used 209.33GB path /dev/cciss/c1d6 > devid 10 size 279.39GB used 208.33GB path /dev/cciss/c1d13 > devid 7 size 279.39GB used 208.33GB path /dev/cciss/c1d10 > devid 18 size 279.39GB used 208.34GB path /dev/cciss/c1d9 > Btrfs v0.19-16-g075587c-dirty > > The filesystem, mounted in /mnt/btrfs is hanging, no existing or new > process can access it, however 'df' still displays the disk usage (3TB out > of 5). The disks appear to be physically healthy. Please note that a > significant number of files were placed on this filesystem, between 20 and > 30 million files. > > The relevant kernel messages are displayed below: > > INFO: task btrfs-submit-0:4220 blocked for more than 120 seconds. > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > btrfs-submit- D 000000010042e12f 0 4220 2 0x00000000 > ffff8803e584ac70 0000000000000046 0000000000004000 0000000000011680 > ffff8803f7349fd8 ffff8803f7349fd8 ffff8803e584ac70 0000000000011680 > 0000000000000001 ffff8803ff99d250 ffffffff8149f020 0000000081150ab0 > Call Trace: > [<ffffffff813089f3>] ? io_schedule+0x71/0xb1 > [<ffffffff811470be>] ? get_request_wait+0xab/0x140 > [<ffffffff810406f4>] ? autoremove_wake_function+0x0/0x2e > [<ffffffff81143a4d>] ? elv_rq_merge_ok+0x89/0x97 > [<ffffffff8114a245>] ? blk_recount_segments+0x17/0x27 > [<ffffffff81147429>] ? __make_request+0x2d6/0x3fc > [<ffffffff81145b16>] ? generic_make_request+0x207/0x268 > [<ffffffff81145c12>] ? submit_bio+0x9b/0xa2 > [<ffffffffa01aa081>] ? btrfs_requeue_work+0xd7/0xe1 [btrfs] > [<ffffffffa01a5365>] ? run_scheduled_bios+0x297/0x48f [btrfs] > [<ffffffffa01aa687>] ? worker_loop+0x17c/0x452 [btrfs] > [<ffffffffa01aa50b>] ? worker_loop+0x0/0x452 [btrfs] > [<ffffffff81040331>] ? kthread+0x79/0x81 > [<ffffffff81003674>] ? kernel_thread_helper+0x4/0x10 > [<ffffffff810402b8>] ? kthread+0x0/0x81 > [<ffffffff81003670>] ? kernel_thread_helper+0x0/0x10 This looks like the issue we saw too, http://lkml.org/lkml/2010/6/8/375. This is reproduceable in our setup. Thanks, Shaohua ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: btrfs: hanging processes - race condition? 2010-06-11 1:12 ` Shaohua Li @ 2010-06-11 2:32 ` Yan, Zheng 2010-06-13 6:50 ` Shaohua Li 2010-06-14 13:26 ` Chris Mason 0 siblings, 2 replies; 12+ messages in thread From: Yan, Zheng @ 2010-06-11 2:32 UTC (permalink / raw) To: Shaohua Li; +Cc: Jerome Ibanes, linux-btrfs, yanmin.zhang, tim.c.chen On Fri, Jun 11, 2010 at 9:12 AM, Shaohua Li <shaohua.li@intel.com> wrot= e: > On Fri, Jun 11, 2010 at 01:41:41AM +0800, Jerome Ibanes wrote: >> List, >> >> I ran into a hang issue (race condition: cpu is high when the server= is >> idle, meaning that btrfs is hanging, and IOwait is high as well) run= ning >> 2.6.34 on debian/lenny on a x86_64 server (dual Opteron 275 w/ 16GB = ram). >> The btrfs filesystem live on 18x300GB scsi spindles, configured as R= aid-0, >> as shown below: >> >> Label: none =A0uuid: bc6442c6-2fe2-4236-a5aa-6b7841234c52 >> =A0 =A0 =A0 =A0 =A0Total devices 18 FS bytes used 2.94TB >> =A0 =A0 =A0 =A0 =A0devid =A0 =A05 size 279.39GB used 208.33GB path /= dev/cciss/c1d0 >> =A0 =A0 =A0 =A0 =A0devid =A0 17 size 279.39GB used 208.34GB path /de= v/cciss/c1d8 >> =A0 =A0 =A0 =A0 =A0devid =A0 16 size 279.39GB used 209.33GB path /de= v/cciss/c1d7 >> =A0 =A0 =A0 =A0 =A0devid =A0 =A04 size 279.39GB used 208.33GB path /= dev/cciss/c0d4 >> =A0 =A0 =A0 =A0 =A0devid =A0 =A01 size 279.39GB used 233.72GB path /= dev/cciss/c0d1 >> =A0 =A0 =A0 =A0 =A0devid =A0 13 size 279.39GB used 208.33GB path /de= v/cciss/c1d4 >> =A0 =A0 =A0 =A0 =A0devid =A0 =A08 size 279.39GB used 208.33GB path /= dev/cciss/c1d11 >> =A0 =A0 =A0 =A0 =A0devid =A0 12 size 279.39GB used 208.33GB path /de= v/cciss/c1d3 >> =A0 =A0 =A0 =A0 =A0devid =A0 =A03 size 279.39GB used 208.33GB path /= dev/cciss/c0d3 >> =A0 =A0 =A0 =A0 =A0devid =A0 =A09 size 279.39GB used 208.33GB path /= dev/cciss/c1d12 >> =A0 =A0 =A0 =A0 =A0devid =A0 =A06 size 279.39GB used 208.33GB path /= dev/cciss/c1d1 >> =A0 =A0 =A0 =A0 =A0devid =A0 11 size 279.39GB used 208.33GB path /de= v/cciss/c1d2 >> =A0 =A0 =A0 =A0 =A0devid =A0 14 size 279.39GB used 208.33GB path /de= v/cciss/c1d5 >> =A0 =A0 =A0 =A0 =A0devid =A0 =A02 size 279.39GB used 233.70GB path /= dev/cciss/c0d2 >> =A0 =A0 =A0 =A0 =A0devid =A0 15 size 279.39GB used 209.33GB path /de= v/cciss/c1d6 >> =A0 =A0 =A0 =A0 =A0devid =A0 10 size 279.39GB used 208.33GB path /de= v/cciss/c1d13 >> =A0 =A0 =A0 =A0 =A0devid =A0 =A07 size 279.39GB used 208.33GB path /= dev/cciss/c1d10 >> =A0 =A0 =A0 =A0 =A0devid =A0 18 size 279.39GB used 208.34GB path /de= v/cciss/c1d9 >> Btrfs v0.19-16-g075587c-dirty >> >> The filesystem, mounted in /mnt/btrfs is hanging, no existing or new >> process can access it, however 'df' still displays the disk usage (3= TB out >> of 5). The disks appear to be physically healthy. Please note that a >> significant number of files were placed on this filesystem, between = 20 and >> 30 million files. >> >> The relevant kernel messages are displayed below: >> >> INFO: task btrfs-submit-0:4220 blocked for more than 120 seconds. >> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this mes= sage. >> btrfs-submit- D 000000010042e12f =A0 =A0 0 =A04220 =A0 =A0 =A02 0x00= 000000 >> =A0 ffff8803e584ac70 0000000000000046 0000000000004000 0000000000011= 680 >> =A0 ffff8803f7349fd8 ffff8803f7349fd8 ffff8803e584ac70 0000000000011= 680 >> =A0 0000000000000001 ffff8803ff99d250 ffffffff8149f020 0000000081150= ab0 >> Call Trace: >> =A0 [<ffffffff813089f3>] ? io_schedule+0x71/0xb1 >> =A0 [<ffffffff811470be>] ? get_request_wait+0xab/0x140 >> =A0 [<ffffffff810406f4>] ? autoremove_wake_function+0x0/0x2e >> =A0 [<ffffffff81143a4d>] ? elv_rq_merge_ok+0x89/0x97 >> =A0 [<ffffffff8114a245>] ? blk_recount_segments+0x17/0x27 >> =A0 [<ffffffff81147429>] ? __make_request+0x2d6/0x3fc >> =A0 [<ffffffff81145b16>] ? generic_make_request+0x207/0x268 >> =A0 [<ffffffff81145c12>] ? submit_bio+0x9b/0xa2 >> =A0 [<ffffffffa01aa081>] ? btrfs_requeue_work+0xd7/0xe1 [btrfs] >> =A0 [<ffffffffa01a5365>] ? run_scheduled_bios+0x297/0x48f [btrfs] >> =A0 [<ffffffffa01aa687>] ? worker_loop+0x17c/0x452 [btrfs] >> =A0 [<ffffffffa01aa50b>] ? worker_loop+0x0/0x452 [btrfs] >> =A0 [<ffffffff81040331>] ? kthread+0x79/0x81 >> =A0 [<ffffffff81003674>] ? kernel_thread_helper+0x4/0x10 >> =A0 [<ffffffff810402b8>] ? kthread+0x0/0x81 >> =A0 [<ffffffff81003670>] ? kernel_thread_helper+0x0/0x10 > This looks like the issue we saw too, http://lkml.org/lkml/2010/6/8/3= 75. > This is reproduceable in our setup. I think I know the cause of http://lkml.org/lkml/2010/6/8/375. The code in the first do-while loop in btrfs_commit_transaction set current process to TASK_UNINTERRUPTIBLE state, then calls btrfs_start_delalloc_inodes, btrfs_wait_ordered_extents and btrfs_run_ordered_operations(). All of these function may call cond_resched(). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" = in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: btrfs: hanging processes - race condition? 2010-06-11 2:32 ` Yan, Zheng @ 2010-06-13 6:50 ` Shaohua Li 2010-06-14 13:28 ` Chris Mason 2010-06-14 13:26 ` Chris Mason 1 sibling, 1 reply; 12+ messages in thread From: Shaohua Li @ 2010-06-13 6:50 UTC (permalink / raw) To: Yan, Zheng , chris.mason Cc: Jerome Ibanes, linux-btrfs, Zhang, Yanmin, Chen, Tim C On Fri, Jun 11, 2010 at 10:32:07AM +0800, Yan, Zheng wrote: > On Fri, Jun 11, 2010 at 9:12 AM, Shaohua Li <shaohua.li@intel.com> wr= ote: > > On Fri, Jun 11, 2010 at 01:41:41AM +0800, Jerome Ibanes wrote: > >> List, > >> > >> I ran into a hang issue (race condition: cpu is high when the serv= er is > >> idle, meaning that btrfs is hanging, and IOwait is high as well) r= unning > >> 2.6.34 on debian/lenny on a x86_64 server (dual Opteron 275 w/ 16G= B ram). > >> The btrfs filesystem live on 18x300GB scsi spindles, configured as= Raid-0, > >> as shown below: > >> > >> Label: none =A0uuid: bc6442c6-2fe2-4236-a5aa-6b7841234c52 > >> =A0 =A0 =A0 =A0 =A0Total devices 18 FS bytes used 2.94TB > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A05 size 279.39GB used 208.33GB path= /dev/cciss/c1d0 > >> =A0 =A0 =A0 =A0 =A0devid =A0 17 size 279.39GB used 208.34GB path /= dev/cciss/c1d8 > >> =A0 =A0 =A0 =A0 =A0devid =A0 16 size 279.39GB used 209.33GB path /= dev/cciss/c1d7 > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A04 size 279.39GB used 208.33GB path= /dev/cciss/c0d4 > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A01 size 279.39GB used 233.72GB path= /dev/cciss/c0d1 > >> =A0 =A0 =A0 =A0 =A0devid =A0 13 size 279.39GB used 208.33GB path /= dev/cciss/c1d4 > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A08 size 279.39GB used 208.33GB path= /dev/cciss/c1d11 > >> =A0 =A0 =A0 =A0 =A0devid =A0 12 size 279.39GB used 208.33GB path /= dev/cciss/c1d3 > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A03 size 279.39GB used 208.33GB path= /dev/cciss/c0d3 > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A09 size 279.39GB used 208.33GB path= /dev/cciss/c1d12 > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A06 size 279.39GB used 208.33GB path= /dev/cciss/c1d1 > >> =A0 =A0 =A0 =A0 =A0devid =A0 11 size 279.39GB used 208.33GB path /= dev/cciss/c1d2 > >> =A0 =A0 =A0 =A0 =A0devid =A0 14 size 279.39GB used 208.33GB path /= dev/cciss/c1d5 > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A02 size 279.39GB used 233.70GB path= /dev/cciss/c0d2 > >> =A0 =A0 =A0 =A0 =A0devid =A0 15 size 279.39GB used 209.33GB path /= dev/cciss/c1d6 > >> =A0 =A0 =A0 =A0 =A0devid =A0 10 size 279.39GB used 208.33GB path /= dev/cciss/c1d13 > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A07 size 279.39GB used 208.33GB path= /dev/cciss/c1d10 > >> =A0 =A0 =A0 =A0 =A0devid =A0 18 size 279.39GB used 208.34GB path /= dev/cciss/c1d9 > >> Btrfs v0.19-16-g075587c-dirty > >> > >> The filesystem, mounted in /mnt/btrfs is hanging, no existing or n= ew > >> process can access it, however 'df' still displays the disk usage = (3TB out > >> of 5). The disks appear to be physically healthy. Please note that= a > >> significant number of files were placed on this filesystem, betwee= n 20 and > >> 30 million files. > >> > >> The relevant kernel messages are displayed below: > >> > >> INFO: task btrfs-submit-0:4220 blocked for more than 120 seconds. > >> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this m= essage. > >> btrfs-submit- D 000000010042e12f =A0 =A0 0 =A04220 =A0 =A0 =A02 0x= 00000000 > >> =A0 ffff8803e584ac70 0000000000000046 0000000000004000 00000000000= 11680 > >> =A0 ffff8803f7349fd8 ffff8803f7349fd8 ffff8803e584ac70 00000000000= 11680 > >> =A0 0000000000000001 ffff8803ff99d250 ffffffff8149f020 00000000811= 50ab0 > >> Call Trace: > >> =A0 [<ffffffff813089f3>] ? io_schedule+0x71/0xb1 > >> =A0 [<ffffffff811470be>] ? get_request_wait+0xab/0x140 > >> =A0 [<ffffffff810406f4>] ? autoremove_wake_function+0x0/0x2e > >> =A0 [<ffffffff81143a4d>] ? elv_rq_merge_ok+0x89/0x97 > >> =A0 [<ffffffff8114a245>] ? blk_recount_segments+0x17/0x27 > >> =A0 [<ffffffff81147429>] ? __make_request+0x2d6/0x3fc > >> =A0 [<ffffffff81145b16>] ? generic_make_request+0x207/0x268 > >> =A0 [<ffffffff81145c12>] ? submit_bio+0x9b/0xa2 > >> =A0 [<ffffffffa01aa081>] ? btrfs_requeue_work+0xd7/0xe1 [btrfs] > >> =A0 [<ffffffffa01a5365>] ? run_scheduled_bios+0x297/0x48f [btrfs] > >> =A0 [<ffffffffa01aa687>] ? worker_loop+0x17c/0x452 [btrfs] > >> =A0 [<ffffffffa01aa50b>] ? worker_loop+0x0/0x452 [btrfs] > >> =A0 [<ffffffff81040331>] ? kthread+0x79/0x81 > >> =A0 [<ffffffff81003674>] ? kernel_thread_helper+0x4/0x10 > >> =A0 [<ffffffff810402b8>] ? kthread+0x0/0x81 > >> =A0 [<ffffffff81003670>] ? kernel_thread_helper+0x0/0x10 > > This looks like the issue we saw too, http://lkml.org/lkml/2010/6/8= /375. > > This is reproduceable in our setup. >=20 > I think I know the cause of http://lkml.org/lkml/2010/6/8/375. > The code in the first do-while loop in btrfs_commit_transaction > set current process to TASK_UNINTERRUPTIBLE state, then calls > btrfs_start_delalloc_inodes, btrfs_wait_ordered_extents and > btrfs_run_ordered_operations(). All of these function may call > cond_resched(). Hi, When I test random write, I saw a lot of threads jump into btree_writep= ages() and do noting and io throughput is zero for some time. Looks like there= is a live lock. See the code of btree_writepages(): if (wbc->sync_mode =3D=3D WB_SYNC_NONE) { struct btrfs_root *root =3D BTRFS_I(mapping->host)->root; u64 num_dirty; unsigned long thresh =3D 32 * 1024 * 1024; if (wbc->for_kupdate) return 0; /* this is a bit racy, but that's ok */ num_dirty =3D root->fs_info->dirty_metadata_bytes; >>>>>> if (num_dirty < thresh) return 0; } The marked line is quite intrusive. In my test, the live lock is caused= by the thresh check. The dirty_metadata_bytes < 32M. Without it, I can't see the live= lock. Not sure if this is related to the hang. Thanks, Shaohua -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" = in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: btrfs: hanging processes - race condition? 2010-06-13 6:50 ` Shaohua Li @ 2010-06-14 13:28 ` Chris Mason 2010-06-14 18:12 ` Jerome Ibanes 2010-06-17 1:41 ` Shaohua Li 0 siblings, 2 replies; 12+ messages in thread From: Chris Mason @ 2010-06-14 13:28 UTC (permalink / raw) To: Shaohua Li Cc: Yan, Zheng , Jerome Ibanes, linux-btrfs, Zhang, Yanmin, Chen, Tim C On Sun, Jun 13, 2010 at 02:50:06PM +0800, Shaohua Li wrote: > On Fri, Jun 11, 2010 at 10:32:07AM +0800, Yan, Zheng wrote: > > On Fri, Jun 11, 2010 at 9:12 AM, Shaohua Li <shaohua.li@intel.com> = wrote: > > > On Fri, Jun 11, 2010 at 01:41:41AM +0800, Jerome Ibanes wrote: > > >> List, > > >> > > >> I ran into a hang issue (race condition: cpu is high when the se= rver is > > >> idle, meaning that btrfs is hanging, and IOwait is high as well)= running > > >> 2.6.34 on debian/lenny on a x86_64 server (dual Opteron 275 w/ 1= 6GB ram). > > >> The btrfs filesystem live on 18x300GB scsi spindles, configured = as Raid-0, > > >> as shown below: > > >> > > >> Label: none =A0uuid: bc6442c6-2fe2-4236-a5aa-6b7841234c52 > > >> =A0 =A0 =A0 =A0 =A0Total devices 18 FS bytes used 2.94TB > > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A05 size 279.39GB used 208.33GB pa= th /dev/cciss/c1d0 > > >> =A0 =A0 =A0 =A0 =A0devid =A0 17 size 279.39GB used 208.34GB path= /dev/cciss/c1d8 > > >> =A0 =A0 =A0 =A0 =A0devid =A0 16 size 279.39GB used 209.33GB path= /dev/cciss/c1d7 > > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A04 size 279.39GB used 208.33GB pa= th /dev/cciss/c0d4 > > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A01 size 279.39GB used 233.72GB pa= th /dev/cciss/c0d1 > > >> =A0 =A0 =A0 =A0 =A0devid =A0 13 size 279.39GB used 208.33GB path= /dev/cciss/c1d4 > > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A08 size 279.39GB used 208.33GB pa= th /dev/cciss/c1d11 > > >> =A0 =A0 =A0 =A0 =A0devid =A0 12 size 279.39GB used 208.33GB path= /dev/cciss/c1d3 > > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A03 size 279.39GB used 208.33GB pa= th /dev/cciss/c0d3 > > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A09 size 279.39GB used 208.33GB pa= th /dev/cciss/c1d12 > > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A06 size 279.39GB used 208.33GB pa= th /dev/cciss/c1d1 > > >> =A0 =A0 =A0 =A0 =A0devid =A0 11 size 279.39GB used 208.33GB path= /dev/cciss/c1d2 > > >> =A0 =A0 =A0 =A0 =A0devid =A0 14 size 279.39GB used 208.33GB path= /dev/cciss/c1d5 > > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A02 size 279.39GB used 233.70GB pa= th /dev/cciss/c0d2 > > >> =A0 =A0 =A0 =A0 =A0devid =A0 15 size 279.39GB used 209.33GB path= /dev/cciss/c1d6 > > >> =A0 =A0 =A0 =A0 =A0devid =A0 10 size 279.39GB used 208.33GB path= /dev/cciss/c1d13 > > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A07 size 279.39GB used 208.33GB pa= th /dev/cciss/c1d10 > > >> =A0 =A0 =A0 =A0 =A0devid =A0 18 size 279.39GB used 208.34GB path= /dev/cciss/c1d9 > > >> Btrfs v0.19-16-g075587c-dirty > > >> > > >> The filesystem, mounted in /mnt/btrfs is hanging, no existing or= new > > >> process can access it, however 'df' still displays the disk usag= e (3TB out > > >> of 5). The disks appear to be physically healthy. Please note th= at a > > >> significant number of files were placed on this filesystem, betw= een 20 and > > >> 30 million files. > > >> > > >> The relevant kernel messages are displayed below: > > >> > > >> INFO: task btrfs-submit-0:4220 blocked for more than 120 seconds= =2E > > >> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this= message. > > >> btrfs-submit- D 000000010042e12f =A0 =A0 0 =A04220 =A0 =A0 =A02 = 0x00000000 > > >> =A0 ffff8803e584ac70 0000000000000046 0000000000004000 000000000= 0011680 > > >> =A0 ffff8803f7349fd8 ffff8803f7349fd8 ffff8803e584ac70 000000000= 0011680 > > >> =A0 0000000000000001 ffff8803ff99d250 ffffffff8149f020 000000008= 1150ab0 > > >> Call Trace: > > >> =A0 [<ffffffff813089f3>] ? io_schedule+0x71/0xb1 > > >> =A0 [<ffffffff811470be>] ? get_request_wait+0xab/0x140 > > >> =A0 [<ffffffff810406f4>] ? autoremove_wake_function+0x0/0x2e > > >> =A0 [<ffffffff81143a4d>] ? elv_rq_merge_ok+0x89/0x97 > > >> =A0 [<ffffffff8114a245>] ? blk_recount_segments+0x17/0x27 > > >> =A0 [<ffffffff81147429>] ? __make_request+0x2d6/0x3fc > > >> =A0 [<ffffffff81145b16>] ? generic_make_request+0x207/0x268 > > >> =A0 [<ffffffff81145c12>] ? submit_bio+0x9b/0xa2 > > >> =A0 [<ffffffffa01aa081>] ? btrfs_requeue_work+0xd7/0xe1 [btrfs] > > >> =A0 [<ffffffffa01a5365>] ? run_scheduled_bios+0x297/0x48f [btrfs= ] > > >> =A0 [<ffffffffa01aa687>] ? worker_loop+0x17c/0x452 [btrfs] > > >> =A0 [<ffffffffa01aa50b>] ? worker_loop+0x0/0x452 [btrfs] > > >> =A0 [<ffffffff81040331>] ? kthread+0x79/0x81 > > >> =A0 [<ffffffff81003674>] ? kernel_thread_helper+0x4/0x10 > > >> =A0 [<ffffffff810402b8>] ? kthread+0x0/0x81 > > >> =A0 [<ffffffff81003670>] ? kernel_thread_helper+0x0/0x10 > > > This looks like the issue we saw too, http://lkml.org/lkml/2010/6= /8/375. > > > This is reproduceable in our setup. > >=20 > > I think I know the cause of http://lkml.org/lkml/2010/6/8/375. > > The code in the first do-while loop in btrfs_commit_transaction > > set current process to TASK_UNINTERRUPTIBLE state, then calls > > btrfs_start_delalloc_inodes, btrfs_wait_ordered_extents and > > btrfs_run_ordered_operations(). All of these function may call > > cond_resched(). > Hi, > When I test random write, I saw a lot of threads jump into btree_writ= epages() > and do noting and io throughput is zero for some time. Looks like the= re is a > live lock. See the code of btree_writepages(): > if (wbc->sync_mode =3D=3D WB_SYNC_NONE) { > struct btrfs_root *root =3D BTRFS_I(mapping->host)->root; > u64 num_dirty; > unsigned long thresh =3D 32 * 1024 * 1024; >=20 > if (wbc->for_kupdate) > return 0; >=20 > /* this is a bit racy, but that's ok */ > num_dirty =3D root->fs_info->dirty_metadata_bytes; > >>>>>> if (num_dirty < thresh) > return 0; > } > The marked line is quite intrusive. In my test, the live lock is caus= ed by the thresh > check. The dirty_metadata_bytes < 32M. Without it, I can't see the li= ve lock. Not > sure if this is related to the hang. How much ram do you have? The goal of the check is to avoid writing metadata blocks because once we write them we have to do more IO to cow them again if they are changed later. It shouldn't be looping hard in btrfs there, what was the workload? -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" = in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: btrfs: hanging processes - race condition? 2010-06-14 13:28 ` Chris Mason @ 2010-06-14 18:12 ` Jerome Ibanes 2010-06-14 19:08 ` Chris Mason 2010-06-17 1:41 ` Shaohua Li 1 sibling, 1 reply; 12+ messages in thread From: Jerome Ibanes @ 2010-06-14 18:12 UTC (permalink / raw) To: Chris Mason Cc: Shaohua Li, Yan, Zheng , Jerome Ibanes, linux-btrfs, Zhang, Yanmin, Chen, Tim C [-- Attachment #1: Type: TEXT/PLAIN, Size: 6012 bytes --] On Mon, 14 Jun 2010, Chris Mason wrote: > On Sun, Jun 13, 2010 at 02:50:06PM +0800, Shaohua Li wrote: >> On Fri, Jun 11, 2010 at 10:32:07AM +0800, Yan, Zheng wrote: >>> On Fri, Jun 11, 2010 at 9:12 AM, Shaohua Li <shaohua.li@intel.com> wrote: >>>> On Fri, Jun 11, 2010 at 01:41:41AM +0800, Jerome Ibanes wrote: >>>>> List, >>>>> >>>>> I ran into a hang issue (race condition: cpu is high when the server is >>>>> idle, meaning that btrfs is hanging, and IOwait is high as well) running >>>>> 2.6.34 on debian/lenny on a x86_64 server (dual Opteron 275 w/ 16GB ram). >>>>> The btrfs filesystem live on 18x300GB scsi spindles, configured as Raid-0, >>>>> as shown below: >>>>> >>>>> Label: none uuid: bc6442c6-2fe2-4236-a5aa-6b7841234c52 >>>>> Total devices 18 FS bytes used 2.94TB >>>>> devid 5 size 279.39GB used 208.33GB path /dev/cciss/c1d0 >>>>> devid 17 size 279.39GB used 208.34GB path /dev/cciss/c1d8 >>>>> devid 16 size 279.39GB used 209.33GB path /dev/cciss/c1d7 >>>>> devid 4 size 279.39GB used 208.33GB path /dev/cciss/c0d4 >>>>> devid 1 size 279.39GB used 233.72GB path /dev/cciss/c0d1 >>>>> devid 13 size 279.39GB used 208.33GB path /dev/cciss/c1d4 >>>>> devid 8 size 279.39GB used 208.33GB path /dev/cciss/c1d11 >>>>> devid 12 size 279.39GB used 208.33GB path /dev/cciss/c1d3 >>>>> devid 3 size 279.39GB used 208.33GB path /dev/cciss/c0d3 >>>>> devid 9 size 279.39GB used 208.33GB path /dev/cciss/c1d12 >>>>> devid 6 size 279.39GB used 208.33GB path /dev/cciss/c1d1 >>>>> devid 11 size 279.39GB used 208.33GB path /dev/cciss/c1d2 >>>>> devid 14 size 279.39GB used 208.33GB path /dev/cciss/c1d5 >>>>> devid 2 size 279.39GB used 233.70GB path /dev/cciss/c0d2 >>>>> devid 15 size 279.39GB used 209.33GB path /dev/cciss/c1d6 >>>>> devid 10 size 279.39GB used 208.33GB path /dev/cciss/c1d13 >>>>> devid 7 size 279.39GB used 208.33GB path /dev/cciss/c1d10 >>>>> devid 18 size 279.39GB used 208.34GB path /dev/cciss/c1d9 >>>>> Btrfs v0.19-16-g075587c-dirty >>>>> >>>>> The filesystem, mounted in /mnt/btrfs is hanging, no existing or new >>>>> process can access it, however 'df' still displays the disk usage (3TB out >>>>> of 5). The disks appear to be physically healthy. Please note that a >>>>> significant number of files were placed on this filesystem, between 20 and >>>>> 30 million files. >>>>> >>>>> The relevant kernel messages are displayed below: >>>>> >>>>> INFO: task btrfs-submit-0:4220 blocked for more than 120 seconds. >>>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >>>>> btrfs-submit- D 000000010042e12f 0 4220 2 0x00000000 >>>>> ffff8803e584ac70 0000000000000046 0000000000004000 0000000000011680 >>>>> ffff8803f7349fd8 ffff8803f7349fd8 ffff8803e584ac70 0000000000011680 >>>>> 0000000000000001 ffff8803ff99d250 ffffffff8149f020 0000000081150ab0 >>>>> Call Trace: >>>>> [<ffffffff813089f3>] ? io_schedule+0x71/0xb1 >>>>> [<ffffffff811470be>] ? get_request_wait+0xab/0x140 >>>>> [<ffffffff810406f4>] ? autoremove_wake_function+0x0/0x2e >>>>> [<ffffffff81143a4d>] ? elv_rq_merge_ok+0x89/0x97 >>>>> [<ffffffff8114a245>] ? blk_recount_segments+0x17/0x27 >>>>> [<ffffffff81147429>] ? __make_request+0x2d6/0x3fc >>>>> [<ffffffff81145b16>] ? generic_make_request+0x207/0x268 >>>>> [<ffffffff81145c12>] ? submit_bio+0x9b/0xa2 >>>>> [<ffffffffa01aa081>] ? btrfs_requeue_work+0xd7/0xe1 [btrfs] >>>>> [<ffffffffa01a5365>] ? run_scheduled_bios+0x297/0x48f [btrfs] >>>>> [<ffffffffa01aa687>] ? worker_loop+0x17c/0x452 [btrfs] >>>>> [<ffffffffa01aa50b>] ? worker_loop+0x0/0x452 [btrfs] >>>>> [<ffffffff81040331>] ? kthread+0x79/0x81 >>>>> [<ffffffff81003674>] ? kernel_thread_helper+0x4/0x10 >>>>> [<ffffffff810402b8>] ? kthread+0x0/0x81 >>>>> [<ffffffff81003670>] ? kernel_thread_helper+0x0/0x10 >>>> This looks like the issue we saw too, http://lkml.org/lkml/2010/6/8/375. >>>> This is reproduceable in our setup. >>> >>> I think I know the cause of http://lkml.org/lkml/2010/6/8/375. >>> The code in the first do-while loop in btrfs_commit_transaction >>> set current process to TASK_UNINTERRUPTIBLE state, then calls >>> btrfs_start_delalloc_inodes, btrfs_wait_ordered_extents and >>> btrfs_run_ordered_operations(). All of these function may call >>> cond_resched(). >> Hi, >> When I test random write, I saw a lot of threads jump into btree_writepages() >> and do noting and io throughput is zero for some time. Looks like there is a >> live lock. See the code of btree_writepages(): >> if (wbc->sync_mode == WB_SYNC_NONE) { >> struct btrfs_root *root = BTRFS_I(mapping->host)->root; >> u64 num_dirty; >> unsigned long thresh = 32 * 1024 * 1024; >> >> if (wbc->for_kupdate) >> return 0; >> >> /* this is a bit racy, but that's ok */ >> num_dirty = root->fs_info->dirty_metadata_bytes; >>>>>>>> if (num_dirty < thresh) >> return 0; >> } >> The marked line is quite intrusive. In my test, the live lock is caused by the thresh >> check. The dirty_metadata_bytes < 32M. Without it, I can't see the live lock. Not >> sure if this is related to the hang. > > How much ram do you have? The goal of the check is to avoid writing > metadata blocks because once we write them we have to do more IO to cow > them again if they are changed later. This server has 16GB of ram on a x86_64 (dual opteron 275, ecc memory). > It shouldn't be looping hard in btrfs there, what was the workload? The workload was the extraction of large tarballs (one at the time, about 300+ files extracted by second from a single tarball, which is pretty good), as you might expect, the disks were tested (read and write) for physical errors before I report this bug. Jerome J. Ibanes ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: btrfs: hanging processes - race condition? 2010-06-14 18:12 ` Jerome Ibanes @ 2010-06-14 19:08 ` Chris Mason 2010-06-14 19:13 ` Jerome Ibanes 0 siblings, 1 reply; 12+ messages in thread From: Chris Mason @ 2010-06-14 19:08 UTC (permalink / raw) To: Jerome Ibanes Cc: Shaohua Li, Yan, Zheng , linux-btrfs, Zhang, Yanmin, Chen, Tim C On Mon, Jun 14, 2010 at 11:12:53AM -0700, Jerome Ibanes wrote: > On Mon, 14 Jun 2010, Chris Mason wrote: >=20 > >On Sun, Jun 13, 2010 at 02:50:06PM +0800, Shaohua Li wrote: > >>On Fri, Jun 11, 2010 at 10:32:07AM +0800, Yan, Zheng wrote: > >>>On Fri, Jun 11, 2010 at 9:12 AM, Shaohua Li <shaohua.li@intel.com>= wrote: > >>>>On Fri, Jun 11, 2010 at 01:41:41AM +0800, Jerome Ibanes wrote: > >>>>>List, > >>>>> > >>>>>I ran into a hang issue (race condition: cpu is high when the se= rver is > >>>>>idle, meaning that btrfs is hanging, and IOwait is high as well)= running > >>>>>2.6.34 on debian/lenny on a x86_64 server (dual Opteron 275 w/ 1= 6GB ram). > >>>>>The btrfs filesystem live on 18x300GB scsi spindles, configured = as Raid-0, > >>>>>as shown below: > >>>>> > >>>>>Label: none =A0uuid: bc6442c6-2fe2-4236-a5aa-6b7841234c52 > >>>>>=A0 =A0 =A0 =A0 =A0Total devices 18 FS bytes used 2.94TB > >>>>>=A0 =A0 =A0 =A0 =A0devid =A0 =A05 size 279.39GB used 208.33GB pa= th /dev/cciss/c1d0 > >>>>>=A0 =A0 =A0 =A0 =A0devid =A0 17 size 279.39GB used 208.34GB path= /dev/cciss/c1d8 > >>>>>=A0 =A0 =A0 =A0 =A0devid =A0 16 size 279.39GB used 209.33GB path= /dev/cciss/c1d7 > >>>>>=A0 =A0 =A0 =A0 =A0devid =A0 =A04 size 279.39GB used 208.33GB pa= th /dev/cciss/c0d4 > >>>>>=A0 =A0 =A0 =A0 =A0devid =A0 =A01 size 279.39GB used 233.72GB pa= th /dev/cciss/c0d1 > >>>>>=A0 =A0 =A0 =A0 =A0devid =A0 13 size 279.39GB used 208.33GB path= /dev/cciss/c1d4 > >>>>>=A0 =A0 =A0 =A0 =A0devid =A0 =A08 size 279.39GB used 208.33GB pa= th /dev/cciss/c1d11 > >>>>>=A0 =A0 =A0 =A0 =A0devid =A0 12 size 279.39GB used 208.33GB path= /dev/cciss/c1d3 > >>>>>=A0 =A0 =A0 =A0 =A0devid =A0 =A03 size 279.39GB used 208.33GB pa= th /dev/cciss/c0d3 > >>>>>=A0 =A0 =A0 =A0 =A0devid =A0 =A09 size 279.39GB used 208.33GB pa= th /dev/cciss/c1d12 > >>>>>=A0 =A0 =A0 =A0 =A0devid =A0 =A06 size 279.39GB used 208.33GB pa= th /dev/cciss/c1d1 > >>>>>=A0 =A0 =A0 =A0 =A0devid =A0 11 size 279.39GB used 208.33GB path= /dev/cciss/c1d2 > >>>>>=A0 =A0 =A0 =A0 =A0devid =A0 14 size 279.39GB used 208.33GB path= /dev/cciss/c1d5 > >>>>>=A0 =A0 =A0 =A0 =A0devid =A0 =A02 size 279.39GB used 233.70GB pa= th /dev/cciss/c0d2 > >>>>>=A0 =A0 =A0 =A0 =A0devid =A0 15 size 279.39GB used 209.33GB path= /dev/cciss/c1d6 > >>>>>=A0 =A0 =A0 =A0 =A0devid =A0 10 size 279.39GB used 208.33GB path= /dev/cciss/c1d13 > >>>>>=A0 =A0 =A0 =A0 =A0devid =A0 =A07 size 279.39GB used 208.33GB pa= th /dev/cciss/c1d10 > >>>>>=A0 =A0 =A0 =A0 =A0devid =A0 18 size 279.39GB used 208.34GB path= /dev/cciss/c1d9 > >>>>>Btrfs v0.19-16-g075587c-dirty > >>>>> > >>>>>The filesystem, mounted in /mnt/btrfs is hanging, no existing or= new > >>>>>process can access it, however 'df' still displays the disk usag= e (3TB out > >>>>>of 5). The disks appear to be physically healthy. Please note th= at a > >>>>>significant number of files were placed on this filesystem, betw= een 20 and > >>>>>30 million files. > >>>>> > >>>>>The relevant kernel messages are displayed below: > >>>>> > >>>>>INFO: task btrfs-submit-0:4220 blocked for more than 120 seconds= =2E > >>>>>"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this= message. > >>>>>btrfs-submit- D 000000010042e12f =A0 =A0 0 =A04220 =A0 =A0 =A02 = 0x00000000 > >>>>>=A0 ffff8803e584ac70 0000000000000046 0000000000004000 000000000= 0011680 > >>>>>=A0 ffff8803f7349fd8 ffff8803f7349fd8 ffff8803e584ac70 000000000= 0011680 > >>>>>=A0 0000000000000001 ffff8803ff99d250 ffffffff8149f020 000000008= 1150ab0 > >>>>>Call Trace: > >>>>>=A0 [<ffffffff813089f3>] ? io_schedule+0x71/0xb1 > >>>>>=A0 [<ffffffff811470be>] ? get_request_wait+0xab/0x140 > >>>>>=A0 [<ffffffff810406f4>] ? autoremove_wake_function+0x0/0x2e > >>>>>=A0 [<ffffffff81143a4d>] ? elv_rq_merge_ok+0x89/0x97 > >>>>>=A0 [<ffffffff8114a245>] ? blk_recount_segments+0x17/0x27 > >>>>>=A0 [<ffffffff81147429>] ? __make_request+0x2d6/0x3fc > >>>>>=A0 [<ffffffff81145b16>] ? generic_make_request+0x207/0x268 > >>>>>=A0 [<ffffffff81145c12>] ? submit_bio+0x9b/0xa2 > >>>>>=A0 [<ffffffffa01aa081>] ? btrfs_requeue_work+0xd7/0xe1 [btrfs] > >>>>>=A0 [<ffffffffa01a5365>] ? run_scheduled_bios+0x297/0x48f [btrfs= ] > >>>>>=A0 [<ffffffffa01aa687>] ? worker_loop+0x17c/0x452 [btrfs] > >>>>>=A0 [<ffffffffa01aa50b>] ? worker_loop+0x0/0x452 [btrfs] > >>>>>=A0 [<ffffffff81040331>] ? kthread+0x79/0x81 > >>>>>=A0 [<ffffffff81003674>] ? kernel_thread_helper+0x4/0x10 > >>>>>=A0 [<ffffffff810402b8>] ? kthread+0x0/0x81 > >>>>>=A0 [<ffffffff81003670>] ? kernel_thread_helper+0x0/0x10 > >>>>This looks like the issue we saw too, http://lkml.org/lkml/2010/6= /8/375. > >>>>This is reproduceable in our setup. > >>> > >>>I think I know the cause of http://lkml.org/lkml/2010/6/8/375. > >>>The code in the first do-while loop in btrfs_commit_transaction > >>>set current process to TASK_UNINTERRUPTIBLE state, then calls > >>>btrfs_start_delalloc_inodes, btrfs_wait_ordered_extents and > >>>btrfs_run_ordered_operations(). All of these function may call > >>>cond_resched(). > >>Hi, > >>When I test random write, I saw a lot of threads jump into btree_wr= itepages() > >>and do noting and io throughput is zero for some time. Looks like t= here is a > >>live lock. See the code of btree_writepages(): > >> if (wbc->sync_mode =3D=3D WB_SYNC_NONE) { > >> struct btrfs_root *root =3D BTRFS_I(mapping->host)->root; > >> u64 num_dirty; > >> unsigned long thresh =3D 32 * 1024 * 1024; > >> > >> if (wbc->for_kupdate) > >> return 0; > >> > >> /* this is a bit racy, but that's ok */ > >> num_dirty =3D root->fs_info->dirty_metadata_bytes; > >>>>>>>> if (num_dirty < thresh) > >> return 0; > >> } > >>The marked line is quite intrusive. In my test, the live lock is ca= used by the thresh > >>check. The dirty_metadata_bytes < 32M. Without it, I can't see the = live lock. Not > >>sure if this is related to the hang. > > > >How much ram do you have? The goal of the check is to avoid writing > >metadata blocks because once we write them we have to do more IO to = cow > >them again if they are changed later. >=20 > This server has 16GB of ram on a x86_64 (dual opteron 275, ecc memory= ). >=20 > >It shouldn't be looping hard in btrfs there, what was the workload? >=20 > The workload was the extraction of large tarballs (one at the time, > about 300+ files extracted by second from a single tarball, which is > pretty good), as you might expect, the disks were tested (read and > write) for physical errors before I report this bug. I think Zheng is right and this one will get fixed by the latest code. The spinning writepage part should be a different problem. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" = in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: btrfs: hanging processes - race condition? 2010-06-14 19:08 ` Chris Mason @ 2010-06-14 19:13 ` Jerome Ibanes 2010-06-16 18:12 ` Jerome Ibanes 0 siblings, 1 reply; 12+ messages in thread From: Jerome Ibanes @ 2010-06-14 19:13 UTC (permalink / raw) To: Chris Mason Cc: Shaohua Li, Yan, Zheng , linux-btrfs, Zhang, Yanmin, Chen, Tim C [-- Attachment #1: Type: TEXT/PLAIN, Size: 6510 bytes --] > On Mon, Jun 14, 2010 at 11:12:53AM -0700, Jerome Ibanes wrote: >> On Mon, 14 Jun 2010, Chris Mason wrote: >> >>> On Sun, Jun 13, 2010 at 02:50:06PM +0800, Shaohua Li wrote: >>>> On Fri, Jun 11, 2010 at 10:32:07AM +0800, Yan, Zheng wrote: >>>>> On Fri, Jun 11, 2010 at 9:12 AM, Shaohua Li <shaohua.li@intel.com> wrote: >>>>>> On Fri, Jun 11, 2010 at 01:41:41AM +0800, Jerome Ibanes wrote: >>>>>>> List, >>>>>>> >>>>>>> I ran into a hang issue (race condition: cpu is high when the server is >>>>>>> idle, meaning that btrfs is hanging, and IOwait is high as well) running >>>>>>> 2.6.34 on debian/lenny on a x86_64 server (dual Opteron 275 w/ 16GB ram). >>>>>>> The btrfs filesystem live on 18x300GB scsi spindles, configured as Raid-0, >>>>>>> as shown below: >>>>>>> >>>>>>> Label: none uuid: bc6442c6-2fe2-4236-a5aa-6b7841234c52 >>>>>>> Total devices 18 FS bytes used 2.94TB >>>>>>> devid 5 size 279.39GB used 208.33GB path /dev/cciss/c1d0 >>>>>>> devid 17 size 279.39GB used 208.34GB path /dev/cciss/c1d8 >>>>>>> devid 16 size 279.39GB used 209.33GB path /dev/cciss/c1d7 >>>>>>> devid 4 size 279.39GB used 208.33GB path /dev/cciss/c0d4 >>>>>>> devid 1 size 279.39GB used 233.72GB path /dev/cciss/c0d1 >>>>>>> devid 13 size 279.39GB used 208.33GB path /dev/cciss/c1d4 >>>>>>> devid 8 size 279.39GB used 208.33GB path /dev/cciss/c1d11 >>>>>>> devid 12 size 279.39GB used 208.33GB path /dev/cciss/c1d3 >>>>>>> devid 3 size 279.39GB used 208.33GB path /dev/cciss/c0d3 >>>>>>> devid 9 size 279.39GB used 208.33GB path /dev/cciss/c1d12 >>>>>>> devid 6 size 279.39GB used 208.33GB path /dev/cciss/c1d1 >>>>>>> devid 11 size 279.39GB used 208.33GB path /dev/cciss/c1d2 >>>>>>> devid 14 size 279.39GB used 208.33GB path /dev/cciss/c1d5 >>>>>>> devid 2 size 279.39GB used 233.70GB path /dev/cciss/c0d2 >>>>>>> devid 15 size 279.39GB used 209.33GB path /dev/cciss/c1d6 >>>>>>> devid 10 size 279.39GB used 208.33GB path /dev/cciss/c1d13 >>>>>>> devid 7 size 279.39GB used 208.33GB path /dev/cciss/c1d10 >>>>>>> devid 18 size 279.39GB used 208.34GB path /dev/cciss/c1d9 >>>>>>> Btrfs v0.19-16-g075587c-dirty >>>>>>> >>>>>>> The filesystem, mounted in /mnt/btrfs is hanging, no existing or new >>>>>>> process can access it, however 'df' still displays the disk usage (3TB out >>>>>>> of 5). The disks appear to be physically healthy. Please note that a >>>>>>> significant number of files were placed on this filesystem, between 20 and >>>>>>> 30 million files. >>>>>>> >>>>>>> The relevant kernel messages are displayed below: >>>>>>> >>>>>>> INFO: task btrfs-submit-0:4220 blocked for more than 120 seconds. >>>>>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >>>>>>> btrfs-submit- D 000000010042e12f 0 4220 2 0x00000000 >>>>>>> ffff8803e584ac70 0000000000000046 0000000000004000 0000000000011680 >>>>>>> ffff8803f7349fd8 ffff8803f7349fd8 ffff8803e584ac70 0000000000011680 >>>>>>> 0000000000000001 ffff8803ff99d250 ffffffff8149f020 0000000081150ab0 >>>>>>> Call Trace: >>>>>>> [<ffffffff813089f3>] ? io_schedule+0x71/0xb1 >>>>>>> [<ffffffff811470be>] ? get_request_wait+0xab/0x140 >>>>>>> [<ffffffff810406f4>] ? autoremove_wake_function+0x0/0x2e >>>>>>> [<ffffffff81143a4d>] ? elv_rq_merge_ok+0x89/0x97 >>>>>>> [<ffffffff8114a245>] ? blk_recount_segments+0x17/0x27 >>>>>>> [<ffffffff81147429>] ? __make_request+0x2d6/0x3fc >>>>>>> [<ffffffff81145b16>] ? generic_make_request+0x207/0x268 >>>>>>> [<ffffffff81145c12>] ? submit_bio+0x9b/0xa2 >>>>>>> [<ffffffffa01aa081>] ? btrfs_requeue_work+0xd7/0xe1 [btrfs] >>>>>>> [<ffffffffa01a5365>] ? run_scheduled_bios+0x297/0x48f [btrfs] >>>>>>> [<ffffffffa01aa687>] ? worker_loop+0x17c/0x452 [btrfs] >>>>>>> [<ffffffffa01aa50b>] ? worker_loop+0x0/0x452 [btrfs] >>>>>>> [<ffffffff81040331>] ? kthread+0x79/0x81 >>>>>>> [<ffffffff81003674>] ? kernel_thread_helper+0x4/0x10 >>>>>>> [<ffffffff810402b8>] ? kthread+0x0/0x81 >>>>>>> [<ffffffff81003670>] ? kernel_thread_helper+0x0/0x10 >>>>>> This looks like the issue we saw too, http://lkml.org/lkml/2010/6/8/375. >>>>>> This is reproduceable in our setup. >>>>> >>>>> I think I know the cause of http://lkml.org/lkml/2010/6/8/375. >>>>> The code in the first do-while loop in btrfs_commit_transaction >>>>> set current process to TASK_UNINTERRUPTIBLE state, then calls >>>>> btrfs_start_delalloc_inodes, btrfs_wait_ordered_extents and >>>>> btrfs_run_ordered_operations(). All of these function may call >>>>> cond_resched(). >>>> Hi, >>>> When I test random write, I saw a lot of threads jump into btree_writepages() >>>> and do noting and io throughput is zero for some time. Looks like there is a >>>> live lock. See the code of btree_writepages(): >>>> if (wbc->sync_mode == WB_SYNC_NONE) { >>>> struct btrfs_root *root = BTRFS_I(mapping->host)->root; >>>> u64 num_dirty; >>>> unsigned long thresh = 32 * 1024 * 1024; >>>> >>>> if (wbc->for_kupdate) >>>> return 0; >>>> >>>> /* this is a bit racy, but that's ok */ >>>> num_dirty = root->fs_info->dirty_metadata_bytes; >>>>>>>>>> if (num_dirty < thresh) >>>> return 0; >>>> } >>>> The marked line is quite intrusive. In my test, the live lock is caused by the thresh >>>> check. The dirty_metadata_bytes < 32M. Without it, I can't see the live lock. Not >>>> sure if this is related to the hang. >>> >>> How much ram do you have? The goal of the check is to avoid writing >>> metadata blocks because once we write them we have to do more IO to cow >>> them again if they are changed later. >> >> This server has 16GB of ram on a x86_64 (dual opteron 275, ecc memory). >> >>> It shouldn't be looping hard in btrfs there, what was the workload? >> >> The workload was the extraction of large tarballs (one at the time, >> about 300+ files extracted by second from a single tarball, which is >> pretty good), as you might expect, the disks were tested (read and >> write) for physical errors before I report this bug. > > I think Zheng is right and this one will get fixed by the latest code. > The spinning writepage part should be a different problem. I'm trying to repro with 2.6.35-rc3, expect results within 24 hours. Jerome J. Ibanes ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: btrfs: hanging processes - race condition? 2010-06-14 19:13 ` Jerome Ibanes @ 2010-06-16 18:12 ` Jerome Ibanes 0 siblings, 0 replies; 12+ messages in thread From: Jerome Ibanes @ 2010-06-16 18:12 UTC (permalink / raw) To: chris.mason, shaohua.li, yanzheng, linux-btrfs, yanmin.zhang, tim.c.chen Cc: Nicolas Michal [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: TEXT/PLAIN; charset=X-UNKNOWN; format=flowed, Size: 8536 bytes --] On Mon, 14 Jun 2010, Jerome Ibanes wrote: >> On Mon, Jun 14, 2010 at 11:12:53AM -0700, Jerome Ibanes wrote: >> > On Mon, 14 Jun 2010, Chris Mason wrote: >> > >> > > On Sun, Jun 13, 2010 at 02:50:06PM +0800, Shaohua Li wrote: >> > > > On Fri, Jun 11, 2010 at 10:32:07AM +0800, Yan, Zheng wrote: >> > > > > On Fri, Jun 11, 2010 at 9:12 AM, Shaohua Li <shaohua.li@intel.com> >> > > > > wrote: >> > > > > > On Fri, Jun 11, 2010 at 01:41:41AM +0800, Jerome Ibanes wrote: >> > > > > > > List, >> > > > > > > >> > > > > > > I ran into a hang issue (race condition: cpu is high when the >> > > > > > > server is >> > > > > > > idle, meaning that btrfs is hanging, and IOwait is high as >> > > > > > > well) running >> > > > > > > 2.6.34 on debian/lenny on a x86_64 server (dual Opteron 275 w/ >> > > > > > > 16GB ram). >> > > > > > > The btrfs filesystem live on 18x300GB scsi spindles, >> > > > > > > configured as Raid-0, >> > > > > > > as shown below: >> > > > > > > >> > > > > > > Label: none uuid: bc6442c6-2fe2-4236-a5aa-6b7841234c52 >> > > > > > > Total devices 18 FS bytes used 2.94TB >> > > > > > > devid 5 size 279.39GB used 208.33GB path >> > > > > > > /dev/cciss/c1d0 >> > > > > > > devid 17 size 279.39GB used 208.34GB path >> > > > > > > /dev/cciss/c1d8 >> > > > > > > devid 16 size 279.39GB used 209.33GB path >> > > > > > > /dev/cciss/c1d7 >> > > > > > > devid 4 size 279.39GB used 208.33GB path >> > > > > > > /dev/cciss/c0d4 >> > > > > > > devid 1 size 279.39GB used 233.72GB path >> > > > > > > /dev/cciss/c0d1 >> > > > > > > devid 13 size 279.39GB used 208.33GB path >> > > > > > > /dev/cciss/c1d4 >> > > > > > > devid 8 size 279.39GB used 208.33GB path >> > > > > > > /dev/cciss/c1d11 >> > > > > > > devid 12 size 279.39GB used 208.33GB path >> > > > > > > /dev/cciss/c1d3 >> > > > > > > devid 3 size 279.39GB used 208.33GB path >> > > > > > > /dev/cciss/c0d3 >> > > > > > > devid 9 size 279.39GB used 208.33GB path >> > > > > > > /dev/cciss/c1d12 >> > > > > > > devid 6 size 279.39GB used 208.33GB path >> > > > > > > /dev/cciss/c1d1 >> > > > > > > devid 11 size 279.39GB used 208.33GB path >> > > > > > > /dev/cciss/c1d2 >> > > > > > > devid 14 size 279.39GB used 208.33GB path >> > > > > > > /dev/cciss/c1d5 >> > > > > > > devid 2 size 279.39GB used 233.70GB path >> > > > > > > /dev/cciss/c0d2 >> > > > > > > devid 15 size 279.39GB used 209.33GB path >> > > > > > > /dev/cciss/c1d6 >> > > > > > > devid 10 size 279.39GB used 208.33GB path >> > > > > > > /dev/cciss/c1d13 >> > > > > > > devid 7 size 279.39GB used 208.33GB path >> > > > > > > /dev/cciss/c1d10 >> > > > > > > devid 18 size 279.39GB used 208.34GB path >> > > > > > > /dev/cciss/c1d9 >> > > > > > > Btrfs v0.19-16-g075587c-dirty >> > > > > > > >> > > > > > > The filesystem, mounted in /mnt/btrfs is hanging, no existing >> > > > > > > or new >> > > > > > > process can access it, however 'df' still displays the disk >> > > > > > > usage (3TB out >> > > > > > > of 5). The disks appear to be physically healthy. Please note >> > > > > > > that a >> > > > > > > significant number of files were placed on this filesystem, >> > > > > > > between 20 and >> > > > > > > 30 million files. >> > > > > > > >> > > > > > > The relevant kernel messages are displayed below: >> > > > > > > >> > > > > > > INFO: task btrfs-submit-0:4220 blocked for more than 120 >> > > > > > > seconds. >> > > > > > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables >> > > > > > > this message. >> > > > > > > btrfs-submit- D 000000010042e12f 0 4220 2 0x00000000 >> > > > > > > ffff8803e584ac70 0000000000000046 0000000000004000 >> > > > > > > 0000000000011680 >> > > > > > > ffff8803f7349fd8 ffff8803f7349fd8 ffff8803e584ac70 >> > > > > > > 0000000000011680 >> > > > > > > 0000000000000001 ffff8803ff99d250 ffffffff8149f020 >> > > > > > > 0000000081150ab0 >> > > > > > > Call Trace: >> > > > > > > [<ffffffff813089f3>] ? io_schedule+0x71/0xb1 >> > > > > > > [<ffffffff811470be>] ? get_request_wait+0xab/0x140 >> > > > > > > [<ffffffff810406f4>] ? autoremove_wake_function+0x0/0x2e >> > > > > > > [<ffffffff81143a4d>] ? elv_rq_merge_ok+0x89/0x97 >> > > > > > > [<ffffffff8114a245>] ? blk_recount_segments+0x17/0x27 >> > > > > > > [<ffffffff81147429>] ? __make_request+0x2d6/0x3fc >> > > > > > > [<ffffffff81145b16>] ? generic_make_request+0x207/0x268 >> > > > > > > [<ffffffff81145c12>] ? submit_bio+0x9b/0xa2 >> > > > > > > [<ffffffffa01aa081>] ? btrfs_requeue_work+0xd7/0xe1 >> > > > > > > [btrfs] >> > > > > > > [<ffffffffa01a5365>] ? run_scheduled_bios+0x297/0x48f >> > > > > > > [btrfs] >> > > > > > > [<ffffffffa01aa687>] ? worker_loop+0x17c/0x452 [btrfs] >> > > > > > > [<ffffffffa01aa50b>] ? worker_loop+0x0/0x452 [btrfs] >> > > > > > > [<ffffffff81040331>] ? kthread+0x79/0x81 >> > > > > > > [<ffffffff81003674>] ? kernel_thread_helper+0x4/0x10 >> > > > > > > [<ffffffff810402b8>] ? kthread+0x0/0x81 >> > > > > > > [<ffffffff81003670>] ? kernel_thread_helper+0x0/0x10 >> > > > > > This looks like the issue we saw too, >> > > > > > http://lkml.org/lkml/2010/6/8/375. >> > > > > > This is reproduceable in our setup. >> > > > > >> > > > > I think I know the cause of http://lkml.org/lkml/2010/6/8/375. >> > > > > The code in the first do-while loop in btrfs_commit_transaction >> > > > > set current process to TASK_UNINTERRUPTIBLE state, then calls >> > > > > btrfs_start_delalloc_inodes, btrfs_wait_ordered_extents and >> > > > > btrfs_run_ordered_operations(). All of these function may call >> > > > > cond_resched(). >> > > > Hi, >> > > > When I test random write, I saw a lot of threads jump into >> > > > btree_writepages() >> > > > and do noting and io throughput is zero for some time. Looks like >> > > > there is a >> > > > live lock. See the code of btree_writepages(): >> > > > if (wbc->sync_mode == WB_SYNC_NONE) { >> > > > struct btrfs_root *root = BTRFS_I(mapping->host)->root; >> > > > u64 num_dirty; >> > > > unsigned long thresh = 32 * 1024 * 1024; >> > > > >> > > > if (wbc->for_kupdate) >> > > > return 0; >> > > > >> > > > /* this is a bit racy, but that's ok */ >> > > > num_dirty = root->fs_info->dirty_metadata_bytes; >> > > > > > > > > > if (num_dirty < thresh) >> > > > return 0; >> > > > } >> > > > The marked line is quite intrusive. In my test, the live lock is >> > > > caused by the thresh >> > > > check. The dirty_metadata_bytes < 32M. Without it, I can't see the >> > > > live lock. Not >> > > > sure if this is related to the hang. >> > > >> > > How much ram do you have? The goal of the check is to avoid writing >> > > metadata blocks because once we write them we have to do more IO to >> > > cow >> > > them again if they are changed later. >> > >> > This server has 16GB of ram on a x86_64 (dual opteron 275, ecc memory). >> > >> > > It shouldn't be looping hard in btrfs there, what was the workload? >> > >> > The workload was the extraction of large tarballs (one at the time, >> > about 300+ files extracted by second from a single tarball, which is >> > pretty good), as you might expect, the disks were tested (read and >> > write) for physical errors before I report this bug. >> >> I think Zheng is right and this one will get fixed by the latest code. >> The spinning writepage part should be a different problem. > > I'm trying to repro with 2.6.35-rc3, expect results within 24 hours. I can no longer repro this issue (after 48 hours of filesystem stress) under 2.6.35-rc3 on a filesystem with over 50 million files. I will reopen this thread should this reoccur. Jerome J. Ibanes ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: btrfs: hanging processes - race condition? 2010-06-14 13:28 ` Chris Mason 2010-06-14 18:12 ` Jerome Ibanes @ 2010-06-17 1:41 ` Shaohua Li 2010-06-18 0:57 ` Shaohua Li 1 sibling, 1 reply; 12+ messages in thread From: Shaohua Li @ 2010-06-17 1:41 UTC (permalink / raw) To: Chris Mason, Yan, Zheng , Jerome Ibanes, linux-btrfs, On Mon, Jun 14, 2010 at 09:28:29PM +0800, Chris Mason wrote: > On Sun, Jun 13, 2010 at 02:50:06PM +0800, Shaohua Li wrote: > > On Fri, Jun 11, 2010 at 10:32:07AM +0800, Yan, Zheng wrote: > > > On Fri, Jun 11, 2010 at 9:12 AM, Shaohua Li <shaohua.li@intel.com= > wrote: > > > > On Fri, Jun 11, 2010 at 01:41:41AM +0800, Jerome Ibanes wrote: > > > >> List, > > > >> > > > >> I ran into a hang issue (race condition: cpu is high when the = server is > > > >> idle, meaning that btrfs is hanging, and IOwait is high as wel= l) running > > > >> 2.6.34 on debian/lenny on a x86_64 server (dual Opteron 275 w/= 16GB ram). > > > >> The btrfs filesystem live on 18x300GB scsi spindles, configure= d as Raid-0, > > > >> as shown below: > > > >> > > > >> Label: none =A0uuid: bc6442c6-2fe2-4236-a5aa-6b7841234c52 > > > >> =A0 =A0 =A0 =A0 =A0Total devices 18 FS bytes used 2.94TB > > > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A05 size 279.39GB used 208.33GB = path /dev/cciss/c1d0 > > > >> =A0 =A0 =A0 =A0 =A0devid =A0 17 size 279.39GB used 208.34GB pa= th /dev/cciss/c1d8 > > > >> =A0 =A0 =A0 =A0 =A0devid =A0 16 size 279.39GB used 209.33GB pa= th /dev/cciss/c1d7 > > > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A04 size 279.39GB used 208.33GB = path /dev/cciss/c0d4 > > > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A01 size 279.39GB used 233.72GB = path /dev/cciss/c0d1 > > > >> =A0 =A0 =A0 =A0 =A0devid =A0 13 size 279.39GB used 208.33GB pa= th /dev/cciss/c1d4 > > > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A08 size 279.39GB used 208.33GB = path /dev/cciss/c1d11 > > > >> =A0 =A0 =A0 =A0 =A0devid =A0 12 size 279.39GB used 208.33GB pa= th /dev/cciss/c1d3 > > > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A03 size 279.39GB used 208.33GB = path /dev/cciss/c0d3 > > > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A09 size 279.39GB used 208.33GB = path /dev/cciss/c1d12 > > > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A06 size 279.39GB used 208.33GB = path /dev/cciss/c1d1 > > > >> =A0 =A0 =A0 =A0 =A0devid =A0 11 size 279.39GB used 208.33GB pa= th /dev/cciss/c1d2 > > > >> =A0 =A0 =A0 =A0 =A0devid =A0 14 size 279.39GB used 208.33GB pa= th /dev/cciss/c1d5 > > > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A02 size 279.39GB used 233.70GB = path /dev/cciss/c0d2 > > > >> =A0 =A0 =A0 =A0 =A0devid =A0 15 size 279.39GB used 209.33GB pa= th /dev/cciss/c1d6 > > > >> =A0 =A0 =A0 =A0 =A0devid =A0 10 size 279.39GB used 208.33GB pa= th /dev/cciss/c1d13 > > > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A07 size 279.39GB used 208.33GB = path /dev/cciss/c1d10 > > > >> =A0 =A0 =A0 =A0 =A0devid =A0 18 size 279.39GB used 208.34GB pa= th /dev/cciss/c1d9 > > > >> Btrfs v0.19-16-g075587c-dirty > > > >> > > > >> The filesystem, mounted in /mnt/btrfs is hanging, no existing = or new > > > >> process can access it, however 'df' still displays the disk us= age (3TB out > > > >> of 5). The disks appear to be physically healthy. Please note = that a > > > >> significant number of files were placed on this filesystem, be= tween 20 and > > > >> 30 million files. > > > >> > > > >> The relevant kernel messages are displayed below: > > > >> > > > >> INFO: task btrfs-submit-0:4220 blocked for more than 120 secon= ds. > > > >> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables th= is message. > > > >> btrfs-submit- D 000000010042e12f =A0 =A0 0 =A04220 =A0 =A0 =A0= 2 0x00000000 > > > >> =A0 ffff8803e584ac70 0000000000000046 0000000000004000 0000000= 000011680 > > > >> =A0 ffff8803f7349fd8 ffff8803f7349fd8 ffff8803e584ac70 0000000= 000011680 > > > >> =A0 0000000000000001 ffff8803ff99d250 ffffffff8149f020 0000000= 081150ab0 > > > >> Call Trace: > > > >> =A0 [<ffffffff813089f3>] ? io_schedule+0x71/0xb1 > > > >> =A0 [<ffffffff811470be>] ? get_request_wait+0xab/0x140 > > > >> =A0 [<ffffffff810406f4>] ? autoremove_wake_function+0x0/0x2e > > > >> =A0 [<ffffffff81143a4d>] ? elv_rq_merge_ok+0x89/0x97 > > > >> =A0 [<ffffffff8114a245>] ? blk_recount_segments+0x17/0x27 > > > >> =A0 [<ffffffff81147429>] ? __make_request+0x2d6/0x3fc > > > >> =A0 [<ffffffff81145b16>] ? generic_make_request+0x207/0x268 > > > >> =A0 [<ffffffff81145c12>] ? submit_bio+0x9b/0xa2 > > > >> =A0 [<ffffffffa01aa081>] ? btrfs_requeue_work+0xd7/0xe1 [btrfs= ] > > > >> =A0 [<ffffffffa01a5365>] ? run_scheduled_bios+0x297/0x48f [btr= fs] > > > >> =A0 [<ffffffffa01aa687>] ? worker_loop+0x17c/0x452 [btrfs] > > > >> =A0 [<ffffffffa01aa50b>] ? worker_loop+0x0/0x452 [btrfs] > > > >> =A0 [<ffffffff81040331>] ? kthread+0x79/0x81 > > > >> =A0 [<ffffffff81003674>] ? kernel_thread_helper+0x4/0x10 > > > >> =A0 [<ffffffff810402b8>] ? kthread+0x0/0x81 > > > >> =A0 [<ffffffff81003670>] ? kernel_thread_helper+0x0/0x10 > > > > This looks like the issue we saw too, http://lkml.org/lkml/2010= /6/8/375. > > > > This is reproduceable in our setup. > > >=20 > > > I think I know the cause of http://lkml.org/lkml/2010/6/8/375. > > > The code in the first do-while loop in btrfs_commit_transaction > > > set current process to TASK_UNINTERRUPTIBLE state, then calls > > > btrfs_start_delalloc_inodes, btrfs_wait_ordered_extents and > > > btrfs_run_ordered_operations(). All of these function may call > > > cond_resched(). > > Hi, > > When I test random write, I saw a lot of threads jump into btree_wr= itepages() > > and do noting and io throughput is zero for some time. Looks like t= here is a > > live lock. See the code of btree_writepages(): > > if (wbc->sync_mode =3D=3D WB_SYNC_NONE) { > > struct btrfs_root *root =3D BTRFS_I(mapping->host)->root; > > u64 num_dirty; > > unsigned long thresh =3D 32 * 1024 * 1024; > >=20 > > if (wbc->for_kupdate) > > return 0; > >=20 > > /* this is a bit racy, but that's ok */ > > num_dirty =3D root->fs_info->dirty_metadata_bytes; > > >>>>>> if (num_dirty < thresh) > > return 0; > > } > > The marked line is quite intrusive. In my test, the live lock is ca= used by the thresh > > check. The dirty_metadata_bytes < 32M. Without it, I can't see the = live lock. Not > > sure if this is related to the hang. >=20 > How much ram do you have? The goal of the check is to avoid writing > metadata blocks because once we write them we have to do more IO to c= ow > them again if they are changed later. >=20 > It shouldn't be looping hard in btrfs there, what was the workload? This is a fio randomwrite. Yep, I limited memory to a small size (~500M= ), because it makes me easily to produce a 'xxx blocked for more than 120 seconds' issue. I= can understand small memory could be an issue, but this still looks intrusive, right? The issue Yanmin reported is under 2.6.35-rc1, so might not be the 'TASK_UNINTERRUPTIBLE' issue, but we will try -rc3 too. Thanks, Shaohua -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" = in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: btrfs: hanging processes - race condition? 2010-06-17 1:41 ` Shaohua Li @ 2010-06-18 0:57 ` Shaohua Li 0 siblings, 0 replies; 12+ messages in thread From: Shaohua Li @ 2010-06-18 0:57 UTC (permalink / raw) To: Chris Mason, Yan, Zheng , Jerome Ibanes, linux-btrfs, On Thu, Jun 17, 2010 at 09:41:18AM +0800, Shaohua Li wrote: > On Mon, Jun 14, 2010 at 09:28:29PM +0800, Chris Mason wrote: > > On Sun, Jun 13, 2010 at 02:50:06PM +0800, Shaohua Li wrote: > > > On Fri, Jun 11, 2010 at 10:32:07AM +0800, Yan, Zheng wrote: > > > > On Fri, Jun 11, 2010 at 9:12 AM, Shaohua Li <shaohua.li@intel.c= om> wrote: > > > > > On Fri, Jun 11, 2010 at 01:41:41AM +0800, Jerome Ibanes wrote= : > > > > >> List, > > > > >> > > > > >> I ran into a hang issue (race condition: cpu is high when th= e server is > > > > >> idle, meaning that btrfs is hanging, and IOwait is high as w= ell) running > > > > >> 2.6.34 on debian/lenny on a x86_64 server (dual Opteron 275 = w/ 16GB ram). > > > > >> The btrfs filesystem live on 18x300GB scsi spindles, configu= red as Raid-0, > > > > >> as shown below: > > > > >> > > > > >> Label: none =A0uuid: bc6442c6-2fe2-4236-a5aa-6b7841234c52 > > > > >> =A0 =A0 =A0 =A0 =A0Total devices 18 FS bytes used 2.94TB > > > > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A05 size 279.39GB used 208.33G= B path /dev/cciss/c1d0 > > > > >> =A0 =A0 =A0 =A0 =A0devid =A0 17 size 279.39GB used 208.34GB = path /dev/cciss/c1d8 > > > > >> =A0 =A0 =A0 =A0 =A0devid =A0 16 size 279.39GB used 209.33GB = path /dev/cciss/c1d7 > > > > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A04 size 279.39GB used 208.33G= B path /dev/cciss/c0d4 > > > > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A01 size 279.39GB used 233.72G= B path /dev/cciss/c0d1 > > > > >> =A0 =A0 =A0 =A0 =A0devid =A0 13 size 279.39GB used 208.33GB = path /dev/cciss/c1d4 > > > > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A08 size 279.39GB used 208.33G= B path /dev/cciss/c1d11 > > > > >> =A0 =A0 =A0 =A0 =A0devid =A0 12 size 279.39GB used 208.33GB = path /dev/cciss/c1d3 > > > > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A03 size 279.39GB used 208.33G= B path /dev/cciss/c0d3 > > > > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A09 size 279.39GB used 208.33G= B path /dev/cciss/c1d12 > > > > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A06 size 279.39GB used 208.33G= B path /dev/cciss/c1d1 > > > > >> =A0 =A0 =A0 =A0 =A0devid =A0 11 size 279.39GB used 208.33GB = path /dev/cciss/c1d2 > > > > >> =A0 =A0 =A0 =A0 =A0devid =A0 14 size 279.39GB used 208.33GB = path /dev/cciss/c1d5 > > > > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A02 size 279.39GB used 233.70G= B path /dev/cciss/c0d2 > > > > >> =A0 =A0 =A0 =A0 =A0devid =A0 15 size 279.39GB used 209.33GB = path /dev/cciss/c1d6 > > > > >> =A0 =A0 =A0 =A0 =A0devid =A0 10 size 279.39GB used 208.33GB = path /dev/cciss/c1d13 > > > > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A07 size 279.39GB used 208.33G= B path /dev/cciss/c1d10 > > > > >> =A0 =A0 =A0 =A0 =A0devid =A0 18 size 279.39GB used 208.34GB = path /dev/cciss/c1d9 > > > > >> Btrfs v0.19-16-g075587c-dirty > > > > >> > > > > >> The filesystem, mounted in /mnt/btrfs is hanging, no existin= g or new > > > > >> process can access it, however 'df' still displays the disk = usage (3TB out > > > > >> of 5). The disks appear to be physically healthy. Please not= e that a > > > > >> significant number of files were placed on this filesystem, = between 20 and > > > > >> 30 million files. > > > > >> > > > > >> The relevant kernel messages are displayed below: > > > > >> > > > > >> INFO: task btrfs-submit-0:4220 blocked for more than 120 sec= onds. > > > > >> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables = this message. > > > > >> btrfs-submit- D 000000010042e12f =A0 =A0 0 =A04220 =A0 =A0 =A0= 2 0x00000000 > > > > >> =A0 ffff8803e584ac70 0000000000000046 0000000000004000 00000= 00000011680 > > > > >> =A0 ffff8803f7349fd8 ffff8803f7349fd8 ffff8803e584ac70 00000= 00000011680 > > > > >> =A0 0000000000000001 ffff8803ff99d250 ffffffff8149f020 00000= 00081150ab0 > > > > >> Call Trace: > > > > >> =A0 [<ffffffff813089f3>] ? io_schedule+0x71/0xb1 > > > > >> =A0 [<ffffffff811470be>] ? get_request_wait+0xab/0x140 > > > > >> =A0 [<ffffffff810406f4>] ? autoremove_wake_function+0x0/0x2e > > > > >> =A0 [<ffffffff81143a4d>] ? elv_rq_merge_ok+0x89/0x97 > > > > >> =A0 [<ffffffff8114a245>] ? blk_recount_segments+0x17/0x27 > > > > >> =A0 [<ffffffff81147429>] ? __make_request+0x2d6/0x3fc > > > > >> =A0 [<ffffffff81145b16>] ? generic_make_request+0x207/0x268 > > > > >> =A0 [<ffffffff81145c12>] ? submit_bio+0x9b/0xa2 > > > > >> =A0 [<ffffffffa01aa081>] ? btrfs_requeue_work+0xd7/0xe1 [btr= fs] > > > > >> =A0 [<ffffffffa01a5365>] ? run_scheduled_bios+0x297/0x48f [b= trfs] > > > > >> =A0 [<ffffffffa01aa687>] ? worker_loop+0x17c/0x452 [btrfs] > > > > >> =A0 [<ffffffffa01aa50b>] ? worker_loop+0x0/0x452 [btrfs] > > > > >> =A0 [<ffffffff81040331>] ? kthread+0x79/0x81 > > > > >> =A0 [<ffffffff81003674>] ? kernel_thread_helper+0x4/0x10 > > > > >> =A0 [<ffffffff810402b8>] ? kthread+0x0/0x81 > > > > >> =A0 [<ffffffff81003670>] ? kernel_thread_helper+0x0/0x10 > > > > > This looks like the issue we saw too, http://lkml.org/lkml/20= 10/6/8/375. > > > > > This is reproduceable in our setup. > > > >=20 > > > > I think I know the cause of http://lkml.org/lkml/2010/6/8/375. > > > > The code in the first do-while loop in btrfs_commit_transaction > > > > set current process to TASK_UNINTERRUPTIBLE state, then calls > > > > btrfs_start_delalloc_inodes, btrfs_wait_ordered_extents and > > > > btrfs_run_ordered_operations(). All of these function may call > > > > cond_resched(). > > > Hi, > > > When I test random write, I saw a lot of threads jump into btree_= writepages() > > > and do noting and io throughput is zero for some time. Looks like= there is a > > > live lock. See the code of btree_writepages(): > > > if (wbc->sync_mode =3D=3D WB_SYNC_NONE) { > > > struct btrfs_root *root =3D BTRFS_I(mapping->host)->root; > > > u64 num_dirty; > > > unsigned long thresh =3D 32 * 1024 * 1024; > > >=20 > > > if (wbc->for_kupdate) > > > return 0; > > >=20 > > > /* this is a bit racy, but that's ok */ > > > num_dirty =3D root->fs_info->dirty_metadata_bytes; > > > >>>>>> if (num_dirty < thresh) > > > return 0; > > > } > > > The marked line is quite intrusive. In my test, the live lock is = caused by the thresh > > > check. The dirty_metadata_bytes < 32M. Without it, I can't see th= e live lock. Not > > > sure if this is related to the hang. > >=20 > > How much ram do you have? The goal of the check is to avoid writin= g > > metadata blocks because once we write them we have to do more IO to= cow > > them again if they are changed later. > >=20 > > It shouldn't be looping hard in btrfs there, what was the workload? > This is a fio randomwrite. Yep, I limited memory to a small size (~50= 0M), because it makes > me easily to produce a 'xxx blocked for more than 120 seconds' issue.= I can understand small > memory could be an issue, but this still looks intrusive, right? >=20 > The issue Yanmin reported is under 2.6.35-rc1, so might not be the > 'TASK_UNINTERRUPTIBLE' issue, but we will try -rc3 too. I still get below message with 2.6.35-rc3. The system is still running,= because my fio test finished even with the message. INFO: task flush-btrfs-134:14144 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this messag= e. flush-btrfs-1 D 0000000100346d5c 4480 14144 2 0x00000000 ffff88016fd51530 0000000000000046 0000000000000001 ffff880100000000 ffff88023ef0f100 0000000000013ac0 0000000000013ac0 0000000000004000 0000000000013ac0 ffff88018be2dfd8 ffff88016fd51530 ffff88018be2dfd8 Call Trace: [<ffffffff8124be61>] ? wait_block_group_cache_progress+0xc0/0xe4 [<ffffffff81052977>] ? autoremove_wake_function+0x0/0x2a [<ffffffff81052977>] ? autoremove_wake_function+0x0/0x2a [<ffffffff8124f956>] ? find_free_extent+0x694/0x9c4 [<ffffffff8124fd53>] ? btrfs_reserve_extent+0xcd/0x189 [<ffffffff81262803>] ? cow_file_range+0x19e/0x2fc [<ffffffff81262fe0>] ? run_delalloc_range+0xa7/0x393 [<ffffffff8127832e>] ? test_range_bit+0x2b/0x127 [<ffffffff8127b44e>] ? find_lock_delalloc_range+0x1af/0x1d1 [<ffffffff8127b655>] ? __extent_writepage+0x1e5/0x61f [<ffffffff812c1fac>] ? prio_tree_next+0x1c0/0x221 [<ffffffff812bf3c8>] ? cpumask_any_but+0x28/0x37 [<ffffffff810b4af7>] ? page_mkclean+0x120/0x148 [<ffffffff8127bed2>] ? extent_write_cache_pages.clone.0+0x15e/0x26c [<ffffffff8127c0db>] ? extent_writepages+0x41/0x5a [<ffffffff81264319>] ? btrfs_get_extent+0x0/0x798 [<ffffffff810e6fcb>] ? writeback_single_inode+0xd1/0x2e8 [<ffffffff810e7d02>] ? writeback_inodes_wb+0x40d/0x51f [<ffffffff810e7f47>] ? wb_writeback+0x133/0x1b2 [<ffffffff810e81ae>] ? wb_do_writeback+0x148/0x15e [<ffffffff810e81fe>] ? bdi_writeback_task+0x3a/0x113 [<ffffffff81052897>] ? bit_waitqueue+0x14/0xa4 [<ffffffff810a9b19>] ? bdi_start_fn+0x0/0xc2 [<ffffffff810a9b7c>] ? bdi_start_fn+0x63/0xc2 [<ffffffff810a9b19>] ? bdi_start_fn+0x0/0xc2 [<ffffffff81052525>] ? kthread+0x75/0x7d [<ffffffff81003654>] ? kernel_thread_helper+0x4/0x10 [<ffffffff810524b0>] ? kthread+0x0/0x7d [<ffffffff81003650>] ? kernel_thread_helper+0x0/0x10 =20 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" = in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: btrfs: hanging processes - race condition? 2010-06-11 2:32 ` Yan, Zheng 2010-06-13 6:50 ` Shaohua Li @ 2010-06-14 13:26 ` Chris Mason 1 sibling, 0 replies; 12+ messages in thread From: Chris Mason @ 2010-06-14 13:26 UTC (permalink / raw) To: Yan, Zheng Cc: Shaohua Li, Jerome Ibanes, linux-btrfs, yanmin.zhang, tim.c.chen On Fri, Jun 11, 2010 at 10:32:07AM +0800, Yan, Zheng wrote: > On Fri, Jun 11, 2010 at 9:12 AM, Shaohua Li <shaohua.li@intel.com> wr= ote: > > On Fri, Jun 11, 2010 at 01:41:41AM +0800, Jerome Ibanes wrote: > >> List, > >> > >> I ran into a hang issue (race condition: cpu is high when the serv= er is > >> idle, meaning that btrfs is hanging, and IOwait is high as well) r= unning > >> 2.6.34 on debian/lenny on a x86_64 server (dual Opteron 275 w/ 16G= B ram). > >> The btrfs filesystem live on 18x300GB scsi spindles, configured as= Raid-0, > >> as shown below: > >> > >> Label: none =A0uuid: bc6442c6-2fe2-4236-a5aa-6b7841234c52 > >> =A0 =A0 =A0 =A0 =A0Total devices 18 FS bytes used 2.94TB > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A05 size 279.39GB used 208.33GB path= /dev/cciss/c1d0 > >> =A0 =A0 =A0 =A0 =A0devid =A0 17 size 279.39GB used 208.34GB path /= dev/cciss/c1d8 > >> =A0 =A0 =A0 =A0 =A0devid =A0 16 size 279.39GB used 209.33GB path /= dev/cciss/c1d7 > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A04 size 279.39GB used 208.33GB path= /dev/cciss/c0d4 > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A01 size 279.39GB used 233.72GB path= /dev/cciss/c0d1 > >> =A0 =A0 =A0 =A0 =A0devid =A0 13 size 279.39GB used 208.33GB path /= dev/cciss/c1d4 > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A08 size 279.39GB used 208.33GB path= /dev/cciss/c1d11 > >> =A0 =A0 =A0 =A0 =A0devid =A0 12 size 279.39GB used 208.33GB path /= dev/cciss/c1d3 > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A03 size 279.39GB used 208.33GB path= /dev/cciss/c0d3 > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A09 size 279.39GB used 208.33GB path= /dev/cciss/c1d12 > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A06 size 279.39GB used 208.33GB path= /dev/cciss/c1d1 > >> =A0 =A0 =A0 =A0 =A0devid =A0 11 size 279.39GB used 208.33GB path /= dev/cciss/c1d2 > >> =A0 =A0 =A0 =A0 =A0devid =A0 14 size 279.39GB used 208.33GB path /= dev/cciss/c1d5 > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A02 size 279.39GB used 233.70GB path= /dev/cciss/c0d2 > >> =A0 =A0 =A0 =A0 =A0devid =A0 15 size 279.39GB used 209.33GB path /= dev/cciss/c1d6 > >> =A0 =A0 =A0 =A0 =A0devid =A0 10 size 279.39GB used 208.33GB path /= dev/cciss/c1d13 > >> =A0 =A0 =A0 =A0 =A0devid =A0 =A07 size 279.39GB used 208.33GB path= /dev/cciss/c1d10 > >> =A0 =A0 =A0 =A0 =A0devid =A0 18 size 279.39GB used 208.34GB path /= dev/cciss/c1d9 > >> Btrfs v0.19-16-g075587c-dirty > >> > >> The filesystem, mounted in /mnt/btrfs is hanging, no existing or n= ew > >> process can access it, however 'df' still displays the disk usage = (3TB out > >> of 5). The disks appear to be physically healthy. Please note that= a > >> significant number of files were placed on this filesystem, betwee= n 20 and > >> 30 million files. > >> > >> The relevant kernel messages are displayed below: > >> > >> INFO: task btrfs-submit-0:4220 blocked for more than 120 seconds. > >> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this m= essage. > >> btrfs-submit- D 000000010042e12f =A0 =A0 0 =A04220 =A0 =A0 =A02 0x= 00000000 > >> =A0 ffff8803e584ac70 0000000000000046 0000000000004000 00000000000= 11680 > >> =A0 ffff8803f7349fd8 ffff8803f7349fd8 ffff8803e584ac70 00000000000= 11680 > >> =A0 0000000000000001 ffff8803ff99d250 ffffffff8149f020 00000000811= 50ab0 > >> Call Trace: > >> =A0 [<ffffffff813089f3>] ? io_schedule+0x71/0xb1 > >> =A0 [<ffffffff811470be>] ? get_request_wait+0xab/0x140 > >> =A0 [<ffffffff810406f4>] ? autoremove_wake_function+0x0/0x2e > >> =A0 [<ffffffff81143a4d>] ? elv_rq_merge_ok+0x89/0x97 > >> =A0 [<ffffffff8114a245>] ? blk_recount_segments+0x17/0x27 > >> =A0 [<ffffffff81147429>] ? __make_request+0x2d6/0x3fc > >> =A0 [<ffffffff81145b16>] ? generic_make_request+0x207/0x268 > >> =A0 [<ffffffff81145c12>] ? submit_bio+0x9b/0xa2 > >> =A0 [<ffffffffa01aa081>] ? btrfs_requeue_work+0xd7/0xe1 [btrfs] > >> =A0 [<ffffffffa01a5365>] ? run_scheduled_bios+0x297/0x48f [btrfs] > >> =A0 [<ffffffffa01aa687>] ? worker_loop+0x17c/0x452 [btrfs] > >> =A0 [<ffffffffa01aa50b>] ? worker_loop+0x0/0x452 [btrfs] > >> =A0 [<ffffffff81040331>] ? kthread+0x79/0x81 > >> =A0 [<ffffffff81003674>] ? kernel_thread_helper+0x4/0x10 > >> =A0 [<ffffffff810402b8>] ? kthread+0x0/0x81 > >> =A0 [<ffffffff81003670>] ? kernel_thread_helper+0x0/0x10 > > This looks like the issue we saw too, http://lkml.org/lkml/2010/6/8= /375. > > This is reproduceable in our setup. >=20 > I think I know the cause of http://lkml.org/lkml/2010/6/8/375. > The code in the first do-while loop in btrfs_commit_transaction > set current process to TASK_UNINTERRUPTIBLE state, then calls > btrfs_start_delalloc_inodes, btrfs_wait_ordered_extents and > btrfs_run_ordered_operations(). All of these function may call > cond_resched(). The TASK_UNINTERRUPTIBLE problem was fixed with 2.6.35-rc1. You can find the changes in the master branch of the btrfs-unstable repo. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" = in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2010-06-18 0:57 UTC | newest] Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2010-06-10 17:41 btrfs: hanging processes - race condition? Jerome Ibanes 2010-06-11 1:12 ` Shaohua Li 2010-06-11 2:32 ` Yan, Zheng 2010-06-13 6:50 ` Shaohua Li 2010-06-14 13:28 ` Chris Mason 2010-06-14 18:12 ` Jerome Ibanes 2010-06-14 19:08 ` Chris Mason 2010-06-14 19:13 ` Jerome Ibanes 2010-06-16 18:12 ` Jerome Ibanes 2010-06-17 1:41 ` Shaohua Li 2010-06-18 0:57 ` Shaohua Li 2010-06-14 13:26 ` Chris Mason
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).