All of lore.kernel.org
 help / color / mirror / Atom feed
* CFQ: async queue blocks the whole system
@ 2011-06-09 10:49 Tao Ma
  2011-06-09 14:14 ` Vivek Goyal
  0 siblings, 1 reply; 33+ messages in thread
From: Tao Ma @ 2011-06-09 10:49 UTC (permalink / raw)
  To: linux-kernel; +Cc: Jens Axboe, Vivek Goyal, Tao Ma

Hi Jens and Vivek,
	We are current running some heavy ext4 metadata test,
and we found a very severe problem for CFQ. Please correct me if
my statement below is wrong.

CFQ only has an async queue for every priority of every class and
these queues have a very low serving priority, so if the system
has a large number of sync reads, these queues will be delayed a
lot of time. As a result, the flushers will be blocked, then the
journal and finally our applications[1].

I have tried to let jbd/2 to use WRITE_SYNC so that they can checkpoint
in time and the patches are sent. But today we found another similar
block in kswapd which make me think that maybe CFQ should be changed
somehow so that all these callers can benefit from it.

So is there any way to let the async queue work timely or at least
is there any deadline for async queue to finish an request in time
even in case there are many reads?

btw, We have tested deadline scheduler and it seems to work in our test.

[1] the message we get from one system:
INFO: task flush-8:0:2950 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
flush-8:0       D ffff88062bfde738     0  2950      2 0x00000000
 ffff88062b137820 0000000000000046 ffff88062b137750 ffffffff812b7bc3
 ffff88032cddc000 ffff88062bfde380 ffff88032d3d8840 0000000c2be37400
 000000002be37601 0000000000000006 ffff88062b137760 ffffffff811c242e
Call Trace:
 [<ffffffff812b7bc3>] ? scsi_request_fn+0x345/0x3df
 [<ffffffff811c242e>] ? __blk_run_queue+0x1a/0x1c
 [<ffffffff811c57cc>] ? queue_unplugged+0x77/0x8e
 [<ffffffff813dbe67>] io_schedule+0x47/0x61
 [<ffffffff811c512c>] get_request_wait+0xe0/0x152
 [<ffffffff81062ed0>] ? list_del_init+0x21/0x21
 [<ffffffff811c1c21>] ? elv_merge+0xa0/0xb5
 [<ffffffff811c7cdd>] __make_request+0x185/0x2a8
 [<ffffffff811c4536>] generic_make_request+0x246/0x323
 [<ffffffff810c880b>] ? mempool_alloc_slab+0x16/0x18
 [<ffffffff810c8a7d>] ? mempool_alloc+0x31/0xf4
 [<ffffffff811c5bc2>] submit_bio+0xe2/0x101
 [<ffffffff81136c88>] ? bio_alloc_bioset+0x4d/0xc5
 [<ffffffff810dc846>] ? inc_zone_page_state+0x25/0x28
 [<ffffffff811326f7>] submit_bh+0x105/0x129
 [<ffffffff81134b74>] __block_write_full_page+0x218/0x31d
 [<ffffffff811356ea>] ? __set_page_dirty_buffers+0xac/0xac
 [<ffffffff81138aea>] ? blkdev_get_blocks+0xa6/0xa6
 [<ffffffff811356ea>] ? __set_page_dirty_buffers+0xac/0xac
 [<ffffffff81138aea>] ? blkdev_get_blocks+0xa6/0xa6
 [<ffffffff81134d02>] block_write_full_page_endio+0x89/0x95
 [<ffffffff81134d23>] block_write_full_page+0x15/0x17
 [<ffffffff81137cc6>] blkdev_writepage+0x18/0x1a
 [<ffffffff810cdf48>] __writepage+0x17/0x30
 [<ffffffff810ce571>] write_cache_pages+0x251/0x361
 [<ffffffff810cdf31>] ? page_mapping+0x35/0x35
 [<ffffffff810ce6c9>] generic_writepages+0x48/0x63
 [<ffffffff810ce705>] do_writepages+0x21/0x2a
 [<ffffffff8112c296>] writeback_single_inode+0xb1/0x1a8
 [<ffffffff8112c7fb>] writeback_sb_inodes+0xb5/0x12f
 [<ffffffff8112cbfa>] writeback_inodes_wb+0x111/0x121
 [<ffffffff8112cdd3>] wb_writeback+0x1c9/0x2ce
 [<ffffffff81053b6a>] ? lock_timer_base+0x2b/0x4f
 [<ffffffff8112d00c>] wb_do_writeback+0x134/0x1a3
 [<ffffffff8112df71>] bdi_writeback_thread+0x89/0x1b4
 [<ffffffff8112dee8>] ? perf_trace_writeback_class+0xa6/0xa6
 [<ffffffff81062966>] kthread+0x72/0x7a
 [<ffffffff813e4a04>] kernel_thread_helper+0x4/0x10
 [<ffffffff810628f4>] ? kthread_bind+0x67/0x67
 [<ffffffff813e4a00>] ? gs_change+0x13/0x13
INFO: task jbd2/sda12-8:3435 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
jbd2/sda12-8    D ffff88062c2fabb8     0  3435      2 0x00000000
 ffff88061f6c9d30 0000000000000046 0000000000000000 0000000000000000
 0000000000000000 ffff88062c2fa800 ffff88032d238400 00000001000024b4
 0000000000000000 0000000000000000 0000000000000000 0000000000000000
Call Trace:
 [<ffffffff81062f38>] ? spin_unlock_irqrestore+0xe/0x10
 [<ffffffffa00fa718>] jbd2_journal_commit_transaction+0x254/0x14a4 [jbd2]
 [<ffffffff81036d7a>] ? need_resched+0x23/0x2d
 [<ffffffff81062ed0>] ? list_del_init+0x21/0x21
 [<ffffffff81053b6a>] ? lock_timer_base+0x2b/0x4f
 [<ffffffff81053b3d>] ? spin_unlock_irqrestore+0xe/0x10
 [<ffffffff81053c09>] ? try_to_del_timer_sync+0x7b/0x89
 [<ffffffffa01001a5>] ? jbd2_journal_start_commit+0x72/0x72 [jbd2]
 [<ffffffffa01002c9>] kjournald2+0x124/0x381 [jbd2]
 [<ffffffff81062ed0>] ? list_del_init+0x21/0x21
 [<ffffffff81062966>] kthread+0x72/0x7a
 [<ffffffff813e4a04>] kernel_thread_helper+0x4/0x10
 [<ffffffff810628f4>] ? kthread_bind+0x67/0x67
 [<ffffffff813e4a00>] ? gs_change+0x13/0x13
INFO: task attr_set:3832 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
attr_set        D ffff8806157f8538     0  3832      1 0x00000000
 ffff880615565b28 0000000000000086 0000000000000001 0000000000000007
 0000000000000000 ffff8806157f8180 ffffffff8180b020 0000000000000000
 0000000000000000 0000000000000000 0000000000000000 0000000000000000
Call Trace:
 [<ffffffff8103926f>] ? hrtick_update+0x32/0x34
 [<ffffffff81042360>] ? dequeue_task_fair+0x15c/0x169
 [<ffffffff81062f38>] ? spin_unlock_irqrestore+0xe/0x10
 [<ffffffffa00f92f7>] start_this_handle+0x2f5/0x564 [jbd2]
 [<ffffffff81062ed0>] ? list_del_init+0x21/0x21
 [<ffffffffa00f973c>] jbd2__journal_start+0xa5/0xd2 [jbd2]
 [<ffffffffa00f977c>] jbd2_journal_start+0x13/0x15 [jbd2]
 [<ffffffffa012f1fc>] ext4_journal_start_sb+0x11a/0x129 [ext4]
 [<ffffffffa0119b56>] ? ext4_file_open+0x15b/0x181 [ext4]
 [<ffffffffa014c1de>] ext4_xattr_set+0x69/0xe2 [ext4]
 [<ffffffffa014c96e>] ext4_xattr_user_set+0x43/0x49 [ext4]
 [<ffffffff811294cb>] generic_setxattr+0x67/0x76
 [<ffffffff81129f7b>] __vfs_setxattr_noperm+0x77/0xdc
 [<ffffffff8112a05c>] vfs_setxattr+0x7c/0x97
 [<ffffffff8112a12c>] setxattr+0xb5/0xe8
 [<ffffffff810fed93>] ? virt_to_head_page+0x29/0x2b
 [<ffffffff810fedb3>] ? virt_to_slab+0x1e/0x2e
 [<ffffffff810ff668>] ? __cache_free+0x44/0x1bf
 [<ffffffff8112a1ca>] sys_fsetxattr+0x6b/0x91
 [<ffffffff813e38c2>] system_call_fastpath+0x16/0x1b

[2] kswapd is blocked.
[<ffffffff814ae453>] io_schedule+0x73/0xc0
[682201.029914]  [<ffffffff81230cea>] get_request_wait+0xca/0x160
[682201.030236]  [<ffffffff8108d520>] ? autoremove_wake_function+0x0/0x40
[682201.030602]  [<ffffffff81228077>] ? elv_merge+0x37/0x1c0
[682201.030880]  [<ffffffff81230e13>] __make_request+0x93/0x4b0
[682201.031511]  [<ffffffff8122f1b9>] generic_make_request+0x1b9/0x3c0
[682201.031863]  [<ffffffff810d6a3d>] ? rcu_start_gp+0xfd/0x1e0
[682201.032195]  [<ffffffff8122f439>] submit_bio+0x79/0x120
[682201.032472]  [<ffffffff8118f0b9>] submit_bh+0xf9/0x150
[682201.032741]  [<ffffffff8119211e>] __block_write_full_page+0x1ae/0x320
[682201.033093]  [<ffffffff811900a0>] ? end_buffer_async_write+0x0/0x160
[682201.033457]  [<ffffffffa02483b0>] ? noalloc_get_block_write+0x0/0x60 [ext4]
[682201.033777]  [<ffffffff811900a0>] ? end_buffer_async_write+0x0/0x160
[682201.034079]  [<ffffffff81192366>] block_write_full_page_endio+0xd6/0x120
[682201.034413]  [<ffffffffa02483b0>] ? noalloc_get_block_write+0x0/0x60 [ext4]
[682201.034727]  [<ffffffff811923c5>] block_write_full_page+0x15/0x20
[682201.035063]  [<ffffffffa0244d9e>] ext4_writepage+0x28e/0x340 [ext4]
[682201.035509]  [<ffffffff8111d61d>] shrink_zone+0x116d/0x1480
[682201.035792]  [<ffffffff8111e25c>] kswapd+0x60c/0x800
[682201.036049]  [<ffffffff8111b5d0>] ? isolate_pages_global+0x0/0x3e0
[682201.036397]  [<ffffffff814adcfa>] ? thread_return+0x4e/0x734
[682201.036745]  [<ffffffff8108d520>] ? autoremove_wake_function+0x0/0x40
[682201.037055]  [<ffffffff8111dc50>] ? kswapd+0x0/0x800
[682201.037359]  [<ffffffff8108d3a6>] kthread+0x96/0xa0
[682201.037671]  [<ffffffff8101408a>] child_rip+0xa/0x20
[682201.038115]  [<ffffffff8108d310>] ? kthread+0x0/0xa0
[682201.038421]  [<ffffffff81014080>] ? child_rip+0x0/0x20

Regards,
Tao

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: CFQ: async queue blocks the whole system
  2011-06-09 10:49 CFQ: async queue blocks the whole system Tao Ma
@ 2011-06-09 14:14 ` Vivek Goyal
  2011-06-09 14:34   ` Jens Axboe
  2011-06-09 14:47   ` Tao Ma
  0 siblings, 2 replies; 33+ messages in thread
From: Vivek Goyal @ 2011-06-09 14:14 UTC (permalink / raw)
  To: Tao Ma; +Cc: linux-kernel, Jens Axboe

On Thu, Jun 09, 2011 at 06:49:37PM +0800, Tao Ma wrote:
> Hi Jens and Vivek,
> 	We are current running some heavy ext4 metadata test,
> and we found a very severe problem for CFQ. Please correct me if
> my statement below is wrong.
> 
> CFQ only has an async queue for every priority of every class and
> these queues have a very low serving priority, so if the system
> has a large number of sync reads, these queues will be delayed a
> lot of time. As a result, the flushers will be blocked, then the
> journal and finally our applications[1].
> 
> I have tried to let jbd/2 to use WRITE_SYNC so that they can checkpoint
> in time and the patches are sent. But today we found another similar
> block in kswapd which make me think that maybe CFQ should be changed
> somehow so that all these callers can benefit from it.
> 
> So is there any way to let the async queue work timely or at least
> is there any deadline for async queue to finish an request in time
> even in case there are many reads?
> 
> btw, We have tested deadline scheduler and it seems to work in our test.
> 
> [1] the message we get from one system:
> INFO: task flush-8:0:2950 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> flush-8:0       D ffff88062bfde738     0  2950      2 0x00000000
>  ffff88062b137820 0000000000000046 ffff88062b137750 ffffffff812b7bc3
>  ffff88032cddc000 ffff88062bfde380 ffff88032d3d8840 0000000c2be37400
>  000000002be37601 0000000000000006 ffff88062b137760 ffffffff811c242e
> Call Trace:
>  [<ffffffff812b7bc3>] ? scsi_request_fn+0x345/0x3df
>  [<ffffffff811c242e>] ? __blk_run_queue+0x1a/0x1c
>  [<ffffffff811c57cc>] ? queue_unplugged+0x77/0x8e
>  [<ffffffff813dbe67>] io_schedule+0x47/0x61
>  [<ffffffff811c512c>] get_request_wait+0xe0/0x152

Ok, so flush slept on trying to get a "request" allocated on request 
queue. That means all the ASYNC request descriptors are already consumed
and we are not making progress with ASYNc requests.

A relatively recent patch allowed sync queues to always preempt async queues
and schedule sync workload instead of async. This had the potential to
starve async queues and looks like that's what we are running into.

commit f8ae6e3eb8251be32c6e913393d9f8d9e0609489
Author: Shaohua Li <shaohua.li@intel.com>
Date:   Fri Jan 14 08:41:02 2011 +0100

    block cfq: make queue preempt work for queues from different workload

Do you have few seconds of blktrace. I just wanted to verify that this
is what we are running into. 

Thanks
Vivek

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: CFQ: async queue blocks the whole system
  2011-06-09 14:14 ` Vivek Goyal
@ 2011-06-09 14:34   ` Jens Axboe
  2011-06-09 14:47   ` Tao Ma
  1 sibling, 0 replies; 33+ messages in thread
From: Jens Axboe @ 2011-06-09 14:34 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Tao Ma, linux-kernel

On 2011-06-09 16:14, Vivek Goyal wrote:
> On Thu, Jun 09, 2011 at 06:49:37PM +0800, Tao Ma wrote:
>> Hi Jens and Vivek,
>> 	We are current running some heavy ext4 metadata test,
>> and we found a very severe problem for CFQ. Please correct me if
>> my statement below is wrong.
>>
>> CFQ only has an async queue for every priority of every class and
>> these queues have a very low serving priority, so if the system
>> has a large number of sync reads, these queues will be delayed a
>> lot of time. As a result, the flushers will be blocked, then the
>> journal and finally our applications[1].
>>
>> I have tried to let jbd/2 to use WRITE_SYNC so that they can checkpoint
>> in time and the patches are sent. But today we found another similar
>> block in kswapd which make me think that maybe CFQ should be changed
>> somehow so that all these callers can benefit from it.
>>
>> So is there any way to let the async queue work timely or at least
>> is there any deadline for async queue to finish an request in time
>> even in case there are many reads?
>>
>> btw, We have tested deadline scheduler and it seems to work in our test.
>>
>> [1] the message we get from one system:
>> INFO: task flush-8:0:2950 blocked for more than 120 seconds.
>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> flush-8:0       D ffff88062bfde738     0  2950      2 0x00000000
>>  ffff88062b137820 0000000000000046 ffff88062b137750 ffffffff812b7bc3
>>  ffff88032cddc000 ffff88062bfde380 ffff88032d3d8840 0000000c2be37400
>>  000000002be37601 0000000000000006 ffff88062b137760 ffffffff811c242e
>> Call Trace:
>>  [<ffffffff812b7bc3>] ? scsi_request_fn+0x345/0x3df
>>  [<ffffffff811c242e>] ? __blk_run_queue+0x1a/0x1c
>>  [<ffffffff811c57cc>] ? queue_unplugged+0x77/0x8e
>>  [<ffffffff813dbe67>] io_schedule+0x47/0x61
>>  [<ffffffff811c512c>] get_request_wait+0xe0/0x152
> 
> Ok, so flush slept on trying to get a "request" allocated on request 
> queue. That means all the ASYNC request descriptors are already consumed
> and we are not making progress with ASYNc requests.
> 
> A relatively recent patch allowed sync queues to always preempt async queues
> and schedule sync workload instead of async. This had the potential to
> starve async queues and looks like that's what we are running into.
> 
> commit f8ae6e3eb8251be32c6e913393d9f8d9e0609489
> Author: Shaohua Li <shaohua.li@intel.com>
> Date:   Fri Jan 14 08:41:02 2011 +0100
> 
>     block cfq: make queue preempt work for queues from different workload
> 
> Do you have few seconds of blktrace. I just wanted to verify that this
> is what we are running into. 

That's a good first step. Tao Ma, is this a known regression or is that
unknown?

On vacation this week, I'll look into as soon as I get back.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: CFQ: async queue blocks the whole system
  2011-06-09 14:14 ` Vivek Goyal
  2011-06-09 14:34   ` Jens Axboe
@ 2011-06-09 14:47   ` Tao Ma
  2011-06-09 15:37     ` Vivek Goyal
  1 sibling, 1 reply; 33+ messages in thread
From: Tao Ma @ 2011-06-09 14:47 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: linux-kernel, Jens Axboe

Hi Vivek,
	Thanks for the quick response.
On 06/09/2011 10:14 PM, Vivek Goyal wrote:
> On Thu, Jun 09, 2011 at 06:49:37PM +0800, Tao Ma wrote:
>> Hi Jens and Vivek,
>> 	We are current running some heavy ext4 metadata test,
>> and we found a very severe problem for CFQ. Please correct me if
>> my statement below is wrong.
>>
>> CFQ only has an async queue for every priority of every class and
>> these queues have a very low serving priority, so if the system
>> has a large number of sync reads, these queues will be delayed a
>> lot of time. As a result, the flushers will be blocked, then the
>> journal and finally our applications[1].
>>
>> I have tried to let jbd/2 to use WRITE_SYNC so that they can checkpoint
>> in time and the patches are sent. But today we found another similar
>> block in kswapd which make me think that maybe CFQ should be changed
>> somehow so that all these callers can benefit from it.
>>
>> So is there any way to let the async queue work timely or at least
>> is there any deadline for async queue to finish an request in time
>> even in case there are many reads?
>>
>> btw, We have tested deadline scheduler and it seems to work in our test.
>>
>> [1] the message we get from one system:
>> INFO: task flush-8:0:2950 blocked for more than 120 seconds.
>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> flush-8:0       D ffff88062bfde738     0  2950      2 0x00000000
>>  ffff88062b137820 0000000000000046 ffff88062b137750 ffffffff812b7bc3
>>  ffff88032cddc000 ffff88062bfde380 ffff88032d3d8840 0000000c2be37400
>>  000000002be37601 0000000000000006 ffff88062b137760 ffffffff811c242e
>> Call Trace:
>>  [<ffffffff812b7bc3>] ? scsi_request_fn+0x345/0x3df
>>  [<ffffffff811c242e>] ? __blk_run_queue+0x1a/0x1c
>>  [<ffffffff811c57cc>] ? queue_unplugged+0x77/0x8e
>>  [<ffffffff813dbe67>] io_schedule+0x47/0x61
>>  [<ffffffff811c512c>] get_request_wait+0xe0/0x152
> 
> Ok, so flush slept on trying to get a "request" allocated on request 
> queue. That means all the ASYNC request descriptors are already consumed
> and we are not making progress with ASYNc requests.
> 
> A relatively recent patch allowed sync queues to always preempt async queues
> and schedule sync workload instead of async. This had the potential to
> starve async queues and looks like that's what we are running into.
> 
> commit f8ae6e3eb8251be32c6e913393d9f8d9e0609489
> Author: Shaohua Li <shaohua.li@intel.com>
> Date:   Fri Jan 14 08:41:02 2011 +0100
> 
>     block cfq: make queue preempt work for queues from different workload
> 
> Do you have few seconds of blktrace. I just wanted to verify that this
> is what we are running into. 
We are using the latest kernel, so the patch is already there. :(

You are right that all the requests have been allocated and the flusher
is waiting for requests to be available. But the root cause is that in
heavy sync read, the async queue in cfq is delayed too much. I have
added some traces in the cfq codes path and after several investigation,
I found several interesting things and tried to improve it. But I am not
sure whether it is bug or it is designed intentionally.

1. In cfq_dispatch_requests we select a sync queue to serve, but if the
queue has too much requests in flight, the cfq_slice_used_soon may be
true and the cfqq isn't allowed to send and will waste some timeslice.
Then why choose this cfqq? Why not choose a qualified one?

2. async queue isn't allowed to be sent if there is some sync request in
fly, but as now most of the devices has a greater depth, should we
improve it somehow? I guess queue_depth should be a valid number maybe?

3. Even there is no sync i/o, the async queue isn't allowed to send too
much requests because of the check in cfq_may_dispatch "Async queues
must wait a bit before being allowed dispatch", so in my test the async
queue has several chances to be selected, but it is only allowed
todispatch one request at a time. It is really amazing.

4. We have nr_requests = 128 for the async queues, but have so many
limitations for it to be carried out, and once the processes get the
chances, it will batch the requests, and in my tests, the total async
requests can be accumulated to more than 180, So it takes a really long
time for the process to wait for the requests to be lower than 127 again
to request more I/O and causes the livelock. So the number also be
improved somehow? Maybe it should be changed dynamically? anyway, if the
async queue can be dispatched in time and in batches in cfq, it
shouldn't be a problem.

btw, I uses a SAS which has a queue_depth=128 and the D2C time is very
small, so the whole system throughout is really really low.

Regards,
Tao

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: CFQ: async queue blocks the whole system
  2011-06-09 14:47   ` Tao Ma
@ 2011-06-09 15:37     ` Vivek Goyal
  2011-06-09 15:44       ` Tao Ma
  2011-06-10  1:19       ` Shaohua Li
  0 siblings, 2 replies; 33+ messages in thread
From: Vivek Goyal @ 2011-06-09 15:37 UTC (permalink / raw)
  To: Tao Ma; +Cc: linux-kernel, Jens Axboe

On Thu, Jun 09, 2011 at 10:47:43PM +0800, Tao Ma wrote:
> Hi Vivek,
> 	Thanks for the quick response.
> On 06/09/2011 10:14 PM, Vivek Goyal wrote:
> > On Thu, Jun 09, 2011 at 06:49:37PM +0800, Tao Ma wrote:
> >> Hi Jens and Vivek,
> >> 	We are current running some heavy ext4 metadata test,
> >> and we found a very severe problem for CFQ. Please correct me if
> >> my statement below is wrong.
> >>
> >> CFQ only has an async queue for every priority of every class and
> >> these queues have a very low serving priority, so if the system
> >> has a large number of sync reads, these queues will be delayed a
> >> lot of time. As a result, the flushers will be blocked, then the
> >> journal and finally our applications[1].
> >>
> >> I have tried to let jbd/2 to use WRITE_SYNC so that they can checkpoint
> >> in time and the patches are sent. But today we found another similar
> >> block in kswapd which make me think that maybe CFQ should be changed
> >> somehow so that all these callers can benefit from it.
> >>
> >> So is there any way to let the async queue work timely or at least
> >> is there any deadline for async queue to finish an request in time
> >> even in case there are many reads?
> >>
> >> btw, We have tested deadline scheduler and it seems to work in our test.
> >>
> >> [1] the message we get from one system:
> >> INFO: task flush-8:0:2950 blocked for more than 120 seconds.
> >> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >> flush-8:0       D ffff88062bfde738     0  2950      2 0x00000000
> >>  ffff88062b137820 0000000000000046 ffff88062b137750 ffffffff812b7bc3
> >>  ffff88032cddc000 ffff88062bfde380 ffff88032d3d8840 0000000c2be37400
> >>  000000002be37601 0000000000000006 ffff88062b137760 ffffffff811c242e
> >> Call Trace:
> >>  [<ffffffff812b7bc3>] ? scsi_request_fn+0x345/0x3df
> >>  [<ffffffff811c242e>] ? __blk_run_queue+0x1a/0x1c
> >>  [<ffffffff811c57cc>] ? queue_unplugged+0x77/0x8e
> >>  [<ffffffff813dbe67>] io_schedule+0x47/0x61
> >>  [<ffffffff811c512c>] get_request_wait+0xe0/0x152
> > 
> > Ok, so flush slept on trying to get a "request" allocated on request 
> > queue. That means all the ASYNC request descriptors are already consumed
> > and we are not making progress with ASYNc requests.
> > 
> > A relatively recent patch allowed sync queues to always preempt async queues
> > and schedule sync workload instead of async. This had the potential to
> > starve async queues and looks like that's what we are running into.
> > 
> > commit f8ae6e3eb8251be32c6e913393d9f8d9e0609489
> > Author: Shaohua Li <shaohua.li@intel.com>
> > Date:   Fri Jan 14 08:41:02 2011 +0100
> > 
> >     block cfq: make queue preempt work for queues from different workload
> > 
> > Do you have few seconds of blktrace. I just wanted to verify that this
> > is what we are running into. 
> We are using the latest kernel, so the patch is already there. :(
> 
> You are right that all the requests have been allocated and the flusher
> is waiting for requests to be available. But the root cause is that in
> heavy sync read, the async queue in cfq is delayed too much. I have
> added some traces in the cfq codes path and after several investigation,
> I found several interesting things and tried to improve it. But I am not
> sure whether it is bug or it is designed intentionally.
> 
> 1. In cfq_dispatch_requests we select a sync queue to serve, but if the
> queue has too much requests in flight, the cfq_slice_used_soon may be
> true and the cfqq isn't allowed to send and will waste some timeslice.
> Then why choose this cfqq? Why not choose a qualified one?

CFQ in general tries not to drive too deep a queue depth in an effort
to improve latencies. CFQ is generally recommened for slow SATA drives
and dispatching too many requests from a single queue can only serve to
increase the latency.

> 
> 2. async queue isn't allowed to be sent if there is some sync request in
> fly, but as now most of the devices has a greater depth, should we
> improve it somehow? I guess queue_depth should be a valid number maybe?

We seem to be running this batching thing in cfq_may_dispatch() where
we drain sync requests before async is dispatched and vice-a-versa. 
I am not sure how does this batching thing helps. I think Jens should
be a better person to comment on that.

I ran a fio job with few readers and few writers. I do see that few times
we have schedule ASYNC workload/queue but did not dispatch a request
from that. And reason being that there are sync requests in flight. And
by the time sync requests finish, async queue gets preempted.

So async queue does it scheduled but never gets a chance to dispatch
a request because there was sync IO in flight.

If there is no major advantage of draining sync requests before async
is dispatched, I think this should be an easy fix.
 
> 
> 3. Even there is no sync i/o, the async queue isn't allowed to send too
> much requests because of the check in cfq_may_dispatch "Async queues
> must wait a bit before being allowed dispatch", so in my test the async
> queue has several chances to be selected, but it is only allowed
> todispatch one request at a time. It is really amazing.

Again heavily loaded to improve sync latencies. Say you have queue
depth of 128 and you fill that all with async requests because right
now there is no sync request around. Then a sync request comes in.
We don't have a way to give it a priority and it might happen that
it gets executed after 128 async requests have finished (driver and
drive dependent though).

So in an attempt to improve sync latencies we don't drive too
high queue depths.

Its latency vs throughput tradeoff.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: CFQ: async queue blocks the whole system
  2011-06-09 15:37     ` Vivek Goyal
@ 2011-06-09 15:44       ` Tao Ma
  2011-06-09 18:27         ` Vivek Goyal
  2011-06-10  1:19       ` Shaohua Li
  1 sibling, 1 reply; 33+ messages in thread
From: Tao Ma @ 2011-06-09 15:44 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: linux-kernel, Jens Axboe

On 06/09/2011 11:37 PM, Vivek Goyal wrote:
> On Thu, Jun 09, 2011 at 10:47:43PM +0800, Tao Ma wrote:
>> Hi Vivek,
>> 	Thanks for the quick response.
>> On 06/09/2011 10:14 PM, Vivek Goyal wrote:
>>> On Thu, Jun 09, 2011 at 06:49:37PM +0800, Tao Ma wrote:
>>>> Hi Jens and Vivek,
>>>> 	We are current running some heavy ext4 metadata test,
>>>> and we found a very severe problem for CFQ. Please correct me if
>>>> my statement below is wrong.
>>>>
>>>> CFQ only has an async queue for every priority of every class and
>>>> these queues have a very low serving priority, so if the system
>>>> has a large number of sync reads, these queues will be delayed a
>>>> lot of time. As a result, the flushers will be blocked, then the
>>>> journal and finally our applications[1].
>>>>
>>>> I have tried to let jbd/2 to use WRITE_SYNC so that they can checkpoint
>>>> in time and the patches are sent. But today we found another similar
>>>> block in kswapd which make me think that maybe CFQ should be changed
>>>> somehow so that all these callers can benefit from it.
>>>>
>>>> So is there any way to let the async queue work timely or at least
>>>> is there any deadline for async queue to finish an request in time
>>>> even in case there are many reads?
>>>>
>>>> btw, We have tested deadline scheduler and it seems to work in our test.
>>>>
>>>> [1] the message we get from one system:
>>>> INFO: task flush-8:0:2950 blocked for more than 120 seconds.
>>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>> flush-8:0       D ffff88062bfde738     0  2950      2 0x00000000
>>>>  ffff88062b137820 0000000000000046 ffff88062b137750 ffffffff812b7bc3
>>>>  ffff88032cddc000 ffff88062bfde380 ffff88032d3d8840 0000000c2be37400
>>>>  000000002be37601 0000000000000006 ffff88062b137760 ffffffff811c242e
>>>> Call Trace:
>>>>  [<ffffffff812b7bc3>] ? scsi_request_fn+0x345/0x3df
>>>>  [<ffffffff811c242e>] ? __blk_run_queue+0x1a/0x1c
>>>>  [<ffffffff811c57cc>] ? queue_unplugged+0x77/0x8e
>>>>  [<ffffffff813dbe67>] io_schedule+0x47/0x61
>>>>  [<ffffffff811c512c>] get_request_wait+0xe0/0x152
>>>
>>> Ok, so flush slept on trying to get a "request" allocated on request 
>>> queue. That means all the ASYNC request descriptors are already consumed
>>> and we are not making progress with ASYNc requests.
>>>
>>> A relatively recent patch allowed sync queues to always preempt async queues
>>> and schedule sync workload instead of async. This had the potential to
>>> starve async queues and looks like that's what we are running into.
>>>
>>> commit f8ae6e3eb8251be32c6e913393d9f8d9e0609489
>>> Author: Shaohua Li <shaohua.li@intel.com>
>>> Date:   Fri Jan 14 08:41:02 2011 +0100
>>>
>>>     block cfq: make queue preempt work for queues from different workload
>>>
>>> Do you have few seconds of blktrace. I just wanted to verify that this
>>> is what we are running into. 
>> We are using the latest kernel, so the patch is already there. :(
>>
>> You are right that all the requests have been allocated and the flusher
>> is waiting for requests to be available. But the root cause is that in
>> heavy sync read, the async queue in cfq is delayed too much. I have
>> added some traces in the cfq codes path and after several investigation,
>> I found several interesting things and tried to improve it. But I am not
>> sure whether it is bug or it is designed intentionally.
>>
>> 1. In cfq_dispatch_requests we select a sync queue to serve, but if the
>> queue has too much requests in flight, the cfq_slice_used_soon may be
>> true and the cfqq isn't allowed to send and will waste some timeslice.
>> Then why choose this cfqq? Why not choose a qualified one?
> 
> CFQ in general tries not to drive too deep a queue depth in an effort
> to improve latencies. CFQ is generally recommened for slow SATA drives
> and dispatching too many requests from a single queue can only serve to
> increase the latency.
ok, so do you mean that for a fast drive, cfq isn't recommended and
deadline is always prefered? ;) We have a SAS with queue_depth=128, so
it should be a fast drive I guess. :)
> 
>>
>> 2. async queue isn't allowed to be sent if there is some sync request in
>> fly, but as now most of the devices has a greater depth, should we
>> improve it somehow? I guess queue_depth should be a valid number maybe?
> 
> We seem to be running this batching thing in cfq_may_dispatch() where
> we drain sync requests before async is dispatched and vice-a-versa. 
> I am not sure how does this batching thing helps. I think Jens should
> be a better person to comment on that.
> 
> I ran a fio job with few readers and few writers. I do see that few times
> we have schedule ASYNC workload/queue but did not dispatch a request
> from that. And reason being that there are sync requests in flight. And
> by the time sync requests finish, async queue gets preempted.
> 
> So async queue does it scheduled but never gets a chance to dispatch
> a request because there was sync IO in flight.
yeah, that's one thing I found in my test.
> 
> If there is no major advantage of draining sync requests before async
> is dispatched, I think this should be an easy fix.
>  
>>
>> 3. Even there is no sync i/o, the async queue isn't allowed to send too
>> much requests because of the check in cfq_may_dispatch "Async queues
>> must wait a bit before being allowed dispatch", so in my test the async
>> queue has several chances to be selected, but it is only allowed
>> todispatch one request at a time. It is really amazing.
> 
> Again heavily loaded to improve sync latencies. Say you have queue
> depth of 128 and you fill that all with async requests because right
> now there is no sync request around. Then a sync request comes in.
> We don't have a way to give it a priority and it might happen that
> it gets executed after 128 async requests have finished (driver and
> drive dependent though).
> 
> So in an attempt to improve sync latencies we don't drive too
> high queue depths.
> 
> Its latency vs throughput tradeoff.
ok, so it seems that all these are designed, not a bug. Thanks for the
clarification.

btw, reverting the patch doesn't work. I can still get the livelock.

Regards,
Tao

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: CFQ: async queue blocks the whole system
  2011-06-09 15:44       ` Tao Ma
@ 2011-06-09 18:27         ` Vivek Goyal
  2011-06-10  5:48           ` Tao Ma
  0 siblings, 1 reply; 33+ messages in thread
From: Vivek Goyal @ 2011-06-09 18:27 UTC (permalink / raw)
  To: Tao Ma; +Cc: linux-kernel, Jens Axboe

On Thu, Jun 09, 2011 at 11:44:21PM +0800, Tao Ma wrote:

[..]
> > CFQ in general tries not to drive too deep a queue depth in an effort
> > to improve latencies. CFQ is generally recommened for slow SATA drives
> > and dispatching too many requests from a single queue can only serve to
> > increase the latency.
> ok, so do you mean that for a fast drive, cfq isn't recommended and
> deadline is always prefered? ;) We have a SAS with queue_depth=128, so
> it should be a fast drive I guess. :)

I think in general that has been true in my experience. SSDs are still
ok with CFQ because that sets nonrotational flag and cuts down on 
idling. But if it is a rotational media which can handle multiple
parallel requests at a time you might have better throughput results
with deadline.

[..]
> > Its latency vs throughput tradeoff.
> ok, so it seems that all these are designed, not a bug. Thanks for the
> clarification.
> 
> btw, reverting the patch doesn't work. I can still get the livelock.

Can you give following patch a try and see if it helps. On my system this
does allow CFQ to dispatch some writes once in a while.

thanks
Vivek

---
 block/cfq-iosched.c |   25 ++++++++++++++++++++++++-
 1 file changed, 24 insertions(+), 1 deletion(-)

Index: linux-2.6/block/cfq-iosched.c
===================================================================
--- linux-2.6.orig/block/cfq-iosched.c	2011-06-09 11:44:40.000000000 -0400
+++ linux-2.6/block/cfq-iosched.c	2011-06-09 14:04:01.036983301 -0400
@@ -303,6 +303,8 @@ struct cfq_data {
 
 	/* Number of groups which are on blkcg->blkg_list */
 	unsigned int nr_blkcg_linked_grps;
+
+	unsigned long last_async_dispatched;
 };
 
 static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd);
@@ -2063,6 +2065,10 @@ static void cfq_dispatch_insert(struct r
 
 	cfqd->rq_in_flight[cfq_cfqq_sync(cfqq)]++;
 	cfqq->nr_sectors += blk_rq_sectors(rq);
+
+	if (!cfq_cfqq_sync(cfqq))
+		cfqd->last_async_dispatched = jiffies;
+
 	cfq_blkiocg_update_dispatch_stats(&cfqq->cfqg->blkg, blk_rq_bytes(rq),
 					rq_data_dir(rq), rq_is_sync(rq));
 }
@@ -3315,8 +3321,25 @@ cfq_should_preempt(struct cfq_data *cfqd
 	 * if the new request is sync, but the currently running queue is
 	 * not, let the sync request have priority.
 	 */
-	if (rq_is_sync(rq) && !cfq_cfqq_sync(cfqq))
+	if (rq_is_sync(rq) && !cfq_cfqq_sync(cfqq)) {
+		unsigned long async_delay = 0;
+
+		async_delay = jiffies - cfqd->last_async_dispatched;
+
+		/*
+		 * CFQ is heavily loaded in favor of sync queues and that
+ 		 * can lead to starvation of async queues. If it has been
+ 		 * too long since last async request was dispatched, don't
+ 		 * preempt async queue
+ 		 *
+ 		 * Once we have per group async queues, this will need
+ 		 * modification.
+ 		 */
+		if (async_delay > 2 * HZ)
+			return false;
+
 		return true;
+	}
 
 	if (new_cfqq->cfqg != cfqq->cfqg)
 		return false;

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: CFQ: async queue blocks the whole system
  2011-06-09 15:37     ` Vivek Goyal
  2011-06-09 15:44       ` Tao Ma
@ 2011-06-10  1:19       ` Shaohua Li
  2011-06-10  1:34         ` Shaohua Li
  2011-06-10  9:17         ` Vivek Goyal
  1 sibling, 2 replies; 33+ messages in thread
From: Shaohua Li @ 2011-06-10  1:19 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Tao Ma, linux-kernel, Jens Axboe

2011/6/9 Vivek Goyal <vgoyal@redhat.com>:
> On Thu, Jun 09, 2011 at 10:47:43PM +0800, Tao Ma wrote:
>> Hi Vivek,
>>       Thanks for the quick response.
>> On 06/09/2011 10:14 PM, Vivek Goyal wrote:
>> > On Thu, Jun 09, 2011 at 06:49:37PM +0800, Tao Ma wrote:
>> >> Hi Jens and Vivek,
>> >>    We are current running some heavy ext4 metadata test,
>> >> and we found a very severe problem for CFQ. Please correct me if
>> >> my statement below is wrong.
>> >>
>> >> CFQ only has an async queue for every priority of every class and
>> >> these queues have a very low serving priority, so if the system
>> >> has a large number of sync reads, these queues will be delayed a
>> >> lot of time. As a result, the flushers will be blocked, then the
>> >> journal and finally our applications[1].
>> >>
>> >> I have tried to let jbd/2 to use WRITE_SYNC so that they can checkpoint
>> >> in time and the patches are sent. But today we found another similar
>> >> block in kswapd which make me think that maybe CFQ should be changed
>> >> somehow so that all these callers can benefit from it.
>> >>
>> >> So is there any way to let the async queue work timely or at least
>> >> is there any deadline for async queue to finish an request in time
>> >> even in case there are many reads?
>> >>
>> >> btw, We have tested deadline scheduler and it seems to work in our test.
>> >>
>> >> [1] the message we get from one system:
>> >> INFO: task flush-8:0:2950 blocked for more than 120 seconds.
>> >> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> >> flush-8:0       D ffff88062bfde738     0  2950      2 0x00000000
>> >>  ffff88062b137820 0000000000000046 ffff88062b137750 ffffffff812b7bc3
>> >>  ffff88032cddc000 ffff88062bfde380 ffff88032d3d8840 0000000c2be37400
>> >>  000000002be37601 0000000000000006 ffff88062b137760 ffffffff811c242e
>> >> Call Trace:
>> >>  [<ffffffff812b7bc3>] ? scsi_request_fn+0x345/0x3df
>> >>  [<ffffffff811c242e>] ? __blk_run_queue+0x1a/0x1c
>> >>  [<ffffffff811c57cc>] ? queue_unplugged+0x77/0x8e
>> >>  [<ffffffff813dbe67>] io_schedule+0x47/0x61
>> >>  [<ffffffff811c512c>] get_request_wait+0xe0/0x152
>> >
>> > Ok, so flush slept on trying to get a "request" allocated on request
>> > queue. That means all the ASYNC request descriptors are already consumed
>> > and we are not making progress with ASYNc requests.
>> >
>> > A relatively recent patch allowed sync queues to always preempt async queues
>> > and schedule sync workload instead of async. This had the potential to
>> > starve async queues and looks like that's what we are running into.
>> >
>> > commit f8ae6e3eb8251be32c6e913393d9f8d9e0609489
>> > Author: Shaohua Li <shaohua.li@intel.com>
>> > Date:   Fri Jan 14 08:41:02 2011 +0100
>> >
>> >     block cfq: make queue preempt work for queues from different workload
>> >
>> > Do you have few seconds of blktrace. I just wanted to verify that this
>> > is what we are running into.
>> We are using the latest kernel, so the patch is already there. :(
>>
>> You are right that all the requests have been allocated and the flusher
>> is waiting for requests to be available. But the root cause is that in
>> heavy sync read, the async queue in cfq is delayed too much. I have
>> added some traces in the cfq codes path and after several investigation,
>> I found several interesting things and tried to improve it. But I am not
>> sure whether it is bug or it is designed intentionally.
>>
>> 1. In cfq_dispatch_requests we select a sync queue to serve, but if the
>> queue has too much requests in flight, the cfq_slice_used_soon may be
>> true and the cfqq isn't allowed to send and will waste some timeslice.
>> Then why choose this cfqq? Why not choose a qualified one?
>
> CFQ in general tries not to drive too deep a queue depth in an effort
> to improve latencies. CFQ is generally recommened for slow SATA drives
> and dispatching too many requests from a single queue can only serve to
> increase the latency.
>
>>
>> 2. async queue isn't allowed to be sent if there is some sync request in
>> fly, but as now most of the devices has a greater depth, should we
>> improve it somehow? I guess queue_depth should be a valid number maybe?
>
> We seem to be running this batching thing in cfq_may_dispatch() where
> we drain sync requests before async is dispatched and vice-a-versa.
> I am not sure how does this batching thing helps. I think Jens should
> be a better person to comment on that.
>
> I ran a fio job with few readers and few writers. I do see that few times
> we have schedule ASYNC workload/queue but did not dispatch a request
> from that. And reason being that there are sync requests in flight. And
> by the time sync requests finish, async queue gets preempted.
>
> So async queue does it scheduled but never gets a chance to dispatch
> a request because there was sync IO in flight.
>
> If there is no major advantage of draining sync requests before async
> is dispatched, I think this should be an easy fix.
I thought this is to avoid sync latency if we switch from an async
queue to sync queue later.

>> 3. Even there is no sync i/o, the async queue isn't allowed to send too
>> much requests because of the check in cfq_may_dispatch "Async queues
>> must wait a bit before being allowed dispatch", so in my test the async
>> queue has several chances to be selected, but it is only allowed
>> todispatch one request at a time. It is really amazing.
>
> Again heavily loaded to improve sync latencies. Say you have queue
> depth of 128 and you fill that all with async requests because right
> now there is no sync request around. Then a sync request comes in.
> We don't have a way to give it a priority and it might happen that
> it gets executed after 128 async requests have finished (driver and
> drive dependent though).
>
> So in an attempt to improve sync latencies we don't drive too
> high queue depths.
>
> Its latency vs throughput tradeoff.
The current cfq do be able to stave async queue, because we want to give small
latency to sync queue.
I agree we should do something to improve async starvation, but the
problem is how
long async queue slice should be. A sd card I tested has very high
latency for write. A 4k write can take > 300ms. Just dispatching a
singe write can dramatically impact
read throughput. Even in modern SSD, read is several times faster than write.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: CFQ: async queue blocks the whole system
  2011-06-10  1:19       ` Shaohua Li
@ 2011-06-10  1:34         ` Shaohua Li
  2011-06-10  2:06           ` Tao Ma
  2011-06-10  9:17         ` Vivek Goyal
  1 sibling, 1 reply; 33+ messages in thread
From: Shaohua Li @ 2011-06-10  1:34 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Tao Ma, linux-kernel, Jens Axboe

2011/6/10 Shaohua Li <shaohua.li@intel.com>:
> 2011/6/9 Vivek Goyal <vgoyal@redhat.com>:
>> On Thu, Jun 09, 2011 at 10:47:43PM +0800, Tao Ma wrote:
>>> Hi Vivek,
>>>       Thanks for the quick response.
>>> On 06/09/2011 10:14 PM, Vivek Goyal wrote:
>>> > On Thu, Jun 09, 2011 at 06:49:37PM +0800, Tao Ma wrote:
>>> >> Hi Jens and Vivek,
>>> >>    We are current running some heavy ext4 metadata test,
>>> >> and we found a very severe problem for CFQ. Please correct me if
>>> >> my statement below is wrong.
>>> >>
>>> >> CFQ only has an async queue for every priority of every class and
>>> >> these queues have a very low serving priority, so if the system
>>> >> has a large number of sync reads, these queues will be delayed a
>>> >> lot of time. As a result, the flushers will be blocked, then the
>>> >> journal and finally our applications[1].
>>> >>
>>> >> I have tried to let jbd/2 to use WRITE_SYNC so that they can checkpoint
>>> >> in time and the patches are sent. But today we found another similar
>>> >> block in kswapd which make me think that maybe CFQ should be changed
>>> >> somehow so that all these callers can benefit from it.
>>> >>
>>> >> So is there any way to let the async queue work timely or at least
>>> >> is there any deadline for async queue to finish an request in time
>>> >> even in case there are many reads?
>>> >>
>>> >> btw, We have tested deadline scheduler and it seems to work in our test.
>>> >>
>>> >> [1] the message we get from one system:
>>> >> INFO: task flush-8:0:2950 blocked for more than 120 seconds.
>>> >> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>> >> flush-8:0       D ffff88062bfde738     0  2950      2 0x00000000
>>> >>  ffff88062b137820 0000000000000046 ffff88062b137750 ffffffff812b7bc3
>>> >>  ffff88032cddc000 ffff88062bfde380 ffff88032d3d8840 0000000c2be37400
>>> >>  000000002be37601 0000000000000006 ffff88062b137760 ffffffff811c242e
>>> >> Call Trace:
>>> >>  [<ffffffff812b7bc3>] ? scsi_request_fn+0x345/0x3df
>>> >>  [<ffffffff811c242e>] ? __blk_run_queue+0x1a/0x1c
>>> >>  [<ffffffff811c57cc>] ? queue_unplugged+0x77/0x8e
>>> >>  [<ffffffff813dbe67>] io_schedule+0x47/0x61
>>> >>  [<ffffffff811c512c>] get_request_wait+0xe0/0x152
>>> >
>>> > Ok, so flush slept on trying to get a "request" allocated on request
>>> > queue. That means all the ASYNC request descriptors are already consumed
>>> > and we are not making progress with ASYNc requests.
>>> >
>>> > A relatively recent patch allowed sync queues to always preempt async queues
>>> > and schedule sync workload instead of async. This had the potential to
>>> > starve async queues and looks like that's what we are running into.
>>> >
>>> > commit f8ae6e3eb8251be32c6e913393d9f8d9e0609489
>>> > Author: Shaohua Li <shaohua.li@intel.com>
>>> > Date:   Fri Jan 14 08:41:02 2011 +0100
>>> >
>>> >     block cfq: make queue preempt work for queues from different workload
>>> >
>>> > Do you have few seconds of blktrace. I just wanted to verify that this
>>> > is what we are running into.
>>> We are using the latest kernel, so the patch is already there. :(
>>>
>>> You are right that all the requests have been allocated and the flusher
>>> is waiting for requests to be available. But the root cause is that in
>>> heavy sync read, the async queue in cfq is delayed too much. I have
>>> added some traces in the cfq codes path and after several investigation,
>>> I found several interesting things and tried to improve it. But I am not
>>> sure whether it is bug or it is designed intentionally.
>>>
>>> 1. In cfq_dispatch_requests we select a sync queue to serve, but if the
>>> queue has too much requests in flight, the cfq_slice_used_soon may be
>>> true and the cfqq isn't allowed to send and will waste some timeslice.
>>> Then why choose this cfqq? Why not choose a qualified one?
>>
>> CFQ in general tries not to drive too deep a queue depth in an effort
>> to improve latencies. CFQ is generally recommened for slow SATA drives
>> and dispatching too many requests from a single queue can only serve to
>> increase the latency.
>>
>>>
>>> 2. async queue isn't allowed to be sent if there is some sync request in
>>> fly, but as now most of the devices has a greater depth, should we
>>> improve it somehow? I guess queue_depth should be a valid number maybe?
>>
>> We seem to be running this batching thing in cfq_may_dispatch() where
>> we drain sync requests before async is dispatched and vice-a-versa.
>> I am not sure how does this batching thing helps. I think Jens should
>> be a better person to comment on that.
>>
>> I ran a fio job with few readers and few writers. I do see that few times
>> we have schedule ASYNC workload/queue but did not dispatch a request
>> from that. And reason being that there are sync requests in flight. And
>> by the time sync requests finish, async queue gets preempted.
>>
>> So async queue does it scheduled but never gets a chance to dispatch
>> a request because there was sync IO in flight.
>>
>> If there is no major advantage of draining sync requests before async
>> is dispatched, I think this should be an easy fix.
> I thought this is to avoid sync latency if we switch from an async
> queue to sync queue later.
>
>>> 3. Even there is no sync i/o, the async queue isn't allowed to send too
>>> much requests because of the check in cfq_may_dispatch "Async queues
>>> must wait a bit before being allowed dispatch", so in my test the async
>>> queue has several chances to be selected, but it is only allowed
>>> todispatch one request at a time. It is really amazing.
>>
>> Again heavily loaded to improve sync latencies. Say you have queue
>> depth of 128 and you fill that all with async requests because right
>> now there is no sync request around. Then a sync request comes in.
>> We don't have a way to give it a priority and it might happen that
>> it gets executed after 128 async requests have finished (driver and
>> drive dependent though).
>>
>> So in an attempt to improve sync latencies we don't drive too
>> high queue depths.
>>
>> Its latency vs throughput tradeoff.
> The current cfq do be able to stave async queue, because we want to give small
> latency to sync queue.
> I agree we should do something to improve async starvation, but the
> problem is how
> long async queue slice should be. A sd card I tested has very high
> latency for write. A 4k write can take > 300ms. Just dispatching a
> singe write can dramatically impact
> read throughput. Even in modern SSD, read is several times faster than write.
My previous experiment is if a queue is preempted, it will not be
preempted at the second time. This can improve something, but can't
resolve the problem completely.
I thought we can't completely solve this issue if we give high
priority to sync queue,
async queue is unavoidable to be able starved.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: CFQ: async queue blocks the whole system
  2011-06-10  1:34         ` Shaohua Li
@ 2011-06-10  2:06           ` Tao Ma
  2011-06-10  2:35             ` Shaohua Li
  0 siblings, 1 reply; 33+ messages in thread
From: Tao Ma @ 2011-06-10  2:06 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Vivek Goyal, linux-kernel, Jens Axboe

On 06/10/2011 09:34 AM, Shaohua Li wrote:
> 2011/6/10 Shaohua Li <shaohua.li@intel.com>:
>> 2011/6/9 Vivek Goyal <vgoyal@redhat.com>:
>>> On Thu, Jun 09, 2011 at 10:47:43PM +0800, Tao Ma wrote:
>>>> Hi Vivek,
>>>>       Thanks for the quick response.
>>>> On 06/09/2011 10:14 PM, Vivek Goyal wrote:
>>>>> On Thu, Jun 09, 2011 at 06:49:37PM +0800, Tao Ma wrote:
>>>>>> Hi Jens and Vivek,
>>>>>>    We are current running some heavy ext4 metadata test,
>>>>>> and we found a very severe problem for CFQ. Please correct me if
>>>>>> my statement below is wrong.
>>>>>>
>>>>>> CFQ only has an async queue for every priority of every class and
>>>>>> these queues have a very low serving priority, so if the system
>>>>>> has a large number of sync reads, these queues will be delayed a
>>>>>> lot of time. As a result, the flushers will be blocked, then the
>>>>>> journal and finally our applications[1].
>>>>>>
>>>>>> I have tried to let jbd/2 to use WRITE_SYNC so that they can checkpoint
>>>>>> in time and the patches are sent. But today we found another similar
>>>>>> block in kswapd which make me think that maybe CFQ should be changed
>>>>>> somehow so that all these callers can benefit from it.
>>>>>>
>>>>>> So is there any way to let the async queue work timely or at least
>>>>>> is there any deadline for async queue to finish an request in time
>>>>>> even in case there are many reads?
>>>>>>
>>>>>> btw, We have tested deadline scheduler and it seems to work in our test.
>>>>>>
>>>>>> [1] the message we get from one system:
>>>>>> INFO: task flush-8:0:2950 blocked for more than 120 seconds.
>>>>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>>>> flush-8:0       D ffff88062bfde738     0  2950      2 0x00000000
>>>>>>  ffff88062b137820 0000000000000046 ffff88062b137750 ffffffff812b7bc3
>>>>>>  ffff88032cddc000 ffff88062bfde380 ffff88032d3d8840 0000000c2be37400
>>>>>>  000000002be37601 0000000000000006 ffff88062b137760 ffffffff811c242e
>>>>>> Call Trace:
>>>>>>  [<ffffffff812b7bc3>] ? scsi_request_fn+0x345/0x3df
>>>>>>  [<ffffffff811c242e>] ? __blk_run_queue+0x1a/0x1c
>>>>>>  [<ffffffff811c57cc>] ? queue_unplugged+0x77/0x8e
>>>>>>  [<ffffffff813dbe67>] io_schedule+0x47/0x61
>>>>>>  [<ffffffff811c512c>] get_request_wait+0xe0/0x152
>>>>>
>>>>> Ok, so flush slept on trying to get a "request" allocated on request
>>>>> queue. That means all the ASYNC request descriptors are already consumed
>>>>> and we are not making progress with ASYNc requests.
>>>>>
>>>>> A relatively recent patch allowed sync queues to always preempt async queues
>>>>> and schedule sync workload instead of async. This had the potential to
>>>>> starve async queues and looks like that's what we are running into.
>>>>>
>>>>> commit f8ae6e3eb8251be32c6e913393d9f8d9e0609489
>>>>> Author: Shaohua Li <shaohua.li@intel.com>
>>>>> Date:   Fri Jan 14 08:41:02 2011 +0100
>>>>>
>>>>>     block cfq: make queue preempt work for queues from different workload
>>>>>
>>>>> Do you have few seconds of blktrace. I just wanted to verify that this
>>>>> is what we are running into.
>>>> We are using the latest kernel, so the patch is already there. :(
>>>>
>>>> You are right that all the requests have been allocated and the flusher
>>>> is waiting for requests to be available. But the root cause is that in
>>>> heavy sync read, the async queue in cfq is delayed too much. I have
>>>> added some traces in the cfq codes path and after several investigation,
>>>> I found several interesting things and tried to improve it. But I am not
>>>> sure whether it is bug or it is designed intentionally.
>>>>
>>>> 1. In cfq_dispatch_requests we select a sync queue to serve, but if the
>>>> queue has too much requests in flight, the cfq_slice_used_soon may be
>>>> true and the cfqq isn't allowed to send and will waste some timeslice.
>>>> Then why choose this cfqq? Why not choose a qualified one?
>>>
>>> CFQ in general tries not to drive too deep a queue depth in an effort
>>> to improve latencies. CFQ is generally recommened for slow SATA drives
>>> and dispatching too many requests from a single queue can only serve to
>>> increase the latency.
>>>
>>>>
>>>> 2. async queue isn't allowed to be sent if there is some sync request in
>>>> fly, but as now most of the devices has a greater depth, should we
>>>> improve it somehow? I guess queue_depth should be a valid number maybe?
>>>
>>> We seem to be running this batching thing in cfq_may_dispatch() where
>>> we drain sync requests before async is dispatched and vice-a-versa.
>>> I am not sure how does this batching thing helps. I think Jens should
>>> be a better person to comment on that.
>>>
>>> I ran a fio job with few readers and few writers. I do see that few times
>>> we have schedule ASYNC workload/queue but did not dispatch a request
>>> from that. And reason being that there are sync requests in flight. And
>>> by the time sync requests finish, async queue gets preempted.
>>>
>>> So async queue does it scheduled but never gets a chance to dispatch
>>> a request because there was sync IO in flight.
>>>
>>> If there is no major advantage of draining sync requests before async
>>> is dispatched, I think this should be an easy fix.
>> I thought this is to avoid sync latency if we switch from an async
>> queue to sync queue later.
>>
>>>> 3. Even there is no sync i/o, the async queue isn't allowed to send too
>>>> much requests because of the check in cfq_may_dispatch "Async queues
>>>> must wait a bit before being allowed dispatch", so in my test the async
>>>> queue has several chances to be selected, but it is only allowed
>>>> todispatch one request at a time. It is really amazing.
>>>
>>> Again heavily loaded to improve sync latencies. Say you have queue
>>> depth of 128 and you fill that all with async requests because right
>>> now there is no sync request around. Then a sync request comes in.
>>> We don't have a way to give it a priority and it might happen that
>>> it gets executed after 128 async requests have finished (driver and
>>> drive dependent though).
>>>
>>> So in an attempt to improve sync latencies we don't drive too
>>> high queue depths.
>>>
>>> Its latency vs throughput tradeoff.
>> The current cfq do be able to stave async queue, because we want to give small
>> latency to sync queue.
>> I agree we should do something to improve async starvation, but the
>> problem is how
>> long async queue slice should be. A sd card I tested has very high
>> latency for write. A 4k write can take > 300ms. Just dispatching a
>> singe write can dramatically impact
>> read throughput. Even in modern SSD, read is several times faster than write.
> My previous experiment is if a queue is preempted, it will not be
> preempted at the second time. This can improve something, but can't
> resolve the problem completely.
> I thought we can't completely solve this issue if we give high
> priority to sync queue,
> async queue is unavoidable to be able starved.
I am fine that async queue tends to be starved, but the problem is that
it is been starved too long. And although it is asynced, some sync
process are waiting for it, like jbd2 in my test. So CFQ should provide
us with at least some deadline. In my tests sometimes flusher is blocked
for more than 500 seconds before it can be waken up in get_request_wait.
 It is really amazing for me.

Regards,
Tao

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: CFQ: async queue blocks the whole system
  2011-06-10  2:06           ` Tao Ma
@ 2011-06-10  2:35             ` Shaohua Li
  2011-06-10  3:02               ` Tao Ma
  0 siblings, 1 reply; 33+ messages in thread
From: Shaohua Li @ 2011-06-10  2:35 UTC (permalink / raw)
  To: Tao Ma; +Cc: Vivek Goyal, linux-kernel, Jens Axboe

2011/6/10 Tao Ma <tm@tao.ma>:
> On 06/10/2011 09:34 AM, Shaohua Li wrote:
>> 2011/6/10 Shaohua Li <shaohua.li@intel.com>:
>>> 2011/6/9 Vivek Goyal <vgoyal@redhat.com>:
>>>> On Thu, Jun 09, 2011 at 10:47:43PM +0800, Tao Ma wrote:
>>>>> Hi Vivek,
>>>>>       Thanks for the quick response.
>>>>> On 06/09/2011 10:14 PM, Vivek Goyal wrote:
>>>>>> On Thu, Jun 09, 2011 at 06:49:37PM +0800, Tao Ma wrote:
>>>>>>> Hi Jens and Vivek,
>>>>>>>    We are current running some heavy ext4 metadata test,
>>>>>>> and we found a very severe problem for CFQ. Please correct me if
>>>>>>> my statement below is wrong.
>>>>>>>
>>>>>>> CFQ only has an async queue for every priority of every class and
>>>>>>> these queues have a very low serving priority, so if the system
>>>>>>> has a large number of sync reads, these queues will be delayed a
>>>>>>> lot of time. As a result, the flushers will be blocked, then the
>>>>>>> journal and finally our applications[1].
>>>>>>>
>>>>>>> I have tried to let jbd/2 to use WRITE_SYNC so that they can checkpoint
>>>>>>> in time and the patches are sent. But today we found another similar
>>>>>>> block in kswapd which make me think that maybe CFQ should be changed
>>>>>>> somehow so that all these callers can benefit from it.
>>>>>>>
>>>>>>> So is there any way to let the async queue work timely or at least
>>>>>>> is there any deadline for async queue to finish an request in time
>>>>>>> even in case there are many reads?
>>>>>>>
>>>>>>> btw, We have tested deadline scheduler and it seems to work in our test.
>>>>>>>
>>>>>>> [1] the message we get from one system:
>>>>>>> INFO: task flush-8:0:2950 blocked for more than 120 seconds.
>>>>>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>>>>> flush-8:0       D ffff88062bfde738     0  2950      2 0x00000000
>>>>>>>  ffff88062b137820 0000000000000046 ffff88062b137750 ffffffff812b7bc3
>>>>>>>  ffff88032cddc000 ffff88062bfde380 ffff88032d3d8840 0000000c2be37400
>>>>>>>  000000002be37601 0000000000000006 ffff88062b137760 ffffffff811c242e
>>>>>>> Call Trace:
>>>>>>>  [<ffffffff812b7bc3>] ? scsi_request_fn+0x345/0x3df
>>>>>>>  [<ffffffff811c242e>] ? __blk_run_queue+0x1a/0x1c
>>>>>>>  [<ffffffff811c57cc>] ? queue_unplugged+0x77/0x8e
>>>>>>>  [<ffffffff813dbe67>] io_schedule+0x47/0x61
>>>>>>>  [<ffffffff811c512c>] get_request_wait+0xe0/0x152
>>>>>>
>>>>>> Ok, so flush slept on trying to get a "request" allocated on request
>>>>>> queue. That means all the ASYNC request descriptors are already consumed
>>>>>> and we are not making progress with ASYNc requests.
>>>>>>
>>>>>> A relatively recent patch allowed sync queues to always preempt async queues
>>>>>> and schedule sync workload instead of async. This had the potential to
>>>>>> starve async queues and looks like that's what we are running into.
>>>>>>
>>>>>> commit f8ae6e3eb8251be32c6e913393d9f8d9e0609489
>>>>>> Author: Shaohua Li <shaohua.li@intel.com>
>>>>>> Date:   Fri Jan 14 08:41:02 2011 +0100
>>>>>>
>>>>>>     block cfq: make queue preempt work for queues from different workload
>>>>>>
>>>>>> Do you have few seconds of blktrace. I just wanted to verify that this
>>>>>> is what we are running into.
>>>>> We are using the latest kernel, so the patch is already there. :(
>>>>>
>>>>> You are right that all the requests have been allocated and the flusher
>>>>> is waiting for requests to be available. But the root cause is that in
>>>>> heavy sync read, the async queue in cfq is delayed too much. I have
>>>>> added some traces in the cfq codes path and after several investigation,
>>>>> I found several interesting things and tried to improve it. But I am not
>>>>> sure whether it is bug or it is designed intentionally.
>>>>>
>>>>> 1. In cfq_dispatch_requests we select a sync queue to serve, but if the
>>>>> queue has too much requests in flight, the cfq_slice_used_soon may be
>>>>> true and the cfqq isn't allowed to send and will waste some timeslice.
>>>>> Then why choose this cfqq? Why not choose a qualified one?
>>>>
>>>> CFQ in general tries not to drive too deep a queue depth in an effort
>>>> to improve latencies. CFQ is generally recommened for slow SATA drives
>>>> and dispatching too many requests from a single queue can only serve to
>>>> increase the latency.
>>>>
>>>>>
>>>>> 2. async queue isn't allowed to be sent if there is some sync request in
>>>>> fly, but as now most of the devices has a greater depth, should we
>>>>> improve it somehow? I guess queue_depth should be a valid number maybe?
>>>>
>>>> We seem to be running this batching thing in cfq_may_dispatch() where
>>>> we drain sync requests before async is dispatched and vice-a-versa.
>>>> I am not sure how does this batching thing helps. I think Jens should
>>>> be a better person to comment on that.
>>>>
>>>> I ran a fio job with few readers and few writers. I do see that few times
>>>> we have schedule ASYNC workload/queue but did not dispatch a request
>>>> from that. And reason being that there are sync requests in flight. And
>>>> by the time sync requests finish, async queue gets preempted.
>>>>
>>>> So async queue does it scheduled but never gets a chance to dispatch
>>>> a request because there was sync IO in flight.
>>>>
>>>> If there is no major advantage of draining sync requests before async
>>>> is dispatched, I think this should be an easy fix.
>>> I thought this is to avoid sync latency if we switch from an async
>>> queue to sync queue later.
>>>
>>>>> 3. Even there is no sync i/o, the async queue isn't allowed to send too
>>>>> much requests because of the check in cfq_may_dispatch "Async queues
>>>>> must wait a bit before being allowed dispatch", so in my test the async
>>>>> queue has several chances to be selected, but it is only allowed
>>>>> todispatch one request at a time. It is really amazing.
>>>>
>>>> Again heavily loaded to improve sync latencies. Say you have queue
>>>> depth of 128 and you fill that all with async requests because right
>>>> now there is no sync request around. Then a sync request comes in.
>>>> We don't have a way to give it a priority and it might happen that
>>>> it gets executed after 128 async requests have finished (driver and
>>>> drive dependent though).
>>>>
>>>> So in an attempt to improve sync latencies we don't drive too
>>>> high queue depths.
>>>>
>>>> Its latency vs throughput tradeoff.
>>> The current cfq do be able to stave async queue, because we want to give small
>>> latency to sync queue.
>>> I agree we should do something to improve async starvation, but the
>>> problem is how
>>> long async queue slice should be. A sd card I tested has very high
>>> latency for write. A 4k write can take > 300ms. Just dispatching a
>>> singe write can dramatically impact
>>> read throughput. Even in modern SSD, read is several times faster than write.
>> My previous experiment is if a queue is preempted, it will not be
>> preempted at the second time. This can improve something, but can't
>> resolve the problem completely.
>> I thought we can't completely solve this issue if we give high
>> priority to sync queue,
>> async queue is unavoidable to be able starved.
> I am fine that async queue tends to be starved, but the problem is that
> it is been starved too long. And although it is asynced, some sync
> process are waiting for it, like jbd2 in my test. So CFQ should provide
> us with at least some deadline. In my tests sometimes flusher is blocked
> for more than 500 seconds before it can be waken up in get_request_wait.
>  It is really amazing for me.
I don't understand. If your sync thread is waiting for the async one,
the sync thread will not deliver request, and then async thread will
run then, why there is starvation? maybe you can give more details of
your workload.
I don't think we can give a deadline for async request, because we
still want to give sync high priority. We can give async some slices,
so for a workload of small number of async requests and large number
sync requests, we don't starve async too much. But for a workload with
large number of sync/async requests, async will be starved for sure
and we can't solve this in cfq.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: CFQ: async queue blocks the whole system
  2011-06-10  2:35             ` Shaohua Li
@ 2011-06-10  3:02               ` Tao Ma
  2011-06-10  9:20                 ` Vivek Goyal
  0 siblings, 1 reply; 33+ messages in thread
From: Tao Ma @ 2011-06-10  3:02 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Vivek Goyal, linux-kernel, Jens Axboe

On 06/10/2011 10:35 AM, Shaohua Li wrote:
> 2011/6/10 Tao Ma <tm@tao.ma>:
>> On 06/10/2011 09:34 AM, Shaohua Li wrote:
>>> 2011/6/10 Shaohua Li <shaohua.li@intel.com>:
>>>> 2011/6/9 Vivek Goyal <vgoyal@redhat.com>:
>>>>> On Thu, Jun 09, 2011 at 10:47:43PM +0800, Tao Ma wrote:
>>>>>> Hi Vivek,
>>>>>>       Thanks for the quick response.
>>>>>> On 06/09/2011 10:14 PM, Vivek Goyal wrote:
>>>>>>> On Thu, Jun 09, 2011 at 06:49:37PM +0800, Tao Ma wrote:
>>>>>>>> Hi Jens and Vivek,
>>>>>>>>    We are current running some heavy ext4 metadata test,
>>>>>>>> and we found a very severe problem for CFQ. Please correct me if
>>>>>>>> my statement below is wrong.
>>>>>>>>
>>>>>>>> CFQ only has an async queue for every priority of every class and
>>>>>>>> these queues have a very low serving priority, so if the system
>>>>>>>> has a large number of sync reads, these queues will be delayed a
>>>>>>>> lot of time. As a result, the flushers will be blocked, then the
>>>>>>>> journal and finally our applications[1].
>>>>>>>>
>>>>>>>> I have tried to let jbd/2 to use WRITE_SYNC so that they can checkpoint
>>>>>>>> in time and the patches are sent. But today we found another similar
>>>>>>>> block in kswapd which make me think that maybe CFQ should be changed
>>>>>>>> somehow so that all these callers can benefit from it.
>>>>>>>>
>>>>>>>> So is there any way to let the async queue work timely or at least
>>>>>>>> is there any deadline for async queue to finish an request in time
>>>>>>>> even in case there are many reads?
>>>>>>>>
>>>>>>>> btw, We have tested deadline scheduler and it seems to work in our test.
>>>>>>>>
>>>>>>>> [1] the message we get from one system:
>>>>>>>> INFO: task flush-8:0:2950 blocked for more than 120 seconds.
>>>>>>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>>>>>> flush-8:0       D ffff88062bfde738     0  2950      2 0x00000000
>>>>>>>>  ffff88062b137820 0000000000000046 ffff88062b137750 ffffffff812b7bc3
>>>>>>>>  ffff88032cddc000 ffff88062bfde380 ffff88032d3d8840 0000000c2be37400
>>>>>>>>  000000002be37601 0000000000000006 ffff88062b137760 ffffffff811c242e
>>>>>>>> Call Trace:
>>>>>>>>  [<ffffffff812b7bc3>] ? scsi_request_fn+0x345/0x3df
>>>>>>>>  [<ffffffff811c242e>] ? __blk_run_queue+0x1a/0x1c
>>>>>>>>  [<ffffffff811c57cc>] ? queue_unplugged+0x77/0x8e
>>>>>>>>  [<ffffffff813dbe67>] io_schedule+0x47/0x61
>>>>>>>>  [<ffffffff811c512c>] get_request_wait+0xe0/0x152
>>>>>>>
>>>>>>> Ok, so flush slept on trying to get a "request" allocated on request
>>>>>>> queue. That means all the ASYNC request descriptors are already consumed
>>>>>>> and we are not making progress with ASYNc requests.
>>>>>>>
>>>>>>> A relatively recent patch allowed sync queues to always preempt async queues
>>>>>>> and schedule sync workload instead of async. This had the potential to
>>>>>>> starve async queues and looks like that's what we are running into.
>>>>>>>
>>>>>>> commit f8ae6e3eb8251be32c6e913393d9f8d9e0609489
>>>>>>> Author: Shaohua Li <shaohua.li@intel.com>
>>>>>>> Date:   Fri Jan 14 08:41:02 2011 +0100
>>>>>>>
>>>>>>>     block cfq: make queue preempt work for queues from different workload
>>>>>>>
>>>>>>> Do you have few seconds of blktrace. I just wanted to verify that this
>>>>>>> is what we are running into.
>>>>>> We are using the latest kernel, so the patch is already there. :(
>>>>>>
>>>>>> You are right that all the requests have been allocated and the flusher
>>>>>> is waiting for requests to be available. But the root cause is that in
>>>>>> heavy sync read, the async queue in cfq is delayed too much. I have
>>>>>> added some traces in the cfq codes path and after several investigation,
>>>>>> I found several interesting things and tried to improve it. But I am not
>>>>>> sure whether it is bug or it is designed intentionally.
>>>>>>
>>>>>> 1. In cfq_dispatch_requests we select a sync queue to serve, but if the
>>>>>> queue has too much requests in flight, the cfq_slice_used_soon may be
>>>>>> true and the cfqq isn't allowed to send and will waste some timeslice.
>>>>>> Then why choose this cfqq? Why not choose a qualified one?
>>>>>
>>>>> CFQ in general tries not to drive too deep a queue depth in an effort
>>>>> to improve latencies. CFQ is generally recommened for slow SATA drives
>>>>> and dispatching too many requests from a single queue can only serve to
>>>>> increase the latency.
>>>>>
>>>>>>
>>>>>> 2. async queue isn't allowed to be sent if there is some sync request in
>>>>>> fly, but as now most of the devices has a greater depth, should we
>>>>>> improve it somehow? I guess queue_depth should be a valid number maybe?
>>>>>
>>>>> We seem to be running this batching thing in cfq_may_dispatch() where
>>>>> we drain sync requests before async is dispatched and vice-a-versa.
>>>>> I am not sure how does this batching thing helps. I think Jens should
>>>>> be a better person to comment on that.
>>>>>
>>>>> I ran a fio job with few readers and few writers. I do see that few times
>>>>> we have schedule ASYNC workload/queue but did not dispatch a request
>>>>> from that. And reason being that there are sync requests in flight. And
>>>>> by the time sync requests finish, async queue gets preempted.
>>>>>
>>>>> So async queue does it scheduled but never gets a chance to dispatch
>>>>> a request because there was sync IO in flight.
>>>>>
>>>>> If there is no major advantage of draining sync requests before async
>>>>> is dispatched, I think this should be an easy fix.
>>>> I thought this is to avoid sync latency if we switch from an async
>>>> queue to sync queue later.
>>>>
>>>>>> 3. Even there is no sync i/o, the async queue isn't allowed to send too
>>>>>> much requests because of the check in cfq_may_dispatch "Async queues
>>>>>> must wait a bit before being allowed dispatch", so in my test the async
>>>>>> queue has several chances to be selected, but it is only allowed
>>>>>> todispatch one request at a time. It is really amazing.
>>>>>
>>>>> Again heavily loaded to improve sync latencies. Say you have queue
>>>>> depth of 128 and you fill that all with async requests because right
>>>>> now there is no sync request around. Then a sync request comes in.
>>>>> We don't have a way to give it a priority and it might happen that
>>>>> it gets executed after 128 async requests have finished (driver and
>>>>> drive dependent though).
>>>>>
>>>>> So in an attempt to improve sync latencies we don't drive too
>>>>> high queue depths.
>>>>>
>>>>> Its latency vs throughput tradeoff.
>>>> The current cfq do be able to stave async queue, because we want to give small
>>>> latency to sync queue.
>>>> I agree we should do something to improve async starvation, but the
>>>> problem is how
>>>> long async queue slice should be. A sd card I tested has very high
>>>> latency for write. A 4k write can take > 300ms. Just dispatching a
>>>> singe write can dramatically impact
>>>> read throughput. Even in modern SSD, read is several times faster than write.
>>> My previous experiment is if a queue is preempted, it will not be
>>> preempted at the second time. This can improve something, but can't
>>> resolve the problem completely.
>>> I thought we can't completely solve this issue if we give high
>>> priority to sync queue,
>>> async queue is unavoidable to be able starved.
>> I am fine that async queue tends to be starved, but the problem is that
>> it is been starved too long. And although it is asynced, some sync
>> process are waiting for it, like jbd2 in my test. So CFQ should provide
>> us with at least some deadline. In my tests sometimes flusher is blocked
>> for more than 500 seconds before it can be waken up in get_request_wait.
>>  It is really amazing for me.
> I don't understand. If your sync thread is waiting for the async one,
> the sync thread will not deliver request, and then async thread will
> run then, why there is starvation? maybe you can give more details of
> your workload.
ok, my word may be misleading. ;) The real problem here is when the
flusher uses async write to send the request, the buffer is locked. But
jbd2 needs to lock the buffer also in some cases, so it is blocked.
> I don't think we can give a deadline for async request, because we
> still want to give sync high priority. We can give async some slices,
> so for a workload of small number of async requests and large number
> sync requests, we don't starve async too much. But for a workload with
> large number of sync/async requests, async will be starved for sure
> and we can't solve this in cfq.
OK, so if you guys thinks a 500 seconds wait is good for an async write
to complete, fine, then we have to switch to deadline.

Regards,
Tao

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: CFQ: async queue blocks the whole system
  2011-06-09 18:27         ` Vivek Goyal
@ 2011-06-10  5:48           ` Tao Ma
  2011-06-10  9:14             ` Vivek Goyal
  0 siblings, 1 reply; 33+ messages in thread
From: Tao Ma @ 2011-06-10  5:48 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: linux-kernel, Jens Axboe

On 06/10/2011 02:27 AM, Vivek Goyal wrote:
> On Thu, Jun 09, 2011 at 11:44:21PM +0800, Tao Ma wrote:
> 
> [..]
>>> CFQ in general tries not to drive too deep a queue depth in an effort
>>> to improve latencies. CFQ is generally recommened for slow SATA drives
>>> and dispatching too many requests from a single queue can only serve to
>>> increase the latency.
>> ok, so do you mean that for a fast drive, cfq isn't recommended and
>> deadline is always prefered? ;) We have a SAS with queue_depth=128, so
>> it should be a fast drive I guess. :)
> 
> I think in general that has been true in my experience. SSDs are still
> ok with CFQ because that sets nonrotational flag and cuts down on 
> idling. But if it is a rotational media which can handle multiple
> parallel requests at a time you might have better throughput results
> with deadline.
Thank you for the advice.
> 
> [..]
>>> Its latency vs throughput tradeoff.
>> ok, so it seems that all these are designed, not a bug. Thanks for the
>> clarification.
>>
>> btw, reverting the patch doesn't work. I can still get the livelock.
> 
> Can you give following patch a try and see if it helps. On my system this
> does allow CFQ to dispatch some writes once in a while.
Sorry, this patch doesn't work in my test.

Regards,
Tao
> 
> thanks
> Vivek
> 
> ---
>  block/cfq-iosched.c |   25 ++++++++++++++++++++++++-
>  1 file changed, 24 insertions(+), 1 deletion(-)
> 
> Index: linux-2.6/block/cfq-iosched.c
> ===================================================================
> --- linux-2.6.orig/block/cfq-iosched.c	2011-06-09 11:44:40.000000000 -0400
> +++ linux-2.6/block/cfq-iosched.c	2011-06-09 14:04:01.036983301 -0400
> @@ -303,6 +303,8 @@ struct cfq_data {
>  
>  	/* Number of groups which are on blkcg->blkg_list */
>  	unsigned int nr_blkcg_linked_grps;
> +
> +	unsigned long last_async_dispatched;
>  };
>  
>  static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd);
> @@ -2063,6 +2065,10 @@ static void cfq_dispatch_insert(struct r
>  
>  	cfqd->rq_in_flight[cfq_cfqq_sync(cfqq)]++;
>  	cfqq->nr_sectors += blk_rq_sectors(rq);
> +
> +	if (!cfq_cfqq_sync(cfqq))
> +		cfqd->last_async_dispatched = jiffies;
> +
>  	cfq_blkiocg_update_dispatch_stats(&cfqq->cfqg->blkg, blk_rq_bytes(rq),
>  					rq_data_dir(rq), rq_is_sync(rq));
>  }
> @@ -3315,8 +3321,25 @@ cfq_should_preempt(struct cfq_data *cfqd
>  	 * if the new request is sync, but the currently running queue is
>  	 * not, let the sync request have priority.
>  	 */
> -	if (rq_is_sync(rq) && !cfq_cfqq_sync(cfqq))
> +	if (rq_is_sync(rq) && !cfq_cfqq_sync(cfqq)) {
> +		unsigned long async_delay = 0;
> +
> +		async_delay = jiffies - cfqd->last_async_dispatched;
> +
> +		/*
> +		 * CFQ is heavily loaded in favor of sync queues and that
> + 		 * can lead to starvation of async queues. If it has been
> + 		 * too long since last async request was dispatched, don't
> + 		 * preempt async queue
> + 		 *
> + 		 * Once we have per group async queues, this will need
> + 		 * modification.
> + 		 */
> +		if (async_delay > 2 * HZ)
> +			return false;
> +
>  		return true;
> +	}
>  
>  	if (new_cfqq->cfqg != cfqq->cfqg)
>  		return false;


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: CFQ: async queue blocks the whole system
  2011-06-10  5:48           ` Tao Ma
@ 2011-06-10  9:14             ` Vivek Goyal
  2011-06-10 10:00               ` Tao Ma
  0 siblings, 1 reply; 33+ messages in thread
From: Vivek Goyal @ 2011-06-10  9:14 UTC (permalink / raw)
  To: Tao Ma; +Cc: linux-kernel, Jens Axboe

On Fri, Jun 10, 2011 at 01:48:37PM +0800, Tao Ma wrote:

[..]
> >> btw, reverting the patch doesn't work. I can still get the livelock.

What test exactly you are running. I am primarily interested in whether
you still get the hung task timeout warning where a writer is waiting on
get_request_wait() for more than 120 secods or not.

Livelock might be a different problem and for which Christoph provided
a patch for XFS.

> > 
> > Can you give following patch a try and see if it helps. On my system this
> > does allow CFQ to dispatch some writes once in a while.
> Sorry, this patch doesn't work in my test.

Can you give me backtrace of say 15 seconds each with and without patch.
I think now we must be dispatching some writes, that's a different thing
that writer still sleeps more than 120 seconds because there are  way
too many readers.

May be we need to look into show workload tree scheduling takes place and
tweak that logic a bit.

Looking at backtraces should help.

On my system with XFS filesystem I ran 32 readers and 16 buffered writers
with fio for 180 seconds. Without the patch I was getting hung task
timeout warning and with the patch I stopped getting that. I also ran
the blktrace and saw that roughly in 4 seconds we got to dispatch a write.
Which is much better than complete write starving.

So basically blktrace will help here.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: CFQ: async queue blocks the whole system
  2011-06-10  1:19       ` Shaohua Li
  2011-06-10  1:34         ` Shaohua Li
@ 2011-06-10  9:17         ` Vivek Goyal
  2011-06-10  9:20           ` Jens Axboe
  1 sibling, 1 reply; 33+ messages in thread
From: Vivek Goyal @ 2011-06-10  9:17 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Tao Ma, linux-kernel, Jens Axboe

On Fri, Jun 10, 2011 at 09:19:12AM +0800, Shaohua Li wrote:

[..]
> > If there is no major advantage of draining sync requests before async
> > is dispatched, I think this should be an easy fix.
> I thought this is to avoid sync latency if we switch from an async
> queue to sync queue later.

Is it about the sync request latency which has already been dispatched? I 
really wish that driver and disk should do some prioritazation for reads
here and CFQ does not have to jump through hoops like drain sync requests
before async requests are dispatched.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: CFQ: async queue blocks the whole system
  2011-06-10  3:02               ` Tao Ma
@ 2011-06-10  9:20                 ` Vivek Goyal
  2011-06-10  9:21                   ` Jens Axboe
  2011-06-13  1:03                   ` Shaohua Li
  0 siblings, 2 replies; 33+ messages in thread
From: Vivek Goyal @ 2011-06-10  9:20 UTC (permalink / raw)
  To: Tao Ma; +Cc: Shaohua Li, linux-kernel, Jens Axboe

On Fri, Jun 10, 2011 at 11:02:11AM +0800, Tao Ma wrote:

[..]
> > I don't think we can give a deadline for async request, because we
> > still want to give sync high priority. We can give async some slices,
> > so for a workload of small number of async requests and large number
> > sync requests, we don't starve async too much. But for a workload with
> > large number of sync/async requests, async will be starved for sure
> > and we can't solve this in cfq.
> OK, so if you guys thinks a 500 seconds wait is good for an async write
> to complete, fine, then we have to switch to deadline.

I don't think that starving WRITES completely is a good idea. Especially
given the fact that you were not able to dispatch WRITES for 500 seconds.
This needs fixing.

Its not about giving hard deadline to WRITES, but making sure we don't
starve them completely and they also make some progress.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: CFQ: async queue blocks the whole system
  2011-06-10  9:17         ` Vivek Goyal
@ 2011-06-10  9:20           ` Jens Axboe
  2011-06-10  9:29             ` Vivek Goyal
  0 siblings, 1 reply; 33+ messages in thread
From: Jens Axboe @ 2011-06-10  9:20 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Shaohua Li, Tao Ma, linux-kernel

On 2011-06-10 11:17, Vivek Goyal wrote:
> On Fri, Jun 10, 2011 at 09:19:12AM +0800, Shaohua Li wrote:
> 
> [..]
>>> If there is no major advantage of draining sync requests before async
>>> is dispatched, I think this should be an easy fix.
>> I thought this is to avoid sync latency if we switch from an async
>> queue to sync queue later.
> 
> Is it about the sync request latency which has already been dispatched? I 
> really wish that driver and disk should do some prioritazation for reads
> here and CFQ does not have to jump through hoops like drain sync requests
> before async requests are dispatched.

That would never work. Are you suggesting putting that logic in all
drivers? Or relying on hardware to get the fairness right? Not going to
happen.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: CFQ: async queue blocks the whole system
  2011-06-10  9:20                 ` Vivek Goyal
@ 2011-06-10  9:21                   ` Jens Axboe
  2011-06-13  1:03                   ` Shaohua Li
  1 sibling, 0 replies; 33+ messages in thread
From: Jens Axboe @ 2011-06-10  9:21 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Tao Ma, Shaohua Li, linux-kernel

On 2011-06-10 11:20, Vivek Goyal wrote:
> On Fri, Jun 10, 2011 at 11:02:11AM +0800, Tao Ma wrote:
> 
> [..]
>>> I don't think we can give a deadline for async request, because we
>>> still want to give sync high priority. We can give async some slices,
>>> so for a workload of small number of async requests and large number
>>> sync requests, we don't starve async too much. But for a workload with
>>> large number of sync/async requests, async will be starved for sure
>>> and we can't solve this in cfq.
>> OK, so if you guys thinks a 500 seconds wait is good for an async write
>> to complete, fine, then we have to switch to deadline.
> 
> I don't think that starving WRITES completely is a good idea. Especially
> given the fact that you were not able to dispatch WRITES for 500 seconds.
> This needs fixing.
> 
> Its not about giving hard deadline to WRITES, but making sure we don't
> starve them completely and they also make some progress.

Agree, we need to have some sort of forward progress guarantee at least.
Starving buffered writes indefinitely is surely a BUG. And it must be a
regression from not that long ago.

Trying to catch up with this thread...

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: CFQ: async queue blocks the whole system
  2011-06-10  9:20           ` Jens Axboe
@ 2011-06-10  9:29             ` Vivek Goyal
  2011-06-10  9:31               ` Jens Axboe
  0 siblings, 1 reply; 33+ messages in thread
From: Vivek Goyal @ 2011-06-10  9:29 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Shaohua Li, Tao Ma, linux-kernel

On Fri, Jun 10, 2011 at 11:20:43AM +0200, Jens Axboe wrote:
> On 2011-06-10 11:17, Vivek Goyal wrote:
> > On Fri, Jun 10, 2011 at 09:19:12AM +0800, Shaohua Li wrote:
> > 
> > [..]
> >>> If there is no major advantage of draining sync requests before async
> >>> is dispatched, I think this should be an easy fix.
> >> I thought this is to avoid sync latency if we switch from an async
> >> queue to sync queue later.
> > 
> > Is it about the sync request latency which has already been dispatched? I 
> > really wish that driver and disk should do some prioritazation for reads
> > here and CFQ does not have to jump through hoops like drain sync requests
> > before async requests are dispatched.
> 
> That would never work. Are you suggesting putting that logic in all
> drivers? Or relying on hardware to get the fairness right? Not going to
> happen.

I was hoping that hardware does some prioritization. Well, in this case
even if hardware maintains FIFO behavior it should be good enough.

But I would not claim anything in this regard as I have never experimented
with it and have no idea that how sync latencies are impacted if we don't
drain the queue before dispathing WRITEs.

I was just wondering that with current generation hardware is it bad
enough that we need to keep this logic around?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: CFQ: async queue blocks the whole system
  2011-06-10  9:29             ` Vivek Goyal
@ 2011-06-10  9:31               ` Jens Axboe
  0 siblings, 0 replies; 33+ messages in thread
From: Jens Axboe @ 2011-06-10  9:31 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Shaohua Li, Tao Ma, linux-kernel

On 2011-06-10 11:29, Vivek Goyal wrote:
> On Fri, Jun 10, 2011 at 11:20:43AM +0200, Jens Axboe wrote:
>> On 2011-06-10 11:17, Vivek Goyal wrote:
>>> On Fri, Jun 10, 2011 at 09:19:12AM +0800, Shaohua Li wrote:
>>>
>>> [..]
>>>>> If there is no major advantage of draining sync requests before async
>>>>> is dispatched, I think this should be an easy fix.
>>>> I thought this is to avoid sync latency if we switch from an async
>>>> queue to sync queue later.
>>>
>>> Is it about the sync request latency which has already been dispatched? I 
>>> really wish that driver and disk should do some prioritazation for reads
>>> here and CFQ does not have to jump through hoops like drain sync requests
>>> before async requests are dispatched.
>>
>> That would never work. Are you suggesting putting that logic in all
>> drivers? Or relying on hardware to get the fairness right? Not going to
>> happen.
> 
> I was hoping that hardware does some prioritization. Well, in this case
> even if hardware maintains FIFO behavior it should be good enough.
> 
> But I would not claim anything in this regard as I have never experimented
> with it and have no idea that how sync latencies are impacted if we don't
> drain the queue before dispathing WRITEs.
> 
> I was just wondering that with current generation hardware is it bad
> enough that we need to keep this logic around?

The logic for draining around sync/async switch isn't that old, in fact
it dates only back to the last round of interactiveness fury. So yes,
it's definitely needed. It made a big difference on hardware that was
just out-of-the-store back then. I think you are putting way too much
faith into the sanity and goals of companies making this hardware.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: CFQ: async queue blocks the whole system
  2011-06-10  9:14             ` Vivek Goyal
@ 2011-06-10 10:00               ` Tao Ma
  2011-06-10 15:44                 ` Vivek Goyal
  0 siblings, 1 reply; 33+ messages in thread
From: Tao Ma @ 2011-06-10 10:00 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: linux-kernel, Jens Axboe

On 06/10/2011 05:14 PM, Vivek Goyal wrote:
> On Fri, Jun 10, 2011 at 01:48:37PM +0800, Tao Ma wrote:
> 
> [..]
>>>> btw, reverting the patch doesn't work. I can still get the livelock.
> 
> What test exactly you are running. I am primarily interested in whether
> you still get the hung task timeout warning where a writer is waiting on
> get_request_wait() for more than 120 secods or not.
> 
> Livelock might be a different problem and for which Christoph provided
> a patch for XFS.
> 
>>>
>>> Can you give following patch a try and see if it helps. On my system this
>>> does allow CFQ to dispatch some writes once in a while.
>> Sorry, this patch doesn't work in my test.
> 
> Can you give me backtrace of say 15 seconds each with and without patch.
> I think now we must be dispatching some writes, that's a different thing
> that writer still sleeps more than 120 seconds because there are  way
> too many readers.
> 
> May be we need to look into show workload tree scheduling takes place and
> tweak that logic a bit.
OK, our test cases can be downloaded for free. ;)
svn co http://code.taobao.org/svn/dirbench/trunk/meta_test/press/set_vs_get
Modify run.sh to be fit for your need. Normally within 10 mins, you will
get the livelock. We have a SAS disk with 15000 RPMs.

btw, you have to mount the volume on /test since the test program are
not that clever. :)

Regards,
Tao

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: CFQ: async queue blocks the whole system
  2011-06-10 10:00               ` Tao Ma
@ 2011-06-10 15:44                 ` Vivek Goyal
  2011-06-11  7:24                   ` Tao Ma
                                     ` (2 more replies)
  0 siblings, 3 replies; 33+ messages in thread
From: Vivek Goyal @ 2011-06-10 15:44 UTC (permalink / raw)
  To: Tao Ma; +Cc: linux-kernel, Jens Axboe

On Fri, Jun 10, 2011 at 06:00:37PM +0800, Tao Ma wrote:
> On 06/10/2011 05:14 PM, Vivek Goyal wrote:
> > On Fri, Jun 10, 2011 at 01:48:37PM +0800, Tao Ma wrote:
> > 
> > [..]
> >>>> btw, reverting the patch doesn't work. I can still get the livelock.
> > 
> > What test exactly you are running. I am primarily interested in whether
> > you still get the hung task timeout warning where a writer is waiting on
> > get_request_wait() for more than 120 secods or not.
> > 
> > Livelock might be a different problem and for which Christoph provided
> > a patch for XFS.
> > 
> >>>
> >>> Can you give following patch a try and see if it helps. On my system this
> >>> does allow CFQ to dispatch some writes once in a while.
> >> Sorry, this patch doesn't work in my test.
> > 
> > Can you give me backtrace of say 15 seconds each with and without patch.
> > I think now we must be dispatching some writes, that's a different thing
> > that writer still sleeps more than 120 seconds because there are  way
> > too many readers.
> > 
> > May be we need to look into show workload tree scheduling takes place and
> > tweak that logic a bit.
> OK, our test cases can be downloaded for free. ;)
> svn co http://code.taobao.org/svn/dirbench/trunk/meta_test/press/set_vs_get
> Modify run.sh to be fit for your need. Normally within 10 mins, you will
> get the livelock. We have a SAS disk with 15000 RPMs.
> 
> btw, you have to mount the volume on /test since the test program are
> not that clever. :)

Thanks for the test program. System keeps on working, that's a different
thing that writes might not make lot of progress. 

What do you mean by livelock in your case. How do you define that?

Couple of times I did see hung_task warning with your test. And I also
saw that we have most of the time starved WRITES but one in a while we
will dispatch some writes.

Having said that I will still admit that current logic can completely
starve async writes if there are sufficient number of readers. I can
reproduce this simply by launching lots of readers and bunch of writers
using fio.

So I have written another patch, where I don't allow preemption of 
async queue if it waiting for sync requests to drain and has not
dispatched any request after having been scheduled.

This atleast makes sure that writes are not starved. But that does not
mean that whole bunch of async writes are dispatched. In presence of
lots of read activity, expect 1 write to be dispatched every few seconds.

Please give this patch a try and if it still does not work, please upload
some bltraces while test is running.

You can also run iostat on disk and should be able to see that with
the patch you are dispatching writes more often than before.

Thanks
Vivek

---
 block/cfq-iosched.c |    9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

Index: linux-2.6/block/cfq-iosched.c
===================================================================
--- linux-2.6.orig/block/cfq-iosched.c	2011-06-10 09:13:01.000000000 -0400
+++ linux-2.6/block/cfq-iosched.c	2011-06-10 10:02:31.850831735 -0400
@@ -3315,8 +3315,15 @@ cfq_should_preempt(struct cfq_data *cfqd
 	 * if the new request is sync, but the currently running queue is
 	 * not, let the sync request have priority.
 	 */
-	if (rq_is_sync(rq) && !cfq_cfqq_sync(cfqq))
+	if (rq_is_sync(rq) && !cfq_cfqq_sync(cfqq)) {
+		/*
+		 * Allow atleast one dispatch otherwise this can repeat
+		 * and writes can be starved completely
+		 */
+		if (!cfqq->slice_dispatch)
+			return false;
 		return true;
+	}
 
 	if (new_cfqq->cfqg != cfqq->cfqg)
 		return false;


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: CFQ: async queue blocks the whole system
  2011-06-10 15:44                 ` Vivek Goyal
@ 2011-06-11  7:24                   ` Tao Ma
  2011-06-13 10:08                   ` Tao Ma
  2011-06-17  3:04                   ` Tao Ma
  2 siblings, 0 replies; 33+ messages in thread
From: Tao Ma @ 2011-06-11  7:24 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: linux-kernel, Jens Axboe

On 06/10/2011 11:44 PM, Vivek Goyal wrote:
> On Fri, Jun 10, 2011 at 06:00:37PM +0800, Tao Ma wrote:
>> On 06/10/2011 05:14 PM, Vivek Goyal wrote:
>>> On Fri, Jun 10, 2011 at 01:48:37PM +0800, Tao Ma wrote:
>>>
>>> [..]
>>>>>> btw, reverting the patch doesn't work. I can still get the livelock.
>>>
>>> What test exactly you are running. I am primarily interested in whether
>>> you still get the hung task timeout warning where a writer is waiting on
>>> get_request_wait() for more than 120 secods or not.
>>>
>>> Livelock might be a different problem and for which Christoph provided
>>> a patch for XFS.
>>>
>>>>>
>>>>> Can you give following patch a try and see if it helps. On my system this
>>>>> does allow CFQ to dispatch some writes once in a while.
>>>> Sorry, this patch doesn't work in my test.
>>>
>>> Can you give me backtrace of say 15 seconds each with and without patch.
>>> I think now we must be dispatching some writes, that's a different thing
>>> that writer still sleeps more than 120 seconds because there are  way
>>> too many readers.
>>>
>>> May be we need to look into show workload tree scheduling takes place and
>>> tweak that logic a bit.
>> OK, our test cases can be downloaded for free. ;)
>> svn co http://code.taobao.org/svn/dirbench/trunk/meta_test/press/set_vs_get
>> Modify run.sh to be fit for your need. Normally within 10 mins, you will
>> get the livelock. We have a SAS disk with 15000 RPMs.
>>
>> btw, you have to mount the volume on /test since the test program are
>> not that clever. :)
> 
> Thanks for the test program. System keeps on working, that's a different
> thing that writes might not make lot of progress. 
> 
> What do you mean by livelock in your case. How do you define that?
I mean there are many tasks hang like a deadlock, but actually they can
get out finally.
> 
> Couple of times I did see hung_task warning with your test. And I also
> saw that we have most of the time starved WRITES but one in a while we
> will dispatch some writes.
yeah, that is what we get.
> 
> Having said that I will still admit that current logic can completely
> starve async writes if there are sufficient number of readers. I can
> reproduce this simply by launching lots of readers and bunch of writers
> using fio.
yeah, as I have said before, we found some async writes are delayed for
more than 500 seconds.
> 
> So I have written another patch, where I don't allow preemption of 
> async queue if it waiting for sync requests to drain and has not
> dispatched any request after having been scheduled.
> 
> This atleast makes sure that writes are not starved. But that does not
> mean that whole bunch of async writes are dispatched. In presence of
> lots of read activity, expect 1 write to be dispatched every few seconds.
> 
> Please give this patch a try and if it still does not work, please upload
> some bltraces while test is running.
OK, I will try it and return back with the test result. Great thanks for
your test and patch.

Tao
> 
> You can also run iostat on disk and should be able to see that with
> the patch you are dispatching writes more often than before.
> 
> Thanks
> Vivek
> 
> ---
>  block/cfq-iosched.c |    9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> Index: linux-2.6/block/cfq-iosched.c
> ===================================================================
> --- linux-2.6.orig/block/cfq-iosched.c	2011-06-10 09:13:01.000000000 -0400
> +++ linux-2.6/block/cfq-iosched.c	2011-06-10 10:02:31.850831735 -0400
> @@ -3315,8 +3315,15 @@ cfq_should_preempt(struct cfq_data *cfqd
>  	 * if the new request is sync, but the currently running queue is
>  	 * not, let the sync request have priority.
>  	 */
> -	if (rq_is_sync(rq) && !cfq_cfqq_sync(cfqq))
> +	if (rq_is_sync(rq) && !cfq_cfqq_sync(cfqq)) {
> +		/*
> +		 * Allow atleast one dispatch otherwise this can repeat
> +		 * and writes can be starved completely
> +		 */
> +		if (!cfqq->slice_dispatch)
> +			return false;
>  		return true;
> +	}
>  
>  	if (new_cfqq->cfqg != cfqq->cfqg)
>  		return false;
> 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: CFQ: async queue blocks the whole system
  2011-06-10  9:20                 ` Vivek Goyal
  2011-06-10  9:21                   ` Jens Axboe
@ 2011-06-13  1:03                   ` Shaohua Li
  1 sibling, 0 replies; 33+ messages in thread
From: Shaohua Li @ 2011-06-13  1:03 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Tao Ma, linux-kernel, Jens Axboe

On Fri, 2011-06-10 at 17:20 +0800, Vivek Goyal wrote:
> On Fri, Jun 10, 2011 at 11:02:11AM +0800, Tao Ma wrote:
> 
> [..]
> > > I don't think we can give a deadline for async request, because we
> > > still want to give sync high priority. We can give async some slices,
> > > so for a workload of small number of async requests and large number
> > > sync requests, we don't starve async too much. But for a workload with
> > > large number of sync/async requests, async will be starved for sure
> > > and we can't solve this in cfq.
> > OK, so if you guys thinks a 500 seconds wait is good for an async write
> > to complete, fine, then we have to switch to deadline.
> 
> I don't think that starving WRITES completely is a good idea. Especially
> given the fact that you were not able to dispatch WRITES for 500 seconds.
> This needs fixing.
> 
> Its not about giving hard deadline to WRITES, but making sure we don't
> starve them completely and they also make some progress.
Sure, I have no objection to avoid write starvation for a light write
workload. We definitely should do something for such workload. Your
patch is a good start.
But for a heavy write workload (for example, Ma's test), CFQ never can
completely avoid write starvation. In such workload, write queue has
only slow progress to handle requests.

Thanks,
Shaohua


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: CFQ: async queue blocks the whole system
  2011-06-10 15:44                 ` Vivek Goyal
  2011-06-11  7:24                   ` Tao Ma
@ 2011-06-13 10:08                   ` Tao Ma
  2011-06-13 21:41                     ` Vivek Goyal
  2011-06-17  3:04                   ` Tao Ma
  2 siblings, 1 reply; 33+ messages in thread
From: Tao Ma @ 2011-06-13 10:08 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: linux-kernel, Jens Axboe

On 06/10/2011 11:44 PM, Vivek Goyal wrote:
> On Fri, Jun 10, 2011 at 06:00:37PM +0800, Tao Ma wrote:
>> On 06/10/2011 05:14 PM, Vivek Goyal wrote:
>>> On Fri, Jun 10, 2011 at 01:48:37PM +0800, Tao Ma wrote:
>>>
>>> [..]
>>>>>> btw, reverting the patch doesn't work. I can still get the livelock.
>>>
>>> What test exactly you are running. I am primarily interested in whether
>>> you still get the hung task timeout warning where a writer is waiting on
>>> get_request_wait() for more than 120 secods or not.
>>>
>>> Livelock might be a different problem and for which Christoph provided
>>> a patch for XFS.
>>>
>>>>>
>>>>> Can you give following patch a try and see if it helps. On my system this
>>>>> does allow CFQ to dispatch some writes once in a while.
>>>> Sorry, this patch doesn't work in my test.
>>>
>>> Can you give me backtrace of say 15 seconds each with and without patch.
>>> I think now we must be dispatching some writes, that's a different thing
>>> that writer still sleeps more than 120 seconds because there are  way
>>> too many readers.
>>>
>>> May be we need to look into show workload tree scheduling takes place and
>>> tweak that logic a bit.
>> OK, our test cases can be downloaded for free. ;)
>> svn co http://code.taobao.org/svn/dirbench/trunk/meta_test/press/set_vs_get
>> Modify run.sh to be fit for your need. Normally within 10 mins, you will
>> get the livelock. We have a SAS disk with 15000 RPMs.
>>
>> btw, you have to mount the volume on /test since the test program are
>> not that clever. :)
> 
> Thanks for the test program. System keeps on working, that's a different
> thing that writes might not make lot of progress. 
> 
> What do you mean by livelock in your case. How do you define that?
> 
> Couple of times I did see hung_task warning with your test. And I also
> saw that we have most of the time starved WRITES but one in a while we
> will dispatch some writes.
> 
> Having said that I will still admit that current logic can completely
> starve async writes if there are sufficient number of readers. I can
> reproduce this simply by launching lots of readers and bunch of writers
> using fio.
> 
> So I have written another patch, where I don't allow preemption of 
> async queue if it waiting for sync requests to drain and has not
> dispatched any request after having been scheduled.
> 
> This atleast makes sure that writes are not starved. But that does not
> mean that whole bunch of async writes are dispatched. In presence of
> lots of read activity, expect 1 write to be dispatched every few seconds.
> 
> Please give this patch a try and if it still does not work, please upload
> some bltraces while test is running.
> 
> You can also run iostat on disk and should be able to see that with
> the patch you are dispatching writes more often than before.
Sorry, the patch doesn't work.

I used trace event to capture all the blktraces since it doesn't
interfere with the tests, hope it helps.

Please downloaded it from http://blog.coly.li/tmp/blktrace.tar.bz2

Thanks
Tao
> 
> Thanks
> Vivek
> 
> ---
>  block/cfq-iosched.c |    9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> Index: linux-2.6/block/cfq-iosched.c
> ===================================================================
> --- linux-2.6.orig/block/cfq-iosched.c	2011-06-10 09:13:01.000000000 -0400
> +++ linux-2.6/block/cfq-iosched.c	2011-06-10 10:02:31.850831735 -0400
> @@ -3315,8 +3315,15 @@ cfq_should_preempt(struct cfq_data *cfqd
>  	 * if the new request is sync, but the currently running queue is
>  	 * not, let the sync request have priority.
>  	 */
> -	if (rq_is_sync(rq) && !cfq_cfqq_sync(cfqq))
> +	if (rq_is_sync(rq) && !cfq_cfqq_sync(cfqq)) {
> +		/*
> +		 * Allow atleast one dispatch otherwise this can repeat
> +		 * and writes can be starved completely
> +		 */
> +		if (!cfqq->slice_dispatch)
> +			return false;
>  		return true;
> +	}
>  
>  	if (new_cfqq->cfqg != cfqq->cfqg)
>  		return false;
> 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: CFQ: async queue blocks the whole system
  2011-06-13 10:08                   ` Tao Ma
@ 2011-06-13 21:41                     ` Vivek Goyal
  2011-06-14  7:03                       ` Tao Ma
  0 siblings, 1 reply; 33+ messages in thread
From: Vivek Goyal @ 2011-06-13 21:41 UTC (permalink / raw)
  To: Tao Ma; +Cc: linux-kernel, Jens Axboe

On Mon, Jun 13, 2011 at 06:08:40PM +0800, Tao Ma wrote:

[..]
> > You can also run iostat on disk and should be able to see that with
> > the patch you are dispatching writes more often than before.
> Sorry, the patch doesn't work.
> 
> I used trace event to capture all the blktraces since it doesn't
> interfere with the tests, hope it helps.

Actually I was looking for CFQ traces. This seems to be generic block
layer trace points. May be you can use "blktrace -d /dev/<device>"
and then blkparse. It also gives the aggregate view which is helpful.

> 
> Please downloaded it from http://blog.coly.li/tmp/blktrace.tar.bz2

What concerns me is following.

5255.521353: block_rq_issue: 8,0 W 0 () 571137153 + 8 [attr_set]
5578.863871: block_rq_issue: 8,0 W 0 () 512950473 + 48 [kworker/0:1]

IIUC, we dispatched second write more than 300 seconds after dispatching
1 write. What happened in between. We should have dispatched more writes.

CFQ traces might give better idea in terms of whether wl_type for async
queues was scheduled or not at all.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: CFQ: async queue blocks the whole system
  2011-06-13 21:41                     ` Vivek Goyal
@ 2011-06-14  7:03                       ` Tao Ma
  2011-06-14 13:30                         ` Vivek Goyal
  0 siblings, 1 reply; 33+ messages in thread
From: Tao Ma @ 2011-06-14  7:03 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: linux-kernel, Jens Axboe

Hi Vivek,
On 06/14/2011 05:41 AM, Vivek Goyal wrote:
> On Mon, Jun 13, 2011 at 06:08:40PM +0800, Tao Ma wrote:
> 
> [..]
>>> You can also run iostat on disk and should be able to see that with
>>> the patch you are dispatching writes more often than before.
>> Sorry, the patch doesn't work.
>>
>> I used trace event to capture all the blktraces since it doesn't
>> interfere with the tests, hope it helps.
> 
> Actually I was looking for CFQ traces. This seems to be generic block
> layer trace points. May be you can use "blktrace -d /dev/<device>"
> and then blkparse. It also gives the aggregate view which is helpful.
> 
>>
>> Please downloaded it from http://blog.coly.li/tmp/blktrace.tar.bz2
> 
> What concerns me is following.
> 
> 5255.521353: block_rq_issue: 8,0 W 0 () 571137153 + 8 [attr_set]
> 5578.863871: block_rq_issue: 8,0 W 0 () 512950473 + 48 [kworker/0:1]
> 
> IIUC, we dispatched second write more than 300 seconds after dispatching
> 1 write. What happened in between. We should have dispatched more writes.
> 
> CFQ traces might give better idea in terms of whether wl_type for async
> queues was scheduled or not at all.
I tried several times today, but it looks like that if I enable
blktrace, the hung_task will not show up in the message. So do you think
the blktrace at that time is still useful? If yes, I can capture 1
minute for you. Thanks.

Regards,
Tao

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: CFQ: async queue blocks the whole system
  2011-06-14  7:03                       ` Tao Ma
@ 2011-06-14 13:30                         ` Vivek Goyal
  2011-06-14 15:42                           ` Tao Ma
  0 siblings, 1 reply; 33+ messages in thread
From: Vivek Goyal @ 2011-06-14 13:30 UTC (permalink / raw)
  To: Tao Ma; +Cc: linux-kernel, Jens Axboe

On Tue, Jun 14, 2011 at 03:03:24PM +0800, Tao Ma wrote:
> Hi Vivek,
> On 06/14/2011 05:41 AM, Vivek Goyal wrote:
> > On Mon, Jun 13, 2011 at 06:08:40PM +0800, Tao Ma wrote:
> > 
> > [..]
> >>> You can also run iostat on disk and should be able to see that with
> >>> the patch you are dispatching writes more often than before.
> >> Sorry, the patch doesn't work.
> >>
> >> I used trace event to capture all the blktraces since it doesn't
> >> interfere with the tests, hope it helps.
> > 
> > Actually I was looking for CFQ traces. This seems to be generic block
> > layer trace points. May be you can use "blktrace -d /dev/<device>"
> > and then blkparse. It also gives the aggregate view which is helpful.
> > 
> >>
> >> Please downloaded it from http://blog.coly.li/tmp/blktrace.tar.bz2
> > 
> > What concerns me is following.
> > 
> > 5255.521353: block_rq_issue: 8,0 W 0 () 571137153 + 8 [attr_set]
> > 5578.863871: block_rq_issue: 8,0 W 0 () 512950473 + 48 [kworker/0:1]
> > 
> > IIUC, we dispatched second write more than 300 seconds after dispatching
> > 1 write. What happened in between. We should have dispatched more writes.
> > 
> > CFQ traces might give better idea in terms of whether wl_type for async
> > queues was scheduled or not at all.
> I tried several times today, but it looks like that if I enable
> blktrace, the hung_task will not show up in the message. So do you think
> the blktrace at that time is still useful? If yes, I can capture 1
> minute for you. Thanks.

Capturing 1 min output will also be good. 

You can do one more thing. Mount block IO controller. It has the stats for
sync and async dispatch (blkio.io_serviced or blkio.io_service_bytes). You
can write a simple script to read and print these files every few seconds.
That will also tell whether CFQ is dispatching async requests for the 
said device regularly or not.

So both blktrace and blkio controller stat will help.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: CFQ: async queue blocks the whole system
  2011-06-14 13:30                         ` Vivek Goyal
@ 2011-06-14 15:42                           ` Tao Ma
  2011-06-14 21:14                             ` Vivek Goyal
  0 siblings, 1 reply; 33+ messages in thread
From: Tao Ma @ 2011-06-14 15:42 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: linux-kernel, Jens Axboe

On 06/14/2011 09:30 PM, Vivek Goyal wrote:
> On Tue, Jun 14, 2011 at 03:03:24PM +0800, Tao Ma wrote:
>> Hi Vivek,
>> On 06/14/2011 05:41 AM, Vivek Goyal wrote:
>>> On Mon, Jun 13, 2011 at 06:08:40PM +0800, Tao Ma wrote:
>>>
>>> [..]
>>>>> You can also run iostat on disk and should be able to see that with
>>>>> the patch you are dispatching writes more often than before.
>>>> Sorry, the patch doesn't work.
>>>>
>>>> I used trace event to capture all the blktraces since it doesn't
>>>> interfere with the tests, hope it helps.
>>>
>>> Actually I was looking for CFQ traces. This seems to be generic block
>>> layer trace points. May be you can use "blktrace -d /dev/<device>"
>>> and then blkparse. It also gives the aggregate view which is helpful.
>>>
>>>>
>>>> Please downloaded it from http://blog.coly.li/tmp/blktrace.tar.bz2
>>>
>>> What concerns me is following.
>>>
>>> 5255.521353: block_rq_issue: 8,0 W 0 () 571137153 + 8 [attr_set]
>>> 5578.863871: block_rq_issue: 8,0 W 0 () 512950473 + 48 [kworker/0:1]
>>>
>>> IIUC, we dispatched second write more than 300 seconds after dispatching
>>> 1 write. What happened in between. We should have dispatched more writes.
>>>
>>> CFQ traces might give better idea in terms of whether wl_type for async
>>> queues was scheduled or not at all.
>> I tried several times today, but it looks like that if I enable
>> blktrace, the hung_task will not show up in the message. So do you think
>> the blktrace at that time is still useful? If yes, I can capture 1
>> minute for you. Thanks.
> 
> Capturing 1 min output will also be good. 
OK, I captured 2 mins blkparse log before the hung. You can downloaded
it from http://blog.coly.li/tmp/blkparse.tar.bz2

> 
> You can do one more thing. Mount block IO controller. It has the stats for
> sync and async dispatch (blkio.io_serviced or blkio.io_service_bytes). You
> can write a simple script to read and print these files every few seconds.
> That will also tell whether CFQ is dispatching async requests for the 
> said device regularly or not.
OK, I will try block IO controller tomorrow to see whether we can find
some useful info. Anyway, thanks for the diagnose.

Thanks
Tao

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: CFQ: async queue blocks the whole system
  2011-06-14 15:42                           ` Tao Ma
@ 2011-06-14 21:14                             ` Vivek Goyal
  0 siblings, 0 replies; 33+ messages in thread
From: Vivek Goyal @ 2011-06-14 21:14 UTC (permalink / raw)
  To: Tao Ma; +Cc: linux-kernel, Jens Axboe

On Tue, Jun 14, 2011 at 11:42:15PM +0800, Tao Ma wrote:
> On 06/14/2011 09:30 PM, Vivek Goyal wrote:
> > On Tue, Jun 14, 2011 at 03:03:24PM +0800, Tao Ma wrote:
> >> Hi Vivek,
> >> On 06/14/2011 05:41 AM, Vivek Goyal wrote:
> >>> On Mon, Jun 13, 2011 at 06:08:40PM +0800, Tao Ma wrote:
> >>>
> >>> [..]
> >>>>> You can also run iostat on disk and should be able to see that with
> >>>>> the patch you are dispatching writes more often than before.
> >>>> Sorry, the patch doesn't work.
> >>>>
> >>>> I used trace event to capture all the blktraces since it doesn't
> >>>> interfere with the tests, hope it helps.
> >>>
> >>> Actually I was looking for CFQ traces. This seems to be generic block
> >>> layer trace points. May be you can use "blktrace -d /dev/<device>"
> >>> and then blkparse. It also gives the aggregate view which is helpful.
> >>>
> >>>>
> >>>> Please downloaded it from http://blog.coly.li/tmp/blktrace.tar.bz2
> >>>
> >>> What concerns me is following.
> >>>
> >>> 5255.521353: block_rq_issue: 8,0 W 0 () 571137153 + 8 [attr_set]
> >>> 5578.863871: block_rq_issue: 8,0 W 0 () 512950473 + 48 [kworker/0:1]
> >>>
> >>> IIUC, we dispatched second write more than 300 seconds after dispatching
> >>> 1 write. What happened in between. We should have dispatched more writes.
> >>>
> >>> CFQ traces might give better idea in terms of whether wl_type for async
> >>> queues was scheduled or not at all.
> >> I tried several times today, but it looks like that if I enable
> >> blktrace, the hung_task will not show up in the message. So do you think
> >> the blktrace at that time is still useful? If yes, I can capture 1
> >> minute for you. Thanks.
> > 
> > Capturing 1 min output will also be good. 
> OK, I captured 2 mins blkparse log before the hung. You can downloaded
> it from http://blog.coly.li/tmp/blkparse.tar.bz2

Thanks. I looked at this log and looks like now we are not starving
WRITES.

I did grep on the logs.

grep -e "wl_type:0" -e "cfq3068A / sl_used" blkparse.log | async-dispatch

And I see that now async WRITES are being dispatched at regular interval
and we are not seeing long delays (atleast in this log).

A sample output is as follows. What I am expecting from the patch is that
it will avoid the starvation of async queues in presence of lots of
writers. That's a different thing that one might not be able to push
enough WRITES in 120 seconds window and one can still get hung task
timeout message. 

 5.135877740     0  m   N cfq3068A / set_active wl_prio:0 wl_type:0
 5.231137776     0  m   N cfq3068A / sl_used=95 disp=1 charge=95 iops=0 sect=16
13.311745653     0  m   N cfq3068A / set_active wl_prio:0 wl_type:0
13.373046196     0  m   N cfq3068A / sl_used=1 disp=16 charge=1 iops=0 sect=136
18.097413421     0  m   N cfq3068A / set_active wl_prio:0 wl_type:0
18.097466598     0  m   N cfq3068A / sl_used=1 disp=3 charge=1 iops=0 sect=32
18.119371182     0  m   N cfq3068A / set_active wl_prio:0 wl_type:0
18.159420999     0  m   N cfq3068A / sl_used=40 disp=1592 charge=40 iops=0 sect=14360
18.159424767     0  m   N cfq3068A / set_active wl_prio:0 wl_type:0
18.199409182     0  m   N cfq3068A / sl_used=40 disp=1646 charge=40 iops=0 sect=13584
18.199414996     0  m   N cfq3068A / set_active wl_prio:0 wl_type:0
18.239374395     0  m   N cfq3068A / sl_used=40 disp=1678 charge=40 iops=0 sect=13872
18.239378182     0  m   N cfq3068A / set_active wl_prio:0 wl_type:0
18.254531670     0  m   N cfq3068A / sl_used=15 disp=603 charge=15 iops=0 sect=4920
27.580961230     0  m   N cfq3068A / set_active wl_prio:0 wl_type:0
27.653340897     0  m   N cfq3068A / sl_used=72 disp=16 charge=72 iops=0 sect=128

Thanks
Vivek

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: CFQ: async queue blocks the whole system
  2011-06-10 15:44                 ` Vivek Goyal
  2011-06-11  7:24                   ` Tao Ma
  2011-06-13 10:08                   ` Tao Ma
@ 2011-06-17  3:04                   ` Tao Ma
  2011-06-17 12:50                     ` Vivek Goyal
  2 siblings, 1 reply; 33+ messages in thread
From: Tao Ma @ 2011-06-17  3:04 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: linux-kernel, Jens Axboe

Hi Vivek,
On 06/10/2011 11:44 PM, Vivek Goyal wrote:
> On Fri, Jun 10, 2011 at 06:00:37PM +0800, Tao Ma wrote:
>> On 06/10/2011 05:14 PM, Vivek Goyal wrote:
>>> On Fri, Jun 10, 2011 at 01:48:37PM +0800, Tao Ma wrote:
>>>
>>> [..]
>>>>>> btw, reverting the patch doesn't work. I can still get the livelock.
>>>
>>> What test exactly you are running. I am primarily interested in whether
>>> you still get the hung task timeout warning where a writer is waiting on
>>> get_request_wait() for more than 120 secods or not.
>>>
>>> Livelock might be a different problem and for which Christoph provided
>>> a patch for XFS.
>>>
>>>>>
>>>>> Can you give following patch a try and see if it helps. On my system this
>>>>> does allow CFQ to dispatch some writes once in a while.
>>>> Sorry, this patch doesn't work in my test.
>>>
>>> Can you give me backtrace of say 15 seconds each with and without patch.
>>> I think now we must be dispatching some writes, that's a different thing
>>> that writer still sleeps more than 120 seconds because there are  way
>>> too many readers.
>>>
>>> May be we need to look into show workload tree scheduling takes place and
>>> tweak that logic a bit.
>> OK, our test cases can be downloaded for free. ;)
>> svn co http://code.taobao.org/svn/dirbench/trunk/meta_test/press/set_vs_get
>> Modify run.sh to be fit for your need. Normally within 10 mins, you will
>> get the livelock. We have a SAS disk with 15000 RPMs.
>>
>> btw, you have to mount the volume on /test since the test program are
>> not that clever. :)
> 
> Thanks for the test program. System keeps on working, that's a different
> thing that writes might not make lot of progress. 
> 
> What do you mean by livelock in your case. How do you define that?
> 
> Couple of times I did see hung_task warning with your test. And I also
> saw that we have most of the time starved WRITES but one in a while we
> will dispatch some writes.
> 
> Having said that I will still admit that current logic can completely
> starve async writes if there are sufficient number of readers. I can
> reproduce this simply by launching lots of readers and bunch of writers
> using fio.
> 
> So I have written another patch, where I don't allow preemption of 
> async queue if it waiting for sync requests to drain and has not
> dispatched any request after having been scheduled.
> 
> This atleast makes sure that writes are not starved. But that does not
> mean that whole bunch of async writes are dispatched. In presence of
> lots of read activity, expect 1 write to be dispatched every few seconds.
> 
> Please give this patch a try and if it still does not work, please upload
> some bltraces while test is running.
> 
> You can also run iostat on disk and should be able to see that with
> the patch you are dispatching writes more often than before.
I am testing your patch heavily these days.
With this patch, the workload is better to survive. But in some our test
machines we can still find the hung task. After we tune slice_idle to 0,
it is OK now. So do you think this tuning is valid?

btw do you think the patch is the final version? We have some plans of
carrying it in our product system to see whether it works.

Regards,
Tao
> 
> Thanks
> Vivek
> 
> ---
>  block/cfq-iosched.c |    9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> Index: linux-2.6/block/cfq-iosched.c
> ===================================================================
> --- linux-2.6.orig/block/cfq-iosched.c	2011-06-10 09:13:01.000000000 -0400
> +++ linux-2.6/block/cfq-iosched.c	2011-06-10 10:02:31.850831735 -0400
> @@ -3315,8 +3315,15 @@ cfq_should_preempt(struct cfq_data *cfqd
>  	 * if the new request is sync, but the currently running queue is
>  	 * not, let the sync request have priority.
>  	 */
> -	if (rq_is_sync(rq) && !cfq_cfqq_sync(cfqq))
> +	if (rq_is_sync(rq) && !cfq_cfqq_sync(cfqq)) {
> +		/*
> +		 * Allow atleast one dispatch otherwise this can repeat
> +		 * and writes can be starved completely
> +		 */
> +		if (!cfqq->slice_dispatch)
> +			return false;
>  		return true;
> +	}
>  
>  	if (new_cfqq->cfqg != cfqq->cfqg)
>  		return false;
> 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: CFQ: async queue blocks the whole system
  2011-06-17  3:04                   ` Tao Ma
@ 2011-06-17 12:50                     ` Vivek Goyal
  2011-06-17 14:34                       ` Tao Ma
  0 siblings, 1 reply; 33+ messages in thread
From: Vivek Goyal @ 2011-06-17 12:50 UTC (permalink / raw)
  To: Tao Ma; +Cc: linux-kernel, Jens Axboe

On Fri, Jun 17, 2011 at 11:04:51AM +0800, Tao Ma wrote:
> Hi Vivek,
> On 06/10/2011 11:44 PM, Vivek Goyal wrote:
> > On Fri, Jun 10, 2011 at 06:00:37PM +0800, Tao Ma wrote:
> >> On 06/10/2011 05:14 PM, Vivek Goyal wrote:
> >>> On Fri, Jun 10, 2011 at 01:48:37PM +0800, Tao Ma wrote:
> >>>
> >>> [..]
> >>>>>> btw, reverting the patch doesn't work. I can still get the livelock.
> >>>
> >>> What test exactly you are running. I am primarily interested in whether
> >>> you still get the hung task timeout warning where a writer is waiting on
> >>> get_request_wait() for more than 120 secods or not.
> >>>
> >>> Livelock might be a different problem and for which Christoph provided
> >>> a patch for XFS.
> >>>
> >>>>>
> >>>>> Can you give following patch a try and see if it helps. On my system this
> >>>>> does allow CFQ to dispatch some writes once in a while.
> >>>> Sorry, this patch doesn't work in my test.
> >>>
> >>> Can you give me backtrace of say 15 seconds each with and without patch.
> >>> I think now we must be dispatching some writes, that's a different thing
> >>> that writer still sleeps more than 120 seconds because there are  way
> >>> too many readers.
> >>>
> >>> May be we need to look into show workload tree scheduling takes place and
> >>> tweak that logic a bit.
> >> OK, our test cases can be downloaded for free. ;)
> >> svn co http://code.taobao.org/svn/dirbench/trunk/meta_test/press/set_vs_get
> >> Modify run.sh to be fit for your need. Normally within 10 mins, you will
> >> get the livelock. We have a SAS disk with 15000 RPMs.
> >>
> >> btw, you have to mount the volume on /test since the test program are
> >> not that clever. :)
> > 
> > Thanks for the test program. System keeps on working, that's a different
> > thing that writes might not make lot of progress. 
> > 
> > What do you mean by livelock in your case. How do you define that?
> > 
> > Couple of times I did see hung_task warning with your test. And I also
> > saw that we have most of the time starved WRITES but one in a while we
> > will dispatch some writes.
> > 
> > Having said that I will still admit that current logic can completely
> > starve async writes if there are sufficient number of readers. I can
> > reproduce this simply by launching lots of readers and bunch of writers
> > using fio.
> > 
> > So I have written another patch, where I don't allow preemption of 
> > async queue if it waiting for sync requests to drain and has not
> > dispatched any request after having been scheduled.
> > 
> > This atleast makes sure that writes are not starved. But that does not
> > mean that whole bunch of async writes are dispatched. In presence of
> > lots of read activity, expect 1 write to be dispatched every few seconds.
> > 
> > Please give this patch a try and if it still does not work, please upload
> > some bltraces while test is running.
> > 
> > You can also run iostat on disk and should be able to see that with
> > the patch you are dispatching writes more often than before.
> I am testing your patch heavily these days.
> With this patch, the workload is better to survive. But in some our test
> machines we can still find the hung task. After we tune slice_idle to 0,
> it is OK now. So do you think this tuning is valid?

By slice_idle=0 you turn off the idling and that's the core of the CFQ.
So practically you have more deadline like behavior.

> 
> btw do you think the patch is the final version? We have some plans of
> carrying it in our product system to see whether it works.

If this patch is helping, I will do some testing with single reader and
multiple writers and see how badly does it impact reader in that case.
If it is not too bad, may be it is reasonable to include this patch.

Thanks
Vivek

> 
> Regards,
> Tao
> > 
> > Thanks
> > Vivek
> > 
> > ---
> >  block/cfq-iosched.c |    9 ++++++++-
> >  1 file changed, 8 insertions(+), 1 deletion(-)
> > 
> > Index: linux-2.6/block/cfq-iosched.c
> > ===================================================================
> > --- linux-2.6.orig/block/cfq-iosched.c	2011-06-10 09:13:01.000000000 -0400
> > +++ linux-2.6/block/cfq-iosched.c	2011-06-10 10:02:31.850831735 -0400
> > @@ -3315,8 +3315,15 @@ cfq_should_preempt(struct cfq_data *cfqd
> >  	 * if the new request is sync, but the currently running queue is
> >  	 * not, let the sync request have priority.
> >  	 */
> > -	if (rq_is_sync(rq) && !cfq_cfqq_sync(cfqq))
> > +	if (rq_is_sync(rq) && !cfq_cfqq_sync(cfqq)) {
> > +		/*
> > +		 * Allow atleast one dispatch otherwise this can repeat
> > +		 * and writes can be starved completely
> > +		 */
> > +		if (!cfqq->slice_dispatch)
> > +			return false;
> >  		return true;
> > +	}
> >  
> >  	if (new_cfqq->cfqg != cfqq->cfqg)
> >  		return false;
> > 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: CFQ: async queue blocks the whole system
  2011-06-17 12:50                     ` Vivek Goyal
@ 2011-06-17 14:34                       ` Tao Ma
  0 siblings, 0 replies; 33+ messages in thread
From: Tao Ma @ 2011-06-17 14:34 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: linux-kernel, Jens Axboe

On 06/17/2011 08:50 PM, Vivek Goyal wrote:
> On Fri, Jun 17, 2011 at 11:04:51AM +0800, Tao Ma wrote:
>> Hi Vivek,
>> On 06/10/2011 11:44 PM, Vivek Goyal wrote:
>>> On Fri, Jun 10, 2011 at 06:00:37PM +0800, Tao Ma wrote:
>>>> On 06/10/2011 05:14 PM, Vivek Goyal wrote:
>>>>> On Fri, Jun 10, 2011 at 01:48:37PM +0800, Tao Ma wrote:
>>>>>
>>>>> [..]
>>>>>>>> btw, reverting the patch doesn't work. I can still get the livelock.
>>>>>
>>>>> What test exactly you are running. I am primarily interested in whether
>>>>> you still get the hung task timeout warning where a writer is waiting on
>>>>> get_request_wait() for more than 120 secods or not.
>>>>>
>>>>> Livelock might be a different problem and for which Christoph provided
>>>>> a patch for XFS.
>>>>>
>>>>>>>
>>>>>>> Can you give following patch a try and see if it helps. On my system this
>>>>>>> does allow CFQ to dispatch some writes once in a while.
>>>>>> Sorry, this patch doesn't work in my test.
>>>>>
>>>>> Can you give me backtrace of say 15 seconds each with and without patch.
>>>>> I think now we must be dispatching some writes, that's a different thing
>>>>> that writer still sleeps more than 120 seconds because there are  way
>>>>> too many readers.
>>>>>
>>>>> May be we need to look into show workload tree scheduling takes place and
>>>>> tweak that logic a bit.
>>>> OK, our test cases can be downloaded for free. ;)
>>>> svn co http://code.taobao.org/svn/dirbench/trunk/meta_test/press/set_vs_get
>>>> Modify run.sh to be fit for your need. Normally within 10 mins, you will
>>>> get the livelock. We have a SAS disk with 15000 RPMs.
>>>>
>>>> btw, you have to mount the volume on /test since the test program are
>>>> not that clever. :)
>>>
>>> Thanks for the test program. System keeps on working, that's a different
>>> thing that writes might not make lot of progress. 
>>>
>>> What do you mean by livelock in your case. How do you define that?
>>>
>>> Couple of times I did see hung_task warning with your test. And I also
>>> saw that we have most of the time starved WRITES but one in a while we
>>> will dispatch some writes.
>>>
>>> Having said that I will still admit that current logic can completely
>>> starve async writes if there are sufficient number of readers. I can
>>> reproduce this simply by launching lots of readers and bunch of writers
>>> using fio.
>>>
>>> So I have written another patch, where I don't allow preemption of 
>>> async queue if it waiting for sync requests to drain and has not
>>> dispatched any request after having been scheduled.
>>>
>>> This atleast makes sure that writes are not starved. But that does not
>>> mean that whole bunch of async writes are dispatched. In presence of
>>> lots of read activity, expect 1 write to be dispatched every few seconds.
>>>
>>> Please give this patch a try and if it still does not work, please upload
>>> some bltraces while test is running.
>>>
>>> You can also run iostat on disk and should be able to see that with
>>> the patch you are dispatching writes more often than before.
>> I am testing your patch heavily these days.
>> With this patch, the workload is better to survive. But in some our test
>> machines we can still find the hung task. After we tune slice_idle to 0,
>> it is OK now. So do you think this tuning is valid?
> 
> By slice_idle=0 you turn off the idling and that's the core of the CFQ.
> So practically you have more deadline like behavior.
Sure, but as our original test without this patch doesn't survive even
in slice_idle = 0, so I guess this patch does improve somehow in our
test. ;)
> 
>>
>> btw do you think the patch is the final version? We have some plans of
>> carrying it in our product system to see whether it works.
> 
> If this patch is helping, I will do some testing with single reader and
> multiple writers and see how badly does it impact reader in that case.
> If it is not too bad, may be it is reasonable to include this patch.
cool, btw you can add my Reported-and-Tested-by.

Thanks
Tao
> 
> Thanks
> Vivek
> 
>>
>> Regards,
>> Tao
>>>
>>> Thanks
>>> Vivek
>>>
>>> ---
>>>  block/cfq-iosched.c |    9 ++++++++-
>>>  1 file changed, 8 insertions(+), 1 deletion(-)
>>>
>>> Index: linux-2.6/block/cfq-iosched.c
>>> ===================================================================
>>> --- linux-2.6.orig/block/cfq-iosched.c	2011-06-10 09:13:01.000000000 -0400
>>> +++ linux-2.6/block/cfq-iosched.c	2011-06-10 10:02:31.850831735 -0400
>>> @@ -3315,8 +3315,15 @@ cfq_should_preempt(struct cfq_data *cfqd
>>>  	 * if the new request is sync, but the currently running queue is
>>>  	 * not, let the sync request have priority.
>>>  	 */
>>> -	if (rq_is_sync(rq) && !cfq_cfqq_sync(cfqq))
>>> +	if (rq_is_sync(rq) && !cfq_cfqq_sync(cfqq)) {
>>> +		/*
>>> +		 * Allow atleast one dispatch otherwise this can repeat
>>> +		 * and writes can be starved completely
>>> +		 */
>>> +		if (!cfqq->slice_dispatch)
>>> +			return false;
>>>  		return true;
>>> +	}
>>>  
>>>  	if (new_cfqq->cfqg != cfqq->cfqg)
>>>  		return false;
>>>


^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2011-06-17 14:34 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-06-09 10:49 CFQ: async queue blocks the whole system Tao Ma
2011-06-09 14:14 ` Vivek Goyal
2011-06-09 14:34   ` Jens Axboe
2011-06-09 14:47   ` Tao Ma
2011-06-09 15:37     ` Vivek Goyal
2011-06-09 15:44       ` Tao Ma
2011-06-09 18:27         ` Vivek Goyal
2011-06-10  5:48           ` Tao Ma
2011-06-10  9:14             ` Vivek Goyal
2011-06-10 10:00               ` Tao Ma
2011-06-10 15:44                 ` Vivek Goyal
2011-06-11  7:24                   ` Tao Ma
2011-06-13 10:08                   ` Tao Ma
2011-06-13 21:41                     ` Vivek Goyal
2011-06-14  7:03                       ` Tao Ma
2011-06-14 13:30                         ` Vivek Goyal
2011-06-14 15:42                           ` Tao Ma
2011-06-14 21:14                             ` Vivek Goyal
2011-06-17  3:04                   ` Tao Ma
2011-06-17 12:50                     ` Vivek Goyal
2011-06-17 14:34                       ` Tao Ma
2011-06-10  1:19       ` Shaohua Li
2011-06-10  1:34         ` Shaohua Li
2011-06-10  2:06           ` Tao Ma
2011-06-10  2:35             ` Shaohua Li
2011-06-10  3:02               ` Tao Ma
2011-06-10  9:20                 ` Vivek Goyal
2011-06-10  9:21                   ` Jens Axboe
2011-06-13  1:03                   ` Shaohua Li
2011-06-10  9:17         ` Vivek Goyal
2011-06-10  9:20           ` Jens Axboe
2011-06-10  9:29             ` Vivek Goyal
2011-06-10  9:31               ` Jens Axboe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.