All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] cfq: fix starvation of asynchronous writes
@ 2016-09-23  0:59 Glauber Costa
  2016-09-23  2:10 ` Jens Axboe
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Glauber Costa @ 2016-09-23  0:59 UTC (permalink / raw)
  To: linux-block; +Cc: Glauber Costa, Jens Axboe, linux-kernel

While debugging timeouts happening in my application workload (ScyllaDB), I have
observed calls to open() taking a long time, ranging everywhere from 2 seconds -
the first ones that are enough to time out my application - to more than 30
seconds.

The problem seems to happen because XFS may block on pending metadata updates
under certain circumnstances, and that's confirmed with the following backtrace
taken by the offcputime tool (iovisor/bcc):

    ffffffffb90c57b1 finish_task_switch
    ffffffffb97dffb5 schedule
    ffffffffb97e310c schedule_timeout
    ffffffffb97e1f12 __down
    ffffffffb90ea821 down
    ffffffffc046a9dc xfs_buf_lock
    ffffffffc046abfb _xfs_buf_find
    ffffffffc046ae4a xfs_buf_get_map
    ffffffffc046babd xfs_buf_read_map
    ffffffffc0499931 xfs_trans_read_buf_map
    ffffffffc044a561 xfs_da_read_buf
    ffffffffc0451390 xfs_dir3_leaf_read.constprop.16
    ffffffffc0452b90 xfs_dir2_leaf_lookup_int
    ffffffffc0452e0f xfs_dir2_leaf_lookup
    ffffffffc044d9d3 xfs_dir_lookup
    ffffffffc047d1d9 xfs_lookup
    ffffffffc0479e53 xfs_vn_lookup
    ffffffffb925347a path_openat
    ffffffffb9254a71 do_filp_open
    ffffffffb9242a94 do_sys_open
    ffffffffb9242b9e sys_open
    ffffffffb97e42b2 entry_SYSCALL_64_fastpath
    00007fb0698162ed [unknown]

Inspecting my run with blktrace, I can see that the xfsaild kthread exhibit very
high "Dispatch wait" times, on the dozens of seconds range and consistent with
the open() times I have saw in that run.

Still from the blktrace output, we can after searching a bit, identify the
request that wasn't dispatched:

  8,0   11      152    81.092472813   804  A  WM 141698288 + 8 <- (8,1) 141696240
  8,0   11      153    81.092472889   804  Q  WM 141698288 + 8 [xfsaild/sda1]
  8,0   11      154    81.092473207   804  G  WM 141698288 + 8 [xfsaild/sda1]
  8,0   11      206    81.092496118   804  I  WM 141698288 + 8 (   22911) [xfsaild/sda1]
  <==== 'I' means Inserted (into the IO scheduler) ===================================>
  8,0    0   289372    96.718761435     0  D  WM 141698288 + 8 (15626265317) [swapper/0]
  <==== Only 15s later the CFQ scheduler dispatches the request ======================>

As we can see above, in this particular example CFQ took 15 seconds to dispatch
this request. Going back to the full trace, we can see that the xfsaild queue
had plenty of opportunity to run, and it was selected as the active queue many
times. It would just always be preempted by something else (example):

  8,0    1        0    81.117912979     0  m   N cfq1618SN / insert_request
  8,0    1        0    81.117913419     0  m   N cfq1618SN / add_to_rr
  8,0    1        0    81.117914044     0  m   N cfq1618SN / preempt
  8,0    1        0    81.117914398     0  m   N cfq767A  / slice expired t=1
  8,0    1        0    81.117914755     0  m   N cfq767A  / resid=40
  8,0    1        0    81.117915340     0  m   N / served: vt=1948520448 min_vt=1948520448
  8,0    1        0    81.117915858     0  m   N cfq767A  / sl_used=1 disp=0 charge=0 iops=1 sect=0

where cfq767 is the xfsaild queue and cfq1618 corresponds to one of the ScyllaDB
IO dispatchers.

The requests preempting the xfsaild queue are synchronous requests. That's a
characteristic of ScyllaDB workloads, as we only ever issue O_DIRECT requests.
While it can be argued that preempting ASYNC requests in favor of SYNC is part
of the CFQ logic, I don't believe that doing so for 15+ seconds is anyone's
goal.

Moreover, unless I am misunderstanding something, that breaks the expectation
set by the "fifo_expire_async" tunable, which in my system is set to the
default.

Looking at the code, it seems to me that the issue is that after we make
an async queue active, there is no guarantee that it will execute any request.

When the queue itself tests if it cfq_may_dispatch() it can bail if it sees SYNC
requests in flight. An incoming request from another queue can also preempt it
in such situation before we have the chance to execute anything (as seen in the
trace above).

This patch sets the must_dispatch flag if we notice that we have requests
that are already fifo_expired. This flag is always cleared after
cfq_dispatch_request() returns from cfq_dispatch_requests(), so it won't pin
the queue for subsequent requests (unless they are themselves expired)

Care is taken during preempt to still allow rt requests to preempt us
regardless.

Testing my workload with this patch applied produces much better results.
>From the application side I see no timeouts, and the open() latency histogram
generated by systemtap looks much better, with the worst outlier at 131ms:

Latency histogram of xfs_buf_lock acquisition (microseconds):
 value |-------------------------------------------------- count
     0 |                                                     11
     1 |@@@@                                                161
     2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  1966
     4 |@                                                    54
     8 |                                                     36
    16 |                                                      7
    32 |                                                      0
    64 |                                                      0
       ~
  1024 |                                                      0
  2048 |                                                      0
  4096 |                                                      1
  8192 |                                                      1
 16384 |                                                      2
 32768 |                                                      0
 65536 |                                                      0
131072 |                                                      1
262144 |                                                      0
524288 |                                                      0

Signed-off-by: Glauber Costa <glauber@scylladb.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: linux-block@vger.kernel.org
CC: linux-kernel@vger.kernel.org

Signed-off-by: Glauber Costa <glauber@scylladb.com>
---
 block/cfq-iosched.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index cc2f6db..5e24d88 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -3042,7 +3042,6 @@ static struct request *cfq_check_fifo(struct cfq_queue *cfqq)
 	if (ktime_get_ns() < rq->fifo_time)
 		rq = NULL;
 
-	cfq_log_cfqq(cfqq->cfqd, cfqq, "fifo=%p", rq);
 	return rq;
 }
 
@@ -3420,6 +3419,9 @@ static bool cfq_may_dispatch(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
 	unsigned int max_dispatch;
 
+	if (cfq_cfqq_must_dispatch(cfqq))
+		return true;
+
 	/*
 	 * Drain async requests before we start sync IO
 	 */
@@ -3511,15 +3513,20 @@ static bool cfq_dispatch_request(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 
 	BUG_ON(RB_EMPTY_ROOT(&cfqq->sort_list));
 
+	rq = cfq_check_fifo(cfqq);
+	if (rq)
+		cfq_mark_cfqq_must_dispatch(cfqq);
+
 	if (!cfq_may_dispatch(cfqd, cfqq))
 		return false;
 
 	/*
 	 * follow expired path, else get first next available
 	 */
-	rq = cfq_check_fifo(cfqq);
 	if (!rq)
 		rq = cfqq->next_rq;
+	else
+		cfq_log_cfqq(cfqq->cfqd, cfqq, "fifo=%p", rq);
 
 	/*
 	 * insert request into driver dispatch list
@@ -3989,7 +3996,7 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
 	 * if the new request is sync, but the currently running queue is
 	 * not, let the sync request have priority.
 	 */
-	if (rq_is_sync(rq) && !cfq_cfqq_sync(cfqq))
+	if (rq_is_sync(rq) && !cfq_cfqq_sync(cfqq) && !cfq_cfqq_must_dispatch(cfqq))
 		return true;
 
 	/*
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] cfq: fix starvation of asynchronous writes
  2016-09-23  0:59 [PATCH] cfq: fix starvation of asynchronous writes Glauber Costa
@ 2016-09-23  2:10 ` Jens Axboe
  2016-09-23 16:11   ` Jens Axboe
  2016-09-23 10:22   ` Paolo Valente
  2016-10-20 18:08 ` Fwd: " Glauber Costa
  2 siblings, 1 reply; 12+ messages in thread
From: Jens Axboe @ 2016-09-23  2:10 UTC (permalink / raw)
  To: Glauber Costa, linux-block; +Cc: linux-kernel

On 09/22/2016 06:59 PM, Glauber Costa wrote:
> While debugging timeouts happening in my application workload (ScyllaDB), I have
> observed calls to open() taking a long time, ranging everywhere from 2 seconds -
> the first ones that are enough to time out my application - to more than 30
> seconds.
>
> The problem seems to happen because XFS may block on pending metadata updates
> under certain circumnstances, and that's confirmed with the following backtrace
> taken by the offcputime tool (iovisor/bcc):
>
>     ffffffffb90c57b1 finish_task_switch
>     ffffffffb97dffb5 schedule
>     ffffffffb97e310c schedule_timeout
>     ffffffffb97e1f12 __down
>     ffffffffb90ea821 down
>     ffffffffc046a9dc xfs_buf_lock
>     ffffffffc046abfb _xfs_buf_find
>     ffffffffc046ae4a xfs_buf_get_map
>     ffffffffc046babd xfs_buf_read_map
>     ffffffffc0499931 xfs_trans_read_buf_map
>     ffffffffc044a561 xfs_da_read_buf
>     ffffffffc0451390 xfs_dir3_leaf_read.constprop.16
>     ffffffffc0452b90 xfs_dir2_leaf_lookup_int
>     ffffffffc0452e0f xfs_dir2_leaf_lookup
>     ffffffffc044d9d3 xfs_dir_lookup
>     ffffffffc047d1d9 xfs_lookup
>     ffffffffc0479e53 xfs_vn_lookup
>     ffffffffb925347a path_openat
>     ffffffffb9254a71 do_filp_open
>     ffffffffb9242a94 do_sys_open
>     ffffffffb9242b9e sys_open
>     ffffffffb97e42b2 entry_SYSCALL_64_fastpath
>     00007fb0698162ed [unknown]
>
> Inspecting my run with blktrace, I can see that the xfsaild kthread exhibit very
> high "Dispatch wait" times, on the dozens of seconds range and consistent with
> the open() times I have saw in that run.
>
> Still from the blktrace output, we can after searching a bit, identify the
> request that wasn't dispatched:
>
>   8,0   11      152    81.092472813   804  A  WM 141698288 + 8 <- (8,1) 141696240
>   8,0   11      153    81.092472889   804  Q  WM 141698288 + 8 [xfsaild/sda1]
>   8,0   11      154    81.092473207   804  G  WM 141698288 + 8 [xfsaild/sda1]
>   8,0   11      206    81.092496118   804  I  WM 141698288 + 8 (   22911) [xfsaild/sda1]
>   <==== 'I' means Inserted (into the IO scheduler) ===================================>
>   8,0    0   289372    96.718761435     0  D  WM 141698288 + 8 (15626265317) [swapper/0]
>   <==== Only 15s later the CFQ scheduler dispatches the request ======================>
>
> As we can see above, in this particular example CFQ took 15 seconds to dispatch
> this request. Going back to the full trace, we can see that the xfsaild queue
> had plenty of opportunity to run, and it was selected as the active queue many
> times. It would just always be preempted by something else (example):
>
>   8,0    1        0    81.117912979     0  m   N cfq1618SN / insert_request
>   8,0    1        0    81.117913419     0  m   N cfq1618SN / add_to_rr
>   8,0    1        0    81.117914044     0  m   N cfq1618SN / preempt
>   8,0    1        0    81.117914398     0  m   N cfq767A  / slice expired t=1
>   8,0    1        0    81.117914755     0  m   N cfq767A  / resid=40
>   8,0    1        0    81.117915340     0  m   N / served: vt=1948520448 min_vt=1948520448
>   8,0    1        0    81.117915858     0  m   N cfq767A  / sl_used=1 disp=0 charge=0 iops=1 sect=0
>
> where cfq767 is the xfsaild queue and cfq1618 corresponds to one of the ScyllaDB
> IO dispatchers.
>
> The requests preempting the xfsaild queue are synchronous requests. That's a
> characteristic of ScyllaDB workloads, as we only ever issue O_DIRECT requests.
> While it can be argued that preempting ASYNC requests in favor of SYNC is part
> of the CFQ logic, I don't believe that doing so for 15+ seconds is anyone's
> goal.
>
> Moreover, unless I am misunderstanding something, that breaks the expectation
> set by the "fifo_expire_async" tunable, which in my system is set to the
> default.
>
> Looking at the code, it seems to me that the issue is that after we make
> an async queue active, there is no guarantee that it will execute any request.
>
> When the queue itself tests if it cfq_may_dispatch() it can bail if it sees SYNC
> requests in flight. An incoming request from another queue can also preempt it
> in such situation before we have the chance to execute anything (as seen in the
> trace above).
>
> This patch sets the must_dispatch flag if we notice that we have requests
> that are already fifo_expired. This flag is always cleared after
> cfq_dispatch_request() returns from cfq_dispatch_requests(), so it won't pin
> the queue for subsequent requests (unless they are themselves expired)
>
> Care is taken during preempt to still allow rt requests to preempt us
> regardless.
>
> Testing my workload with this patch applied produces much better results.
> From the application side I see no timeouts, and the open() latency histogram
> generated by systemtap looks much better, with the worst outlier at 131ms:

Good analysis and the fix looks nice and clean. I'll take a closer look 
tomorrow. Thanks!

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] cfq: fix starvation of asynchronous writes
  2016-09-23  0:59 [PATCH] cfq: fix starvation of asynchronous writes Glauber Costa
@ 2016-09-23 10:22   ` Paolo Valente
  2016-09-23 10:22   ` Paolo Valente
  2016-10-20 18:08 ` Fwd: " Glauber Costa
  2 siblings, 0 replies; 12+ messages in thread
From: Paolo Valente @ 2016-09-23 10:22 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-block, Jens Axboe, linux-kernel, Mark Brown, Ulf Hansson,
	Linus Walleij


> Il giorno 23 set 2016, alle ore 02:59, Glauber Costa =
<glauber@scylladb.com> ha scritto:
>=20
> While debugging timeouts happening in my application workload =
(ScyllaDB), I have
> observed calls to open() taking a long time, ranging everywhere from 2 =
seconds -
> the first ones that are enough to time out my application - to more =
than 30
> seconds.
>=20
> The problem seems to happen because XFS may block on pending metadata =
updates
> under certain circumnstances, and that's confirmed with the following =
backtrace
> taken by the offcputime tool (iovisor/bcc):
>=20
>    ffffffffb90c57b1 finish_task_switch
>    ffffffffb97dffb5 schedule
>    ffffffffb97e310c schedule_timeout
>    ffffffffb97e1f12 __down
>    ffffffffb90ea821 down
>    ffffffffc046a9dc xfs_buf_lock
>    ffffffffc046abfb _xfs_buf_find
>    ffffffffc046ae4a xfs_buf_get_map
>    ffffffffc046babd xfs_buf_read_map
>    ffffffffc0499931 xfs_trans_read_buf_map
>    ffffffffc044a561 xfs_da_read_buf
>    ffffffffc0451390 xfs_dir3_leaf_read.constprop.16
>    ffffffffc0452b90 xfs_dir2_leaf_lookup_int
>    ffffffffc0452e0f xfs_dir2_leaf_lookup
>    ffffffffc044d9d3 xfs_dir_lookup
>    ffffffffc047d1d9 xfs_lookup
>    ffffffffc0479e53 xfs_vn_lookup
>    ffffffffb925347a path_openat
>    ffffffffb9254a71 do_filp_open
>    ffffffffb9242a94 do_sys_open
>    ffffffffb9242b9e sys_open
>    ffffffffb97e42b2 entry_SYSCALL_64_fastpath
>    00007fb0698162ed [unknown]
>=20
> Inspecting my run with blktrace, I can see that the xfsaild kthread =
exhibit very
> high "Dispatch wait" times, on the dozens of seconds range and =
consistent with
> the open() times I have saw in that run.
>=20
> Still from the blktrace output, we can after searching a bit, identify =
the
> request that wasn't dispatched:
>=20
>  8,0   11      152    81.092472813   804  A  WM 141698288 + 8 <- (8,1) =
141696240
>  8,0   11      153    81.092472889   804  Q  WM 141698288 + 8 =
[xfsaild/sda1]
>  8,0   11      154    81.092473207   804  G  WM 141698288 + 8 =
[xfsaild/sda1]
>  8,0   11      206    81.092496118   804  I  WM 141698288 + 8 (   =
22911) [xfsaild/sda1]
>  <=3D=3D=3D=3D 'I' means Inserted (into the IO scheduler) =
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D>
>  8,0    0   289372    96.718761435     0  D  WM 141698288 + 8 =
(15626265317) [swapper/0]
>  <=3D=3D=3D=3D Only 15s later the CFQ scheduler dispatches the request =
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D>
>=20
> As we can see above, in this particular example CFQ took 15 seconds to =
dispatch
> this request. Going back to the full trace, we can see that the =
xfsaild queue
> had plenty of opportunity to run, and it was selected as the active =
queue many
> times. It would just always be preempted by something else (example):
>=20
>  8,0    1        0    81.117912979     0  m   N cfq1618SN / =
insert_request
>  8,0    1        0    81.117913419     0  m   N cfq1618SN / add_to_rr
>  8,0    1        0    81.117914044     0  m   N cfq1618SN / preempt
>  8,0    1        0    81.117914398     0  m   N cfq767A  / slice =
expired t=3D1
>  8,0    1        0    81.117914755     0  m   N cfq767A  / resid=3D40
>  8,0    1        0    81.117915340     0  m   N / served: =
vt=3D1948520448 min_vt=3D1948520448
>  8,0    1        0    81.117915858     0  m   N cfq767A  / sl_used=3D1 =
disp=3D0 charge=3D0 iops=3D1 sect=3D0
>=20
> where cfq767 is the xfsaild queue and cfq1618 corresponds to one of =
the ScyllaDB
> IO dispatchers.
>=20
> The requests preempting the xfsaild queue are synchronous requests. =
That's a
> characteristic of ScyllaDB workloads, as we only ever issue O_DIRECT =
requests.
> While it can be argued that preempting ASYNC requests in favor of SYNC =
is part
> of the CFQ logic, I don't believe that doing so for 15+ seconds is =
anyone's
> goal.
>=20
> Moreover, unless I am misunderstanding something, that breaks the =
expectation
> set by the "fifo_expire_async" tunable, which in my system is set to =
the
> default.
>=20
> Looking at the code, it seems to me that the issue is that after we =
make
> an async queue active, there is no guarantee that it will execute any =
request.
>=20
> When the queue itself tests if it cfq_may_dispatch() it can bail if it =
sees SYNC
> requests in flight. An incoming request from another queue can also =
preempt it
> in such situation before we have the chance to execute anything (as =
seen in the
> trace above).
>=20

The BFQ scheduler should not suffer from this issue, because it does
not consider this condition at all for preemption.  Even better, BFQ
performs preemption as a function of just one condition, which is
formally proven not to break any request-delay guarantee.

Would you be willing to give it a try?  It might solve or at least
mitigate the specific latency problem addressed by your patch, and at
the same time provide many other benefits, in terms of low latency and
high throughput.

If you are willing to try it, then in [1] you can find the last BFQ
development branch, rebased against 4.8-rc7.  These commits add BFQ as =
an
extra scheduler.

Otherwise you can find BFQ for 4.7 here [2], and BFQ for 4.6 here [3].

If you prefer patches, then you can find them in [4] for 4.7, or in
[5] for 4.6.

Finally, if you need details about BFQ, have a look at [6], or just ask!

Thanks,
Paolo

[1] https://github.com/linusw/linux-bfq/tree/bfq-v8
[2] https://github.com/linusw/linux-bfq/tree/bfq-v8-v4.7
[3] https://github.com/linusw/linux-bfq/tree/bfq-v8-v4.6
[4] =
http://algogroup.unimore.it/people/paolo/disk_sched/patches/4.7.0-v8r3/
[5] =
http://algogroup.unimore.it/people/paolo/disk_sched/patches/4.6.0-v8r3/
[6] http://algogroup.unimore.it/people/paolo/disk_sched/

> This patch sets the must_dispatch flag if we notice that we have =
requests
> that are already fifo_expired. This flag is always cleared after
> cfq_dispatch_request() returns from cfq_dispatch_requests(), so it =
won't pin
> the queue for subsequent requests (unless they are themselves expired)
>=20
> Care is taken during preempt to still allow rt requests to preempt us
> regardless.
>=20
> Testing my workload with this patch applied produces much better =
results.
> =46rom the application side I see no timeouts, and the open() latency =
histogram
> generated by systemtap looks much better, with the worst outlier at =
131ms:
>=20



> Latency histogram of xfs_buf_lock acquisition (microseconds):
> value |-------------------------------------------------- count
>     0 |                                                     11
>     1 |@@@@                                                161
>     2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  1966
>     4 |@                                                    54
>     8 |                                                     36
>    16 |                                                      7
>    32 |                                                      0
>    64 |                                                      0
>       ~
>  1024 |                                                      0
>  2048 |                                                      0
>  4096 |                                                      1
>  8192 |                                                      1
> 16384 |                                                      2
> 32768 |                                                      0
> 65536 |                                                      0
> 131072 |                                                      1
> 262144 |                                                      0
> 524288 |                                                      0
>=20
> Signed-off-by: Glauber Costa <glauber@scylladb.com>
> CC: Jens Axboe <axboe@kernel.dk>
> CC: linux-block@vger.kernel.org
> CC: linux-kernel@vger.kernel.org
>=20
> Signed-off-by: Glauber Costa <glauber@scylladb.com>
> ---
> block/cfq-iosched.c | 13 ++++++++++---
> 1 file changed, 10 insertions(+), 3 deletions(-)
>=20
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index cc2f6db..5e24d88 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -3042,7 +3042,6 @@ static struct request *cfq_check_fifo(struct =
cfq_queue *cfqq)
> 	if (ktime_get_ns() < rq->fifo_time)
> 		rq =3D NULL;
>=20
> -	cfq_log_cfqq(cfqq->cfqd, cfqq, "fifo=3D%p", rq);
> 	return rq;
> }
>=20
> @@ -3420,6 +3419,9 @@ static bool cfq_may_dispatch(struct cfq_data =
*cfqd, struct cfq_queue *cfqq)
> {
> 	unsigned int max_dispatch;
>=20
> +	if (cfq_cfqq_must_dispatch(cfqq))
> +		return true;
> +
> 	/*
> 	 * Drain async requests before we start sync IO
> 	 */
> @@ -3511,15 +3513,20 @@ static bool cfq_dispatch_request(struct =
cfq_data *cfqd, struct cfq_queue *cfqq)
>=20
> 	BUG_ON(RB_EMPTY_ROOT(&cfqq->sort_list));
>=20
> +	rq =3D cfq_check_fifo(cfqq);
> +	if (rq)
> +		cfq_mark_cfqq_must_dispatch(cfqq);
> +
> 	if (!cfq_may_dispatch(cfqd, cfqq))
> 		return false;
>=20
> 	/*
> 	 * follow expired path, else get first next available
> 	 */
> -	rq =3D cfq_check_fifo(cfqq);
> 	if (!rq)
> 		rq =3D cfqq->next_rq;
> +	else
> +		cfq_log_cfqq(cfqq->cfqd, cfqq, "fifo=3D%p", rq);
>=20
> 	/*
> 	 * insert request into driver dispatch list
> @@ -3989,7 +3996,7 @@ cfq_should_preempt(struct cfq_data *cfqd, struct =
cfq_queue *new_cfqq,
> 	 * if the new request is sync, but the currently running queue =
is
> 	 * not, let the sync request have priority.
> 	 */
> -	if (rq_is_sync(rq) && !cfq_cfqq_sync(cfqq))
> +	if (rq_is_sync(rq) && !cfq_cfqq_sync(cfqq) && =
!cfq_cfqq_must_dispatch(cfqq))
> 		return true;
>=20
> 	/*
> --=20
> 2.5.5
>=20
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" =
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
Paolo Valente
Algogroup
Dipartimento di Scienze Fisiche, Informatiche e Matematiche
Via Campi 213/B
41125 Modena - Italy
http://algogroup.unimore.it/people/paolo/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] cfq: fix starvation of asynchronous writes
@ 2016-09-23 10:22   ` Paolo Valente
  0 siblings, 0 replies; 12+ messages in thread
From: Paolo Valente @ 2016-09-23 10:22 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-block, Jens Axboe, linux-kernel, Mark Brown, Ulf Hansson,
	Linus Walleij


> Il giorno 23 set 2016, alle ore 02:59, Glauber Costa <glauber@scylladb.com> ha scritto:
> 
> While debugging timeouts happening in my application workload (ScyllaDB), I have
> observed calls to open() taking a long time, ranging everywhere from 2 seconds -
> the first ones that are enough to time out my application - to more than 30
> seconds.
> 
> The problem seems to happen because XFS may block on pending metadata updates
> under certain circumnstances, and that's confirmed with the following backtrace
> taken by the offcputime tool (iovisor/bcc):
> 
>    ffffffffb90c57b1 finish_task_switch
>    ffffffffb97dffb5 schedule
>    ffffffffb97e310c schedule_timeout
>    ffffffffb97e1f12 __down
>    ffffffffb90ea821 down
>    ffffffffc046a9dc xfs_buf_lock
>    ffffffffc046abfb _xfs_buf_find
>    ffffffffc046ae4a xfs_buf_get_map
>    ffffffffc046babd xfs_buf_read_map
>    ffffffffc0499931 xfs_trans_read_buf_map
>    ffffffffc044a561 xfs_da_read_buf
>    ffffffffc0451390 xfs_dir3_leaf_read.constprop.16
>    ffffffffc0452b90 xfs_dir2_leaf_lookup_int
>    ffffffffc0452e0f xfs_dir2_leaf_lookup
>    ffffffffc044d9d3 xfs_dir_lookup
>    ffffffffc047d1d9 xfs_lookup
>    ffffffffc0479e53 xfs_vn_lookup
>    ffffffffb925347a path_openat
>    ffffffffb9254a71 do_filp_open
>    ffffffffb9242a94 do_sys_open
>    ffffffffb9242b9e sys_open
>    ffffffffb97e42b2 entry_SYSCALL_64_fastpath
>    00007fb0698162ed [unknown]
> 
> Inspecting my run with blktrace, I can see that the xfsaild kthread exhibit very
> high "Dispatch wait" times, on the dozens of seconds range and consistent with
> the open() times I have saw in that run.
> 
> Still from the blktrace output, we can after searching a bit, identify the
> request that wasn't dispatched:
> 
>  8,0   11      152    81.092472813   804  A  WM 141698288 + 8 <- (8,1) 141696240
>  8,0   11      153    81.092472889   804  Q  WM 141698288 + 8 [xfsaild/sda1]
>  8,0   11      154    81.092473207   804  G  WM 141698288 + 8 [xfsaild/sda1]
>  8,0   11      206    81.092496118   804  I  WM 141698288 + 8 (   22911) [xfsaild/sda1]
>  <==== 'I' means Inserted (into the IO scheduler) ===================================>
>  8,0    0   289372    96.718761435     0  D  WM 141698288 + 8 (15626265317) [swapper/0]
>  <==== Only 15s later the CFQ scheduler dispatches the request ======================>
> 
> As we can see above, in this particular example CFQ took 15 seconds to dispatch
> this request. Going back to the full trace, we can see that the xfsaild queue
> had plenty of opportunity to run, and it was selected as the active queue many
> times. It would just always be preempted by something else (example):
> 
>  8,0    1        0    81.117912979     0  m   N cfq1618SN / insert_request
>  8,0    1        0    81.117913419     0  m   N cfq1618SN / add_to_rr
>  8,0    1        0    81.117914044     0  m   N cfq1618SN / preempt
>  8,0    1        0    81.117914398     0  m   N cfq767A  / slice expired t=1
>  8,0    1        0    81.117914755     0  m   N cfq767A  / resid=40
>  8,0    1        0    81.117915340     0  m   N / served: vt=1948520448 min_vt=1948520448
>  8,0    1        0    81.117915858     0  m   N cfq767A  / sl_used=1 disp=0 charge=0 iops=1 sect=0
> 
> where cfq767 is the xfsaild queue and cfq1618 corresponds to one of the ScyllaDB
> IO dispatchers.
> 
> The requests preempting the xfsaild queue are synchronous requests. That's a
> characteristic of ScyllaDB workloads, as we only ever issue O_DIRECT requests.
> While it can be argued that preempting ASYNC requests in favor of SYNC is part
> of the CFQ logic, I don't believe that doing so for 15+ seconds is anyone's
> goal.
> 
> Moreover, unless I am misunderstanding something, that breaks the expectation
> set by the "fifo_expire_async" tunable, which in my system is set to the
> default.
> 
> Looking at the code, it seems to me that the issue is that after we make
> an async queue active, there is no guarantee that it will execute any request.
> 
> When the queue itself tests if it cfq_may_dispatch() it can bail if it sees SYNC
> requests in flight. An incoming request from another queue can also preempt it
> in such situation before we have the chance to execute anything (as seen in the
> trace above).
> 

The BFQ scheduler should not suffer from this issue, because it does
not consider this condition at all for preemption.  Even better, BFQ
performs preemption as a function of just one condition, which is
formally proven not to break any request-delay guarantee.

Would you be willing to give it a try?  It might solve or at least
mitigate the specific latency problem addressed by your patch, and at
the same time provide many other benefits, in terms of low latency and
high throughput.

If you are willing to try it, then in [1] you can find the last BFQ
development branch, rebased against 4.8-rc7.  These commits add BFQ as an
extra scheduler.

Otherwise you can find BFQ for 4.7 here [2], and BFQ for 4.6 here [3].

If you prefer patches, then you can find them in [4] for 4.7, or in
[5] for 4.6.

Finally, if you need details about BFQ, have a look at [6], or just ask!

Thanks,
Paolo

[1] https://github.com/linusw/linux-bfq/tree/bfq-v8
[2] https://github.com/linusw/linux-bfq/tree/bfq-v8-v4.7
[3] https://github.com/linusw/linux-bfq/tree/bfq-v8-v4.6
[4] http://algogroup.unimore.it/people/paolo/disk_sched/patches/4.7.0-v8r3/
[5] http://algogroup.unimore.it/people/paolo/disk_sched/patches/4.6.0-v8r3/
[6] http://algogroup.unimore.it/people/paolo/disk_sched/

> This patch sets the must_dispatch flag if we notice that we have requests
> that are already fifo_expired. This flag is always cleared after
> cfq_dispatch_request() returns from cfq_dispatch_requests(), so it won't pin
> the queue for subsequent requests (unless they are themselves expired)
> 
> Care is taken during preempt to still allow rt requests to preempt us
> regardless.
> 
> Testing my workload with this patch applied produces much better results.
> From the application side I see no timeouts, and the open() latency histogram
> generated by systemtap looks much better, with the worst outlier at 131ms:
> 



> Latency histogram of xfs_buf_lock acquisition (microseconds):
> value |-------------------------------------------------- count
>     0 |                                                     11
>     1 |@@@@                                                161
>     2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  1966
>     4 |@                                                    54
>     8 |                                                     36
>    16 |                                                      7
>    32 |                                                      0
>    64 |                                                      0
>       ~
>  1024 |                                                      0
>  2048 |                                                      0
>  4096 |                                                      1
>  8192 |                                                      1
> 16384 |                                                      2
> 32768 |                                                      0
> 65536 |                                                      0
> 131072 |                                                      1
> 262144 |                                                      0
> 524288 |                                                      0
> 
> Signed-off-by: Glauber Costa <glauber@scylladb.com>
> CC: Jens Axboe <axboe@kernel.dk>
> CC: linux-block@vger.kernel.org
> CC: linux-kernel@vger.kernel.org
> 
> Signed-off-by: Glauber Costa <glauber@scylladb.com>
> ---
> block/cfq-iosched.c | 13 ++++++++++---
> 1 file changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index cc2f6db..5e24d88 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -3042,7 +3042,6 @@ static struct request *cfq_check_fifo(struct cfq_queue *cfqq)
> 	if (ktime_get_ns() < rq->fifo_time)
> 		rq = NULL;
> 
> -	cfq_log_cfqq(cfqq->cfqd, cfqq, "fifo=%p", rq);
> 	return rq;
> }
> 
> @@ -3420,6 +3419,9 @@ static bool cfq_may_dispatch(struct cfq_data *cfqd, struct cfq_queue *cfqq)
> {
> 	unsigned int max_dispatch;
> 
> +	if (cfq_cfqq_must_dispatch(cfqq))
> +		return true;
> +
> 	/*
> 	 * Drain async requests before we start sync IO
> 	 */
> @@ -3511,15 +3513,20 @@ static bool cfq_dispatch_request(struct cfq_data *cfqd, struct cfq_queue *cfqq)
> 
> 	BUG_ON(RB_EMPTY_ROOT(&cfqq->sort_list));
> 
> +	rq = cfq_check_fifo(cfqq);
> +	if (rq)
> +		cfq_mark_cfqq_must_dispatch(cfqq);
> +
> 	if (!cfq_may_dispatch(cfqd, cfqq))
> 		return false;
> 
> 	/*
> 	 * follow expired path, else get first next available
> 	 */
> -	rq = cfq_check_fifo(cfqq);
> 	if (!rq)
> 		rq = cfqq->next_rq;
> +	else
> +		cfq_log_cfqq(cfqq->cfqd, cfqq, "fifo=%p", rq);
> 
> 	/*
> 	 * insert request into driver dispatch list
> @@ -3989,7 +3996,7 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
> 	 * if the new request is sync, but the currently running queue is
> 	 * not, let the sync request have priority.
> 	 */
> -	if (rq_is_sync(rq) && !cfq_cfqq_sync(cfqq))
> +	if (rq_is_sync(rq) && !cfq_cfqq_sync(cfqq) && !cfq_cfqq_must_dispatch(cfqq))
> 		return true;
> 
> 	/*
> -- 
> 2.5.5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
Paolo Valente
Algogroup
Dipartimento di Scienze Fisiche, Informatiche e Matematiche
Via Campi 213/B
41125 Modena - Italy
http://algogroup.unimore.it/people/paolo/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] cfq: fix starvation of asynchronous writes
  2016-09-23 10:22   ` Paolo Valente
  (?)
@ 2016-09-23 11:28   ` Glauber Costa
  2016-09-23 12:19     ` Glauber Costa
  -1 siblings, 1 reply; 12+ messages in thread
From: Glauber Costa @ 2016-09-23 11:28 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Linus Walleij, Jens Axboe, linux-kernel, Mark Brown, linux-block,
	Ulf Hansson

[-- Attachment #1: Type: text/plain, Size: 10916 bytes --]

On Sep 23, 2016 6:22 AM, "Paolo Valente" <paolo.valente@unimore.it> wrote:
>
>
> > Il giorno 23 set 2016, alle ore 02:59, Glauber Costa <
glauber@scylladb.com> ha scritto:
> >
> > While debugging timeouts happening in my application workload
(ScyllaDB), I have
> > observed calls to open() taking a long time, ranging everywhere from 2
seconds -
> > the first ones that are enough to time out my application - to more
than 30
> > seconds.
> >
> > The problem seems to happen because XFS may block on pending metadata
updates
> > under certain circumnstances, and that's confirmed with the following
backtrace
> > taken by the offcputime tool (iovisor/bcc):
> >
> >    ffffffffb90c57b1 finish_task_switch
> >    ffffffffb97dffb5 schedule
> >    ffffffffb97e310c schedule_timeout
> >    ffffffffb97e1f12 __down
> >    ffffffffb90ea821 down
> >    ffffffffc046a9dc xfs_buf_lock
> >    ffffffffc046abfb _xfs_buf_find
> >    ffffffffc046ae4a xfs_buf_get_map
> >    ffffffffc046babd xfs_buf_read_map
> >    ffffffffc0499931 xfs_trans_read_buf_map
> >    ffffffffc044a561 xfs_da_read_buf
> >    ffffffffc0451390 xfs_dir3_leaf_read.constprop.16
> >    ffffffffc0452b90 xfs_dir2_leaf_lookup_int
> >    ffffffffc0452e0f xfs_dir2_leaf_lookup
> >    ffffffffc044d9d3 xfs_dir_lookup
> >    ffffffffc047d1d9 xfs_lookup
> >    ffffffffc0479e53 xfs_vn_lookup
> >    ffffffffb925347a path_openat
> >    ffffffffb9254a71 do_filp_open
> >    ffffffffb9242a94 do_sys_open
> >    ffffffffb9242b9e sys_open
> >    ffffffffb97e42b2 entry_SYSCALL_64_fastpath
> >    00007fb0698162ed [unknown]
> >
> > Inspecting my run with blktrace, I can see that the xfsaild kthread
exhibit very
> > high "Dispatch wait" times, on the dozens of seconds range and
consistent with
> > the open() times I have saw in that run.
> >
> > Still from the blktrace output, we can after searching a bit, identify
the
> > request that wasn't dispatched:
> >
> >  8,0   11      152    81.092472813   804  A  WM 141698288 + 8 <- (8,1)
141696240
> >  8,0   11      153    81.092472889   804  Q  WM 141698288 + 8
[xfsaild/sda1]
> >  8,0   11      154    81.092473207   804  G  WM 141698288 + 8
[xfsaild/sda1]
> >  8,0   11      206    81.092496118   804  I  WM 141698288 + 8 (
 22911) [xfsaild/sda1]
> >  <==== 'I' means Inserted (into the IO scheduler)
===================================>
> >  8,0    0   289372    96.718761435     0  D  WM 141698288 + 8
(15626265317) [swapper/0]
> >  <==== Only 15s later the CFQ scheduler dispatches the request
======================>
> >
> > As we can see above, in this particular example CFQ took 15 seconds to
dispatch
> > this request. Going back to the full trace, we can see that the xfsaild
queue
> > had plenty of opportunity to run, and it was selected as the active
queue many
> > times. It would just always be preempted by something else (example):
> >
> >  8,0    1        0    81.117912979     0  m   N cfq1618SN /
insert_request
> >  8,0    1        0    81.117913419     0  m   N cfq1618SN / add_to_rr
> >  8,0    1        0    81.117914044     0  m   N cfq1618SN / preempt
> >  8,0    1        0    81.117914398     0  m   N cfq767A  / slice
expired t=1
> >  8,0    1        0    81.117914755     0  m   N cfq767A  / resid=40
> >  8,0    1        0    81.117915340     0  m   N / served: vt=1948520448
min_vt=1948520448
> >  8,0    1        0    81.117915858     0  m   N cfq767A  / sl_used=1
disp=0 charge=0 iops=1 sect=0
> >
> > where cfq767 is the xfsaild queue and cfq1618 corresponds to one of the
ScyllaDB
> > IO dispatchers.
> >
> > The requests preempting the xfsaild queue are synchronous requests.
That's a
> > characteristic of ScyllaDB workloads, as we only ever issue O_DIRECT
requests.
> > While it can be argued that preempting ASYNC requests in favor of SYNC
is part
> > of the CFQ logic, I don't believe that doing so for 15+ seconds is
anyone's
> > goal.
> >
> > Moreover, unless I am misunderstanding something, that breaks the
expectation
> > set by the "fifo_expire_async" tunable, which in my system is set to the
> > default.
> >
> > Looking at the code, it seems to me that the issue is that after we make
> > an async queue active, there is no guarantee that it will execute any
request.
> >
> > When the queue itself tests if it cfq_may_dispatch() it can bail if it
sees SYNC
> > requests in flight. An incoming request from another queue can also
preempt it
> > in such situation before we have the chance to execute anything (as
seen in the
> > trace above).
> >
>
> The BFQ scheduler should not suffer from this issue, because it does
> not consider this condition at all for preemption.  Even better, BFQ
> performs preemption as a function of just one condition, which is
> formally proven not to break any request-delay guarantee.
>
> Would you be willing to give it a try?  It might solve or at least
> mitigate the specific latency problem addressed by your patch, and at
> the same time provide many other benefits, in terms of low latency and
> high throughput.

We are definitely interested in trying different elevators that could
benefit our workload. Specially because right now we don't really have a
default recommendation.

But just to be clear, I saw very early on that the problem didn't happen
with noop or deadline, so if this was just a characteristic of the CFQ I
would have stopped there.

I do believe however, as I have said in my commitlog message, that this is
just a bug. And no bug shall remain unfixed in my presence.

>
> If you are willing to try it, then in [1] you can find the last BFQ
> development branch, rebased against 4.8-rc7.  These commits add BFQ as an
> extra scheduler.
>
> Otherwise you can find BFQ for 4.7 here [2], and BFQ for 4.6 here [3].
>
> If you prefer patches, then you can find them in [4] for 4.7, or in
> [5] for 4.6.
>
> Finally, if you need details about BFQ, have a look at [6], or just ask!
>
> Thanks,
> Paolo
>
> [1] https://github.com/linusw/linux-bfq/tree/bfq-v8
> [2] https://github.com/linusw/linux-bfq/tree/bfq-v8-v4.7
> [3] https://github.com/linusw/linux-bfq/tree/bfq-v8-v4.6
> [4]
http://algogroup.unimore.it/people/paolo/disk_sched/patches/4.7.0-v8r3/
> [5]
http://algogroup.unimore.it/people/paolo/disk_sched/patches/4.6.0-v8r3/
> [6] http://algogroup.unimore.it/people/paolo/disk_sched/
>
> > This patch sets the must_dispatch flag if we notice that we have
requests
> > that are already fifo_expired. This flag is always cleared after
> > cfq_dispatch_request() returns from cfq_dispatch_requests(), so it
won't pin
> > the queue for subsequent requests (unless they are themselves expired)
> >
> > Care is taken during preempt to still allow rt requests to preempt us
> > regardless.
> >
> > Testing my workload with this patch applied produces much better
results.
> > From the application side I see no timeouts, and the open() latency
histogram
> > generated by systemtap looks much better, with the worst outlier at
131ms:
> >
>
>
>
> > Latency histogram of xfs_buf_lock acquisition (microseconds):
> > value |-------------------------------------------------- count
> >     0 |                                                     11
> >     1 |@@@@                                                161
> >     2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  1966
> >     4 |@                                                    54
> >     8 |                                                     36
> >    16 |                                                      7
> >    32 |                                                      0
> >    64 |                                                      0
> >       ~
> >  1024 |                                                      0
> >  2048 |                                                      0
> >  4096 |                                                      1
> >  8192 |                                                      1
> > 16384 |                                                      2
> > 32768 |                                                      0
> > 65536 |                                                      0
> > 131072 |                                                      1
> > 262144 |                                                      0
> > 524288 |                                                      0
> >
> > Signed-off-by: Glauber Costa <glauber@scylladb.com>
> > CC: Jens Axboe <axboe@kernel.dk>
> > CC: linux-block@vger.kernel.org
> > CC: linux-kernel@vger.kernel.org
> >
> > Signed-off-by: Glauber Costa <glauber@scylladb.com>
> > ---
> > block/cfq-iosched.c | 13 ++++++++++---
> > 1 file changed, 10 insertions(+), 3 deletions(-)
> >
> > diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> > index cc2f6db..5e24d88 100644
> > --- a/block/cfq-iosched.c
> > +++ b/block/cfq-iosched.c
> > @@ -3042,7 +3042,6 @@ static struct request *cfq_check_fifo(struct
cfq_queue *cfqq)
> >       if (ktime_get_ns() < rq->fifo_time)
> >               rq = NULL;
> >
> > -     cfq_log_cfqq(cfqq->cfqd, cfqq, "fifo=%p", rq);
> >       return rq;
> > }
> >
> > @@ -3420,6 +3419,9 @@ static bool cfq_may_dispatch(struct cfq_data
*cfqd, struct cfq_queue *cfqq)
> > {
> >       unsigned int max_dispatch;
> >
> > +     if (cfq_cfqq_must_dispatch(cfqq))
> > +             return true;
> > +
> >       /*
> >        * Drain async requests before we start sync IO
> >        */
> > @@ -3511,15 +3513,20 @@ static bool cfq_dispatch_request(struct
cfq_data *cfqd, struct cfq_queue *cfqq)
> >
> >       BUG_ON(RB_EMPTY_ROOT(&cfqq->sort_list));
> >
> > +     rq = cfq_check_fifo(cfqq);
> > +     if (rq)
> > +             cfq_mark_cfqq_must_dispatch(cfqq);
> > +
> >       if (!cfq_may_dispatch(cfqd, cfqq))
> >               return false;
> >
> >       /*
> >        * follow expired path, else get first next available
> >        */
> > -     rq = cfq_check_fifo(cfqq);
> >       if (!rq)
> >               rq = cfqq->next_rq;
> > +     else
> > +             cfq_log_cfqq(cfqq->cfqd, cfqq, "fifo=%p", rq);
> >
> >       /*
> >        * insert request into driver dispatch list
> > @@ -3989,7 +3996,7 @@ cfq_should_preempt(struct cfq_data *cfqd, struct
cfq_queue *new_cfqq,
> >        * if the new request is sync, but the currently running queue is
> >        * not, let the sync request have priority.
> >        */
> > -     if (rq_is_sync(rq) && !cfq_cfqq_sync(cfqq))
> > +     if (rq_is_sync(rq) && !cfq_cfqq_sync(cfqq) &&
!cfq_cfqq_must_dispatch(cfqq))
> >               return true;
> >
> >       /*
> > --
> > 2.5.5
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-block"
in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
> --
> Paolo Valente
> Algogroup
> Dipartimento di Scienze Fisiche, Informatiche e Matematiche
> Via Campi 213/B
> 41125 Modena - Italy
> http://algogroup.unimore.it/people/paolo/
>
>
>
>
>

[-- Attachment #2: Type: text/html, Size: 15446 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] cfq: fix starvation of asynchronous writes
  2016-09-23 11:28   ` Glauber Costa
@ 2016-09-23 12:19     ` Glauber Costa
  0 siblings, 0 replies; 12+ messages in thread
From: Glauber Costa @ 2016-09-23 12:19 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Linus Walleij, Jens Axboe, linux-kernel, Mark Brown, linux-block,
	Ulf Hansson

On Fri, Sep 23, 2016 at 7:28 AM, Glauber Costa <glauber@scylladb.com> wrote:
> On Sep 23, 2016 6:22 AM, "Paolo Valente" <paolo.valente@unimore.it> wrote:
>>
>>
>> > Il giorno 23 set 2016, alle ore 02:59, Glauber Costa
>> > <glauber@scylladb.com> ha scritto:
>> >
>> > While debugging timeouts happening in my application workload
>> > (ScyllaDB), I have
>> > observed calls to open() taking a long time, ranging everywhere from 2
>> > seconds -
>> > the first ones that are enough to time out my application - to more than
>> > 30
>> > seconds.
>> >
>> > The problem seems to happen because XFS may block on pending metadata
>> > updates
>> > under certain circumnstances, and that's confirmed with the following
>> > backtrace
>> > taken by the offcputime tool (iovisor/bcc):
>> >
>> >    ffffffffb90c57b1 finish_task_switch
>> >    ffffffffb97dffb5 schedule
>> >    ffffffffb97e310c schedule_timeout
>> >    ffffffffb97e1f12 __down
>> >    ffffffffb90ea821 down
>> >    ffffffffc046a9dc xfs_buf_lock
>> >    ffffffffc046abfb _xfs_buf_find
>> >    ffffffffc046ae4a xfs_buf_get_map
>> >    ffffffffc046babd xfs_buf_read_map
>> >    ffffffffc0499931 xfs_trans_read_buf_map
>> >    ffffffffc044a561 xfs_da_read_buf
>> >    ffffffffc0451390 xfs_dir3_leaf_read.constprop.16
>> >    ffffffffc0452b90 xfs_dir2_leaf_lookup_int
>> >    ffffffffc0452e0f xfs_dir2_leaf_lookup
>> >    ffffffffc044d9d3 xfs_dir_lookup
>> >    ffffffffc047d1d9 xfs_lookup
>> >    ffffffffc0479e53 xfs_vn_lookup
>> >    ffffffffb925347a path_openat
>> >    ffffffffb9254a71 do_filp_open
>> >    ffffffffb9242a94 do_sys_open
>> >    ffffffffb9242b9e sys_open
>> >    ffffffffb97e42b2 entry_SYSCALL_64_fastpath
>> >    00007fb0698162ed [unknown]
>> >
>> > Inspecting my run with blktrace, I can see that the xfsaild kthread
>> > exhibit very
>> > high "Dispatch wait" times, on the dozens of seconds range and
>> > consistent with
>> > the open() times I have saw in that run.
>> >
>> > Still from the blktrace output, we can after searching a bit, identify
>> > the
>> > request that wasn't dispatched:
>> >
>> >  8,0   11      152    81.092472813   804  A  WM 141698288 + 8 <- (8,1)
>> > 141696240
>> >  8,0   11      153    81.092472889   804  Q  WM 141698288 + 8
>> > [xfsaild/sda1]
>> >  8,0   11      154    81.092473207   804  G  WM 141698288 + 8
>> > [xfsaild/sda1]
>> >  8,0   11      206    81.092496118   804  I  WM 141698288 + 8 (   22911)
>> > [xfsaild/sda1]
>> >  <==== 'I' means Inserted (into the IO scheduler)
>> > ===================================>
>> >  8,0    0   289372    96.718761435     0  D  WM 141698288 + 8
>> > (15626265317) [swapper/0]
>> >  <==== Only 15s later the CFQ scheduler dispatches the request
>> > ======================>
>> >
>> > As we can see above, in this particular example CFQ took 15 seconds to
>> > dispatch
>> > this request. Going back to the full trace, we can see that the xfsaild
>> > queue
>> > had plenty of opportunity to run, and it was selected as the active
>> > queue many
>> > times. It would just always be preempted by something else (example):
>> >
>> >  8,0    1        0    81.117912979     0  m   N cfq1618SN /
>> > insert_request
>> >  8,0    1        0    81.117913419     0  m   N cfq1618SN / add_to_rr
>> >  8,0    1        0    81.117914044     0  m   N cfq1618SN / preempt
>> >  8,0    1        0    81.117914398     0  m   N cfq767A  / slice expired
>> > t=1
>> >  8,0    1        0    81.117914755     0  m   N cfq767A  / resid=40
>> >  8,0    1        0    81.117915340     0  m   N / served: vt=1948520448
>> > min_vt=1948520448
>> >  8,0    1        0    81.117915858     0  m   N cfq767A  / sl_used=1
>> > disp=0 charge=0 iops=1 sect=0
>> >
>> > where cfq767 is the xfsaild queue and cfq1618 corresponds to one of the
>> > ScyllaDB
>> > IO dispatchers.
>> >
>> > The requests preempting the xfsaild queue are synchronous requests.
>> > That's a
>> > characteristic of ScyllaDB workloads, as we only ever issue O_DIRECT
>> > requests.
>> > While it can be argued that preempting ASYNC requests in favor of SYNC
>> > is part
>> > of the CFQ logic, I don't believe that doing so for 15+ seconds is
>> > anyone's
>> > goal.
>> >
>> > Moreover, unless I am misunderstanding something, that breaks the
>> > expectation
>> > set by the "fifo_expire_async" tunable, which in my system is set to the
>> > default.
>> >
>> > Looking at the code, it seems to me that the issue is that after we make
>> > an async queue active, there is no guarantee that it will execute any
>> > request.
>> >
>> > When the queue itself tests if it cfq_may_dispatch() it can bail if it
>> > sees SYNC
>> > requests in flight. An incoming request from another queue can also
>> > preempt it
>> > in such situation before we have the chance to execute anything (as seen
>> > in the
>> > trace above).
>> >
>>
>> The BFQ scheduler should not suffer from this issue, because it does
>> not consider this condition at all for preemption.  Even better, BFQ
>> performs preemption as a function of just one condition, which is
>> formally proven not to break any request-delay guarantee.
>>
>> Would you be willing to give it a try?  It might solve or at least
>> mitigate the specific latency problem addressed by your patch, and at
>> the same time provide many other benefits, in terms of low latency and
>> high throughput.
>
> We are definitely interested in trying different elevators that could
> benefit our workload. Specially because right now we don't really have a
> default recommendation.
>
> But just to be clear, I saw very early on that the problem didn't happen
> with noop or deadline, so if this was just a characteristic of the CFQ I
> would have stopped there.
>
> I do believe however, as I have said in my commitlog message, that this is
> just a bug. And no bug shall remain unfixed in my presence.

Sorry this was sent in HTML. Just got mail from vger.kernel.org saying
the message was refused
due to that. I don't deal with LKML in ages and my gmail phone
interface was sending
HTML messages by default.

>
>>
>> If you are willing to try it, then in [1] you can find the last BFQ
>> development branch, rebased against 4.8-rc7.  These commits add BFQ as an
>> extra scheduler.
>>
>> Otherwise you can find BFQ for 4.7 here [2], and BFQ for 4.6 here [3].
>>
>> If you prefer patches, then you can find them in [4] for 4.7, or in
>> [5] for 4.6.
>>
>> Finally, if you need details about BFQ, have a look at [6], or just ask!
>>
>> Thanks,
>> Paolo
>>
>> [1] https://github.com/linusw/linux-bfq/tree/bfq-v8
>> [2] https://github.com/linusw/linux-bfq/tree/bfq-v8-v4.7
>> [3] https://github.com/linusw/linux-bfq/tree/bfq-v8-v4.6
>> [4]
>> http://algogroup.unimore.it/people/paolo/disk_sched/patches/4.7.0-v8r3/
>> [5]
>> http://algogroup.unimore.it/people/paolo/disk_sched/patches/4.6.0-v8r3/
>> [6] http://algogroup.unimore.it/people/paolo/disk_sched/
>>
>> > This patch sets the must_dispatch flag if we notice that we have
>> > requests
>> > that are already fifo_expired. This flag is always cleared after
>> > cfq_dispatch_request() returns from cfq_dispatch_requests(), so it won't
>> > pin
>> > the queue for subsequent requests (unless they are themselves expired)
>> >
>> > Care is taken during preempt to still allow rt requests to preempt us
>> > regardless.
>> >
>> > Testing my workload with this patch applied produces much better
>> > results.
>> > From the application side I see no timeouts, and the open() latency
>> > histogram
>> > generated by systemtap looks much better, with the worst outlier at
>> > 131ms:
>> >
>>
>>
>>
>> > Latency histogram of xfs_buf_lock acquisition (microseconds):
>> > value |-------------------------------------------------- count
>> >     0 |                                                     11
>> >     1 |@@@@                                                161
>> >     2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  1966
>> >     4 |@                                                    54
>> >     8 |                                                     36
>> >    16 |                                                      7
>> >    32 |                                                      0
>> >    64 |                                                      0
>> >       ~
>> >  1024 |                                                      0
>> >  2048 |                                                      0
>> >  4096 |                                                      1
>> >  8192 |                                                      1
>> > 16384 |                                                      2
>> > 32768 |                                                      0
>> > 65536 |                                                      0
>> > 131072 |                                                      1
>> > 262144 |                                                      0
>> > 524288 |                                                      0
>> >
>> > Signed-off-by: Glauber Costa <glauber@scylladb.com>
>> > CC: Jens Axboe <axboe@kernel.dk>
>> > CC: linux-block@vger.kernel.org
>> > CC: linux-kernel@vger.kernel.org
>> >
>> > Signed-off-by: Glauber Costa <glauber@scylladb.com>
>> > ---
>> > block/cfq-iosched.c | 13 ++++++++++---
>> > 1 file changed, 10 insertions(+), 3 deletions(-)
>> >
>> > diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
>> > index cc2f6db..5e24d88 100644
>> > --- a/block/cfq-iosched.c
>> > +++ b/block/cfq-iosched.c
>> > @@ -3042,7 +3042,6 @@ static struct request *cfq_check_fifo(struct
>> > cfq_queue *cfqq)
>> >       if (ktime_get_ns() < rq->fifo_time)
>> >               rq = NULL;
>> >
>> > -     cfq_log_cfqq(cfqq->cfqd, cfqq, "fifo=%p", rq);
>> >       return rq;
>> > }
>> >
>> > @@ -3420,6 +3419,9 @@ static bool cfq_may_dispatch(struct cfq_data
>> > *cfqd, struct cfq_queue *cfqq)
>> > {
>> >       unsigned int max_dispatch;
>> >
>> > +     if (cfq_cfqq_must_dispatch(cfqq))
>> > +             return true;
>> > +
>> >       /*
>> >        * Drain async requests before we start sync IO
>> >        */
>> > @@ -3511,15 +3513,20 @@ static bool cfq_dispatch_request(struct cfq_data
>> > *cfqd, struct cfq_queue *cfqq)
>> >
>> >       BUG_ON(RB_EMPTY_ROOT(&cfqq->sort_list));
>> >
>> > +     rq = cfq_check_fifo(cfqq);
>> > +     if (rq)
>> > +             cfq_mark_cfqq_must_dispatch(cfqq);
>> > +
>> >       if (!cfq_may_dispatch(cfqd, cfqq))
>> >               return false;
>> >
>> >       /*
>> >        * follow expired path, else get first next available
>> >        */
>> > -     rq = cfq_check_fifo(cfqq);
>> >       if (!rq)
>> >               rq = cfqq->next_rq;
>> > +     else
>> > +             cfq_log_cfqq(cfqq->cfqd, cfqq, "fifo=%p", rq);
>> >
>> >       /*
>> >        * insert request into driver dispatch list
>> > @@ -3989,7 +3996,7 @@ cfq_should_preempt(struct cfq_data *cfqd, struct
>> > cfq_queue *new_cfqq,
>> >        * if the new request is sync, but the currently running queue is
>> >        * not, let the sync request have priority.
>> >        */
>> > -     if (rq_is_sync(rq) && !cfq_cfqq_sync(cfqq))
>> > +     if (rq_is_sync(rq) && !cfq_cfqq_sync(cfqq) &&
>> > !cfq_cfqq_must_dispatch(cfqq))
>> >               return true;
>> >
>> >       /*
>> > --
>> > 2.5.5
>> >
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe linux-block"
>> > in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>> --
>> Paolo Valente
>> Algogroup
>> Dipartimento di Scienze Fisiche, Informatiche e Matematiche
>> Via Campi 213/B
>> 41125 Modena - Italy
>> http://algogroup.unimore.it/people/paolo/
>>
>>
>>
>>
>>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] cfq: fix starvation of asynchronous writes
  2016-09-23  2:10 ` Jens Axboe
@ 2016-09-23 16:11   ` Jens Axboe
  0 siblings, 0 replies; 12+ messages in thread
From: Jens Axboe @ 2016-09-23 16:11 UTC (permalink / raw)
  To: Glauber Costa, linux-block; +Cc: linux-kernel

On 09/22/2016 08:10 PM, Jens Axboe wrote:
> On 09/22/2016 06:59 PM, Glauber Costa wrote:
>> While debugging timeouts happening in my application workload
>> (ScyllaDB), I have
>> observed calls to open() taking a long time, ranging everywhere from 2
>> seconds -
>> the first ones that are enough to time out my application - to more
>> than 30
>> seconds.
>>
>> The problem seems to happen because XFS may block on pending metadata
>> updates
>> under certain circumnstances, and that's confirmed with the following
>> backtrace
>> taken by the offcputime tool (iovisor/bcc):
>>
>>     ffffffffb90c57b1 finish_task_switch
>>     ffffffffb97dffb5 schedule
>>     ffffffffb97e310c schedule_timeout
>>     ffffffffb97e1f12 __down
>>     ffffffffb90ea821 down
>>     ffffffffc046a9dc xfs_buf_lock
>>     ffffffffc046abfb _xfs_buf_find
>>     ffffffffc046ae4a xfs_buf_get_map
>>     ffffffffc046babd xfs_buf_read_map
>>     ffffffffc0499931 xfs_trans_read_buf_map
>>     ffffffffc044a561 xfs_da_read_buf
>>     ffffffffc0451390 xfs_dir3_leaf_read.constprop.16
>>     ffffffffc0452b90 xfs_dir2_leaf_lookup_int
>>     ffffffffc0452e0f xfs_dir2_leaf_lookup
>>     ffffffffc044d9d3 xfs_dir_lookup
>>     ffffffffc047d1d9 xfs_lookup
>>     ffffffffc0479e53 xfs_vn_lookup
>>     ffffffffb925347a path_openat
>>     ffffffffb9254a71 do_filp_open
>>     ffffffffb9242a94 do_sys_open
>>     ffffffffb9242b9e sys_open
>>     ffffffffb97e42b2 entry_SYSCALL_64_fastpath
>>     00007fb0698162ed [unknown]
>>
>> Inspecting my run with blktrace, I can see that the xfsaild kthread
>> exhibit very
>> high "Dispatch wait" times, on the dozens of seconds range and
>> consistent with
>> the open() times I have saw in that run.
>>
>> Still from the blktrace output, we can after searching a bit, identify
>> the
>> request that wasn't dispatched:
>>
>>   8,0   11      152    81.092472813   804  A  WM 141698288 + 8 <-
>> (8,1) 141696240
>>   8,0   11      153    81.092472889   804  Q  WM 141698288 + 8
>> [xfsaild/sda1]
>>   8,0   11      154    81.092473207   804  G  WM 141698288 + 8
>> [xfsaild/sda1]
>>   8,0   11      206    81.092496118   804  I  WM 141698288 + 8 (
>> 22911) [xfsaild/sda1]
>>   <==== 'I' means Inserted (into the IO scheduler)
>> ===================================>
>>   8,0    0   289372    96.718761435     0  D  WM 141698288 + 8
>> (15626265317) [swapper/0]
>>   <==== Only 15s later the CFQ scheduler dispatches the request
>> ======================>
>>
>> As we can see above, in this particular example CFQ took 15 seconds to
>> dispatch
>> this request. Going back to the full trace, we can see that the
>> xfsaild queue
>> had plenty of opportunity to run, and it was selected as the active
>> queue many
>> times. It would just always be preempted by something else (example):
>>
>>   8,0    1        0    81.117912979     0  m   N cfq1618SN /
>> insert_request
>>   8,0    1        0    81.117913419     0  m   N cfq1618SN / add_to_rr
>>   8,0    1        0    81.117914044     0  m   N cfq1618SN / preempt
>>   8,0    1        0    81.117914398     0  m   N cfq767A  / slice
>> expired t=1
>>   8,0    1        0    81.117914755     0  m   N cfq767A  / resid=40
>>   8,0    1        0    81.117915340     0  m   N / served:
>> vt=1948520448 min_vt=1948520448
>>   8,0    1        0    81.117915858     0  m   N cfq767A  / sl_used=1
>> disp=0 charge=0 iops=1 sect=0
>>
>> where cfq767 is the xfsaild queue and cfq1618 corresponds to one of
>> the ScyllaDB
>> IO dispatchers.
>>
>> The requests preempting the xfsaild queue are synchronous requests.
>> That's a
>> characteristic of ScyllaDB workloads, as we only ever issue O_DIRECT
>> requests.
>> While it can be argued that preempting ASYNC requests in favor of SYNC
>> is part
>> of the CFQ logic, I don't believe that doing so for 15+ seconds is
>> anyone's
>> goal.
>>
>> Moreover, unless I am misunderstanding something, that breaks the
>> expectation
>> set by the "fifo_expire_async" tunable, which in my system is set to the
>> default.
>>
>> Looking at the code, it seems to me that the issue is that after we make
>> an async queue active, there is no guarantee that it will execute any
>> request.
>>
>> When the queue itself tests if it cfq_may_dispatch() it can bail if it
>> sees SYNC
>> requests in flight. An incoming request from another queue can also
>> preempt it
>> in such situation before we have the chance to execute anything (as
>> seen in the
>> trace above).
>>
>> This patch sets the must_dispatch flag if we notice that we have requests
>> that are already fifo_expired. This flag is always cleared after
>> cfq_dispatch_request() returns from cfq_dispatch_requests(), so it
>> won't pin
>> the queue for subsequent requests (unless they are themselves expired)
>>
>> Care is taken during preempt to still allow rt requests to preempt us
>> regardless.
>>
>> Testing my workload with this patch applied produces much better results.
>> From the application side I see no timeouts, and the open() latency
>> histogram
>> generated by systemtap looks much better, with the worst outlier at
>> 131ms:
>
> Good analysis and the fix looks nice and clean. I'll take a closer look
> tomorrow. Thanks!

Looks like a good patch to me. I've queued it up, thanks Glauber.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Fwd: [PATCH] cfq: fix starvation of asynchronous writes
  2016-09-23  0:59 [PATCH] cfq: fix starvation of asynchronous writes Glauber Costa
  2016-09-23  2:10 ` Jens Axboe
  2016-09-23 10:22   ` Paolo Valente
@ 2016-10-20 18:08 ` Glauber Costa
  2016-10-21  7:36   ` Greg KH
  2 siblings, 1 reply; 12+ messages in thread
From: Glauber Costa @ 2016-10-20 18:08 UTC (permalink / raw)
  To: stable

subject: [PATCH] cfq: fix starvation of asynchronous writes
commit id: 3932a86b4b9d1f0b049d64d4591ce58ad18b44ec
why: cfq starvation, easily can easily trigger with O_DIRECT-heavy workloads.
versions: all that are maintained, if possible. I have verified it
cherry picks cleanly at least until 4.5
reason for not using "CC stable": I just forgot, sorry.


---------- Forwarded message ----------
From: Glauber Costa <glauber@scylladb.com>
Date: Thu, Sep 22, 2016 at 8:59 PM
Subject: [PATCH] cfq: fix starvation of asynchronous writes
To: linux-block@vger.kernel.org
Cc: Glauber Costa <glauber@scylladb.com>, Jens Axboe
<axboe@kernel.dk>, linux-kernel@vger.kernel.org


While debugging timeouts happening in my application workload (ScyllaDB), I have
observed calls to open() taking a long time, ranging everywhere from 2 seconds -
the first ones that are enough to time out my application - to more than 30
seconds.

The problem seems to happen because XFS may block on pending metadata updates
under certain circumnstances, and that's confirmed with the following backtrace
taken by the offcputime tool (iovisor/bcc):

    ffffffffb90c57b1 finish_task_switch
    ffffffffb97dffb5 schedule
    ffffffffb97e310c schedule_timeout
    ffffffffb97e1f12 __down
    ffffffffb90ea821 down
    ffffffffc046a9dc xfs_buf_lock
    ffffffffc046abfb _xfs_buf_find
    ffffffffc046ae4a xfs_buf_get_map
    ffffffffc046babd xfs_buf_read_map
    ffffffffc0499931 xfs_trans_read_buf_map
    ffffffffc044a561 xfs_da_read_buf
    ffffffffc0451390 xfs_dir3_leaf_read.constprop.16
    ffffffffc0452b90 xfs_dir2_leaf_lookup_int
    ffffffffc0452e0f xfs_dir2_leaf_lookup
    ffffffffc044d9d3 xfs_dir_lookup
    ffffffffc047d1d9 xfs_lookup
    ffffffffc0479e53 xfs_vn_lookup
    ffffffffb925347a path_openat
    ffffffffb9254a71 do_filp_open
    ffffffffb9242a94 do_sys_open
    ffffffffb9242b9e sys_open
    ffffffffb97e42b2 entry_SYSCALL_64_fastpath
    00007fb0698162ed [unknown]

Inspecting my run with blktrace, I can see that the xfsaild kthread exhibit very
high "Dispatch wait" times, on the dozens of seconds range and consistent with
the open() times I have saw in that run.

Still from the blktrace output, we can after searching a bit, identify the
request that wasn't dispatched:

  8,0   11      152    81.092472813   804  A  WM 141698288 + 8 <-
(8,1) 141696240
  8,0   11      153    81.092472889   804  Q  WM 141698288 + 8 [xfsaild/sda1]
  8,0   11      154    81.092473207   804  G  WM 141698288 + 8 [xfsaild/sda1]
  8,0   11      206    81.092496118   804  I  WM 141698288 + 8 (
22911) [xfsaild/sda1]
  <==== 'I' means Inserted (into the IO scheduler)
===================================>
  8,0    0   289372    96.718761435     0  D  WM 141698288 + 8
(15626265317) [swapper/0]
  <==== Only 15s later the CFQ scheduler dispatches the request
======================>

As we can see above, in this particular example CFQ took 15 seconds to dispatch
this request. Going back to the full trace, we can see that the xfsaild queue
had plenty of opportunity to run, and it was selected as the active queue many
times. It would just always be preempted by something else (example):

  8,0    1        0    81.117912979     0  m   N cfq1618SN / insert_request
  8,0    1        0    81.117913419     0  m   N cfq1618SN / add_to_rr
  8,0    1        0    81.117914044     0  m   N cfq1618SN / preempt
  8,0    1        0    81.117914398     0  m   N cfq767A  / slice expired t=1
  8,0    1        0    81.117914755     0  m   N cfq767A  / resid=40
  8,0    1        0    81.117915340     0  m   N / served:
vt=1948520448 min_vt=1948520448
  8,0    1        0    81.117915858     0  m   N cfq767A  / sl_used=1
disp=0 charge=0 iops=1 sect=0

where cfq767 is the xfsaild queue and cfq1618 corresponds to one of the ScyllaDB
IO dispatchers.

The requests preempting the xfsaild queue are synchronous requests. That's a
characteristic of ScyllaDB workloads, as we only ever issue O_DIRECT requests.
While it can be argued that preempting ASYNC requests in favor of SYNC is part
of the CFQ logic, I don't believe that doing so for 15+ seconds is anyone's
goal.

Moreover, unless I am misunderstanding something, that breaks the expectation
set by the "fifo_expire_async" tunable, which in my system is set to the
default.

Looking at the code, it seems to me that the issue is that after we make
an async queue active, there is no guarantee that it will execute any request.

When the queue itself tests if it cfq_may_dispatch() it can bail if it sees SYNC
requests in flight. An incoming request from another queue can also preempt it
in such situation before we have the chance to execute anything (as seen in the
trace above).

This patch sets the must_dispatch flag if we notice that we have requests
that are already fifo_expired. This flag is always cleared after
cfq_dispatch_request() returns from cfq_dispatch_requests(), so it won't pin
the queue for subsequent requests (unless they are themselves expired)

Care is taken during preempt to still allow rt requests to preempt us
regardless.

Testing my workload with this patch applied produces much better results.
>From the application side I see no timeouts, and the open() latency histogram
generated by systemtap looks much better, with the worst outlier at 131ms:

Latency histogram of xfs_buf_lock acquisition (microseconds):
 value |-------------------------------------------------- count
     0 |                                                     11
     1 |@@@@                                                161
     2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  1966
     4 |@                                                    54
     8 |                                                     36
    16 |                                                      7
    32 |                                                      0
    64 |                                                      0
       ~
  1024 |                                                      0
  2048 |                                                      0
  4096 |                                                      1
  8192 |                                                      1
 16384 |                                                      2
 32768 |                                                      0
 65536 |                                                      0
131072 |                                                      1
262144 |                                                      0
524288 |                                                      0

Signed-off-by: Glauber Costa <glauber@scylladb.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: linux-block@vger.kernel.org
CC: linux-kernel@vger.kernel.org

Signed-off-by: Glauber Costa <glauber@scylladb.com>
---
 block/cfq-iosched.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index cc2f6db..5e24d88 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -3042,7 +3042,6 @@ static struct request *cfq_check_fifo(struct
cfq_queue *cfqq)
        if (ktime_get_ns() < rq->fifo_time)
                rq = NULL;

-       cfq_log_cfqq(cfqq->cfqd, cfqq, "fifo=%p", rq);
        return rq;
 }

@@ -3420,6 +3419,9 @@ static bool cfq_may_dispatch(struct cfq_data
*cfqd, struct cfq_queue *cfqq)
 {
        unsigned int max_dispatch;

+       if (cfq_cfqq_must_dispatch(cfqq))
+               return true;
+
        /*
         * Drain async requests before we start sync IO
         */
@@ -3511,15 +3513,20 @@ static bool cfq_dispatch_request(struct
cfq_data *cfqd, struct cfq_queue *cfqq)

        BUG_ON(RB_EMPTY_ROOT(&cfqq->sort_list));

+       rq = cfq_check_fifo(cfqq);
+       if (rq)
+               cfq_mark_cfqq_must_dispatch(cfqq);
+
        if (!cfq_may_dispatch(cfqd, cfqq))
                return false;

        /*
         * follow expired path, else get first next available
         */
-       rq = cfq_check_fifo(cfqq);
        if (!rq)
                rq = cfqq->next_rq;
+       else
+               cfq_log_cfqq(cfqq->cfqd, cfqq, "fifo=%p", rq);

        /*
         * insert request into driver dispatch list
@@ -3989,7 +3996,7 @@ cfq_should_preempt(struct cfq_data *cfqd, struct
cfq_queue *new_cfqq,
         * if the new request is sync, but the currently running queue is
         * not, let the sync request have priority.
         */
-       if (rq_is_sync(rq) && !cfq_cfqq_sync(cfqq))
+       if (rq_is_sync(rq) && !cfq_cfqq_sync(cfqq) &&
!cfq_cfqq_must_dispatch(cfqq))
                return true;

        /*
--
2.5.5

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: Fwd: [PATCH] cfq: fix starvation of asynchronous writes
  2016-10-20 18:08 ` Fwd: " Glauber Costa
@ 2016-10-21  7:36   ` Greg KH
  2016-10-21  8:53     ` Holger Hoffstätte
  2016-10-21 13:15     ` Glauber Costa
  0 siblings, 2 replies; 12+ messages in thread
From: Greg KH @ 2016-10-21  7:36 UTC (permalink / raw)
  To: Glauber Costa; +Cc: stable

On Thu, Oct 20, 2016 at 02:08:50PM -0400, Glauber Costa wrote:
> subject: [PATCH] cfq: fix starvation of asynchronous writes
> commit id: 3932a86b4b9d1f0b049d64d4591ce58ad18b44ec
> why: cfq starvation, easily can easily trigger with O_DIRECT-heavy workloads.
> versions: all that are maintained, if possible. I have verified it
> cherry picks cleanly at least until 4.5
> reason for not using "CC stable": I just forgot, sorry.

It seems to apply to 4.4 as well, is that ok, or should it not go there?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Fwd: [PATCH] cfq: fix starvation of asynchronous writes
  2016-10-21  7:36   ` Greg KH
@ 2016-10-21  8:53     ` Holger Hoffstätte
  2016-10-21  9:07       ` Greg KH
  2016-10-21 13:15     ` Glauber Costa
  1 sibling, 1 reply; 12+ messages in thread
From: Holger Hoffstätte @ 2016-10-21  8:53 UTC (permalink / raw)
  To: stable

On Fri, 21 Oct 2016 09:36:34 +0200, Greg KH wrote:

> On Thu, Oct 20, 2016 at 02:08:50PM -0400, Glauber Costa wrote:
>> subject: [PATCH] cfq: fix starvation of asynchronous writes
>> commit id: 3932a86b4b9d1f0b049d64d4591ce58ad18b44ec
>> why: cfq starvation, easily can easily trigger with O_DIRECT-heavy workloads.
>> versions: all that are maintained, if possible. I have verified it
>> cherry picks cleanly at least until 4.5
>> reason for not using "CC stable": I just forgot, sorry.
> 
> It seems to apply to 4.4 as well, is that ok, or should it not go there?

FWIW I've had it in 4.4 until recently and it worked just fine.
The corner case that is addressed is generic and seems to have been
there quite a while.

-h


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Fwd: [PATCH] cfq: fix starvation of asynchronous writes
  2016-10-21  8:53     ` Holger Hoffstätte
@ 2016-10-21  9:07       ` Greg KH
  0 siblings, 0 replies; 12+ messages in thread
From: Greg KH @ 2016-10-21  9:07 UTC (permalink / raw)
  To: Holger Hoffstätte; +Cc: stable

On Fri, Oct 21, 2016 at 08:53:17AM +0000, Holger Hoffst�tte wrote:
> On Fri, 21 Oct 2016 09:36:34 +0200, Greg KH wrote:
> 
> > On Thu, Oct 20, 2016 at 02:08:50PM -0400, Glauber Costa wrote:
> >> subject: [PATCH] cfq: fix starvation of asynchronous writes
> >> commit id: 3932a86b4b9d1f0b049d64d4591ce58ad18b44ec
> >> why: cfq starvation, easily can easily trigger with O_DIRECT-heavy workloads.
> >> versions: all that are maintained, if possible. I have verified it
> >> cherry picks cleanly at least until 4.5
> >> reason for not using "CC stable": I just forgot, sorry.
> > 
> > It seems to apply to 4.4 as well, is that ok, or should it not go there?
> 
> FWIW I've had it in 4.4 until recently and it worked just fine.
> The corner case that is addressed is generic and seems to have been
> there quite a while.

Thanks, now applied there as well.

greg k-h

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Fwd: [PATCH] cfq: fix starvation of asynchronous writes
  2016-10-21  7:36   ` Greg KH
  2016-10-21  8:53     ` Holger Hoffstätte
@ 2016-10-21 13:15     ` Glauber Costa
  1 sibling, 0 replies; 12+ messages in thread
From: Glauber Costa @ 2016-10-21 13:15 UTC (permalink / raw)
  To: Greg KH; +Cc: stable

On Fri, Oct 21, 2016 at 3:36 AM, Greg KH <greg@kroah.com> wrote:
> On Thu, Oct 20, 2016 at 02:08:50PM -0400, Glauber Costa wrote:
>> subject: [PATCH] cfq: fix starvation of asynchronous writes
>> commit id: 3932a86b4b9d1f0b049d64d4591ce58ad18b44ec
>> why: cfq starvation, easily can easily trigger with O_DIRECT-heavy workloads.
>> versions: all that are maintained, if possible. I have verified it
>> cherry picks cleanly at least until 4.5
>> reason for not using "CC stable": I just forgot, sorry.
>
> It seems to apply to 4.4 as well, is that ok, or should it not go there?
>

As far as I know it affects 4.4 as well.

> thanks,
>
> greg k-h

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2016-10-21 13:15 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-23  0:59 [PATCH] cfq: fix starvation of asynchronous writes Glauber Costa
2016-09-23  2:10 ` Jens Axboe
2016-09-23 16:11   ` Jens Axboe
2016-09-23 10:22 ` Paolo Valente
2016-09-23 10:22   ` Paolo Valente
2016-09-23 11:28   ` Glauber Costa
2016-09-23 12:19     ` Glauber Costa
2016-10-20 18:08 ` Fwd: " Glauber Costa
2016-10-21  7:36   ` Greg KH
2016-10-21  8:53     ` Holger Hoffstätte
2016-10-21  9:07       ` Greg KH
2016-10-21 13:15     ` Glauber Costa

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.