On Sep 23, 2016 6:22 AM, "Paolo Valente" wrote: > > > > Il giorno 23 set 2016, alle ore 02:59, Glauber Costa < glauber@scylladb.com> ha scritto: > > > > While debugging timeouts happening in my application workload (ScyllaDB), I have > > observed calls to open() taking a long time, ranging everywhere from 2 seconds - > > the first ones that are enough to time out my application - to more than 30 > > seconds. > > > > The problem seems to happen because XFS may block on pending metadata updates > > under certain circumnstances, and that's confirmed with the following backtrace > > taken by the offcputime tool (iovisor/bcc): > > > > ffffffffb90c57b1 finish_task_switch > > ffffffffb97dffb5 schedule > > ffffffffb97e310c schedule_timeout > > ffffffffb97e1f12 __down > > ffffffffb90ea821 down > > ffffffffc046a9dc xfs_buf_lock > > ffffffffc046abfb _xfs_buf_find > > ffffffffc046ae4a xfs_buf_get_map > > ffffffffc046babd xfs_buf_read_map > > ffffffffc0499931 xfs_trans_read_buf_map > > ffffffffc044a561 xfs_da_read_buf > > ffffffffc0451390 xfs_dir3_leaf_read.constprop.16 > > ffffffffc0452b90 xfs_dir2_leaf_lookup_int > > ffffffffc0452e0f xfs_dir2_leaf_lookup > > ffffffffc044d9d3 xfs_dir_lookup > > ffffffffc047d1d9 xfs_lookup > > ffffffffc0479e53 xfs_vn_lookup > > ffffffffb925347a path_openat > > ffffffffb9254a71 do_filp_open > > ffffffffb9242a94 do_sys_open > > ffffffffb9242b9e sys_open > > ffffffffb97e42b2 entry_SYSCALL_64_fastpath > > 00007fb0698162ed [unknown] > > > > Inspecting my run with blktrace, I can see that the xfsaild kthread exhibit very > > high "Dispatch wait" times, on the dozens of seconds range and consistent with > > the open() times I have saw in that run. > > > > Still from the blktrace output, we can after searching a bit, identify the > > request that wasn't dispatched: > > > > 8,0 11 152 81.092472813 804 A WM 141698288 + 8 <- (8,1) 141696240 > > 8,0 11 153 81.092472889 804 Q WM 141698288 + 8 [xfsaild/sda1] > > 8,0 11 154 81.092473207 804 G WM 141698288 + 8 [xfsaild/sda1] > > 8,0 11 206 81.092496118 804 I WM 141698288 + 8 ( 22911) [xfsaild/sda1] > > <==== 'I' means Inserted (into the IO scheduler) ===================================> > > 8,0 0 289372 96.718761435 0 D WM 141698288 + 8 (15626265317) [swapper/0] > > <==== Only 15s later the CFQ scheduler dispatches the request ======================> > > > > As we can see above, in this particular example CFQ took 15 seconds to dispatch > > this request. Going back to the full trace, we can see that the xfsaild queue > > had plenty of opportunity to run, and it was selected as the active queue many > > times. It would just always be preempted by something else (example): > > > > 8,0 1 0 81.117912979 0 m N cfq1618SN / insert_request > > 8,0 1 0 81.117913419 0 m N cfq1618SN / add_to_rr > > 8,0 1 0 81.117914044 0 m N cfq1618SN / preempt > > 8,0 1 0 81.117914398 0 m N cfq767A / slice expired t=1 > > 8,0 1 0 81.117914755 0 m N cfq767A / resid=40 > > 8,0 1 0 81.117915340 0 m N / served: vt=1948520448 min_vt=1948520448 > > 8,0 1 0 81.117915858 0 m N cfq767A / sl_used=1 disp=0 charge=0 iops=1 sect=0 > > > > where cfq767 is the xfsaild queue and cfq1618 corresponds to one of the ScyllaDB > > IO dispatchers. > > > > The requests preempting the xfsaild queue are synchronous requests. That's a > > characteristic of ScyllaDB workloads, as we only ever issue O_DIRECT requests. > > While it can be argued that preempting ASYNC requests in favor of SYNC is part > > of the CFQ logic, I don't believe that doing so for 15+ seconds is anyone's > > goal. > > > > Moreover, unless I am misunderstanding something, that breaks the expectation > > set by the "fifo_expire_async" tunable, which in my system is set to the > > default. > > > > Looking at the code, it seems to me that the issue is that after we make > > an async queue active, there is no guarantee that it will execute any request. > > > > When the queue itself tests if it cfq_may_dispatch() it can bail if it sees SYNC > > requests in flight. An incoming request from another queue can also preempt it > > in such situation before we have the chance to execute anything (as seen in the > > trace above). > > > > The BFQ scheduler should not suffer from this issue, because it does > not consider this condition at all for preemption. Even better, BFQ > performs preemption as a function of just one condition, which is > formally proven not to break any request-delay guarantee. > > Would you be willing to give it a try? It might solve or at least > mitigate the specific latency problem addressed by your patch, and at > the same time provide many other benefits, in terms of low latency and > high throughput. We are definitely interested in trying different elevators that could benefit our workload. Specially because right now we don't really have a default recommendation. But just to be clear, I saw very early on that the problem didn't happen with noop or deadline, so if this was just a characteristic of the CFQ I would have stopped there. I do believe however, as I have said in my commitlog message, that this is just a bug. And no bug shall remain unfixed in my presence. > > If you are willing to try it, then in [1] you can find the last BFQ > development branch, rebased against 4.8-rc7. These commits add BFQ as an > extra scheduler. > > Otherwise you can find BFQ for 4.7 here [2], and BFQ for 4.6 here [3]. > > If you prefer patches, then you can find them in [4] for 4.7, or in > [5] for 4.6. > > Finally, if you need details about BFQ, have a look at [6], or just ask! > > Thanks, > Paolo > > [1] https://github.com/linusw/linux-bfq/tree/bfq-v8 > [2] https://github.com/linusw/linux-bfq/tree/bfq-v8-v4.7 > [3] https://github.com/linusw/linux-bfq/tree/bfq-v8-v4.6 > [4] http://algogroup.unimore.it/people/paolo/disk_sched/patches/4.7.0-v8r3/ > [5] http://algogroup.unimore.it/people/paolo/disk_sched/patches/4.6.0-v8r3/ > [6] http://algogroup.unimore.it/people/paolo/disk_sched/ > > > This patch sets the must_dispatch flag if we notice that we have requests > > that are already fifo_expired. This flag is always cleared after > > cfq_dispatch_request() returns from cfq_dispatch_requests(), so it won't pin > > the queue for subsequent requests (unless they are themselves expired) > > > > Care is taken during preempt to still allow rt requests to preempt us > > regardless. > > > > Testing my workload with this patch applied produces much better results. > > From the application side I see no timeouts, and the open() latency histogram > > generated by systemtap looks much better, with the worst outlier at 131ms: > > > > > > > Latency histogram of xfs_buf_lock acquisition (microseconds): > > value |-------------------------------------------------- count > > 0 | 11 > > 1 |@@@@ 161 > > 2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1966 > > 4 |@ 54 > > 8 | 36 > > 16 | 7 > > 32 | 0 > > 64 | 0 > > ~ > > 1024 | 0 > > 2048 | 0 > > 4096 | 1 > > 8192 | 1 > > 16384 | 2 > > 32768 | 0 > > 65536 | 0 > > 131072 | 1 > > 262144 | 0 > > 524288 | 0 > > > > Signed-off-by: Glauber Costa > > CC: Jens Axboe > > CC: linux-block@vger.kernel.org > > CC: linux-kernel@vger.kernel.org > > > > Signed-off-by: Glauber Costa > > --- > > block/cfq-iosched.c | 13 ++++++++++--- > > 1 file changed, 10 insertions(+), 3 deletions(-) > > > > diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c > > index cc2f6db..5e24d88 100644 > > --- a/block/cfq-iosched.c > > +++ b/block/cfq-iosched.c > > @@ -3042,7 +3042,6 @@ static struct request *cfq_check_fifo(struct cfq_queue *cfqq) > > if (ktime_get_ns() < rq->fifo_time) > > rq = NULL; > > > > - cfq_log_cfqq(cfqq->cfqd, cfqq, "fifo=%p", rq); > > return rq; > > } > > > > @@ -3420,6 +3419,9 @@ static bool cfq_may_dispatch(struct cfq_data *cfqd, struct cfq_queue *cfqq) > > { > > unsigned int max_dispatch; > > > > + if (cfq_cfqq_must_dispatch(cfqq)) > > + return true; > > + > > /* > > * Drain async requests before we start sync IO > > */ > > @@ -3511,15 +3513,20 @@ static bool cfq_dispatch_request(struct cfq_data *cfqd, struct cfq_queue *cfqq) > > > > BUG_ON(RB_EMPTY_ROOT(&cfqq->sort_list)); > > > > + rq = cfq_check_fifo(cfqq); > > + if (rq) > > + cfq_mark_cfqq_must_dispatch(cfqq); > > + > > if (!cfq_may_dispatch(cfqd, cfqq)) > > return false; > > > > /* > > * follow expired path, else get first next available > > */ > > - rq = cfq_check_fifo(cfqq); > > if (!rq) > > rq = cfqq->next_rq; > > + else > > + cfq_log_cfqq(cfqq->cfqd, cfqq, "fifo=%p", rq); > > > > /* > > * insert request into driver dispatch list > > @@ -3989,7 +3996,7 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq, > > * if the new request is sync, but the currently running queue is > > * not, let the sync request have priority. > > */ > > - if (rq_is_sync(rq) && !cfq_cfqq_sync(cfqq)) > > + if (rq_is_sync(rq) && !cfq_cfqq_sync(cfqq) && !cfq_cfqq_must_dispatch(cfqq)) > > return true; > > > > /* > > -- > > 2.5.5 > > > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-block" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > -- > Paolo Valente > Algogroup > Dipartimento di Scienze Fisiche, Informatiche e Matematiche > Via Campi 213/B > 41125 Modena - Italy > http://algogroup.unimore.it/people/paolo/ > > > > >