From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Subject: Re: [PATCH] cfq: fix starvation of asynchronous writes To: Glauber Costa , linux-block@vger.kernel.org References: <1474592399-2790-1-git-send-email-glauber@scylladb.com> <52daacdc-bb2f-3672-d2e8-30c5e376b9d4@kernel.dk> Cc: linux-kernel@vger.kernel.org From: Jens Axboe Message-ID: Date: Fri, 23 Sep 2016 10:11:54 -0600 MIME-Version: 1.0 In-Reply-To: <52daacdc-bb2f-3672-d2e8-30c5e376b9d4@kernel.dk> Content-Type: text/plain; charset=windows-1252; format=flowed List-ID: On 09/22/2016 08:10 PM, Jens Axboe wrote: > On 09/22/2016 06:59 PM, Glauber Costa wrote: >> While debugging timeouts happening in my application workload >> (ScyllaDB), I have >> observed calls to open() taking a long time, ranging everywhere from 2 >> seconds - >> the first ones that are enough to time out my application - to more >> than 30 >> seconds. >> >> The problem seems to happen because XFS may block on pending metadata >> updates >> under certain circumnstances, and that's confirmed with the following >> backtrace >> taken by the offcputime tool (iovisor/bcc): >> >> ffffffffb90c57b1 finish_task_switch >> ffffffffb97dffb5 schedule >> ffffffffb97e310c schedule_timeout >> ffffffffb97e1f12 __down >> ffffffffb90ea821 down >> ffffffffc046a9dc xfs_buf_lock >> ffffffffc046abfb _xfs_buf_find >> ffffffffc046ae4a xfs_buf_get_map >> ffffffffc046babd xfs_buf_read_map >> ffffffffc0499931 xfs_trans_read_buf_map >> ffffffffc044a561 xfs_da_read_buf >> ffffffffc0451390 xfs_dir3_leaf_read.constprop.16 >> ffffffffc0452b90 xfs_dir2_leaf_lookup_int >> ffffffffc0452e0f xfs_dir2_leaf_lookup >> ffffffffc044d9d3 xfs_dir_lookup >> ffffffffc047d1d9 xfs_lookup >> ffffffffc0479e53 xfs_vn_lookup >> ffffffffb925347a path_openat >> ffffffffb9254a71 do_filp_open >> ffffffffb9242a94 do_sys_open >> ffffffffb9242b9e sys_open >> ffffffffb97e42b2 entry_SYSCALL_64_fastpath >> 00007fb0698162ed [unknown] >> >> Inspecting my run with blktrace, I can see that the xfsaild kthread >> exhibit very >> high "Dispatch wait" times, on the dozens of seconds range and >> consistent with >> the open() times I have saw in that run. >> >> Still from the blktrace output, we can after searching a bit, identify >> the >> request that wasn't dispatched: >> >> 8,0 11 152 81.092472813 804 A WM 141698288 + 8 <- >> (8,1) 141696240 >> 8,0 11 153 81.092472889 804 Q WM 141698288 + 8 >> [xfsaild/sda1] >> 8,0 11 154 81.092473207 804 G WM 141698288 + 8 >> [xfsaild/sda1] >> 8,0 11 206 81.092496118 804 I WM 141698288 + 8 ( >> 22911) [xfsaild/sda1] >> <==== 'I' means Inserted (into the IO scheduler) >> ===================================> >> 8,0 0 289372 96.718761435 0 D WM 141698288 + 8 >> (15626265317) [swapper/0] >> <==== Only 15s later the CFQ scheduler dispatches the request >> ======================> >> >> As we can see above, in this particular example CFQ took 15 seconds to >> dispatch >> this request. Going back to the full trace, we can see that the >> xfsaild queue >> had plenty of opportunity to run, and it was selected as the active >> queue many >> times. It would just always be preempted by something else (example): >> >> 8,0 1 0 81.117912979 0 m N cfq1618SN / >> insert_request >> 8,0 1 0 81.117913419 0 m N cfq1618SN / add_to_rr >> 8,0 1 0 81.117914044 0 m N cfq1618SN / preempt >> 8,0 1 0 81.117914398 0 m N cfq767A / slice >> expired t=1 >> 8,0 1 0 81.117914755 0 m N cfq767A / resid=40 >> 8,0 1 0 81.117915340 0 m N / served: >> vt=1948520448 min_vt=1948520448 >> 8,0 1 0 81.117915858 0 m N cfq767A / sl_used=1 >> disp=0 charge=0 iops=1 sect=0 >> >> where cfq767 is the xfsaild queue and cfq1618 corresponds to one of >> the ScyllaDB >> IO dispatchers. >> >> The requests preempting the xfsaild queue are synchronous requests. >> That's a >> characteristic of ScyllaDB workloads, as we only ever issue O_DIRECT >> requests. >> While it can be argued that preempting ASYNC requests in favor of SYNC >> is part >> of the CFQ logic, I don't believe that doing so for 15+ seconds is >> anyone's >> goal. >> >> Moreover, unless I am misunderstanding something, that breaks the >> expectation >> set by the "fifo_expire_async" tunable, which in my system is set to the >> default. >> >> Looking at the code, it seems to me that the issue is that after we make >> an async queue active, there is no guarantee that it will execute any >> request. >> >> When the queue itself tests if it cfq_may_dispatch() it can bail if it >> sees SYNC >> requests in flight. An incoming request from another queue can also >> preempt it >> in such situation before we have the chance to execute anything (as >> seen in the >> trace above). >> >> This patch sets the must_dispatch flag if we notice that we have requests >> that are already fifo_expired. This flag is always cleared after >> cfq_dispatch_request() returns from cfq_dispatch_requests(), so it >> won't pin >> the queue for subsequent requests (unless they are themselves expired) >> >> Care is taken during preempt to still allow rt requests to preempt us >> regardless. >> >> Testing my workload with this patch applied produces much better results. >> From the application side I see no timeouts, and the open() latency >> histogram >> generated by systemtap looks much better, with the worst outlier at >> 131ms: > > Good analysis and the fix looks nice and clean. I'll take a closer look > tomorrow. Thanks! Looks like a good patch to me. I've queued it up, thanks Glauber. -- Jens Axboe