Re: [PATCH] cfq: fix starvation of asynchronous writes

From: Jens Axboe <axboe@kernel.dk>
To: Glauber Costa <glauber@scylladb.com>, linux-block@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Subject: Re: [PATCH] cfq: fix starvation of asynchronous writes
Date: Fri, 23 Sep 2016 10:11:54 -0600	[thread overview]
Message-ID: <c169ee89-966b-ccf3-a699-ded96ccbe922@kernel.dk> (raw)
In-Reply-To: <52daacdc-bb2f-3672-d2e8-30c5e376b9d4@kernel.dk>

On 09/22/2016 08:10 PM, Jens Axboe wrote:
> On 09/22/2016 06:59 PM, Glauber Costa wrote:
>> While debugging timeouts happening in my application workload
>> (ScyllaDB), I have
>> observed calls to open() taking a long time, ranging everywhere from 2
>> seconds -
>> the first ones that are enough to time out my application - to more
>> than 30
>> seconds.
>>
>> The problem seems to happen because XFS may block on pending metadata
>> updates
>> under certain circumnstances, and that's confirmed with the following
>> backtrace
>> taken by the offcputime tool (iovisor/bcc):
>>
>>     ffffffffb90c57b1 finish_task_switch
>>     ffffffffb97dffb5 schedule
>>     ffffffffb97e310c schedule_timeout
>>     ffffffffb97e1f12 __down
>>     ffffffffb90ea821 down
>>     ffffffffc046a9dc xfs_buf_lock
>>     ffffffffc046abfb _xfs_buf_find
>>     ffffffffc046ae4a xfs_buf_get_map
>>     ffffffffc046babd xfs_buf_read_map
>>     ffffffffc0499931 xfs_trans_read_buf_map
>>     ffffffffc044a561 xfs_da_read_buf
>>     ffffffffc0451390 xfs_dir3_leaf_read.constprop.16
>>     ffffffffc0452b90 xfs_dir2_leaf_lookup_int
>>     ffffffffc0452e0f xfs_dir2_leaf_lookup
>>     ffffffffc044d9d3 xfs_dir_lookup
>>     ffffffffc047d1d9 xfs_lookup
>>     ffffffffc0479e53 xfs_vn_lookup
>>     ffffffffb925347a path_openat
>>     ffffffffb9254a71 do_filp_open
>>     ffffffffb9242a94 do_sys_open
>>     ffffffffb9242b9e sys_open
>>     ffffffffb97e42b2 entry_SYSCALL_64_fastpath
>>     00007fb0698162ed [unknown]
>>
>> Inspecting my run with blktrace, I can see that the xfsaild kthread
>> exhibit very
>> high "Dispatch wait" times, on the dozens of seconds range and
>> consistent with
>> the open() times I have saw in that run.
>>
>> Still from the blktrace output, we can after searching a bit, identify
>> the
>> request that wasn't dispatched:
>>
>>   8,0   11      152    81.092472813   804  A  WM 141698288 + 8 <-
>> (8,1) 141696240
>>   8,0   11      153    81.092472889   804  Q  WM 141698288 + 8
>> [xfsaild/sda1]
>>   8,0   11      154    81.092473207   804  G  WM 141698288 + 8
>> [xfsaild/sda1]
>>   8,0   11      206    81.092496118   804  I  WM 141698288 + 8 (
>> 22911) [xfsaild/sda1]
>>   <==== 'I' means Inserted (into the IO scheduler)
>> ===================================>
>>   8,0    0   289372    96.718761435     0  D  WM 141698288 + 8
>> (15626265317) [swapper/0]
>>   <==== Only 15s later the CFQ scheduler dispatches the request
>> ======================>
>>
>> As we can see above, in this particular example CFQ took 15 seconds to
>> dispatch
>> this request. Going back to the full trace, we can see that the
>> xfsaild queue
>> had plenty of opportunity to run, and it was selected as the active
>> queue many
>> times. It would just always be preempted by something else (example):
>>
>>   8,0    1        0    81.117912979     0  m   N cfq1618SN /
>> insert_request
>>   8,0    1        0    81.117913419     0  m   N cfq1618SN / add_to_rr
>>   8,0    1        0    81.117914044     0  m   N cfq1618SN / preempt
>>   8,0    1        0    81.117914398     0  m   N cfq767A  / slice
>> expired t=1
>>   8,0    1        0    81.117914755     0  m   N cfq767A  / resid=40
>>   8,0    1        0    81.117915340     0  m   N / served:
>> vt=1948520448 min_vt=1948520448
>>   8,0    1        0    81.117915858     0  m   N cfq767A  / sl_used=1
>> disp=0 charge=0 iops=1 sect=0
>>
>> where cfq767 is the xfsaild queue and cfq1618 corresponds to one of
>> the ScyllaDB
>> IO dispatchers.
>>
>> The requests preempting the xfsaild queue are synchronous requests.
>> That's a
>> characteristic of ScyllaDB workloads, as we only ever issue O_DIRECT
>> requests.
>> While it can be argued that preempting ASYNC requests in favor of SYNC
>> is part
>> of the CFQ logic, I don't believe that doing so for 15+ seconds is
>> anyone's
>> goal.
>>
>> Moreover, unless I am misunderstanding something, that breaks the
>> expectation
>> set by the "fifo_expire_async" tunable, which in my system is set to the
>> default.
>>
>> Looking at the code, it seems to me that the issue is that after we make
>> an async queue active, there is no guarantee that it will execute any
>> request.
>>
>> When the queue itself tests if it cfq_may_dispatch() it can bail if it
>> sees SYNC
>> requests in flight. An incoming request from another queue can also
>> preempt it
>> in such situation before we have the chance to execute anything (as
>> seen in the
>> trace above).
>>
>> This patch sets the must_dispatch flag if we notice that we have requests
>> that are already fifo_expired. This flag is always cleared after
>> cfq_dispatch_request() returns from cfq_dispatch_requests(), so it
>> won't pin
>> the queue for subsequent requests (unless they are themselves expired)
>>
>> Care is taken during preempt to still allow rt requests to preempt us
>> regardless.
>>
>> Testing my workload with this patch applied produces much better results.
>> From the application side I see no timeouts, and the open() latency
>> histogram
>> generated by systemtap looks much better, with the worst outlier at
>> 131ms:
>
> Good analysis and the fix looks nice and clean. I'll take a closer look
> tomorrow. Thanks!

Looks like a good patch to me. I've queued it up, thanks Glauber.

-- 
Jens Axboe