All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jens Axboe <axboe@kernel.dk>
To: Glauber Costa <glauber@scylladb.com>, linux-block@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Subject: Re: [PATCH] cfq: fix starvation of asynchronous writes
Date: Fri, 23 Sep 2016 10:11:54 -0600	[thread overview]
Message-ID: <c169ee89-966b-ccf3-a699-ded96ccbe922@kernel.dk> (raw)
In-Reply-To: <52daacdc-bb2f-3672-d2e8-30c5e376b9d4@kernel.dk>

On 09/22/2016 08:10 PM, Jens Axboe wrote:
> On 09/22/2016 06:59 PM, Glauber Costa wrote:
>> While debugging timeouts happening in my application workload
>> (ScyllaDB), I have
>> observed calls to open() taking a long time, ranging everywhere from 2
>> seconds -
>> the first ones that are enough to time out my application - to more
>> than 30
>> seconds.
>>
>> The problem seems to happen because XFS may block on pending metadata
>> updates
>> under certain circumnstances, and that's confirmed with the following
>> backtrace
>> taken by the offcputime tool (iovisor/bcc):
>>
>>     ffffffffb90c57b1 finish_task_switch
>>     ffffffffb97dffb5 schedule
>>     ffffffffb97e310c schedule_timeout
>>     ffffffffb97e1f12 __down
>>     ffffffffb90ea821 down
>>     ffffffffc046a9dc xfs_buf_lock
>>     ffffffffc046abfb _xfs_buf_find
>>     ffffffffc046ae4a xfs_buf_get_map
>>     ffffffffc046babd xfs_buf_read_map
>>     ffffffffc0499931 xfs_trans_read_buf_map
>>     ffffffffc044a561 xfs_da_read_buf
>>     ffffffffc0451390 xfs_dir3_leaf_read.constprop.16
>>     ffffffffc0452b90 xfs_dir2_leaf_lookup_int
>>     ffffffffc0452e0f xfs_dir2_leaf_lookup
>>     ffffffffc044d9d3 xfs_dir_lookup
>>     ffffffffc047d1d9 xfs_lookup
>>     ffffffffc0479e53 xfs_vn_lookup
>>     ffffffffb925347a path_openat
>>     ffffffffb9254a71 do_filp_open
>>     ffffffffb9242a94 do_sys_open
>>     ffffffffb9242b9e sys_open
>>     ffffffffb97e42b2 entry_SYSCALL_64_fastpath
>>     00007fb0698162ed [unknown]
>>
>> Inspecting my run with blktrace, I can see that the xfsaild kthread
>> exhibit very
>> high "Dispatch wait" times, on the dozens of seconds range and
>> consistent with
>> the open() times I have saw in that run.
>>
>> Still from the blktrace output, we can after searching a bit, identify
>> the
>> request that wasn't dispatched:
>>
>>   8,0   11      152    81.092472813   804  A  WM 141698288 + 8 <-
>> (8,1) 141696240
>>   8,0   11      153    81.092472889   804  Q  WM 141698288 + 8
>> [xfsaild/sda1]
>>   8,0   11      154    81.092473207   804  G  WM 141698288 + 8
>> [xfsaild/sda1]
>>   8,0   11      206    81.092496118   804  I  WM 141698288 + 8 (
>> 22911) [xfsaild/sda1]
>>   <==== 'I' means Inserted (into the IO scheduler)
>> ===================================>
>>   8,0    0   289372    96.718761435     0  D  WM 141698288 + 8
>> (15626265317) [swapper/0]
>>   <==== Only 15s later the CFQ scheduler dispatches the request
>> ======================>
>>
>> As we can see above, in this particular example CFQ took 15 seconds to
>> dispatch
>> this request. Going back to the full trace, we can see that the
>> xfsaild queue
>> had plenty of opportunity to run, and it was selected as the active
>> queue many
>> times. It would just always be preempted by something else (example):
>>
>>   8,0    1        0    81.117912979     0  m   N cfq1618SN /
>> insert_request
>>   8,0    1        0    81.117913419     0  m   N cfq1618SN / add_to_rr
>>   8,0    1        0    81.117914044     0  m   N cfq1618SN / preempt
>>   8,0    1        0    81.117914398     0  m   N cfq767A  / slice
>> expired t=1
>>   8,0    1        0    81.117914755     0  m   N cfq767A  / resid=40
>>   8,0    1        0    81.117915340     0  m   N / served:
>> vt=1948520448 min_vt=1948520448
>>   8,0    1        0    81.117915858     0  m   N cfq767A  / sl_used=1
>> disp=0 charge=0 iops=1 sect=0
>>
>> where cfq767 is the xfsaild queue and cfq1618 corresponds to one of
>> the ScyllaDB
>> IO dispatchers.
>>
>> The requests preempting the xfsaild queue are synchronous requests.
>> That's a
>> characteristic of ScyllaDB workloads, as we only ever issue O_DIRECT
>> requests.
>> While it can be argued that preempting ASYNC requests in favor of SYNC
>> is part
>> of the CFQ logic, I don't believe that doing so for 15+ seconds is
>> anyone's
>> goal.
>>
>> Moreover, unless I am misunderstanding something, that breaks the
>> expectation
>> set by the "fifo_expire_async" tunable, which in my system is set to the
>> default.
>>
>> Looking at the code, it seems to me that the issue is that after we make
>> an async queue active, there is no guarantee that it will execute any
>> request.
>>
>> When the queue itself tests if it cfq_may_dispatch() it can bail if it
>> sees SYNC
>> requests in flight. An incoming request from another queue can also
>> preempt it
>> in such situation before we have the chance to execute anything (as
>> seen in the
>> trace above).
>>
>> This patch sets the must_dispatch flag if we notice that we have requests
>> that are already fifo_expired. This flag is always cleared after
>> cfq_dispatch_request() returns from cfq_dispatch_requests(), so it
>> won't pin
>> the queue for subsequent requests (unless they are themselves expired)
>>
>> Care is taken during preempt to still allow rt requests to preempt us
>> regardless.
>>
>> Testing my workload with this patch applied produces much better results.
>> From the application side I see no timeouts, and the open() latency
>> histogram
>> generated by systemtap looks much better, with the worst outlier at
>> 131ms:
>
> Good analysis and the fix looks nice and clean. I'll take a closer look
> tomorrow. Thanks!

Looks like a good patch to me. I've queued it up, thanks Glauber.

-- 
Jens Axboe

  reply	other threads:[~2016-09-23 16:11 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-09-23  0:59 [PATCH] cfq: fix starvation of asynchronous writes Glauber Costa
2016-09-23  2:10 ` Jens Axboe
2016-09-23 16:11   ` Jens Axboe [this message]
2016-09-23 10:22 ` Paolo Valente
2016-09-23 10:22   ` Paolo Valente
2016-09-23 11:28   ` Glauber Costa
2016-09-23 12:19     ` Glauber Costa
2016-10-20 18:08 ` Fwd: " Glauber Costa
2016-10-21  7:36   ` Greg KH
2016-10-21  8:53     ` Holger Hoffstätte
2016-10-21  9:07       ` Greg KH
2016-10-21 13:15     ` Glauber Costa

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c169ee89-966b-ccf3-a699-ded96ccbe922@kernel.dk \
    --to=axboe@kernel.dk \
    --cc=glauber@scylladb.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.