linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Flush requests not going through IO scheduler
@ 2015-11-02 12:20 Jan Kara
  2015-11-02 16:58 ` Jens Axboe
  2015-11-02 17:52 ` Jeff Moyer
  0 siblings, 2 replies; 9+ messages in thread
From: Jan Kara @ 2015-11-02 12:20 UTC (permalink / raw)
  To: Jens Axboe; +Cc: LKML, Jeff Moyer

Hello,

when looking into a performance issue, I've noticed one interesting thing
in blktrace data:

  8,0    2        0     1.745149746     0  m   N cfq320SN / dispatch_insert
  8,0    2        0     1.745150258     0  m   N cfq320SN / dispatched a request
  8,0    2        0     1.745150524     0  m   N cfq320SN / activate rq, drv=10
  8,0    2     2893     1.745150644 30477  D  WS 495331192 + 192 [git]
  8,0    1     3678     1.746851310     0  C  WS 495331192 + 192 [0]

We wrote the data for transaction commit here.

  8,0    1        0     1.746863220     0  m   N cfq320SN / complete rqnoidle 1
  8,0    1        0     1.746863801     0  m   N cfq320SN / set_slice=27
  8,0    1        0     1.746864439     0  m   N cfq320SN / arm_idle: 8 group_idle: 0

Currently there is no IO queued from jbd2 thread so idle...

  8,0    1     3679     1.746878424   320  A FWFS 495331384 + 8 <- (8,2) 478543928
  8,0    1     3680     1.746879028   320  Q FWFS 495331384 + 8 [jbd2/sda2-8]
  8,0    1     3681     1.746879673   320  G FWFS 495331384 + 8 [jbd2/sda2-8]
  8,0    1     3682     1.746880227   320  I FWFS 495331384 + 8 [jbd2/sda2-8]

Jbd2 thread now queues the commit block.

  8,0    1        0     1.754263523     0  m   N cfq idle timer fired
  8,0    1        0     1.754264733     0  m   N cfq320SN / slice expired t=0

But it was not dispatched and we just idled until timer fired. Then we
started dispatching for other queue and got to dispatching the commit block
only much later.

I've looked into the block layer code and the reason for this behavior
(idling when there is in fact IO to dispatch) is the special handling of
flush requests. When a flush request is submitted, we insert it with
ELEVATOR_INSERT_FLUSH and blk_insert_flush() then handles it. That
eventually just ends up doing something along the lines of:

	list_add_tail(&rq->queuelist, &q->queue_head);

So we add request to the list of requests to dispatch but we don't notify
IO scheduler in any way. Thus IO scheduler won't properly track the
request, won't properly account IO time for it if I'm right etc...

Ideally we should call q->elevator->type->ops.elevator_add_req_fn() to
handle the request but I'm not sure it won't break some assumptions of the
flush code. But at minimum shouldn't we at least try to dispatch the
request?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Flush requests not going through IO scheduler
  2015-11-02 12:20 Flush requests not going through IO scheduler Jan Kara
@ 2015-11-02 16:58 ` Jens Axboe
  2015-11-03 16:41   ` Jan Kara
  2015-11-02 17:52 ` Jeff Moyer
  1 sibling, 1 reply; 9+ messages in thread
From: Jens Axboe @ 2015-11-02 16:58 UTC (permalink / raw)
  To: Jan Kara; +Cc: LKML, Jeff Moyer

On 11/02/2015 05:20 AM, Jan Kara wrote:
> Hello,
>
> when looking into a performance issue, I've noticed one interesting thing
> in blktrace data:
>
>    8,0    2        0     1.745149746     0  m   N cfq320SN / dispatch_insert
>    8,0    2        0     1.745150258     0  m   N cfq320SN / dispatched a request
>    8,0    2        0     1.745150524     0  m   N cfq320SN / activate rq, drv=10
>    8,0    2     2893     1.745150644 30477  D  WS 495331192 + 192 [git]
>    8,0    1     3678     1.746851310     0  C  WS 495331192 + 192 [0]
>
> We wrote the data for transaction commit here.
>
>    8,0    1        0     1.746863220     0  m   N cfq320SN / complete rqnoidle 1
>    8,0    1        0     1.746863801     0  m   N cfq320SN / set_slice=27
>    8,0    1        0     1.746864439     0  m   N cfq320SN / arm_idle: 8 group_idle: 0
>
> Currently there is no IO queued from jbd2 thread so idle...
>
>    8,0    1     3679     1.746878424   320  A FWFS 495331384 + 8 <- (8,2) 478543928
>    8,0    1     3680     1.746879028   320  Q FWFS 495331384 + 8 [jbd2/sda2-8]
>    8,0    1     3681     1.746879673   320  G FWFS 495331384 + 8 [jbd2/sda2-8]
>    8,0    1     3682     1.746880227   320  I FWFS 495331384 + 8 [jbd2/sda2-8]
>
> Jbd2 thread now queues the commit block.
>
>    8,0    1        0     1.754263523     0  m   N cfq idle timer fired
>    8,0    1        0     1.754264733     0  m   N cfq320SN / slice expired t=0
>
> But it was not dispatched and we just idled until timer fired. Then we
> started dispatching for other queue and got to dispatching the commit block
> only much later.
>
> I've looked into the block layer code and the reason for this behavior
> (idling when there is in fact IO to dispatch) is the special handling of
> flush requests. When a flush request is submitted, we insert it with
> ELEVATOR_INSERT_FLUSH and blk_insert_flush() then handles it. That
> eventually just ends up doing something along the lines of:
>
> 	list_add_tail(&rq->queuelist, &q->queue_head);
>
> So we add request to the list of requests to dispatch but we don't notify
> IO scheduler in any way. Thus IO scheduler won't properly track the
> request, won't properly account IO time for it if I'm right etc...
>
> Ideally we should call q->elevator->type->ops.elevator_add_req_fn() to
> handle the request but I'm not sure it won't break some assumptions of the
> flush code. But at minimum shouldn't we at least try to dispatch the
> request?

Certainly, the current behavior is undoubtedly broken. The least 
intrusive fix would be to kick off scheduling when we add it to the 
request, but the elevator should handle it. Are you going to be up for 
hacking up a fix?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Flush requests not going through IO scheduler
  2015-11-02 12:20 Flush requests not going through IO scheduler Jan Kara
  2015-11-02 16:58 ` Jens Axboe
@ 2015-11-02 17:52 ` Jeff Moyer
  2015-11-03 14:55   ` Jan Kara
  1 sibling, 1 reply; 9+ messages in thread
From: Jeff Moyer @ 2015-11-02 17:52 UTC (permalink / raw)
  To: Jan Kara; +Cc: Jens Axboe, LKML

Jan Kara <jack@suse.cz> writes:

> Hello,
>
> when looking into a performance issue, I've noticed one interesting thing
> in blktrace data:

Hi Jan,

Do you have a reproducer I can run?

Thanks,
Jeff

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Flush requests not going through IO scheduler
  2015-11-02 17:52 ` Jeff Moyer
@ 2015-11-03 14:55   ` Jan Kara
  0 siblings, 0 replies; 9+ messages in thread
From: Jan Kara @ 2015-11-03 14:55 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: Jan Kara, Jens Axboe, LKML

On Mon 02-11-15 12:52:26, Jeff Moyer wrote:
> Jan Kara <jack@suse.cz> writes:
> 
> > Hello,
> >
> > when looking into a performance issue, I've noticed one interesting thing
> > in blktrace data:
> 
> Hi Jan,
> 
> Do you have a reproducer I can run?

So I don't have a reproducer that would trigger a situation where we are
idling while there is request queued. However the following is easy to
trigger:

  8,0   67       43     0.000064884  1958  D  WS 2499459168 + 104 [jbd2/sda5-8]
  8,0   67       44     0.000222208     0  C  WS 2499459168 + 104 [0]
  8,0   67        0     0.000232553     0  m   N cfq1958SN / complete rqnoidle 1
  8,0   67        0     0.000234002     0  m   N cfq1958SN / set_slice=30
  8,0   67        0     0.000234984     0  m   N cfq1958SN / Not idling.  st->count:1
  8,0   67        0     0.000236353     0  m   N cfq1958SN / arm_idle: 2 group_idle: 1
  8,0   67        0     0.000236832     0  m   N cfq schedule dispatch
  8,0   67       45     0.000259492  1958  A FWFS 2499459272 + 8 <- (8,5) 621525192
  8,0   67       46     0.000259994  1958  Q FWFS 2499459272 + 8 [jbd2/sda5-8]
  8,0   67       47     0.000262466  1958  G FWFS 2499459272 + 8 [jbd2/sda5-8]
  8,0   67       48     0.000263164  1958  I FWFS 2499459272 + 8 [jbd2/sda5-8]
  8,0   67       49     0.000389821     0  D  WS 2499459272 + 8 [swapper/67]
  8,0   67       50     0.000439548     0  C  WS 2499459272 + 8 [0]
  8,0   67       51     0.000556711     0  C  WS 2499459272 [0]
  8,0   67        0     0.004387444     0  m   N cfq idle timer fired

Here we dispatch the flush request quickly but only because we queued kick
of the queue in IO completion of the previous request. If jbd2 were 120 us
slower in submitting the flush request, it would miss the kick and the
request would end up waiting until the idle timer fires.

To trigger this you just need a rotational storage with write back cache,
slice_idle & group_idle enabled, mount ext4 on top of it. Any trasaction
commit will end up looking like this.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Flush requests not going through IO scheduler
  2015-11-02 16:58 ` Jens Axboe
@ 2015-11-03 16:41   ` Jan Kara
  2015-11-03 16:49     ` Jens Axboe
  0 siblings, 1 reply; 9+ messages in thread
From: Jan Kara @ 2015-11-03 16:41 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Jan Kara, LKML, Jeff Moyer

On Mon 02-11-15 09:58:01, Jens Axboe wrote:
> On 11/02/2015 05:20 AM, Jan Kara wrote:
> >Hello,
> >
> >when looking into a performance issue, I've noticed one interesting thing
> >in blktrace data:
> >
> >   8,0    2        0     1.745149746     0  m   N cfq320SN / dispatch_insert
> >   8,0    2        0     1.745150258     0  m   N cfq320SN / dispatched a request
> >   8,0    2        0     1.745150524     0  m   N cfq320SN / activate rq, drv=10
> >   8,0    2     2893     1.745150644 30477  D  WS 495331192 + 192 [git]
> >   8,0    1     3678     1.746851310     0  C  WS 495331192 + 192 [0]
> >
> >We wrote the data for transaction commit here.
> >
> >   8,0    1        0     1.746863220     0  m   N cfq320SN / complete rqnoidle 1
> >   8,0    1        0     1.746863801     0  m   N cfq320SN / set_slice=27
> >   8,0    1        0     1.746864439     0  m   N cfq320SN / arm_idle: 8 group_idle: 0
> >
> >Currently there is no IO queued from jbd2 thread so idle...
> >
> >   8,0    1     3679     1.746878424   320  A FWFS 495331384 + 8 <- (8,2) 478543928
> >   8,0    1     3680     1.746879028   320  Q FWFS 495331384 + 8 [jbd2/sda2-8]
> >   8,0    1     3681     1.746879673   320  G FWFS 495331384 + 8 [jbd2/sda2-8]
> >   8,0    1     3682     1.746880227   320  I FWFS 495331384 + 8 [jbd2/sda2-8]
> >
> >Jbd2 thread now queues the commit block.
> >
> >   8,0    1        0     1.754263523     0  m   N cfq idle timer fired
> >   8,0    1        0     1.754264733     0  m   N cfq320SN / slice expired t=0
> >
> >But it was not dispatched and we just idled until timer fired. Then we
> >started dispatching for other queue and got to dispatching the commit block
> >only much later.
> >
> >I've looked into the block layer code and the reason for this behavior
> >(idling when there is in fact IO to dispatch) is the special handling of
> >flush requests. When a flush request is submitted, we insert it with
> >ELEVATOR_INSERT_FLUSH and blk_insert_flush() then handles it. That
> >eventually just ends up doing something along the lines of:
> >
> >	list_add_tail(&rq->queuelist, &q->queue_head);
> >
> >So we add request to the list of requests to dispatch but we don't notify
> >IO scheduler in any way. Thus IO scheduler won't properly track the
> >request, won't properly account IO time for it if I'm right etc...
> >
> >Ideally we should call q->elevator->type->ops.elevator_add_req_fn() to
> >handle the request but I'm not sure it won't break some assumptions of the
> >flush code. But at minimum shouldn't we at least try to dispatch the
> >request?
> 
> Certainly, the current behavior is undoubtedly broken. The least
> intrusive fix would be to kick off scheduling when we add it to the
> request, but the elevator should handle it. Are you going to be up
> for hacking up a fix?

I have some trouble understanding what do you mean exactly. Do you think we
should just call __blk_run_queue() after we add the request to
q->queue_head?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Flush requests not going through IO scheduler
  2015-11-03 16:41   ` Jan Kara
@ 2015-11-03 16:49     ` Jens Axboe
  2015-11-03 17:18       ` Jeff Moyer
  0 siblings, 1 reply; 9+ messages in thread
From: Jens Axboe @ 2015-11-03 16:49 UTC (permalink / raw)
  To: Jan Kara; +Cc: LKML, Jeff Moyer

On 11/03/2015 09:41 AM, Jan Kara wrote:
> On Mon 02-11-15 09:58:01, Jens Axboe wrote:
>> On 11/02/2015 05:20 AM, Jan Kara wrote:
>>> Hello,
>>>
>>> when looking into a performance issue, I've noticed one interesting thing
>>> in blktrace data:
>>>
>>>    8,0    2        0     1.745149746     0  m   N cfq320SN / dispatch_insert
>>>    8,0    2        0     1.745150258     0  m   N cfq320SN / dispatched a request
>>>    8,0    2        0     1.745150524     0  m   N cfq320SN / activate rq, drv=10
>>>    8,0    2     2893     1.745150644 30477  D  WS 495331192 + 192 [git]
>>>    8,0    1     3678     1.746851310     0  C  WS 495331192 + 192 [0]
>>>
>>> We wrote the data for transaction commit here.
>>>
>>>    8,0    1        0     1.746863220     0  m   N cfq320SN / complete rqnoidle 1
>>>    8,0    1        0     1.746863801     0  m   N cfq320SN / set_slice=27
>>>    8,0    1        0     1.746864439     0  m   N cfq320SN / arm_idle: 8 group_idle: 0
>>>
>>> Currently there is no IO queued from jbd2 thread so idle...
>>>
>>>    8,0    1     3679     1.746878424   320  A FWFS 495331384 + 8 <- (8,2) 478543928
>>>    8,0    1     3680     1.746879028   320  Q FWFS 495331384 + 8 [jbd2/sda2-8]
>>>    8,0    1     3681     1.746879673   320  G FWFS 495331384 + 8 [jbd2/sda2-8]
>>>    8,0    1     3682     1.746880227   320  I FWFS 495331384 + 8 [jbd2/sda2-8]
>>>
>>> Jbd2 thread now queues the commit block.
>>>
>>>    8,0    1        0     1.754263523     0  m   N cfq idle timer fired
>>>    8,0    1        0     1.754264733     0  m   N cfq320SN / slice expired t=0
>>>
>>> But it was not dispatched and we just idled until timer fired. Then we
>>> started dispatching for other queue and got to dispatching the commit block
>>> only much later.
>>>
>>> I've looked into the block layer code and the reason for this behavior
>>> (idling when there is in fact IO to dispatch) is the special handling of
>>> flush requests. When a flush request is submitted, we insert it with
>>> ELEVATOR_INSERT_FLUSH and blk_insert_flush() then handles it. That
>>> eventually just ends up doing something along the lines of:
>>>
>>> 	list_add_tail(&rq->queuelist, &q->queue_head);
>>>
>>> So we add request to the list of requests to dispatch but we don't notify
>>> IO scheduler in any way. Thus IO scheduler won't properly track the
>>> request, won't properly account IO time for it if I'm right etc...
>>>
>>> Ideally we should call q->elevator->type->ops.elevator_add_req_fn() to
>>> handle the request but I'm not sure it won't break some assumptions of the
>>> flush code. But at minimum shouldn't we at least try to dispatch the
>>> request?
>>
>> Certainly, the current behavior is undoubtedly broken. The least
>> intrusive fix would be to kick off scheduling when we add it to the
>> request, but the elevator should handle it. Are you going to be up
>> for hacking up a fix?
>
> I have some trouble understanding what do you mean exactly. Do you think we
> should just call __blk_run_queue() after we add the request to
> q->queue_head?

No, that won't be enough, as it won't always break out of the idle 
logic. We need to ensure that the new request is noticed, so that CFQ 
knows and can decide to kick off things.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Flush requests not going through IO scheduler
  2015-11-03 16:49     ` Jens Axboe
@ 2015-11-03 17:18       ` Jeff Moyer
  2015-11-03 17:24         ` Jens Axboe
  0 siblings, 1 reply; 9+ messages in thread
From: Jeff Moyer @ 2015-11-03 17:18 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Jan Kara, LKML

Jens Axboe <axboe@kernel.dk> writes:

>>> Certainly, the current behavior is undoubtedly broken. The least
>>> intrusive fix would be to kick off scheduling when we add it to the
>>> request, but the elevator should handle it. Are you going to be up
>>> for hacking up a fix?
>>
>> I have some trouble understanding what do you mean exactly. Do you think we
>> should just call __blk_run_queue() after we add the request to
>> q->queue_head?
>
> No, that won't be enough, as it won't always break out of the idle
> logic. We need to ensure that the new request is noticed, so that CFQ
> knows and can decide to kick off things.

Hmm?  __blk_run_queue calls the request_fn, which will call
blk_peek_request, which calls __elv_next_request, which will find the
request on queue_head.  Right?

        while (1) {
                if (!list_empty(&q->queue_head)) {
                        rq = list_entry_rq(q->queue_head.next);
                        return rq;

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Flush requests not going through IO scheduler
  2015-11-03 17:18       ` Jeff Moyer
@ 2015-11-03 17:24         ` Jens Axboe
  2015-11-12 13:40           ` Jan Kara
  0 siblings, 1 reply; 9+ messages in thread
From: Jens Axboe @ 2015-11-03 17:24 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: Jan Kara, LKML

On 11/03/2015 10:18 AM, Jeff Moyer wrote:
> Jens Axboe <axboe@kernel.dk> writes:
>
>>>> Certainly, the current behavior is undoubtedly broken. The least
>>>> intrusive fix would be to kick off scheduling when we add it to the
>>>> request, but the elevator should handle it. Are you going to be up
>>>> for hacking up a fix?
>>>
>>> I have some trouble understanding what do you mean exactly. Do you think we
>>> should just call __blk_run_queue() after we add the request to
>>> q->queue_head?
>>
>> No, that won't be enough, as it won't always break out of the idle
>> logic. We need to ensure that the new request is noticed, so that CFQ
>> knows and can decide to kick off things.
>
> Hmm?  __blk_run_queue calls the request_fn, which will call
> blk_peek_request, which calls __elv_next_request, which will find the
> request on queue_head.  Right?
>
>          while (1) {
>                  if (!list_empty(&q->queue_head)) {
>                          rq = list_entry_rq(q->queue_head.next);
>                          return rq;

I guess that will bypass the schedule. Ugh, but that's pretty ugly, 
since cfq is still effectively idling. These flush requests really 
should go to an internal scheduler list for dispatch.

But as a quick fix, it might be enough to just kick off the queue with 
blk_run_queue().

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Flush requests not going through IO scheduler
  2015-11-03 17:24         ` Jens Axboe
@ 2015-11-12 13:40           ` Jan Kara
  0 siblings, 0 replies; 9+ messages in thread
From: Jan Kara @ 2015-11-12 13:40 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Jeff Moyer, Jan Kara, LKML

On Tue 03-11-15 10:24:12, Jens Axboe wrote:
> On 11/03/2015 10:18 AM, Jeff Moyer wrote:
> >Jens Axboe <axboe@kernel.dk> writes:
> >
> >>>>Certainly, the current behavior is undoubtedly broken. The least
> >>>>intrusive fix would be to kick off scheduling when we add it to the
> >>>>request, but the elevator should handle it. Are you going to be up
> >>>>for hacking up a fix?
> >>>
> >>>I have some trouble understanding what do you mean exactly. Do you think we
> >>>should just call __blk_run_queue() after we add the request to
> >>>q->queue_head?
> >>
> >>No, that won't be enough, as it won't always break out of the idle
> >>logic. We need to ensure that the new request is noticed, so that CFQ
> >>knows and can decide to kick off things.
> >
> >Hmm?  __blk_run_queue calls the request_fn, which will call
> >blk_peek_request, which calls __elv_next_request, which will find the
> >request on queue_head.  Right?
> >
> >         while (1) {
> >                 if (!list_empty(&q->queue_head)) {
> >                         rq = list_entry_rq(q->queue_head.next);
> >                         return rq;
> 
> I guess that will bypass the schedule. Ugh, but that's pretty ugly,
> since cfq is still effectively idling. These flush requests really
> should go to an internal scheduler list for dispatch.
> 
> But as a quick fix, it might be enough to just kick off the queue
> with blk_run_queue().

So I was looking more into this and in the end tracked this down to be
mostly a blktrace issue. The first thing is: blk_queue_bio() will actually
kick the queue after the flush request is queued but at that moment, there
is only a request for the initial flush queued and that is invisible to
blktrace so it seems the disk is idle although it is not. After this
request completes, we queue & dispatch the request with data which is
visible in blktrace. So in this case requests are dispatched as they
should. The only question I cannot really answer is why the initial flush
is not visible in the block trace - at least trace_block_rq_issue() tracepoint
and corresponding completion should trigger and should be visible... Anyone
has idea? 

Also blk_insert_flush() can add request directly to q->queue_head when no
flushing is required. I've sent patch to fix that to go through IO
scheduler but it is mostly a non-issue as usually
generic_make_request_checks() removes FLUSH and FUA flags when they are not
needed.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2015-11-12 13:40 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-02 12:20 Flush requests not going through IO scheduler Jan Kara
2015-11-02 16:58 ` Jens Axboe
2015-11-03 16:41   ` Jan Kara
2015-11-03 16:49     ` Jens Axboe
2015-11-03 17:18       ` Jeff Moyer
2015-11-03 17:24         ` Jens Axboe
2015-11-12 13:40           ` Jan Kara
2015-11-02 17:52 ` Jeff Moyer
2015-11-03 14:55   ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).