All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jeff Moyer <jmoyer@redhat.com>
To: Mel Gorman <mgorman@suse.de>
Cc: "Theodore Ts'o" <tytso@mit.edu>,
	Dave Chinner <david@fromorbit.com>, Jan Kara <jack@suse.cz>,
	linux-ext4@vger.kernel.org, LKML <linux-kernel@vger.kernel.org>,
	Linux-MM <linux-mm@kvack.org>, Jiri Slaby <jslaby@suse.cz>
Subject: Re: Excessive stall times on ext4 in 3.9-rc2
Date: Mon, 22 Apr 2013 18:42:23 -0400	[thread overview]
Message-ID: <x49a9oqmblc.fsf@segfault.boston.devel.redhat.com> (raw)
In-Reply-To: <20130422143846.GA2675@suse.de> (Mel Gorman's message of "Mon, 22 Apr 2013 15:38:46 +0100")

Mel Gorman <mgorman@suse.de> writes:

> (Adding Jeff Moyer to the cc as I'm told he is interested in the blktrace)

Thanks.  I've got a few comments and corrections for you below.

> TLDR: Flusher writes pages very quickly after processes dirty a buffer. Reads
> starve flusher writes.
[snip]

> 3. The blktrace indicates that reads can starve writes from flusher
>
>    While there are people that can look at a blktrace and find problems
>    like they are rain man, I'm more like an ADHD squirrel when looking at
>    a trace.  I wrote a script to look for what unrelated requests completed
>    while an request got stalled for over a second. It seemed like something
>    that a tool shoudl already exist for but I didn't find one unless btt
>    can give the information somehow.

Care to share that script?

[snip]

> I recognise that the output will have a WTF reaction but the key
> observations to me are
>
> a) a single write request from flusher took over a second to complete
> b) at the time it was queued, it was mostly other writes that were in
>    the queue at the same time
> c) The write request and the parallel writes were all asynchronous write
>    requests
> D) at the time the request completed, there were a LARGE number of
>    other requested queued and completed at the same time.
>
> Of the requests queued and completed in the meantime the breakdown was
>
>      22 RM
>      31 RA
>      82 W
>     445 R
>
> If I'm reading this correctly, it is saying that 22 reads were merged (RM),
> 31 reads were remapped to another device (RA) which is probably reads from
> the dm-crypt partition, 82 were writes (W) which is not far off the number
> of writes that were in the queue and 445 were other reads. The delay was
> dominated by reads that were queued after the write request and completed
> before it.

RM == Read Meta
RA == Read Ahead  (remapping, by the way, does not happen across
                   devices, just into partitions)
W and R you understood correctly.

> That's saying that the 27128th request in the trace took over 7 seconds
> to complete and was an asynchronous write from flusher. The contents of
> the queue are displayed at that time and the breakdown of requests is
>
>      23 WS  [JEM: write sync]
>      86 RM  [JEM: Read Meta]
>     124 RA  [JEM: Read Ahead]
>     442 W
>    1931 R
>
> 7 seconds later when it was completed the breakdown of completed
> requests was
>
>      25 WS
>     114 RM
>     155 RA
>     408 W
>    2457 R
>
> In combination, that confirms for me that asynchronous writes from flush
> are being starved by reads. When a process requires a buffer that is locked
> by that asynchronous write from flusher, it stalls.
>
>> The thing is, we do want to make ext4 work well with cfq, and
>> prioritizing non-readahead read requests ahead of data writeback does
>> make sense.  The issue is with is that metadata writes going through
>> the block device could in some cases effectively cause a priority
>> inversion when what had previously been an asynchronous writeback
>> starts blocking a foreground, user-visible process.
>> 
>> At least, that's the theory;
>
> I *think* the data more or less confirms the theory but it'd be nice if
> someone else double checked in case I'm seeing what I want to see
> instead of what is actually there.

Looks sane.  You can also see a lot of "preempt"s in the blkparse
output, which indicates exactly what you're saying.  Any sync request
gets priority over the async requests.

I'll also note that even though your I/O is going all over the place
(D2C is pretty bad, 14ms), most of the time is spent waiting for a
struct request allocation or between Queue and Merge:

==================== All Devices ====================

            ALL           MIN           AVG           MAX           N
--------------- ------------- ------------- ------------- -----------

Q2Q               0.000000001   0.000992259   8.898375882     2300861
Q2G               0.000000843  10.193261239 2064.079501935     1016463 <====
G2I               0.000000461   0.000044702   3.237065090     1015803
Q2M               0.000000101   8.203147238 2064.079367557     1311662
I2D               0.000002012   1.476824812 2064.089774419     1014890
M2D               0.000003283   6.994306138 283.573348664     1284872
D2C               0.000061889   0.014438316   0.857811758     2291996
Q2C               0.000072284  13.363007244 2064.092228625     2292191

==================== Device Overhead ====================

       DEV |       Q2G       G2I       Q2M       I2D       D2C
---------- | --------- --------- --------- --------- ---------
 (  8,  0) |  33.8259%   0.0001%  35.1275%   4.8932%   0.1080%
---------- | --------- --------- --------- --------- ---------
   Overall |  33.8259%   0.0001%  35.1275%   4.8932%   0.1080%

I'm not sure I believe that max value.  2064 seconds seems a bit high.
Also, Q2M should not be anywhere near that big, so more investigation is
required there.  A quick look over the data doesn't show any such delays
(making me question the tools), but I'll write some code tomorrow to
verify the btt output.

Jan, if I were to come up with a way of promoting a particular async
queue to the front of the line, where would I put such a call in the
ext4/jbd2 code to be effective?

Mel, can you reproduce this at will?  Do you have a reproducer that I
could run so I'm not constantly bugging you?

Cheers,
Jeff

WARNING: multiple messages have this Message-ID (diff)
From: Jeff Moyer <jmoyer@redhat.com>
To: Mel Gorman <mgorman@suse.de>
Cc: "Theodore Ts'o" <tytso@mit.edu>,
	Dave Chinner <david@fromorbit.com>, Jan Kara <jack@suse.cz>,
	linux-ext4@vger.kernel.org, LKML <linux-kernel@vger.kernel.org>,
	Linux-MM <linux-mm@kvack.org>, Jiri Slaby <jslaby@suse.cz>
Subject: Re: Excessive stall times on ext4 in 3.9-rc2
Date: Mon, 22 Apr 2013 18:42:23 -0400	[thread overview]
Message-ID: <x49a9oqmblc.fsf@segfault.boston.devel.redhat.com> (raw)
In-Reply-To: <20130422143846.GA2675@suse.de> (Mel Gorman's message of "Mon, 22 Apr 2013 15:38:46 +0100")

Mel Gorman <mgorman@suse.de> writes:

> (Adding Jeff Moyer to the cc as I'm told he is interested in the blktrace)

Thanks.  I've got a few comments and corrections for you below.

> TLDR: Flusher writes pages very quickly after processes dirty a buffer. Reads
> starve flusher writes.
[snip]

> 3. The blktrace indicates that reads can starve writes from flusher
>
>    While there are people that can look at a blktrace and find problems
>    like they are rain man, I'm more like an ADHD squirrel when looking at
>    a trace.  I wrote a script to look for what unrelated requests completed
>    while an request got stalled for over a second. It seemed like something
>    that a tool shoudl already exist for but I didn't find one unless btt
>    can give the information somehow.

Care to share that script?

[snip]

> I recognise that the output will have a WTF reaction but the key
> observations to me are
>
> a) a single write request from flusher took over a second to complete
> b) at the time it was queued, it was mostly other writes that were in
>    the queue at the same time
> c) The write request and the parallel writes were all asynchronous write
>    requests
> D) at the time the request completed, there were a LARGE number of
>    other requested queued and completed at the same time.
>
> Of the requests queued and completed in the meantime the breakdown was
>
>      22 RM
>      31 RA
>      82 W
>     445 R
>
> If I'm reading this correctly, it is saying that 22 reads were merged (RM),
> 31 reads were remapped to another device (RA) which is probably reads from
> the dm-crypt partition, 82 were writes (W) which is not far off the number
> of writes that were in the queue and 445 were other reads. The delay was
> dominated by reads that were queued after the write request and completed
> before it.

RM == Read Meta
RA == Read Ahead  (remapping, by the way, does not happen across
                   devices, just into partitions)
W and R you understood correctly.

> That's saying that the 27128th request in the trace took over 7 seconds
> to complete and was an asynchronous write from flusher. The contents of
> the queue are displayed at that time and the breakdown of requests is
>
>      23 WS  [JEM: write sync]
>      86 RM  [JEM: Read Meta]
>     124 RA  [JEM: Read Ahead]
>     442 W
>    1931 R
>
> 7 seconds later when it was completed the breakdown of completed
> requests was
>
>      25 WS
>     114 RM
>     155 RA
>     408 W
>    2457 R
>
> In combination, that confirms for me that asynchronous writes from flush
> are being starved by reads. When a process requires a buffer that is locked
> by that asynchronous write from flusher, it stalls.
>
>> The thing is, we do want to make ext4 work well with cfq, and
>> prioritizing non-readahead read requests ahead of data writeback does
>> make sense.  The issue is with is that metadata writes going through
>> the block device could in some cases effectively cause a priority
>> inversion when what had previously been an asynchronous writeback
>> starts blocking a foreground, user-visible process.
>> 
>> At least, that's the theory;
>
> I *think* the data more or less confirms the theory but it'd be nice if
> someone else double checked in case I'm seeing what I want to see
> instead of what is actually there.

Looks sane.  You can also see a lot of "preempt"s in the blkparse
output, which indicates exactly what you're saying.  Any sync request
gets priority over the async requests.

I'll also note that even though your I/O is going all over the place
(D2C is pretty bad, 14ms), most of the time is spent waiting for a
struct request allocation or between Queue and Merge:

==================== All Devices ====================

            ALL           MIN           AVG           MAX           N
--------------- ------------- ------------- ------------- -----------

Q2Q               0.000000001   0.000992259   8.898375882     2300861
Q2G               0.000000843  10.193261239 2064.079501935     1016463 <====
G2I               0.000000461   0.000044702   3.237065090     1015803
Q2M               0.000000101   8.203147238 2064.079367557     1311662
I2D               0.000002012   1.476824812 2064.089774419     1014890
M2D               0.000003283   6.994306138 283.573348664     1284872
D2C               0.000061889   0.014438316   0.857811758     2291996
Q2C               0.000072284  13.363007244 2064.092228625     2292191

==================== Device Overhead ====================

       DEV |       Q2G       G2I       Q2M       I2D       D2C
---------- | --------- --------- --------- --------- ---------
 (  8,  0) |  33.8259%   0.0001%  35.1275%   4.8932%   0.1080%
---------- | --------- --------- --------- --------- ---------
   Overall |  33.8259%   0.0001%  35.1275%   4.8932%   0.1080%

I'm not sure I believe that max value.  2064 seconds seems a bit high.
Also, Q2M should not be anywhere near that big, so more investigation is
required there.  A quick look over the data doesn't show any such delays
(making me question the tools), but I'll write some code tomorrow to
verify the btt output.

Jan, if I were to come up with a way of promoting a particular async
queue to the front of the line, where would I put such a call in the
ext4/jbd2 code to be effective?

Mel, can you reproduce this at will?  Do you have a reproducer that I
could run so I'm not constantly bugging you?

Cheers,
Jeff

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2013-04-22 22:44 UTC|newest]

Thread overview: 105+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-04-02 14:27 Excessive stall times on ext4 in 3.9-rc2 Mel Gorman
2013-04-02 14:27 ` Mel Gorman
2013-04-02 15:00 ` Jiri Slaby
2013-04-02 15:00   ` Jiri Slaby
2013-04-02 15:03 ` Zheng Liu
2013-04-02 15:03   ` Zheng Liu
2013-04-02 15:15   ` Mel Gorman
2013-04-02 15:15     ` Mel Gorman
2013-04-02 15:06 ` Theodore Ts'o
2013-04-02 15:06   ` Theodore Ts'o
2013-04-02 15:14   ` Theodore Ts'o
2013-04-02 15:14     ` Theodore Ts'o
2013-04-02 18:19     ` Theodore Ts'o
2013-04-02 18:19       ` Theodore Ts'o
2013-04-07 21:59       ` Frank Ch. Eigler
2013-04-07 21:59         ` Frank Ch. Eigler
2013-04-08  8:36         ` Mel Gorman
2013-04-08  8:36           ` Mel Gorman
2013-04-08 10:52           ` Frank Ch. Eigler
2013-04-08 10:52             ` Frank Ch. Eigler
2013-04-08 11:01         ` Theodore Ts'o
2013-04-08 11:01           ` Theodore Ts'o
2013-04-03 10:19     ` Mel Gorman
2013-04-03 10:19       ` Mel Gorman
2013-04-03 12:05       ` Theodore Ts'o
2013-04-03 12:05         ` Theodore Ts'o
2013-04-03 15:15         ` Mel Gorman
2013-04-05 22:18       ` Jiri Slaby
2013-04-05 22:18         ` Jiri Slaby
2013-04-05 23:16         ` Theodore Ts'o
2013-04-05 23:16           ` Theodore Ts'o
2013-04-06  7:29           ` Jiri Slaby
2013-04-06  7:29             ` Jiri Slaby
2013-04-06  7:37             ` Jiri Slaby
2013-04-06  7:37               ` Jiri Slaby
2013-04-06  8:19               ` Jiri Slaby
2013-04-06 13:15             ` Theodore Ts'o
2013-04-06 13:15               ` Theodore Ts'o
2013-04-10 10:56   ` Mel Gorman
2013-04-10 10:56     ` Mel Gorman
2013-04-10 13:12     ` Theodore Ts'o
2013-04-10 13:12       ` Theodore Ts'o
2013-04-11 17:04       ` Mel Gorman
2013-04-11 17:04         ` Mel Gorman
2013-04-11 18:35         ` Theodore Ts'o
2013-04-11 18:35           ` Theodore Ts'o
2013-04-11 21:33           ` Jan Kara
2013-04-11 21:33             ` Jan Kara
2013-04-12  2:57             ` Theodore Ts'o
2013-04-12  2:57               ` Theodore Ts'o
2013-04-12  4:50               ` Dave Chinner
2013-04-12  4:50                 ` Dave Chinner
2013-04-12 15:19                 ` Theodore Ts'o
2013-04-12 15:19                   ` Theodore Ts'o
2013-04-13  1:23                   ` Dave Chinner
2013-04-13  1:23                     ` Dave Chinner
2013-04-22 14:38                   ` Mel Gorman
2013-04-22 14:38                     ` Mel Gorman
2013-04-22 22:42                     ` Jeff Moyer [this message]
2013-04-22 22:42                       ` Jeff Moyer
2013-04-23  0:02                       ` Theodore Ts'o
2013-04-23  0:02                         ` Theodore Ts'o
2013-04-23  9:31                       ` Jan Kara
2013-04-23  9:31                         ` Jan Kara
2013-04-23 14:01                       ` Mel Gorman
2013-04-23 14:01                         ` Mel Gorman
2013-04-24 19:09                         ` Jeff Moyer
2013-04-24 19:09                           ` Jeff Moyer
2013-04-25 12:21                           ` Mel Gorman
2013-04-25 12:21                             ` Mel Gorman
2013-04-12  9:47               ` Mel Gorman
2013-04-12  9:47                 ` Mel Gorman
2013-04-21  0:05                 ` Theodore Ts'o
2013-04-21  0:05                   ` Theodore Ts'o
2013-04-21  0:07                   ` [PATCH 1/3] ext4: mark all metadata I/O with REQ_META Theodore Ts'o
2013-04-21  0:07                     ` Theodore Ts'o
2013-04-21  0:07                     ` [PATCH 2/3] buffer: add BH_Prio and BH_Meta flags Theodore Ts'o
2013-04-21  0:07                       ` Theodore Ts'o
2013-04-21  0:07                     ` [PATCH 3/3] ext4: mark metadata blocks using bh flags Theodore Ts'o
2013-04-21  0:07                       ` Theodore Ts'o
2013-04-21  6:09                       ` Jiri Slaby
2013-04-21  6:09                         ` Jiri Slaby
2013-04-21  6:09                         ` Jiri Slaby
2013-04-21 19:55                         ` Theodore Ts'o
2013-04-21 19:55                           ` Theodore Ts'o
2013-04-21 19:55                           ` Theodore Ts'o
2013-04-21 20:48                           ` [PATCH 3/3 -v2] " Theodore Ts'o
2013-04-21 20:48                             ` Theodore Ts'o
2013-04-21 20:48                             ` Theodore Ts'o
2013-04-22 12:06                     ` [PATCH 1/3] ext4: mark all metadata I/O with REQ_META Zheng Liu
2013-04-22 12:06                       ` Zheng Liu
2013-04-23 15:33                   ` Excessive stall times on ext4 in 3.9-rc2 Mel Gorman
2013-04-23 15:33                     ` Mel Gorman
2013-04-23 15:50                     ` Theodore Ts'o
2013-04-23 15:50                       ` Theodore Ts'o
2013-04-23 16:13                       ` Mel Gorman
2013-04-23 16:13                         ` Mel Gorman
2013-04-12 10:18               ` Tvrtko Ursulin
2013-04-12 10:18                 ` Tvrtko Ursulin
2013-04-12  9:45           ` Mel Gorman
2013-04-12  9:45             ` Mel Gorman
2013-04-02 23:16 ` Theodore Ts'o
2013-04-02 23:16   ` Theodore Ts'o
2013-04-03 15:22   ` Mel Gorman
2013-04-03 15:22     ` Mel Gorman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=x49a9oqmblc.fsf@segfault.boston.devel.redhat.com \
    --to=jmoyer@redhat.com \
    --cc=david@fromorbit.com \
    --cc=jack@suse.cz \
    --cc=jslaby@suse.cz \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.