From: Jeff Moyer <jmoyer@redhat.com> To: Mel Gorman <mgorman@suse.de> Cc: "Theodore Ts'o" <tytso@mit.edu>, Dave Chinner <david@fromorbit.com>, Jan Kara <jack@suse.cz>, linux-ext4@vger.kernel.org, LKML <linux-kernel@vger.kernel.org>, Linux-MM <linux-mm@kvack.org>, Jiri Slaby <jslaby@suse.cz> Subject: Re: Excessive stall times on ext4 in 3.9-rc2 Date: Mon, 22 Apr 2013 18:42:23 -0400 [thread overview] Message-ID: <x49a9oqmblc.fsf@segfault.boston.devel.redhat.com> (raw) In-Reply-To: <20130422143846.GA2675@suse.de> (Mel Gorman's message of "Mon, 22 Apr 2013 15:38:46 +0100") Mel Gorman <mgorman@suse.de> writes: > (Adding Jeff Moyer to the cc as I'm told he is interested in the blktrace) Thanks. I've got a few comments and corrections for you below. > TLDR: Flusher writes pages very quickly after processes dirty a buffer. Reads > starve flusher writes. [snip] > 3. The blktrace indicates that reads can starve writes from flusher > > While there are people that can look at a blktrace and find problems > like they are rain man, I'm more like an ADHD squirrel when looking at > a trace. I wrote a script to look for what unrelated requests completed > while an request got stalled for over a second. It seemed like something > that a tool shoudl already exist for but I didn't find one unless btt > can give the information somehow. Care to share that script? [snip] > I recognise that the output will have a WTF reaction but the key > observations to me are > > a) a single write request from flusher took over a second to complete > b) at the time it was queued, it was mostly other writes that were in > the queue at the same time > c) The write request and the parallel writes were all asynchronous write > requests > D) at the time the request completed, there were a LARGE number of > other requested queued and completed at the same time. > > Of the requests queued and completed in the meantime the breakdown was > > 22 RM > 31 RA > 82 W > 445 R > > If I'm reading this correctly, it is saying that 22 reads were merged (RM), > 31 reads were remapped to another device (RA) which is probably reads from > the dm-crypt partition, 82 were writes (W) which is not far off the number > of writes that were in the queue and 445 were other reads. The delay was > dominated by reads that were queued after the write request and completed > before it. RM == Read Meta RA == Read Ahead (remapping, by the way, does not happen across devices, just into partitions) W and R you understood correctly. > That's saying that the 27128th request in the trace took over 7 seconds > to complete and was an asynchronous write from flusher. The contents of > the queue are displayed at that time and the breakdown of requests is > > 23 WS [JEM: write sync] > 86 RM [JEM: Read Meta] > 124 RA [JEM: Read Ahead] > 442 W > 1931 R > > 7 seconds later when it was completed the breakdown of completed > requests was > > 25 WS > 114 RM > 155 RA > 408 W > 2457 R > > In combination, that confirms for me that asynchronous writes from flush > are being starved by reads. When a process requires a buffer that is locked > by that asynchronous write from flusher, it stalls. > >> The thing is, we do want to make ext4 work well with cfq, and >> prioritizing non-readahead read requests ahead of data writeback does >> make sense. The issue is with is that metadata writes going through >> the block device could in some cases effectively cause a priority >> inversion when what had previously been an asynchronous writeback >> starts blocking a foreground, user-visible process. >> >> At least, that's the theory; > > I *think* the data more or less confirms the theory but it'd be nice if > someone else double checked in case I'm seeing what I want to see > instead of what is actually there. Looks sane. You can also see a lot of "preempt"s in the blkparse output, which indicates exactly what you're saying. Any sync request gets priority over the async requests. I'll also note that even though your I/O is going all over the place (D2C is pretty bad, 14ms), most of the time is spent waiting for a struct request allocation or between Queue and Merge: ==================== All Devices ==================== ALL MIN AVG MAX N --------------- ------------- ------------- ------------- ----------- Q2Q 0.000000001 0.000992259 8.898375882 2300861 Q2G 0.000000843 10.193261239 2064.079501935 1016463 <==== G2I 0.000000461 0.000044702 3.237065090 1015803 Q2M 0.000000101 8.203147238 2064.079367557 1311662 I2D 0.000002012 1.476824812 2064.089774419 1014890 M2D 0.000003283 6.994306138 283.573348664 1284872 D2C 0.000061889 0.014438316 0.857811758 2291996 Q2C 0.000072284 13.363007244 2064.092228625 2292191 ==================== Device Overhead ==================== DEV | Q2G G2I Q2M I2D D2C ---------- | --------- --------- --------- --------- --------- ( 8, 0) | 33.8259% 0.0001% 35.1275% 4.8932% 0.1080% ---------- | --------- --------- --------- --------- --------- Overall | 33.8259% 0.0001% 35.1275% 4.8932% 0.1080% I'm not sure I believe that max value. 2064 seconds seems a bit high. Also, Q2M should not be anywhere near that big, so more investigation is required there. A quick look over the data doesn't show any such delays (making me question the tools), but I'll write some code tomorrow to verify the btt output. Jan, if I were to come up with a way of promoting a particular async queue to the front of the line, where would I put such a call in the ext4/jbd2 code to be effective? Mel, can you reproduce this at will? Do you have a reproducer that I could run so I'm not constantly bugging you? Cheers, Jeff
WARNING: multiple messages have this Message-ID (diff)
From: Jeff Moyer <jmoyer@redhat.com> To: Mel Gorman <mgorman@suse.de> Cc: "Theodore Ts'o" <tytso@mit.edu>, Dave Chinner <david@fromorbit.com>, Jan Kara <jack@suse.cz>, linux-ext4@vger.kernel.org, LKML <linux-kernel@vger.kernel.org>, Linux-MM <linux-mm@kvack.org>, Jiri Slaby <jslaby@suse.cz> Subject: Re: Excessive stall times on ext4 in 3.9-rc2 Date: Mon, 22 Apr 2013 18:42:23 -0400 [thread overview] Message-ID: <x49a9oqmblc.fsf@segfault.boston.devel.redhat.com> (raw) In-Reply-To: <20130422143846.GA2675@suse.de> (Mel Gorman's message of "Mon, 22 Apr 2013 15:38:46 +0100") Mel Gorman <mgorman@suse.de> writes: > (Adding Jeff Moyer to the cc as I'm told he is interested in the blktrace) Thanks. I've got a few comments and corrections for you below. > TLDR: Flusher writes pages very quickly after processes dirty a buffer. Reads > starve flusher writes. [snip] > 3. The blktrace indicates that reads can starve writes from flusher > > While there are people that can look at a blktrace and find problems > like they are rain man, I'm more like an ADHD squirrel when looking at > a trace. I wrote a script to look for what unrelated requests completed > while an request got stalled for over a second. It seemed like something > that a tool shoudl already exist for but I didn't find one unless btt > can give the information somehow. Care to share that script? [snip] > I recognise that the output will have a WTF reaction but the key > observations to me are > > a) a single write request from flusher took over a second to complete > b) at the time it was queued, it was mostly other writes that were in > the queue at the same time > c) The write request and the parallel writes were all asynchronous write > requests > D) at the time the request completed, there were a LARGE number of > other requested queued and completed at the same time. > > Of the requests queued and completed in the meantime the breakdown was > > 22 RM > 31 RA > 82 W > 445 R > > If I'm reading this correctly, it is saying that 22 reads were merged (RM), > 31 reads were remapped to another device (RA) which is probably reads from > the dm-crypt partition, 82 were writes (W) which is not far off the number > of writes that were in the queue and 445 were other reads. The delay was > dominated by reads that were queued after the write request and completed > before it. RM == Read Meta RA == Read Ahead (remapping, by the way, does not happen across devices, just into partitions) W and R you understood correctly. > That's saying that the 27128th request in the trace took over 7 seconds > to complete and was an asynchronous write from flusher. The contents of > the queue are displayed at that time and the breakdown of requests is > > 23 WS [JEM: write sync] > 86 RM [JEM: Read Meta] > 124 RA [JEM: Read Ahead] > 442 W > 1931 R > > 7 seconds later when it was completed the breakdown of completed > requests was > > 25 WS > 114 RM > 155 RA > 408 W > 2457 R > > In combination, that confirms for me that asynchronous writes from flush > are being starved by reads. When a process requires a buffer that is locked > by that asynchronous write from flusher, it stalls. > >> The thing is, we do want to make ext4 work well with cfq, and >> prioritizing non-readahead read requests ahead of data writeback does >> make sense. The issue is with is that metadata writes going through >> the block device could in some cases effectively cause a priority >> inversion when what had previously been an asynchronous writeback >> starts blocking a foreground, user-visible process. >> >> At least, that's the theory; > > I *think* the data more or less confirms the theory but it'd be nice if > someone else double checked in case I'm seeing what I want to see > instead of what is actually there. Looks sane. You can also see a lot of "preempt"s in the blkparse output, which indicates exactly what you're saying. Any sync request gets priority over the async requests. I'll also note that even though your I/O is going all over the place (D2C is pretty bad, 14ms), most of the time is spent waiting for a struct request allocation or between Queue and Merge: ==================== All Devices ==================== ALL MIN AVG MAX N --------------- ------------- ------------- ------------- ----------- Q2Q 0.000000001 0.000992259 8.898375882 2300861 Q2G 0.000000843 10.193261239 2064.079501935 1016463 <==== G2I 0.000000461 0.000044702 3.237065090 1015803 Q2M 0.000000101 8.203147238 2064.079367557 1311662 I2D 0.000002012 1.476824812 2064.089774419 1014890 M2D 0.000003283 6.994306138 283.573348664 1284872 D2C 0.000061889 0.014438316 0.857811758 2291996 Q2C 0.000072284 13.363007244 2064.092228625 2292191 ==================== Device Overhead ==================== DEV | Q2G G2I Q2M I2D D2C ---------- | --------- --------- --------- --------- --------- ( 8, 0) | 33.8259% 0.0001% 35.1275% 4.8932% 0.1080% ---------- | --------- --------- --------- --------- --------- Overall | 33.8259% 0.0001% 35.1275% 4.8932% 0.1080% I'm not sure I believe that max value. 2064 seconds seems a bit high. Also, Q2M should not be anywhere near that big, so more investigation is required there. A quick look over the data doesn't show any such delays (making me question the tools), but I'll write some code tomorrow to verify the btt output. Jan, if I were to come up with a way of promoting a particular async queue to the front of the line, where would I put such a call in the ext4/jbd2 code to be effective? Mel, can you reproduce this at will? Do you have a reproducer that I could run so I'm not constantly bugging you? Cheers, Jeff -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2013-04-22 22:44 UTC|newest] Thread overview: 105+ messages / expand[flat|nested] mbox.gz Atom feed top 2013-04-02 14:27 Excessive stall times on ext4 in 3.9-rc2 Mel Gorman 2013-04-02 14:27 ` Mel Gorman 2013-04-02 15:00 ` Jiri Slaby 2013-04-02 15:00 ` Jiri Slaby 2013-04-02 15:03 ` Zheng Liu 2013-04-02 15:03 ` Zheng Liu 2013-04-02 15:15 ` Mel Gorman 2013-04-02 15:15 ` Mel Gorman 2013-04-02 15:06 ` Theodore Ts'o 2013-04-02 15:06 ` Theodore Ts'o 2013-04-02 15:14 ` Theodore Ts'o 2013-04-02 15:14 ` Theodore Ts'o 2013-04-02 18:19 ` Theodore Ts'o 2013-04-02 18:19 ` Theodore Ts'o 2013-04-07 21:59 ` Frank Ch. Eigler 2013-04-07 21:59 ` Frank Ch. Eigler 2013-04-08 8:36 ` Mel Gorman 2013-04-08 8:36 ` Mel Gorman 2013-04-08 10:52 ` Frank Ch. Eigler 2013-04-08 10:52 ` Frank Ch. Eigler 2013-04-08 11:01 ` Theodore Ts'o 2013-04-08 11:01 ` Theodore Ts'o 2013-04-03 10:19 ` Mel Gorman 2013-04-03 10:19 ` Mel Gorman 2013-04-03 12:05 ` Theodore Ts'o 2013-04-03 12:05 ` Theodore Ts'o 2013-04-03 15:15 ` Mel Gorman 2013-04-05 22:18 ` Jiri Slaby 2013-04-05 22:18 ` Jiri Slaby 2013-04-05 23:16 ` Theodore Ts'o 2013-04-05 23:16 ` Theodore Ts'o 2013-04-06 7:29 ` Jiri Slaby 2013-04-06 7:29 ` Jiri Slaby 2013-04-06 7:37 ` Jiri Slaby 2013-04-06 7:37 ` Jiri Slaby 2013-04-06 8:19 ` Jiri Slaby 2013-04-06 13:15 ` Theodore Ts'o 2013-04-06 13:15 ` Theodore Ts'o 2013-04-10 10:56 ` Mel Gorman 2013-04-10 10:56 ` Mel Gorman 2013-04-10 13:12 ` Theodore Ts'o 2013-04-10 13:12 ` Theodore Ts'o 2013-04-11 17:04 ` Mel Gorman 2013-04-11 17:04 ` Mel Gorman 2013-04-11 18:35 ` Theodore Ts'o 2013-04-11 18:35 ` Theodore Ts'o 2013-04-11 21:33 ` Jan Kara 2013-04-11 21:33 ` Jan Kara 2013-04-12 2:57 ` Theodore Ts'o 2013-04-12 2:57 ` Theodore Ts'o 2013-04-12 4:50 ` Dave Chinner 2013-04-12 4:50 ` Dave Chinner 2013-04-12 15:19 ` Theodore Ts'o 2013-04-12 15:19 ` Theodore Ts'o 2013-04-13 1:23 ` Dave Chinner 2013-04-13 1:23 ` Dave Chinner 2013-04-22 14:38 ` Mel Gorman 2013-04-22 14:38 ` Mel Gorman 2013-04-22 22:42 ` Jeff Moyer [this message] 2013-04-22 22:42 ` Jeff Moyer 2013-04-23 0:02 ` Theodore Ts'o 2013-04-23 0:02 ` Theodore Ts'o 2013-04-23 9:31 ` Jan Kara 2013-04-23 9:31 ` Jan Kara 2013-04-23 14:01 ` Mel Gorman 2013-04-23 14:01 ` Mel Gorman 2013-04-24 19:09 ` Jeff Moyer 2013-04-24 19:09 ` Jeff Moyer 2013-04-25 12:21 ` Mel Gorman 2013-04-25 12:21 ` Mel Gorman 2013-04-12 9:47 ` Mel Gorman 2013-04-12 9:47 ` Mel Gorman 2013-04-21 0:05 ` Theodore Ts'o 2013-04-21 0:05 ` Theodore Ts'o 2013-04-21 0:07 ` [PATCH 1/3] ext4: mark all metadata I/O with REQ_META Theodore Ts'o 2013-04-21 0:07 ` Theodore Ts'o 2013-04-21 0:07 ` [PATCH 2/3] buffer: add BH_Prio and BH_Meta flags Theodore Ts'o 2013-04-21 0:07 ` Theodore Ts'o 2013-04-21 0:07 ` [PATCH 3/3] ext4: mark metadata blocks using bh flags Theodore Ts'o 2013-04-21 0:07 ` Theodore Ts'o 2013-04-21 6:09 ` Jiri Slaby 2013-04-21 6:09 ` Jiri Slaby 2013-04-21 6:09 ` Jiri Slaby 2013-04-21 19:55 ` Theodore Ts'o 2013-04-21 19:55 ` Theodore Ts'o 2013-04-21 19:55 ` Theodore Ts'o 2013-04-21 20:48 ` [PATCH 3/3 -v2] " Theodore Ts'o 2013-04-21 20:48 ` Theodore Ts'o 2013-04-21 20:48 ` Theodore Ts'o 2013-04-22 12:06 ` [PATCH 1/3] ext4: mark all metadata I/O with REQ_META Zheng Liu 2013-04-22 12:06 ` Zheng Liu 2013-04-23 15:33 ` Excessive stall times on ext4 in 3.9-rc2 Mel Gorman 2013-04-23 15:33 ` Mel Gorman 2013-04-23 15:50 ` Theodore Ts'o 2013-04-23 15:50 ` Theodore Ts'o 2013-04-23 16:13 ` Mel Gorman 2013-04-23 16:13 ` Mel Gorman 2013-04-12 10:18 ` Tvrtko Ursulin 2013-04-12 10:18 ` Tvrtko Ursulin 2013-04-12 9:45 ` Mel Gorman 2013-04-12 9:45 ` Mel Gorman 2013-04-02 23:16 ` Theodore Ts'o 2013-04-02 23:16 ` Theodore Ts'o 2013-04-03 15:22 ` Mel Gorman 2013-04-03 15:22 ` Mel Gorman
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=x49a9oqmblc.fsf@segfault.boston.devel.redhat.com \ --to=jmoyer@redhat.com \ --cc=david@fromorbit.com \ --cc=jack@suse.cz \ --cc=jslaby@suse.cz \ --cc=linux-ext4@vger.kernel.org \ --cc=linux-kernel@vger.kernel.org \ --cc=linux-mm@kvack.org \ --cc=mgorman@suse.de \ --cc=tytso@mit.edu \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.