linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Rob Landley <rob@landley.net>
To: Mel Gorman <mgorman@suse.de>
Cc: Jan Kara <jack@suse.cz>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	"Theodore Ts'o" <tytso@mit.edu>,
	"Artem S. Tashkinov" <t.artem@lycos.com>,
	Wu Fengguang <fengguang.wu@intel.com>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: Disabling in-memory write cache for x86-64 in Linux II
Date: Tue, 19 Nov 2013 11:17:03 -0600	[thread overview]
Message-ID: <1384881423.1974.277@driftwood> (raw)
In-Reply-To: <20131030120152.GM2400@suse.de> (from mgorman@suse.de on Wed Oct 30 07:01:52 2013)

On 10/30/2013 07:01:52 AM, Mel Gorman wrote:
> We talked about this a
> few months ago but I still suspect that we will have to bite the  
> bullet and
> tune based on "do not dirty more data than it takes N seconds to  
> writeback"
> using per-bdi writeback estimations. It's just not that trivial to  
> implement
> as the writeback speeds can change for a variety of reasons (multiple  
> IO
> sources, random vs sequential etc).

Record "block writes finished this second" into an 8 entry ring buffer,  
with a flag saying "device was partly idle this period" so you can  
ignore those entries. Keep a high water mark, which should converge to  
the device's linear write capacity.

This gives you recent thrashing speed and max capacity, and some  
weighted average of the two lets you avoid queuing up 10 minutes of  
writes all at once like 3.0 would to a terabyte USB2 disk. (And then  
vim calls sync() and hangs...)

The first tricky bit is the high water mark, but it's not too bad. If  
the device reads and writes at the same rate you can populate it from  
that, but even starting it with just one block should converge really  
fast because A) the round trip time should be well under a second, B)  
if you're submitting more than one period's worth of data (you can  
dirty enough to keep disk busy for 2 seconds), then it'll queue up 2  
blocks at a time, then 4, then 8, and increase exponentially until you  
hit the high water mark. (Which is measured so it won't overshoot.)

The second tricky bit is weighting the average, but presumably counting  
the high water mark as one, then adding in all the "device did not  
actually go idle during this period" entries, and dividing by the  
number of entries considered... Reasonable first guess?

Obvious optimizations: instead of recording the "disk went idle" flag  
in the ring buffer, just don't advance the ring buffer at the end of  
that second, but zero out the entry and re-accumulate it. That way the  
ring buffer should always have 7 seconds of measured activity, even if  
it's not necessarily recent. And of course you don't have to wake  
anything up when there was no I/O, so it's nicely quiescent when the  
system is...

Lowering the high water mark in the case of a transient spurious  
reading (maybe clock skew during suspend or virtualization glitch or  
some such) is fun, and could give you a 4 billion block bad reading,  
but if you always decrement the high water mark by 25% (x-=(x>>2)) each  
second the disk didn't go idle (rounding up) and then queue up more  
than one period's worth of data (but no more than say 8 seconds worth),  
such glitches should fix themselves and it'll work its way back up or  
down to a reasonably accurate value. (Keep in mind you're averaging the  
high water mark back down with 7 seconds of measured data from the ring  
buffer. Maybe you can cap the high water mark at the sum of all the  
measured values in the ring buffer as an extra check? You're already  
calculating it to do the average, so...)

This is assuming your hard drive _itself_ doesn't have bufferbloat, but  
http://spritesmods.com/?art=hddhack&f=rss implies they don't, and  
tagged command queueing lets you see through that anyway so your  
"actually committed" numbers could presumably still be accurate if the  
manufacturers aren't totally lying.

Given how far behind I am on my email, I assume somebody's already  
suggested this by now. :)

Rob

  reply	other threads:[~2013-11-20  3:16 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-10-25  7:25 Disabling in-memory write cache for x86-64 in Linux II Artem S. Tashkinov
2013-10-25  8:18 ` Linus Torvalds
2013-10-25  8:30   ` Artem S. Tashkinov
2013-10-25  8:43     ` Linus Torvalds
2013-10-25  9:15       ` Karl Kiniger
2013-10-29 20:30         ` Jan Kara
2013-10-29 20:43           ` Andrew Morton
2013-10-29 21:30             ` Jan Kara
2013-10-29 21:36             ` Linus Torvalds
2013-10-31 14:26           ` Karl Kiniger
2013-11-01 14:25             ` Maxim Patlasov
2013-11-01 14:31             ` [PATCH] mm: add strictlimit knob Maxim Patlasov
2013-11-04 22:01               ` Andrew Morton
2013-11-06 14:30                 ` Maxim Patlasov
2013-11-06 15:05                 ` [PATCH] mm: add strictlimit knob -v2 Maxim Patlasov
2013-11-07 12:26                   ` Henrique de Moraes Holschuh
2013-11-22 23:45                   ` Andrew Morton
2013-10-25 11:28       ` Disabling in-memory write cache for x86-64 in Linux II David Lang
2013-10-25  9:18     ` Theodore Ts'o
2013-10-25  9:29       ` Andrew Morton
2013-10-25  9:32         ` Linus Torvalds
2013-10-26 11:32           ` Pavel Machek
2013-10-26 20:03             ` Linus Torvalds
2013-10-29 20:57           ` Jan Kara
2013-10-29 21:33             ` Linus Torvalds
2013-10-29 22:13               ` Jan Kara
2013-10-29 22:42                 ` Linus Torvalds
2013-11-01 17:22                   ` Fengguang Wu
2013-11-04 12:19                     ` Pavel Machek
2013-11-04 12:26                   ` Pavel Machek
2013-10-30 12:01             ` Mel Gorman
2013-11-19 17:17               ` Rob Landley [this message]
2013-11-20 20:52                 ` One Thousand Gnomes
2013-10-25 22:37         ` Fengguang Wu
2013-10-25 23:05       ` Fengguang Wu
2013-10-25 23:37         ` Theodore Ts'o
2013-10-29 20:40           ` Jan Kara
2013-10-30 10:07             ` Artem S. Tashkinov
2013-10-30 15:12               ` Jan Kara
2013-11-05  0:50   ` Andreas Dilger
2013-11-05  4:12     ` Dave Chinner
2013-11-07 13:48       ` Jan Kara
2013-11-11  3:22         ` Dave Chinner
2013-11-11 19:31           ` Jan Kara
2013-10-25 10:49 ` NeilBrown
2013-10-25 11:26   ` David Lang
2013-10-25 18:26     ` Artem S. Tashkinov
2013-10-25 19:40       ` Diego Calleja
2013-10-25 23:32         ` Fengguang Wu
2013-11-15 15:48           ` Diego Calleja
2013-10-25 20:43       ` NeilBrown
2013-10-25 21:03         ` Artem S. Tashkinov
2013-10-25 22:11           ` NeilBrown
     [not found]             ` <CAF7GXvpJVLYDS5NfH-NVuN9bOJjAS5c1MQqSTjoiVBHJt6bWcw@mail.gmail.com>
2013-11-05  1:47               ` David Lang
2013-11-05  2:08               ` NeilBrown
2013-10-29 20:49       ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1384881423.1974.277@driftwood \
    --to=rob@landley.net \
    --cc=akpm@linux-foundation.org \
    --cc=fengguang.wu@intel.com \
    --cc=jack@suse.cz \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=t.artem@lycos.com \
    --cc=torvalds@linux-foundation.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).