Re: Disabling in-memory write cache for x86-64 in Linux II

From: NeilBrown <neilb@suse.de>
To: "Artem S. Tashkinov" <t.artem@lycos.com>
Cc: david@lang.hm, linux-kernel@vger.kernel.org,
	torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org,
	axboe@kernel.dk, linux-mm@kvack.org
Subject: Re: Disabling in-memory write cache for x86-64 in Linux II
Date: Sat, 26 Oct 2013 09:11:12 +1100	[thread overview]
Message-ID: <20131026091112.241da260@notabene.brown> (raw)
In-Reply-To: <476525596.14731.1382735024280.JavaMail.mail@webmail11>

[-- Attachment #1: Type: text/plain, Size: 4860 bytes --]

On Fri, 25 Oct 2013 21:03:44 +0000 (UTC) "Artem S. Tashkinov"
<t.artem@lycos.com> wrote:

> Oct 26, 2013 02:44:07 AM, neil wrote:
> On Fri, 25 Oct 2013 18:26:23 +0000 (UTC) "Artem S. Tashkinov"
> >> 
> >> Exactly. And not being able to use applications which show you IO performance
> >> like Midnight Commander. You might prefer to use "cp -a" but I cannot imagine
> >> my life without being able to see the progress of a copying operation. With the current
> >> dirty cache there's no way to understand how you storage media actually behaves.
> >
> >So fix Midnight Commander.  If you want the copy to be actually finished when
> >it says  it is finished, then it needs to call 'fsync()' at the end.
> 
> This sounds like a very bad joke. How applications are supposed to show and
> calculate an _average_ write speed if there are no kernel calls/ioctls to actually
> make the kernel flush dirty buffers _during_ copying? Actually it's a good way to
> solve this problem in user space - alas, even if such calls are implemented, user
> space will start using them only in 2018 if not further from that.

But there is a way to flush dirty buffers *during* copies.  
  man 2 sync_file_range

if giving precise feedback is is paramount importance to you, then this would
be the interface to use.
> 
> >> 
> >> Per device dirty cache seems like a nice idea, I, for one, would like to disable it
> >> altogether or make it an absolute minimum for things like USB flash drives - because
> >> I don't care about multithreaded performance or delayed allocation on such devices -
> >> I'm interested in my data reaching my USB stick ASAP - because it's how most people
> >> use them.
> >>
> >
> >As has already been said, you can substantially disable  the cache by tuning
> >down various values in /proc/sys/vm/.
> >Have you tried?
> 
> I don't understand who you are replying to. I asked about per device settings, you are
> again referring me to system wide settings - they don't look that good if we're talking
> about a 3MB/sec flash drive and 500MB/sec SSD drive. Besides it makes no sense
> to allocate 20% of physical RAM for things which don't belong to it in the first place.

Sorry, missed the per-device bit.
You could try playing with
  /sys/class/bdi/XX:YY/max_ratio

where XX:YY is the major/minor number of the device, so 8:0 for /dev/sda.
Wind it right down for slow devices and you might get something like what you
want.

> 
> I don't know any other OS which has a similar behaviour.

I don't know about the internal details of any other OS, so I cannot really
comment.

> 
> And like people (including me) have already mentioned, such a huge dirty cache can
> stall their PCs/servers for a considerable amount of time.

Yes.  But this is a different issue.
There are two very different issues that should be kept separate.

One is that when "cp" or similar complete, the data hasn't all be written out
yet.  It typically takes another 30 seconds before the flush will complete.
You seemed to primarily complain about this, so that is what I originally
address.  That is where in the "dirty_*_centisecs" values apply.

The other, quite separate, issue is that Linux will cache more dirty data
than it can write out in a reasonable time.  All the tuning parameters refer
to the amount of data (whether as a percentage of RAM or as a number of
bytes), but what people really care about is a number of seconds.

As you might imagine, estimating how long it will take to write out a certain
amount of data is highly non-trivial.  The relationship between megabytes and
seconds can be non-linear and can change over time.

Caching nothing at all can hurt a lot of workloads.  Caching too much can
obviously hurt too.  Caching "5 seconds" worth of data would be ideal, but
would be incredibly difficult to implement.
It is possible that keeping a sliding estimate of device throughput for each
device would be possible, and using that to automatically adjust the
"max_ratio" value (or some related internal thing) might be a 70% solution.

Certainly it would be an interesting project for someone.

> 
> Of course, if you don't use Linux on the desktop you don't really care - well, I do. Also
> not everyone in this world has an UPS - which means such a huge buffer can lead to a
> serious data loss in case of a power blackout.

I don't have a desk (just a lap), but I use Linux on all my computers and
I've never really noticed the problem.  Maybe I'm just very patient, or maybe
I don't work with large data sets and slow devices.

However I don't think data-loss is really a related issue.  Any process that
cares about data safety *must* use fsync at appropriate places.  This has
always been true.

NeilBrown

> 
> Regards,
> 
> Artem

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]