From: Ric Wheeler <rwheeler@redhat.com>
To: Howard Chu <hyc@symas.com>
Cc: General Discussion of SQLite Database <sqlite-users@sqlite.org>,
David Lang <david@lang.hm>, Vladislav Bolkhovitin <vst@vlnb.net>,
"Theodore Ts'o" <tytso@mit.edu>, Richard Hipp <drh@hwaci.com>,
linux-kernel <linux-kernel@vger.kernel.org>,
linux-fsdevel@vger.kernel.org
Subject: Re: [sqlite] light weight write barriers
Date: Fri, 16 Nov 2012 13:03:02 -0500 [thread overview]
Message-ID: <50A67FD6.1030108@redhat.com> (raw)
In-Reply-To: <50A661D0.4030200@symas.com>
On 11/16/2012 10:54 AM, Howard Chu wrote:
> Ric Wheeler wrote:
>> On 11/16/2012 10:06 AM, Howard Chu wrote:
>>> David Lang wrote:
>>>> barriers keep getting mentioned because they are a easy concept to understand.
>>>> "do this set of stuff before doing any of this other set of stuff, but I don't
>>>> care when any of this gets done" and they fit well with the requirements of
>>>> the
>>>> users.
>>>>
>>>> Users readily accept that if the system crashes, they will loose the most
>>>> recent
>>>> stuff that they did,
>>>
>>> *some* users may accept that. *None* should.
>>>
>>>> but they get annoyed when things get corrupted to the point
>>>> that they loose the entire file.
>>>>
>>>> this includes things like modifying one option and a crash resulting in the
>>>> config file being blank. Yes, you can do the 'write to temp file, sync file,
>>>> sync directory, rename file" dance, but the fact that to do so the user
>>>> must sit
>>>> and wait for the syncs to take place can be a problem. It would be far
>>>> better to
>>>> be able to say "write to temp file, and after it's on disk, rename the
>>>> file" and
>>>> not have the user wait. The user doesn't really care if the changes hit disk
>>>> immediately, or several seconds (or even 10s of seconds) later, as long as
>>>> there
>>>> is not any possibility of the rename hitting disk before the file contents.
>>>>
>>>> The fact that this could be implemented in multiple ways in the existing
>>>> hardware does not mean that there need to be multiple ways exposed to
>>>> userspace,
>>>> it just means that the cost of doing the operation will vary depending on the
>>>> hardware that you have. This also means that if new hardware introduces a new
>>>> way of implementing this, that improvement can be passed on to the users
>>>> without
>>>> needing application changes.
>>>
>>> There are a couple industry failures here:
>>>
>>> 1) the drive manufacturers sell drives that lie, and consumers accept it
>>> because they don't know better. We programmers, who know better, have failed
>>> to raise a stink and demand that this be fixed.
>>> A) Drives should not lose data on power failure. If a drive accepts a write
>>> request and says "OK, done" then that data should get written to stable
>>> storage, period. Whether it requires capacitors or some other onboard power
>>> supply, or whatever, they should just do it. Keep in mind that today, most of
>>> the difference between enterprise drives and consumer desktop drives is just a
>>> firmware change, that hardware is already identical. Nobody should accept a
>>> product that doesn't offer this guarantee. It's inexcusable.
>>> B) it should go without saying - drives should reliably report back to the
>>> host, when something goes wrong. E.g., if a write request has been accepted,
>>> cached, and reported complete, but then during the actual write an ECC failure
>>> is detected in the cacheline, the drive needs to tell the host "oh by the way,
>>> block XXX didn't actually make it to disk like I told you it did 10ms ago."
>>>
>>> If the entire software industry were to simply state "your shit stinks and
>>> we're not going to take it any more" the hard drive industry would have no
>>> choice but to fix it. And in most cases it would be a zero-cost fix for them.
>>>
>>> Once you have drives that are actually trustworthy, actually reliable (which
>>> doesn't mean they never fail, it only means they tell the truth about
>>> successes or failures), most of these other issues disappear. Most of the need
>>> for barriers disappear.
>>>
>>
>> I think that you are arguing a fairly silly point.
>
> Seems to me that you're arguing that we should accept inferior technology.
> Who's really being silly?
No, just suggesting that you either pay for the expensive stuff or learn how to
use cost effective, high capacity storage like the rest of the world.
I don't disagree that having non-volatile write caches would be nice, but
everyone has learned how to deal with volatile write caches at the low end of
market.
>
>> If you want that behaviour, you have had it for more than a decade - simply
>> disable the write cache on your drive and you are done.
>
> You seem to believe it's nonsensical for someone to want both fast and
> reliable writes, or that it's unreasonable for a storage device to offer the
> same, cheaply. And yet it is clearly trivial to provide all of the above.
I look forward to seeing your products in the market.
Until you have more than "I want" and "I think" on your storage system design
resume, I suggest you spend the money to get the parts with non-volatile write
caches or fix your code.
Ric
>> If you - as a user - want to run faster and use applications that are coded to
>> handle data integrity properly (fsync, fdatasync, etc), leave the write cache
>> enabled and use file system barriers.
>
> Applications aren't supposed to need to worry about such details, that's why
> we have operating systems.
>
> Drives should tell the truth. In event of an error detected after the fact,
> the drive should report the error back to the host. There's nothing
> nonsensical there.
>
> When a drive's cache is enabled, the host should maintain a queue of written
> pages, of a length equal to the size of the drive's cache. If a drive says
> "hey, block XXX failed" the OS can reissue the write from its own queue. No
> muss, no fuss, no performance bottlenecks. This is what Real Computers did
> before the age of VAX Unix.
>
>> Everyone has to trade off cost versus something else and this is a very, very
>> long standing trade off that drive manufacturers have made.
>
> With the cost of storage falling as rapidly as it has in recent years, this is
> a stupid tradeoff.
>
next prev parent reply other threads:[~2012-11-16 18:03 UTC|newest]
Thread overview: 58+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <CALwJ=MzHjAOs4J4kGH6HLdwP8E88StDWyAPVumNg9zCWpS9Tdg@mail.gmail.com>
2012-10-10 17:17 ` light weight write barriers Andi Kleen
2012-10-11 16:32 ` [sqlite] " 杨苏立 Yang Su Li
2012-10-11 17:41 ` Christoph Hellwig
2012-10-23 19:53 ` Vladislav Bolkhovitin
2012-10-24 21:17 ` Nico Williams
2012-10-24 22:03 ` david
2012-10-25 0:20 ` Nico Williams
2012-10-25 1:04 ` david
2012-10-25 5:18 ` Nico Williams
2012-10-25 6:02 ` Theodore Ts'o
2012-10-25 6:58 ` david
2012-10-25 14:03 ` Theodore Ts'o
2012-10-25 18:03 ` david
2012-10-25 18:29 ` Theodore Ts'o
2012-11-05 20:03 ` Pavel Machek
2012-11-05 22:04 ` Theodore Ts'o
[not found] ` <CALwJ=Mx-uEFLXK2wywekk=0dwrwVFb68wocnH9bjXJmHRsJx3w@mail.gmail.com>
2012-11-05 23:00 ` Theodore Ts'o
2012-10-30 23:49 ` Nico Williams
2012-10-25 5:42 ` Theodore Ts'o
2012-10-25 7:11 ` david
2012-10-27 1:52 ` Vladislav Bolkhovitin
2012-10-25 5:14 ` Theodore Ts'o
2012-10-25 13:03 ` Alan Cox
2012-10-25 13:50 ` Theodore Ts'o
2012-10-27 1:55 ` Vladislav Bolkhovitin
2012-10-27 1:54 ` Vladislav Bolkhovitin
2012-10-27 4:44 ` Theodore Ts'o
2012-10-30 22:22 ` Vladislav Bolkhovitin
2012-10-31 9:54 ` Alan Cox
2012-11-01 20:18 ` Vladislav Bolkhovitin
2012-11-01 21:24 ` Alan Cox
2012-11-02 0:15 ` Vladislav Bolkhovitin
2012-11-02 0:38 ` Howard Chu
2012-11-02 12:33 ` Alan Cox
2012-11-13 3:41 ` Vladislav Bolkhovitin
2012-11-13 17:40 ` Alan Cox
2012-11-13 19:13 ` Nico Williams
2012-11-15 1:17 ` Vladislav Bolkhovitin
2012-11-15 12:07 ` David Lang
2012-11-16 15:06 ` Howard Chu
2012-11-16 15:31 ` Ric Wheeler
2012-11-16 15:54 ` Howard Chu
2012-11-16 18:03 ` Ric Wheeler [this message]
2012-11-16 19:14 ` David Lang
2012-11-17 5:02 ` Vladislav Bolkhovitin
[not found] ` <CABK4GYNGrbes2Yhig4ioh-37OXg6iy6gqb3u8A2P2_dqNpMqoQ@mail.gmail.com>
2012-11-17 5:02 ` Vladislav Bolkhovitin
2012-11-15 17:06 ` Ryan Johnson
2012-11-15 22:35 ` Chris Friesen
2012-11-17 5:02 ` Vladislav Bolkhovitin
2012-11-20 1:23 ` Vladislav Bolkhovitin
2012-11-26 20:05 ` Nico Williams
2012-11-29 2:15 ` Vladislav Bolkhovitin
2012-11-15 1:16 ` Vladislav Bolkhovitin
2012-11-13 3:37 ` Vladislav Bolkhovitin
[not found] ` <CALwJ=MwtFAz7uby+YzPPp2eBG-y+TUTOu9E9tEJbygDQW+s_tg@mail.gmail.com>
2012-11-13 3:41 ` Vladislav Bolkhovitin
[not found] ` <CABK4GYMmigmi7YM9A5Aga21ZWoMKgUe3eX-AhPzLw9CnYhpcGA@mail.gmail.com>
2012-11-13 3:42 ` Vladislav Bolkhovitin
[not found] ` <CALwJ=MyR+nU3zqi3V3JMuEGNwd8FUsw9xLACJvd0HoBv3kRi0w@mail.gmail.com>
2012-10-11 16:38 ` Nico Williams
2012-10-11 16:48 ` Nico Williams
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=50A67FD6.1030108@redhat.com \
--to=rwheeler@redhat.com \
--cc=david@lang.hm \
--cc=drh@hwaci.com \
--cc=hyc@symas.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=sqlite-users@sqlite.org \
--cc=tytso@mit.edu \
--cc=vst@vlnb.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).