linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Ric Wheeler <rwheeler@redhat.com>
To: Howard Chu <hyc@symas.com>
Cc: General Discussion of SQLite Database <sqlite-users@sqlite.org>,
	David Lang <david@lang.hm>, Vladislav Bolkhovitin <vst@vlnb.net>,
	"Theodore Ts'o" <tytso@mit.edu>, Richard Hipp <drh@hwaci.com>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	linux-fsdevel@vger.kernel.org
Subject: Re: [sqlite] light weight write barriers
Date: Fri, 16 Nov 2012 13:03:02 -0500	[thread overview]
Message-ID: <50A67FD6.1030108@redhat.com> (raw)
In-Reply-To: <50A661D0.4030200@symas.com>

On 11/16/2012 10:54 AM, Howard Chu wrote:
> Ric Wheeler wrote:
>> On 11/16/2012 10:06 AM, Howard Chu wrote:
>>> David Lang wrote:
>>>> barriers keep getting mentioned because they are a easy concept to understand.
>>>> "do this set of stuff before doing any of this other set of stuff, but I don't
>>>> care when any of this gets done" and they fit well with the requirements of 
>>>> the
>>>> users.
>>>>
>>>> Users readily accept that if the system crashes, they will loose the most 
>>>> recent
>>>> stuff that they did,
>>>
>>> *some* users may accept that. *None* should.
>>>
>>>> but they get annoyed when things get corrupted to the point
>>>> that they loose the entire file.
>>>>
>>>> this includes things like modifying one option and a crash resulting in the
>>>> config file being blank. Yes, you can do the 'write to temp file, sync file,
>>>> sync directory, rename file" dance, but the fact that to do so the user 
>>>> must sit
>>>> and wait for the syncs to take place can be a problem. It would be far 
>>>> better to
>>>> be able to say "write to temp file, and after it's on disk, rename the 
>>>> file" and
>>>> not have the user wait. The user doesn't really care if the changes hit disk
>>>> immediately, or several seconds (or even 10s of seconds) later, as long as 
>>>> there
>>>> is not any possibility of the rename hitting disk before the file contents.
>>>>
>>>> The fact that this could be implemented in multiple ways in the existing
>>>> hardware does not mean that there need to be multiple ways exposed to 
>>>> userspace,
>>>> it just means that the cost of doing the operation will vary depending on the
>>>> hardware that you have. This also means that if new hardware introduces a new
>>>> way of implementing this, that improvement can be passed on to the users 
>>>> without
>>>> needing application changes.
>>>
>>> There are a couple industry failures here:
>>>
>>> 1) the drive manufacturers sell drives that lie, and consumers accept it
>>> because they don't know better. We programmers, who know better, have failed
>>> to raise a stink and demand that this be fixed.
>>>    A) Drives should not lose data on power failure. If a drive accepts a write
>>> request and says "OK, done" then that data should get written to stable
>>> storage, period. Whether it requires capacitors or some other onboard power
>>> supply, or whatever, they should just do it. Keep in mind that today, most of
>>> the difference between enterprise drives and consumer desktop drives is just a
>>> firmware change, that hardware is already identical. Nobody should accept a
>>> product that doesn't offer this guarantee. It's inexcusable.
>>>    B) it should go without saying - drives should reliably report back to the
>>> host, when something goes wrong. E.g., if a write request has been accepted,
>>> cached, and reported complete, but then during the actual write an ECC failure
>>> is detected in the cacheline, the drive needs to tell the host "oh by the way,
>>> block XXX didn't actually make it to disk like I told you it did 10ms ago."
>>>
>>> If the entire software industry were to simply state "your shit stinks and
>>> we're not going to take it any more" the hard drive industry would have no
>>> choice but to fix it. And in most cases it would be a zero-cost fix for them.
>>>
>>> Once you have drives that are actually trustworthy, actually reliable (which
>>> doesn't mean they never fail, it only means they tell the truth about
>>> successes or failures), most of these other issues disappear. Most of the need
>>> for barriers disappear.
>>>
>>
>> I think that you are arguing a fairly silly point.
>
> Seems to me that you're arguing that we should accept inferior technology. 
> Who's really being silly?

No, just suggesting that you either pay for the expensive stuff or learn how to 
use cost effective, high capacity storage like the rest of the world.

I don't disagree that having non-volatile write caches would be nice, but 
everyone has learned how to deal with volatile write caches at the low end of 
market.

>
>> If you want that behaviour, you have had it for more than a decade - simply
>> disable the write cache on your drive and you are done.
>
> You seem to believe it's nonsensical for someone to want both fast and 
> reliable writes, or that it's unreasonable for a storage device to offer the 
> same, cheaply. And yet it is clearly trivial to provide all of the above.

I look forward to seeing your products in the market.

Until you have more than "I want" and "I think" on your storage system design 
resume, I suggest you spend the money to get the parts with non-volatile write 
caches or fix your code.

Ric


>> If you - as a user - want to run faster and use applications that are coded to
>> handle data integrity properly (fsync, fdatasync, etc), leave the write cache
>> enabled and use file system barriers.
>
> Applications aren't supposed to need to worry about such details, that's why 
> we have operating systems.
>
> Drives should tell the truth. In event of an error detected after the fact, 
> the drive should report the error back to the host. There's nothing 
> nonsensical there.
>
> When a drive's cache is enabled, the host should maintain a queue of written 
> pages, of a length equal to the size of the drive's cache. If a drive says 
> "hey, block XXX failed" the OS can reissue the write from its own queue. No 
> muss, no fuss, no performance bottlenecks. This is what Real Computers did 
> before the age of VAX Unix.
>
>> Everyone has to trade off cost versus something else and this is a very, very
>> long standing trade off that drive manufacturers have made.
>
> With the cost of storage falling as rapidly as it has in recent years, this is 
> a stupid tradeoff.
>


  reply	other threads:[~2012-11-16 18:03 UTC|newest]

Thread overview: 58+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CALwJ=MzHjAOs4J4kGH6HLdwP8E88StDWyAPVumNg9zCWpS9Tdg@mail.gmail.com>
2012-10-10 17:17 ` light weight write barriers Andi Kleen
2012-10-11 16:32   ` [sqlite] " 杨苏立 Yang Su Li
2012-10-11 17:41     ` Christoph Hellwig
2012-10-23 19:53     ` Vladislav Bolkhovitin
2012-10-24 21:17       ` Nico Williams
2012-10-24 22:03         ` david
2012-10-25  0:20           ` Nico Williams
2012-10-25  1:04             ` david
2012-10-25  5:18               ` Nico Williams
2012-10-25  6:02                 ` Theodore Ts'o
2012-10-25  6:58                   ` david
2012-10-25 14:03                     ` Theodore Ts'o
2012-10-25 18:03                       ` david
2012-10-25 18:29                         ` Theodore Ts'o
2012-11-05 20:03                           ` Pavel Machek
2012-11-05 22:04                             ` Theodore Ts'o
     [not found]                               ` <CALwJ=Mx-uEFLXK2wywekk=0dwrwVFb68wocnH9bjXJmHRsJx3w@mail.gmail.com>
2012-11-05 23:00                                 ` Theodore Ts'o
2012-10-30 23:49                   ` Nico Williams
2012-10-25  5:42           ` Theodore Ts'o
2012-10-25  7:11             ` david
2012-10-27  1:52         ` Vladislav Bolkhovitin
2012-10-25  5:14       ` Theodore Ts'o
2012-10-25 13:03         ` Alan Cox
2012-10-25 13:50           ` Theodore Ts'o
2012-10-27  1:55             ` Vladislav Bolkhovitin
2012-10-27  1:54         ` Vladislav Bolkhovitin
2012-10-27  4:44           ` Theodore Ts'o
2012-10-30 22:22             ` Vladislav Bolkhovitin
2012-10-31  9:54               ` Alan Cox
2012-11-01 20:18                 ` Vladislav Bolkhovitin
2012-11-01 21:24                   ` Alan Cox
2012-11-02  0:15                     ` Vladislav Bolkhovitin
2012-11-02  0:38                     ` Howard Chu
2012-11-02 12:33                       ` Alan Cox
2012-11-13  3:41                         ` Vladislav Bolkhovitin
2012-11-13 17:40                           ` Alan Cox
2012-11-13 19:13                             ` Nico Williams
2012-11-15  1:17                               ` Vladislav Bolkhovitin
2012-11-15 12:07                                 ` David Lang
2012-11-16 15:06                                   ` Howard Chu
2012-11-16 15:31                                     ` Ric Wheeler
2012-11-16 15:54                                       ` Howard Chu
2012-11-16 18:03                                         ` Ric Wheeler [this message]
2012-11-16 19:14                                     ` David Lang
2012-11-17  5:02                                   ` Vladislav Bolkhovitin
     [not found]                                   ` <CABK4GYNGrbes2Yhig4ioh-37OXg6iy6gqb3u8A2P2_dqNpMqoQ@mail.gmail.com>
2012-11-17  5:02                                     ` Vladislav Bolkhovitin
2012-11-15 17:06                                 ` Ryan Johnson
2012-11-15 22:35                                   ` Chris Friesen
2012-11-17  5:02                                     ` Vladislav Bolkhovitin
2012-11-20  1:23                                       ` Vladislav Bolkhovitin
2012-11-26 20:05                                         ` Nico Williams
2012-11-29  2:15                                           ` Vladislav Bolkhovitin
2012-11-15  1:16                             ` Vladislav Bolkhovitin
2012-11-13  3:37                       ` Vladislav Bolkhovitin
     [not found]                       ` <CALwJ=MwtFAz7uby+YzPPp2eBG-y+TUTOu9E9tEJbygDQW+s_tg@mail.gmail.com>
2012-11-13  3:41                         ` Vladislav Bolkhovitin
     [not found]           ` <CABK4GYMmigmi7YM9A5Aga21ZWoMKgUe3eX-AhPzLw9CnYhpcGA@mail.gmail.com>
2012-11-13  3:42             ` Vladislav Bolkhovitin
     [not found]   ` <CALwJ=MyR+nU3zqi3V3JMuEGNwd8FUsw9xLACJvd0HoBv3kRi0w@mail.gmail.com>
2012-10-11 16:38     ` Nico Williams
2012-10-11 16:48       ` Nico Williams

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=50A67FD6.1030108@redhat.com \
    --to=rwheeler@redhat.com \
    --cc=david@lang.hm \
    --cc=drh@hwaci.com \
    --cc=hyc@symas.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=sqlite-users@sqlite.org \
    --cc=tytso@mit.edu \
    --cc=vst@vlnb.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).