linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Ric Wheeler <rwheeler@redhat.com>
To: David Rees <drees76@gmail.com>
Cc: Jeff Garzik <jeff@garzik.org>, Theodore Tso <tytso@mit.edu>,
	Jan Kara <jack@suse.cz>, Chris Mason <chris.mason@oracle.com>,
	Ric Wheeler <rwheeler@redhat.com>,
	Linux Kernel Developers List <linux-kernel@vger.kernel.org>,
	Ext4 Developers List <linux-ext4@vger.kernel.org>
Subject: Re: [PATCH 0/3] Ext3 latency improvement patches
Date: Mon, 30 Mar 2009 10:16:51 -0400	[thread overview]
Message-ID: <49D0D453.4000307@redhat.com> (raw)
In-Reply-To: <72dbd3150903271724n5e7900a5j2486707565cd9d74@mail.gmail.com>

David Rees wrote:
> On Fri, Mar 27, 2009 at 5:14 PM, Jeff Garzik <jeff@garzik.org> wrote:
>   
>> Theodore Tso wrote:
>>     
>>> OTOH, the really big databases will tend to use direct I/O, so they
>>> won't be dirtying the page cache anyway.  So maybe it's not worth the
>>>       
>> Not necessarily...  From what I understand, a lot of the individual
>> low-level components in cloud storage, such as GoogleFS's chunk server[1] do
>> not bypass the page cache, even though they do care about the details of
>> data caching and data consistency.
>>     
>
> PostgreSQL does not use direct I/O, either (except for the
> write-ahead-logs which are written sequentially and only get read
> during database recovery).  I'm sure that most of MySQL's database
> engines, also don't.
>
> -Dave
>   

The high end, traditional databases like DB2 and Oracle definitely do 
tend to use direct I/O and manage the cache vs not cached pages 
carefully on their own.

They also tend to use database "page sizes" larger than our VM page 
size  or FS block size and work hard to send large, aligned IO's down to 
storage in the correct order so they can be fully recoverable after a 
crash (no partially updated DB pages, aka "torn pages").

A lot of the cloud storage people rely on whole files. For example, you 
implement RAID at the file level by breaking your file down into K 
chunks, each one sent over the network to different machines. That chunk 
is really a whole file and is sent to disk (hopefully with an fsync()!) 
before ack'ing the transaction. They don't worry about data integrity 
for objects less than that chunk size.

At least, this is how we did it in Centera - without doing that, you are 
definitely open to data loss.

Ric




  reply	other threads:[~2009-03-30 14:20 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-03-27 20:24 [PATCH 0/3] Ext3 latency improvement patches Theodore Ts'o
2009-03-27 20:24 ` [PATCH 1/3] block_write_full_page: Use synchronous writes for WBC_SYNC_ALL writebacks Theodore Ts'o
2009-03-27 20:24   ` [PATCH 2/3] ext3: Use WRITE_SYNC for commits which are caused by fsync() Theodore Ts'o
2009-03-27 20:24     ` [PATCH 3/3] ext3: Avoid starting a transaction in writepage when not necessary Theodore Ts'o
2009-03-27 22:23       ` Jan Kara
2009-03-27 23:03         ` Theodore Tso
2009-03-30 13:22           ` Jan Kara
2009-03-27 22:20     ` [PATCH 2/3] ext3: Use WRITE_SYNC for commits which are caused by fsync() Jan Kara
2009-03-27 20:55   ` [PATCH 1/3] block_write_full_page: Use synchronous writes for WBC_SYNC_ALL writebacks Jan Kara
2009-04-07  6:21   ` Andrew Morton
2009-04-07  6:50     ` Andrew Morton
2009-04-07  7:08       ` Jens Axboe
2009-04-07  7:17         ` Jens Axboe
2009-04-07  8:16           ` Jens Axboe
2009-04-07  7:23         ` Andrew Morton
2009-04-07  7:57           ` Jens Axboe
2009-04-07 19:09             ` Theodore Tso
2009-04-07 19:32               ` Jens Axboe
2009-04-07 21:44                 ` Theodore Tso
2009-04-07 22:19                   ` [PATCH] block_write_full_page: switch synchronous writes to use WRITE_SYNC_PLUG Theodore Tso
2009-04-07 23:09                     ` Andrew Morton
2009-04-07 23:46                       ` Theodore Tso
2009-04-08  8:08                       ` Jens Axboe
2009-04-08 22:34                         ` Andrew Morton
2009-04-09 17:59                           ` Jens Axboe
2009-04-08  6:00                     ` Jens Axboe
2009-04-08 15:26                       ` Theodore Tso
2009-04-08  5:58                   ` [PATCH 1/3] block_write_full_page: Use synchronous writes for WBC_SYNC_ALL writebacks Jens Axboe
2009-04-08 15:25                     ` Theodore Tso
2009-04-07 14:19           ` Theodore Tso
2009-03-27 20:50 ` [PATCH 0/3] Ext3 latency improvement patches Chris Mason
2009-03-27 21:03   ` Chris Mason
2009-03-27 21:19     ` Jan Kara
2009-03-27 21:30     ` Theodore Tso
2009-03-27 21:54       ` Jan Kara
2009-03-27 23:09         ` Theodore Tso
2009-03-28  0:14           ` Jeff Garzik
2009-03-28  0:24             ` David Rees
2009-03-30 14:16               ` Ric Wheeler [this message]
2009-03-30 11:23       ` Aneesh Kumar K.V
2009-03-30 11:44         ` Chris Mason

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=49D0D453.4000307@redhat.com \
    --to=rwheeler@redhat.com \
    --cc=chris.mason@oracle.com \
    --cc=drees76@gmail.com \
    --cc=jack@suse.cz \
    --cc=jeff@garzik.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).