linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* True  fsync() in Linux (on IDE)
@ 2004-03-18  1:08 Peter Zaitsev
  2004-03-18  6:47 ` Jens Axboe
  0 siblings, 1 reply; 40+ messages in thread
From: Peter Zaitsev @ 2004-03-18  1:08 UTC (permalink / raw)
  To: Linux Kernel

Hello,

I'm wondering is there any way in Linux to do proper fsync(), which
makes sure data is written to the disk.

Currently on IDE devices one can see, fsync() only flushes data to the
drive cache which is not enough for ACID guaranties database server must
give. 

There is solution just to disable drive write cache, but it seems to
slowdown performance way to much.

I would be also happy enough with some global kernel option which would
enable drive cache flush on fsync :) 


Mac OS X also has this "optimization", but at least it provides an
alternative flush method for Database Servers:

fcntl(fd, F_FULLFSYNC, NULL)

can be used instead of fsync() to get true fsync() behavior. 

-- 
Peter Zaitsev, Senior Support Engineer
MySQL AB, www.mysql.com

Meet the MySQL Team at User Conference 2004! (April 14-16, Orlando,FL)
  http://www.mysql.com/uc2004/


^ permalink raw reply	[flat|nested] 40+ messages in thread
* Re: True  fsync() in Linux (on IDE)
@ 2004-03-22 13:08 Heikki Tuuri
  2004-03-22 13:23 ` Jens Axboe
  0 siblings, 1 reply; 40+ messages in thread
From: Heikki Tuuri @ 2004-03-22 13:08 UTC (permalink / raw)
  To: linux-kernel

Hi!

I have written the InnoDB backend to MySQL. Some notes on the fsync()
processing problem:

1. It is dangerous for a database if fsync'ed files are physically written
to the disk in an order different from the order in which the fsync's were
called on them. In a power outage this can cause database corruption.

For example, a database must make sure that the log file is written to the
disk at least up to the 'log sequence number' of any data page written to
disk. Thus, we must first write to the log file and call fsync() on it, and
only after that are allowed to write the data page to a data file and call
fsync() on the data file.

2. An 'atomic' file write in the OS does not solve the problem of partially
written database pages in a power outage if the disk drive is not guaranteed
to stay operational long enough to be able to write the whole page
physically to disk. An InnoDB data page is 16 kB, and probably not
guaranteed to be any 'atomic' unit of physical disk writes. However, in
practice, half-written pages (either because of the OS or the disk) seem to
be very rare.

3. Jeffrey Siegal wrote to me that he checked a few disk drives if they
support a cache flush. Some of them did, others did not. If the disk drive
does not support a cache flush, then the only way to do a proper fsync is to
configure it not to cache writes at all. Though, in some drives even the
non-cache configuration option may be missing.

Best regards,

Heikki Tuuri
Innobase Oy
http://www.innodb.com

...........
List:       linux-kernel
Subject:    Re: True  fsync() in Linux (on IDE)
From:       Peter Zaitsev <peter () mysql ! com>
Date:       2004-03-20 19:48:23
Message-ID: <1079812102.3182.31.camel () abyss ! local>
[Download message RAW]

On Sat, 2004-03-20 at 02:20, Jamie Lokier wrote:
> Peter Zaitsev wrote:
> > If file system would guaranty atomicity of write() calls (synchronous
> > would be enough) we could disable it and get good extra performance.
>
> Store an MD5 or SHA digest of the page in the page itself, or elsewhere.
> (Obviously the digest doesn't include the bytes used to store it).
>
> Then partial write errors are always detectable, even if there's a
> hardware failure, so journal writes are effectively atomic.

Jamie,

The problem is not detecting the partial page writes, but dealing with
them.   Obviously there is checksum on the page (it is however not
MD5/SHA which are designed for cryptographic needs) and so page
corruption is detected if it happens for whatever reason.

The problem is you can't do anything with the page if only unknown
portion of it was modified.

Innodb uses sort of "logical" logging which   just says something like
delete row #2 from page #123, so if page is badly corrupted it will not
help to recover.

Of course you can log full pages, but this will increase overhead
significantly, especially for small  row sizes.

This is why solution now is to use  long term "logical" log and short
term "physical" log, which is used by background page writer, before
writing pages to their original locations.


-- 
Peter Zaitsev, Senior Support Engineer
MySQL AB, www.mysql.com


^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2004-03-22 20:28 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-03-18  1:08 True fsync() in Linux (on IDE) Peter Zaitsev
2004-03-18  6:47 ` Jens Axboe
2004-03-18 11:34   ` Matthias Andree
2004-03-18 11:55     ` Jens Axboe
2004-03-18 12:21       ` Matthias Andree
2004-03-18 12:37         ` Jens Axboe
2004-03-18 11:58     ` (no subject) Daniel Czarnecki
2004-03-18 19:44   ` True fsync() in Linux (on IDE) Peter Zaitsev
2004-03-18 19:47     ` Jens Axboe
2004-03-18 20:11       ` Chris Mason
2004-03-18 20:17         ` Peter Zaitsev
2004-03-18 20:33           ` Chris Mason
2004-03-18 20:46             ` Peter Zaitsev
2004-03-18 21:02               ` Chris Mason
2004-03-18 21:09                 ` Peter Zaitsev
2004-03-18 21:19                   ` Chris Mason
2004-03-19  8:05                     ` Hans Reiser
2004-03-19 13:52                       ` Chris Mason
2004-03-19 19:26                         ` Peter Zaitsev
2004-03-19 20:23                           ` Chris Mason
2004-03-19 20:31                             ` Hans Reiser
2004-03-19 20:38                               ` Chris Mason
2004-03-19 20:48                                 ` Hans Reiser
2004-03-19 20:56                                   ` Chris Mason
2004-03-20 11:04                                     ` Hans Reiser
2004-03-19 19:36                         ` Hans Reiser
2004-03-19 19:57                           ` Chris Mason
2004-03-19 20:04                             ` Hans Reiser
2004-03-19 20:15                               ` Chris Mason
2004-03-19 20:06                           ` Peter Zaitsev
2004-03-19 22:03                             ` Matthias Andree
2004-03-20 10:20                             ` Jamie Lokier
2004-03-20 19:48                               ` Peter Zaitsev
2004-03-22 13:08 Heikki Tuuri
2004-03-22 13:23 ` Jens Axboe
2004-03-22 15:17   ` Matthias Andree
2004-03-22 15:35     ` Christoph Hellwig
2004-03-22 19:12     ` Christoffer Hall-Frederiksen
2004-03-22 20:28       ` Matthias Andree
2004-03-22 19:33     ` Hans Reiser

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).