All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] Ensuring data is written to disk
@ 2006-08-01  0:11 Armistead, Jason
  2006-08-01 10:17 ` Jamie Lokier
  0 siblings, 1 reply; 13+ messages in thread
From: Armistead, Jason @ 2006-08-01  0:11 UTC (permalink / raw)
  To: Qemu-Devel (E-mail)

I've been following the thread about disk data consistency with some
interest.  Given that many IDE disk drives may choose to hold data in their
write buffers before actually writing it to disk, and given that the
ordering of the writes may not be the same as the OS or application expects,
the only obvious way I can see to overcome this, and ensure the data is
truly written to the physical platters without disabling write caching is to
overwhelm the disk drive with more data than can fit in its internal write
buffer.

So, if you have an IDE disk with an 8Mb cache, guess what, send it an 8Mb
chunk of random data to write out when you do an fsync().  Better still,
locate this 8Mb as close to the middle of the travel of its heads, so that
performance is not affected any more than necessary.  If the drive firmware
uses a LILO or LRU policy to determine when to do its disk writes,
overwhelming its buffers should ensure that the actual data you sent to it
gets written out 

Of course, guessing the disk drive write buffer size and trying not to kill
system I/O performance with all these writes is another question entirely
... sigh !!!


Jason

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] Ensuring data is written to disk
  2006-08-01  0:11 [Qemu-devel] Ensuring data is written to disk Armistead, Jason
@ 2006-08-01 10:17 ` Jamie Lokier
  2006-08-01 10:45   ` Jens Axboe
  0 siblings, 1 reply; 13+ messages in thread
From: Jamie Lokier @ 2006-08-01 10:17 UTC (permalink / raw)
  To: qemu-devel

Armistead, Jason wrote:
> I've been following the thread about disk data consistency with some
> interest.  Given that many IDE disk drives may choose to hold data in their
> write buffers before actually writing it to disk, and given that the
> ordering of the writes may not be the same as the OS or application expects,
> the only obvious way I can see to overcome this, and ensure the data is
> truly written to the physical platters without disabling write caching is to
> overwhelm the disk drive with more data than can fit in its internal write
> buffer.
> 
> So, if you have an IDE disk with an 8Mb cache, guess what, send it an 8Mb
> chunk of random data to write out when you do an fsync().  Better still,
> locate this 8Mb as close to the middle of the travel of its heads, so that
> performance is not affected any more than necessary.  If the drive firmware
> uses a LILO or LRU policy to determine when to do its disk writes,
> overwhelming its buffers should ensure that the actual data you sent to it
> gets written out 

It doesn't work.

I thought that too, for a while, as a way to avoid sending CACHEFLUSH
commands for fs journal ordering when there is a lot of data being
written anyway.

But there is no guarantee that the drive uses a LILO or LRU policy,
and if the firmware is optimised for cache performance over a range of
benchmarks, it won't use those - there are better strategies.

You could write 8MB to the drive, but it could easily pass through the
cache without evicting some of the other data you want written.
_Especially_ if the 8MB is written to an area in the middle of the
head sweep.

> Of course, guessing the disk drive write buffer size and trying not to kill
> system I/O performance with all these writes is another question entirely
> ... sigh !!!

If you just want to evict all data from the drive's cache, and don't
actually have other data to write, there is a CACHEFLUSH command you
can send to the drive which will be more dependable than writing as
much data as the cache size.

-- Jamie

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] Ensuring data is written to disk
  2006-08-01 10:17 ` Jamie Lokier
@ 2006-08-01 10:45   ` Jens Axboe
  2006-08-01 14:17     ` Jamie Lokier
  0 siblings, 1 reply; 13+ messages in thread
From: Jens Axboe @ 2006-08-01 10:45 UTC (permalink / raw)
  To: qemu-devel

On Tue, Aug 01 2006, Jamie Lokier wrote:
> > Of course, guessing the disk drive write buffer size and trying not to kill
> > system I/O performance with all these writes is another question entirely
> > ... sigh !!!
> 
> If you just want to evict all data from the drive's cache, and don't
> actually have other data to write, there is a CACHEFLUSH command you
> can send to the drive which will be more dependable than writing as
> much data as the cache size.

Exactly, and this is what the OS fsync() should do once the drive has
acknowledged that the data has been written (to cache). At least
reiserfs w/barriers on Linux does this.

Random write tricks are worthless, as you cannot make any assumptions
about what the drive firmware will do.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] Ensuring data is written to disk
  2006-08-01 10:45   ` Jens Axboe
@ 2006-08-01 14:17     ` Jamie Lokier
  2006-08-01 19:05       ` Jens Axboe
  0 siblings, 1 reply; 13+ messages in thread
From: Jamie Lokier @ 2006-08-01 14:17 UTC (permalink / raw)
  To: qemu-devel

Jens Axboe wrote:
> On Tue, Aug 01 2006, Jamie Lokier wrote:
> > > Of course, guessing the disk drive write buffer size and trying not to kill
> > > system I/O performance with all these writes is another question entirely
> > > ... sigh !!!
> > 
> > If you just want to evict all data from the drive's cache, and don't
> > actually have other data to write, there is a CACHEFLUSH command you
> > can send to the drive which will be more dependable than writing as
> > much data as the cache size.
> 
> Exactly, and this is what the OS fsync() should do once the drive has
> acknowledged that the data has been written (to cache). At least
> reiserfs w/barriers on Linux does this.

1. Are you sure this happens, w/ reiserfs on Linux, even if the disk
   is an SATA or SCSI type that supports ordered tagged commands?  My
   understanding is that barriers force an ordering between write
   commands, and that CACHEFLUSH is used only with disks that don't have
   more sophisticated write ordering commands.  Is the data still
   committed to the disk platter before fsync() returns on those?

2. Do you know if ext3 (in ordered mode) w/barriers on Linux does it too,
   for in-place writes which don't modify the inode and therefore don't
   have a journal entry?

On Darwin, fsync() does not issue CACHEFLUSH to the drive.  Instead,
it has an fcntl F_FULLSYNC which does that, which is documented in
Darwin's fsync() page as working with all Darwin's filesystems,
provided the hardware honours CACHEFLUSH or the equivalent.

>From what little documentation I've found, on Linux it appears to be
much less predictable.  It seems that some filesystems, with some
kernel versions, and some mount options, on some types of disk, with
some drive settings, will commit data to a platter before fsync()
returns, and others won't.  And an application calling fsync() has no
easy way to find out.  Have I got this wrong?

ps. (An aside question): do you happen to know of a good patch which
implements IDE barriers w/ ext3 on 2.4 kernels?  I found a patch by
googling, but it seemed that the ext3 parts might not be finished, so
I don't trust it.  I've found turning off the IDE write cache makes
writes safe, but with a huge performance cost.

Thanks,
-- Jamie

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] Ensuring data is written to disk
  2006-08-01 14:17     ` Jamie Lokier
@ 2006-08-01 19:05       ` Jens Axboe
  2006-08-01 21:50         ` Jamie Lokier
  0 siblings, 1 reply; 13+ messages in thread
From: Jens Axboe @ 2006-08-01 19:05 UTC (permalink / raw)
  To: qemu-devel

On Tue, Aug 01 2006, Jamie Lokier wrote:
> Jens Axboe wrote:
> > On Tue, Aug 01 2006, Jamie Lokier wrote:
> > > > Of course, guessing the disk drive write buffer size and trying not to kill
> > > > system I/O performance with all these writes is another question entirely
> > > > ... sigh !!!
> > > 
> > > If you just want to evict all data from the drive's cache, and don't
> > > actually have other data to write, there is a CACHEFLUSH command you
> > > can send to the drive which will be more dependable than writing as
> > > much data as the cache size.
> > 
> > Exactly, and this is what the OS fsync() should do once the drive has
> > acknowledged that the data has been written (to cache). At least
> > reiserfs w/barriers on Linux does this.
> 
> 1. Are you sure this happens, w/ reiserfs on Linux, even if the disk
>    is an SATA or SCSI type that supports ordered tagged commands?  My
>    understanding is that barriers force an ordering between write
>    commands, and that CACHEFLUSH is used only with disks that don't have
>    more sophisticated write ordering commands.  Is the data still
>    committed to the disk platter before fsync() returns on those?

No SATA drive supports ordered tags, that is a SCSI only property. The
barrier writes is a separate thing, probably reiser ties the two
together because it needs to know if the flush cache command works as
expected. Drives are funny sometimes...

For SATA you always need at least one cache flush (you need one if you
have the FUA/Forced Unit Access write available, you need two if not).

> 2. Do you know if ext3 (in ordered mode) w/barriers on Linux does it too,
>    for in-place writes which don't modify the inode and therefore don't
>    have a journal entry?

I don't think that it does, however it may have changed. A quick grep
would seem to indicate that it has not changed.

> On Darwin, fsync() does not issue CACHEFLUSH to the drive.  Instead,
> it has an fcntl F_FULLSYNC which does that, which is documented in
> Darwin's fsync() page as working with all Darwin's filesystems,
> provided the hardware honours CACHEFLUSH or the equivalent.

That seems somewhat strange to me, I'd much rather be able to say that
fsync() itself is safe. An added fcntl hack doesn't really help the
applications that already rely on the correct behaviour.

> rom what little documentation I've found, on Linux it appears to be
> much less predictable.  It seems that some filesystems, with some
> kernel versions, and some mount options, on some types of disk, with
> some drive settings, will commit data to a platter before fsync()
> returns, and others won't.  And an application calling fsync() has no
> easy way to find out.  Have I got this wrong?

Nope, I'm afraid that is pretty much true... reiser and (it looks like,
just grepped) XFS has best support for this. Unfortunately I don't think
the user can actually tell if the OS does the right thing, outside of
running a blktrace and verifying that it actually sends a flush cache
down the queue.

> ps. (An aside question): do you happen to know of a good patch which
> implements IDE barriers w/ ext3 on 2.4 kernels?  I found a patch by
> googling, but it seemed that the ext3 parts might not be finished, so
> I don't trust it.  I've found turning off the IDE write cache makes
> writes safe, but with a huge performance cost.

The hard part (the IDE code) can be grabbed from the SLES8 latest
kernels, I developed and tested the code there. That also has the ext3
bits, IIRC.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] Ensuring data is written to disk
  2006-08-01 19:05       ` Jens Axboe
@ 2006-08-01 21:50         ` Jamie Lokier
  2006-08-02  6:51           ` Jens Axboe
  0 siblings, 1 reply; 13+ messages in thread
From: Jamie Lokier @ 2006-08-01 21:50 UTC (permalink / raw)
  To: qemu-devel

Jens Axboe wrote:
> > > > If you just want to evict all data from the drive's cache, and don't
> > > > actually have other data to write, there is a CACHEFLUSH command you
> > > > can send to the drive which will be more dependable than writing as
> > > > much data as the cache size.
> > > 
> > > Exactly, and this is what the OS fsync() should do once the drive has
> > > acknowledged that the data has been written (to cache). At least
> > > reiserfs w/barriers on Linux does this.
> > 
> > 1. Are you sure this happens, w/ reiserfs on Linux, even if the disk
> >    is an SATA or SCSI type that supports ordered tagged commands?  My
> >    understanding is that barriers force an ordering between write
> >    commands, and that CACHEFLUSH is used only with disks that don't have
> >    more sophisticated write ordering commands.  Is the data still
> >    committed to the disk platter before fsync() returns on those?
> 
> No SATA drive supports ordered tags, that is a SCSI only property. The
> barrier writes is a separate thing, probably reiser ties the two
> together because it needs to know if the flush cache command works as
> expected. Drives are funny sometimes...
> 
> For SATA you always need at least one cache flush (you need one if you
> have the FUA/Forced Unit Access write available, you need two if not).

Well my question wasn't intended to be specific to ATA (sorry if that
wasn't clear), but a general question about writing to disks on Linux.

And I don't understand your answer.  Are you saying that reiserfs on
Linux (presumably 2.6) commits data (and file metadata) to disk
platters before returning from fsync(), for all types of disk
including PATA, SATA and SCSI?  Or if not, is that a known property of
PATA only, or PATA and SATA only?  (And in all cases, presumably only
"ordinary" controllers can be depended on, not RAID controllers or
USB/Firewire bridges which ignore cache flushes for no good reason).

> > 2. Do you know if ext3 (in ordered mode) w/barriers on Linux does it too,
> >    for in-place writes which don't modify the inode and therefore don't
> >    have a journal entry?
> 
> I don't think that it does, however it may have changed. A quick grep
> would seem to indicate that it has not changed.

Ew.  What do databases do to be reliable then?  Or aren't they, on Linux?

> > On Darwin, fsync() does not issue CACHEFLUSH to the drive.  Instead,
> > it has an fcntl F_FULLSYNC which does that, which is documented in
> > Darwin's fsync() page as working with all Darwin's filesystems,
> > provided the hardware honours CACHEFLUSH or the equivalent.
> 
> That seems somewhat strange to me, I'd much rather be able to say that
> fsync() itself is safe. An added fcntl hack doesn't really help the
> applications that already rely on the correct behaviour.

According to the Darwin fsync(2) man page, it claims Darwin is the
only OS which has a facility to commit the data to disk platters.
(And it claims to do this with IDE, SCSI and FibreChannel.  With
journalling filesystems, it requests the journal to do the commit but
the cache flush still ultimately reaches the disk.  Sounds like a good
implementation to me).

SQLite (a nice open source database) will use F_FULLSYNC on Darwin to
do this, and it appears to add a large performance penalty relative to
using fsync() alone.  People noticed and wondered why.

Other OSes show similar performance as Darwin with fsync() only.

So it looks like the man page is probably accurate: other OSes,
particularly including Linux, don't commit the data reliably to disk
platters when using fsync().

In which case, I'd imagine that's why Darwin has a separate option,
because if Darwin's fsync() was many times slower than all the other
OSes, most people would take that as a sign of a badly performing OS,
rather than understanding the benefits.

> > from what little documentation I've found, on Linux it appears to be
> > much less predictable.  It seems that some filesystems, with some
> > kernel versions, and some mount options, on some types of disk, with
> > some drive settings, will commit data to a platter before fsync()
> > returns, and others won't.  And an application calling fsync() has no
> > easy way to find out.  Have I got this wrong?
> 
> Nope, I'm afraid that is pretty much true... reiser and (it looks like,
> just grepped) XFS has best support for this. Unfortunately I don't think
> the user can actually tell if the OS does the right thing, outside of
> running a blktrace and verifying that it actually sends a flush cache
> down the queue.

Ew.  So what do databases on Linux do?  Or are database commits
unreliable because of this?

> > ps. (An aside question): do you happen to know of a good patch which
> > implements IDE barriers w/ ext3 on 2.4 kernels?  I found a patch by
> > googling, but it seemed that the ext3 parts might not be finished, so
> > I don't trust it.  I've found turning off the IDE write cache makes
> > writes safe, but with a huge performance cost.
> 
> The hard part (the IDE code) can be grabbed from the SLES8 latest
> kernels, I developed and tested the code there. That also has the ext3
> bits, IIRC.

Thanks muchly!  I will definitely take a look at that.  I'm working on
a uClinux project which must use a 2.4 kernel, and performance with
write cache off has been a real problem.  And I've seen fs corruption
after power cycles with write cache on many times, as expected.

It's a shame the ext3 bits don't do fsync() to the platter though. :-/

To reliably commit data to an ext3 file, should we do ioctl(block_dev,
HDIO_SET_WCACHE, 1) on 2.6 kernels on IDE?  (The side effects look to
me like they may create a barrier then flush the cache, even when it's
already enabled, but only on 2.6 kernels).  Or is there a better way?
(I don't see any way to do it on vanilla 2.4 kernels).

Should we change to only reiserfs and expect fsync() to commit data
reliably only with that fs?  I realise this is a lot of difficult
questions, that apply to more than just Qemu...

Still, the answers are relevant to Qemu and reliably emulating a disk
on Linux.  And relevant to most database users, I should think.

Thanks again,
-- Jamie

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] Ensuring data is written to disk
  2006-08-01 21:50         ` Jamie Lokier
@ 2006-08-02  6:51           ` Jens Axboe
  2006-08-02 13:28             ` Jamie Lokier
  2006-08-07 13:11             ` R. Armiento
  0 siblings, 2 replies; 13+ messages in thread
From: Jens Axboe @ 2006-08-02  6:51 UTC (permalink / raw)
  To: qemu-devel

On Tue, Aug 01 2006, Jamie Lokier wrote:
> Jens Axboe wrote:
> > > > > If you just want to evict all data from the drive's cache, and don't
> > > > > actually have other data to write, there is a CACHEFLUSH command you
> > > > > can send to the drive which will be more dependable than writing as
> > > > > much data as the cache size.
> > > > 
> > > > Exactly, and this is what the OS fsync() should do once the drive has
> > > > acknowledged that the data has been written (to cache). At least
> > > > reiserfs w/barriers on Linux does this.
> > > 
> > > 1. Are you sure this happens, w/ reiserfs on Linux, even if the disk
> > >    is an SATA or SCSI type that supports ordered tagged commands?  My
> > >    understanding is that barriers force an ordering between write
> > >    commands, and that CACHEFLUSH is used only with disks that don't have
> > >    more sophisticated write ordering commands.  Is the data still
> > >    committed to the disk platter before fsync() returns on those?
> > 
> > No SATA drive supports ordered tags, that is a SCSI only property. The
> > barrier writes is a separate thing, probably reiser ties the two
> > together because it needs to know if the flush cache command works as
> > expected. Drives are funny sometimes...
> > 
> > For SATA you always need at least one cache flush (you need one if you
> > have the FUA/Forced Unit Access write available, you need two if not).
> 
> Well my question wasn't intended to be specific to ATA (sorry if that
> wasn't clear), but a general question about writing to disks on Linux.
> 
> And I don't understand your answer.  Are you saying that reiserfs on
> Linux (presumably 2.6) commits data (and file metadata) to disk
> platters before returning from fsync(), for all types of disk
> including PATA, SATA and SCSI?  Or if not, is that a known property of
> PATA only, or PATA and SATA only?  (And in all cases, presumably only
> "ordinary" controllers can be depended on, not RAID controllers or
> USB/Firewire bridges which ignore cache flushes for no good reason).

blkdev_issue_flush() is brutal, but it works on SATA/PATA/SCSI. So yes,
it should eb reliable.

> > > 2. Do you know if ext3 (in ordered mode) w/barriers on Linux does it too,
> > >    for in-place writes which don't modify the inode and therefore don't
> > >    have a journal entry?
> > 
> > I don't think that it does, however it may have changed. A quick grep
> > would seem to indicate that it has not changed.
> 
> Ew.  What do databases do to be reliable then?  Or aren't they, on Linux?

They probably run on better storage than commodity SATA drives with
write back caching enabled. To my knowledge, Linux is one of the only OS
that even attempts to fix this.

> > > On Darwin, fsync() does not issue CACHEFLUSH to the drive.  Instead,
> > > it has an fcntl F_FULLSYNC which does that, which is documented in
> > > Darwin's fsync() page as working with all Darwin's filesystems,
> > > provided the hardware honours CACHEFLUSH or the equivalent.
> > 
> > That seems somewhat strange to me, I'd much rather be able to say that
> > fsync() itself is safe. An added fcntl hack doesn't really help the
> > applications that already rely on the correct behaviour.
> 
> According to the Darwin fsync(2) man page, it claims Darwin is the
> only OS which has a facility to commit the data to disk platters.
> (And it claims to do this with IDE, SCSI and FibreChannel.  With
> journalling filesystems, it requests the journal to do the commit but
> the cache flush still ultimately reaches the disk.  Sounds like a good
> implementation to me).

The implementation may be nice, but it's the idea that is appalling to
me. But it sounds like the Darwin man page is out of date, or at least
untrue.

> SQLite (a nice open source database) will use F_FULLSYNC on Darwin to
> do this, and it appears to add a large performance penalty relative to
> using fsync() alone.  People noticed and wondered why.

Disk cache flushes are nasty, they stall everything. But it's still
typically faster than disabling write back caching, so...

> Other OSes show similar performance as Darwin with fsync() only.
> 
> So it looks like the man page is probably accurate: other OSes,
> particularly including Linux, don't commit the data reliably to disk
> platters when using fsync().

How did you reach that conclusion? reiser certainly does it if you have
barriers enabled (which you need anyways to be safe with write back
caching), and with a little investigation we can perhaps conclude that
XFS is safe as well.

> In which case, I'd imagine that's why Darwin has a separate option,
> because if Darwin's fsync() was many times slower than all the other
> OSes, most people would take that as a sign of a badly performing OS,
> rather than understanding the benefits.

That sounds like marketing driven engineering, nice. It requires app
changes, which is pretty silly. I would much rather have a way of just
enabling/disabling full flush on a per-device basis, you could use the
cache type as the default indicator of whether to issue the cache flush
or not. Then let the admin override it, if he wants to run unsafe but
faster.

> > > from what little documentation I've found, on Linux it appears to be
> > > much less predictable.  It seems that some filesystems, with some
> > > kernel versions, and some mount options, on some types of disk, with
> > > some drive settings, will commit data to a platter before fsync()
> > > returns, and others won't.  And an application calling fsync() has no
> > > easy way to find out.  Have I got this wrong?
> > 
> > Nope, I'm afraid that is pretty much true... reiser and (it looks like,
> > just grepped) XFS has best support for this. Unfortunately I don't think
> > the user can actually tell if the OS does the right thing, outside of
> > running a blktrace and verifying that it actually sends a flush cache
> > down the queue.
> 
> Ew.  So what do databases on Linux do?  Or are database commits
> unreliable because of this?

See above.

> > > ps. (An aside question): do you happen to know of a good patch which
> > > implements IDE barriers w/ ext3 on 2.4 kernels?  I found a patch by
> > > googling, but it seemed that the ext3 parts might not be finished, so
> > > I don't trust it.  I've found turning off the IDE write cache makes
> > > writes safe, but with a huge performance cost.
> > 
> > The hard part (the IDE code) can be grabbed from the SLES8 latest
> > kernels, I developed and tested the code there. That also has the ext3
> > bits, IIRC.
> 
> Thanks muchly!  I will definitely take a look at that.  I'm working on
> a uClinux project which must use a 2.4 kernel, and performance with
> write cache off has been a real problem.  And I've seen fs corruption
> after power cycles with write cache on many times, as expected.

No problem.

> It's a shame the ext3 bits don't do fsync() to the platter though. :-/

It really is, apparently none of the ext3 guys care about write back
caching problems. The only guy wanting to help with the ext3 bits was
Andrew. In the reiserfs guys favor, they have actively been pursuing
solutions to this problem. And XFS recently caught up and should be just
as good on the barrier side, I have yet to verify the fsync() part.

> To reliably commit data to an ext3 file, should we do ioctl(block_dev,
> HDIO_SET_WCACHE, 1) on 2.6 kernels on IDE?  (The side effects look to

Did you mean (..., 0)? And yes, it looks like it right now that fsync()
isn't any better than other OS on ext3, so disabling write back caching
is the safest.

> me like they may create a barrier then flush the cache, even when it's
> already enabled, but only on 2.6 kernels).  Or is there a better way?
> (I don't see any way to do it on vanilla 2.4 kernels).

2.4 vanilla doesn't have barrier support, unfortunately.

> Should we change to only reiserfs and expect fsync() to commit data
> reliably only with that fs?  I realise this is a lot of difficult
> questions, that apply to more than just Qemu...

Yes, reiser is the only one that works reliably across power loss with
write back caching for the journal commits as well as fsync guarantees.

> Still, the answers are relevant to Qemu and reliably emulating a disk
> on Linux.  And relevant to most database users, I should think.

Indeed, it would be nice if someone (whistles) would write up a note
about the current state of things...

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] Ensuring data is written to disk
  2006-08-02  6:51           ` Jens Axboe
@ 2006-08-02 13:28             ` Jamie Lokier
  2006-08-02 15:56               ` Bill C. Riemers
  2006-08-07 13:11             ` R. Armiento
  1 sibling, 1 reply; 13+ messages in thread
From: Jamie Lokier @ 2006-08-02 13:28 UTC (permalink / raw)
  To: qemu-devel

Jens Axboe wrote:
> > > For SATA you always need at least one cache flush (you need one if you
> > > have the FUA/Forced Unit Access write available, you need two if not).
> > 
> > Well my question wasn't intended to be specific to ATA (sorry if that
> > wasn't clear), but a general question about writing to disks on Linux.
> > 
> > And I don't understand your answer.  Are you saying that reiserfs on
> > Linux (presumably 2.6) commits data (and file metadata) to disk
> > platters before returning from fsync(), for all types of disk
> > including PATA, SATA and SCSI?  Or if not, is that a known property of
> > PATA only, or PATA and SATA only?  (And in all cases, presumably only
> > "ordinary" controllers can be depended on, not RAID controllers or
> > USB/Firewire bridges which ignore cache flushes for no good reason).
> 
> blkdev_issue_flush() is brutal, but it works on SATA/PATA/SCSI. So yes,
> it should eb reliable.

Ah, thanks.  I've looked at that bit of reiserfs, xfs and ext3 now.

It looks like adding a single call to blkdev_issue_flush() at the end
of ext3_sync_file() would do the trick.  I'm surprised that one-line
patch isn't in there already.

Of course that doesn't help with writing an application to reliably
commit on existing systems.

> > > > 2. Do you know if ext3 (in ordered mode) w/barriers on Linux does it too,
> > > >    for in-place writes which don't modify the inode and therefore don't
> > > >    have a journal entry?
> > > 
> > > I don't think that it does, however it may have changed. A quick grep
> > > would seem to indicate that it has not changed.
> > 
> > Ew.  What do databases do to be reliable then?  Or aren't they, on Linux?
> 
> They probably run on better storage than commodity SATA drives with
> write back caching enabled. To my knowledge, Linux is one of the only OS
> that even attempts to fix this.

I would imagine most of the MySQL databases backing small web sites
run on commodity PATA or SATA drives, and that most people have
assumed fsync() to be good enough for database commits in the absence
of hardware failure, or when one disk goes down in a RAID.  Time to
correct those misassumption!

> > > > On Darwin, fsync() does not issue CACHEFLUSH to the drive.  Instead,
> > > > it has an fcntl F_FULLSYNC which does that, which is documented in
> > > > Darwin's fsync() page as working with all Darwin's filesystems,
> > > > provided the hardware honours CACHEFLUSH or the equivalent.
> > > 
> > > That seems somewhat strange to me, I'd much rather be able to say that
> > > fsync() itself is safe. An added fcntl hack doesn't really help the
> > > applications that already rely on the correct behaviour.
> > 
> > According to the Darwin fsync(2) man page, it claims Darwin is the
> > only OS which has a facility to commit the data to disk platters.
> > (And it claims to do this with IDE, SCSI and FibreChannel.  With
> > journalling filesystems, it requests the journal to do the commit but
> > the cache flush still ultimately reaches the disk.  Sounds like a good
> > implementation to me).
> 
> The implementation may be nice, but it's the idea that is appalling to
> me. But it sounds like the Darwin man page is out of date, or at least
> untrue.
> 
> > SQLite (a nice open source database) will use F_FULLSYNC on Darwin to
> > do this, and it appears to add a large performance penalty relative to
> > using fsync() alone.  People noticed and wondered why.
> 
> Disk cache flushes are nasty, they stall everything. But it's still
> typically faster than disabling write back caching, so...

I agree that it's nasty.  But then, the fsync() interface is rather
sub-optimal.  E.g. something like sendmail which writes a new file
needs to fsync() on the file _and_ its parent directory.  You don't
want two disk flushes then, just one after both fsync() calls have
completed.  Similarly if you're doing anything where you want to
commit data to more than one file.  An fsync_multi() interface would
be more efficient.

> > Other OSes show similar performance as Darwin with fsync() only.
> > 
> > So it looks like the man page is probably accurate: other OSes,
> > particularly including Linux, don't commit the data reliably to disk
> > platters when using fsync().
> 
> How did you reach that conclusion?

>From seeing the reported timings for SQLite on Linux and Darwin
with/without F_FULLSYNC.  The Linux timings were similar to Darwin
without F_FULLSYNC.  Others and myself assumed the timings are
probably I/O bound, and reflect the transactions going to disk.  But
it could be Darwin being slower :-)

> reiser certainly does it if you have barriers enabled (which you
> need anyways to be safe with write back caching), and with a little
> investigation we can perhaps conclude that XFS is safe as well.

Yes, reiser and XFS look quite convincing.  Although I notice the
blkdev_issue_flush is conditional in both, and the condition is
non-trivial.  I'll assume the authors thought specifically about this.

> > In which case, I'd imagine that's why Darwin has a separate option,
> > because if Darwin's fsync() was many times slower than all the other
> > OSes, most people would take that as a sign of a badly performing OS,
> > rather than understanding the benefits.
> 
> That sounds like marketing driven engineering, nice. It requires app
> changes, which is pretty silly. I would much rather have a way of just
> enabling/disabling full flush on a per-device basis, you could use the
> cache type as the default indicator of whether to issue the cache flush
> or not. Then let the admin override it, if he wants to run unsafe but
> faster.

I agree, that makes sense to me too.

> > > > from what little documentation I've found, on Linux it appears to be
> > > > much less predictable.  It seems that some filesystems, with some
> > > > kernel versions, and some mount options, on some types of disk, with
> > > > some drive settings, will commit data to a platter before fsync()
> > > > returns, and others won't.  And an application calling fsync() has no
> > > > easy way to find out.  Have I got this wrong?
> > > 
> > > Nope, I'm afraid that is pretty much true... reiser and (it looks like,
> > > just grepped) XFS has best support for this. Unfortunately I don't think
> > > the user can actually tell if the OS does the right thing, outside of
> > > running a blktrace and verifying that it actually sends a flush cache
> > > down the queue.
> > 
> > Ew.  So what do databases on Linux do?  Or are database commits
> > unreliable because of this?
> 
> See above.

I conclude that database commits _are_ unreliable on Linux on a
disturbingly large number of smaller setups.

With ext3 on 2.6 and IDE write cache enabled, fsync() does not even
guarantee the ordering of writes, let alone commit them properly.
This is because it omits a journal commit (and hence IDE barrier), if
the data writes haven't changed the inode, which they don't if it's
within the 1-second mtime granularity.

O_SYNC on ext3 suffers the same problems.  (I don't know if O_SYNC
commits data to platters on reiser and XFS, or maintains write
ordering; I guess that fsync() should be called when those are
needed).

Considering the marketing of ext3 as offering data integrity, I'm
disappointed.

An ugly workaround suggests itself, which is to forcibly modify the
inode after writing and before calling fsync(): write, utime, utime,
fsync.  As a side effect of the journal barrier, it will cause a cache
flush to disk.

> > > > ps. (An aside question): do you happen to know of a good patch which
> > > > implements IDE barriers w/ ext3 on 2.4 kernels?  I found a patch by
> > > > googling, but it seemed that the ext3 parts might not be finished, so
> > > > I don't trust it.  I've found turning off the IDE write cache makes
> > > > writes safe, but with a huge performance cost.
> > > 
> > > The hard part (the IDE code) can be grabbed from the SLES8 latest
> > > kernels, I developed and tested the code there. That also has the ext3
> > > bits, IIRC.
> > 
> > Thanks muchly!  I will definitely take a look at that.  I'm working on
> > a uClinux project which must use a 2.4 kernel, and performance with
> > write cache off has been a real problem.  And I've seen fs corruption
> > after power cycles with write cache on many times, as expected.
> 
> No problem.

Have looked, it's most helpful, and I will use your patches.
Ironically, that 2.4 patch seems to include reliable commits w/ ext3,
because every fsync() commits a journal entry.  Er, I think.  (It was
optimised away in 2.6: http://lkml.org/lkml/2004/3/18/36).

> > It's a shame the ext3 bits don't do fsync() to the platter though. :-/
> 
> It really is, apparently none of the ext3 guys care about write back
> caching problems. The only guy wanting to help with the ext3 bits was
> Andrew. In the reiserfs guys favor, they have actively been pursuing
> solutions to this problem. And XFS recently caught up and should be just
> as good on the barrier side, I have yet to verify the fsync() part.

There's a call to blkdev_issue_flush in XFS fsync(), so it looks
promising.  I'm not sure what the condition for calling it depends on
though, but it seems likely the authors have thought it through.

> > To reliably commit data to an ext3 file, should we do ioctl(block_dev,
> > HDIO_SET_WCACHE, 1) on 2.6 kernels on IDE?  (The side effects look to
> 
> Did you mean (..., 0)? And yes, it looks like it right now that fsync()
> isn't any better than other OS on ext3, so disabling write back caching
> is the safest.

I meant (..., 1).  For some reason I thought the call to
update_ordered() in ide-disk.c issued a barrier, a convenient side
effect of HDIO_SET_WCACHE.  But on re-reading, it doesn't issue a
barrier.  So that's not a solution.

(..., 0) sucks performance wise.  I think calling utime to dirty the
inode prior to fsync() will work with ext3, but it's ugly for many
reasons, not least that it will work on IDE, but it won't work on
anything (e.g. SCSI) which uses ordered tags rather than flushes.

> > me like they may create a barrier then flush the cache, even when it's
> > already enabled, but only on 2.6 kernels).  Or is there a better way?
> > (I don't see any way to do it on vanilla 2.4 kernels).
> 
> 2.4 vanilla doesn't have barrier support, unfortunately.

I was wondering how to force an IDE cache flush on 2.4, from the
application after it's called fsync().  No barrier support implied.  I
guess there is some way to do it using the IDE taskfile ioctls?
Nothing is clear here, unfortunately.

I'm surprised blkdev_issue_flush (or the equivalent in 2.4) isn't
available to userspace through a block device ioctl.  There is
BLKFLSBUF which _almost_ pretends to do it, but that doesn't issue a
low-level disk flush, and it invalidates the read-cached data.

> > Should we change to only reiserfs and expect fsync() to commit data
> > reliably only with that fs?  I realise this is a lot of difficult
> > questions, that apply to more than just Qemu...
> 
> Yes, reiser is the only one that works reliably across power loss with
> write back caching for the journal commits as well as fsync guarantees.

I'll try it.  I see enough problems with ext3 on a tiny embedded
system (writes stalling for a long time, read-cached data being
re-read from disk every 5 seconds) that I was avoiding reiser because
I thought it would be more complicated.  That, and I have high faith
in e2fsck.  But given the problems with ext3, maybe I'll get better
embedded results with reiser :)

-- Jamie

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] Ensuring data is written to disk
  2006-08-02 13:28             ` Jamie Lokier
@ 2006-08-02 15:56               ` Bill C. Riemers
  0 siblings, 0 replies; 13+ messages in thread
From: Bill C. Riemers @ 2006-08-02 15:56 UTC (permalink / raw)
  To: qemu-devel

[-- Attachment #1: Type: text/plain, Size: 12723 bytes --]

Just to throw in my two cents, I notice that on the namesys website, they
claim reiser4 is completely safe in the event of a power failure, while
reiserfs 3 still requires some recovery.  Apparently in reiser4 they somehow
design writes to happen in sequences that create atomic events.  So the
whole change is there, or none of it.  I am not sure how this is
accomplished given the state of disk caching...  Perhaps that is why they
don't consider reiser4 ready for prime time use.

Bill


On 8/2/06, Jamie Lokier <jamie@shareable.org> wrote:
>
> Jens Axboe wrote:
> > > > For SATA you always need at least one cache flush (you need one if
> you
> > > > have the FUA/Forced Unit Access write available, you need two if
> not).
> > >
> > > Well my question wasn't intended to be specific to ATA (sorry if that
> > > wasn't clear), but a general question about writing to disks on Linux.
> > >
> > > And I don't understand your answer.  Are you saying that reiserfs on
> > > Linux (presumably 2.6) commits data (and file metadata) to disk
> > > platters before returning from fsync(), for all types of disk
> > > including PATA, SATA and SCSI?  Or if not, is that a known property of
> > > PATA only, or PATA and SATA only?  (And in all cases, presumably only
> > > "ordinary" controllers can be depended on, not RAID controllers or
> > > USB/Firewire bridges which ignore cache flushes for no good reason).
> >
> > blkdev_issue_flush() is brutal, but it works on SATA/PATA/SCSI. So yes,
> > it should eb reliable.
>
> Ah, thanks.  I've looked at that bit of reiserfs, xfs and ext3 now.
>
> It looks like adding a single call to blkdev_issue_flush() at the end
> of ext3_sync_file() would do the trick.  I'm surprised that one-line
> patch isn't in there already.
>
> Of course that doesn't help with writing an application to reliably
> commit on existing systems.
>
> > > > > 2. Do you know if ext3 (in ordered mode) w/barriers on Linux does
> it too,
> > > > >    for in-place writes which don't modify the inode and therefore
> don't
> > > > >    have a journal entry?
> > > >
> > > > I don't think that it does, however it may have changed. A quick
> grep
> > > > would seem to indicate that it has not changed.
> > >
> > > Ew.  What do databases do to be reliable then?  Or aren't they, on
> Linux?
> >
> > They probably run on better storage than commodity SATA drives with
> > write back caching enabled. To my knowledge, Linux is one of the only OS
> > that even attempts to fix this.
>
> I would imagine most of the MySQL databases backing small web sites
> run on commodity PATA or SATA drives, and that most people have
> assumed fsync() to be good enough for database commits in the absence
> of hardware failure, or when one disk goes down in a RAID.  Time to
> correct those misassumption!
>
> > > > > On Darwin, fsync() does not issue CACHEFLUSH to the
> drive.  Instead,
> > > > > it has an fcntl F_FULLSYNC which does that, which is documented in
> > > > > Darwin's fsync() page as working with all Darwin's filesystems,
> > > > > provided the hardware honours CACHEFLUSH or the equivalent.
> > > >
> > > > That seems somewhat strange to me, I'd much rather be able to say
> that
> > > > fsync() itself is safe. An added fcntl hack doesn't really help the
> > > > applications that already rely on the correct behaviour.
> > >
> > > According to the Darwin fsync(2) man page, it claims Darwin is the
> > > only OS which has a facility to commit the data to disk platters.
> > > (And it claims to do this with IDE, SCSI and FibreChannel.  With
> > > journalling filesystems, it requests the journal to do the commit but
> > > the cache flush still ultimately reaches the disk.  Sounds like a good
> > > implementation to me).
> >
> > The implementation may be nice, but it's the idea that is appalling to
> > me. But it sounds like the Darwin man page is out of date, or at least
> > untrue.
> >
> > > SQLite (a nice open source database) will use F_FULLSYNC on Darwin to
> > > do this, and it appears to add a large performance penalty relative to
> > > using fsync() alone.  People noticed and wondered why.
> >
> > Disk cache flushes are nasty, they stall everything. But it's still
> > typically faster than disabling write back caching, so...
>
> I agree that it's nasty.  But then, the fsync() interface is rather
> sub-optimal.  E.g. something like sendmail which writes a new file
> needs to fsync() on the file _and_ its parent directory.  You don't
> want two disk flushes then, just one after both fsync() calls have
> completed.  Similarly if you're doing anything where you want to
> commit data to more than one file.  An fsync_multi() interface would
> be more efficient.
>
> > > Other OSes show similar performance as Darwin with fsync() only.
> > >
> > > So it looks like the man page is probably accurate: other OSes,
> > > particularly including Linux, don't commit the data reliably to disk
> > > platters when using fsync().
> >
> > How did you reach that conclusion?
>
> >From seeing the reported timings for SQLite on Linux and Darwin
> with/without F_FULLSYNC.  The Linux timings were similar to Darwin
> without F_FULLSYNC.  Others and myself assumed the timings are
> probably I/O bound, and reflect the transactions going to disk.  But
> it could be Darwin being slower :-)
>
> > reiser certainly does it if you have barriers enabled (which you
> > need anyways to be safe with write back caching), and with a little
> > investigation we can perhaps conclude that XFS is safe as well.
>
> Yes, reiser and XFS look quite convincing.  Although I notice the
> blkdev_issue_flush is conditional in both, and the condition is
> non-trivial.  I'll assume the authors thought specifically about this.
>
> > > In which case, I'd imagine that's why Darwin has a separate option,
> > > because if Darwin's fsync() was many times slower than all the other
> > > OSes, most people would take that as a sign of a badly performing OS,
> > > rather than understanding the benefits.
> >
> > That sounds like marketing driven engineering, nice. It requires app
> > changes, which is pretty silly. I would much rather have a way of just
> > enabling/disabling full flush on a per-device basis, you could use the
> > cache type as the default indicator of whether to issue the cache flush
> > or not. Then let the admin override it, if he wants to run unsafe but
> > faster.
>
> I agree, that makes sense to me too.
>
> > > > > from what little documentation I've found, on Linux it appears to
> be
> > > > > much less predictable.  It seems that some filesystems, with some
> > > > > kernel versions, and some mount options, on some types of disk,
> with
> > > > > some drive settings, will commit data to a platter before fsync()
> > > > > returns, and others won't.  And an application calling fsync() has
> no
> > > > > easy way to find out.  Have I got this wrong?
> > > >
> > > > Nope, I'm afraid that is pretty much true... reiser and (it looks
> like,
> > > > just grepped) XFS has best support for this. Unfortunately I don't
> think
> > > > the user can actually tell if the OS does the right thing, outside
> of
> > > > running a blktrace and verifying that it actually sends a flush
> cache
> > > > down the queue.
> > >
> > > Ew.  So what do databases on Linux do?  Or are database commits
> > > unreliable because of this?
> >
> > See above.
>
> I conclude that database commits _are_ unreliable on Linux on a
> disturbingly large number of smaller setups.
>
> With ext3 on 2.6 and IDE write cache enabled, fsync() does not even
> guarantee the ordering of writes, let alone commit them properly.
> This is because it omits a journal commit (and hence IDE barrier), if
> the data writes haven't changed the inode, which they don't if it's
> within the 1-second mtime granularity.
>
> O_SYNC on ext3 suffers the same problems.  (I don't know if O_SYNC
> commits data to platters on reiser and XFS, or maintains write
> ordering; I guess that fsync() should be called when those are
> needed).
>
> Considering the marketing of ext3 as offering data integrity, I'm
> disappointed.
>
> An ugly workaround suggests itself, which is to forcibly modify the
> inode after writing and before calling fsync(): write, utime, utime,
> fsync.  As a side effect of the journal barrier, it will cause a cache
> flush to disk.
>
> > > > > ps. (An aside question): do you happen to know of a good patch
> which
> > > > > implements IDE barriers w/ ext3 on 2.4 kernels?  I found a patch
> by
> > > > > googling, but it seemed that the ext3 parts might not be finished,
> so
> > > > > I don't trust it.  I've found turning off the IDE write cache
> makes
> > > > > writes safe, but with a huge performance cost.
> > > >
> > > > The hard part (the IDE code) can be grabbed from the SLES8 latest
> > > > kernels, I developed and tested the code there. That also has the
> ext3
> > > > bits, IIRC.
> > >
> > > Thanks muchly!  I will definitely take a look at that.  I'm working on
> > > a uClinux project which must use a 2.4 kernel, and performance with
> > > write cache off has been a real problem.  And I've seen fs corruption
> > > after power cycles with write cache on many times, as expected.
> >
> > No problem.
>
> Have looked, it's most helpful, and I will use your patches.
> Ironically, that 2.4 patch seems to include reliable commits w/ ext3,
> because every fsync() commits a journal entry.  Er, I think.  (It was
> optimised away in 2.6: http://lkml.org/lkml/2004/3/18/36).
>
> > > It's a shame the ext3 bits don't do fsync() to the platter though. :-/
> >
> > It really is, apparently none of the ext3 guys care about write back
> > caching problems. The only guy wanting to help with the ext3 bits was
> > Andrew. In the reiserfs guys favor, they have actively been pursuing
> > solutions to this problem. And XFS recently caught up and should be just
> > as good on the barrier side, I have yet to verify the fsync() part.
>
> There's a call to blkdev_issue_flush in XFS fsync(), so it looks
> promising.  I'm not sure what the condition for calling it depends on
> though, but it seems likely the authors have thought it through.
>
> > > To reliably commit data to an ext3 file, should we do ioctl(block_dev,
> > > HDIO_SET_WCACHE, 1) on 2.6 kernels on IDE?  (The side effects look to
> >
> > Did you mean (..., 0)? And yes, it looks like it right now that fsync()
> > isn't any better than other OS on ext3, so disabling write back caching
> > is the safest.
>
> I meant (..., 1).  For some reason I thought the call to
> update_ordered() in ide-disk.c issued a barrier, a convenient side
> effect of HDIO_SET_WCACHE.  But on re-reading, it doesn't issue a
> barrier.  So that's not a solution.
>
> (..., 0) sucks performance wise.  I think calling utime to dirty the
> inode prior to fsync() will work with ext3, but it's ugly for many
> reasons, not least that it will work on IDE, but it won't work on
> anything (e.g. SCSI) which uses ordered tags rather than flushes.
>
> > > me like they may create a barrier then flush the cache, even when it's
> > > already enabled, but only on 2.6 kernels).  Or is there a better way?
> > > (I don't see any way to do it on vanilla 2.4 kernels).
> >
> > 2.4 vanilla doesn't have barrier support, unfortunately.
>
> I was wondering how to force an IDE cache flush on 2.4, from the
> application after it's called fsync().  No barrier support implied.  I
> guess there is some way to do it using the IDE taskfile ioctls?
> Nothing is clear here, unfortunately.
>
> I'm surprised blkdev_issue_flush (or the equivalent in 2.4) isn't
> available to userspace through a block device ioctl.  There is
> BLKFLSBUF which _almost_ pretends to do it, but that doesn't issue a
> low-level disk flush, and it invalidates the read-cached data.
>
> > > Should we change to only reiserfs and expect fsync() to commit data
> > > reliably only with that fs?  I realise this is a lot of difficult
> > > questions, that apply to more than just Qemu...
> >
> > Yes, reiser is the only one that works reliably across power loss with
> > write back caching for the journal commits as well as fsync guarantees.
>
> I'll try it.  I see enough problems with ext3 on a tiny embedded
> system (writes stalling for a long time, read-cached data being
> re-read from disk every 5 seconds) that I was avoiding reiser because
> I thought it would be more complicated.  That, and I have high faith
> in e2fsck.  But given the problems with ext3, maybe I'll get better
> embedded results with reiser :)
>
> -- Jamie
>
>
> _______________________________________________
> Qemu-devel mailing list
> Qemu-devel@nongnu.org
> http://lists.nongnu.org/mailman/listinfo/qemu-devel
>

[-- Attachment #2: Type: text/html, Size: 14846 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] Ensuring data is written to disk
  2006-08-02  6:51           ` Jens Axboe
  2006-08-02 13:28             ` Jamie Lokier
@ 2006-08-07 13:11             ` R. Armiento
  2006-08-07 16:14               ` Bill C. Riemers
  2006-08-07 18:13               ` Thomas Steffen
  1 sibling, 2 replies; 13+ messages in thread
From: R. Armiento @ 2006-08-07 13:11 UTC (permalink / raw)
  To: qemu-devel


Jens Axboe wrote:
> On Tue, Aug 01 2006, Jamie Lokier wrote:
>> Should we change to only reiserfs and expect fsync() to commit data
>> reliably only with that fs?  I realise this is a lot of difficult
>> questions, that apply to more than just Qemu...
> 
> Yes, reiser is the only one that works reliably across power loss with
> write back caching for the journal commits as well as fsync guarantees.

So, just to get this further clarified:

Lets assume this typical website setup: HARDWARE: commodity SATA/PATA; 
drive cache is not battery backed up. HOST OS: late Linux 2.6 kernel 
(e.g. 2.6.15), directly, on top of host, a recent version of database 
software (e.g. MySQL 5.1). Running in ~ 'production'.

Now, if I understand the foregoing discussion: the *only* way of running 
this setup with 'full' transactional guarantees on power loss, without 
having to change/patch the Linux kernel, is to turn off write-caching? 
And that severely decreases performance.

To reiterate the foregoing discussion: fsync in ext3 only goes to the 
drive cache. ResiserFS v3, which is included in the kernel, does not 
guarantee data integrity on power loss. Reiser4 requires a kernel patch 
that the developers do not yet recommend for production use, see e.g. 
http://www.namesys.com/download.html . Furthermore (if I am not mislead) 
the XFS or JFS versions included in the kernel does not guarantee data 
integrity on power loss (Please reply and prove me wrong, even flames 
are welcome :). Hence, the best bet is Jamie Lokier's one line ext3 
patch, see:
   http://lists.gnu.org/archive/html/qemu-devel/2006-08/msg00032.html
and then run ext3 in fully journaled mode (add data=journal to the mount 
options). In the 'unusual' case where it is enough with just database 
integrity, but acceptable that transactions may disappear on power loss, 
one can run ext3 in default mode.

This is somewhat surprising to me, given claims of data integrity made 
by both ext3 and MySQL documentation.

And then, on top of this, if one instead runs the database in a QEMU 
with a late Linux 2.6 kernel, one are just making data-loss more likely, 
right? So QEMU is in no way to blame for any of this.

For people following this discussion (perhaps suitable for QEMU docs): 
To disable write-caching on HOST and GUEST os:es; Make sure
       hdparm -W0 [device]
is run each bootup. On Debian/Ubuntu you do this by editing
       /etc/hdparm.conf
and uncomment the line:
       write_cache = off
However, this severely decreases performance. Also note: in MySQL the 
MyISAM table type still does not guarantee no data loss; you need innoDB 
for that.

Best regards,
Rickard

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] Ensuring data is written to disk
  2006-08-07 13:11             ` R. Armiento
@ 2006-08-07 16:14               ` Bill C. Riemers
  2006-08-07 18:13               ` Thomas Steffen
  1 sibling, 0 replies; 13+ messages in thread
From: Bill C. Riemers @ 2006-08-07 16:14 UTC (permalink / raw)
  To: qemu-devel

[-- Attachment #1: Type: text/plain, Size: 3200 bytes --]

I was talking to a friend at Red Hat.  He says what they suggest using ext3,
but putting the journal file small separate internal SCSI drive....  If you
do so you will get far better performance and reliability than from
reiserfs.

Bill


On 8/7/06, R. Armiento <reply-2006@armiento.net> wrote:
>
>
> Jens Axboe wrote:
> > On Tue, Aug 01 2006, Jamie Lokier wrote:
> >> Should we change to only reiserfs and expect fsync() to commit data
> >> reliably only with that fs?  I realise this is a lot of difficult
> >> questions, that apply to more than just Qemu...
> >
> > Yes, reiser is the only one that works reliably across power loss with
> > write back caching for the journal commits as well as fsync guarantees.
>
> So, just to get this further clarified:
>
> Lets assume this typical website setup: HARDWARE: commodity SATA/PATA;
> drive cache is not battery backed up. HOST OS: late Linux 2.6 kernel
> (e.g. 2.6.15), directly, on top of host, a recent version of database
> software (e.g. MySQL 5.1). Running in ~ 'production'.
>
> Now, if I understand the foregoing discussion: the *only* way of running
> this setup with 'full' transactional guarantees on power loss, without
> having to change/patch the Linux kernel, is to turn off write-caching?
> And that severely decreases performance.
>
> To reiterate the foregoing discussion: fsync in ext3 only goes to the
> drive cache. ResiserFS v3, which is included in the kernel, does not
> guarantee data integrity on power loss. Reiser4 requires a kernel patch
> that the developers do not yet recommend for production use, see e.g.
> http://www.namesys.com/download.html . Furthermore (if I am not mislead)
> the XFS or JFS versions included in the kernel does not guarantee data
> integrity on power loss (Please reply and prove me wrong, even flames
> are welcome :). Hence, the best bet is Jamie Lokier's one line ext3
> patch, see:
>    http://lists.gnu.org/archive/html/qemu-devel/2006-08/msg00032.html
> and then run ext3 in fully journaled mode (add data=journal to the mount
> options). In the 'unusual' case where it is enough with just database
> integrity, but acceptable that transactions may disappear on power loss,
> one can run ext3 in default mode.
>
> This is somewhat surprising to me, given claims of data integrity made
> by both ext3 and MySQL documentation.
>
> And then, on top of this, if one instead runs the database in a QEMU
> with a late Linux 2.6 kernel, one are just making data-loss more likely,
> right? So QEMU is in no way to blame for any of this.
>
> For people following this discussion (perhaps suitable for QEMU docs):
> To disable write-caching on HOST and GUEST os:es; Make sure
>        hdparm -W0 [device]
> is run each bootup. On Debian/Ubuntu you do this by editing
>        /etc/hdparm.conf
> and uncomment the line:
>        write_cache = off
> However, this severely decreases performance. Also note: in MySQL the
> MyISAM table type still does not guarantee no data loss; you need innoDB
> for that.
>
> Best regards,
> Rickard
>
>
>
> _______________________________________________
> Qemu-devel mailing list
> Qemu-devel@nongnu.org
> http://lists.nongnu.org/mailman/listinfo/qemu-devel
>

[-- Attachment #2: Type: text/html, Size: 3966 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] Ensuring data is written to disk
  2006-08-07 13:11             ` R. Armiento
  2006-08-07 16:14               ` Bill C. Riemers
@ 2006-08-07 18:13               ` Thomas Steffen
  2006-08-08  2:37                 ` R. Armiento
  1 sibling, 1 reply; 13+ messages in thread
From: Thomas Steffen @ 2006-08-07 18:13 UTC (permalink / raw)
  To: qemu-devel

On 8/7/06, R. Armiento <reply-2006@armiento.net> wrote:
> Lets assume this typical website setup: HARDWARE: commodity SATA/PATA;
> drive cache is not battery backed up. HOST OS: late Linux 2.6 kernel
> (e.g. 2.6.15), directly, on top of host, a recent version of database
> software (e.g. MySQL 5.1). Running in ~ 'production'.
>
> Now, if I understand the foregoing discussion: the *only* way of running
> this setup with 'full' transactional guarantees on power loss, without
> having to change/patch the Linux kernel, is to turn off write-caching?
> And that severely decreases performance.

And some IDE disks do not let you switch off write-caching. So as far
as I know, you need SCSI for transactional guarantees. SATA might
work, but since so many things "should work" and then don't in SATA, I
would be very careful.

> To reiterate the foregoing discussion: fsync in ext3 only goes to the
> drive cache. ResiserFS v3, which is included in the kernel, does not
> guarantee data integrity on power loss.

I have heard this before. Basically, the OS can interprete the fsync
command as a request to flush all caches, or it can interprete it as a
write barrier. The later gives much higher performance and guarantees
the consistency of the disk content, but it does not guarantee the
consistency with the rest of the world. My impression was that Linux
only does the later, but I did not find a lot of information on this.

> This is somewhat surprising to me, given claims of data integrity made
> by both ext3 and MySQL documentation.

I don't have any problems with that. Both MySQL and ext3 are
transaction safe if used on a correct disk (SCSI). But if your disk
does not handle sync correctly, then the resulting system cannot be
transaction safe.

> And then, on top of this, if one instead runs the database in a QEMU
> with a late Linux 2.6 kernel, one are just making data-loss more likely,
> right? So QEMU is in no way to blame for any of this.

If qemu works correctly: yes. It would be interesting to test that.

> However, this severely decreases performance. Also note: in MySQL the
> MyISAM table type still does not guarantee no data loss; you need innoDB
> for that.

Correct, and MyISAM is much more popular, because it is faster.

Thomas

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] Ensuring data is written to disk
  2006-08-07 18:13               ` Thomas Steffen
@ 2006-08-08  2:37                 ` R. Armiento
  0 siblings, 0 replies; 13+ messages in thread
From: R. Armiento @ 2006-08-08  2:37 UTC (permalink / raw)
  To: qemu-devel

Thomas Steffen wrote:
> On 8/7/06, R. Armiento <reply-2006@armiento.net> wrote:
> And some IDE disks do not let you switch off write-caching. So as far
> as I know, you need SCSI for transactional guarantees. 

I don't think the fact that there are some buggy drives/firmwares out 
there should be taken to mean that you cannot ever have transactional 
guarantees on SATA/PATA. It just means that you have to be careful to 
make sure your drives are not buggy.

> [write barrier] gives much higher performance and guarantees
> the consistency of the disk content, but it does not guarantee the
> consistency with the rest of the world. My impression was that Linux
> only does the later, but I did not find a lot of information on this.

Right, and if you want full transaction guarantees in your database 
(e.g. ACID, http://en.wikipedia.org/wiki/ACID ) on power loss you cannot 
rely on any such fancy handling of fsync, you *have* to wait for data to 
first go to disk cache, and then flush the cache to make it go to disk 
platters. *Ext3* (rather than "Linux" as you say) has currently no mount 
option to enable this (as far as I know; but the one-line patch 
previously mentioned should enable this). However, I interpret Jens 
Axboe to say that Reiser4 in Linux may do this (as default?).

But if "full transaction guarantees" seems excessive, and 'merely' 
internal consistency of your database on power loss is acceptable: what 
file systems and mount options can be used on a later Linux 2.6 kernel? 
I would think ext3 in the non-default data=journal mode is what is 
needed, but the ext3/MySQL docs I have found is lacking on this topic.

> Both MySQL and ext3 are transaction safe if used on a correct disk (SCSI). 
> But if your disk does not handle sync correctly, then the resulting 
> system cannot be transaction safe.

Do you by 'transaction safe' mean 'internal consistency'? (That is not 
the definition I use; see above). On SCSI hardware with write cache and 
no battery back up, wont data in the cache be irrecoverably lost just as 
on SATA/PATA? Thus, while the file system (if you journal metadata) or 
your database (if you also journal data) may be guaranteed to stay 
consistent, I think you can still lose transactions that has been 
committed.

I don't recall to have seen warnings of this in database docs, hence my 
surprise. (However, I might just not have been looking hard enough.)

//Rickard

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2006-08-08  2:14 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-08-01  0:11 [Qemu-devel] Ensuring data is written to disk Armistead, Jason
2006-08-01 10:17 ` Jamie Lokier
2006-08-01 10:45   ` Jens Axboe
2006-08-01 14:17     ` Jamie Lokier
2006-08-01 19:05       ` Jens Axboe
2006-08-01 21:50         ` Jamie Lokier
2006-08-02  6:51           ` Jens Axboe
2006-08-02 13:28             ` Jamie Lokier
2006-08-02 15:56               ` Bill C. Riemers
2006-08-07 13:11             ` R. Armiento
2006-08-07 16:14               ` Bill C. Riemers
2006-08-07 18:13               ` Thomas Steffen
2006-08-08  2:37                 ` R. Armiento

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.