linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* How to know when file data has been flushed into disk?
@ 2006-04-07 15:42 Xin Zhao
  2006-04-07 15:53 ` Douglas McNaught
  2006-04-07 17:54 ` Zach Brown
  0 siblings, 2 replies; 9+ messages in thread
From: Xin Zhao @ 2006-04-07 15:42 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel

Hi,

If a program access data like this:

1. open the file
2. write a lot of data into this file
3. close the file

Assume the underlying file system is Ext3 file system.

If Ext3 is in the data=ordered	mode,	all data will be forced directly
out to the main file system prior to its metadata being committed to
the journal.

So my questions are:
1. How will the file system be notified after all data has been
flushed into disk?

2. Unlike data=journal mode, in data=order mode, the data could be
lost if system crashes when data is being flushed to disk. When system
reboots, does journal contains the old meta data for undo?

3. Does sys_close() have to  be blocked until all data and metadata
are committed? If not, sys_close() may give application an illusion
that the file is successfully written, which can cause the application
to take subsequent operation. However, data flush could be failed. In
this case, file system seems to mislead the application. Is this true?
If so, any solutions?

Thanks in advance for your help!

-x

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: How to know when file data has been flushed into disk?
  2006-04-07 15:42 How to know when file data has been flushed into disk? Xin Zhao
@ 2006-04-07 15:53 ` Douglas McNaught
  2006-04-07 16:04   ` Xin Zhao
  2006-04-07 23:54   ` Ric Wheeler
  2006-04-07 17:54 ` Zach Brown
  1 sibling, 2 replies; 9+ messages in thread
From: Douglas McNaught @ 2006-04-07 15:53 UTC (permalink / raw)
  To: Xin Zhao; +Cc: linux-kernel, linux-fsdevel

"Xin Zhao" <uszhaoxin@gmail.com> writes:

> 3. Does sys_close() have to  be blocked until all data and metadata
> are committed? If not, sys_close() may give application an illusion
> that the file is successfully written, which can cause the application
> to take subsequent operation. However, data flush could be failed. In
> this case, file system seems to mislead the application. Is this true?
> If so, any solutions?

The fsync() call is the way to make sure written data has hit the
disk.  close() doesn't guarantee that.

-Doug

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: How to know when file data has been flushed into disk?
  2006-04-07 15:53 ` Douglas McNaught
@ 2006-04-07 16:04   ` Xin Zhao
  2006-04-07 16:55     ` linux-os (Dick Johnson)
  2006-04-07 23:54   ` Ric Wheeler
  1 sibling, 1 reply; 9+ messages in thread
From: Xin Zhao @ 2006-04-07 16:04 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel

Thanks for your reply.

That make sense. But at least ext3 needs to know when all data has
been flushed so that it can commit the meta data. Question is how can
ext3 knows that? The data flushing is done by flush daemon. There go
to be some way to notify ext3 that data is flushed. Where  is this
part of code in ext3 module?

Xin

On 4/7/06, Douglas McNaught <doug@mcnaught.org> wrote:
> "Xin Zhao" <uszhaoxin@gmail.com> writes:
>
> > 3. Does sys_close() have to  be blocked until all data and metadata
> > are committed? If not, sys_close() may give application an illusion
> > that the file is successfully written, which can cause the application
> > to take subsequent operation. However, data flush could be failed. In
> > this case, file system seems to mislead the application. Is this true?
> > If so, any solutions?
>
> The fsync() call is the way to make sure written data has hit the
> disk.  close() doesn't guarantee that.
>
> -Doug
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: How to know when file data has been flushed into disk?
  2006-04-07 16:04   ` Xin Zhao
@ 2006-04-07 16:55     ` linux-os (Dick Johnson)
  2006-04-07 17:19       ` Xin Zhao
  0 siblings, 1 reply; 9+ messages in thread
From: linux-os (Dick Johnson) @ 2006-04-07 16:55 UTC (permalink / raw)
  To: Xin Zhao; +Cc: linux-kernel, linux-fsdevel


On Fri, 7 Apr 2006, Xin Zhao wrote:

> Thanks for your reply.
>
> That make sense. But at least ext3 needs to know when all data has
> been flushed so that it can commit the meta data. Question is how can
> ext3 knows that? The data flushing is done by flush daemon. There go
> to be some way to notify ext3 that data is flushed. Where  is this
> part of code in ext3 module?
>
> Xin
>
> On 4/7/06, Douglas McNaught <doug@mcnaught.org> wrote:
>> "Xin Zhao" <uszhaoxin@gmail.com> writes:
>>
>>> 3. Does sys_close() have to  be blocked until all data and metadata
>>> are committed? If not, sys_close() may give application an illusion
>>> that the file is successfully written, which can cause the application
>>> to take subsequent operation. However, data flush could be failed. In
>>> this case, file system seems to mislead the application. Is this true?
>>> If so, any solutions?
>>
>> The fsync() call is the way to make sure written data has hit the
>> disk.  close() doesn't guarantee that.
>>
>> -Doug
>>

In principle, you __never__ know that the data got to the
disk platter(s). Any database that thinks differently is
broken by design. You need transaction processing to be
assured that you have all the (correct) data available
in the database. Transaction processing provides atomic
stepping stones so that, in the event of a failure, the
transactions can be rolled back to the last complete one
and then restarted.

The simplest example is the use of a number of journal
files, each containing a record of the previous
transactions and enough information to roll-back the
database to the point at which these files were saved.
These files are checksummed and saved in order. In the
event of a crash, these files are read until the latest
of the readable ones has a correct checksum. The database
manager uses the information in the file to roll-back
the main database to the exact content at the time the
journal file was saved.

Once the database is restarted, any previous journal
files can be deleted as well as the bad ones that followed.
However, the journal file that was used to restart the
database is never deleted until it has been superseded
by another that worked in a database restart. That way,
there is always a way to get back to a clean database.

Cheers,
Dick Johnson
Penguin : Linux version 2.6.15.4 on an i686 machine (5589.42 BogoMips).
Warning : 98.36% of all statistics are fiction, book release in April.
_
\x1a\x04

****************************************************************
The information transmitted in this message is confidential and may be privileged.  Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited.  If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to DeliveryErrors@analogic.com - and destroy all copies of this information, including any attachments, without reading or disclosing them.

Thank you.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: How to know when file data has been flushed into disk?
  2006-04-07 16:55     ` linux-os (Dick Johnson)
@ 2006-04-07 17:19       ` Xin Zhao
  0 siblings, 0 replies; 9+ messages in thread
From: Xin Zhao @ 2006-04-07 17:19 UTC (permalink / raw)
  To: linux-os (Dick Johnson); +Cc: linux-kernel, linux-fsdevel

Thanks for reply.

I think Douglas answered the third question, I guess you are trying to
answer the first two questions. Maybe I don't get your point. But my
question is:

Since ext3 will commit the transaction AFTER all data is flushed to
disk, it must know when the data flush is done. But how does ext3 know
that? Where can I find this code in ext3 module?

Maybe software has no way to know when the data is really written into
disk platters since hard drive has cache too. But software (like
flushd) should know when it finishes sending the data to hard drive. I
guess ext3 will commit transaction at that time. So the mysterious
thing to me is how ext3 get notified that data has been flushed.

Any further thoughts?

cheers,
Xin

On 4/7/06, linux-os (Dick Johnson) <linux-os@analogic.com> wrote:
>
> On Fri, 7 Apr 2006, Xin Zhao wrote:
>
> > Thanks for your reply.
> >
> > That make sense. But at least ext3 needs to know when all data has
> > been flushed so that it can commit the meta data. Question is how can
> > ext3 knows that? The data flushing is done by flush daemon. There go
> > to be some way to notify ext3 that data is flushed. Where  is this
> > part of code in ext3 module?
> >
> > Xin
> >
> > On 4/7/06, Douglas McNaught <doug@mcnaught.org> wrote:
> >> "Xin Zhao" <uszhaoxin@gmail.com> writes:
> >>
> >>> 3. Does sys_close() have to  be blocked until all data and metadata
> >>> are committed? If not, sys_close() may give application an illusion
> >>> that the file is successfully written, which can cause the application
> >>> to take subsequent operation. However, data flush could be failed. In
> >>> this case, file system seems to mislead the application. Is this true?
> >>> If so, any solutions?
> >>
> >> The fsync() call is the way to make sure written data has hit the
> >> disk.  close() doesn't guarantee that.
> >>
> >> -Doug
> >>
>
> In principle, you __never__ know that the data got to the
> disk platter(s). Any database that thinks differently is
> broken by design. You need transaction processing to be
> assured that you have all the (correct) data available
> in the database. Transaction processing provides atomic
> stepping stones so that, in the event of a failure, the
> transactions can be rolled back to the last complete one
> and then restarted.
>
> The simplest example is the use of a number of journal
> files, each containing a record of the previous
> transactions and enough information to roll-back the
> database to the point at which these files were saved.
> These files are checksummed and saved in order. In the
> event of a crash, these files are read until the latest
> of the readable ones has a correct checksum. The database
> manager uses the information in the file to roll-back
> the main database to the exact content at the time the
> journal file was saved.
>
> Once the database is restarted, any previous journal
> files can be deleted as well as the bad ones that followed.
> However, the journal file that was used to restart the
> database is never deleted until it has been superseded
> by another that worked in a database restart. That way,
> there is always a way to get back to a clean database.
>
> Cheers,
> Dick Johnson
> Penguin : Linux version 2.6.15.4 on an i686 machine (5589.42 BogoMips).
> Warning : 98.36% of all statistics are fiction, book release in April.
> _
> \x1a\x04
>
> ****************************************************************
> The information transmitted in this message is confidential and may be privileged.  Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited.  If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to DeliveryErrors@analogic.com - and destroy all copies of this information, including any attachments, without reading or disclosing them.
>
> Thank you.
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: How to know when file data has been flushed into disk?
  2006-04-07 15:42 How to know when file data has been flushed into disk? Xin Zhao
  2006-04-07 15:53 ` Douglas McNaught
@ 2006-04-07 17:54 ` Zach Brown
  2006-04-08  1:39   ` Xin Zhao
  1 sibling, 1 reply; 9+ messages in thread
From: Zach Brown @ 2006-04-07 17:54 UTC (permalink / raw)
  To: Xin Zhao; +Cc: linux-kernel, linux-fsdevel


> If a program access data like this:
> 
> 1. open the file
> 2. write a lot of data into this file

You don't say if this is an extending write or overwriting existing file
data.  I'm going to assume extending writes so that data=ordered kicks in.

> 3. close the file

> So my questions are:
> 1. How will the file system be notified after all data has been
> flushed into disk?

Look at phase 2 in journal_commit_transaction().  The kjournald thread
issues the writeback of the file data by walking t_sync_datalist and
then waits for the writeback to complete by using wait_on_buffer()
before committing the transaction.

> 2. Unlike data=journal mode, in data=order mode, the data could be
> lost if system crashes when data is being flushed to disk. When system
> reboots, does journal contains the old meta data for undo?

No, ext3 isn't roll-backward.  It doesn't store the *old* data in the
journal and undo the change if it fails halfway through.  It's
roll-forward.  It stores the *new* data in the journal and replays
complete transactions in the journal that weren't moved out to their
final place on disk at the time of the crash.

So if the machine reboots during the writeback phase then the
transaction won't be committed yet and recovery won't replay that
transaction from the journal.  From the metadata's point of view the
file extension will never have happened.

> 3. Does sys_close() have to  be blocked until all data and metadata
> are committed?

No, and neither does sys_getpid() :)

> to take subsequent operation. However, data flush could be failed. In
> this case, file system seems to mislead the application. Is this true?

No.  The application has no grounds for assuming that a successful
close() has synced previous operations to disk.  It's simply not part of
the API.

> If so, any solutions?

The application should rely on tools like fsync(), fdatasync(), O_SYNC,
mount -o sync, etc.  Whatever suits it best.

- z

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: How to know when file data has been flushed into disk?
  2006-04-07 15:53 ` Douglas McNaught
  2006-04-07 16:04   ` Xin Zhao
@ 2006-04-07 23:54   ` Ric Wheeler
  1 sibling, 0 replies; 9+ messages in thread
From: Ric Wheeler @ 2006-04-07 23:54 UTC (permalink / raw)
  To: Douglas McNaught; +Cc: Xin Zhao, linux-kernel, linux-fsdevel


Douglas McNaught wrote:

>"Xin Zhao" <uszhaoxin@gmail.com> writes:
>
>  
>
>>3. Does sys_close() have to  be blocked until all data and metadata
>>are committed? If not, sys_close() may give application an illusion
>>that the file is successfully written, which can cause the application
>>to take subsequent operation. However, data flush could be failed. In
>>this case, file system seems to mislead the application. Is this true?
>>If so, any solutions?
>>    
>>
>
>The fsync() call is the way to make sure written data has hit the
>disk.  close() doesn't guarantee that.
>
>-Doug
>
>  
>
You should also make sure, if you care about data recovery after a power 
outage, that you have either disabled the write cache on your drives or 
have a working write barrier.  Without this, fsync will move the data 
from the page cache to the disk's write cache where it is up to the 
drive firmware to write it back to permanent, safe storage on the disk 
platter.

ric


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: How to know when file data has been flushed into disk?
  2006-04-07 17:54 ` Zach Brown
@ 2006-04-08  1:39   ` Xin Zhao
  0 siblings, 0 replies; 9+ messages in thread
From: Xin Zhao @ 2006-04-08  1:39 UTC (permalink / raw)
  To: Zach Brown; +Cc: linux-kernel, linux-fsdevel

This answered all my questions! Many thanks! Will check the phase 2 code.

Xin


On 4/7/06, Zach Brown <zab@zabbo.net> wrote:
>
> > If a program access data like this:
> >
> > 1. open the file
> > 2. write a lot of data into this file
>
> You don't say if this is an extending write or overwriting existing file
> data.  I'm going to assume extending writes so that data=ordered kicks in.
>
> > 3. close the file
>
> > So my questions are:
> > 1. How will the file system be notified after all data has been
> > flushed into disk?
>
> Look at phase 2 in journal_commit_transaction().  The kjournald thread
> issues the writeback of the file data by walking t_sync_datalist and
> then waits for the writeback to complete by using wait_on_buffer()
> before committing the transaction.
>
> > 2. Unlike data=journal mode, in data=order mode, the data could be
> > lost if system crashes when data is being flushed to disk. When system
> > reboots, does journal contains the old meta data for undo?
>
> No, ext3 isn't roll-backward.  It doesn't store the *old* data in the
> journal and undo the change if it fails halfway through.  It's
> roll-forward.  It stores the *new* data in the journal and replays
> complete transactions in the journal that weren't moved out to their
> final place on disk at the time of the crash.
>
> So if the machine reboots during the writeback phase then the
> transaction won't be committed yet and recovery won't replay that
> transaction from the journal.  From the metadata's point of view the
> file extension will never have happened.
>
> > 3. Does sys_close() have to  be blocked until all data and metadata
> > are committed?
>
> No, and neither does sys_getpid() :)
>
> > to take subsequent operation. However, data flush could be failed. In
> > this case, file system seems to mislead the application. Is this true?
>
> No.  The application has no grounds for assuming that a successful
> close() has synced previous operations to disk.  It's simply not part of
> the API.
>
> > If so, any solutions?
>
> The application should rely on tools like fsync(), fdatasync(), O_SYNC,
> mount -o sync, etc.  Whatever suits it best.
>
> - z
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: How to know when file data has been flushed into disk?
@ 2006-04-07 21:07 Michael Guo
  0 siblings, 0 replies; 9+ messages in thread
From: Michael Guo @ 2006-04-07 21:07 UTC (permalink / raw)
  To: Xin Zhao, linux-os (Dick Johnson); +Cc: linux-kernel, linux-fsdevel

1)Checking source code following fsync() system call. I believe that you would get more information.
2)Dick tell you ext3 based on journaling is a crash safety filesystem 


-----Original Message-----
From: linux-kernel-owner@vger.kernel.org
[mailto:linux-kernel-owner@vger.kernel.org]On Behalf Of Xin Zhao
Sent: Friday, April 07, 2006 1:19 PM
To: linux-os (Dick Johnson)
Cc: linux-kernel; linux-fsdevel@vger.kernel.org
Subject: Re: How to know when file data has been flushed into disk?


Thanks for reply.

I think Douglas answered the third question, I guess you are trying to
answer the first two questions. Maybe I don't get your point. But my
question is:

Since ext3 will commit the transaction AFTER all data is flushed to
disk, it must know when the data flush is done. But how does ext3 know
that? Where can I find this code in ext3 module?

Maybe software has no way to know when the data is really written into
disk platters since hard drive has cache too. But software (like
flushd) should know when it finishes sending the data to hard drive. I
guess ext3 will commit transaction at that time. So the mysterious
thing to me is how ext3 get notified that data has been flushed.

Any further thoughts?

cheers,
Xin

On 4/7/06, linux-os (Dick Johnson) <linux-os@analogic.com> wrote:
>
> On Fri, 7 Apr 2006, Xin Zhao wrote:
>
> > Thanks for your reply.
> >
> > That make sense. But at least ext3 needs to know when all data has
> > been flushed so that it can commit the meta data. Question is how can
> > ext3 knows that? The data flushing is done by flush daemon. There go
> > to be some way to notify ext3 that data is flushed. Where  is this
> > part of code in ext3 module?
> >
> > Xin
> >
> > On 4/7/06, Douglas McNaught <doug@mcnaught.org> wrote:
> >> "Xin Zhao" <uszhaoxin@gmail.com> writes:
> >>
> >>> 3. Does sys_close() have to  be blocked until all data and metadata
> >>> are committed? If not, sys_close() may give application an illusion
> >>> that the file is successfully written, which can cause the application
> >>> to take subsequent operation. However, data flush could be failed. In
> >>> this case, file system seems to mislead the application. Is this true?
> >>> If so, any solutions?
> >>
> >> The fsync() call is the way to make sure written data has hit the
> >> disk.  close() doesn't guarantee that.
> >>
> >> -Doug
> >>
>
> In principle, you __never__ know that the data got to the
> disk platter(s). Any database that thinks differently is
> broken by design. You need transaction processing to be
> assured that you have all the (correct) data available
> in the database. Transaction processing provides atomic
> stepping stones so that, in the event of a failure, the
> transactions can be rolled back to the last complete one
> and then restarted.
>
> The simplest example is the use of a number of journal
> files, each containing a record of the previous
> transactions and enough information to roll-back the
> database to the point at which these files were saved.
> These files are checksummed and saved in order. In the
> event of a crash, these files are read until the latest
> of the readable ones has a correct checksum. The database
> manager uses the information in the file to roll-back
> the main database to the exact content at the time the
> journal file was saved.
>
> Once the database is restarted, any previous journal
> files can be deleted as well as the bad ones that followed.
> However, the journal file that was used to restart the
> database is never deleted until it has been superseded
> by another that worked in a database restart. That way,
> there is always a way to get back to a clean database.
>
> Cheers,
> Dick Johnson
> Penguin : Linux version 2.6.15.4 on an i686 machine (5589.42 BogoMips).
> Warning : 98.36% of all statistics are fiction, book release in April.
> _
> \x1a\x04
>
> ****************************************************************
> The information transmitted in this message is confidential and may be privileged.  Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited.  If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to DeliveryErrors@analogic.com - and destroy all copies of this information, including any attachments, without reading or disclosing them.
>
> Thank you.
>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2006-04-08  1:39 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-04-07 15:42 How to know when file data has been flushed into disk? Xin Zhao
2006-04-07 15:53 ` Douglas McNaught
2006-04-07 16:04   ` Xin Zhao
2006-04-07 16:55     ` linux-os (Dick Johnson)
2006-04-07 17:19       ` Xin Zhao
2006-04-07 23:54   ` Ric Wheeler
2006-04-07 17:54 ` Zach Brown
2006-04-08  1:39   ` Xin Zhao
2006-04-07 21:07 Michael Guo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).