linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: document ext3 requirements
       [not found]         ` <fa.hQTLXdIllf+hs4yQb092u6fowq0@ifi.uio.no>
@ 2009-01-04 19:08           ` Sitsofe Wheeler
  2009-01-04 19:31             ` Theodore Tso
  0 siblings, 1 reply; 86+ messages in thread
From: Sitsofe Wheeler @ 2009-01-04 19:08 UTC (permalink / raw)
  To: Theodore Tso, Duane Griffin, Valdis.Kletnieks,
	Martin MOKREJŠ,
	Pavel Machek, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc

Theodore Tso wrote:
> So what's the use case where people want to be able to mount a
> filesystem needing recovery read/only without running the journal?

Corrupted SD card[1] that's been locked to read only for recovery 
purposes without having the FS tear itself apart further?

Others seem to be saying that it is useful for forensics...

[1] http://pavelmachek.livejournal.com/68701.html


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-04 19:08           ` document ext3 requirements Sitsofe Wheeler
@ 2009-01-04 19:31             ` Theodore Tso
  2009-01-04 22:40               ` Pavel Machek
  0 siblings, 1 reply; 86+ messages in thread
From: Theodore Tso @ 2009-01-04 19:31 UTC (permalink / raw)
  To: Sitsofe Wheeler
  Cc: Duane Griffin, Valdis.Kletnieks, Martin MOKREJŠ,
	Pavel Machek, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc

On Sun, Jan 04, 2009 at 07:08:01PM +0000, Sitsofe Wheeler wrote:
> Theodore Tso wrote:
>> So what's the use case where people want to be able to mount a
>> filesystem needing recovery read/only without running the journal?
>
> Corrupted SD card[1] that's been locked to read only for recovery  
> purposes without having the FS tear itself apart further?

In that case, the right answer is to copy the 32 GB SD card to hard
drive, and then operate on the hard drive.....  In general, if the
media has started going bad, the *first* thing you want to do is an
immediate copy of the media to some place stable.

> Others seem to be saying that it is useful for forensics...

Again, the best thing to do is a full image copy of the drive before
you do anything else.....

If someone wants to implement code to scans the journal, and create a
redirection map where whenever the filesystem needs to read from block
N, it reads from block M instead, they should feel free to do so.  But
so far, each of the use cases people are talking about are pretty rare
cases, which is probably why we don't have it at moment.

In fact, it's probably possible to create this as a pure userspace
solution using devicemapper.

	 					- Ted

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-04 19:31             ` Theodore Tso
@ 2009-01-04 22:40               ` Pavel Machek
  2009-01-04 23:30                 ` Theodore Tso
  0 siblings, 1 reply; 86+ messages in thread
From: Pavel Machek @ 2009-01-04 22:40 UTC (permalink / raw)
  To: Theodore Tso, Sitsofe Wheeler, Duane Griffin, Valdis.Kletnieks,
	Martin MOKREJŠ,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc

On Sun 2009-01-04 14:31:41, Theodore Tso wrote:
> On Sun, Jan 04, 2009 at 07:08:01PM +0000, Sitsofe Wheeler wrote:
> > Theodore Tso wrote:
> >> So what's the use case where people want to be able to mount a
> >> filesystem needing recovery read/only without running the journal?
> >
> > Corrupted SD card[1] that's been locked to read only for recovery  
> > purposes without having the FS tear itself apart further?
> 
> In that case, the right answer is to copy the 32 GB SD card to hard
> drive, and then operate on the hard drive.....  In general, if the
> media has started going bad, the *first* thing you want to do is an
> immediate copy of the media to some place stable.

Not neccessarily.

If I have a bit of precious data and lot of junk on the card, I want
to copy out the precious data before the card dies. Reading the whole
media may just take too long.

That's probably very true for rotating harddrives after headcrash...

"ro, noload" seems like very acceptable solution in that case.
									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-04 22:40               ` Pavel Machek
@ 2009-01-04 23:30                 ` Theodore Tso
  2009-01-05  3:49                   ` Rob Landley
  0 siblings, 1 reply; 86+ messages in thread
From: Theodore Tso @ 2009-01-04 23:30 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Sitsofe Wheeler, Duane Griffin, Valdis.Kletnieks,
	Martin MOKREJŠ,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc

On Sun, Jan 04, 2009 at 11:40:52PM +0100, Pavel Machek wrote:
> Not neccessarily.
> 
> If I have a bit of precious data and lot of junk on the card, I want
> to copy out the precious data before the card dies. Reading the whole
> media may just take too long.
> 
> That's probably very true for rotating harddrives after headcrash...

For a small amount data, maybe; but the number of seeks is often far
more destructive than the amount of time the disk is spinning.  And in
practice, what generally happens is the user starts looking around to
make sure there wasn't anything else on the disk worth saving, and now
data is getting copied off based on human reaction time.  So that's
why I normally advise users that doing a full image copy of the disk
is much better than, say, "cp -r /home/luser /backup", or cd'ing
around a filesystem hierarchy and trying to save files one by one.

Note also that with SD cards, reading is generally non-destructive and
the time it takes to copy off say, 32 GB really isn't that long.

        	    	     	      	      - Ted

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-04 23:30                 ` Theodore Tso
@ 2009-01-05  3:49                   ` Rob Landley
  2009-01-05  4:31                     ` Robert Hancock
                                       ` (3 more replies)
  0 siblings, 4 replies; 86+ messages in thread
From: Rob Landley @ 2009-01-05  3:49 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Pavel Machek, Sitsofe Wheeler, Duane Griffin, Valdis.Kletnieks,
	Martin MOKREJŠ,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc

On Sunday 04 January 2009 17:30:52 Theodore Tso wrote:
> On Sun, Jan 04, 2009 at 11:40:52PM +0100, Pavel Machek wrote:
> > Not neccessarily.
> >
> > If I have a bit of precious data and lot of junk on the card, I want
> > to copy out the precious data before the card dies. Reading the whole
> > media may just take too long.
> >
> > That's probably very true for rotating harddrives after headcrash...
>
> For a small amount data, maybe; but the number of seeks is often far
> more destructive than the amount of time the disk is spinning.  And in
> practice, what generally happens is the user starts looking around to
> make sure there wasn't anything else on the disk worth saving, and now
> data is getting copied off based on human reaction time.  So that's
> why I normally advise users that doing a full image copy of the disk
> is much better than, say, "cp -r /home/luser /backup", or cd'ing
> around a filesystem hierarchy and trying to save files one by one.

That would be true if the disk hardware wasn't doing a gazillion retries to 
read a bad sector internally (taking 5 seconds to come back and report 
failure), and then the darn scsi layer added another gazillion retries on top 
of that, and the two multiply together to make it so slow that that when you 
leave the thing copying the disk overnight it's STILL not done 24 hours later.  
Going in and cherry picking individual files looks kind of appealing in that 
situation.

Rob

P.S. Yeah, I had a laptop hard drive crash a month or so back.  I remember 
when it was still possible to buy storage devices that didn't get arbitrarily 
routed through the SCSI layer.  I miss those days.  I found the patch to route 
ramdisks through the scsi layer amusing, though.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-05  3:49                   ` Rob Landley
@ 2009-01-05  4:31                     ` Robert Hancock
  2009-01-05  5:00                     ` david
                                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 86+ messages in thread
From: Robert Hancock @ 2009-01-05  4:31 UTC (permalink / raw)
  To: Rob Landley
  Cc: Theodore Tso, Pavel Machek, Sitsofe Wheeler, Duane Griffin,
	Valdis.Kletnieks, Martin MOKREJŠ,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc

Rob Landley wrote:
> On Sunday 04 January 2009 17:30:52 Theodore Tso wrote:
>> On Sun, Jan 04, 2009 at 11:40:52PM +0100, Pavel Machek wrote:
>>> Not neccessarily.
>>>
>>> If I have a bit of precious data and lot of junk on the card, I want
>>> to copy out the precious data before the card dies. Reading the whole
>>> media may just take too long.
>>>
>>> That's probably very true for rotating harddrives after headcrash...
>> For a small amount data, maybe; but the number of seeks is often far
>> more destructive than the amount of time the disk is spinning.  And in
>> practice, what generally happens is the user starts looking around to
>> make sure there wasn't anything else on the disk worth saving, and now
>> data is getting copied off based on human reaction time.  So that's
>> why I normally advise users that doing a full image copy of the disk
>> is much better than, say, "cp -r /home/luser /backup", or cd'ing
>> around a filesystem hierarchy and trying to save files one by one.
> 
> That would be true if the disk hardware wasn't doing a gazillion retries to 
> read a bad sector internally (taking 5 seconds to come back and report 
> failure), and then the darn scsi layer added another gazillion retries on top 
> of that, and the two multiply together to make it so slow that that when you 
> leave the thing copying the disk overnight it's STILL not done 24 hours later.  
> Going in and cherry picking individual files looks kind of appealing in that 
> situation.
> 
> Rob
> 
> P.S. Yeah, I had a laptop hard drive crash a month or so back.  I remember 
> when it was still possible to buy storage devices that didn't get arbitrarily 
> routed through the SCSI layer.  I miss those days.  I found the patch to route 
> ramdisks through the scsi layer amusing, though.

SCSI layer doesn't do any retries itself. Block layer does.

Even with zero software retries however, if there are a ton of bad 
sectors it can still take ages for them to all fail reading one at a 
time just from the disk's retries..

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-05  3:49                   ` Rob Landley
  2009-01-05  4:31                     ` Robert Hancock
@ 2009-01-05  5:00                     ` david
  2009-01-05 11:19                     ` Alan Cox
  2009-01-06 10:36                     ` Matthias Andree
  3 siblings, 0 replies; 86+ messages in thread
From: david @ 2009-01-05  5:00 UTC (permalink / raw)
  To: Rob Landley
  Cc: Theodore Tso, Pavel Machek, Sitsofe Wheeler, Duane Griffin,
	Valdis.Kletnieks, Martin MOKREJŠ,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc

On Sun, 4 Jan 2009, Rob Landley wrote:

> On Sunday 04 January 2009 17:30:52 Theodore Tso wrote:
>> On Sun, Jan 04, 2009 at 11:40:52PM +0100, Pavel Machek wrote:
>>> Not neccessarily.
>>>
>>> If I have a bit of precious data and lot of junk on the card, I want
>>> to copy out the precious data before the card dies. Reading the whole
>>> media may just take too long.
>>>
>>> That's probably very true for rotating harddrives after headcrash...
>>
>> For a small amount data, maybe; but the number of seeks is often far
>> more destructive than the amount of time the disk is spinning.  And in
>> practice, what generally happens is the user starts looking around to
>> make sure there wasn't anything else on the disk worth saving, and now
>> data is getting copied off based on human reaction time.  So that's
>> why I normally advise users that doing a full image copy of the disk
>> is much better than, say, "cp -r /home/luser /backup", or cd'ing
>> around a filesystem hierarchy and trying to save files one by one.
>
> That would be true if the disk hardware wasn't doing a gazillion retries to
> read a bad sector internally (taking 5 seconds to come back and report
> failure), and then the darn scsi layer added another gazillion retries on top
> of that, and the two multiply together to make it so slow that that when you
> leave the thing copying the disk overnight it's STILL not done 24 hours later.
> Going in and cherry picking individual files looks kind of appealing in that
> situation.

I've also had cases where one particular spot on the drive is bad. any 
attempt to read that sector fails and causes the drive to error out until 
a reboot. grabbing individual files I could skip the file(s) in the 
affected portion and retreive everything else on the drive (or in some 
cases raid array with multiple failures)

David Lang

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-05  3:49                   ` Rob Landley
  2009-01-05  4:31                     ` Robert Hancock
  2009-01-05  5:00                     ` david
@ 2009-01-05 11:19                     ` Alan Cox
  2009-01-05 19:00                       ` Rob Landley
  2009-01-27 13:24                       ` Thierry Vignaud
  2009-01-06 10:36                     ` Matthias Andree
  3 siblings, 2 replies; 86+ messages in thread
From: Alan Cox @ 2009-01-05 11:19 UTC (permalink / raw)
  To: Rob Landley
  Cc: Theodore Tso, Pavel Machek, Sitsofe Wheeler, Duane Griffin,
	Valdis.Kletnieks, Martin MOKREJŠ,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc

> That would be true if the disk hardware wasn't doing a gazillion retries to 
> read a bad sector internally (taking 5 seconds to come back and report 
> failure), and then the darn scsi layer added another gazillion retries on top 
> of that, and the two multiply together to make it so slow that that when you 
> leave the thing copying the disk overnight it's STILL not done 24 hours later.  
> Going in and cherry picking individual files looks kind of appealing in that 
> situation.

You could of course just learn to use the functions the kernel provides.
If you want to recover disk blocks without retrying you can do that via
SG_IO. If you want to adjust the timeout and retry levels you can do that
too via sysfs.

Alan

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-05 11:19                     ` Alan Cox
@ 2009-01-05 19:00                       ` Rob Landley
  2009-01-05 19:27                         ` Martin K. Petersen
  2009-01-27 13:24                       ` Thierry Vignaud
  1 sibling, 1 reply; 86+ messages in thread
From: Rob Landley @ 2009-01-05 19:00 UTC (permalink / raw)
  To: Alan Cox
  Cc: Theodore Tso, Pavel Machek, Sitsofe Wheeler, Duane Griffin,
	Valdis.Kletnieks, Martin MOKREJŠ,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc

On Monday 05 January 2009 05:19:13 Alan Cox wrote:
> You could of course just learn to use the functions the kernel provides.
> If you want to recover disk blocks without retrying you can do that via
> SG_IO. If you want to adjust the timeout and retry levels you can do that
> too via sysfs.

Good to know, but "my laptop hard drive just died" is not the optimal time to 
learn these sorts of things.

> Alan

Rob

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-05 19:00                       ` Rob Landley
@ 2009-01-05 19:27                         ` Martin K. Petersen
  2009-01-06 10:41                           ` Matthias Andree
  0 siblings, 1 reply; 86+ messages in thread
From: Martin K. Petersen @ 2009-01-05 19:27 UTC (permalink / raw)
  To: Rob Landley
  Cc: Alan Cox, Theodore Tso, Pavel Machek, Sitsofe Wheeler,
	Duane Griffin, Valdis.Kletnieks, Martin MOKREJŠ,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc

>>>>> "Rob" == Rob Landley <rob@landley.net> writes:

Rob> On Monday 05 January 2009 05:19:13 Alan Cox wrote:
>> You could of course just learn to use the functions the kernel
>> provides.  If you want to recover disk blocks without retrying you
>> can do that via SG_IO. If you want to adjust the timeout and retry
>> levels you can do that too via sysfs.

Rob> Good to know, but "my laptop hard drive just died" is not the
Rob> optimal time to learn these sorts of things.

http://www.garloff.de/kurt/linux/ddrescue/

-- 
Martin K. Petersen	Oracle Linux Engineering


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-05  3:49                   ` Rob Landley
                                       ` (2 preceding siblings ...)
  2009-01-05 11:19                     ` Alan Cox
@ 2009-01-06 10:36                     ` Matthias Andree
  3 siblings, 0 replies; 86+ messages in thread
From: Matthias Andree @ 2009-01-06 10:36 UTC (permalink / raw)
  To: Rob Landley
  Cc: Theodore Tso, Pavel Machek, Sitsofe Wheeler, Duane Griffin,
	Valdis.Kletnieks, Martin MOKREJŠ,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc

On Sun, 04 Jan 2009, Rob Landley wrote:

> On Sunday 04 January 2009 17:30:52 Theodore Tso wrote:
> > On Sun, Jan 04, 2009 at 11:40:52PM +0100, Pavel Machek wrote:
> > > Not neccessarily.
> > >
> > > If I have a bit of precious data and lot of junk on the card, I want
> > > to copy out the precious data before the card dies. Reading the whole
> > > media may just take too long.
> > >
> > > That's probably very true for rotating harddrives after headcrash...
> >
> > For a small amount data, maybe; but the number of seeks is often far
> > more destructive than the amount of time the disk is spinning.  And in
> > practice, what generally happens is the user starts looking around to
> > make sure there wasn't anything else on the disk worth saving, and now
> > data is getting copied off based on human reaction time.  So that's
> > why I normally advise users that doing a full image copy of the disk
> > is much better than, say, "cp -r /home/luser /backup", or cd'ing
> > around a filesystem hierarchy and trying to save files one by one.
> 
> That would be true if the disk hardware wasn't doing a gazillion retries to 
> read a bad sector internally (taking 5 seconds to come back and report 
> failure), and then the darn scsi layer added another gazillion retries on top 
> of that, and the two multiply together to make it so slow that that when you 
> leave the thing copying the disk overnight it's STILL not done 24 hours later.  
> Going in and cherry picking individual files looks kind of appealing in that 
> situation.

Well, I recently (Dec 1st or so) had a venerable HDD fail with a couple
of bad sectors; with oldish backups (couple of days) (Samsung SP2004C
plugged to a VIA VT6420 in the south bridge, VT8237).

I couldn't use dd_rescue since the disk drive was detached by the OS
upon the disk's hitting the first error.

If it's a software or hardware fault I cannot say, FreeBSD
7.1-PRERELEASE showed the same behaviour as did the openSUSE 11.0 i686
kernels, but then again it might be either OS losing patience with the
drive doing excessive reads, or the drive actually violating the bus
protocols or hanging. It didn't need power cycling though, detaching and
reattaching the ATA bus was sufficient.

For me, recovery was possible with rsync (or cp -Rp): run rsync -avH
until the drive froze, figure which file was affected, note the name,
remount the drive, rm the affected file, and continue.

Is there a "don't retry reads" setting in the kernel that I miss?

(I still have the drive, so I can try some error handling patches if
desired.)

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-05 19:27                         ` Martin K. Petersen
@ 2009-01-06 10:41                           ` Matthias Andree
  2009-01-06 15:30                             ` Theodore Tso
       [not found]                             ` <20090106153020.GB13086__11022.1833143898$1231255950$gmane$org@mit.edu>
  0 siblings, 2 replies; 86+ messages in thread
From: Matthias Andree @ 2009-01-06 10:41 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Rob Landley, Alan Cox, Theodore Tso, Pavel Machek,
	Sitsofe Wheeler, Duane Griffin, Valdis.Kletnieks,
	Martin MOKREJŠ,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc

On Mon, 05 Jan 2009, Martin K. Petersen wrote:

> >>>>> "Rob" == Rob Landley <rob@landley.net> writes:
> 
> Rob> On Monday 05 January 2009 05:19:13 Alan Cox wrote:
> >> You could of course just learn to use the functions the kernel
> >> provides.  If you want to recover disk blocks without retrying you
> >> can do that via SG_IO. If you want to adjust the timeout and retry
> >> levels you can do that too via sysfs.
> 
> Rob> Good to know, but "my laptop hard drive just died" is not the
> Rob> optimal time to learn these sorts of things.
> 
> http://www.garloff.de/kurt/linux/ddrescue/

While nice, it does not reconfigure the block layer to reduce retries;
at least not in a manner I see at a glance; no sysctl or SG_IO or ioctl
or fcntl anywhere.

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-06 10:41                           ` Matthias Andree
@ 2009-01-06 15:30                             ` Theodore Tso
       [not found]                             ` <20090106153020.GB13086__11022.1833143898$1231255950$gmane$org@mit.edu>
  1 sibling, 0 replies; 86+ messages in thread
From: Theodore Tso @ 2009-01-06 15:30 UTC (permalink / raw)
  To: Martin K. Petersen, Rob Landley, Alan Cox, Pavel Machek,
	Sitsofe Wheeler, Duane Griffin, Valdis.Kletnieks,
	Martin MOKREJŠ,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc

On Tue, Jan 06, 2009 at 11:41:46AM +0100, Matthias Andree wrote:
> > Rob> Good to know, but "my laptop hard drive just died" is not the
> > Rob> optimal time to learn these sorts of things.
> > 
> > http://www.garloff.de/kurt/linux/ddrescue/
> 
> While nice, it does not reconfigure the block layer to reduce retries;
> at least not in a manner I see at a glance; no sysctl or SG_IO or ioctl
> or fcntl anywhere.

Well, Kurt Garloff wrote that program years and years ago.  I'm sure
if someone created patches he'd probably accept them, though.  It's
still the best program I've found for doing image backups in
catastrophic situations.

						- Ted

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
       [not found]                             ` <20090106153020.GB13086__11022.1833143898$1231255950$gmane$org@mit.edu>
@ 2009-01-06 15:40                               ` Andi Kleen
  2009-01-06 15:57                                 ` Theodore Tso
  0 siblings, 1 reply; 86+ messages in thread
From: Andi Kleen @ 2009-01-06 15:40 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Martin K. Petersen, Rob Landley, Alan Cox, Pavel Machek,
	Sitsofe Wheeler, Duane Griffin, Valdis.Kletnieks,
	Martin MOKREJŠ,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc

Theodore Tso <tytso@mit.edu> writes:

> On Tue, Jan 06, 2009 at 11:41:46AM +0100, Matthias Andree wrote:
>> > Rob> Good to know, but "my laptop hard drive just died" is not the
>> > Rob> optimal time to learn these sorts of things.
>> > 
>> > http://www.garloff.de/kurt/linux/ddrescue/
>> 
>> While nice, it does not reconfigure the block layer to reduce retries;
>> at least not in a manner I see at a glance; no sysctl or SG_IO or ioctl
>> or fcntl anywhere.
>
> Well, Kurt Garloff wrote that program years and years ago.  I'm sure
> if someone created patches he'd probably accept them, though.  It's
> still the best program I've found for doing image backups in
> catastrophic situations.

Better would be just to incorporate the functionality as an option
into standard GNU dd. Then everyone would easily have access to it.

-Andi

-- 
ak@linux.intel.com

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-06 15:40                               ` Andi Kleen
@ 2009-01-06 15:57                                 ` Theodore Tso
  2009-01-06 17:31                                   ` Andi Kleen
  2009-01-06 19:31                                   ` Rob Landley
  0 siblings, 2 replies; 86+ messages in thread
From: Theodore Tso @ 2009-01-06 15:57 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Martin K. Petersen, Rob Landley, Alan Cox, Pavel Machek,
	Sitsofe Wheeler, Duane Griffin, Valdis.Kletnieks,
	Martin MOKREJŠ,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc

On Tue, Jan 06, 2009 at 04:40:33PM +0100, Andi Kleen wrote:
> > Well, Kurt Garloff wrote that program years and years ago.  I'm sure
> > if someone created patches he'd probably accept them, though.  It's
> > still the best program I've found for doing image backups in
> > catastrophic situations.
> 
> Better would be just to incorporate the functionality as an option
> into standard GNU dd. Then everyone would easily have access to it.

I'm not sure whether the GNU coreutils maintainer would be willing to
accept a series of Linux-specific interfaces, but dd_rescue also has
the advantage that it uses a large blocksize for speed, but when an
error is returned, it backs off to a small block size to recovery the
maximum amount of data, and then later returns to the large block
size.  (Ideally, it should be able to query the disk drive to
determine its internal block size, and use that for the smaller block
size, but I'm not sure if there's a standardized way that value is
exposed by HDD's or SDD's.)

The dd_rescue program also has a progress bar, which as we all know
makes things go faster :-), and is useful because it means the user
knows how much of the disk has been copied, and whether he/she should
go to sleep for the night, or grab a cup of coffee or beer.  Its user
interface is also much simpler, and it's much easier to interrupt it
and start it up again where it left off.  (You can do this with dd,
but the average inexperienced user will be horribly confused by the dd
man page, and might easily screw up or skip one of the seek or skip
options.)

Of course, the right answer is to pursue both paths, although my
experiences getting changes into the core/file/shell-utils has been
frustrating and unpleasant, although granted that was over ten years
ago, and hopefully the maintainer has been replaced since then by one
who is more responsive.  OTOH, Kurt's a good guy, and would probably
be willing to accept patches to improve dd_rescue.

						- Ted

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-06 15:57                                 ` Theodore Tso
@ 2009-01-06 17:31                                   ` Andi Kleen
  2009-01-06 19:31                                   ` Rob Landley
  1 sibling, 0 replies; 86+ messages in thread
From: Andi Kleen @ 2009-01-06 17:31 UTC (permalink / raw)
  To: Theodore Tso, Andi Kleen, Martin K. Petersen, Rob Landley,
	Alan Cox, Pavel Machek, Sitsofe Wheeler, Duane Griffin,
	Valdis.Kletnieks, Martin MOKREJŠ,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc

> Of course, the right answer is to pursue both paths, although my
> experiences getting changes into the core/file/shell-utils has been
> frustrating and unpleasant, although granted that was over ten years

I submitted a change to coreutils some time ago and the maintainer
(Jim Meyering) was easy to work with in my experience.

-Andi
-- 
ak@linux.intel.com

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-06 15:57                                 ` Theodore Tso
  2009-01-06 17:31                                   ` Andi Kleen
@ 2009-01-06 19:31                                   ` Rob Landley
  1 sibling, 0 replies; 86+ messages in thread
From: Rob Landley @ 2009-01-06 19:31 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Andi Kleen, Martin K. Petersen, Alan Cox, Pavel Machek,
	Sitsofe Wheeler, Duane Griffin, Valdis.Kletnieks,
	Martin MOKREJŠ,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc

On Tuesday 06 January 2009 09:57:29 Theodore Tso wrote:
> On Tue, Jan 06, 2009 at 04:40:33PM +0100, Andi Kleen wrote:
> > > Well, Kurt Garloff wrote that program years and years ago.  I'm sure
> > > if someone created patches he'd probably accept them, though.  It's
> > > still the best program I've found for doing image backups in
> > > catastrophic situations.
> >
> > Better would be just to incorporate the functionality as an option
> > into standard GNU dd. Then everyone would easily have access to it.
>
> I'm not sure whether the GNU coreutils maintainer would be willing to
> accept a series of Linux-specific interfaces, but dd_rescue also has
> the advantage that it uses a large blocksize for speed, but when an
> error is returned, it backs off to a small block size to recovery the
> maximum amount of data, and then later returns to the large block
> size.  (Ideally, it should be able to query the disk drive to
> determine its internal block size, and use that for the smaller block
> size, but I'm not sure if there's a standardized way that value is
> exposed by HDD's or SDD's.)

I don't suppose there a Documentation file to put data recovery information 
in?

(Maybe the new filesystems expectations file, which doesn't seem the best name 
for it...?)

Rob

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-05 11:19                     ` Alan Cox
  2009-01-05 19:00                       ` Rob Landley
@ 2009-01-27 13:24                       ` Thierry Vignaud
  2009-01-27 13:37                         ` Alan Cox
  1 sibling, 1 reply; 86+ messages in thread
From: Thierry Vignaud @ 2009-01-27 13:24 UTC (permalink / raw)
  To: Alan Cox
  Cc: Rob Landley, Theodore Tso, Pavel Machek, Sitsofe Wheeler,
	Duane Griffin, Valdis.Kletnieks, Martin MOKREJŠ,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc

Alan Cox <alan@lxorguk.ukuu.org.uk> writes:

> > That would be true if the disk hardware wasn't doing a gazillion retries to 
> > read a bad sector internally (taking 5 seconds to come back and report 
> > failure), and then the darn scsi layer added another gazillion retries on top 
> > of that, and the two multiply together to make it so slow that that when you 
> > leave the thing copying the disk overnight it's STILL not done 24 hours later.  
> > Going in and cherry picking individual files looks kind of appealing in that 
> > situation.
> 
> You could of course just learn to use the functions the kernel provides.
> If you want to recover disk blocks without retrying you can do that via
> SG_IO. If you want to adjust the timeout and retry levels you can do that
> too via sysfs.

Sure but maybe the default values might be altered. I think the current
tradeoff has set the cursor way too far for retries.

I remember seeing I/O error on CDs resulting in zillions of retries on
errors on USB discs resulting in resetting the USB port again & again
for hours... (CD case is years ago, but I usually see the USB layer
trying reseting for quite a long time at least once per month)

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-27 13:24                       ` Thierry Vignaud
@ 2009-01-27 13:37                         ` Alan Cox
  0 siblings, 0 replies; 86+ messages in thread
From: Alan Cox @ 2009-01-27 13:37 UTC (permalink / raw)
  To: Thierry Vignaud
  Cc: Rob Landley, Theodore Tso, Pavel Machek, Sitsofe Wheeler,
	Duane Griffin, Valdis.Kletnieks, Martin MOKREJ__, kernel list,
	Andrew Morton, mtk.manpages, rdunlap, linux-doc

> I remember seeing I/O error on CDs resulting in zillions of retries on
> errors on USB discs resulting in resetting the USB port again & again
> for hours... (CD case is years ago, but I usually see the USB layer
> trying reseting for quite a long time at least once per month)

Most of the CD ones are caused by tools like hal continuing to probe the
device regularly and causing new avalanches of errors.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-05 11:43     ` Alan Cox
@ 2009-01-07 11:59       ` Rob Landley
  0 siblings, 0 replies; 86+ messages in thread
From: Rob Landley @ 2009-01-07 11:59 UTC (permalink / raw)
  To: Alan Cox
  Cc: Theodore Tso, Alexander E. Patrakov, Pavel Machek, kernel list,
	Andrew Morton, mtk.manpages, rdunlap, linux-doc

On Monday 05 January 2009 05:43:29 Alan Cox wrote:
> > Huh?  I've never heard an assertion that disabling the write cache (I
> > assume you mean using write-through caching as opposed to write-back
> > caching), shortens the lifespan of disk drives.  Aggressive battery
>
> Thats what I was told by a disk vendor - simply because the drive makes a
> lot more mechanical movements and writes.

It certainly sounds like less write cacheing would shorten the lifespan of 
flash devices...

Rob

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-05 19:16           ` Theodore Tso
@ 2009-01-06 19:20             ` Rob Landley
  0 siblings, 0 replies; 86+ messages in thread
From: Rob Landley @ 2009-01-06 19:20 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Martin MOKREJŠ,
	Pavel Machek, Duane Griffin, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc

On Monday 05 January 2009 13:16:58 Theodore Tso wrote:
> On Sun, Jan 04, 2009 at 01:56:32PM -0600, Rob Landley wrote:
> > On Saturday 03 January 2009 17:01:58 Martin MOKREJŠ wrote:
> > > > Still handy for recovering badly broken filesystems, I'd say.
> > >
> > > Me as well. How about improving you doc patch with some summary of
> > > this thread (although it is probably not over yet)? ;-) Definitely,
> > > a note that one can mount it as ext2 while read-only would be helpful
> > > when doing some forensics on the disk.
> >
> > Although make sure you _do_ mount it as read only because if you mount an
> > ext3 filesystem read/write as ext2 I've had it zap the journal entirely
> > and then you have to tune2fs -j the sucker to turn it back into ext3.
> >
> > Ext3 is... touchy.
>
> Um.... horse pucky:

Well I managed to kill it more than once, but I could easily have the 
reproduction sequence wrong.  (I wasn't _trying_ to do it again...)

> # mke2fs -q -t ext3 /dev/thunk/footest
> # debugfs -R features /dev/thunk/footest
> debugfs 1.41.3 (12-Oct-2008)
> Filesystem features: has_journal ext_attr resize_inode dir_index filetype
> sparse_super large_file # mount -t ext2 /dev/thunk/footest /mnt
> # touch /mnt/foo
> # umount /mnt
> # debugfs -R features /dev/thunk/footest
> debugfs 1.41.3 (12-Oct-2008)
> Filesystem features: has_journal ext_attr resize_inode dir_index filetype
> sparse_super large_file

If I can figure out what I did, I'll get back to you.

>    	     		 	  	       		 - Ted

Rob

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-06 10:08         ` Matthias Andree
@ 2009-01-06 15:23           ` Theodore Tso
  0 siblings, 0 replies; 86+ messages in thread
From: Theodore Tso @ 2009-01-06 15:23 UTC (permalink / raw)
  To: Martin MOKREJŠ, Duane Griffin, kernel list

On Tue, Jan 06, 2009 at 11:08:10AM +0100, Matthias Andree wrote:
> On Sun, 04 Jan 2009, Martin MOKREJŠ wrote:
> > Hmm, so if my dual-boot machine does not shutdown correctly and I boot
> > accidentally in M$ Win where I use ext2 IFS driver and modify some
> > stuff on the ext3 drive, after a while reboot to linux and the journal
> > get re-played ... Mmm ...
> 
> If the ext2 IFS driver mounts an ext3 file system that needs journal
> replay, the IFS driver is broken (unless it can replay the journal, of
> course - I stopped using that driver long ago, being unhappy with it).

Indeed; that's why there is a INCOMPAT NEEDS_RECOVERY feature flag to
prevent compliant ext2 implementations from mounting an ext3
filesystem that needs recovery.  We've thought about most of these
issues, almost a decade ago...

							- Ted

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-03 23:01       ` Martin MOKREJŠ
                           ` (2 preceding siblings ...)
  2009-01-04 19:56         ` Rob Landley
@ 2009-01-06 10:08         ` Matthias Andree
  2009-01-06 15:23           ` Theodore Tso
  3 siblings, 1 reply; 86+ messages in thread
From: Matthias Andree @ 2009-01-06 10:08 UTC (permalink / raw)
  To: Martin MOKREJŠ; +Cc: Duane Griffin, kernel list

On Sun, 04 Jan 2009, Martin MOKREJŠ wrote:

> Pavel Machek wrote:
> > On Sat 2009-01-03 22:17:15, Duane Griffin wrote:
> >> [Fixed top-posting]
> >>
> >> 2009/1/3 Martin MOKREJŠ <mmokrejs@ribosome.natur.cuni.cz>:
> >>> Pavel Machek wrote:
> >>>> readonly mount does actually write to the media in some cases. Document that.
> >>>>
> >>> Can one avoid replay of the journal then if it would be unclean?
> >>> Just curious.
> >> Nope. If the underlying block device is read-only then mounting the
> >> filesystem will fail. I tried to fix this some time ago, and have a
> >> set of patches that almost always work, but "almost always" isn't good
> >> enough. Unfortunately I never managed to figure out a way to finish it
> >> off without disgusting hacks or major surgery.
> > 
> > Uhuh, can you just ignore the journal and mount it anyway?
> > ...basically treating it like an ext2?
> > 
> > ...ok, that will present "old" version of the filesystem to the
> > user... violating fsync() semantics.
> 
> Hmm, so if my dual-boot machine does not shutdown correctly and I boot
> accidentally in M$ Win where I use ext2 IFS driver and modify some
> stuff on the ext3 drive, after a while reboot to linux and the journal
> get re-played ... Mmm ...

If the ext2 IFS driver mounts an ext3 file system that needs journal
replay, the IFS driver is broken (unless it can replay the journal, of
course - I stopped using that driver long ago, being unhappy with it).

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-03 22:29     ` Pavel Machek
  2009-01-03 23:01       ` Martin MOKREJŠ
  2009-01-03 23:12       ` Duane Griffin
@ 2009-01-06 10:06       ` Matthias Andree
  2 siblings, 0 replies; 86+ messages in thread
From: Matthias Andree @ 2009-01-06 10:06 UTC (permalink / raw)
  To: kernel list

On Sat, 03 Jan 2009, Pavel Machek wrote:

> On Sat 2009-01-03 22:17:15, Duane Griffin wrote:
> > [Fixed top-posting]
> > 
> > 2009/1/3 Martin MOKREJŠ <mmokrejs@ribosome.natur.cuni.cz>:
> > > Pavel Machek wrote:
> > >> readonly mount does actually write to the media in some cases. Document that.
> > >>
> > > Can one avoid replay of the journal then if it would be unclean?
> > > Just curious.
> > 
> > Nope. If the underlying block device is read-only then mounting the
> > filesystem will fail. I tried to fix this some time ago, and have a
> > set of patches that almost always work, but "almost always" isn't good
> > enough. Unfortunately I never managed to figure out a way to finish it
> > off without disgusting hacks or major surgery.
> 
> Uhuh, can you just ignore the journal and mount it anyway?

An ext3 file system that needs journal recovery sets one of the ext2
incompatible flags to prevent just that.

> ...basically treating it like an ext2?
> 
> ...ok, that will present "old" version of the filesystem to the
> user... violating fsync() semantics.
> 
> Still handy for recovering badly broken filesystems, I'd say.

While you cannot have that, you'll need to dump the file system
(possibly with dd_rescue) to another medium and work on the copy.
That's what you should do anyways. ;-)

I think if you really want to mount the file system without journal
replay, you need to clear the needs-recovery "incompat" flag (on the
copy, obviously).

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-05 22:25       ` Sitsofe Wheeler
@ 2009-01-06  4:08         ` Rob Landley
  0 siblings, 0 replies; 86+ messages in thread
From: Rob Landley @ 2009-01-06  4:08 UTC (permalink / raw)
  To: Sitsofe Wheeler
  Cc: Pavel Machek, kernel list, Andrew Morton, tytso, mtk.manpages,
	rdunlap, linux-doc

On Monday 05 January 2009 16:25:52 Sitsofe Wheeler wrote:
> Rob Landley wrote:
> > There was also a marvelous thread Linus participated in on some hardware
> > industry web message board, but I have no idea where it's gone...
>
> http://www.realworldtech.com/forums/index.cfm?action=detail&id=92702&thread
>id=92678&roomid=2 ? Can you remember a bit more about this thread?

Yeah, that looks like it.

It's a big thread, but I found it educational.  There's a lot more interesting 
info was scattered further down.  (Linus himself replied a dozen times.)

Rob

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-05 20:36             ` Sitsofe Wheeler
@ 2009-01-05 23:09               ` Theodore Tso
  0 siblings, 0 replies; 86+ messages in thread
From: Theodore Tso @ 2009-01-05 23:09 UTC (permalink / raw)
  To: Sitsofe Wheeler
  Cc: Martin K. Petersen, Pavel Machek, Rob Landley, kernel list,
	Andrew Morton, mtk.manpages, rdunlap, linux-doc

On Mon, Jan 05, 2009 at 08:36:28PM +0000, Sitsofe Wheeler wrote:
> Theodore Tso wrote:
>> Don't forget non-cheasy mounting options so an accidental brush
>> against the side of the unit doesn't cause the hard drive to become
>> disconnected from system and suffer a power drop.  I guess that gets
>> filed under "Brute force" as well.  :-)
>
> Are you thinking of sync? 

No, I was talking about physical mounting issues; as I said earlier in
the e-mail message you replied to (you must not have read my e-mail
carefully), Pavel ran into a problem where the SD card protruded
slightly from the laptop case, and it would easily (via physical
contact) get loosed from its connector so that it would become
disconnected from the laptop, causing it to lose power, sometimes
while it was writing, leading to filesystem corruptions.

> If so I have experience of this not helping  
> with ext3 on an 8Gbyte SD card in an EeePC 900. Sooner or later a bunch  
> of zeros overwrites the early part of the partition and an fsck tears  
> the FS apart. This seems to happen quickly if you are booting your root  
> from the SD card (no swap though). A FAT32 partition seems to be  
> unperturbed so far (but it's not being used the same way as the ext3  
> partition).

A quick google search found some interesting posts on the subject:

  http://forum.eeeuser.com/viewtopic.php?id=37174
  http://lists.alioth.debian.org/pipermail/debian-eeepc-devel/2008-August/000837.html
  http://www.spinics.net/lists/linux-scsi/msg28197.html
  http://kerneltrap.org/mailarchive/git-commits-head/2008/8/6/2832894

							- Ted

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
       [not found]     ` <fa.ucJLoSQwk9OAj6T6x60tbWaiTAo@ifi.uio.no>
@ 2009-01-05 22:25       ` Sitsofe Wheeler
  2009-01-06  4:08         ` Rob Landley
  0 siblings, 1 reply; 86+ messages in thread
From: Sitsofe Wheeler @ 2009-01-05 22:25 UTC (permalink / raw)
  To: Rob Landley
  Cc: Pavel Machek, kernel list, Andrew Morton, tytso, mtk.manpages,
	rdunlap, linux-doc

Rob Landley wrote:
> There was also a marvelous thread Linus participated in on some hardware 
> industry web message board, but I have no idea where it's gone...

http://www.realworldtech.com/forums/index.cfm?action=detail&id=92702&threadid=92678&roomid=2 
? Can you remember a bit more about this thread?

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
       [not found]           ` <fa.377DMq2lPMyaHxadPnApFSJFoCg@ifi.uio.no>
@ 2009-01-05 20:36             ` Sitsofe Wheeler
  2009-01-05 23:09               ` Theodore Tso
  0 siblings, 1 reply; 86+ messages in thread
From: Sitsofe Wheeler @ 2009-01-05 20:36 UTC (permalink / raw)
  To: Theodore Tso, Martin K. Petersen, Pavel Machek, Rob Landley,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc

Theodore Tso wrote:
> Don't forget non-cheasy mounting options so an accidental brush
> against the side of the unit doesn't cause the hard drive to become
> disconnected from system and suffer a power drop.  I guess that gets
> filed under "Brute force" as well.  :-)

Are you thinking of sync? If so I have experience of this not helping 
with ext3 on an 8Gbyte SD card in an EeePC 900. Sooner or later a bunch 
of zeros overwrites the early part of the partition and an fsck tears 
the FS apart. This seems to happen quickly if you are booting your root 
from the SD card (no swap though). A FAT32 partition seems to be 
unperturbed so far (but it's not being used the same way as the ext3 
partition).

> P.S.  I feel obliged to point out that in my Lenovo X61s, the SD card
> is flush with the laptop case when inserted, and I've never had a
> problem with the SD card prematurely ejected during operaiton.   :-)

I have never had premature ejection of SD with my Eee...

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-05 19:15         ` Martin K. Petersen
@ 2009-01-05 20:19           ` Theodore Tso
  0 siblings, 0 replies; 86+ messages in thread
From: Theodore Tso @ 2009-01-05 20:19 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Pavel Machek, Rob Landley, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc

On Mon, Jan 05, 2009 at 02:15:44PM -0500, Martin K. Petersen wrote:
> 
> It works some of the time.  But in reality if you yank power halfway
> during a write operation the end result is undefined.
> 
> The saving grace for normal users is that the potential corruption is
> limited to a couple of sectors.

A few years ago it was asserted to me that the internal block size for
spinning magnetic media was around 32k.  So if the hard drive doesn't
have enough of a capacitor or other energy reserve to complete its
internal read-modify-write cycle, attempts to read the 32k chunk of
disk could result in hard ECC failures that would cause the blocks in
question to all return uncorrectiable read errors when they are
accessed.

Of course, if the memory goes south first, and you're in the middle of
streaming a 128k update to the inode the filesystem, and the power
fails, and the memory start returning garbage during the DMA
operation, you may have much bigger problems.  :-)

So it's probably more than "a couple of sectors"....

> The current suck of flash SSDs is that the erase block size amplifies
> this problem by at least one order of magnitude, often two.  I have a
> couple of SSDs here that will leave my filesystem in shambles every time
> the machine crashes.  I quickly got tired of reinstalling Fedora several
> times per week so now my main machine is back to spinning media.

The erase block size is typically 1 to 4 megabytes, from my
understanding.  So yeah, that's easily 1-2 orders of magnitude.  Worse
yet, flash's sequential streaming write speeds are much slower than
hard drive's (anywhere from a factor of 3 to 12 depending on
cheap/trashy the flash drive happens to be), so that opens the time
window even further, by possibly as much as another order of magnitude.

I also suspect that HDD manufactures have learned various tricks (due
to enterprise storage/database vendors leaning on them) to make the
drives appear more atomic in the face of hard drive errors, and also,
in Pavel's case, as I recall he was using the card in a laptop where
the SD card protruded slightly from the laptop case, and it was very
easy for it to get dislodged, meaning that power failures during
writes were even more likely than you would expect with a fixed HDD or
SDD which is secured into place using screws or other more reliable
mounting hardware.

Put all of this together, given that Pavel's Really Trashy 32GB SD was
probably the full 3 orders of magnitude worse than traditional HDD,
and he was having many more failures due to physical mounting issues,
it's not surprising that most people haven't see problems with
traditional HDD's, even none of this is guaranteed by the hard drive
vendors.

> The people that truly and deeply care about this type of write atomicity
> (i.e. enterprises) deploy disk arrays that will do the right thing in
> face of an error.  This involves NVRAM, mirrored caches, uninterruptible
> power supplies, etc.  Brute force if you will.

Don't forget non-cheasy mounting options so an accidental brush
against the side of the unit doesn't cause the hard drive to become
disconnected from system and suffer a power drop.  I guess that gets
filed under "Brute force" as well.  :-)

							- Ted

P.S.  I feel obliged to point out that in my Lenovo X61s, the SD card
is flush with the laptop case when inserted, and I've never had a
problem with the SD card prematurely ejected during operaiton.   :-)


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-04 19:56         ` Rob Landley
@ 2009-01-05 19:16           ` Theodore Tso
  2009-01-06 19:20             ` Rob Landley
  0 siblings, 1 reply; 86+ messages in thread
From: Theodore Tso @ 2009-01-05 19:16 UTC (permalink / raw)
  To: Rob Landley
  Cc: Martin MOKREJŠ,
	Pavel Machek, Duane Griffin, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc

On Sun, Jan 04, 2009 at 01:56:32PM -0600, Rob Landley wrote:
> On Saturday 03 January 2009 17:01:58 Martin MOKREJŠ wrote:
> > > Still handy for recovering badly broken filesystems, I'd say.
> >
> > Me as well. How about improving you doc patch with some summary of
> > this thread (although it is probably not over yet)? ;-) Definitely,
> > a note that one can mount it as ext2 while read-only would be helpful
> > when doing some forensics on the disk.
> 
> Although make sure you _do_ mount it as read only because if you mount an ext3 
> filesystem read/write as ext2 I've had it zap the journal entirely and then 
> you have to tune2fs -j the sucker to turn it back into ext3.
> 
> Ext3 is... touchy.

Um.... horse pucky:

# mke2fs -q -t ext3 /dev/thunk/footest
# debugfs -R features /dev/thunk/footest
debugfs 1.41.3 (12-Oct-2008)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype sparse_super large_file
# mount -t ext2 /dev/thunk/footest /mnt
# touch /mnt/foo
# umount /mnt
# debugfs -R features /dev/thunk/footest
debugfs 1.41.3 (12-Oct-2008)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype sparse_super large_file

   	     		 	  	       		 - Ted

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-05  9:45       ` Pavel Machek
  2009-01-05 11:28         ` Alan Cox
@ 2009-01-05 19:15         ` Martin K. Petersen
  2009-01-05 20:19           ` Theodore Tso
  1 sibling, 1 reply; 86+ messages in thread
From: Martin K. Petersen @ 2009-01-05 19:15 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Martin K. Petersen, Rob Landley, kernel list, Andrew Morton,
	tytso, mtk.manpages, rdunlap, linux-doc

>>>>> "Pavel" == Pavel Machek <pavel@suse.cz> writes:

>> It is mostly true on SCSI class devices because various UNIX, RAID
>> array and database vendors have spent many years leaning very hard on
>> the drive manufacturers to make it so.
>> 
>> But it's not a hard guarantee, you can't get it in writing, and it's
>> not in any of the standards.  Hybrid drives with flash had potential
>> to close that particular loophole but those appear to be dead in the
>> water.

Pavel> So "in practice it works but vendors will not guarantee that"?

It works some of the time.  But in reality if you yank power halfway
during a write operation the end result is undefined.

The saving grace for normal users is that the potential corruption is
limited to a couple of sectors.

The current suck of flash SSDs is that the erase block size amplifies
this problem by at least one order of magnitude, often two.  I have a
couple of SSDs here that will leave my filesystem in shambles every time
the machine crashes.  I quickly got tired of reinstalling Fedora several
times per week so now my main machine is back to spinning media.

The people that truly and deeply care about this type of write atomicity
(i.e. enterprises) deploy disk arrays that will do the right thing in
face of an error.  This involves NVRAM, mirrored caches, uninterruptible
power supplies, etc.  Brute force if you will.

High-end arrays even give you atomicity at a bigger granularity such as
filesystem or database blocks.  On some storage you can say "this LUN is
used for an Oracle database that always writes in multiples of 8KB" and
the array will guarantee that each 8KB block of the I/O is written in
its entirety or not at all.  Some arrays even allow you to verify Oracle
logical block checksums to ensure that the I/O is intact and internally
consistent.

I have been bugging storage vendors about a per-I/O write atomicity
setting for a while.  But it really messes up their pipelining so they
aren't keen on the idea.  We may be able to get some of it fixed as a
side-effect of the DIF bits vs. the impending switch to 4KB sectors,
though.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-04 18:38   ` Theodore Tso
  2009-01-04 22:37     ` Pavel Machek
@ 2009-01-05 11:43     ` Alan Cox
  2009-01-07 11:59       ` Rob Landley
  1 sibling, 1 reply; 86+ messages in thread
From: Alan Cox @ 2009-01-05 11:43 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Alexander E. Patrakov, Pavel Machek, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc

> If dm supported barriers, this wouldn't be an issue.  Personally, I

"If the dm people applied the patches to support barriers" I believe is
the correct description - Andi ? 

dm and md want fixing and even in the md case it isn't hard to do right.

> > or disabling write cache (but, as Alan Cox said, this  
> > shortens the lifespan of the disk).
> 
> Huh?  I've never heard an assertion that disabling the write cache (I
> assume you mean using write-through caching as opposed to write-back
> caching), shortens the lifespan of disk drives.  Aggressive battery

Thats what I was told by a disk vendor - simply because the drive makes a
lot more mechanical movements and writes.

> your noticing it, you can avoid running fsck at boot time.  It's
> really more about shorting the boot time after a crash more than
> anything else.

That depends enormously on your environment. In a secure environment full
data journalling is practically essential to avoid the tiny risk of bits
of important data turning up in another users file.

Alan

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-05  9:45       ` Pavel Machek
@ 2009-01-05 11:28         ` Alan Cox
  2009-01-05 19:15         ` Martin K. Petersen
  1 sibling, 0 replies; 86+ messages in thread
From: Alan Cox @ 2009-01-05 11:28 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Martin K. Petersen, Rob Landley, kernel list, Andrew Morton,
	tytso, mtk.manpages, rdunlap, linux-doc

> How much true is it for normal SATA drives? Are there some tests I can
> just run on a machine, powercycle it few times, and it tells me if my
> disk is non-ATOMIC-SECTORS?

No.

And even if it did writes to one sector can damage another. The
mathematical certainly stuff lives only in the world of maths. In the real
world everything is probabilities.

Alan

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-05  3:20     ` Martin K. Petersen
@ 2009-01-05  9:45       ` Pavel Machek
  2009-01-05 11:28         ` Alan Cox
  2009-01-05 19:15         ` Martin K. Petersen
  0 siblings, 2 replies; 86+ messages in thread
From: Pavel Machek @ 2009-01-05  9:45 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Rob Landley, kernel list, Andrew Morton, tytso, mtk.manpages,
	rdunlap, linux-doc

> >>>>> "Pavel" == Pavel Machek <pavel@suse.cz> writes:
> 
> Pavel> Does this sound like a fair summary?
> 
> Pavel> Sector writes are atomic (ATOMIC-SECTORS)
> Pavel> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> I'd just like to point out that the all-or-nothing hardware sector
> atomity thing is -- to a large extent -- a myth.

It is a myth that linux filesystems depend on for safe operation :-(.

> It is mostly true on SCSI class devices because various UNIX, RAID array
> and database vendors have spent many years leaning very hard on the
> drive manufacturers to make it so.
> 
> But it's not a hard guarantee, you can't get it in writing, and it's not
> in any of the standards.  Hybrid drives with flash had potential to
> close that particular loophole but those appear to be dead in the water.

So "in practice it works but vendors will not guarantee that"?

How much true is it for normal SATA drives? Are there some tests I can
just run on a machine, powercycle it few times, and it tells me if my
disk is non-ATOMIC-SECTORS?
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-05  0:16     ` david
@ 2009-01-05  9:38       ` Pavel Machek
  0 siblings, 0 replies; 86+ messages in thread
From: Pavel Machek @ 2009-01-05  9:38 UTC (permalink / raw)
  To: david
  Cc: Rob Landley, kernel list, Andrew Morton, tytso, mtk.manpages,
	rdunlap, linux-doc


> >Sector writes are atomic (ATOMIC-SECTORS)
> >~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >
> >Either whole sector is correctly written or nothing is written during
> >powerfail.
> >
> >       Unfortuantely, none of the cheap USB/SD flash cards I seen do
> >       behave like this, and are unsuitable for all linux filesystems
> >       I know.
> >
> >               An inherent problem with using flash as a normal block
> >               device is that the flash erase size is bigger than
> >               most filesystem sector sizes.  So when you request a
> >               write, it may erase and rewrite the next 64k, 128k, or
> >               even a couple megabytes on the really _big_ ones.
> >
> >               If you lose power in the middle of that, filesystem
> >               won't notice that data in the "sectors" _after_ the
> >               one your were trying to write to got trashed.
> 
> around, not after. the block you are reading could be in the middle or at 
> the end of an eraseblock.

Applied, thanks.

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-05  2:51       ` Rob Landley
  2009-01-05  3:33         ` Martin K. Petersen
@ 2009-01-05  4:02         ` david
  2009-01-05  3:52           ` Rob Landley
  1 sibling, 1 reply; 86+ messages in thread
From: david @ 2009-01-05  4:02 UTC (permalink / raw)
  To: Rob Landley
  Cc: Sitsofe Wheeler, Pavel Machek, kernel list, Andrew Morton, tytso,
	mtk.manpages, rdunlap, linux-doc

On Sun, 4 Jan 2009, Rob Landley wrote:

> On Sunday 04 January 2009 17:13:08 Sitsofe Wheeler wrote:
>> Pavel Machek wrote:
>>> Is there linux filesystem that can handle that? I know jffs2, but
>>> that's unsuitable for stuff like USB thumb drives, right?
>>
>> This raises the question that if nothing can handle it which FS is the
>> least bad? The last I heard people were saying that with cheap SSDs the
>> recommendation was FAT [1] but in the future btrfs, nilfs and logfs
>> would be better.
>>
>> [1] http://lkml.org/lkml/2008/10/14/129
>
> I wonder if the flash filesystems could be told via mount options that they're
> to use a normal block device as if it was a flash with granularity X?
>
> They can't explicitly control erase, but writing to any block in a block group
> will erase and rewrite the whole group so they can just do large write
> transactions close to each other and the device should aggregate enough for an
> erase block.  (Plus don't touch anything _outside_ where you guess an erase
> block to be until you've finished writing the whole block, which they
> presumably already do.)

this capability would help for raid arrays as well.

if you have a raid5/6 array writing one sector to a stripe results in you 
reading the pairity block for that stripe, reading the rest of the sectors 
for the block on that disk, recalculating the pairity information and 
writing the changed sectors out to both disks.

if you are writing the entire stripe, you could calculate the pairity and 
just write everything out (no reads nessasary).

this would make sequential writes to raid5/6 arrays almost as fast as if 
they were raid0 stripes.

if you could define 'erase block size' to be the raid stripe size the same 
approach would work for both systems.

when I asked about the on the md list a couple of years ago, the response 
that I got was that it was a good idea, but there was no way to get the 
information about the low-level topology to the higher levels that would 
need to act on the information. now that there is a second case where this 
is needed, any mechanism that gets created should be made so that it's 
useable for both.

David Lang

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-05  4:02         ` david
@ 2009-01-05  3:52           ` Rob Landley
  0 siblings, 0 replies; 86+ messages in thread
From: Rob Landley @ 2009-01-05  3:52 UTC (permalink / raw)
  To: david
  Cc: Sitsofe Wheeler, Pavel Machek, kernel list, Andrew Morton, tytso,
	mtk.manpages, rdunlap, linux-doc

On Sunday 04 January 2009 22:02:05 david@lang.hm wrote:
> when I asked about the on the md list a couple of years ago, the response
> that I got was that it was a good idea, but there was no way to get the
> information about the low-level topology to the higher levels that would
> need to act on the information.

Mount option or an argument to mkfs that stores it in the superblock both work 
for me.

> now that there is a second case where this
> is needed, any mechanism that gets created should be made so that it's
> useable for both.

The embedded and supercomputing worlds have more in common with each other 
than either does with the desktop...

> David Lang

Rob

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-05  2:51       ` Rob Landley
@ 2009-01-05  3:33         ` Martin K. Petersen
  2009-01-05  4:02         ` david
  1 sibling, 0 replies; 86+ messages in thread
From: Martin K. Petersen @ 2009-01-05  3:33 UTC (permalink / raw)
  To: Rob Landley
  Cc: Sitsofe Wheeler, Pavel Machek, kernel list, Andrew Morton, tytso,
	mtk.manpages, rdunlap, linux-doc

>>>>> "Rob" == Rob Landley <rob@landley.net> writes:

Rob> I wonder if the flash filesystems could be told via mount options
Rob> that they're to use a normal block device as if it was a flash with
Rob> granularity X?

I posted some patches a few months ago that allowed us to do this.  In
particular they expose the underlying I/O topology to the filesystems.
That includes minimum, preferred and maximum I/O size for both read and
write as well as alignment.  The patches also allow stacking so we get
alignment right on say LVM on top of MD on top of a partitioned disk.

At Kernel Summit/Plumbers Linus absolutely hated this idea in the
context of SSDs.  And I don't necessarily disagree with his point that
intel (claim to have) solved this problem.

However, there's still lots of crappy devices out there that we need to
support.  And we absolutely need this for RAID (both software and
hardware) as well.  I've been meaning to post a new round of these
patches.  I'll take a look at them again this week.

The intent was to use the alignment and block sizes to honor erase block
boundaries when merging requests.

SCSI already has knobs that expose the appropriate sizes although not
many vendors implement them yet.  I've been talking to a few SSD vendors
about exposing similar parameters with SATA.  Most of them are willing
and will happily share this information.  Other vendors stop responding
when you ask them too many questions.

-- 
Martin K. Petersen	Oracle Linux Engineering


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-04 19:21                         ` Geert Uytterhoeven
  2009-01-04 19:36                           ` Theodore Tso
  2009-01-04 22:42                           ` Bron Gondwana
@ 2009-01-05  3:22                           ` Rob Landley
  2 siblings, 0 replies; 86+ messages in thread
From: Rob Landley @ 2009-01-05  3:22 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Theodore Tso, Duane Griffin, Valdis.Kletnieks,
	Martin MOKREJŠ,
	Pavel Machek, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc

On Sunday 04 January 2009 13:21:06 Geert Uytterhoeven wrote:
> I think most people get shocked when they discover that mounting something
> read-only may actualy write to the media. This is a bit unexpected (hey, if
> I mount `read-only', I expect that no writes will happen), as it behaved
> differently before the introduction of journalling.

Is this an unreasonable use case:

  kill -STOP $(pidof qemu)
  mount -o loop,ro hdb.img blah
  cp blah/thingy thingy
  umount blah
  kill -CONT $(pidof qemu)

Currently, if your loopback mount is -t ext3 it'll write to the block device, 
and if your mount is -t ext2 it'll refuse to work on an unclean ext3 
filesystem, even if it's read only.  (But it _will_ work on an unclean ext2 
filesystem.)

My theory when I first found out about this was "the filesystem developers 
hate me personally".

Rob

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-04 22:55   ` Pavel Machek
  2009-01-05  0:16     ` david
  2009-01-05  1:50     ` Rob Landley
@ 2009-01-05  3:20     ` Martin K. Petersen
  2009-01-05  9:45       ` Pavel Machek
  2 siblings, 1 reply; 86+ messages in thread
From: Martin K. Petersen @ 2009-01-05  3:20 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Rob Landley, kernel list, Andrew Morton, tytso, mtk.manpages,
	rdunlap, linux-doc

>>>>> "Pavel" == Pavel Machek <pavel@suse.cz> writes:

Pavel> Does this sound like a fair summary?

Pavel> Sector writes are atomic (ATOMIC-SECTORS)
Pavel> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

I'd just like to point out that the all-or-nothing hardware sector
atomity thing is -- to a large extent -- a myth.

It is mostly true on SCSI class devices because various UNIX, RAID array
and database vendors have spent many years leaning very hard on the
drive manufacturers to make it so.

But it's not a hard guarantee, you can't get it in writing, and it's not
in any of the standards.  Hybrid drives with flash had potential to
close that particular loophole but those appear to be dead in the water.

-- 
Martin K. Petersen	Oracle Linux Engineering


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-04  0:19         ` Pavel Machek
@ 2009-01-05  2:55           ` Rob Landley
  0 siblings, 0 replies; 86+ messages in thread
From: Rob Landley @ 2009-01-05  2:55 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Martin MOKREJŠ,
	Duane Griffin, kernel list, Andrew Morton, tytso, mtk.manpages,
	rdunlap, linux-doc

On Saturday 03 January 2009 18:19:00 Pavel Machek wrote:
> No, you can't mount unclean ext3 as an ext2; patch to do that would be
> possible but...

tune2fs -O ^has_journal /dev/blah
fsck.ext2 -f /dev/blah

Rob

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-04 23:13     ` Sitsofe Wheeler
@ 2009-01-05  2:51       ` Rob Landley
  2009-01-05  3:33         ` Martin K. Petersen
  2009-01-05  4:02         ` david
  0 siblings, 2 replies; 86+ messages in thread
From: Rob Landley @ 2009-01-05  2:51 UTC (permalink / raw)
  To: Sitsofe Wheeler
  Cc: Pavel Machek, kernel list, Andrew Morton, tytso, mtk.manpages,
	rdunlap, linux-doc

On Sunday 04 January 2009 17:13:08 Sitsofe Wheeler wrote:
> Pavel Machek wrote:
> > Is there linux filesystem that can handle that? I know jffs2, but
> > that's unsuitable for stuff like USB thumb drives, right?
>
> This raises the question that if nothing can handle it which FS is the
> least bad? The last I heard people were saying that with cheap SSDs the
> recommendation was FAT [1] but in the future btrfs, nilfs and logfs
> would be better.
>
> [1] http://lkml.org/lkml/2008/10/14/129

I wonder if the flash filesystems could be told via mount options that they're 
to use a normal block device as if it was a flash with granularity X?

They can't explicitly control erase, but writing to any block in a block group 
will erase and rewrite the whole group so they can just do large write 
transactions close to each other and the device should aggregate enough for an 
erase block.  (Plus don't touch anything _outside_ where you guess an erase 
block to be until you've finished writing the whole block, which they 
presumably already do.)

The other question is whether there's any way to guess an erase granularity 
that's "good enough" for a device of size X, maybe larger than the device 
actually does but not smaller than any remotely sane manufacturer would 
implement.  (And just _don't_ partition these suckers, so you don't have to 
worry about partitions aligning with erase block sizes.) 

Rob

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-04 22:55   ` Pavel Machek
  2009-01-05  0:16     ` david
@ 2009-01-05  1:50     ` Rob Landley
  2009-01-05  3:20     ` Martin K. Petersen
  2 siblings, 0 replies; 86+ messages in thread
From: Rob Landley @ 2009-01-05  1:50 UTC (permalink / raw)
  To: Pavel Machek
  Cc: kernel list, Andrew Morton, tytso, mtk.manpages, rdunlap, linux-doc

On Sunday 04 January 2009 16:55:45 Pavel Machek wrote:
> On Sun 2009-01-04 13:49:49, Rob Landley wrote:
> > On Saturday 03 January 2009 06:38:15 Pavel Machek wrote:
> > > +Ext3 expects disk/storage subsystem to behave sanely. On sanely
> > > +behaving disk subsystem, data that have been successfully synced will
> > > +stay on the disk. Sane means:
> > > +
> > > +* writes to media never fail. Even if disk returns error condition
> > > during +  write, ext3 can't handle that correctly, because success on
> > > fsync was already +  returned when data hit the journal.
> > > +
> > > +	   (Fortunately writes failing are very uncommon on disks, as they
> > > +	   have spare sectors they use when write fails.)
> > > +
> > > +* either whole sector is correctly written or nothing is written
> > > during +  powerfail.
> > > +
> > > +	   (Unfortuantely, none of the cheap USB/SD flash cards I seen do
> > > behave +	   like this, and are unsuitable for ext3.
> >
> > Want to document the granularity issues with flash, while you're at it?
> >
> > An inherent problem with using flash as a normal block device is that the
> > flash erase size is bigger than most filesystem sector sizes.  So when
> > you request a write, it may erase and rewrite the next 64k, 128k, or even
> > a couple megabytes on the really _big_ ones.
> >
> > If you lose power in the middle of that, ext3 won't notice that data in
> > the "sectors" _after_ the one your were trying to write to got trashed.
> >
> > The flash filesystems take this into account as part of their wear
> > levelling stuff (they normally copy the entire chunk into a new chunk,
> > leaving the old one in place until it's no longer needed), but they need
> > to query the device to get the erase granularity in order to do that,
> > which is why they don't work on non-flash block devices.
>
> Is there linux filesystem that can handle that? I know jffs2, but
> that's unsuitable for stuff like USB thumb drives, right?

Any of the flash filesystems should handle that.  The main problem with jffs2 
is it doesn't scale well to large device sizes.  UBIFS is supposed to scale 
much better, but I haven't played with it yet.

And the thing about USB thumb drives is they present as a normal block device, 
_not_ as flash, so you can't _query_ their erase granularity.  (It's like 
those hardware raid cards that wouldn't tell you they were striping and such 
so you had to figure out a well-performing layout all by yourself.)   They do 
it magically behind the scenes, and if the power goes out (or you yank the 
device out unexpectedly) if they haven't got a built-in capacitor or battery 
to have enough power to complete their pending transaction, you're screwed.

Plus they do horrible wear levelling, the lot of 'em.  Read Val Henson's 
livejournal entry about it: http://valhenson.livejournal.com/25228.html

There was also a marvelous thread Linus participated in on some hardware 
industry web message board, but I have no idea where it's gone...

> Does this sound like a fair summary?

See Ted's comment.  The summary's fine, the question is where to put this sort 
of thing...

>                 If you lose power in the middle of that, filesystem
>                 won't notice that data in the "sectors" _after_ the
>                 one your were trying to write to got trashed.

Well, the journal won't notice.  An e2fsck will notice huge swaths of missing 
metadata, but won't be able to do anything about it.  (And if what got zapped 
was file _contents_ rather than metadata, you're on your own finding it.  Fun, 
isn't it?)

> 									Pavel

Rob

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-04 22:06   ` Theodore Tso
  2009-01-04 22:25     ` Pavel Machek
  2009-01-04 23:07     ` Pavel Machek
@ 2009-01-05  1:38     ` Rob Landley
  2 siblings, 0 replies; 86+ messages in thread
From: Rob Landley @ 2009-01-05  1:38 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Pavel Machek, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc

On Sunday 04 January 2009 16:06:34 Theodore Tso wrote:
> True enough, although the newer SSD's will have this problem addressed
> (although at least initially, they are **far** more costly than the
> el-cheapo 32GB SD cards you can find at the checkout counter at Fry's
> alongside battery-powered shavers and trashy ipod speakers).

I have great faith in the ability of PC hardware to continue to be crap for 
the foreseeable future.

> I will stress again, that most of this doesn't belong in
> Documentation/filesystems/ext3.txt, as most of this is *not*
> ext3-specific.

Yes and no.  Ext3 is enough of a "default" filesystem for Linux that some 
documentation on when _not_ to use sounds like a good idea.

That said, some kind of a "choosing a filesystem" file would be good, perhaps 
under the filesystems directory.  (Then the ext3 doc would just need a brief 
comment and a pointer to the other file.)

Rob

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-04 22:55   ` Pavel Machek
@ 2009-01-05  0:16     ` david
  2009-01-05  9:38       ` Pavel Machek
  2009-01-05  1:50     ` Rob Landley
  2009-01-05  3:20     ` Martin K. Petersen
  2 siblings, 1 reply; 86+ messages in thread
From: david @ 2009-01-05  0:16 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Rob Landley, kernel list, Andrew Morton, tytso, mtk.manpages,
	rdunlap, linux-doc

On Sun, 4 Jan 2009, Pavel Machek wrote:

> On Sun 2009-01-04 13:49:49, Rob Landley wrote:
>> On Saturday 03 January 2009 06:38:15 Pavel Machek wrote:
>>> +Ext3 expects disk/storage subsystem to behave sanely. On sanely
>>> +behaving disk subsystem, data that have been successfully synced will
>>> +stay on the disk. Sane means:
>>> +
>>> +* writes to media never fail. Even if disk returns error condition during
>>> +  write, ext3 can't handle that correctly, because success on fsync was
>>> already +  returned when data hit the journal.
>>> +
>>> +	   (Fortunately writes failing are very uncommon on disks, as they
>>> +	   have spare sectors they use when write fails.)
>>> +
>>> +* either whole sector is correctly written or nothing is written during
>>> +  powerfail.
>>> +
>>> +	   (Unfortuantely, none of the cheap USB/SD flash cards I seen do behave
>>> +	   like this, and are unsuitable for ext3.
>>
>> Want to document the granularity issues with flash, while you're at it?
>>
>> An inherent problem with using flash as a normal block device is that the
>> flash erase size is bigger than most filesystem sector sizes.  So when you
>> request a write, it may erase and rewrite the next 64k, 128k, or even a couple
>> megabytes on the really _big_ ones.
>>
>> If you lose power in the middle of that, ext3 won't notice that data in the
>> "sectors" _after_ the one your were trying to write to got trashed.
>>
>> The flash filesystems take this into account as part of their wear levelling
>> stuff (they normally copy the entire chunk into a new chunk, leaving the old
>> one in place until it's no longer needed), but they need to query the device
>> to get the erase granularity in order to do that, which is why they don't work
>> on non-flash block devices.
>
> Is there linux filesystem that can handle that? I know jffs2, but
> that's unsuitable for stuff like USB thumb drives, right?
>
> Does this sound like a fair summary?
>
> Sector writes are atomic (ATOMIC-SECTORS)
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> Either whole sector is correctly written or nothing is written during
> powerfail.
>
>        Unfortuantely, none of the cheap USB/SD flash cards I seen do
>        behave like this, and are unsuitable for all linux filesystems
>        I know.
>
>                An inherent problem with using flash as a normal block
>                device is that the flash erase size is bigger than
>                most filesystem sector sizes.  So when you request a
>                write, it may erase and rewrite the next 64k, 128k, or
>                even a couple megabytes on the really _big_ ones.
>
>                If you lose power in the middle of that, filesystem
>                won't notice that data in the "sectors" _after_ the
>                one your were trying to write to got trashed.

around, not after. the block you are reading could be in the middle or at 
the end of an eraseblock.

David Lang

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-04 22:37     ` Pavel Machek
@ 2009-01-04 23:58       ` Theodore Tso
  0 siblings, 0 replies; 86+ messages in thread
From: Theodore Tso @ 2009-01-04 23:58 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Alexander E. Patrakov, kernel list, Andrew Morton, mtk.manpages,
	rdunlap, linux-doc, Alan Cox

On Sun, Jan 04, 2009 at 11:37:56PM +0100, Pavel Machek wrote:
> 
> Are you sure you need to have thrashing? AFAICT metadata + fsync heavy
> workload should be enough... and there were scripts to easily repeat
> that.

The memory pressure is needed to force disk buffers out to disk sooner
than fsync() would normally force buffers out.  The scripts which I've
seen induced memory pressure.  If the disk is *super* aggressive at
reordering writes, I suppose a heavy fsync workload might be enough on
its own, but in practice, it's generally not enough.

							- Ted

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
       [not found]   ` <fa.26o5IHCAC3TQdXupl62CLYwQ+Wk@ifi.uio.no>
@ 2009-01-04 23:13     ` Sitsofe Wheeler
  2009-01-05  2:51       ` Rob Landley
       [not found]     ` <fa.GBkQuKdRj+YRVczlNLFhGvaw3WY@ifi.uio.no>
       [not found]     ` <fa.ucJLoSQwk9OAj6T6x60tbWaiTAo@ifi.uio.no>
  2 siblings, 1 reply; 86+ messages in thread
From: Sitsofe Wheeler @ 2009-01-04 23:13 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Rob Landley, kernel list, Andrew Morton, tytso, mtk.manpages,
	rdunlap, linux-doc

Pavel Machek wrote:
> Is there linux filesystem that can handle that? I know jffs2, but
> that's unsuitable for stuff like USB thumb drives, right?

This raises the question that if nothing can handle it which FS is the 
least bad? The last I heard people were saying that with cheap SSDs the 
recommendation was FAT [1] but in the future btrfs, nilfs and logfs 
would be better.

[1] http://lkml.org/lkml/2008/10/14/129

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-04 22:06   ` Theodore Tso
  2009-01-04 22:25     ` Pavel Machek
@ 2009-01-04 23:07     ` Pavel Machek
  2009-01-05  1:38     ` Rob Landley
  2 siblings, 0 replies; 86+ messages in thread
From: Pavel Machek @ 2009-01-04 23:07 UTC (permalink / raw)
  To: Theodore Tso, Rob Landley, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc

On Sun 2009-01-04 17:06:34, Theodore Tso wrote:
> On Sun, Jan 04, 2009 at 01:49:49PM -0600, Rob Landley wrote:
> > 
> > Want to document the granularity issues with flash, while you're at it?
> > 
> > An inherent problem with using flash as a normal block device is that the 
> > flash erase size is bigger than most filesystem sector sizes.  So when you 
> > request a write, it may erase and rewrite the next 64k, 128k, or even a couple 
> > megabytes on the really _big_ ones.
> > 
> > If you lose power in the middle of that, ext3 won't notice that data in the 
> > "sectors" _after_ the one your were trying to write to got trashed.
> 
> True enough, although the newer SSD's will have this problem addressed
> (although at least initially, they are **far** more costly than the
> el-cheapo 32GB SD cards you can find at the checkout counter at Fry's
> alongside battery-powered shavers and trashy ipod speakers).

Hey, I got one of those el-cheapo 32GB SD cards. I fully expected it
to be slow, but eating my data 3 times per month was unexpected even
for me.

I'm not even sure where the blame is. I certainly blame the Linux
documentation: there should be "DON'T USE CRAPPY SD CARDS" warning in
big bold letters somewhere. I guess mkfs.ext3 should just refuse to
make filesystem on them. (Of course, the manufacturer should have told
me that the card is crap; I can bet it can not even work with
VFAT/Windows).

Plus I'd hope some filesystem materializes that can handle 128KB
"block size"... because the el-cheapo card I have here is actually
pretty sane. It seems to store data I put on it, and should be safe to
use with huge block size...  

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-04 19:49 ` Rob Landley
  2009-01-04 22:06   ` Theodore Tso
@ 2009-01-04 22:55   ` Pavel Machek
  2009-01-05  0:16     ` david
                       ` (2 more replies)
  1 sibling, 3 replies; 86+ messages in thread
From: Pavel Machek @ 2009-01-04 22:55 UTC (permalink / raw)
  To: Rob Landley
  Cc: kernel list, Andrew Morton, tytso, mtk.manpages, rdunlap, linux-doc

On Sun 2009-01-04 13:49:49, Rob Landley wrote:
> On Saturday 03 January 2009 06:38:15 Pavel Machek wrote:
> > +Ext3 expects disk/storage subsystem to behave sanely. On sanely
> > +behaving disk subsystem, data that have been successfully synced will
> > +stay on the disk. Sane means:
> > +
> > +* writes to media never fail. Even if disk returns error condition during
> > +  write, ext3 can't handle that correctly, because success on fsync was
> > already +  returned when data hit the journal.
> > +
> > +	   (Fortunately writes failing are very uncommon on disks, as they
> > +	   have spare sectors they use when write fails.)
> > +
> > +* either whole sector is correctly written or nothing is written during
> > +  powerfail.
> > +
> > +	   (Unfortuantely, none of the cheap USB/SD flash cards I seen do behave
> > +	   like this, and are unsuitable for ext3.
> 
> Want to document the granularity issues with flash, while you're at it?
> 
> An inherent problem with using flash as a normal block device is that the 
> flash erase size is bigger than most filesystem sector sizes.  So when you 
> request a write, it may erase and rewrite the next 64k, 128k, or even a couple 
> megabytes on the really _big_ ones.
> 
> If you lose power in the middle of that, ext3 won't notice that data in the 
> "sectors" _after_ the one your were trying to write to got trashed.
> 
> The flash filesystems take this into account as part of their wear levelling 
> stuff (they normally copy the entire chunk into a new chunk, leaving the old 
> one in place until it's no longer needed), but they need to query the device 
> to get the erase granularity in order to do that, which is why they don't work 
> on non-flash block devices.

Is there linux filesystem that can handle that? I know jffs2, but
that's unsuitable for stuff like USB thumb drives, right?

Does this sound like a fair summary?

Sector writes are atomic (ATOMIC-SECTORS)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Either whole sector is correctly written or nothing is written during
powerfail.

        Unfortuantely, none of the cheap USB/SD flash cards I seen do
        behave like this, and are unsuitable for all linux filesystems
        I know.

                An inherent problem with using flash as a normal block
                device is that the flash erase size is bigger than
                most filesystem sector sizes.  So when you request a
                write, it may erase and rewrite the next 64k, 128k, or
                even a couple megabytes on the really _big_ ones.

                If you lose power in the middle of that, filesystem
                won't notice that data in the "sectors" _after_ the
                one your were trying to write to got trashed.

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-04 19:21                         ` Geert Uytterhoeven
  2009-01-04 19:36                           ` Theodore Tso
@ 2009-01-04 22:42                           ` Bron Gondwana
  2009-01-05  3:22                           ` Rob Landley
  2 siblings, 0 replies; 86+ messages in thread
From: Bron Gondwana @ 2009-01-04 22:42 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Theodore Tso, Duane Griffin, Valdis.Kletnieks,
	Martin MOKREJŠ,
	Pavel Machek, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc

On Sun, Jan 04, 2009 at 08:21:06PM +0100, Geert Uytterhoeven wrote:
> On Sun, 4 Jan 2009, Theodore Tso wrote:
> > On Sun, Jan 04, 2009 at 02:24:43PM +0000, Duane Griffin wrote:
> > > > Is there a way using md/dm/lvm etc to make the source partition R/O and
> > > > replay the journal onto a CoW snapshop?  Admittedly, not easy to do inside
> > > > the 'mount' command itself, but at least it might be workable for LiveCD R/O
> > > > mounts and forensics work, where you can *tell* beforehand that's what you
> > > > want and can jump through setup games before doing the mount...
> > > 
> > > Yes, something like that is best practice, as I understand it. The
> > > LiveCD init scripts could check whether they are about to R/O mount an
> > > ext[34] filesystem needing recovery and either refuse with a useful
> > > message to the user, or even automatically create and mount a COW
> > > snapshot, as you described. They'd still need to warn the user though,
> > > since things like remounting R/W wouldn't work as expected.
> > 
> > So what's the use case where people want to be able to mount a
> > filesystem needing recovery read/only without running the journal?
> 
> As mentioned before, suspending a laptop (running from hdd), running a live CD,
> and expecting everything to work fine when resuming from hdd?

Any particular reason why suspend doesn't run the journal during
shutdown and leave a clean filesystem?  It shouldn't take that
long surely.

I know it doesn't solve the "it really just crashed" problem, but
you don't tend to unsuspend from a crash anyway.

Bron ( just curious )

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-04 18:38   ` Theodore Tso
@ 2009-01-04 22:37     ` Pavel Machek
  2009-01-04 23:58       ` Theodore Tso
  2009-01-05 11:43     ` Alan Cox
  1 sibling, 1 reply; 86+ messages in thread
From: Pavel Machek @ 2009-01-04 22:37 UTC (permalink / raw)
  To: Theodore Tso, Alexander E. Patrakov, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, Alan Cox

On Sun 2009-01-04 13:38:34, Theodore Tso wrote:
> On Sun, Jan 04, 2009 at 06:35:41PM +0500, Alexander E. Patrakov wrote:
> >
> > Ext3 means either hardware that supports barriers (not sure how to  
> > check
> 
> Pretty much all modern disk drives supports barriers.  And note that
> w/o barriers ext3 has worked pretty well.  *If* you have a workload
> pushes your system into a mode which where it is very low on memory,
> so it is constantly paging/thrashing and you have a workload which is
> metadata intensive, and you crash the machine while it is thrashing,
> it is possible to end up in a situation where your filesystem is
> corrupted and you have to use e2fsck to correct the filesystem.  In

Are you sure you need to have thrashing? AFAICT metadata + fsync heavy
workload should be enough... and there were scripts to easily repeat
that.

> > Does this requirement apply to other  
> > journaling filesystems? Do I need journaling at all, given that I have  
> > an UPS on my desktop and a battery in the laptop?
> 
> Which requirement?  Barriers?  Most journaling filesystems simply
> enable barriers by default.  
> 
> And journalling is useful so that if your system crashes, say due to
> suspend and resume not working out, or the battery runs dry without
> your noticing it, you can avoid running fsck at boot time.  It's
> really more about shorting the boot time after a crash more than
> anything else.

Actually, journalling with barriers=0 should still be "safe" in case of
kernel crashes (*), right? Because if just kernel is dead, disk
firmware will still write the cache back, AFAICT.
									Pavel

(*) kernel crashes that do not involve writing random garbage to disk.
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-04  2:32 ` Theodore Tso
@ 2009-01-04 22:33   ` Pavel Machek
  0 siblings, 0 replies; 86+ messages in thread
From: Pavel Machek @ 2009-01-04 22:33 UTC (permalink / raw)
  To: Theodore Tso, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc

Hi!

On Sat 2009-01-03 21:32:11, Theodore Tso wrote:
> On Sat, Jan 03, 2009 at 01:38:15PM +0100, Pavel Machek wrote:
> > +Requirements
> > +============
> > +
> > +Ext3 expects disk/storage subsystem to behave sanely. On sanely
> > +behaving disk subsystem, data that have been successfully synced will
> > +stay on the disk. Sane means:
> > +
> > +* writes to media never fail. Even if disk returns error condition during
> > +  write, ext3 can't handle that correctly, because success on fsync was already
> > +  returned when data hit the journal.
> > +
> > +	   (Fortunately writes failing are very uncommon on disks, as they
> > +	   have spare sectors they use when write fails.)
> 
> This is not unique to ext3; per the discussion two weeks ago, this is
> largely because of the fsync() interface not possibly being able to

Ok, so I guess I should split the patch to truly ext3-specific part,
and the part that is common for all the filesystems. I guess I'll need
some help with everything but ext2 and ext3...

> return errors caused by failures when creating or modifying parent
> directories.  Given this, it's a bit misleading to place this in the
> Documentation/filesystems/ext3.txt.  At the minimum it should include
> a discussion about what the issues might be, and given that pretty
> much any Unix/Linux filesystem doesn't have a way of reflecting these
> errors to application programs, it probably should be in a
> filesystem-independent documentation file.

Ok. I'll have to think about good name of that file.

> > +* either write caching is disabled, or hw can do barriers and they are enabled.
> > +
> > +	   (Note that barriers are disabled by default, use "barrier=1"
> > +	   mount option after making sure hw can support them). 
> 
> We really should get akpm to agree to accept the patch to default
> barriers by default instead.  :-)

:-). Yes, that would help a bit.

(No, it is not complete solution. barrier=0/writeback on should be
still documented as unsafe).
									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-04 22:06   ` Theodore Tso
@ 2009-01-04 22:25     ` Pavel Machek
  2009-01-04 23:07     ` Pavel Machek
  2009-01-05  1:38     ` Rob Landley
  2 siblings, 0 replies; 86+ messages in thread
From: Pavel Machek @ 2009-01-04 22:25 UTC (permalink / raw)
  To: Theodore Tso, Rob Landley, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc

On Sun 2009-01-04 17:06:34, Theodore Tso wrote:
> On Sun, Jan 04, 2009 at 01:49:49PM -0600, Rob Landley wrote:
> > 
> > Want to document the granularity issues with flash, while you're at it?
> > 
> > An inherent problem with using flash as a normal block device is that the 
> > flash erase size is bigger than most filesystem sector sizes.  So when you 
> > request a write, it may erase and rewrite the next 64k, 128k, or even a couple 
> > megabytes on the really _big_ ones.
> > 
> > If you lose power in the middle of that, ext3 won't notice that data in the 
> > "sectors" _after_ the one your were trying to write to got trashed.
> 
> True enough, although the newer SSD's will have this problem addressed
> (although at least initially, they are **far** more costly than the
> el-cheapo 32GB SD cards you can find at the checkout counter at Fry's
> alongside battery-powered shavers and trashy ipod speakers).
> 
> I will stress again, that most of this doesn't belong in
> Documentation/filesystems/ext3.txt, as most of this is *not*
> ext3-specific.

I've initially done the patch for ext3 because that's what I'm using
and becuase I felt responsible for documenting it after a huge thread.

At least barrier=1 seems to be ext3 specific, and perhaps logfs or
something can survive full eraseblocks disappearing. Anyway, i guess
we all agree that this needs to be documented _somewhere_, and that's
what I'm trying to do.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-04 19:49 ` Rob Landley
@ 2009-01-04 22:06   ` Theodore Tso
  2009-01-04 22:25     ` Pavel Machek
                       ` (2 more replies)
  2009-01-04 22:55   ` Pavel Machek
  1 sibling, 3 replies; 86+ messages in thread
From: Theodore Tso @ 2009-01-04 22:06 UTC (permalink / raw)
  To: Rob Landley
  Cc: Pavel Machek, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc

On Sun, Jan 04, 2009 at 01:49:49PM -0600, Rob Landley wrote:
> 
> Want to document the granularity issues with flash, while you're at it?
> 
> An inherent problem with using flash as a normal block device is that the 
> flash erase size is bigger than most filesystem sector sizes.  So when you 
> request a write, it may erase and rewrite the next 64k, 128k, or even a couple 
> megabytes on the really _big_ ones.
> 
> If you lose power in the middle of that, ext3 won't notice that data in the 
> "sectors" _after_ the one your were trying to write to got trashed.

True enough, although the newer SSD's will have this problem addressed
(although at least initially, they are **far** more costly than the
el-cheapo 32GB SD cards you can find at the checkout counter at Fry's
alongside battery-powered shavers and trashy ipod speakers).

I will stress again, that most of this doesn't belong in
Documentation/filesystems/ext3.txt, as most of this is *not*
ext3-specific.

						- Ted

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-04 21:55                               ` Theodore Tso
@ 2009-01-04 22:06                                 ` Duane Griffin
  0 siblings, 0 replies; 86+ messages in thread
From: Duane Griffin @ 2009-01-04 22:06 UTC (permalink / raw)
  To: Theodore Tso, Duane Griffin, Geert Uytterhoeven,
	Valdis.Kletnieks, Martin MOKREJŠ,
	Pavel Machek, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc

2009/1/4 Theodore Tso <tytso@mit.edu>:
> On Sun, Jan 04, 2009 at 07:51:27PM +0000, Duane Griffin wrote:
>>
>> If anyone is interested I'd be happy to dust off and send them my old
>> patches to implement this. There are a couple of issues with it.
>> First, I never got around to implementing remount R/W support. Second,
>> I had to introduce a rather nasty hack in order to handle un-escaping
>> JFS magic numbers.
>
> Can you dust off the patches and send a copy to
> linux-ext4@vger.kernel.org so we have them archived someplace where
> hopefully someone might have time to look at it?

OK, will do. I've posted them there before, but not the latest version
that properly handles un-escaping JFS magic numbers (albeit in an ugly
way). I'll rebase them on top of the latest ext4 patch queue and
repost.

>                                          - Ted

Cheers,
Duane.

-- 
"I never could learn to drink that blood and call it wine" - Bob Dylan

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-04 19:51                             ` Duane Griffin
@ 2009-01-04 21:55                               ` Theodore Tso
  2009-01-04 22:06                                 ` Duane Griffin
  0 siblings, 1 reply; 86+ messages in thread
From: Theodore Tso @ 2009-01-04 21:55 UTC (permalink / raw)
  To: Duane Griffin
  Cc: Geert Uytterhoeven, Valdis.Kletnieks, Martin MOKREJŠ,
	Pavel Machek, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc

On Sun, Jan 04, 2009 at 07:51:27PM +0000, Duane Griffin wrote:
> 
> If anyone is interested I'd be happy to dust off and send them my old
> patches to implement this. There are a couple of issues with it.
> First, I never got around to implementing remount R/W support. Second,
> I had to introduce a rather nasty hack in order to handle un-escaping
> JFS magic numbers.

Can you dust off the patches and send a copy to
linux-ext4@vger.kernel.org so we have them archived someplace where
hopefully someone might have time to look at it?

	  	  	     	     	  - Ted

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-03 23:58             ` Robert Hancock
  2009-01-04  0:08               ` Martin MOKREJŠ
@ 2009-01-04 21:49               ` Ingo Oeser
  1 sibling, 0 replies; 86+ messages in thread
From: Ingo Oeser @ 2009-01-04 21:49 UTC (permalink / raw)
  To: Robert Hancock
  Cc: Martin MOKREJŠ,
	Duane Griffin, Pavel Machek, kernel list, Andrew Morton, tytso,
	mtk.manpages, rdunlap, linux-doc

On Sunday 04 January 2009, Robert Hancock wrote:
> I agree, there should be a way to force it to mount "really read only" 
> so it doesn't try to replay the journal. That might require just 
> ignoring the journal content, which may result in the FS appearing 
> corrupt, but for recovery/forensics purposes that seems better than 
> nothing..

For forensics you ALWAYS get a copy of the full disk first, 
which you set read only with blockdev --setro /dev/$MYDISK.

You then restore from this copy.


Best Regard

Ingo Oeser, been there, done that

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-04 13:35 ` Alexander E. Patrakov
                     ` (2 preceding siblings ...)
  2009-01-04 18:38   ` Theodore Tso
@ 2009-01-04 20:10   ` Pavel Machek
  3 siblings, 0 replies; 86+ messages in thread
From: Pavel Machek @ 2009-01-04 20:10 UTC (permalink / raw)
  To: Alexander E. Patrakov
  Cc: kernel list, Andrew Morton, tytso, mtk.manpages, rdunlap,
	linux-doc, Alan Cox

On Sun 2009-01-04 18:35:41, Alexander E. Patrakov wrote:
> Pavel Machek wrote:
> [CC: Alan Cox because of his reply in the "XFS internal error" thread]
>
>> Using ext3 is only safe if storage subsystem meets certain
>> criteria. Document those.
>
> Thanks for this patch. However, after reading this, I have a stupid  
> question: which file system should I use if I had to reinstall my  
> computers from scratch now?

ext2 is still the safest default... if you can live with fsck.

ext3 is the safest from the journalling ones, AFAICT.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-03 23:01       ` Martin MOKREJŠ
  2009-01-03 23:38         ` Duane Griffin
  2009-01-04  0:19         ` Pavel Machek
@ 2009-01-04 19:56         ` Rob Landley
  2009-01-05 19:16           ` Theodore Tso
  2009-01-06 10:08         ` Matthias Andree
  3 siblings, 1 reply; 86+ messages in thread
From: Rob Landley @ 2009-01-04 19:56 UTC (permalink / raw)
  To: Martin MOKREJŠ
  Cc: Pavel Machek, Duane Griffin, kernel list, Andrew Morton, tytso,
	mtk.manpages, rdunlap, linux-doc

On Saturday 03 January 2009 17:01:58 Martin MOKREJŠ wrote:
> > Still handy for recovering badly broken filesystems, I'd say.
>
> Me as well. How about improving you doc patch with some summary of
> this thread (although it is probably not over yet)? ;-) Definitely,
> a note that one can mount it as ext2 while read-only would be helpful
> when doing some forensics on the disk.

Although make sure you _do_ mount it as read only because if you mount an ext3 
filesystem read/write as ext2 I've had it zap the journal entirely and then 
you have to tune2fs -j the sucker to turn it back into ext3.

Ext3 is... touchy.

Rob

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-04 19:36                           ` Theodore Tso
@ 2009-01-04 19:51                             ` Duane Griffin
  2009-01-04 21:55                               ` Theodore Tso
  0 siblings, 1 reply; 86+ messages in thread
From: Duane Griffin @ 2009-01-04 19:51 UTC (permalink / raw)
  To: Theodore Tso, Geert Uytterhoeven, Duane Griffin,
	Valdis.Kletnieks, Martin MOKREJŠ,
	Pavel Machek, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc

2009/1/4 Theodore Tso <tytso@mit.edu>:
> On Sun, Jan 04, 2009 at 08:21:06PM +0100, Geert Uytterhoeven wrote:
>> As for mounting the root file system read-only during early boot up, and
>> remounting it read-write later, I guess it's quite complicated to replay the
>> journal (in RAM) on read-only mount, and deferring the replay writeback until
>> remounting read-write?
>
> It's not *that* hard; if someone would like to cons up a patch, please
> feel free....  but it's certainly not a high priority for me or most
> of the other ext3 filesystem developers.

If anyone is interested I'd be happy to dust off and send them my old
patches to implement this. There are a couple of issues with it.
First, I never got around to implementing remount R/W support. Second,
I had to introduce a rather nasty hack in order to handle un-escaping
JFS magic numbers.

>                                        - Ted

Cheers,
Duane.

-- 
"I never could learn to drink that blood and call it wine" - Bob Dylan

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-03 12:38 Pavel Machek
                   ` (2 preceding siblings ...)
  2009-01-04 13:35 ` Alexander E. Patrakov
@ 2009-01-04 19:49 ` Rob Landley
  2009-01-04 22:06   ` Theodore Tso
  2009-01-04 22:55   ` Pavel Machek
  3 siblings, 2 replies; 86+ messages in thread
From: Rob Landley @ 2009-01-04 19:49 UTC (permalink / raw)
  To: Pavel Machek
  Cc: kernel list, Andrew Morton, tytso, mtk.manpages, rdunlap, linux-doc

On Saturday 03 January 2009 06:38:15 Pavel Machek wrote:
> +Ext3 expects disk/storage subsystem to behave sanely. On sanely
> +behaving disk subsystem, data that have been successfully synced will
> +stay on the disk. Sane means:
> +
> +* writes to media never fail. Even if disk returns error condition during
> +  write, ext3 can't handle that correctly, because success on fsync was
> already +  returned when data hit the journal.
> +
> +	   (Fortunately writes failing are very uncommon on disks, as they
> +	   have spare sectors they use when write fails.)
> +
> +* either whole sector is correctly written or nothing is written during
> +  powerfail.
> +
> +	   (Unfortuantely, none of the cheap USB/SD flash cards I seen do behave
> +	   like this, and are unsuitable for ext3.

Want to document the granularity issues with flash, while you're at it?

An inherent problem with using flash as a normal block device is that the 
flash erase size is bigger than most filesystem sector sizes.  So when you 
request a write, it may erase and rewrite the next 64k, 128k, or even a couple 
megabytes on the really _big_ ones.

If you lose power in the middle of that, ext3 won't notice that data in the 
"sectors" _after_ the one your were trying to write to got trashed.

The flash filesystems take this into account as part of their wear levelling 
stuff (they normally copy the entire chunk into a new chunk, leaving the old 
one in place until it's no longer needed), but they need to query the device 
to get the erase granularity in order to do that, which is why they don't work 
on non-flash block devices.

Rob

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-04 19:21                         ` Geert Uytterhoeven
@ 2009-01-04 19:36                           ` Theodore Tso
  2009-01-04 19:51                             ` Duane Griffin
  2009-01-04 22:42                           ` Bron Gondwana
  2009-01-05  3:22                           ` Rob Landley
  2 siblings, 1 reply; 86+ messages in thread
From: Theodore Tso @ 2009-01-04 19:36 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Duane Griffin, Valdis.Kletnieks, Martin MOKREJŠ,
	Pavel Machek, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc

On Sun, Jan 04, 2009 at 08:21:06PM +0100, Geert Uytterhoeven wrote:
> As mentioned before, suspending a laptop (running from hdd), running
> a live CD, and expecting everything to work fine when resuming from
> hdd?
> 
> I think most people get shocked when they discover that mounting
> something read-only may actualy write to the media. This is a bit
> unexpected (hey, if I mount `read-only', I expect that no writes
> will happen), as it behaved differently before the introduction of
> journalling.

It's been this way for about a decade....  that being said, if you
really want to do this, you can today via "mount -o ro,noload /dev/XXX
/mntpt".  However, the system could crash or fail because the
filesystem without having run the journal could be quite inconsistent.  

> As for mounting the root file system read-only during early boot up, and
> remounting it read-write later, I guess it's quite complicated to replay the
> journal (in RAM) on read-only mount, and deferring the replay writeback until
> remounting read-write?

It's not *that* hard; if someone would like to cons up a patch, please
feel free....  but it's certainly not a high priority for me or most
of the other ext3 filesystem developers.

					- Ted

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-04 18:40                       ` Theodore Tso
@ 2009-01-04 19:21                         ` Geert Uytterhoeven
  2009-01-04 19:36                           ` Theodore Tso
                                             ` (2 more replies)
  0 siblings, 3 replies; 86+ messages in thread
From: Geert Uytterhoeven @ 2009-01-04 19:21 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Duane Griffin, Valdis.Kletnieks, Martin MOKREJŠ,
	Pavel Machek, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc

On Sun, 4 Jan 2009, Theodore Tso wrote:
> On Sun, Jan 04, 2009 at 02:24:43PM +0000, Duane Griffin wrote:
> > > Is there a way using md/dm/lvm etc to make the source partition R/O and
> > > replay the journal onto a CoW snapshop?  Admittedly, not easy to do inside
> > > the 'mount' command itself, but at least it might be workable for LiveCD R/O
> > > mounts and forensics work, where you can *tell* beforehand that's what you
> > > want and can jump through setup games before doing the mount...
> > 
> > Yes, something like that is best practice, as I understand it. The
> > LiveCD init scripts could check whether they are about to R/O mount an
> > ext[34] filesystem needing recovery and either refuse with a useful
> > message to the user, or even automatically create and mount a COW
> > snapshot, as you described. They'd still need to warn the user though,
> > since things like remounting R/W wouldn't work as expected.
> 
> So what's the use case where people want to be able to mount a
> filesystem needing recovery read/only without running the journal?

As mentioned before, suspending a laptop (running from hdd), running a live CD,
and expecting everything to work fine when resuming from hdd?

I think most people get shocked when they discover that mounting something
read-only may actualy write to the media. This is a bit unexpected (hey, if I
mount `read-only', I expect that no writes will happen), as it behaved
differently before the introduction of journalling.

As for mounting the root file system read-only during early boot up, and
remounting it read-write later, I guess it's quite complicated to replay the
journal (in RAM) on read-only mount, and deferring the replay writeback until
remounting read-write?

Gr{oetje,eeting}s,

						Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
							    -- Linus Torvalds

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-04 14:24                     ` Duane Griffin
@ 2009-01-04 18:40                       ` Theodore Tso
  2009-01-04 19:21                         ` Geert Uytterhoeven
  0 siblings, 1 reply; 86+ messages in thread
From: Theodore Tso @ 2009-01-04 18:40 UTC (permalink / raw)
  To: Duane Griffin
  Cc: Valdis.Kletnieks, Martin MOKREJŠ,
	Pavel Machek, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc

On Sun, Jan 04, 2009 at 02:24:43PM +0000, Duane Griffin wrote:
> > Is there a way using md/dm/lvm etc to make the source partition R/O and
> > replay the journal onto a CoW snapshop?  Admittedly, not easy to do inside
> > the 'mount' command itself, but at least it might be workable for LiveCD R/O
> > mounts and forensics work, where you can *tell* beforehand that's what you
> > want and can jump through setup games before doing the mount...
> 
> Yes, something like that is best practice, as I understand it. The
> LiveCD init scripts could check whether they are about to R/O mount an
> ext[34] filesystem needing recovery and either refuse with a useful
> message to the user, or even automatically create and mount a COW
> snapshot, as you described. They'd still need to warn the user though,
> since things like remounting R/W wouldn't work as expected.

So what's the use case where people want to be able to mount a
filesystem needing recovery read/only without running the journal?

	   	   	    	      	      	      	  - Ted

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-04 13:35 ` Alexander E. Patrakov
  2009-01-04 13:53   ` Valdis.Kletnieks
  2009-01-04 18:21   ` Michael Tokarev
@ 2009-01-04 18:38   ` Theodore Tso
  2009-01-04 22:37     ` Pavel Machek
  2009-01-05 11:43     ` Alan Cox
  2009-01-04 20:10   ` Pavel Machek
  3 siblings, 2 replies; 86+ messages in thread
From: Theodore Tso @ 2009-01-04 18:38 UTC (permalink / raw)
  To: Alexander E. Patrakov
  Cc: Pavel Machek, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, Alan Cox

On Sun, Jan 04, 2009 at 06:35:41PM +0500, Alexander E. Patrakov wrote:
>
> Ext3 means either hardware that supports barriers (not sure how to  
> check

Pretty much all modern disk drives supports barriers.  And note that
w/o barriers ext3 has worked pretty well.  *If* you have a workload
pushes your system into a mode which where it is very low on memory,
so it is constantly paging/thrashing and you have a workload which is
metadata intensive, and you crash the machine while it is thrashing,
it is possible to end up in a situation where your filesystem is
corrupted and you have to use e2fsck to correct the filesystem.  In
practice this is often not the case, which is why the default for ext3
has been with barriers disabled, and most people have not noted major
problems.  This is why Andrew Morton has refused accept the patch for
ext3 which disables barriers by default; he's not convinced the
performance hit is worth the improvement in reliability.

Ext4 does enable barriers by defaults, mainly because filesystem
developers tend to be believe the reliability is more important than
performance.  (On the other hand, Google runs with ext2 w/o
journalling, because everything is replicated three times and it's
easier to just blow away the filesystem and resync from one of the
duplicate copies; so in the right circumstances, maybe worrying only
about performance and ignoring reliability makes perfect sense.)

> and anyway I have to use encryption on the work laptop due to the  
> corporate policy

If dm supported barriers, this wouldn't be an issue.  Personally, I
find the convenience of LVM is so useful that I use ext4 with LVM,
even though the barrier requests get dropped on the ground.  And I'm a
kernel developer, and I use a laptop with suspend/resume, which means
I often crash uncleanly --- and I've not lost data yet, despite the
lack of barriers.  (On the other hand, my laptop has 4 gigs of memory,
so I'm rarely thrashing due memory pressure.)

> or disabling write cache (but, as Alan Cox said, this  
> shortens the lifespan of the disk).

Huh?  I've never heard an assertion that disabling the write cache (I
assume you mean using write-through caching as opposed to write-back
caching), shortens the lifespan of disk drives.  Aggressive battery
saving mode is far more likely to shorten disk drive life, due to
spinning the platters up and down a lot.   

> Does this requirement apply to other  
> journaling filesystems? Do I need journaling at all, given that I have  
> an UPS on my desktop and a battery in the laptop?

Which requirement?  Barriers?  Most journaling filesystems simply
enable barriers by default.  

And journalling is useful so that if your system crashes, say due to
suspend and resume not working out, or the battery runs dry without
your noticing it, you can avoid running fsck at boot time.  It's
really more about shorting the boot time after a crash more than
anything else.

					- Ted

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-04 13:35 ` Alexander E. Patrakov
  2009-01-04 13:53   ` Valdis.Kletnieks
@ 2009-01-04 18:21   ` Michael Tokarev
  2009-01-04 18:38   ` Theodore Tso
  2009-01-04 20:10   ` Pavel Machek
  3 siblings, 0 replies; 86+ messages in thread
From: Michael Tokarev @ 2009-01-04 18:21 UTC (permalink / raw)
  To: Alexander E. Patrakov
  Cc: Pavel Machek, kernel list, Andrew Morton, tytso, mtk.manpages,
	rdunlap, linux-doc, Alan Cox

Alexander E. Patrakov wrote:
[]
> Ext3 means either hardware that supports barriers (not sure how to
> check, and anyway I have to use encryption on the work laptop due to the
> corporate policy) or disabling write cache (but, as Alan Cox said, this
> shortens the lifespan of the disk). Does this requirement apply to other
> journaling filesystems? Do I need journaling at all, given that I have
> an UPS on my desktop and a battery in the laptop?

There's another possibility too, somewhat more risky.  Namely, run with
write cache ON by default, and switch it off when running off battery
(either UPS or notebook).  Should save both worlds, PROVIDED the battery
actually/UPS works :)

/mjt



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-04  3:52                   ` Valdis.Kletnieks
@ 2009-01-04 14:24                     ` Duane Griffin
  2009-01-04 18:40                       ` Theodore Tso
  0 siblings, 1 reply; 86+ messages in thread
From: Duane Griffin @ 2009-01-04 14:24 UTC (permalink / raw)
  To: Valdis.Kletnieks
  Cc: Martin MOKREJŠ,
	Pavel Machek, kernel list, Andrew Morton, tytso, mtk.manpages,
	rdunlap, linux-doc

2009/1/4  <Valdis.Kletnieks@vt.edu>:
> On Sun, 04 Jan 2009 00:41:51 GMT, Duane Griffin said:
>
>> I agree, it isn't a great situation. Nonetheless, it has always been
>> thus for ext3, and so far we've muddled along. Unless and until we can
>> replay the journal in-memory without touching the on-disk data, we are
>> stuck with it.
>
> Is there a way using md/dm/lvm etc to make the source partition R/O and
> replay the journal onto a CoW snapshop?  Admittedly, not easy to do inside
> the 'mount' command itself, but at least it might be workable for LiveCD R/O
> mounts and forensics work, where you can *tell* beforehand that's what you
> want and can jump through setup games before doing the mount...

Yes, something like that is best practice, as I understand it. The
LiveCD init scripts could check whether they are about to R/O mount an
ext[34] filesystem needing recovery and either refuse with a useful
message to the user, or even automatically create and mount a COW
snapshot, as you described. They'd still need to warn the user though,
since things like remounting R/W wouldn't work as expected.

Cheers,
Duane.

-- 
"I never could learn to drink that blood and call it wine" - Bob Dylan

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-04 13:35 ` Alexander E. Patrakov
@ 2009-01-04 13:53   ` Valdis.Kletnieks
  2009-01-04 18:21   ` Michael Tokarev
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 86+ messages in thread
From: Valdis.Kletnieks @ 2009-01-04 13:53 UTC (permalink / raw)
  To: Alexander E. Patrakov
  Cc: Pavel Machek, kernel list, Andrew Morton, tytso, mtk.manpages,
	rdunlap, linux-doc, Alan Cox

[-- Attachment #1: Type: text/plain, Size: 1097 bytes --]

On Sun, 04 Jan 2009 18:35:41 +0500, "Alexander E. Patrakov" said:

> Ext3 means either hardware that supports barriers (not sure how to 
> check, and anyway I have to use encryption on the work laptop due to the 
> corporate policy) or disabling write cache (but, as Alan Cox said, this 
> shortens the lifespan of the disk).

False dichotomy.  This isn't an "either/or", as there's a *third* case:

"understand the issues and risks involved if you have a write cache and
no barrier support, and learn to deal with it".

As you point out, if it's a laptop with a battery, the risk may be *very* low.
Let's say there's a 1 in 10,000 chance that you'll trash a file system and
need to restore from backups.

That may be totally acceptable if you've already estimated a 1 in 500 chance
of the whole damned laptop going walkies while you're not looking, and then
you *still* need to be able to restore from backups onto a replacement machine.

Yes, for some systems, the whole barriers/write cache thing is in fact very
important.  But for others, data loss due to spilled coffee is a bigger worry...

[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-03 12:38 Pavel Machek
  2009-01-03 21:17 ` Martin MOKREJŠ
  2009-01-04  2:32 ` Theodore Tso
@ 2009-01-04 13:35 ` Alexander E. Patrakov
  2009-01-04 13:53   ` Valdis.Kletnieks
                     ` (3 more replies)
  2009-01-04 19:49 ` Rob Landley
  3 siblings, 4 replies; 86+ messages in thread
From: Alexander E. Patrakov @ 2009-01-04 13:35 UTC (permalink / raw)
  To: Pavel Machek
  Cc: kernel list, Andrew Morton, tytso, mtk.manpages, rdunlap,
	linux-doc, Alan Cox

Pavel Machek wrote:
[CC: Alan Cox because of his reply in the "XFS internal error" thread]

> Using ext3 is only safe if storage subsystem meets certain
> criteria. Document those.

Thanks for this patch. However, after reading this, I have a stupid 
question: which file system should I use if I had to reinstall my 
computers from scratch now?

Ext3 means either hardware that supports barriers (not sure how to 
check, and anyway I have to use encryption on the work laptop due to the 
corporate policy) or disabling write cache (but, as Alan Cox said, this 
shortens the lifespan of the disk). Does this requirement apply to other 
journaling filesystems? Do I need journaling at all, given that I have 
an UPS on my desktop and a battery in the laptop?

-- 
Alexander E. Patrakov

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-04  0:41                 ` Duane Griffin
@ 2009-01-04  3:52                   ` Valdis.Kletnieks
  2009-01-04 14:24                     ` Duane Griffin
  0 siblings, 1 reply; 86+ messages in thread
From: Valdis.Kletnieks @ 2009-01-04  3:52 UTC (permalink / raw)
  To: Duane Griffin
  Cc: Martin MOKREJŠ,
	Pavel Machek, kernel list, Andrew Morton, tytso, mtk.manpages,
	rdunlap, linux-doc

[-- Attachment #1: Type: text/plain, Size: 665 bytes --]

On Sun, 04 Jan 2009 00:41:51 GMT, Duane Griffin said:

> I agree, it isn't a great situation. Nonetheless, it has always been
> thus for ext3, and so far we've muddled along. Unless and until we can
> replay the journal in-memory without touching the on-disk data, we are
> stuck with it.

Is there a way using md/dm/lvm etc to make the source partition R/O and
replay the journal onto a CoW snapshop?  Admittedly, not easy to do inside
the 'mount' command itself, but at least it might be workable for LiveCD R/O
mounts and forensics work, where you can *tell* beforehand that's what you
want and can jump through setup games before doing the mount...

[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-03 12:38 Pavel Machek
  2009-01-03 21:17 ` Martin MOKREJŠ
@ 2009-01-04  2:32 ` Theodore Tso
  2009-01-04 22:33   ` Pavel Machek
  2009-01-04 13:35 ` Alexander E. Patrakov
  2009-01-04 19:49 ` Rob Landley
  3 siblings, 1 reply; 86+ messages in thread
From: Theodore Tso @ 2009-01-04  2:32 UTC (permalink / raw)
  To: Pavel Machek; +Cc: kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc

On Sat, Jan 03, 2009 at 01:38:15PM +0100, Pavel Machek wrote:
> +Requirements
> +============
> +
> +Ext3 expects disk/storage subsystem to behave sanely. On sanely
> +behaving disk subsystem, data that have been successfully synced will
> +stay on the disk. Sane means:
> +
> +* writes to media never fail. Even if disk returns error condition during
> +  write, ext3 can't handle that correctly, because success on fsync was already
> +  returned when data hit the journal.
> +
> +	   (Fortunately writes failing are very uncommon on disks, as they
> +	   have spare sectors they use when write fails.)

This is not unique to ext3; per the discussion two weeks ago, this is
largely because of the fsync() interface not possibly being able to
return errors caused by failures when creating or modifying parent
directories.  Given this, it's a bit misleading to place this in the
Documentation/filesystems/ext3.txt.  At the minimum it should include
a discussion about what the issues might be, and given that pretty
much any Unix/Linux filesystem doesn't have a way of reflecting these
errors to application programs, it probably should be in a
filesystem-independent documentation file.

> +* either whole sector is correctly written or nothing is written during
> +  powerfail.
> +
> +	   (Unfortuantely, none of the cheap USB/SD flash cards I seen do behave
> +	   like this, and are unsuitable for ext3. Because RAM tends to fail
> +	   faster than rest of system during powerfail, special hw killing
> +	   DMA transfers may be neccessary. Not sure how common that problem
> +	   is on generic PC machines).

Again, this is true for other filesystems (it was first discovered on
SGI "pizza boxes" machines running XFS, and special hardware changes
added to allow DMA aborts) --- in fact, because of ext3's use of
physical block journaling, it's much more likely that it will recover
from these sorts of errors.  So it's very misleading to have this sort
of discussion in Documentation/filesystems/ext3.txt.

> +* either write caching is disabled, or hw can do barriers and they are enabled.
> +
> +	   (Note that barriers are disabled by default, use "barrier=1"
> +	   mount option after making sure hw can support them). 

We really should get akpm to agree to accept the patch to default
barriers by default instead.  :-)

							- Ted

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-04  0:11               ` Martin MOKREJŠ
@ 2009-01-04  0:41                 ` Duane Griffin
  2009-01-04  3:52                   ` Valdis.Kletnieks
  0 siblings, 1 reply; 86+ messages in thread
From: Duane Griffin @ 2009-01-04  0:41 UTC (permalink / raw)
  To: Martin MOKREJŠ
  Cc: Pavel Machek, kernel list, Andrew Morton, tytso, mtk.manpages,
	rdunlap, linux-doc

2009/1/4 Martin MOKREJŠ <mmokrejs@ribosome.natur.cuni.cz>:
> Duane Griffin wrote:
>> 2009/1/3 Martin MOKREJŠ <mmokrejs@ribosome.natur.cuni.cz>:
>>> Why does not "mount -ro" die when it would have to replay the journal
>>> with a message that user must run fsck.ext3 in order to be able to mount
>>> it albeit read-only? Still I would prefer having an extra switch to
>>> force mount RO while not touching the journal for disk forensics.
>>> I think that would also prevent the cases when a LiveCD/rescue distribution
>>> would not mount+replay it automagically but user would really have to
>>> provide the switch to the command. I am really not using the recovery
>>> boot cd to touch my partitions in some cases unwillingly.
>>
>> Well, that would make things rather tricky. As in, shutting down
>> uncleanly would render your system unbootable.
>
> ??? If I am booted off a CD/DVD drive I just do not want my system
> to be touched. I am fine if the dist mounts my drives automagically
> in read-only mode but if that currently forces journal replay then no,
> thanks. ;)

I agree, it isn't a great situation. Nonetheless, it has always been
thus for ext3, and so far we've muddled along. Unless and until we can
replay the journal in-memory without touching the on-disk data, we are
stuck with it.

We can't refuse to mount an unclean FS, as that would break booting.
We also can't ignore the journal by default, if/when we get a patch to
do so at all, as that effectively corrupts random chunks of the FS.
Fine for forensics and recovery; not so much for booting from.

> M.

Cheers,
Duane.

-- 
"I never could learn to drink that blood and call it wine" - Bob Dylan

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-03 23:01       ` Martin MOKREJŠ
  2009-01-03 23:38         ` Duane Griffin
@ 2009-01-04  0:19         ` Pavel Machek
  2009-01-05  2:55           ` Rob Landley
  2009-01-04 19:56         ` Rob Landley
  2009-01-06 10:08         ` Matthias Andree
  3 siblings, 1 reply; 86+ messages in thread
From: Pavel Machek @ 2009-01-04  0:19 UTC (permalink / raw)
  To: Martin MOKREJŠ
  Cc: Duane Griffin, kernel list, Andrew Morton, tytso, mtk.manpages,
	rdunlap, linux-doc

On Sun 2009-01-04 00:01:58, Martin MOKREJŠ wrote:
> Pavel Machek wrote:
> > On Sat 2009-01-03 22:17:15, Duane Griffin wrote:
> >> [Fixed top-posting]
> >>
> >> 2009/1/3 Martin MOKREJŠ <mmokrejs@ribosome.natur.cuni.cz>:
> >>> Pavel Machek wrote:
> >>>> readonly mount does actually write to the media in some cases. Document that.
> >>>>
> >>> Can one avoid replay of the journal then if it would be unclean?
> >>> Just curious.
> >> Nope. If the underlying block device is read-only then mounting the
> >> filesystem will fail. I tried to fix this some time ago, and have a
> >> set of patches that almost always work, but "almost always" isn't good
> >> enough. Unfortunately I never managed to figure out a way to finish it
> >> off without disgusting hacks or major surgery.
> > 
> > Uhuh, can you just ignore the journal and mount it anyway?
> > ...basically treating it like an ext2?
> > 
> > ...ok, that will present "old" version of the filesystem to the
> > user... violating fsync() semantics.
> 
> Hmm, so if my dual-boot machine does not shutdown correctly and I boot
> accidentally in M$ Win where I use ext2 IFS driver and modify some
> stuff on the ext3 drive, after a while reboot to linux and the journal
> get re-played ... Mmm ...

ext2 driver should refuse to mount dirty ext3 filesystem. (Linux ext2
driver does that).

> > Still handy for recovering badly broken filesystems, I'd say.
> 
> Me as well. How about improving you doc patch with some summary of
> this thread (although it is probably not over yet)? ;-) Definitely,
> a note that one can mount it as ext2 while read-only would be helpful
> when doing some forensics on the disk.

No, you can't mount unclean ext3 as an ext2; patch to do that would be
possible but...

I believe the patch is correct & useful.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-04  0:00             ` Duane Griffin
@ 2009-01-04  0:11               ` Martin MOKREJŠ
  2009-01-04  0:41                 ` Duane Griffin
  0 siblings, 1 reply; 86+ messages in thread
From: Martin MOKREJŠ @ 2009-01-04  0:11 UTC (permalink / raw)
  To: Duane Griffin
  Cc: Pavel Machek, kernel list, Andrew Morton, tytso, mtk.manpages,
	rdunlap, linux-doc

Duane Griffin wrote:
> 2009/1/3 Martin MOKREJŠ <mmokrejs@ribosome.natur.cuni.cz>:
>> Why does not "mount -ro" die when it would have to replay the journal
>> with a message that user must run fsck.ext3 in order to be able to mount
>> it albeit read-only? Still I would prefer having an extra switch to
>> force mount RO while not touching the journal for disk forensics.
>> I think that would also prevent the cases when a LiveCD/rescue distribution
>> would not mount+replay it automagically but user would really have to
>> provide the switch to the command. I am really not using the recovery
>> boot cd to touch my partitions in some cases unwillingly.
> 
> Well, that would make things rather tricky. As in, shutting down
> uncleanly would render your system unbootable.

??? If I am booted off a CD/DVD drive I just do not want my system
to be touched. I am fine if the dist mounts my drives automagically
in read-only mode but if that currently forces journal replay then no,
thanks. ;)

M.


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-03 23:58             ` Robert Hancock
@ 2009-01-04  0:08               ` Martin MOKREJŠ
  2009-01-04 21:49               ` Ingo Oeser
  1 sibling, 0 replies; 86+ messages in thread
From: Martin MOKREJŠ @ 2009-01-04  0:08 UTC (permalink / raw)
  To: Robert Hancock
  Cc: Duane Griffin, Pavel Machek, kernel list, Andrew Morton, tytso,
	mtk.manpages, rdunlap, linux-doc

Robert Hancock wrote:
> Martin MOKREJŠ wrote:
>> Duane Griffin wrote:
>>> 2009/1/3 Martin MOKREJŠ <mmokrejs@ribosome.natur.cuni.cz>:
>>>> Hmm, so if my dual-boot machine does not shutdown correctly and I boot
>>>> accidentally in M$ Win where I use ext2 IFS driver and modify some
>>>> stuff on the ext3 drive, after a while reboot to linux and the journal
>>>> get re-played ... Mmm ...
>>> You *really* wouldn't want to be doing that.
>>>
>>> The other scenario that people have reported trouble with is
>>> suspending the system, booting a live CD which "read-only" mounts the
>>> filesystem (and replays the journal), then resuming.
>>
>> Why does not "mount -ro" die when it would have to replay the journal
>> with a message that user must run fsck.ext3 in order to be able to mount
>> it albeit read-only? Still I would prefer having an extra switch to
> 
> That would break typical system bootup in the unclean journal case,
> normally the root FS is mounted read-only to start with (which replays
> the journal) and remounted read-write later on - and usually the fsck
> utilities are located on the root filesystem..

Couldn't that be handled by e.g. openRC during boot, to provide the
say to be provided --force-journal-replay during "normal" boot?
Yes, that would mean e2fsprogs would become incompatible with older
versions but why not "fix" the logic?

> 
>> force mount RO while not touching the journal for disk forensics. I
>> think that would also prevent the cases when a LiveCD/rescue 
>> distribution would not mount+replay it automagically but user would
>> really have to provide the switch to the command. I am really not
>> using the recovery boot cd to touch my partitions in some cases
>> unwillingly.
>
> I agree, there should be a way to force it to mount "really read only"
> so it doesn't try to replay the journal. That might require just
> ignoring the journal content, which may result in the FS appearing
> corrupt, but for recovery/forensics purposes that seems better than
> nothing..

Fully agree.

M.


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-03 23:50           ` Martin MOKREJŠ
  2009-01-03 23:58             ` Robert Hancock
@ 2009-01-04  0:00             ` Duane Griffin
  2009-01-04  0:11               ` Martin MOKREJŠ
  1 sibling, 1 reply; 86+ messages in thread
From: Duane Griffin @ 2009-01-04  0:00 UTC (permalink / raw)
  To: Martin MOKREJŠ
  Cc: Pavel Machek, kernel list, Andrew Morton, tytso, mtk.manpages,
	rdunlap, linux-doc

2009/1/3 Martin MOKREJŠ <mmokrejs@ribosome.natur.cuni.cz>:
> Why does not "mount -ro" die when it would have to replay the journal
> with a message that user must run fsck.ext3 in order to be able to mount
> it albeit read-only? Still I would prefer having an extra switch to
> force mount RO while not touching the journal for disk forensics.
> I think that would also prevent the cases when a LiveCD/rescue distribution
> would not mount+replay it automagically but user would really have to
> provide the switch to the command. I am really not using the recovery
> boot cd to touch my partitions in some cases unwillingly.

Well, that would make things rather tricky. As in, shutting down
uncleanly would render your system unbootable.

> Sure that does not prevent my case when I let ext2 IFS writing onto
> my ext3 partition. Actually, couldn't the driver at least warn me
> the journal log is non-empty (am just a user, sorry, cannot check
> myself the code at www.fs-driver.org if it could do at least this
> although it does not understand ext3). ;-)

The driver certainly should warn you in that case. I have no idea
whether it does, as I don't use it, sorry.

Cheers,
Duane.

-- 
"I never could learn to drink that blood and call it wine" - Bob Dylan

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-03 23:50           ` Martin MOKREJŠ
@ 2009-01-03 23:58             ` Robert Hancock
  2009-01-04  0:08               ` Martin MOKREJŠ
  2009-01-04 21:49               ` Ingo Oeser
  2009-01-04  0:00             ` Duane Griffin
  1 sibling, 2 replies; 86+ messages in thread
From: Robert Hancock @ 2009-01-03 23:58 UTC (permalink / raw)
  To: Martin MOKREJŠ
  Cc: Duane Griffin, Pavel Machek, kernel list, Andrew Morton, tytso,
	mtk.manpages, rdunlap, linux-doc

Martin MOKREJŠ wrote:
> Duane Griffin wrote:
>> 2009/1/3 Martin MOKREJŠ <mmokrejs@ribosome.natur.cuni.cz>:
>>> Hmm, so if my dual-boot machine does not shutdown correctly and I boot
>>> accidentally in M$ Win where I use ext2 IFS driver and modify some
>>> stuff on the ext3 drive, after a while reboot to linux and the journal
>>> get re-played ... Mmm ...
>> You *really* wouldn't want to be doing that.
>>
>> The other scenario that people have reported trouble with is
>> suspending the system, booting a live CD which "read-only" mounts the
>> filesystem (and replays the journal), then resuming.
> 
> Why does not "mount -ro" die when it would have to replay the journal
> with a message that user must run fsck.ext3 in order to be able to mount
> it albeit read-only? Still I would prefer having an extra switch to

That would break typical system bootup in the unclean journal case, 
normally the root FS is mounted read-only to start with (which replays 
the journal) and remounted read-write later on - and usually the fsck 
utilities are located on the root filesystem..

> force mount RO while not touching the journal for disk forensics.
> I think that would also prevent the cases when a LiveCD/rescue distribution
> would not mount+replay it automagically but user would really have to
> provide the switch to the command. I am really not using the recovery
> boot cd to touch my partitions in some cases unwillingly.

I agree, there should be a way to force it to mount "really read only" 
so it doesn't try to replay the journal. That might require just 
ignoring the journal content, which may result in the FS appearing 
corrupt, but for recovery/forensics purposes that seems better than 
nothing..

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-03 23:38         ` Duane Griffin
@ 2009-01-03 23:50           ` Martin MOKREJŠ
  2009-01-03 23:58             ` Robert Hancock
  2009-01-04  0:00             ` Duane Griffin
  0 siblings, 2 replies; 86+ messages in thread
From: Martin MOKREJŠ @ 2009-01-03 23:50 UTC (permalink / raw)
  To: Duane Griffin
  Cc: Pavel Machek, kernel list, Andrew Morton, tytso, mtk.manpages,
	rdunlap, linux-doc

Duane Griffin wrote:
> 2009/1/3 Martin MOKREJŠ <mmokrejs@ribosome.natur.cuni.cz>:
>> Hmm, so if my dual-boot machine does not shutdown correctly and I boot
>> accidentally in M$ Win where I use ext2 IFS driver and modify some
>> stuff on the ext3 drive, after a while reboot to linux and the journal
>> get re-played ... Mmm ...
> 
> You *really* wouldn't want to be doing that.
> 
> The other scenario that people have reported trouble with is
> suspending the system, booting a live CD which "read-only" mounts the
> filesystem (and replays the journal), then resuming.

Why does not "mount -ro" die when it would have to replay the journal
with a message that user must run fsck.ext3 in order to be able to mount
it albeit read-only? Still I would prefer having an extra switch to
force mount RO while not touching the journal for disk forensics.
I think that would also prevent the cases when a LiveCD/rescue distribution
would not mount+replay it automagically but user would really have to
provide the switch to the command. I am really not using the recovery
boot cd to touch my partitions in some cases unwillingly.

Sure that does not prevent my case when I let ext2 IFS writing onto
my ext3 partition. Actually, couldn't the driver at least warn me
the journal log is non-empty (am just a user, sorry, cannot check
myself the code at www.fs-driver.org if it could do at least this
although it does not understand ext3). ;-)

Martin


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-03 23:01       ` Martin MOKREJŠ
@ 2009-01-03 23:38         ` Duane Griffin
  2009-01-03 23:50           ` Martin MOKREJŠ
  2009-01-04  0:19         ` Pavel Machek
                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 86+ messages in thread
From: Duane Griffin @ 2009-01-03 23:38 UTC (permalink / raw)
  To: Martin MOKREJŠ
  Cc: Pavel Machek, kernel list, Andrew Morton, tytso, mtk.manpages,
	rdunlap, linux-doc

2009/1/3 Martin MOKREJŠ <mmokrejs@ribosome.natur.cuni.cz>:
> Hmm, so if my dual-boot machine does not shutdown correctly and I boot
> accidentally in M$ Win where I use ext2 IFS driver and modify some
> stuff on the ext3 drive, after a while reboot to linux and the journal
> get re-played ... Mmm ...

You *really* wouldn't want to be doing that.

The other scenario that people have reported trouble with is
suspending the system, booting a live CD which "read-only" mounts the
filesystem (and replays the journal), then resuming.

Cheers,
Duane.

-- 
"I never could learn to drink that blood and call it wine" - Bob Dylan

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-03 22:29     ` Pavel Machek
  2009-01-03 23:01       ` Martin MOKREJŠ
@ 2009-01-03 23:12       ` Duane Griffin
  2009-01-06 10:06       ` Matthias Andree
  2 siblings, 0 replies; 86+ messages in thread
From: Duane Griffin @ 2009-01-03 23:12 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Martin MOKREJŠ,
	kernel list, Andrew Morton, tytso, mtk.manpages, rdunlap,
	linux-doc

2009/1/3 Pavel Machek <pavel@suse.cz>:
> On Sat 2009-01-03 22:17:15, Duane Griffin wrote:
>> [Fixed top-posting]
>>
>> 2009/1/3 Martin MOKREJŠ <mmokrejs@ribosome.natur.cuni.cz>:
>> > Pavel Machek wrote:
>> >> readonly mount does actually write to the media in some cases. Document that.
>> >>
>> > Can one avoid replay of the journal then if it would be unclean?
>> > Just curious.
>>
>> Nope. If the underlying block device is read-only then mounting the
>> filesystem will fail. I tried to fix this some time ago, and have a
>> set of patches that almost always work, but "almost always" isn't good
>> enough. Unfortunately I never managed to figure out a way to finish it
>> off without disgusting hacks or major surgery.
>
> Uhuh, can you just ignore the journal and mount it anyway?
> ...basically treating it like an ext2?

I'm afraid not, ext2 won't mount an FS with EXT3_FEATURE_INCOMPAT_RECOVER set.

> ...ok, that will present "old" version of the filesystem to the
> user... violating fsync() semantics.
>
> Still handy for recovering badly broken filesystems, I'd say.
>
>                                                                        Pavel

Cheers,
Duane.

-- 
"I never could learn to drink that blood and call it wine" - Bob Dylan

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-03 22:29     ` Pavel Machek
@ 2009-01-03 23:01       ` Martin MOKREJŠ
  2009-01-03 23:38         ` Duane Griffin
                           ` (3 more replies)
  2009-01-03 23:12       ` Duane Griffin
  2009-01-06 10:06       ` Matthias Andree
  2 siblings, 4 replies; 86+ messages in thread
From: Martin MOKREJŠ @ 2009-01-03 23:01 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Duane Griffin, kernel list, Andrew Morton, tytso, mtk.manpages,
	rdunlap, linux-doc

Pavel Machek wrote:
> On Sat 2009-01-03 22:17:15, Duane Griffin wrote:
>> [Fixed top-posting]
>>
>> 2009/1/3 Martin MOKREJŠ <mmokrejs@ribosome.natur.cuni.cz>:
>>> Pavel Machek wrote:
>>>> readonly mount does actually write to the media in some cases. Document that.
>>>>
>>> Can one avoid replay of the journal then if it would be unclean?
>>> Just curious.
>> Nope. If the underlying block device is read-only then mounting the
>> filesystem will fail. I tried to fix this some time ago, and have a
>> set of patches that almost always work, but "almost always" isn't good
>> enough. Unfortunately I never managed to figure out a way to finish it
>> off without disgusting hacks or major surgery.
> 
> Uhuh, can you just ignore the journal and mount it anyway?
> ...basically treating it like an ext2?
> 
> ...ok, that will present "old" version of the filesystem to the
> user... violating fsync() semantics.

Hmm, so if my dual-boot machine does not shutdown correctly and I boot
accidentally in M$ Win where I use ext2 IFS driver and modify some
stuff on the ext3 drive, after a while reboot to linux and the journal
get re-played ... Mmm ...

> 
> Still handy for recovering badly broken filesystems, I'd say.

Me as well. How about improving you doc patch with some summary of
this thread (although it is probably not over yet)? ;-) Definitely,
a note that one can mount it as ext2 while read-only would be helpful
when doing some forensics on the disk.


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-03 22:17   ` Duane Griffin
@ 2009-01-03 22:29     ` Pavel Machek
  2009-01-03 23:01       ` Martin MOKREJŠ
                         ` (2 more replies)
  0 siblings, 3 replies; 86+ messages in thread
From: Pavel Machek @ 2009-01-03 22:29 UTC (permalink / raw)
  To: Duane Griffin
  Cc: Martin MOKREJŠ,
	kernel list, Andrew Morton, tytso, mtk.manpages, rdunlap,
	linux-doc

On Sat 2009-01-03 22:17:15, Duane Griffin wrote:
> [Fixed top-posting]
> 
> 2009/1/3 Martin MOKREJŠ <mmokrejs@ribosome.natur.cuni.cz>:
> > Pavel Machek wrote:
> >> readonly mount does actually write to the media in some cases. Document that.
> >>
> > Can one avoid replay of the journal then if it would be unclean?
> > Just curious.
> 
> Nope. If the underlying block device is read-only then mounting the
> filesystem will fail. I tried to fix this some time ago, and have a
> set of patches that almost always work, but "almost always" isn't good
> enough. Unfortunately I never managed to figure out a way to finish it
> off without disgusting hacks or major surgery.

Uhuh, can you just ignore the journal and mount it anyway?
...basically treating it like an ext2?

...ok, that will present "old" version of the filesystem to the
user... violating fsync() semantics.

Still handy for recovering badly broken filesystems, I'd say.

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-03 21:17 ` Martin MOKREJŠ
  2009-01-03 22:06   ` Pavel Machek
@ 2009-01-03 22:17   ` Duane Griffin
  2009-01-03 22:29     ` Pavel Machek
  1 sibling, 1 reply; 86+ messages in thread
From: Duane Griffin @ 2009-01-03 22:17 UTC (permalink / raw)
  To: Martin MOKREJŠ
  Cc: Pavel Machek, kernel list, Andrew Morton, tytso, mtk.manpages,
	rdunlap, linux-doc

[Fixed top-posting]

2009/1/3 Martin MOKREJŠ <mmokrejs@ribosome.natur.cuni.cz>:
> Pavel Machek wrote:
>> readonly mount does actually write to the media in some cases. Document that.
>>
> Can one avoid replay of the journal then if it would be unclean?
> Just curious.

Nope. If the underlying block device is read-only then mounting the
filesystem will fail. I tried to fix this some time ago, and have a
set of patches that almost always work, but "almost always" isn't good
enough. Unfortunately I never managed to figure out a way to finish it
off without disgusting hacks or major surgery.

> M.

Cheers,
Duane.

-- 
"I never could learn to drink that blood and call it wine" - Bob Dylan

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-03 21:17 ` Martin MOKREJŠ
@ 2009-01-03 22:06   ` Pavel Machek
  2009-01-03 22:17   ` Duane Griffin
  1 sibling, 0 replies; 86+ messages in thread
From: Pavel Machek @ 2009-01-03 22:06 UTC (permalink / raw)
  To: Martin MOKREJŠ
  Cc: kernel list, Andrew Morton, tytso, mtk.manpages, rdunlap, linux-doc

On Sat 2009-01-03 22:17:11, Martin MOKREJŠ wrote:
> Can one avoid replay of the journal then if it would be unclean?
> Just curious.

Well, mounting unclean filesystem is dangerous but depending on
circumstances, it may be better than writing to the filesystems.

(You may not be able to read some data and may provoke kernel bugs,
but at least you don't damage what is on disk. If you are collecting
evidence -- not writing is very important. If you suspect something is
very wrong with the drive, not writing is good idea).

								Pavel
> 
> Pavel Machek wrote:
> > Using ext3 is only safe if storage subsystem meets certain
> > criteria. Document those.
> > 
> > Errors=remount-ro is documented as default, but superblock setting
> > overrides that and mkfs defaults to errors=continue... so the default
> > is errors=continue in practice.
> > 
> > readonly mount does actually write to the media in some cases. Document that.
> > 
> > Signed-off-by: Pavel Machek <pavel@suse.cz>
> > 
> > diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: document ext3 requirements
  2009-01-03 12:38 Pavel Machek
@ 2009-01-03 21:17 ` Martin MOKREJŠ
  2009-01-03 22:06   ` Pavel Machek
  2009-01-03 22:17   ` Duane Griffin
  2009-01-04  2:32 ` Theodore Tso
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 86+ messages in thread
From: Martin MOKREJŠ @ 2009-01-03 21:17 UTC (permalink / raw)
  To: Pavel Machek
  Cc: kernel list, Andrew Morton, tytso, mtk.manpages, rdunlap, linux-doc

Can one avoid replay of the journal then if it would be unclean?
Just curious.
M.

Pavel Machek wrote:
> Using ext3 is only safe if storage subsystem meets certain
> criteria. Document those.
> 
> Errors=remount-ro is documented as default, but superblock setting
> overrides that and mkfs defaults to errors=continue... so the default
> is errors=continue in practice.
> 
> readonly mount does actually write to the media in some cases. Document that.
> 
> Signed-off-by: Pavel Machek <pavel@suse.cz>
> 
> diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt

^ permalink raw reply	[flat|nested] 86+ messages in thread

* document ext3 requirements
@ 2009-01-03 12:38 Pavel Machek
  2009-01-03 21:17 ` Martin MOKREJŠ
                   ` (3 more replies)
  0 siblings, 4 replies; 86+ messages in thread
From: Pavel Machek @ 2009-01-03 12:38 UTC (permalink / raw)
  To: kernel list, Andrew Morton, tytso, mtk.manpages, rdunlap, linux-doc

Using ext3 is only safe if storage subsystem meets certain
criteria. Document those.

Errors=remount-ro is documented as default, but superblock setting
overrides that and mkfs defaults to errors=continue... so the default
is errors=continue in practice.

readonly mount does actually write to the media in some cases. Document that.

Signed-off-by: Pavel Machek <pavel@suse.cz>

diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt
index 9dd2a3b..74a73b0 100644
--- a/Documentation/filesystems/ext3.txt
+++ b/Documentation/filesystems/ext3.txt
@@ -14,6 +14,9 @@ Options
 When mounting an ext3 filesystem, the following option are accepted:
 (*) == default
 
+ro			Note that ext3 will replay the journal (and thus write
+			to the partition) even when mounted "read only".
+
 journal=update		Update the ext3 file system's journal to the current
 			format.
 
@@ -95,6 +98,8 @@ debug			Extra debugging information is sent to syslog.
 errors=remount-ro(*)	Remount the filesystem read-only on an error.
 errors=continue		Keep going on a filesystem error.
 errors=panic		Panic and halt the machine if an error occurs.
+			(Note that default is overriden by superblock
+			setting on most systems).
 
 data_err=ignore(*)	Just print an error message if an error occurs
 			in a file data buffer in ordered mode.
@@ -188,6 +193,34 @@ mke2fs: 	create a ext3 partition with the -j flag.
 debugfs: 	ext2 and ext3 file system debugger.
 ext2online:	online (mounted) ext2 and ext3 filesystem resizer
 
+Requirements
+============
+
+Ext3 expects disk/storage subsystem to behave sanely. On sanely
+behaving disk subsystem, data that have been successfully synced will
+stay on the disk. Sane means:
+
+* writes to media never fail. Even if disk returns error condition during
+  write, ext3 can't handle that correctly, because success on fsync was already
+  returned when data hit the journal.
+
+	   (Fortunately writes failing are very uncommon on disks, as they
+	   have spare sectors they use when write fails.)
+
+* either whole sector is correctly written or nothing is written during
+  powerfail.
+
+	   (Unfortuantely, none of the cheap USB/SD flash cards I seen do behave
+	   like this, and are unsuitable for ext3. Because RAM tends to fail
+	   faster than rest of system during powerfail, special hw killing
+	   DMA transfers may be neccessary. Not sure how common that problem
+	   is on generic PC machines).
+
+* either write caching is disabled, or hw can do barriers and they are enabled.
+
+	   (Note that barriers are disabled by default, use "barrier=1"
+	   mount option after making sure hw can support them). 
+
 
 References
 ==========

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply related	[flat|nested] 86+ messages in thread

end of thread, other threads:[~2009-01-27 13:38 UTC | newest]

Thread overview: 86+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <fa.pmCH9X+XujDl6RH6/TpkNtsTnbk@ifi.uio.no>
     [not found] ` <fa.b62zZFe5e154PhgA+0sdwVXD9F0@ifi.uio.no>
     [not found]   ` <fa.ZTpiSvxEhp3YJDepiUQs+cU0C98@ifi.uio.no>
     [not found]     ` <fa.xvvufQC6zTpU9R6vhDl51DR5V7A@ifi.uio.no>
     [not found]       ` <fa.pkV69eXC76Pb9fnmERdAwXX9OKA@ifi.uio.no>
     [not found]         ` <fa.hQTLXdIllf+hs4yQb092u6fowq0@ifi.uio.no>
2009-01-04 19:08           ` document ext3 requirements Sitsofe Wheeler
2009-01-04 19:31             ` Theodore Tso
2009-01-04 22:40               ` Pavel Machek
2009-01-04 23:30                 ` Theodore Tso
2009-01-05  3:49                   ` Rob Landley
2009-01-05  4:31                     ` Robert Hancock
2009-01-05  5:00                     ` david
2009-01-05 11:19                     ` Alan Cox
2009-01-05 19:00                       ` Rob Landley
2009-01-05 19:27                         ` Martin K. Petersen
2009-01-06 10:41                           ` Matthias Andree
2009-01-06 15:30                             ` Theodore Tso
     [not found]                             ` <20090106153020.GB13086__11022.1833143898$1231255950$gmane$org@mit.edu>
2009-01-06 15:40                               ` Andi Kleen
2009-01-06 15:57                                 ` Theodore Tso
2009-01-06 17:31                                   ` Andi Kleen
2009-01-06 19:31                                   ` Rob Landley
2009-01-27 13:24                       ` Thierry Vignaud
2009-01-27 13:37                         ` Alan Cox
2009-01-06 10:36                     ` Matthias Andree
     [not found] <fa.P4z5CJpM0xT37PWJuOuCHDkO76o@ifi.uio.no>
     [not found] ` <fa.eOwOqydZi0qs6K1nmNxBFGQMV40@ifi.uio.no>
     [not found]   ` <fa.26o5IHCAC3TQdXupl62CLYwQ+Wk@ifi.uio.no>
2009-01-04 23:13     ` Sitsofe Wheeler
2009-01-05  2:51       ` Rob Landley
2009-01-05  3:33         ` Martin K. Petersen
2009-01-05  4:02         ` david
2009-01-05  3:52           ` Rob Landley
     [not found]     ` <fa.GBkQuKdRj+YRVczlNLFhGvaw3WY@ifi.uio.no>
     [not found]       ` <fa.rCyCghh/+staAmYi/+gwYvefIS0@ifi.uio.no>
     [not found]         ` <fa.c5j7jAMUnJPvgI9Oj/VczSDNakE@ifi.uio.no>
     [not found]           ` <fa.377DMq2lPMyaHxadPnApFSJFoCg@ifi.uio.no>
2009-01-05 20:36             ` Sitsofe Wheeler
2009-01-05 23:09               ` Theodore Tso
     [not found]     ` <fa.ucJLoSQwk9OAj6T6x60tbWaiTAo@ifi.uio.no>
2009-01-05 22:25       ` Sitsofe Wheeler
2009-01-06  4:08         ` Rob Landley
2009-01-03 12:38 Pavel Machek
2009-01-03 21:17 ` Martin MOKREJŠ
2009-01-03 22:06   ` Pavel Machek
2009-01-03 22:17   ` Duane Griffin
2009-01-03 22:29     ` Pavel Machek
2009-01-03 23:01       ` Martin MOKREJŠ
2009-01-03 23:38         ` Duane Griffin
2009-01-03 23:50           ` Martin MOKREJŠ
2009-01-03 23:58             ` Robert Hancock
2009-01-04  0:08               ` Martin MOKREJŠ
2009-01-04 21:49               ` Ingo Oeser
2009-01-04  0:00             ` Duane Griffin
2009-01-04  0:11               ` Martin MOKREJŠ
2009-01-04  0:41                 ` Duane Griffin
2009-01-04  3:52                   ` Valdis.Kletnieks
2009-01-04 14:24                     ` Duane Griffin
2009-01-04 18:40                       ` Theodore Tso
2009-01-04 19:21                         ` Geert Uytterhoeven
2009-01-04 19:36                           ` Theodore Tso
2009-01-04 19:51                             ` Duane Griffin
2009-01-04 21:55                               ` Theodore Tso
2009-01-04 22:06                                 ` Duane Griffin
2009-01-04 22:42                           ` Bron Gondwana
2009-01-05  3:22                           ` Rob Landley
2009-01-04  0:19         ` Pavel Machek
2009-01-05  2:55           ` Rob Landley
2009-01-04 19:56         ` Rob Landley
2009-01-05 19:16           ` Theodore Tso
2009-01-06 19:20             ` Rob Landley
2009-01-06 10:08         ` Matthias Andree
2009-01-06 15:23           ` Theodore Tso
2009-01-03 23:12       ` Duane Griffin
2009-01-06 10:06       ` Matthias Andree
2009-01-04  2:32 ` Theodore Tso
2009-01-04 22:33   ` Pavel Machek
2009-01-04 13:35 ` Alexander E. Patrakov
2009-01-04 13:53   ` Valdis.Kletnieks
2009-01-04 18:21   ` Michael Tokarev
2009-01-04 18:38   ` Theodore Tso
2009-01-04 22:37     ` Pavel Machek
2009-01-04 23:58       ` Theodore Tso
2009-01-05 11:43     ` Alan Cox
2009-01-07 11:59       ` Rob Landley
2009-01-04 20:10   ` Pavel Machek
2009-01-04 19:49 ` Rob Landley
2009-01-04 22:06   ` Theodore Tso
2009-01-04 22:25     ` Pavel Machek
2009-01-04 23:07     ` Pavel Machek
2009-01-05  1:38     ` Rob Landley
2009-01-04 22:55   ` Pavel Machek
2009-01-05  0:16     ` david
2009-01-05  9:38       ` Pavel Machek
2009-01-05  1:50     ` Rob Landley
2009-01-05  3:20     ` Martin K. Petersen
2009-01-05  9:45       ` Pavel Machek
2009-01-05 11:28         ` Alan Cox
2009-01-05 19:15         ` Martin K. Petersen
2009-01-05 20:19           ` Theodore Tso

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).