All of lore.kernel.org
 help / color / mirror / Atom feed
* A couple of questions
@ 2010-05-27 13:39 Paul Millar
  2010-05-27 14:56 ` Hubert Kario
  2010-05-27 16:00 ` Chris Mason
  0 siblings, 2 replies; 50+ messages in thread
From: Paul Millar @ 2010-05-27 13:39 UTC (permalink / raw)
  To: linux-btrfs

Hi,

I've been looking at Btrfs and have a couple of naive questions that don't 
seem to be answered on the wiki or in the articles I've read on the 
filesystem.


First: discovering a file's checksum value.

Here's the scenario: software is writing some data as a fresh file.  This 
software happens to know (a priori) the checksum of this data; for example, a 
storage server receives the file's data and checksum independently.

I've some confidence that, once the data is stored in btrfs, any corruption 
(from the storage fabric) will be spotted; however, the data may have became 
corrupt before being stored (e.g., from the network).  To catch this, the 
checksum of the stored data needs to be calculated and checked.

One approach is to calculate the checksum (in user-space) after the data is 
stored.  This adds extra IO- and CPU-load and there's also the possibility of 
false-negative results due to the filesystem cache (although btrfs may remove 
this risk).

Another approach would be to ask btrfs for the checksum.  It seems that it's 
possible to combine multiple CRC-32C values to figure out the checksum of the 
combined data [e.g., zlib's crc32_combine() function].  So, obtaining a file's 
checksum might be a light-weight operation.

Yet another possibility would be to push the desired checksum value (via 
fcntl?) and have btrfs compare the desired checksum with the file's actual 
checksum on close(2), failing that call if the checksums don't match.

Would any of this be possible (without an awful lot of work)?



Second: adding support for Adler32?

Looking at the unstable git repo, it looks like there's currently support for 
only the CRC-32C checksum algorithm.  Is this correct?  If so, is anyone 
working on adding support for Adler32?

Cheers,

Paul.
(ps, please keep me CC-ed in on replies)

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: A couple of questions
  2010-05-27 13:39 A couple of questions Paul Millar
@ 2010-05-27 14:56 ` Hubert Kario
  2010-05-31 17:59   ` Paul Millar
  2010-05-27 16:00 ` Chris Mason
  1 sibling, 1 reply; 50+ messages in thread
From: Hubert Kario @ 2010-05-27 14:56 UTC (permalink / raw)
  To: Paul Millar; +Cc: linux-btrfs

On Thursday 27 May 2010 15:39:54 Paul Millar wrote:
> Hi,
>=20
> I've been looking at Btrfs and have a couple of naive questions that =
don't
> seem to be answered on the wiki or in the articles I've read on the
> filesystem.
>=20
>=20
> First: discovering a file's checksum value.
>=20
> Here's the scenario: software is writing some data as a fresh file.  =
This
> software happens to know (a priori) the checksum of this data; for ex=
ample,
> a storage server receives the file's data and checksum independently.
>=20
> I've some confidence that, once the data is stored in btrfs, any corr=
uption
> (from the storage fabric) will be spotted; however, the data may have
> became corrupt before being stored (e.g., from the network).  To catc=
h
> this, the checksum of the stored data needs to be calculated and chec=
ked.
>=20
> One approach is to calculate the checksum (in user-space) after the d=
ata is
> stored.  This adds extra IO- and CPU-load and there's also the possib=
ility
> of false-negative results due to the filesystem cache (although btrfs=
 may
> remove this risk).
>=20
> Another approach would be to ask btrfs for the checksum.  It seems th=
at
> it's possible to combine multiple CRC-32C values to figure out the
> checksum of the combined data [e.g., zlib's crc32_combine() function]=
=2E=20
> So, obtaining a file's checksum might be a light-weight operation.
>=20
> Yet another possibility would be to push the desired checksum value (=
via
> fcntl?) and have btrfs compare the desired checksum with the file's a=
ctual
> checksum on close(2), failing that call if the checksums don't match.
>=20
> Would any of this be possible (without an awful lot of work)?

IMO, if an application recieves data with checksum it can calculate the=
=20
checksum of data on the fly, as it writes it to the disk. It won't add =
any=20
additional IO to storage subsystem. It won't detect in-memory corruptio=
n=20
though, but if you want to be resilant to this, you should be looking a=
t ECC=20
RAM as subsequent checks can be affected by it to.

Second, you shouldn't tie application or network protocol to a CRC sche=
me used=20
by filesystem on server! Especially when there can be other CRC algorit=
hms=20
used, not only CRC-32C.

If the checksum algorithm used by FS was set in stone, then userspace c=
ould=20
employ it somehow, but if there can be different CRCs used, I see no re=
ason to=20
allow the userspace to read them.


--=20
Hubert Kario
QBS - Quality Business Software
02-656 Warszawa, ul. Ksawer=C3=B3w 30/85
tel. +48 (22) 646-61-51, 646-74-24
www.qbs.com.pl

System Zarz=C4=85dzania Jako=C5=9Bci=C4=85
zgodny z norm=C4=85 ISO 9001:2000
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: A couple of questions
  2010-05-27 13:39 A couple of questions Paul Millar
  2010-05-27 14:56 ` Hubert Kario
@ 2010-05-27 16:00 ` Chris Mason
  2010-05-31 18:06   ` Paul Millar
  1 sibling, 1 reply; 50+ messages in thread
From: Chris Mason @ 2010-05-27 16:00 UTC (permalink / raw)
  To: Paul Millar; +Cc: linux-btrfs

On Thu, May 27, 2010 at 03:39:54PM +0200, Paul Millar wrote:
> Hi,
> 
> I've been looking at Btrfs and have a couple of naive questions that don't 
> seem to be answered on the wiki or in the articles I've read on the 
> filesystem.
> 
> 
> First: discovering a file's checksum value.
> 
> Here's the scenario: software is writing some data as a fresh file.  This 
> software happens to know (a priori) the checksum of this data; for example, a 
> storage server receives the file's data and checksum independently.
> 
> I've some confidence that, once the data is stored in btrfs, any corruption 
> (from the storage fabric) will be spotted; however, the data may have became 
> corrupt before being stored (e.g., from the network).  To catch this, the 
> checksum of the stored data needs to be calculated and checked.
> 
> One approach is to calculate the checksum (in user-space) after the data is 
> stored.  This adds extra IO- and CPU-load and there's also the possibility of 
> false-negative results due to the filesystem cache (although btrfs may remove 
> this risk).
> 
> Another approach would be to ask btrfs for the checksum.  It seems that it's 
> possible to combine multiple CRC-32C values to figure out the checksum of the 
> combined data [e.g., zlib's crc32_combine() function].  So, obtaining a file's 
> checksum might be a light-weight operation.
> 
> Yet another possibility would be to push the desired checksum value (via 
> fcntl?) and have btrfs compare the desired checksum with the file's actual 
> checksum on close(2), failing that call if the checksums don't match.
> 
> Would any of this be possible (without an awful lot of work)?

I'd suggest that you look at T10 DIF and DIX, which are targeted at
exactly this kind of thing.  We're looking at integrating dif/dix into
btrfs at some point.

> 
> 
> 
> Second: adding support for Adler32?
> 
> Looking at the unstable git repo, it looks like there's currently support for 
> only the CRC-32C checksum algorithm.  Is this correct?  If so, is anyone 
> working on adding support for Adler32?

We haven't looked at adler32.  crc32c was chosen because it is supported
in hardware by recent intel CPUs.

-chris

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: A couple of questions
  2010-05-27 14:56 ` Hubert Kario
@ 2010-05-31 17:59   ` Paul Millar
  2010-06-02 16:19     ` Hubert Kario
  0 siblings, 1 reply; 50+ messages in thread
From: Paul Millar @ 2010-05-31 17:59 UTC (permalink / raw)
  To: Hubert Kario; +Cc: linux-btrfs

Hi Hubert,

On Thursday 27 May 2010 16:56:00 Hubert Kario wrote:
> > Would [obtaining file checksum] be possible (without an awful lot
> > of work)?
> 
> [Calculating checksum in-memory]  won't detect in-memory corruption
> though, but if you want to be resilant to this, you should be looking at
>  ECC RAM as subsequent checks can be affected by it to.

Certainly ECC RAM will help, but unfortunately it doesn't remove the 
possibility of corruption; for example, CERN found [1] that double-bit memory 
corruptions (which ECC cannot recover from) can still happen.

[1] 
http://indico.cern.ch/getFile.py/access?contribId=3&sessionId=0&resId=1&materialId=paper&confId=13797

Also, IIRC there was a case where Fermilab tracked down a data corruption to a 
faulty PCI bus in the server.  So who knows where are all the places 
corruption could occur?

I guess the real problem is that, when processing large amounts of data, these 
rare occurrences start to stack up.


> Second, you shouldn't tie application or network protocol to a CRC scheme
>  used by filesystem on server! Especially when there can be other CRC
>  algorithms used, not only CRC-32C.

Sure, but the protocol isn't tied to any particular checksum algorithm.

 
> If the checksum algorithm used by FS was set in stone, then userspace could
> employ it somehow, but if there can be different CRCs used, I see no reason
>  to allow the userspace to read them.

I agree that a checksum value, without knowing the algorithm, isn't much use.  
However, the FS reported a string representation of the tuple (algorithm, 
value); for example:

   0:DCD05C54

(where "0" is from BTRFS_CSUM_TYPE_CRC32)

Would that allow meaningful use of this information?

Cheers,

Paul.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: A couple of questions
  2010-05-27 16:00 ` Chris Mason
@ 2010-05-31 18:06   ` Paul Millar
  2010-05-31 20:33     ` Mike Fedyk
  2010-06-01 13:39     ` Martin K. Petersen
  0 siblings, 2 replies; 50+ messages in thread
From: Paul Millar @ 2010-05-31 18:06 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs

Hi Chris,

On Thursday 27 May 2010 18:00:44 Chris Mason wrote:
> I'd suggest that you look at T10 DIF and DIX, which are targeted at
> exactly this kind of thing.  We're looking at integrating dif/dix into
> btrfs at some point.

I've been keeping half-an-eye on T10's work in ensuring end-to-end integrity.  
That you guys are planning to integrate dif/dix support is certainly welcome 
news!

In my use-case (a file-server that receives a new file from a remote client),  
I believe that, to ensure end-to-end integrity,  the server software would 
have to push the client-supplied checksum into the FS when writing a new file.  
(I believe there's some T10 slides somewhere that show this use-case) -- or 
(equivalently) the server software obtains the FS checksum for the file and 
matches it against the client-supplied value.

I'm deliberately taking the simplest case when the client has chosen the same 
checksum algorithm as the FS uses.  In reality, this may not be the case, but 
we can probably cope with that.

My concern is that, if the server-software doesn't push the client-provided 
checksum then the FS checksum (plus T-10 DIF/DIX) would not provide a rigorous 
assurance that the bytes are the same.  Without this assurance, corruption 
could still occur; for example, within the server's memory.

> We haven't looked at adler32.  crc32c was chosen because it is supported
> in hardware by recent intel CPUs.

OK, fair enough :)

Cheers,

Paul.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: A couple of questions
  2010-05-31 18:06   ` Paul Millar
@ 2010-05-31 20:33     ` Mike Fedyk
  2010-06-02 11:56       ` Paul Millar
  2010-06-01 13:39     ` Martin K. Petersen
  1 sibling, 1 reply; 50+ messages in thread
From: Mike Fedyk @ 2010-05-31 20:33 UTC (permalink / raw)
  To: Paul Millar; +Cc: Chris Mason, linux-btrfs

On Mon, May 31, 2010 at 11:06 AM, Paul Millar <paul.millar@desy.de> wro=
te:
> Hi Chris,
>
> On Thursday 27 May 2010 18:00:44 Chris Mason wrote:
>> I'd suggest that you look at T10 DIF and DIX, which are targeted at
>> exactly this kind of thing. =C2=A0We're looking at integrating dif/d=
ix into
>> btrfs at some point.
>
> I've been keeping half-an-eye on T10's work in ensuring end-to-end in=
tegrity.
> That you guys are planning to integrate dif/dix support is certainly =
welcome
> news!
>
> In my use-case (a file-server that receives a new file from a remote =
client),
> I believe that, to ensure end-to-end integrity, =C2=A0the server soft=
ware would
> have to push the client-supplied checksum into the FS when writing a =
new file.
> (I believe there's some T10 slides somewhere that show this use-case)=
 -- or
> (equivalently) the server software obtains the FS checksum for the fi=
le and
> matches it against the client-supplied value.
>
> I'm deliberately taking the simplest case when the client has chosen =
the same
> checksum algorithm as the FS uses. =C2=A0In reality, this may not be =
the case, but
> we can probably cope with that.
>
> My concern is that, if the server-software doesn't push the client-pr=
ovided
> checksum then the FS checksum (plus T-10 DIF/DIX) would not provide a=
 rigorous
> assurance that the bytes are the same. =C2=A0Without this assurance, =
corruption
> could still occur; for example, within the server's memory.
>

Have you taken into account the boundaries of the data checksums?
Your app may checksum per file or some logical partition in the file
format.  Btrfs does the checksum per-extent so unless you keep track
of where the extent boundaries are, that checksum will be useless to
the userspace app.  Also the app would be tied specifically to a
storage technology.  No matter how great foo might be, not everyone's
going to use it.

Also are you going to get this info over nfs, cifs, lustre, gluster,
ceph, foo, bar and baz?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: A couple of questions
  2010-05-31 18:06   ` Paul Millar
  2010-05-31 20:33     ` Mike Fedyk
@ 2010-06-01 13:39     ` Martin K. Petersen
  2010-06-02 13:40       ` Paul Millar
  1 sibling, 1 reply; 50+ messages in thread
From: Martin K. Petersen @ 2010-06-01 13:39 UTC (permalink / raw)
  To: Paul Millar; +Cc: Chris Mason, linux-btrfs

>>>>> "Paul" == Paul Millar <paul.millar@desy.de> writes:

Paul> My concern is that, if the server-software doesn't push the
Paul> client-provided checksum then the FS checksum (plus T-10 DIF/DIX)
Paul> would not provide a rigorous assurance that the bytes are the
Paul> same.  Without this assurance, corruption could still occur; for
Paul> example, within the server's memory.

For DIX we allow integrity metadata conversion.  Once the data is
received, the server generates appropriate IMD for the next layer.  Then
the server verifies that the original IMD matches the data buffer.  That
way there's no window of error.  But obviously the ideal case is where
the same IMD can be passed throughout the stack without conversion.

Not sure what you use for file service?  I believe NFSv4 allows for
checksums to be passed along. I have not looked at them closely yet,
though.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: A couple of questions
  2010-05-31 20:33     ` Mike Fedyk
@ 2010-06-02 11:56       ` Paul Millar
  0 siblings, 0 replies; 50+ messages in thread
From: Paul Millar @ 2010-06-02 11:56 UTC (permalink / raw)
  To: Mike Fedyk; +Cc: Chris Mason, linux-btrfs

Hi Mike,

On Monday 31 May 2010 22:33:23 Mike Fedyk wrote:
> On Mon, May 31, 2010 at 11:06 AM, Paul Millar <paul.millar@desy.de> wrote:
> > [...] My concern is that, if the server-software doesn't push the
> > client-provided checksum then the FS checksum (plus T-10 DIF/DIX) would
> > not provide a rigorous assurance that the bytes are the same [...]
> 
> Have you taken into account the boundaries of the data checksums?
> Your app may checksum per file or some logical partition in the file
> format. 

I'm thinking specifically of the case when the user creates a file, writes the 
file's contents and closes it;  for us, this is the only use-case when writing 
data.  In this scenario, the checksum would be of the file's complete data 
rather than any particular logical partition.

> Btrfs does the checksum per-extent so unless you keep track
> of where the extent boundaries are, that checksum will be useless to
> the userspace app. 

Sure, this is true with how things are currently.

However, I was hoping that it would be possible to add code within btrfs to 
obtain the checksum over the all the file's data.  Since btrfs knows the 
extend sizes and per-extend checksum values, I believe this is tractable and 
relatively easy.

> Also the app would be tied specifically to a storage technology.  No
> matter how great foo might be, not everyone's going to use it.

Sure, but I'm thinking of this behaviour (within the app) as being optional. 
The app would continue to be FS and storage-technology independent.

If the FS doesn't support internal consistency (e.g., ext3, xfs, ..) then the 
app would continue to do userland checksum verification on write:  it's better 
than nothing.

If the app is deployed on a node with btrfs then the app could try to "align" 
the user-supplied checksum with the value within the FS: either pushing the 
correct checksum value into the FS or reading the resulting FS-generated 
checksum value after writing and comparing it with the user-supplied value.

> Also are you going to get this info over nfs, cifs, lustre, gluster,
> ceph, foo, bar and baz?

This is certainly a valid concern. 

I can't speak for all these protocols and distributed filesystems: we don't 
support mounting our app with CIFS and the software doesn't participate with 
luster, gluster, ceph cluster filesystems.

However, here's information about the protocols we do support:

The majority of LAN transfers use a custom protocol.  The wire-protocol 
includes support for uploading a checksum value on close.

We also support the xrootd protocol, which allows clients to upload checksum 
values with the kXR_verifyw command.

We've also support for NFS v4.1.   NFS doesn't support uploading checksum (I 
believe, and it isn't part of current v4.2 work), but we may be able to work 
around this.

We also support WebDAV.  This currently has no support for checksum.

Almost all WAN transfers currently use GridFTP v2.  This includes the SCKS 
command, which allows the client to upload the correct checksum value.

In short, with current usage, the app will know the checksum value, as 
supplied by the remote client.

Cheers,

Paul.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: A couple of questions
  2010-06-01 13:39     ` Martin K. Petersen
@ 2010-06-02 13:40       ` Paul Millar
  2010-06-04  1:17         ` Martin K. Petersen
  0 siblings, 1 reply; 50+ messages in thread
From: Paul Millar @ 2010-06-02 13:40 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: Chris Mason, linux-btrfs

On Tuesday 01 June 2010 15:39:52 Martin K. Petersen wrote:
> >>>>> "Paul" == Paul Millar <paul.millar@desy.de> writes:
> Paul> My concern is that, if the server-software doesn't push the
> Paul> client-provided checksum then the FS checksum (plus T-10 DIF/DIX)
> Paul> would not provide a rigorous assurance that the bytes are the
> Paul> same.  Without this assurance, corruption could still occur; for
> Paul> example, within the server's memory.
> 
> For DIX we allow integrity metadata conversion.  Once the data is
> received, the server generates appropriate IMD for the next layer.  Then
> the server verifies that the original IMD matches the data buffer.  That
> way there's no window of error.  But obviously the ideal case is where
> the same IMD can be passed throughout the stack without conversion.

I think we may be talking slightly at cross-purposes here: in my case, one of 
the end-points (for "end-to-end data integrity") is a remote computer, that is 
uploading a file with a corresponding checksum.

Please correct me if I'm wrong here, but T10 DIF/DIX refers only to data 
integrity protection from the OS's FS-level down to the block device: a 
userland application doesn't know that it is writing into a FS that is 
utilising DIX with a DIF-enabled storage system.

When a file is uploaded from a remote client to an application with the 
checksum, the app can verify this checksum internally.  However, there's then 
a (logical) gap between userland and FS where data integrity is no longer 
assured.  For example, corruption that occurs after the app has verified the 
checksum value would not be picked up, even with T10 DIX/DIF, since the FS 
would receive and store the already-corrupted data "in good faith".

In principle, one can add a btrfs-specific mechanism to continue this 
assurance from userland down to the FS.  Perhaps the simplest would be to 
allow userland applications to read the FS's internal checksum (app would read 
the FS internal checksum after writing and verify it is consistent), but I 
guess more sophisticated (interleaved IMD, T10-like) mechanisms are also 
possible.

Unfortunately, any such solution would be btrfs-specific, since (I believe) no 
one has standardised how to extend T10 into userspace.


> Not sure what you use for file service?  I believe NFSv4 allows for
> checksums to be passed along. I have not looked at them closely yet,
> though.

I believe NFS currently doesn't support checksums (as per v4.1).  Looking into 
more detail, Alok Aggarwal gave a talk at 2006 connectathon about this.  
Alok's slides have a nice diagram (slide 11) showing the kind of end-to-end 
integrity I'm after.  The issue is how to achieve the assurance between "NFS 
Server" and "Local FS" on the right.

For NFS, I believe there aren't any plans for introducing checksum support for 
v4.2.  Perhaps it'll appear with the later minor versions of the standard.

Cheers,

Paul.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: A couple of questions
  2010-05-31 17:59   ` Paul Millar
@ 2010-06-02 16:19     ` Hubert Kario
  0 siblings, 0 replies; 50+ messages in thread
From: Hubert Kario @ 2010-06-02 16:19 UTC (permalink / raw)
  To: Paul Millar; +Cc: linux-btrfs

On Monday 31 May 2010 19:59:46 Paul Millar wrote:
> Hi Hubert,
>=20
> On Thursday 27 May 2010 16:56:00 Hubert Kario wrote:
> > > Would [obtaining file checksum] be possible (without an awful lot
> > > of work)?
> >=20
> > [Calculating checksum in-memory]  won't detect in-memory corruption
> > though, but if you want to be resilant to this, you should be looki=
ng at
> >=20
> >  ECC RAM as subsequent checks can be affected by it to.
>=20
> Certainly ECC RAM will help, but unfortunately it doesn't remove the
> possibility of corruption; for example, CERN found [1] that double-bi=
t
> memory corruptions (which ECC cannot recover from) can still happen.
>=20
> [1]
> http://indico.cern.ch/getFile.py/access?contribId=3D3&sessionId=3D0&r=
esId=3D1&mat
> erialId=3Dpaper&confId=3D13797
>=20
> Also, IIRC there was a case where Fermilab tracked down a data corrup=
tion
> to a faulty PCI bus in the server.  So who knows where are all the pl=
aces
> corruption could occur?
>=20
> I guess the real problem is that, when processing large amounts of da=
ta,
> these rare occurrences start to stack up.
>=20

Yes, but AFAIK btrfs checksums don't have internal checksum (e.g. you c=
an't=20
check if the read checksum is a valid one or not, it does not have cont=
rol=20
bits), as such, if you consider PCI bus corruption as likely, you still=
 don't=20
get 100% certanity that the data reached the HDD unharmed.

If you need such level of certanity when recording data, I'd consider=20
mainframe hardware and/or duplicating whole storage stack.

Cheers,
--=20
Hubert Kario
QBS - Quality Business Software
02-656 Warszawa, ul. Ksawer=C3=B3w 30/85
tel. +48 (22) 646-61-51, 646-74-24
www.qbs.com.pl

System Zarz=C4=85dzania Jako=C5=9Bci=C4=85
zgodny z norm=C4=85 ISO 9001:2000
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: A couple of questions
  2010-06-02 13:40       ` Paul Millar
@ 2010-06-04  1:17         ` Martin K. Petersen
  0 siblings, 0 replies; 50+ messages in thread
From: Martin K. Petersen @ 2010-06-04  1:17 UTC (permalink / raw)
  To: Paul Millar; +Cc: Martin K. Petersen, Chris Mason, linux-btrfs

>>>>> "Paul" == Paul Millar <paul.millar@desy.de> writes:

Paul> Please correct me if I'm wrong here, but T10 DIF/DIX refers only
Paul> to data integrity protection from the OS's FS-level down to the
Paul> block device: a userland application doesn't know that it is
Paul> writing into a FS that is utilising DIX with a DIF-enabled storage
Paul> system.

My point was that it is possible to have different protection types in
play (and thus different checksums) as long as you overlap the
protection envelopes.  At the expense of having to calculate checksums
multiple times, of course.


Paul> Unfortunately, any such solution would be btrfs-specific, since (I
Paul> believe) no one has standardised how to extend T10 into userspace.

Not yet, but we're working on a generic interface that would allow the
protection information to be attached.  This is not going to be tied to
just T10 DIF.  The current Linux block layer integrity handles different
types of protection information.


Paul> I believe NFS currently doesn't support checksums (as per v4.1).
Paul> Looking into more detail, Alok Aggarwal gave a talk at 2006
Paul> connectathon about this.  Alok's slides have a nice diagram (slide
Paul> 11) showing the kind of end-to-end integrity I'm after.  The issue
Paul> is how to achieve the assurance between "NFS Server" and "Local
Paul> FS" on the right.

Paul> For NFS, I believe there aren't any plans for introducing checksum
Paul> support for v4.2.  Perhaps it'll appear with the later minor
Paul> versions of the standard.

I haven't looked into this for a long time.  Last time I talked to the
NFS folks they seemed to think it would be possible to bridge the two
methods.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: A couple of questions
  2005-04-18 15:31 ` Linus Torvalds
@ 2005-04-18 16:23   ` Paul Jackson
  0 siblings, 0 replies; 50+ messages in thread
From: Paul Jackson @ 2005-04-18 16:23 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: is, git

Linus wrote:
> Nothing beats backups and distribution.

Famous quote from the past:

"Only wimps use tape backup: real men just upload their important stuff on ftp,
 and let the rest of the world mirror it ;)" Linus Torvalds

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@engr.sgi.com> 1.650.933.1373, 1.925.600.0401

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: A couple of questions
  2005-04-18 11:51 Imre Simon
@ 2005-04-18 15:31 ` Linus Torvalds
  2005-04-18 16:23   ` Paul Jackson
  0 siblings, 1 reply; 50+ messages in thread
From: Linus Torvalds @ 2005-04-18 15:31 UTC (permalink / raw)
  To: Imre Simon; +Cc: git



On Mon, 18 Apr 2005, Imre Simon wrote:
>
> How will git handle a corrupted (git) file system?
> 
> For instance, what can be done if objects/xy/z{38} does not pass the
> simple consistency test, i.e. if the file's sha1 hash is not xyz{38}?
> This might be a serious problem because, in general, one cannot
> reconstruct the contents of file objects/xy/z{38} from its name
> xyz{38}.

Nothing beats backups and distribution. The distributed nature of git 
means that you can replicate your objects abitrarily.

> Another problem might come up if the file does pass the simple
> consistency test but the file's contents is not a valid git file,

Run "fsck-cache". It not only tests SHA1 and general object sanity, but it
does full tracking of the resulting reachability and everything else. It
prints out any corruption it finds (missing or bad objects), and if you
use the "--unreachable" flag it will also print out objects that exist but 
that aren't readable from any of the HEAD nodes (which you need to 
specify).

So for example

	fsck-cache --unreachable $(cat .git/HEAD)

will do quite a _lot_ of verification on the tree. There are a few extra 
validity tests I'm going to add (make sure that tree objects are sorted 
properly etc), but on the whole if "fsck-cache" is happy, you do have a 
valid tree.

Any corrupt objects you will have to find in backups or other archives (ie
you can just remove them and do an "rsync" with some other site in the
hopes that somebody else has the object you have corrupted).

Of course, "valid tree" doesn't mean that it wasn't generated by some evil 
person, and the end result might be crap. Git is a revision tracking 
system, not a quality assurance system ;)

		Linus

^ permalink raw reply	[flat|nested] 50+ messages in thread

* A couple of questions
@ 2005-04-18 11:51 Imre Simon
  2005-04-18 15:31 ` Linus Torvalds
  0 siblings, 1 reply; 50+ messages in thread
From: Imre Simon @ 2005-04-18 11:51 UTC (permalink / raw)
  To: git

How will git handle a corrupted (git) file system?

For instance, what can be done if objects/xy/z{38} does not pass the
simple consistency test, i.e. if the file's sha1 hash is not xyz{38}?
This might be a serious problem because, in general, one cannot
reconstruct the contents of file objects/xy/z{38} from its name
xyz{38}.

Another problem might come up if the file does pass the simple
consistency test but the file's contents is not a valid git file,
i.e. something that

  (*) successfully inflates to a stream of bytes that forms a sequence of
  <ascii tag without space> + <space> + <ascii decimal size> +
  <byte\0> + <binary object data>.

Are there enough internal redundancies in git to allow fixing at least
some corrupted file systems? Shouldn't there be some?

Another related observation is that git is not really based on a 160 bit
hashing scheme. Indeed, only files that satisfy the above condition
(*) are allowed and this most certainly reduces the valid range of the
hashing function. I do not think that this will be a problem, but it
doesn't hurt to point this out once.

Cheers,

Imre Simon


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: A couple of questions
  2002-05-17 15:04       ` Kuba Ober
@ 2002-05-18 20:40         ` Hans Reiser
  0 siblings, 0 replies; 50+ messages in thread
From: Hans Reiser @ 2002-05-18 20:40 UTC (permalink / raw)
  To: Kuba Ober; +Cc: Marc A. Lehmann

Kuba Ober wrote:

>>>What I'm thinking of is this:
>>>to the user, which most users w/o intimate filesystem knowledge won't be
>>>able to answer at all?
>>>      
>>>
>>Unix traditionally wasn't aimed at the point-and-click users without
>>knowledge.
>>    
>>
>
>Yep. But the thing is that either fsck can restore the data or not. There's no 
>way in between.
>
>What more can unix-poweruser do about recovering a filesystem, other than 
>running a disk editor (say a reiserfs-customized version of norton disk 
>editor, which used to be a good thing for hand recovery of fat fs before it 
>became crap) ?
>
>What kinds of questions can fsck really ask without having to present user 
>with a lot of intricate data, which is better visualized graphically or, at 
>least in a more interactive ui?
>
>Example: If e2fsck starts asking questions like "inode counts don't match for 
>groups (a long list of groups). fix them <y>", what should I answer? no? that 
>would be nonsense.
>
Yeah, I agree with this one.  Especially when there is usually 
absolutely no place where it is documented what in the world these sorts 
of messages mean.  Which is one of my pet peeves with our reiserfsck and 
journaling code, I am simply unable to convey that a message that is not 
explained such that the average user can understand it is programming 
malpractice.

>  
>
>>for some strange reason no fsck behaves like that.
>>    
>>
>
>Because most fscks are hacks. They are useful, they mostly do their job, but 
>they are far from full-features tools, and that's the reality. I don't 
>complain. I just say that their functionality isn't optimal, and shouldn't be 
>cited as something that's the way to go, or as something that should be a 
>design goal. They are effects of how much time did the fs-knowing people have 
>to put in them.
>
Yes, very true.  Sigh.  Norton was the exception to this, and it was not 
produced by the author of the FS.

>
>Cheers, Kuba
>
>
>  
>




^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: A couple of questions
  2002-05-17 13:11 berthiaume_wayne
@ 2002-05-17 16:03 ` Kuba Ober
  0 siblings, 0 replies; 50+ messages in thread
From: Kuba Ober @ 2002-05-17 16:03 UTC (permalink / raw)
  To: berthiaume_wayne; +Cc: reiserfs-list

On pi±tek 17 maj 2002 09:11 am, berthiaume_wayne@emc.com wrote:
> Kuba, I guess the question that should be posed this way: What is
> the downside of not asking the user and just fixing what can be fixed? Is
> there a potential for unrepairable damage if you were to fix blindly
> without "user" intervention?

The downside is that with fsck's that are quick hacks, you really require user 
to know a lot, and you ask complicated questions.

Fsck should *first* get all the information it can from the fs and digest it. 
Only then can it try to fix things, and ask questions about things that are 
doubtful. There are certain things that are 99.999995 true, almost 
assertions, because certain damage patterns have extremely slim chances of 
occuring.

This is essentially a way to formulate fsck algorithm in terms similar to some 
expert systems.

Example: I'll use FAT16 for the example fs, since I assume most people know it 
well enough. With FAT16 filesystem, it was quite easy to discern clusters 
occupied by directories from clusters occupied by file data. And then, there 
was more data that increased the probablity that you indeed had a proper 
directory cluster. It might have went in steps:
(assuming all fat copies were zeroed)
1. Read all disk clusters, detect those that are probable directories basing 
solely on cluster contents. Define an "is-directory" property for each 
cluster. Assign 0 to this property in those clusters which failed detection 
in this step, and 0.8 to clusters which were detected.
2. Check for mutual links between directories detected thus far (the forward 
and backward links). Bump the "is-directory" probabilities for clusters that 
have passed to 1.0.
3. Assign "is-first-cluster" probabilities for all clusters. Set them to value 
of "is-directory" from the directory cluster that contained an entry pointing 
to this cluster, or 0 if nothing points to them.
4. Check for consecutive directory clusters, starting at all clusters having 
is-first-cluster > 0 && is-directory > 0. Bump "is-directory" basing on 
best-known neighbors, etc, ...

There were many shortcuts taken here, since I ignore multiply-linked entries, 
loops, etc. It was meant as example of the idea, not implementation.  There 
is a lot of what-if kind of approach in fs recovery, and by providing an 
expert-system fuzzy-logic (ie non-binary) approach, there can be a lot of 
knowledge gained about a filesystem without asking a single question. We're 
really looking for answers if we depend on a piece of information in doing a 
recovery decision, and we consider the information we have to be too 
doubtful.

That also means that the fsck/recovery program needs to do a lot of stuff, a 
lot more than one thinks. The typical "multi-pass" approach where errors are 
fixed from lower-level to higher-level is wrong, since it inherently either 
looses information, or doesn't have it yet in earlier steps.

There can be only three passes: gather data from the media, ask additional 
questions to the user if they are needed, do the fixes. I don't see it any 
other way, and I was always thinking of an ideal fsck tool in these terms 
since I was about 12 (late 80's, already had a third HD in my 286/8 machine, 
and had done a few recovery operations with diskeditor). An example of wrong 
approach is say norton disk doctor, on FAT16: it would first check FAT, fix 
that, and only afterwards check & fix directory structure -- it looses a lot 
of information that each of the passes keeps to itself, eg. fixing cluster 
chains in FAT doesn't really look at what those clusters contain, etc.

Cheers, Kuba

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: A couple of questions
  2002-05-17 13:10     ` Valdis.Kletnieks
@ 2002-05-17 15:35       ` Kuba Ober
  0 siblings, 0 replies; 50+ messages in thread
From: Kuba Ober @ 2002-05-17 15:35 UTC (permalink / raw)
  To: reiserfs-list

> > Why on earth does a filesystem check & recovery program need to ask
> > questions to the user, which most users w/o intimate filesystem knowledge
> > won't be able to answer at all? Looking at this list, what people want is
> > to get their data back, as much as possible. They never want to get less
> > than that. Why bother asking?
>
> Well.. at least "traditionally", it was possible for a filesystem to get
> scrozzled in ways that just saying 'y' would result in more data loss than
> one judicious 'n' at the wrong time.  There's been more than once in the
> last two decades that I've had  fsck dropping zillions of files into
> lost+found/ because it decided that they were in invalid directories.
>
> Why were those directories invalid? Because their .. pointers were broken.
>
> What was wrong with their .. pointers?  They pointed at a directory that
> had a bum .. as well...
>
> End result?  If you answered 'n' to all the "relink?" questions *except*
> for the one actual broken one, you ended up with an almost-intact directory
> tree in lost+found, and a simple 'mv #004 ../real_name' would finish it,
> rather than 3 zillion #nnnn entries with no directory structure at all.

That's an fsck problem. Fsck has enough data to make sure that directories are 
really correct. That particular fsck you're mentioning didn't do its job, 
that's all.

As far as the pointers are concerned, what kinds of directory pointers you're 
referring to? Pointers from lower-level directories that didn't point to the 
directory in question, pointers from broken lower-level dirs that point to 
directory in question, pointers from that directory entry to some other 
broken directory?

Please explain the pointer thing and say explicitly what points to what and 
why they were broken. I'd really like to know, as I'm putting down an 
idea-list if I'd ever feel like making a useable disk-editor thingo that 
would support different linux fsms.

> The ugliest one of these was on a system where fsck refused to grow
> lost+found (because doing so would require more blocks - which are hard to
> get if you can't trust the free list at the moment - that's why old mkfs
> commands would have a loop that touched a bunch of files in lost+found and
> then rm them - just to grow the directory size).
>
> So quite often, you'd end up doing an 'fsck -n' once to figure out what was
> scrogged, then re-run it several times, answering 'y' to things in the
> right order...

Again, that looks like fsck was broken and not doing its job. Please say what 
kind of info the fsck was providing to you with fsck -n, and how did you use 
that info to answer 'y' in the right order. I assume that:
1. the fsck will be able to answer most of these questions if it had code to 
do it,
2. the only leftover questions would be those that would have to be asked 
anyway -- if we'll make users not answer questions when they don't need to, 
they will learn to answer the important ones.

Please tell me which one of these would you rather see?
- "invalid block counts in groups (blah, lists inodes in groups), fix them?"
- "inode x had zero dtime, fixed"
- or "your fs seems to have suffered corruption that's typical to shutting 
down an fs without properly unmounting it first, corrected."
(that's somewhat generalizing, and based on e2fs)

Cheers, Kuba

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: A couple of questions
@ 2002-05-17 15:27 Steve Pratt
  0 siblings, 0 replies; 50+ messages in thread
From: Steve Pratt @ 2002-05-17 15:27 UTC (permalink / raw)
  To: reiserfs-list


There have been many valid point raised here about the ability of a fsck
program to correctly fix all errors without user input.   Based on a quick
scan of the reiser fsck code, it appears to have relatively few user
prompts (except for the rebuild_bs path which I am excluding from this
discussion).   For keeping FSIM behaviors consistent in the EVMS interface,
I will implement the call to reiser fsck with a auto response of a carriage
return to accept the default action of the user prompt.  I will preface the
running of fsck in other than read-only mode with a warning that this will
occur and in the case where this is unacceptable to an experienced user, he
can still execute the fsck from the command line.  I will also capture all
of the output from fsck (including the banner/credits) and display to the
user.

Now, on to expand and shrink......

Steve

EVMS Development - http://www.sf.net/projects/evms
Linux Technology Center - IBM Corporation
(512) 838-9763  EMAIL: SLPratt@US.IBM.COM



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: A couple of questions
  2002-05-17  0:45         ` Philipp Gühring
  2002-05-17  1:06           ` Manuel Krause
@ 2002-05-17 15:21           ` Kuba Ober
  1 sibling, 0 replies; 50+ messages in thread
From: Kuba Ober @ 2002-05-17 15:21 UTC (permalink / raw)
  To: reiserfs-list

> > I am a filesystems developer, and I don't know enough to do more than
> > press y with most fscks.
>
> There is one case, in which I know that I have to say no: If the partition
> that a fsck tries to correct has a different type than the fsck thinks.
> (Running e2fsck against a reiserfs partition for example)

These are obvious cases and I don't oppose asking in those cases. That makes 
sense. This is actually a case when the user/sysadmin can provide meaningful 
input to the fsck. But asking questions which reduce to "are you sure you 
want me (fsck) to do as much as I can to bring it back" are pointless.

> > I think that for the most part, if one is
> > going to ask the user to help, one needs to provide a real interface, a
> > filesystem structure editor.....
>
> Well, debugfs (ext2) was an approach into that direction, isn't it?
> Now I stumbled across debugreiserfs, but it lacks interactive mode.

We need at least as much functionality as norton disk editor had wrt fat 
partition fixing. Anything less is a waste of time.

> > which no FS has ever done....  but
> > right now we need to get what we have debugged thoroughly.  It is on the
> > list of things I would like to add someday.
>
> What I would like to see is a tool to do the following:
> (And I don't think that I will find a sponsor for that tool :-(
>
> After a crash, I make a dd from the crashed partition, into a normal file
> in another partition, that's perhaps on a differnt harddisk.

That's only needed sometimes, like when your source partition is failing. An 
fsck can warn that "there are read errors while accessing this partition, 
advice making binary copy and working on that".

Otherwise, typically the amount of changed bits that fsck actually changes to 
fix things is minor. Restore files are not such a bad idea, you know, 
especially that they are easy to implement (just journal all changes in a 
file, including previous data).

> Then I want to run a dumping utility, that tries to restore every bit that
> still can be found in the crashed partition, and tries to resemble all the
> files in it, and even creating a lost&found directory ...
>
> That dumping utility should take an output directory as argument, in which
> it recreates the contents.
>
> Something like "The Coroners Kit", but more for recovery than for
> investigation.
>
> What is important for that tool:
> * It must not crash under any circumstances. Even if every bit of the
> filesystem is currupted, it has to do its work, and try to recover as much
> as possible.
> * It has to assume that every bit of the filesystem can be corrupt, so it
> has to try to semantically verify the bits, pointers, ...
> * It should try different ways to restore access to lost data, if it
> stumbles across problems in the filesystems.
> * There must not be any assertions that would not allow the tool to run
> over the whole partiton, and search everywhere for lost data
> * It has to be designed to work on files which are dumps from partition
> based filesystems.
> * It should be able to detect and correct common hardware or crash related
> problems in the filesystem:
>   * Files that are not statable or accessible, because there only exists an
> entry in the directory, but nothing in the reiserfs tree
>   * Transactions that are open
>   * Corrupted directory entries like filenames with special charakters that
> can not be used from the system, or rights with undefined bits, ...
>   * ...
> * It must not change any data on the partition, instead it writes
> everything to an output directory

reiterating
1. It doesn't need to work on the real partition. There are many ways 
(implementations) it can work without writing to it, not even the metadata.
2. It should try to "tick-check" any correct data asap, so as to limit the 
search area for corrupted leftovers
3. It should basically leave corrupted data until it is done with all 
non-corrupted stuff. Typical fs corruptions are tiny, tiny, tiny.
4. Hardware problems are basically "underlying-block-device" problems. 
Sometimes things fail without hardware failing, like bugs in raid stuff, etc. 
This is basically a heuristic pattern detection stuff: if things are borked 
in a certain pattern, we can assume that the block device has problems.

A lot of stuff in such a tool would be fs-independent, like filename 
verificator (that would give a probability that a given string was a 
filename), etc. Anyway, the point of this exercise is to make a tool which 
can have a decent frontend, and which actually can ask meaningful questions 
that the user can answer. On many server systems, the admin is reasonably 
able to answer a question about say whether this directory was something he 
hasn't seen, or something that he has been working on a lot lately, etc. This 
requires maybe kind of an expert-system approach. Well, that's too much of a 
buzzword, but it boils down to very simple things: it needs to find out as 
many correct things about the filesystem (those that leave little doubt) as 
it can, and treat them as assertions, and then it can ask a few decent 
questions that  will make the recovery possible. Again, a lot of fs 
corruption is pretty much limited. Say if for some reason all the superblocks 
have been overwritten, it should first try assuming that these were generated 
by the most recent mkfs with default options, and then try a few different 
possible options, progressing to uncommon ones, etc. That's the approach to 
the problem as I see it. Rants and flames welcome.

Cheers, Kuba Ober

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: A couple of questions
  2002-05-16 21:44     ` Lehmann 
                         ` (2 preceding siblings ...)
  2002-05-17 15:04       ` Kuba Ober
@ 2002-05-17 15:05       ` Kuba Ober
  3 siblings, 0 replies; 50+ messages in thread
From: Kuba Ober @ 2002-05-17 15:05 UTC (permalink / raw)
  To:  ( Marc A. Lehmann ); +Cc: reiserfs-list

> > What I'm thinking of is this:
> > to the user, which most users w/o intimate filesystem knowledge won't be
> > able to answer at all?
>
> Unix traditionally wasn't aimed at the point-and-click users without
> knowledge.

Yep. But the thing is that either fsck can restore the data or not. There's no 
way in between.

What more can unix-poweruser do about recovering a filesystem, other than 
running a disk editor (say a reiserfs-customized version of norton disk 
editor, which used to be a good thing for hand recovery of fat fs before it 
became crap) ?

What kinds of questions can fsck really ask without having to present user 
with a lot of intricate data, which is better visualized graphically or, at 
least in a more interactive ui?

Example: If e2fsck starts asking questions like "inode counts don't match for 
groups (a long list of groups). fix them <y>", what should I answer? no? that 
would be nonsense.

One of the reasons for these questions are that they are eyebrow-raisers. Say, 
if your raid1 array got out of sync somehow, a lot of fsck errors will maybe 
prompt you to look at whether something like that might have happened in the 
first place. But fsck should be able, as far as it can from the fs metadata, 
tell the user whether the fs was seriously corrupted by a block device 
failure (say raid corruption, hd having transfer problems, etc), overwriting 
of data with garbage, or unclean shutdown / fs-specific kernel bugs. It's the 
fsck utility that has this data. No power admin will have that kind of 
knowledge without at least dumping metadata and having a look at it with some 
specialized tool, or an on-the-spot hacked script to test things out.

I had a hardware crash. Or a raid failure. Obviously I don't have recent 
enough backup, or I really need those last-minute changes back, quickly. Or 
may system crashed in such a way that some metadata got corrupted. Many ifs. 
Now please tell me specifically how that knowledge applies to answering 
particular questions that say ext2fsck or reiserfsck may ask.

> for some strange reason no fsck behaves like that.

Because most fscks are hacks. They are useful, they mostly do their job, but 
they are far from full-features tools, and that's the reality. I don't 
complain. I just say that their functionality isn't optimal, and shouldn't be 
cited as something that's the way to go, or as something that should be a 
design goal. They are effects of how much time did the fs-knowing people have 
to put in them.

Cheers, Kuba

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: A couple of questions
  2002-05-16 21:44     ` Lehmann 
  2002-05-16 23:57       ` Hans Reiser
  2002-05-17  0:17       ` Manuel Krause
@ 2002-05-17 15:04       ` Kuba Ober
  2002-05-18 20:40         ` Hans Reiser
  2002-05-17 15:05       ` Kuba Ober
  3 siblings, 1 reply; 50+ messages in thread
From: Kuba Ober @ 2002-05-17 15:04 UTC (permalink / raw)
  To:  ( Marc A. Lehmann ); +Cc: reiserfs-list

> > What I'm thinking of is this:
> > to the user, which most users w/o intimate filesystem knowledge won't be
> > able to answer at all?
>
> Unix traditionally wasn't aimed at the point-and-click users without
> knowledge.

Yep. But the thing is that either fsck can restore the data or not. There's no 
way in between.

What more can unix-poweruser do about recovering a filesystem, other than 
running a disk editor (say a reiserfs-customized version of norton disk 
editor, which used to be a good thing for hand recovery of fat fs before it 
became crap) ?

What kinds of questions can fsck really ask without having to present user 
with a lot of intricate data, which is better visualized graphically or, at 
least in a more interactive ui?

Example: If e2fsck starts asking questions like "inode counts don't match for 
groups (a long list of groups). fix them <y>", what should I answer? no? that 
would be nonsense.

One of the reasons for these questions are that they are eyebrow-raisers. Say, 
if your raid1 array got out of sync somehow, a lot of fsck errors will maybe 
prompt you to look at whether something like that might have happened in the 
first place. But fsck should be able, as far as it can from the fs metadata, 
tell the user whether the fs was seriously corrupted by a block device 
failure (say raid corruption, hd having transfer problems, etc), overwriting 
of data with garbage, or unclean shutdown / fs-specific kernel bugs. It's the 
fsck utility that has this data. No power admin will have that kind of 
knowledge without at least dumping metadata and having a look at it with some 
specialized tool, or an on-the-spot hacked script to test things out.

I had a hardware crash. Or a raid failure. Obviously I don't have recent 
enough backup, or I really need those last-minute changes back, quickly. Or 
may system crashed in such a way that some metadata got corrupted. Many ifs. 
Now please tell me specifically how that knowledge applies to answering 
particular questions that say ext2fsck or reiserfsck may ask.

> for some strange reason no fsck behaves like that.

Because most fscks are hacks. They are useful, they mostly do their job, but 
they are far from full-features tools, and that's the reality. I don't 
complain. I just say that their functionality isn't optimal, and shouldn't be 
cited as something that's the way to go, or as something that should be a 
design goal. They are effects of how much time did the fs-knowing people have 
to put in them.

Cheers, Kuba

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: A couple of questions
@ 2002-05-17 13:11 berthiaume_wayne
  2002-05-17 16:03 ` Kuba Ober
  0 siblings, 1 reply; 50+ messages in thread
From: berthiaume_wayne @ 2002-05-17 13:11 UTC (permalink / raw)
  To: kuba; +Cc: reiserfs-list

	Kuba, I guess the question that should be posed this way: What is
the downside of not asking the user and just fixing what can be fixed? Is
there a potential for unrepairable damage if you were to fix blindly without
"user" intervention?

-----Original Message-----
From: Kuba Ober [mailto:kuba@mareimbrium.org]
Sent: Thursday, May 16, 2002 5:24 PM
To: reiserfs-list@namesys.com
Subject: Re: [reiserfs-list] A couple of questions


> >One extra question on this.  I assume that if in Fix mode and errors are
> >encountered that fsck.resiserfs will prompt to fix each error and that
> >there is no way to have it answer 'Yes' automatically like the ext2 -y
> >option.
> >
> >If not then I will probably have to take Jonathan Briggs suggestion of a
> >third process to answer 'Yes' repeatedly.
> >
> >Steve
>
> Why in the world do you want to run fsck without running it manually,
> and passing all information to the user?

What I'm thinking of is this:
1. If a filesystem is really too borked for fsck to recover useful stuff, it

should be left alone. Either fsck is able to help or not. No need to ask 
user, fsck has more data to determine whether the fs makes a scant of sense,

or if it has been messed up too much.
2. If we run fsck, we want to recover as much data as possible. That's what 
lost+found directory is for -- stuff that is not exactly clean for use, but 
may nevertheless be useful, gets hooked there.

Why on earth does a filesystem check & recovery program need to ask
questions 
to the user, which most users w/o intimate filesystem knowledge won't be
able 
to answer at all? Looking at this list, what people want is to get their
data 
back, as much as possible. They never want to get less than that. Why bother

asking?

That's one thing. Another thing is making fsck work on broken media, since 
that is what many unsuspecting users actually do. It should simply disregard

read errors and try using whatever data there is in ok-read blocks.

I don't think that asking too many questions is worth it. He who runs fsck
in 
"fix" mode wants his data back (whatever is left of it). Certain things,
like 
recovering the deleted files, may be worth specifying as options, but
typical 
recovery stuff should be w/o questions in my opinion. At least that's what 
I'd expect all fsck's to do.

Cheers, Kuba

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: A couple of questions
  2002-05-16 21:23   ` Kuba Ober
  2002-05-16 21:44     ` Lehmann 
  2002-05-16 21:44     ` Lehmann 
@ 2002-05-17 13:10     ` Valdis.Kletnieks
  2002-05-17 15:35       ` Kuba Ober
  2 siblings, 1 reply; 50+ messages in thread
From: Valdis.Kletnieks @ 2002-05-17 13:10 UTC (permalink / raw)
  To: Kuba Ober; +Cc: reiserfs-list

[-- Attachment #1: Type: text/plain, Size: 1802 bytes --]

On Thu, 16 May 2002 17:23:42 EDT, Kuba Ober <kuba@mareimbrium.org>  said:

> Why on earth does a filesystem check & recovery program need to ask questions 
> to the user, which most users w/o intimate filesystem knowledge won't be able 
> to answer at all? Looking at this list, what people want is to get their data 
> back, as much as possible. They never want to get less than that. Why bother 
> asking?

Well.. at least "traditionally", it was possible for a filesystem to get
scrozzled in ways that just saying 'y' would result in more data loss than
one judicious 'n' at the wrong time.  There's been more than once in the last
two decades that I've had  fsck dropping zillions of files into lost+found/ because
it decided that they were in invalid directories.

Why were those directories invalid? Because their .. pointers were broken.

What was wrong with their .. pointers?  They pointed at a directory that had
a bum .. as well...

End result?  If you answered 'n' to all the "relink?" questions *except* for
the one actual broken one, you ended up with an almost-intact directory tree
in lost+found, and a simple 'mv #004 ../real_name' would finish it, rather than
3 zillion #nnnn entries with no directory structure at all.

The ugliest one of these was on a system where fsck refused to grow lost+found
(because doing so would require more blocks - which are hard to get if you
can't trust the free list at the moment - that's why old mkfs commands would
have a loop that touched a bunch of files in lost+found and then rm them - just
to grow the directory size).

So quite often, you'd end up doing an 'fsck -n' once to figure out what was
scrogged, then re-run it several times, answering 'y' to things in the right
order...

/Valdis (who has had Very Bad Weekends from this issue ;)


[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: A couple of questions
  2002-05-17  0:45         ` Philipp Gühring
@ 2002-05-17  1:06           ` Manuel Krause
  2002-05-17 15:21           ` Kuba Ober
  1 sibling, 0 replies; 50+ messages in thread
From: Manuel Krause @ 2002-05-17  1:06 UTC (permalink / raw)
  To: reiserfs-list

On 05/17/2002 02:45 AM, Philipp Gühring wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Dear Hans,
> 
> 
>>I am a filesystems developer, and I don't know enough to do more than
>>press y with most fscks.  
>>
> 
> There is one case, in which I know that I have to say no: If the partition 
> that a fsck tries to correct has a different type than the fsck thinks. 
> (Running e2fsck against a reiserfs partition for example)
> 
> And those things happens when someone changes harddisks, ...
> 
> 
>>I think that for the most part, if one is
>>going to ask the user to help, one needs to provide a real interface, a
>>filesystem structure editor.....  
>>
> 
> Well, debugfs (ext2) was an approach into that direction, isn't it?
> Now I stumbled across debugreiserfs, but it lacks interactive mode.
> 
> 
>>which no FS has ever done....  but
>>right now we need to get what we have debugged thoroughly.  It is on the
>>list of things I would like to add someday.
>>
> 
> What I would like to see is a tool to do the following:
> (And I don't think that I will find a sponsor for that tool :-(
> 
> After a crash, I make a dd from the crashed partition, into a normal file in 
> another partition, that's perhaps on a differnt harddisk.
> 
> Then I want to run a dumping utility, that tries to restore every bit that 
> still can be found in the crashed partition, and tries to resemble all the 
> files in it, and even creating a lost&found directory ...
> 
> That dumping utility should take an output directory as argument, in which it 
> recreates the contents.


It should take a FILE of the SAME size of the original partition as 
output. So the utility or user may dd it back safely.

> 
> Something like "The Coroners Kit", but more for rec2overy than for 
> investigation.
> 
> What is important for that tool:
> * It must not crash under any circumstances. Even if every bit of the 
> filesystem is currupted, it has to do its work, and try to recover as much as 
> possible.
> * It has to assume that every bit of the filesystem can be corrupt, so it has 
> to try to semantically verify the bits, pointers, ...
> * It should try different ways to restore access to lost data, if it stumbles 
> across problems in the filesystems.
> * There must not be any assertions that would not allow the tool to run over 
> the whole partiton, and search everywhere for lost data
> * It has to be designed to work on files which are dumps from partition based 
> filesystems.
> * It should be able to detect and correct common hardware or crash related 
> problems in the filesystem: 
>   * Files that are not statable or accessible, because there only exists an 
> entry in the directory, but nothing in the reiserfs tree
>   * Transactions that are open
>   * Corrupted directory entries like filenames with special charakters that 
> can not be used from the system, or rights with undefined bits, ...
>   * ...
> * It must not change any data on the partition, instead it writes everything 
> to an output directory


...FILE, that is able to be checked manually via loop-mount to see the 
difference... Philipp is very right on his needs. We all are interestedm 
I assume.

> 
> Many greetings,
> - -- 
> ~ Philipp Gühring              p.guehring@futureware.at
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.0.6 (GNU/Linux)
> Comment: For info see http://www.gnupg.org
> 
> iD8DBQE85FKjlqQ+F+0wB3oRAlT3AJ9/2t5pDirnnLs/4daKrSKWD2msxQCeIHZx
> BU+PvfxKKbojRtdnLPerfMY=
> =dohB
> -----END PGP SIGNATURE-----
> 


Best wishes,

Manuel




^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: A couple of questions
  2002-05-16 23:57       ` Hans Reiser
@ 2002-05-17  0:45         ` Philipp Gühring
  2002-05-17  1:06           ` Manuel Krause
  2002-05-17 15:21           ` Kuba Ober
  0 siblings, 2 replies; 50+ messages in thread
From: Philipp Gühring @ 2002-05-17  0:45 UTC (permalink / raw)
  To: reiserfs-list

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Dear Hans,

> I am a filesystems developer, and I don't know enough to do more than
> press y with most fscks.  

There is one case, in which I know that I have to say no: If the partition 
that a fsck tries to correct has a different type than the fsck thinks. 
(Running e2fsck against a reiserfs partition for example)

And those things happens when someone changes harddisks, ...

> I think that for the most part, if one is
> going to ask the user to help, one needs to provide a real interface, a
> filesystem structure editor.....  

Well, debugfs (ext2) was an approach into that direction, isn't it?
Now I stumbled across debugreiserfs, but it lacks interactive mode.

> which no FS has ever done....  but
> right now we need to get what we have debugged thoroughly.  It is on the
> list of things I would like to add someday.

What I would like to see is a tool to do the following:
(And I don't think that I will find a sponsor for that tool :-(

After a crash, I make a dd from the crashed partition, into a normal file in 
another partition, that's perhaps on a differnt harddisk.

Then I want to run a dumping utility, that tries to restore every bit that 
still can be found in the crashed partition, and tries to resemble all the 
files in it, and even creating a lost&found directory ...

That dumping utility should take an output directory as argument, in which it 
recreates the contents.

Something like "The Coroners Kit", but more for recovery than for 
investigation.

What is important for that tool:
* It must not crash under any circumstances. Even if every bit of the 
filesystem is currupted, it has to do its work, and try to recover as much as 
possible.
* It has to assume that every bit of the filesystem can be corrupt, so it has 
to try to semantically verify the bits, pointers, ...
* It should try different ways to restore access to lost data, if it stumbles 
across problems in the filesystems.
* There must not be any assertions that would not allow the tool to run over 
the whole partiton, and search everywhere for lost data
* It has to be designed to work on files which are dumps from partition based 
filesystems.
* It should be able to detect and correct common hardware or crash related 
problems in the filesystem: 
  * Files that are not statable or accessible, because there only exists an 
entry in the directory, but nothing in the reiserfs tree
  * Transactions that are open
  * Corrupted directory entries like filenames with special charakters that 
can not be used from the system, or rights with undefined bits, ...
  * ...
* It must not change any data on the partition, instead it writes everything 
to an output directory

Many greetings,
- -- 
~ Philipp Gühring              p.guehring@futureware.at
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: For info see http://www.gnupg.org

iD8DBQE85FKjlqQ+F+0wB3oRAlT3AJ9/2t5pDirnnLs/4daKrSKWD2msxQCeIHZx
BU+PvfxKKbojRtdnLPerfMY=
=dohB
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: A couple of questions
  2002-05-16 21:44     ` Lehmann 
  2002-05-16 23:57       ` Hans Reiser
@ 2002-05-17  0:17       ` Manuel Krause
  2002-05-17 15:04       ` Kuba Ober
  2002-05-17 15:05       ` Kuba Ober
  3 siblings, 0 replies; 50+ messages in thread
From: Manuel Krause @ 2002-05-17  0:17 UTC (permalink / raw)
  To: reiserfs-list

Hi '""pcg\"@goof.com ( Marc) (A.) (Lehmann )' and all others!

On 05/16/2002 11:44 PM, pcg@goof.com ( Marc) (A.) (Lehmann ) wrote:

> On Thu, May 16, 2002 at 05:23:42PM -0400, Kuba Ober <kuba@mareimbrium.org> wrote:
> 
>>What I'm thinking of is this:
>>to the user, which most users w/o intimate filesystem knowledge won't be able 
>>to answer at all?
>>
> 
> Unix traditionally wasn't aimed at the point-and-click users without
> knowledge.


Don't loose contact with reality. Nowadays things change very often. And 
  at least Linux has to be usable for a "traditional Win user" soon if 
it should exist in future.

> 
>>Looking at this list, what people want is to get their data 
>>back, as much as possible. They never want to get less than that. Why bother 
>>asking?
>>
> 
> Users who know nothing can still be told to just press "y". Even better,
> somebody with some knowledge about the filesystem (and the contents!)
> layout can often do better with an interactive fsck (see ext2fs).
> 
> I don't think it makes sense to enhance the dumb-user-mode while at the same
> time keeping informed users from working properly.
> 


It makes sense to improve all user modes (in docs and usability) AND all 
automatic modes possible depending on the distro. Pardon, what is an 
"informed" user? Someone who reads the Docs or someone who can decidedly 
type "Yes"... and make a repeat loop??

> 
>>that is what many unsuspecting users actually do. It should simply disregard 
>>read errors and try using whatever data there is in ok-read blocks.
>>
> 
> It should actually ask the user wether she wants the block to be repaired
> (if possible), or permanently marked defect.


Doesn't that really depend on the HW state <-> or can software reliably 
decide whether the info it gets is o.k. so far on Linux?? See the 
previous posts on the list.

> 
>>I don't think that asking too many questions is worth it. He who runs fsck in 
>>"fix" mode wants his data back (whatever is left of it).
>>
> 
> Thats a big mistake. He who runs fsck wants to recover as much data as
> possible. Sometimes maybe more than fsck alone can do.
> 


We all running "reiserfsck" want as much data back as possible and are 
in fear the FS has lost some things or is loosing things while running 
fsck (what is real with ext2 and vfat). What is a big mistake on 
reiserfs at least as it retrieves "mostly all" things possible. O.k. 
that point is inaccurate.

> 
>>recovery stuff should be w/o questions in my opinion. At least that's what 
>>I'd expect all fsck's to do.
>>
> 
> for some strange reason no fsck behaves like that.
> 


Yess. I agree on that opinion. The fixable things should pass without 
any question. Who knows the special missing inode, when it should be fixed?

Best wishes,

Manuel




^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: A couple of questions
  2002-05-16 21:44     ` Lehmann 
@ 2002-05-16 23:57       ` Hans Reiser
  2002-05-17  0:45         ` Philipp Gühring
  2002-05-17  0:17       ` Manuel Krause
                         ` (2 subsequent siblings)
  3 siblings, 1 reply; 50+ messages in thread
From: Hans Reiser @ 2002-05-16 23:57 UTC (permalink / raw)
  To: Marc A. Lehmann; +Cc: Kuba Ober, reiserfs-list

pcg( Marc)@goof(A.).(Lehmann )com wrote:

>On Thu, May 16, 2002 at 05:23:42PM -0400, Kuba Ober <kuba@mareimbrium.org> wrote:
>  
>
>>What I'm thinking of is this:
>>to the user, which most users w/o intimate filesystem knowledge won't be able 
>>to answer at all?
>>    
>>
>
>Unix traditionally wasn't aimed at the point-and-click users without
>knowledge.
>
>  
>
>>Looking at this list, what people want is to get their data 
>>back, as much as possible. They never want to get less than that. Why bother 
>>asking?
>>    
>>
>
>Users who know nothing can still be told to just press "y". Even better,
>somebody with some knowledge about the filesystem (and the contents!)
>layout can often do better with an interactive fsck (see ext2fs).
>
>I don't think it makes sense to enhance the dumb-user-mode while at the same
>time keeping informed users from working properly.
>
>  
>
>>that is what many unsuspecting users actually do. It should simply disregard 
>>read errors and try using whatever data there is in ok-read blocks.
>>    
>>
>
>It should actually ask the user wether she wants the block to be repaired
>(if possible), or permanently marked defect.
>
>  
>
>>I don't think that asking too many questions is worth it. He who runs fsck in 
>>"fix" mode wants his data back (whatever is left of it).
>>    
>>
>
>Thats a big mistake. He who runs fsck wants to recover as much data as
>possible. Sometimes maybe more than fsck alone can do.
>
>  
>
>>recovery stuff should be w/o questions in my opinion. At least that's what 
>>I'd expect all fsck's to do.
>>    
>>
>
>for some strange reason no fsck behaves like that.
>
>  
>
Actually, I think Kuba makes a good point.  I will ask some questions 
about exactly when do we ask the user a question more than once.

I am a filesystems developer, and I don't know enough to do more than 
press y with most fscks.  I think that for the most part, if one is 
going to ask the user to help, one needs to provide a real interface, a 
filesystem structure editor.....  which no FS has ever done....  but 
right now we need to get what we have debugged thoroughly.  It is on the 
list of things I would like to add someday.

Hans



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: A couple of questions
  2002-05-16 21:23   ` Kuba Ober
@ 2002-05-16 21:44     ` Lehmann 
  2002-05-16 23:57       ` Hans Reiser
                         ` (3 more replies)
  2002-05-16 21:44     ` Lehmann 
  2002-05-17 13:10     ` Valdis.Kletnieks
  2 siblings, 4 replies; 50+ messages in thread
From: Lehmann  @ 2002-05-16 21:44 UTC (permalink / raw)
  To: Kuba Ober; +Cc: reiserfs-list

On Thu, May 16, 2002 at 05:23:42PM -0400, Kuba Ober <kuba@mareimbrium.org> wrote:
> What I'm thinking of is this:
> to the user, which most users w/o intimate filesystem knowledge won't be able 
> to answer at all?

Unix traditionally wasn't aimed at the point-and-click users without
knowledge.

> Looking at this list, what people want is to get their data 
> back, as much as possible. They never want to get less than that. Why bother 
> asking?

Users who know nothing can still be told to just press "y". Even better,
somebody with some knowledge about the filesystem (and the contents!)
layout can often do better with an interactive fsck (see ext2fs).

I don't think it makes sense to enhance the dumb-user-mode while at the same
time keeping informed users from working properly.

> that is what many unsuspecting users actually do. It should simply disregard 
> read errors and try using whatever data there is in ok-read blocks.

It should actually ask the user wether she wants the block to be repaired
(if possible), or permanently marked defect.

> I don't think that asking too many questions is worth it. He who runs fsck in 
> "fix" mode wants his data back (whatever is left of it).

Thats a big mistake. He who runs fsck wants to recover as much data as
possible. Sometimes maybe more than fsck alone can do.

> recovery stuff should be w/o questions in my opinion. At least that's what 
> I'd expect all fsck's to do.

for some strange reason no fsck behaves like that.

-- 
      -----==-                                             |
      ----==-- _                                           |
      ---==---(_)__  __ ____  __       Marc Lehmann      +--
      --==---/ / _ \/ // /\ \/ /       pcg@goof.com      |e|
      -=====/_/_//_/\_,_/ /_/\_\       XX11-RIPE         --+
    The choice of a GNU generation                       |
                                                         |

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: A couple of questions
  2002-05-16 21:23   ` Kuba Ober
  2002-05-16 21:44     ` Lehmann 
@ 2002-05-16 21:44     ` Lehmann 
  2002-05-17 13:10     ` Valdis.Kletnieks
  2 siblings, 0 replies; 50+ messages in thread
From: Lehmann  @ 2002-05-16 21:44 UTC (permalink / raw)
  To: Kuba Ober; +Cc: reiserfs-list

On Thu, May 16, 2002 at 05:23:42PM -0400, Kuba Ober <kuba@mareimbrium.org> wrote:
> What I'm thinking of is this:
> to the user, which most users w/o intimate filesystem knowledge won't be able 
> to answer at all?

Unix traditionally wasn't aimed at the point-and-click users without
knowledge.

> Looking at this list, what people want is to get their data 
> back, as much as possible. They never want to get less than that. Why bother 
> asking?

Users who know nothing can still be told to just press "y". Even better,
somebody with some knowledge about the filesystem (and the contents!)
layout can often do better with an interactive fsck (see ext2fs).

I don't think it makes sense to enhance the dumb-user-mode while at the same
time keeping informed users from working properly.

> that is what many unsuspecting users actually do. It should simply disregard 
> read errors and try using whatever data there is in ok-read blocks.

It should actually ask the user wether she wants the block to be repaired
(if possible), or permanently marked defect.

> I don't think that asking too many questions is worth it. He who runs fsck in 
> "fix" mode wants his data back (whatever is left of it).

Thats a big mistake. He who runs fsck wants to recover as much data as
possible. Sometimes maybe more than fsck alone can do.

> recovery stuff should be w/o questions in my opinion. At least that's what 
> I'd expect all fsck's to do.

for some strange reason no fsck behaves like that.

-- 
      -----==-                                             |
      ----==-- _                                           |
      ---==---(_)__  __ ____  __       Marc Lehmann      +--
      --==---/ / _ \/ // /\ \/ /       pcg@goof.com      |e|
      -=====/_/_//_/\_,_/ /_/\_\       XX11-RIPE         --+
    The choice of a GNU generation                       |
                                                         |

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: A couple of questions
  2002-05-16 20:33 ` Hans Reiser
@ 2002-05-16 21:23   ` Kuba Ober
  2002-05-16 21:44     ` Lehmann 
                       ` (2 more replies)
  0 siblings, 3 replies; 50+ messages in thread
From: Kuba Ober @ 2002-05-16 21:23 UTC (permalink / raw)
  To: reiserfs-list

> >One extra question on this.  I assume that if in Fix mode and errors are
> >encountered that fsck.resiserfs will prompt to fix each error and that
> >there is no way to have it answer 'Yes' automatically like the ext2 -y
> >option.
> >
> >If not then I will probably have to take Jonathan Briggs suggestion of a
> >third process to answer 'Yes' repeatedly.
> >
> >Steve
>
> Why in the world do you want to run fsck without running it manually,
> and passing all information to the user?

What I'm thinking of is this:
1. If a filesystem is really too borked for fsck to recover useful stuff, it 
should be left alone. Either fsck is able to help or not. No need to ask 
user, fsck has more data to determine whether the fs makes a scant of sense, 
or if it has been messed up too much.
2. If we run fsck, we want to recover as much data as possible. That's what 
lost+found directory is for -- stuff that is not exactly clean for use, but 
may nevertheless be useful, gets hooked there.

Why on earth does a filesystem check & recovery program need to ask questions 
to the user, which most users w/o intimate filesystem knowledge won't be able 
to answer at all? Looking at this list, what people want is to get their data 
back, as much as possible. They never want to get less than that. Why bother 
asking?

That's one thing. Another thing is making fsck work on broken media, since 
that is what many unsuspecting users actually do. It should simply disregard 
read errors and try using whatever data there is in ok-read blocks.

I don't think that asking too many questions is worth it. He who runs fsck in 
"fix" mode wants his data back (whatever is left of it). Certain things, like 
recovering the deleted files, may be worth specifying as options, but typical 
recovery stuff should be w/o questions in my opinion. At least that's what 
I'd expect all fsck's to do.

Cheers, Kuba

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: A couple of questions
  2002-05-16 18:44 Steve Pratt
  2002-05-16 18:55 ` Oleg Drokin
@ 2002-05-16 20:33 ` Hans Reiser
  2002-05-16 21:23   ` Kuba Ober
  1 sibling, 1 reply; 50+ messages in thread
From: Hans Reiser @ 2002-05-16 20:33 UTC (permalink / raw)
  To: Steve Pratt; +Cc: Oleg Drokin, reiserfs-list

Steve Pratt wrote:

>Oleg Drokin wrote:
>  
>
>>Hello!
>>    
>>
>
>  
>
>>On Thu, May 16, 2002 at 10:11:37AM -0500, Steve Pratt wrote:
>>    
>>
>>>>>superblock. Neither 3.5 nor 3.6 superblock appear to have a label
>>>>>          
>>>>>
>field,
>  
>
>>>>>but mkfs has an option for it.
>>>>>          
>>>>>
>>>>Labels are supported in reiserfs v3.6 format. (2.4 supopr was merged
>>>>        
>>>>
>into
>  
>
>>>>2.4.19-pre3, if I remember correctly).
>>>>        
>>>>
>>>Ok, so it looks like I can use the option and if they have the right
>>>      
>>>
>kernel
>  
>
>>>code it will just work.
>>>      
>>>
>
>  
>
>>In fact even kernel without "support" will work.
>>Support is only means that the space in superblock is marked as used by
>>label/uuid instead of being marked as "reserved".
>>You cannot query uuid/label from withing the kernel, anyway.
>>    
>>
>
>Ok. This is fine.
>
>  
>
>>>>You can circumvient this by echo Yes | reiserfsck ...
>>>>if you need.
>>>>        
>>>>
>>>Actually this is not trivial in fork/exec in C code.  Especially when I
>>>      
>>>
>
>  
>
>>I think it is.
>>    
>>
>
>  
>
>>>want to preserve the return code from the fsck.  If you know of a coding
>>>trick to do this I would be interested.
>>>      
>>>
>
>(pseudocode removed) ...
>
>Thanks! I have this working now.
>
>One extra question on this.  I assume that if in Fix mode and errors are
>encountered that fsck.resiserfs will prompt to fix each error and that
>there is no way to have it answer 'Yes' automatically like the ext2 -y
>option.
>
>If not then I will probably have to take Jonathan Briggs suggestion of a
>third process to answer 'Yes' repeatedly.
>
>Steve
>
>
>
>
>
>
>  
>
Why in the world do you want to run fsck without running it manually, 
and passing all information to the user?

Hans



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: A couple of questions
  2002-05-16 18:44 Steve Pratt
@ 2002-05-16 18:55 ` Oleg Drokin
  2002-05-16 20:33 ` Hans Reiser
  1 sibling, 0 replies; 50+ messages in thread
From: Oleg Drokin @ 2002-05-16 18:55 UTC (permalink / raw)
  To: Steve Pratt; +Cc: reiserfs-list

Hello!

On Thu, May 16, 2002 at 01:44:16PM -0500, Steve Pratt wrote:

> One extra question on this.  I assume that if in Fix mode and errors are
> encountered that fsck.resiserfs will prompt to fix each error and that
> there is no way to have it answer 'Yes' automatically like the ext2 -y
> option.

I am unaware of a situation where reiserfsck asks any question in the middle of
repairing.
It may ask for more than one confirmation in the beginning (one of the cases
being when it thinks it may replay a journal, but it is not sure).
But these cases are really the cases where you want to pass the question
to user and get his opinition to pass it to fsck.
I think they all are looks very similar, so you can find them quickly in output
stream (something like some text and then "(y/n):" sequence of characters).

When fs bugs are fixed, this is done without any questions asked,
and though this is somewhat questionable approach, this is how it done at
present.

> If not then I will probably have to take Jonathan Briggs suggestion of a
> third process to answer 'Yes' repeatedly.

This is very dangerous approach. There is exactly one "extra" confirmation,
the one at the beginning, where reiserfsck asks if you really want to run it.
All of the others are really important. (well, may be not all, but most of them
for sure)

Bye,
    Oleg

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: A couple of questions
@ 2002-05-16 18:48 Steve Pratt
  0 siblings, 0 replies; 50+ messages in thread
From: Steve Pratt @ 2002-05-16 18:48 UTC (permalink / raw)
  To: Jonathan Briggs; +Cc: reiserfs-list



Jonathan Briggs wrote:
>On Thu, 2002-05-16 at 09:11, Steve Pratt wrote:

>> >You can circumvient this by echo Yes | reiserfsck ...
>> >if you need.
>>
>> Actually this is not trivial in fork/exec in C code.  Especially when I
>> want to preserve the return code from the fsck.  If you know of a coding
>> trick to do this I would be interested.

>This is not too hard.  I would do two forks, so that the main process
>does not get hung up writing Yes's.  So:

>Main Process:
>Create a pipe with pipe()
>Fork off your fsck.reiserfs and record the pid returned from fork.
>Fork off your Yes writer and record the pid returned from fork.
>Wait for your Yes writer with waitpid.
>Wait for your fsck with waitpid and collect the status.

>fsck Child Process
>Call dup2 to change standard input to your pipe output
>Close the original pipe input file descriptor.
>Close the original pipe output file descriptor.
>Exec fsck.reiserfs

>Yes writer Child Process:
>Close the pipe output file descriptor.
>Loop writing Yes into your pipe input.
>Should not need to worry about exiting, this process should get a
>SIGPIPE when other process exits.

Thanks, it looks like I will need to do this..

Steve




^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: A couple of questions
@ 2002-05-16 18:44 Steve Pratt
  2002-05-16 18:55 ` Oleg Drokin
  2002-05-16 20:33 ` Hans Reiser
  0 siblings, 2 replies; 50+ messages in thread
From: Steve Pratt @ 2002-05-16 18:44 UTC (permalink / raw)
  To: Oleg Drokin; +Cc: reiserfs-list


Oleg Drokin wrote:
>Hello!

>On Thu, May 16, 2002 at 10:11:37AM -0500, Steve Pratt wrote:
>> >> superblock. Neither 3.5 nor 3.6 superblock appear to have a label
field,
>> >> but mkfs has an option for it.
>> >Labels are supported in reiserfs v3.6 format. (2.4 supopr was merged
into
>> >2.4.19-pre3, if I remember correctly).
>> Ok, so it looks like I can use the option and if they have the right
kernel
>> code it will just work.

>In fact even kernel without "support" will work.
>Support is only means that the space in superblock is marked as used by
>label/uuid instead of being marked as "reserved".
>You cannot query uuid/label from withing the kernel, anyway.

Ok. This is fine.

>> >You can circumvient this by echo Yes | reiserfsck ...
>> >if you need.
>> Actually this is not trivial in fork/exec in C code.  Especially when I

>I think it is.

>> want to preserve the return code from the fsck.  If you know of a coding
>> trick to do this I would be interested.

(pseudocode removed) ...

Thanks! I have this working now.

One extra question on this.  I assume that if in Fix mode and errors are
encountered that fsck.resiserfs will prompt to fix each error and that
there is no way to have it answer 'Yes' automatically like the ext2 -y
option.

If not then I will probably have to take Jonathan Briggs suggestion of a
third process to answer 'Yes' repeatedly.

Steve





^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: A couple of questions
  2002-05-16 15:11 Steve Pratt
@ 2002-05-16 15:35 ` Oleg Drokin
  0 siblings, 0 replies; 50+ messages in thread
From: Oleg Drokin @ 2002-05-16 15:35 UTC (permalink / raw)
  To: Steve Pratt; +Cc: reiserfs-list

Hello!

On Thu, May 16, 2002 at 10:11:37AM -0500, Steve Pratt wrote:
> >> superblock. Neither 3.5 nor 3.6 superblock appear to have a label field,
> >> but mkfs has an option for it.
> >Labels are supported in reiserfs v3.6 format. (2.4 supopr was merged into
> >2.4.19-pre3, if I remember correctly).
> Ok, so it looks like I can use the option and if they have the right kernel
> code it will just work.

In fact even kernel without "support" will work.
Support is only means that the space in superblock is marked as used by
label/uuid instead of being marked as "reserved".
You cannot query uuid/label from withing the kernel, anyway.

> >You can circumvient this by echo Yes | reiserfsck ...
> >if you need.
> Actually this is not trivial in fork/exec in C code.  Especially when I

I think it is.

> want to preserve the return code from the fsck.  If you know of a coding
> trick to do this I would be interested.

Basically it is (half pseudocode):

ifd=create_input_pipe(&pifd); // 
ofd=create_output_pipe(&pofd); // these 2 would return fds suitable for child
			       // and modify fds suitable for parent.
			       // in fact simple pipe(2) is what you need.
if (!fork()) {
	close(0);
	close(1);
	close(2);
	dup2(ifd,0);
	dup2(ofd,1);
	dup2(ofd,2);
	close(ofd);
	close(ifd);
	execl("/path/reiserfsck","reiserfsck","param1","param2",...);
}
close(ofd);
close(ifd);
read_len=read(pofd,&buffer, 100000); // nonblocking read in fact.
print_message(buffer, read_len); // Hans wants you to print mkreiserfs/reiserfsck banner on screen.
write(pifd, "Yes\n", 4);
read_len=read(pofd,&buffer, 100000); // nonblocking read in fact. probably in a loop
				     // Here you can even provide user with
print_message(buffer, read_len);     // progress information.

wait(&exitcode);
close(pifd);
close(pofd);
analyze_exitcode_and_notify_user(exitcode);

Bye,
    Oleg




^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: A couple of questions
  2002-05-16 14:52 Steve Pratt
@ 2002-05-16 15:13 ` Hans Reiser
  0 siblings, 0 replies; 50+ messages in thread
From: Hans Reiser @ 2002-05-16 15:13 UTC (permalink / raw)
  To: Steve Pratt; +Cc: Oleg Drokin, reiserfs-list

Steve Pratt wrote:

>Hans Reiser wrote:
>  
>
>>Oleg Drokin wrote:
>>    
>>
>
>  
>
>>>Hello!
>>>
>>>On Thu, May 16, 2002 at 01:42:48PM +0400, Hans Reiser wrote:
>>>
>>>
>>>
>>>      
>>>
>>>>>>Second, what is the option to keep fsck from prompting for 'Yes' when
>>>>>>running.  I need to exec this without additional input.  Seems like
>>>>>>            
>>>>>>
>quiet
>  
>
>>>>>>should do it, but it doesn't.
>>>>>>
>>>>>>
>>>>>>            
>>>>>>
>>>>>Hans has a decision that this should not be done for now because
>>>>>reiserfsprogs are in active bugfixing stage.
>>>>>
>>>>>
>>>>>          
>>>>>
>>>>?
>>>>I don't understand this.:-/
>>>>
>>>>
>>>>        
>>>>
>>>You have said "We do not trust reiserfsck to be run automatically without
>>>user confirmation"
>>>
>>>Bye,
>>>   Oleg
>>>
>>>
>>>      
>>>
>>Ah, now I remember the conversation.  Yes, if we put the option in,
>>someone will get the bright idea to run it at every boot (and they might
>>be a distro or appliance vendor:-/), and fsck is not as reliable as
>>journaling so this should not be done.
>>    
>>
>
>  
>
>>So, why do you want to exec it without additional input?
>>    
>>
>
>Because I have already prompted for input from the EVMS interface.  The
>FSIMs in EVMS provide for EVMS to coordinate actions with the file systems
>such are shrinking the filesystem before shrinking the volume.  We also
>provide interfaces for mkfs, fsck and defrag (if supported by the file
>system).  So the FSIM prompts for all of the options to be passed to the
>file system utility and then invokes it.  In the case of fsck I have
>already prompted for all the options and the user is explicitly asking for
>fsck on this volume.  Since the prompt is output on stdout it may not be
>visible to a GUI user.
>
>Steve
>
>
>
>
>Hans
>
>
>
>
>
>
>
>
>  
>
Do you print the reiserfs credits?

Hans



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: A couple of questions
@ 2002-05-16 15:11 Steve Pratt
  2002-05-16 15:35 ` Oleg Drokin
  0 siblings, 1 reply; 50+ messages in thread
From: Steve Pratt @ 2002-05-16 15:11 UTC (permalink / raw)
  To: Oleg Drokin; +Cc: reiserfs-list


Oleg Drokin wrote:
>Hello!

>On Wed, May 15, 2002 at 04:22:22PM -0500, Steve Pratt wrote:

>> A couple of questions about the ReiserFS utilities.  First, does
ReiserFS
>> support labels or not?  FAQ says no and refers to lack of space in the
3.5
>> superblock. Neither 3.5 nor 3.6 superblock appear to have a label field,
>> but mkfs has an option for it.

>Labels are supported in reiserfs v3.6 format. (2.4 supopr was merged into
>2.4.19-pre3, if I remember correctly).

Ok, so it looks like I can use the option and if they have the right kernel
code it will just work.

>> Second, what is the option to keep fsck from prompting for 'Yes' when
>> running.  I need to exec this without additional input.  Seems like
quiet
>> should do it, but it doesn't.

>Hans has a decision that this should not be done for now because
>reiserfsprogs are in active bugfixing stage.

Bummer.

>You can circumvient this by echo Yes | reiserfsck ...
>if you need.

Actually this is not trivial in fork/exec in C code.  Especially when I
want to preserve the return code from the fsck.  If you know of a coding
trick to do this I would be interested.

Steve





^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: A couple of questions
@ 2002-05-16 14:52 Steve Pratt
  2002-05-16 15:13 ` Hans Reiser
  0 siblings, 1 reply; 50+ messages in thread
From: Steve Pratt @ 2002-05-16 14:52 UTC (permalink / raw)
  To: Hans Reiser; +Cc: Oleg Drokin, reiserfs-list


Hans Reiser wrote:
>Oleg Drokin wrote:

>>Hello!
>>
>>On Thu, May 16, 2002 at 01:42:48PM +0400, Hans Reiser wrote:
>>
>>
>>
>>>>>Second, what is the option to keep fsck from prompting for 'Yes' when
>>>>>running.  I need to exec this without additional input.  Seems like
quiet
>>>>>should do it, but it doesn't.
>>>>>
>>>>>
>>>>Hans has a decision that this should not be done for now because
>>>>reiserfsprogs are in active bugfixing stage.
>>>>
>>>>
>>>
>>>?
>>>I don't understand this.:-/
>>>
>>>
>>
>>You have said "We do not trust reiserfsck to be run automatically without
>>user confirmation"
>>
>>Bye,
>>    Oleg
>>
>>
>Ah, now I remember the conversation.  Yes, if we put the option in,
>someone will get the bright idea to run it at every boot (and they might
>be a distro or appliance vendor:-/), and fsck is not as reliable as
>journaling so this should not be done.

>So, why do you want to exec it without additional input?

Because I have already prompted for input from the EVMS interface.  The
FSIMs in EVMS provide for EVMS to coordinate actions with the file systems
such are shrinking the filesystem before shrinking the volume.  We also
provide interfaces for mkfs, fsck and defrag (if supported by the file
system).  So the FSIM prompts for all of the options to be passed to the
file system utility and then invokes it.  In the case of fsck I have
already prompted for all the options and the user is explicitly asking for
fsck on this volume.  Since the prompt is output on stdout it may not be
visible to a GUI user.

Steve




Hans







^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: A couple of questions
  2002-05-16 11:40     ` Oleg Drokin
@ 2002-05-16 11:54       ` Hans Reiser
  0 siblings, 0 replies; 50+ messages in thread
From: Hans Reiser @ 2002-05-16 11:54 UTC (permalink / raw)
  To: Oleg Drokin; +Cc: Steve Pratt, reiserfs-list

Oleg Drokin wrote:

>Hello!
>
>On Thu, May 16, 2002 at 01:42:48PM +0400, Hans Reiser wrote:
>
>  
>
>>>>Second, what is the option to keep fsck from prompting for 'Yes' when
>>>>running.  I need to exec this without additional input.  Seems like quiet
>>>>should do it, but it doesn't.
>>>>        
>>>>
>>>Hans has a decision that this should not be done for now because
>>>reiserfsprogs are in active bugfixing stage.
>>>
>>>      
>>>
>>?
>>I don't understand this.:-/
>>    
>>
>
>You have said "We do not trust reiserfsck to be run automatically without
>user confirmation"
>
>Bye,
>    Oleg
>
>
>  
>
Ah, now I remember the conversation.  Yes, if we put the option in, 
someone will get the bright idea to run it at every boot (and they might 
be a distro or appliance vendor:-/), and fsck is not as reliable as 
journaling so this should not be done.

So, why do you want to exec it without additional input?

Hans



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: A couple of questions
  2002-05-16  9:42   ` Hans Reiser
@ 2002-05-16 11:40     ` Oleg Drokin
  2002-05-16 11:54       ` Hans Reiser
  0 siblings, 1 reply; 50+ messages in thread
From: Oleg Drokin @ 2002-05-16 11:40 UTC (permalink / raw)
  To: Hans Reiser; +Cc: Steve Pratt, reiserfs-list

Hello!

On Thu, May 16, 2002 at 01:42:48PM +0400, Hans Reiser wrote:

> >>Second, what is the option to keep fsck from prompting for 'Yes' when
> >>running.  I need to exec this without additional input.  Seems like quiet
> >>should do it, but it doesn't.
> >Hans has a decision that this should not be done for now because
> >reiserfsprogs are in active bugfixing stage.
> >
> ?
> I don't understand this.:-/

You have said "We do not trust reiserfsck to be run automatically without
user confirmation"

Bye,
    Oleg

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: A couple of questions
  2002-05-16  5:20 ` Oleg Drokin
@ 2002-05-16  9:42   ` Hans Reiser
  2002-05-16 11:40     ` Oleg Drokin
  0 siblings, 1 reply; 50+ messages in thread
From: Hans Reiser @ 2002-05-16  9:42 UTC (permalink / raw)
  To: Oleg Drokin; +Cc: Steve Pratt, reiserfs-list

Oleg Drokin wrote:

>Hello!
>
>On Wed, May 15, 2002 at 04:22:22PM -0500, Steve Pratt wrote:
>
>  
>
>>A couple of questions about the ReiserFS utilities.  First, does ReiserFS
>>support labels or not?  FAQ says no and refers to lack of space in the 3.5
>>superblock. Neither 3.5 nor 3.6 superblock appear to have a label field,
>>but mkfs has an option for it.
>>    
>>
>
>Labels are supported in reiserfs v3.6 format. (2.4 supopr was merged into
>2.4.19-pre3, if I remember correctly).
>
>  
>
>>Second, what is the option to keep fsck from prompting for 'Yes' when
>>running.  I need to exec this without additional input.  Seems like quiet
>>should do it, but it doesn't.
>>    
>>
>
>Hans has a decision that this should not be done for now because
>reiserfsprogs are in active bugfixing stage.
>
?

I don't understand this.:-/

>You can circumvient this by echo Yes | reiserfsck ...
>if you need.
>
>Bye, 
>    Oleg 
>
>
>  
>




^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: A couple of questions
  2002-05-15 21:22 Steve Pratt
@ 2002-05-16  5:20 ` Oleg Drokin
  2002-05-16  9:42   ` Hans Reiser
  0 siblings, 1 reply; 50+ messages in thread
From: Oleg Drokin @ 2002-05-16  5:20 UTC (permalink / raw)
  To: Steve Pratt; +Cc: reiserfs-list

Hello!

On Wed, May 15, 2002 at 04:22:22PM -0500, Steve Pratt wrote:

> A couple of questions about the ReiserFS utilities.  First, does ReiserFS
> support labels or not?  FAQ says no and refers to lack of space in the 3.5
> superblock. Neither 3.5 nor 3.6 superblock appear to have a label field,
> but mkfs has an option for it.

Labels are supported in reiserfs v3.6 format. (2.4 supopr was merged into
2.4.19-pre3, if I remember correctly).

> Second, what is the option to keep fsck from prompting for 'Yes' when
> running.  I need to exec this without additional input.  Seems like quiet
> should do it, but it doesn't.

Hans has a decision that this should not be done for now because
reiserfsprogs are in active bugfixing stage.
You can circumvient this by echo Yes | reiserfsck ...
if you need.

Bye, 
    Oleg 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* A couple of questions
@ 2002-05-15 21:22 Steve Pratt
  2002-05-16  5:20 ` Oleg Drokin
  0 siblings, 1 reply; 50+ messages in thread
From: Steve Pratt @ 2002-05-15 21:22 UTC (permalink / raw)
  To: reiserfs-list

A couple of questions about the ReiserFS utilities.  First, does ReiserFS
support labels or not?  FAQ says no and refers to lack of space in the 3.5
superblock. Neither 3.5 nor 3.6 superblock appear to have a label field,
but mkfs has an option for it.

Second, what is the option to keep fsck from prompting for 'Yes' when
running.  I need to exec this without additional input.  Seems like quiet
should do it, but it doesn't.

Steve

EVMS Development - http://www.sf.net/projects/evms
Linux Technology Center - IBM Corporation
(512) 838-9763  EMAIL: SLPratt@US.IBM.COM



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: A couple of questions
  2001-10-10 11:28 Adil EL YOUSSEFI
@ 2001-10-10 12:11 ` David Woodhouse
  0 siblings, 0 replies; 50+ messages in thread
From: David Woodhouse @ 2001-10-10 12:11 UTC (permalink / raw)
  To: Adil EL YOUSSEFI; +Cc: linux-mtd

adilos2@yahoo.com said:
>  -> Under what license it is released. My boss wants to
>  know if he would have to pay in order to put JFFS2 in our product.

Under what licence is your product? JFFS2 is under a dual licence - both 
GPL for compatibility with the Linux kernel, and RHEPL for use in eCos. 

>  -> We are using a flash device with different sector sizes ( 1*32k,
> 2*16k,1*64k and the others are 128k ), What size should the JFFS2
> reserved sectors have then ? 

128KiB - the 'major' erase size.

>  -> If JFFS2 is used without enabling compression, will there still be
> corner cases in GC requiring 5 reserved sectors and not 2. 

Probably not. To be honest, it probably wouldn't happen even _with_ 
compression - I just don't like releasing software that'll 'probably' work 
:)

>  -> Why can't JFFS2 be used on Compact flashes ? 

Technically, it can - we now have a 'blkmtd' driver which uses any block 
device as backing store for an MTD device - so it can use any hard drive. 
At the moment, it's painfully slow. Checkpointing ought to fix that, but 
nobody's currently working on implementing that.

--
dwmw2

^ permalink raw reply	[flat|nested] 50+ messages in thread

* A couple of questions
@ 2001-10-10 11:28 Adil EL YOUSSEFI
  2001-10-10 12:11 ` David Woodhouse
  0 siblings, 1 reply; 50+ messages in thread
From: Adil EL YOUSSEFI @ 2001-10-10 11:28 UTC (permalink / raw)
  To: linux-mtd

Hi everybody,

 I have some questions reguarding JFFS2 :

-> Under what license it is released. My boss wants to
 know if he would have to pay in order to put JFFS2 in
our product.

-> We are using a flash device with different sector
sizes ( 1*32k, 2*16k,1*64k and the others are 128k ),
What size should the JFFS2 reserved sectors have then
?

-> If JFFS2 is used without enabling compression, will
there still be corner cases in GC requiring 5 reserved
sectors and not 2. 

-> Why can't JFFS2 be used on Compact flashes ? 

Thanks,




__________________________________________________
Do You Yahoo!?
Make a great connection at Yahoo! Personals.
http://personals.yahoo.com

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: A couple of questions
  1999-03-15 22:46   ` neil
@ 1999-03-16 12:22     ` Stephen C. Tweedie
  0 siblings, 0 replies; 50+ messages in thread
From: Stephen C. Tweedie @ 1999-03-16 12:22 UTC (permalink / raw)
  To: neil; +Cc: Stephen C. Tweedie, Linux-MM

Hi,

On Tue, 16 Mar 1999 07:46:06 +0900, neil@tc-1-192.ariake.gol.ne.jp
said:

> Thanks for your reply.  I think you've missed my point on this one.
> The variable "pte" is set before calling __get_free_page(), and being
> local cannot be modified by other processes.  

Umm, OK, you've convinced me. :) I think we have enough locks held
throughout this to prevent the present or writable bits in *page_table
from changing between the test in handle_pte_fault() and do_wp_page()
itself, even on SMP.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: A couple of questions
  1999-03-15 18:58 ` Stephen C. Tweedie
  1999-03-15 22:46   ` neil
@ 1999-03-16  2:11   ` Andrea Arcangeli
  1 sibling, 0 replies; 50+ messages in thread
From: Andrea Arcangeli @ 1999-03-16  2:11 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Neil Booth, linux-mm, Linus Torvalds

On Mon, 15 Mar 1999, Stephen C. Tweedie wrote:

>--- mm/memory.c~	Tue Jan 19 01:33:10 1999
>+++ mm/memory.c	Mon Mar 15 18:57:31 1999
>@@ -651,13 +651,13 @@
> 		delete_from_swap_cache(page_map);
> 		/* FallThrough */
> 	case 1:
>-		/* We can release the kernel lock now.. */
>-		unlock_kernel();
>-
> 		flush_cache_page(vma, address);
> 		set_pte(page_table, pte_mkdirty(pte_mkwrite(pte)));
> 		flush_tlb_page(vma, address);
> end_wp_page:
>+		/* We can release the kernel lock now.. */
>+		unlock_kernel();
>+
> 		if (new_page)
> 			free_page(new_page);
> 		return 1;
>----------------------------------------------------------------

Your sure safe patch is strictly needed according to me in order to
release the lock_kernel in the end_wp_page path.

The reason I think it's just safe remove the lock_kernel before updating
the page table of the process is because the swap_out engine will do
nothing with the page until it will be a clean page (and should be clean
because it was read-only in first place.... am I really right here?).
Every other part of the VM will block on the semaphore so it won't race
anyway with the page fault handler.

I think this patch against 2.2.3 looks needed to me (except the first
chunk that is only removing superflous code).

Seems to works fine after some minute of stress-testing.

Index: mm//memory.c
===================================================================
RCS file: /var/cvs/linux/mm/memory.c,v
retrieving revision 1.1.2.3
diff -u -r1.1.2.3 memory.c
--- memory.c	1999/01/24 02:46:31	1.1.2.3
+++ linux/mm/memory.c	1999/03/16 01:55:45
@@ -624,10 +624,6 @@
 	/* Did someone else copy this page for us while we slept? */
 	if (pte_val(*page_table) != pte_val(pte))
 		goto end_wp_page;
-	if (!pte_present(pte))
-		goto end_wp_page;
-	if (pte_write(pte))
-		goto end_wp_page;
 	old_page = pte_page(pte);
 	if (MAP_NR(old_page) >= max_mapnr)
 		goto bad_wp_page;
@@ -651,13 +647,18 @@
 		delete_from_swap_cache(page_map);
 		/* FallThrough */
 	case 1:
-		/* We can release the kernel lock now.. */
+		/*
+		 * We can release the kernel lock now.. because the swap_out
+		 * engine will do nothing with the page table until it
+		 * will be a clean page (and we are sure it's clean because it
+		 * wasn't writable yet). All other parts of the VM will
+		 * stop on the mmap semaphore. -arca
+		 */
 		unlock_kernel();
 
 		flush_cache_page(vma, address);
 		set_pte(page_table, pte_mkdirty(pte_mkwrite(pte)));
 		flush_tlb_page(vma, address);
-end_wp_page:
 		if (new_page)
 			free_page(new_page);
 		return 1;
@@ -681,9 +682,15 @@
 bad_wp_page:
 	printk("do_wp_page: bogus page at address %08lx (%08lx)\n",address,old_page);
 	send_sig(SIGKILL, tsk, 1);
+	unlock_kernel();
 	if (new_page)
 		free_page(new_page);
 	return 0;
+end_wp_page:
+	unlock_kernel();
+	if (new_page)
+		free_page(new_page);
+	return 1;
 }
 
 /*



Andrea Arcangeli


--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: A couple of questions
  1999-03-15 18:58 ` Stephen C. Tweedie
@ 1999-03-15 22:46   ` neil
  1999-03-16 12:22     ` Stephen C. Tweedie
  1999-03-16  2:11   ` Andrea Arcangeli
  1 sibling, 1 reply; 50+ messages in thread
From: neil @ 1999-03-15 22:46 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Linux-MM

Hi Stephen,

Stephen C. Tweedie wrote:-
> Hi,
> 
[..snip..]
>
> > 2) The last 2 of the 3 branches to end_wp_page seem to me to be
> > impossible code paths.
> 
> > 	if (!pte_present(pte))
> > 		goto end_wp_page;
> > 	if (pte_write(pte))
> > 		goto end_wp_page;
> 
> No, the start of do_wp_page() looks like:
> 
> 	pte = *page_table;
> 	new_page = __get_free_page(GFP_USER);
> 
> and the get_free_page() call can block if we are out of memory, dropping
> the kernel lock in the process.  The page table can be modified by
> kswapd during this interval.

Thanks for your reply.  I think you've missed my point on this one.
The variable "pte" is set before calling __get_free_page(), and being
local cannot be modified by other processes.  Hence I still believe
the 2 branches shown are impossible, their negative having been the
condition for entering do_wp_page().

The case you mention is captured by the initial test

	if (pte_val(*page_table) != pte_val(pte))
		goto end_wp_page;

performed before the two above.  Do you agree?

Cheers,

Neil.
-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: A couple of questions
  1999-03-02 13:11 Neil Booth
@ 1999-03-15 18:58 ` Stephen C. Tweedie
  1999-03-15 22:46   ` neil
  1999-03-16  2:11   ` Andrea Arcangeli
  0 siblings, 2 replies; 50+ messages in thread
From: Stephen C. Tweedie @ 1999-03-15 18:58 UTC (permalink / raw)
  To: Neil Booth; +Cc: linux-mm, Stephen Tweedie

Hi,

<Late answer: I've been offline for a couple of weeks>

On Tue, 02 Mar 1999 22:11:45 +0900, Neil Booth <NeilB@earthling.net> said:

> I have a couple of questions about do_wp_page; I hope they're welcome
> here.

> 1) do_wp_page has most execution paths doing an unlock_kernel() but
> there are a couple that don't. Why isn't this inconsistent? 

Good question, and a possible bug.  Anyone else care to glance at this?
It's a possible problem only on SMP, of course.  The obvious fix is:

----------------------------------------------------------------
--- mm/memory.c~	Tue Jan 19 01:33:10 1999
+++ mm/memory.c	Mon Mar 15 18:57:31 1999
@@ -651,13 +651,13 @@
 		delete_from_swap_cache(page_map);
 		/* FallThrough */
 	case 1:
-		/* We can release the kernel lock now.. */
-		unlock_kernel();
-
 		flush_cache_page(vma, address);
 		set_pte(page_table, pte_mkdirty(pte_mkwrite(pte)));
 		flush_tlb_page(vma, address);
 end_wp_page:
+		/* We can release the kernel lock now.. */
+		unlock_kernel();
+
 		if (new_page)
 			free_page(new_page);
 		return 1;
----------------------------------------------------------------

> 2) The last 2 of the 3 branches to end_wp_page seem to me to be
> impossible code paths.

> 	if (!pte_present(pte))
> 		goto end_wp_page;
> 	if (pte_write(pte))
> 		goto end_wp_page;

No, the start of do_wp_page() looks like:

	pte = *page_table;
	new_page = __get_free_page(GFP_USER);

and the get_free_page() call can block if we are out of memory, dropping
the kernel lock in the process.  The page table can be modified by
kswapd during this interval.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 50+ messages in thread

* A couple of questions
@ 1999-03-02 13:11 Neil Booth
  1999-03-15 18:58 ` Stephen C. Tweedie
  0 siblings, 1 reply; 50+ messages in thread
From: Neil Booth @ 1999-03-02 13:11 UTC (permalink / raw)
  To: linux-mm

I have a couple of questions about do_wp_page; I hope they're welcome
here.

1) do_wp_page has most execution paths doing an unlock_kernel() but
there are a couple that don't. Why isn't this inconsistent? e.g. any of
the branches that call end_wp_page do not unlock the kernel. What am I
missing? Is it that these branches only happen if we slept while getting
the free page, and sleeping always unlocks the kernel?

2) The last 2 of the 3 branches to end_wp_page seem to me to be
impossible code paths.

	if (!pte_present(pte))
		goto end_wp_page;
	if (pte_write(pte))
		goto end_wp_page;

At entry, pte (= *page_table) is present and not writable as this is the
only way do_wp_page gets called from handle_pte_fault (and we hold the
kernel lock so nothing else can change *page_table). Being a local
variable, it contents cannot change, so why these 2 tests?

Cheers,

Neil.
--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2010-06-04  1:17 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-05-27 13:39 A couple of questions Paul Millar
2010-05-27 14:56 ` Hubert Kario
2010-05-31 17:59   ` Paul Millar
2010-06-02 16:19     ` Hubert Kario
2010-05-27 16:00 ` Chris Mason
2010-05-31 18:06   ` Paul Millar
2010-05-31 20:33     ` Mike Fedyk
2010-06-02 11:56       ` Paul Millar
2010-06-01 13:39     ` Martin K. Petersen
2010-06-02 13:40       ` Paul Millar
2010-06-04  1:17         ` Martin K. Petersen
  -- strict thread matches above, loose matches on Subject: below --
2005-04-18 11:51 Imre Simon
2005-04-18 15:31 ` Linus Torvalds
2005-04-18 16:23   ` Paul Jackson
2002-05-17 15:27 Steve Pratt
2002-05-17 13:11 berthiaume_wayne
2002-05-17 16:03 ` Kuba Ober
2002-05-16 18:48 Steve Pratt
2002-05-16 18:44 Steve Pratt
2002-05-16 18:55 ` Oleg Drokin
2002-05-16 20:33 ` Hans Reiser
2002-05-16 21:23   ` Kuba Ober
2002-05-16 21:44     ` Lehmann 
2002-05-16 23:57       ` Hans Reiser
2002-05-17  0:45         ` Philipp Gühring
2002-05-17  1:06           ` Manuel Krause
2002-05-17 15:21           ` Kuba Ober
2002-05-17  0:17       ` Manuel Krause
2002-05-17 15:04       ` Kuba Ober
2002-05-18 20:40         ` Hans Reiser
2002-05-17 15:05       ` Kuba Ober
2002-05-16 21:44     ` Lehmann 
2002-05-17 13:10     ` Valdis.Kletnieks
2002-05-17 15:35       ` Kuba Ober
2002-05-16 15:11 Steve Pratt
2002-05-16 15:35 ` Oleg Drokin
2002-05-16 14:52 Steve Pratt
2002-05-16 15:13 ` Hans Reiser
2002-05-15 21:22 Steve Pratt
2002-05-16  5:20 ` Oleg Drokin
2002-05-16  9:42   ` Hans Reiser
2002-05-16 11:40     ` Oleg Drokin
2002-05-16 11:54       ` Hans Reiser
2001-10-10 11:28 Adil EL YOUSSEFI
2001-10-10 12:11 ` David Woodhouse
1999-03-02 13:11 Neil Booth
1999-03-15 18:58 ` Stephen C. Tweedie
1999-03-15 22:46   ` neil
1999-03-16 12:22     ` Stephen C. Tweedie
1999-03-16  2:11   ` Andrea Arcangeli

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.