Re: filesystem corruption ?

From: Bernd Schubert <bernd-schubert@web.de>
To: Oleg Drokin <green@namesys.com>
Cc: reiserfs-list@namesys.com
Subject: Re: filesystem corruption ?
Date: Thu, 20 Mar 2003 19:23:48 +0100	[thread overview]
Message-ID: <200303201923.48454.bernd-schubert@web.de> (raw)
In-Reply-To: <20030320200639.A8618@namesys.com>

[-- Attachment #1: Type: text/plain, Size: 3152 bytes --]

On Thursday 20 March 2003 18:06, Oleg Drokin wrote:
> Hello!
>
> On Thu, Mar 20, 2003 at 05:25:13PM +0100, Bernd Schubert wrote:
> > We use this filesystem a nfs-root-fs to several clients (exported as
> > read-only), so we are lucky, since we regularly backup the whole
> > partition. We have a backup from this Morning and another one from
> > Monday. Based on comparing the output of md5sum we can't find any
> > problems between the version from monday and the version of this morning,
> > *but* there are differences for some binaries in /usr/bin, such as gdb,
> > between the backup of this Morning and the Current files.
>
> Hm, interesting.
> And what are the differences? How big are they?

Since it are binaries files, a colleague had the idea to use hexdump and diff, 
so the command for the attached file was:

diff <(hexdump /worka/gdb) <(hexdump /usr/bin/gdb)|sort -k 2 >gdb.diff

So the lines beginning with '<' are from working gdb and lines beginning with 
'>' are from corrupted gdb. When you look into the diff-file you will see, 
that only some bits per line have changed.

> Anything interesting in logs?

Except perhaps 'Mar 20 16:46:58 hamilton kernel: invalidate: busy buffer', 
nothing else.

> Any events happening between morning backup and time of problem discovery?

Except, that I recompiled a kernel and we installed some programs using 
aptitude (its a debian system), nothing happend to the filesystem. There was 
also no reboot, no crash, etc.

Update: The corruption probably happend at 15:48, since at this time also a 
xchat on one of the clients crashed and this was noticed by us at first. The 
xchat binary was also affected by the corruption.
At the very same time another client was rebooted and something seems to have 
caused a very strange nfs-mounting from this machine. However, we see 189 
mount tries for '/', '/etc' and '/var' within 5 seconds from this client, 
finally it was succesfull, thatswhy we didn't notice the strange mounting 
scheme. Please note again that we export '/' read-only, so the client 
shouldn't be able to corrupt the files.
Since it turn out, that the nfs-corruption could be nfs related, I have to 
give further information about our server/client solution:
	We have both, knfsd and unfsd (clusternfs) running on our server,
	knfsd serves '/' (read-only, reiserfs) and unfsd serves '/etc' and '/var' 
(read-write, ext2). 
	Due to current kernel limitation both have to use the same rpc-port, but 
luckily not the same upd/tcp port (but both mountd's are running on different 
rpc-ports and different tcp/upd ports).
I hope that this is not the reason for our trouble, anyway I wouldn't know how 
this could cause this kind of trouble at all.

I'm now going to modify the client's initrd and prevent something like this.

>
> > Do you have any ideas whats going wrong and what we can do?
>
> We need more info.

Just tell me what else you need! Should we run debugreiserfs ?

> Also check modification date of gdb, may be some process changed it?

Its not only gdb, also several other programs. The modification time and 
filesize are the same.

Thanks for your help,
	Bernd

[-- Attachment #2: gdb.diff.gz --]
[-- Type: application/x-gzip, Size: 2875 bytes --]