All of lore.kernel.org
 help / color / mirror / Atom feed
* Filesystem corruption?
@ 2018-10-22 20:02 Gervais, Francois
  2018-10-22 23:12 ` Qu Wenruo
  2018-10-23  9:25 ` Juergen Sauer
  0 siblings, 2 replies; 15+ messages in thread
From: Gervais, Francois @ 2018-10-22 20:02 UTC (permalink / raw)
  To: linux-btrfs

Hi,

I think I lost power on my btrfs disk and it looks like it is now in an unfunctional state.

Any idea how I could debug that issue?

Here is what I have:

kernel 4.4.0-119-generic
btrfs-progs v4.4



sudo btrfs check /dev/sdd
Checking filesystem on /dev/sdd
UUID: 9a14b7a1-672c-44da-b49a-1f6566db3e44
checking extents
checking free space cache
checking fs roots
checking csums
checking root refs
checking quota groups
Ignoring qgroup relation key 310
Ignoring qgroup relation key 311
Ignoring qgroup relation key 313
Ignoring qgroup relation key 321
Ignoring qgroup relation key 326
Ignoring qgroup relation key 346
Ignoring qgroup relation key 354
Ignoring qgroup relation key 355
Ignoring qgroup relation key 356
Ignoring qgroup relation key 367
Ignoring qgroup relation key 370
Ignoring qgroup relation key 371
Ignoring qgroup relation key 373
Ignoring qgroup relation key 71213169107796323
Ignoring qgroup relation key 71213169107796323
Ignoring qgroup relation key 71494644084506935
Ignoring qgroup relation key 71494644084506935
Ignoring qgroup relation key 71494644084506937
Ignoring qgroup relation key 71494644084506937
Ignoring qgroup relation key 71494644084506945
Ignoring qgroup relation key 71494644084506945
Ignoring qgroup relation key 71494644084506950
Ignoring qgroup relation key 71494644084506950
Ignoring qgroup relation key 71494644084506970
Ignoring qgroup relation key 71494644084506970
Ignoring qgroup relation key 71494644084506978
Ignoring qgroup relation key 71494644084506978
Ignoring qgroup relation key 71494644084506978
Ignoring qgroup relation key 71494644084506980
Ignoring qgroup relation key 71494644084506980
Ignoring qgroup relation key 71494644084506991
Ignoring qgroup relation key 71494644084506991
Ignoring qgroup relation key 71494644084506994
Ignoring qgroup relation key 71494644084506994
Ignoring qgroup relation key 71494644084506995
Ignoring qgroup relation key 71494644084506995
Ignoring qgroup relation key 71494644084506997
Ignoring qgroup relation key 71494644084506997
Ignoring qgroup relation key 71776119061217590
Ignoring qgroup relation key 71776119061217590
Ignoring qgroup relation key 71776119061217590
Ignoring qgroup relation key 71776119061217590
Ignoring qgroup relation key 71776119061217590
Ignoring qgroup relation key 71776119061217590
Ignoring qgroup relation key 71776119061217590
Ignoring qgroup relation key 71776119061217590
Ignoring qgroup relation key 71776119061217590
Ignoring qgroup relation key 71776119061217590
Ignoring qgroup relation key 71776119061217590
Ignoring qgroup relation key 71776119061217590
found 29301522460 bytes used err is 0
total csum bytes: 27525424
total tree bytes: 541573120
total fs tree bytes: 494223360
total extent tree bytes: 16908288
btree space waste bytes: 85047903
file data blocks allocated: 273892241408
 referenced 44667650048
extent buffer leak: start 29360128 len 16384
extent buffer leak: start 740524032 len 16384
extent buffer leak: start 446840832 len 16384
extent buffer leak: start 142819328 len 16384
extent buffer leak: start 143179776 len 16384
extent buffer leak: start 184107008 len 16384
extent buffer leak: start 190513152 len 16384
extent buffer leak: start 190939136 len 16384
extent buffer leak: start 239943680 len 16384
extent buffer leak: start 29392896 len 16384
extent buffer leak: start 295223296 len 16384
extent buffer leak: start 30556160 len 16384
extent buffer leak: start 29376512 len 16384
extent buffer leak: start 29409280 len 16384
extent buffer leak: start 29491200 len 16384
extent buffer leak: start 29556736 len 16384
extent buffer leak: start 29720576 len 16384
extent buffer leak: start 29884416 len 16384
extent buffer leak: start 30097408 len 16384
extent buffer leak: start 30179328 len 16384
extent buffer leak: start 30228480 len 16384
extent buffer leak: start 30277632 len 16384
extent buffer leak: start 30343168 len 16384
extent buffer leak: start 30392320 len 16384
extent buffer leak: start 30457856 len 16384
extent buffer leak: start 30507008 len 16384
extent buffer leak: start 30572544 len 16384
extent buffer leak: start 30621696 len 16384
extent buffer leak: start 30670848 len 16384
extent buffer leak: start 30720000 len 16384
extent buffer leak: start 30769152 len 16384
extent buffer leak: start 30801920 len 16384
extent buffer leak: start 30867456 len 16384
extent buffer leak: start 30916608 len 16384
extent buffer leak: start 102498304 len 16384
extent buffer leak: start 204488704 len 16384
extent buffer leak: start 237912064 len 16384
extent buffer leak: start 328499200 len 16384
extent buffer leak: start 684539904 len 16384
extent buffer leak: start 849362944 len 16384

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Filesystem corruption?
  2018-10-22 20:02 Filesystem corruption? Gervais, Francois
@ 2018-10-22 23:12 ` Qu Wenruo
  2018-10-23  9:25 ` Juergen Sauer
  1 sibling, 0 replies; 15+ messages in thread
From: Qu Wenruo @ 2018-10-22 23:12 UTC (permalink / raw)
  To: Gervais, Francois, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 3657 bytes --]



On 2018/10/23 上午4:02, Gervais, Francois wrote:
> Hi,
> 
> I think I lost power on my btrfs disk and it looks like it is now in an unfunctional state.

What does the word "unfunctional" mean?

Unable to mount? Or what else?

> 
> Any idea how I could debug that issue?
> 
> Here is what I have:
> 
> kernel 4.4.0-119-generic

The kernel is somewhat old now.

> btrfs-progs v4.4

The progs is definitely too old.

It's highly recommended to use the latest btrfs-progs for its better
"btrfs check" code.

> 
> 
> 
> sudo btrfs check /dev/sdd
> Checking filesystem on /dev/sdd
> UUID: 9a14b7a1-672c-44da-b49a-1f6566db3e44
> checking extents
> checking free space cache
> checking fs roots
> checking csums
> checking root refs

So no error reported from all these essential trees.
Unless there is some bug in btrfs-progs 4.4, your fs should be mostly OK.

> checking quota groups
> Ignoring qgroup relation key 310
[snip]
> Ignoring qgroup relation key 71776119061217590

Just a lot of qgroup relation key problems.
Not a big problem, especially considering you're using older kernel
without proper qgroup fixes.

Just in case, please run "btrfs check" with latest btrfs-progs (v4.17.1)
to see if it reports extra error.

Despite that, if the fs can be mounted RW, mount it then execute "btrfs
quota disable <mnt>" should disable quota and solves the problem.

Thanks,
Qu

> found 29301522460 bytes used err is 0
> total csum bytes: 27525424
> total tree bytes: 541573120
> total fs tree bytes: 494223360
> total extent tree bytes: 16908288
> btree space waste bytes: 85047903
> file data blocks allocated: 273892241408
>  referenced 44667650048
> extent buffer leak: start 29360128 len 16384
> extent buffer leak: start 740524032 len 16384
> extent buffer leak: start 446840832 len 16384
> extent buffer leak: start 142819328 len 16384
> extent buffer leak: start 143179776 len 16384
> extent buffer leak: start 184107008 len 16384
> extent buffer leak: start 190513152 len 16384
> extent buffer leak: start 190939136 len 16384
> extent buffer leak: start 239943680 len 16384
> extent buffer leak: start 29392896 len 16384
> extent buffer leak: start 295223296 len 16384
> extent buffer leak: start 30556160 len 16384
> extent buffer leak: start 29376512 len 16384
> extent buffer leak: start 29409280 len 16384
> extent buffer leak: start 29491200 len 16384
> extent buffer leak: start 29556736 len 16384
> extent buffer leak: start 29720576 len 16384
> extent buffer leak: start 29884416 len 16384
> extent buffer leak: start 30097408 len 16384
> extent buffer leak: start 30179328 len 16384
> extent buffer leak: start 30228480 len 16384
> extent buffer leak: start 30277632 len 16384
> extent buffer leak: start 30343168 len 16384
> extent buffer leak: start 30392320 len 16384
> extent buffer leak: start 30457856 len 16384
> extent buffer leak: start 30507008 len 16384
> extent buffer leak: start 30572544 len 16384
> extent buffer leak: start 30621696 len 16384
> extent buffer leak: start 30670848 len 16384
> extent buffer leak: start 30720000 len 16384
> extent buffer leak: start 30769152 len 16384
> extent buffer leak: start 30801920 len 16384
> extent buffer leak: start 30867456 len 16384
> extent buffer leak: start 30916608 len 16384
> extent buffer leak: start 102498304 len 16384
> extent buffer leak: start 204488704 len 16384
> extent buffer leak: start 237912064 len 16384
> extent buffer leak: start 328499200 len 16384
> extent buffer leak: start 684539904 len 16384
> extent buffer leak: start 849362944 len 16384
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Filesystem corruption?
  2018-10-22 20:02 Filesystem corruption? Gervais, Francois
  2018-10-22 23:12 ` Qu Wenruo
@ 2018-10-23  9:25 ` Juergen Sauer
  1 sibling, 0 replies; 15+ messages in thread
From: Juergen Sauer @ 2018-10-23  9:25 UTC (permalink / raw)
  To: Gervais, Francois, Btrfs BTRFS


[-- Attachment #1.1: Type: text/plain, Size: 1060 bytes --]

Am 22.10.18 um 22:02 schrieb Gervais, Francois:
> Hi,
> 
> I think I lost power on my btrfs disk and it looks like it is now in an unfunctional state.
> 
> Any idea how I could debug that issue?
> 
> Here is what I have:
> 
> kernel 4.4.0-119-generic
> btrfs-progs v4.4
> 
> 
> 
> sudo btrfs check /dev/sdd
> Checking filesystem on /dev/sdd
> UUID: 9a14b7a1-672c-44da-b49a-1f6566db3e44
> checking extents
> checking free space cache
> checking fs roots
> checking csums
> checking root refs
> checking quota groups
> Ignoring qgroup relation key 310
> Ignoring qgroup relation key 311

Since Kernel 4.4. were quite a lot of stability changes.
In practice, I try first booting a nearly current kernel with current
btrfs-tools.

Practical easiest way to do so is to download and dd an arch linux iso
to a usb stick and booting from this stick.

see: https://www.archlinux.org/

Afterwards I repeated the repair try.

In more than one case before this procedure worked for me.

mit freundlichen Grüßen
Jürgen Sauer



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: filesystem corruption ?
  2003-03-22 18:18         ` Bernd Schubert
@ 2003-03-22 18:37           ` Anders Widman
  0 siblings, 0 replies; 15+ messages in thread
From: Anders Widman @ 2003-03-22 18:37 UTC (permalink / raw)
  To: reiserfs-list

> Hello,

>> Though this machine will be replaced by a real server in a few month, I'm
>> still rather worried what happend. Even if its 'only' a hardware memory
>> problem this means lots of trouble for us -- on the one hand it seems not
>> to be memtest86 detectable and on the other hand our programs really do
>> need working memory, but of course this is not of your concern.

> Update: I yesterday started our fall-back-server and run another memtest86 on
> the suspected machine. A colleague just told me that memtest86 reported 3
> errors in test 8, well lets see what comes in test 11.
> So this either means that the physicians have run some experiments today or
> that the memory became damaged within 2 weeks.

Just  a  reminder that a few % of all hardware is broken, no matter if
it is the CPU, mainboard or ram or something else.

> Thanks a lot for your help to identify this as a hardware problem.

> Best regards,
>         Bernd


   



--------
PGP public key: https://tnonline.net/secure/pgp_key.txt


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: filesystem corruption ?
  2003-03-21 13:01       ` Bernd Schubert
  2003-03-21 13:07         ` Oleg Drokin
@ 2003-03-22 18:18         ` Bernd Schubert
  2003-03-22 18:37           ` Anders Widman
  1 sibling, 1 reply; 15+ messages in thread
From: Bernd Schubert @ 2003-03-22 18:18 UTC (permalink / raw)
  To: Oleg Drokin; +Cc: reiserfs-list

Hello,

> Though this machine will be replaced by a real server in a few month, I'm
> still rather worried what happend. Even if its 'only' a hardware memory
> problem this means lots of trouble for us -- on the one hand it seems not
> to be memtest86 detectable and on the other hand our programs really do
> need working memory, but of course this is not of your concern.

Update: I yesterday started our fall-back-server and run another memtest86 on 
the suspected machine. A colleague just told me that memtest86 reported 3 
errors in test 8, well lets see what comes in test 11.
So this either means that the physicians have run some experiments today or 
that the memory became damaged within 2 weeks.

Thanks a lot for your help to identify this as a hardware problem.

Best regards,
	Bernd

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: filesystem corruption ?
  2003-03-21 16:00           ` Russell Coker
@ 2003-03-21 17:14             ` Valdis.Kletnieks
  0 siblings, 0 replies; 15+ messages in thread
From: Valdis.Kletnieks @ 2003-03-21 17:14 UTC (permalink / raw)
  To: Russell Coker; +Cc: Oleg Drokin, reiserfs-list

[-- Attachment #1: Type: text/plain, Size: 1042 bytes --]

On Fri, 21 Mar 2003 17:00:09 +0100, Russell Coker said:

> The problem with lead is that it's poisonous and soft.  Having to wash your 
> hands after touching your computer could get annoying.

Been there, done that.  Had an experimental rig that for various reasons
meant that we had a old AT-class computer that had to be within 4-6 inches
of the beam in the accelerator (the actual constraint was a 18" tether one
end of which was in a memory slot and other end which was actually in the
beam).  We had a lot of problems with backscattered stuff, so we ended up
with a jacket that had a layer of lead to stop the high-energy stuff, but
that had a lot of thermal neutrons coming out of it, so there was a layer
of paraffin (I think, it's been close to 20 years now) to moderate those,
and then the paraffin emitted the occasional alpha and beta particles, so
there was a layer of metal foil to stop THOSE.

If I got some more caffeine into me, I might even be able to relate the
multiple layers needed for that rig to filesystem design. ;)

[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: filesystem corruption ?
  2003-03-21 13:07         ` Oleg Drokin
  2003-03-21 13:20           ` Bernd Schubert
@ 2003-03-21 16:00           ` Russell Coker
  2003-03-21 17:14             ` Valdis.Kletnieks
  1 sibling, 1 reply; 15+ messages in thread
From: Russell Coker @ 2003-03-21 16:00 UTC (permalink / raw)
  To: Oleg Drokin; +Cc: reiserfs-list

On Fri, 21 Mar 2003 14:07, Oleg Drokin wrote:
> I've learn in the school that if you put some bit amount of plumbum in

It's better known in English as "lead".

The problem with lead is that it's poisonous and soft.  Having to wash your 
hands after touching your computer could get annoying.

Other metals such as copper and steel will reduce the radiation and can also 
be used for protection against mechanical damage.

The best way to reduce radiation is by distance.  The inverse-square law 
applies, so moving the computer further away from the experiment will reduce 
the radiation more easily than anything else you may do.  One thing to 
consider is disk-less X-term machines for if you need to operate a computer 
from near the experiment, so if the X-term crashed from radiation then your 
server with the data should continue running correctly.

-- 
http://www.coker.com.au/selinux/   My NSA Security Enhanced Linux packages
http://www.coker.com.au/bonnie++/  Bonnie++ hard drive benchmark
http://www.coker.com.au/postal/    Postal SMTP/POP benchmark
http://www.coker.com.au/~russell/  My home page


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: filesystem corruption ?
  2003-03-21 13:07         ` Oleg Drokin
@ 2003-03-21 13:20           ` Bernd Schubert
  2003-03-21 16:00           ` Russell Coker
  1 sibling, 0 replies; 15+ messages in thread
From: Bernd Schubert @ 2003-03-21 13:20 UTC (permalink / raw)
  To: Oleg Drokin; +Cc: reiserfs-list

> I've learn in the school that if you put some bit amount of plumbum in
> between some area and source of radiation, chances are radiation that will
> reach the protected area will be of much lesser strenght.
> In fact you might go to those guys and ask them what matherial (and how
> much of it) is best suited to shield against stuff they generate.

We already discussed during the lunch time to order somthing like this for our 
systems ;-) (would be a rather strange order for a usual computer company, 
wouldn't it ?)
But in fact, I'm now really going to contact  the those guys and ask if they 
have some stuff to detect their beams.

Have a nice weekend,
	Bernd

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: filesystem corruption ?
  2003-03-21 13:01       ` Bernd Schubert
@ 2003-03-21 13:07         ` Oleg Drokin
  2003-03-21 13:20           ` Bernd Schubert
  2003-03-21 16:00           ` Russell Coker
  2003-03-22 18:18         ` Bernd Schubert
  1 sibling, 2 replies; 15+ messages in thread
From: Oleg Drokin @ 2003-03-21 13:07 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: reiserfs-list

Hello!

On Fri, Mar 21, 2003 at 02:01:38PM +0100, Bernd Schubert wrote:
> > So, the beam of X-rays run through the memory module corrupting some bits?
> > ;) This stuff should not have been written to disk, so probably
> > plain reboot should fix everything? Can you test that?
> indeed after rebooting everything is fine again. We will run another memtest86 

So on-disk corruption is out of question.

> during the weekend, though I really don't believe we will find a problem.

Ask those physics guys to run some X-ray experiments while you are running memtest86 ;)

> Though this machine will be replaced by a real server in a few month, I'm 
> still rather worried what happend. Even if its 'only' a hardware memory 
> problem this means lots of trouble for us -- on the one hand it seems not to 
> be memtest86 detectable and on the other hand our programs really do need 

Well, it may be not detectable because no high-enerty beams are running around at
the time of test.

> working memory, but of course this is not of your concern.

I've learn in the school that if you put some bit amount of plumbum in between
some area and source of radiation, chances are radiation that will reach the
protected area will be of much lesser strenght.
In fact you might go to those guys and ask them what matherial (and how much of it)
is best suited to shield against stuff they generate.

Bye,
    Oleg

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: filesystem corruption ?
  2003-03-21  7:32     ` Oleg Drokin
  2003-03-21 10:14       ` Bernd Schubert
@ 2003-03-21 13:01       ` Bernd Schubert
  2003-03-21 13:07         ` Oleg Drokin
  2003-03-22 18:18         ` Bernd Schubert
  1 sibling, 2 replies; 15+ messages in thread
From: Bernd Schubert @ 2003-03-21 13:01 UTC (permalink / raw)
  To: Oleg Drokin; +Cc: reiserfs-list

Hi,

> So, the beam of X-rays run through the memory module corrupting some bits?
> ;) This stuff should not have been written to disk, so probably
> plain reboot should fix everything? Can you test that?
>

indeed after rebooting everything is fine again. We will run another memtest86 
during the weekend, though I really don't believe we will find a problem.

Though this machine will be replaced by a real server in a few month, I'm 
still rather worried what happend. Even if its 'only' a hardware memory 
problem this means lots of trouble for us -- on the one hand it seems not to 
be memtest86 detectable and on the other hand our programs really do need 
working memory, but of course this is not of your concern.


Thanks for your help,
	Bernd

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: filesystem corruption ?
  2003-03-21  7:32     ` Oleg Drokin
@ 2003-03-21 10:14       ` Bernd Schubert
  2003-03-21 13:01       ` Bernd Schubert
  1 sibling, 0 replies; 15+ messages in thread
From: Bernd Schubert @ 2003-03-21 10:14 UTC (permalink / raw)
  To: Oleg Drokin; +Cc: reiserfs-list

On Friday 21 March 2003 08:32, you wrote:
> Hello!
>
> On Thu, Mar 20, 2003 at 07:23:48PM +0100, Bernd Schubert wrote:
> > > Hm, interesting.
> > > And what are the differences? How big are they?
> >
> > Since it are binaries files, a colleague had the idea to use hexdump and
> > diff, so the command for the attached file was:
> > diff <(hexdump /worka/gdb) <(hexdump /usr/bin/gdb)|sort -k 2 >gdb.diff
> > So the lines beginning with '<' are from working gdb and lines beginning
> > with '>' are from corrupted gdb. When you look into the diff-file you
> > will see, that only some bits per line have changed.
>
> I see.
> Basically you have two pages of data corrupted.
> And the corruption indeed looks like bit corruption.
> How about rebooting that box and checking if corruption pattern changes?
> Also I'd recommend you to run memtext86 for some time as this looks like
> bad memory pattern.

All of our machines have to pass a full memtest86 checking before we intend to 
use them - this machine is about 3 weeks old, of course it also had to run 
this test and furthermore it has ECC-memory.

>
> > > Any events happening between morning backup and time of problem
> > > discovery?
> >
> > Except, that I recompiled a kernel and we installed some programs using
> > aptitude (its a debian system), nothing happend to the filesystem. There
> > was also no reboot, no crash, etc.
> > Update: The corruption probably happend at 15:48, since at this time also
> > a xchat on one of the clients crashed and this was noticed by us at
> > first. The xchat binary was also affected by the corruption.
>
> So, the beam of X-rays run through the memory module corrupting some bits?

There is the 'Environmental Physics Institut' in the floor below us and since 
we currently have an extremely high hardware failure rate, I have been joking 
for some time that they might be causing it (I believe they are indeed using 
x-ray beams). I should really ask them if their constructions are shielded 
properly ;-)

> ;) This stuff should not have been written to disk, so probably
> plain reboot should fix everything? Can you test that?

Yes of course, if something goes wrong we still have our fall back machine :-)

I will report in the afternoon if it worked.

Best regards,
	Bernd

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: filesystem corruption ?
  2003-03-20 18:23   ` Bernd Schubert
@ 2003-03-21  7:32     ` Oleg Drokin
  2003-03-21 10:14       ` Bernd Schubert
  2003-03-21 13:01       ` Bernd Schubert
  0 siblings, 2 replies; 15+ messages in thread
From: Oleg Drokin @ 2003-03-21  7:32 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: reiserfs-list

Hello!

On Thu, Mar 20, 2003 at 07:23:48PM +0100, Bernd Schubert wrote:
> > Hm, interesting.
> > And what are the differences? How big are they?
> Since it are binaries files, a colleague had the idea to use hexdump and diff, 
> so the command for the attached file was:
> diff <(hexdump /worka/gdb) <(hexdump /usr/bin/gdb)|sort -k 2 >gdb.diff
> So the lines beginning with '<' are from working gdb and lines beginning with 
> '>' are from corrupted gdb. When you look into the diff-file you will see, 
> that only some bits per line have changed.

I see.
Basically you have two pages of data corrupted.
And the corruption indeed looks like bit corruption.
How about rebooting that box and checking if corruption pattern changes?
Also I'd recommend you to run memtext86 for some time as this looks like
bad memory pattern.

> > Any events happening between morning backup and time of problem discovery?
> Except, that I recompiled a kernel and we installed some programs using 
> aptitude (its a debian system), nothing happend to the filesystem. There was 
> also no reboot, no crash, etc.
> Update: The corruption probably happend at 15:48, since at this time also a 
> xchat on one of the clients crashed and this was noticed by us at first. The 
> xchat binary was also affected by the corruption.

So, the beam of X-rays run through the memory module corrupting some bits? ;)
This stuff should not have been written to disk, so probably
plain reboot should fix everything? Can you test that?

Bye,
    Oleg

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: filesystem corruption ?
  2003-03-20 17:06 ` Oleg Drokin
@ 2003-03-20 18:23   ` Bernd Schubert
  2003-03-21  7:32     ` Oleg Drokin
  0 siblings, 1 reply; 15+ messages in thread
From: Bernd Schubert @ 2003-03-20 18:23 UTC (permalink / raw)
  To: Oleg Drokin; +Cc: reiserfs-list

[-- Attachment #1: Type: text/plain, Size: 3152 bytes --]

On Thursday 20 March 2003 18:06, Oleg Drokin wrote:
> Hello!
>
> On Thu, Mar 20, 2003 at 05:25:13PM +0100, Bernd Schubert wrote:
> > We use this filesystem a nfs-root-fs to several clients (exported as
> > read-only), so we are lucky, since we regularly backup the whole
> > partition. We have a backup from this Morning and another one from
> > Monday. Based on comparing the output of md5sum we can't find any
> > problems between the version from monday and the version of this morning,
> > *but* there are differences for some binaries in /usr/bin, such as gdb,
> > between the backup of this Morning and the Current files.
>
> Hm, interesting.
> And what are the differences? How big are they?

Since it are binaries files, a colleague had the idea to use hexdump and diff, 
so the command for the attached file was:

diff <(hexdump /worka/gdb) <(hexdump /usr/bin/gdb)|sort -k 2 >gdb.diff

So the lines beginning with '<' are from working gdb and lines beginning with 
'>' are from corrupted gdb. When you look into the diff-file you will see, 
that only some bits per line have changed.

> Anything interesting in logs?

Except perhaps 'Mar 20 16:46:58 hamilton kernel: invalidate: busy buffer', 
nothing else.


> Any events happening between morning backup and time of problem discovery?

Except, that I recompiled a kernel and we installed some programs using 
aptitude (its a debian system), nothing happend to the filesystem. There was 
also no reboot, no crash, etc.

Update: The corruption probably happend at 15:48, since at this time also a 
xchat on one of the clients crashed and this was noticed by us at first. The 
xchat binary was also affected by the corruption.
At the very same time another client was rebooted and something seems to have 
caused a very strange nfs-mounting from this machine. However, we see 189 
mount tries for '/', '/etc' and '/var' within 5 seconds from this client, 
finally it was succesfull, thatswhy we didn't notice the strange mounting 
scheme. Please note again that we export '/' read-only, so the client 
shouldn't be able to corrupt the files.
Since it turn out, that the nfs-corruption could be nfs related, I have to 
give further information about our server/client solution:
	We have both, knfsd and unfsd (clusternfs) running on our server,
	knfsd serves '/' (read-only, reiserfs) and unfsd serves '/etc' and '/var' 
(read-write, ext2). 
	Due to current kernel limitation both have to use the same rpc-port, but 
luckily not the same upd/tcp port (but both mountd's are running on different 
rpc-ports and different tcp/upd ports).
I hope that this is not the reason for our trouble, anyway I wouldn't know how 
this could cause this kind of trouble at all.

I'm now going to modify the client's initrd and prevent something like this.

>
> > Do you have any ideas whats going wrong and what we can do?
>
> We need more info.

Just tell me what else you need! Should we run debugreiserfs ?

> Also check modification date of gdb, may be some process changed it?

Its not only gdb, also several other programs. The modification time and 
filesize are the same.


Thanks for your help,
	Bernd

[-- Attachment #2: gdb.diff.gz --]
[-- Type: application/x-gzip, Size: 2875 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: filesystem corruption ?
  2003-03-20 16:25 filesystem corruption ? Bernd Schubert
@ 2003-03-20 17:06 ` Oleg Drokin
  2003-03-20 18:23   ` Bernd Schubert
  0 siblings, 1 reply; 15+ messages in thread
From: Oleg Drokin @ 2003-03-20 17:06 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: reiserfs-list

Hello!

On Thu, Mar 20, 2003 at 05:25:13PM +0100, Bernd Schubert wrote:

> We use this filesystem a nfs-root-fs to several clients (exported as 
> read-only), so we are lucky, since we regularly backup the whole partition. 
> We have a backup from this Morning and another one from Monday. Based on 
> comparing the output of md5sum we can't find any problems between the version 
> from monday and the version of this morning, *but* there are differences for 
> some binaries in /usr/bin, such as gdb, between the backup of this Morning 
> and the Current files.

Hm, interesting.
And what are the differences? How big are they?
Anything interesting in logs?
Any events happening between morning backup and time of problem discovery?

> Do you have any ideas whats going wrong and what we can do?

We need more info.
Also check modification date of gdb, may be some process changed it?

Bye,
    Oleg

^ permalink raw reply	[flat|nested] 15+ messages in thread

* filesystem corruption ?
@ 2003-03-20 16:25 Bernd Schubert
  2003-03-20 17:06 ` Oleg Drokin
  0 siblings, 1 reply; 15+ messages in thread
From: Bernd Schubert @ 2003-03-20 16:25 UTC (permalink / raw)
  To: reiserfs-list

Hi,

we just encountered serious problems on our '/' reiserfs partition. 
To short it up, before the full problem description comes, 
"reiserfsck{3.6.3,4,5pre2} --check" doesn't find any problems.

Well, in detail this means that some binaries suddenly became corrupted. For 
example running gdb gives:

gdb: Symbol `emacs_ctlx_keymap' has different size in shared object, consider 
re-linking
Illegal instruction

We use this filesystem a nfs-root-fs to several clients (exported as 
read-only), so we are lucky, since we regularly backup the whole partition. 
We have a backup from this Morning and another one from Monday. Based on 
comparing the output of md5sum we can't find any problems between the version 
from monday and the version of this morning, *but* there are differences for 
some binaries in /usr/bin, such as gdb, between the backup of this Morning 
and the Current files.
(Well, to say the truth there also some more difference between the monday's 
backup, the backup of this Morning and the Current version, but these are, of 
course, only difference we caused ourselves by doing updates and kernel 
compilations)

We currently have remounted '/' (hda5) read-only and have run several versions 
of reiserfsck (including the current 3.6.5pre2), so 'reiserfsck --check 
/dev/hda5', but it doesn't find any problems.

Do you have any ideas whats going wrong and what we can do?


Thanks in advance,
	Bernd


PS: a detailed system description:
		- Athlon 2000+ with 3GB ECC RAM (ECC is enabled in the bios, memtest86 also 
reports enabled ECC)
		- 80GB Western Digital harddisk on /dev/hda
		- (cdwriter on /dev/hdc)
		- kernel is 2.4.20 
		- '/' is on hda5; '/etc' and '/var' are on extra partitions
		- '/home' is mounted from another server
		
During the noon/afternoon I repompiled a new kernel for another system in 
'/usr/src', so probably the main writing access during this day.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2018-10-23  9:31 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-22 20:02 Filesystem corruption? Gervais, Francois
2018-10-22 23:12 ` Qu Wenruo
2018-10-23  9:25 ` Juergen Sauer
  -- strict thread matches above, loose matches on Subject: below --
2003-03-20 16:25 filesystem corruption ? Bernd Schubert
2003-03-20 17:06 ` Oleg Drokin
2003-03-20 18:23   ` Bernd Schubert
2003-03-21  7:32     ` Oleg Drokin
2003-03-21 10:14       ` Bernd Schubert
2003-03-21 13:01       ` Bernd Schubert
2003-03-21 13:07         ` Oleg Drokin
2003-03-21 13:20           ` Bernd Schubert
2003-03-21 16:00           ` Russell Coker
2003-03-21 17:14             ` Valdis.Kletnieks
2003-03-22 18:18         ` Bernd Schubert
2003-03-22 18:37           ` Anders Widman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.