linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: 2.4.5 data corruption
       [not found] <53B208BD9A7FD311881A009027B6BBFB9EACFE@siamese>
@ 2001-06-19 19:01 ` Alan Cox
  2001-06-19 20:06   ` Stefan Traby
  0 siblings, 1 reply; 14+ messages in thread
From: Alan Cox @ 2001-06-19 19:01 UTC (permalink / raw)
  To: Adam Radford
  Cc: 'Stefan Traby', Alan Cox, Larry McVoy, linux-kernel, tytso

> Sometimes it takes either the kernel tree or our website some time to get 
> in 'sync' with the latest driver version. The latest driver version is 
> 1.02.00.007.  
> 
> There may be DAC960 like /proc support at some point for GUI haters.

Publishing enough info to let people write a GPL non gui management tool would
be a win in itself

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 2.4.5 data corruption
  2001-06-19 19:01 ` 2.4.5 data corruption Alan Cox
@ 2001-06-19 20:06   ` Stefan Traby
  0 siblings, 0 replies; 14+ messages in thread
From: Stefan Traby @ 2001-06-19 20:06 UTC (permalink / raw)
  To: Alan Cox
  Cc: Adam Radford, 'Stefan Traby', Larry McVoy, linux-kernel, tytso

On Tue, Jun 19, 2001 at 08:01:11PM +0100, Alan Cox wrote:

> > Sometimes it takes either the kernel tree or our website some time to get 
> > in 'sync' with the latest driver version. The latest driver version is 
> > 1.02.00.007.  
> > 
> > There may be DAC960 like /proc support at some point for GUI haters.
> 
> Publishing enough info to let people write a GPL non gui management tool would
> be a win in itself

And on-disk superblock documentation.
I want to be able to recover from a single disk-failure and
power-fail conditions in all cases, not just in 50%.

3ware is simply unable to recover from some situations where recovery
is possible (reported to them at least two times).

Maybe they will understand this sometimes. I think that I have
a right to get my data back if it's possible.
It was extremly hard to explain them LGPL and the
fact that they violated it; so I expect not too much.

To not publish the specs is really extremly unfair;
it just shows how they care about my data.

-- 

  ciao - 
    Stefan

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 2.4.5 data corruption
  2001-06-14 18:20 ` Alan Cox
  2001-06-14 22:27   ` Eugene Crosser
  2001-06-19  3:00   ` Stefan Traby
@ 2001-06-19  9:13   ` Pedro M. Rodrigues
  2 siblings, 0 replies; 14+ messages in thread
From: Pedro M. Rodrigues @ 2001-06-19  9:13 UTC (permalink / raw)
  To: linux-kernel

On 19 Jun 2001, at 5:00, Stefan Traby wrote:

> On Thu, Jun 14, 2001 at 07:20:06PM +0100, Alan Cox wrote:
> > > Folks, I believe I have a reproducible test case which corrupts
> > > data in 2.4.5.
> > 
> > 2.4.5 has an out of date 3ware driver that is short
> 
> > +   1.02.00.007 - Fix possible null pointer dereferences in
> > +   tw_ioctl().
> > +                 Remove check for invalid done function pointer
> > +                 from tw_scsi_queue().
> 
> hehe, this one keeps the 3dmd from running here at all.

  Saw that one here too. 

[...]

> (like DAC); I guess that many people would love to get rid
> of the - sorry - fucking closed sourced and totally broken 3dmd
> which makes an extremly nice product totally useless (you can't
> trust it; not only because it's closed source, it simply doesn't
> work (except that it wastes memory, that works fine. tested.))
> 
> -- 

   3dmd does have a lot of problems, but i thought it was just me. I 
only made it work once in a machine, and not very well. Last week 
i installed the latest version in another of my machines and after 
half an hour wrestling with it - trying to make it change passwords 
and ask for one, among other things - i gave up.


> 
>   ciao - 
>     Stefan
> 

Pedro

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 2.4.5 data corruption
  2001-06-19  3:00   ` Stefan Traby
@ 2001-06-19  7:49     ` Alan Cox
  0 siblings, 0 replies; 14+ messages in thread
From: Alan Cox @ 2001-06-19  7:49 UTC (permalink / raw)
  To: stefan; +Cc: Alan Cox, Larry McVoy, linux-kernel, tytso

> Well, I do not understand how the driver is distributed.
> The actual 3ware stuff won't compile on 2.4.x, and the stuff in kernel
> is always different from 3ware releases.

The stuff in the -ac tree is directly from 3ware

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 2.4.5 data corruption
  2001-06-14 18:20 ` Alan Cox
  2001-06-14 22:27   ` Eugene Crosser
@ 2001-06-19  3:00   ` Stefan Traby
  2001-06-19  7:49     ` Alan Cox
  2001-06-19  9:13   ` Pedro M. Rodrigues
  2 siblings, 1 reply; 14+ messages in thread
From: Stefan Traby @ 2001-06-19  3:00 UTC (permalink / raw)
  To: Alan Cox; +Cc: Larry McVoy, linux-kernel, tytso

On Thu, Jun 14, 2001 at 07:20:06PM +0100, Alan Cox wrote:
> > Folks, I believe I have a reproducible test case which corrupts data in
> > 2.4.5.
> 
> 2.4.5 has an out of date 3ware driver that is short

> +   1.02.00.007 - Fix possible null pointer dereferences in tw_ioctl().
> +                 Remove check for invalid done function pointer from
> +                 tw_scsi_queue().

hehe, this one keeps the 3dmd from running here at all.

> That might be a first thing to check

Well, I do not understand how the driver is distributed.
The actual 3ware stuff won't compile on 2.4.x, and the stuff in kernel
is always different from 3ware releases.

I use two 8-port cards (8 disks each) and I see different but
fatal problems on both systems.

Is anyone here using an actual firmware and raid-5 ?
Does it work up to some level on 6800 ?

Anyway, a useful proc-interface would be really cool
(like DAC); I guess that many people would love to get rid
of the - sorry - fucking closed sourced and totally broken 3dmd
which makes an extremly nice product totally useless (you can't
trust it; not only because it's closed source, it simply doesn't
work (except that it wastes memory, that works fine. tested.))

-- 

  ciao - 
    Stefan

" destroy-your-data-by-3dmd-no-need-for-hammer-anymore CNAME www.3ware.com. "

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 2.4.5 data corruption
  2001-06-15 19:54       ` Eugene Crosser
@ 2001-06-15 20:17         ` Larry McVoy
  0 siblings, 0 replies; 14+ messages in thread
From: Larry McVoy @ 2001-06-15 20:17 UTC (permalink / raw)
  To: Eugene Crosser; +Cc: linux-kernel

On Fri, Jun 15, 2001 at 11:54:20PM +0400, Eugene Crosser wrote:
> In article <E15Afvk-0005aV-00@the-village.bc.nu>,
>         Alan Cox <alan@lxorguk.ukuu.org.uk> writes:
> >> any problems since 2.4.5 was published, they seem to have surfaced
> >> immediately after I created a rather big file capturing video with
> >> broadcast2000 (video card is bt848).  Filesystem is ext2.
> > 
> > Thats something I've seen reported elsehwere. The high bandwidth capture card
> > stuff seems to show up problems. It could be drivers could be hardware. On
> > my AMD 751 pre release board I see that problem but on the 751 production board
> > I dont
> 
> You must be right, today I created another big file with the same program
> but without doing caputre and the filesystem was intact.  OTOH,
> Russell Leighton reports curruption when creating a file with dd...

For what it is worth, after having three failures in a row, now it isn't
happening.  My test case is/was my nightly backup.  If it happens again,
I'll save the corrupted data so we can do more digging.  I'm kicking 
myself for not having done it the first time around.
-- 
---
Larry McVoy            	 lm at bitmover.com           http://www.bitmover.com/lm 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 2.4.5 data corruption
  2001-06-14 22:49     ` Alan Cox
@ 2001-06-15 19:54       ` Eugene Crosser
  2001-06-15 20:17         ` Larry McVoy
  0 siblings, 1 reply; 14+ messages in thread
From: Eugene Crosser @ 2001-06-15 19:54 UTC (permalink / raw)
  To: linux-kernel

In article <E15Afvk-0005aV-00@the-village.bc.nu>,
        Alan Cox <alan@lxorguk.ukuu.org.uk> writes:
>> any problems since 2.4.5 was published, they seem to have surfaced
>> immediately after I created a rather big file capturing video with
>> broadcast2000 (video card is bt848).  Filesystem is ext2.
> 
> Thats something I've seen reported elsehwere. The high bandwidth capture card
> stuff seems to show up problems. It could be drivers could be hardware. On
> my AMD 751 pre release board I see that problem but on the 751 production board
> I dont

You must be right, today I created another big file with the same program
but without doing caputre and the filesystem was intact.  OTOH,
Russell Leighton reports curruption when creating a file with dd...

Eugene

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 2.4.5 data corruption
  2001-06-14 22:27   ` Eugene Crosser
  2001-06-14 22:49     ` Alan Cox
@ 2001-06-15 12:02     ` Russell Leighton
  1 sibling, 0 replies; 14+ messages in thread
From: Russell Leighton @ 2001-06-15 12:02 UTC (permalink / raw)
  To: linux-kernel


Nuther anecdote:

I was creating a big swapfile on ext2 (because 2.4.5 needs too much swap)
with dd (SCSI disk on Sym53c8-something controller) and corrupted
the partition THEN fsck would cause the kernel to panic. I thought
I had some bad hw ... the box sits on my office floor waiting resurrection.

Eugene Crosser wrote:

> In article <E15Abiw-00056O-00@the-village.bc.nu>,
>         Alan Cox <alan@lxorguk.ukuu.org.uk> writes:
> >> Folks, I believe I have a reproducible test case which corrupts data in
> >> 2.4.5.
> >
> > 2.4.5 has an out of date 3ware driver that is short
>
> These days I observed massive FS curruption on vanilla 2.4.5,
> SCSI disk on Sym53c8-something controller (UW).  I did not notice
> any problems since 2.4.5 was published, they seem to have surfaced
> immediately after I created a rather big file capturing video with
> broadcast2000 (video card is bt848).  Filesystem is ext2.
>
> Eugene
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

--
---------------------------------------------------
Russell Leighton    russell.leighton@247media.com
---------------------------------------------------



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 2.4.5 data corruption
  2001-06-14 22:27   ` Eugene Crosser
@ 2001-06-14 22:49     ` Alan Cox
  2001-06-15 19:54       ` Eugene Crosser
  2001-06-15 12:02     ` Russell Leighton
  1 sibling, 1 reply; 14+ messages in thread
From: Alan Cox @ 2001-06-14 22:49 UTC (permalink / raw)
  To: Eugene Crosser; +Cc: linux-kernel

> any problems since 2.4.5 was published, they seem to have surfaced
> immediately after I created a rather big file capturing video with
> broadcast2000 (video card is bt848).  Filesystem is ext2.

Thats something I've seen reported elsehwere. The high bandwidth capture card
stuff seems to show up problems. It could be drivers could be hardware. On
my AMD 751 pre release board I see that problem but on the 751 production board
I dont

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 2.4.5 data corruption
  2001-06-14 18:20 ` Alan Cox
@ 2001-06-14 22:27   ` Eugene Crosser
  2001-06-14 22:49     ` Alan Cox
  2001-06-15 12:02     ` Russell Leighton
  2001-06-19  3:00   ` Stefan Traby
  2001-06-19  9:13   ` Pedro M. Rodrigues
  2 siblings, 2 replies; 14+ messages in thread
From: Eugene Crosser @ 2001-06-14 22:27 UTC (permalink / raw)
  To: linux-kernel

In article <E15Abiw-00056O-00@the-village.bc.nu>,
        Alan Cox <alan@lxorguk.ukuu.org.uk> writes:
>> Folks, I believe I have a reproducible test case which corrupts data in
>> 2.4.5.
> 
> 2.4.5 has an out of date 3ware driver that is short

These days I observed massive FS curruption on vanilla 2.4.5,
SCSI disk on Sym53c8-something controller (UW).  I did not notice
any problems since 2.4.5 was published, they seem to have surfaced
immediately after I created a rather big file capturing video with
broadcast2000 (video card is bt848).  Filesystem is ext2.

Eugene

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 2.4.5 data corruption
  2001-06-12 20:17 Larry McVoy
  2001-06-13 15:09 ` Nathan Straz
  2001-06-13 23:39 ` Chris Mason
@ 2001-06-14 18:20 ` Alan Cox
  2001-06-14 22:27   ` Eugene Crosser
                     ` (2 more replies)
  2 siblings, 3 replies; 14+ messages in thread
From: Alan Cox @ 2001-06-14 18:20 UTC (permalink / raw)
  To: Larry McVoy; +Cc: linux-kernel, tytso

> Folks, I believe I have a reproducible test case which corrupts data in
> 2.4.5.

2.4.5 has an out of date 3ware driver that is short

+   1.02.00.005 - Allocate bounce buffers and custom queue depth for raid5 for
+                 6000 and 5000 series controllers.
+                 Reduce polling mdelays causing problems on some systems.
+                 Fix use_sg = 1 calculation bug.
+                 Check for scsi_register returning NULL.
+                 Add aen count to /proc/scsi/3w-xxxx.
+                 Remove aen code unit masking in tw_aen_complete().
+   1.02.00.006 - Remove unit from printk in tw_scsi_eh_abort(), causing
+                 possible oops.
+                 Fix possible null pointer dereference in tw_scsi_queue()
+                 if done function pointer was invalid.
+   1.02.00.007 - Fix possible null pointer dereferences in tw_ioctl().
+                 Remove check for invalid done function pointer from
+                 tw_scsi_queue().

That might be a first thing to check



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 2.4.5 data corruption
  2001-06-12 20:17 Larry McVoy
  2001-06-13 15:09 ` Nathan Straz
@ 2001-06-13 23:39 ` Chris Mason
  2001-06-14 18:20 ` Alan Cox
  2 siblings, 0 replies; 14+ messages in thread
From: Chris Mason @ 2001-06-13 23:39 UTC (permalink / raw)
  To: Larry McVoy, linux-kernel



On Tuesday, June 12, 2001 01:17:49 PM -0700 Larry McVoy <lm@bitmover.com>
wrote:

> Folks, I believe I have a reproducible test case which corrupts data in
> 2.4.5.
> 
> We do nightly, weekly, and monthly backups by copying our entire /home
> partition on the company file server:
> 
> Filesystem            Size  Used Avail Use% Mounted on
> /dev/hda1             1.9G  1.7G  123M  93% /
> /dev/hda6             1.9G  437M  1.4G  23% /tmp

What flavor of IDE controller?  Where is swap?

-chris


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 2.4.5 data corruption
  2001-06-12 20:17 Larry McVoy
@ 2001-06-13 15:09 ` Nathan Straz
  2001-06-13 23:39 ` Chris Mason
  2001-06-14 18:20 ` Alan Cox
  2 siblings, 0 replies; 14+ messages in thread
From: Nathan Straz @ 2001-06-13 15:09 UTC (permalink / raw)
  To: Larry McVoy; +Cc: linux-kernel

On Tue, Jun 12, 2001 at 01:17:49PM -0700, Larry McVoy wrote:
> Folks, I believe I have a reproducible test case which corrupts data in
> 2.4.5.

Why don't you send the test case to the list?  I would love to try it
out and it would be a good addition to LTP.

-- 
Nate Straz                                              nstraz@sgi.com
sgi, inc                                           http://www.sgi.com/
Linux Test Project                                  http://ltp.sf.net/

^ permalink raw reply	[flat|nested] 14+ messages in thread

* 2.4.5 data corruption
@ 2001-06-12 20:17 Larry McVoy
  2001-06-13 15:09 ` Nathan Straz
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Larry McVoy @ 2001-06-12 20:17 UTC (permalink / raw)
  To: linux-kernel; +Cc: tytso

Folks, I believe I have a reproducible test case which corrupts data in
2.4.5.

We do nightly, weekly, and monthly backups by copying our entire /home
partition on the company file server:

Filesystem            Size  Used Avail Use% Mounted on
/dev/hda1             1.9G  1.7G  123M  93% /
/dev/hda6             1.9G  437M  1.4G  23% /tmp
/dev/sda1              37G   26G   11G  71% /home
/dev/sdc1              37G   26G   11G  70% /weekly
/dev/sdd1              37G   24G   13G  65% /monthly
/dev/sdb1              37G   26G   11G  71% /nightly

The sd? drives are actually ide drives on a 3ware escalade controller.
I have reason to believe the drives are good, before I installed them
I scrubbed them with varying data patterns and verified that that I got
back what I put there.  All tested cleanly overnight.

I recently added an integrity check to our backups - the integrity checker
writes out the path, the gzip adler32 checksum, the size, and the mtime of
each file.  Each time I do a backup, the backup scripts look for the 
integrity listing in the other partitions and compares all files with the
same path, size, and modtime.  

This morning I had a pile of errors after things having gone smoothly for
the last few weeks.  I suspected that I had screwed something up, looked
over the backup scripts, simplified them down to a simple cpio, and tried
again.  Another pile of errors, different set of files.  

In both cases, the newly created files were corrupted, the ones on the 
live /home partition as well as the /weekly & /monthly partitions all 
compared cleanly.

I rebooted into 2.2.19, tried again, no errors.  I was running 2.4.5,
no patches.  I power cycled the machine between each reboot, went through
the bios memory check, and also went through my own memory check; memory 
does not seem to be an issue.

I think I can reproduce this, it takes a reboot and about 2 hours.  I made
it happen twice with 2.4.5, the first try on 2.2.19 did not work.

The data corruption looks like *extra* bytes added at the beginning of
files.  I only looked at a few, if we go down the path of debugging this
I'll save them all next time.  The extra byte counts were small, in one
case there was the letter "1" added to the start of the file, other than
that it was identical.  That's really weird, as a file system guy, I'd
expect to see blocks of data not small chunks of data.  Very strange.

One thing I haven't done is to rule out the 3ware controller.  I tend to
doubt it is the problem but who knows.  

There were no kernel messages complaining about anything during the 
backup, so the kernel doesn't seem to know there is a problem.

So, does anyone recognize these symptoms?  Does anyone care?  

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2001-06-19 20:07 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <53B208BD9A7FD311881A009027B6BBFB9EACFE@siamese>
2001-06-19 19:01 ` 2.4.5 data corruption Alan Cox
2001-06-19 20:06   ` Stefan Traby
2001-06-12 20:17 Larry McVoy
2001-06-13 15:09 ` Nathan Straz
2001-06-13 23:39 ` Chris Mason
2001-06-14 18:20 ` Alan Cox
2001-06-14 22:27   ` Eugene Crosser
2001-06-14 22:49     ` Alan Cox
2001-06-15 19:54       ` Eugene Crosser
2001-06-15 20:17         ` Larry McVoy
2001-06-15 12:02     ` Russell Leighton
2001-06-19  3:00   ` Stefan Traby
2001-06-19  7:49     ` Alan Cox
2001-06-19  9:13   ` Pedro M. Rodrigues

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).