linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: Undo aic7xxx changes (now rc7+aic20030603)
       [not found]                   ` <20030618111010$154f@gated-at.bofh.it>
@ 2003-06-18 12:46                     ` Pascal Schmidt
  2003-06-18 12:49                       ` Stephan von Krawczynski
  0 siblings, 1 reply; 64+ messages in thread
From: Pascal Schmidt @ 2003-06-18 12:46 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: linux-kernel

Stephan von Krawczynski wrote in linux-kernel:

> around 70-100 GB of data is transferred to a nfs-server with rc8 onto a RAID5
> on 3ware-controller.
> The data is then copied via tar onto a SDLT drive connected to an aic
> controller.
> Afterwards the data is verified by tar.

Have you tried with a different SCSI controller to rule out bugs in st.c?

-- 
Ciao,
Pascal

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Undo aic7xxx changes (now rc7+aic20030603)
  2003-06-18 12:46                     ` Undo aic7xxx changes (now rc7+aic20030603) Pascal Schmidt
@ 2003-06-18 12:49                       ` Stephan von Krawczynski
  0 siblings, 0 replies; 64+ messages in thread
From: Stephan von Krawczynski @ 2003-06-18 12:49 UTC (permalink / raw)
  To: Pascal Schmidt; +Cc: linux-kernel

On Wed, 18 Jun 2003 14:46:02 +0200
Pascal Schmidt <der.eremit@email.de> wrote:

> Stephan von Krawczynski wrote in linux-kernel:
> 
> > around 70-100 GB of data is transferred to a nfs-server with rc8 onto a RAID5
> > on 3ware-controller.
> > The data is then copied via tar onto a SDLT drive connected to an aic
> > controller.
> > Afterwards the data is verified by tar.
> 
> Have you tried with a different SCSI controller to rule out bugs in st.c?

Replacement part is not yet shipped.

Regards,
Stephan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Undo aic7xxx changes (now rc7+aic20030603)
  2003-06-30 11:39                                                                               ` Marcelo Tosatti
@ 2003-06-30 12:08                                                                                 ` Stephan von Krawczynski
  0 siblings, 0 replies; 64+ messages in thread
From: Stephan von Krawczynski @ 2003-06-30 12:08 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: linux-kernel, stoffel, willy, kpfleming, gibbs, green

On Mon, 30 Jun 2003 08:39:38 -0300 (BRT)
Marcelo Tosatti <marcelo@conectiva.com.br> wrote:

> 
> 
> On Mon, 30 Jun 2003, Stephan von Krawczynski wrote:
> 
> > Hello all,
> >
> > it looks like the problem gets worse currently. This is the second day I
> > see 4 verification errors. This is with kernel 2.4.22-pre2 now.
> 
> 
> As far as I understood, the tape is corrupting the data (or writting, or
> when reading back).
> 
> Is this correct?

Actually my guess is that the _data_ itself is not corrupt, neither the
original set located on 3ware RAID nor the backup'ed set on aic-connected SDLT.
The problem is - according to my personal opinion - flawed during the readback
that occurs while verifying. I do not know if the data is already corrupted by
the aic-driver (less probable currently) or some flaw inside the caching of the
_original_ set. The situation is complex because of the multiple involved
subsystems. 

My experience is this:

If you reboot and make backup/verify cycle from 3ware to aic/tape everything
seems fine.

If you reboot and push data over NFS to 3ware-disk, then do the backup/verify
cycle (with this data) from 3ware to aic/tape the corruption is very likely.

If you do try another verify run of the data you see corruptions happen on
_other_ files than the verify before. It is therefore unlikely that both data
"ends" are part of the problem, because you would expect the same corruptions
to show up - at least this is my hope.

Regards,
Stephan


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Undo aic7xxx changes (now rc7+aic20030603)
  2003-06-30 10:10                                                                             ` Stephan von Krawczynski
@ 2003-06-30 11:39                                                                               ` Marcelo Tosatti
  2003-06-30 12:08                                                                                 ` Stephan von Krawczynski
  0 siblings, 1 reply; 64+ messages in thread
From: Marcelo Tosatti @ 2003-06-30 11:39 UTC (permalink / raw)
  To: Stephan von Krawczynski
  Cc: linux-kernel, stoffel, willy, kpfleming, gibbs, green



On Mon, 30 Jun 2003, Stephan von Krawczynski wrote:

> Hello all,
>
> it looks like the problem gets worse currently. This is the second day I see 4
> verification errors. This is with kernel 2.4.22-pre2 now.


As far as I understood, the tape is corrupting the data (or writting, or
when reading back).

Is this correct?

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Undo aic7xxx changes (now rc7+aic20030603)
  2003-06-26 11:34                                                                           ` Stephan von Krawczynski
@ 2003-06-30 10:10                                                                             ` Stephan von Krawczynski
  2003-06-30 11:39                                                                               ` Marcelo Tosatti
  0 siblings, 1 reply; 64+ messages in thread
From: Stephan von Krawczynski @ 2003-06-30 10:10 UTC (permalink / raw)
  To: linux-kernel; +Cc: stoffel, willy, marcelo, kpfleming, gibbs, green

Hello all,

it looks like the problem gets worse currently. This is the second day I see 4
verification errors. This is with kernel 2.4.22-pre2 now.

Regards,
Stephan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Undo aic7xxx changes (now rc7+aic20030603)
  2003-06-25 20:30                                                                         ` John Stoffel
  2003-06-26  9:36                                                                           ` Stephan von Krawczynski
@ 2003-06-26 11:34                                                                           ` Stephan von Krawczynski
  2003-06-30 10:10                                                                             ` Stephan von Krawczynski
  1 sibling, 1 reply; 64+ messages in thread
From: Stephan von Krawczynski @ 2003-06-26 11:34 UTC (permalink / raw)
  To: John Stoffel; +Cc: willy, linux-kernel, marcelo, kpfleming, gibbs, green

On Wed, 25 Jun 2003 16:30:22 -0400
"John Stoffel" <stoffel@lucent.com> wrote:

> Maybe I need to try and generate 15-18 files 2gb+ each and dump them
> to tape with tar and see how that's handled, and if we get erorrs.

More data on this:
Today was a very bad day regarding the issue. I experienced three verification
errors, the filesizes were:

  563162975
  746555206
12679280738

So it seems it is not really linked to the filesize.

Regards,
Stephan


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Undo aic7xxx changes (now rc7+aic20030603)
  2003-06-25 20:30                                                                         ` John Stoffel
@ 2003-06-26  9:36                                                                           ` Stephan von Krawczynski
  2003-06-26 11:34                                                                           ` Stephan von Krawczynski
  1 sibling, 0 replies; 64+ messages in thread
From: Stephan von Krawczynski @ 2003-06-26  9:36 UTC (permalink / raw)
  To: John Stoffel; +Cc: willy, linux-kernel, marcelo, kpfleming, gibbs, green

On Wed, 25 Jun 2003 16:30:22 -0400
"John Stoffel" <stoffel@lucent.com> wrote:

> >>>>> "Stephan" == Stephan von Krawczynski <skraw@ithnet.com> writes:
> 
> Stephan> I have tried that already but never managed to get
> Stephan> verification errors on tar archives written to disk.  Maybe I
> Stephan> try again some more...
> 
> I've been trying to get tar errors myself, while writing a 35gb
> filesystem to a DLT7000.  I'm now running 2.4.21-pre5-ac1 and I
> haven't seen any errors.  Yet.  I'm using the 6.2.8 version of the
> driver as well.  The filesystem is just a copy of my home directory
> and some MP3s and other random files and such.  Lots of text and jpegf
> files, along with some other stuff. 
> 
> Maybe I need to try and generate 15-18 files 2gb+ each and dump them
> to tape with tar and see how that's handled, and if we get erorrs.
> 
> Stephan, can you double check your version info as well?  And it would
> be great to get some info on your 3ware setup as well, just so we can
> work on narrowing down the issues.

Hm, I guess you mean kernel version? I am experiencing this problem since about
21-rcX versions, currently running 22-pre1.
The 3ware setup is pretty straight forward a RAID5 with 3 160 GB disks and no
spare.
I would not deny nfs to interact with this problem. Can you try to move your
backup'ed data from somewhere via nfs to your tar'ing box?

Regards,
Stephan


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Undo aic7xxx changes (now rc7+aic20030603)
  2003-06-25 19:16                                                                     ` Willy Tarreau
  2003-06-25 19:42                                                                       ` Stephan von Krawczynski
@ 2003-06-25 23:04                                                                       ` Bernd Eckenfels
  1 sibling, 0 replies; 64+ messages in thread
From: Bernd Eckenfels @ 2003-06-25 23:04 UTC (permalink / raw)
  To: linux-kernel

In article <20030625191655.GA15970@alpha.home.local> you wrote:
> Hmmm no, you're right, I forgot about this case. I think that access time or
> other time-dependant informations may change often enough to make a big diff
> on checksums. I have no more idea at the moment. Or perhaps tar to a disk file
> instead of the tape and check that file :-/

you can cat the tree into md5sums or run md5sums on the tree:

find . -print0 | xargs -0 cat | md5sum

this will only compare file content. You could first dump it to a file and
then md5sum it, if you want to test also writes.

Greetings
Bernd
-- 
eckes privat - http://www.eckes.org/
Project Freefire - http://www.freefire.org/

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Undo aic7xxx changes (now rc7+aic20030603)
  2003-06-25 19:42                                                                       ` Stephan von Krawczynski
@ 2003-06-25 20:30                                                                         ` John Stoffel
  2003-06-26  9:36                                                                           ` Stephan von Krawczynski
  2003-06-26 11:34                                                                           ` Stephan von Krawczynski
  0 siblings, 2 replies; 64+ messages in thread
From: John Stoffel @ 2003-06-25 20:30 UTC (permalink / raw)
  To: Stephan von Krawczynski
  Cc: Willy Tarreau, linux-kernel, marcelo, kpfleming, stoffel, gibbs, green

>>>>> "Stephan" == Stephan von Krawczynski <skraw@ithnet.com> writes:

Stephan> On Wed, 25 Jun 2003 21:16:55 +0200
Stephan> Willy Tarreau <willy@w.ods.org> wrote:

>> Hmmm no, you're right, I forgot about this case. I think that
>> access time or other time-dependant informations may change often
>> enough to make a big diff on checksums. I have no more idea at the
>> moment. Or perhaps tar to a disk file instead of the tape and check
>> that file :-/

Stephan> I have tried that already but never managed to get
Stephan> verification errors on tar archives written to disk.  Maybe I
Stephan> try again some more...

I've been trying to get tar errors myself, while writing a 35gb
filesystem to a DLT7000.  I'm now running 2.4.21-pre5-ac1 and I
haven't seen any errors.  Yet.  I'm using the 6.2.8 version of the
driver as well.  The filesystem is just a copy of my home directory
and some MP3s and other random files and such.  Lots of text and jpegf
files, along with some other stuff. 

Maybe I need to try and generate 15-18 files 2gb+ each and dump them
to tape with tar and see how that's handled, and if we get erorrs.

Stephan, can you double check your version info as well?  And it would
be great to get some info on your 3ware setup as well, just so we can
work on narrowing down the issues.

Unfortunately, due to the way I have to setup things, the RAID array
and the tape drive are on the same channel, which slows down things
I'm sure.  

Here are some timings from dumping and verifying the data to tape:

  jfsnew:/# time tar -c-W -b 128 -f /dev/st0 /scratch
  tar: Removing leading `/' from member names
  408.840u 869.730s 4:03:02.80 8.7%       0+0k 0+0io 258pf+0w

  jfsnew:/# time tar -c-W -b 256 -f /dev/st0 /scratch
  tar: Removing leading `/' from member names
  443.210u 1104.930s 4:07:00.89 10.4%     0+0k 0+0io 264pf+0w

My filesystem is a as follows:

  jfsnew:/home# mdadm -D /dev/md1
  /dev/md1:
	  Version : 00.90.00
    Creation Time : Mon Jun 23 22:51:43 2003
       Raid Level : raid0
       Array Size : 44457600 (42.40 GiB 45.57 GB)
     Raid Devices : 5
    Total Devices : 5
  Preferred Minor : 1
      Persistence : Superblock is persistent

      Update Time : Mon Jun 23 22:51:43 2003
	    State : dirty, no-errors
   Active Devices : 5
  Working Devices : 5
   Failed Devices : 0
    Spare Devices : 0

       Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
	 0       8       48        0      active sync   /dev/sdd
	 1       8       64        1      active sync   /dev/sde
	 2       8       80        2      active sync   /dev/sdf
	 3       8       96        3      active sync   /dev/sdg
	 4       8      112        4      active sync   /dev/sdh
	     UUID : ffa7efb1:1c151f2d:4f6a138c:77085f29


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Undo aic7xxx changes (now rc7+aic20030603)
  2003-06-25 19:16                                                                     ` Willy Tarreau
@ 2003-06-25 19:42                                                                       ` Stephan von Krawczynski
  2003-06-25 20:30                                                                         ` John Stoffel
  2003-06-25 23:04                                                                       ` Bernd Eckenfels
  1 sibling, 1 reply; 64+ messages in thread
From: Stephan von Krawczynski @ 2003-06-25 19:42 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: willy, linux-kernel, marcelo, kpfleming, stoffel, gibbs, green

On Wed, 25 Jun 2003 21:16:55 +0200
Willy Tarreau <willy@w.ods.org> wrote:

> On Wed, Jun 25, 2003 at 01:43:53AM +0200, Stephan von Krawczynski wrote:
> > > Ah, OK ! I didn't understand this. You're right, this is also a
> > > possibility. Perhaps a tar cf - /mnt/3ware | chkblk would get evidence of
> > > somme corruption?
> > 
> > Hm, probably a dumb question: does repeated tar'ing of the same files lead
> > to exactly the same archive? There is no timestamp inside or something
> > equivalent?
> 
> Hmmm no, you're right, I forgot about this case. I think that access time or
> other time-dependant informations may change often enough to make a big diff
> on checksums. I have no more idea at the moment. Or perhaps tar to a disk
> file instead of the tape and check that file :-/

I have tried that already but never managed to get verification errors on tar
archives written to disk.
Maybe I try again some more...

Regards,
Stephan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Undo aic7xxx changes (now rc7+aic20030603)
  2003-06-24 23:43                                                                   ` Stephan von Krawczynski
@ 2003-06-25 19:16                                                                     ` Willy Tarreau
  2003-06-25 19:42                                                                       ` Stephan von Krawczynski
  2003-06-25 23:04                                                                       ` Bernd Eckenfels
  0 siblings, 2 replies; 64+ messages in thread
From: Willy Tarreau @ 2003-06-25 19:16 UTC (permalink / raw)
  To: Stephan von Krawczynski
  Cc: Willy Tarreau, linux-kernel, marcelo, kpfleming, stoffel, gibbs, green

On Wed, Jun 25, 2003 at 01:43:53AM +0200, Stephan von Krawczynski wrote:
> > Ah, OK ! I didn't understand this. You're right, this is also a possibility.
> > Perhaps a tar cf - /mnt/3ware | chkblk would get evidence of somme corruption
> > ?
> 
> Hm, probably a dumb question: does repeated tar'ing of the same files lead to
> exactly the same archive? There is no timestamp inside or something equivalent
> ?

Hmmm no, you're right, I forgot about this case. I think that access time or
other time-dependant informations may change often enough to make a big diff
on checksums. I have no more idea at the moment. Or perhaps tar to a disk file
instead of the tape and check that file :-/

Cheers,
Willy


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Undo aic7xxx changes (now rc7+aic20030603)
  2003-06-24 21:26                                                               ` Stephan von Krawczynski
  2003-06-24 22:03                                                                 ` Willy Tarreau
@ 2003-06-25  2:22                                                                 ` Valdis.Kletnieks
  1 sibling, 0 replies; 64+ messages in thread
From: Valdis.Kletnieks @ 2003-06-25  2:22 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1603 bytes --]

On Tue, 24 Jun 2003 23:26:09 +0200, Stephan von Krawczynski said:

> sorry, you probably misunderstood my flaky explanation. What I meant was not 
a
> cached block from the _tape_ (obviously indeed a char-type device) but from t
he
> 3ware disk (i.e. the other side of the verification). Consider the tape
> completely working, but the disk data corrupt (possibly not from real reading
> but from corrupted cache).

Don't rule out odder explanations either.  True story follows.. ;)

I once had the misfortune of being the admin for a Gould PN/9080. UTX/32 1.2
came out, and since it changed the inode format on disk, it's dump/mkfs/restore
time.  So I take the last 3 full backups, and do 2 more complete dumps besides.
I checked, and *NO* I/O errors had been reported (and then I checked THAT by
giving it a known bad tape and seeing errors WERE reported).

Do the upgrade... and *every single* tape was 'not in dump/restore format'.

Finally traced it down (this was the days when oscilloscopes were still useful)
to a bad 7400 series chip on the tape controller.  The backplane was a 32-bit
bus, the tape was an 8-bit device - so there was a 4-to-1 mux that had a bad
chip.  Bit 3 would be correct for 4 bits, inverted for 4 bits, correct for
4, etc..  Tape drive *NEVER* complained, because what came over the *cable*
was correct, parity and all..

Oh, and I got the data back something like this:

cat > mangle.c
main() {
int muck[2];
  while (read(0,muck,8) == 8) {
	muck[1] ^= 0x20202020;
        write(1,muck,8);
  }
}
^D
cc -o mangle mangle.c
dd if=/dev/rmt0 bs=32k | ./mangle | restore -f -


[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Undo aic7xxx changes (now rc7+aic20030603)
  2003-06-24 22:03                                                                 ` Willy Tarreau
@ 2003-06-24 23:43                                                                   ` Stephan von Krawczynski
  2003-06-25 19:16                                                                     ` Willy Tarreau
  0 siblings, 1 reply; 64+ messages in thread
From: Stephan von Krawczynski @ 2003-06-24 23:43 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: willy, linux-kernel, marcelo, kpfleming, stoffel, gibbs, green

On Wed, 25 Jun 2003 00:03:31 +0200
Willy Tarreau <willy@w.ods.org> wrote:

> On Tue, Jun 24, 2003 at 11:26:09PM +0200, Stephan von Krawczynski wrote:
>  
> > sorry, you probably misunderstood my flaky explanation. What I meant was
> > not a cached block from the _tape_ (obviously indeed a char-type device)
> > but from the 3ware disk (i.e. the other side of the verification). Consider
> > the tape completely working, but the disk data corrupt (possibly not from
> > real reading but from corrupted cache).
> 
> Ah, OK ! I didn't understand this. You're right, this is also a possibility.
> Perhaps a tar cf - /mnt/3ware | chkblk would get evidence of somme corruption
> ?

Hm, probably a dumb question: does repeated tar'ing of the same files lead to
exactly the same archive? There is no timestamp inside or something equivalent
?

Regards,
Stephan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Undo aic7xxx changes (now rc7+aic20030603)
  2003-06-24 21:26                                                               ` Stephan von Krawczynski
@ 2003-06-24 22:03                                                                 ` Willy Tarreau
  2003-06-24 23:43                                                                   ` Stephan von Krawczynski
  2003-06-25  2:22                                                                 ` Valdis.Kletnieks
  1 sibling, 1 reply; 64+ messages in thread
From: Willy Tarreau @ 2003-06-24 22:03 UTC (permalink / raw)
  To: Stephan von Krawczynski
  Cc: Willy Tarreau, linux-kernel, marcelo, kpfleming, stoffel, gibbs, green

On Tue, Jun 24, 2003 at 11:26:09PM +0200, Stephan von Krawczynski wrote:
 
> sorry, you probably misunderstood my flaky explanation. What I meant was not a
> cached block from the _tape_ (obviously indeed a char-type device) but from the
> 3ware disk (i.e. the other side of the verification). Consider the tape
> completely working, but the disk data corrupt (possibly not from real reading
> but from corrupted cache).

Ah, OK ! I didn't understand this. You're right, this is also a possibility.
Perhaps a tar cf - /mnt/3ware | chkblk would get evidence of somme corruption ?

<...snip... OK for these points ...>
 
> Hm, interestingly the former freeze bug (solved by marcelo through backout of
> some patch in rc8) did not show up in UP. Since then I did not test UP any
> more. The problem itself does not necessarily point to flaky hardware, as I
> would have no idea how bad cache can only show up during a tape verification,
> that does not sound all that reasonable.

OK, I agree. And right after posting, I remembered that if this was the case,
you should also see some MCEs which doesn't seem to be your case.

> More likely could be a SMP race anywhere from nfs-server, 3ware disk driver to
> page cache, or not?

fairly possible. That's also what Justin suggested in the past, BTW :-)

Cheers,
Willy


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Undo aic7xxx changes (now rc7+aic20030603)
  2003-06-24 17:43                                                             ` Willy Tarreau
@ 2003-06-24 21:26                                                               ` Stephan von Krawczynski
  2003-06-24 22:03                                                                 ` Willy Tarreau
  2003-06-25  2:22                                                                 ` Valdis.Kletnieks
  0 siblings, 2 replies; 64+ messages in thread
From: Stephan von Krawczynski @ 2003-06-24 21:26 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: linux-kernel, willy, marcelo, kpfleming, stoffel, gibbs, green

On Tue, 24 Jun 2003 19:43:31 +0200
Willy Tarreau <willy@w.ods.org> wrote:

> Hi Stephan,
> 
> > Is it possible that the verification errors do not occur because of a read
> > problem, but because of a page cached block getting trashed somehow between
> > "tar to tape" and "read from tape". I would suspect that some blocks
> > survive in memory and are re-used during verification. If for some reason
> > this data is invalid or corrupted the verification fails although the read
> > was correct.
> 
> That seems strange to me, I don't see how we could cache data from a char
> device.

Hello Willy,

sorry, you probably misunderstood my flaky explanation. What I meant was not a
cached block from the _tape_ (obviously indeed a char-type device) but from the
3ware disk (i.e. the other side of the verification). Consider the tape
completely working, but the disk data corrupt (possibly not from real reading
but from corrupted cache).

> It is possible that chkblk and tar don't use same block size and that
> your problem only occurs on larger transfers, or particularly aligned ones.

Very likely not the same block size, with tar I use -b64.
 
> You could try to increase the block size in chkblk to something bigger than a
> page for example. I don't know if tar reads your tape at full speed,

It does. There's no head repositioning.

> but it's
> possible that if it doesn't cope with the tape speed, an overrun occurs and
> something finally gets dropped :-/

Very unlikely, how do you create an overrun in a synchronuos single read
operation?
 
> > I know that this sounds weird, but nevertheless possible, or not?
> > It may even be worse, the data may have also been left from the original
> > nfs action, correct?
> > Is there a way to completely invalidate/flush all cached blocks concerning
> > this fs (besides umount)?
> 
> I don't believe in this. But as Justin says, this card can get very high
> performances and hassle the hardware. Perhaps you have a rare weakness in
> your hardware that only occurs under these conditions, although I don't know
> how this could be checked.

I doubt that. Reason is that though the tape is pretty fast for a tape it is
still pretty slow compared to a disk. Since I use the box for months now I
would have expected such a hardware problem to show up for disk access, too.
And there was none.

> IIRC, you said that it works flawlessly in UP and you need SMP to hit the
> bug. Perhaps your second CPU is sometimes flaky (bad cache, etc...) :-/

Hm, interestingly the former freeze bug (solved by marcelo through backout of
some patch in rc8) did not show up in UP. Since then I did not test UP any
more. The problem itself does not necessarily point to flaky hardware, as I
would have no idea how bad cache can only show up during a tape verification,
that does not sound all that reasonable.
More likely could be a SMP race anywhere from nfs-server, 3ware disk driver to
page cache, or not?

Regards,
Stephan



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Undo aic7xxx changes (now rc7+aic20030603)
  2003-06-20 22:03                                                   ` Willy Tarreau
  2003-06-20 23:48                                                     ` Stephan von Krawczynski
@ 2003-06-24 18:31                                                     ` Bill Davidsen
  1 sibling, 0 replies; 64+ messages in thread
From: Bill Davidsen @ 2003-06-24 18:31 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Marcelo Tosatti, Linux Kernel Mailing List

On Sat, 21 Jun 2003, Willy Tarreau wrote:

> On Fri, Jun 20, 2003 at 06:13:53PM -0300, Marcelo Tosatti wrote:
> > > Actually, without another copy of the data on a different system to
> > > verify it with, you can't know that for sure. It could easily be getting
> > > to the tape (the actual media) just fine, but then get corrupted during
> > > the verify readback.
> > 
> > Right. Stephan, if you could use a bit of your time to isolate the problem
> > I would be VERY grateful.
> 
> I remember Stephan once said that he used tar to verify the tape, and that for
> one backup, he did several tests showing corruption on different files. Altough
> that doesn't mean that the tape is written totally correctly, it at proves that
> there's at least a read corruption.
> 
> I think that comparing multiple reads to find a pattern in corruption offsets
> (if any) is the only thing he could do (not speaking about mixing read/writes
> with good/bad kernels). Of course, storing several times 70GB on disk is not
> easy, but at least a 16 bits checksum for each 1kB block would result on about
> 140 MB files, which will be "easier" to compare. It could be enough to check
> for empty blocks, duplicated blocks or totally random ones.

Actually, to find problems like this, a change to cpio would be useful:

  find /home | cpio -oB -Hcrc >/dev/st0

as an example. When reading back you will see errors from the CRC on each
file. I use cpio for this reason in some cases where knowing it's right
is critical.

-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Undo aic7xxx changes (now rc7+aic20030603)
  2003-06-24 11:11                                                           ` Stephan von Krawczynski
@ 2003-06-24 17:43                                                             ` Willy Tarreau
  2003-06-24 21:26                                                               ` Stephan von Krawczynski
  0 siblings, 1 reply; 64+ messages in thread
From: Willy Tarreau @ 2003-06-24 17:43 UTC (permalink / raw)
  To: Stephan von Krawczynski
  Cc: linux-kernel, willy, marcelo, kpfleming, stoffel, gibbs, green

Hi Stephan,

> Is it possible that the verification errors do not occur because of a read
> problem, but because of a page cached block getting trashed somehow between
> "tar to tape" and "read from tape". I would suspect that some blocks survive in
> memory and are re-used during verification. If for some reason this data is
> invalid or corrupted the verification fails although the read was correct.

That seems strange to me, I don't see how we could cache data from a char
device. It is possible that chkblk and tar don't use same block size and that
your problem only occurs on larger transfers, or particularly aligned ones.

You could try to increase the block size in chkblk to something bigger than a
page for example. I don't know if tar reads your tape at full speed, but it's
possible that if it doesn't cope with the tape speed, an overrun occurs and
something finally gets dropped :-/

> I know that this sounds weird, but nevertheless possible, or not?
> It may even be worse, the data may have also been left from the original nfs
> action, correct?
> Is there a way to completely invalidate/flush all cached blocks concerning this
> fs (besides umount)?

I don't believe in this. But as Justin says, this card can get very high
performances and hassle the hardware. Perhaps you have a rare weakness in your
hardware that only occurs under these conditions, although I don't know how
this could be checked.

IIRC, you said that it works flawlessly in UP and you need SMP to hit the bug.
Perhaps your second CPU is sometimes flaky (bad cache, etc...) :-/

Cheers,
Willy


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Undo aic7xxx changes (now rc7+aic20030603)
  2003-06-23 11:30                                                         ` Stephan von Krawczynski
@ 2003-06-24 11:11                                                           ` Stephan von Krawczynski
  2003-06-24 17:43                                                             ` Willy Tarreau
  0 siblings, 1 reply; 64+ messages in thread
From: Stephan von Krawczynski @ 2003-06-24 11:11 UTC (permalink / raw)
  To: linux-kernel; +Cc: willy, marcelo, kpfleming, stoffel, gibbs, green

Hello all, hello Willy,

I tried to produce the problem by using your chkblk tool, but was not
successful up to now. All checksums are the same. Is it possible that the
problem lies deeper in the process than expected. Remember I do:

copy data via NFS to server
tar data on server to tape
read data back vor verification with tar -d

Is it possible that the verification errors do not occur because of a read
problem, but because of a page cached block getting trashed somehow between
"tar to tape" and "read from tape". I would suspect that some blocks survive in
memory and are re-used during verification. If for some reason this data is
invalid or corrupted the verification fails although the read was correct.
I know that this sounds weird, but nevertheless possible, or not?
It may even be worse, the data may have also been left from the original nfs
action, correct?
Is there a way to completely invalidate/flush all cached blocks concerning this
fs (besides umount)?

Regards,
Stephan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Undo aic7xxx changes (now rc7+aic20030603)
  2003-06-21 10:50                                                       ` Willy TARREAU
  2003-06-22 19:00                                                         ` Stephan von Krawczynski
@ 2003-06-23 11:30                                                         ` Stephan von Krawczynski
  2003-06-24 11:11                                                           ` Stephan von Krawczynski
  1 sibling, 1 reply; 64+ messages in thread
From: Stephan von Krawczynski @ 2003-06-23 11:30 UTC (permalink / raw)
  To: Willy TARREAU
  Cc: willy, marcelo, kpfleming, stoffel, gibbs, linux-kernel, green

Hello again,

so we learned that working on the weekend is no good ;-)
The problem is back - still on 22-pre1 . I had two failed verifications this
morning.
Now I am giving Willy's checksumming a try. I'll keep you informed.

Regards,
Stephan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Undo aic7xxx changes (now rc7+aic20030603)
  2003-06-21 10:50                                                       ` Willy TARREAU
@ 2003-06-22 19:00                                                         ` Stephan von Krawczynski
  2003-06-23 11:30                                                         ` Stephan von Krawczynski
  1 sibling, 0 replies; 64+ messages in thread
From: Stephan von Krawczynski @ 2003-06-22 19:00 UTC (permalink / raw)
  To: Willy TARREAU
  Cc: willy, marcelo, kpfleming, stoffel, gibbs, linux-kernel, green

Hello all,

here is the interesting result of my working weekend with intensive testing:
As 22-pre1 just came out I decided to use it for further testing of the issue,
because I don't like testing old kernels particularly. And to my great surprise
I have not managed to break 22-pre1 so far. I have up to now moved about 1 TB
of data through the box (written to tape and verified) and have not yet
produced a single verify error.
Question is: how do I continue?
Of course the tape-writing actions will be continuing, so I still have a look
at the issue every day.
Are we interested in finding out what particular patch in pre1 is responsible
for this?

Well, at least there is the positive result that pre1 seems significantly
better...

Regards,
Stephan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Undo aic7xxx changes (now rc7+aic20030603)
  2003-06-20 23:48                                                     ` Stephan von Krawczynski
@ 2003-06-21 10:50                                                       ` Willy TARREAU
  2003-06-22 19:00                                                         ` Stephan von Krawczynski
  2003-06-23 11:30                                                         ` Stephan von Krawczynski
  0 siblings, 2 replies; 64+ messages in thread
From: Willy TARREAU @ 2003-06-21 10:50 UTC (permalink / raw)
  To: Stephan von Krawczynski
  Cc: Willy Tarreau, marcelo, kpfleming, stoffel, gibbs, linux-kernel, green

On Sat, Jun 21, 2003 at 01:48:28AM +0200, Stephan von Krawczynski wrote:
 
> Well, in fact I am a bit lost in the case, because of the shere data volume, I
> have space for several sets on disk, but it takes a damn long time to produce
> one cycle write/verify. Anyway I will do if that helps. The big problem with
> tar is that I have (to my knowledge) no chance to let it somewhere save the
> verify-failing data parts. I guess this could help a lot, because we could then
> see what the corruption looks like, how long (in bytes) it is and so on.
> If anybody has an idea how to achieve this goal let me know.

I wanted to implement a compare-and-capture feature in my check tool, but
realized that it would certainly be of no help if you get duplicated blocks or
so, because you'll have no way to tell *where* the captured block should have
been. That's why I suggested the checksum instead : if you get a pattern such
as :
   check1  check2
0: 1234    1234
1: 4567    4567
3: 789a    4567
4: bcde    789a
5: f012    bcde

... it will mean than block 1 was duplicated in check2. If you see :

   check1  check2
0: 1234    1234
1: 4567    4567
3: 789a    4567
4: bcde    bcde
5: f012    f012

... it will mean than block 1 was repeated instead of block 2 in check2.

If you see 0000, it probably means that you got a block full of zeros, since
the algorithm is only additive.

The resulting files will be 1/512 of the input, I think you'll find some space
on your disk for such a file.

It may be interesting to do regular checks during the second read, so that you
can abort after the first error, and not have to get a second full read.

> Ok, weekend is here, I see what can be done.

Here is my proposed program. I tried it on my local hard disk, it took 5 min
to check the full 8 GB (30 MB/s), and I reached 123 MB/s on a 4 disks software
raid5 array with an AHA29160. It outputs the current offset every 64 MB.
Here it is running on a DDS3 :

[root@alpha /root]# ~willy/c/chkblk.alpha /dev/nst0 > nst0.chk
At offset 603979776...

I hope it can help.

Cheers,
Willy


/*
 * chkblk - computes block checksums - 2003/06/21 - Willy Tarreau <w@w.ods.org>
 *
 * This program is free, do what you want with it, I will not be responsible if
 * it trashes all your data.
 *
 * Reads a file and outputs a binary 16 bit checksum for each 1KB block.
 * Useful to check for data corruption. Eg :
 *
 *  # chkblk /dev/tape > test1.chk
 *  # chkblk /dev/tape > test2.chk
 *  # cmp -l test[12].chk
 *
 * or :
 *  # chkblk /dev/sda2 |od -tx2 -Ax > test1.txt
 *  # chkblk /dev/sda2 |od -tx2 -Ax > test2.txt
 *  # diff -u test[12].txt
 *
 * To be able to read files bigger than 2GB, you should compile it
 * with "-D_LARGEFILE_SOURCE -D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64".
 *
 *
 */

#include <stdio.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdlib.h>

#define BLOCKSIZE 1024

#if _FILE_OFFSET_BITS == 64
#define OFF_T_FMT "%ll"
#else
#define OFF_T_FMT "%l"
#endif

void usage() {
    fprintf(stderr,
	    "Usage: chkblk input > output\n"
	    "   - input is a file, device, ...\n"
	    "   - output will be a binary file 1/512th the size of input\n"
	    );
    exit(1);
}


main(int argc, char **argv) {
    int fd;
    int len;
    off_t inp_off;
    unsigned long *buffer;

    if (argc != 2)
	usage();

    buffer = (void *)malloc(BLOCKSIZE);
    if (buffer == NULL) {
	fprintf(stderr,"Out of memory\n");
	exit(2);
    }

    fd = open(argv[1], O_RDONLY);
    if (fd < 0) {
	perror("open");
	exit(3);
    }

    inp_off = 0;
    while ((len = read(fd, buffer, BLOCKSIZE)) > 0) {
	unsigned long sum = 0;
	int off;
	inp_off += len;

	/* displays the offset every 64 MB */
	if ((inp_off & 0x3ffffff) == 0)
	    fprintf(stderr,"At offset " OFF_T_FMT "u...\r", inp_off);

	for (off = 0; off < len/sizeof(*buffer); off++)
	    sum += buffer[off];
	while (sum >= (1<<16)) {
	    sum = (sum & 0xffff) + (sum >> 16);
	}
	putchar(sum);
	putchar(sum >> 8);
    }
    fprintf(stderr,"At offset " OFF_T_FMT "u", inp_off);
    if (len < 0) {
	fprintf(stderr, ", read returned : \n");
	perror("");
	close(fd);
	exit(4);
    }
    else {
	fprintf(stderr, ", check completed without error\n");
    }
	
    close(fd);
    exit(0);
}

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Undo aic7xxx changes (now rc7+aic20030603)
  2003-06-20 22:03                                                   ` Willy Tarreau
@ 2003-06-20 23:48                                                     ` Stephan von Krawczynski
  2003-06-21 10:50                                                       ` Willy TARREAU
  2003-06-24 18:31                                                     ` Bill Davidsen
  1 sibling, 1 reply; 64+ messages in thread
From: Stephan von Krawczynski @ 2003-06-20 23:48 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: marcelo, kpfleming, stoffel, gibbs, linux-kernel, willy, green

On Sat, 21 Jun 2003 00:03:31 +0200
Willy Tarreau <willy@w.ods.org> wrote:

> Hi !
> 
> On Fri, Jun 20, 2003 at 06:13:53PM -0300, Marcelo Tosatti wrote:
> > > Actually, without another copy of the data on a different system to
> > > verify it with, you can't know that for sure. It could easily be getting
> > > to the tape (the actual media) just fine, but then get corrupted during
> > > the verify readback.
> > 
> > Right. Stephan, if you could use a bit of your time to isolate the problem
> > I would be VERY grateful.
> 
> I remember Stephan once said that he used tar to verify the tape, and that
> for one backup, he did several tests showing corruption on different files.
> Altough that doesn't mean that the tape is written totally correctly, it at
> proves that there's at least a read corruption.

Hello Willy, hello Marcelo,

in fact I noticed that doing multiple verify cycles the so-called corruption
happens rarely (read _very_ rarely) on the same files. So it is indeed very
likely that the read case is a problem.
Another thing to note is that I did not manage to produce a failed verify on a
dataset tar'ed to the 3ware raid and not to tape. I did not test that very
intensively, but from the tests I did I would have expected a corruption to
happen based on the cycles I did on tape.

> I think that comparing multiple reads to find a pattern in corruption offsets
> (if any) is the only thing he could do (not speaking about mixing read/writes
> with good/bad kernels). Of course, storing several times 70GB on disk is not
> easy, but at least a 16 bits checksum for each 1kB block would result on
> about 140 MB files, which will be "easier" to compare. It could be enough to
> check for empty blocks, duplicated blocks or totally random ones.
> 
> Stephan, if you're willing to do the test but don't have such a tool, I may
> write a quick dirty one tomorrow if that helps.
> 
> BTW, it could be interesting to note the read buffer's hardware address for
> each test, in case it matters.

Well, in fact I am a bit lost in the case, because of the shere data volume, I
have space for several sets on disk, but it takes a damn long time to produce
one cycle write/verify. Anyway I will do if that helps. The big problem with
tar is that I have (to my knowledge) no chance to let it somewhere save the
verify-failing data parts. I guess this could help a lot, because we could then
see what the corruption looks like, how long (in bytes) it is and so on.
If anybody has an idea how to achieve this goal let me know.

I am not 100% confident that the tests would look the same if I simply read the
whole tape onto the disks again and then verify via file compare, but anyway I
should try this too several times to complete the picture. 

Ok, weekend is here, I see what can be done.

Regards,
Stephan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Undo aic7xxx changes (now rc7+aic20030603)
  2003-06-20 21:13                                                 ` Marcelo Tosatti
@ 2003-06-20 22:03                                                   ` Willy Tarreau
  2003-06-20 23:48                                                     ` Stephan von Krawczynski
  2003-06-24 18:31                                                     ` Bill Davidsen
  0 siblings, 2 replies; 64+ messages in thread
From: Willy Tarreau @ 2003-06-20 22:03 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Kevin P. Fleming, Stephan von Krawczynski, stoffel, gibbs,
	linux-kernel, willy, green

Hi !

On Fri, Jun 20, 2003 at 06:13:53PM -0300, Marcelo Tosatti wrote:
> > Actually, without another copy of the data on a different system to
> > verify it with, you can't know that for sure. It could easily be getting
> > to the tape (the actual media) just fine, but then get corrupted during
> > the verify readback.
> 
> Right. Stephan, if you could use a bit of your time to isolate the problem
> I would be VERY grateful.

I remember Stephan once said that he used tar to verify the tape, and that for
one backup, he did several tests showing corruption on different files. Altough
that doesn't mean that the tape is written totally correctly, it at proves that
there's at least a read corruption.

I think that comparing multiple reads to find a pattern in corruption offsets
(if any) is the only thing he could do (not speaking about mixing read/writes
with good/bad kernels). Of course, storing several times 70GB on disk is not
easy, but at least a 16 bits checksum for each 1kB block would result on about
140 MB files, which will be "easier" to compare. It could be enough to check
for empty blocks, duplicated blocks or totally random ones.

Stephan, if you're willing to do the test but don't have such a tool, I may
write a quick dirty one tomorrow if that helps.

BTW, it could be interesting to note the read buffer's hardware address for
each test, in case it matters.

Cheers,
Willy


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Undo aic7xxx changes (now rc7+aic20030603)
  2003-06-20 20:59                                               ` Kevin P. Fleming
@ 2003-06-20 21:13                                                 ` Marcelo Tosatti
  2003-06-20 22:03                                                   ` Willy Tarreau
  0 siblings, 1 reply; 64+ messages in thread
From: Marcelo Tosatti @ 2003-06-20 21:13 UTC (permalink / raw)
  To: Kevin P. Fleming
  Cc: Stephan von Krawczynski, stoffel, gibbs, linux-kernel, willy, green


On Fri, 20 Jun 2003, Kevin P. Fleming wrote:

> Marcelo Tosatti wrote:
>
> > So the data is intact when it arrives on the 3ware and gets corrupted
> > on the write to the tape?
> >
>
> Actually, without another copy of the data on a different system to
> verify it with, you can't know that for sure. It could easily be getting
> to the tape (the actual media) just fine, but then get corrupted during
> the verify readback.

Right. Stephan, if you could use a bit of your time to isolate the problem
I would be VERY grateful.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Undo aic7xxx changes (now rc7+aic20030603)
  2003-06-20 19:59                                             ` Marcelo Tosatti
@ 2003-06-20 20:59                                               ` Kevin P. Fleming
  2003-06-20 21:13                                                 ` Marcelo Tosatti
  0 siblings, 1 reply; 64+ messages in thread
From: Kevin P. Fleming @ 2003-06-20 20:59 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Stephan von Krawczynski, stoffel, gibbs, linux-kernel, willy, green

Marcelo Tosatti wrote:

> So the data is intact when it arrives on the 3ware and gets corrupted
> on the write to the tape?
> 

Actually, without another copy of the data on a different system to verify it 
with, you can't know that for sure. It could easily be getting to the tape (the 
actual media) just fine, but then get corrupted during the verify readback.



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Undo aic7xxx changes (now rc7+aic20030603)
  2003-06-18 11:05                                           ` Stephan von Krawczynski
  2003-06-18 14:21                                             ` John Stoffel
@ 2003-06-20 19:59                                             ` Marcelo Tosatti
  2003-06-20 20:59                                               ` Kevin P. Fleming
  1 sibling, 1 reply; 64+ messages in thread
From: Marcelo Tosatti @ 2003-06-20 19:59 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: stoffel, gibbs, linux-kernel, willy, green



On Wed, 18 Jun 2003, Stephan von Krawczynski wrote:

> On Tue, 17 Jun 2003 17:47:02 -0300 (BRT)
> Marcelo Tosatti <marcelo@conectiva.com.br> wrote:
>
> >
> >
> > On Fri, 13 Jun 2003, Stephan von Krawczynski wrote:
> >
> > > Hello all,
> > >
> > > this is the second day of stress-testing pure rc8 in SMP, apic mode. Today
> > > everything is fine, no freeze, no data corruption.
> > >
> > > current standings:
> > >
> > > 2 days continuous test, one file data corruption on day 1
> >
> >
> > What kind of data corruption and what tests are you doing ? (sorry if you
> > already mentionad that on the list)
>
> Todays score:
>
> 7 days continuous test
> one file data corruption on day 1
> one file data corruption on day 4
> two file data corruptions on day 6
>
> Test is performed as follows:
>
> around 70-100 GB of data is transferred to a nfs-server with rc8 onto a
> RAID5 on 3ware-controller. The data is then copied via tar onto a SDLT
> drive connected to an aic controller. Afterwards the data is verified by
> tar.

So the data is intact when it arrives on the 3ware and gets corrupted
on the write to the tape?


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Undo aic7xxx changes (now rc7+aic20030603)
  2003-06-18 14:21                                             ` John Stoffel
@ 2003-06-18 14:54                                               ` Stephan von Krawczynski
  0 siblings, 0 replies; 64+ messages in thread
From: Stephan von Krawczynski @ 2003-06-18 14:54 UTC (permalink / raw)
  To: John Stoffel; +Cc: marcelo, stoffel, gibbs, linux-kernel, willy, green

On Wed, 18 Jun 2003 10:21:25 -0400
"John Stoffel" <stoffel@lucent.com> wrote:

> 
> Stephan> 7 days continuous test
> Stephan> one file data corruption on day 1
> Stephan> one file data corruption on day 4
> Stephan> two file data corruptions on day 6
>  
> Stephan> Test is performed as follows:
> 
> Stephan> around 70-100 GB of data is transferred to a nfs-server with
> Stephan> rc8 onto a RAID5 on 3ware-controller.  The data is then
> Stephan> copied via tar onto a SDLT drive connected to an aic
> Stephan> controller.  Afterwards the data is verified by tar.
> 
> Is the data verified after the transfer to the NFS server?  Does it
> pass muster then using MD5 sums on the files?

No, the content is not verified to be the same as on the nfs clients. But
this is not the point here, it could as well be bad content that is saved to
tape, and if you get wrong verification for this, your bad data simply got
worse. Right?

> What happens if you cut the tape drive out of the loop and copy the
> data to another partition on the 3ware controller and do the compare
> then?

I have not managed to get the corruption on archives written to (the same)
3ware partition instead of tape up to this day.

> 
> I assume you're doing:
> 
>   tar -c -f /dev/tape --verify /path/to/files

No. See your second guess.

> and that's when you get the errors?  Or are you writing to tape, and
> then doing a compare with:
> 
>   tar -c -f /dev/tape /path/to/files
>   tar -d -f /dev/tape /path/to/files

Yes, I am separately verifying with "-d".

> Stephan> Since rc8 this runs stable (froze before during the first
> Stephan> day).
> 
> How much RAM is in the box, and how much free space is on the
> filesystem?  I've been trying to replicate this type of test on
> 2.5.7x, but I've been having issues.  I'm also just dumping a pile of
> MP3s to tape and reading them back to check.

See first post of the thread, in case it already vanished: 3 GB RAM, 320 GB
filesystem space, at least half free.

> Stephan> Most of the several files tar'ed are beyond the 2 GB file
> Stephan> size. They vary from around 100MB upto about 15 GB per file,
> Stephan> around 70 GB minimum summed up.  Of course I exchanged the
> Stephan> tapes and the drive. Didn't get better.
> 
> This is an interesting data point.  What happens if you make all the
> files be 2.5gb in size, do you get corruption more consistently then?  

Hm, I haven't tried this so far. My next guess would have been not to verify
but to read the data completely in (to disk) again and then do a verification
based on a file-compare utility. If there is a difference one can have a real
look on the data, which is a bit of a mess on tape.

> I'm interested in this issue because I want to make sure that tape
> backups work reliably on Linux.  

Well, two of the same kind :-)

Regards,
Stephan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Undo aic7xxx changes (now rc7+aic20030603)
  2003-06-18 11:05                                           ` Stephan von Krawczynski
@ 2003-06-18 14:21                                             ` John Stoffel
  2003-06-18 14:54                                               ` Stephan von Krawczynski
  2003-06-20 19:59                                             ` Marcelo Tosatti
  1 sibling, 1 reply; 64+ messages in thread
From: John Stoffel @ 2003-06-18 14:21 UTC (permalink / raw)
  To: Stephan von Krawczynski
  Cc: Marcelo Tosatti, stoffel, gibbs, linux-kernel, willy, green


Stephan> 7 days continuous test
Stephan> one file data corruption on day 1
Stephan> one file data corruption on day 4
Stephan> two file data corruptions on day 6
 
Stephan> Test is performed as follows:

Stephan> around 70-100 GB of data is transferred to a nfs-server with
Stephan> rc8 onto a RAID5 on 3ware-controller.  The data is then
Stephan> copied via tar onto a SDLT drive connected to an aic
Stephan> controller.  Afterwards the data is verified by tar.

Is the data verified after the transfer to the NFS server?  Does it
pass muster then using MD5 sums on the files?

What happens if you cut the tape drive out of the loop and copy the
data to another partition on the 3ware controller and do the compare
then?

I assume you're doing:

  tar -c -f /dev/tape --verify /path/to/files

and that's when you get the errors?  Or are you writing to tape, and
then doing a compare with:

  tar -c -f /dev/tape /path/to/files
  tar -d -f /dev/tape /path/to/files

Stephan> Since rc8 this runs stable (froze before during the first
Stephan> day).

How much RAM is in the box, and how much free space is on the
filesystem?  I've been trying to replicate this type of test on
2.5.7x, but I've been having issues.  I'm also just dumping a pile of
MP3s to tape and reading them back to check.

Stephan> Most of the several files tar'ed are beyond the 2 GB file
Stephan> size. They vary from around 100MB upto about 15 GB per file,
Stephan> around 70 GB minimum summed up.  Of course I exchanged the
Stephan> tapes and the drive. Didn't get better.

This is an interesting data point.  What happens if you make all the
files be 2.5gb in size, do you get corruption more consistently then?  

I'm interested in this issue because I want to make sure that tape
backups work reliably on Linux.  

John

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Undo aic7xxx changes (now rc7+aic20030603)
  2003-06-17 20:47                                         ` Marcelo Tosatti
@ 2003-06-18 11:05                                           ` Stephan von Krawczynski
  2003-06-18 14:21                                             ` John Stoffel
  2003-06-20 19:59                                             ` Marcelo Tosatti
  0 siblings, 2 replies; 64+ messages in thread
From: Stephan von Krawczynski @ 2003-06-18 11:05 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: stoffel, gibbs, linux-kernel, willy, green

On Tue, 17 Jun 2003 17:47:02 -0300 (BRT)
Marcelo Tosatti <marcelo@conectiva.com.br> wrote:

> 
> 
> On Fri, 13 Jun 2003, Stephan von Krawczynski wrote:
> 
> > Hello all,
> >
> > this is the second day of stress-testing pure rc8 in SMP, apic mode. Today
> > everything is fine, no freeze, no data corruption.
> >
> > current standings:
> >
> > 2 days continuous test, one file data corruption on day 1
> 
> 
> What kind of data corruption and what tests are you doing ? (sorry if you
> already mentionad that on the list)

Todays score:

7 days continuous test
one file data corruption on day 1
one file data corruption on day 4
two file data corruptions on day 6
 
Test is performed as follows:

around 70-100 GB of data is transferred to a nfs-server with rc8 onto a RAID5
on 3ware-controller.
The data is then copied via tar onto a SDLT drive connected to an aic
controller.
Afterwards the data is verified by tar.

Since rc8 this runs stable (froze before during the first day).
Whats left is that the verify done failes sometimes (see above). It does not
look like a write error to tape, because retrying the verify cycle the errors
occur in other files most of the time (or even none at all). It seems reading
back is the problem. I doubt the problem lies on the 3ware side, because this
would mean you cannot use it at all (there should be errors all over other
actions as well then).
Most of the several files tar'ed are beyond the 2 GB file size. They vary from
around 100MB upto about 15 GB per file, around 70 GB minimum summed up.
Of course I exchanged the tapes and the drive. Didn't get better.

Regards,
Stephan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Undo aic7xxx changes (now rc7+aic20030603)
  2003-06-13  9:45                                       ` Stephan von Krawczynski
  2003-06-15 12:56                                         ` Stephan von Krawczynski
@ 2003-06-17 20:47                                         ` Marcelo Tosatti
  2003-06-18 11:05                                           ` Stephan von Krawczynski
  1 sibling, 1 reply; 64+ messages in thread
From: Marcelo Tosatti @ 2003-06-17 20:47 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: John Stoffel, gibbs, linux-kernel, willy, green



On Fri, 13 Jun 2003, Stephan von Krawczynski wrote:

> Hello all,
>
> this is the second day of stress-testing pure rc8 in SMP, apic mode. Today
> everything is fine, no freeze, no data corruption.
>
> current standings:
>
> 2 days continuous test, one file data corruption on day 1


What kind of data corruption and what tests are you doing ? (sorry if you
already mentionad that on the list)



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Undo aic7xxx changes (now rc7+aic20030603)
  2003-06-15 12:56                                         ` Stephan von Krawczynski
@ 2003-06-15 13:26                                           ` John Stoffel
  0 siblings, 0 replies; 64+ messages in thread
From: John Stoffel @ 2003-06-15 13:26 UTC (permalink / raw)
  To: Stephan von Krawczynski
  Cc: linux-kernel, stoffel, gibbs, willy, marcelo, green


Stephan> this is the fourth day of stress-testing pure rc8/2.4.21 in
Stephan> SMP, apic mode. Today another corruption happened.
 
Stephan> current standings:
 
Stephan> 4 days continuous test, 
Stephan> one file data corruption on day 1
Stephan> one file data corruption on day 4

Can you define corruption?  Can you tell us what commands you are
using to generate the data which is written to tape?  

John


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Undo aic7xxx changes (now rc7+aic20030603)
  2003-06-13  9:45                                       ` Stephan von Krawczynski
@ 2003-06-15 12:56                                         ` Stephan von Krawczynski
  2003-06-15 13:26                                           ` John Stoffel
  2003-06-17 20:47                                         ` Marcelo Tosatti
  1 sibling, 1 reply; 64+ messages in thread
From: Stephan von Krawczynski @ 2003-06-15 12:56 UTC (permalink / raw)
  To: linux-kernel; +Cc: stoffel, gibbs, willy, marcelo, green

Hello all,
 
this is the fourth day of stress-testing pure rc8/2.4.21 in SMP, apic mode. Today
another corruption happened.
 
current standings:
 
4 days continuous test, 
one file data corruption on day 1
one file data corruption on day 4

Regards,
Stephan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Undo aic7xxx changes (now rc7+aic20030603)
  2003-06-11 21:01                                     ` John Stoffel
@ 2003-06-13  9:45                                       ` Stephan von Krawczynski
  2003-06-15 12:56                                         ` Stephan von Krawczynski
  2003-06-17 20:47                                         ` Marcelo Tosatti
  0 siblings, 2 replies; 64+ messages in thread
From: Stephan von Krawczynski @ 2003-06-13  9:45 UTC (permalink / raw)
  To: John Stoffel; +Cc: gibbs, linux-kernel, willy, marcelo, green

Hello all,

this is the second day of stress-testing pure rc8 in SMP, apic mode. Today
everything is fine, no freeze, no data corruption.

current standings:

2 days continuous test, one file data corruption on day 1

Regards,
Stephan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Undo aic7xxx changes (now rc7+aic20030603)
  2003-06-11 20:23                                   ` Stephan von Krawczynski
  2003-06-11 21:01                                     ` John Stoffel
@ 2003-06-12 13:54                                     ` Stephan von Krawczynski
  1 sibling, 0 replies; 64+ messages in thread
From: Stephan von Krawczynski @ 2003-06-12 13:54 UTC (permalink / raw)
  To: linux-kernel; +Cc: gibbs, willy, marcelo, green

On Wed, 11 Jun 2003 22:23:46 +0200
Stephan von Krawczynski <skraw@ithnet.com> wrote:

> Hello,
> [...]
> Anyway it looks like failures have gotten fewer since rc8. I will try an
> overnight stress test now to see if I get it freezing again.

Interestingly it does not freeze. One file shows data corruption, but the
system looks stable. None of the older rc's made it up to this point. Looks
like something in rc8 got better and I am in fact experiencing a set of bugs
and not only one.

Regards,
Stephan


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Undo aic7xxx changes (now rc7+aic20030603)
  2003-06-11 20:23                                   ` Stephan von Krawczynski
@ 2003-06-11 21:01                                     ` John Stoffel
  2003-06-13  9:45                                       ` Stephan von Krawczynski
  2003-06-12 13:54                                     ` Stephan von Krawczynski
  1 sibling, 1 reply; 64+ messages in thread
From: John Stoffel @ 2003-06-11 21:01 UTC (permalink / raw)
  To: Stephan von Krawczynski
  Cc: Justin T. Gibbs, linux-kernel, willy, marcelo, green


Stephan> I switched to rc8 (SMP, apic), took three cycles until it
Stephan> failed.  rc8 (SMP, apic, HIGHIO) failed on the first try.  I
Stephan> thought HIGHIO could make a difference if there were inherent
Stephan> problems with bounce buffers. Unfortunately this seems not
Stephan> the case.

I'm doing testing on 2.5.70-mm3, SMP, APIC, PREEMPT with an AIC7880
driving a DLT7000 along with some idle disks on the same bus.  I'm
dumping data to tape and verifying it.  Once I get more data, I'll
followup.

John

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Undo aic7xxx changes (now rc7+aic20030603)
  2003-06-11  4:39                                 ` Justin T. Gibbs
@ 2003-06-11 20:23                                   ` Stephan von Krawczynski
  2003-06-11 21:01                                     ` John Stoffel
  2003-06-12 13:54                                     ` Stephan von Krawczynski
  0 siblings, 2 replies; 64+ messages in thread
From: Stephan von Krawczynski @ 2003-06-11 20:23 UTC (permalink / raw)
  To: Justin T. Gibbs; +Cc: linux-kernel, willy, marcelo, green

Hello,

a short note on todays test cycles.
I switched to rc8 (SMP, apic), took three cycles until it failed.
rc8 (SMP, apic, HIGHIO) failed on the first try.
I thought HIGHIO could make a difference if there were inherent problems with
bounce buffers. Unfortunately this seems not the case.

Anyway it looks like failures have gotten fewer since rc8. I will try an
overnight stress test now to see if I get it freezing again.

Regards,
Stephan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Undo aic7xxx changes (now rc7+aic20030603)
  2003-06-11  0:51                               ` Stephan von Krawczynski
@ 2003-06-11  4:39                                 ` Justin T. Gibbs
  2003-06-11 20:23                                   ` Stephan von Krawczynski
  0 siblings, 1 reply; 64+ messages in thread
From: Justin T. Gibbs @ 2003-06-11  4:39 UTC (permalink / raw)
  To: Stephan von Krawczynski, Justin T. Gibbs
  Cc: linux-kernel, willy, marcelo, green

>> 99% of the problems have to do with broken interrupt routing.  There is
>> plenty of information about this issue on the mailing lists, but people
>> still ask me.
>
> You should state an exact definition of "broken interrupt routing" in this
> case. The only thing I would call a broken interrupt routing is if an
> interrupt does not show up at all.

That's the only definition for it and 99% of the email I field about
the aic7xxx driver is due to interrupts *not arriving*.

>> I just don't believe that this is true.  Most of the questions that people
>> email me directly are questions that are easily answered by a google search.
>> In otherwords, the information is already readily available.  It is just
>> easier to send email than to actually investigate a potential solution
>> to the problem.  So, people send email and ask the same questions, and
>> get the same answers.
>
> Do you have a FAQ?

It's the driver readme file.

>> You're the one being silly.  You are oversimplifying what it takes to
>> do I/O and the components that are involved in doing that I/O.  If you
>> don't understand that the load on several components in the kernel changes,
>> often in subtle but important ways, when you change the target of your
>> I/O, then I don't know what to say to you.
>
> Data corruption is nothing subtle. We are not talking about performance tweaks,
> we are talking about the basics. Something like "a synchronous action (like
> reading during a verify) has to be synchronous". We are not talking about a
> hardware related problem on scsi bus. We are not talking about the box
> stumbling over a massive data flood. We are talking about reading a file/device
> to a memory buffer and doing a cmp action between two of those. If your os is
> not able to perform something like this you can do virtually nothing, not even
> booting (because your reading action corrupts the data).

And with any experience you will find that subtle races in all of these
"basic operations" can often only be triggered by certain scenarios.  Saying
that "well my machine boots" is not enough to prove that the components
involved to that point are bug free.  You may be able to operate just
fine in 99% of your test scenarios yet still have a very catastrophic
flaw in the code.

>> >> >>  When testing our drivers against RHAS2.1 we found that the stock
>> >> >> kernel had data corruption issues very similar to what your are talking
>> >> >> about when run on very fast, hyperthreading, SMP machines.  The data
>> >> >> corruption occurred with any SCSI controller we tried, regardless of
>> >> > vendor.
>> >> >
>> >> > My question is: is it solved?
>> >>
>> >> My understanding is that it was fixed in 2.4.18 level kernels, but since
>> >> I don't know the root cause of the corruption, it could have just been
>> >> made more difficult to reproduce.
>> >
>> > Can you point to some URL where information about this is available?
>>
>> https://rhn.redhat.com/errata/RHSA-2003-147.html
>
> The scenario described there is unlikely for my case because
> a) I have only 3 GB of mem
> b) no hints are available that UP can solve the problem on the same hardware

This is only the latest corruption bug that has been addressed.  You
should really read all of the kernel erratas.  The one we hit originally
was this one:

https://rhn.redhat.com/errata/RHSA-2002-227.html

I'm not saying that this is your problem or even related, but just to
point out that the type of data corruption you are talking about can
occur due to bugs in core kernel functionality.

>> To reproduce your problem, I need the same MB, memory configuration, drive
>> types, a 3ware card, and the same tape drive you have.  I have tried various
>> backup scenarios with *other hardware* and have failed to reproduce your
>> problem.
>
> I have talked to others with similar problems and none has the same mb or a
> 3ware controller.

Define similar.  You are the only person I know of that is currently
indicating they are having *data corruption* with the aic7xxx driver.
That is, in particular, what I am trying to reproduce locally.

> All have problems with streamers on aic. All solutions I
> heard so far were done by replacing aic by whatever strange controller
> they got their hands on.

I'm glad they were able to resolve their problems.

>> >> I suggest you go browse the code that is exercised by such an activity
>> >> before you say that.
>> >
>> > What kind of a statement is this?
>>
>> Its one way of saying that you need to understand all of the code involved
>> with turing a write syscall into a call into the aic7xxx driver.  If you
>> review the code path, you'll find that there are thousands of lines of
>> code involved that have nothing to do with SCSI or the aic7xxx driver.
>> To say that you have created a simple example that proves that the problem
>> is in the aic7xxx driver is naive at best.
>
> To tell me it is not is just as good.

You mean "just as naive"?  Pointing your finger at the aic7xxx driver
is not going to solve your problem.  Ruling out other system components
(of which there are many in your test case) also won't help find it.

>> In this case, the information you have so far provided points away from
>> the aic7xxx driver.  I don't say that in all cases that I investigate,
>> but I believe it to be true in this case.  If past experience is any guide,
>> 80-90% of the problems like this that I have debugged (and that I could
>> actually replicate) were induced by using the aic7xxx driver, but turned
>> out to be bugs in other components in the system.  The aic7xxx driver
>> happens to be one of the more agressive SCSI drivers in the system and
>> that can often lead to finding bugs in other components.
>
> Agressive is indeed a good term for it. And it describes exactly what I don't
> like about it.

Then don't use choose to use it.

> The primary goal of a driver (in my eyes) is to make some
> connected hardware work as expected. It is definitely not its primary goal to
> be overly brilliant and therefore detecting bugs in other subsystems.

My goal is to take full advantage of the hardware I support in my drivers.
That isn't an attempt to be "brilliant", but rather just taking advantage
of the hardware you have purchased.  The end result is that for instance
the aic79xx driver can achieve sustained random I/O throughput 40% above
it's main competetor.  That isn't an attempt to break the rest of linux,
but to get the most performance possible out of Linux.

> I have
> told you months ago that a symbios driven systems feels somehow smoother and
> faster - elegant.

Which doesn't tell me anything about the relative performance of the
two drivers.  Such subjective remarks do not provide any feedback that
can be turned into a concrete plan to improve the driver.  They don't even
really tell me what you think is wrong with it.

> And btw: you win nothing with your way, not even performance.

Another unsubstantiated claim.  Again, if you don't like the driver, or
its style, you should just use something else if it will make you happier.
It certainly sounds like that is the case.

>> I have lots of test setups that show the aic7xxx and aic79xx driver working
>> just fine in PIII and P4 dual and quad configurations with and without apic
>> interrupt routing and writing to tape.
>
> This does only mean you have not yet met something similar to my setup. It
> does not really prove a lot.

Which is exactly my point!  You act as though I should be able to magically
reproduce and fix your problem.  I've said that I can't reproduce it and
that means I can't fix it without more information.  I never claimed anything
more than that other than your current data points do not, in my opinion,
point to an aic7xxx driver problem.  That doesn't *eliminate* the aic7xxx
driver as a cause just as your test cases don't eliminate the other
components of the system.

> Well, the thing is, I try to achieve information. But since the whole issue is
> all about lots of data I try to find an intelligent way to locate the cause of
> it all. I am not very confident that analysis of the trashed data will lead
> somewhere.

If you filter all available to what you only believe will be relavent to
solving the problem, then you will likely filter out things that might
give others a clue as to the true cause of your problem.

> I think narrowing the code path that leads to the problem by
> multiple distinct test scenarios looks more/faster promising. Can you think of
> something reducing the test complexity (not using tar, not comparing to a file
> or whatever)?

I would be analyzing the current failure modes first, but if you just want
to try to narrow the cause by varying your configuration, you could do
that by using a different source filesystem or even using /dev/zero or
a program that generates the data that will be written to tape.  You might
also try to determine if the corruption happens when the tape is written
or if the data is corrupted during the read.  You could do this by
doing multiple read sessions to see if the corruption is consistent or
doing the write in what appears to be a safe kernel mode and the read
in the unsafe kernel and vice - versa. Etc.

--
Justin


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Undo aic7xxx changes (now rc7+aic20030603)
  2003-06-10 18:07                             ` Justin T. Gibbs
@ 2003-06-11  0:51                               ` Stephan von Krawczynski
  2003-06-11  4:39                                 ` Justin T. Gibbs
  0 siblings, 1 reply; 64+ messages in thread
From: Stephan von Krawczynski @ 2003-06-11  0:51 UTC (permalink / raw)
  To: Justin T. Gibbs; +Cc: linux-kernel, willy, marcelo, green

On Tue, 10 Jun 2003 12:07:00 -0600
"Justin T. Gibbs" <gibbs@scsiguy.com> wrote:

> >> I never said that it wasn't serios, I just haven't seen any indication
> >> that this problem is caused by my driver.  There is a big difference.
> >> If your complaint is that I typically help people to solve their problems
> >> *off-list*, then I'm sorry if that offends your sensibilities.
> >
> > It does not offend my sensibilities, it is simply damaging the available
> > information about typical problems and their solving. If you don't do it
> > open, there is no way for others to follow your thoughts and debugging and
> > therefore you are confronted hundred times with the same questions. People
> > have no choice but asking you, because your debugging cases are hidden.
> 
> 99% of the problems have to do with broken interrupt routing.  There is
> plenty of information about this issue on the mailing lists, but people
> still ask me.

You should state an exact definition of "broken interrupt routing" in this
case. The only thing I would call a broken interrupt routing is if an interrupt
does not show up at all. Everything else is in my eyes a broken interrupt
handling in the driver (generally spoken). A driver has (in my programming
world) to cope with:
- interrupts showing up immediately during the currently running interrupt
handling (immediate recausing)
- multiple interrupt causes per one shot (software or interrupt controller were
to lazy for producing single interrupts per cause)
- lost interrupts (may cause error condition of course but at least a message
in some log)
- continous interrupts (handler has to know when he is too long inside
interrupt and give the rest of the system a chance to survive)
- optimistic interrupt requeuing (handler has to know from the past what is the
right flow of interrupt causes in a multiple caused interrupt, though hardware
may be unable to tell him).

> I just don't believe that this is true.  Most of the questions that people
> email me directly are questions that are easily answered by a google search.
> In otherwords, the information is already readily available.  It is just
> easier to send email than to actually investigate a potential solution
> to the problem.  So, people send email and ask the same questions, and
> get the same answers.

Do you have a FAQ?

> >> >> a buffer layer bug, or a filesystem bug.
> >> >
> >> > /dev/tape with a filesystem? Have you read what we are talking about?
> >>
> >> Where did you get the data to place on the tape?  /dev/zero?
> >
> > Don't be silly. If reading a file from some hd would be a problem in
> > itself, then we could all go home and have a beer. You are talking about
> > the minimum requirement for an os.
> 
> You're the one being silly.  You are oversimplifying what it takes to
> do I/O and the components that are involved in doing that I/O.  If you
> don't understand that the load on several components in the kernel changes,
> often in subtle but important ways, when you change the target of your
> I/O, then I don't know what to say to you.

Data corruption is nothing subtle. We are not talking about performance tweaks,
we are talking about the basics. Something like "a synchronous action (like
reading during a verify) has to be synchronous". We are not talking about a
hardware related problem on scsi bus. We are not talking about the box
stumbling over a massive data flood. We are talking about reading a file/device
to a memory buffer and doing a cmp action between two of those. If your os is
not able to perform something like this you can do virtually nothing, not even
booting (because your reading action corrupts the data).

> >> >>  When testing our drivers against RHAS2.1 we found that the stock
> >> >> kernel had data corruption issues very similar to what your are talking
> >> >> about when run on very fast, hyperthreading, SMP machines.  The data
> >> >> corruption occurred with any SCSI controller we tried, regardless of
> >> > vendor.
> >> >
> >> > My question is: is it solved?
> >>
> >> My understanding is that it was fixed in 2.4.18 level kernels, but since
> >> I don't know the root cause of the corruption, it could have just been
> >> made more difficult to reproduce.
> >
> > Can you point to some URL where information about this is available?
> 
> https://rhn.redhat.com/errata/RHSA-2003-147.html

The scenario described there is unlikely for my case because 
a) I have only 3 GB of mem
b) no hints are available that UP can solve the problem on the same hardware 


> > No, it is only the most simple one. Unfortunately scsi-driver development
> > is everything but simple for the standard problem case. It requires the
> > ability to set up equipment just like the discussed case for reproduction
> > of the problem.  Of course only for cases the author cannot reproduce
> > inside his software via brain.  All information needed to reproduce the
> > main problem is available in this thread.
> 
> To reproduce your problem, I need the same MB, memory configuration, drive
> types, a 3ware card, and the same tape drive you have.  I have tried various
> backup scenarios with *other hardware* and have failed to reproduce your
> problem.

I have talked to others with similar problems and none has the same mb or a
3ware controller. All have problems with streamers on aic. All solutions I
heard so far were done by replacing aic by whatever strange controller they got
their hands on.

> >> I suggest you go browse the code that is exercised by such an activity
> >> before you say that.
> >
> > What kind of a statement is this?
> 
> Its one way of saying that you need to understand all of the code involved
> with turing a write syscall into a call into the aic7xxx driver.  If you
> review the code path, you'll find that there are thousands of lines of
> code involved that have nothing to do with SCSI or the aic7xxx driver.
> To say that you have created a simple example that proves that the problem
> is in the aic7xxx driver is naive at best.

To tell me it is not is just as good. 

> In this case, the information you have so far provided points away from
> the aic7xxx driver.  I don't say that in all cases that I investigate,
> but I believe it to be true in this case.  If past experience is any guide,
> 80-90% of the problems like this that I have debugged (and that I could
> actually replicate) were induced by using the aic7xxx driver, but turned
> out to be bugs in other components in the system.  The aic7xxx driver
> happens to be one of the more agressive SCSI drivers in the system and
> that can often lead to finding bugs in other components.

Agressive is indeed a good term for it. And it describes exactly what I don't
like about it. The primary goal of a driver (in my eyes) is to make some
connected hardware work as expected. It is definitely not its primary goal to
be overly brilliant and therefore detecting bugs in other subsystems. I have
told you months ago that a symbios driven systems feels somehow smoother and
faster - elegant. Whereas aic gives you the feeling someone tried to kick the
systems butt with a big hammer. Its a matter of style and _defensiveness_. 
As long as you ride it agressively don't complain a lot of people go after you
for explanations.
And btw: you win nothing with your way, not even performance.

> > Back to the facts:
> > Simple question: you say its not a problem inside the driver. Ok. Question:
> > how to you prove that? Can you specify a test setup (program or something)
> > I can check to see that there is no problem with the general SMP tape usage
> > of the aic driver? I mean you must have seen something working, or not?
> 
> The only way to do this is to find the actual bug.  The problem feels like
> a VM or FS race condition most likely caused by having the source controller
> and the destination controller on separate interrupts in the apic case so
> that you have real concurrency in the system.  In the non apic case, it looks
> like everyone shares the same interrupt, so you cannot field interrupts
> for both the 3ware and the aic7xxx driver at the same time.  I also say
> this because data corruption is something that is very difficult for the
> aic7xxx driver to acomplish without there being some kind of error message
> from the driver.

Well, at least I managed to get some interesting statement from you after all.
I have to think about this a bit.

> I have lots of test setups that show the aic7xxx and aic79xx driver working
> just fine in PIII and P4 dual and quad configurations with and without apic
> interrupt routing and writing to tape.

This does only mean you have not yet met something similar to my setup. It does
not really prove a lot.

>  There's not much more that I can
> do here without having your exact system here or having more information.

Well, the thing is, I try to achieve information. But since the whole issue is
all about lots of data I try to find an intelligent way to locate the cause of
it all. I am not very confident that analysis of the trashed data will lead
somewhere. I think narrowing the code path that leads to the problem by
multiple distinct test scenarios looks more/faster promising. Can you think of
something reducing the test complexity (not using tar, not comparing to a file
or whatever)?

Regards,
Stephan


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Undo aic7xxx changes (now rc7+aic20030603)
  2003-06-10 18:15                                 ` Zwane Mwaikambo
@ 2003-06-10 23:55                                   ` Stephan von Krawczynski
  0 siblings, 0 replies; 64+ messages in thread
From: Stephan von Krawczynski @ 2003-06-10 23:55 UTC (permalink / raw)
  To: Zwane Mwaikambo; +Cc: linux-kernel, willy, gibbs, marcelo, green

On Tue, 10 Jun 2003 14:15:58 -0400 (EDT)
Zwane Mwaikambo <zwane@linuxpower.ca> wrote:

> On Tue, 10 Jun 2003, Stephan von Krawczynski wrote:
> 
> > The controller used is the second aic7xxx. The 31 interrupts on CPU0 have
> > occured before the test. This setup fails during verify (data corruption).
> > 
> > I would say that the interrupt code of the aic in itself is therefore ok
> > with SMP. If it were a SMP race condition inside the interrupt routine this
> > test should have been ok (as only one CPU is used).
> 
> Thanks for verifying this, at least i know the problem isn't with 
> interrupt routing in your specific case.
> 
> 	Zwane

I guess your comment is a bit ahead of my tests. I just completed the test with
rc7+aic20030603 SMP, apic and maxcpus=1. It fails.
This means that although there is only one CPU used through the whole kernel
the data corruption occurs.
I would therefore conclude that the corruption is only possible if in fact the
standard code path is flaky in terms of data completeness per request.
Something like a broken synchronous action, a read request coming back
completed although it is in fact still running or the like.
May also be a misinterpretation of a kind of an "action completed" interrupt.
Or something like one interrupt for multiple running actions with a mixup of
the various causes.
To make sure it is not a problem in the SMP code path through the driver I have
to check a UP kernel with apic support enabled. I will do this tommorrow.
If this is ok then things are simple, because its nailed down to the SMP code
path without a concurrency cause.
Lets see ...

Regards,
Stephan



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Undo aic7xxx changes (now rc7+aic20030603)
  2003-06-10 17:44                               ` Stephan von Krawczynski
  2003-06-10 18:15                                 ` Zwane Mwaikambo
@ 2003-06-10 18:20                                 ` Zwane Mwaikambo
  1 sibling, 0 replies; 64+ messages in thread
From: Zwane Mwaikambo @ 2003-06-10 18:20 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: linux-kernel, willy, gibbs, marcelo, green

On Tue, 10 Jun 2003, Stephan von Krawczynski wrote:

> occured before the test. This setup fails during verify (data corruption).

Can you reproduce this with disks only?

	Zwane
-- 
function.linuxpower.ca

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Undo aic7xxx changes (now rc7+aic20030603)
  2003-06-10 17:44                               ` Stephan von Krawczynski
@ 2003-06-10 18:15                                 ` Zwane Mwaikambo
  2003-06-10 23:55                                   ` Stephan von Krawczynski
  2003-06-10 18:20                                 ` Zwane Mwaikambo
  1 sibling, 1 reply; 64+ messages in thread
From: Zwane Mwaikambo @ 2003-06-10 18:15 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: linux-kernel, willy, gibbs, marcelo, green

On Tue, 10 Jun 2003, Stephan von Krawczynski wrote:

> The controller used is the second aic7xxx. The 31 interrupts on CPU0 have
> occured before the test. This setup fails during verify (data corruption).
> 
> I would say that the interrupt code of the aic in itself is therefore ok with
> SMP. If it were a SMP race condition inside the interrupt routine this test
> should have been ok (as only one CPU is used).

Thanks for verifying this, at least i know the problem isn't with 
interrupt routing in your specific case.

	Zwane
-- 
function.linuxpower.ca

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Undo aic7xxx changes (now rc7+aic20030603)
  2003-06-10 17:11                           ` Stephan von Krawczynski
@ 2003-06-10 18:07                             ` Justin T. Gibbs
  2003-06-11  0:51                               ` Stephan von Krawczynski
  0 siblings, 1 reply; 64+ messages in thread
From: Justin T. Gibbs @ 2003-06-10 18:07 UTC (permalink / raw)
  To: Stephan von Krawczynski, Justin T. Gibbs
  Cc: linux-kernel, willy, marcelo, green

>> I never said that it wasn't serios, I just haven't seen any indication
>> that this problem is caused by my driver.  There is a big difference.
>> If your complaint is that I typically help people to solve their problems
>> *off-list*, then I'm sorry if that offends your sensibilities.
>
> It does not offend my sensibilities, it is simply damaging the available
> information about typical problems and their solving. If you don't do it open,
> there is no way for others to follow your thoughts and debugging and therefore
> you are confronted hundred times with the same questions. People have no
> choice but asking you, because your debugging cases are hidden.

99% of the problems have to do with broken interrupt routing.  There is
plenty of information about this issue on the mailing lists, but people
still ask me.  It seems that SCSI is suitably complex for the common
user that even when the driver explictly tells you "your drive is dying",
I get email asking how I can fix my driver so that their drive doesn't
die.  The same is true if you look at the large body of dump card state
information that people have posted from the aic7xxx and aic79xx drivers
to this list.  Anyone who gets this type of output seems to think that
their problem must be the same as any other person that gets a dump
card state.  I don't think that any amount of posting information about
how I decifer what the registers are telling me will cut down on this
confusion.

>> I'm just sick of being blamed for anything that goes wrong on any system
>> that happens to have an aic7xxx controller in it.  99% or the time its
>> not my fault, but I suppose since I debug and resolve these issues off
>> list for people that contact me, the general assumption is that these
>> issues are the aic7xxx driver's fault.
>
> No, you produce your own problem. You cannot help every single who has a
> problem around his box/aic. This is impossible. So you have to create a
> valuable information basis others can read and think about. This is most
> simply done by debugging problems _openly_.

I just don't believe that this is true.  Most of the questions that people
email me directly are questions that are easily answered by a google search.
In otherwords, the information is already readily available.  It is just
easier to send email than to actually investigate a potential solution
to the problem.  So, people send email and ask the same questions, and
get the same answers.

>> >> a buffer layer bug, or a filesystem bug.
>> >
>> > /dev/tape with a filesystem? Have you read what we are talking about?
>>
>> Where did you get the data to place on the tape?  /dev/zero?
>
> Don't be silly. If reading a file from some hd would be a problem in itself,
> then we could all go home and have a beer. You are talking about the minimum
> requirement for an os.

You're the one being silly.  You are oversimplifying what it takes to
do I/O and the components that are involved in doing that I/O.  If you
don't understand that the load on several components in the kernel changes,
often in subtle but important ways, when you change the target of your
I/O, then I don't know what to say to you.

>> >>  When testing our drivers against RHAS2.1 we found that the stock
>> >> kernel had data corruption issues very similar to what your are talking
>> >> about when run on very fast, hyperthreading, SMP machines.  The data
>> >> corruption occurred with any SCSI controller we tried, regardless of
>> > vendor.
>> >
>> > My question is: is it solved?
>>
>> My understanding is that it was fixed in 2.4.18 level kernels, but since
>> I don't know the root cause of the corruption, it could have just been
>> made more difficult to reproduce.
>
> Can you point to some URL where information about this is available?

https://rhn.redhat.com/errata/RHSA-2003-147.html

This is just the most recent attempt to fix these issues.  You might
want to go back and read the other erratas.

>> > Justin, this is nothing quite serious, I just mentioned it for a feedback
>> > to something _simple_.
>>
>> It's the only thing you've mentioned that I have enough information to
>> look at.
>
> No, it is only the most simple one. Unfortunately scsi-driver development is
> everything but simple for the standard problem case. It requires the ability
> to set up equipment just like the discussed case for reproduction of the
> problem.  Of course only for cases the author cannot reproduce inside his
> software via brain.  All information needed to reproduce the main problem is
> available in this thread.

To reproduce your problem, I need the same MB, memory configuration, drive
types, a 3ware card, and the same tape drive you have.  I have tried various
backup scenarios with *other hardware* and have failed to reproduce your
problem.

>> I suggest you go browse the code that is exercised by such an activity
>> before you say that.
>
> What kind of a statement is this?

Its one way of saying that you need to understand all of the code involved
with turing a write syscall into a call into the aic7xxx driver.  If you
review the code path, you'll find that there are thousands of lines of
code involved that have nothing to do with SCSI or the aic7xxx driver.
To say that you have created a simple example that proves that the problem
is in the aic7xxx driver is naive at best.

> I want to solve a problem - for me _and_ for others (and this is
> why I do it openly).
> I really have not understood what you want, besides not being spoken to.
> If I were you I would try to _prove_ that it is _not_ my problem, in best by
> finding the real problem.

As I said before, I have tried to reproduce your problem, but I cannot.
I have no hope of proving that a problem I cannot replicate is not a
problem with my driver.

Some additional things that might help:

 o Charaterize the type of corruption that you are seeing in a more
   formal way.  For example, use an easy to verify pattern that will
   allow you to actually analyze the corruption.  Is the corruption
   following some pattern?

 o Can you determine if the corruption is happening when writting to
   the tape vs. reading from it?  You might do this by writing to
   the tape in an SMP mode that shows data corruption and then validate
   the driver in a safe, UP, mode and vice-versa.

 o What happens when you use different hardware/FS type/etc for the source
   and destination?

> Unfortunately I (and some others) do have the
> impression that you simply live by the idea that as long as nobody can
> _prove_ your code has a problem, there is no problem.
> This is in fact the bofh lifestyle that works for you (as long as you do not
> meet an equally skilled person), but not for the users (spell "rest of us").

In this case, the information you have so far provided points away from
the aic7xxx driver.  I don't say that in all cases that I investigate,
but I believe it to be true in this case.  If past experience is any guide,
80-90% of the problems like this that I have debugged (and that I could
actually replicate) were induced by using the aic7xxx driver, but turned
out to be bugs in other components in the system.  The aic7xxx driver
happens to be one of the more agressive SCSI drivers in the system and
that can often lead to finding bugs in other components.

> Back to the facts:
> Simple question: you say its not a problem inside the driver. Ok. Question:
> how to you prove that? Can you specify a test setup (program or something) I
> can check to see that there is no problem with the general SMP tape usage of
> the aic driver? I mean you must have seen something working, or not?

The only way to do this is to find the actual bug.  The problem feels like
a VM or FS race condition most likely caused by having the source controller and
the destination controller on separate interrupts in the apic case so that
you have real concurrency in the system.  In the non apic case, it looks
like everyone shares the same interrupt, so you cannot field interrupts
for both the 3ware and the aic7xxx driver at the same time.  I also say
this because data corruption is something that is very difficult for the
aic7xxx driver to acomplish without there being some kind of error message
from the driver.

I have lots of test setups that show the aic7xxx and aic79xx driver working
just fine in PIII and P4 dual and quad configurations with and without apic
interrupt routing and writing to tape.  There's not much more that I can
do here without having your exact system here or having more information.

--
Justin


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Undo aic7xxx changes (now rc7+aic20030603)
  2003-06-10 13:51                             ` Zwane Mwaikambo
  2003-06-10 15:55                               ` Stephan von Krawczynski
@ 2003-06-10 17:44                               ` Stephan von Krawczynski
  2003-06-10 18:15                                 ` Zwane Mwaikambo
  2003-06-10 18:20                                 ` Zwane Mwaikambo
  1 sibling, 2 replies; 64+ messages in thread
From: Stephan von Krawczynski @ 2003-06-10 17:44 UTC (permalink / raw)
  To: Zwane Mwaikambo; +Cc: linux-kernel, willy, gibbs, marcelo, green

On Tue, 10 Jun 2003 09:51:34 -0400 (EDT)
Zwane Mwaikambo <zwane@linuxpower.ca> wrote:

> > Reading around the whole interrupt stuff I came across a very simple idea
> > which I am going to test right now. See you in some hours ;-)

I now tried rc7+aic20030603 SMP apic _but_ interrupts from aic only bound to
single cpu. I did this with help of irqbalance from Arjan.

/proc/interrupts:

           CPU0       CPU1       
  0:       5148     571297    IO-APIC-edge  timer
  1:       9733         97    IO-APIC-edge  keyboard
  2:          0          0          XT-PIC  cascade
 12:      43720       1271    IO-APIC-edge  PS/2 Mouse
 15:          4          4    IO-APIC-edge  ide1
 17:       1297    1336383   IO-APIC-level  3ware Storage Controller
 18:        344      16447   IO-APIC-level  eth0, eth1
 20:        570          3   IO-APIC-level  fcpcipnp
 21:      57292        340   IO-APIC-level  eth2
 22:     443161       2776   IO-APIC-level  aic7xxx
 23:         31    2005037   IO-APIC-level  aic7xxx
 26:          0          0   IO-APIC-level  EMU10K1
NMI:     593524     582633 
LOC:     576356     576330 
ERR:          0
MIS:          0

The controller used is the second aic7xxx. The 31 interrupts on CPU0 have
occured before the test. This setup fails during verify (data corruption).

I would say that the interrupt code of the aic in itself is therefore ok with
SMP. If it were a SMP race condition inside the interrupt routine this test
should have been ok (as only one CPU is used).

Regards,
Stephan





^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Undo aic7xxx changes (now rc7+aic20030603)
  2003-06-10 15:38                         ` Justin T. Gibbs
@ 2003-06-10 17:11                           ` Stephan von Krawczynski
  2003-06-10 18:07                             ` Justin T. Gibbs
  0 siblings, 1 reply; 64+ messages in thread
From: Stephan von Krawczynski @ 2003-06-10 17:11 UTC (permalink / raw)
  To: Justin T. Gibbs; +Cc: linux-kernel, willy, marcelo, green

On Tue, 10 Jun 2003 09:38:31 -0600
"Justin T. Gibbs" <gibbs@scsiguy.com> wrote:

> I never said that it wasn't serios, I just haven't seen any indication
> that this problem is caused by my driver.  There is a big difference.
> If your complaint is that I typically help people to solve their problems
> *off-list*, then I'm sorry if that offends your sensibilities.

It does not offend my sensibilities, it is simply damaging the available
information about typical problems and their solving. If you don't do it open,
there is no way for others to follow your thoughts and debugging and therefore
you are confronted hundred times with the same questions. People have no choice
but asking you, because your debugging cases are hidden.

> I personally don't think that I need to CC a million people while I'm
> passing back various debugging information and asking for new output.  Its
> just a lot of noise for the majority of people on the linux-kernel list.

Keep in mind the broad user base of aics. Compared to other stuff in the kernel
your messages may be a whole lot more interesting to listening LKML readers
than other threads.
 
> I'm just sick of being blamed for anything that goes wrong on any system
> that happens to have an aic7xxx controller in it.  99% or the time its
> not my fault, but I suppose since I debug and resolve these issues off
> list for people that contact me, the general assumption is that these
> issues are the aic7xxx driver's fault.

No, you produce your own problem. You cannot help every single who has a
problem around his box/aic. This is impossible. So you have to create a
valuable information basis others can read and think about. This is most simply
done by debugging problems _openly_.

> >> a buffer layer bug, or a filesystem bug.
> >
> > /dev/tape with a filesystem? Have you read what we are talking about?
> 
> Where did you get the data to place on the tape?  /dev/zero?

Don't be silly. If reading a file from some hd would be a problem in itself,
then we could all go home and have a beer. You are talking about the minimum
requirement for an os.

> >>  When testing our drivers against RHAS2.1 we found that the stock
> >> kernel had data corruption issues very similar to what your are talking
> >> about when run on very fast, hyperthreading, SMP machines.  The data
> >> corruption occurred with any SCSI controller we tried, regardless of
> >vendor.
> >
> > My question is: is it solved?
> 
> My understanding is that it was fixed in 2.4.18 level kernels, but since
> I don't know the root cause of the corruption, it could have just been
> made more difficult to reproduce.

Can you point to some URL where information about this is available?

> > This is not the first discussion about an instability in aic.
> 
> I'm not talking about *every case of aic7xxx driver instability*, I'm
> talking about *this particular case* of driver instability.  Problems
> that to the naive user look similar are typically not.

Sorry, I should have said: "This is not the first discussion about an
instability in aic between you and me". 

> > Justin, this is nothing quite serious, I just mentioned it for a feedback
> > to something _simple_.
> 
> It's the only thing you've mentioned that I have enough information to
> look at.

No, it is only the most simple one. Unfortunately scsi-driver development is
everything but simple for the standard problem case. It requires the ability to
set up equipment just like the discussed case for reproduction of the problem.
Of course only for cases the author cannot reproduce inside his software via
brain.
All information needed to reproduce the main problem is available in this
thread.

> > What exactly is "elsewhere" if your data is bogus when tar'ing onto
> > /dev/tape via aic and it is completely ok when tar'ing into a file via
> > reiserfs/3ware ? There is not really much left between tar and the
> > aic-driver and the tape.
> 
> I suggest you go browse the code that is exercised by such an activity
> before you say that.

What kind of a statement is this? I spent days for reproduction of the error
case, every single test takes something from 3,5 to 24 hours. And you tell me
"well, guy, if you want to know what I know go ahead and read my code", well
knowing that at least 50% of the knowledge is not in the code but in the
surrounding material you read to get where you are. I don't want to become scsi
maintainer, I want to solve a problem - for me _and_ for others (and this is
why I do it openly).
I really have not understood what you want, besides not being spoken to.
If I were you I would try to _prove_ that it is _not_ my problem, in best by
finding the real problem. Unfortunately I (and some others) do have the
impression that you simply live by the idea that as long as nobody can _prove_ 
your code has a problem, there is no problem.
This is in fact the bofh lifestyle that works for you (as long as you do not
meet an equally skilled person), but not for the users (spell "rest of us").

Back to the facts:
Simple question: you say its not a problem inside the driver. Ok. Question: how
to you prove that? Can you specify a test setup (program or something) I can
check to see that there is no problem with the general SMP tape usage of the
aic driver? I mean you must have seen something working, or not?

Regards,
Stephan


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Undo aic7xxx changes (now rc7+aic20030603)
  2003-06-10 15:55                               ` Stephan von Krawczynski
@ 2003-06-10 16:23                                 ` Oleg Drokin
  0 siblings, 0 replies; 64+ messages in thread
From: Oleg Drokin @ 2003-06-10 16:23 UTC (permalink / raw)
  To: Stephan von Krawczynski
  Cc: Zwane Mwaikambo, linux-kernel, willy, gibbs, marcelo

Hello!

On Tue, Jun 10, 2003 at 05:55:06PM +0200, Stephan von Krawczynski wrote:

> Jun 10 17:50:53 admin kernel: Process tar (pid: 4004, stackpage=dead5000)

Hehe, whith this kind of stackpage, this process was doomed just after the fork() ;)

> >>EIP; c0221c37 <st_do_scsi+127/180>   <=====

It seems that in st_do_scsi, in this line
                (STp->buffer)->syscall_result = st_chk_result(STp, SRpnt);

STp is garbage for some reason, though it was valid before.

Bye,
    Oleg

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Undo aic7xxx changes (now rc7+aic20030603)
  2003-06-10 13:51                             ` Zwane Mwaikambo
@ 2003-06-10 15:55                               ` Stephan von Krawczynski
  2003-06-10 16:23                                 ` Oleg Drokin
  2003-06-10 17:44                               ` Stephan von Krawczynski
  1 sibling, 1 reply; 64+ messages in thread
From: Stephan von Krawczynski @ 2003-06-10 15:55 UTC (permalink / raw)
  To: Zwane Mwaikambo; +Cc: linux-kernel, willy, gibbs, marcelo, green

On Tue, 10 Jun 2003 09:51:34 -0400 (EDT)
Zwane Mwaikambo <zwane@linuxpower.ca> wrote:

> > Reading around the whole interrupt stuff I came across a very simple idea which
> > I am going to test right now. See you in some hours ;-)
> 
> Cool

Hoho, how about this one:

ksymoops 2.4.8 on i686 2.4.21-rc7-aic.  Options used
     -V (default)
     -k /proc/ksyms (default)
     -l /proc/modules (default)
     -o /lib/modules/2.4.21-rc7-aic/ (default)
     -m /boot/System.map-2.4.21-rc7-aic (default)

Warning: You did not tell me where to find symbol information.  I will
assume that the log matches the kernel and modules that are running
right now and I'll use the default options above for symbol resolution.
If the current kernel and/or modules do not match the log, you can get
more accurate output by telling me the kernel version and where to find
map, modules, ksyms etc.  ksymoops -h explains the options.

Jun 10 17:50:53 admin kernel: Unable to handle kernel NULL pointer dereference at virtual address 00000b2c
Jun 10 17:50:53 admin kernel: c0221c37
Jun 10 17:50:53 admin kernel: *pde = 00000000
Jun 10 17:50:53 admin kernel: Oops: 0000
Jun 10 17:50:53 admin kernel: CPU:    0
Jun 10 17:50:53 admin kernel: EIP:    0010:[st_do_scsi+295/384]    Not tainted
Jun 10 17:50:53 admin kernel: EIP:    0010:[<c0221c37>]    Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
Jun 10 17:50:53 admin kernel: EFLAGS: 00010246
Jun 10 17:50:53 admin kernel: eax: 00000000   ebx: 00000001   ecx: 00000000   edx: c34a0424
Jun 10 17:50:53 admin kernel: esi: f5f2c180   edi: 00000b00   ebp: 00008090   esp: dead5edc
Jun 10 17:50:53 admin kernel: ds: 0018   es: 0018   ss: 0018
Jun 10 17:50:53 admin kernel: Process tar (pid: 4004, stackpage=dead5000)
Jun 10 17:50:53 admin kernel: Stack: f5f2c180 00000000 c0090000 00008000 c0221a10 00015f90 00000000 dead5f7c
Jun 10 17:50:53 admin kernel:        c34a0400 00000001 00008000 c0223abd 00000000 c34a0400 dead5f40 00008000
Jun 10 17:50:53 admin kernel:        00000002 00015f90 00000000 00000001 00000000 00000000 c34a04c0 c34a0450
Jun 10 17:50:53 admin kernel: Call Trace:    [st_sleep_done+0/256] [read_tape+269/1024] [scsi_finish_command+152/208] [st_read+1015/1152] [sys_read+155/384]
Jun 10 17:50:53 admin kernel: Call Trace:    [<c0221a10>] [<c0223abd>] [<c01ede38>] [<c02241a7>] [<c0141c0b>]
Jun 10 17:50:53 admin kernel:   [<c010782f>]
Jun 10 17:50:53 admin kernel: Code: 8b 5f 2c 89 74 24 04 89 3c 24 e8 ea fb ff ff 89 43 1c eb a5


>>EIP; c0221c37 <st_do_scsi+127/180>   <=====

>>edx; c34a0424 <_end+310e0e4/38547d20>
>>esi; f5f2c180 <_end+35b99e40/38547d20>
>>esp; dead5edc <_end+1e743b9c/38547d20>

Trace; c0221a10 <st_sleep_done+0/100>
Trace; c0223abd <read_tape+10d/400>
Trace; c01ede38 <scsi_finish_command+98/d0>
Trace; c02241a7 <st_read+3f7/480>
Trace; c0141c0b <sys_read+9b/180>
Trace; c010782f <system_call+33/38>

Code;  c0221c37 <st_do_scsi+127/180>
00000000 <_EIP>:
Code;  c0221c37 <st_do_scsi+127/180>   <=====
   0:   8b 5f 2c                  mov    0x2c(%edi),%ebx   <=====
Code;  c0221c3a <st_do_scsi+12a/180>
   3:   89 74 24 04               mov    %esi,0x4(%esp,1)
Code;  c0221c3e <st_do_scsi+12e/180>
   7:   89 3c 24                  mov    %edi,(%esp,1)
Code;  c0221c41 <st_do_scsi+131/180>
   a:   e8 ea fb ff ff            call   fffffbf9 <_EIP+0xfffffbf9>
Code;  c0221c46 <st_do_scsi+136/180>
   f:   89 43 1c                  mov    %eax,0x1c(%ebx)
Code;  c0221c49 <st_do_scsi+139/180>
  12:   eb a5                     jmp    ffffffb9 <_EIP+0xffffffb9>


1 warning issued.  Results may not be reliable.

Anybody able to comment on that?

Regards,
Stephan


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Undo aic7xxx changes (now rc7+aic20030603)
  2003-06-10 10:23                       ` Stephan von Krawczynski
@ 2003-06-10 15:38                         ` Justin T. Gibbs
  2003-06-10 17:11                           ` Stephan von Krawczynski
  0 siblings, 1 reply; 64+ messages in thread
From: Justin T. Gibbs @ 2003-06-10 15:38 UTC (permalink / raw)
  To: Stephan von Krawczynski, Justin T. Gibbs
  Cc: linux-kernel, willy, marcelo, green

>> Stephan,
>>
>> Other than your most recent complaint that the driver doesn't function
>> correctly in an SMP kernel when you specify the nosmp option, you have
>> yet to provide any information that points to a problem in the aic7xxx
>> driver.
>
> Dear Justin,
>
> I am really not complaining about you not helping specifically _me_, I am
> complaining about your quite visible general opinion that this whole thing is
> really not serious, or maybe it is only that you are not making your efforts
> transparent to others, I don't know.

I never said that it wasn't serios, I just haven't seen any indication
that this problem is caused by my driver.  There is a big difference.
If your complaint is that I typically help people to solve their problems
*off-list*, then I'm sorry if that offends your sensibilities.
I personally don't think that I need to CC a million people while I'm
passing back various debugging information and asking for new output.  Its
just a lot of noise for the majority of people on the linux-kernel list.

>>  Without such information, I'm at a loss to help you.  One thing
>> that you forgot to mention in your "report" is that data corruption can
>> happen in many more places than just in the aic7xxx driver.
>
> <sarcasm>Did I mention the big magnet right beside the tape?</sarcasm>

I'm just sick of being blamed for anything that goes wrong on any system
that happens to have an aic7xxx controller in it.  99% or the time its
not my fault, but I suppose since I debug and resolve these issues off
list for people that contact me, the general assumption is that these
issues are the aic7xxx driver's fault.

>>  The data could be corrupted by a VM bug,
>
> VM is quite the same, tar'ing to /dev/tape or /var/bak/mybackfile.tar.

No, the VM activity is quite different.

>> a buffer layer bug, or a filesystem bug.
>
> /dev/tape with a filesystem? Have you read what we are talking about?

Where did you get the data to place on the tape?  /dev/zero?

>>  When testing our drivers against RHAS2.1 we found that the stock
>> kernel had data corruption issues very similar to what your are talking
>> about when run on very fast, hyperthreading, SMP machines.  The data
>> corruption occurred with any SCSI controller we tried, regardless of vendor.
>
> My question is: is it solved?

My understanding is that it was fixed in 2.4.18 level kernels, but since
I don't know the root cause of the corruption, it could have just been
made more difficult to reproduce.

>> If you continue to feel that the aic7xxx driver is at fault, I encourage you
>> to try to reproduce this failure with someone elses card.  I think you'll
>> find that the problem persists even with this change.
>
> This is not the first discussion about an instability in aic.

I'm not talking about *every case of aic7xxx driver instability*, I'm
talking about *this particular case* of driver instability.  Problems
that to the naive user look similar are typically not.

>> I will be more than happy to look into why the aic7xxx driver may not
>> operate correctly in an SMP kernel with the nosmp option.  Considering
>> that your complaint about this failure came into my email box just
>> yesterday, perhaps you can give me just a few days to look into this
>> before you decide to call me unresponsive.  Since I'm attending a
>> conference this whole week, I won't even be able to look at this
>> until I return on Monday of next week.
>
> Justin, this is nothing quite serious, I just mentioned it for a feedback to
> something _simple_.

It's the only thing you've mentioned that I have enough information to
look at.

>> I'm sorry that you are experiencing data corruption.  I take those
>> issues very seriously, but all of your panics and other reports point
>> to issues elsewhere in the kernel that should be resolved before you
>> conclude that the data corruption you are experiencing is somehow
>> the aic7xxx driver's fault.  I'll be more than happy to fess up to
>> and correct any defect that is found in the driver, but I cannot fix
>> bugs that I cannot reproduce and that have no usable debugging information
>> associated with them.
>
> What exactly is "elsewhere" if your data is bogus when tar'ing onto /dev/tape
> via aic and it is completely ok when tar'ing into a file via reiserfs/3ware ?
> There is not really much left between tar and the aic-driver and the tape.

I suggest you go browse the code that is exercised by such an activity
before you say that.

--
Jusitn


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Undo aic7xxx changes (now rc7+aic20030603)
  2003-06-10 13:38                           ` Stephan von Krawczynski
@ 2003-06-10 13:51                             ` Zwane Mwaikambo
  2003-06-10 15:55                               ` Stephan von Krawczynski
  2003-06-10 17:44                               ` Stephan von Krawczynski
  0 siblings, 2 replies; 64+ messages in thread
From: Zwane Mwaikambo @ 2003-06-10 13:51 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: linux-kernel, willy, gibbs, marcelo, green

On Tue, 10 Jun 2003, Stephan von Krawczynski wrote:

> On Tue, 10 Jun 2003 08:51:35 -0400 (EDT)
> Zwane Mwaikambo <zwane@linuxpower.ca> wrote:
> 
> > > Can you clarify? Do you mean options "nosmp noapic" or just "noapic" on SMP
> > > kernel?
> > 
> > Kernel built with CONFIG_SMP and booted with 'noapic' kernel parameter
> 
> Ok. To speed up the tests I  call it "ok" if there are no verify errors within
> 70 GB and "fail" if there are one or more.
> I have tried rc7+aic20030603 SMP with noapic and it is ok.

Can you also test it with an SMP kernel and only maxcpus=1 ?

> Reading around the whole interrupt stuff I came across a very simple idea which
> I am going to test right now. See you in some hours ;-)

Cool

	Zwane
-- 
function.linuxpower.ca

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Undo aic7xxx changes (now rc7+aic20030603)
  2003-06-10 12:51                         ` Zwane Mwaikambo
@ 2003-06-10 13:38                           ` Stephan von Krawczynski
  2003-06-10 13:51                             ` Zwane Mwaikambo
  0 siblings, 1 reply; 64+ messages in thread
From: Stephan von Krawczynski @ 2003-06-10 13:38 UTC (permalink / raw)
  To: Zwane Mwaikambo; +Cc: linux-kernel, willy, gibbs, marcelo, green

On Tue, 10 Jun 2003 08:51:35 -0400 (EDT)
Zwane Mwaikambo <zwane@linuxpower.ca> wrote:

> > Can you clarify? Do you mean options "nosmp noapic" or just "noapic" on SMP
> > kernel?
> 
> Kernel built with CONFIG_SMP and booted with 'noapic' kernel parameter

Ok. To speed up the tests I  call it "ok" if there are no verify errors within
70 GB and "fail" if there are one or more.
I have tried rc7+aic20030603 SMP with noapic and it is ok.

/proc/interrupts:

           CPU0       CPU1       
  0:    1061143          0          XT-PIC  timer
  1:       6582          0          XT-PIC  keyboard
  2:          0          0          XT-PIC  cascade
  5:       1229          0          XT-PIC  EMU10K1
  9:    9269694          0          XT-PIC  aic7xxx, aic7xxx, 3ware Storage Controller, fcpcipnp, eth0, eth1, eth2
 12:     129555          0          XT-PIC  PS/2 Mouse
 15:          4          0          XT-PIC  ide1
NMI:          0          0 
LOC:    1061054    1061028 
ERR:          1
MIS:          0


Reading around the whole interrupt stuff I came across a very simple idea which
I am going to test right now. See you in some hours ;-)

Regards,
Stephan


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Undo aic7xxx changes (now rc7+aic20030603)
  2003-06-10 10:30                       ` Stephan von Krawczynski
@ 2003-06-10 12:51                         ` Zwane Mwaikambo
  2003-06-10 13:38                           ` Stephan von Krawczynski
  0 siblings, 1 reply; 64+ messages in thread
From: Zwane Mwaikambo @ 2003-06-10 12:51 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: linux-kernel, willy, gibbs, marcelo, green

On Tue, 10 Jun 2003, Stephan von Krawczynski wrote:

> Uh, do I trust Linus ? ;-) Well, probably I am going to take a look. The whole
> story eats a lot of time as I have to deal with GBs of data for every single
> test.

Cool, i'll wait on that then.

> Can you clarify? Do you mean options "nosmp noapic" or just "noapic" on SMP
> kernel?

Kernel built with CONFIG_SMP and booted with 'noapic' kernel parameter

> Hm, my question is: if it were exclusively an apic problem, why do other
> controllers (in a filesystem environment) work flawlessly. Maybe the driver and
> apic simply have differing opinions in certain race cases, but that does not
> mean that apic is always to blame, does it?

I'm a bit wary of blaming the interrupt routing setup, as i have also 
noted that other devices work fine. But we have to be objective and try 
and isolate things first. You seem to have a good head start on that.

	Zwane
-- 
function.linuxpower.ca

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Undo aic7xxx changes (now rc7+aic20030603)
  2003-06-10  1:38                     ` Zwane Mwaikambo
@ 2003-06-10 10:30                       ` Stephan von Krawczynski
  2003-06-10 12:51                         ` Zwane Mwaikambo
  0 siblings, 1 reply; 64+ messages in thread
From: Stephan von Krawczynski @ 2003-06-10 10:30 UTC (permalink / raw)
  To: Zwane Mwaikambo; +Cc: linux-kernel, willy, gibbs, marcelo, green

On Mon, 9 Jun 2003 21:38:16 -0400 (EDT)
Zwane Mwaikambo <zwane@linuxpower.ca> wrote:

> On Mon, 9 Jun 2003, Stephan von Krawczynski wrote:
> 
> > During the whole testing with SMP I recognised that the tar-verify always
> > brought up "content differs" warnings. Which basically means that the
> > filesize is ok but the content is not. As there might be various causes for
> > this (bad tape, bad drive, bad cabling) I did not give very much about it.
> > But it turns out there are no more such warnings when using an UP kernel
> > (on the same box with the complete same hardware including tapes).
> > 
> > >From this experience I would conclude the following (for my personal test
> > case):
> 
> Can you also try this with 2.5?

Uh, do I trust Linus ? ;-) Well, probably I am going to take a look. The whole
story eats a lot of time as I have to deal with GBs of data for every single
test.

> > 1) aic-driver has problems with smp/up switching (meaning crashes when
> > trying an SMP build with nosmp). This is completely reproducable.
> 
> Can you also try an SMP kernel with noapic?

Can you clarify? Do you mean options "nosmp noapic" or just "noapic" on SMP
kernel?

> > 2) aic-driver (almost no matter what version) has problems with SMP setup
> > and tape drives. Obviously data integrity is not given. This is completely
> > reproducable in my test setup.
> 
> I have had problems with symmetric interrupt handling but can normally get 
> it working with noapic. And no it doesn't appear to be an interrupt 
> routing problem on my box (If it is someone please clearly state what the 
> exact problem is to me)

Hm, my question is: if it were exclusively an apic problem, why do other
controllers (in a filesystem environment) work flawlessly. Maybe the driver and
apic simply have differing opinions in certain race cases, but that does not
mean that apic is always to blame, does it?

Regards,
Stephan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Undo aic7xxx changes (now rc7+aic20030603)
  2003-06-09 15:32                     ` Justin T. Gibbs
@ 2003-06-10 10:23                       ` Stephan von Krawczynski
  2003-06-10 15:38                         ` Justin T. Gibbs
  0 siblings, 1 reply; 64+ messages in thread
From: Stephan von Krawczynski @ 2003-06-10 10:23 UTC (permalink / raw)
  To: Justin T. Gibbs; +Cc: linux-kernel, willy, marcelo, green

On Mon, 09 Jun 2003 15:32:11 +0000
"Justin T. Gibbs" <gibbs@scsiguy.com> wrote:

> > For Justin:
> > Thank you for your continous openness and support in the whole issue in
> > form of exactly _zero_ comments (,besides "how do you know aic is to
> > blame?").
> 
> Stephan,
> 
> Other than your most recent complaint that the driver doesn't function
> correctly in an SMP kernel when you specify the nosmp option, you have
> yet to provide any information that points to a problem in the aic7xxx
> driver.

Dear Justin,

I am really not complaining about you not helping specifically _me_, I am
complaining about your quite visible general opinion that this whole thing is
really not serious, or maybe it is only that you are not making your efforts
transparent to others, I don't know.

>  Without such information, I'm at a loss to help you.  One thing
> that you forgot to mention in your "report" is that data corruption can
> happen in many more places than just in the aic7xxx driver.

<sarcasm>Did I mention the big magnet right beside the tape?</sarcasm>

>  The data
> could be corrupted by a VM bug,

VM is quite the same, tar'ing to /dev/tape or /var/bak/mybackfile.tar.

> a buffer layer bug, or a filesystem
> bug.

/dev/tape with a filesystem? Have you read what we are talking about?

>  When testing our drivers against RHAS2.1 we found that the stock
> kernel had data corruption issues very similar to what your are talking
> about when run on very fast, hyperthreading, SMP machines.  The data
> corruption occurred with any SCSI controller we tried, regardless of vendor.

My question is: is it solved?

> If you continue to feel that the aic7xxx driver is at fault, I encourage you
> to try to reproduce this failure with someone elses card.  I think you'll
> find that the problem persists even with this change.

This is not the first discussion about an instability in aic. We had the same
thing months ago for another setup (where btw you said the same thing). Back
then I switched to symbios and everything went ok from then on. Thing is: I am
not a big learner, I just re-tried with aic now, and it happened again. I will
do the same thing now like back then: switching to symbios. Be sure I am going
to tell my experiences. Be aware that I have already received reports from
others with the same problem solving it the same way - switching away from aic.

> I will be more than happy to look into why the aic7xxx driver may not
> operate correctly in an SMP kernel with the nosmp option.  Considering
> that your complaint about this failure came into my email box just
> yesterday, perhaps you can give me just a few days to look into this
> before you decide to call me unresponsive.  Since I'm attending a
> conference this whole week, I won't even be able to look at this
> until I return on Monday of next week.

Justin, this is nothing quite serious, I just mentioned it for a feedback to
something _simple_.

> I'm sorry that you are experiencing data corruption.  I take those
> issues very seriously, but all of your panics and other reports point
> to issues elsewhere in the kernel that should be resolved before you
> conclude that the data corruption you are experiencing is somehow
> the aic7xxx driver's fault.  I'll be more than happy to fess up to
> and correct any defect that is found in the driver, but I cannot fix
> bugs that I cannot reproduce and that have no usable debugging information
> associated with them.

What exactly is "elsewhere" if your data is bogus when tar'ing onto /dev/tape
via aic and it is completely ok when tar'ing into a file via reiserfs/3ware ?
There is not really much left between tar and the aic-driver and the tape.
Where is your favourite in this game?

Regards,
Stephan


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Undo aic7xxx changes (now rc7+aic20030603)
  2003-06-09 15:10                   ` Stephan von Krawczynski
  2003-06-09 15:32                     ` Justin T. Gibbs
@ 2003-06-10  1:38                     ` Zwane Mwaikambo
  2003-06-10 10:30                       ` Stephan von Krawczynski
  1 sibling, 1 reply; 64+ messages in thread
From: Zwane Mwaikambo @ 2003-06-10  1:38 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: linux-kernel, willy, gibbs, marcelo, green

On Mon, 9 Jun 2003, Stephan von Krawczynski wrote:

> During the whole testing with SMP I recognised that the tar-verify always
> brought up "content differs" warnings. Which basically means that the filesize
> is ok but the content is not. As there might be various causes for this (bad
> tape, bad drive, bad cabling) I did not give very much about it. But it turns
> out there are no more such warnings when using an UP kernel (on the same box
> with the complete same hardware including tapes).
> 
> >From this experience I would conclude the following (for my personal test
> case):

Can you also try this with 2.5?

> 1) aic-driver has problems with smp/up switching (meaning crashes when trying
> an SMP build with nosmp). This is completely reproducable.

Can you also try an SMP kernel with noapic?

> 2) aic-driver (almost no matter what version) has problems with SMP setup and
> tape drives. Obviously data integrity is not given. This is completely
> reproducable in my test setup.

I have had problems with symmetric interrupt handling but can normally get 
it working with noapic. And no it doesn't appear to be an interrupt 
routing problem on my box (If it is someone please clearly state what the 
exact problem is to me)

	Zwane
-- 
function.linuxpower.ca

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Undo aic7xxx changes (now rc7+aic20030603)
  2003-06-09 15:10                   ` Stephan von Krawczynski
@ 2003-06-09 15:32                     ` Justin T. Gibbs
  2003-06-10 10:23                       ` Stephan von Krawczynski
  2003-06-10  1:38                     ` Zwane Mwaikambo
  1 sibling, 1 reply; 64+ messages in thread
From: Justin T. Gibbs @ 2003-06-09 15:32 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: linux-kernel, willy, gibbs, marcelo, green

> For Justin:
> Thank you for your continous openness and support in the whole issue in form
> of exactly _zero_ comments (,besides "how do you know aic is to blame?").

Stephan,

Other than your most recent complaint that the driver doesn't function
correctly in an SMP kernel when you specify the nosmp option, you have
yet to provide any information that points to a problem in the aic7xxx
driver.  Without such information, I'm at a loss to help you.  One thing
that you forgot to mention in your "report" is that data corruption can
happen in many more places than just in the aic7xxx driver.  The data
could be corrupted by a VM bug, a buffer layer bug, or a filesystem
bug.  When testing our drivers against RHAS2.1 we found that the stock
kernel had data corruption issues very similar to what your are talking
about when run on very fast, hyperthreading, SMP machines.  The data
corruption occurred with any SCSI controller we tried, regardless of vendor.
If you continue to feel that the aic7xxx driver is at fault, I encourage you
to try to reproduce this failure with someone elses card.  I think you'll
find that the problem persists even with this change.

I will be more than happy to look into why the aic7xxx driver may not
operate correctly in an SMP kernel with the nosmp option.  Considering
that your complaint about this failure came into my email box just
yesterday, perhaps you can give me just a few days to look into this
before you decide to call me unresponsive.  Since I'm attending a
conference this whole week, I won't even be able to look at this
until I return on Monday of next week.

I'm sorry that you are experiencing data corruption.  I take those
issues very seriously, but all of your panics and other reports point
to issues elsewhere in the kernel that should be resolved before you
conclude that the data corruption you are experiencing is somehow
the aic7xxx driver's fault.  I'll be more than happy to fess up to
and correct any defect that is found in the driver, but I cannot fix
bugs that I cannot reproduce and that have no usable debugging information
associated with them.

--
Justin


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Undo aic7xxx changes (now rc7+aic20030603)
  2003-06-08 11:49                 ` Stephan von Krawczynski
  2003-06-08 16:07                   ` Stephan von Krawczynski
@ 2003-06-09 15:10                   ` Stephan von Krawczynski
  2003-06-09 15:32                     ` Justin T. Gibbs
  2003-06-10  1:38                     ` Zwane Mwaikambo
  1 sibling, 2 replies; 64+ messages in thread
From: Stephan von Krawczynski @ 2003-06-09 15:10 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: linux-kernel, willy, gibbs, marcelo, green

Hello all,

I just finished another bunch of tests around the discussed issue and it's
getting to an end.
Yesterday I started using the test box with UP kernel instead of SMP, because I
have the feeling the whole problem is somewhere around an SMP race condition.
As far as I can see now the box runs 24h stable _and_ (and this is the
important part) one problem I did not talk about till now is completely gone:

During the whole testing with SMP I recognised that the tar-verify always
brought up "content differs" warnings. Which basically means that the filesize
is ok but the content is not. As there might be various causes for this (bad
tape, bad drive, bad cabling) I did not give very much about it. But it turns
out there are no more such warnings when using an UP kernel (on the same box
with the complete same hardware including tapes).

>From this experience I would conclude the following (for my personal test
case):

1) aic-driver has problems with smp/up switching (meaning crashes when trying
an SMP build with nosmp). This is completely reproducable.

2) aic-driver (almost no matter what version) has problems with SMP setup and
tape drives. Obviously data integrity is not given. This is completely
reproducable in my test setup.

For Marcelo: 
It seems you can take any version of the aic driver for small box setups with
UP, I never saw any troubles with it. As soon as you look at SMP flush it down
the t..let.

For Justin:
Thank you for your continous openness and support in the whole issue in form of
exactly _zero_ comments (,besides "how do you know aic is to blame?").

For Willy:
I honour your efforts, but we are not capable of solving the issue.

For Oleg:
Stay tuned, I will test the re-creation issue and your patch.

And now I go and buy a Symbios controller and re-try.

Regards,
Stephan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Undo aic7xxx changes (now rc7+aic20030603)
  2003-06-08 11:49                 ` Stephan von Krawczynski
@ 2003-06-08 16:07                   ` Stephan von Krawczynski
  2003-06-09 15:10                   ` Stephan von Krawczynski
  1 sibling, 0 replies; 64+ messages in thread
From: Stephan von Krawczynski @ 2003-06-08 16:07 UTC (permalink / raw)
  To: gibbs; +Cc: linux-kernel

Hello Justin,

another thing I stumbled across: if you compile the latest aic-driver
(20030603) for smp, but boot the kernel with nosmp flag, the driver hangs
during device-scan.

Regards,
Stephan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Undo aic7xxx changes (now rc7+aic20030603)
  2003-06-08 11:19               ` Stephan von Krawczynski
@ 2003-06-08 11:49                 ` Stephan von Krawczynski
  2003-06-08 16:07                   ` Stephan von Krawczynski
  2003-06-09 15:10                   ` Stephan von Krawczynski
  0 siblings, 2 replies; 64+ messages in thread
From: Stephan von Krawczynski @ 2003-06-08 11:49 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: linux-kernel, willy, gibbs, marcelo, green

Hello author,

shoot me for the last comment regarding __kmem_cache_alloc (which means: forget
it).
Still you have significant source code duplication between "#define
kmem_cache_alloc_one" and "void* kmem_cache_alloc_batch".
How about an exit-symbol parameter?

Regards,
Stephan


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Undo aic7xxx changes (now rc7+aic20030603)
  2003-06-05 18:14             ` Willy Tarreau
  2003-06-06  8:17               ` Oleg Drokin
@ 2003-06-08 11:19               ` Stephan von Krawczynski
  2003-06-08 11:49                 ` Stephan von Krawczynski
  1 sibling, 1 reply; 64+ messages in thread
From: Stephan von Krawczynski @ 2003-06-08 11:19 UTC (permalink / raw)
  To: linux-kernel; +Cc: willy, gibbs, marcelo, green

Hello all,

looking at code around my problem I discovered this:

static inline void * __kmem_cache_alloc (kmem_cache_t *cachep, int flags)
{
        unsigned long save_flags;
        void* objp;

        kmem_cache_alloc_head(cachep, flags);
try_again:
        local_irq_save(save_flags);
#ifdef CONFIG_SMP
        {
                cpucache_t *cc = cc_data(cachep);

                if (cc) {
                        if (cc->avail) {
                                STATS_INC_ALLOCHIT(cachep);
                                objp = cc_entry(cc)[--cc->avail];
                        } else {
                                STATS_INC_ALLOCMISS(cachep);
                                objp = kmem_cache_alloc_batch(cachep,cc,flags);
                                if (!objp)
                                        goto alloc_new_slab_nolock;
                        }
                } else {
                        spin_lock(&cachep->spinlock);
                        objp = kmem_cache_alloc_one(cachep);
                        spin_unlock(&cachep->spinlock);
                }
        }
#else
        objp = kmem_cache_alloc_one(cachep);
#endif
        local_irq_restore(save_flags);
        return objp;
alloc_new_slab:  
#ifdef CONFIG_SMP
        spin_unlock(&cachep->spinlock);
alloc_new_slab_nolock:
#endif
        local_irq_restore(save_flags);
        if (kmem_cache_grow(cachep, flags))
                /* Someone may have stolen our objs.  Doesn't matter, we'll
                 * just come back here again.
                 */
                goto try_again;
        return NULL;
} 
  

I suggest it for most-absurd-goto-usage-award.

1) There seems to be no reference for symbol "alloc_new_slab"
2) "spin_unlock" (right below) is never reached
3) The not-ifdef'ed code below is only used if CONFIG_SMP
4) The code "alloc_new_slab_nolock" is referenced only once by a goto
   (why not simply pasted there?)

This does not look like a problem, it only is damn ugly. I have no idea 
what this code actually does, but it looks patched-to-the-limit. Has 
anybody reviewed slab regarding CONFIG_SMP?

Regards,
Stephan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Undo aic7xxx changes (now rc7+aic20030603)
  2003-06-06  9:17                   ` Oleg Drokin
@ 2003-06-08 10:15                     ` Stephan von Krawczynski
  0 siblings, 0 replies; 64+ messages in thread
From: Stephan von Krawczynski @ 2003-06-08 10:15 UTC (permalink / raw)
  To: linux-kernel; +Cc: willy, gibbs, marcelo, green

On Fri, 6 Jun 2003 13:17:59 +0400
Oleg Drokin <green@namesys.com> wrote:

> Hello!
> 
> On Fri, Jun 06, 2003 at 11:04:08AM +0200, Stephan von Krawczynski wrote:
> > > No, it did crashed in allocation code (you skipped one trace line):
> > > Jun  5 16:53:55 admin kernel: Call Trace:    [__kmem_cache_alloc+107/304]
> > > [kmem_cache_grow+508/624]
> > > [__kmem_cache_alloc+125/304]+[get_mem_for_virtual_node+87/224]
> > > [fix_nodes+198/1008]
> > > 
> > > And the EIP is in kmem_cache_alloc_batch, sounds like it tripped on bad
> > > pointer or something like this. So something is corrupting slab lists it
> > > seems.
> > I agree with you. Only problem is: how can I find out what caused the problem.
> 
> Probably by careful code observations.
> 
> > The only thing I can tell is that the box never hangs when using only HDs on
> > the aic & 3ware controllers. As soon as I begin to use a SDLT drive on aic
> > things get fishy.
> 
> You do not have reiserfs filesystem on a tape drive, right? ;)
> But thhat reduces the region to review to parts thqt deal with tape devices and
> tape-specific stuff, it seems.
> 
> Bye,
>     Oleg

Hello all,

in the meantime I got another oops and it looks like this:

ksymoops 2.4.8 on i686 2.4.21-rc7-aic.  Options used
     -V (default)
     -k /proc/ksyms (default)
     -l /proc/modules (default)
     -o /lib/modules/2.4.21-rc7-aic/ (default)
     -m /boot/System.map-2.4.21-rc7-aic (default)

Warning: You did not tell me where to find symbol information.  I will
assume that the log matches the kernel and modules that are running
right now and I'll use the default options above for symbol resolution.
If the current kernel and/or modules do not match the log, you can get
more accurate output by telling me the kernel version and where to find
map, modules, ksyms etc.  ksymoops -h explains the options.

Jun  8 10:48:49 linux kernel: Oops: 0000
Jun  8 10:48:49 linux kernel: CPU:    1
Jun  8 10:48:49 linux kernel: EIP:    0010:[<c013755e>]    Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
Jun  8 10:48:49 linux kernel: EFLAGS: 00010006
Jun  8 10:48:49 linux kernel: eax: 5a005139   ebx: 5a005139   ecx: edb89c21   edx: 00000060
Jun  8 10:48:49 linux kernel: esi: 00000021   edi: 0000005c   ebp: c342fecc   esp: e4007d74
Jun  8 10:48:49 linux kernel: ds: 0018   es: 0018   ss: 0018
Jun  8 10:48:49 linux kernel: Process tar (pid: 17369, stackpage=e4007000)
Jun  8 10:48:49 linux kernel: Stack: c342fed4 c342fedc c342fecc 00000246 00000070 effa58a0 c01382eb c342fecc
Jun  8 10:48:49 linux kernel:        c3467800 00000070 00000000 c1000020 effa58a0 effa58a0 c013f7d9 c342fecc
Jun  8 10:48:49 linux kernel:        00000070 00000000 c013f8a5 c349d418 f6fc1200 00000000 00000000 c1000020
Jun  8 10:48:49 linux kernel: Call Trace:    [<c01382eb>] [<c013f7d9>] [<c013f8a5>] [<c01b8f73>] [<c01b929e>]
Jun  8 10:48:49 linux kernel:   [<c01b936c>] [<c0145596>] [<c0139fc2>] [<c013069e>] [<c017c4e0>] [<c013124f>]
Jun  8 10:48:49 linux kernel:   [<c0131531>] [<c0131ad0>] [<c0131d20>] [<c0131ad0>] [<c0141c0b>] [<c010782f>]
Jun  8 10:48:49 linux kernel: Code: 8b 44 81 18 0f af da 8b 51 0c 89 41 14 01 d3 40 0f 84 89 00


>>EIP; c013755e <kmem_cache_alloc_batch+4e/110>   <=====

>>ecx; edb89c21 <_end+2d7f78e1/38547d20>
>>ebp; c342fecc <_end+309db8c/38547d20>
>>esp; e4007d74 <_end+23c75a34/38547d20>

Trace; c01382eb <__kmem_cache_alloc+6b/130>
Trace; c013f7d9 <alloc_bounce_bh+19/a0>
Trace; c013f8a5 <create_bounce+45/190>
Trace; c01b8f73 <__make_request+3d3/640>
Trace; c01b929e <generic_make_request+be/140>
Trace; c01b936c <submit_bh+4c/70>
Trace; c0145596 <block_read_full_page+2c6/2e0>
Trace; c0139fc2 <__alloc_pages+42/190>
Trace; c013069e <generic_buffer_fdatasync+5e/110>
Trace; c017c4e0 <reiserfs_get_block+0/12c0>
Trace; c013124f <generic_file_readahead+af/1a0>
Trace; c0131531 <do_generic_file_read+1c1/470>
Trace; c0131ad0 <file_read_actor+0/110>
Trace; c0131d20 <generic_file_read+140/160>
Trace; c0131ad0 <file_read_actor+0/110>
Trace; c0141c0b <sys_read+9b/180>
Trace; c010782f <system_call+33/38>

Code;  c013755e <kmem_cache_alloc_batch+4e/110>
00000000 <_EIP>:
Code;  c013755e <kmem_cache_alloc_batch+4e/110>   <=====
   0:   8b 44 81 18               mov    0x18(%ecx,%eax,4),%eax   <=====
Code;  c0137562 <kmem_cache_alloc_batch+52/110>
   4:   0f af da                  imul   %edx,%ebx
Code;  c0137565 <kmem_cache_alloc_batch+55/110>
   7:   8b 51 0c                  mov    0xc(%ecx),%edx
Code;  c0137568 <kmem_cache_alloc_batch+58/110>
   a:   89 41 14                  mov    %eax,0x14(%ecx)
Code;  c013756b <kmem_cache_alloc_batch+5b/110>
   d:   01 d3                     add    %edx,%ebx
Code;  c013756d <kmem_cache_alloc_batch+5d/110>
   f:   40                        inc    %eax
Code;  c013756e <kmem_cache_alloc_batch+5e/110>
  10:   0f 84 89 00 00 00         je     9f <_EIP+0x9f>


1 warning issued.  Results may not be reliable.


This is the second oops inside kmem_cache_alloc_batch, the problem can be talked of as reproducable.
This is a 2.4.21-rc7+aic20030603 kernel.

Regards,
Stephan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Undo aic7xxx changes (now rc7+aic20030603)
  2003-06-06  9:04                 ` Stephan von Krawczynski
@ 2003-06-06  9:17                   ` Oleg Drokin
  2003-06-08 10:15                     ` Stephan von Krawczynski
  0 siblings, 1 reply; 64+ messages in thread
From: Oleg Drokin @ 2003-06-06  9:17 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: willy, gibbs, marcelo, linux-kernel

Hello!

On Fri, Jun 06, 2003 at 11:04:08AM +0200, Stephan von Krawczynski wrote:
> > No, it did crashed in allocation code (you skipped one trace line):
> > Jun  5 16:53:55 admin kernel: Call Trace:    [__kmem_cache_alloc+107/304]
> > [kmem_cache_grow+508/624]
> > [__kmem_cache_alloc+125/304]+[get_mem_for_virtual_node+87/224]
> > [fix_nodes+198/1008]
> > 
> > And the EIP is in kmem_cache_alloc_batch, sounds like it tripped on bad
> > pointer or something like this. So something is corrupting slab lists it
> > seems.
> I agree with you. Only problem is: how can I find out what caused the problem.

Probably by careful code observations.

> The only thing I can tell is that the box never hangs when using only HDs on
> the aic & 3ware controllers. As soon as I begin to use a SDLT drive on aic
> things get fishy.

You do not have reiserfs filesystem on a tape drive, right? ;)
But thhat reduces the region to review to parts thqt deal with tape devices and
tape-specific stuff, it seems.

Bye,
    Oleg

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Undo aic7xxx changes (now rc7+aic20030603)
  2003-06-06  8:17               ` Oleg Drokin
@ 2003-06-06  9:04                 ` Stephan von Krawczynski
  2003-06-06  9:17                   ` Oleg Drokin
  0 siblings, 1 reply; 64+ messages in thread
From: Stephan von Krawczynski @ 2003-06-06  9:04 UTC (permalink / raw)
  To: Oleg Drokin; +Cc: willy, gibbs, marcelo, linux-kernel

On Fri, 6 Jun 2003 12:17:12 +0400
Oleg Drokin <green@namesys.com> wrote:

> Hello!
> 
> On Thu, Jun 05, 2003 at 08:14:23PM +0200, Willy Tarreau wrote:
> > > It took some days to produce output for my freezing problem. This one is
> > > rc7+aic20030603:
> > Good !
> > It seems that it crashed in the reiserfs code rather than in aic7xxx !
> > perhaps you hit 2 different bugs, or perhaps there's a race that only newer
> > code can trigger, or there's a leak somewhere. You may want to forward the
> > oops to the reiserfs team too.
> 
> No, it did crashed in allocation code (you skipped one trace line):
> Jun  5 16:53:55 admin kernel: Call Trace:    [__kmem_cache_alloc+107/304]
> [kmem_cache_grow+508/624]
> [__kmem_cache_alloc+125/304]+[get_mem_for_virtual_node+87/224]
> [fix_nodes+198/1008]
> 
> And the EIP is in kmem_cache_alloc_batch, sounds like it tripped on bad
> pointer or something like this. So something is corrupting slab lists it
> seems.
> 
> Bye,
>     Oleg

I agree with you. Only problem is: how can I find out what caused the problem.
The only thing I can tell is that the box never hangs when using only HDs on
the aic & 3ware controllers. As soon as I begin to use a SDLT drive on aic
things get fishy.

Regards,
Stephan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Undo aic7xxx changes (now rc7+aic20030603)
  2003-06-05 18:14             ` Willy Tarreau
@ 2003-06-06  8:17               ` Oleg Drokin
  2003-06-06  9:04                 ` Stephan von Krawczynski
  2003-06-08 11:19               ` Stephan von Krawczynski
  1 sibling, 1 reply; 64+ messages in thread
From: Oleg Drokin @ 2003-06-06  8:17 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Stephan von Krawczynski, gibbs, marcelo, linux-kernel

Hello!

On Thu, Jun 05, 2003 at 08:14:23PM +0200, Willy Tarreau wrote:
> > It took some days to produce output for my freezing problem. This one is rc7+aic20030603:
> Good !
> It seems that it crashed in the reiserfs code rather than in aic7xxx ! perhaps
> you hit 2 different bugs, or perhaps there's a race that only newer code can
> trigger, or there's a leak somewhere. You may want to forward the oops to the
> reiserfs team too.

No, it did crashed in allocation code (you skipped one trace line):
Jun  5 16:53:55 admin kernel: Call Trace:    [__kmem_cache_alloc+107/304] [kmem_cache_grow+508/624] [__kmem_cache_alloc+125/304]
+[get_mem_for_virtual_node+87/224] [fix_nodes+198/1008]

And the EIP is in kmem_cache_alloc_batch, sounds like it tripped on bad pointer or something like this.
So something is corrupting slab lists it seems.

Bye,
    Oleg

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Undo aic7xxx changes (now rc7+aic20030603)
  2003-06-05 15:05           ` Undo aic7xxx changes (now rc7+aic20030603) Stephan von Krawczynski
@ 2003-06-05 18:14             ` Willy Tarreau
  2003-06-06  8:17               ` Oleg Drokin
  2003-06-08 11:19               ` Stephan von Krawczynski
  0 siblings, 2 replies; 64+ messages in thread
From: Willy Tarreau @ 2003-06-05 18:14 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: Willy Tarreau, gibbs, marcelo, linux-kernel

On Thu, Jun 05, 2003 at 05:05:51PM +0200, Stephan von Krawczynski wrote:
> Hello all,
> 
> It took some days to produce output for my freezing problem. This one is rc7+aic20030603:

Good !

It seems that it crashed in the reiserfs code rather than in aic7xxx ! perhaps
you hit 2 different bugs, or perhaps there's a race that only newer code can
trigger, or there's a leak somewhere. You may want to forward the oops to the
reiserfs team too.

> Jun  5 16:53:55 admin kernel: Call Trace:    [<c01382eb>] [<c013749c>] [<c01382fd>] [<c01846a7>] [<c0184bc6>]
> Jun  5 16:53:55 admin kernel:   [reiserfs_paste_into_item+147/304] [reiserfs_get_block+1989/4800] [bh_action+106/112] [tasklet_hi_action+83/160] [smp_apic_timer_interrupt+264/304] [.text.lock.buffer+191/610]
> Jun  5 16:53:55 admin kernel:   [<c0191ae3>] [<c017cca5>] [<c012252a>] [<c01223b3>] [<c0115d88>] [<c01474bd>]
> Jun  5 16:53:55 admin kernel:   [getblk+109/128] [is_tree_node+100/112] [search_by_key+1824/3792] [__block_prepare_write+479/880] [block_prepare_write+51/144] [reiserfs_get_block+0/4800]
> Jun  5 16:53:55 admin kernel:   [<c014447d>] [<c018e8f4>] [<c018f020>] [<c014503f>] [<c0145a23>] [<c017c4e0>]
> Jun  5 16:53:55 admin kernel:   [generic_file_write+970/2128] [reiserfs_get_block+0/4800] [sys_write+155/384] [system_call+51/56]
> Jun  5 16:53:55 admin kernel:   [<c013397a>] [<c017c4e0>] [<c0141d8b>] [<c010782f>]
> Jun  5 16:53:55 admin kernel: 
> Jun  5 16:53:55 admin kernel: Code: 8b 44 81 18 0f af da 8b 51 0c 89 41 14 01 d3 40 0f 84 89 00

Cheers and thanks for the test !

Willy


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Undo aic7xxx changes (now rc7+aic20030603)
  2003-05-24 11:16         ` Willy Tarreau
@ 2003-06-05 15:05           ` Stephan von Krawczynski
  2003-06-05 18:14             ` Willy Tarreau
  0 siblings, 1 reply; 64+ messages in thread
From: Stephan von Krawczynski @ 2003-06-05 15:05 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: willy, gibbs, marcelo, linux-kernel

Hello all,

It took some days to produce output for my freezing problem. This one is rc7+aic20030603:

Jun  5 16:53:55 admin kernel: Unable to handle kernel paging request at virtual address 8e30a7c5
Jun  5 16:53:55 admin kernel:  printing eip:
Jun  5 16:53:55 admin kernel: c013755e
Jun  5 16:53:55 admin kernel: *pde = 00000000
Jun  5 16:53:55 admin kernel: Oops: 0000
Jun  5 16:53:55 admin kernel: CPU:    0 
Jun  5 16:53:55 admin kernel: EIP:    0010:[kmem_cache_alloc_batch+78/272]    Not tainted
Jun  5 16:53:55 admin kernel: EIP:    0010:[<c013755e>]    Not tainted
Jun  5 16:53:55 admin kernel: EFLAGS: 00010006
Jun  5 16:53:55 admin kernel: eax: e62d70eb   ebx: e62d70eb   ecx: f57ae401   edx: 00000020
Jun  5 16:53:55 admin kernel: esi: 00000043   edi: 0000003a   ebp: c342b060   esp: e5e63a28
Jun  5 16:53:55 admin kernel: ds: 0018   es: 0018   ss: 0018
Jun  5 16:53:55 admin kernel: Process tar (pid: 7112, stackpage=e5e63000)
Jun  5 16:53:55 admin kernel: Stack: c342b068 c342b070 c342b060 00000246 00000020 e7420000 c01382eb c342b060
Jun  5 16:53:55 admin kernel:        c3461000 00000020 00000000 c342bdb8 00000000 e7420000 c013749c c342b060
Jun  5 16:53:55 admin kernel:        00000020 d3d05ec0 00000003 00000020 c342bdb8 00000246 00000020 e5e63b14
Jun  5 16:53:55 admin kernel: Call Trace:    [__kmem_cache_alloc+107/304] [kmem_cache_grow+508/624] [__kmem_cache_alloc+125/304] [get_mem_for_virtual_node+87/224] [fix_nodes+198/1008]
Jun  5 16:53:55 admin kernel: Call Trace:    [<c01382eb>] [<c013749c>] [<c01382fd>] [<c01846a7>] [<c0184bc6>]
Jun  5 16:53:55 admin kernel:   [reiserfs_paste_into_item+147/304] [reiserfs_get_block+1989/4800] [bh_action+106/112] [tasklet_hi_action+83/160] [smp_apic_timer_interrupt+264/304] [.text.lock.buffer+191/610]
Jun  5 16:53:55 admin kernel:   [<c0191ae3>] [<c017cca5>] [<c012252a>] [<c01223b3>] [<c0115d88>] [<c01474bd>]
Jun  5 16:53:55 admin kernel:   [getblk+109/128] [is_tree_node+100/112] [search_by_key+1824/3792] [__block_prepare_write+479/880] [block_prepare_write+51/144] [reiserfs_get_block+0/4800]
Jun  5 16:53:55 admin kernel:   [<c014447d>] [<c018e8f4>] [<c018f020>] [<c014503f>] [<c0145a23>] [<c017c4e0>]
Jun  5 16:53:55 admin kernel:   [generic_file_write+970/2128] [reiserfs_get_block+0/4800] [sys_write+155/384] [system_call+51/56]
Jun  5 16:53:55 admin kernel:   [<c013397a>] [<c017c4e0>] [<c0141d8b>] [<c010782f>]
Jun  5 16:53:55 admin kernel: 
Jun  5 16:53:55 admin kernel: Code: 8b 44 81 18 0f af da 8b 51 0c 89 41 14 01 d3 40 0f 84 89 00


Does this help?

Regards,
Stephan

^ permalink raw reply	[flat|nested] 64+ messages in thread

end of thread, other threads:[~2003-06-30 11:53 UTC | newest]

Thread overview: 64+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20030507203025$6f60@gated-at.bofh.it>
     [not found] ` <20030509005011$6cee@gated-at.bofh.it>
     [not found]   ` <20030509101012$732a@gated-at.bofh.it>
     [not found]     ` <20030509122007$758f@gated-at.bofh.it>
     [not found]       ` <20030509131009$00f3@gated-at.bofh.it>
     [not found]         ` <20030611045008$03cf@gated-at.bofh.it>
     [not found]           ` <20030611203031$12de@gated-at.bofh.it>
     [not found]             ` <20030611211012$34cf@gated-at.bofh.it>
     [not found]               ` <20030613095017$1680@gated-at.bofh.it>
     [not found]                 ` <20030617210022$3e37@gated-at.bofh.it>
     [not found]                   ` <20030618111010$154f@gated-at.bofh.it>
2003-06-18 12:46                     ` Undo aic7xxx changes (now rc7+aic20030603) Pascal Schmidt
2003-06-18 12:49                       ` Stephan von Krawczynski
2003-05-07 20:22 Undo aic7xxx changes Marcelo Tosatti
2003-05-09  0:45 ` Justin T. Gibbs
2003-05-09 10:06   ` Stephan von Krawczynski
2003-05-09 12:06     ` Willy Tarreau
2003-05-09 13:02       ` Stephan von Krawczynski
2003-05-24 11:16         ` Willy Tarreau
2003-06-05 15:05           ` Undo aic7xxx changes (now rc7+aic20030603) Stephan von Krawczynski
2003-06-05 18:14             ` Willy Tarreau
2003-06-06  8:17               ` Oleg Drokin
2003-06-06  9:04                 ` Stephan von Krawczynski
2003-06-06  9:17                   ` Oleg Drokin
2003-06-08 10:15                     ` Stephan von Krawczynski
2003-06-08 11:19               ` Stephan von Krawczynski
2003-06-08 11:49                 ` Stephan von Krawczynski
2003-06-08 16:07                   ` Stephan von Krawczynski
2003-06-09 15:10                   ` Stephan von Krawczynski
2003-06-09 15:32                     ` Justin T. Gibbs
2003-06-10 10:23                       ` Stephan von Krawczynski
2003-06-10 15:38                         ` Justin T. Gibbs
2003-06-10 17:11                           ` Stephan von Krawczynski
2003-06-10 18:07                             ` Justin T. Gibbs
2003-06-11  0:51                               ` Stephan von Krawczynski
2003-06-11  4:39                                 ` Justin T. Gibbs
2003-06-11 20:23                                   ` Stephan von Krawczynski
2003-06-11 21:01                                     ` John Stoffel
2003-06-13  9:45                                       ` Stephan von Krawczynski
2003-06-15 12:56                                         ` Stephan von Krawczynski
2003-06-15 13:26                                           ` John Stoffel
2003-06-17 20:47                                         ` Marcelo Tosatti
2003-06-18 11:05                                           ` Stephan von Krawczynski
2003-06-18 14:21                                             ` John Stoffel
2003-06-18 14:54                                               ` Stephan von Krawczynski
2003-06-20 19:59                                             ` Marcelo Tosatti
2003-06-20 20:59                                               ` Kevin P. Fleming
2003-06-20 21:13                                                 ` Marcelo Tosatti
2003-06-20 22:03                                                   ` Willy Tarreau
2003-06-20 23:48                                                     ` Stephan von Krawczynski
2003-06-21 10:50                                                       ` Willy TARREAU
2003-06-22 19:00                                                         ` Stephan von Krawczynski
2003-06-23 11:30                                                         ` Stephan von Krawczynski
2003-06-24 11:11                                                           ` Stephan von Krawczynski
2003-06-24 17:43                                                             ` Willy Tarreau
2003-06-24 21:26                                                               ` Stephan von Krawczynski
2003-06-24 22:03                                                                 ` Willy Tarreau
2003-06-24 23:43                                                                   ` Stephan von Krawczynski
2003-06-25 19:16                                                                     ` Willy Tarreau
2003-06-25 19:42                                                                       ` Stephan von Krawczynski
2003-06-25 20:30                                                                         ` John Stoffel
2003-06-26  9:36                                                                           ` Stephan von Krawczynski
2003-06-26 11:34                                                                           ` Stephan von Krawczynski
2003-06-30 10:10                                                                             ` Stephan von Krawczynski
2003-06-30 11:39                                                                               ` Marcelo Tosatti
2003-06-30 12:08                                                                                 ` Stephan von Krawczynski
2003-06-25 23:04                                                                       ` Bernd Eckenfels
2003-06-25  2:22                                                                 ` Valdis.Kletnieks
2003-06-24 18:31                                                     ` Bill Davidsen
2003-06-12 13:54                                     ` Stephan von Krawczynski
2003-06-10  1:38                     ` Zwane Mwaikambo
2003-06-10 10:30                       ` Stephan von Krawczynski
2003-06-10 12:51                         ` Zwane Mwaikambo
2003-06-10 13:38                           ` Stephan von Krawczynski
2003-06-10 13:51                             ` Zwane Mwaikambo
2003-06-10 15:55                               ` Stephan von Krawczynski
2003-06-10 16:23                                 ` Oleg Drokin
2003-06-10 17:44                               ` Stephan von Krawczynski
2003-06-10 18:15                                 ` Zwane Mwaikambo
2003-06-10 23:55                                   ` Stephan von Krawczynski
2003-06-10 18:20                                 ` Zwane Mwaikambo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).