* Undo aic7xxx changes @ 2003-05-07 20:22 Marcelo Tosatti 2003-05-09 0:45 ` Justin T. Gibbs 0 siblings, 1 reply; 110+ messages in thread From: Marcelo Tosatti @ 2003-05-07 20:22 UTC (permalink / raw) To: lkml; +Cc: Justin T. Gibbs Hi, I've undone aic7xxx changes which were locking up some machines on initialization. The new driver is now named drivers/scsi/aic79xx and is under CONFIG_AIC79XX. Justin, unfortunately I can't even THINK about updating aic7xxx to your new driver at the current release stage. I will do so in the 2.4.22. The update also contains a PCI posting flush fix from Arjan. People, please test the driver. ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes 2003-05-07 20:22 Undo aic7xxx changes Marcelo Tosatti @ 2003-05-09 0:45 ` Justin T. Gibbs 2003-05-09 10:06 ` Stephan von Krawczynski 0 siblings, 1 reply; 110+ messages in thread From: Justin T. Gibbs @ 2003-05-09 0:45 UTC (permalink / raw) To: Marcelo Tosatti, lkml > Hi, > > I've undone aic7xxx changes which were locking up some machines on > initialization. Hmm. It would have been nice to have the oportunity to fix this correctly. As it stands now, I have really no idea what people were testing or not since by taking Alan's patch you have lost the complete change history and the ability to step people through the changes. I have preserved this history in the bk send output that is available on my site if at some point that is useful to you. > The new driver is now named drivers/scsi/aic79xx and is under > CONFIG_AIC79XX. So we now have an extra copy of the assembler, the Config files, and the aiclib files. This is not a solution. If you wanted to selectively update the aic79xx driver, all you had to do was ask me for the requisite change sets. This is what a mainatiner is for. > Justin, unfortunately I can't even THINK about updating aic7xxx to your > new driver at the current release stage. I will do so in the 2.4.22. Does this mean that you will actually take BK changes form me instead of from just about anyone else that sends you aic7xxx driver updates? I had pretty much given up on this. > The update also contains a PCI posting flush fix from Arjan. Which is completely unnecessary and in fact will cause hangs and crashes on many Dell servers. The "fix" for the VIA systems that violate the PCI spec is to either: 1) Update the driver correctly so that it's detection logic will automatically disable memory mapped I/O for these broken systems. or 2) Just disable the BIOS options that configure the system to violate the PCI prefetching rules. Slowing down all systems, even the ones that are *not broken* by doing extra, random, PCI read cycles is not a fix. If you want some verification of the Dell issue (which I'm sure will cause problems on other "fast" systems too), just ask Matt Domsh. Again, if you have concerns about the aic7xxx or aic79xx drivers, my mail box is always open. Waiting to contact me until the last minute where I can only sit on the sidelines and watch another train wreck is not the best way to ensure that the drivers function correctly in 2.4.X. What this basically boils down to is trust. If you don't trust me, tell me how I can build that trust. Without it, I can only continue to tell most people that contact me with bug reports, "It's already fixed in the official driver. You can pull the latest from ..." -- Justin ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes 2003-05-09 0:45 ` Justin T. Gibbs @ 2003-05-09 10:06 ` Stephan von Krawczynski 2003-05-09 12:06 ` Willy Tarreau 0 siblings, 1 reply; 110+ messages in thread From: Stephan von Krawczynski @ 2003-05-09 10:06 UTC (permalink / raw) To: Justin T. Gibbs; +Cc: marcelo, linux-kernel On Thu, 08 May 2003 18:45:42 -0600 "Justin T. Gibbs" <gibbs@scsiguy.com> wrote: > > Hi, > > [...] > > Justin, unfortunately I can't even THINK about updating aic7xxx to your > > new driver at the current release stage. I will do so in the 2.4.22. > > [...] > Again, if you have concerns about the aic7xxx or aic79xx drivers, my > mail box is always open. Waiting to contact me until the last minute > where I can only sit on the sidelines and watch another train wreck is > not the best way to ensure that the drivers function correctly in 2.4.X. > > What this basically boils down to is trust. If you don't trust me, > tell me how I can build that trust. Without it, I can only continue > to tell most people that contact me with bug reports, "It's already > fixed in the official driver. You can pull the latest from ..." Justin, just to complete the picture: as I wrote some days ago concerning your hint to "use the latest from ..." your latest driver does not complete booting on (at least) my system but freezes - which I wrote to LKML. I have not yet heard anything about this issue. You cannot expect to include a newer driver which performs obviously worse in some cases. "Worse" here means "fails" and not "performs bad". Marcelos' decision on the topic looks pretty reasonable to me... Regards, Stephan ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes 2003-05-09 10:06 ` Stephan von Krawczynski @ 2003-05-09 12:06 ` Willy Tarreau 2003-05-09 13:02 ` Stephan von Krawczynski 0 siblings, 1 reply; 110+ messages in thread From: Willy Tarreau @ 2003-05-09 12:06 UTC (permalink / raw) To: Stephan von Krawczynski; +Cc: Justin T. Gibbs, marcelo, linux-kernel On Fri, May 09, 2003 at 12:06:48PM +0200, Stephan von Krawczynski wrote: > Justin, just to complete the picture: as I wrote some days ago concerning your > hint to "use the latest from ..." your latest driver does not complete booting > on (at least) my system but freezes - which I wrote to LKML. I have not yet > heard > anything about this issue. You cannot expect to include a newer driver which > performs obviously worse in some cases. > "Worse" here means "fails" and not "performs bad". Marcelos' decision on the > topic looks pretty reasonable to me... What's your setup ? Are you in SMP ? I was hit by a lock bug introduced near 6.2.30, which Justin fixed recently and included in his latest driver (20030502). Justin suggested to me to try the NMI watchdog to find what was wrong and it pointed us to a spinlock problem. Have you tried to debug something ? I must say that this driver seems really robust now on my setup (dual athlon), but perhaps your problem is of the same order and could be fixed easily with some help, which would be good for you and everyone else. Regards, Willy ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes 2003-05-09 12:06 ` Willy Tarreau @ 2003-05-09 13:02 ` Stephan von Krawczynski 2003-05-09 13:27 ` Willy Tarreau 0 siblings, 1 reply; 110+ messages in thread From: Stephan von Krawczynski @ 2003-05-09 13:02 UTC (permalink / raw) To: Willy Tarreau; +Cc: gibbs, marcelo, linux-kernel On Fri, 9 May 2003 14:06:59 +0200 Willy Tarreau <willy@w.ods.org> wrote: > On Fri, May 09, 2003 at 12:06:48PM +0200, Stephan von Krawczynski wrote: > > > Justin, just to complete the picture: as I wrote some days ago concerning > > your hint to "use the latest from ..." your latest driver does not complete > > booting on (at least) my system but freezes - which I wrote to LKML. I have > > not yet heard > > anything about this issue. You cannot expect to include a newer driver > > which performs obviously worse in some cases. > > "Worse" here means "fails" and not "performs bad". Marcelos' decision on > > the topic looks pretty reasonable to me... > > What's your setup ? Are you in SMP ? SMP PIII 1.4 GHz, dual Adaptec AIC-7899P U160/m (rev 01) > I was hit by a lock bug introduced near > 6.2.30, which Justin fixed recently and included in his latest driver > (20030502). Justin suggested to me to try the NMI watchdog to find what was > wrong and it pointed us to a spinlock problem. Have you tried to debug > something ? I cannot say which version of the driver it was, the only thing I can tell you is that the archive was called aic79xx-linux-2.4-20030410-tar.gz. > I must say that this driver seems really robust now on my setup > (dual athlon), but perhaps your problem is of the same order and could be > fixed easily with some help, which would be good for you and everyone else. I can't tell, basic problem in my setup is that it seems virtually impossible to bring some 100GB of data onto a streamer connected to the above aic. It crashes almost every day with a freeze and no oops or other message. I am at the moment willing to await 2.4.21 and see, and if that does not solve it, then I will probably go back to a dual symbios controller which I used before and never had any glitches with. This is a system in production and not particularly useful for debugging a lot and correspoding downtime. Regards, Stephan ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes 2003-05-09 13:02 ` Stephan von Krawczynski @ 2003-05-09 13:27 ` Willy Tarreau 2003-05-09 13:46 ` Stephan von Krawczynski 2003-05-09 14:11 ` Stephan von Krawczynski 0 siblings, 2 replies; 110+ messages in thread From: Willy Tarreau @ 2003-05-09 13:27 UTC (permalink / raw) To: Stephan von Krawczynski; +Cc: Willy Tarreau, gibbs, marcelo, linux-kernel On Fri, May 09, 2003 at 03:02:07PM +0200, Stephan von Krawczynski wrote: > I cannot say which version of the driver it was, the only thing I can tell you > is that the archive was called aic79xx-linux-2.4-20030410-tar.gz. That's really interesting, because I got the bug since around this version (20030417 IIRC), and it locked up only on SMP, sometimes during boot, or during heavy disk accesses caused by "updatedb" and "make -j dep". It's fixed in 20030502 from http://people.freebsd.org/~gibbs/linux/SRC/ > I can't tell, basic problem in my setup is that it seems virtually impossible > to bring some 100GB of data onto a streamer connected to the above aic. It > crashes almost every day with a freeze and no oops or other message. I had the same symptom which is very frustrating, I agree. I even had difficulties to catch the NMI watchdog output which was often truncated. > I am at the moment willing to await 2.4.21 and see, and if that does not solve it, Well, would you at least agree to retest current version from the above URL ? I find it a bit of a shame that the driver goes back in -rc stage. Marcelo, do you have some information about the setup from the people who reported hangs to you ? Perhaps we could even ask them to confirm that Justin's updated driver fixes their problems ? > This is a system in production and not particularly useful for debugging a lot > and correspoding downtime. I certainly can understand ;-) Regards, Willy ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes 2003-05-09 13:27 ` Willy Tarreau @ 2003-05-09 13:46 ` Stephan von Krawczynski 2003-05-09 14:56 ` Willy Tarreau 2003-05-09 14:11 ` Stephan von Krawczynski 1 sibling, 1 reply; 110+ messages in thread From: Stephan von Krawczynski @ 2003-05-09 13:46 UTC (permalink / raw) To: Willy Tarreau; +Cc: willy, gibbs, marcelo, linux-kernel On Fri, 9 May 2003 15:27:57 +0200 Willy Tarreau <willy@w.ods.org> wrote: > On Fri, May 09, 2003 at 03:02:07PM +0200, Stephan von Krawczynski wrote: > > > I cannot say which version of the driver it was, the only thing I can tell > > you is that the archive was called aic79xx-linux-2.4-20030410-tar.gz. > > That's really interesting, because I got the bug since around this version > (20030417 IIRC), and it locked up only on SMP, sometimes during boot, or > during heavy disk accesses caused by "updatedb" and "make -j dep". It's > fixed in 20030502 from http://people.freebsd.org/~gibbs/linux/SRC/ I tried to merge the latest aic archive into 2.4.21-rc2, besides the "usual" signed/unsigned warnings I got this one: aic7xxx_osm.c: In function `ahc_linux_map_seg': aic7xxx_osm.c:770: warning: integer constant is too large for "long" type FYI -- Regards, Stephan ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes 2003-05-09 13:46 ` Stephan von Krawczynski @ 2003-05-09 14:56 ` Willy Tarreau 2003-05-09 15:08 ` Arjan van de Ven ` (2 more replies) 0 siblings, 3 replies; 110+ messages in thread From: Willy Tarreau @ 2003-05-09 14:56 UTC (permalink / raw) To: Stephan von Krawczynski; +Cc: Willy Tarreau, gibbs, marcelo, linux-kernel On Fri, May 09, 2003 at 03:46:37PM +0200, Stephan von Krawczynski wrote: > On Fri, 9 May 2003 15:27:57 +0200 > Willy Tarreau <willy@w.ods.org> wrote: > > > On Fri, May 09, 2003 at 03:02:07PM +0200, Stephan von Krawczynski wrote: > > > > > I cannot say which version of the driver it was, the only thing I can tell > > > you is that the archive was called aic79xx-linux-2.4-20030410-tar.gz. > > > > That's really interesting, because I got the bug since around this version > > (20030417 IIRC), and it locked up only on SMP, sometimes during boot, or > > during heavy disk accesses caused by "updatedb" and "make -j dep". It's > > fixed in 20030502 from http://people.freebsd.org/~gibbs/linux/SRC/ > > I tried to merge the latest aic archive into 2.4.21-rc2, besides the "usual" > signed/unsigned warnings I got this one: > > aic7xxx_osm.c: In function `ahc_linux_map_seg': > aic7xxx_osm.c:770: warning: integer constant is too large for "long" type Good catch, but in fact, it's more this line which worries me : 758: if ((addr ^ (addr + len - 1)) & ~0xFFFFFFFF) { I don't see how ~0xFFFFFFFF can be non-null on 32 bits archs, because addr is a bus_addr_t which is in turn dma_addr_t which itself is u32. So unless I don't find the trick this would mean that this code should never be executed. Perhaps ~0xFFFFFFFFULL would be more appropriate, or even >0xFFFFFFFF, since this can be detected with u32 using the carry left by the addition. Regards, Willy ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes 2003-05-09 14:56 ` Willy Tarreau @ 2003-05-09 15:08 ` Arjan van de Ven 2003-05-09 16:27 ` Willy Tarreau 2003-05-09 15:18 ` Andreas Schwab 2003-05-09 15:19 ` William Lee Irwin III 2 siblings, 1 reply; 110+ messages in thread From: Arjan van de Ven @ 2003-05-09 15:08 UTC (permalink / raw) To: Willy Tarreau; +Cc: marcelo, linux-kernel [-- Attachment #1: Type: text/plain, Size: 251 bytes --] > ull on 32 bits archs, because addr is > a bus_addr_t which is in turn dma_addr_t which itself is u32. So unless I don't > find the trick this would mean that this code should never be executed. Perhaps dma_addr_t is either u32 or u64 on x86 [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes 2003-05-09 15:08 ` Arjan van de Ven @ 2003-05-09 16:27 ` Willy Tarreau 0 siblings, 0 replies; 110+ messages in thread From: Willy Tarreau @ 2003-05-09 16:27 UTC (permalink / raw) To: Arjan van de Ven; +Cc: Willy Tarreau, marcelo, linux-kernel On Fri, May 09, 2003 at 05:08:03PM +0200, Arjan van de Ven wrote: > > ull on 32 bits archs, because addr is > > a bus_addr_t which is in turn dma_addr_t which itself is u32. So unless I don't > > find the trick this would mean that this code should never be executed. Perhaps > > dma_addr_t is either u32 or u64 on x86 Yes Arjan, but it's u64 only if CONFIG_HIGHMEM is set. So I repost my question in another way : is this code supposed to be executed when CONFIG_HIGHMEM=n since (u32)(~0xFFFFFFFF) = 0 ? Regards, Willy ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes 2003-05-09 14:56 ` Willy Tarreau 2003-05-09 15:08 ` Arjan van de Ven @ 2003-05-09 15:18 ` Andreas Schwab 2003-05-09 15:19 ` William Lee Irwin III 2 siblings, 0 replies; 110+ messages in thread From: Andreas Schwab @ 2003-05-09 15:18 UTC (permalink / raw) To: Willy Tarreau; +Cc: Stephan von Krawczynski, gibbs, marcelo, linux-kernel Willy Tarreau <willy@w.ods.org> writes: |> On Fri, May 09, 2003 at 03:46:37PM +0200, Stephan von Krawczynski wrote: |> > On Fri, 9 May 2003 15:27:57 +0200 |> > Willy Tarreau <willy@w.ods.org> wrote: |> > |> > > On Fri, May 09, 2003 at 03:02:07PM +0200, Stephan von Krawczynski wrote: |> > > |> > > > I cannot say which version of the driver it was, the only thing I can tell |> > > > you is that the archive was called aic79xx-linux-2.4-20030410-tar.gz. |> > > |> > > That's really interesting, because I got the bug since around this version |> > > (20030417 IIRC), and it locked up only on SMP, sometimes during boot, or |> > > during heavy disk accesses caused by "updatedb" and "make -j dep". It's |> > > fixed in 20030502 from http://people.freebsd.org/~gibbs/linux/SRC/ |> > |> > I tried to merge the latest aic archive into 2.4.21-rc2, besides the "usual" |> > signed/unsigned warnings I got this one: |> > |> > aic7xxx_osm.c: In function `ahc_linux_map_seg': |> > aic7xxx_osm.c:770: warning: integer constant is too large for "long" type |> |> Good catch, but in fact, it's more this line which worries me : |> |> 758: if ((addr ^ (addr + len - 1)) & ~0xFFFFFFFF) { |> |> I don't see how ~0xFFFFFFFF can be non-null on 32 bits archs It will always be zero even on 64 bit archs, because ~0xFFFFFFFF is of type unsigned int. The context doesn't matter. Andreas. -- Andreas Schwab, SuSE Labs, schwab@suse.de SuSE Linux AG, Deutschherrnstr. 15-19, D-90429 Nürnberg Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes 2003-05-09 14:56 ` Willy Tarreau 2003-05-09 15:08 ` Arjan van de Ven 2003-05-09 15:18 ` Andreas Schwab @ 2003-05-09 15:19 ` William Lee Irwin III 2 siblings, 0 replies; 110+ messages in thread From: William Lee Irwin III @ 2003-05-09 15:19 UTC (permalink / raw) To: Willy Tarreau; +Cc: Stephan von Krawczynski, gibbs, marcelo, linux-kernel On Fri, May 09, 2003 at 04:56:21PM +0200, Willy Tarreau wrote: > I don't see how ~0xFFFFFFFF can be non-null on 32 bits archs, because addr is > a bus_addr_t which is in turn dma_addr_t which itself is u32. So unless I don't > find the trick this would mean that this code should never be executed. Perhaps > ~0xFFFFFFFFULL would be more appropriate, or even >0xFFFFFFFF, since this can be > detected with u32 using the carry left by the addition. include/asm-i386/types.h line 55 #ifdef CONFIG_HIGHMEM typedef u64 dma_addr_t; #else typedef u32 dma_addr_t; #endif -- wli ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes 2003-05-09 13:27 ` Willy Tarreau 2003-05-09 13:46 ` Stephan von Krawczynski @ 2003-05-09 14:11 ` Stephan von Krawczynski 2003-05-09 14:57 ` Willy Tarreau 1 sibling, 1 reply; 110+ messages in thread From: Stephan von Krawczynski @ 2003-05-09 14:11 UTC (permalink / raw) To: Willy Tarreau; +Cc: willy, gibbs, marcelo, linux-kernel On Fri, 9 May 2003 15:27:57 +0200 Willy Tarreau <willy@w.ods.org> wrote: > Well, would you at least agree to retest current version from the above URL ? > I find it a bit of a shame that the driver goes back in -rc stage. Ok, I can tell you at least this: it boots. Just did it. I can tell tomorrow how it behaves with my specific problem. This is a setup with 2.4.21-rc2 and aic79xx-linux-2.4-20030502-tar.gz. -- Regards, Stephan ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes 2003-05-09 14:11 ` Stephan von Krawczynski @ 2003-05-09 14:57 ` Willy Tarreau 2003-05-12 9:02 ` Stephan von Krawczynski 0 siblings, 1 reply; 110+ messages in thread From: Willy Tarreau @ 2003-05-09 14:57 UTC (permalink / raw) To: Stephan von Krawczynski; +Cc: Willy Tarreau, gibbs, marcelo, linux-kernel On Fri, May 09, 2003 at 04:11:06PM +0200, Stephan von Krawczynski wrote: > On Fri, 9 May 2003 15:27:57 +0200 > Willy Tarreau <willy@w.ods.org> wrote: > > > Well, would you at least agree to retest current version from the above URL ? > > I find it a bit of a shame that the driver goes back in -rc stage. > > Ok, I can tell you at least this: it boots. Just did it. I can tell tomorrow > how it behaves with my specific problem. Thanks for having tried ;-) Willy ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes 2003-05-09 14:57 ` Willy Tarreau @ 2003-05-12 9:02 ` Stephan von Krawczynski 2003-05-12 15:43 ` Marc-Christian Petersen ` (2 more replies) 0 siblings, 3 replies; 110+ messages in thread From: Stephan von Krawczynski @ 2003-05-12 9:02 UTC (permalink / raw) To: Willy Tarreau; +Cc: willy, gibbs, marcelo, linux-kernel On Fri, 9 May 2003 16:57:38 +0200 Willy Tarreau <willy@w.ods.org> wrote: > On Fri, May 09, 2003 at 04:11:06PM +0200, Stephan von Krawczynski wrote: > > On Fri, 9 May 2003 15:27:57 +0200 > > Willy Tarreau <willy@w.ods.org> wrote: > > > > > Well, would you at least agree to retest current version from the above > > > URL ? I find it a bit of a shame that the driver goes back in -rc stage. > > > > Ok, I can tell you at least this: it boots. Just did it. I can tell > > tomorrow how it behaves with my specific problem. > > Thanks for having tried ;-) Hello all, I have tried 2.4.21-rc2 with aic79xx-linux-2.4-20030502-tar.gz for three days now and have to say it performs well. I had no freezes any more and nothing weird happening. Everything is smooth and ok. This is the best performance I have seen comparing all 2.4.21-X versions tested. Thanks a lot. I will proceed with further stress tests... Regards, Stephan ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes 2003-05-12 9:02 ` Stephan von Krawczynski @ 2003-05-12 15:43 ` Marc-Christian Petersen 2003-05-12 17:25 ` Willy Tarreau 2003-05-23 10:38 ` Stephan von Krawczynski 2 siblings, 0 replies; 110+ messages in thread From: Marc-Christian Petersen @ 2003-05-12 15:43 UTC (permalink / raw) To: Stephan von Krawczynski, Willy Tarreau Cc: willy, gibbs, marcelo, linux-kernel On Monday 12 May 2003 11:02, Stephan von Krawczynski wrote: > I have tried 2.4.21-rc2 with aic79xx-linux-2.4-20030502-tar.gz for three > days now and have to say it performs well. I had no freezes any more and > nothing weird happening. Everything is smooth and ok. This is the best > performance I have seen comparing all 2.4.21-X versions tested. > > Thanks a lot. same here. 0 Problems at all. ciao, Marc ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes 2003-05-12 9:02 ` Stephan von Krawczynski 2003-05-12 15:43 ` Marc-Christian Petersen @ 2003-05-12 17:25 ` Willy Tarreau 2003-05-23 10:38 ` Stephan von Krawczynski 2 siblings, 0 replies; 110+ messages in thread From: Willy Tarreau @ 2003-05-12 17:25 UTC (permalink / raw) To: Stephan von Krawczynski, marcelo; +Cc: Willy Tarreau, gibbs, linux-kernel Hi All, On Mon, May 12, 2003 at 11:02:18AM +0200, Stephan von Krawczynski wrote: > I have tried 2.4.21-rc2 with aic79xx-linux-2.4-20030502-tar.gz for three days > now and have to say it performs well. I had no freezes any more and nothing > weird happening. Everything is smooth and ok. This is the best performance I > have seen comparing all 2.4.21-X versions tested. Same here, it seems rock solid on my dual athlon and has survived several hours of 5 simultaneous make -j 8 bzImage modules with swapping. Definitely the most stable for me since I've switched from Doug's to Justin's driver. Marcelo, would it be unreasonable to include it in -rc3 ? After all, it would not be a radical update, since it was removed from -rc2 ? Just a few bug fixes. What do you think ? Regards, Willy ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes 2003-05-12 9:02 ` Stephan von Krawczynski 2003-05-12 15:43 ` Marc-Christian Petersen 2003-05-12 17:25 ` Willy Tarreau @ 2003-05-23 10:38 ` Stephan von Krawczynski 2003-05-23 12:58 ` Justin T. Gibbs 2003-05-23 18:30 ` Undo aic7xxx changes Marcelo Tosatti 2 siblings, 2 replies; 110+ messages in thread From: Stephan von Krawczynski @ 2003-05-23 10:38 UTC (permalink / raw) To: willy; +Cc: gibbs, marcelo, linux-kernel On Mon, 12 May 2003 11:02:18 +0200 Stephan von Krawczynski <skraw@ithnet.com> wrote: > On Fri, 9 May 2003 16:57:38 +0200 > Willy Tarreau <willy@w.ods.org> wrote: > > > On Fri, May 09, 2003 at 04:11:06PM +0200, Stephan von Krawczynski wrote: > > > On Fri, 9 May 2003 15:27:57 +0200 > > > Willy Tarreau <willy@w.ods.org> wrote: > > > > > > > Well, would you at least agree to retest current version from the above > > > > URL ? I find it a bit of a shame that the driver goes back in -rc > > > > stage. > > > > > > Ok, I can tell you at least this: it boots. Just did it. I can tell > > > tomorrow how it behaves with my specific problem. > > > > Thanks for having tried ;-) > > Hello all, > > I have tried 2.4.21-rc2 with aic79xx-linux-2.4-20030502-tar.gz for three days > now and have to say it performs well. I had no freezes any more and nothing > weird happening. Everything is smooth and ok. This is the best performance I > have seen comparing all 2.4.21-X versions tested. > > Thanks a lot. > > I will proceed with further stress tests... Ok. I managed to crash the tested machine after 14 days now. The crash itself is exactly like former 2.4.21-X. It just freezes, no oops no nothing. It looks like things got better, but not solved. Regards, Stephan ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes 2003-05-23 10:38 ` Stephan von Krawczynski @ 2003-05-23 12:58 ` Justin T. Gibbs 2003-05-23 13:11 ` Stephan von Krawczynski 2003-05-23 19:57 ` Willy Tarreau 2003-05-23 18:30 ` Undo aic7xxx changes Marcelo Tosatti 1 sibling, 2 replies; 110+ messages in thread From: Justin T. Gibbs @ 2003-05-23 12:58 UTC (permalink / raw) To: Stephan von Krawczynski, willy; +Cc: marcelo, linux-kernel > Ok. I managed to crash the tested machine after 14 days now. The crash itself > is exactly like former 2.4.21-X. It just freezes, no oops no nothing. It looks > like things got better, but not solved. What is telling you that the freeze is SCSI related? Are you running with the nmi watchdog and have a trace? Do you have driver messages that you aren't sharing? -- Justin ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes 2003-05-23 12:58 ` Justin T. Gibbs @ 2003-05-23 13:11 ` Stephan von Krawczynski 2003-05-23 19:57 ` Willy Tarreau 1 sibling, 0 replies; 110+ messages in thread From: Stephan von Krawczynski @ 2003-05-23 13:11 UTC (permalink / raw) To: Justin T. Gibbs; +Cc: willy, marcelo, linux-kernel On Fri, 23 May 2003 06:58:41 -0600 "Justin T. Gibbs" <gibbs@scsiguy.com> wrote: > > Ok. I managed to crash the tested machine after 14 days now. The crash > > itself is exactly like former 2.4.21-X. It just freezes, no oops no > > nothing. It looks like things got better, but not solved. > > What is telling you that the freeze is SCSI related? Are you running > with the nmi watchdog and have a trace? Do you have driver messages > that you aren't sharing? Hello Justin, to make that clear: I am in no way sure _what_ is causing the problem. I am only updating the (very few) infos I gave/could give during the last weeks. >From looking at the ongoings I would say your driver patch (URL already sent several times) made things better. This does obviously not mean that the kernel-included aic-driver is the sole cause of the troubles. I am in fact very pleased that rc2/aic-20030502 made things quite noticably better than every 21-rc/pre before. What I am giving is a positive feedback, but I have as few logs for it as I had for the very negative I sent times ago. Anyway, I am continuing with stress-tests on rc3/aic-20030520. Regards, Stephan ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes 2003-05-23 12:58 ` Justin T. Gibbs 2003-05-23 13:11 ` Stephan von Krawczynski @ 2003-05-23 19:57 ` Willy Tarreau 2003-05-24 10:52 ` Stephan von Krawczynski 1 sibling, 1 reply; 110+ messages in thread From: Willy Tarreau @ 2003-05-23 19:57 UTC (permalink / raw) To: Justin T. Gibbs; +Cc: Stephan von Krawczynski, willy, marcelo, linux-kernel Hello ! On Fri, May 23, 2003 at 06:58:41AM -0600, Justin T. Gibbs wrote: > > Ok. I managed to crash the tested machine after 14 days now. The crash itself > > is exactly like former 2.4.21-X. It just freezes, no oops no nothing. It looks > > like things got better, but not solved. > > What is telling you that the freeze is SCSI related? Are you running > with the nmi watchdog and have a trace? Do you have driver messages > that you aren't sharing? Stephen, Justin is right, you should run it through the NMI watchdog, in the hope to find something useful. If it hangs again in 14 days, you won't know why and that may be frustrating. With the NMI watchdog, you at least have a chance to see where it locks up, and you may find it to be within the driver, which would help Justin stabilize it, or within any other kernel subsystem. I had to use nmi_watchdog=2 at boot time, but other people use 1. Regards, Willy ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes 2003-05-23 19:57 ` Willy Tarreau @ 2003-05-24 10:52 ` Stephan von Krawczynski 2003-05-24 11:16 ` Willy Tarreau 0 siblings, 1 reply; 110+ messages in thread From: Stephan von Krawczynski @ 2003-05-24 10:52 UTC (permalink / raw) To: Willy Tarreau; +Cc: gibbs, willy, marcelo, linux-kernel On Fri, 23 May 2003 21:57:57 +0200 Willy Tarreau <willy@w.ods.org> wrote: > Hello ! > > On Fri, May 23, 2003 at 06:58:41AM -0600, Justin T. Gibbs wrote: > > > Ok. I managed to crash the tested machine after 14 days now. The crash > > > itself is exactly like former 2.4.21-X. It just freezes, no oops no > > > nothing. It looks like things got better, but not solved. > > > > What is telling you that the freeze is SCSI related? Are you running > > with the nmi watchdog and have a trace? Do you have driver messages > > that you aren't sharing? > > Stephen, > > Justin is right, you should run it through the NMI watchdog, in the hope to > find something useful. If it hangs again in 14 days, you won't know why and > that may be frustrating. With the NMI watchdog, you at least have a chance to > see where it locks up, and you may find it to be within the driver, which > would help Justin stabilize it, or within any other kernel subsystem. > > I had to use nmi_watchdog=2 at boot time, but other people use 1. > > Regards, > Willy Hello Willy, I will do that, but I am not so confident about this, because the box runs X and a console oops output from nmi may as well not be visible nor written to disk. Regards, Stephan ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes 2003-05-24 10:52 ` Stephan von Krawczynski @ 2003-05-24 11:16 ` Willy Tarreau 2003-05-25 10:58 ` Stephan von Krawczynski 2003-06-05 15:05 ` Undo aic7xxx changes (now rc7+aic20030603) Stephan von Krawczynski 0 siblings, 2 replies; 110+ messages in thread From: Willy Tarreau @ 2003-05-24 11:16 UTC (permalink / raw) To: Stephan von Krawczynski; +Cc: Willy Tarreau, gibbs, marcelo, linux-kernel On Sat, May 24, 2003 at 12:52:52PM +0200, Stephan von Krawczynski wrote: > On Fri, 23 May 2003 21:57:57 +0200 > Willy Tarreau <willy@w.ods.org> wrote: > > > Hello ! > > > > On Fri, May 23, 2003 at 06:58:41AM -0600, Justin T. Gibbs wrote: > > > > Ok. I managed to crash the tested machine after 14 days now. The crash > > > > itself is exactly like former 2.4.21-X. It just freezes, no oops no > > > > nothing. It looks like things got better, but not solved. > > > > > > What is telling you that the freeze is SCSI related? Are you running > > > with the nmi watchdog and have a trace? Do you have driver messages > > > that you aren't sharing? > > > > Stephen, > > > > Justin is right, you should run it through the NMI watchdog, in the hope to > > find something useful. If it hangs again in 14 days, you won't know why and > > that may be frustrating. With the NMI watchdog, you at least have a chance to > > see where it locks up, and you may find it to be within the driver, which > > would help Justin stabilize it, or within any other kernel subsystem. > > > > I had to use nmi_watchdog=2 at boot time, but other people use 1. > > > > Regards, > > Willy > > Hello Willy, > > I will do that, but I am not so confident about this, because the box runs X > and a console oops output from nmi may as well not be visible nor written to > disk. OK, I understand. Other options are : serial console (worked for me after several retries), remote syslogd (sometimes works if the system can still schedule a bit), or patches such as netconsole, which sends the logs to a remote host, and kmsgdump which tries to get them onto a floppy after a panic or a forced dump. Regards, Willy ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes 2003-05-24 11:16 ` Willy Tarreau @ 2003-05-25 10:58 ` Stephan von Krawczynski 2003-05-25 12:35 ` Willy TARREAU ` (2 more replies) 2003-06-05 15:05 ` Undo aic7xxx changes (now rc7+aic20030603) Stephan von Krawczynski 1 sibling, 3 replies; 110+ messages in thread From: Stephan von Krawczynski @ 2003-05-25 10:58 UTC (permalink / raw) To: Willy Tarreau; +Cc: willy, gibbs, marcelo, linux-kernel On Sat, 24 May 2003 13:16:08 +0200 Willy Tarreau <willy@w.ods.org> wrote: > > Hello Willy, > > > > I will do that, but I am not so confident about this, because the box runs > > X and a console oops output from nmi may as well not be visible nor written > > to disk. > > OK, I understand. Other options are : serial console (worked for me after > several retries), remote syslogd (sometimes works if the system can still > schedule a bit), or patches such as netconsole, which sends the logs to a > remote host, and kmsgdump which tries to get them onto a floppy after a > panic or a forced dump. > > Regards, > Willy Hello all, it did not take really long for rc3+aic20030520 to freeze - exactly one day. Though I used nmi_watchdog there are no presentable outputs. As I expected the screen simply is black and no messages are in any logfiles. Again it froze while tar-ing about 80 GB of data onto an aic-driven SDLT. Data is coming from IDE drive connected to a 3ware 7500-8 (though no raid configuration). I conclude that rc2+aic20030502 was way better. Ah yes, one more thing: I can ping the box, but keyboard, mouse, display is dead and usually working processes stopped (like snmp). Willy: I am willing to try a serial console setup (as it does not interfere with X). I have tried this before with no luck. Can you provide some hints how you got that working (yes, I read Documentation/serial-console.txt, but I could not manage any output on the serial line). Regards, Stephan ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes 2003-05-25 10:58 ` Stephan von Krawczynski @ 2003-05-25 12:35 ` Willy TARREAU 2003-05-25 12:47 ` Marc-Christian Petersen 2003-05-25 18:30 ` Justin T. Gibbs 2 siblings, 0 replies; 110+ messages in thread From: Willy TARREAU @ 2003-05-25 12:35 UTC (permalink / raw) To: Stephan von Krawczynski; +Cc: Willy Tarreau, gibbs, marcelo, linux-kernel Hello ! On Sun, May 25, 2003 at 12:58:11PM +0200, Stephan von Krawczynski wrote: > it did not take really long for rc3+aic20030520 to freeze - exactly one day. Well, in some ways, it will be easier to debug it than when it took 14 days, if it's the same bug, of course. > Though I used nmi_watchdog there are no presentable outputs. As I expected the > screen simply is black and no messages are in any logfiles. > Again it froze while tar-ing about 80 GB of data onto an aic-driven SDLT. Data > is coming from IDE drive connected to a 3ware 7500-8 (though no raid > configuration). OK, so there's a high probability that the problem is related to either SCSI or IDE (or both), and less likely implies any other parts. > Ah yes, one more thing: I can ping the box, but keyboard, mouse, display is > dead and usually working processes stopped (like snmp). that's surprizing, mine was completely dead IIRC. It's like it doesn't schedule anymore but still processes interrupts. I don't know if a deadlock can cause this behaviour. > Willy: I am willing to try a serial console setup (as it does not interfere > with X). I have tried this before with no luck. Can you provide some hints how > you got that working (yes, I read Documentation/serial-console.txt, but I could > not manage any output on the serial line). I had to try several times, because the freeze was so sudden that I often caught only a few chars. Justin even didn't believe me. First, you have to check that CONFIG_SERIAL_CONSOLE is enabled. After that, you'll need a remote console which can work at high speeds (I could get interesting results at 38400 bps). Surprizingly, above I had mangled output. Perhaps my cable wasn't good enough (flat cisco RJ45 console cable). I also disabled hard and soft flow control. But as I already stated, in my case it was easier because it froze every 2-3 boots, and when it didn't I only had to start a "make -j dep" to get it. So if I got frozen with no messages, I simply hit the reset button and tried again. It seems more complicated in your case (although your big tar may be helping). When your setup seems OK, you should test it to be sure. I often use "mdir" with nothing in the drive, or AltGr-SysRq-P to get console messages. If you don't see anything on your serial console, then your setup is not ready yet for a test. Oh and by the way, if you're using modules, you may find interesting to keep copies of lsmod output, and /proc/ksyms to get a more accurate decoding with a further ksymoops. If you really cannot catch anything, I suggest one of these solutions : - apply the netconsole patch and have a linux box on the same lan with the netconsole server. You can find it in -aa kernels for example. - apply the kmsgdump patch, only if you have a floppy drive or a parallel printer. It will try to reset the system after a panic, and use bios calls to write the kernel messages buffer on the media. This usually works, but there are some corner cases where it doesn't. But it's easy to try with AltGr-SysRq-D. Download it from http://w.ods.org/tools/kmsgdump/ Good luck ! Willy ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes 2003-05-25 10:58 ` Stephan von Krawczynski 2003-05-25 12:35 ` Willy TARREAU @ 2003-05-25 12:47 ` Marc-Christian Petersen 2003-05-25 13:50 ` Stephan von Krawczynski 2003-05-26 15:00 ` Stephan von Krawczynski 2003-05-25 18:30 ` Justin T. Gibbs 2 siblings, 2 replies; 110+ messages in thread From: Marc-Christian Petersen @ 2003-05-25 12:47 UTC (permalink / raw) To: Stephan von Krawczynski, Willy Tarreau; +Cc: willy, gibbs, linux-kernel On Sunday 25 May 2003 12:58, Stephan von Krawczynski wrote: Hi Stephan, > Though I used nmi_watchdog there are no presentable outputs. As I expected > the screen simply is black and no messages are in any logfiles. > Again it froze while tar-ing about 80 GB of data onto an aic-driven SDLT. > Data is coming from IDE drive connected to a 3ware 7500-8 (though no raid > configuration). > > I conclude that rc2+aic20030502 was way better. > Ah yes, one more thing: I can ping the box, but keyboard, mouse, display is > dead and usually working processes stopped (like snmp). > Willy: I am willing to try a serial console setup (as it does not interfere > with X). I have tried this before with no luck. Can you provide some hints > how you got that working (yes, I read Documentation/serial-console.txt, but > I could not manage any output on the serial line). before trying this, could you please update to aic20030523? Thank you. ciao, Marc ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes 2003-05-25 12:47 ` Marc-Christian Petersen @ 2003-05-25 13:50 ` Stephan von Krawczynski 2003-05-25 14:01 ` Marc-Christian Petersen 2003-05-25 14:03 ` Geller Sandor 2003-05-26 15:00 ` Stephan von Krawczynski 1 sibling, 2 replies; 110+ messages in thread From: Stephan von Krawczynski @ 2003-05-25 13:50 UTC (permalink / raw) To: Marc-Christian Petersen; +Cc: willy, gibbs, linux-kernel On Sun, 25 May 2003 14:47:56 +0200 Marc-Christian Petersen <m.c.p@wolk-project.de> wrote: > On Sunday 25 May 2003 12:58, Stephan von Krawczynski wrote: > > Hi Stephan, > before trying this, could you please update to aic20030523? Thank you. Is there a changelog somewhere? What is the difference between 20030520 and 20030523 ? Regards, Stephan ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes 2003-05-25 13:50 ` Stephan von Krawczynski @ 2003-05-25 14:01 ` Marc-Christian Petersen 2003-05-25 14:03 ` Geller Sandor 1 sibling, 0 replies; 110+ messages in thread From: Marc-Christian Petersen @ 2003-05-25 14:01 UTC (permalink / raw) To: Stephan von Krawczynski; +Cc: willy, gibbs, linux-kernel On Sunday 25 May 2003 15:50, Stephan von Krawczynski wrote: Hi Stephan, > > before trying this, could you please update to aic20030523? Thank you. > Is there a changelog somewhere? What is the difference between 20030520 and > 20030523 ? yes, there is a changelog. Unfortunately in the tar.gz package because the one on Justins website isn't up2date. I've made it available on my website. http://wolk.sf.net/tmp/AIC-CHANGELOG ciao, Marc ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes 2003-05-25 13:50 ` Stephan von Krawczynski 2003-05-25 14:01 ` Marc-Christian Petersen @ 2003-05-25 14:03 ` Geller Sandor 1 sibling, 0 replies; 110+ messages in thread From: Geller Sandor @ 2003-05-25 14:03 UTC (permalink / raw) To: Stephan von Krawczynski; +Cc: linux-kernel On Sun, 25 May 2003, Stephan von Krawczynski wrote: > Is there a changelog somewhere? What is the difference between 20030520 > and 20030523 ? See drivers/scsi/aic7xxx/CHANGELOG Geller Sandor <wildy@petra.hos.u-szeged.hu> ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes 2003-05-25 12:47 ` Marc-Christian Petersen 2003-05-25 13:50 ` Stephan von Krawczynski @ 2003-05-26 15:00 ` Stephan von Krawczynski 2003-05-26 16:44 ` Willy Tarreau 1 sibling, 1 reply; 110+ messages in thread From: Stephan von Krawczynski @ 2003-05-26 15:00 UTC (permalink / raw) To: Marc-Christian Petersen; +Cc: willy, gibbs, linux-kernel, marcelo On Sun, 25 May 2003 14:47:56 +0200 Marc-Christian Petersen <m.c.p@wolk-project.de> wrote: > On Sunday 25 May 2003 12:58, Stephan von Krawczynski wrote: > > Hi Stephan, > before trying this, could you please update to aic20030523? Thank you. > > > ciao, Marc Hello Marc, I did this. The combination rc3+aic20030523 survived the first day of tests. So it seems at least better than rc3+aic20030520. I'll keep you informed. Regards, Stephan ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes 2003-05-26 15:00 ` Stephan von Krawczynski @ 2003-05-26 16:44 ` Willy Tarreau 2003-05-30 8:09 ` Stephan von Krawczynski 0 siblings, 1 reply; 110+ messages in thread From: Willy Tarreau @ 2003-05-26 16:44 UTC (permalink / raw) To: Stephan von Krawczynski Cc: Marc-Christian Petersen, willy, gibbs, linux-kernel, marcelo On Mon, May 26, 2003 at 05:00:58PM +0200, Stephan von Krawczynski wrote: > On Sun, 25 May 2003 14:47:56 +0200 > Marc-Christian Petersen <m.c.p@wolk-project.de> wrote: > > > On Sunday 25 May 2003 12:58, Stephan von Krawczynski wrote: > > > > Hi Stephan, > > before trying this, could you please update to aic20030523? Thank you. > > > > > > ciao, Marc > > Hello Marc, > > I did this. The combination rc3+aic20030523 survived the first day of tests. So > it seems at least better than rc3+aic20030520. The same has been running on my Alpha since yesterday evening on a 54GB raid0 which I transformed to raid5 (39 GB backed up to IDE ; mkraid ; 39GB restored). Still alive. Cheers, Willy ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes 2003-05-26 16:44 ` Willy Tarreau @ 2003-05-30 8:09 ` Stephan von Krawczynski 2003-05-30 8:19 ` Marc-Christian Petersen ` (3 more replies) 0 siblings, 4 replies; 110+ messages in thread From: Stephan von Krawczynski @ 2003-05-30 8:09 UTC (permalink / raw) To: marcelo; +Cc: m.c.p, willy, gibbs, linux-kernel Hello Marcelo, I tried plain rc6 now and have to tell you it does not survive a single day of my usual tests. It freezes during tar from 3ware-driven IDE to aic-driven SDLT. This is identical to all previous rc (and some pre) releases of 2.4.21. So far I can tell you that the only thing that has recently cured this problem is replacing the aic-driver with latest of justins' releases. As plain rc6 does definitely not work I will now switch over to rc6+aic-20030523. Remember that rc3+aic-20030523 already worked quite ok (4 days test survived). My personal opinion is a known-to-be-broken 2.4.21 should not be released, as a lot of people only try/use the releases and therefore an immediately released 2.4.22-pre1 with justins driver will not be a good solution. Regards, Stephan ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes 2003-05-30 8:09 ` Stephan von Krawczynski @ 2003-05-30 8:19 ` Marc-Christian Petersen 2003-05-30 8:21 ` Arjan van de Ven ` (2 subsequent siblings) 3 siblings, 0 replies; 110+ messages in thread From: Marc-Christian Petersen @ 2003-05-30 8:19 UTC (permalink / raw) To: Stephan von Krawczynski, marcelo; +Cc: willy, gibbs, linux-kernel On Friday 30 May 2003 10:09, Stephan von Krawczynski wrote: Hi Stephan, > I tried plain rc6 now and have to tell you it does not survive a single day > of my usual tests. It freezes during tar from 3ware-driven IDE to > aic-driven SDLT. This is identical to all previous rc (and some pre) > releases of 2.4.21. So far I can tell you that the only thing that has > recently cured this problem is replacing the aic-driver with latest of > justins' releases. > As plain rc6 does definitely not work I will now switch over to > rc6+aic-20030523. Remember that rc3+aic-20030523 already worked quite ok (4 > days test survived). same experience on my boxen (quite much with AIC) > My personal opinion is a known-to-be-broken 2.4.21 should not be released, > as a lot of people only try/use the releases and therefore an immediately > released 2.4.22-pre1 with justins driver will not be a good solution. ACK! Maybe we should disable AIC Config option and instead add a comment like: comment 'For AICXXXX, please go to http://people.freebsd.org/~gibbs/linux/' comment 'and download the latest tar.gz and unpack these drivers!' comment 'After unpacking, enable Config.in option in drivers/scsi/Config.in' *scnr* ;) ciao, Marc ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes 2003-05-30 8:09 ` Stephan von Krawczynski 2003-05-30 8:19 ` Marc-Christian Petersen @ 2003-05-30 8:21 ` Arjan van de Ven 2003-05-30 8:51 ` Stephan von Krawczynski 2003-05-30 13:34 ` Jeff Garzik 2003-05-30 13:35 ` Jeff Garzik 3 siblings, 1 reply; 110+ messages in thread From: Arjan van de Ven @ 2003-05-30 8:21 UTC (permalink / raw) To: Stephan von Krawczynski; +Cc: marcelo, m.c.p, willy, gibbs, linux-kernel [-- Attachment #1: Type: text/plain, Size: 555 bytes --] > My personal opinion is a known-to-be-broken 2.4.21 should not be released, as a > lot of people only try/use the releases and therefore an immediately released > 2.4.22-pre1 with justins driver will not be a good solution. I think you missed the point entirely before. 2.4.21 CANNOT cause regressions most of all. At this point there is no way to know if the thing that fixes your machine breaks on 100s others that DO work correctly in 2.4.20. Even if it would fix 100s and break 1 it's still not acceptable for stable kernel releases. [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes 2003-05-30 8:21 ` Arjan van de Ven @ 2003-05-30 8:51 ` Stephan von Krawczynski 0 siblings, 0 replies; 110+ messages in thread From: Stephan von Krawczynski @ 2003-05-30 8:51 UTC (permalink / raw) To: arjanv; +Cc: marcelo, m.c.p, willy, gibbs, linux-kernel On 30 May 2003 10:21:33 +0200 Arjan van de Ven <arjanv@redhat.com> wrote: > > > > My personal opinion is a known-to-be-broken 2.4.21 should not be released, > > as a lot of people only try/use the releases and therefore an immediately > > released 2.4.22-pre1 with justins driver will not be a good solution. > > I think you missed the point entirely before. 2.4.21 CANNOT cause > regressions most of all. At this point there is no way to know if the > thing that fixes your machine breaks on 100s others that DO work > correctly in 2.4.20. Even if it would fix 100s and break 1 it's still > not acceptable for stable kernel releases. Unfortunately you miss my point (which is probably too simple to be clearly visible): I want to give some feedback on a topic/problem I am experiencing since _long_. I was _asked_ to do so. Additionally I am stating my _opinion_. I am _not_ telling anybody what to do. I am not in a position to do so. Very likely only _few_ people are in such a position, very likely the maintainer of aic and hopefully Marcelo. Have you read all available bug reports Justin got? If you have not, don't play with numbers. Another personal opinion: software development tends to make things possible that "cannot be". ;-) Regards, Stephan ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes 2003-05-30 8:09 ` Stephan von Krawczynski 2003-05-30 8:19 ` Marc-Christian Petersen 2003-05-30 8:21 ` Arjan van de Ven @ 2003-05-30 13:34 ` Jeff Garzik 2003-05-30 13:59 ` Stephan von Krawczynski 2003-05-30 13:35 ` Jeff Garzik 3 siblings, 1 reply; 110+ messages in thread From: Jeff Garzik @ 2003-05-30 13:34 UTC (permalink / raw) To: Stephan von Krawczynski; +Cc: marcelo, m.c.p, willy, gibbs, linux-kernel On Fri, May 30, 2003 at 10:09:00AM +0200, Stephan von Krawczynski wrote: > Hello Marcelo, > > I tried plain rc6 now and have to tell you it does not survive a single day of > my usual tests. It freezes during tar from 3ware-driven IDE to aic-driven SDLT. > This is identical to all previous rc (and some pre) releases of 2.4.21. So far > I can tell you that the only thing that has recently cured this problem is > replacing the aic-driver with latest of justins' releases. So Justin's driver fixes your 3ware problems??? And exactly what -rc/-pre release stopped working for you? Jeff ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes 2003-05-30 13:34 ` Jeff Garzik @ 2003-05-30 13:59 ` Stephan von Krawczynski 0 siblings, 0 replies; 110+ messages in thread From: Stephan von Krawczynski @ 2003-05-30 13:59 UTC (permalink / raw) To: Jeff Garzik; +Cc: marcelo, m.c.p, willy, gibbs, linux-kernel On Fri, 30 May 2003 09:34:56 -0400 Jeff Garzik <jgarzik@pobox.com> wrote: > On Fri, May 30, 2003 at 10:09:00AM +0200, Stephan von Krawczynski wrote: > > Hello Marcelo, > > > > I tried plain rc6 now and have to tell you it does not survive a single day > > of my usual tests. It freezes during tar from 3ware-driven IDE to > > aic-driven SDLT. This is identical to all previous rc (and some pre) > > releases of 2.4.21. So far I can tell you that the only thing that has > > recently cured this problem is replacing the aic-driver with latest of > > justins' releases. > > So Justin's driver fixes your 3ware problems??? This is _no_ 3ware problem. As I told you data comes from 3ware and goes to aic. The problem occurs if using plain-version aic and is gone if using justins latest releases. As long as we do nothing with the aic driver there is no problem at all (3ware works fine here). > And exactly what -rc/-pre release stopped working for you? Very good question. I can check, but I need one day per version to check. It may well be that in fact none of the pre/rc releases worked, we have this box since about pre3 and to my knowledge we always had the problem. Boy, we were quite happy when we found out that Justins stuff got it going - it already got on our nerves quite a bit ;-) If you want to know about some special kernel release just tell me and I will try it. Maybe I should tell again details about the test setup as not all may remember in this long-lasting thread. Basically the problem seldomly arises after booting. I have the impression that this got in fact better over the releases, earlier pre's froze earlier. what we do: 1) copy around 50 - 100 GB of data via nfs to a 3ware drive (always works well) 2) tar this data on the nfs server from 3ware drive to aic(-driven) SDLT (quantum) 3) verify the archived data via tar freezes happen while 2) or 3). If you reboot after 1) they are very rare, never on any later rc-release. As this whole things takes time we do it overnight and have a look at the box next morning. Not a single plain release is ok on the next morning. Checking the logs we find out it froze in 2) or 3). If you do exactly the same thing on exactly the same box with exactly the same data but Justins driver everything is ok (aic-20030523). It was not ok with aic-20030520 (just to mention this), aic-20030502 was quite ok (survived 14 days). What else can I tell you? Regards, Stephan ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes 2003-05-30 8:09 ` Stephan von Krawczynski ` (2 preceding siblings ...) 2003-05-30 13:34 ` Jeff Garzik @ 2003-05-30 13:35 ` Jeff Garzik 3 siblings, 0 replies; 110+ messages in thread From: Jeff Garzik @ 2003-05-30 13:35 UTC (permalink / raw) To: Stephan von Krawczynski; +Cc: marcelo, m.c.p, willy, gibbs, linux-kernel On Fri, May 30, 2003 at 10:09:00AM +0200, Stephan von Krawczynski wrote: > Hello Marcelo, > > I tried plain rc6 now and have to tell you it does not survive a single day of > my usual tests. It freezes during tar from 3ware-driven IDE to aic-driven SDLT. > This is identical to all previous rc (and some pre) releases of 2.4.21. So far > I can tell you that the only thing that has recently cured this problem is > replacing the aic-driver with latest of justins' releases. > As plain rc6 does definitely not work I will now switch over to > rc6+aic-20030523. Remember that rc3+aic-20030523 already worked quite ok (4 > days test survived). Also, does the aic7xxx_old driver work for you? The "old" part is only in regards to lack of support for very-new aic7xxx hardware. Jeff ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes 2003-05-25 10:58 ` Stephan von Krawczynski 2003-05-25 12:35 ` Willy TARREAU 2003-05-25 12:47 ` Marc-Christian Petersen @ 2003-05-25 18:30 ` Justin T. Gibbs 2 siblings, 0 replies; 110+ messages in thread From: Justin T. Gibbs @ 2003-05-25 18:30 UTC (permalink / raw) To: Stephan von Krawczynski, Willy Tarreau; +Cc: marcelo, linux-kernel > Willy: I am willing to try a serial console setup (as it does not interfere > with X). Are you still running all of your tests with X up? You then have no chance of getting any useful diagnostics without a serial console. Can't you switch back to a vty while the test is running? >I have tried this before with no luck. Can you provide some hints how > you got that working (yes, I read Documentation/serial-console.txt, but > I could not manage any output on the serial line). You will need a null modem cable. Config a kernel with serial console support enabled. Use a fairly high speed for your console (115200). To enable your first serial port as a console add something like the following to your kenrel command line: console=ttyS0,115200 console=vty0 This will retain console output on the vty too. -- Justin ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes (now rc7+aic20030603) 2003-05-24 11:16 ` Willy Tarreau 2003-05-25 10:58 ` Stephan von Krawczynski @ 2003-06-05 15:05 ` Stephan von Krawczynski 2003-06-05 18:14 ` Willy Tarreau 1 sibling, 1 reply; 110+ messages in thread From: Stephan von Krawczynski @ 2003-06-05 15:05 UTC (permalink / raw) To: Willy Tarreau; +Cc: willy, gibbs, marcelo, linux-kernel Hello all, It took some days to produce output for my freezing problem. This one is rc7+aic20030603: Jun 5 16:53:55 admin kernel: Unable to handle kernel paging request at virtual address 8e30a7c5 Jun 5 16:53:55 admin kernel: printing eip: Jun 5 16:53:55 admin kernel: c013755e Jun 5 16:53:55 admin kernel: *pde = 00000000 Jun 5 16:53:55 admin kernel: Oops: 0000 Jun 5 16:53:55 admin kernel: CPU: 0 Jun 5 16:53:55 admin kernel: EIP: 0010:[kmem_cache_alloc_batch+78/272] Not tainted Jun 5 16:53:55 admin kernel: EIP: 0010:[<c013755e>] Not tainted Jun 5 16:53:55 admin kernel: EFLAGS: 00010006 Jun 5 16:53:55 admin kernel: eax: e62d70eb ebx: e62d70eb ecx: f57ae401 edx: 00000020 Jun 5 16:53:55 admin kernel: esi: 00000043 edi: 0000003a ebp: c342b060 esp: e5e63a28 Jun 5 16:53:55 admin kernel: ds: 0018 es: 0018 ss: 0018 Jun 5 16:53:55 admin kernel: Process tar (pid: 7112, stackpage=e5e63000) Jun 5 16:53:55 admin kernel: Stack: c342b068 c342b070 c342b060 00000246 00000020 e7420000 c01382eb c342b060 Jun 5 16:53:55 admin kernel: c3461000 00000020 00000000 c342bdb8 00000000 e7420000 c013749c c342b060 Jun 5 16:53:55 admin kernel: 00000020 d3d05ec0 00000003 00000020 c342bdb8 00000246 00000020 e5e63b14 Jun 5 16:53:55 admin kernel: Call Trace: [__kmem_cache_alloc+107/304] [kmem_cache_grow+508/624] [__kmem_cache_alloc+125/304] [get_mem_for_virtual_node+87/224] [fix_nodes+198/1008] Jun 5 16:53:55 admin kernel: Call Trace: [<c01382eb>] [<c013749c>] [<c01382fd>] [<c01846a7>] [<c0184bc6>] Jun 5 16:53:55 admin kernel: [reiserfs_paste_into_item+147/304] [reiserfs_get_block+1989/4800] [bh_action+106/112] [tasklet_hi_action+83/160] [smp_apic_timer_interrupt+264/304] [.text.lock.buffer+191/610] Jun 5 16:53:55 admin kernel: [<c0191ae3>] [<c017cca5>] [<c012252a>] [<c01223b3>] [<c0115d88>] [<c01474bd>] Jun 5 16:53:55 admin kernel: [getblk+109/128] [is_tree_node+100/112] [search_by_key+1824/3792] [__block_prepare_write+479/880] [block_prepare_write+51/144] [reiserfs_get_block+0/4800] Jun 5 16:53:55 admin kernel: [<c014447d>] [<c018e8f4>] [<c018f020>] [<c014503f>] [<c0145a23>] [<c017c4e0>] Jun 5 16:53:55 admin kernel: [generic_file_write+970/2128] [reiserfs_get_block+0/4800] [sys_write+155/384] [system_call+51/56] Jun 5 16:53:55 admin kernel: [<c013397a>] [<c017c4e0>] [<c0141d8b>] [<c010782f>] Jun 5 16:53:55 admin kernel: Jun 5 16:53:55 admin kernel: Code: 8b 44 81 18 0f af da 8b 51 0c 89 41 14 01 d3 40 0f 84 89 00 Does this help? Regards, Stephan ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes (now rc7+aic20030603) 2003-06-05 15:05 ` Undo aic7xxx changes (now rc7+aic20030603) Stephan von Krawczynski @ 2003-06-05 18:14 ` Willy Tarreau 2003-06-06 8:17 ` Oleg Drokin 2003-06-08 11:19 ` Stephan von Krawczynski 0 siblings, 2 replies; 110+ messages in thread From: Willy Tarreau @ 2003-06-05 18:14 UTC (permalink / raw) To: Stephan von Krawczynski; +Cc: Willy Tarreau, gibbs, marcelo, linux-kernel On Thu, Jun 05, 2003 at 05:05:51PM +0200, Stephan von Krawczynski wrote: > Hello all, > > It took some days to produce output for my freezing problem. This one is rc7+aic20030603: Good ! It seems that it crashed in the reiserfs code rather than in aic7xxx ! perhaps you hit 2 different bugs, or perhaps there's a race that only newer code can trigger, or there's a leak somewhere. You may want to forward the oops to the reiserfs team too. > Jun 5 16:53:55 admin kernel: Call Trace: [<c01382eb>] [<c013749c>] [<c01382fd>] [<c01846a7>] [<c0184bc6>] > Jun 5 16:53:55 admin kernel: [reiserfs_paste_into_item+147/304] [reiserfs_get_block+1989/4800] [bh_action+106/112] [tasklet_hi_action+83/160] [smp_apic_timer_interrupt+264/304] [.text.lock.buffer+191/610] > Jun 5 16:53:55 admin kernel: [<c0191ae3>] [<c017cca5>] [<c012252a>] [<c01223b3>] [<c0115d88>] [<c01474bd>] > Jun 5 16:53:55 admin kernel: [getblk+109/128] [is_tree_node+100/112] [search_by_key+1824/3792] [__block_prepare_write+479/880] [block_prepare_write+51/144] [reiserfs_get_block+0/4800] > Jun 5 16:53:55 admin kernel: [<c014447d>] [<c018e8f4>] [<c018f020>] [<c014503f>] [<c0145a23>] [<c017c4e0>] > Jun 5 16:53:55 admin kernel: [generic_file_write+970/2128] [reiserfs_get_block+0/4800] [sys_write+155/384] [system_call+51/56] > Jun 5 16:53:55 admin kernel: [<c013397a>] [<c017c4e0>] [<c0141d8b>] [<c010782f>] > Jun 5 16:53:55 admin kernel: > Jun 5 16:53:55 admin kernel: Code: 8b 44 81 18 0f af da 8b 51 0c 89 41 14 01 d3 40 0f 84 89 00 Cheers and thanks for the test ! Willy ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes (now rc7+aic20030603) 2003-06-05 18:14 ` Willy Tarreau @ 2003-06-06 8:17 ` Oleg Drokin 2003-06-06 9:04 ` Stephan von Krawczynski 2003-06-08 11:19 ` Stephan von Krawczynski 1 sibling, 1 reply; 110+ messages in thread From: Oleg Drokin @ 2003-06-06 8:17 UTC (permalink / raw) To: Willy Tarreau; +Cc: Stephan von Krawczynski, gibbs, marcelo, linux-kernel Hello! On Thu, Jun 05, 2003 at 08:14:23PM +0200, Willy Tarreau wrote: > > It took some days to produce output for my freezing problem. This one is rc7+aic20030603: > Good ! > It seems that it crashed in the reiserfs code rather than in aic7xxx ! perhaps > you hit 2 different bugs, or perhaps there's a race that only newer code can > trigger, or there's a leak somewhere. You may want to forward the oops to the > reiserfs team too. No, it did crashed in allocation code (you skipped one trace line): Jun 5 16:53:55 admin kernel: Call Trace: [__kmem_cache_alloc+107/304] [kmem_cache_grow+508/624] [__kmem_cache_alloc+125/304] +[get_mem_for_virtual_node+87/224] [fix_nodes+198/1008] And the EIP is in kmem_cache_alloc_batch, sounds like it tripped on bad pointer or something like this. So something is corrupting slab lists it seems. Bye, Oleg ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes (now rc7+aic20030603) 2003-06-06 8:17 ` Oleg Drokin @ 2003-06-06 9:04 ` Stephan von Krawczynski 2003-06-06 9:17 ` Oleg Drokin 0 siblings, 1 reply; 110+ messages in thread From: Stephan von Krawczynski @ 2003-06-06 9:04 UTC (permalink / raw) To: Oleg Drokin; +Cc: willy, gibbs, marcelo, linux-kernel On Fri, 6 Jun 2003 12:17:12 +0400 Oleg Drokin <green@namesys.com> wrote: > Hello! > > On Thu, Jun 05, 2003 at 08:14:23PM +0200, Willy Tarreau wrote: > > > It took some days to produce output for my freezing problem. This one is > > > rc7+aic20030603: > > Good ! > > It seems that it crashed in the reiserfs code rather than in aic7xxx ! > > perhaps you hit 2 different bugs, or perhaps there's a race that only newer > > code can trigger, or there's a leak somewhere. You may want to forward the > > oops to the reiserfs team too. > > No, it did crashed in allocation code (you skipped one trace line): > Jun 5 16:53:55 admin kernel: Call Trace: [__kmem_cache_alloc+107/304] > [kmem_cache_grow+508/624] > [__kmem_cache_alloc+125/304]+[get_mem_for_virtual_node+87/224] > [fix_nodes+198/1008] > > And the EIP is in kmem_cache_alloc_batch, sounds like it tripped on bad > pointer or something like this. So something is corrupting slab lists it > seems. > > Bye, > Oleg I agree with you. Only problem is: how can I find out what caused the problem. The only thing I can tell is that the box never hangs when using only HDs on the aic & 3ware controllers. As soon as I begin to use a SDLT drive on aic things get fishy. Regards, Stephan ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes (now rc7+aic20030603) 2003-06-06 9:04 ` Stephan von Krawczynski @ 2003-06-06 9:17 ` Oleg Drokin 2003-06-06 15:24 ` short freezing while file re-creation Stephan von Krawczynski 2003-06-08 10:15 ` Undo aic7xxx changes (now rc7+aic20030603) Stephan von Krawczynski 0 siblings, 2 replies; 110+ messages in thread From: Oleg Drokin @ 2003-06-06 9:17 UTC (permalink / raw) To: Stephan von Krawczynski; +Cc: willy, gibbs, marcelo, linux-kernel Hello! On Fri, Jun 06, 2003 at 11:04:08AM +0200, Stephan von Krawczynski wrote: > > No, it did crashed in allocation code (you skipped one trace line): > > Jun 5 16:53:55 admin kernel: Call Trace: [__kmem_cache_alloc+107/304] > > [kmem_cache_grow+508/624] > > [__kmem_cache_alloc+125/304]+[get_mem_for_virtual_node+87/224] > > [fix_nodes+198/1008] > > > > And the EIP is in kmem_cache_alloc_batch, sounds like it tripped on bad > > pointer or something like this. So something is corrupting slab lists it > > seems. > I agree with you. Only problem is: how can I find out what caused the problem. Probably by careful code observations. > The only thing I can tell is that the box never hangs when using only HDs on > the aic & 3ware controllers. As soon as I begin to use a SDLT drive on aic > things get fishy. You do not have reiserfs filesystem on a tape drive, right? ;) But thhat reduces the region to review to parts thqt deal with tape devices and tape-specific stuff, it seems. Bye, Oleg ^ permalink raw reply [flat|nested] 110+ messages in thread
* short freezing while file re-creation 2003-06-06 9:17 ` Oleg Drokin @ 2003-06-06 15:24 ` Stephan von Krawczynski 2003-06-06 16:02 ` Oleg Drokin 2003-06-08 10:15 ` Undo aic7xxx changes (now rc7+aic20030603) Stephan von Krawczynski 1 sibling, 1 reply; 110+ messages in thread From: Stephan von Krawczynski @ 2003-06-06 15:24 UTC (permalink / raw) To: Oleg Drokin; +Cc: linux-kernel Hello Oleg, while experimenting around my other problem I noticed my box freezes for some seconds while tar is re-creating an archive of around 70 GB size on a reiserfs with 3ware-connected device. This is experienced with 2.4.21-rc7. Reproducable via: create BIG tar archive file (my size 70 GB) on a reiserfs re-create same archive and watch box gone dead while the old archive is zapped. (Gone dead means: mouse froze, keyboard froze, X froze) The effect is visible for several seconds, then everything is back to normal. It's no big deal if you are interactively dealing with the cause (tar). But if you deal with background processes in server environment where your primary process goes suddenly dead for seconds you are probably not amused... Can you verify this? Is this device or fs dependant? Regards, Stephan ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: short freezing while file re-creation 2003-06-06 15:24 ` short freezing while file re-creation Stephan von Krawczynski @ 2003-06-06 16:02 ` Oleg Drokin 2003-06-06 19:00 ` Chris Mason 0 siblings, 1 reply; 110+ messages in thread From: Oleg Drokin @ 2003-06-06 16:02 UTC (permalink / raw) To: Stephan von Krawczynski; +Cc: linux-kernel Hello! On Fri, Jun 06, 2003 at 05:24:54PM +0200, Stephan von Krawczynski wrote: > while experimenting around my other problem I noticed my box freezes for some > seconds while tar is re-creating an archive of around 70 GB size on a reiserfs > with 3ware-connected device. > This is experienced with 2.4.21-rc7. Reproducable via: > create BIG tar archive file (my size 70 GB) on a reiserfs > re-create same archive and watch box gone dead while the old archive is zapped. > (Gone dead means: mouse froze, keyboard froze, X froze) Hm, I will try . Wild guess: does this patch helps? (untessted, not even compiled, but should be safe ) Bye, Oleg ===== stree.c 1.21 vs edited ===== --- 1.21/fs/reiserfs/stree.c Tue Mar 4 19:48:52 2003 +++ edited/fs/reiserfs/stree.c Fri Jun 6 20:01:29 2003 @@ -1773,6 +1773,8 @@ journal_begin(th, p_s_inode->i_sb, orig_len_alloc) ; reiserfs_update_inode_transaction(p_s_inode) ; } + if (current->need_resched) + schedule() ; } while ( n_file_size > ROUND_UP (n_new_file_size) && search_for_position_by_key(p_s_inode->i_sb, &s_item_key, &s_search_path) == POSITION_FOUND ) ; ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: short freezing while file re-creation 2003-06-06 16:02 ` Oleg Drokin @ 2003-06-06 19:00 ` Chris Mason 2003-06-06 19:10 ` Oleg Drokin 0 siblings, 1 reply; 110+ messages in thread From: Chris Mason @ 2003-06-06 19:00 UTC (permalink / raw) To: Oleg Drokin; +Cc: Stephan von Krawczynski, linux-kernel On Fri, 2003-06-06 at 12:02, Oleg Drokin wrote: > Hello! > > On Fri, Jun 06, 2003 at 05:24:54PM +0200, Stephan von Krawczynski wrote: > > > while experimenting around my other problem I noticed my box freezes for some > > seconds while tar is re-creating an archive of around 70 GB size on a reiserfs > > with 3ware-connected device. > > This is experienced with 2.4.21-rc7. Reproducable via: > > create BIG tar archive file (my size 70 GB) on a reiserfs > > re-create same archive and watch box gone dead while the old archive is zapped. > > (Gone dead means: mouse froze, keyboard froze, X froze) > > Hm, I will try . > > Wild guess: does this patch helps? (untessted, not even compiled, but should be safe ) > There are still some latency issues with io in rc7, it could be a general problem. -chris ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: short freezing while file re-creation 2003-06-06 19:00 ` Chris Mason @ 2003-06-06 19:10 ` Oleg Drokin 2003-06-06 19:20 ` Chris Mason 0 siblings, 1 reply; 110+ messages in thread From: Oleg Drokin @ 2003-06-06 19:10 UTC (permalink / raw) To: Chris Mason; +Cc: Stephan von Krawczynski, linux-kernel Hello! On Fri, Jun 06, 2003 at 03:00:54PM -0400, Chris Mason wrote: > There are still some latency issues with io in rc7, it could be a > general problem. Hm. But I think everything that was not needing disk io (i.e. mouse stuff) should not be affected? Bye, Oleg ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: short freezing while file re-creation 2003-06-06 19:10 ` Oleg Drokin @ 2003-06-06 19:20 ` Chris Mason 0 siblings, 0 replies; 110+ messages in thread From: Chris Mason @ 2003-06-06 19:20 UTC (permalink / raw) To: Oleg Drokin; +Cc: Stephan von Krawczynski, linux-kernel On Fri, 2003-06-06 at 15:10, Oleg Drokin wrote: > Hello! > > On Fri, Jun 06, 2003 at 03:00:54PM -0400, Chris Mason wrote: > > > There are still some latency issues with io in rc7, it could be a > > general problem. > > Hm. But I think everything that was not needing disk io (i.e. mouse stuff) > should not be affected? > It shouldn't ;-) But the problems are still not completely understood. This particular problem could still be reiserfs, it's hard to say right now. -chris ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes (now rc7+aic20030603) 2003-06-06 9:17 ` Oleg Drokin 2003-06-06 15:24 ` short freezing while file re-creation Stephan von Krawczynski @ 2003-06-08 10:15 ` Stephan von Krawczynski 1 sibling, 0 replies; 110+ messages in thread From: Stephan von Krawczynski @ 2003-06-08 10:15 UTC (permalink / raw) To: linux-kernel; +Cc: willy, gibbs, marcelo, green On Fri, 6 Jun 2003 13:17:59 +0400 Oleg Drokin <green@namesys.com> wrote: > Hello! > > On Fri, Jun 06, 2003 at 11:04:08AM +0200, Stephan von Krawczynski wrote: > > > No, it did crashed in allocation code (you skipped one trace line): > > > Jun 5 16:53:55 admin kernel: Call Trace: [__kmem_cache_alloc+107/304] > > > [kmem_cache_grow+508/624] > > > [__kmem_cache_alloc+125/304]+[get_mem_for_virtual_node+87/224] > > > [fix_nodes+198/1008] > > > > > > And the EIP is in kmem_cache_alloc_batch, sounds like it tripped on bad > > > pointer or something like this. So something is corrupting slab lists it > > > seems. > > I agree with you. Only problem is: how can I find out what caused the problem. > > Probably by careful code observations. > > > The only thing I can tell is that the box never hangs when using only HDs on > > the aic & 3ware controllers. As soon as I begin to use a SDLT drive on aic > > things get fishy. > > You do not have reiserfs filesystem on a tape drive, right? ;) > But thhat reduces the region to review to parts thqt deal with tape devices and > tape-specific stuff, it seems. > > Bye, > Oleg Hello all, in the meantime I got another oops and it looks like this: ksymoops 2.4.8 on i686 2.4.21-rc7-aic. Options used -V (default) -k /proc/ksyms (default) -l /proc/modules (default) -o /lib/modules/2.4.21-rc7-aic/ (default) -m /boot/System.map-2.4.21-rc7-aic (default) Warning: You did not tell me where to find symbol information. I will assume that the log matches the kernel and modules that are running right now and I'll use the default options above for symbol resolution. If the current kernel and/or modules do not match the log, you can get more accurate output by telling me the kernel version and where to find map, modules, ksyms etc. ksymoops -h explains the options. Jun 8 10:48:49 linux kernel: Oops: 0000 Jun 8 10:48:49 linux kernel: CPU: 1 Jun 8 10:48:49 linux kernel: EIP: 0010:[<c013755e>] Not tainted Using defaults from ksymoops -t elf32-i386 -a i386 Jun 8 10:48:49 linux kernel: EFLAGS: 00010006 Jun 8 10:48:49 linux kernel: eax: 5a005139 ebx: 5a005139 ecx: edb89c21 edx: 00000060 Jun 8 10:48:49 linux kernel: esi: 00000021 edi: 0000005c ebp: c342fecc esp: e4007d74 Jun 8 10:48:49 linux kernel: ds: 0018 es: 0018 ss: 0018 Jun 8 10:48:49 linux kernel: Process tar (pid: 17369, stackpage=e4007000) Jun 8 10:48:49 linux kernel: Stack: c342fed4 c342fedc c342fecc 00000246 00000070 effa58a0 c01382eb c342fecc Jun 8 10:48:49 linux kernel: c3467800 00000070 00000000 c1000020 effa58a0 effa58a0 c013f7d9 c342fecc Jun 8 10:48:49 linux kernel: 00000070 00000000 c013f8a5 c349d418 f6fc1200 00000000 00000000 c1000020 Jun 8 10:48:49 linux kernel: Call Trace: [<c01382eb>] [<c013f7d9>] [<c013f8a5>] [<c01b8f73>] [<c01b929e>] Jun 8 10:48:49 linux kernel: [<c01b936c>] [<c0145596>] [<c0139fc2>] [<c013069e>] [<c017c4e0>] [<c013124f>] Jun 8 10:48:49 linux kernel: [<c0131531>] [<c0131ad0>] [<c0131d20>] [<c0131ad0>] [<c0141c0b>] [<c010782f>] Jun 8 10:48:49 linux kernel: Code: 8b 44 81 18 0f af da 8b 51 0c 89 41 14 01 d3 40 0f 84 89 00 >>EIP; c013755e <kmem_cache_alloc_batch+4e/110> <===== >>ecx; edb89c21 <_end+2d7f78e1/38547d20> >>ebp; c342fecc <_end+309db8c/38547d20> >>esp; e4007d74 <_end+23c75a34/38547d20> Trace; c01382eb <__kmem_cache_alloc+6b/130> Trace; c013f7d9 <alloc_bounce_bh+19/a0> Trace; c013f8a5 <create_bounce+45/190> Trace; c01b8f73 <__make_request+3d3/640> Trace; c01b929e <generic_make_request+be/140> Trace; c01b936c <submit_bh+4c/70> Trace; c0145596 <block_read_full_page+2c6/2e0> Trace; c0139fc2 <__alloc_pages+42/190> Trace; c013069e <generic_buffer_fdatasync+5e/110> Trace; c017c4e0 <reiserfs_get_block+0/12c0> Trace; c013124f <generic_file_readahead+af/1a0> Trace; c0131531 <do_generic_file_read+1c1/470> Trace; c0131ad0 <file_read_actor+0/110> Trace; c0131d20 <generic_file_read+140/160> Trace; c0131ad0 <file_read_actor+0/110> Trace; c0141c0b <sys_read+9b/180> Trace; c010782f <system_call+33/38> Code; c013755e <kmem_cache_alloc_batch+4e/110> 00000000 <_EIP>: Code; c013755e <kmem_cache_alloc_batch+4e/110> <===== 0: 8b 44 81 18 mov 0x18(%ecx,%eax,4),%eax <===== Code; c0137562 <kmem_cache_alloc_batch+52/110> 4: 0f af da imul %edx,%ebx Code; c0137565 <kmem_cache_alloc_batch+55/110> 7: 8b 51 0c mov 0xc(%ecx),%edx Code; c0137568 <kmem_cache_alloc_batch+58/110> a: 89 41 14 mov %eax,0x14(%ecx) Code; c013756b <kmem_cache_alloc_batch+5b/110> d: 01 d3 add %edx,%ebx Code; c013756d <kmem_cache_alloc_batch+5d/110> f: 40 inc %eax Code; c013756e <kmem_cache_alloc_batch+5e/110> 10: 0f 84 89 00 00 00 je 9f <_EIP+0x9f> 1 warning issued. Results may not be reliable. This is the second oops inside kmem_cache_alloc_batch, the problem can be talked of as reproducable. This is a 2.4.21-rc7+aic20030603 kernel. Regards, Stephan ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes (now rc7+aic20030603) 2003-06-05 18:14 ` Willy Tarreau 2003-06-06 8:17 ` Oleg Drokin @ 2003-06-08 11:19 ` Stephan von Krawczynski 2003-06-08 11:49 ` Stephan von Krawczynski 1 sibling, 1 reply; 110+ messages in thread From: Stephan von Krawczynski @ 2003-06-08 11:19 UTC (permalink / raw) To: linux-kernel; +Cc: willy, gibbs, marcelo, green Hello all, looking at code around my problem I discovered this: static inline void * __kmem_cache_alloc (kmem_cache_t *cachep, int flags) { unsigned long save_flags; void* objp; kmem_cache_alloc_head(cachep, flags); try_again: local_irq_save(save_flags); #ifdef CONFIG_SMP { cpucache_t *cc = cc_data(cachep); if (cc) { if (cc->avail) { STATS_INC_ALLOCHIT(cachep); objp = cc_entry(cc)[--cc->avail]; } else { STATS_INC_ALLOCMISS(cachep); objp = kmem_cache_alloc_batch(cachep,cc,flags); if (!objp) goto alloc_new_slab_nolock; } } else { spin_lock(&cachep->spinlock); objp = kmem_cache_alloc_one(cachep); spin_unlock(&cachep->spinlock); } } #else objp = kmem_cache_alloc_one(cachep); #endif local_irq_restore(save_flags); return objp; alloc_new_slab: #ifdef CONFIG_SMP spin_unlock(&cachep->spinlock); alloc_new_slab_nolock: #endif local_irq_restore(save_flags); if (kmem_cache_grow(cachep, flags)) /* Someone may have stolen our objs. Doesn't matter, we'll * just come back here again. */ goto try_again; return NULL; } I suggest it for most-absurd-goto-usage-award. 1) There seems to be no reference for symbol "alloc_new_slab" 2) "spin_unlock" (right below) is never reached 3) The not-ifdef'ed code below is only used if CONFIG_SMP 4) The code "alloc_new_slab_nolock" is referenced only once by a goto (why not simply pasted there?) This does not look like a problem, it only is damn ugly. I have no idea what this code actually does, but it looks patched-to-the-limit. Has anybody reviewed slab regarding CONFIG_SMP? Regards, Stephan ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes (now rc7+aic20030603) 2003-06-08 11:19 ` Stephan von Krawczynski @ 2003-06-08 11:49 ` Stephan von Krawczynski 2003-06-08 16:07 ` Stephan von Krawczynski 2003-06-09 15:10 ` Stephan von Krawczynski 0 siblings, 2 replies; 110+ messages in thread From: Stephan von Krawczynski @ 2003-06-08 11:49 UTC (permalink / raw) To: Stephan von Krawczynski; +Cc: linux-kernel, willy, gibbs, marcelo, green Hello author, shoot me for the last comment regarding __kmem_cache_alloc (which means: forget it). Still you have significant source code duplication between "#define kmem_cache_alloc_one" and "void* kmem_cache_alloc_batch". How about an exit-symbol parameter? Regards, Stephan ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes (now rc7+aic20030603) 2003-06-08 11:49 ` Stephan von Krawczynski @ 2003-06-08 16:07 ` Stephan von Krawczynski 2003-06-09 15:10 ` Stephan von Krawczynski 1 sibling, 0 replies; 110+ messages in thread From: Stephan von Krawczynski @ 2003-06-08 16:07 UTC (permalink / raw) To: gibbs; +Cc: linux-kernel Hello Justin, another thing I stumbled across: if you compile the latest aic-driver (20030603) for smp, but boot the kernel with nosmp flag, the driver hangs during device-scan. Regards, Stephan ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes (now rc7+aic20030603) 2003-06-08 11:49 ` Stephan von Krawczynski 2003-06-08 16:07 ` Stephan von Krawczynski @ 2003-06-09 15:10 ` Stephan von Krawczynski 2003-06-09 15:32 ` Justin T. Gibbs 2003-06-10 1:38 ` Zwane Mwaikambo 1 sibling, 2 replies; 110+ messages in thread From: Stephan von Krawczynski @ 2003-06-09 15:10 UTC (permalink / raw) To: Stephan von Krawczynski; +Cc: linux-kernel, willy, gibbs, marcelo, green Hello all, I just finished another bunch of tests around the discussed issue and it's getting to an end. Yesterday I started using the test box with UP kernel instead of SMP, because I have the feeling the whole problem is somewhere around an SMP race condition. As far as I can see now the box runs 24h stable _and_ (and this is the important part) one problem I did not talk about till now is completely gone: During the whole testing with SMP I recognised that the tar-verify always brought up "content differs" warnings. Which basically means that the filesize is ok but the content is not. As there might be various causes for this (bad tape, bad drive, bad cabling) I did not give very much about it. But it turns out there are no more such warnings when using an UP kernel (on the same box with the complete same hardware including tapes). >From this experience I would conclude the following (for my personal test case): 1) aic-driver has problems with smp/up switching (meaning crashes when trying an SMP build with nosmp). This is completely reproducable. 2) aic-driver (almost no matter what version) has problems with SMP setup and tape drives. Obviously data integrity is not given. This is completely reproducable in my test setup. For Marcelo: It seems you can take any version of the aic driver for small box setups with UP, I never saw any troubles with it. As soon as you look at SMP flush it down the t..let. For Justin: Thank you for your continous openness and support in the whole issue in form of exactly _zero_ comments (,besides "how do you know aic is to blame?"). For Willy: I honour your efforts, but we are not capable of solving the issue. For Oleg: Stay tuned, I will test the re-creation issue and your patch. And now I go and buy a Symbios controller and re-try. Regards, Stephan ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes (now rc7+aic20030603) 2003-06-09 15:10 ` Stephan von Krawczynski @ 2003-06-09 15:32 ` Justin T. Gibbs 2003-06-10 10:23 ` Stephan von Krawczynski 2003-06-10 1:38 ` Zwane Mwaikambo 1 sibling, 1 reply; 110+ messages in thread From: Justin T. Gibbs @ 2003-06-09 15:32 UTC (permalink / raw) To: Stephan von Krawczynski; +Cc: linux-kernel, willy, gibbs, marcelo, green > For Justin: > Thank you for your continous openness and support in the whole issue in form > of exactly _zero_ comments (,besides "how do you know aic is to blame?"). Stephan, Other than your most recent complaint that the driver doesn't function correctly in an SMP kernel when you specify the nosmp option, you have yet to provide any information that points to a problem in the aic7xxx driver. Without such information, I'm at a loss to help you. One thing that you forgot to mention in your "report" is that data corruption can happen in many more places than just in the aic7xxx driver. The data could be corrupted by a VM bug, a buffer layer bug, or a filesystem bug. When testing our drivers against RHAS2.1 we found that the stock kernel had data corruption issues very similar to what your are talking about when run on very fast, hyperthreading, SMP machines. The data corruption occurred with any SCSI controller we tried, regardless of vendor. If you continue to feel that the aic7xxx driver is at fault, I encourage you to try to reproduce this failure with someone elses card. I think you'll find that the problem persists even with this change. I will be more than happy to look into why the aic7xxx driver may not operate correctly in an SMP kernel with the nosmp option. Considering that your complaint about this failure came into my email box just yesterday, perhaps you can give me just a few days to look into this before you decide to call me unresponsive. Since I'm attending a conference this whole week, I won't even be able to look at this until I return on Monday of next week. I'm sorry that you are experiencing data corruption. I take those issues very seriously, but all of your panics and other reports point to issues elsewhere in the kernel that should be resolved before you conclude that the data corruption you are experiencing is somehow the aic7xxx driver's fault. I'll be more than happy to fess up to and correct any defect that is found in the driver, but I cannot fix bugs that I cannot reproduce and that have no usable debugging information associated with them. -- Justin ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes (now rc7+aic20030603) 2003-06-09 15:32 ` Justin T. Gibbs @ 2003-06-10 10:23 ` Stephan von Krawczynski 2003-06-10 15:38 ` Justin T. Gibbs 0 siblings, 1 reply; 110+ messages in thread From: Stephan von Krawczynski @ 2003-06-10 10:23 UTC (permalink / raw) To: Justin T. Gibbs; +Cc: linux-kernel, willy, marcelo, green On Mon, 09 Jun 2003 15:32:11 +0000 "Justin T. Gibbs" <gibbs@scsiguy.com> wrote: > > For Justin: > > Thank you for your continous openness and support in the whole issue in > > form of exactly _zero_ comments (,besides "how do you know aic is to > > blame?"). > > Stephan, > > Other than your most recent complaint that the driver doesn't function > correctly in an SMP kernel when you specify the nosmp option, you have > yet to provide any information that points to a problem in the aic7xxx > driver. Dear Justin, I am really not complaining about you not helping specifically _me_, I am complaining about your quite visible general opinion that this whole thing is really not serious, or maybe it is only that you are not making your efforts transparent to others, I don't know. > Without such information, I'm at a loss to help you. One thing > that you forgot to mention in your "report" is that data corruption can > happen in many more places than just in the aic7xxx driver. <sarcasm>Did I mention the big magnet right beside the tape?</sarcasm> > The data > could be corrupted by a VM bug, VM is quite the same, tar'ing to /dev/tape or /var/bak/mybackfile.tar. > a buffer layer bug, or a filesystem > bug. /dev/tape with a filesystem? Have you read what we are talking about? > When testing our drivers against RHAS2.1 we found that the stock > kernel had data corruption issues very similar to what your are talking > about when run on very fast, hyperthreading, SMP machines. The data > corruption occurred with any SCSI controller we tried, regardless of vendor. My question is: is it solved? > If you continue to feel that the aic7xxx driver is at fault, I encourage you > to try to reproduce this failure with someone elses card. I think you'll > find that the problem persists even with this change. This is not the first discussion about an instability in aic. We had the same thing months ago for another setup (where btw you said the same thing). Back then I switched to symbios and everything went ok from then on. Thing is: I am not a big learner, I just re-tried with aic now, and it happened again. I will do the same thing now like back then: switching to symbios. Be sure I am going to tell my experiences. Be aware that I have already received reports from others with the same problem solving it the same way - switching away from aic. > I will be more than happy to look into why the aic7xxx driver may not > operate correctly in an SMP kernel with the nosmp option. Considering > that your complaint about this failure came into my email box just > yesterday, perhaps you can give me just a few days to look into this > before you decide to call me unresponsive. Since I'm attending a > conference this whole week, I won't even be able to look at this > until I return on Monday of next week. Justin, this is nothing quite serious, I just mentioned it for a feedback to something _simple_. > I'm sorry that you are experiencing data corruption. I take those > issues very seriously, but all of your panics and other reports point > to issues elsewhere in the kernel that should be resolved before you > conclude that the data corruption you are experiencing is somehow > the aic7xxx driver's fault. I'll be more than happy to fess up to > and correct any defect that is found in the driver, but I cannot fix > bugs that I cannot reproduce and that have no usable debugging information > associated with them. What exactly is "elsewhere" if your data is bogus when tar'ing onto /dev/tape via aic and it is completely ok when tar'ing into a file via reiserfs/3ware ? There is not really much left between tar and the aic-driver and the tape. Where is your favourite in this game? Regards, Stephan ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes (now rc7+aic20030603) 2003-06-10 10:23 ` Stephan von Krawczynski @ 2003-06-10 15:38 ` Justin T. Gibbs 2003-06-10 17:11 ` Stephan von Krawczynski 0 siblings, 1 reply; 110+ messages in thread From: Justin T. Gibbs @ 2003-06-10 15:38 UTC (permalink / raw) To: Stephan von Krawczynski, Justin T. Gibbs Cc: linux-kernel, willy, marcelo, green >> Stephan, >> >> Other than your most recent complaint that the driver doesn't function >> correctly in an SMP kernel when you specify the nosmp option, you have >> yet to provide any information that points to a problem in the aic7xxx >> driver. > > Dear Justin, > > I am really not complaining about you not helping specifically _me_, I am > complaining about your quite visible general opinion that this whole thing is > really not serious, or maybe it is only that you are not making your efforts > transparent to others, I don't know. I never said that it wasn't serios, I just haven't seen any indication that this problem is caused by my driver. There is a big difference. If your complaint is that I typically help people to solve their problems *off-list*, then I'm sorry if that offends your sensibilities. I personally don't think that I need to CC a million people while I'm passing back various debugging information and asking for new output. Its just a lot of noise for the majority of people on the linux-kernel list. >> Without such information, I'm at a loss to help you. One thing >> that you forgot to mention in your "report" is that data corruption can >> happen in many more places than just in the aic7xxx driver. > > <sarcasm>Did I mention the big magnet right beside the tape?</sarcasm> I'm just sick of being blamed for anything that goes wrong on any system that happens to have an aic7xxx controller in it. 99% or the time its not my fault, but I suppose since I debug and resolve these issues off list for people that contact me, the general assumption is that these issues are the aic7xxx driver's fault. >> The data could be corrupted by a VM bug, > > VM is quite the same, tar'ing to /dev/tape or /var/bak/mybackfile.tar. No, the VM activity is quite different. >> a buffer layer bug, or a filesystem bug. > > /dev/tape with a filesystem? Have you read what we are talking about? Where did you get the data to place on the tape? /dev/zero? >> When testing our drivers against RHAS2.1 we found that the stock >> kernel had data corruption issues very similar to what your are talking >> about when run on very fast, hyperthreading, SMP machines. The data >> corruption occurred with any SCSI controller we tried, regardless of vendor. > > My question is: is it solved? My understanding is that it was fixed in 2.4.18 level kernels, but since I don't know the root cause of the corruption, it could have just been made more difficult to reproduce. >> If you continue to feel that the aic7xxx driver is at fault, I encourage you >> to try to reproduce this failure with someone elses card. I think you'll >> find that the problem persists even with this change. > > This is not the first discussion about an instability in aic. I'm not talking about *every case of aic7xxx driver instability*, I'm talking about *this particular case* of driver instability. Problems that to the naive user look similar are typically not. >> I will be more than happy to look into why the aic7xxx driver may not >> operate correctly in an SMP kernel with the nosmp option. Considering >> that your complaint about this failure came into my email box just >> yesterday, perhaps you can give me just a few days to look into this >> before you decide to call me unresponsive. Since I'm attending a >> conference this whole week, I won't even be able to look at this >> until I return on Monday of next week. > > Justin, this is nothing quite serious, I just mentioned it for a feedback to > something _simple_. It's the only thing you've mentioned that I have enough information to look at. >> I'm sorry that you are experiencing data corruption. I take those >> issues very seriously, but all of your panics and other reports point >> to issues elsewhere in the kernel that should be resolved before you >> conclude that the data corruption you are experiencing is somehow >> the aic7xxx driver's fault. I'll be more than happy to fess up to >> and correct any defect that is found in the driver, but I cannot fix >> bugs that I cannot reproduce and that have no usable debugging information >> associated with them. > > What exactly is "elsewhere" if your data is bogus when tar'ing onto /dev/tape > via aic and it is completely ok when tar'ing into a file via reiserfs/3ware ? > There is not really much left between tar and the aic-driver and the tape. I suggest you go browse the code that is exercised by such an activity before you say that. -- Jusitn ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes (now rc7+aic20030603) 2003-06-10 15:38 ` Justin T. Gibbs @ 2003-06-10 17:11 ` Stephan von Krawczynski 2003-06-10 18:07 ` Justin T. Gibbs 0 siblings, 1 reply; 110+ messages in thread From: Stephan von Krawczynski @ 2003-06-10 17:11 UTC (permalink / raw) To: Justin T. Gibbs; +Cc: linux-kernel, willy, marcelo, green On Tue, 10 Jun 2003 09:38:31 -0600 "Justin T. Gibbs" <gibbs@scsiguy.com> wrote: > I never said that it wasn't serios, I just haven't seen any indication > that this problem is caused by my driver. There is a big difference. > If your complaint is that I typically help people to solve their problems > *off-list*, then I'm sorry if that offends your sensibilities. It does not offend my sensibilities, it is simply damaging the available information about typical problems and their solving. If you don't do it open, there is no way for others to follow your thoughts and debugging and therefore you are confronted hundred times with the same questions. People have no choice but asking you, because your debugging cases are hidden. > I personally don't think that I need to CC a million people while I'm > passing back various debugging information and asking for new output. Its > just a lot of noise for the majority of people on the linux-kernel list. Keep in mind the broad user base of aics. Compared to other stuff in the kernel your messages may be a whole lot more interesting to listening LKML readers than other threads. > I'm just sick of being blamed for anything that goes wrong on any system > that happens to have an aic7xxx controller in it. 99% or the time its > not my fault, but I suppose since I debug and resolve these issues off > list for people that contact me, the general assumption is that these > issues are the aic7xxx driver's fault. No, you produce your own problem. You cannot help every single who has a problem around his box/aic. This is impossible. So you have to create a valuable information basis others can read and think about. This is most simply done by debugging problems _openly_. > >> a buffer layer bug, or a filesystem bug. > > > > /dev/tape with a filesystem? Have you read what we are talking about? > > Where did you get the data to place on the tape? /dev/zero? Don't be silly. If reading a file from some hd would be a problem in itself, then we could all go home and have a beer. You are talking about the minimum requirement for an os. > >> When testing our drivers against RHAS2.1 we found that the stock > >> kernel had data corruption issues very similar to what your are talking > >> about when run on very fast, hyperthreading, SMP machines. The data > >> corruption occurred with any SCSI controller we tried, regardless of > >vendor. > > > > My question is: is it solved? > > My understanding is that it was fixed in 2.4.18 level kernels, but since > I don't know the root cause of the corruption, it could have just been > made more difficult to reproduce. Can you point to some URL where information about this is available? > > This is not the first discussion about an instability in aic. > > I'm not talking about *every case of aic7xxx driver instability*, I'm > talking about *this particular case* of driver instability. Problems > that to the naive user look similar are typically not. Sorry, I should have said: "This is not the first discussion about an instability in aic between you and me". > > Justin, this is nothing quite serious, I just mentioned it for a feedback > > to something _simple_. > > It's the only thing you've mentioned that I have enough information to > look at. No, it is only the most simple one. Unfortunately scsi-driver development is everything but simple for the standard problem case. It requires the ability to set up equipment just like the discussed case for reproduction of the problem. Of course only for cases the author cannot reproduce inside his software via brain. All information needed to reproduce the main problem is available in this thread. > > What exactly is "elsewhere" if your data is bogus when tar'ing onto > > /dev/tape via aic and it is completely ok when tar'ing into a file via > > reiserfs/3ware ? There is not really much left between tar and the > > aic-driver and the tape. > > I suggest you go browse the code that is exercised by such an activity > before you say that. What kind of a statement is this? I spent days for reproduction of the error case, every single test takes something from 3,5 to 24 hours. And you tell me "well, guy, if you want to know what I know go ahead and read my code", well knowing that at least 50% of the knowledge is not in the code but in the surrounding material you read to get where you are. I don't want to become scsi maintainer, I want to solve a problem - for me _and_ for others (and this is why I do it openly). I really have not understood what you want, besides not being spoken to. If I were you I would try to _prove_ that it is _not_ my problem, in best by finding the real problem. Unfortunately I (and some others) do have the impression that you simply live by the idea that as long as nobody can _prove_ your code has a problem, there is no problem. This is in fact the bofh lifestyle that works for you (as long as you do not meet an equally skilled person), but not for the users (spell "rest of us"). Back to the facts: Simple question: you say its not a problem inside the driver. Ok. Question: how to you prove that? Can you specify a test setup (program or something) I can check to see that there is no problem with the general SMP tape usage of the aic driver? I mean you must have seen something working, or not? Regards, Stephan ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes (now rc7+aic20030603) 2003-06-10 17:11 ` Stephan von Krawczynski @ 2003-06-10 18:07 ` Justin T. Gibbs 2003-06-11 0:51 ` Stephan von Krawczynski 0 siblings, 1 reply; 110+ messages in thread From: Justin T. Gibbs @ 2003-06-10 18:07 UTC (permalink / raw) To: Stephan von Krawczynski, Justin T. Gibbs Cc: linux-kernel, willy, marcelo, green >> I never said that it wasn't serios, I just haven't seen any indication >> that this problem is caused by my driver. There is a big difference. >> If your complaint is that I typically help people to solve their problems >> *off-list*, then I'm sorry if that offends your sensibilities. > > It does not offend my sensibilities, it is simply damaging the available > information about typical problems and their solving. If you don't do it open, > there is no way for others to follow your thoughts and debugging and therefore > you are confronted hundred times with the same questions. People have no > choice but asking you, because your debugging cases are hidden. 99% of the problems have to do with broken interrupt routing. There is plenty of information about this issue on the mailing lists, but people still ask me. It seems that SCSI is suitably complex for the common user that even when the driver explictly tells you "your drive is dying", I get email asking how I can fix my driver so that their drive doesn't die. The same is true if you look at the large body of dump card state information that people have posted from the aic7xxx and aic79xx drivers to this list. Anyone who gets this type of output seems to think that their problem must be the same as any other person that gets a dump card state. I don't think that any amount of posting information about how I decifer what the registers are telling me will cut down on this confusion. >> I'm just sick of being blamed for anything that goes wrong on any system >> that happens to have an aic7xxx controller in it. 99% or the time its >> not my fault, but I suppose since I debug and resolve these issues off >> list for people that contact me, the general assumption is that these >> issues are the aic7xxx driver's fault. > > No, you produce your own problem. You cannot help every single who has a > problem around his box/aic. This is impossible. So you have to create a > valuable information basis others can read and think about. This is most > simply done by debugging problems _openly_. I just don't believe that this is true. Most of the questions that people email me directly are questions that are easily answered by a google search. In otherwords, the information is already readily available. It is just easier to send email than to actually investigate a potential solution to the problem. So, people send email and ask the same questions, and get the same answers. >> >> a buffer layer bug, or a filesystem bug. >> > >> > /dev/tape with a filesystem? Have you read what we are talking about? >> >> Where did you get the data to place on the tape? /dev/zero? > > Don't be silly. If reading a file from some hd would be a problem in itself, > then we could all go home and have a beer. You are talking about the minimum > requirement for an os. You're the one being silly. You are oversimplifying what it takes to do I/O and the components that are involved in doing that I/O. If you don't understand that the load on several components in the kernel changes, often in subtle but important ways, when you change the target of your I/O, then I don't know what to say to you. >> >> When testing our drivers against RHAS2.1 we found that the stock >> >> kernel had data corruption issues very similar to what your are talking >> >> about when run on very fast, hyperthreading, SMP machines. The data >> >> corruption occurred with any SCSI controller we tried, regardless of >> > vendor. >> > >> > My question is: is it solved? >> >> My understanding is that it was fixed in 2.4.18 level kernels, but since >> I don't know the root cause of the corruption, it could have just been >> made more difficult to reproduce. > > Can you point to some URL where information about this is available? https://rhn.redhat.com/errata/RHSA-2003-147.html This is just the most recent attempt to fix these issues. You might want to go back and read the other erratas. >> > Justin, this is nothing quite serious, I just mentioned it for a feedback >> > to something _simple_. >> >> It's the only thing you've mentioned that I have enough information to >> look at. > > No, it is only the most simple one. Unfortunately scsi-driver development is > everything but simple for the standard problem case. It requires the ability > to set up equipment just like the discussed case for reproduction of the > problem. Of course only for cases the author cannot reproduce inside his > software via brain. All information needed to reproduce the main problem is > available in this thread. To reproduce your problem, I need the same MB, memory configuration, drive types, a 3ware card, and the same tape drive you have. I have tried various backup scenarios with *other hardware* and have failed to reproduce your problem. >> I suggest you go browse the code that is exercised by such an activity >> before you say that. > > What kind of a statement is this? Its one way of saying that you need to understand all of the code involved with turing a write syscall into a call into the aic7xxx driver. If you review the code path, you'll find that there are thousands of lines of code involved that have nothing to do with SCSI or the aic7xxx driver. To say that you have created a simple example that proves that the problem is in the aic7xxx driver is naive at best. > I want to solve a problem - for me _and_ for others (and this is > why I do it openly). > I really have not understood what you want, besides not being spoken to. > If I were you I would try to _prove_ that it is _not_ my problem, in best by > finding the real problem. As I said before, I have tried to reproduce your problem, but I cannot. I have no hope of proving that a problem I cannot replicate is not a problem with my driver. Some additional things that might help: o Charaterize the type of corruption that you are seeing in a more formal way. For example, use an easy to verify pattern that will allow you to actually analyze the corruption. Is the corruption following some pattern? o Can you determine if the corruption is happening when writting to the tape vs. reading from it? You might do this by writing to the tape in an SMP mode that shows data corruption and then validate the driver in a safe, UP, mode and vice-versa. o What happens when you use different hardware/FS type/etc for the source and destination? > Unfortunately I (and some others) do have the > impression that you simply live by the idea that as long as nobody can > _prove_ your code has a problem, there is no problem. > This is in fact the bofh lifestyle that works for you (as long as you do not > meet an equally skilled person), but not for the users (spell "rest of us"). In this case, the information you have so far provided points away from the aic7xxx driver. I don't say that in all cases that I investigate, but I believe it to be true in this case. If past experience is any guide, 80-90% of the problems like this that I have debugged (and that I could actually replicate) were induced by using the aic7xxx driver, but turned out to be bugs in other components in the system. The aic7xxx driver happens to be one of the more agressive SCSI drivers in the system and that can often lead to finding bugs in other components. > Back to the facts: > Simple question: you say its not a problem inside the driver. Ok. Question: > how to you prove that? Can you specify a test setup (program or something) I > can check to see that there is no problem with the general SMP tape usage of > the aic driver? I mean you must have seen something working, or not? The only way to do this is to find the actual bug. The problem feels like a VM or FS race condition most likely caused by having the source controller and the destination controller on separate interrupts in the apic case so that you have real concurrency in the system. In the non apic case, it looks like everyone shares the same interrupt, so you cannot field interrupts for both the 3ware and the aic7xxx driver at the same time. I also say this because data corruption is something that is very difficult for the aic7xxx driver to acomplish without there being some kind of error message from the driver. I have lots of test setups that show the aic7xxx and aic79xx driver working just fine in PIII and P4 dual and quad configurations with and without apic interrupt routing and writing to tape. There's not much more that I can do here without having your exact system here or having more information. -- Justin ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes (now rc7+aic20030603) 2003-06-10 18:07 ` Justin T. Gibbs @ 2003-06-11 0:51 ` Stephan von Krawczynski 2003-06-11 4:39 ` Justin T. Gibbs 0 siblings, 1 reply; 110+ messages in thread From: Stephan von Krawczynski @ 2003-06-11 0:51 UTC (permalink / raw) To: Justin T. Gibbs; +Cc: linux-kernel, willy, marcelo, green On Tue, 10 Jun 2003 12:07:00 -0600 "Justin T. Gibbs" <gibbs@scsiguy.com> wrote: > >> I never said that it wasn't serios, I just haven't seen any indication > >> that this problem is caused by my driver. There is a big difference. > >> If your complaint is that I typically help people to solve their problems > >> *off-list*, then I'm sorry if that offends your sensibilities. > > > > It does not offend my sensibilities, it is simply damaging the available > > information about typical problems and their solving. If you don't do it > > open, there is no way for others to follow your thoughts and debugging and > > therefore you are confronted hundred times with the same questions. People > > have no choice but asking you, because your debugging cases are hidden. > > 99% of the problems have to do with broken interrupt routing. There is > plenty of information about this issue on the mailing lists, but people > still ask me. You should state an exact definition of "broken interrupt routing" in this case. The only thing I would call a broken interrupt routing is if an interrupt does not show up at all. Everything else is in my eyes a broken interrupt handling in the driver (generally spoken). A driver has (in my programming world) to cope with: - interrupts showing up immediately during the currently running interrupt handling (immediate recausing) - multiple interrupt causes per one shot (software or interrupt controller were to lazy for producing single interrupts per cause) - lost interrupts (may cause error condition of course but at least a message in some log) - continous interrupts (handler has to know when he is too long inside interrupt and give the rest of the system a chance to survive) - optimistic interrupt requeuing (handler has to know from the past what is the right flow of interrupt causes in a multiple caused interrupt, though hardware may be unable to tell him). > I just don't believe that this is true. Most of the questions that people > email me directly are questions that are easily answered by a google search. > In otherwords, the information is already readily available. It is just > easier to send email than to actually investigate a potential solution > to the problem. So, people send email and ask the same questions, and > get the same answers. Do you have a FAQ? > >> >> a buffer layer bug, or a filesystem bug. > >> > > >> > /dev/tape with a filesystem? Have you read what we are talking about? > >> > >> Where did you get the data to place on the tape? /dev/zero? > > > > Don't be silly. If reading a file from some hd would be a problem in > > itself, then we could all go home and have a beer. You are talking about > > the minimum requirement for an os. > > You're the one being silly. You are oversimplifying what it takes to > do I/O and the components that are involved in doing that I/O. If you > don't understand that the load on several components in the kernel changes, > often in subtle but important ways, when you change the target of your > I/O, then I don't know what to say to you. Data corruption is nothing subtle. We are not talking about performance tweaks, we are talking about the basics. Something like "a synchronous action (like reading during a verify) has to be synchronous". We are not talking about a hardware related problem on scsi bus. We are not talking about the box stumbling over a massive data flood. We are talking about reading a file/device to a memory buffer and doing a cmp action between two of those. If your os is not able to perform something like this you can do virtually nothing, not even booting (because your reading action corrupts the data). > >> >> When testing our drivers against RHAS2.1 we found that the stock > >> >> kernel had data corruption issues very similar to what your are talking > >> >> about when run on very fast, hyperthreading, SMP machines. The data > >> >> corruption occurred with any SCSI controller we tried, regardless of > >> > vendor. > >> > > >> > My question is: is it solved? > >> > >> My understanding is that it was fixed in 2.4.18 level kernels, but since > >> I don't know the root cause of the corruption, it could have just been > >> made more difficult to reproduce. > > > > Can you point to some URL where information about this is available? > > https://rhn.redhat.com/errata/RHSA-2003-147.html The scenario described there is unlikely for my case because a) I have only 3 GB of mem b) no hints are available that UP can solve the problem on the same hardware > > No, it is only the most simple one. Unfortunately scsi-driver development > > is everything but simple for the standard problem case. It requires the > > ability to set up equipment just like the discussed case for reproduction > > of the problem. Of course only for cases the author cannot reproduce > > inside his software via brain. All information needed to reproduce the > > main problem is available in this thread. > > To reproduce your problem, I need the same MB, memory configuration, drive > types, a 3ware card, and the same tape drive you have. I have tried various > backup scenarios with *other hardware* and have failed to reproduce your > problem. I have talked to others with similar problems and none has the same mb or a 3ware controller. All have problems with streamers on aic. All solutions I heard so far were done by replacing aic by whatever strange controller they got their hands on. > >> I suggest you go browse the code that is exercised by such an activity > >> before you say that. > > > > What kind of a statement is this? > > Its one way of saying that you need to understand all of the code involved > with turing a write syscall into a call into the aic7xxx driver. If you > review the code path, you'll find that there are thousands of lines of > code involved that have nothing to do with SCSI or the aic7xxx driver. > To say that you have created a simple example that proves that the problem > is in the aic7xxx driver is naive at best. To tell me it is not is just as good. > In this case, the information you have so far provided points away from > the aic7xxx driver. I don't say that in all cases that I investigate, > but I believe it to be true in this case. If past experience is any guide, > 80-90% of the problems like this that I have debugged (and that I could > actually replicate) were induced by using the aic7xxx driver, but turned > out to be bugs in other components in the system. The aic7xxx driver > happens to be one of the more agressive SCSI drivers in the system and > that can often lead to finding bugs in other components. Agressive is indeed a good term for it. And it describes exactly what I don't like about it. The primary goal of a driver (in my eyes) is to make some connected hardware work as expected. It is definitely not its primary goal to be overly brilliant and therefore detecting bugs in other subsystems. I have told you months ago that a symbios driven systems feels somehow smoother and faster - elegant. Whereas aic gives you the feeling someone tried to kick the systems butt with a big hammer. Its a matter of style and _defensiveness_. As long as you ride it agressively don't complain a lot of people go after you for explanations. And btw: you win nothing with your way, not even performance. > > Back to the facts: > > Simple question: you say its not a problem inside the driver. Ok. Question: > > how to you prove that? Can you specify a test setup (program or something) > > I can check to see that there is no problem with the general SMP tape usage > > of the aic driver? I mean you must have seen something working, or not? > > The only way to do this is to find the actual bug. The problem feels like > a VM or FS race condition most likely caused by having the source controller > and the destination controller on separate interrupts in the apic case so > that you have real concurrency in the system. In the non apic case, it looks > like everyone shares the same interrupt, so you cannot field interrupts > for both the 3ware and the aic7xxx driver at the same time. I also say > this because data corruption is something that is very difficult for the > aic7xxx driver to acomplish without there being some kind of error message > from the driver. Well, at least I managed to get some interesting statement from you after all. I have to think about this a bit. > I have lots of test setups that show the aic7xxx and aic79xx driver working > just fine in PIII and P4 dual and quad configurations with and without apic > interrupt routing and writing to tape. This does only mean you have not yet met something similar to my setup. It does not really prove a lot. > There's not much more that I can > do here without having your exact system here or having more information. Well, the thing is, I try to achieve information. But since the whole issue is all about lots of data I try to find an intelligent way to locate the cause of it all. I am not very confident that analysis of the trashed data will lead somewhere. I think narrowing the code path that leads to the problem by multiple distinct test scenarios looks more/faster promising. Can you think of something reducing the test complexity (not using tar, not comparing to a file or whatever)? Regards, Stephan ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes (now rc7+aic20030603) 2003-06-11 0:51 ` Stephan von Krawczynski @ 2003-06-11 4:39 ` Justin T. Gibbs 2003-06-11 20:23 ` Stephan von Krawczynski 0 siblings, 1 reply; 110+ messages in thread From: Justin T. Gibbs @ 2003-06-11 4:39 UTC (permalink / raw) To: Stephan von Krawczynski, Justin T. Gibbs Cc: linux-kernel, willy, marcelo, green >> 99% of the problems have to do with broken interrupt routing. There is >> plenty of information about this issue on the mailing lists, but people >> still ask me. > > You should state an exact definition of "broken interrupt routing" in this > case. The only thing I would call a broken interrupt routing is if an > interrupt does not show up at all. That's the only definition for it and 99% of the email I field about the aic7xxx driver is due to interrupts *not arriving*. >> I just don't believe that this is true. Most of the questions that people >> email me directly are questions that are easily answered by a google search. >> In otherwords, the information is already readily available. It is just >> easier to send email than to actually investigate a potential solution >> to the problem. So, people send email and ask the same questions, and >> get the same answers. > > Do you have a FAQ? It's the driver readme file. >> You're the one being silly. You are oversimplifying what it takes to >> do I/O and the components that are involved in doing that I/O. If you >> don't understand that the load on several components in the kernel changes, >> often in subtle but important ways, when you change the target of your >> I/O, then I don't know what to say to you. > > Data corruption is nothing subtle. We are not talking about performance tweaks, > we are talking about the basics. Something like "a synchronous action (like > reading during a verify) has to be synchronous". We are not talking about a > hardware related problem on scsi bus. We are not talking about the box > stumbling over a massive data flood. We are talking about reading a file/device > to a memory buffer and doing a cmp action between two of those. If your os is > not able to perform something like this you can do virtually nothing, not even > booting (because your reading action corrupts the data). And with any experience you will find that subtle races in all of these "basic operations" can often only be triggered by certain scenarios. Saying that "well my machine boots" is not enough to prove that the components involved to that point are bug free. You may be able to operate just fine in 99% of your test scenarios yet still have a very catastrophic flaw in the code. >> >> >> When testing our drivers against RHAS2.1 we found that the stock >> >> >> kernel had data corruption issues very similar to what your are talking >> >> >> about when run on very fast, hyperthreading, SMP machines. The data >> >> >> corruption occurred with any SCSI controller we tried, regardless of >> >> > vendor. >> >> > >> >> > My question is: is it solved? >> >> >> >> My understanding is that it was fixed in 2.4.18 level kernels, but since >> >> I don't know the root cause of the corruption, it could have just been >> >> made more difficult to reproduce. >> > >> > Can you point to some URL where information about this is available? >> >> https://rhn.redhat.com/errata/RHSA-2003-147.html > > The scenario described there is unlikely for my case because > a) I have only 3 GB of mem > b) no hints are available that UP can solve the problem on the same hardware This is only the latest corruption bug that has been addressed. You should really read all of the kernel erratas. The one we hit originally was this one: https://rhn.redhat.com/errata/RHSA-2002-227.html I'm not saying that this is your problem or even related, but just to point out that the type of data corruption you are talking about can occur due to bugs in core kernel functionality. >> To reproduce your problem, I need the same MB, memory configuration, drive >> types, a 3ware card, and the same tape drive you have. I have tried various >> backup scenarios with *other hardware* and have failed to reproduce your >> problem. > > I have talked to others with similar problems and none has the same mb or a > 3ware controller. Define similar. You are the only person I know of that is currently indicating they are having *data corruption* with the aic7xxx driver. That is, in particular, what I am trying to reproduce locally. > All have problems with streamers on aic. All solutions I > heard so far were done by replacing aic by whatever strange controller > they got their hands on. I'm glad they were able to resolve their problems. >> >> I suggest you go browse the code that is exercised by such an activity >> >> before you say that. >> > >> > What kind of a statement is this? >> >> Its one way of saying that you need to understand all of the code involved >> with turing a write syscall into a call into the aic7xxx driver. If you >> review the code path, you'll find that there are thousands of lines of >> code involved that have nothing to do with SCSI or the aic7xxx driver. >> To say that you have created a simple example that proves that the problem >> is in the aic7xxx driver is naive at best. > > To tell me it is not is just as good. You mean "just as naive"? Pointing your finger at the aic7xxx driver is not going to solve your problem. Ruling out other system components (of which there are many in your test case) also won't help find it. >> In this case, the information you have so far provided points away from >> the aic7xxx driver. I don't say that in all cases that I investigate, >> but I believe it to be true in this case. If past experience is any guide, >> 80-90% of the problems like this that I have debugged (and that I could >> actually replicate) were induced by using the aic7xxx driver, but turned >> out to be bugs in other components in the system. The aic7xxx driver >> happens to be one of the more agressive SCSI drivers in the system and >> that can often lead to finding bugs in other components. > > Agressive is indeed a good term for it. And it describes exactly what I don't > like about it. Then don't use choose to use it. > The primary goal of a driver (in my eyes) is to make some > connected hardware work as expected. It is definitely not its primary goal to > be overly brilliant and therefore detecting bugs in other subsystems. My goal is to take full advantage of the hardware I support in my drivers. That isn't an attempt to be "brilliant", but rather just taking advantage of the hardware you have purchased. The end result is that for instance the aic79xx driver can achieve sustained random I/O throughput 40% above it's main competetor. That isn't an attempt to break the rest of linux, but to get the most performance possible out of Linux. > I have > told you months ago that a symbios driven systems feels somehow smoother and > faster - elegant. Which doesn't tell me anything about the relative performance of the two drivers. Such subjective remarks do not provide any feedback that can be turned into a concrete plan to improve the driver. They don't even really tell me what you think is wrong with it. > And btw: you win nothing with your way, not even performance. Another unsubstantiated claim. Again, if you don't like the driver, or its style, you should just use something else if it will make you happier. It certainly sounds like that is the case. >> I have lots of test setups that show the aic7xxx and aic79xx driver working >> just fine in PIII and P4 dual and quad configurations with and without apic >> interrupt routing and writing to tape. > > This does only mean you have not yet met something similar to my setup. It > does not really prove a lot. Which is exactly my point! You act as though I should be able to magically reproduce and fix your problem. I've said that I can't reproduce it and that means I can't fix it without more information. I never claimed anything more than that other than your current data points do not, in my opinion, point to an aic7xxx driver problem. That doesn't *eliminate* the aic7xxx driver as a cause just as your test cases don't eliminate the other components of the system. > Well, the thing is, I try to achieve information. But since the whole issue is > all about lots of data I try to find an intelligent way to locate the cause of > it all. I am not very confident that analysis of the trashed data will lead > somewhere. If you filter all available to what you only believe will be relavent to solving the problem, then you will likely filter out things that might give others a clue as to the true cause of your problem. > I think narrowing the code path that leads to the problem by > multiple distinct test scenarios looks more/faster promising. Can you think of > something reducing the test complexity (not using tar, not comparing to a file > or whatever)? I would be analyzing the current failure modes first, but if you just want to try to narrow the cause by varying your configuration, you could do that by using a different source filesystem or even using /dev/zero or a program that generates the data that will be written to tape. You might also try to determine if the corruption happens when the tape is written or if the data is corrupted during the read. You could do this by doing multiple read sessions to see if the corruption is consistent or doing the write in what appears to be a safe kernel mode and the read in the unsafe kernel and vice - versa. Etc. -- Justin ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes (now rc7+aic20030603) 2003-06-11 4:39 ` Justin T. Gibbs @ 2003-06-11 20:23 ` Stephan von Krawczynski 2003-06-11 21:01 ` John Stoffel 2003-06-12 13:54 ` Stephan von Krawczynski 0 siblings, 2 replies; 110+ messages in thread From: Stephan von Krawczynski @ 2003-06-11 20:23 UTC (permalink / raw) To: Justin T. Gibbs; +Cc: linux-kernel, willy, marcelo, green Hello, a short note on todays test cycles. I switched to rc8 (SMP, apic), took three cycles until it failed. rc8 (SMP, apic, HIGHIO) failed on the first try. I thought HIGHIO could make a difference if there were inherent problems with bounce buffers. Unfortunately this seems not the case. Anyway it looks like failures have gotten fewer since rc8. I will try an overnight stress test now to see if I get it freezing again. Regards, Stephan ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes (now rc7+aic20030603) 2003-06-11 20:23 ` Stephan von Krawczynski @ 2003-06-11 21:01 ` John Stoffel 2003-06-13 9:45 ` Stephan von Krawczynski 2003-06-12 13:54 ` Stephan von Krawczynski 1 sibling, 1 reply; 110+ messages in thread From: John Stoffel @ 2003-06-11 21:01 UTC (permalink / raw) To: Stephan von Krawczynski Cc: Justin T. Gibbs, linux-kernel, willy, marcelo, green Stephan> I switched to rc8 (SMP, apic), took three cycles until it Stephan> failed. rc8 (SMP, apic, HIGHIO) failed on the first try. I Stephan> thought HIGHIO could make a difference if there were inherent Stephan> problems with bounce buffers. Unfortunately this seems not Stephan> the case. I'm doing testing on 2.5.70-mm3, SMP, APIC, PREEMPT with an AIC7880 driving a DLT7000 along with some idle disks on the same bus. I'm dumping data to tape and verifying it. Once I get more data, I'll followup. John ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes (now rc7+aic20030603) 2003-06-11 21:01 ` John Stoffel @ 2003-06-13 9:45 ` Stephan von Krawczynski 2003-06-15 12:56 ` Stephan von Krawczynski 2003-06-17 20:47 ` Marcelo Tosatti 0 siblings, 2 replies; 110+ messages in thread From: Stephan von Krawczynski @ 2003-06-13 9:45 UTC (permalink / raw) To: John Stoffel; +Cc: gibbs, linux-kernel, willy, marcelo, green Hello all, this is the second day of stress-testing pure rc8 in SMP, apic mode. Today everything is fine, no freeze, no data corruption. current standings: 2 days continuous test, one file data corruption on day 1 Regards, Stephan ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes (now rc7+aic20030603) 2003-06-13 9:45 ` Stephan von Krawczynski @ 2003-06-15 12:56 ` Stephan von Krawczynski 2003-06-15 13:26 ` John Stoffel 2003-06-17 20:47 ` Marcelo Tosatti 1 sibling, 1 reply; 110+ messages in thread From: Stephan von Krawczynski @ 2003-06-15 12:56 UTC (permalink / raw) To: linux-kernel; +Cc: stoffel, gibbs, willy, marcelo, green Hello all, this is the fourth day of stress-testing pure rc8/2.4.21 in SMP, apic mode. Today another corruption happened. current standings: 4 days continuous test, one file data corruption on day 1 one file data corruption on day 4 Regards, Stephan ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes (now rc7+aic20030603) 2003-06-15 12:56 ` Stephan von Krawczynski @ 2003-06-15 13:26 ` John Stoffel 0 siblings, 0 replies; 110+ messages in thread From: John Stoffel @ 2003-06-15 13:26 UTC (permalink / raw) To: Stephan von Krawczynski Cc: linux-kernel, stoffel, gibbs, willy, marcelo, green Stephan> this is the fourth day of stress-testing pure rc8/2.4.21 in Stephan> SMP, apic mode. Today another corruption happened. Stephan> current standings: Stephan> 4 days continuous test, Stephan> one file data corruption on day 1 Stephan> one file data corruption on day 4 Can you define corruption? Can you tell us what commands you are using to generate the data which is written to tape? John ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes (now rc7+aic20030603) 2003-06-13 9:45 ` Stephan von Krawczynski 2003-06-15 12:56 ` Stephan von Krawczynski @ 2003-06-17 20:47 ` Marcelo Tosatti 2003-06-18 11:05 ` Stephan von Krawczynski 1 sibling, 1 reply; 110+ messages in thread From: Marcelo Tosatti @ 2003-06-17 20:47 UTC (permalink / raw) To: Stephan von Krawczynski; +Cc: John Stoffel, gibbs, linux-kernel, willy, green On Fri, 13 Jun 2003, Stephan von Krawczynski wrote: > Hello all, > > this is the second day of stress-testing pure rc8 in SMP, apic mode. Today > everything is fine, no freeze, no data corruption. > > current standings: > > 2 days continuous test, one file data corruption on day 1 What kind of data corruption and what tests are you doing ? (sorry if you already mentionad that on the list) ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes (now rc7+aic20030603) 2003-06-17 20:47 ` Marcelo Tosatti @ 2003-06-18 11:05 ` Stephan von Krawczynski 2003-06-18 14:21 ` John Stoffel 2003-06-20 19:59 ` Marcelo Tosatti 0 siblings, 2 replies; 110+ messages in thread From: Stephan von Krawczynski @ 2003-06-18 11:05 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: stoffel, gibbs, linux-kernel, willy, green On Tue, 17 Jun 2003 17:47:02 -0300 (BRT) Marcelo Tosatti <marcelo@conectiva.com.br> wrote: > > > On Fri, 13 Jun 2003, Stephan von Krawczynski wrote: > > > Hello all, > > > > this is the second day of stress-testing pure rc8 in SMP, apic mode. Today > > everything is fine, no freeze, no data corruption. > > > > current standings: > > > > 2 days continuous test, one file data corruption on day 1 > > > What kind of data corruption and what tests are you doing ? (sorry if you > already mentionad that on the list) Todays score: 7 days continuous test one file data corruption on day 1 one file data corruption on day 4 two file data corruptions on day 6 Test is performed as follows: around 70-100 GB of data is transferred to a nfs-server with rc8 onto a RAID5 on 3ware-controller. The data is then copied via tar onto a SDLT drive connected to an aic controller. Afterwards the data is verified by tar. Since rc8 this runs stable (froze before during the first day). Whats left is that the verify done failes sometimes (see above). It does not look like a write error to tape, because retrying the verify cycle the errors occur in other files most of the time (or even none at all). It seems reading back is the problem. I doubt the problem lies on the 3ware side, because this would mean you cannot use it at all (there should be errors all over other actions as well then). Most of the several files tar'ed are beyond the 2 GB file size. They vary from around 100MB upto about 15 GB per file, around 70 GB minimum summed up. Of course I exchanged the tapes and the drive. Didn't get better. Regards, Stephan ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes (now rc7+aic20030603) 2003-06-18 11:05 ` Stephan von Krawczynski @ 2003-06-18 14:21 ` John Stoffel 2003-06-18 14:54 ` Stephan von Krawczynski 2003-06-20 19:59 ` Marcelo Tosatti 1 sibling, 1 reply; 110+ messages in thread From: John Stoffel @ 2003-06-18 14:21 UTC (permalink / raw) To: Stephan von Krawczynski Cc: Marcelo Tosatti, stoffel, gibbs, linux-kernel, willy, green Stephan> 7 days continuous test Stephan> one file data corruption on day 1 Stephan> one file data corruption on day 4 Stephan> two file data corruptions on day 6 Stephan> Test is performed as follows: Stephan> around 70-100 GB of data is transferred to a nfs-server with Stephan> rc8 onto a RAID5 on 3ware-controller. The data is then Stephan> copied via tar onto a SDLT drive connected to an aic Stephan> controller. Afterwards the data is verified by tar. Is the data verified after the transfer to the NFS server? Does it pass muster then using MD5 sums on the files? What happens if you cut the tape drive out of the loop and copy the data to another partition on the 3ware controller and do the compare then? I assume you're doing: tar -c -f /dev/tape --verify /path/to/files and that's when you get the errors? Or are you writing to tape, and then doing a compare with: tar -c -f /dev/tape /path/to/files tar -d -f /dev/tape /path/to/files Stephan> Since rc8 this runs stable (froze before during the first Stephan> day). How much RAM is in the box, and how much free space is on the filesystem? I've been trying to replicate this type of test on 2.5.7x, but I've been having issues. I'm also just dumping a pile of MP3s to tape and reading them back to check. Stephan> Most of the several files tar'ed are beyond the 2 GB file Stephan> size. They vary from around 100MB upto about 15 GB per file, Stephan> around 70 GB minimum summed up. Of course I exchanged the Stephan> tapes and the drive. Didn't get better. This is an interesting data point. What happens if you make all the files be 2.5gb in size, do you get corruption more consistently then? I'm interested in this issue because I want to make sure that tape backups work reliably on Linux. John ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes (now rc7+aic20030603) 2003-06-18 14:21 ` John Stoffel @ 2003-06-18 14:54 ` Stephan von Krawczynski 0 siblings, 0 replies; 110+ messages in thread From: Stephan von Krawczynski @ 2003-06-18 14:54 UTC (permalink / raw) To: John Stoffel; +Cc: marcelo, stoffel, gibbs, linux-kernel, willy, green On Wed, 18 Jun 2003 10:21:25 -0400 "John Stoffel" <stoffel@lucent.com> wrote: > > Stephan> 7 days continuous test > Stephan> one file data corruption on day 1 > Stephan> one file data corruption on day 4 > Stephan> two file data corruptions on day 6 > > Stephan> Test is performed as follows: > > Stephan> around 70-100 GB of data is transferred to a nfs-server with > Stephan> rc8 onto a RAID5 on 3ware-controller. The data is then > Stephan> copied via tar onto a SDLT drive connected to an aic > Stephan> controller. Afterwards the data is verified by tar. > > Is the data verified after the transfer to the NFS server? Does it > pass muster then using MD5 sums on the files? No, the content is not verified to be the same as on the nfs clients. But this is not the point here, it could as well be bad content that is saved to tape, and if you get wrong verification for this, your bad data simply got worse. Right? > What happens if you cut the tape drive out of the loop and copy the > data to another partition on the 3ware controller and do the compare > then? I have not managed to get the corruption on archives written to (the same) 3ware partition instead of tape up to this day. > > I assume you're doing: > > tar -c -f /dev/tape --verify /path/to/files No. See your second guess. > and that's when you get the errors? Or are you writing to tape, and > then doing a compare with: > > tar -c -f /dev/tape /path/to/files > tar -d -f /dev/tape /path/to/files Yes, I am separately verifying with "-d". > Stephan> Since rc8 this runs stable (froze before during the first > Stephan> day). > > How much RAM is in the box, and how much free space is on the > filesystem? I've been trying to replicate this type of test on > 2.5.7x, but I've been having issues. I'm also just dumping a pile of > MP3s to tape and reading them back to check. See first post of the thread, in case it already vanished: 3 GB RAM, 320 GB filesystem space, at least half free. > Stephan> Most of the several files tar'ed are beyond the 2 GB file > Stephan> size. They vary from around 100MB upto about 15 GB per file, > Stephan> around 70 GB minimum summed up. Of course I exchanged the > Stephan> tapes and the drive. Didn't get better. > > This is an interesting data point. What happens if you make all the > files be 2.5gb in size, do you get corruption more consistently then? Hm, I haven't tried this so far. My next guess would have been not to verify but to read the data completely in (to disk) again and then do a verification based on a file-compare utility. If there is a difference one can have a real look on the data, which is a bit of a mess on tape. > I'm interested in this issue because I want to make sure that tape > backups work reliably on Linux. Well, two of the same kind :-) Regards, Stephan ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes (now rc7+aic20030603) 2003-06-18 11:05 ` Stephan von Krawczynski 2003-06-18 14:21 ` John Stoffel @ 2003-06-20 19:59 ` Marcelo Tosatti 2003-06-20 20:59 ` Kevin P. Fleming 1 sibling, 1 reply; 110+ messages in thread From: Marcelo Tosatti @ 2003-06-20 19:59 UTC (permalink / raw) To: Stephan von Krawczynski; +Cc: stoffel, gibbs, linux-kernel, willy, green On Wed, 18 Jun 2003, Stephan von Krawczynski wrote: > On Tue, 17 Jun 2003 17:47:02 -0300 (BRT) > Marcelo Tosatti <marcelo@conectiva.com.br> wrote: > > > > > > > On Fri, 13 Jun 2003, Stephan von Krawczynski wrote: > > > > > Hello all, > > > > > > this is the second day of stress-testing pure rc8 in SMP, apic mode. Today > > > everything is fine, no freeze, no data corruption. > > > > > > current standings: > > > > > > 2 days continuous test, one file data corruption on day 1 > > > > > > What kind of data corruption and what tests are you doing ? (sorry if you > > already mentionad that on the list) > > Todays score: > > 7 days continuous test > one file data corruption on day 1 > one file data corruption on day 4 > two file data corruptions on day 6 > > Test is performed as follows: > > around 70-100 GB of data is transferred to a nfs-server with rc8 onto a > RAID5 on 3ware-controller. The data is then copied via tar onto a SDLT > drive connected to an aic controller. Afterwards the data is verified by > tar. So the data is intact when it arrives on the 3ware and gets corrupted on the write to the tape? ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes (now rc7+aic20030603) 2003-06-20 19:59 ` Marcelo Tosatti @ 2003-06-20 20:59 ` Kevin P. Fleming 2003-06-20 21:13 ` Marcelo Tosatti 0 siblings, 1 reply; 110+ messages in thread From: Kevin P. Fleming @ 2003-06-20 20:59 UTC (permalink / raw) To: Marcelo Tosatti Cc: Stephan von Krawczynski, stoffel, gibbs, linux-kernel, willy, green Marcelo Tosatti wrote: > So the data is intact when it arrives on the 3ware and gets corrupted > on the write to the tape? > Actually, without another copy of the data on a different system to verify it with, you can't know that for sure. It could easily be getting to the tape (the actual media) just fine, but then get corrupted during the verify readback. ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes (now rc7+aic20030603) 2003-06-20 20:59 ` Kevin P. Fleming @ 2003-06-20 21:13 ` Marcelo Tosatti 2003-06-20 22:03 ` Willy Tarreau 0 siblings, 1 reply; 110+ messages in thread From: Marcelo Tosatti @ 2003-06-20 21:13 UTC (permalink / raw) To: Kevin P. Fleming Cc: Stephan von Krawczynski, stoffel, gibbs, linux-kernel, willy, green On Fri, 20 Jun 2003, Kevin P. Fleming wrote: > Marcelo Tosatti wrote: > > > So the data is intact when it arrives on the 3ware and gets corrupted > > on the write to the tape? > > > > Actually, without another copy of the data on a different system to > verify it with, you can't know that for sure. It could easily be getting > to the tape (the actual media) just fine, but then get corrupted during > the verify readback. Right. Stephan, if you could use a bit of your time to isolate the problem I would be VERY grateful. ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes (now rc7+aic20030603) 2003-06-20 21:13 ` Marcelo Tosatti @ 2003-06-20 22:03 ` Willy Tarreau 2003-06-20 23:48 ` Stephan von Krawczynski 2003-06-24 18:31 ` Bill Davidsen 0 siblings, 2 replies; 110+ messages in thread From: Willy Tarreau @ 2003-06-20 22:03 UTC (permalink / raw) To: Marcelo Tosatti Cc: Kevin P. Fleming, Stephan von Krawczynski, stoffel, gibbs, linux-kernel, willy, green Hi ! On Fri, Jun 20, 2003 at 06:13:53PM -0300, Marcelo Tosatti wrote: > > Actually, without another copy of the data on a different system to > > verify it with, you can't know that for sure. It could easily be getting > > to the tape (the actual media) just fine, but then get corrupted during > > the verify readback. > > Right. Stephan, if you could use a bit of your time to isolate the problem > I would be VERY grateful. I remember Stephan once said that he used tar to verify the tape, and that for one backup, he did several tests showing corruption on different files. Altough that doesn't mean that the tape is written totally correctly, it at proves that there's at least a read corruption. I think that comparing multiple reads to find a pattern in corruption offsets (if any) is the only thing he could do (not speaking about mixing read/writes with good/bad kernels). Of course, storing several times 70GB on disk is not easy, but at least a 16 bits checksum for each 1kB block would result on about 140 MB files, which will be "easier" to compare. It could be enough to check for empty blocks, duplicated blocks or totally random ones. Stephan, if you're willing to do the test but don't have such a tool, I may write a quick dirty one tomorrow if that helps. BTW, it could be interesting to note the read buffer's hardware address for each test, in case it matters. Cheers, Willy ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes (now rc7+aic20030603) 2003-06-20 22:03 ` Willy Tarreau @ 2003-06-20 23:48 ` Stephan von Krawczynski 2003-06-21 10:50 ` Willy TARREAU 2003-06-24 18:31 ` Bill Davidsen 1 sibling, 1 reply; 110+ messages in thread From: Stephan von Krawczynski @ 2003-06-20 23:48 UTC (permalink / raw) To: Willy Tarreau Cc: marcelo, kpfleming, stoffel, gibbs, linux-kernel, willy, green On Sat, 21 Jun 2003 00:03:31 +0200 Willy Tarreau <willy@w.ods.org> wrote: > Hi ! > > On Fri, Jun 20, 2003 at 06:13:53PM -0300, Marcelo Tosatti wrote: > > > Actually, without another copy of the data on a different system to > > > verify it with, you can't know that for sure. It could easily be getting > > > to the tape (the actual media) just fine, but then get corrupted during > > > the verify readback. > > > > Right. Stephan, if you could use a bit of your time to isolate the problem > > I would be VERY grateful. > > I remember Stephan once said that he used tar to verify the tape, and that > for one backup, he did several tests showing corruption on different files. > Altough that doesn't mean that the tape is written totally correctly, it at > proves that there's at least a read corruption. Hello Willy, hello Marcelo, in fact I noticed that doing multiple verify cycles the so-called corruption happens rarely (read _very_ rarely) on the same files. So it is indeed very likely that the read case is a problem. Another thing to note is that I did not manage to produce a failed verify on a dataset tar'ed to the 3ware raid and not to tape. I did not test that very intensively, but from the tests I did I would have expected a corruption to happen based on the cycles I did on tape. > I think that comparing multiple reads to find a pattern in corruption offsets > (if any) is the only thing he could do (not speaking about mixing read/writes > with good/bad kernels). Of course, storing several times 70GB on disk is not > easy, but at least a 16 bits checksum for each 1kB block would result on > about 140 MB files, which will be "easier" to compare. It could be enough to > check for empty blocks, duplicated blocks or totally random ones. > > Stephan, if you're willing to do the test but don't have such a tool, I may > write a quick dirty one tomorrow if that helps. > > BTW, it could be interesting to note the read buffer's hardware address for > each test, in case it matters. Well, in fact I am a bit lost in the case, because of the shere data volume, I have space for several sets on disk, but it takes a damn long time to produce one cycle write/verify. Anyway I will do if that helps. The big problem with tar is that I have (to my knowledge) no chance to let it somewhere save the verify-failing data parts. I guess this could help a lot, because we could then see what the corruption looks like, how long (in bytes) it is and so on. If anybody has an idea how to achieve this goal let me know. I am not 100% confident that the tests would look the same if I simply read the whole tape onto the disks again and then verify via file compare, but anyway I should try this too several times to complete the picture. Ok, weekend is here, I see what can be done. Regards, Stephan ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes (now rc7+aic20030603) 2003-06-20 23:48 ` Stephan von Krawczynski @ 2003-06-21 10:50 ` Willy TARREAU 2003-06-22 19:00 ` Stephan von Krawczynski 2003-06-23 11:30 ` Stephan von Krawczynski 0 siblings, 2 replies; 110+ messages in thread From: Willy TARREAU @ 2003-06-21 10:50 UTC (permalink / raw) To: Stephan von Krawczynski Cc: Willy Tarreau, marcelo, kpfleming, stoffel, gibbs, linux-kernel, green On Sat, Jun 21, 2003 at 01:48:28AM +0200, Stephan von Krawczynski wrote: > Well, in fact I am a bit lost in the case, because of the shere data volume, I > have space for several sets on disk, but it takes a damn long time to produce > one cycle write/verify. Anyway I will do if that helps. The big problem with > tar is that I have (to my knowledge) no chance to let it somewhere save the > verify-failing data parts. I guess this could help a lot, because we could then > see what the corruption looks like, how long (in bytes) it is and so on. > If anybody has an idea how to achieve this goal let me know. I wanted to implement a compare-and-capture feature in my check tool, but realized that it would certainly be of no help if you get duplicated blocks or so, because you'll have no way to tell *where* the captured block should have been. That's why I suggested the checksum instead : if you get a pattern such as : check1 check2 0: 1234 1234 1: 4567 4567 3: 789a 4567 4: bcde 789a 5: f012 bcde ... it will mean than block 1 was duplicated in check2. If you see : check1 check2 0: 1234 1234 1: 4567 4567 3: 789a 4567 4: bcde bcde 5: f012 f012 ... it will mean than block 1 was repeated instead of block 2 in check2. If you see 0000, it probably means that you got a block full of zeros, since the algorithm is only additive. The resulting files will be 1/512 of the input, I think you'll find some space on your disk for such a file. It may be interesting to do regular checks during the second read, so that you can abort after the first error, and not have to get a second full read. > Ok, weekend is here, I see what can be done. Here is my proposed program. I tried it on my local hard disk, it took 5 min to check the full 8 GB (30 MB/s), and I reached 123 MB/s on a 4 disks software raid5 array with an AHA29160. It outputs the current offset every 64 MB. Here it is running on a DDS3 : [root@alpha /root]# ~willy/c/chkblk.alpha /dev/nst0 > nst0.chk At offset 603979776... I hope it can help. Cheers, Willy /* * chkblk - computes block checksums - 2003/06/21 - Willy Tarreau <w@w.ods.org> * * This program is free, do what you want with it, I will not be responsible if * it trashes all your data. * * Reads a file and outputs a binary 16 bit checksum for each 1KB block. * Useful to check for data corruption. Eg : * * # chkblk /dev/tape > test1.chk * # chkblk /dev/tape > test2.chk * # cmp -l test[12].chk * * or : * # chkblk /dev/sda2 |od -tx2 -Ax > test1.txt * # chkblk /dev/sda2 |od -tx2 -Ax > test2.txt * # diff -u test[12].txt * * To be able to read files bigger than 2GB, you should compile it * with "-D_LARGEFILE_SOURCE -D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64". * * */ #include <stdio.h> #include <unistd.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <stdlib.h> #define BLOCKSIZE 1024 #if _FILE_OFFSET_BITS == 64 #define OFF_T_FMT "%ll" #else #define OFF_T_FMT "%l" #endif void usage() { fprintf(stderr, "Usage: chkblk input > output\n" " - input is a file, device, ...\n" " - output will be a binary file 1/512th the size of input\n" ); exit(1); } main(int argc, char **argv) { int fd; int len; off_t inp_off; unsigned long *buffer; if (argc != 2) usage(); buffer = (void *)malloc(BLOCKSIZE); if (buffer == NULL) { fprintf(stderr,"Out of memory\n"); exit(2); } fd = open(argv[1], O_RDONLY); if (fd < 0) { perror("open"); exit(3); } inp_off = 0; while ((len = read(fd, buffer, BLOCKSIZE)) > 0) { unsigned long sum = 0; int off; inp_off += len; /* displays the offset every 64 MB */ if ((inp_off & 0x3ffffff) == 0) fprintf(stderr,"At offset " OFF_T_FMT "u...\r", inp_off); for (off = 0; off < len/sizeof(*buffer); off++) sum += buffer[off]; while (sum >= (1<<16)) { sum = (sum & 0xffff) + (sum >> 16); } putchar(sum); putchar(sum >> 8); } fprintf(stderr,"At offset " OFF_T_FMT "u", inp_off); if (len < 0) { fprintf(stderr, ", read returned : \n"); perror(""); close(fd); exit(4); } else { fprintf(stderr, ", check completed without error\n"); } close(fd); exit(0); } ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes (now rc7+aic20030603) 2003-06-21 10:50 ` Willy TARREAU @ 2003-06-22 19:00 ` Stephan von Krawczynski 2003-06-23 11:30 ` Stephan von Krawczynski 1 sibling, 0 replies; 110+ messages in thread From: Stephan von Krawczynski @ 2003-06-22 19:00 UTC (permalink / raw) To: Willy TARREAU Cc: willy, marcelo, kpfleming, stoffel, gibbs, linux-kernel, green Hello all, here is the interesting result of my working weekend with intensive testing: As 22-pre1 just came out I decided to use it for further testing of the issue, because I don't like testing old kernels particularly. And to my great surprise I have not managed to break 22-pre1 so far. I have up to now moved about 1 TB of data through the box (written to tape and verified) and have not yet produced a single verify error. Question is: how do I continue? Of course the tape-writing actions will be continuing, so I still have a look at the issue every day. Are we interested in finding out what particular patch in pre1 is responsible for this? Well, at least there is the positive result that pre1 seems significantly better... Regards, Stephan ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes (now rc7+aic20030603) 2003-06-21 10:50 ` Willy TARREAU 2003-06-22 19:00 ` Stephan von Krawczynski @ 2003-06-23 11:30 ` Stephan von Krawczynski 2003-06-24 11:11 ` Stephan von Krawczynski 1 sibling, 1 reply; 110+ messages in thread From: Stephan von Krawczynski @ 2003-06-23 11:30 UTC (permalink / raw) To: Willy TARREAU Cc: willy, marcelo, kpfleming, stoffel, gibbs, linux-kernel, green Hello again, so we learned that working on the weekend is no good ;-) The problem is back - still on 22-pre1 . I had two failed verifications this morning. Now I am giving Willy's checksumming a try. I'll keep you informed. Regards, Stephan ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes (now rc7+aic20030603) 2003-06-23 11:30 ` Stephan von Krawczynski @ 2003-06-24 11:11 ` Stephan von Krawczynski 2003-06-24 17:43 ` Willy Tarreau 0 siblings, 1 reply; 110+ messages in thread From: Stephan von Krawczynski @ 2003-06-24 11:11 UTC (permalink / raw) To: linux-kernel; +Cc: willy, marcelo, kpfleming, stoffel, gibbs, green Hello all, hello Willy, I tried to produce the problem by using your chkblk tool, but was not successful up to now. All checksums are the same. Is it possible that the problem lies deeper in the process than expected. Remember I do: copy data via NFS to server tar data on server to tape read data back vor verification with tar -d Is it possible that the verification errors do not occur because of a read problem, but because of a page cached block getting trashed somehow between "tar to tape" and "read from tape". I would suspect that some blocks survive in memory and are re-used during verification. If for some reason this data is invalid or corrupted the verification fails although the read was correct. I know that this sounds weird, but nevertheless possible, or not? It may even be worse, the data may have also been left from the original nfs action, correct? Is there a way to completely invalidate/flush all cached blocks concerning this fs (besides umount)? Regards, Stephan ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes (now rc7+aic20030603) 2003-06-24 11:11 ` Stephan von Krawczynski @ 2003-06-24 17:43 ` Willy Tarreau 2003-06-24 21:26 ` Stephan von Krawczynski 0 siblings, 1 reply; 110+ messages in thread From: Willy Tarreau @ 2003-06-24 17:43 UTC (permalink / raw) To: Stephan von Krawczynski Cc: linux-kernel, willy, marcelo, kpfleming, stoffel, gibbs, green Hi Stephan, > Is it possible that the verification errors do not occur because of a read > problem, but because of a page cached block getting trashed somehow between > "tar to tape" and "read from tape". I would suspect that some blocks survive in > memory and are re-used during verification. If for some reason this data is > invalid or corrupted the verification fails although the read was correct. That seems strange to me, I don't see how we could cache data from a char device. It is possible that chkblk and tar don't use same block size and that your problem only occurs on larger transfers, or particularly aligned ones. You could try to increase the block size in chkblk to something bigger than a page for example. I don't know if tar reads your tape at full speed, but it's possible that if it doesn't cope with the tape speed, an overrun occurs and something finally gets dropped :-/ > I know that this sounds weird, but nevertheless possible, or not? > It may even be worse, the data may have also been left from the original nfs > action, correct? > Is there a way to completely invalidate/flush all cached blocks concerning this > fs (besides umount)? I don't believe in this. But as Justin says, this card can get very high performances and hassle the hardware. Perhaps you have a rare weakness in your hardware that only occurs under these conditions, although I don't know how this could be checked. IIRC, you said that it works flawlessly in UP and you need SMP to hit the bug. Perhaps your second CPU is sometimes flaky (bad cache, etc...) :-/ Cheers, Willy ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes (now rc7+aic20030603) 2003-06-24 17:43 ` Willy Tarreau @ 2003-06-24 21:26 ` Stephan von Krawczynski 2003-06-24 22:03 ` Willy Tarreau 2003-06-25 2:22 ` Valdis.Kletnieks 0 siblings, 2 replies; 110+ messages in thread From: Stephan von Krawczynski @ 2003-06-24 21:26 UTC (permalink / raw) To: Willy Tarreau Cc: linux-kernel, willy, marcelo, kpfleming, stoffel, gibbs, green On Tue, 24 Jun 2003 19:43:31 +0200 Willy Tarreau <willy@w.ods.org> wrote: > Hi Stephan, > > > Is it possible that the verification errors do not occur because of a read > > problem, but because of a page cached block getting trashed somehow between > > "tar to tape" and "read from tape". I would suspect that some blocks > > survive in memory and are re-used during verification. If for some reason > > this data is invalid or corrupted the verification fails although the read > > was correct. > > That seems strange to me, I don't see how we could cache data from a char > device. Hello Willy, sorry, you probably misunderstood my flaky explanation. What I meant was not a cached block from the _tape_ (obviously indeed a char-type device) but from the 3ware disk (i.e. the other side of the verification). Consider the tape completely working, but the disk data corrupt (possibly not from real reading but from corrupted cache). > It is possible that chkblk and tar don't use same block size and that > your problem only occurs on larger transfers, or particularly aligned ones. Very likely not the same block size, with tar I use -b64. > You could try to increase the block size in chkblk to something bigger than a > page for example. I don't know if tar reads your tape at full speed, It does. There's no head repositioning. > but it's > possible that if it doesn't cope with the tape speed, an overrun occurs and > something finally gets dropped :-/ Very unlikely, how do you create an overrun in a synchronuos single read operation? > > I know that this sounds weird, but nevertheless possible, or not? > > It may even be worse, the data may have also been left from the original > > nfs action, correct? > > Is there a way to completely invalidate/flush all cached blocks concerning > > this fs (besides umount)? > > I don't believe in this. But as Justin says, this card can get very high > performances and hassle the hardware. Perhaps you have a rare weakness in > your hardware that only occurs under these conditions, although I don't know > how this could be checked. I doubt that. Reason is that though the tape is pretty fast for a tape it is still pretty slow compared to a disk. Since I use the box for months now I would have expected such a hardware problem to show up for disk access, too. And there was none. > IIRC, you said that it works flawlessly in UP and you need SMP to hit the > bug. Perhaps your second CPU is sometimes flaky (bad cache, etc...) :-/ Hm, interestingly the former freeze bug (solved by marcelo through backout of some patch in rc8) did not show up in UP. Since then I did not test UP any more. The problem itself does not necessarily point to flaky hardware, as I would have no idea how bad cache can only show up during a tape verification, that does not sound all that reasonable. More likely could be a SMP race anywhere from nfs-server, 3ware disk driver to page cache, or not? Regards, Stephan ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes (now rc7+aic20030603) 2003-06-24 21:26 ` Stephan von Krawczynski @ 2003-06-24 22:03 ` Willy Tarreau 2003-06-24 23:43 ` Stephan von Krawczynski 2003-06-25 2:22 ` Valdis.Kletnieks 1 sibling, 1 reply; 110+ messages in thread From: Willy Tarreau @ 2003-06-24 22:03 UTC (permalink / raw) To: Stephan von Krawczynski Cc: Willy Tarreau, linux-kernel, marcelo, kpfleming, stoffel, gibbs, green On Tue, Jun 24, 2003 at 11:26:09PM +0200, Stephan von Krawczynski wrote: > sorry, you probably misunderstood my flaky explanation. What I meant was not a > cached block from the _tape_ (obviously indeed a char-type device) but from the > 3ware disk (i.e. the other side of the verification). Consider the tape > completely working, but the disk data corrupt (possibly not from real reading > but from corrupted cache). Ah, OK ! I didn't understand this. You're right, this is also a possibility. Perhaps a tar cf - /mnt/3ware | chkblk would get evidence of somme corruption ? <...snip... OK for these points ...> > Hm, interestingly the former freeze bug (solved by marcelo through backout of > some patch in rc8) did not show up in UP. Since then I did not test UP any > more. The problem itself does not necessarily point to flaky hardware, as I > would have no idea how bad cache can only show up during a tape verification, > that does not sound all that reasonable. OK, I agree. And right after posting, I remembered that if this was the case, you should also see some MCEs which doesn't seem to be your case. > More likely could be a SMP race anywhere from nfs-server, 3ware disk driver to > page cache, or not? fairly possible. That's also what Justin suggested in the past, BTW :-) Cheers, Willy ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes (now rc7+aic20030603) 2003-06-24 22:03 ` Willy Tarreau @ 2003-06-24 23:43 ` Stephan von Krawczynski 2003-06-25 19:16 ` Willy Tarreau 0 siblings, 1 reply; 110+ messages in thread From: Stephan von Krawczynski @ 2003-06-24 23:43 UTC (permalink / raw) To: Willy Tarreau Cc: willy, linux-kernel, marcelo, kpfleming, stoffel, gibbs, green On Wed, 25 Jun 2003 00:03:31 +0200 Willy Tarreau <willy@w.ods.org> wrote: > On Tue, Jun 24, 2003 at 11:26:09PM +0200, Stephan von Krawczynski wrote: > > > sorry, you probably misunderstood my flaky explanation. What I meant was > > not a cached block from the _tape_ (obviously indeed a char-type device) > > but from the 3ware disk (i.e. the other side of the verification). Consider > > the tape completely working, but the disk data corrupt (possibly not from > > real reading but from corrupted cache). > > Ah, OK ! I didn't understand this. You're right, this is also a possibility. > Perhaps a tar cf - /mnt/3ware | chkblk would get evidence of somme corruption > ? Hm, probably a dumb question: does repeated tar'ing of the same files lead to exactly the same archive? There is no timestamp inside or something equivalent ? Regards, Stephan ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes (now rc7+aic20030603) 2003-06-24 23:43 ` Stephan von Krawczynski @ 2003-06-25 19:16 ` Willy Tarreau 2003-06-25 19:42 ` Stephan von Krawczynski 2003-06-25 23:04 ` Bernd Eckenfels 0 siblings, 2 replies; 110+ messages in thread From: Willy Tarreau @ 2003-06-25 19:16 UTC (permalink / raw) To: Stephan von Krawczynski Cc: Willy Tarreau, linux-kernel, marcelo, kpfleming, stoffel, gibbs, green On Wed, Jun 25, 2003 at 01:43:53AM +0200, Stephan von Krawczynski wrote: > > Ah, OK ! I didn't understand this. You're right, this is also a possibility. > > Perhaps a tar cf - /mnt/3ware | chkblk would get evidence of somme corruption > > ? > > Hm, probably a dumb question: does repeated tar'ing of the same files lead to > exactly the same archive? There is no timestamp inside or something equivalent > ? Hmmm no, you're right, I forgot about this case. I think that access time or other time-dependant informations may change often enough to make a big diff on checksums. I have no more idea at the moment. Or perhaps tar to a disk file instead of the tape and check that file :-/ Cheers, Willy ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes (now rc7+aic20030603) 2003-06-25 19:16 ` Willy Tarreau @ 2003-06-25 19:42 ` Stephan von Krawczynski 2003-06-25 20:30 ` John Stoffel 2003-06-25 23:04 ` Bernd Eckenfels 1 sibling, 1 reply; 110+ messages in thread From: Stephan von Krawczynski @ 2003-06-25 19:42 UTC (permalink / raw) To: Willy Tarreau Cc: willy, linux-kernel, marcelo, kpfleming, stoffel, gibbs, green On Wed, 25 Jun 2003 21:16:55 +0200 Willy Tarreau <willy@w.ods.org> wrote: > On Wed, Jun 25, 2003 at 01:43:53AM +0200, Stephan von Krawczynski wrote: > > > Ah, OK ! I didn't understand this. You're right, this is also a > > > possibility. Perhaps a tar cf - /mnt/3ware | chkblk would get evidence of > > > somme corruption? > > > > Hm, probably a dumb question: does repeated tar'ing of the same files lead > > to exactly the same archive? There is no timestamp inside or something > > equivalent? > > Hmmm no, you're right, I forgot about this case. I think that access time or > other time-dependant informations may change often enough to make a big diff > on checksums. I have no more idea at the moment. Or perhaps tar to a disk > file instead of the tape and check that file :-/ I have tried that already but never managed to get verification errors on tar archives written to disk. Maybe I try again some more... Regards, Stephan ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes (now rc7+aic20030603) 2003-06-25 19:42 ` Stephan von Krawczynski @ 2003-06-25 20:30 ` John Stoffel 2003-06-26 9:36 ` Stephan von Krawczynski 2003-06-26 11:34 ` Stephan von Krawczynski 0 siblings, 2 replies; 110+ messages in thread From: John Stoffel @ 2003-06-25 20:30 UTC (permalink / raw) To: Stephan von Krawczynski Cc: Willy Tarreau, linux-kernel, marcelo, kpfleming, stoffel, gibbs, green >>>>> "Stephan" == Stephan von Krawczynski <skraw@ithnet.com> writes: Stephan> On Wed, 25 Jun 2003 21:16:55 +0200 Stephan> Willy Tarreau <willy@w.ods.org> wrote: >> Hmmm no, you're right, I forgot about this case. I think that >> access time or other time-dependant informations may change often >> enough to make a big diff on checksums. I have no more idea at the >> moment. Or perhaps tar to a disk file instead of the tape and check >> that file :-/ Stephan> I have tried that already but never managed to get Stephan> verification errors on tar archives written to disk. Maybe I Stephan> try again some more... I've been trying to get tar errors myself, while writing a 35gb filesystem to a DLT7000. I'm now running 2.4.21-pre5-ac1 and I haven't seen any errors. Yet. I'm using the 6.2.8 version of the driver as well. The filesystem is just a copy of my home directory and some MP3s and other random files and such. Lots of text and jpegf files, along with some other stuff. Maybe I need to try and generate 15-18 files 2gb+ each and dump them to tape with tar and see how that's handled, and if we get erorrs. Stephan, can you double check your version info as well? And it would be great to get some info on your 3ware setup as well, just so we can work on narrowing down the issues. Unfortunately, due to the way I have to setup things, the RAID array and the tape drive are on the same channel, which slows down things I'm sure. Here are some timings from dumping and verifying the data to tape: jfsnew:/# time tar -c-W -b 128 -f /dev/st0 /scratch tar: Removing leading `/' from member names 408.840u 869.730s 4:03:02.80 8.7% 0+0k 0+0io 258pf+0w jfsnew:/# time tar -c-W -b 256 -f /dev/st0 /scratch tar: Removing leading `/' from member names 443.210u 1104.930s 4:07:00.89 10.4% 0+0k 0+0io 264pf+0w My filesystem is a as follows: jfsnew:/home# mdadm -D /dev/md1 /dev/md1: Version : 00.90.00 Creation Time : Mon Jun 23 22:51:43 2003 Raid Level : raid0 Array Size : 44457600 (42.40 GiB 45.57 GB) Raid Devices : 5 Total Devices : 5 Preferred Minor : 1 Persistence : Superblock is persistent Update Time : Mon Jun 23 22:51:43 2003 State : dirty, no-errors Active Devices : 5 Working Devices : 5 Failed Devices : 0 Spare Devices : 0 Chunk Size : 64K Number Major Minor RaidDevice State 0 8 48 0 active sync /dev/sdd 1 8 64 1 active sync /dev/sde 2 8 80 2 active sync /dev/sdf 3 8 96 3 active sync /dev/sdg 4 8 112 4 active sync /dev/sdh UUID : ffa7efb1:1c151f2d:4f6a138c:77085f29 ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes (now rc7+aic20030603) 2003-06-25 20:30 ` John Stoffel @ 2003-06-26 9:36 ` Stephan von Krawczynski 2003-06-26 11:34 ` Stephan von Krawczynski 1 sibling, 0 replies; 110+ messages in thread From: Stephan von Krawczynski @ 2003-06-26 9:36 UTC (permalink / raw) To: John Stoffel; +Cc: willy, linux-kernel, marcelo, kpfleming, gibbs, green On Wed, 25 Jun 2003 16:30:22 -0400 "John Stoffel" <stoffel@lucent.com> wrote: > >>>>> "Stephan" == Stephan von Krawczynski <skraw@ithnet.com> writes: > > Stephan> I have tried that already but never managed to get > Stephan> verification errors on tar archives written to disk. Maybe I > Stephan> try again some more... > > I've been trying to get tar errors myself, while writing a 35gb > filesystem to a DLT7000. I'm now running 2.4.21-pre5-ac1 and I > haven't seen any errors. Yet. I'm using the 6.2.8 version of the > driver as well. The filesystem is just a copy of my home directory > and some MP3s and other random files and such. Lots of text and jpegf > files, along with some other stuff. > > Maybe I need to try and generate 15-18 files 2gb+ each and dump them > to tape with tar and see how that's handled, and if we get erorrs. > > Stephan, can you double check your version info as well? And it would > be great to get some info on your 3ware setup as well, just so we can > work on narrowing down the issues. Hm, I guess you mean kernel version? I am experiencing this problem since about 21-rcX versions, currently running 22-pre1. The 3ware setup is pretty straight forward a RAID5 with 3 160 GB disks and no spare. I would not deny nfs to interact with this problem. Can you try to move your backup'ed data from somewhere via nfs to your tar'ing box? Regards, Stephan ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes (now rc7+aic20030603) 2003-06-25 20:30 ` John Stoffel 2003-06-26 9:36 ` Stephan von Krawczynski @ 2003-06-26 11:34 ` Stephan von Krawczynski 2003-06-30 10:10 ` Stephan von Krawczynski 1 sibling, 1 reply; 110+ messages in thread From: Stephan von Krawczynski @ 2003-06-26 11:34 UTC (permalink / raw) To: John Stoffel; +Cc: willy, linux-kernel, marcelo, kpfleming, gibbs, green On Wed, 25 Jun 2003 16:30:22 -0400 "John Stoffel" <stoffel@lucent.com> wrote: > Maybe I need to try and generate 15-18 files 2gb+ each and dump them > to tape with tar and see how that's handled, and if we get erorrs. More data on this: Today was a very bad day regarding the issue. I experienced three verification errors, the filesizes were: 563162975 746555206 12679280738 So it seems it is not really linked to the filesize. Regards, Stephan ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes (now rc7+aic20030603) 2003-06-26 11:34 ` Stephan von Krawczynski @ 2003-06-30 10:10 ` Stephan von Krawczynski 2003-06-30 11:39 ` Marcelo Tosatti 0 siblings, 1 reply; 110+ messages in thread From: Stephan von Krawczynski @ 2003-06-30 10:10 UTC (permalink / raw) To: linux-kernel; +Cc: stoffel, willy, marcelo, kpfleming, gibbs, green Hello all, it looks like the problem gets worse currently. This is the second day I see 4 verification errors. This is with kernel 2.4.22-pre2 now. Regards, Stephan ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes (now rc7+aic20030603) 2003-06-30 10:10 ` Stephan von Krawczynski @ 2003-06-30 11:39 ` Marcelo Tosatti 2003-06-30 12:08 ` Stephan von Krawczynski 0 siblings, 1 reply; 110+ messages in thread From: Marcelo Tosatti @ 2003-06-30 11:39 UTC (permalink / raw) To: Stephan von Krawczynski Cc: linux-kernel, stoffel, willy, kpfleming, gibbs, green On Mon, 30 Jun 2003, Stephan von Krawczynski wrote: > Hello all, > > it looks like the problem gets worse currently. This is the second day I see 4 > verification errors. This is with kernel 2.4.22-pre2 now. As far as I understood, the tape is corrupting the data (or writting, or when reading back). Is this correct? ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes (now rc7+aic20030603) 2003-06-30 11:39 ` Marcelo Tosatti @ 2003-06-30 12:08 ` Stephan von Krawczynski 0 siblings, 0 replies; 110+ messages in thread From: Stephan von Krawczynski @ 2003-06-30 12:08 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: linux-kernel, stoffel, willy, kpfleming, gibbs, green On Mon, 30 Jun 2003 08:39:38 -0300 (BRT) Marcelo Tosatti <marcelo@conectiva.com.br> wrote: > > > On Mon, 30 Jun 2003, Stephan von Krawczynski wrote: > > > Hello all, > > > > it looks like the problem gets worse currently. This is the second day I > > see 4 verification errors. This is with kernel 2.4.22-pre2 now. > > > As far as I understood, the tape is corrupting the data (or writting, or > when reading back). > > Is this correct? Actually my guess is that the _data_ itself is not corrupt, neither the original set located on 3ware RAID nor the backup'ed set on aic-connected SDLT. The problem is - according to my personal opinion - flawed during the readback that occurs while verifying. I do not know if the data is already corrupted by the aic-driver (less probable currently) or some flaw inside the caching of the _original_ set. The situation is complex because of the multiple involved subsystems. My experience is this: If you reboot and make backup/verify cycle from 3ware to aic/tape everything seems fine. If you reboot and push data over NFS to 3ware-disk, then do the backup/verify cycle (with this data) from 3ware to aic/tape the corruption is very likely. If you do try another verify run of the data you see corruptions happen on _other_ files than the verify before. It is therefore unlikely that both data "ends" are part of the problem, because you would expect the same corruptions to show up - at least this is my hope. Regards, Stephan ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes (now rc7+aic20030603) 2003-06-25 19:16 ` Willy Tarreau 2003-06-25 19:42 ` Stephan von Krawczynski @ 2003-06-25 23:04 ` Bernd Eckenfels 1 sibling, 0 replies; 110+ messages in thread From: Bernd Eckenfels @ 2003-06-25 23:04 UTC (permalink / raw) To: linux-kernel In article <20030625191655.GA15970@alpha.home.local> you wrote: > Hmmm no, you're right, I forgot about this case. I think that access time or > other time-dependant informations may change often enough to make a big diff > on checksums. I have no more idea at the moment. Or perhaps tar to a disk file > instead of the tape and check that file :-/ you can cat the tree into md5sums or run md5sums on the tree: find . -print0 | xargs -0 cat | md5sum this will only compare file content. You could first dump it to a file and then md5sum it, if you want to test also writes. Greetings Bernd -- eckes privat - http://www.eckes.org/ Project Freefire - http://www.freefire.org/ ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes (now rc7+aic20030603) 2003-06-24 21:26 ` Stephan von Krawczynski 2003-06-24 22:03 ` Willy Tarreau @ 2003-06-25 2:22 ` Valdis.Kletnieks 1 sibling, 0 replies; 110+ messages in thread From: Valdis.Kletnieks @ 2003-06-25 2:22 UTC (permalink / raw) To: Stephan von Krawczynski; +Cc: linux-kernel [-- Attachment #1: Type: text/plain, Size: 1603 bytes --] On Tue, 24 Jun 2003 23:26:09 +0200, Stephan von Krawczynski said: > sorry, you probably misunderstood my flaky explanation. What I meant was not a > cached block from the _tape_ (obviously indeed a char-type device) but from t he > 3ware disk (i.e. the other side of the verification). Consider the tape > completely working, but the disk data corrupt (possibly not from real reading > but from corrupted cache). Don't rule out odder explanations either. True story follows.. ;) I once had the misfortune of being the admin for a Gould PN/9080. UTX/32 1.2 came out, and since it changed the inode format on disk, it's dump/mkfs/restore time. So I take the last 3 full backups, and do 2 more complete dumps besides. I checked, and *NO* I/O errors had been reported (and then I checked THAT by giving it a known bad tape and seeing errors WERE reported). Do the upgrade... and *every single* tape was 'not in dump/restore format'. Finally traced it down (this was the days when oscilloscopes were still useful) to a bad 7400 series chip on the tape controller. The backplane was a 32-bit bus, the tape was an 8-bit device - so there was a 4-to-1 mux that had a bad chip. Bit 3 would be correct for 4 bits, inverted for 4 bits, correct for 4, etc.. Tape drive *NEVER* complained, because what came over the *cable* was correct, parity and all.. Oh, and I got the data back something like this: cat > mangle.c main() { int muck[2]; while (read(0,muck,8) == 8) { muck[1] ^= 0x20202020; write(1,muck,8); } } ^D cc -o mangle mangle.c dd if=/dev/rmt0 bs=32k | ./mangle | restore -f - [-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --] ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes (now rc7+aic20030603) 2003-06-20 22:03 ` Willy Tarreau 2003-06-20 23:48 ` Stephan von Krawczynski @ 2003-06-24 18:31 ` Bill Davidsen 1 sibling, 0 replies; 110+ messages in thread From: Bill Davidsen @ 2003-06-24 18:31 UTC (permalink / raw) To: Willy Tarreau; +Cc: Marcelo Tosatti, Linux Kernel Mailing List On Sat, 21 Jun 2003, Willy Tarreau wrote: > On Fri, Jun 20, 2003 at 06:13:53PM -0300, Marcelo Tosatti wrote: > > > Actually, without another copy of the data on a different system to > > > verify it with, you can't know that for sure. It could easily be getting > > > to the tape (the actual media) just fine, but then get corrupted during > > > the verify readback. > > > > Right. Stephan, if you could use a bit of your time to isolate the problem > > I would be VERY grateful. > > I remember Stephan once said that he used tar to verify the tape, and that for > one backup, he did several tests showing corruption on different files. Altough > that doesn't mean that the tape is written totally correctly, it at proves that > there's at least a read corruption. > > I think that comparing multiple reads to find a pattern in corruption offsets > (if any) is the only thing he could do (not speaking about mixing read/writes > with good/bad kernels). Of course, storing several times 70GB on disk is not > easy, but at least a 16 bits checksum for each 1kB block would result on about > 140 MB files, which will be "easier" to compare. It could be enough to check > for empty blocks, duplicated blocks or totally random ones. Actually, to find problems like this, a change to cpio would be useful: find /home | cpio -oB -Hcrc >/dev/st0 as an example. When reading back you will see errors from the CRC on each file. I use cpio for this reason in some cases where knowing it's right is critical. -- bill davidsen <davidsen@tmr.com> CTO, TMR Associates, Inc Doing interesting things with little computers since 1979. ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes (now rc7+aic20030603) 2003-06-11 20:23 ` Stephan von Krawczynski 2003-06-11 21:01 ` John Stoffel @ 2003-06-12 13:54 ` Stephan von Krawczynski 1 sibling, 0 replies; 110+ messages in thread From: Stephan von Krawczynski @ 2003-06-12 13:54 UTC (permalink / raw) To: linux-kernel; +Cc: gibbs, willy, marcelo, green On Wed, 11 Jun 2003 22:23:46 +0200 Stephan von Krawczynski <skraw@ithnet.com> wrote: > Hello, > [...] > Anyway it looks like failures have gotten fewer since rc8. I will try an > overnight stress test now to see if I get it freezing again. Interestingly it does not freeze. One file shows data corruption, but the system looks stable. None of the older rc's made it up to this point. Looks like something in rc8 got better and I am in fact experiencing a set of bugs and not only one. Regards, Stephan ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes (now rc7+aic20030603) 2003-06-09 15:10 ` Stephan von Krawczynski 2003-06-09 15:32 ` Justin T. Gibbs @ 2003-06-10 1:38 ` Zwane Mwaikambo 2003-06-10 10:30 ` Stephan von Krawczynski 1 sibling, 1 reply; 110+ messages in thread From: Zwane Mwaikambo @ 2003-06-10 1:38 UTC (permalink / raw) To: Stephan von Krawczynski; +Cc: linux-kernel, willy, gibbs, marcelo, green On Mon, 9 Jun 2003, Stephan von Krawczynski wrote: > During the whole testing with SMP I recognised that the tar-verify always > brought up "content differs" warnings. Which basically means that the filesize > is ok but the content is not. As there might be various causes for this (bad > tape, bad drive, bad cabling) I did not give very much about it. But it turns > out there are no more such warnings when using an UP kernel (on the same box > with the complete same hardware including tapes). > > >From this experience I would conclude the following (for my personal test > case): Can you also try this with 2.5? > 1) aic-driver has problems with smp/up switching (meaning crashes when trying > an SMP build with nosmp). This is completely reproducable. Can you also try an SMP kernel with noapic? > 2) aic-driver (almost no matter what version) has problems with SMP setup and > tape drives. Obviously data integrity is not given. This is completely > reproducable in my test setup. I have had problems with symmetric interrupt handling but can normally get it working with noapic. And no it doesn't appear to be an interrupt routing problem on my box (If it is someone please clearly state what the exact problem is to me) Zwane -- function.linuxpower.ca ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes (now rc7+aic20030603) 2003-06-10 1:38 ` Zwane Mwaikambo @ 2003-06-10 10:30 ` Stephan von Krawczynski 2003-06-10 12:51 ` Zwane Mwaikambo 0 siblings, 1 reply; 110+ messages in thread From: Stephan von Krawczynski @ 2003-06-10 10:30 UTC (permalink / raw) To: Zwane Mwaikambo; +Cc: linux-kernel, willy, gibbs, marcelo, green On Mon, 9 Jun 2003 21:38:16 -0400 (EDT) Zwane Mwaikambo <zwane@linuxpower.ca> wrote: > On Mon, 9 Jun 2003, Stephan von Krawczynski wrote: > > > During the whole testing with SMP I recognised that the tar-verify always > > brought up "content differs" warnings. Which basically means that the > > filesize is ok but the content is not. As there might be various causes for > > this (bad tape, bad drive, bad cabling) I did not give very much about it. > > But it turns out there are no more such warnings when using an UP kernel > > (on the same box with the complete same hardware including tapes). > > > > >From this experience I would conclude the following (for my personal test > > case): > > Can you also try this with 2.5? Uh, do I trust Linus ? ;-) Well, probably I am going to take a look. The whole story eats a lot of time as I have to deal with GBs of data for every single test. > > 1) aic-driver has problems with smp/up switching (meaning crashes when > > trying an SMP build with nosmp). This is completely reproducable. > > Can you also try an SMP kernel with noapic? Can you clarify? Do you mean options "nosmp noapic" or just "noapic" on SMP kernel? > > 2) aic-driver (almost no matter what version) has problems with SMP setup > > and tape drives. Obviously data integrity is not given. This is completely > > reproducable in my test setup. > > I have had problems with symmetric interrupt handling but can normally get > it working with noapic. And no it doesn't appear to be an interrupt > routing problem on my box (If it is someone please clearly state what the > exact problem is to me) Hm, my question is: if it were exclusively an apic problem, why do other controllers (in a filesystem environment) work flawlessly. Maybe the driver and apic simply have differing opinions in certain race cases, but that does not mean that apic is always to blame, does it? Regards, Stephan ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes (now rc7+aic20030603) 2003-06-10 10:30 ` Stephan von Krawczynski @ 2003-06-10 12:51 ` Zwane Mwaikambo 2003-06-10 13:38 ` Stephan von Krawczynski 0 siblings, 1 reply; 110+ messages in thread From: Zwane Mwaikambo @ 2003-06-10 12:51 UTC (permalink / raw) To: Stephan von Krawczynski; +Cc: linux-kernel, willy, gibbs, marcelo, green On Tue, 10 Jun 2003, Stephan von Krawczynski wrote: > Uh, do I trust Linus ? ;-) Well, probably I am going to take a look. The whole > story eats a lot of time as I have to deal with GBs of data for every single > test. Cool, i'll wait on that then. > Can you clarify? Do you mean options "nosmp noapic" or just "noapic" on SMP > kernel? Kernel built with CONFIG_SMP and booted with 'noapic' kernel parameter > Hm, my question is: if it were exclusively an apic problem, why do other > controllers (in a filesystem environment) work flawlessly. Maybe the driver and > apic simply have differing opinions in certain race cases, but that does not > mean that apic is always to blame, does it? I'm a bit wary of blaming the interrupt routing setup, as i have also noted that other devices work fine. But we have to be objective and try and isolate things first. You seem to have a good head start on that. Zwane -- function.linuxpower.ca ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes (now rc7+aic20030603) 2003-06-10 12:51 ` Zwane Mwaikambo @ 2003-06-10 13:38 ` Stephan von Krawczynski 2003-06-10 13:51 ` Zwane Mwaikambo 0 siblings, 1 reply; 110+ messages in thread From: Stephan von Krawczynski @ 2003-06-10 13:38 UTC (permalink / raw) To: Zwane Mwaikambo; +Cc: linux-kernel, willy, gibbs, marcelo, green On Tue, 10 Jun 2003 08:51:35 -0400 (EDT) Zwane Mwaikambo <zwane@linuxpower.ca> wrote: > > Can you clarify? Do you mean options "nosmp noapic" or just "noapic" on SMP > > kernel? > > Kernel built with CONFIG_SMP and booted with 'noapic' kernel parameter Ok. To speed up the tests I call it "ok" if there are no verify errors within 70 GB and "fail" if there are one or more. I have tried rc7+aic20030603 SMP with noapic and it is ok. /proc/interrupts: CPU0 CPU1 0: 1061143 0 XT-PIC timer 1: 6582 0 XT-PIC keyboard 2: 0 0 XT-PIC cascade 5: 1229 0 XT-PIC EMU10K1 9: 9269694 0 XT-PIC aic7xxx, aic7xxx, 3ware Storage Controller, fcpcipnp, eth0, eth1, eth2 12: 129555 0 XT-PIC PS/2 Mouse 15: 4 0 XT-PIC ide1 NMI: 0 0 LOC: 1061054 1061028 ERR: 1 MIS: 0 Reading around the whole interrupt stuff I came across a very simple idea which I am going to test right now. See you in some hours ;-) Regards, Stephan ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes (now rc7+aic20030603) 2003-06-10 13:38 ` Stephan von Krawczynski @ 2003-06-10 13:51 ` Zwane Mwaikambo 2003-06-10 15:55 ` Stephan von Krawczynski 2003-06-10 17:44 ` Stephan von Krawczynski 0 siblings, 2 replies; 110+ messages in thread From: Zwane Mwaikambo @ 2003-06-10 13:51 UTC (permalink / raw) To: Stephan von Krawczynski; +Cc: linux-kernel, willy, gibbs, marcelo, green On Tue, 10 Jun 2003, Stephan von Krawczynski wrote: > On Tue, 10 Jun 2003 08:51:35 -0400 (EDT) > Zwane Mwaikambo <zwane@linuxpower.ca> wrote: > > > > Can you clarify? Do you mean options "nosmp noapic" or just "noapic" on SMP > > > kernel? > > > > Kernel built with CONFIG_SMP and booted with 'noapic' kernel parameter > > Ok. To speed up the tests I call it "ok" if there are no verify errors within > 70 GB and "fail" if there are one or more. > I have tried rc7+aic20030603 SMP with noapic and it is ok. Can you also test it with an SMP kernel and only maxcpus=1 ? > Reading around the whole interrupt stuff I came across a very simple idea which > I am going to test right now. See you in some hours ;-) Cool Zwane -- function.linuxpower.ca ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes (now rc7+aic20030603) 2003-06-10 13:51 ` Zwane Mwaikambo @ 2003-06-10 15:55 ` Stephan von Krawczynski 2003-06-10 16:23 ` Oleg Drokin 2003-06-10 17:44 ` Stephan von Krawczynski 1 sibling, 1 reply; 110+ messages in thread From: Stephan von Krawczynski @ 2003-06-10 15:55 UTC (permalink / raw) To: Zwane Mwaikambo; +Cc: linux-kernel, willy, gibbs, marcelo, green On Tue, 10 Jun 2003 09:51:34 -0400 (EDT) Zwane Mwaikambo <zwane@linuxpower.ca> wrote: > > Reading around the whole interrupt stuff I came across a very simple idea which > > I am going to test right now. See you in some hours ;-) > > Cool Hoho, how about this one: ksymoops 2.4.8 on i686 2.4.21-rc7-aic. Options used -V (default) -k /proc/ksyms (default) -l /proc/modules (default) -o /lib/modules/2.4.21-rc7-aic/ (default) -m /boot/System.map-2.4.21-rc7-aic (default) Warning: You did not tell me where to find symbol information. I will assume that the log matches the kernel and modules that are running right now and I'll use the default options above for symbol resolution. If the current kernel and/or modules do not match the log, you can get more accurate output by telling me the kernel version and where to find map, modules, ksyms etc. ksymoops -h explains the options. Jun 10 17:50:53 admin kernel: Unable to handle kernel NULL pointer dereference at virtual address 00000b2c Jun 10 17:50:53 admin kernel: c0221c37 Jun 10 17:50:53 admin kernel: *pde = 00000000 Jun 10 17:50:53 admin kernel: Oops: 0000 Jun 10 17:50:53 admin kernel: CPU: 0 Jun 10 17:50:53 admin kernel: EIP: 0010:[st_do_scsi+295/384] Not tainted Jun 10 17:50:53 admin kernel: EIP: 0010:[<c0221c37>] Not tainted Using defaults from ksymoops -t elf32-i386 -a i386 Jun 10 17:50:53 admin kernel: EFLAGS: 00010246 Jun 10 17:50:53 admin kernel: eax: 00000000 ebx: 00000001 ecx: 00000000 edx: c34a0424 Jun 10 17:50:53 admin kernel: esi: f5f2c180 edi: 00000b00 ebp: 00008090 esp: dead5edc Jun 10 17:50:53 admin kernel: ds: 0018 es: 0018 ss: 0018 Jun 10 17:50:53 admin kernel: Process tar (pid: 4004, stackpage=dead5000) Jun 10 17:50:53 admin kernel: Stack: f5f2c180 00000000 c0090000 00008000 c0221a10 00015f90 00000000 dead5f7c Jun 10 17:50:53 admin kernel: c34a0400 00000001 00008000 c0223abd 00000000 c34a0400 dead5f40 00008000 Jun 10 17:50:53 admin kernel: 00000002 00015f90 00000000 00000001 00000000 00000000 c34a04c0 c34a0450 Jun 10 17:50:53 admin kernel: Call Trace: [st_sleep_done+0/256] [read_tape+269/1024] [scsi_finish_command+152/208] [st_read+1015/1152] [sys_read+155/384] Jun 10 17:50:53 admin kernel: Call Trace: [<c0221a10>] [<c0223abd>] [<c01ede38>] [<c02241a7>] [<c0141c0b>] Jun 10 17:50:53 admin kernel: [<c010782f>] Jun 10 17:50:53 admin kernel: Code: 8b 5f 2c 89 74 24 04 89 3c 24 e8 ea fb ff ff 89 43 1c eb a5 >>EIP; c0221c37 <st_do_scsi+127/180> <===== >>edx; c34a0424 <_end+310e0e4/38547d20> >>esi; f5f2c180 <_end+35b99e40/38547d20> >>esp; dead5edc <_end+1e743b9c/38547d20> Trace; c0221a10 <st_sleep_done+0/100> Trace; c0223abd <read_tape+10d/400> Trace; c01ede38 <scsi_finish_command+98/d0> Trace; c02241a7 <st_read+3f7/480> Trace; c0141c0b <sys_read+9b/180> Trace; c010782f <system_call+33/38> Code; c0221c37 <st_do_scsi+127/180> 00000000 <_EIP>: Code; c0221c37 <st_do_scsi+127/180> <===== 0: 8b 5f 2c mov 0x2c(%edi),%ebx <===== Code; c0221c3a <st_do_scsi+12a/180> 3: 89 74 24 04 mov %esi,0x4(%esp,1) Code; c0221c3e <st_do_scsi+12e/180> 7: 89 3c 24 mov %edi,(%esp,1) Code; c0221c41 <st_do_scsi+131/180> a: e8 ea fb ff ff call fffffbf9 <_EIP+0xfffffbf9> Code; c0221c46 <st_do_scsi+136/180> f: 89 43 1c mov %eax,0x1c(%ebx) Code; c0221c49 <st_do_scsi+139/180> 12: eb a5 jmp ffffffb9 <_EIP+0xffffffb9> 1 warning issued. Results may not be reliable. Anybody able to comment on that? Regards, Stephan ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes (now rc7+aic20030603) 2003-06-10 15:55 ` Stephan von Krawczynski @ 2003-06-10 16:23 ` Oleg Drokin 0 siblings, 0 replies; 110+ messages in thread From: Oleg Drokin @ 2003-06-10 16:23 UTC (permalink / raw) To: Stephan von Krawczynski Cc: Zwane Mwaikambo, linux-kernel, willy, gibbs, marcelo Hello! On Tue, Jun 10, 2003 at 05:55:06PM +0200, Stephan von Krawczynski wrote: > Jun 10 17:50:53 admin kernel: Process tar (pid: 4004, stackpage=dead5000) Hehe, whith this kind of stackpage, this process was doomed just after the fork() ;) > >>EIP; c0221c37 <st_do_scsi+127/180> <===== It seems that in st_do_scsi, in this line (STp->buffer)->syscall_result = st_chk_result(STp, SRpnt); STp is garbage for some reason, though it was valid before. Bye, Oleg ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes (now rc7+aic20030603) 2003-06-10 13:51 ` Zwane Mwaikambo 2003-06-10 15:55 ` Stephan von Krawczynski @ 2003-06-10 17:44 ` Stephan von Krawczynski 2003-06-10 18:15 ` Zwane Mwaikambo 2003-06-10 18:20 ` Zwane Mwaikambo 1 sibling, 2 replies; 110+ messages in thread From: Stephan von Krawczynski @ 2003-06-10 17:44 UTC (permalink / raw) To: Zwane Mwaikambo; +Cc: linux-kernel, willy, gibbs, marcelo, green On Tue, 10 Jun 2003 09:51:34 -0400 (EDT) Zwane Mwaikambo <zwane@linuxpower.ca> wrote: > > Reading around the whole interrupt stuff I came across a very simple idea > > which I am going to test right now. See you in some hours ;-) I now tried rc7+aic20030603 SMP apic _but_ interrupts from aic only bound to single cpu. I did this with help of irqbalance from Arjan. /proc/interrupts: CPU0 CPU1 0: 5148 571297 IO-APIC-edge timer 1: 9733 97 IO-APIC-edge keyboard 2: 0 0 XT-PIC cascade 12: 43720 1271 IO-APIC-edge PS/2 Mouse 15: 4 4 IO-APIC-edge ide1 17: 1297 1336383 IO-APIC-level 3ware Storage Controller 18: 344 16447 IO-APIC-level eth0, eth1 20: 570 3 IO-APIC-level fcpcipnp 21: 57292 340 IO-APIC-level eth2 22: 443161 2776 IO-APIC-level aic7xxx 23: 31 2005037 IO-APIC-level aic7xxx 26: 0 0 IO-APIC-level EMU10K1 NMI: 593524 582633 LOC: 576356 576330 ERR: 0 MIS: 0 The controller used is the second aic7xxx. The 31 interrupts on CPU0 have occured before the test. This setup fails during verify (data corruption). I would say that the interrupt code of the aic in itself is therefore ok with SMP. If it were a SMP race condition inside the interrupt routine this test should have been ok (as only one CPU is used). Regards, Stephan ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes (now rc7+aic20030603) 2003-06-10 17:44 ` Stephan von Krawczynski @ 2003-06-10 18:15 ` Zwane Mwaikambo 2003-06-10 23:55 ` Stephan von Krawczynski 2003-06-10 18:20 ` Zwane Mwaikambo 1 sibling, 1 reply; 110+ messages in thread From: Zwane Mwaikambo @ 2003-06-10 18:15 UTC (permalink / raw) To: Stephan von Krawczynski; +Cc: linux-kernel, willy, gibbs, marcelo, green On Tue, 10 Jun 2003, Stephan von Krawczynski wrote: > The controller used is the second aic7xxx. The 31 interrupts on CPU0 have > occured before the test. This setup fails during verify (data corruption). > > I would say that the interrupt code of the aic in itself is therefore ok with > SMP. If it were a SMP race condition inside the interrupt routine this test > should have been ok (as only one CPU is used). Thanks for verifying this, at least i know the problem isn't with interrupt routing in your specific case. Zwane -- function.linuxpower.ca ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes (now rc7+aic20030603) 2003-06-10 18:15 ` Zwane Mwaikambo @ 2003-06-10 23:55 ` Stephan von Krawczynski 0 siblings, 0 replies; 110+ messages in thread From: Stephan von Krawczynski @ 2003-06-10 23:55 UTC (permalink / raw) To: Zwane Mwaikambo; +Cc: linux-kernel, willy, gibbs, marcelo, green On Tue, 10 Jun 2003 14:15:58 -0400 (EDT) Zwane Mwaikambo <zwane@linuxpower.ca> wrote: > On Tue, 10 Jun 2003, Stephan von Krawczynski wrote: > > > The controller used is the second aic7xxx. The 31 interrupts on CPU0 have > > occured before the test. This setup fails during verify (data corruption). > > > > I would say that the interrupt code of the aic in itself is therefore ok > > with SMP. If it were a SMP race condition inside the interrupt routine this > > test should have been ok (as only one CPU is used). > > Thanks for verifying this, at least i know the problem isn't with > interrupt routing in your specific case. > > Zwane I guess your comment is a bit ahead of my tests. I just completed the test with rc7+aic20030603 SMP, apic and maxcpus=1. It fails. This means that although there is only one CPU used through the whole kernel the data corruption occurs. I would therefore conclude that the corruption is only possible if in fact the standard code path is flaky in terms of data completeness per request. Something like a broken synchronous action, a read request coming back completed although it is in fact still running or the like. May also be a misinterpretation of a kind of an "action completed" interrupt. Or something like one interrupt for multiple running actions with a mixup of the various causes. To make sure it is not a problem in the SMP code path through the driver I have to check a UP kernel with apic support enabled. I will do this tommorrow. If this is ok then things are simple, because its nailed down to the SMP code path without a concurrency cause. Lets see ... Regards, Stephan ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes (now rc7+aic20030603) 2003-06-10 17:44 ` Stephan von Krawczynski 2003-06-10 18:15 ` Zwane Mwaikambo @ 2003-06-10 18:20 ` Zwane Mwaikambo 1 sibling, 0 replies; 110+ messages in thread From: Zwane Mwaikambo @ 2003-06-10 18:20 UTC (permalink / raw) To: Stephan von Krawczynski; +Cc: linux-kernel, willy, gibbs, marcelo, green On Tue, 10 Jun 2003, Stephan von Krawczynski wrote: > occured before the test. This setup fails during verify (data corruption). Can you reproduce this with disks only? Zwane -- function.linuxpower.ca ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes 2003-05-23 10:38 ` Stephan von Krawczynski 2003-05-23 12:58 ` Justin T. Gibbs @ 2003-05-23 18:30 ` Marcelo Tosatti 2003-05-23 19:25 ` Stephan von Krawczynski 1 sibling, 1 reply; 110+ messages in thread From: Marcelo Tosatti @ 2003-05-23 18:30 UTC (permalink / raw) To: Stephan von Krawczynski; +Cc: willy, gibbs, linux-kernel On Fri, 23 May 2003, Stephan von Krawczynski wrote: > On Mon, 12 May 2003 11:02:18 +0200 > Stephan von Krawczynski <skraw@ithnet.com> wrote: > > > On Fri, 9 May 2003 16:57:38 +0200 > > Willy Tarreau <willy@w.ods.org> wrote: > > > > > On Fri, May 09, 2003 at 04:11:06PM +0200, Stephan von Krawczynski wrote: > > > > On Fri, 9 May 2003 15:27:57 +0200 > > > > Willy Tarreau <willy@w.ods.org> wrote: > > > > > > > > > Well, would you at least agree to retest current version from the above > > > > > URL ? I find it a bit of a shame that the driver goes back in -rc > > > > > stage. > > > > > > > > Ok, I can tell you at least this: it boots. Just did it. I can tell > > > > tomorrow how it behaves with my specific problem. > > > > > > Thanks for having tried ;-) > > > > Hello all, > > > > I have tried 2.4.21-rc2 with aic79xx-linux-2.4-20030502-tar.gz for three days > > now and have to say it performs well. I had no freezes any more and nothing > > weird happening. Everything is smooth and ok. This is the best performance I > > have seen comparing all 2.4.21-X versions tested. > > > > Thanks a lot. > > > > I will proceed with further stress tests... > > Ok. I managed to crash the tested machine after 14 days now. The crash itself > is exactly like former 2.4.21-X. It just freezes, no oops no nothing. It looks > like things got better, but not solved. > What about rc3? ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes 2003-05-23 18:30 ` Undo aic7xxx changes Marcelo Tosatti @ 2003-05-23 19:25 ` Stephan von Krawczynski 0 siblings, 0 replies; 110+ messages in thread From: Stephan von Krawczynski @ 2003-05-23 19:25 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: willy, gibbs, linux-kernel On Fri, 23 May 2003 15:30:33 -0300 (BRT) Marcelo Tosatti <marcelo@conectiva.com.br> wrote: > What about rc3? I will inform you if anything bad happens :-) rc3+aic20030520 tests started today. Regards, Stephan ^ permalink raw reply [flat|nested] 110+ messages in thread
[parent not found: <20030507203025$6f60@gated-at.bofh.it>]
[parent not found: <20030509005011$6cee@gated-at.bofh.it>]
[parent not found: <20030509101012$732a@gated-at.bofh.it>]
[parent not found: <20030509122007$758f@gated-at.bofh.it>]
[parent not found: <20030509131009$00f3@gated-at.bofh.it>]
[parent not found: <20030611045008$03cf@gated-at.bofh.it>]
[parent not found: <20030611203031$12de@gated-at.bofh.it>]
[parent not found: <20030611211012$34cf@gated-at.bofh.it>]
[parent not found: <20030613095017$1680@gated-at.bofh.it>]
[parent not found: <20030617210022$3e37@gated-at.bofh.it>]
[parent not found: <20030618111010$154f@gated-at.bofh.it>]
* Re: Undo aic7xxx changes (now rc7+aic20030603) [not found] ` <20030618111010$154f@gated-at.bofh.it> @ 2003-06-18 12:46 ` Pascal Schmidt 2003-06-18 12:49 ` Stephan von Krawczynski 0 siblings, 1 reply; 110+ messages in thread From: Pascal Schmidt @ 2003-06-18 12:46 UTC (permalink / raw) To: Stephan von Krawczynski; +Cc: linux-kernel Stephan von Krawczynski wrote in linux-kernel: > around 70-100 GB of data is transferred to a nfs-server with rc8 onto a RAID5 > on 3ware-controller. > The data is then copied via tar onto a SDLT drive connected to an aic > controller. > Afterwards the data is verified by tar. Have you tried with a different SCSI controller to rule out bugs in st.c? -- Ciao, Pascal ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Undo aic7xxx changes (now rc7+aic20030603) 2003-06-18 12:46 ` Undo aic7xxx changes (now rc7+aic20030603) Pascal Schmidt @ 2003-06-18 12:49 ` Stephan von Krawczynski 0 siblings, 0 replies; 110+ messages in thread From: Stephan von Krawczynski @ 2003-06-18 12:49 UTC (permalink / raw) To: Pascal Schmidt; +Cc: linux-kernel On Wed, 18 Jun 2003 14:46:02 +0200 Pascal Schmidt <der.eremit@email.de> wrote: > Stephan von Krawczynski wrote in linux-kernel: > > > around 70-100 GB of data is transferred to a nfs-server with rc8 onto a RAID5 > > on 3ware-controller. > > The data is then copied via tar onto a SDLT drive connected to an aic > > controller. > > Afterwards the data is verified by tar. > > Have you tried with a different SCSI controller to rule out bugs in st.c? Replacement part is not yet shipped. Regards, Stephan ^ permalink raw reply [flat|nested] 110+ messages in thread
end of thread, other threads:[~2003-06-30 11:53 UTC | newest] Thread overview: 110+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2003-05-07 20:22 Undo aic7xxx changes Marcelo Tosatti 2003-05-09 0:45 ` Justin T. Gibbs 2003-05-09 10:06 ` Stephan von Krawczynski 2003-05-09 12:06 ` Willy Tarreau 2003-05-09 13:02 ` Stephan von Krawczynski 2003-05-09 13:27 ` Willy Tarreau 2003-05-09 13:46 ` Stephan von Krawczynski 2003-05-09 14:56 ` Willy Tarreau 2003-05-09 15:08 ` Arjan van de Ven 2003-05-09 16:27 ` Willy Tarreau 2003-05-09 15:18 ` Andreas Schwab 2003-05-09 15:19 ` William Lee Irwin III 2003-05-09 14:11 ` Stephan von Krawczynski 2003-05-09 14:57 ` Willy Tarreau 2003-05-12 9:02 ` Stephan von Krawczynski 2003-05-12 15:43 ` Marc-Christian Petersen 2003-05-12 17:25 ` Willy Tarreau 2003-05-23 10:38 ` Stephan von Krawczynski 2003-05-23 12:58 ` Justin T. Gibbs 2003-05-23 13:11 ` Stephan von Krawczynski 2003-05-23 19:57 ` Willy Tarreau 2003-05-24 10:52 ` Stephan von Krawczynski 2003-05-24 11:16 ` Willy Tarreau 2003-05-25 10:58 ` Stephan von Krawczynski 2003-05-25 12:35 ` Willy TARREAU 2003-05-25 12:47 ` Marc-Christian Petersen 2003-05-25 13:50 ` Stephan von Krawczynski 2003-05-25 14:01 ` Marc-Christian Petersen 2003-05-25 14:03 ` Geller Sandor 2003-05-26 15:00 ` Stephan von Krawczynski 2003-05-26 16:44 ` Willy Tarreau 2003-05-30 8:09 ` Stephan von Krawczynski 2003-05-30 8:19 ` Marc-Christian Petersen 2003-05-30 8:21 ` Arjan van de Ven 2003-05-30 8:51 ` Stephan von Krawczynski 2003-05-30 13:34 ` Jeff Garzik 2003-05-30 13:59 ` Stephan von Krawczynski 2003-05-30 13:35 ` Jeff Garzik 2003-05-25 18:30 ` Justin T. Gibbs 2003-06-05 15:05 ` Undo aic7xxx changes (now rc7+aic20030603) Stephan von Krawczynski 2003-06-05 18:14 ` Willy Tarreau 2003-06-06 8:17 ` Oleg Drokin 2003-06-06 9:04 ` Stephan von Krawczynski 2003-06-06 9:17 ` Oleg Drokin 2003-06-06 15:24 ` short freezing while file re-creation Stephan von Krawczynski 2003-06-06 16:02 ` Oleg Drokin 2003-06-06 19:00 ` Chris Mason 2003-06-06 19:10 ` Oleg Drokin 2003-06-06 19:20 ` Chris Mason 2003-06-08 10:15 ` Undo aic7xxx changes (now rc7+aic20030603) Stephan von Krawczynski 2003-06-08 11:19 ` Stephan von Krawczynski 2003-06-08 11:49 ` Stephan von Krawczynski 2003-06-08 16:07 ` Stephan von Krawczynski 2003-06-09 15:10 ` Stephan von Krawczynski 2003-06-09 15:32 ` Justin T. Gibbs 2003-06-10 10:23 ` Stephan von Krawczynski 2003-06-10 15:38 ` Justin T. Gibbs 2003-06-10 17:11 ` Stephan von Krawczynski 2003-06-10 18:07 ` Justin T. Gibbs 2003-06-11 0:51 ` Stephan von Krawczynski 2003-06-11 4:39 ` Justin T. Gibbs 2003-06-11 20:23 ` Stephan von Krawczynski 2003-06-11 21:01 ` John Stoffel 2003-06-13 9:45 ` Stephan von Krawczynski 2003-06-15 12:56 ` Stephan von Krawczynski 2003-06-15 13:26 ` John Stoffel 2003-06-17 20:47 ` Marcelo Tosatti 2003-06-18 11:05 ` Stephan von Krawczynski 2003-06-18 14:21 ` John Stoffel 2003-06-18 14:54 ` Stephan von Krawczynski 2003-06-20 19:59 ` Marcelo Tosatti 2003-06-20 20:59 ` Kevin P. Fleming 2003-06-20 21:13 ` Marcelo Tosatti 2003-06-20 22:03 ` Willy Tarreau 2003-06-20 23:48 ` Stephan von Krawczynski 2003-06-21 10:50 ` Willy TARREAU 2003-06-22 19:00 ` Stephan von Krawczynski 2003-06-23 11:30 ` Stephan von Krawczynski 2003-06-24 11:11 ` Stephan von Krawczynski 2003-06-24 17:43 ` Willy Tarreau 2003-06-24 21:26 ` Stephan von Krawczynski 2003-06-24 22:03 ` Willy Tarreau 2003-06-24 23:43 ` Stephan von Krawczynski 2003-06-25 19:16 ` Willy Tarreau 2003-06-25 19:42 ` Stephan von Krawczynski 2003-06-25 20:30 ` John Stoffel 2003-06-26 9:36 ` Stephan von Krawczynski 2003-06-26 11:34 ` Stephan von Krawczynski 2003-06-30 10:10 ` Stephan von Krawczynski 2003-06-30 11:39 ` Marcelo Tosatti 2003-06-30 12:08 ` Stephan von Krawczynski 2003-06-25 23:04 ` Bernd Eckenfels 2003-06-25 2:22 ` Valdis.Kletnieks 2003-06-24 18:31 ` Bill Davidsen 2003-06-12 13:54 ` Stephan von Krawczynski 2003-06-10 1:38 ` Zwane Mwaikambo 2003-06-10 10:30 ` Stephan von Krawczynski 2003-06-10 12:51 ` Zwane Mwaikambo 2003-06-10 13:38 ` Stephan von Krawczynski 2003-06-10 13:51 ` Zwane Mwaikambo 2003-06-10 15:55 ` Stephan von Krawczynski 2003-06-10 16:23 ` Oleg Drokin 2003-06-10 17:44 ` Stephan von Krawczynski 2003-06-10 18:15 ` Zwane Mwaikambo 2003-06-10 23:55 ` Stephan von Krawczynski 2003-06-10 18:20 ` Zwane Mwaikambo 2003-05-23 18:30 ` Undo aic7xxx changes Marcelo Tosatti 2003-05-23 19:25 ` Stephan von Krawczynski [not found] <20030507203025$6f60@gated-at.bofh.it> [not found] ` <20030509005011$6cee@gated-at.bofh.it> [not found] ` <20030509101012$732a@gated-at.bofh.it> [not found] ` <20030509122007$758f@gated-at.bofh.it> [not found] ` <20030509131009$00f3@gated-at.bofh.it> [not found] ` <20030611045008$03cf@gated-at.bofh.it> [not found] ` <20030611203031$12de@gated-at.bofh.it> [not found] ` <20030611211012$34cf@gated-at.bofh.it> [not found] ` <20030613095017$1680@gated-at.bofh.it> [not found] ` <20030617210022$3e37@gated-at.bofh.it> [not found] ` <20030618111010$154f@gated-at.bofh.it> 2003-06-18 12:46 ` Undo aic7xxx changes (now rc7+aic20030603) Pascal Schmidt 2003-06-18 12:49 ` Stephan von Krawczynski
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).