linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* aic7xxx errors
@ 2001-09-05  6:21 Joseph Mathewson
  2001-09-05  7:58 ` Olaf Zaplinski
  2001-09-05 20:23 ` aic7xxx errors Justin T. Gibbs
  0 siblings, 2 replies; 28+ messages in thread
From: Joseph Mathewson @ 2001-09-05  6:21 UTC (permalink / raw)
  To: linux-kernel

I've just woken up this morning to find my internet gateway machine only
responding to pings, and on giving it a keyboard & monitor, a load of

scsi0:0:1:0: Attempting to queue an ABORT message
scsi0:0:1:0: Cmd aborted from QINFIFO
aic7xxx_abort returns 8194

errors.

Is this a problem with the hard drive on ID 1 or a driver issue?  It's now
working fine after a restart (eventually it seems to have given up on ID 1
completely and it restarted cleanly [it boots off ID 0]).

I'm using kernel 2.4.7, the card is an Adaptec 2940UW (aic7xxx), the drive
on ID 1 a Seagate Barracuda 18LP.

Joe.

+-------------------------------------------------+
| Joseph Mathewson <joe@mathewson.co.uk>          |
+-------------------------------------------------+

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: aic7xxx errors
  2001-09-05  6:21 aic7xxx errors Joseph Mathewson
@ 2001-09-05  7:58 ` Olaf Zaplinski
  2001-09-05  9:04   ` Frank Schneider
  2001-09-07 20:32   ` AIC + RAID1 error? (was: Re: aic7xxx errors) Olaf Zaplinski
  2001-09-05 20:23 ` aic7xxx errors Justin T. Gibbs
  1 sibling, 2 replies; 28+ messages in thread
From: Olaf Zaplinski @ 2001-09-05  7:58 UTC (permalink / raw)
  To: joe.mathewson; +Cc: linux-kernel

Joseph Mathewson wrote:
> 
> I've just woken up this morning to find my internet gateway machine only
> responding to pings, and on giving it a keyboard & monitor, a load of
> 
> scsi0:0:1:0: Attempting to queue an ABORT message
> scsi0:0:1:0: Cmd aborted from QINFIFO
> aic7xxx_abort returns 8194
> 
> errors.
[...]

/me too. I had this while booting 2.4.9 with a fresh installed SCSI card
(AHA2940) + harddisk. What worked for me was to compile the kernel with the
old Adaptec driver, so it's a driver issue.

Olaf

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: aic7xxx errors
  2001-09-05  7:58 ` Olaf Zaplinski
@ 2001-09-05  9:04   ` Frank Schneider
  2001-09-05 10:27     ` Antonio Miguel Trindade
  2001-09-07 20:32   ` AIC + RAID1 error? (was: Re: aic7xxx errors) Olaf Zaplinski
  1 sibling, 1 reply; 28+ messages in thread
From: Frank Schneider @ 2001-09-05  9:04 UTC (permalink / raw)
  To: Olaf Zaplinski; +Cc: joe.mathewson, linux-kernel

Olaf Zaplinski schrieb:
> 
> Joseph Mathewson wrote:
> >
> > I've just woken up this morning to find my internet gateway machine only
> > responding to pings, and on giving it a keyboard & monitor, a load of
> >
> > scsi0:0:1:0: Attempting to queue an ABORT message
> > scsi0:0:1:0: Cmd aborted from QINFIFO
> > aic7xxx_abort returns 8194
> >
> > errors.
> [...]
> 
> /me too. I had this while booting 2.4.9 with a fresh installed SCSI card
> (AHA2940) + harddisk. What worked for me was to compile the kernel with the
> old Adaptec driver, so it's a driver issue.
> 
> Olaf

Hello...

I had this effect too here (RH7.1, Kernel 2.4.3), but i put it on a
wrong termination of the LVD Bus...be careful if you have LVD-Drives
with a "Termination"-Jumper...(e.g. IBM DGHS18V)...this Termination is
only usable if you use the drive as Single Ended SCSI-UW, *not* if you
use the drive i a true LVD-environment !

I learnt this the hard way, because i used this "Termination"-jumper and
the system bootet without problems and ran about 2 weeks...then the
above errors occured, followed by system crashes....after reading the
original ibm-docs, and not the oem-reseller-crap, the reason was clear.

Th second thing i noticed was, that the value for "Maximum Number of TCQ
Commands per Device" is per default on 255, but wirt my system the
driver always complained, that he could only use 64 ("locked on
64")...so i decided to switch to 32 and not to let him auto-detect the
max. value...since then i had no problems at all...

Solong..
Frank.

--
Frank Schneider, <SPATZ1@T-ONLINE.DE>.                           
Microsoft isn't the answer.
Microsoft is the question, and the answer is NO.
... -.-

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: aic7xxx errors
  2001-09-05  9:04   ` Frank Schneider
@ 2001-09-05 10:27     ` Antonio Miguel Trindade
  2001-09-05 10:44       ` Frank Schneider
  0 siblings, 1 reply; 28+ messages in thread
From: Antonio Miguel Trindade @ 2001-09-05 10:27 UTC (permalink / raw)
  To: linux-kernel

Em Quarta 05 Setembro 2001 10:04, Frank Schneider escreveu:
> Olaf Zaplinski schrieb:
>
> I had this effect too here (RH7.1, Kernel 2.4.3), but i put it on a
> wrong termination of the LVD Bus...be careful if you have LVD-Drives
> with a "Termination"-Jumper...(e.g. IBM DGHS18V)...this Termination is
> only usable if you use the drive as Single Ended SCSI-UW, *not* if you
> use the drive i a true LVD-environment !
>
> I learnt this the hard way, because i used this "Termination"-jumper and
> the system bootet without problems and ran about 2 weeks...then the
> above errors occured, followed by system crashes....after reading the
> original ibm-docs, and not the oem-reseller-crap, the reason was clear.
>

   According to IBM specs, _no LVD drive has terminators built-in_... I have 
several servers with LVD drives (all IBM) and none of them has terminators, 
even in SE mode. You always have to use an external terminator...

>
> Solong..
> Frank.

-- 
A year spent in artificial intelligence
is enough to make one believe in God.

    -------------------------------
     António Miguel F. M. Trindade
        System's Administrator
           D.E.I. F.C.T.U.C.
    -------------------------------

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: aic7xxx errors
  2001-09-05 10:27     ` Antonio Miguel Trindade
@ 2001-09-05 10:44       ` Frank Schneider
  2001-09-05 11:21         ` Thorsten Kranzkowski
  0 siblings, 1 reply; 28+ messages in thread
From: Frank Schneider @ 2001-09-05 10:44 UTC (permalink / raw)
  To: Antonio Miguel Trindade; +Cc: linux-kernel

Antonio Miguel Trindade schrieb:
> 
> Em Quarta 05 Setembro 2001 10:04, Frank Schneider escreveu:
> > Olaf Zaplinski schrieb:
> >
> > I had this effect too here (RH7.1, Kernel 2.4.3), but i put it on a
> > wrong termination of the LVD Bus...be careful if you have LVD-Drives
> > with a "Termination"-Jumper...(e.g. IBM DGHS18V)...this Termination is
> > only usable if you use the drive as Single Ended SCSI-UW, *not* if you
> > use the drive i a true LVD-environment !
> >
> > I learnt this the hard way, because i used this "Termination"-jumper and
> > the system bootet without problems and ran about 2 weeks...then the
> > above errors occured, followed by system crashes....after reading the
> > original ibm-docs, and not the oem-reseller-crap, the reason was clear.
> >
> 
>    According to IBM specs, _no LVD drive has terminators built-in_... I have
> several servers with LVD drives (all IBM) and none of them has terminators,
> even in SE mode. You always have to use an external terminator...

That was it what i thought too...but if you get a copied sheet from your
vendor, and there a jumper is named "Termination on" and the sheet also
says you can use this, then you probably think the disk has a
LVD-Terminator build-in...although such a terminator is quite simple,
some resistors, perhaps a small chip, not more...it would be possible to
integrate it in the drive logic...

But as said, my DGHS-Disk has a build-in terminator for use with
UW-buses...the bad thing is, that if you "terminate" the LVD-bus with
this, it seems to work...for some time...i had "/" on it and a part of
my /home-RAID5, and it run 2 weeks....

Solong..
Frank.

--
Frank Schneider, <SPATZ1@T-ONLINE.DE>.                           
Microsoft isn't the answer.
Microsoft is the question, and the answer is NO.
... -.-

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: aic7xxx errors
  2001-09-05 10:44       ` Frank Schneider
@ 2001-09-05 11:21         ` Thorsten Kranzkowski
  2001-09-05 13:05           ` Frank Schneider
  0 siblings, 1 reply; 28+ messages in thread
From: Thorsten Kranzkowski @ 2001-09-05 11:21 UTC (permalink / raw)
  To: Frank Schneider; +Cc: Antonio Miguel Trindade, linux-kernel

On Wed, Sep 05, 2001 at 12:44:24PM +0200, Frank Schneider wrote:
> Antonio Miguel Trindade schrieb:
> > Em Quarta 05 Setembro 2001 10:04, Frank Schneider escreveu:
> > > Olaf Zaplinski schrieb:
> > >
> > > I had this effect too here (RH7.1, Kernel 2.4.3), but i put it on a
> > > wrong termination of the LVD Bus...be careful if you have LVD-Drives
> > > with a "Termination"-Jumper...(e.g. IBM DGHS18V)...this Termination is
> > > only usable if you use the drive as Single Ended SCSI-UW, *not* if you
> > > use the drive i a true LVD-environment !
> > >
> > 
> >    According to IBM specs, _no LVD drive has terminators built-in_... I have

There are definitely some that have this SE-Termination jumper.

> 
> But as said, my DGHS-Disk has a build-in terminator for use with
> UW-buses...the bad thing is, that if you "terminate" the LVD-bus with
> this, it seems to work...for some time...i had "/" on it and a part of
> my /home-RAID5, and it run 2 weeks....

Usually when a single device in a LVD chain is operated in SE mode all LVD
devices also switch to SE mode automatically. The use of a SE terminator
such as the one on your harddisk qualifies for SE operation.

But in SE mode you are tied to the much stricter specifications like length
of cable etc. compared to LVD mode. 

So maybe you just exceeded specifications too much.
 

Bye,
Thorsten

-- 
| Thorsten Kranzkowski        Internet: dl8bcu@dl8bcu.de                      |
| Mobile: ++49 170 1876134       Snail: Niemannsweg 30, 49201 Dissen, Germany |
| Ampr: dl8bcu@db0lj.#rpl.deu.eu, dl8bcu@marvin.dl8bcu.ampr.org [44.130.8.19] |

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: aic7xxx errors
  2001-09-05 11:21         ` Thorsten Kranzkowski
@ 2001-09-05 13:05           ` Frank Schneider
  0 siblings, 0 replies; 28+ messages in thread
From: Frank Schneider @ 2001-09-05 13:05 UTC (permalink / raw)
  To: dl8bcu; +Cc: Antonio Miguel Trindade, linux-kernel

Thorsten Kranzkowski schrieb:
> 
> On Wed, Sep 05, 2001 at 12:44:24PM +0200, Frank Schneider wrote:
> > Antonio Miguel Trindade schrieb:
> > > Em Quarta 05 Setembro 2001 10:04, Frank Schneider escreveu:
> > > > Olaf Zaplinski schrieb:
> > > >
> > > > I had this effect too here (RH7.1, Kernel 2.4.3), but i put it on a
> > > > wrong termination of the LVD Bus...be careful if you have LVD-Drives
> > > > with a "Termination"-Jumper...(e.g. IBM DGHS18V)...this Termination is
> > > > only usable if you use the drive as Single Ended SCSI-UW, *not* if you
> > > > use the drive i a true LVD-environment !
> > > >
> > >
> > >    According to IBM specs, _no LVD drive has terminators built-in_... I have
> 
> There are definitely some that have this SE-Termination jumper.

Yes...i can send you one if you send me a spare-drive instead...:-))
 
> >
> > But as said, my DGHS-Disk has a build-in terminator for use with
> > UW-buses...the bad thing is, that if you "terminate" the LVD-bus with
> > this, it seems to work...for some time...i had "/" on it and a part of
> > my /home-RAID5, and it run 2 weeks....
> 
> Usually when a single device in a LVD chain is operated in SE mode all LVD
> devices also switch to SE mode automatically. The use of a SE terminator
> such as the one on your harddisk qualifies for SE operation.

Thats exactly what i expected, but that did not happen...i tried this
one time by setting the "SE"-Jumper on *all* devices *and* connecting
them to the UW-cable (i use a Asus P2B-DS-Mobo with 3 connectors,
Fast-SCSI, UW-SCSI, LVD-SCSI)..their it worked in the described way, but
on the LVD-cable not even the SCSI-Bios at bootup mentioned the
problem...all devices were "LVD-SCSI" rated, and not "SE/FastSCSI" at
bootup...and /proc/scsi/aic7xxx/0 also said something about "80MByte/sec
synchronous speed..."

It seems that in this particular case you don`t get any hint where the
problem lies...neither from the bios nor from the driver...i noticed it
when i changed the LVD-cable and took a closer look on the disks...and
then in the specs on www.storage.ibm.com....

> But in SE mode you are tied to the much stricter specifications like length
> of cable etc. compared to LVD mode.

Thats clear...max. cablelength is 1,50m (if more than 4 devices are
connected), all together, incl. Fast-SCSI-cable or external cables, if
used...

> So maybe you just exceeded specifications too much.

I did this also one time (6 Devices-2m cablelength) and it showed indeed
the same problems...randomly appearing crashes on the scsi-bus,
sometimes revoverable, sometimes not, sometimes under heavy disk-load,
sometimes without...

Solong..
Frank.

--
Frank Schneider, <SPATZ1@T-ONLINE.DE>.                           
Microsoft isn't the answer.
Microsoft is the question, and the answer is NO.
... -.-

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: aic7xxx errors
  2001-09-05  6:21 aic7xxx errors Joseph Mathewson
  2001-09-05  7:58 ` Olaf Zaplinski
@ 2001-09-05 20:23 ` Justin T. Gibbs
  1 sibling, 0 replies; 28+ messages in thread
From: Justin T. Gibbs @ 2001-09-05 20:23 UTC (permalink / raw)
  To: joe.mathewson; +Cc: linux-kernel

>I've just woken up this morning to find my internet gateway machine only
>responding to pings, and on giving it a keyboard & monitor, a load of
>
>scsi0:0:1:0: Attempting to queue an ABORT message
>scsi0:0:1:0: Cmd aborted from QINFIFO
>aic7xxx_abort returns 8194
>
>errors.

I would have to see the messages with "aic7xxx=verbose"" in order
to better diagnose the problem.  A full dmesg that includes driver
initialization and SCSI device detection would be useful too.
You might also want to upgrade your driver to something newer:

	http://people.FreeBSD.org/~gibbs/linux/

--
Justin

^ permalink raw reply	[flat|nested] 28+ messages in thread

* AIC + RAID1 error? (was: Re: aic7xxx errors)
  2001-09-05  7:58 ` Olaf Zaplinski
  2001-09-05  9:04   ` Frank Schneider
@ 2001-09-07 20:32   ` Olaf Zaplinski
  2001-09-07 22:32     ` Justin T. Gibbs
  2001-09-08 20:25     ` Frank Schneider
  1 sibling, 2 replies; 28+ messages in thread
From: Olaf Zaplinski @ 2001-09-07 20:32 UTC (permalink / raw)
  To: linux-kernel

Olaf Zaplinski wrote:
> 
> Joseph Mathewson wrote:
> >
> > I've just woken up this morning to find my internet gateway machine only
> > responding to pings, and on giving it a keyboard & monitor, a load of
> >
> > scsi0:0:1:0: Attempting to queue an ABORT message
> > scsi0:0:1:0: Cmd aborted from QINFIFO
> > aic7xxx_abort returns 8194
> >
> > errors.
> [...]
> 
> /me too. I had this while booting 2.4.9 with a fresh installed SCSI card
> (AHA2940) + harddisk. What worked for me was to compile the kernel with the
> old Adaptec driver, so it's a driver issue.

Okay, I had it again today:

Sep  7 19:15:19 binky kernel: scsi0:0:0:0: Attempting to queue an ABORT
message
Sep  7 19:15:19 binky kernel: scsi0:0:0:0: Cmd aborted from QINFIFO
Sep  7 19:15:19 binky kernel: scsi0:0:0:0: Attempting to queue an ABORT
message
Sep  7 19:15:19 binky kernel: scsi0:0:0:0: Command not found
Sep  7 19:15:19 binky kernel: scsi0:0:0:0: Attempting to queue an ABORT
message
Sep  7 19:15:19 binky kernel: scsi0:0:0:0: Cmd aborted from QINFIFO
Sep  7 19:15:19 binky kernel: scsi0:0:0:0: Attempting to queue an ABORT
message
Sep  7 19:15:19 binky kernel: scsi0:0:0:0: Command not found
Sep  7 19:15:19 binky kernel: scsi0:0:0:0: Attempting to queue an ABORT
message
Sep  7 19:15:19 binky kernel: scsi0:0:0:0: Cmd aborted from QINFIFO
Sep  7 19:15:19 binky kernel: scsi0:0:0:0: Attempting to queue an ABORT
message
Sep  7 19:15:19 binky kernel: scsi0:0:0:0: Command not found

Kernel was 2.4.9ac9 with (new) AIC driver 6.2.1, compiled with "Maximum
Number of TCQ Commands per Device" set to 64. I was lucky since it's a RAID1
system (mirror disk is hda). Distro is SuSE 7.2 Professional, machine
K6-2/300 with 128 MB EDO RAM, FS is reiser 3.6.25. Average load is low, it's
a small smtp/imap/www system.

So I compiled the same kernel with the old AIC driver, and it works fine.

I should mention that it is a rather old PCI AHA-2940 Fast SCSI card with an
also older harddisk IBM 0662S12 (that's the whole SCSI chain).
My other machine (AIC-something U2W with Tandberg SLR (U2W) and SCSI CDR
(SE) attached, no HDDs) works fine with the new driver. I just guess when
saying that it seems to me that the driver developers were focused on
up-to-date cards but not the older ones.

Olaf

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: AIC + RAID1 error? (was: Re: aic7xxx errors)
  2001-09-07 20:32   ` AIC + RAID1 error? (was: Re: aic7xxx errors) Olaf Zaplinski
@ 2001-09-07 22:32     ` Justin T. Gibbs
  2001-09-07 22:51       ` Frank Schneider
  2001-09-08 20:25     ` Frank Schneider
  1 sibling, 1 reply; 28+ messages in thread
From: Justin T. Gibbs @ 2001-09-07 22:32 UTC (permalink / raw)
  To: Olaf Zaplinski; +Cc: linux-kernel

>Okay, I had it again today:

You need to be running with aic7xxx=verbose for these messages to be
useful.  In the 6.2.2 driver release I've turned these messages on
by default.

>Kernel was 2.4.9ac9 with (new) AIC driver 6.2.1, compiled with "Maximum
>Number of TCQ Commands per Device" set to 64.

This is 8 times the tag load the old driver defaults to.

>So I compiled the same kernel with the old AIC driver, and it works fine.

Which may be due to a lighter load on the drive.  Its hard to say without
the verbose messages and the full dmesg for the machine.  You're IBM drive
may be running the "if I miss a seek, I fall off the bus" firmware where
the bug is only triggered under high load.  Send the dmesg output and we'll
see.

>I just guess when
>saying that it seems to me that the driver developers were focused on
>up-to-date cards but not the older ones.

This isn't true.

--
Justin

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: AIC + RAID1 error? (was: Re: aic7xxx errors)
  2001-09-07 22:32     ` Justin T. Gibbs
@ 2001-09-07 22:51       ` Frank Schneider
  2001-09-07 23:37         ` Justin T. Gibbs
  0 siblings, 1 reply; 28+ messages in thread
From: Frank Schneider @ 2001-09-07 22:51 UTC (permalink / raw)
  To: Justin T. Gibbs; +Cc: linux-kernel

"Justin T. Gibbs" schrieb:
> 
> >Okay, I had it again today:
> 
> You need to be running with aic7xxx=verbose for these messages to be
> useful.  In the 6.2.2 driver release I've turned these messages on
> by default.

Could you please shortly explain what this option does...(before it
fills my logfiles with notes "succesfully wrote 1 Byte to disk abc"..:-)
i had recently also some problems with aic7xxx, but they where due to a
misconfigured scsi-bus and perhaps a bad drive (is still under test), so
i enabled scsi error logging in the kernel (2.4.3, RH7.1) and by sending
the following strings to /proc/scsi/scsi:

/bin/echo "scsi log error 5" > /proc/scsi/scsi
/bin/echo "scsi log mlqueue 3" > /proc/scsi/scsi
/bin/echo "scsi log hlcomplete 1" > /proc/scsi/scsi
/bin/echo "scsi log scan 5" > /proc/scsi/scsi

But it did not give me that kind of info i wanted to see...does the
"aic7xxx=verbose" something similar or something completly different ?
 
> >Kernel was 2.4.9ac9 with (new) AIC driver 6.2.1, compiled with "Maximum
> >Number of TCQ Commands per Device" set to 64.
> 
> This is 8 times the tag load the old driver defaults to.

Thats true, and e.g., my relatively new IBM-drives (DGHS18V, 2x
DNES-309170W,  DDRS-39130W, all Server-disks according to IBM) can only
64...and the kernel complains, if i compile it with 255 and locks to
64...as i have played with this feature a while ago, i did not realize a
big performance-plus from 8 to 64, so i switched to 32...and i would go
down to <8 if i where in doubt....

> >So I compiled the same kernel with the old AIC driver and it works fine.

Test it longer and under load...i also "cured" a bad scsi-bus by
switching drivers one time...sometimes it really seems to work...for
some days...:-)
 
> Which may be due to a lighter load on the drive.  Its hard to say without
> the verbose messages and the full dmesg for the machine.  You're IBM drive
> may be running the "if I miss a seek, I fall off the bus" firmware where
> the bug is only triggered under high load.  Send the dmesg output and we'll
> see.

Solong...
Frank.

--
Frank Schneider, <SPATZ1@T-ONLINE.DE>.                           
Microsoft isn't the answer.
Microsoft is the question, and the answer is NO.
... -.-

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: AIC + RAID1 error? (was: Re: aic7xxx errors)
  2001-09-07 22:51       ` Frank Schneider
@ 2001-09-07 23:37         ` Justin T. Gibbs
  2001-09-10 13:50           ` Olaf Zaplinski
  0 siblings, 1 reply; 28+ messages in thread
From: Justin T. Gibbs @ 2001-09-07 23:37 UTC (permalink / raw)
  To: Frank Schneider; +Cc: linux-kernel

>> You need to be running with aic7xxx=verbose for these messages to be
>> useful.  In the 6.2.2 driver release I've turned these messages on
>> by default.
>
>Could you please shortly explain what this option does...(before it
>fills my logfiles with notes "succesfully wrote 1 Byte to disk abc"..:-)

It turns on some diagnostics regarding:

1) Card initialization
2) Transfer Negotiation (occurs with every check condition that occurs
			 prior to sending data, so while not rare, is
			 not a common occurrence).
3) Abort/Timeout processing

It should not fill your log file unless you have a timeout.  This is
exactly the time you want it to fill your logs, so I can help diagnose
and fix your problem.

>> This is 8 times the tag load the old driver defaults to.
>
>Thats true, and e.g., my relatively new IBM-drives (DGHS18V, 2x
>DNES-309170W,  DDRS-39130W, all Server-disks according to IBM) can only
>64...and the kernel complains, if i compile it with 255 and locks to
>64...

Its not really "complaining", its just telling you that it has determined
the proper setting for this device.  There is an advantage to setting
your tag depth to the locked value - the SCSI layer cannot be told
dynamically to lower the tag depth, so there may be extra transactions
sitting in the driver queue for no real purpose - but its not that
big of a deal.

>as i have played with this feature a while ago, i did not realize a
>big performance-plus from 8 to 64, so i switched to 32...and i would go
>down to <8 if i where in doubt....

It all depends on your workload.  If you run a news server or have lots
of concurrent active users on the machine, you are more likely to see
a difference.

--
Justin

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: AIC + RAID1 error? (was: Re: aic7xxx errors)
  2001-09-07 20:32   ` AIC + RAID1 error? (was: Re: aic7xxx errors) Olaf Zaplinski
  2001-09-07 22:32     ` Justin T. Gibbs
@ 2001-09-08 20:25     ` Frank Schneider
  2001-09-08 22:07       ` Justin T. Gibbs
  1 sibling, 1 reply; 28+ messages in thread
From: Frank Schneider @ 2001-09-08 20:25 UTC (permalink / raw)
  To: linux-kernel

Olaf Zaplinski schrieb:
> 
> Olaf Zaplinski wrote:
> >
> > Joseph Mathewson wrote:
> > >
> > > I've just woken up this morning to find my internet gateway machine only
> > > responding to pings, and on giving it a keyboard & monitor, a load of
> > >
> > > scsi0:0:1:0: Attempting to queue an ABORT message
> > > scsi0:0:1:0: Cmd aborted from QINFIFO
> > > aic7xxx_abort returns 8194
> > >
> > > errors.
> > [...]
> >
> > /me too. I had this while booting 2.4.9 with a fresh installed SCSI card
> > (AHA2940) + harddisk. What worked for me was to compile the kernel with the
> > old Adaptec driver, so it's a driver issue.
> 

Hello...

I encounter a likely similar problem at the moment with aic7xxx and
RAID5:

I run a RAID5-Array on three SCSI-Disks, all IBM, all LVD on the
AIC7xxx-Controller on the Mobo (ASUS-P2B-DS)...and from time to time
(usually about once per week) always the same partition of the RAID5
gets a readerror and falls out of the array:

-------------------------
Sep  8 20:49:31 falcon kernel: SCSI disk error : host 0 channel 0 id 0
lun 0 return code = 8000002
Sep  8 20:49:31 falcon kernel: [valid=0] Info fld=0x0, Current sd08:04:
sense key Hardware Error
Sep  8 20:49:31 falcon kernel: Additional sense indicates Internal
target failure
Sep  8 20:49:31 falcon kernel:  I/O error: dev 08:04, sector 8545688
Sep  8 20:49:31 falcon kernel: raid5: Disk failure on sda4, disabling
device. Operation continuing on 2 devices
Sep  8 20:49:31 falcon kernel: md: recovery thread got woken up ...
Sep  8 20:49:31 falcon kernel: md0: no spare disk to reconstruct array!
-- continuing in degraded mode
Sep  8 20:49:31 falcon kernel: md: recovery thread finished ...
Sep  8 20:49:31 falcon kernel: md: updating md0 RAID superblock on
device
Sep  8 20:49:31 falcon kernel: sdc1 [events: 000000be](write) sdc1's sb
offset:
8707072
Sep  8 20:49:32 falcon kernel: sdb1 [events: 000000be](write) sdb1's sb
offset:
8707072
Sep  8 20:49:32 falcon kernel: (skipping faulty sda4 )
Sep  8 20:49:32 falcon kernel: .
----------------------------

Ok, i also thought: "Bad disk" and to verify this (i have still
guarantee on the drive) i formated it, let the AIC-BIOS do a "remap of
bad blocks" and ran "badblocks" about 5 times on it with the
"-w"-option...last but not least i copied over 160GB from and to the
drive over two days...nothing, not a single failure of the drive...today
i re-integrated the disk in my array, and got already the first
fall-off.

I now switched also to the old aic7xxx driver, only to get an idea where
to seek the problem...in the raid-code, in the driver or somewhere
else...

Solong..
Frank.

--
Frank Schneider, <SPATZ1@T-ONLINE.DE>.                           
Microsoft isn't the answer.
Microsoft is the question, and the answer is NO.
... -.-

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: AIC + RAID1 error? (was: Re: aic7xxx errors)
  2001-09-08 20:25     ` Frank Schneider
@ 2001-09-08 22:07       ` Justin T. Gibbs
  0 siblings, 0 replies; 28+ messages in thread
From: Justin T. Gibbs @ 2001-09-08 22:07 UTC (permalink / raw)
  To: Frank Schneider; +Cc: linux-kernel

>I run a RAID5-Array on three SCSI-Disks, all IBM, all LVD on the
>AIC7xxx-Controller on the Mobo (ASUS-P2B-DS)...and from time to time
>(usually about once per week) always the same partition of the RAID5
>gets a readerror and falls out of the array:

This is a very different issue.  The drive has even told you what is
wrong.

>-------------------------
>Sep  8 20:49:31 falcon kernel: SCSI disk error : host 0 channel 0 id 0
>lun 0 return code = 8000002
>Sep  8 20:49:31 falcon kernel: [valid=0] Info fld=0x0, Current sd08:04:
>sense key Hardware Error
>Sep  8 20:49:31 falcon kernel: Additional sense indicates Internal
				^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>target failure
 ^^^^^^^^^^^^^^

Something bad happened inside the disk.  Perhaps IBM can tell you what,
but it is not the aic7xxx driver, SCSI layer, or md's fault for this
disk going offline.

>Ok, i also thought: "Bad disk" and to verify this (i have still
>guarantee on the drive) i formated it, let the AIC-BIOS do a "remap of
>bad blocks" and ran "badblocks" about 5 times on it with the

Target failures are not "media errors".  If the drive was experiencing
a media problem, it would have said so.

--
Justin

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: AIC + RAID1 error? (was: Re: aic7xxx errors)
  2001-09-07 23:37         ` Justin T. Gibbs
@ 2001-09-10 13:50           ` Olaf Zaplinski
  2001-09-10 19:11             ` Frank Schneider
  2001-09-11 15:00             ` Olaf Zaplinski
  0 siblings, 2 replies; 28+ messages in thread
From: Olaf Zaplinski @ 2001-09-10 13:50 UTC (permalink / raw)
  To: linux-kernel

Okay, I tested it today, compiled 2.4.9ac10 with the new driver and TCQ set
to 32. I built the driver as a module to make sure that the machine at least
boots into runlevel 3 (I have no console access, only access to the reset
switch).

I rebooted and inserted the driver with 'modprobe aic7xxx', remembered that
I forgot the verbose flag, removed the driver with 'modprobe -r' and
re-inserted it with 'modprobe aic7xxx aic7xxx=verbose'. The machine was
still alive then. But right after entering 'raidhotadd /dev/md1 /dev/sda1'
the machine hung. reiserfs erased the last lines of /var/log/messages, but
AFAIK the verbose driver output showed no errors.

But how can I help to reproduce the error? Of course I could break the
mirror, compile the driver into the kernel (non-module) and do some stress
test on the SCSI drive. But it's not so good when I drive this machine into
a hang too often.

I compiled the old driver now, also with TCQ set to 32, and the machine
seems to work fine.

Olaf

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: AIC + RAID1 error? (was: Re: aic7xxx errors)
  2001-09-10 13:50           ` Olaf Zaplinski
@ 2001-09-10 19:11             ` Frank Schneider
  2001-09-10 22:29               ` Andreas Steinmetz
  2001-09-11 15:00             ` Olaf Zaplinski
  1 sibling, 1 reply; 28+ messages in thread
From: Frank Schneider @ 2001-09-10 19:11 UTC (permalink / raw)
  To: linux-kernel

Olaf Zaplinski schrieb:
> 
> Okay, I tested it today, compiled 2.4.9ac10 with the new driver and TCQ set
> to 32. I built the driver as a module to make sure that the machine at least
> boots into runlevel 3 (I have no console access, only access to the reset
> switch).
> 
> I rebooted and inserted the driver with 'modprobe aic7xxx', remembered that
> I forgot the verbose flag, removed the driver with 'modprobe -r' and
> re-inserted it with 'modprobe aic7xxx aic7xxx=verbose'. The machine was
> still alive then. But right after entering 'raidhotadd /dev/md1 /dev/sda1'
> the machine hung. reiserfs erased the last lines of /var/log/messages, but
> AFAIK the verbose driver output showed no errors.
> 
> But how can I help to reproduce the error? Of course I could break the
> mirror, compile the driver into the kernel (non-module) and do some stress
> test on the SCSI drive. But it's not so good when I drive this machine into
> a hang too often.
> 
> I compiled the old driver now, also with TCQ set to 32, and the machine
> seems to work fine.
> 

Hello...

I`m also in the moment testing with my raid-problem where one drive
falls out of the raid...till now it did not happen with the old driver,
but that means nothing as it only happened once a week or so.

Something other made me wonder:
I ran the machine several times with the *new* aic7xxx-driver (TCQ=32)
and the "aic7xxx=verbose" commandline, and i noticed the following:
At every reboot (made by "reboot", RH7.1), the machine was not able to
stop the raid5 correctly...it un-mounted the mountpoint (/home) and then
it normaly wants to stop the raid...(you see the messages "mdrecoveryd
got waken up...") but that did not work and after some time (30sec) the
kernel Ooopsed. This was reproducable and only occured if booted with
the "aic7xxx=verbose" kernel-parameter.
The effect after reboot was, that the raid had to be resynced because
one partition (that which always falls out) was damaged or at least
seemed to.
(The filesystem was clean, that was already unmounted as the oops
occured.)

Perhaps someone can test if this is reproducable with his machine
too...i use kernel 2.4.3, raid is built-in, also the aic7xxx, there are
three raid-disks (LVD, aic7xxx-controller on Mobo) in a raid5 mounted as
/home.

Solong...
Frank.

--
Frank Schneider, <SPATZ1@T-ONLINE.DE>.                           
Microsoft isn't the answer.
Microsoft is the question, and the answer is NO.
... -.-

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: AIC + RAID1 error? (was: Re: aic7xxx errors)
  2001-09-10 19:11             ` Frank Schneider
@ 2001-09-10 22:29               ` Andreas Steinmetz
  2001-09-10 22:42                 ` Justin T. Gibbs
  2001-09-10 22:46                 ` Frank Schneider
  0 siblings, 2 replies; 28+ messages in thread
From: Andreas Steinmetz @ 2001-09-10 22:29 UTC (permalink / raw)
  To: Frank Schneider; +Cc: linux-kernel

> Something other made me wonder:
> I ran the machine several times with the *new* aic7xxx-driver (TCQ=32)
> and the "aic7xxx=verbose" commandline, and i noticed the following:
> At every reboot (made by "reboot", RH7.1), the machine was not able to
> stop the raid5 correctly...it un-mounted the mountpoint (/home) and then
> it normaly wants to stop the raid...(you see the messages "mdrecoveryd
> got waken up...") but that did not work and after some time (30sec) the
> kernel Ooopsed. This was reproducable and only occured if booted with
> the "aic7xxx=verbose" kernel-parameter.
> The effect after reboot was, that the raid had to be resynced because
> one partition (that which always falls out) was damaged or at least
> seemed to.
> (The filesystem was clean, that was already unmounted as the oops
> occured.)
> 
> Perhaps someone can test if this is reproducable with his machine
> too...i use kernel 2.4.3, raid is built-in, also the aic7xxx, there are
> three raid-disks (LVD, aic7xxx-controller on Mobo) in a raid5 mounted as
> /home.
> 
Same behaviour for RAID1 and the new aic7xxx driver for me at nearly every
reboot. The old driver works just fine (2.4.9).


Andreas Steinmetz
D.O.M. Datenverarbeitung GmbH

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: AIC + RAID1 error? (was: Re: aic7xxx errors)
  2001-09-10 22:29               ` Andreas Steinmetz
@ 2001-09-10 22:42                 ` Justin T. Gibbs
  2001-09-10 22:55                   ` Frank Schneider
  2001-09-10 23:05                   ` Andreas Steinmetz
  2001-09-10 22:46                 ` Frank Schneider
  1 sibling, 2 replies; 28+ messages in thread
From: Justin T. Gibbs @ 2001-09-10 22:42 UTC (permalink / raw)
  To: Andreas Steinmetz; +Cc: Frank Schneider, linux-kernel

>> Something other made me wonder:
>> I ran the machine several times with the *new* aic7xxx-driver (TCQ=32)
>> and the "aic7xxx=verbose" commandline, and i noticed the following:
>> At every reboot (made by "reboot", RH7.1), the machine was not able to
>> stop the raid5 correctly...it un-mounted the mountpoint (/home) and then
>> it normaly wants to stop the raid...(you see the messages "mdrecoveryd
>> got waken up...") but that did not work and after some time (30sec) the
>> kernel Ooopsed.

...

>Same behaviour for RAID1 and the new aic7xxx driver for me at nearly every
>reboot. The old driver works just fine (2.4.9).

The new driver registers a "reboot notifier" with the system.  If MD
continues to perform I/O after the aic7xxx driver's notification routine
is called, the result is undefined.  The aic7xxx driver has already
shutdown the hardware.  Perhaps I should use a different event to indicate
it is safe for me to clean up the hardware?

--
Justin

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: AIC + RAID1 error? (was: Re: aic7xxx errors)
  2001-09-10 22:29               ` Andreas Steinmetz
  2001-09-10 22:42                 ` Justin T. Gibbs
@ 2001-09-10 22:46                 ` Frank Schneider
  1 sibling, 0 replies; 28+ messages in thread
From: Frank Schneider @ 2001-09-10 22:46 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andreas Steinmetz

Andreas Steinmetz schrieb:
> 
> > Something other made me wonder:
> > I ran the machine several times with the *new* aic7xxx-driver (TCQ=32)
> > and the "aic7xxx=verbose" commandline, and i noticed the following:
> > At every reboot (made by "reboot", RH7.1), the machine was not able to
> > stop the raid5 correctly...it un-mounted the mountpoint (/home) and then
> > it normaly wants to stop the raid...(you see the messages "mdrecoveryd
> > got waken up...") but that did not work and after some time (30sec) the
> > kernel Ooopsed. This was reproducable and only occured if booted with
> > the "aic7xxx=verbose" kernel-parameter.
> > The effect after reboot was, that the raid had to be resynced because
> > one partition (that which always falls out) was damaged or at least
> > seemed to.
> > (The filesystem was clean, that was already unmounted as the oops
> > occured.)
> >
> > Perhaps someone can test if this is reproducable with his machine
> > too...i use kernel 2.4.3, raid is built-in, also the aic7xxx, there are
> > three raid-disks (LVD, aic7xxx-controller on Mobo) in a raid5 mounted as
> > /home.
> >
> Same behaviour for RAID1 and the new aic7xxx driver for me at nearly every
> reboot. The old driver works just fine (2.4.9).

Ok, as i am using Kernel 2.4.3, it seems that the problem exists from
2.4.3 to 2.4.9...could you easily post the kernel-oops ?

I can and will, but i am stil in testing the old driver with my
disk-falls-out-of-raid problem, so i cannot reboot the next week or so
as this problem only occurs randomly about once per week...:-(...and i
want to "circle in" this problem to be sure that it is not something
else...

One thing i realize in the moment:
The old driver uses a default TCQ of 8, now my /proc/scsi/aic7xxx/0 says
that the actual queue depth per device is 1,1,1,1,1.....the TCQ is 8.
We should test if the problem with the new driver goes away if we set a
TCQ of 1...or has someone done this already ?

This problem leads IMHO to the theory that the raid-code and the (new)
aic7xxx-code interfer in some way...(race condition?)...perhaps this
also causes my disk to fall out of the raid...

Solong...
Frank.

--
Frank Schneider, <SPATZ1@T-ONLINE.DE>.                           
Microsoft isn't the answer.
Microsoft is the question, and the answer is NO.
... -.-

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: AIC + RAID1 error? (was: Re: aic7xxx errors)
  2001-09-10 22:42                 ` Justin T. Gibbs
@ 2001-09-10 22:55                   ` Frank Schneider
  2001-09-10 23:06                     ` Justin T. Gibbs
  2001-09-10 23:05                   ` Andreas Steinmetz
  1 sibling, 1 reply; 28+ messages in thread
From: Frank Schneider @ 2001-09-10 22:55 UTC (permalink / raw)
  To: Justin T. Gibbs; +Cc: linux-kernel

"Justin T. Gibbs" schrieb:
> 
> >> Something other made me wonder:
> >> I ran the machine several times with the *new* aic7xxx-driver (TCQ=32)
> >> and the "aic7xxx=verbose" commandline, and i noticed the following:
> >> At every reboot (made by "reboot", RH7.1), the machine was not able to
> >> stop the raid5 correctly...it un-mounted the mountpoint (/home) and then
> >> it normaly wants to stop the raid...(you see the messages "mdrecoveryd
> >> got waken up...") but that did not work and after some time (30sec) the
> >> kernel Ooopsed.
> 
> ...
> 
> >Same behaviour for RAID1 and the new aic7xxx driver for me at nearly every
> >reboot. The old driver works just fine (2.4.9).
> 
> The new driver registers a "reboot notifier" with the system.  If MD
> continues to perform I/O after the aic7xxx driver's notification routine
> is called, the result is undefined.  The aic7xxx driver has already
> shutdown the hardware.  Perhaps I should use a different event to indicate
> it is safe for me to clean up the hardware?

What about a kind of timer ?

If the driver gets the "reboot"-note, watch for activity and shut down
the hardware 5 or 10 secs after the last activity ?

Shutting down the Userprocesses is done in a similar way..."Send
term"...sleep 5...Send Kill..."...and when this happens, all unmounts
and kills should have already occured, so it can only be a question of
<5 secs until the last (raid-) process has exited.

Other possibility would only be to let the kernel send this message just
before he reboots the maschine via a BIOS-call...but even then you would
have to wait a little until the hardware reacts...difficult problem...

Solong...
Frank

--
Frank Schneider, <SPATZ1@T-ONLINE.DE>.                           
Microsoft isn't the answer.
Microsoft is the question, and the answer is NO.
... -.-

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: AIC + RAID1 error? (was: Re: aic7xxx errors)
  2001-09-10 22:42                 ` Justin T. Gibbs
  2001-09-10 22:55                   ` Frank Schneider
@ 2001-09-10 23:05                   ` Andreas Steinmetz
  1 sibling, 0 replies; 28+ messages in thread
From: Andreas Steinmetz @ 2001-09-10 23:05 UTC (permalink / raw)
  To: Justin T. Gibbs; +Cc: linux-kernel, linux-kernel, Frank Schneider

> 
> The new driver registers a "reboot notifier" with the system.  If MD
> continues to perform I/O after the aic7xxx driver's notification routine
> is called, the result is undefined.  The aic7xxx driver has already
> shutdown the hardware.  Perhaps I should use a different event to indicate
> it is safe for me to clean up the hardware?
> 

Gotcha!

Actually the problem seems to be the raid code and the scsi code do register
reboot notifiers with the same priority (0, see below).

include/linux/notifier.h:
 
struct notifier_block
{
        int (*notifier_call)(struct notifier_block *self, unsigned long, void
*);
        struct notifier_block *next;
        int priority;
};
 
drivers/md/md.c:
 
struct notifier_block md_notifier = {
        md_notify_reboot,
        NULL,
        0
};
 
drivers/scsi/aic7xxx/aic7xxx_linux.c:
 
static struct notifier_block ahc_linux_notifier = {
        ahc_linux_halt, NULL, 0
};

When registering the notifiers it depends on who's registering first at the
same priority level.

kernel/sys.c:
 
int notifier_chain_register(struct notifier_block **list, struct notifier_block
*n)
{
        write_lock(&notifier_lock);
        while(*list)
        {
                if(n->priority > (*list)->priority)
                        break;
                list= &((*list)->next);
        }
        n->next = *list;
        *list=n;
        write_unlock(&notifier_lock);
        return 0;
}

The notifier chin is then processed sequentially.

kernel/sys.c:

int notifier_call_chain(struct notifier_block **n, unsigned long val, void *v)
{
        int ret=NOTIFY_DONE;
        struct notifier_block *nb = *n;
 
        while(nb)
        {
                ret=nb->notifier_call(nb,val,v);
                if(ret&NOTIFY_STOP_MASK)
                {
                        return ret;
                }
                nb=nb->next;
        }
        return ret;
}

So what's actually required is to set the raid notifier to a higher priority
than the scsi notifier to assert that raid is stopped before scsi.
Unfortunately I can't test this right now as I'm doing work@home and I do need
physical access to the systems (reset button) if it doesn't work out.

Could you please straighten the priority issue out with the raid maintainer?


Andreas Steinmetz
D.O.M. Datenverarbeitung GmbH

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: AIC + RAID1 error? (was: Re: aic7xxx errors)
  2001-09-10 22:55                   ` Frank Schneider
@ 2001-09-10 23:06                     ` Justin T. Gibbs
  2001-09-10 23:37                       ` Andreas Steinmetz
  2001-09-11 12:10                       ` Frank Schneider
  0 siblings, 2 replies; 28+ messages in thread
From: Justin T. Gibbs @ 2001-09-10 23:06 UTC (permalink / raw)
  To: Frank Schneider; +Cc: linux-kernel

>What about a kind of timer ?

The functions are run serially.  If I'm to wait, I must block
or risk having the machine powered off prior to completing my shutdown.

A coworker of mine playing with the MD code reminded me that
he had to change the priority of the MD notifier to make it work.
I believe that this is the correct fix as there are other SCSI
drivers that have shutdown hooks.

All HBA drivers currently use 0 (or the lowest) as their priority.
MD (line 3475 of drivers/md/md.c) uses 0 too.  Change it to INT_MAX
and MD will always get shutdown prior to any child devices it might
use.

--
Justin

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: AIC + RAID1 error? (was: Re: aic7xxx errors)
  2001-09-10 23:06                     ` Justin T. Gibbs
@ 2001-09-10 23:37                       ` Andreas Steinmetz
  2001-09-10 23:46                         ` Justin T. Gibbs
  2001-09-11 12:10                       ` Frank Schneider
  1 sibling, 1 reply; 28+ messages in thread
From: Andreas Steinmetz @ 2001-09-10 23:37 UTC (permalink / raw)
  To: Justin T. Gibbs; +Cc: linux-kernel, linux-kernel, Frank Schneider

> MD (line 3475 of drivers/md/md.c) uses 0 too.  Change it to INT_MAX
> and MD will always get shutdown prior to any child devices it might

I don't believe INT_MAX to be a good idea. What happens if anything else needs
to shutdown prior to md (think of tux, knfsd)? As a suggestion it would be a
good idea if someone with a broader overview would define some reboot
priorities in include/linux/notifier.h.


Andreas Steinmetz
D.O.M. Datenverarbeitung GmbH

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: AIC + RAID1 error? (was: Re: aic7xxx errors)
  2001-09-10 23:37                       ` Andreas Steinmetz
@ 2001-09-10 23:46                         ` Justin T. Gibbs
  2001-09-11  0:00                           ` Andreas Steinmetz
  0 siblings, 1 reply; 28+ messages in thread
From: Justin T. Gibbs @ 2001-09-10 23:46 UTC (permalink / raw)
  To: Andreas Steinmetz; +Cc: linux-kernel, Frank Schneider

>> MD (line 3475 of drivers/md/md.c) uses 0 too.  Change it to INT_MAX
>> and MD will always get shutdown prior to any child devices it might
>
>I don't believe INT_MAX to be a good idea. What happens if anything else needs
>to shutdown prior to md (think of tux, knfsd)?

Your examples are processes (albeit in the kernel) which should have
received a signal long before the notifier chain is called.

>As a suggestion it would be a
>good idea if someone with a broader overview would define some reboot
>priorities in include/linux/notifier.h.

And expand the codes that are used for the notifier.  The current set
of codes are not well defined and most drivers treat all of them the
same.

--
Justin

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: AIC + RAID1 error? (was: Re: aic7xxx errors)
  2001-09-10 23:46                         ` Justin T. Gibbs
@ 2001-09-11  0:00                           ` Andreas Steinmetz
  0 siblings, 0 replies; 28+ messages in thread
From: Andreas Steinmetz @ 2001-09-11  0:00 UTC (permalink / raw)
  To: Justin T. Gibbs; +Cc: SPATZ1, Frank Schneider, linux-kernel


On 10-Sep-2001 Justin T. Gibbs wrote:
>>> MD (line 3475 of drivers/md/md.c) uses 0 too.  Change it to INT_MAX
>>> and MD will always get shutdown prior to any child devices it might
>>
>>I don't believe INT_MAX to be a good idea. What happens if anything else
>>needs
>>to shutdown prior to md (think of tux, knfsd)?
> 
> Your examples are processes (albeit in the kernel) which should have
> received a signal long before the notifier chain is called.
> 

Granted. I could, however, imagine a fs to require a reboot notifier and that
would need definitely be processed before md.

>>As a suggestion it would be a
>>good idea if someone with a broader overview would define some reboot
>>priorities in include/linux/notifier.h.
> 
> And expand the codes that are used for the notifier.  The current set
> of codes are not well defined and most drivers treat all of them the
> same.
> 

Just posted sort of this request to the list.

> --
> Justin
> 

Andreas Steinmetz
D.O.M. Datenverarbeitung GmbH

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: AIC + RAID1 error? (was: Re: aic7xxx errors)
  2001-09-10 23:06                     ` Justin T. Gibbs
  2001-09-10 23:37                       ` Andreas Steinmetz
@ 2001-09-11 12:10                       ` Frank Schneider
  2001-09-11 16:51                         ` Justin T. Gibbs
  1 sibling, 1 reply; 28+ messages in thread
From: Frank Schneider @ 2001-09-11 12:10 UTC (permalink / raw)
  To: Justin T. Gibbs; +Cc: linux-kernel

"Justin T. Gibbs" schrieb:
> 
> >What about a kind of timer ?
> 
> The functions are run serially.  If I'm to wait, I must block
> or risk having the machine powered off prior to completing my shutdown.
> 
> A coworker of mine playing with the MD code reminded me that
> he had to change the priority of the MD notifier to make it work.
> I believe that this is the correct fix as there are other SCSI
> drivers that have shutdown hooks.
> 
> All HBA drivers currently use 0 (or the lowest) as their priority.
> MD (line 3475 of drivers/md/md.c) uses 0 too.  Change it to INT_MAX
> and MD will always get shutdown prior to any child devices it might
> use.

One question is still open on this case:
Why does the Oops only occur if the "aic7xxx=verbose" is set ?

The above explanation is correct (AFAIK), but the kernel-oops should
then happen on *every* reboot, not only if this verbose-parameter is
set...or does the driver try to shutdown the drives and then write to
the log "AIC7xxx shutdown successfull"...?...:-))

Solong...
Frank.

--
Frank Schneider, <SPATZ1@T-ONLINE.DE>.                           
Microsoft isn't the answer.
Microsoft is the question, and the answer is NO.
... -.-

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: AIC + RAID1 error? (was: Re: aic7xxx errors)
  2001-09-10 13:50           ` Olaf Zaplinski
  2001-09-10 19:11             ` Frank Schneider
@ 2001-09-11 15:00             ` Olaf Zaplinski
  1 sibling, 0 replies; 28+ messages in thread
From: Olaf Zaplinski @ 2001-09-11 15:00 UTC (permalink / raw)
  To: linux-kernel

Olaf Zaplinski wrote:
[...]
> But how can I help to reproduce the error? Of course I could break the
> mirror, compile the driver into the kernel (non-module) and do some stress
> test on the SCSI drive. But it's not so good when I drive this machine into
> a hang too often.

Well, I tried that actually:

- insmod'ed the new driver ('verbose', 'tcq=32')
- broke mirror
- mke2fs /dev/sda1
- tar'ed / to /mnt (which was the mounted sda1)

=> no errors

So it has to do with the RAID code, I think.

Olaf

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: AIC + RAID1 error? (was: Re: aic7xxx errors)
  2001-09-11 12:10                       ` Frank Schneider
@ 2001-09-11 16:51                         ` Justin T. Gibbs
  0 siblings, 0 replies; 28+ messages in thread
From: Justin T. Gibbs @ 2001-09-11 16:51 UTC (permalink / raw)
  To: Frank Schneider; +Cc: linux-kernel

>One question is still open on this case:
>Why does the Oops only occur if the "aic7xxx=verbose" is set ?

I haven't looked to determine why this is so.

--
Justin

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2001-09-11 16:51 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-09-05  6:21 aic7xxx errors Joseph Mathewson
2001-09-05  7:58 ` Olaf Zaplinski
2001-09-05  9:04   ` Frank Schneider
2001-09-05 10:27     ` Antonio Miguel Trindade
2001-09-05 10:44       ` Frank Schneider
2001-09-05 11:21         ` Thorsten Kranzkowski
2001-09-05 13:05           ` Frank Schneider
2001-09-07 20:32   ` AIC + RAID1 error? (was: Re: aic7xxx errors) Olaf Zaplinski
2001-09-07 22:32     ` Justin T. Gibbs
2001-09-07 22:51       ` Frank Schneider
2001-09-07 23:37         ` Justin T. Gibbs
2001-09-10 13:50           ` Olaf Zaplinski
2001-09-10 19:11             ` Frank Schneider
2001-09-10 22:29               ` Andreas Steinmetz
2001-09-10 22:42                 ` Justin T. Gibbs
2001-09-10 22:55                   ` Frank Schneider
2001-09-10 23:06                     ` Justin T. Gibbs
2001-09-10 23:37                       ` Andreas Steinmetz
2001-09-10 23:46                         ` Justin T. Gibbs
2001-09-11  0:00                           ` Andreas Steinmetz
2001-09-11 12:10                       ` Frank Schneider
2001-09-11 16:51                         ` Justin T. Gibbs
2001-09-10 23:05                   ` Andreas Steinmetz
2001-09-10 22:46                 ` Frank Schneider
2001-09-11 15:00             ` Olaf Zaplinski
2001-09-08 20:25     ` Frank Schneider
2001-09-08 22:07       ` Justin T. Gibbs
2001-09-05 20:23 ` aic7xxx errors Justin T. Gibbs

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).