linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: 2.4.22pre8 hangs too (Re: 2.4.21-jam1 solid hangs)
@ 2003-08-29 16:35 Marcelo Tosatti
  2003-08-29 19:57 ` Ville Herva
  0 siblings, 1 reply; 21+ messages in thread
From: Marcelo Tosatti @ 2003-08-29 16:35 UTC (permalink / raw)
  To: Ville Herva; +Cc: Stephan von Krawczynski, lkml


> Ville,
>
> Which kernel doesnt hang on your box? 2.4 something ?

> 2.4.20pre7 ran for over 9 months before it suddenly begun locking up (I
> _suppose_ it could just mean the bug/problem is hard to trigger.)
> Nothing had been changed: the box had been up for that nine month
> period, and the same oracle dump cron job had been running each night.

Strange.

> Earlier 2.4's had too many problems with aic7xxx (crashes and so on), so
> I can't comment on them.

> After 2.4.20pre7, I tried 2.4.21-jam1 (based on -aa patchset) and
> 2.4.22-pre8. I also tried compiling 2.4.21-jam1 with gcc-3.2.1 instead
> of 2.96. All of those locked up eventually, sometimes within a day from
> reboot, some times it takes weeks. At one point, 2.4.21-jam1 seemed to
> reliably lock up when compiling kernel, but it no longer happens no
> matter how hard I try. Usually the lock up happens during nightly oracle
> backup dump.

So NMI and sysrq doesnt help. I suggest you a few things:

Try to make the bug easy to reproduce. Force the Oracle dumps again and
again to crash the box. Can you try it or its a production machine?

BTW, can you describe this "Oracle dumps" in more detail? What do they do?
Save lots of data to disk and thats all or ?

Hope we can trace this down.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.4.22pre8 hangs too (Re: 2.4.21-jam1 solid hangs)
  2003-08-29 16:35 2.4.22pre8 hangs too (Re: 2.4.21-jam1 solid hangs) Marcelo Tosatti
@ 2003-08-29 19:57 ` Ville Herva
  2003-09-09  7:05   ` Ville Herva
  0 siblings, 1 reply; 21+ messages in thread
From: Ville Herva @ 2003-08-29 19:57 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Stephan von Krawczynski, lkml

On Fri, Aug 29, 2003 at 01:35:25PM -0300, you [Marcelo Tosatti] wrote:
> 
> So NMI and sysrq doesnt help. I suggest you a few things:
> 
> Try to make the bug easy to reproduce. Force the Oracle dumps again and
> again to crash the box. 

I happened to work towards that direction this morning (before I read your
mail). Taking the stance that this very probably had something to do with io
stress, I played around with several io loads. Eventually I found out that
fsx on scsi disk reliably caused the box to either lock up or the aic7xxx
driver to barf. What's more, it took under 15 minutes to trigger.

So I copied the rootfs and everything else from the scsi disk to the ide
disk (just barely had enough space), and took all the scsi disk partitions
away from fstab. After reboot, I have been unable to lock it up with fsx
(scsi disk is not accessed at all), but it will take several weeks before
I'm confident that the lock up is cured.

aic7xxx / scsi hw seems quite strong suspect for the lock ups. 2.2 possibly
worked because it has the older aic7xxx 5.x driver.

> Can you try it or its a production machine?

It is a sort-of-a production machine -- that's way I have been so wary on
trying different things. Sorry for that...
 
> BTW, can you describe this "Oracle dumps" in more detail? What do they do?
> Save lots of data to disk and thats all or ?

They dump the oracle data base to a backup file.

${ORAHOME}/bin/exp \
        ***/*** full=Y grants=Y \
        file=${DMPDIR}/fullexp.dmp 1>${LOGDIR}/fulllog.`date '+%a'` 2>&1
 
So basically just heavy IO afaict.

> Hope we can trace this down.

I'm still not 100% sure that the aic7xxx brafs (see
http://lkml.org/lkml/2003/7/29/33 for an example) and the lockups are of
the same origin. But it seems at least 99.5% certain.

If aic7xxx/scsi is to blame, then is it the
  - 2940 scsi adapter
  - the disk
  - the cabling or something (I've checked the termination)
  - the motherboard (irq routing?)
  - the aic7xxx driver?
  - some other kernel issue?

The hw is:
 Intel 815EEA2LU (i815 Chipset)
 Celeron 1.3GHz (Tualatin)
 Adaptec AHA-2940 / AIC-7871
   - Disk (rootfs) SEAGATE  Model: ST19171W Rev: 0024
   - Tape Drive    HP       Model: C1537A Rev: L708
 30GB IDE disk (scratch)


-- v --

v@iki.fi


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.4.22pre8 hangs too (Re: 2.4.21-jam1 solid hangs)
  2003-08-29 19:57 ` Ville Herva
@ 2003-09-09  7:05   ` Ville Herva
  2003-09-09  8:48     ` Stephan von Krawczynski
  0 siblings, 1 reply; 21+ messages in thread
From: Ville Herva @ 2003-09-09  7:05 UTC (permalink / raw)
  To: Marcelo Tosatti, Stephan von Krawczynski, lkml

On Fri, Aug 29, 2003 at 10:57:37PM +0300, you [Ville Herva] wrote:
> On Fri, Aug 29, 2003 at 01:35:25PM -0300, you [Marcelo Tosatti] wrote:
> > 
> > So NMI and sysrq doesnt help. I suggest you a few things:
> > 
> > Try to make the bug easy to reproduce. Force the Oracle dumps again and
> > again to crash the box. 
> 
> I happened to work towards that direction this morning (before I read your
> mail). Taking the stance that this very probably had something to do with io
> stress, I played around with several io loads. Eventually I found out that
> fsx on scsi disk reliably caused the box to either lock up or the aic7xxx
> driver to barf. What's more, it took under 15 minutes to trigger.
> 
> So I copied the rootfs and everything else from the scsi disk to the ide
> disk (just barely had enough space), and took all the scsi disk partitions
> away from fstab. After reboot, I have been unable to lock it up with fsx
> (scsi disk is not accessed at all), but it will take several weeks before
> I'm confident that the lock up is cured.

And indeed it did lock even though the scsi disk is not used at all. It just
took weeks. 

At the time no heavy IO was going on afaict (but there might have been some
io.)

I'm completely out of ideas here. What the heck is the culprit...? Perhaps a
faulty motherboard?

> The hw is:
>  Intel 815EEA2LU (i815 Chipset)
>  Celeron 1.3GHz (Tualatin)
>  Adaptec AHA-2940 / AIC-7871 (NOT USED)
>    - Disk          SEAGATE  Model: ST19171W Rev: 0024 (NOT USED)
>    - Tape Drive    HP       Model: C1537A Rev: L708
>  30GB IDE disk  (All fs's here at the moment)


 
-- v --

v@iki.fi

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.4.22pre8 hangs too (Re: 2.4.21-jam1 solid hangs)
  2003-09-09  7:05   ` Ville Herva
@ 2003-09-09  8:48     ` Stephan von Krawczynski
  0 siblings, 0 replies; 21+ messages in thread
From: Stephan von Krawczynski @ 2003-09-09  8:48 UTC (permalink / raw)
  To: Ville Herva; +Cc: marcelo, linux-kernel

On Tue, 9 Sep 2003 10:05:07 +0300
Ville Herva <vherva@niksula.hut.fi> wrote:

> On Fri, Aug 29, 2003 at 10:57:37PM +0300, you [Ville Herva] wrote:
> > [...]
> > So I copied the rootfs and everything else from the scsi disk to the ide
> > disk (just barely had enough space), and took all the scsi disk partitions
> > away from fstab. After reboot, I have been unable to lock it up with fsx
> > (scsi disk is not accessed at all), but it will take several weeks before
> > I'm confident that the lock up is cured.
> 
> And indeed it did lock even though the scsi disk is not used at all. It just
> took weeks. 
> 
> At the time no heavy IO was going on afaict (but there might have been some
> io.)
> 
> I'm completely out of ideas here. What the heck is the culprit...? Perhaps a
> faulty motherboard?

Hm, after my experiences I would advise you to save time and headache and try
to replace everything but the ide disk at once. This is an easy and fast action
and gives you a chance to tilt any form of hardware error.

Regards,
Stephan

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.4.22pre8 hangs too (Re: 2.4.21-jam1 solid hangs)
  2003-08-28  9:26                           ` Ingo Oeser
@ 2003-08-28 19:09                             ` Ville Herva
  0 siblings, 0 replies; 21+ messages in thread
From: Ville Herva @ 2003-08-28 19:09 UTC (permalink / raw)
  To: Ingo Oeser; +Cc: linux-kernel

On Thu, Aug 28, 2003 at 11:26:30AM +0200, you [Ingo Oeser] wrote:
> 
> But heavy (disk) IO and misterious crashes sound like power problems,
> doesn't it?

Hmm. It doesn't crash, it locks up solid. (Well the aic7xxx driver sometimes
crashes (spits a huge log of errors, rather), but I'm still not sure if
that's related.)

The box only has two disks, 1.3GHz Celeron (~30W), and other lighter power
consumers. Not exactly a power hungry config. I'm not sure about the power
supply - I think it's a 250W one - I'll have to check.

Accoring to sensors, the voltages do not fluctuate much. Also, the
temperatures are moderate (34.0°C system, 41.0°C CPU).

Power problems are surely possible, but don't exactly sound like promising
lead to me. 


-- v --

v@iki.fi

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.4.22pre8 hangs too (Re: 2.4.21-jam1 solid hangs)
  2003-08-27 11:30                         ` Stephan von Krawczynski
@ 2003-08-28  9:26                           ` Ingo Oeser
  2003-08-28 19:09                             ` Ville Herva
  0 siblings, 1 reply; 21+ messages in thread
From: Ingo Oeser @ 2003-08-28  9:26 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: Ville Herva, linux-kernel, tejun

On Wed, Aug 27, 2003 at 01:30:55PM +0200, Stephan von Krawczynski wrote:
> On Wed, 27 Aug 2003 14:04:17 +0300
> Ville Herva <vherva@niksula.hut.fi> wrote:
> > I don't think vga interferes with anything: I never run X on the box, and
> > even the text console remains quiescent as nothing is logged.
> 
> The thing I ran into once was not really an intensive use of VGA and its ints
> but rather some weird glitches in the boards' int logic that sometimes drove
> the software drivers crazy (was network back then).

I have seen this too, with some DSP board.

But heavy (disk) IO and misterious crashes sound like power problems,
doesn't it?

Regards

Ingo Oeser

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.4.22pre8 hangs too (Re: 2.4.21-jam1 solid hangs)
  2003-08-27 23:22 Marcelo Tosatti
@ 2003-08-28  6:10 ` Ville Herva
  0 siblings, 0 replies; 21+ messages in thread
From: Ville Herva @ 2003-08-28  6:10 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Stephan von Krawczynski, lkml

On Wed, Aug 27, 2003 at 08:22:07PM -0300, you [Marcelo Tosatti] wrote:
> 
> Ville,
> 
> Which kernel doesnt hang on your box? 2.4 something ? 

2.4.20pre7 ran for over 9 months before it suddenly begun locking up (I
_suppose_ it could just mean the bug/problem is hard to trigger.) Nothing
had been changed: the box had been up for that nine month period, and the
same oracle dump cron job had been running each night.

Earlier 2.4's had too many problems with aic7xxx (crashes and so on), so I
can't comment on them.

After 2.4.20pre7, I tried 2.4.21-jam1 (based on -aa patchset) and
2.4.22-pre8. I also tried compiling 2.4.21-jam1 with gcc-3.2.1 instead of
2.96. All of those locked up eventually, sometimes within a day from
reboot, some times it takes weeks. At one point, 2.4.21-jam1 seemed to
reliably lock up when compiling kernel, but it no longer happens no matter
how hard I try. Usually the lock up happens during nightly oracle backup
dump.

> 2.2?

2.2.22 and 2.2.22-secure1 never locked up, but I didn't run them for more
than a couple of months.

I realize I should try to change all pieces of hardware one at the time and
try various different kernels, but it all seem tedious and shooting in the
dark: the lock up can take weeks to trigger. I would love to get some kind
of lead or theory on what causes the problem before resorting to brute force
search.

I think I'll try 2.4.22-final + a kernel debugger next. Or is there a better
way of getting information on hard lock ups than kdb? (Tried nmi watchdog
and sysrq already).


-- v --

v@iki.fi

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.4.22pre8 hangs too (Re: 2.4.21-jam1 solid hangs)
  2003-08-28  5:40                   ` Ville Herva
@ 2003-08-28  5:57                     ` Tejun Huh
  0 siblings, 0 replies; 21+ messages in thread
From: Tejun Huh @ 2003-08-28  5:57 UTC (permalink / raw)
  To: Ville Herva, TeJun Huh, Stephan von Krawczynski, linux-kernel

On Thu, Aug 28, 2003 at 08:40:00AM +0300, Ville Herva wrote:
> On Thu, Aug 28, 2003 at 10:13:41AM +0900, you [TeJun Huh] wrote:
> > 
> >  Your problem sounds very simlar to the problem we were suffering.
> > The problem was a spinlock deadlock inside drivers/char/random.c which
> > is used by tcp to generate random initial sequence number.  The bug
> > fix was checked into 2.4 tree on 28th July after the release of pre8
> > at 14th July.
> 
> Uhh, I tried 2.4.22pre8 a while ago (I think it was Herbert P?tzl's
> suggestion), and it locked up too. Shame that the fix didn't make it in
> it...
> 
> I'll give .22-final a spin.
>  
> > This problem can happen on UP machine if the kernel is compiled with
> > CONFIG_SMP.
> 
> This is UP box and the kernel is _not_ compiled with CONFIG_SMP.

 Then, it should be a different problem.  That deadlock wouldn't occur
with UP kernel.

> > Because the offending routine is called only every five
> > minutes and it should receive a SYN packet while it's connecting, it
> > occurs rarely, but it happens when it happens.
> 
> In my case, the lock up seems clearly related to disk io: it usually happens
> during the nightly oracle backup dump, and at some point it kept happening
> while compiling kernel. (It's random, I can no longer reproduce it by just
> compiling a kernel.)
> 
> Do you still think it could be the same one?

 No, I don't think so anymore.  I think trying kdb/kgdb would be
better.

> >  Please try 2.4.22.
> > 
> > P.S. This bug is a real headache.  We had many servers deployed and
> > they all randomly locked up about every two or four weeks.  I believe
> > people should be warned about this one.
> 
> What's really strange is that the box kept running with 2.4.20pre7 for
> almost a year without problems (with the same oracle dump jub in nightly
> cron), and then suddenly begun acting up on my the first day of my summer
> vacatnion...

 Good luck. :-)

-- 
tejun


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.4.22pre8 hangs too (Re: 2.4.21-jam1 solid hangs)
  2003-08-28  1:13                 ` TeJun Huh
@ 2003-08-28  5:40                   ` Ville Herva
  2003-08-28  5:57                     ` Tejun Huh
  0 siblings, 1 reply; 21+ messages in thread
From: Ville Herva @ 2003-08-28  5:40 UTC (permalink / raw)
  To: TeJun Huh; +Cc: Stephan von Krawczynski, linux-kernel

On Thu, Aug 28, 2003 at 10:13:41AM +0900, you [TeJun Huh] wrote:
> 
>  Your problem sounds very simlar to the problem we were suffering.
> The problem was a spinlock deadlock inside drivers/char/random.c which
> is used by tcp to generate random initial sequence number.  The bug
> fix was checked into 2.4 tree on 28th July after the release of pre8
> at 14th July.

Uhh, I tried 2.4.22pre8 a while ago (I think it was Herbert Pötzl's
suggestion), and it locked up too. Shame that the fix didn't make it in
it...

I'll give .22-final a spin.
 
> This problem can happen on UP machine if the kernel is compiled with
> CONFIG_SMP.

This is UP box and the kernel is _not_ compiled with CONFIG_SMP.

> Because the offending routine is called only every five
> minutes and it should receive a SYN packet while it's connecting, it
> occurs rarely, but it happens when it happens.

In my case, the lock up seems clearly related to disk io: it usually happens
during the nightly oracle backup dump, and at some point it kept happening
while compiling kernel. (It's random, I can no longer reproduce it by just
compiling a kernel.)

Do you still think it could be the same one?

>  Please try 2.4.22.
> 
> P.S. This bug is a real headache.  We had many servers deployed and
> they all randomly locked up about every two or four weeks.  I believe
> people should be warned about this one.

What's really strange is that the box kept running with 2.4.20pre7 for
almost a year without problems (with the same oracle dump jub in nightly
cron), and then suddenly begun acting up on my the first day of my summer
vacation...


-- v --

v@iki.fi

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.4.22pre8 hangs too (Re: 2.4.21-jam1 solid hangs)
  2003-08-27  7:37               ` Ville Herva
  2003-08-27  9:30                 ` Stephan von Krawczynski
@ 2003-08-28  1:13                 ` TeJun Huh
  2003-08-28  5:40                   ` Ville Herva
  1 sibling, 1 reply; 21+ messages in thread
From: TeJun Huh @ 2003-08-28  1:13 UTC (permalink / raw)
  To: Ville Herva, Stephan von Krawczynski, linux-kernel

On Wed, Aug 27, 2003 at 10:37:58AM +0300, Ville Herva wrote:
> On Wed, Aug 27, 2003 at 09:21:39AM +0200, you [Stephan von Krawczynski] wrote:
> > 
> > Sorry, then you have to look for another explanation. 
> 
> Yep, but I don't have any reasonable suspects.
> 
> > Did you already try to exchange everything but the harddisks ?
> 
> No. Do you suspect faulty hardware?
> 
> Apart from perhaps Adaptec 2940 (Adaptecs always give me trouble), I
> believe the hw is pretty solid. It had no problems with 2.2 kernels.  Based
> on my experience, the i815 chipset is not that shaky (unlike the Via dung),
> and I would expect the Intel motherboard to be on the better side as well.
> 
> I can't completely rule faulty hw out, though.
> 
> Exchanging hw will be quite difficult, as the hangs take as much as three
> weeks to trigger (sometimes they happen withing a day after reboot), the box
> is a production server, and I don't have much spare hardware atm.
> 
> What I had hoped for is to be able to get some information on where it hangs.
> But sysrq and nmi watchdog don't cut it...
> 

 Hello Ville.  Hello Stephan. :-)

 Your problem sounds very simlar to the problem we were suffering.
The problem was a spinlock deadlock inside drivers/char/random.c which
is used by tcp to generate random initial sequence number.  The bug
fix was checked into 2.4 tree on 28th July after the release of pre8
at 14th July.

ChangeSet@1.1019.1.7, 2003-07-24 14:21:29-03:00, marcelo@freak.distro.conectiva
    Changed EXTRAVERSION to -pre8
  TAG: v2.4.22-pre8

ChangeSet@1.1019.3.10, 2003-07-28 17:25:49-07:00, olof@austin.ibm.com
  [RANDOM]: Fix SMP deadlock in __check_and_rekey().

 This problem can happen on UP machine if the kernel is compiled with
CONFIG_SMP.  Because the offending routine is called only every five
minutes and it should receive a SYN packet while it's connecting, it
occurs rarely, but it happens when it happens.

 Please try 2.4.22.

P.S. This bug is a real headache.  We had many servers deployed and
they all randomly locked up about every two or four weeks.  I believe
people should be warned about this one.

-- 
tejun


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.4.22pre8 hangs too (Re: 2.4.21-jam1 solid hangs)
@ 2003-08-27 23:22 Marcelo Tosatti
  2003-08-28  6:10 ` Ville Herva
  0 siblings, 1 reply; 21+ messages in thread
From: Marcelo Tosatti @ 2003-08-27 23:22 UTC (permalink / raw)
  To: Ville Herva; +Cc: Stephan von Krawczynski, lkml


Ville,

Which kernel doesnt hang on your box? 2.4 something ? 2.2?


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.4.22pre8 hangs too (Re: 2.4.21-jam1 solid hangs)
  2003-08-27 11:04                       ` Ville Herva
@ 2003-08-27 11:30                         ` Stephan von Krawczynski
  2003-08-28  9:26                           ` Ingo Oeser
  0 siblings, 1 reply; 21+ messages in thread
From: Stephan von Krawczynski @ 2003-08-27 11:30 UTC (permalink / raw)
  To: Ville Herva; +Cc: linux-kernel, tejun

On Wed, 27 Aug 2003 14:04:17 +0300
Ville Herva <vherva@niksula.hut.fi> wrote:

> > You're right, it looks pretty clean and simple. Possibly the only thing I
> > would try is moving aic away from int 9 to int 10 or so. Int 9 sometimes
> > interferes with VGA int routing on broken boxes. But that is unlikely
> > (though simple to test).
> 
> I don't think vga interferes with anything: I never run X on the box, and
> even the text console remains quiescent as nothing is logged.

The thing I ran into once was not really an intensive use of VGA and its ints
but rather some weird glitches in the boards' int logic that sometimes drove
the software drivers crazy (was network back then).

Regards,
Stephan

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.4.22pre8 hangs too (Re: 2.4.21-jam1 solid hangs)
  2003-08-27 10:56                     ` Stephan von Krawczynski
@ 2003-08-27 11:04                       ` Ville Herva
  2003-08-27 11:30                         ` Stephan von Krawczynski
  0 siblings, 1 reply; 21+ messages in thread
From: Ville Herva @ 2003-08-27 11:04 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: linux-kernel, tejun

On Wed, Aug 27, 2003 at 12:56:33PM +0200, you [Stephan von Krawczynski] wrote:
> 
> > Or, did you use kdb/kgdb in addition to serial console?
> 
> No.
 
Ok. 

I might give a debugger a shot anyway when I find the time.

> You're right, it looks pretty clean and simple. Possibly the only thing I would
> try is moving aic away from int 9 to int 10 or so. Int 9 sometimes interferes
> with VGA int routing on broken boxes. But that is unlikely (though simple to
> test).

I don't think vga interferes with anything: I never run X on the box, and
even the text console remains quiescent as nothing is logged.

Better test would perhaps be to get rid of Adaptec 2940 altogether and move
the rootfs on an ide disk. But that's not exactly convenient either...


-- v --

v@iki.fi

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.4.22pre8 hangs too (Re: 2.4.21-jam1 solid hangs)
  2003-08-27 10:13                   ` Ville Herva
@ 2003-08-27 10:56                     ` Stephan von Krawczynski
  2003-08-27 11:04                       ` Ville Herva
  0 siblings, 1 reply; 21+ messages in thread
From: Stephan von Krawczynski @ 2003-08-27 10:56 UTC (permalink / raw)
  To: Ville Herva; +Cc: linux-kernel, tejun

On Wed, 27 Aug 2003 13:13:13 +0300
Ville Herva <vherva@niksula.hut.fi> wrote:

> On Wed, Aug 27, 2003 at 11:30:27AM +0200, you [Stephan von Krawczynski]
> wrote:
> > 
> > Hm, did you try a serial console? On my side this was a big step forward.
> 
> Do you mean in your case nothing shown on monitor (I've disabled monitor
> blanking, so that is not it), sysrq key didn't work, nmi watchdog didn't
> trigger but you were still able to get output from serial console? An oops?

I often have X setups, so console output gets _somewhere_ in the background.

> Or, did you use kdb/kgdb in addition to serial console?

No.

> > What does your /proc/interrupts look like compared between 2.2 and 2.4 ?
> 
> I don't have 2.2 output at hand, but the 2.4.21-jam1 output doesn't seem too
> suspicious:

You're right, it looks pretty clean and simple. Possibly the only thing I would
try is moving aic away from int 9 to int 10 or so. Int 9 sometimes interferes
with VGA int routing on broken boxes. But that is unlikely (though simple to
test).

Regards,
Stephan


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.4.22pre8 hangs too (Re: 2.4.21-jam1 solid hangs)
  2003-08-27  9:30                 ` Stephan von Krawczynski
@ 2003-08-27 10:13                   ` Ville Herva
  2003-08-27 10:56                     ` Stephan von Krawczynski
  0 siblings, 1 reply; 21+ messages in thread
From: Ville Herva @ 2003-08-27 10:13 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: linux-kernel, tejun

On Wed, Aug 27, 2003 at 11:30:27AM +0200, you [Stephan von Krawczynski] wrote:
> 
> Hm, did you try a serial console? On my side this was a big step forward.

Do you mean in your case nothing shown on monitor (I've disabled monitor
blanking, so that is not it), sysrq key didn't work, nmi watchdog didn't
trigger but you were still able to get output from serial console? An oops?

Or, did you use kdb/kgdb in addition to serial console?

> If you experience complete hangs it may be something around hanging
> interrupts.

Probably, yes.

> Did you play with apic/acpi etc. to try different interrupt handling? 

ACPI has never been enabled. I enabled local APIC when I enabled nmi
watchdog, so I've tried it on and off.

> What does your /proc/interrupts look like compared between 2.2 and 2.4 ?

I don't have 2.2 output at hand, but the 2.4.21-jam1 output doesn't seem too
suspicious:

cat /proc/interrupts 
           CPU0       
  0:    1675428          XT-PIC  timer
  1:          3          XT-PIC  keyboard
  2:          0          XT-PIC  cascade
  4:      19625          XT-PIC  serial
  9:      25447          XT-PIC  aic7xxx
 11:      25203          XT-PIC  eth0
 12:          0          XT-PIC  PS/2 Mouse
 14:     178082          XT-PIC  ide0
NMI:      16763 
LOC:    1675326 
ERR:          0



-- v --

v@iki.fi

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.4.22pre8 hangs too (Re: 2.4.21-jam1 solid hangs)
  2003-08-27  7:37               ` Ville Herva
@ 2003-08-27  9:30                 ` Stephan von Krawczynski
  2003-08-27 10:13                   ` Ville Herva
  2003-08-28  1:13                 ` TeJun Huh
  1 sibling, 1 reply; 21+ messages in thread
From: Stephan von Krawczynski @ 2003-08-27  9:30 UTC (permalink / raw)
  To: Ville Herva; +Cc: linux-kernel, tejun

On Wed, 27 Aug 2003 10:37:58 +0300
Ville Herva <vherva@niksula.hut.fi> wrote:

> > Did you already try to exchange everything but the harddisks ?
> 
> No. Do you suspect faulty hardware?
> [...]
> What I had hoped for is to be able to get some information on where it hangs.
> But sysrq and nmi watchdog don't cut it...

Hm, did you try a serial console? On my side this was a big step forward.
If you experience complete hangs it may be something around hanging interrupts.
Did you play with apic/acpi etc. to try different interrupt handling? What does
your /proc/interrupts look like compared between 2.2 and 2.4 ?

Regards,
Stephan

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.4.22pre8 hangs too (Re: 2.4.21-jam1 solid hangs)
  2003-08-27  7:21             ` Stephan von Krawczynski
@ 2003-08-27  7:37               ` Ville Herva
  2003-08-27  9:30                 ` Stephan von Krawczynski
  2003-08-28  1:13                 ` TeJun Huh
  0 siblings, 2 replies; 21+ messages in thread
From: Ville Herva @ 2003-08-27  7:37 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: linux-kernel, tejun

On Wed, Aug 27, 2003 at 09:21:39AM +0200, you [Stephan von Krawczynski] wrote:
> 
> Sorry, then you have to look for another explanation. 

Yep, but I don't have any reasonable suspects.

> Did you already try to exchange everything but the harddisks ?

No. Do you suspect faulty hardware?

Apart from perhaps Adaptec 2940 (Adaptecs always give me trouble), I
believe the hw is pretty solid. It had no problems with 2.2 kernels.  Based
on my experience, the i815 chipset is not that shaky (unlike the Via dung),
and I would expect the Intel motherboard to be on the better side as well.

I can't completely rule faulty hw out, though.

Exchanging hw will be quite difficult, as the hangs take as much as three
weeks to trigger (sometimes they happen withing a day after reboot), the box
is a production server, and I don't have much spare hardware atm.

What I had hoped for is to be able to get some information on where it hangs.
But sysrq and nmi watchdog don't cut it...


-- v --

v@iki.fi

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.4.22pre8 hangs too (Re: 2.4.21-jam1 solid hangs)
  2003-08-27  7:12           ` Ville Herva
@ 2003-08-27  7:21             ` Stephan von Krawczynski
  2003-08-27  7:37               ` Ville Herva
  0 siblings, 1 reply; 21+ messages in thread
From: Stephan von Krawczynski @ 2003-08-27  7:21 UTC (permalink / raw)
  To: Ville Herva; +Cc: linux-kernel, tejun

On Wed, 27 Aug 2003 10:12:59 +0300
Ville Herva <vherva@niksula.hut.fi> wrote:

> > > Perhaps this is related to the "Race condition in 2.4 tasklet handling
> > > (cli() broken?)" problem TeJun Huh and Stephan von Krawczynski have been
> > > discussing?
> > 
> > This is no SMP box, is it? If it is no SMP is it probably unrelated.
> 
> Yes, no SMP.

Sorry, then you have to look for another explanation. 
Did you already try to exchange everything but the harddisks ?

Regards,
Stephan

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.4.22pre8 hangs too (Re: 2.4.21-jam1 solid hangs)
  2003-08-27  7:03         ` Stephan von Krawczynski
@ 2003-08-27  7:12           ` Ville Herva
  2003-08-27  7:21             ` Stephan von Krawczynski
  0 siblings, 1 reply; 21+ messages in thread
From: Ville Herva @ 2003-08-27  7:12 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: linux-kernel, tejun

On Wed, Aug 27, 2003 at 09:03:51AM +0200, you [Stephan von Krawczynski] wrote:
> > On Wed, Jul 30, 2003 at 09:10:03PM +0300, you [Ville Herva] wrote:
> > [...]
> >   - HW: Intel 815EEA2LU mobo, i815, Celeron Tualatin 1.3GHz. Adaptec 2940,
> >     9GB Seagate, HP C1537A tapedrive (not used), IBM-DTLA-305030 ide disk.
> >   - The aic7xxx driver has been acting up in past: crashes on boot and 
> >     sometimes at runtime too. I don't know if this is at all related to the
> >     lock ups.
> >   - Kernels tried: 2.4.22-pre8/gcc-2.96-85, 2.4.21-jam1/2.4.21-jam1, 
> >     2.4.21-jam1/gcc-3.2.1-2, 2.4.20pre7 -- all hang.

Forgot to mention: all fs's are ext2.
 
> > Perhaps this is related to the "Race condition in 2.4 tasklet handling
> > (cli() broken?)" problem TeJun Huh and Stephan von Krawczynski have been
> > discussing?
> 
> This is no SMP box, is it? If it is no SMP is it probably unrelated.

Yes, no SMP.


-- v --

v@iki.fi

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.4.22pre8 hangs too (Re: 2.4.21-jam1 solid hangs)
  2003-08-27  6:43       ` 2.4.22pre8 hangs too (Re: 2.4.21-jam1 " Ville Herva
@ 2003-08-27  7:03         ` Stephan von Krawczynski
  2003-08-27  7:12           ` Ville Herva
  0 siblings, 1 reply; 21+ messages in thread
From: Stephan von Krawczynski @ 2003-08-27  7:03 UTC (permalink / raw)
  To: Ville Herva; +Cc: linux-kernel, tejun

On Wed, 27 Aug 2003 09:43:02 +0300
Ville Herva <vherva@niksula.hut.fi> wrote:

> On Wed, Jul 30, 2003 at 09:10:03PM +0300, you [Ville Herva] wrote:
> [...]
>   - HW: Intel 815EEA2LU mobo, i815, Celeron Tualatin 1.3GHz. Adaptec 2940,
>     9GB Seagate, HP C1537A tapedrive (not used), IBM-DTLA-305030 ide disk.
>   - The aic7xxx driver has been acting up in past: crashes on boot and 
>     sometimes at runtime too. I don't know if this is at all related to the
>     lock ups.
>   - Kernels tried: 2.4.22-pre8/gcc-2.96-85, 2.4.21-jam1/2.4.21-jam1, 
>     2.4.21-jam1/gcc-3.2.1-2, 2.4.20pre7 -- all hang.
> 
> Perhaps this is related to the "Race condition in 2.4 tasklet handling
> (cli() broken?)" problem TeJun Huh and Stephan von Krawczynski have been
> discussing?

This is no SMP box, is it? If it is no SMP is it probably unrelated.

Regards,
Stephan

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.4.22pre8 hangs too (Re: 2.4.21-jam1 solid hangs)
  2003-07-30 18:10     ` Ville Herva
@ 2003-08-27  6:43       ` Ville Herva
  2003-08-27  7:03         ` Stephan von Krawczynski
  0 siblings, 1 reply; 21+ messages in thread
From: Ville Herva @ 2003-08-27  6:43 UTC (permalink / raw)
  To: linux-kernel; +Cc: tejun, skraw

On Wed, Jul 30, 2003 at 09:10:03PM +0300, you [Ville Herva] wrote:
> 
> However, I just realized that all of those kernel were compiled with fairly
> dubious gcc, version 2.96-85. I just compiled otherwise identically
> configured 2.4.21-jam1 with gcc-3.2.1-2. It'll take some time to tell
> whether this cures it. This is my main suspect now.

I celebrated too early.

The kernel compiled with gcc 3.2.1 20021207 (Red Hat Linux 8.0 3.2.1-2) hung
too, it just happened to take a little longer.

Short summary:

  - The hangs are solid: 
    - nothing in the log, nothing on the screen
    - no ctrl-alt-del, numlock
    - no sysrq-s, sysrq-u, sysrq-b 
    - nmi watchdog doesn't trigger
  - The hangs mostly happen when the nightly oracle backup dump is in
    progress
    - the oracle database is on an ide disk, oracle app and the dump
      destination are on an scsi disk (Adaptec 2940, SEAGATE ST19171W)
  - HW: Intel 815EEA2LU mobo, i815, Celeron Tualatin 1.3GHz. Adaptec 2940,
    9GB Seagate, HP C1537A tapedrive (not used), IBM-DTLA-305030 ide disk.
  - The aic7xxx driver has been acting up in past: crashes on boot and 
    sometimes at runtime too. I don't know if this is at all related to the
    lock ups.
  - Kernels tried: 2.4.22-pre8/gcc-2.96-85, 2.4.21-jam1/2.4.21-jam1, 
    2.4.21-jam1/gcc-3.2.1-2, 2.4.20pre7 -- all hang.

Perhaps this is related to the "Race condition in 2.4 tasklet handling
(cli() broken?)" problem TeJun Huh and Stephan von Krawczynski have been
discussing?

Any ideas?


-- v --

v@iki.fi

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2003-09-09  8:48 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-08-29 16:35 2.4.22pre8 hangs too (Re: 2.4.21-jam1 solid hangs) Marcelo Tosatti
2003-08-29 19:57 ` Ville Herva
2003-09-09  7:05   ` Ville Herva
2003-09-09  8:48     ` Stephan von Krawczynski
  -- strict thread matches above, loose matches on Subject: below --
2003-08-27 23:22 Marcelo Tosatti
2003-08-28  6:10 ` Ville Herva
2003-07-29  7:39 2.4.21-jam1, aic7xxx-6.2.36: solid hangs, crashes on boot Ville Herva
2003-07-30  7:13 ` 2.4.22pre8 hangs too (Re: 2.4.21-jam1, aic7xxx-6.2.36: solid hangs) Ville Herva
2003-07-30 14:50   ` Marcelo Tosatti
2003-07-30 18:10     ` Ville Herva
2003-08-27  6:43       ` 2.4.22pre8 hangs too (Re: 2.4.21-jam1 " Ville Herva
2003-08-27  7:03         ` Stephan von Krawczynski
2003-08-27  7:12           ` Ville Herva
2003-08-27  7:21             ` Stephan von Krawczynski
2003-08-27  7:37               ` Ville Herva
2003-08-27  9:30                 ` Stephan von Krawczynski
2003-08-27 10:13                   ` Ville Herva
2003-08-27 10:56                     ` Stephan von Krawczynski
2003-08-27 11:04                       ` Ville Herva
2003-08-27 11:30                         ` Stephan von Krawczynski
2003-08-28  9:26                           ` Ingo Oeser
2003-08-28 19:09                             ` Ville Herva
2003-08-28  1:13                 ` TeJun Huh
2003-08-28  5:40                   ` Ville Herva
2003-08-28  5:57                     ` Tejun Huh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).