linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: Fwd: Re: Working nforce2, was Re: Fixes for nforce2 hard lockup, apic, io-apic, udma133 covered
@ 2003-12-15 14:30 Ross Dickson
  2003-12-15 15:02 ` Craig Bradney
  0 siblings, 1 reply; 17+ messages in thread
From: Ross Dickson @ 2003-12-15 14:30 UTC (permalink / raw)
  To: Maciej W. Rozycki; +Cc: recbo, linux-kernel

>> APIC error on CPU0: 02(02) 
> > what?? no crash though. 
> [...] 
> > bob@where cat /proc/interrupts 
> > CPU0 
> > 0: 3350153 IO-APIC-edge timer 
> > 1: 5775 IO-APIC-edge i8042 
> > 2: 0 XT-PIC cascade 
> > 8: 1 IO-APIC-edge rtc 
> > 9: 0 IO-APIC-level acpi 
> > 12: 5385 IO-APIC-edge i8042 
> > 14: 10 IO-APIC-edge ide0 
> > 15: 10 IO-APIC-edge ide1 
> > 16: 1717957 IO-APIC-level ide2, ide3, eth0 
> > 19: 472929 IO-APIC-level ide4, ide5 
> > 21: 0 IO-APIC-level NVidia nForce2 
> > NMI: 822 
> > LOC: 3350073 
> > ERR: 35 
> > MIS: 15818 

>It looks like the infamous APIC delivery bug -- the "MIS" counter shows 
>how many level-triggered interrupts has been erronously delivered as 
>edge-triggered ones. No wonder the system shows instability -- you have 
>noise problems at the APIC bus. 
 
Thanks Maciej
I was wondering about those, I had seen the work around code and would not
have thought it need apply to recent athlon chipsets?


For comparison here is my proc/interrupts 
CPU0
  0:   50462204    IO-APIC-edge  timer
  1:      49153    IO-APIC-edge  keyboard
  2:          0          XT-PIC  cascade
  9:          0   IO-APIC-level  acpi
 12:     395912    IO-APIC-edge  PS/2 Mouse
 14:     995872    IO-APIC-edge  ide0
 15:        283    IO-APIC-edge  ide1
 16:    3921102   IO-APIC-level  nvidia
 18:          2   IO-APIC-level  bttv
 20:     136325   IO-APIC-level  eth0, usb-ohci
 21:     146903   IO-APIC-level  ehci_hcd, NVIDIA nForce Audio
 22:          0   IO-APIC-level  usb-ohci
NMI:          0
LOC:   50457798
ERR:          0
MIS:          0

Albatron KM18G-Pro, nforce2, pheonix bios, 2200XP, 255fsb, ddr400,
ide0 is hard drive, ide1 is cdrom, nmi watchdog off

Report seems OK but this machine locks up hard without the apic delay patch.

I am currently trying the simpler v1 (always add a delay) patch but on all apic
acks as per this posting

http://linux.derkeiler.com/Mailing-Lists/Kernel/2003-12/3291.html

which is a reply to an earlier posting of the same name but I accidently 
omitted the Re in the subject.

Regards,
Ross.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Fwd: Re: Working nforce2, was Re: Fixes for nforce2 hard lockup, apic, io-apic, udma133 covered
  2003-12-15 14:30 Fwd: Re: Working nforce2, was Re: Fixes for nforce2 hard lockup, apic, io-apic, udma133 covered Ross Dickson
@ 2003-12-15 15:02 ` Craig Bradney
  2003-12-15 15:56   ` Maciej W. Rozycki
  2003-12-15 16:54   ` Ross Dickson
  0 siblings, 2 replies; 17+ messages in thread
From: Craig Bradney @ 2003-12-15 15:02 UTC (permalink / raw)
  To: ross; +Cc: Maciej W. Rozycki, recbo, linux-kernel

Just to give the status here ...
Im still running the original 2.6 test 11 patches for apic and ioapic.
Uptime is now 2d 20h with lots of idle time and hard work too.. 

/proc/interrupts as follows:

           CPU0
  0:  245382420    IO-APIC-edge  timer
  1:     139577    IO-APIC-edge  i8042
  2:          0          XT-PIC  cascade
  8:          3    IO-APIC-edge  rtc
  9:          0   IO-APIC-level  acpi
 12:    1478615    IO-APIC-edge  i8042
 14:    1055548    IO-APIC-edge  ide0
 15:     737664    IO-APIC-edge  ide1
 19:   18405692   IO-APIC-level  radeon@PCI:3:0:0
 21:    5257090   IO-APIC-level  ehci_hcd, NVidia nForce2, eth0
 22:          3   IO-APIC-level  ohci1394
NMI:      14944
LOC:  245087891
ERR:          0
MIS:          6

As for NMI.. I actually forget which I booted from... I think =1, but NMI is a small number now.. would it have wrapped?

Craig
A7N8X Deluxe V2 BIOS 1007



On Mon, 2003-12-15 at 15:30, Ross Dickson wrote:
> >> APIC error on CPU0: 02(02) 
> > > what?? no crash though. 
> > [...] 
> > > bob@where cat /proc/interrupts 
> > > CPU0 
> > > 0: 3350153 IO-APIC-edge timer 
> > > 1: 5775 IO-APIC-edge i8042 
> > > 2: 0 XT-PIC cascade 
> > > 8: 1 IO-APIC-edge rtc 
> > > 9: 0 IO-APIC-level acpi 
> > > 12: 5385 IO-APIC-edge i8042 
> > > 14: 10 IO-APIC-edge ide0 
> > > 15: 10 IO-APIC-edge ide1 
> > > 16: 1717957 IO-APIC-level ide2, ide3, eth0 
> > > 19: 472929 IO-APIC-level ide4, ide5 
> > > 21: 0 IO-APIC-level NVidia nForce2 
> > > NMI: 822 
> > > LOC: 3350073 
> > > ERR: 35 
> > > MIS: 15818 
> 
> >It looks like the infamous APIC delivery bug -- the "MIS" counter shows 
> >how many level-triggered interrupts has been erronously delivered as 
> >edge-triggered ones. No wonder the system shows instability -- you have 
> >noise problems at the APIC bus. 
>  
> Thanks Maciej
> I was wondering about those, I had seen the work around code and would not
> have thought it need apply to recent athlon chipsets?
> 
> 
> For comparison here is my proc/interrupts 
> CPU0
>   0:   50462204    IO-APIC-edge  timer
>   1:      49153    IO-APIC-edge  keyboard
>   2:          0          XT-PIC  cascade
>   9:          0   IO-APIC-level  acpi
>  12:     395912    IO-APIC-edge  PS/2 Mouse
>  14:     995872    IO-APIC-edge  ide0
>  15:        283    IO-APIC-edge  ide1
>  16:    3921102   IO-APIC-level  nvidia
>  18:          2   IO-APIC-level  bttv
>  20:     136325   IO-APIC-level  eth0, usb-ohci
>  21:     146903   IO-APIC-level  ehci_hcd, NVIDIA nForce Audio
>  22:          0   IO-APIC-level  usb-ohci
> NMI:          0
> LOC:   50457798
> ERR:          0
> MIS:          0
> 
> Albatron KM18G-Pro, nforce2, pheonix bios, 2200XP, 255fsb, ddr400,
> ide0 is hard drive, ide1 is cdrom, nmi watchdog off
> 
> Report seems OK but this machine locks up hard without the apic delay patch.
> 
> I am currently trying the simpler v1 (always add a delay) patch but on all apic
> acks as per this posting
> 
> http://linux.derkeiler.com/Mailing-Lists/Kernel/2003-12/3291.html
> 
> which is a reply to an earlier posting of the same name but I accidently 
> omitted the Re in the subject.
> 
> Regards,
> Ross.
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Fwd: Re: Working nforce2, was Re: Fixes for nforce2 hard lockup, apic, io-apic, udma133 covered
  2003-12-15 15:02 ` Craig Bradney
@ 2003-12-15 15:56   ` Maciej W. Rozycki
  2003-12-15 16:54   ` Ross Dickson
  1 sibling, 0 replies; 17+ messages in thread
From: Maciej W. Rozycki @ 2003-12-15 15:56 UTC (permalink / raw)
  To: Craig Bradney; +Cc: ross, recbo, linux-kernel

On Mon, 15 Dec 2003, Craig Bradney wrote:

>            CPU0
>   0:  245382420    IO-APIC-edge  timer
>   1:     139577    IO-APIC-edge  i8042
>   2:          0          XT-PIC  cascade
>   8:          3    IO-APIC-edge  rtc
>   9:          0   IO-APIC-level  acpi
>  12:    1478615    IO-APIC-edge  i8042
>  14:    1055548    IO-APIC-edge  ide0
>  15:     737664    IO-APIC-edge  ide1
>  19:   18405692   IO-APIC-level  radeon@PCI:3:0:0
>  21:    5257090   IO-APIC-level  ehci_hcd, NVidia nForce2, eth0
>  22:          3   IO-APIC-level  ohci1394
> NMI:      14944
> LOC:  245087891
> ERR:          0
> MIS:          6
> 
> As for NMI.. I actually forget which I booted from... I think =1, but NMI is a small number now.. would it have wrapped?

 That's "=2" -- otherwise the NMI count would be rougly the same as the
sum of counts for IRQ 0 for all processors.  And you can actually get your
kernel's command line from /proc/cmdline.

-- 
+  Maciej W. Rozycki, Technical University of Gdansk, Poland   +
+--------------------------------------------------------------+
+        e-mail: macro@ds2.pg.gda.pl, PGP key available        +

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Fwd: Re: Working nforce2, was Re: Fixes for nforce2 hard lockup, apic, io-apic, udma133 covered
  2003-12-15 15:02 ` Craig Bradney
  2003-12-15 15:56   ` Maciej W. Rozycki
@ 2003-12-15 16:54   ` Ross Dickson
  2003-12-16  6:07     ` Bob
  1 sibling, 1 reply; 17+ messages in thread
From: Ross Dickson @ 2003-12-15 16:54 UTC (permalink / raw)
  To: Craig Bradney; +Cc: recbo, linux-kernel, Ian Kumlien

On Tuesday 16 December 2003 01:02, you wrote:
> Just to give the status here ...
> Im still running the original 2.6 test 11 patches for apic and ioapic.
> Uptime is now 2d 20h with lots of idle time and hard work too.. 
> 
> /proc/interrupts as follows:
> 
>            CPU0
>   0:  245382420    IO-APIC-edge  timer
>   1:     139577    IO-APIC-edge  i8042
>   2:          0          XT-PIC  cascade
>   8:          3    IO-APIC-edge  rtc
>   9:          0   IO-APIC-level  acpi
>  12:    1478615    IO-APIC-edge  i8042
>  14:    1055548    IO-APIC-edge  ide0
>  15:     737664    IO-APIC-edge  ide1
>  19:   18405692   IO-APIC-level  radeon@PCI:3:0:0
>  21:    5257090   IO-APIC-level  ehci_hcd, NVidia nForce2, eth0
>  22:          3   IO-APIC-level  ohci1394
> NMI:      14944
> LOC:  245087891
> ERR:          0
> MIS:          6

Uptime sounds good so far.
 
I am not convinced my v2 apic patch is a great overall improvement, I am 
thinking v1 apic, is safer for now. 

Having said that
Ian Kumlien currently has an uptime of
1 day, 15 hours +
on v2 patches but with the apic delay timeout increased from 600UL to 800UL.
He has a Barton core - see below.

> 
> Craig
> A7N8X Deluxe V2 BIOS 1007
> 
> 
<snip>

> > I am currently trying the simpler v1 (always add a delay) patch but on all apic
> > acks as per this posting
> > 
> > http://linux.derkeiler.com/Mailing-Lists/Kernel/2003-12/3291.html
> > 
> > which is a reply to an earlier posting of the same name but I accidently 
> > omitted the Re in the subject.
> > 

I don't think it is necessary to put the delay in all apic acks - I just tried it 
to see if it worked and have not yet put my code back the way it was. 
My hard lockups went away with the original v1 apic 
timer delay patch anyway.

Please note in that (above) posting I write that I stuffed up the #ifdefs
in my v1 and v2 patches and adjust code accordingly. Patches worked 
but were only testing on the first config item after #ifdef

apic code should have had
#if defined(CONFIG_MK7) && defined(CONFIG_BLK_DEV_AMD74XX)

ioapic code should have had
#if defined(CONFIG_ACPI_BOOT) && defined(CONFIG_X86_UP_IOAPIC)

Brief summary at this point

1) 2? reports are in that latest award bios with "C1 disconnect" set to "auto?"
 may remove need for apic ack delay patch and still keep cpu thermo managed

2) apic ack delay v1 patch seems safe for all cpu cores but introduces a small
 delay of about half the time of an XTPIC access on each apic timer interrupt
	
3) apic ack delay v2 patch seems safe only on barton cores and gives more debugging
 info and wastes less time than apic v1 patch

4) io-apic v2 patch gives more debugging info but functions same as io-apic v1 patch

Regards
Ross


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Working nforce2, was Re: Fixes for nforce2 hard lockup, apic, io-apic, udma133 covered
  2003-12-15 16:54   ` Ross Dickson
@ 2003-12-16  6:07     ` Bob
  0 siblings, 0 replies; 17+ messages in thread
From: Bob @ 2003-12-16  6:07 UTC (permalink / raw)
  To: linux-kernel

Ross,

my_make_script nf2-800UL 2>&1 | tee /tmp/make.err

#/tmp/make.err
<snip>

  CC      arch/i386/kernel/apic.o
arch/i386/kernel/apic.c: In function `smp_apic_timer_interrupt':
arch/i386/kernel/apic.c:1105: warning: unsigned int format, long unsigned int arg (arg 2)


...which is around the printk line here--

                       printk("..APIC TIMER ack delay, reload:%u, safe:%u\n",

+                if(!passno) { /* calculate timing */
+                        safecnt = apic_read(APIC_TMICT) -
+                                ( (800UL * apic_read(APIC_TMICT) ) /
+                                (1000000000UL/HZ) );
+                        printk("..APIC TIMER ack delay, reload:%u, safe:%u\n",
+                                apic_read(APIC_TMICT), safecnt);
+                        passno++;


Here are the two patches with "#ifdef N" to "#if defined(N)" change
but not the unsigned int change --


diff -urN linux-2.6.0-test11/arch/i386/kernel/apic.c linux-2.6.0-test11-nf2/arch/i386/kernel/apic.c
--- linux-2.6.0-test11/arch/i386/kernel/apic.c	2003-11-26 15:46:07.000000000 -0500
+++ linux-2.6.0-test11-nf2/arch/i386/kernel/apic.c	2003-12-13 23:48:30.000000000 -0500
@@ -1089,6 +1089,37 @@
 	 */
 	irq_stat[cpu].apic_timer_irqs++;
 
+#if defined(CONFIG_MK7) && defined(CONFIG_BLK_DEV_AMD74XX)
+        /*
+         * on 2200XP & nforce2 chipset we need 600ns? 800? 1000? 1100?
+         * from timer irq start to apic irq ack to prevent
+         * hard lockups, use apic timer itself.
+         * C1 disconnect bit related.  Ross Dickson.
+         */
+        {
+                static unsigned int passno, safecnt;
+                if(!passno) { /* calculate timing */
+                        safecnt = apic_read(APIC_TMICT) -
+                                ( (800UL * apic_read(APIC_TMICT) ) /
+                                (1000000000UL/HZ) );
+                        printk("..APIC TIMER ack delay, reload:%u, safe:%u\n",
+                                apic_read(APIC_TMICT), safecnt);
+                        passno++;
+                }
+#if APIC_DEBUG
+                if(passno<12) {
+                        unsigned int at1 = apic_read(APIC_TMCCT);
+                        if( passno > 1 )
+                                Dprintk("..APIC TIMER ack delay, predelay count:%u \n", at1 );
+                        passno++;
+                }
+#endif
+                /* delay only if required */
+                while( apic_read(APIC_TMCCT) > safecnt )
+                        ndelay(100);
+        }
+#endif
+
 	/*
 	 * NOTE! We'd better ACK the irq immediately,
 	 * because timer handling can be slow.*/


diff -urN linux-2.6.0-test11/arch/i386/kernel/io_apic.c linux-2.6.0-test11-nf2/arch/i386/kernel/io_apic.c
--- linux-2.6.0-test11/arch/i386/kernel/io_apic.c	2003-11-26 15:43:32.000000000 -0500
+++ linux-2.6.0-test11-nf2/arch/i386/kernel/io_apic.c	2003-12-13 15:14:25.000000000 -0500
@@ -2128,6 +2128,54 @@
 		printk(KERN_ERR "..MP-BIOS bug: 8254 timer not connected to IO-APIC\n");
 	}
 
+#if defined (CONFIG_ACPI_BOOT) && (CONFIG_X86_UP_IOAPIC)
+        /* for nforce2 try vector 0 on pin0
+         * Note 8259a is already masked, also by default
+         * the io_apic_set_pci_routing call disables the 8259 irq 0
+         * so we must be connected directly to the 8254 timer if this works
+         * Note2: this violates the above comment re Subtle but works!
+         */
+        printk(KERN_INFO "..TIMER: Is timer irq0 connected to IOAPIC Pin0? ...\n");
+        if (pin1 != -1) {
+                extern spinlock_t i8259A_lock;
+                unsigned long flags;
+                int tok, saved_timer_ack = timer_ack;
+                /*
+                 * Ok, does IRQ0 through the IOAPIC work?
+                 */
+                io_apic_set_pci_routing ( 0, 0, 0, 0, 0); /* connect pin */
+                unmask_IO_APIC_irq(0);
+                timer_ack = 0;
+
+                /*
+
+
+
+                 * Ok, does IRQ0 through the IOAPIC work?
+                 */
+                spin_lock_irqsave(&i8259A_lock, flags);
+                Dprintk("..TIMER check 8259 ints disabled, imr1:%02x, imr2:%02x\n", inb(0x21), inb(0xA1));
+                tok = timer_irq_works();
+                spin_unlock_irqrestore(&i8259A_lock, flags);
+                if (tok) {
+                        if (nmi_watchdog == NMI_IO_APIC) {
+                                disable_8259A_irq(0);
+                                setup_nmi();
+                                enable_8259A_irq(0);
+                                check_nmi_watchdog();
+                        }
+                        printk(KERN_INFO "..TIMER: works OK on apic pin0 irq0\n" );
+                        return;
+                }
+                /* failed */
+                timer_ack = saved_timer_ack;
+                clear_IO_APIC_pin(0, 0);
+                io_apic_set_pci_routing ( 0, pin1, 0, 0, 0);
+                printk(KERN_ERR "..MP-BIOS bug: 8254 timer not connected to IO-APIC Pin 0\n");
+        }
+/* end new stuff for nforce2 */
+#endif
+
 	printk(KERN_INFO "...trying to set up timer (IRQ0) through the 8259A ... ");
 	if (pin2 != -1) {
 		printk("\n..... (found pin %d) ...", pin2);



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Working nforce2, was Re: Fixes for nforce2 hard lockup, apic, io-apic, udma133 covered
  2003-12-15 13:11   ` Maciej W. Rozycki
@ 2003-12-16  7:18     ` Bob
  0 siblings, 0 replies; 17+ messages in thread
From: Bob @ 2003-12-16  7:18 UTC (permalink / raw)
  To: linux-kernel

apic.c patch needs reload:%lu instead of %u  ---------->
printk("..APIC TIMER ack delay, reload:%lu, safe:%u\n",



amd xp3000+, 1:1 333mhz fsb to ram, 166mhz cpu
bus clock x dual channel 2-512mb pc3200 tested cas2
sticks, 1:1 fsb to ram for 333mhz, Award bios with
update that works for non-crashing but not for edge
timer without patch.  MSI K7N2 Delta MCP2-T mbo
linux-2.6.0-test11

This was with 3ware controller and unpatched 2.6.0-test11
Note low MIS score but PIC timer and no nmi--

          CPU0       0:  244393560          XT-PIC  timer
 1:      31963    IO-APIC-edge  i8042
 2:          0          XT-PIC  cascade
 8:          1    IO-APIC-edge  rtc
 9:          0   IO-APIC-level  acpi
12:     251884    IO-APIC-edge  i8042
14:         22    IO-APIC-edge  ide0
15:         24    IO-APIC-edge  ide1
16:    4290216   IO-APIC-level  3ware Storage Controller, yenta, yenta
17:    5929405   IO-APIC-level  eth0
21:          0   IO-APIC-level  NVidia nForce2
NMI:          0
LOC:  244378698
ERR:          0
MIS:          6

Next is with the first edge timer patch, nmi_watchdog=2
works but =1 does not, MIS really high("noisy bus"),
replacing 3ware with promise cards and hdparm udma133
causes apic error logged to console during bonnie++ test--

>>APIC error on CPU0: 02(02)
>>what?? no crash though.
>>    
>>
>>bob@where cat /proc/interrupts
>>           CPU0      
>>  0:    3350153    IO-APIC-edge  timer
>>  1:       5775    IO-APIC-edge  i8042
>>  2:          0          XT-PIC  cascade
>>  8:          1    IO-APIC-edge  rtc
>>  9:          0   IO-APIC-level  acpi
>> 12:       5385    IO-APIC-edge  i8042
>> 14:         10    IO-APIC-edge  ide0
>> 15:         10    IO-APIC-edge  ide1
>> 16:    1717957   IO-APIC-level  ide2, ide3, eth0
>> 19:     472929   IO-APIC-level  ide4, ide5
>> 21:          0   IO-APIC-level  NVidia nForce2
>>NMI:        822
>>LOC:    3350073
>>ERR:         35
>>MIS:      15818
>>    
>>

now with promise controllers again, new edge timer patch
permits nmi_watchdog=1 not =2, lots of nmi ticks, MIS count
is only half with first timer patch, NMI ticks = LOC?


bob@where cat /proc/interrupts
           CPU0      
  0:   46188571    IO-APIC-edge  timer
  1:      12396    IO-APIC-edge  i8042
  2:          0          XT-PIC  cascade
  8:          1    IO-APIC-edge  rtc
  9:          0   IO-APIC-level  acpi
 12:     147429    IO-APIC-edge  i8042
 14:         10    IO-APIC-edge  ide0
 15:         10    IO-APIC-edge  ide1
 16:    1413705   IO-APIC-level  ide2, ide3, eth0
 17:          0   IO-APIC-level  yenta, yenta
 19:     258804   IO-APIC-level  ide4, ide5
 21:          0   IO-APIC-level  NVidia nForce2
NMI:   46188592
LOC:   46188482
ERR:         36
MIS:       6877

Now I'll try 800UL/100ndelay to see if it helps with
MIS count(pseudo-sci masochism), be back in a while.

Oh, by the way, I set debug 1 in apic.h but I don't
see anything, and I thought I saw a compile error
flash by, so now I'll compile > logfile 2>&1 and
might see why I don't see--

"..APIC TIMER ack delay, predelay count: 20769"

I don't see any of that debug stuff. Maybe the compile
errors I found were it, see my previous message about
"unsigned in format", maybe printk needs %lu(I don't
know hardly nuffing yet). I'm going to boot 800UL/100ndelay
now.


it needs reload:%lu instead of %u  ---------->
printk("..APIC TIMER ack delay, reload:%lu, safe:%u\n",

Ross: "Can you also advise if your bios setting of the
"C1 disconnect" is set"

I can only guess by my 41C low load 48C high load
temps exactly equal to range for "2.1Ghz 333mhz"
of Ian Kumlien(his?) which is same speed as mine,
that probably cpu disconnect is not on. I have
no visible choice in setup for cpu disconnect.
I'll try athcool to see how disconnect is set.

Ross:"I have heard lockups are not supposed to happen
at all if the fsb (host bus clock speed) matches the
ddr speed. One of my systems went about 4 hours (xp2500
333fsb, DDR333) without the apic delay patch on a phoenix 
bios before lockup"

A couple of months ago I was overly optimistic a couple
of times before the bios update, and it seemed to work
to use 1:1 and only amd74xx onboard hd controller, no
hd cards, and pre-emptive, anticipatory sched not
deadline, apic off in setup but on in linux, lapic
off, acpi on. It was almost stable if using only one
drive, but I really can't go without hd cards for
software raid, so the first fsck on boot if using hd
card, and crash. I could finesse stability by using
options but never quite reach reliability without a
bios update, and certain functions need patching, and
I still have "MIS count, noisy bus" and agp8 crash(I can
use the X nv driver and agpgart no problem, but not nvidia
drivers for X and agp8).



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Working nforce2, was Re: Fixes for nforce2 hard lockup, apic, io-apic, udma133 covered
  2003-12-13  9:20 Ross Dickson
@ 2003-12-13  9:51 ` Bob
  0 siblings, 0 replies; 17+ messages in thread
From: Bob @ 2003-12-13  9:51 UTC (permalink / raw)
  To: linux-kernel

Ross Dickson wrote:

>>>So the fix was absolutely a BIOS fix. 
>>>
>>>      
>>>
><snip>
>
>==That's why I'm trying to contact shuttle.
>Jesse
>  
>
R> Good Work Jesse, I hope shuttle give up some info - especially as I have
pheonix bioses and they are doing ?? about it? -Ross

B> I was expecting to hear that.

I have an Award bios on MSI nforce2 mboard. Their bios flash
file begins with
"w" for Award  w6570nms.760(W6570 v760 bios flash file)
"p" phoenix
"a" ami
also appears at boot but goes by in a flash
and appears on first cmos setup page

So Award bios has a fix for the nforce2.

How about Jesse's bios that can fix the
problem without a kernel patch, as my
Award bios is doing? What kind of bios
is that you have, Jesse?

My Award bios does not make any way for
me to have ioapic edge timer turn on,
though. I need a patch to get that on.

Also I don't have a cpu disconnect choice in
setup and by running temp range 41C to 48C
I guess cpu disconnect is not on. 48C once in
a while does not hurt anything though.  -Bob

>>...but we're stuck looking at smoke and mirrors, 
>>when the kernel might be able to work around 
>>bioses that have not been "updated". Or to put 
>>it another way, "voodoo" may be done by 
>>kernel if not done by bios. Whatever is being 
>>tweaked may be accessible to kernel code. 
>>    
>>
><snip>
>Bob
>
>Please ignore the following if you are already up to speed on SMM. Some
>readers may not know why we cannot do all that the bios can do aside from
>a lack of information.
> 
>Agreed but the keywords are might and may. I remember doing dos based data acquisition 
>with 486SX laptops and then Intel brought out the 486Sl and our pulse counting 
>went bad because of the power saving core. I got the data book from Intel and
>was very dismayed to see that bios code was being executed when I thought our code
>was running and there was not a darn thing I could do about it and keep the
>laptop warranty intact. 
>
>Its offspring as you may already know is SMM. It is a priviledged mode that we can
>do pretty much squat about. It can pop up anywhere in the middle of our code 
>and the only thing we will know about it aside from missing time is when it has
>stuffed something up - like setting registers back to the wrong values. Think of
>it like a kernel within our kernel with permissions set so it can hack us but we
>cannot hack it.
>
>Maciej recently writes of its continuing effect on NMI debug here.
>
>http://linux.derkeiler.com/Mailing-Lists/Kernel/2003-12/2940.html
>
>Regards
>Ross
>
>  
>
Thanks for explaining. We got some new functionality
just by turning nmi_watchdog on but I don't know if
anybody has learned anything from the extra debug
have they, as far as this nforce2 timing thing?     -Bob

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Working nforce2, was Re: Fixes for nforce2 hard lockup, apic, io-apic, udma133 covered
@ 2003-12-13  9:20 Ross Dickson
  2003-12-13  9:51 ` Bob
  0 siblings, 1 reply; 17+ messages in thread
From: Ross Dickson @ 2003-12-13  9:20 UTC (permalink / raw)
  To: linux-kernel

<snip>
>>I decided to wait till this morning, to try the BIOS "C1 Disconnect" set to
>> enabled. Still no lockups under this kernel. Tried a vanilla kernel, no
>> lockups (but timer and watchdog messed up still). Now that I read
>> your message Bob, I understand what you are saying. Luckily, the 
>>updated BIOS changelog states "Add C1 disconnect item." And this exact
>> version seems to have fixed it, and now we have an exact fix (another one?) 
>>to refer to. 
> > 
> >So the fix was absolutely a BIOS fix. 
> > 
<snip>

==That's why I'm trying to contact shuttle.
Jesse

Good Work Jesse, I hope shuttle give up some info - especially as I have
pheonix bioses and they are doing ?? about it?


> ...but we're stuck looking at smoke and mirrors, 
> when the kernel might be able to work around 
> bioses that have not been "updated". Or to put 
> it another way, "voodoo" may be done by 
> kernel if not done by bios. Whatever is being 
> tweaked may be accessible to kernel code. 
<snip>
Bob

Please ignore the following if you are already up to speed on SMM. Some
readers may not know why we cannot do all that the bios can do aside from
a lack of information.
 
Agreed but the keywords are might and may. I remember doing dos based data acquisition 
with 486SX laptops and then Intel brought out the 486Sl and our pulse counting 
went bad because of the power saving core. I got the data book from Intel and
was very dismayed to see that bios code was being executed when I thought our code
was running and there was not a darn thing I could do about it and keep the
laptop warranty intact. 

Its offspring as you may already know is SMM. It is a priviledged mode that we can
do pretty much squat about. It can pop up anywhere in the middle of our code 
and the only thing we will know about it aside from missing time is when it has
stuffed something up - like setting registers back to the wrong values. Think of
it like a kernel within our kernel with permissions set so it can hack us but we
cannot hack it.

Maciej recently writes of its continuing effect on NMI debug here.

http://linux.derkeiler.com/Mailing-Lists/Kernel/2003-12/2940.html

Regards
Ross.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Working nforce2, was Re: Fixes for nforce2 hard lockup, apic, io-apic, udma133 covered
  2003-12-12 16:59             ` Working nforce2, was " Jesse Allen
  2003-12-12 17:18               ` Jesse Allen
  2003-12-12 18:18               ` Josh McKinney
@ 2003-12-13  6:34               ` Bob
  2 siblings, 0 replies; 17+ messages in thread
From: Bob @ 2003-12-13  6:34 UTC (permalink / raw)
  To: linux-kernel

hackers be clever--

"system temperature was getting -- above 40 deg C. CPU was getting up to 
49 deg C...how poorly it's thermal management was operating then. Now 
with the new patches, and ultimately, BIOS update, system temperature is 
about 35 deg C -JesseAllen"

Maybe that tells me that my bios update fixed my
lockup problems without turning on cpu disconnect
or even by turning it off with no doc as face-saver
and not allowing me to see a choice in setup, since
like yours before cpu disconnect working my temp
is 41C most of the time and 48C under a
heavy load, possibly 49C, the exact range you
are looking at before you had cpu disconnect
working

or they turned cpu disconnect off without saying
anything, buying time, saving embarrassment

anyway it's probably off here since I have exactly
the same heat profile

I have 120mm fans one in one out, blowing air
across Zalman cpu and gpu heatsinks, no 80mm
extra Zalman fan. amd xp 3000+ 333mhz 1:1
arctic silver compound on heatsinks

Thermal 1: ok, 41.0 degrees C 105.8 degrees F
 - 41C in X, running realplayer
 - 48C compile a fat kernel or several heavy tasks

-Bob

Jesse Allen wrote:

> ....I compiled a new kernel without the disconnect off patch, or the 
> ack delay. These are the exact patches I used on 2.6.0-test11:
>
>patch-2.6.0-test11-bk8.bz2
>acpi-2.6.0t11.patch acpi bugfixes from Maciej.
>nforce-ioapic-timer-2.6t11.patch from Ross Dickson.  Timer patch.
>forcedeth.patch Patch stolen from -test10-mm1?  Unused.
>forcedeth-update-2.patch Same.
>
>Sure enough, under this kernel, no lockups.  Athcool reported Disconnect was "on".
>
>I decided to wait till this morning, to try the BIOS "C1 Disconnect" set to enabled.  Still no lockups under this kernel.  Tried a vanilla kernel, no lockups (but timer and watchdog messed up still).  Now that I read your message Bob, I understand what you are saying.  Luckily, the updated BIOS changelog states "Add C1 disconnect item."  And this exact version seems to have fixed it, and now we have an exact fix (another one?) to refer to.
>
>So the fix was absolutely a BIOS fix.
>
...but we're stuck looking at smoke and mirrors,
when the kernel might be able to work around
bioses that have not been "updated". Or to put
it another way, "voodoo" may be done by
kernel if not done by bios. Whatever is being
tweaked may be accessible to kernel code.

I can't read anything useful in my bios flash
file w6570nms.760 which is contained in--

>>http://download.msi.com.tw/support/bos_exe/6570v76.exe
>>

>>Nvidia X driver for ti4200 agp8 still locks up linux though,
>>but X nv works fine. agp8 3d may expose the timer issue.
>>
>>    
>>
>
>That's either an nvidia driver problem, or agpgart-nforce problem.  I'd try 4x agp, and or NVAGP (or agpgart, if already using NVAGP).  If you think it's the timer, try the timer patch, or with nolapic noapic.
>
>Jesse
>
Thanks, I've tried all of those except passing agp4 or agp2
to the nvidia X "nvidia" driver. Another clue that it's related
to interrupts or timing of access to interrupts is that before
I put another card on the pci bus I could get into X for a
few seconds with the nvidia driver before linux locked up,
now with an elan pcmcia 32-bit cardbus pci card that claims
it needs its own interrupt(can't give it one yet!) X just locks
up linux on load.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Working nforce2, was Re: Fixes for nforce2 hard lockup, apic, io-apic, udma133 covered
  2003-12-13  5:16 Ross Dickson
@ 2003-12-13  6:04 ` Jesse Allen
  0 siblings, 0 replies; 17+ messages in thread
From: Jesse Allen @ 2003-12-13  6:04 UTC (permalink / raw)
  To: Ross Dickson; +Cc: AMartin, linux-kernel

On Sat, Dec 13, 2003 at 03:16:51PM +1000, Ross Dickson wrote:
> I wonder about the "voodoo" because my apic ack delay patch was developed
> without knowledge of the C1 disconnect bit and reports I have received so far
> are that the hard lockups go away when using it independent of the state of the 
> disconnect bit. Apparently the bit was on in my test systems. 
> 
> Ian Kumlien pointed out the linkage with the northbridge timing signals 
> to the CPU to do with the connect disconnect handshake 

This is what the item help for C1 Disconnect in my BIOS said:
 "Force En/Disabled
  or Auto mode:
  C17 IGP/SPP NB A03
  C18D SPP NB A01 (C01)
  enabled C1 disconnect
  otherwise disabled it"

I was thinking NB referred to northbridge.  SPP is the type of NForce chip.  IGP would be a graphics chip(?), though this board don't have that.

So yes, we do have at least some relationship with the northbridge and disconnect.  This BIOS update probably addressed that, and the BIOS changelog is just a summary.

> so I now wonder just how
> programmable the nforce2 northbridge is? Is it a bit fpga'ish in that they may be
> using the bios boot to alter the handshake timing enough to accomplish what
> the ack delay does but like it should be - transparent to the OS?

Probably.  That's what I'm thinking too now.

> 
> Of course they -the makers- have access to knowledge we don't so it could be 
> something completely different that they are doing!
> 
> In short I agree with the suggestion that the new bios options do more behind
> the scenes than what the athcool and disconnect patches do. 

That's why I'm trying to contact shuttle.

> 
> I am pretty sure that I read somewhere that when the epox boards 
> were first released the epox 8rda bios started out with it (the disconnect bit) off
> then the 8rga+ came out with it on by default? So back then people were wanting
> to turn it on in the 8rda to lower their CPU temperature - now some want it off
> in search of stability? 

Ah, that reminds me.  The very first day I ran this board last week, I was very worried on how high the system temperature was getting -- above 40 deg C.  CPU was getting up to 49 deg C.  Not that it was locking up because of temperature - it would on a cold-boot - but that I was experiencing lock ups and higher than normal temperatures which indicates to me now on how poorly it's thermal management was operating then.  Now with the new patches, and ultimately, BIOS update, system temperature is about 35 deg C, which aint too bad =)

> Back then under win.... some experienced lockups depending
> on which IDE driver was used and which state the bit was in!

Good point!  I was reading some message boards discussing nforce2s yesterday.  And they pretty much unaminiously said, don't use NForce IDE driver, use windows provided IDE driver, because the NForce IDE _locks up_.  So windows does have the same problem after all.  I wouldn't know because I don't have windows... but you can find this same issue everywhere then.

> 
> Out of interest has anyone seen new disconnect bit options in the Pheonix bios or
> only in the award bios?

I have an award bios.

> 
> Finally I have done some more work and found that the ack delay patch on my
> system is about 13 apic timer counts, about half that required to write a byte 
> directly outb(0x00, 0x378) to the printer port at 28 apic timer counts. 
> So the ack delay is about twice as quick as writing a single EOI to the 8259 in
> XTPIC mode provided the 8259 accesses are not souped up under the hood.
> In other words whilst it is a timing hit it is not much of one and it won't be
> needed once this is all fixed by the respective manufacturers -lets hope they
> can do it on the hardware we have already bought.
> 
> Regards
> Ross Dickson
> 
> 

Good work.  Lets hope the hardware manufacturers come through.

Jesse

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Working nforce2, was Re: Fixes for nforce2 hard lockup, apic, io-apic, udma133 covered
@ 2003-12-13  5:16 Ross Dickson
  2003-12-13  6:04 ` Jesse Allen
  0 siblings, 1 reply; 17+ messages in thread
From: Ross Dickson @ 2003-12-13  5:16 UTC (permalink / raw)
  To: cbradney; +Cc: linux-kernel, AMartin, Ian Kumlien

<snip>
>> The thing that strikes me funny is that you get no crashes with the 
> > updated BIOS and Disconnect on, but without the updated BIOS we have 
> > to turn disconnect off with athcool or the patch? This makes me think 
> > that there is some voodoo going on in the BIOS update that they aren't 
> > saying, surprise surprise, or something is just slowing down the time 
> > it takes for it to crash. I say this because I have gone 5+ days 
> > without any of the patches from these threads, acpi apic lapic 
> > enabled, and CPU disconnect on as stated by athcool. This was with 
> > much stress testing, idle time, etc. One day I just ran a grep that I 
> > have done probably 30 times and boom, hang. 
 
>> Good luck, hope the BIOS is the trick, now off to see how I can get 
> > ASUS to put the C1 Disconnect in the next revision. 
 



>Yes, thats how it was for me.. I was the only one here saying "no 
> problems, la la la", then at about 5.25 days.. boom. Then the next day 
> it crashed twice. Hopefully you make some progress with ASUS.. (for the 
> A7N8X Deluxe as well as you mobo please :) ). 
 


>Ive been playing with hardware in the past few days (new quieter Zalman 
> PSU, and Zalman 7000 Cu fan etc) so no uptime to speak of here now. I 
> did compile KDE 3.2 beta 2 last night though.. 6 hours of solid 
> compilation.. no hassles. I have never turned off Disconnect either. 
 


>Thanks to all you guys who are working on this one. Seems to be getting 
> somewhere. 
 


>Craig 

I wonder about the "voodoo" because my apic ack delay patch was developed
without knowledge of the C1 disconnect bit and reports I have received so far
are that the hard lockups go away when using it independent of the state of the 
disconnect bit. Apparently the bit was on in my test systems. 

Ian Kumlien pointed out the linkage with the northbridge timing signals 
to the CPU to do with the connect disconnect handshake so I now wonder just how
programmable the nforce2 northbridge is? Is it a bit fpga'ish in that they may be
using the bios boot to alter the handshake timing enough to accomplish what
the ack delay does but like it should be - transparent to the OS?

Of course they -the makers- have access to knowledge we don't so it could be 
something completely different that they are doing!

In short I agree with the suggestion that the new bios options do more behind
the scenes than what the athcool and disconnect patches do. 

I am pretty sure that I read somewhere that when the epox boards 
were first released the epox 8rda bios started out with it (the disconnect bit) off
then the 8rga+ came out with it on by default? So back then people were wanting
to turn it on in the 8rda to lower their CPU temperature - now some want it off
in search of stability? Back then under win.... some experienced lockups depending
on which IDE driver was used and which state the bit was in!

Out of interest has anyone seen new disconnect bit options in the Pheonix bios or
only in the award bios?

Finally I have done some more work and found that the ack delay patch on my
system is about 13 apic timer counts, about half that required to write a byte 
directly outb(0x00, 0x378) to the printer port at 28 apic timer counts. 
So the ack delay is about twice as quick as writing a single EOI to the 8259 in
XTPIC mode provided the 8259 accesses are not souped up under the hood.
In other words whilst it is a timing hit it is not much of one and it won't be
needed once this is all fixed by the respective manufacturers -lets hope they
can do it on the hardware we have already bought.

Regards
Ross Dickson




^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Working nforce2, was Re: Fixes for nforce2 hard lockup, apic, io-apic, udma133 covered
  2003-12-12 18:18               ` Josh McKinney
  2003-12-12 19:29                 ` Jesse Allen
  2003-12-12 21:42                 ` Craig Bradney
@ 2003-12-13  4:18                 ` Bob
  2 siblings, 0 replies; 17+ messages in thread
From: Bob @ 2003-12-13  4:18 UTC (permalink / raw)
  To: linux-kernel

Re: two instances of good but undocumented bios voodoo

Josh McKinney wrote:

>On approximately Fri, Dec 12, 2003 at 09:59:29AM -0700, Jesse Allen wrote:
>  
>
>>On Fri, Dec 12, 2003 at 04:27:59AM -0500, Bob wrote:
>>    
>>
>>>Jesse Allen wrote:
>>>
>>>      
>>>
>>>>On Thu, Dec 11, 2003 at 06:52:41PM +0100, Ian Kumlien wrote:
>>>>
>>>>
>>>>        
>>>>
>>>> ............
>>>>
>>>>but I checked 
>>>>for a bios update first.  And sure enough like they read my mind, just 
>>>>posted online today, an update.  Here are the details of fixes:
>>>>
>>>>" Checksum:   8B00H                         Date Code: 12/05/03
>>>>1.Support 0.18 micron AMD Duron (Palomino) CPU.
>>>>2.Add C1 disconnect item."..........Jesse
>>>>        
>>>>
-Jesse got a bios update that gives him a cpu disconnect
option now in setup

>>>>        
>>>>
>>>A bios update for MSI K7N2 MCP2-T nforce2 board
>>>fixed the crashing BEFORE these patches were developed,
>>>but there was no documentation that would relate or explain.
>>>      
>>>
-Bob said that about his bios update fixing
the lockup problem entirely, but no doc,
needing no patch except to turn on ioapic
edge timer(another clue--without ioapic
edge timer working bios update fixed this
nforce2 situation!), no clue as to whether bios
update sets cpu disconnect one way or the other,
no opt to choose cpu disconnect in new or old
setup.

Jesse continues--

>>Last night, I updated the bios to the 12-5-03 released yesterday (see above).  I looked at the new option under Advanced Chipset Features, "C1 Disconnect".  It has three selections: Auto, Enabled, Disabled.  There seems to be no default.  The item help says:
>>"Force En/Disabled 
>> or Auto mode:
>> C17 IGP/SPP NB A03
>> C18D SPP NM A01 (C01)
>> enabled C1 disconnect
>> otherwise disabled it"
>>
>>Auto sounded nice, so I selected that first.  I compiled a new kernel without the disconnect off patch, or the ack delay.  These are the exact patches I used on 2.6.0-test11:
>>patch-2.6.0-test11-bk8.bz2
>>acpi-2.6.0t11.patch acpi bugfixes from Maciej.
>>nforce-ioapic-timer-2.6t11.patch from Ross Dickson.  Timer patch.
>>forcedeth.patch Patch stolen from -test10-mm1?  Unused.
>>forcedeth-update-2.patch Same.
>>
>>Sure enough, under this kernel, no lockups.  Athcool reported Disconnect was "on".
>>    
>>
Disconnect was ON!!!

> <snip>  ...one case the bios update fixed the problem

without needing cpu disconnect off, the other case we
don't know how or whether cpu disconnect is on or
off now but bios update fixed nforce2 without turning
ioapic edge timer on. I guess these two case prove that
neither cpu disconnect =on or ioapic timer =off are
causing the problem directly.

>The thing that strikes me funny is that you get no crashes with the
>updated BIOS and Disconnect on, but without the updated BIOS we have
>to turn disconnect off with athcool or the patch?  This makes me think
>that there is some voodoo going on in the BIOS update that they aren't
>saying, surprise surprise, or something is just slowing down the time
>it takes for it to crash.  I say this because I have gone 5+ days
>without any of the patches from these threads, acpi apic lapic
>enabled, and CPU disconnect on as stated by athcool.  This was with
>much stress testing, idle time, etc.  One day I just ran a grep that I
>have done probably 30 times and boom, hang.  
>
>Good luck, hope the BIOS is the trick, now off to see how I can get
>ASUS to put the C1 Disconnect in the next revision.
>
...and at least two motherboard makers have voodoo
to fix the problem.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Working nforce2, was Re: Fixes for nforce2 hard lockup, apic, io-apic, udma133 covered
  2003-12-12 18:18               ` Josh McKinney
  2003-12-12 19:29                 ` Jesse Allen
@ 2003-12-12 21:42                 ` Craig Bradney
  2003-12-13  4:18                 ` Bob
  2 siblings, 0 replies; 17+ messages in thread
From: Craig Bradney @ 2003-12-12 21:42 UTC (permalink / raw)
  To: Josh McKinney; +Cc: linux-kernel

On Fri, 2003-12-12 at 19:18, Josh McKinney wrote:
> On approximately Fri, Dec 12, 2003 at 09:59:29AM -0700, Jesse Allen wrote:
> > On Fri, Dec 12, 2003 at 04:27:59AM -0500, Bob wrote:
> > > Jesse Allen wrote:
> > > 
> > > >On Thu, Dec 11, 2003 at 06:52:41PM +0100, Ian Kumlien wrote:
> > > > 
> > > >
> > > >>Heh, yeah, the need for disconnect is somewhat dodgy, i haven't read up
> > > >>on th rest.
> > > >>
> > > >Hmm, weird.  I went to go look at the Shuttle motherboard maker's site - 
> > > >maybe so that I can bug them for a bios disconnect option - but I checked 
> > > >for a bios update first.  And sure enough like they read my mind, just 
> > > >posted online today, an update.  Here are the details of fixes:
> > > >
> > > >" Checksum:   8B00H                         Date Code: 12/05/03
> > > >1.Support 0.18 micron AMD Duron (Palomino) CPU.
> > > >2.Add C1 disconnect item."
> > > >
> > > >It's almost as they're reading this list.  This disconnect problem was 
> > > >discovered on the 5th (well the 5th in my timezone).  Perhaps they're 
> > > >aware of this issue...  I'm gonna talk to them.
> > > >
> > > >Jesse
> > > >
> > > A bios update for MSI K7N2 MCP2-T nforce2 board
> > > fixed the crashing BEFORE these patches were developed,
> > > but there was no documentation that would relate or explain.
> > 
> > Last night, I updated the bios to the 12-5-03 released yesterday (see above).  I looked at the new option under Advanced Chipset Features, "C1 Disconnect".  It has three selections: Auto, Enabled, Disabled.  There seems to be no default.  The item help says:
> > "Force En/Disabled 
> >  or Auto mode:
> >  C17 IGP/SPP NB A03
> >  C18D SPP NM A01 (C01)
> >  enabled C1 disconnect
> >  otherwise disabled it"
> > 
> > Auto sounded nice, so I selected that first.  I compiled a new kernel without the disconnect off patch, or the ack delay.  These are the exact patches I used on 2.6.0-test11:
> > patch-2.6.0-test11-bk8.bz2
> > acpi-2.6.0t11.patch acpi bugfixes from Maciej.
> > nforce-ioapic-timer-2.6t11.patch from Ross Dickson.  Timer patch.
> > forcedeth.patch Patch stolen from -test10-mm1?  Unused.
> > forcedeth-update-2.patch Same.
> > 
> > Sure enough, under this kernel, no lockups.  Athcool reported Disconnect was "on".
> > 
> <snip>
> > So the fix was absolutely a BIOS fix.  It seems a lot of people have buggy BIOSes on nforce2 boards.  Even some that have the option.  I guess I haven't proved that it was the BIOS fix, because I haven't stressed it for a long period of time.  But I don't believe I have to because I can do grep's and kernel compiles with disconnect on now, where before I couldn't (always been very easy to reproduce lockup).
> <snip>
> 
> The thing that strikes me funny is that you get no crashes with the
> updated BIOS and Disconnect on, but without the updated BIOS we have
> to turn disconnect off with athcool or the patch?  This makes me think
> that there is some voodoo going on in the BIOS update that they aren't
> saying, surprise surprise, or something is just slowing down the time
> it takes for it to crash.  I say this because I have gone 5+ days
> without any of the patches from these threads, acpi apic lapic
> enabled, and CPU disconnect on as stated by athcool.  This was with
> much stress testing, idle time, etc.  One day I just ran a grep that I
> have done probably 30 times and boom, hang.  
> 
> Good luck, hope the BIOS is the trick, now off to see how I can get
> ASUS to put the C1 Disconnect in the next revision.    


Yes, thats how it was for me.. I was the only one here saying "no
problems, la la la", then at about 5.25 days.. boom. Then the next day
it crashed twice. Hopefully you make some progress with ASUS.. (for the
A7N8X Deluxe as well as you mobo please :) ).

Ive been playing with hardware in the past few days (new quieter Zalman
PSU, and Zalman 7000 Cu fan etc) so no uptime to speak of here now. I
did compile KDE 3.2 beta 2 last night though.. 6 hours of solid
compilation.. no hassles. I have never turned off Disconnect either.

Thanks to all you guys who are working on this one. Seems to be getting
somewhere.

Craig


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Working nforce2, was Re: Fixes for nforce2 hard lockup, apic, io-apic, udma133 covered
  2003-12-12 18:18               ` Josh McKinney
@ 2003-12-12 19:29                 ` Jesse Allen
  2003-12-12 21:42                 ` Craig Bradney
  2003-12-13  4:18                 ` Bob
  2 siblings, 0 replies; 17+ messages in thread
From: Jesse Allen @ 2003-12-12 19:29 UTC (permalink / raw)
  To: Josh McKinney; +Cc: linux-kernel

On Fri, Dec 12, 2003 at 01:18:27PM -0500, Josh McKinney wrote:
> 
> The thing that strikes me funny is that you get no crashes with the
> updated BIOS and Disconnect on, but without the updated BIOS we have
> to turn disconnect off with athcool or the patch?  This makes me think
> that there is some voodoo going on in the BIOS update that they aren't
> saying, surprise surprise, 

Yes, it is weird.  I've now asked shuttle for more information.

> or something is just slowing down the time
> it takes for it to crash.  I say this because I have gone 5+ days
> without any of the patches from these threads, acpi apic lapic
> enabled, and CPU disconnect on as stated by athcool.  This was with
> much stress testing, idle time, etc.  One day I just ran a grep that I
> have done probably 30 times and boom, hang.  

I hope this is not the case!  The one/two grep test worked flawlessly, but now if it's delayed, then I can't do that anymore.

(but at least I have the bios option now! heh)

I suggest you reference the Shuttle AN35 12-05-2003 BIOS, and maybe Bob's MSI, when you talk to Asus.  If they can do it, then Asus should be able as well.

Jesse

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Working nforce2, was Re: Fixes for nforce2 hard lockup, apic, io-apic, udma133 covered
  2003-12-12 16:59             ` Working nforce2, was " Jesse Allen
  2003-12-12 17:18               ` Jesse Allen
@ 2003-12-12 18:18               ` Josh McKinney
  2003-12-12 19:29                 ` Jesse Allen
                                   ` (2 more replies)
  2003-12-13  6:34               ` Bob
  2 siblings, 3 replies; 17+ messages in thread
From: Josh McKinney @ 2003-12-12 18:18 UTC (permalink / raw)
  To: linux-kernel

On approximately Fri, Dec 12, 2003 at 09:59:29AM -0700, Jesse Allen wrote:
> On Fri, Dec 12, 2003 at 04:27:59AM -0500, Bob wrote:
> > Jesse Allen wrote:
> > 
> > >On Thu, Dec 11, 2003 at 06:52:41PM +0100, Ian Kumlien wrote:
> > > 
> > >
> > >>Heh, yeah, the need for disconnect is somewhat dodgy, i haven't read up
> > >>on th rest.
> > >>
> > >Hmm, weird.  I went to go look at the Shuttle motherboard maker's site - 
> > >maybe so that I can bug them for a bios disconnect option - but I checked 
> > >for a bios update first.  And sure enough like they read my mind, just 
> > >posted online today, an update.  Here are the details of fixes:
> > >
> > >" Checksum:   8B00H                         Date Code: 12/05/03
> > >1.Support 0.18 micron AMD Duron (Palomino) CPU.
> > >2.Add C1 disconnect item."
> > >
> > >It's almost as they're reading this list.  This disconnect problem was 
> > >discovered on the 5th (well the 5th in my timezone).  Perhaps they're 
> > >aware of this issue...  I'm gonna talk to them.
> > >
> > >Jesse
> > >
> > A bios update for MSI K7N2 MCP2-T nforce2 board
> > fixed the crashing BEFORE these patches were developed,
> > but there was no documentation that would relate or explain.
> 
> Last night, I updated the bios to the 12-5-03 released yesterday (see above).  I looked at the new option under Advanced Chipset Features, "C1 Disconnect".  It has three selections: Auto, Enabled, Disabled.  There seems to be no default.  The item help says:
> "Force En/Disabled 
>  or Auto mode:
>  C17 IGP/SPP NB A03
>  C18D SPP NM A01 (C01)
>  enabled C1 disconnect
>  otherwise disabled it"
> 
> Auto sounded nice, so I selected that first.  I compiled a new kernel without the disconnect off patch, or the ack delay.  These are the exact patches I used on 2.6.0-test11:
> patch-2.6.0-test11-bk8.bz2
> acpi-2.6.0t11.patch acpi bugfixes from Maciej.
> nforce-ioapic-timer-2.6t11.patch from Ross Dickson.  Timer patch.
> forcedeth.patch Patch stolen from -test10-mm1?  Unused.
> forcedeth-update-2.patch Same.
> 
> Sure enough, under this kernel, no lockups.  Athcool reported Disconnect was "on".
> 
<snip>
> So the fix was absolutely a BIOS fix.  It seems a lot of people have buggy BIOSes on nforce2 boards.  Even some that have the option.  I guess I haven't proved that it was the BIOS fix, because I haven't stressed it for a long period of time.  But I don't believe I have to because I can do grep's and kernel compiles with disconnect on now, where before I couldn't (always been very easy to reproduce lockup).
<snip>

The thing that strikes me funny is that you get no crashes with the
updated BIOS and Disconnect on, but without the updated BIOS we have
to turn disconnect off with athcool or the patch?  This makes me think
that there is some voodoo going on in the BIOS update that they aren't
saying, surprise surprise, or something is just slowing down the time
it takes for it to crash.  I say this because I have gone 5+ days
without any of the patches from these threads, acpi apic lapic
enabled, and CPU disconnect on as stated by athcool.  This was with
much stress testing, idle time, etc.  One day I just ran a grep that I
have done probably 30 times and boom, hang.  

Good luck, hope the BIOS is the trick, now off to see how I can get
ASUS to put the C1 Disconnect in the next revision.    

-- 
Josh McKinney		     |	Webmaster: http://joshandangie.org
--------------------------------------------------------------------------
                             | They that can give up essential liberty
Linux, the choice       -o)  | to obtain a little temporary safety deserve 
of the GNU generation    /\  | neither liberty or safety. 
                        _\_v |                          -Benjamin Franklin

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Working nforce2, was Re: Fixes for nforce2 hard lockup, apic, io-apic, udma133 covered
  2003-12-12 16:59             ` Working nforce2, was " Jesse Allen
@ 2003-12-12 17:18               ` Jesse Allen
  2003-12-12 18:18               ` Josh McKinney
  2003-12-13  6:34               ` Bob
  2 siblings, 0 replies; 17+ messages in thread
From: Jesse Allen @ 2003-12-12 17:18 UTC (permalink / raw)
  To: linux-kernel

Oops, typo: NM supposed to be NB
On Fri, Dec 12, 2003 at 09:59:29AM -0700, Jesse Allen wrote:
> The item help says:
> "Force En/Disabled 
>  or Auto mode:
>  C17 IGP/SPP NB A03
>  C18D SPP NM A01 (C01)
  C18D SPP /NB/ A01 (C01)
>  enabled C1 disconnect
>  otherwise disabled it"
> 

Maybe NB means northbridge?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Working nforce2, was Re: Fixes for nforce2 hard lockup, apic, io-apic, udma133 covered
  2003-12-12  9:27           ` Bob
@ 2003-12-12 16:59             ` Jesse Allen
  2003-12-12 17:18               ` Jesse Allen
                                 ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Jesse Allen @ 2003-12-12 16:59 UTC (permalink / raw)
  To: linux-kernel

On Fri, Dec 12, 2003 at 04:27:59AM -0500, Bob wrote:
> Jesse Allen wrote:
> 
> >On Thu, Dec 11, 2003 at 06:52:41PM +0100, Ian Kumlien wrote:
> > 
> >
> >>Heh, yeah, the need for disconnect is somewhat dodgy, i haven't read up
> >>on th rest.
> >>
> >Hmm, weird.  I went to go look at the Shuttle motherboard maker's site - 
> >maybe so that I can bug them for a bios disconnect option - but I checked 
> >for a bios update first.  And sure enough like they read my mind, just 
> >posted online today, an update.  Here are the details of fixes:
> >
> >" Checksum:   8B00H                         Date Code: 12/05/03
> >1.Support 0.18 micron AMD Duron (Palomino) CPU.
> >2.Add C1 disconnect item."
> >
> >It's almost as they're reading this list.  This disconnect problem was 
> >discovered on the 5th (well the 5th in my timezone).  Perhaps they're 
> >aware of this issue...  I'm gonna talk to them.
> >
> >Jesse
> >
> A bios update for MSI K7N2 MCP2-T nforce2 board
> fixed the crashing BEFORE these patches were developed,
> but there was no documentation that would relate or explain.

Last night, I updated the bios to the 12-5-03 released yesterday (see above).  I looked at the new option under Advanced Chipset Features, "C1 Disconnect".  It has three selections: Auto, Enabled, Disabled.  There seems to be no default.  The item help says:
"Force En/Disabled 
 or Auto mode:
 C17 IGP/SPP NB A03
 C18D SPP NM A01 (C01)
 enabled C1 disconnect
 otherwise disabled it"

Auto sounded nice, so I selected that first.  I compiled a new kernel without the disconnect off patch, or the ack delay.  These are the exact patches I used on 2.6.0-test11:
patch-2.6.0-test11-bk8.bz2
acpi-2.6.0t11.patch acpi bugfixes from Maciej.
nforce-ioapic-timer-2.6t11.patch from Ross Dickson.  Timer patch.
forcedeth.patch Patch stolen from -test10-mm1?  Unused.
forcedeth-update-2.patch Same.

Sure enough, under this kernel, no lockups.  Athcool reported Disconnect was "on".

I decided to wait till this morning, to try the BIOS "C1 Disconnect" set to enabled.  Still no lockups under this kernel.  Tried a vanilla kernel, no lockups (but timer and watchdog messed up still).  Now that I read your message Bob, I understand what you are saying.  Luckily, the updated BIOS changelog states "Add C1 disconnect item."  And this exact version seems to have fixed it, and now we have an exact fix (another one?) to refer to.

So the fix was absolutely a BIOS fix.  It seems a lot of people have buggy BIOSes on nforce2 boards.  Even some that have the option.  I guess I haven't proved that it was the BIOS fix, because I haven't stressed it for a long period of time.  But I don't believe I have to because I can do grep's and kernel compiles with disconnect on now, where before I couldn't (always been very easy to reproduce lockup).

> 
> http://www.msi.com.tw/program/support/bios/bos/spt_bos_detail.php?UID=436&kind=1
> http://download.msi.com.tw/support/bos_exe/6570v76.exe
> 
> Award 7.6 at the top of the list. Maybe somebody can figure
> out what they're doing.

I think I'll continue on contacting shuttle and ask them why they added the option, and how they added it.  Maybe that will give us the right information.

> 
> Nvidia X driver for ti4200 agp8 still locks up linux though,
> but X nv works fine. agp8 3d may expose the timer issue.
> 

That's either an nvidia driver problem, or agpgart-nforce problem.  I'd try 4x agp, and or NVAGP (or agpgart, if already using NVAGP).  If you think it's the timer, try the timer patch, or with nolapic noapic.

Jesse

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2003-12-16  7:18 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-12-15 14:30 Fwd: Re: Working nforce2, was Re: Fixes for nforce2 hard lockup, apic, io-apic, udma133 covered Ross Dickson
2003-12-15 15:02 ` Craig Bradney
2003-12-15 15:56   ` Maciej W. Rozycki
2003-12-15 16:54   ` Ross Dickson
2003-12-16  6:07     ` Bob
     [not found] <200312132040.00875.ross@datscreative.com.au>
2003-12-13 12:00 ` Fwd: " Bob
2003-12-15 13:11   ` Maciej W. Rozycki
2003-12-16  7:18     ` Bob
  -- strict thread matches above, loose matches on Subject: below --
2003-12-13  9:20 Ross Dickson
2003-12-13  9:51 ` Bob
2003-12-13  5:16 Ross Dickson
2003-12-13  6:04 ` Jesse Allen
2003-12-07 13:12 Ross Dickson
2003-12-11  6:55 ` Ross Dickson
2003-12-11 11:47   ` Ian Kumlien
2003-12-11  9:12     ` Ross Dickson
2003-12-11 17:52       ` Ian Kumlien
2003-12-11 18:21         ` Jesse Allen
2003-12-12  9:27           ` Bob
2003-12-12 16:59             ` Working nforce2, was " Jesse Allen
2003-12-12 17:18               ` Jesse Allen
2003-12-12 18:18               ` Josh McKinney
2003-12-12 19:29                 ` Jesse Allen
2003-12-12 21:42                 ` Craig Bradney
2003-12-13  4:18                 ` Bob
2003-12-13  6:34               ` Bob

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).