All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: Oops in 2.6.10-rc1 (almost solved)
@ 2004-11-13  3:45 Chuck Ebbert
  2004-11-13 14:28 ` Matt Domsch
  0 siblings, 1 reply; 21+ messages in thread
From: Chuck Ebbert @ 2004-11-13  3:45 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, Matt Domsch

On Tue, 9 Nov 2004 at 17:01:10 -0800 Linus Torvalds <torvalds@osdl.org> wrote:

> > PS: do you have *any* idea how this could be related to the snd-es1371
> > driver (which is producing the oops then)?
>
> I bet it's overwriting some array, and just corrupting memory after it. 
> For example, the edd_info[] array only has 6 entries,

  That's almost certainly the problem.  There can be up to 16 EDD devices
as of the Jun 30 update to the EDD code.

  And sound_class is the next item after edd_info[] in my System.map...


--Chuck Ebbert  12-Nov-04  22:21:27

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Oops in 2.6.10-rc1 (almost solved)
  2004-11-13  3:45 Oops in 2.6.10-rc1 (almost solved) Chuck Ebbert
@ 2004-11-13 14:28 ` Matt Domsch
  2004-11-13 18:55   ` Matt Domsch
  2004-11-14  2:58   ` Matt Domsch
  0 siblings, 2 replies; 21+ messages in thread
From: Matt Domsch @ 2004-11-13 14:28 UTC (permalink / raw)
  To: Chuck Ebbert, Christian Kujau; +Cc: Linus Torvalds, linux-kernel

On Fri, Nov 12, 2004 at 10:45:12PM -0500, Chuck Ebbert wrote:
> On Tue, 9 Nov 2004 at 17:01:10 -0800 Linus Torvalds <torvalds@osdl.org> wrote:
> 
> > > PS: do you have *any* idea how this could be related to the snd-es1371
> > > driver (which is producing the oops then)?
> >
> > I bet it's overwriting some array, and just corrupting memory after it. 
> > For example, the edd_info[] array only has 6 entries,
> 
>   That's almost certainly the problem.  There can be up to 16 EDD devices
> as of the Jun 30 update to the EDD code.

Bingo...  edd_devices[] was too short.  When we keep more
than 6 signatures, it overruns the end.  Also, I rewrote
edd_num_devices to be clearer about its goal.

This patch is necessary even after the last edd.S patch was reverted.

It still doesn't explain why Christian's BIOS reports more devices
than he has, that's still UI, so don't re-apply the edd.S patch just reverted.

Signed-off-by: Matt Domsch

-- 
Matt Domsch
Sr. Software Engineer, Lead Engineer
Dell Linux Solutions linux.dell.com & www.dell.com/linux
Linux on Dell mailing lists @ http://lists.us.dell.com

===== drivers/firmware/edd.c 1.30 vs edited =====
--- 1.30/drivers/firmware/edd.c	2004-06-29 09:44:48 -05:00
+++ edited/drivers/firmware/edd.c	2004-11-13 07:56:00 -06:00
@@ -70,7 +70,7 @@
 static int edd_dev_is_type(struct edd_device *edev, const char *type);
 static struct pci_dev *edd_get_pci_dev(struct edd_device *edev);
 
-static struct edd_device *edd_devices[EDDMAXNR];
+static struct edd_device *edd_devices[EDD_MBR_SIG_MAX];
 
 #define EDD_DEVICE_ATTR(_name,_mode,_show,_test) \
 struct edd_attribute edd_attr_##_name = { 	\
@@ -728,9 +728,9 @@
 
 static inline int edd_num_devices(void)
 {
-	return min_t(unsigned char,
-		     max_t(unsigned char, edd.edd_info_nr, edd.mbr_signature_nr),
-		     max_t(unsigned char, EDD_MBR_SIG_MAX, EDDMAXNR));
+	return max_t(unsigned char,
+		     min_t(unsigned char, EDD_MBR_SIG_MAX, edd.mbr_signature_nr),
+		     min_t(unsigned char, EDDMAXNR, edd.edd_info_nr));
 }
 
 /**

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Oops in 2.6.10-rc1 (almost solved)
  2004-11-13 14:28 ` Matt Domsch
@ 2004-11-13 18:55   ` Matt Domsch
  2004-11-14  2:58   ` Matt Domsch
  1 sibling, 0 replies; 21+ messages in thread
From: Matt Domsch @ 2004-11-13 18:55 UTC (permalink / raw)
  To: Chuck Ebbert, Christian Kujau; +Cc: Linus Torvalds, linux-kernel

On Sat, Nov 13, 2004 at 08:28:35AM -0600, Matt Domsch wrote:
> On Fri, Nov 12, 2004 at 10:45:12PM -0500, Chuck Ebbert wrote:
> > On Tue, 9 Nov 2004 at 17:01:10 -0800 Linus Torvalds <torvalds@osdl.org> wrote:
> > 
> > > > PS: do you have *any* idea how this could be related to the snd-es1371
> > > > driver (which is producing the oops then)?
> > >
> > > I bet it's overwriting some array, and just corrupting memory after it. 
> > > For example, the edd_info[] array only has 6 entries,
> > 
> >   That's almost certainly the problem.  There can be up to 16 EDD devices
> > as of the Jun 30 update to the EDD code.
> 
> Bingo...  edd_devices[] was too short.  When we keep more
> than 6 signatures, it overruns the end.

In particular, depending on your .config, with EDD=y it overwrites 40
bytes past the end of edd_devices (here I've already extended it by
the necessary amount, but the 40 bytes past its end are all subject to
be overwritten):
c043a880 b edd_devices
c043a8c0 b pci_bios_present
c043a8c4 B pci_mmcfg_base_addr
c043a8c8 b mmcfg_last_accessed_device
c043a8cc b called.0
c043a8d0 B pcibios_enable_irq
c043a8d4 b eisa_irq_mask.0
c043a8d8 b broken_hp_bios_irq9
c043a8dc b acer_tm360_irqrouting
c043a8e0 b pirq_table
c043a8e4 b pirq_router

hence the failure Christian saw and attributed to the sound drivers:

EIP is at 0xc15d5820
eax: 00000000   ebx: dff20400   ecx: c15d5820   edx: dff205c4
esi: ffffffed   edi: dff20400   ebp: dff20400   esp: c17a3e58
ds: 007b   es: 007b   ss: 0068
Process modprobe (pid: 178, threadinfo=c17a2000 task=dfcf05a0)
Stack: c01fa5c8 dff20400 000007ff dff20400 c01fa5ff dff20400 000007ff
c15ea400 
       e082729d dff20400 c15ea400 00000000 e08469df c15ea400 000001f8
       000000d0 
       000000d0 df45ed14 00000000 c018e14e c15ea400 ffffffed dff20400
       dff20400 
Call Trace:
 [<c01fa5c8>] pci_enable_device_bars+0x28/0x40
 [<c01fa5ff>] pci_enable_device+0x1f/0x40
 [<e082729d>] snd_ensoniq_create+0x1d/0x480 [snd_ens1371]
 [<e08469df>] snd_card_new+0x1cf/0x2c0 [snd]
 [<c018e14e>] sysfs_new_dirent+0x2e/0x90
 [<e0827867>] snd_audiopci_probe+0x87/0x1e0 [snd_ens1371]
 [<c01fb012>] pci_device_probe_static+0x52/0x70
 [<c01fb05c>] __pci_device_probe+0x2c/0x30
 [<c01fb08c>] pci_device_probe+0x2c/0x60
 [<c0258f4f>] driver_probe_device+0x2f/0x80
 [<c02590b2>] driver_attach+0x52/0xa0
 [<c02595f8>] bus_add_driver+0x98/0xe0
 [<c0259c5f>] driver_register+0x2f/0x40
 [<c01fb340>] pci_register_driver+0x40/0x50
 [<e08279cf>] alsa_card_ens137x_init+0xf/0x13 [snd_ens1371]
 [<c0134279>] sys_init_module+0x169/0x240
 [<c01041eb>] syscall_call+0x7/0xb


With CONFIG_EDD=m, there just wasn't anything interesting in memory
following edd_devices[] (thanks module loader for using whole pages I
believe).

-Matt

-- 
Matt Domsch
Sr. Software Engineer, Lead Engineer
Dell Linux Solutions linux.dell.com & www.dell.com/linux
Linux on Dell mailing lists @ http://lists.us.dell.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Oops in 2.6.10-rc1 (almost solved)
  2004-11-13 14:28 ` Matt Domsch
  2004-11-13 18:55   ` Matt Domsch
@ 2004-11-14  2:58   ` Matt Domsch
  2004-11-14  4:43     ` Linus Torvalds
                       ` (2 more replies)
  1 sibling, 3 replies; 21+ messages in thread
From: Matt Domsch @ 2004-11-14  2:58 UTC (permalink / raw)
  To: Christian Kujau; +Cc: Linus Torvalds, linux-kernel, Chuck Ebbert

On Sat, Nov 13, 2004 at 08:28:35AM -0600, Matt Domsch wrote:
> It still doesn't explain why Christian's BIOS reports more devices
> than he has, that's still UI, so don't re-apply the edd.S patch just reverted.

Alexander van Heukelum noted to me that addw here modifies CF, so I
think something like should fix that.  Christian, if you're in a
position to test this, I'd really appreciate it.  You've been a
fantastic bug reporter / tester!

Not ready for Linus yet, and you'll need to re-apply the previous
edd.S patch which is now reverted in Linus's tree.  As your BIOS
reports via CHECK EXTENSIONS PRESENT that you've got more devices than
you actually have, hopefully the int13 EXTENDED READ won't succeed for
non-existant devices anymore, and then neither will the READ SECTORS
call.

-- 
Matt Domsch
Sr. Software Engineer, Lead Engineer
Dell Linux Solutions linux.dell.com & www.dell.com/linux
Linux on Dell mailing lists @ http://lists.us.dell.com

===== arch/i386/boot/edd.S 1.3 vs edited =====
--- 1.3/arch/i386/boot/edd.S	2004-10-20 03:37:11 -05:00
+++ edited/arch/i386/boot/edd.S	2004-11-13 20:31:58 -06:00
@@ -58,8 +58,12 @@
 	sti					# work around buggy BIOSes
 	popw	%dx
 	popw	%si
-	addw	$EDD_DEV_ADDR_PACKET_LEN, %sp	# remove packet from stack
-	jnc   edd_mbr_store_sig
+	pushfl					# save EFLAGS into ebx	
+	popl	%ebx				# because addw modifies CF
+    	addw	$EDD_DEV_ADDR_PACKET_LEN, %sp	# remove packet from stack
+	pushl	%ebx				# get back right CF
+	popfl
+    	jnc	edd_mbr_store_sig
 	# otherwise, fall through to the legacy read function
 
 edd_mbr_read_sectors:

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Oops in 2.6.10-rc1 (almost solved)
  2004-11-14  2:58   ` Matt Domsch
@ 2004-11-14  4:43     ` Linus Torvalds
  2004-11-14 11:45     ` Christian
  2004-11-14 20:02     ` Christian Kujau
  2 siblings, 0 replies; 21+ messages in thread
From: Linus Torvalds @ 2004-11-14  4:43 UTC (permalink / raw)
  To: Matt Domsch; +Cc: Christian Kujau, linux-kernel, Chuck Ebbert



On Sat, 13 Nov 2004, Matt Domsch wrote:
> 
> Not ready for Linus yet

Indeed. Please don't use pushfl/popfl to save the carry flag. There are 
tons of better ways.

For example, use "lea" instead of "add" to not write the flags (and add a 
comment). Or save the carry flag in a register with

	sbb %bx,%bx

ant test %bx later. Or any of a million other _standard_ ways to handle
this problem.

		Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Oops in 2.6.10-rc1 (almost solved)
  2004-11-14  2:58   ` Matt Domsch
  2004-11-14  4:43     ` Linus Torvalds
@ 2004-11-14 11:45     ` Christian
  2004-11-14 20:02     ` Christian Kujau
  2 siblings, 0 replies; 21+ messages in thread
From: Christian @ 2004-11-14 11:45 UTC (permalink / raw)
  To: Matt Domsch; +Cc: Linus Torvalds, linux-kernel, Chuck Ebbert

Matt Domsch wrote:
> 
> Alexander van Heukelum noted to me that addw here modifies CF, so I
> think something like should fix that.  Christian, if you're in a
> position to test this, I'd really appreciate it.  You've been a

yes, i'll do so. right now i am off (and late) to sth. else, but i'll 
test this in the evening.

thank you,
Christian.
-- 
BOFH excuse #318:

Your EMAIL is now being delivered by the USPS.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Oops in 2.6.10-rc1 (almost solved)
  2004-11-14  2:58   ` Matt Domsch
  2004-11-14  4:43     ` Linus Torvalds
  2004-11-14 11:45     ` Christian
@ 2004-11-14 20:02     ` Christian Kujau
  2004-11-14 21:55       ` Matt Domsch
  2 siblings, 1 reply; 21+ messages in thread
From: Christian Kujau @ 2004-11-14 20:02 UTC (permalink / raw)
  To: Matt Domsch; +Cc: Linus Torvalds, linux-kernel, Chuck Ebbert

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

sorry, took me a bit longer to get to the testing.


Matt Domsch schrieb:
> 
> Not ready for Linus yet, and you'll need to re-apply the previous
> edd.S patch which is now reverted in Linus's tree.  As your BIOS

i've applied the patch to a pristine 2.6.10-rc1, so the (currently
reverted) EDD change is still there. tell me, if the patch had to be
applied to sth. else.

but for now i have to say, that it still oopses:

http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/dmesg-2.6.10-rc1_edd-2.txt

...
BIOS EDD facility v0.16 2004-Jun-25, 16 devices found
...

(oh, i've added an ide-disk yesterday, so hde will show up in dmesg.)

sorry,
Christian.
- --
BOFH excuse #401:

Sales staff sold a product we don't offer.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBl7nZ+A7rjkF8z0wRAvuHAKCX8TWiDt5DP25OqBEWKecfM6x3HwCeNRoM
1IzHqKpcbWOABXWJ4vC4d1w=
=FiKX
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Oops in 2.6.10-rc1 (almost solved)
  2004-11-14 20:02     ` Christian Kujau
@ 2004-11-14 21:55       ` Matt Domsch
  2004-11-15 12:41         ` Oops in 2.6.10-rc1 (solved) Christian Kujau
  0 siblings, 1 reply; 21+ messages in thread
From: Matt Domsch @ 2004-11-14 21:55 UTC (permalink / raw)
  To: Christian Kujau; +Cc: Linus Torvalds, linux-kernel, Chuck Ebbert

On Sun, Nov 14, 2004 at 09:02:33PM +0100, Christian Kujau wrote:
> > Not ready for Linus yet, and you'll need to re-apply the previous
> > edd.S patch which is now reverted in Linus's tree.  As your BIOS
> 
> i've applied the patch to a pristine 2.6.10-rc1, so the (currently
> reverted) EDD change is still there. tell me, if the patch had to be
> applied to sth. else.
> 
> but for now i have to say, that it still oopses:
> 
> http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/dmesg-2.6.10-rc1_edd-2.txt

OK, the patch below (which Linus applied to his tree yesterday) should
fix the oopses.
 
> BIOS EDD facility v0.16 2004-Jun-25, 16 devices found

but the patch to edd.S doesn't resolve that EDD believes you've got 16
devices (I would expect it to report 2, as you have only 2 disks).

Thanks for the quick testing.  Back to the drawing board though for
this second part.

Thanks,
Matt

-- 
Matt Domsch
Sr. Software Engineer, Lead Engineer
Dell Linux Solutions linux.dell.com & www.dell.com/linux
Linux on Dell mailing lists @ http://lists.us.dell.com

===== drivers/firmware/edd.c 1.30 vs edited =====
--- 1.30/drivers/firmware/edd.c	2004-06-29 09:44:48 -05:00
+++ edited/drivers/firmware/edd.c	2004-11-13 07:56:00 -06:00
@@ -70,7 +70,7 @@
 static int edd_dev_is_type(struct edd_device *edev, const char *type);
 static struct pci_dev *edd_get_pci_dev(struct edd_device *edev);
 
-static struct edd_device *edd_devices[EDDMAXNR];
+static struct edd_device *edd_devices[EDD_MBR_SIG_MAX];
 
 #define EDD_DEVICE_ATTR(_name,_mode,_show,_test) \
 struct edd_attribute edd_attr_##_name = { 	\
@@ -728,9 +728,9 @@
 
 static inline int edd_num_devices(void)
 {
-	return min_t(unsigned char,
-		     max_t(unsigned char, edd.edd_info_nr, edd.mbr_signature_nr),
-		     max_t(unsigned char, EDD_MBR_SIG_MAX, EDDMAXNR));
+	return max_t(unsigned char,
+		     min_t(unsigned char, EDD_MBR_SIG_MAX, edd.mbr_signature_nr),
+		     min_t(unsigned char, EDDMAXNR, edd.edd_info_nr));
 }
 
 /**

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Oops in 2.6.10-rc1 (solved)
  2004-11-14 21:55       ` Matt Domsch
@ 2004-11-15 12:41         ` Christian Kujau
  0 siblings, 0 replies; 21+ messages in thread
From: Christian Kujau @ 2004-11-15 12:41 UTC (permalink / raw)
  To: linux-kernel; +Cc: Matt Domsch, Linus Torvalds, Chuck Ebbert

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Matt Domsch schrieb:
> 
> OK, the patch below (which Linus applied to his tree yesterday) should
> fix the oopses.
>  

so i've compiled a pristine 2.6.10-rc1-bk24 as your patch should be
included there (i've tried to apply your patch with --dry-run -> it did
not succeed, -R *would* have been successful) and finally it works!

http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/dmesg-2.6.10-rc1-bk24.txt
http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/config-2.6.10-rc1-bk24

snd_ens1371 is working fine, no oops, i can load/unload the drivers, no
problems ;-)

> 
>>BIOS EDD facility v0.16 2004-Jun-25, 16 devices found
> 
> but the patch to edd.S doesn't resolve that EDD believes you've got 16
> devices (I would expect it to report 2, as you have only 2 disks).

but still:

BIOS EDD facility v0.16 2004-Jun-25, 6 devices found

i have 2 disks now (1 ide, 1 scsi), 2 cdrom drives (ide). as you can see
from the dmesg, i have an additional ide-controller onboard:

PDC20265: chipset revision 2
PDC20265: ROM enabled at 0xdffe0000
PDC20265: 100% native mode on irq 10
PDC20265: (U)DMA Burst Bit ENABLED Primary PCI Mode Secondary PCI Mode.
    ide2: BM-DMA at 0xb400-0xb407, BIOS settings: hde:DMA, hdf:DMA
    ide3: BM-DMA at 0xb408-0xb40f, BIOS settings: hdg:pio, hdh:pio
Probing IDE interface ide2...
hde: ST320413A, ATA DISK drive
ide2 at 0xbc00-0xbc07,0xb802 on irq 10
Probing IDE interface ide3...
Probing IDE interface ide1...
Probing IDE interface ide3...
Probing IDE interface ide4...
ide4: Wait for ready failed before probe !
Probing IDE interface ide5...
ide5: Wait for ready failed before probe !

but there are only 4 ide channels on my board (Gigabyte GA7ZXR):
ide0  - with hda+hdb connected (2x cdrom)
ide1  - none
ide2  - with hde connected (ST320413A)
ide3  - none

so it's probing for a non-existent ide4+ide5! but it did that even in -bk4
times, so it's not "new behaviour", i guess.

http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/dmesg-2.6.9-bk4.txt

anyway, it's working now, the oops is gone, but i can do further testing
regarding this EDD issue of course.

Thanks to all involved,
Christian.
- --
BOFH excuse #195:

We only support a 28000 bps connection.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBmKP4+A7rjkF8z0wRAooxAJ9dD5QEXsEPUJjlBNvtfhtPteGoNwCfdfCA
tsYq86N5Y/bpegSXYWS+nkw=
=kFOh
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Oops in 2.6.10-rc1 (almost solved)
  2004-11-12  0:49                                   ` Linus Torvalds
@ 2004-11-12  1:27                                     ` Christian Kujau
  0 siblings, 0 replies; 21+ messages in thread
From: Christian Kujau @ 2004-11-12  1:27 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Matt Domsch, Kernel Mailing List, Pekka Enberg, Greg KH

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Linus Torvalds schrieb:
> 
> This is why I take random unexplained (but pinpointed) problems so 
> seriously. If it wasn't as apparently random, we could file it under 
> "known problem" and decide to try to fix it later. As it is, it's filed 
> under "known cause", but since we don't know _why_, it might cause totally 
> different problems on another machine, and that just makes it too painful 
> for words. 

just after sending my last mail i too (re)thought about this and i'd have
begged Matt to revert the patch if it was not *only* me having this issue.

but i can see your point here and i appreciate your decision.

> So the changeset is reverted for now in the current -bk tree, and I'll 
> make a -rc2 this weekend and hope that we can stabilize for 2.6.10.

yay!

thanks,
Christian.
- --
BOFH excuse #96:

Vendor no longer supports the product
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBlBFw+A7rjkF8z0wRAld5AJ40MjbzFbVXepXkJr1tLZCvYy7z2QCeMYCe
QQyekHBs1cjuebPZTEuPZZ0=
=wwF6
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Oops in 2.6.10-rc1 (almost solved)
  2004-11-12  0:27                                 ` Christian Kujau
@ 2004-11-12  0:49                                   ` Linus Torvalds
  2004-11-12  1:27                                     ` Christian Kujau
  0 siblings, 1 reply; 21+ messages in thread
From: Linus Torvalds @ 2004-11-12  0:49 UTC (permalink / raw)
  To: Christian Kujau; +Cc: Matt Domsch, Kernel Mailing List, Pekka Enberg, Greg KH



On Fri, 12 Nov 2004, Christian Kujau wrote:
> 
> nevermind then. as nobody else seem to be bothered by this i am happy with
> the workarund (CONFIG_EDD=n) and since the lkml-archives exist we could
> get back to it when it's bothering more people (n>1)

The problem with that approach is that very few people are willing to 
spend the time and effort to really try to figure out where the problem 
triggers for them. Thanks again for testing lots of kernels, and different 
configurations.

Basically, if it's a problem that only happens for a smallish percentage
of people, and an even smaller percentage of those is willing to dig down
and find it, it's not a problem we can afford to ignore. Ignoring it just
means that there will be "a few" error reports that we will either waste
time on, or (even worse) we'll dismiss as "known problems" and then
possibly miss _another_ bug.

This is why I take random unexplained (but pinpointed) problems so 
seriously. If it wasn't as apparently random, we could file it under 
"known problem" and decide to try to fix it later. As it is, it's filed 
under "known cause", but since we don't know _why_, it might cause totally 
different problems on another machine, and that just makes it too painful 
for words. 

So the changeset is reverted for now in the current -bk tree, and I'll 
make a -rc2 this weekend and hope that we can stabilize for 2.6.10.

		Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Oops in 2.6.10-rc1 (almost solved)
  2004-11-11 22:43                               ` Matt Domsch
  2004-11-11 22:53                                 ` Linus Torvalds
@ 2004-11-12  0:27                                 ` Christian Kujau
  2004-11-12  0:49                                   ` Linus Torvalds
  1 sibling, 1 reply; 21+ messages in thread
From: Christian Kujau @ 2004-11-12  0:27 UTC (permalink / raw)
  To: Matt Domsch; +Cc: Kernel Mailing List, Linus Torvalds, Pekka Enberg, Greg KH

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Matt Domsch schrieb:
> 
> As Linus points out, those are the magic numbers in EDD for number of
> device entries stored.  Your BIOS seems to be reporting that is has
> more devices than it does, or the EDD assembly is horked in a way I
> have not yet deciphered.

actually, my BIOS is even to old for e.g. ACPI, with latest firmware
installed. i had no issues so far with the board/bios, but perhaps this is
no longer true. however, it's still strange that this thing is only
triggerd with you change and CONFIG_EDD=y.

> 
> I haven't been able to find a solution to your problem yet, and given
> some external time constraints I've got, won't be able to look into
> this again for another week or more.

nevermind then. as nobody else seem to be bothered by this i am happy with
the workarund (CONFIG_EDD=n) and since the lkml-archives exist we could
get back to it when it's bothering more people (n>1)

thank you for your time,
Christian.
- --
BOFH excuse #396:

Mail server hit by UniSpammer.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBlAOE+A7rjkF8z0wRAkyLAJ4uy4LYBHWk8Wxwr/heQRVm7VOXfwCfW30C
Zv1RdMYf1VOBEGkUnkQ+k0Q=
=f2hG
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Oops in 2.6.10-rc1 (almost solved)
  2004-11-11 22:53                                 ` Linus Torvalds
@ 2004-11-11 22:55                                   ` Matt Domsch
  0 siblings, 0 replies; 21+ messages in thread
From: Matt Domsch @ 2004-11-11 22:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Christian Kujau, Kernel Mailing List,
	Pekka Enberg, Greg KH

On Thu, Nov 11, 2004 at 02:53:15PM -0800, Linus Torvalds wrote:
> Matt, I'll revert the EXTENDED READ change for now, then. The random
> behaviour of the problem it causes makes me really dislike this bug, and
> I'd like to release a -rc2 and start calming down the 2.6.10 stuff, but
> having known random stuff happen really disturbs me.
> 
> We can re-do it once it's more obvious why it broke..

Good plan, thanks.

-- 
Matt Domsch
Sr. Software Engineer, Lead Engineer
Dell Linux Solutions linux.dell.com & www.dell.com/linux
Linux on Dell mailing lists @ http://lists.us.dell.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Oops in 2.6.10-rc1 (almost solved)
  2004-11-11 22:43                               ` Matt Domsch
@ 2004-11-11 22:53                                 ` Linus Torvalds
  2004-11-11 22:55                                   ` Matt Domsch
  2004-11-12  0:27                                 ` Christian Kujau
  1 sibling, 1 reply; 21+ messages in thread
From: Linus Torvalds @ 2004-11-11 22:53 UTC (permalink / raw)
  To: Matt Domsch, Andrew Morton
  Cc: Christian Kujau, Kernel Mailing List, Pekka Enberg, Greg KH



On Thu, 11 Nov 2004, Matt Domsch wrote:
> 
> I haven't been able to find a solution to your problem yet, and given
> some external time constraints I've got, won't be able to look into
> this again for another week or more.

Matt, I'll revert the EXTENDED READ change for now, then. The random
behaviour of the problem it causes makes me really dislike this bug, and
I'd like to release a -rc2 and start calming down the 2.6.10 stuff, but
having known random stuff happen really disturbs me.

We can re-do it once it's more obvious why it broke..

		Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Oops in 2.6.10-rc1 (almost solved)
  2004-11-09 23:40                             ` Matt Domsch
  2004-11-10  0:21                               ` Christian Kujau
@ 2004-11-11 22:43                               ` Matt Domsch
  2004-11-11 22:53                                 ` Linus Torvalds
  2004-11-12  0:27                                 ` Christian Kujau
  1 sibling, 2 replies; 21+ messages in thread
From: Matt Domsch @ 2004-11-11 22:43 UTC (permalink / raw)
  To: Christian Kujau
  Cc: Kernel Mailing List, Linus Torvalds, Pekka Enberg, Greg KH

On Tue, Nov 09, 2004 at 05:40:54PM -0600, Matt Domsch wrote:
> OK, thanks, that helps.  From the diff of those dmesg:
> 
> -BIOS EDD facility v0.16 2004-Jun-25, 16 devices found
> +BIOS EDD facility v0.16 2004-Jun-25, 6 devices found

As Linus points out, those are the magic numbers in EDD for number of
device entries stored.  Your BIOS seems to be reporting that is has
more devices than it does, or the EDD assembly is horked in a way I
have not yet deciphered.
 
> I'll review the assembly again to see where I could have miscounted,
> and see how that may affect the EDD sysfs exports.  Likely no answer
> from me before tomorrow though.

I haven't been able to find a solution to your problem yet, and given
some external time constraints I've got, won't be able to look into
this again for another week or more.

Thanks,
Matt

-- 
Matt Domsch
Sr. Software Engineer, Lead Engineer
Dell Linux Solutions linux.dell.com & www.dell.com/linux
Linux on Dell mailing lists @ http://lists.us.dell.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Oops in 2.6.10-rc1 (almost solved)
  2004-11-10  0:21                               ` Christian Kujau
@ 2004-11-10  1:01                                 ` Linus Torvalds
  0 siblings, 0 replies; 21+ messages in thread
From: Linus Torvalds @ 2004-11-10  1:01 UTC (permalink / raw)
  To: Christian Kujau; +Cc: Kernel Mailing List, Matt Domsch, Pekka Enberg, Greg KH



On Wed, 10 Nov 2004, Christian Kujau wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Matt Domsch schrieb:
> > 
> > -BIOS EDD facility v0.16 2004-Jun-25, 16 devices found
> > +BIOS EDD facility v0.16 2004-Jun-25, 6 devices found
> > 
> > So with the latest EDD patch noted above, it's finding more disks than
> > before.  How many disks do you actually have in the system?
> 
> i have one scsi disk (sda) and two atapi cdrom drives:

Interestingly, "16" is also EDD_MBR_SIG_MAX, so my suspicion is that it 
overflowed some EDD data area. edd_num_devices() (which is what reports 
the above number) does

	min_t(unsigned char,
		max_t(unsigned char, edd.edd_info_nr, edd.mbr_signature_nr),
		max_t(unsigned char, EDD_MBR_SIG_MAX, EDDMAXNR));

where EDDMAXNR is 6, and EDD_MBR_SIG_MAX is the afore-mentioned 16, so we 
know that either edd.edd_info_nr or edd.mbr_signature_nr is actually 
_bigger_ than 16.

Which is clearly totally bogus. In fact, even your old "6 devices found" 
thing looks suspiciously bogus.

> PS: do you have *any* idea how this could be related to the snd-es1371
> driver (which is producing the oops then)?

I bet it's overwriting some array, and just corrupting memory after it. 
For example, the edd_info[] array only has 6 entries, and for example, the 
EDD_MBR_SIG_BUFFER is quite close to where we save the E820MAP memory map 
at bootup, so if something stomps on that, the kernel might be confused 
about where PCI memory can be allocated or similar. Or it might have 
overwritten some ACPI memory data, who knows.

			Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Oops in 2.6.10-rc1 (almost solved)
  2004-11-09 23:40                             ` Matt Domsch
@ 2004-11-10  0:21                               ` Christian Kujau
  2004-11-10  1:01                                 ` Linus Torvalds
  2004-11-11 22:43                               ` Matt Domsch
  1 sibling, 1 reply; 21+ messages in thread
From: Christian Kujau @ 2004-11-10  0:21 UTC (permalink / raw)
  To: Kernel Mailing List; +Cc: Matt Domsch, Linus Torvalds, Pekka Enberg, Greg KH

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Matt Domsch schrieb:
> 
> -BIOS EDD facility v0.16 2004-Jun-25, 16 devices found
> +BIOS EDD facility v0.16 2004-Jun-25, 6 devices found
> 
> So with the latest EDD patch noted above, it's finding more disks than
> before.  How many disks do you actually have in the system?

i have one scsi disk (sda) and two atapi cdrom drives:

hda: CRD-8483B, ATAPI CD/DVD-ROM drive
hdb: AOPEN CD-RW CRW3248 1.17 20020620, ATAPI CD/DVD-ROM drive
...
SCSI device sda: 35548320 512-byte hdwr sectors (18201 MB)
SCSI device sda: drive cache: write back

the "scsi0 : sym-2.1.18k" is on a pci card, the atapi devices are
connected onboard. if it helps:

http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/lspci-v.txt
http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/lspci-vv.txt

> I'll review the assembly again to see where I could have miscounted,
> and see how that may affect the EDD sysfs exports.  Likely no answer
> from me before tomorrow though.

that's ok, real life kicks in here too...

thanks,
Christian.

PS: do you have *any* idea how this could be related to the snd-es1371
driver (which is producing the oops then)?
- --
BOFH excuse #449:

greenpeace free'd the mallocs
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBkV75+A7rjkF8z0wRAl67AJ9P+SF1WfRe7r2zoF9D/b/fyDeD0QCfe6/f
Uxt5DVlb/IzW9VSWuFJqLlI=
=Hpg9
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Oops in 2.6.10-rc1 (almost solved)
  2004-11-09 23:30                           ` Christian Kujau
@ 2004-11-09 23:40                             ` Matt Domsch
  2004-11-10  0:21                               ` Christian Kujau
  2004-11-11 22:43                               ` Matt Domsch
  0 siblings, 2 replies; 21+ messages in thread
From: Matt Domsch @ 2004-11-09 23:40 UTC (permalink / raw)
  To: Christian Kujau
  Cc: Kernel Mailing List, Linus Torvalds, Pekka Enberg, Greg KH

On Wed, Nov 10, 2004 at 12:30:21AM +0100, Christian Kujau wrote:
> > 	ChangeSet@1.2000.5.108, 2004-10-20 08:36:22-07:00, Matt_Domsch@dell.com
> > 	  [PATCH] EDD: use EXTENDED READ command, add CONFIG_EDD_SKIP_MBR
> 
> and i say: good catch! that does it!
> 
> i did "bk undo -a1.2000.5.108" on a current tree, booting this still gives
> an oops:
> 
> http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/dmesg-2.6.9_a1.2000.5.108.txt
> 
> excluding this single ChangeSet with "bk undo -r1.2118" does work with
> CONFIG_EDD=y:
> 
> http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/dmesg-2.6.9_r1.2000.5.108.txt

OK, thanks, that helps.  From the diff of those dmesg:

-BIOS EDD facility v0.16 2004-Jun-25, 16 devices found
+BIOS EDD facility v0.16 2004-Jun-25, 6 devices found

So with the latest EDD patch noted above, it's finding more disks than
before.  How many disks do you actually have in the system?

I'll review the assembly again to see where I could have miscounted,
and see how that may affect the EDD sysfs exports.  Likely no answer
from me before tomorrow though.

Thanks,
Matt

-- 
Matt Domsch
Sr. Software Engineer, Lead Engineer
Dell Linux Solutions linux.dell.com & www.dell.com/linux
Linux on Dell mailing lists @ http://lists.us.dell.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Oops in 2.6.10-rc1 (almost solved)
  2004-11-09 18:53                         ` Linus Torvalds
@ 2004-11-09 23:30                           ` Christian Kujau
  2004-11-09 23:40                             ` Matt Domsch
  0 siblings, 1 reply; 21+ messages in thread
From: Christian Kujau @ 2004-11-09 23:30 UTC (permalink / raw)
  To: Kernel Mailing List; +Cc: Linus Torvalds, Pekka Enberg, Greg KH, Matt_Domsch

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Linus Torvalds schrieb:
> 
> Very strange. There's not a lot of stuff that affects EDD directly that I 
> can see, but there is:
> 
> 	ChangeSet@1.2000.5.108, 2004-10-20 08:36:22-07:00, Matt_Domsch@dell.com
> 	  [PATCH] EDD: use EXTENDED READ command, add CONFIG_EDD_SKIP_MBR

and i say: good catch! that does it!

i did "bk undo -a1.2000.5.108" on a current tree, booting this still gives
an oops:

http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/dmesg-2.6.9_a1.2000.5.108.txt

excluding this single ChangeSet with "bk undo -r1.2118" does work with
CONFIG_EDD=y:

http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/dmesg-2.6.9_r1.2000.5.108.txt

(the filename here should really read "...r1.2118.txt" because that was
the number of the changeset representing the above [PATCH] *after* i did
"bk undo -a1.2000.5.108". right?)

> However, even that would just change the EDD _data_, it doesn't change the 
> code that actually runs in the kernel. And I _really_ don't see what EDD 
> has got to do with anything.

understanding a lot less of all this than you guys i also wonder why only
this single driver broke. i've always loaded a couple of drivers here,
maybe i could play around a bit e.g. CONFIG_SND_ENS1371=y instead of =m or
see if other hw drivers break too.

> I wonder if the EDD stuff corrupts the sysfs tree or something, and you're
> just seeing some strange kobject interference.

do userspace tools matter here? there is "sysfsutils-1.1.0-1" and
"libsysfs1-1.1.0-1" (both debian/unstable) installed here, /sys is mounted:

   sysfs on /sys type sysfs (rw)

> Christian, finding which change triggers this would be very good indeed. I 
> think the merge with greg is still a good place to start, although even 

i'll look again over the -bk magic you told me about and see what it gives.

thanks so far to all involved here, i really enjoyed "working" with you.
first class support at no charge...it's just incredible.

you guys rock,
Christian.
- --
BOFH excuse #112:

The monitor is plugged into the serial port
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBkVMN+A7rjkF8z0wRAqu4AKCtxZxE2spjZGgSnxTWzTTB0CWCkACgi2f3
RmHQXbnkcI1OEcLORhP1dmA=
=5Dot
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Oops in 2.6.10-rc1 (almost solved)
  2004-11-09 17:26                       ` Oops in 2.6.10-rc1 (almost solved) Christian Kujau
@ 2004-11-09 18:53                         ` Linus Torvalds
  2004-11-09 23:30                           ` Christian Kujau
  0 siblings, 1 reply; 21+ messages in thread
From: Linus Torvalds @ 2004-11-09 18:53 UTC (permalink / raw)
  To: Christian Kujau; +Cc: Kernel Mailing List, Pekka Enberg, Greg KH, Matt_Domsch



On Tue, 9 Nov 2004, Christian Kujau wrote:
> 
> at least i finally found the "bad" .config option: it's CONFIG_EDD.
> when i disable this option (and only this options. i can use the same
> .config as usual only disbaling this very option. diff is my witness.)
> i can boot a current (!) 2.6.10-rc1-bk and a working snd-ens1371!

Very strange. There's not a lot of stuff that affects EDD directly that I 
can see, but there is:

	ChangeSet@1.2000.5.108, 2004-10-20 08:36:22-07:00, Matt_Domsch@dell.com
	  [PATCH] EDD: use EXTENDED READ command, add CONFIG_EDD_SKIP_MBR
	  
	  Some controller BIOSes have problems with the legacy int13 fn02 READ
	  SECTORS command.  int13 fn42 EXTENDED READ is used in preference by most
	  boot loaders today, so lets use that.  If EXTENDED READ fails or isn't
	  supported, fall back to READ SECTORS.
	  
	  This hopefully resolves the three reports of BIOSes which would either
	  long-pause (30+ seconds) or hang completely on the legacy READ SECTORS
	  command.
	  
	  This also adds CONFIG_EDD_SKIP_MBR to eliminate reading the MBR on each
	  BIOS-presented disk, in case there are further problems in this area.
	  
	  Signed-off-by: Matt Domsch <Matt_Domsch@dell.com>
	  Signed-off-by: Andrew Morton <akpm@osdl.org>
	  Signed-off-by: Linus Torvalds <torvalds@osdl.org>

which might fit the bill.

However, even that would just change the EDD _data_, it doesn't change the 
code that actually runs in the kernel. And I _really_ don't see what EDD 
has got to do with anything.

I wonder if the EDD stuff corrupts the sysfs tree or something, and you're
just seeing some strange kobject interference. Greg, you'd likely still be
on the line for that one.

Christian, finding which change triggers this would be very good indeed. I 
think the merge with greg is still a good place to start, although even 
just doing the snapshot trees (from _before_ -rc1: ie the patches in 
/pub/linux/kernel/v2.6/snapshots/old: patch-2.6.9-bk*.gz) is actually also 
a good way to narrow things down.

		Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Oops in 2.6.10-rc1 (almost solved)
  2004-11-09 12:33                     ` Christian Kujau
@ 2004-11-09 17:26                       ` Christian Kujau
  2004-11-09 18:53                         ` Linus Torvalds
  0 siblings, 1 reply; 21+ messages in thread
From: Christian Kujau @ 2004-11-09 17:26 UTC (permalink / raw)
  To: Christian Kujau, Kernel Mailing List
  Cc: Pekka Enberg, Linus Torvalds, Greg KH

On Tue, 09 Nov 2004 13:33:20 +0100, Christian Kujau wrote
> i've disabled *only* CONFIG_PREEMPT in another .config but it 
> still oopses:

at least i finally found the "bad" .config option: it's CONFIG_EDD.
when i disable this option (and only this options. i can use the same
.config as usual only disbaling this very option. diff is my witness.)
i can boot a current (!) 2.6.10-rc1-bk and a working snd-ens1371!

i'll test with CONFIG_EDD=m later on. here a short summary:

2.6.9         CONFIG_EDD=y   - OK
2.6.10-rc1-bk CONFIG_EDD=y   - OOPS!
2.6.10-rc1-bk CONFIG_EDD=n   - OK
2.6.10-rc1-bk CONFIG_EDD=m   - ??

yes, i'll continue to find out the ChangeSet but now i (and perhaps you
too, if you are as curious as me) will know where to look at.
i must admit that i was not entirely sure why i wanted to enable
CONFIG_EDD at all. if i had never enabled it, it'd have saved me a week
of bug chasing, but learning is fun, too.

thanks,
Christian.
-- 
BOFH excuse #209:

Only people with names beginning with 'A' are getting mail this week (a
la Microsoft)

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2004-11-15 12:41 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-11-13  3:45 Oops in 2.6.10-rc1 (almost solved) Chuck Ebbert
2004-11-13 14:28 ` Matt Domsch
2004-11-13 18:55   ` Matt Domsch
2004-11-14  2:58   ` Matt Domsch
2004-11-14  4:43     ` Linus Torvalds
2004-11-14 11:45     ` Christian
2004-11-14 20:02     ` Christian Kujau
2004-11-14 21:55       ` Matt Domsch
2004-11-15 12:41         ` Oops in 2.6.10-rc1 (solved) Christian Kujau
  -- strict thread matches above, loose matches on Subject: below --
2004-10-28 13:12 Oops in 2.6.10-rc1 Christian
2004-11-07 16:57 ` Linus Torvalds
2004-11-07 18:31   ` Christian Kujau
2004-11-07 23:45     ` Christian Kujau
2004-11-08  1:16       ` Linus Torvalds
2004-11-08 13:01         ` Christian Kujau
2004-11-08 18:13           ` Linus Torvalds
2004-11-08 20:59             ` Christian Kujau
2004-11-08 23:49               ` Christian Kujau
2004-11-09  1:31                 ` Christian Kujau
2004-11-09  7:40                   ` Pekka Enberg
2004-11-09 12:33                     ` Christian Kujau
2004-11-09 17:26                       ` Oops in 2.6.10-rc1 (almost solved) Christian Kujau
2004-11-09 18:53                         ` Linus Torvalds
2004-11-09 23:30                           ` Christian Kujau
2004-11-09 23:40                             ` Matt Domsch
2004-11-10  0:21                               ` Christian Kujau
2004-11-10  1:01                                 ` Linus Torvalds
2004-11-11 22:43                               ` Matt Domsch
2004-11-11 22:53                                 ` Linus Torvalds
2004-11-11 22:55                                   ` Matt Domsch
2004-11-12  0:27                                 ` Christian Kujau
2004-11-12  0:49                                   ` Linus Torvalds
2004-11-12  1:27                                     ` Christian Kujau

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.