All of lore.kernel.org
 help / color / mirror / Atom feed
* kernel update and dmraid causing grub errors
@ 2010-11-01 22:27 David C. Rankin
  2010-11-03 12:04 ` Heinz Mauelshagen
  0 siblings, 1 reply; 9+ messages in thread
From: David C. Rankin @ 2010-11-01 22:27 UTC (permalink / raw)
  To: dm-devel

dmraid devs,

	Over the past 8-9 months, I have had numerous dmraid related boot failures with
the past 6-8 kernels. It seems like a Russian-roulette type problem. Some
kernels work with dmraid, some cause grub errors. The problem is most acute on
an MSI SLI Platinum Based board (MS-7374), Phenom X4 (9850), with the following
pci bus config:

[15:48 archangel:/home/david/bugs/aa] # lspci
00:00.0 RAM memory: nVidia Corporation MCP78S [GeForce 8200] Memory Controller
(rev a2)
00:01.0 ISA bridge: nVidia Corporation MCP78S [GeForce 8200] LPC Bridge (rev a2)
00:01.1 SMBus: nVidia Corporation MCP78S [GeForce 8200] SMBus (rev a1)
00:01.2 RAM memory: nVidia Corporation MCP78S [GeForce 8200] Memory Controller
(rev a1)
00:01.3 Co-processor: nVidia Corporation MCP78S [GeForce 8200] Co-Processor (rev a2)
00:01.4 RAM memory: nVidia Corporation MCP78S [GeForce 8200] Memory Controller
(rev a1)
00:02.0 USB Controller: nVidia Corporation MCP78S [GeForce 8200] OHCI USB 1.1
Controller (rev a1)
00:02.1 USB Controller: nVidia Corporation MCP78S [GeForce 8200] EHCI USB 2.0
Controller (rev a1)
00:04.0 USB Controller: nVidia Corporation MCP78S [GeForce 8200] OHCI USB 1.1
Controller (rev a1)
00:04.1 USB Controller: nVidia Corporation MCP78S [GeForce 8200] EHCI USB 2.0
Controller (rev a1)
00:06.0 IDE interface: nVidia Corporation MCP78S [GeForce 8200] IDE (rev a1)
00:07.0 Audio device: nVidia Corporation MCP72XE/MCP72P/MCP78U/MCP78S High
Definition Audio (rev a1)
00:08.0 PCI bridge: nVidia Corporation MCP78S [GeForce 8200] PCI Bridge (rev a1)
00:09.0 RAID bus controller: nVidia Corporation MCP78S [GeForce 8200] SATA
Controller (RAID mode) (rev a2)
00:0a.0 Ethernet controller: nVidia Corporation MCP77 Ethernet (rev a2)
00:10.0 PCI bridge: nVidia Corporation MCP78S [GeForce 8200] PCI Express Bridge
(rev a1)
00:12.0 PCI bridge: nVidia Corporation MCP78S [GeForce 8200] PCI Express Bridge
(rev a1)
00:13.0 PCI bridge: nVidia Corporation MCP78S [GeForce 8200] PCI Bridge (rev a1)
00:14.0 PCI bridge: nVidia Corporation MCP78S [GeForce 8200] PCI Bridge (rev a1)
00:18.0 Host bridge: Advanced Micro Devices [AMD] K10 [Opteron, Athlon64,
Sempron] HyperTransport Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] K10 [Opteron, Athlon64,
Sempron] Address Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] K10 [Opteron, Athlon64,
Sempron] DRAM Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] K10 [Opteron, Athlon64,
Sempron] Miscellaneous Control
00:18.4 Host bridge: Advanced Micro Devices [AMD] K10 [Opteron, Athlon64,
Sempron] Link Control
01:06.0 Serial controller: 3Com Corp, Modem Division 56K FaxModem Model 5610
(rev 01)
01:09.0 FireWire (IEEE 1394): VIA Technologies, Inc. VT6306/7/8 [Fire II(M)]
IEEE 1394 OHCI Controller (rev c0)
02:00.0 VGA compatible controller: nVidia Corporation G92 [GeForce 8800 GT] (rev a2)
04:00.0 SATA controller: JMicron Technology Corp. JMB362/JMB363 Serial ATA
Controller (rev 03)
04:00.1 IDE interface: JMicron Technology Corp. JMB362/JMB363 Serial ATA
Controller (rev 03)

full dmidecode information at:
  http://www.3111skyline.com/dl/Archlinux/bugs/aa-dmidecode.txt

	Booting the current Arch Linux kernel (2.6.35.8-1) fails and the boot hangs at
the very start. The kernel line I use hasn't changed in a long time:

  kernel /vmlinuz root=/dev/mapper/nvidia_baaccajap5 ro vga=0x31a

	Booting first stopped with the following error:

Booting 'Arch Linux on Archangel'

root (hd1,5)
  Filesystem type is ext2fs, Partition type 0x83
Kernel /vmlinuz26 root=/dev/mapper/nvidia_baacca_jap5 ro vga=794

Error 24: Attempt to access block outside partition

Press any key to continue...

	Upgrading to device-mapper-2.02.75-1 completely changes the error to:

Error 5: Partition table invalid or corrupt

	Rebooting to 2.6.35.7-1, or 2.6.32.25-1 (the Arch LTS kernel) works just fine.
So the problem is not a partition or partition table problem. The Arch Linux
developer (Tobias Powalowski) has referred me here as the problem isn't a kernel
problem, but something strange that is happening with dmraid.

	The only guess I have is that it is a dmraid/GeForce controller issue that is
triggered when dmraid loads under certain circumstances.

	This box has 2 dmraid arrays:

[17:15 archangel:/home/david/bugs/aa] # dmraid -r
/dev/sdd: nvidia, "nvidia_baaccaja", mirror, ok, 1465149166 sectors, data@ 0
/dev/sda: nvidia, "nvidia_fdaacfde", mirror, ok, 976773166 sectors, data@ 0
/dev/sdb: nvidia, "nvidia_baaccaja", mirror, ok, 1465149166 sectors, data@ 0
/dev/sdc: nvidia, "nvidia_fdaacfde", mirror, ok, 976773166 sectors, data@ 0

[17:15 archangel:/home/david/bugs/aa] # dmraid -s
*** Active Set
name   : nvidia_baaccaja
size   : 1465149056
stride : 128
type   : mirror
status : ok
subsets: 0
devs   : 2
spares : 0
*** Active Set
name   : nvidia_fdaacfde
size   : 976773120
stride : 128
type   : mirror
status : ok
subsets: 0
devs   : 2
spares : 0

	All disks check out fine with smartctl, so it isn't a disk-hardware problem.
The detailed information on the GeForce controller (lspci -vv) is:

00:09.0 RAID bus controller: nVidia Corporation MCP78S [GeForce 8200] SATA
Controller (RAID mode) (rev a2)
        Subsystem: Micro-Star International Co., Ltd. Device 7374
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR+ FastB2B- DisINTx+
        Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort-
<MAbort- >SERR- <PERR- INTx-
        Latency: 0 (750ns min, 250ns max)
        Interrupt: pin A routed to IRQ 28
        Region 0: I/O ports at b080 [size=8]
        Region 1: I/O ports at b000 [size=4]
        Region 2: I/O ports at ac00 [size=8]
        Region 3: I/O ports at a880 [size=4]
        Region 4: I/O ports at a800 [size=16]
        Region 5: Memory at f9e76000 (32-bit, non-prefetchable) [size=8K]
        Capabilities: [44] Power Management version 2
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [8c] SATA HBA v1.0 InCfgSpace
        Capabilities: [b0] MSI: Enable+ Count=1/8 Maskable- 64bit+
                Address: 00000000fee0f00c  Data: 4191
        Capabilities: [ec] HyperTransport: MSI Mapping Enable+ Fixed+
        Kernel driver in use: ahci
        Kernel modules: ahci


    Basically, I'm stumped here. Nothing has changed with this box in over a
year (same grub menu.lst, same hardware), the only oddity is that in 4 of the
last 6 kernels or so have failed to boot with this weird grub error, that has
nothing to do with grub (because it boots all other kernels fine), but is
something that results from dmraid and the way it gets initialized (which I'm
clueless about).

    Let me know what you think and let me know what data or testing you want me
to do. I'll be happy to do it. I last filed this bug with Arch against 2.6.35-1
and the problem was never fixed, but (solved) by upgrading to the (next -
testing kernel), so the actual problem was never found. The url to the closed
report is:

https://bugs.archlinux.org/task/20918?

    Thanks for any ideas or help you can give.

-- 
David C. Rankin, J.D.,P.E.
Rankin Law Firm, PLLC
510 Ochiltree Street
Nacogdoches, Texas 75961
Telephone: (936) 715-9333
Facsimile: (936) 715-9339
www.rankinlawfirm.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: kernel update and dmraid causing grub errors
  2010-11-01 22:27 kernel update and dmraid causing grub errors David C. Rankin
@ 2010-11-03 12:04 ` Heinz Mauelshagen
  2010-11-03 22:19   ` David C. Rankin
  2010-11-03 22:57   ` David C. Rankin
  0 siblings, 2 replies; 9+ messages in thread
From: Heinz Mauelshagen @ 2010-11-03 12:04 UTC (permalink / raw)
  To: device-mapper development


Hi David,

because you're able to access your config fine with some arch LTS
kernels, it doesn't make sense to analyze your metadata upfront and the
following reasons may cause the failures:

- initramfs issue not activating ATARAID mappings properly via dmraid

- drivers missing to access the mappings

- host protected area changes going together with the kernel changes
  (eg. the "Error 24: Attempt to access block outside partition");
  try the libata.ignore_hpa kernel paramaters described
  in the kernel source Documentation/kernel-parameters.txt
  to test for this one

FYI: in general dmraid doesn't rely on a particular controller, just
metadata signatures it discovers. You could attach the disks to some
other SATA controller and still access your RAID sets.

Regards,
Heinz

On Mon, 2010-11-01 at 17:27 -0500, David C. Rankin wrote:
> dmraid devs,
> 
> 	Over the past 8-9 months, I have had numerous dmraid related boot failures with
> the past 6-8 kernels. It seems like a Russian-roulette type problem. Some
> kernels work with dmraid, some cause grub errors. The problem is most acute on
> an MSI SLI Platinum Based board (MS-7374), Phenom X4 (9850), with the following
> pci bus config:
> 
> [15:48 archangel:/home/david/bugs/aa] # lspci
> 00:00.0 RAM memory: nVidia Corporation MCP78S [GeForce 8200] Memory Controller
> (rev a2)
> 00:01.0 ISA bridge: nVidia Corporation MCP78S [GeForce 8200] LPC Bridge (rev a2)
> 00:01.1 SMBus: nVidia Corporation MCP78S [GeForce 8200] SMBus (rev a1)
> 00:01.2 RAM memory: nVidia Corporation MCP78S [GeForce 8200] Memory Controller
> (rev a1)
> 00:01.3 Co-processor: nVidia Corporation MCP78S [GeForce 8200] Co-Processor (rev a2)
> 00:01.4 RAM memory: nVidia Corporation MCP78S [GeForce 8200] Memory Controller
> (rev a1)
> 00:02.0 USB Controller: nVidia Corporation MCP78S [GeForce 8200] OHCI USB 1.1
> Controller (rev a1)
> 00:02.1 USB Controller: nVidia Corporation MCP78S [GeForce 8200] EHCI USB 2.0
> Controller (rev a1)
> 00:04.0 USB Controller: nVidia Corporation MCP78S [GeForce 8200] OHCI USB 1.1
> Controller (rev a1)
> 00:04.1 USB Controller: nVidia Corporation MCP78S [GeForce 8200] EHCI USB 2.0
> Controller (rev a1)
> 00:06.0 IDE interface: nVidia Corporation MCP78S [GeForce 8200] IDE (rev a1)
> 00:07.0 Audio device: nVidia Corporation MCP72XE/MCP72P/MCP78U/MCP78S High
> Definition Audio (rev a1)
> 00:08.0 PCI bridge: nVidia Corporation MCP78S [GeForce 8200] PCI Bridge (rev a1)
> 00:09.0 RAID bus controller: nVidia Corporation MCP78S [GeForce 8200] SATA
> Controller (RAID mode) (rev a2)
> 00:0a.0 Ethernet controller: nVidia Corporation MCP77 Ethernet (rev a2)
> 00:10.0 PCI bridge: nVidia Corporation MCP78S [GeForce 8200] PCI Express Bridge
> (rev a1)
> 00:12.0 PCI bridge: nVidia Corporation MCP78S [GeForce 8200] PCI Express Bridge
> (rev a1)
> 00:13.0 PCI bridge: nVidia Corporation MCP78S [GeForce 8200] PCI Bridge (rev a1)
> 00:14.0 PCI bridge: nVidia Corporation MCP78S [GeForce 8200] PCI Bridge (rev a1)
> 00:18.0 Host bridge: Advanced Micro Devices [AMD] K10 [Opteron, Athlon64,
> Sempron] HyperTransport Configuration
> 00:18.1 Host bridge: Advanced Micro Devices [AMD] K10 [Opteron, Athlon64,
> Sempron] Address Map
> 00:18.2 Host bridge: Advanced Micro Devices [AMD] K10 [Opteron, Athlon64,
> Sempron] DRAM Controller
> 00:18.3 Host bridge: Advanced Micro Devices [AMD] K10 [Opteron, Athlon64,
> Sempron] Miscellaneous Control
> 00:18.4 Host bridge: Advanced Micro Devices [AMD] K10 [Opteron, Athlon64,
> Sempron] Link Control
> 01:06.0 Serial controller: 3Com Corp, Modem Division 56K FaxModem Model 5610
> (rev 01)
> 01:09.0 FireWire (IEEE 1394): VIA Technologies, Inc. VT6306/7/8 [Fire II(M)]
> IEEE 1394 OHCI Controller (rev c0)
> 02:00.0 VGA compatible controller: nVidia Corporation G92 [GeForce 8800 GT] (rev a2)
> 04:00.0 SATA controller: JMicron Technology Corp. JMB362/JMB363 Serial ATA
> Controller (rev 03)
> 04:00.1 IDE interface: JMicron Technology Corp. JMB362/JMB363 Serial ATA
> Controller (rev 03)
> 
> full dmidecode information at:
>   http://www.3111skyline.com/dl/Archlute/bugs/aa-dmidecode.txt

Not accessible.

> 
> 	Booting the current Arch Linux kernel (2.6.35.8-1) fails and the boot hangs at
> the very start. The kernel line I use hasn't changed in a long time:
> 
>   kernel /vmlinuz root=/dev/mapper/nvidia_baaccajap5 ro vga=0x31a
> 
> 	Booting first stopped with the following error:
> 
> Booting 'Arch Linux on Archangel'
> 
> root (hd1,5)
>   Filesystem type is ext2fs, Partition type 0x83
> Kernel /vmlinuz26 root=/dev/mapper/nvidia_baacca_jap5 ro vga=794
> 
> Error 24: Attempt to access block outside partition
> 
> Press any key to continue...
> 
> 	Upgrading to device-mapper-2.02.75-1 completely changes the error to:
> 
> Error 5: Partition table invalid or corrupt
> 
> 	Rebooting to 2.6.35.7-1, or 2.6.32.25-1 (the Arch LTS kernel) works just fine.
> So the problem is not a partition or partition table problem. The Arch Linux
> developer (Tobias Powalowski) has referred me here as the problem isn't a kernel
> problem, but something strange that is happening with dmraid.
> 
> 	The only guess I have is that it is a dmraid/GeForce controller issue that is
> triggered when dmraid loads under certain circumstances.
> 
> 	This box has 2 dmraid arrays:
> 
> [17:15 archangel:/home/david/bugs/aa] # dmraid -r
> /dev/sdd: nvidia, "nvidia_baaccaja", mirror, ok, 1465149166 sectors, data@ 0
> /dev/sda: nvidia, "nvidia_fdaacfde", mirror, ok, 976773166 sectors, data@ 0
> /dev/sdb: nvidia, "nvidia_baaccaja", mirror, ok, 1465149166 sectors, data@ 0
> /dev/sdc: nvidia, "nvidia_fdaacfde", mirror, ok, 976773166 sectors, data@ 0
> 
> [17:15 archangel:/home/david/bugs/aa] # dmraid -s
> *** Active Set
> name   : nvidia_baaccaja
> size   : 1465149056
> stride : 128
> type   : mirror
> status : ok
> subsets: 0
> devs   : 2
> spares : 0
> *** Active Set
> name   : nvidia_fdaacfde
> size   : 976773120
> stride : 128
> type   : mirror
> status : ok
> subsets: 0
> devs   : 2
> spares : 0
> 
> 	All disks check out fine with smartctl, so it isn't a disk-hardware problem.
> The detailed information on the GeForce controller (lspci -vv) is:
> 
> 00:09.0 RAID bus controller: nVidia Corporation MCP78S [GeForce 8200] SATA
> Controller (RAID mode) (rev a2)
>         Subsystem: Micro-Star International Co., Ltd. Device 7374
>         Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
> Stepping- SERR+ FastB2B- DisINTx+
>         Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort-
> <MAbort- >SERR- <PERR- INTx-
>         Latency: 0 (750ns min, 250ns max)
>         Interrupt: pin A routed to IRQ 28
>         Region 0: I/O ports at b080 [size=8]
>         Region 1: I/O ports at b000 [size=4]
>         Region 2: I/O ports at ac00 [size=8]
>         Region 3: I/O ports at a880 [size=4]
>         Region 4: I/O ports at a800 [size=16]
>         Region 5: Memory at f9e76000 (32-bit, non-prefetchable) [size=8K]
>         Capabilities: [44] Power Management version 2
>                 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
> PME(D0-,D1-,D2-,D3hot-,D3cold-)
>                 Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
>         Capabilities: [8c] SATA HBA v1.0 InCfgSpace
>         Capabilities: [b0] MSI: Enable+ Count=1/8 Maskable- 64bit+
>                 Address: 00000000fee0f00c  Data: 4191
>         Capabilities: [ec] HyperTransport: MSI Mapping Enable+ Fixed+
>         Kernel driver in use: ahci
>         Kernel modules: ahci
> 
> 
>     Basically, I'm stumped here. Nothing has changed with this box in over a
> year (same grub menu.lst, same hardware), the only oddity is that in 4 of the
> last 6 kernels or so have failed to boot with this weird grub error, that has
> nothing to do with grub (because it boots all other kernels fine), but is
> 1Gsomething that results from dmraid and the way it gets initialized (which I'm
> clueless about).
> 
>     Let me know what you think and let me know what data or testing you want me
> to do. I'll be happy to do it. I last filed this bug with Arch against 2.6.35-1
> and the problem was never fixed, but (solved) by upgrading to the (next -
> testing kernel), so the actual problem was never found. The url to the closed
> report is:
> 
> https://bugs.archlinux.org/task/20918?
> 
>     Thanks for any ideas or help you can give.
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: kernel update and dmraid causing grub errors
  2010-11-03 12:04 ` Heinz Mauelshagen
@ 2010-11-03 22:19   ` David C. Rankin
  2010-11-03 22:57   ` David C. Rankin
  1 sibling, 0 replies; 9+ messages in thread
From: David C. Rankin @ 2010-11-03 22:19 UTC (permalink / raw)
  To: device-mapper development

On 11/03/2010 07:04 AM, Heinz Mauelshagen wrote:
> Hi David,
> 
> because you're able to access your config fine with some arch LTS
> kernels, it doesn't make sense to analyze your metadata upfront and the
> following reasons may cause the failures:
> 
> - initramfs issue not activating ATARAID mappings properly via dmraid
> 
> - drivers missing to access the mappings
> 
> - host protected area changes going together with the kernel changes
>   (eg. the "Error 24: Attempt to access block outside partition");
>   try the libata.ignore_hpa kernel paramaters described
>   in the kernel source Documentation/kernel-parameters.txt
>   to test for this one
> 
> FYI: in general dmraid doesn't rely on a particular controller, just
> metadata signatures it discovers. You could attach the disks to some
> other SATA controller and still access your RAID sets.
> 
> Regards,
> Heinz

Heinz,

	Thank you for your reply. The LTS kernel is just there (long term support)
kernel (2.6.32.25-2). I can boot the SuSE kernels fine, and the immediately
prior Arch kernel just fine as well (2.6.35.7-1).

	I will try the libata.ignore_hpa setting and let you know and I'll pass this
info to the Arch devs to see if they can help as well.

	This has really been a strange issue. It seems like every other kernel (1 out
of every 2 new releases) just won't boot and stops at the very beginning with
either grub Error 24: or Error 5: (Nothing has changed otherwise and simply
booting another kernel works fine). I'll keep plugging away at trying to find an
answer.

	If you can think of any other tests or helpful diagnostics, please let me know.
Since this occurs at the very beginning of boot, there is no logging which makes
this tough to figure out. (not to mention dmraid is just voodoo to me. ;-)

-- 
David C. Rankin, J.D.,P.E.
Rankin Law Firm, PLLC
510 Ochiltree Street
Nacogdoches, Texas 75961
Telephone: (936) 715-9333
Facsimile: (936) 715-9339
www.rankinlawfirm.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: kernel update and dmraid causing grub errors
  2010-11-03 12:04 ` Heinz Mauelshagen
  2010-11-03 22:19   ` David C. Rankin
@ 2010-11-03 22:57   ` David C. Rankin
  2010-11-04 12:32     ` Heinz Mauelshagen
  1 sibling, 1 reply; 9+ messages in thread
From: David C. Rankin @ 2010-11-03 22:57 UTC (permalink / raw)
  To: device-mapper development

On 11/03/2010 07:04 AM, Heinz Mauelshagen wrote:
> - host protected area changes going together with the kernel changes
>   (eg. the "Error 24: Attempt to access block outside partition");
>   try the libata.ignore_hpa kernel paramaters described
>   in the kernel source Documentation/kernel-parameters.txt
>   to test for this one

Heinz,

	I have testing with both libata.ignore_hpa=0 (default) and libata.ignore_hpa=1
(ignore limits, using full disk), but there is no change. I still get grub Error
24: (this is also with a newer 2.6.36-3 kernel). So I'm stumped again. If you
have any other ideas, please let me know. I'm happy to test on this end.

	I have gone ahead and made the metadata available in case it will help. I have
also included the fdisk info as well:

  http://www.3111skyline.com/dl/bugs/dmraid/dmraid.nvidia/

  http://www.3111skyline.com/dl/bugs/dmraid/fdisk-l-info-20100817.txt

Thank you for your help.

-- 
David C. Rankin, J.D.,P.E.
Rankin Law Firm, PLLC
510 Ochiltree Street
Nacogdoches, Texas 75961
Telephone: (936) 715-9333
Facsimile: (936) 715-9339
www.rankinlawfirm.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: kernel update and dmraid causing grub errors
  2010-11-03 22:57   ` David C. Rankin
@ 2010-11-04 12:32     ` Heinz Mauelshagen
  2010-11-04 16:17       ` David C. Rankin
  0 siblings, 1 reply; 9+ messages in thread
From: Heinz Mauelshagen @ 2010-11-04 12:32 UTC (permalink / raw)
  To: device-mapper development

On Wed, 2010-11-03 at 17:57 -0500, David C. Rankin wrote:
> On 11/03/2010 07:04 AM, Heinz Mauelshagen wrote:
> > - host protected area changes going together with the kernel changes
> >   (eg. the "Error 24: Attempt to access block outside partition");
> >   try the libata.ignore_hpa kernel paramaters described
> >   in the kernel source Documentation/kernel-parameters.txt
> >   to test for this one
> 
> Heinz,
> 
> 	I have testing with both libata.ignore_hpa=0 (default) and libata.ignore_hpa=1
> (ignore limits, using full disk), but there is no change. I still get grub Error
> 24: (this is also with a newer 2.6.36-3 kernel). So I'm stumped again. If you
> have any other ideas, please let me know. I'm happy to test on this end.

I overlooked you said it's a grub error, which occurs before any kernel
argument is being processed. Hmm, this could be a grub flaw then
identifying the disk size wrong or getting an offset wrong to load data
from. Did grub change with the kernel installed?

Heinz

> 
> 	I have gone ahead and made the metadata available in case it will help. I have
> also included the fdisk info as well:
> 
>   http://www.3111skyline.com/dl/bugs/dmraid/dmraid.nvidia/
> 
>   http://www.3111skyline.com/dl/bugs/dmraid/fdisk-l-info-20100817.txt
> 
> Thank you for your help.
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: kernel update and dmraid causing grub errors
  2010-11-04 12:32     ` Heinz Mauelshagen
@ 2010-11-04 16:17       ` David C. Rankin
  2010-11-09 17:55         ` David C. Rankin
  0 siblings, 1 reply; 9+ messages in thread
From: David C. Rankin @ 2010-11-04 16:17 UTC (permalink / raw)
  To: device-mapper development

On 11/04/2010 07:32 AM, Heinz Mauelshagen wrote:
> I overlooked you said it's a grub error, which occurs before any kernel
> argument is being processed. Hmm, this could be a grub flaw then
> identifying the disk size wrong or getting an offset wrong to load data
> from. Did grub change with the kernel installed?
> 

Hi Heinz,

	No, grub is (grub-0.97-17) and it hasn't changed since April 25, 2010. So
whatever is happening, isn't due to a grub change. Some of the Arch devs think
it might be a kernel issue. Last night, I posted the issue to the kernel list at
kernel.org and we will see what response we get back. The post to kernel.org was
pretty much the complete history of the issue, so I'll include the additional
information posted to the kernel list below for completeness:

Hardware:

  lspci -vv info

    http://www.3111skyline.com/dl/bugs/dmraid/lspci-vv.txt

  dmidecode info

    http://www.3111skyline.com/dl/bugs/Archlinux/aa-dmidecode.txt

  dmraid metadata and fdisk info

    http://www.3111skyline.com/dl/bugs/dmraid/dmraid.nvidia/

    http://www.3111skyline.com/dl/bugs/dmraid/fdisk-l-info-20100817.txt

Thoughts from the Arch Devs:

  post 1:

    These error are semi-random, they probably depend on where the kernel and
initramfs files are physically located in the file system.

    Grub (and all other bootloaders for that matter) use BIOS calls to access
files on the hard drive - they rely on the BIOS (and in your case, the jmicron
dmraid BIOS) for file access. This access seems to fail for certain areas on
your file system.

  post 2:

    Aah, it just hit me: the problem may in fact be fairly random in that it may
depend on where the initramfs is stored.  So, if the BIOS is broken, you may be
lucky to be able to boot under one kernel, and the next upgrade places things in
a place on disk where the BIOS bug kicks in, and you're screwed.  So it has
nothing to do with the kernel version, grub or dmraid in this case.  Do I
understand this correctly?

  post 3:

    I guess there has been something changed in the kernel26 2.6.35.8 and above
which doesn't work with your BIOS or your RAID. Either this is a bug in kernel26
2.6.35.8 and newer or it is not a bug but a new feature or a change which
doesn't work with your probably outdated BIOS.

    I'd suggest asking kernel upstream by either filing a bug report at
kernel.org or asking on their mailing list.

    It definitely must have something to do with the kernel. Otherwise it
wouldn't work again after a kernel downgrade.

Further tests I've done:

    Per the suggestions of the dm-devel list, I have tested with both
libata.ignore_hpa=0 (default) and libata.ignore_hpa=1 (ignore limits, using full
disk), but there is no change. I still get grub Error 24: (this is with the
2.6.36-3 kernel)

    I did another test starting with 2.6.35-7 (working), upgrade to 2.6.35-8
(expect failure -- it did), then upgrade directly to 2.6.36-3 and (expect
success if it was an initramfs location issue -- it failed too). Just to be
sure, I re-made the initramfs a couple of times and tried booting with them -
they all failed as well.

    Then downgraded to 2.6.35-7 -> it works like a champ -- no matter what order
it gets installed in.

Just FYI, the BIOS is current for the board and information on both can be found at:


http://www.msi.com/index.php?func=downloaddetail&type=bios&maincat_no=1&prod_no=1443

direct download:

  http://www.msi.com/index.php?func=downloadfile&dno=12299&type=bios

    So, I'm stumped again. With your help, the help of the Arch developers, and
the help of the guys at kernel.org, we should be able to isolate the issue and
figure out where the problem is. If you have any other ideas, please let me know
and I'll pass that information back to Arch and the kernel list. With all the
smart people involved, I'll bet we get this thing sorted out. Thanks.



-- 
David C. Rankin, J.D.,P.E.
Rankin Law Firm, PLLC
510 Ochiltree Street
Nacogdoches, Texas 75961
Telephone: (936) 715-9333
Facsimile: (936) 715-9339
www.rankinlawfirm.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: kernel update and dmraid causing grub errors
  2010-11-04 16:17       ` David C. Rankin
@ 2010-11-09 17:55         ` David C. Rankin
  2010-11-10  5:49           ` David C. Rankin
  2010-11-17 21:59           ` Heinz Mauelshagen
  0 siblings, 2 replies; 9+ messages in thread
From: David C. Rankin @ 2010-11-09 17:55 UTC (permalink / raw)
  To: device-mapper development

On 11/04/2010 11:17 AM, David C. Rankin wrote:
> Hi Heinz,
> 
> 	No, grub is (grub-0.97-17) and it hasn't changed since April 25, 2010. So
> whatever is happening, isn't due to a grub change. Some of the Arch devs think
> it might be a kernel issue. Last night, I posted the issue to the kernel list at
> kernel.org and we will see what response we get back. The post to kernel.org was
> pretty much the complete history of the issue, so I'll include the additional
> information posted to the kernel list below for completeness:

Heinz,

	Just as a follow-up, I didn't get a response from the kernel.org list on the
issue. In fact the only dm related post on the list in the past week was the CFQ
dm-crypt post that I also see was cc'ed here. I'll try the grub list and see if
they have any ideas. If I get a response, I'll let you know. If you have any
epiphanies on the issue, please let me know. Thanks.

-- 
David C. Rankin, J.D.,P.E.
Rankin Law Firm, PLLC
510 Ochiltree Street
Nacogdoches, Texas 75961
Telephone: (936) 715-9333
Facsimile: (936) 715-9339
www.rankinlawfirm.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: kernel update and dmraid causing grub errors
  2010-11-09 17:55         ` David C. Rankin
@ 2010-11-10  5:49           ` David C. Rankin
  2010-11-17 21:59           ` Heinz Mauelshagen
  1 sibling, 0 replies; 9+ messages in thread
From: David C. Rankin @ 2010-11-10  5:49 UTC (permalink / raw)
  To: device-mapper development

On 11/09/2010 11:55 AM, David C. Rankin wrote:
> On 11/04/2010 11:17 AM, David C. Rankin wrote:
>> Hi Heinz,
>>
>> 	No, grub is (grub-0.97-17) and it hasn't changed since April 25, 2010. So
>> whatever is happening, isn't due to a grub change. Some of the Arch devs think
>> it might be a kernel issue. Last night, I posted the issue to the kernel list at
>> kernel.org and we will see what response we get back. The post to kernel.org was
>> pretty much the complete history of the issue, so I'll include the additional
>> information posted to the kernel list below for completeness:
> 
> Heinz,
> 
> 	Just as a follow-up, I didn't get a response from the kernel.org list on the
> issue. In fact the only dm related post on the list in the past week was the CFQ
> dm-crypt post that I also see was cc'ed here. I'll try the grub list and see if
> they have any ideas. If I get a response, I'll let you know. If you have any
> epiphanies on the issue, please let me know. Thanks.
> 

Heinz,

    I have one more piece of input and one more question. The issue may be more
than just this one box. I have two x86_64 nv dmraid boxes at the house
(primary/backup servers). The one I have had the boot problems with (MSI K9N2
SLI Platinum - Award BIOS) (running 2.6.35.7) and the other one is based on a
Tyan Tomcat K8e (Model: S2865 - Pheonix BIOS/Opteron 180) (running 2.6.35.8)
Both have similar nv dmraid setups. (MSI box has 2 RAID 1 arrays, Tyan box has 1
RAID 1 array)

    What I have noticed recently, the Tyan box boots and experiences what sounds
like disk/drive controller "confusion." What is weird is that it depends on how
the box inits. The problem is either "there" or it "isn't".

    What I mean is that when the problem occurs on the Tyan box -- it effects
the box from boot until shutdown. It behaves just like there is an interrupt
conflict or drive/controller fault. I can hear consistent read/write head
excursions (once every 2-3 secs.) and I get 15-30-60 second delays with
everything (type ls -- then wait 30,60 seconds for the listing or rt-click on
the desktop and wait, and wait... for the context menu). It doesn't matter
whether I have a desktop running or boot to runlevel 3 -- it's a low-level
issue.

    Normally that is a "Hey stupid, you have a drive failing... go fix it"
issue. But it's not. smartctl is fine on all drives -- "no errors logged".
Nothing in syslog or dmesg, and the disks are clean.

    A shutdown or reboot will completely "fix" the problem. Although today I had
to shutdown/restart 3 times before it "fixed" itself. When the box "inits"
without having this problem - it never exhibits *any* problem until the next
boot when whatever it is strikes again.

    Since I rarely boot the box, I don't exactly know when this started, but it
has been within the past month -- which is consistent with the latest round of
boot failures on the MSI box moving from kernel 2.6.35.7 to .8.

    I don't know what to make of it? It seems like something has just gone
"flaky" with how dmraid is working (or grub or kernel or whatever), and it's
like some part of the setup is just confused. On the MSI box, it appears as some
attempt to read beyond the partition boundary or the box thinking there is a
corrupt partition table and booting fails with the latest kernels. On the Tyan
box, it appears as something that causes read/write head excursions and causes
the 15-60 second hangs like there is an interrupt conflict or some hardware
thing waiting on a timeout.

    One item that did catch my eye on the kernel list was a dmraid issue
concerning a "CFQ dm-crypt" problem. I have no idea what that is other than
gleaning it had to do with some type of dmraid queue/scheduler that was causing
problems. I don't know if that could point to some area of dmraid that might be
the culprit.

    If you have any ideas of any type of test and/or diagnostic I could use the
next time the Tyan box exhibits the problem -- to look at where the hang/timeout
issue is, I would appreciate your ideas. (that's an area where I have no clue...
how or what to look for)

    Thanks for all your continued help and willingness to provide ideas. I know
this is a weird issue, but now that I have two boxes showing some signs of a
similar problem -- hopefully that will help me narrow it down.


-- 
David C. Rankin, J.D.,P.E.
Rankin Law Firm, PLLC
510 Ochiltree Street
Nacogdoches, Texas 75961
Telephone: (936) 715-9333
Facsimile: (936) 715-9339
www.rankinlawfirm.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: kernel update and dmraid causing grub errors
  2010-11-09 17:55         ` David C. Rankin
  2010-11-10  5:49           ` David C. Rankin
@ 2010-11-17 21:59           ` Heinz Mauelshagen
  1 sibling, 0 replies; 9+ messages in thread
From: Heinz Mauelshagen @ 2010-11-17 21:59 UTC (permalink / raw)
  To: device-mapper development

On Tue, 2010-11-09 at 11:55 -0600, David C. Rankin wrote:
> On 11/04/2010 11:17 AM, David C. Rankin wrote:
> > Hi Heinz,
> > 
> > 	No, grub is (grub-0.97-17) and it hasn't changed since April 25, 2010. So
> > whatever is happening, isn't due to a grub change. Some of the Arch devs think
> > it might be a kernel issue. Last night, I posted the issue to the kernel list at
> > kernel.org and we will see what response we get back. The post to kernel.org was
> > pretty much the complete history of the issue, so I'll include the additional
> > information posted to the kernel list below for completeness:
> 
> Heinz,
> 
> 	Just as a follow-up, I didn't get a response from the kernel.org list on the
> issue. In fact the only dm related post on the list in the past week was the CFQ
> dm-crypt post that I also see was cc'ed here. I'll try the grub list and see if
> they have any ideas. If I get a response, I'll let you know. If you have any
> epiphanies on the issue, please let me know. Thanks.

Will do, thanks.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2010-11-17 21:59 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-11-01 22:27 kernel update and dmraid causing grub errors David C. Rankin
2010-11-03 12:04 ` Heinz Mauelshagen
2010-11-03 22:19   ` David C. Rankin
2010-11-03 22:57   ` David C. Rankin
2010-11-04 12:32     ` Heinz Mauelshagen
2010-11-04 16:17       ` David C. Rankin
2010-11-09 17:55         ` David C. Rankin
2010-11-10  5:49           ` David C. Rankin
2010-11-17 21:59           ` Heinz Mauelshagen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.