All of lore.kernel.org
 help / color / mirror / Atom feed
* grub rescue read or write sector outside of partition
@ 2015-06-26  0:33 Dale Carstensen
  2015-06-26  8:11 ` Andrei Borzenkov
  2015-06-26  8:21 ` Fajar A. Nugraha
  0 siblings, 2 replies; 6+ messages in thread
From: Dale Carstensen @ 2015-06-26  0:33 UTC (permalink / raw)
  To: grub-devel

I had a drive fail, and it is the one that had grub on it.
It had parts of two RAID-6 partitions, too.  So I bought a
new drive and added partitions on it to replace the failed
RAID-6 parts.  That was still booting OK from the failed
drive, but then I updated the kernel, and I decided to also
install a new grub on the new drive.

That seemed to go OK until I tried to reboot.  I landed in
grub rescue.  Fortunately I have several computers, so I can
look up documentation, etc. without my main desktop functioning.
Somewhere I found that grub rescue has only a few commands, none
of them "help" or a list of commands, and no TAB-expansions.
Well, they seem to be ls, set, unset and insmod.  Supposedly,
running insmod normal, then normal, will get back to the
fuller set of commands with help, but that's where it gets
the "outside of partition" error, it seems.

I can ls the /boot/grub/i386-pc/ directory, where normal.mod
is, so I would think grub rescue could find and read normal.mod,
too, but, I guess not.

So, set debug=all helped a little, expanding the message
from just something like (I'd have to keep trying to
reboot to get it verbatim) read or write bad, to
the specific size of the partition (in decimal, around
175 million 512-byte blocks) and the sector it is trying
to read (read.c:461) (in hexadecimal), around 10 million.
But 10 million hex really is larger than 175 million
decimal.

So maybe my BIOS has some limitation on how deep it can
read into this 2 TB drive, or maybe the drive having
hardware sectors of 4096 bytes replacing one with
512 confuses grub.  But the old drive with the failures
gets the same problem.

It's gentoo, grub2 (I could look up the version once it's
running again), 64-bit (although grub seems not to really
notice 32- vs 64-bit, or the kernel, so I'm not sure it's
just smart or really dumb), and, like I say, the / partition
is RAID-6, including /boot.  I'm going to try making a
non-RAID /boot, maybe later I'll try making it RAID-1,
to see if that helps.

Any advise?

Thanks.

--
Open WebMail Project (http://openwebmail.org)



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: grub rescue read or write sector outside of partition
  2015-06-26  0:33 grub rescue read or write sector outside of partition Dale Carstensen
@ 2015-06-26  8:11 ` Andrei Borzenkov
  2015-06-27 22:17   ` Dale Carstensen
  2015-06-26  8:21 ` Fajar A. Nugraha
  1 sibling, 1 reply; 6+ messages in thread
From: Andrei Borzenkov @ 2015-06-26  8:11 UTC (permalink / raw)
  To: Dale Carstensen; +Cc: grub-devel

В Thu, 25 Jun 2015 17:33:25 -0700
"Dale Carstensen" <dlc@lampinc.com> пишет:

> I had a drive fail, and it is the one that had grub on it.
> It had parts of two RAID-6 partitions, too.  So I bought a
> new drive and added partitions on it to replace the failed
> RAID-6 parts.  That was still booting OK from the failed
> drive, but then I updated the kernel, and I decided to also
> install a new grub on the new drive.
> 

How? Please show exact commands you used as well as your disk
configuration.

> That seemed to go OK until I tried to reboot.  I landed in
> grub rescue.  Fortunately I have several computers, so I can
> look up documentation, etc. without my main desktop functioning.
> Somewhere I found that grub rescue has only a few commands, none
> of them "help" or a list of commands, and no TAB-expansions.
> Well, they seem to be ls, set, unset and insmod.  Supposedly,
> running insmod normal, then normal, will get back to the
> fuller set of commands with help, but that's where it gets
> the "outside of partition" error, it seems.
> 
> I can ls the /boot/grub/i386-pc/ directory, where normal.mod
> is, so I would think grub rescue could find and read normal.mod,
> too, but, I guess not.
> 

Please show output of "set" command at this point.

> So, set debug=all helped a little, expanding the message
> from just something like (I'd have to keep trying to
> reboot to get it verbatim) read or write bad, to
> the specific size of the partition (in decimal, around
> 175 million 512-byte blocks) and the sector it is trying
> to read (read.c:461) (in hexadecimal), around 10 million.
> But 10 million hex really is larger than 175 million
> decimal.
> 
> So maybe my BIOS has some limitation on how deep it can
> read into this 2 TB drive, or maybe the drive having
> hardware sectors of 4096 bytes replacing one with
> 512 confuses grub.  But the old drive with the failures
> gets the same problem.
> 
> It's gentoo, grub2 (I could look up the version once it's
> running again), 64-bit (although grub seems not to really
> notice 32- vs 64-bit, or the kernel, so I'm not sure it's
> just smart or really dumb), and, like I say, the / partition
> is RAID-6, including /boot.  I'm going to try making a
> non-RAID /boot, maybe later I'll try making it RAID-1,
> to see if that helps.
> 
> Any advise?
> 
> Thanks.
> 
> --
> Open WebMail Project (http://openwebmail.org)
> 
> 
> _______________________________________________
> Grub-devel mailing list
> Grub-devel@gnu.org
> https://lists.gnu.org/mailman/listinfo/grub-devel



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: grub rescue read or write sector outside of partition
  2015-06-26  0:33 grub rescue read or write sector outside of partition Dale Carstensen
  2015-06-26  8:11 ` Andrei Borzenkov
@ 2015-06-26  8:21 ` Fajar A. Nugraha
  1 sibling, 0 replies; 6+ messages in thread
From: Fajar A. Nugraha @ 2015-06-26  8:21 UTC (permalink / raw)
  To: The development of GNU GRUB

On Fri, Jun 26, 2015 at 7:33 AM, Dale Carstensen <dlc@lampinc.com> wrote:
> Somewhere I found that grub rescue has only a few commands, none
> of them "help" or a list of commands, and no TAB-expansions.
> Well, they seem to be ls, set, unset and insmod.  Supposedly,
> running insmod normal, then normal, will get back to the
> fuller set of commands with help, but that's where it gets
> the "outside of partition" error, it seems.

You could include the modules you want in grub image so you can get
tab expansion even when you got dropped to grub rescue prompt. Or
include whole bunch of modules so you can boot even without the module
directory. Some modules will be automatically included as dependency
of another module, so for example if you include "ls" then "normal"
will automatically be included

This is what I use:

grub-install --modules="all_video help cat echo ls search test
part_gpt part_msdos loopback fat ntfscomp ext2 btrfs xfs zfsinfo
iso9660 configfile linux chain boot" /dev/sda

-- 
Fajar


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: grub rescue read or write sector outside of partition
  2015-06-26  8:11 ` Andrei Borzenkov
@ 2015-06-27 22:17   ` Dale Carstensen
  2015-06-28  6:01     ` Andrei Borzenkov
  0 siblings, 1 reply; 6+ messages in thread
From: Dale Carstensen @ 2015-06-27 22:17 UTC (permalink / raw)
  To: Andrei Borzenkov; +Cc: grub-devel

TL;DR it looks to me like grub has a problem with leaving failed
mdadm RAID6 members around

Thanks to Fajar A. Nugraha for the advice about --modules for
grub-install (seems to me to be undocumented).  I managed to
stumble through without enhancing the commands for "grub rescue",
but it's good to know I could have.

I still have a question, though.

The grub.cfg file has menuentry nesting, with an outer name of
"Gentoo GNU/Linux", and inner names by version/recovery.  But
I can't find any documentation of how to navigate to choose,
say, 3.8.11, now that I've made 4.0.5 default.  Seems to me
all the lines used to show up.  Maybe I manually took out the
nesting before??

So what key(s) drill down into sub-menus on the grub menu?
Did I miss it in the info page / manual?

>Date:   Fri, 26 Jun 2015 11:11:14 +0300
>To:     "Dale Carstensen" <dlc@lampinc.com>
>cc:     grub-devel@gnu.org
>From:   Andrei Borzenkov <arvidjaar@gmail.com>
>Subject: Re: grub rescue read or write sector outside of partition

> Thu, 25 Jun 2015 17:33:25 -0700
>"Dale Carstensen" <dlc@lampinc.com> :
>
>> I had a drive fail, and it is the one that had grub on it.
>> It had parts of two RAID-6 partitions, too.  So I bought a
>> new drive and added partitions on it to replace the failed
>> RAID-6 parts.  That was still booting OK from the failed
>> drive, but then I updated the kernel, and I decided to also
>> install a new grub on the new drive.
>
>How? Please show exact commands you used as well as your disk
>configuration.

The bash history is long gone.  My feeble memory is that it
was simply

 grub2-install /dev/sdf

and it responded there were no errors.

Eventually I booted from a DVD and used chroot to do

 grub2-install /dev/sdb

The disk configuration, as shown by /proc/mdstat, is:

md126 : active raid6 sdf8[5] sdd1[4] sdc1[3] sdb1[2] sda1[1]
      87836160 blocks super 1.2 level 6, 512k chunk, algorithm 2 [5/5] [UUUUU]
      
md127 : active raid6 sdf10[5] sdd3[4] sdc3[3] sdb3[2] sda3[1]
      840640512 blocks super 1.2 level 6, 512k chunk, algorithm 2 [5/5] [UUUUU]
      bitmap: 1/3 pages [4KB], 65536KB chunk

/ is mounted from md126p1, /home from md127p1.

The sdf8 and sdf10 partitions are on the replacement drive.
The former partitions those replaced are still on sde8 and
sde10.

Grub calls sde (hd0), sdf (hd1), md126 (md/1) and md127 (md/3).
The DVD boot calls sde sda, and sdf sdb.  All neatly made
consistent by those long UUID strings.  And grub calls
md126p1 (md/1,gpt1), but for command input seems to like
(md/1,1) without the label-type distinction.

Or maybe I have md/1 and md/3 swapped??  I hope not.

The command that replaced the bad drive with the good in RAID6
was

 mdadm --add /dev/md126 /dev/sdf8

Below, I'll show what I think has made it stable and useful
again.

>> That seemed to go OK until I tried to reboot.  I landed in
>> grub rescue.  Fortunately I have several computers, so I can
>> look up documentation, etc. without my main desktop functioning.
>> Somewhere I found that grub rescue has only a few commands, none
>> of them "help" or a list of commands, and no TAB-expansions.
>> Well, they seem to be ls, set, unset and insmod.  Supposedly,
>> running insmod normal, then normal, will get back to the
>> fuller set of commands with help, but that's where it gets
>> the "outside of partition" error, it seems.
>>
>> I can ls the /boot/grub/i386-pc/ directory, where normal.mod
>> is, so I would think grub rescue could find and read normal.mod,
>> too, but, I guess not.
>>
>
>Please show output of "set" command at this point.

In the original grub rescue event, I think set output this:

cmdpath=(hd0)
prefix=(mduuid/73fc9531-525f-05e9-6992-6654b5b95a33,1)/boot/grub
root=mduuid/73fc9531-525f-05e9-6992-6654b5b95a33,1

And the number 73fc...5a33 is the blkid for /dev/sdf8.  I think it
was just the three variables.

Note that I booted from (hd1), but somehow cmdpath got
diverted to (hd0), though the UUID for prefix and root were
still on (hd1).  Unless I misremember.

>
>> So, set debug=all helped a little, expanding the message
>> from just something like (I'd have to keep trying to
>> reboot to get it verbatim) read or write bad, to
>> the specific size of the partition (in decimal, around
>> 175 million 512-byte blocks) and the sector it is trying
>> to read (read.c:461) (in hexadecimal), around 10 million.
>> But 10 million hex really is larger than 175 million
>> decimal.
>>
>> So maybe my BIOS has some limitation on how deep it can
>> read into this 2 TB drive, or maybe the drive having
>> hardware sectors of 4096 bytes replacing one with
>> 512 confuses grub.  But the old drive with the failures
>> gets the same problem.
>>
>> It's gentoo, grub2 (I could look up the version once it's
>> running again),

Part of the output of

 eix grub | cat

is

 [I] sys-boot/grub
  ...
 Installed versions:  2.02_beta2-r3(2)^t(07:25:12 03/13/15)(multislot nls sdl 
truetype -debug -device-mapper -doc -efiemu -libzfs -mount -static -test 
GRUB_PLATFORMS="-coreboot -efi-32 -efi-64 -emu -ieee1275 -loongson -multiboot 
-pc -qemu -qemu-mips -xen")

>> 64-bit (although grub seems not to really
>> notice 32- vs 64-bit, or the kernel, so I'm not sure it's
>> just smart or really dumb),

It is multilib, so 32- vs 64-bit appearance is nuanced.

>> and, like I say, the / partition
>> is RAID-6, including /boot.  I'm going to try making a
>> non-RAID /boot, maybe later I'll try making it RAID-1,
>> to see if that helps.
>>
>> Any advise?
>>
>> Thanks.

Well, it seems to work again.

The first baby step was to make a partition on (hd1)/sdb/sdf
starting at block 34 and ending at block 2047.  Partition 8
begins at block 2048, and originally I set it to type ef02.
Then I changed it to fd00 and made the block 34 partition (11)
type ef02.  I tried to make that partition 11 ext3 and
put some of /boot in it, but obviously it's way too short
for anything but grub's persistent environment.  So I used
dd with if=/dev/zero to clear it.  And I did grub2-install
with the --recheck option.  All while booted from DVD and
using chroot, keeping in mind the device was /dev/sdb.

That avoided "grub rescue", but the only kernel it found was
the old one, 3.8.11.

I stabbed in the dark through another 4 or 5 reboots, until
eventually pager=1 and cat to look at /boot/grub/grub.cfg
showed that the only menuentry in it for Linux was for
3.8.11, while I knew the latest grub.cfg I had also had
the new 4.0.5, as well as older 3.8.9 and 3.8.7 ones.
I'm still not sure where that grub.cfg came from, but I
made the assumption that it had to do with grub being too
liberal about failed members of RAID6 partitions.

So I ran

 mdadm --zero-superblock /dev/sde8

and also for 10.

I think that fixed things.  Oh, I also had, before the
zero-superblock, changed /etc/default/grub to set the
default menu item to the long weird id for 4.0.5.

So, it's working, or at least appears to work.  I suppose
I should check whether cmdpath in grub is (hd1) or maybe
is still the incorrect (hd0).




^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: grub rescue read or write sector outside of partition
  2015-06-27 22:17   ` Dale Carstensen
@ 2015-06-28  6:01     ` Andrei Borzenkov
  2015-06-28 22:54       ` Dale Carstensen
  0 siblings, 1 reply; 6+ messages in thread
From: Andrei Borzenkov @ 2015-06-28  6:01 UTC (permalink / raw)
  To: Dale Carstensen; +Cc: grub-devel

В Sat, 27 Jun 2015 15:17:38 -0700
Dale Carstensen  <dlc@lampinc.com> пишет:

> 
> The bash history is long gone.  My feeble memory is that it
> was simply
> 
>  grub2-install /dev/sdf
> 
> and it responded there were no errors.
> 
> Eventually I booted from a DVD and used chroot to do
> 
>  grub2-install /dev/sdb
> 
> The disk configuration, as shown by /proc/mdstat, is:
> 
> md126 : active raid6 sdf8[5] sdd1[4] sdc1[3] sdb1[2] sda1[1]
>       87836160 blocks super 1.2 level 6, 512k chunk, algorithm 2 [5/5] [UUUUU]
>       
> md127 : active raid6 sdf10[5] sdd3[4] sdc3[3] sdb3[2] sda3[1]
>       840640512 blocks super 1.2 level 6, 512k chunk, algorithm 2 [5/5] [UUUUU]
>       bitmap: 1/3 pages [4KB], 65536KB chunk
> 
> / is mounted from md126p1, /home from md127p1.
> 

Do you really have partitioned MD RAID? Why? The only reason to have it
is firmware RAID, but here you already build MD from partitions. Having
additional partitions on top of them just complicates things.

> The sdf8 and sdf10 partitions are on the replacement drive.
> The former partitions those replaced are still on sde8 and
> sde10.
> 

Was drive sde still present in system after you had replaced it with
sdf in MD RAID? You said it failed - how exactly? Was it media failure
on some sectors or was it complete failure, which made drive
inaccessible?

> Grub calls sde (hd0), sdf (hd1), md126 (md/1) and md127 (md/3).
> The DVD boot calls sde sda, and sdf sdb.  All neatly made
> consistent by those long UUID strings.  And grub calls
> md126p1 (md/1,gpt1), but for command input seems to like
> (md/1,1) without the label-type distinction.
> 

Yes, you indeed have partitioned RAID. And grub even does work with it.
Wow! :)

...
> 
> Well, it seems to work again.
> 
> The first baby step was to make a partition on (hd1)/sdb/sdf
> starting at block 34 and ending at block 2047.  Partition 8
> begins at block 2048, and originally I set it to type ef02.

If at this state you run grub2-install /dev/sdf *and* it completed
without errors, it means it overwrote beginning of /dev/sdf8. I wonder
what can we do to check that partition is not in use. May be opening it
exclusively would help in this case.

> Then I changed it to fd00 and made the block 34 partition (11)
> type ef02.  I tried to make that partition 11 ext3 and
> put some of /boot in it, but obviously it's way too short
> for anything but grub's persistent environment.  So I used
> dd with if=/dev/zero to clear it.  And I did grub2-install
> with the --recheck option.  All while booted from DVD and
> using chroot, keeping in mind the device was /dev/sdb.
> 
> That avoided "grub rescue", but the only kernel it found was
> the old one, 3.8.11.
> 

Not sure I understand so far how it could fix "grub rescue".

> I stabbed in the dark through another 4 or 5 reboots, until
> eventually pager=1 and cat to look at /boot/grub/grub.cfg
> showed that the only menuentry in it for Linux was for
> 3.8.11, while I knew the latest grub.cfg I had also had
> the new 4.0.5, as well as older 3.8.9 and 3.8.7 ones.
> I'm still not sure where that grub.cfg came from, but I
> made the assumption that it had to do with grub being too
> liberal about failed members of RAID6 partitions.
> 
> So I ran
> 
>  mdadm --zero-superblock /dev/sde8
> 
> and also for 10.
> 
> I think that fixed things.

Yes, *that* is quite possible. Unfortunately GRUB does not currently
checks whether disk superblock generations match each other, and it
stops scanning as soon as enough disks are found, so it could pick up
stale pieces from sde instead of new one from sdf.

So to be on safe side it is necessary to either remove replaced drive
or zero it out, so it is not detected as part of RAID.

Anyway, I'm happy you fixed it and thank you very much for sharing your
experience, it is quite helpful!


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: grub rescue read or write sector outside of partition
  2015-06-28  6:01     ` Andrei Borzenkov
@ 2015-06-28 22:54       ` Dale Carstensen
  0 siblings, 0 replies; 6+ messages in thread
From: Dale Carstensen @ 2015-06-28 22:54 UTC (permalink / raw)
  To: Andrei Borzenkov; +Cc: grub-devel

Andrei,

Perhaps you know the answer to my question about accessing
nested grub menus from the menu presented at boot time.  Or
anybody else who knows, please chime in.  I guess I'll do
some ebuild commands to get the source and check there.

Here's how I stated the question:

>I still have a question, though.
>
>The grub.cfg file has menuentry nesting, with an outer name of
>"Gentoo GNU/Linux", and inner names by version/recovery.  But
>I can't find any documentation of how to navigate to choose,
>say, 3.8.11, now that I've made 4.0.5 default.  Seems to me
>all the lines used to show up.  Maybe I manually took out the
>nesting before??
>
>So what key(s) drill down into sub-menus on the grub menu?
>Did I miss it in the info page / manual?

More discussion about my RAID configuration, grub and the
history of this event:

>Date:   Sun, 28 Jun 2015 09:01:21 +0300
>To:     Dale Carstensen  <dlc@lampinc.com>
>cc:     grub-devel@gnu.org
>From:   Andrei Borzenkov <arvidjaar@gmail.com>
>Subject: Re: grub rescue read or write sector outside of partition

 Sat, 27 Jun 2015 15:17:38 -0700
>Dale Carstensen  <dlc@lampinc.com
>

>> / is mounted from md126p1, /home from md127p1.
>>
>
>Do you really have partitioned MD RAID? Why?

Some time in the past I tried putting ext3 on a mdadm
device directly, instead of on a partition within it,
and it got corrupted.  Until I made a gpt label and
a partition inside, then the corruption stopped.  I
don't remember what was doing that.  It probably wasn't
grub.

> The only reason to have it
>is firmware RAID, but here you already build MD from partitions. Having
>additional partitions on top of them just complicates things.

Well, sure, but once bitten twice shy.  Maybe simpler is
a little too simple.

>> The sdf8 and sdf10 partitions are on the replacement drive.
>> The former partitions those replaced are still on sde8 and
>> sde10.
>>
>
>Was drive sde still present in system after you had replaced it with
>sdf in MD RAID?

Yes, you are correct in your interpretation of what I wrote.

> You said it failed - how exactly?

Ah, now we're getting into a discussion of systemd,
with all that entails, Lennart Poettering, etc.

OK.

Well, about May 27, I was trying to remember how it used to
work for me to mail a web page to myself from lynx.  And I
looked for mail transfer agents and decided to try esmtp.
As I tried to get esmtp configured, I decided it might help
to look at some log files.  Used to be I could just use
less on /var/log/messages (or syslog, daemon, maillog, whatever).
But I decided to see what all the hoopla was about regarding
systemd.  So in March I had built a new gentoo installation
with, not openrc as is usual with gentoo, but with systemd.

So in diagnosing the esmtp configuration mystery after May
27, I ran journalctl.  That automatically uses a pager, maybe
it is less itself, or something very similar to less.  So
I looked at the last few lines of the log.  Boy, was I
surprised to see that the last three lines, on May 27, were:

May 13 16:12:53 lacnmp kernel: md/raid:md127: read error corrected (8 sectors 
at
May 13 16:12:05 lacnmp systemd[1]: systemd-journald.service watchdog timeout 
(li
May 13 16:12:53 lacnmp systemd-journal[1370]: Journal stopped

Oh, boy, was I happy I had improved my system by changing to
systemd.  Not.

Almost as much as when an Ubuntu system I admin, by using
upstart, another "improvement" over init scripts, had no
anacron running since a month before I checked.  And that system
is still running, with a new temporary cron entry to run
anacron daily, 18 months later.

Well, I'm still trying to figure out how to run logwatch
daily (and mail it to myself!), and a few other nits with
systemd, and it has claimed to have stopped the journal
again, yet somehow the logging continued.  Mysteries, linux
is full of them.  Especially things Lennart has invented,
it seems.  I think this is fact, not opinion, but I guess
eventually systemd will just work and we'll all be happy.
I'm not sure everybody will ever be happy with IPV6, on
the other hand.  Keep things in perspective.

So, anyway, the first error on sde had been April 13, and
at some point mdadm had banished it from both the RAID6
partitions.  It has other partitions, and there have been
some errors on those, too.  But it's not completely dead.
So I left it in the system while I added a replacement
for it.

> Was it media failure
>on some sectors or was it complete failure, which made drive
>inaccessible?

Some of the messages before May 12 do state "Medium Error".
I guess a checksum is probably not matching.  Too many to
keep adding to flaw maps to avoid the bad spots.

I had a drive go bad in 2010 by de-coupling a platter from
the spindle.  It cost me $1600 to get my data back, but
Kroll Ontrack could do it.  That was in a RAID that was
just slicing, no redundancy.  It took me awhile to figure
out this mdadm / lvm2 stuff, and you think I still don't
because I use a partition within a RAID, and I think I
still don't because it's still confusing to me.

>...
>>
>> Well, it seems to work again.
>>
>> The first baby step was to make a partition on (hd1)/sdb/sdf
>> starting at block 34 and ending at block 2047.  Partition 8
>> begins at block 2048, and originally I set it to type ef02.
>
>If at this state you run grub2-install /dev/sdf *and* it completed
>without errors, it means it overwrote beginning of /dev/sdf8.

Well, that's disturbing.  I guess mdadm or ext3 or whatever
is resilient enough to slough that off.

> I wonder
>what can we do to check that partition is not in use. May be opening it
>exclusively would help in this case.
>
>> Then I changed it to fd00 and made the block 34 partition (11)
>> type ef02.  I tried to make that partition 11 ext3 and
>> put some of /boot in it, but obviously it's way too short
>> for anything but grub's persistent environment.  So I used
>> dd with if=/dev/zero to clear it.  And I did grub2-install
>> with the --recheck option.  All while booted from DVD and
>> using chroot, keeping in mind the device was /dev/sdb.
>>
>> That avoided "grub rescue", but the only kernel it found was
>> the old one, 3.8.11.
>>
>
>Not sure I understand so far how it could fix "grub rescue".

I don't either, but I guess lingering data on sde sometimes
didn't steer grub out of bounds.  I don't recall ever having
a grub.cfg with only 3.8.11 in it, but that's what grub was
seeing (see next sentence).

>> I stabbed in the dark through another 4 or 5 reboots, until
>> eventually pager=1 and cat to look at /boot/grub/grub.cfg
>> showed that the only menuentry in it for Linux was for
>> 3.8.11, while I knew the latest grub.cfg I had also had
>> the new 4.0.5, as well as older 3.8.9 and 3.8.7 ones.
>> I'm still not sure where that grub.cfg came from, but I
>> made the assumption that it had to do with grub being too
>> liberal about failed members of RAID6 partitions.
>>
>> So I ran
>>
>>  mdadm --zero-superblock /dev/sde8
>>
>> and also for 10.
>>
>> I think that fixed things.
>
>Yes, *that* is quite possible. Unfortunately GRUB does not currently
>checks whether disk superblock generations match each other, and it
>stops scanning as soon as enough disks are found, so it could pick up
>stale pieces from sde instead of new one from sdf.

The kernel RAID support is more picky, and it ignored
sde, considering its partitions failed.

>So to be on safe side it is necessary to either remove replaced drive
>or zero it out, so it is not detected as part of RAID.
>
>Anyway, I'm happy you fixed it and thank you very much for sharing your
>experience, it is quite helpful!

What to do when a RAID element fails is not all that clear.
Maybe mdadm with --replace and --with arguments??  But,
first doing something to add a spare??  The --add seemed
to work for me.  I've done this on another Ubuntu (not the
one still running anacron from cron daily) that had a
SATA controller fail.  I think it restores the RAID to
fully working.  At least I don't see any evidence to the
contrary.

I see in grub, you have all these things to consider
about optical drives, virtual machines, Hurd, other
CPU architectures.  So I see how a little detail about
RAID failed members could fall off the edge.  But it's
not very comfortable when that "grub rescue" prompt comes
up, and I thought, oh, what now?



^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2015-06-28 22:55 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-26  0:33 grub rescue read or write sector outside of partition Dale Carstensen
2015-06-26  8:11 ` Andrei Borzenkov
2015-06-27 22:17   ` Dale Carstensen
2015-06-28  6:01     ` Andrei Borzenkov
2015-06-28 22:54       ` Dale Carstensen
2015-06-26  8:21 ` Fajar A. Nugraha

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.