From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from list by lists.gnu.org with archive (Exim 4.71) id 1Z9LTh-0004au-FT for mharc-grub-devel@gnu.org; Sun, 28 Jun 2015 18:55:09 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:55146) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Z9LTf-0004Yj-1L for grub-devel@gnu.org; Sun, 28 Jun 2015 18:55:08 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Z9LTb-0000V2-OU for grub-devel@gnu.org; Sun, 28 Jun 2015 18:55:06 -0400 Received: from los-alamos.net ([67.133.86.10]:9979) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Z9LTb-0000TA-FC for grub-devel@gnu.org; Sun, 28 Jun 2015 18:55:03 -0400 Received: from lacn.los-alamos.net (lacn.los-alamos.net [10.9.91.6]) by los-alamos.net (8.12.8/8.10.1) with ESMTP id t5SMwPZF029027; Sun, 28 Jun 2015 16:58:25 -0600 (MDT) Received: from localhost (lacn.los-alamos.net [127.0.0.1]) by lacn.los-alamos.net (Postfix) with ESMTP id A1F3D3B8; Sun, 28 Jun 2015 16:53:06 -0600 (MDT) Received: from lacn.los-alamos.net ([127.0.0.1]) by localhost (lacn.los-alamos.net [127.0.0.1]) (amavisd-maia, port 10024) with ESMTP id 02349-02; Sun, 28 Jun 2015 16:52:24 -0600 (MDT) Received: from lampinc.com (lacn.los-alamos.net [127.0.0.1]) by lacn.los-alamos.net (Postfix) with ESMTP id E23592A1; Sun, 28 Jun 2015 16:52:23 -0600 (MDT) X-Mailer: exmh version 2.7.0 04/02/2003 (gentoo 2.7.0) with nmh-1.3 To: Andrei Borzenkov Subject: Re: grub rescue read or write sector outside of partition In-reply-to: <20150628090121.465303f2@opensuse.site> References: <20150626000953.M20169@lampinc.com> <20150626111114.247a6e3a@opensuse.site> <20150627221525.55BF529C@lacn.los-alamos.net> <20150628090121.465303f2@opensuse.site> Comments: In-reply-to Andrei Borzenkov message dated "Sun, 28 Jun 2015 09:01:21 +0300." Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Date: Sun, 28 Jun 2015 15:54:39 -0700 From: Dale Carstensen Message-Id: <20150628225223.E23592A1@lacn.los-alamos.net> X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 67.133.86.10 Cc: grub-devel@gnu.org X-BeenThere: grub-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list Reply-To: The development of GNU GRUB List-Id: The development of GNU GRUB List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 28 Jun 2015 22:55:08 -0000 Andrei, Perhaps you know the answer to my question about accessing nested grub menus from the menu presented at boot time. Or anybody else who knows, please chime in. I guess I'll do some ebuild commands to get the source and check there. Here's how I stated the question: >I still have a question, though. > >The grub.cfg file has menuentry nesting, with an outer name of >"Gentoo GNU/Linux", and inner names by version/recovery. But >I can't find any documentation of how to navigate to choose, >say, 3.8.11, now that I've made 4.0.5 default. Seems to me >all the lines used to show up. Maybe I manually took out the >nesting before?? > >So what key(s) drill down into sub-menus on the grub menu? >Did I miss it in the info page / manual? More discussion about my RAID configuration, grub and the history of this event: >Date: Sun, 28 Jun 2015 09:01:21 +0300 >To: Dale Carstensen >cc: grub-devel@gnu.org >From: Andrei Borzenkov >Subject: Re: grub rescue read or write sector outside of partition Sat, 27 Jun 2015 15:17:38 -0700 >Dale Carstensen >> / is mounted from md126p1, /home from md127p1. >> > >Do you really have partitioned MD RAID? Why? Some time in the past I tried putting ext3 on a mdadm device directly, instead of on a partition within it, and it got corrupted. Until I made a gpt label and a partition inside, then the corruption stopped. I don't remember what was doing that. It probably wasn't grub. > The only reason to have it >is firmware RAID, but here you already build MD from partitions. Having >additional partitions on top of them just complicates things. Well, sure, but once bitten twice shy. Maybe simpler is a little too simple. >> The sdf8 and sdf10 partitions are on the replacement drive. >> The former partitions those replaced are still on sde8 and >> sde10. >> > >Was drive sde still present in system after you had replaced it with >sdf in MD RAID? Yes, you are correct in your interpretation of what I wrote. > You said it failed - how exactly? Ah, now we're getting into a discussion of systemd, with all that entails, Lennart Poettering, etc. OK. Well, about May 27, I was trying to remember how it used to work for me to mail a web page to myself from lynx. And I looked for mail transfer agents and decided to try esmtp. As I tried to get esmtp configured, I decided it might help to look at some log files. Used to be I could just use less on /var/log/messages (or syslog, daemon, maillog, whatever). But I decided to see what all the hoopla was about regarding systemd. So in March I had built a new gentoo installation with, not openrc as is usual with gentoo, but with systemd. So in diagnosing the esmtp configuration mystery after May 27, I ran journalctl. That automatically uses a pager, maybe it is less itself, or something very similar to less. So I looked at the last few lines of the log. Boy, was I surprised to see that the last three lines, on May 27, were: May 13 16:12:53 lacnmp kernel: md/raid:md127: read error corrected (8 sectors at May 13 16:12:05 lacnmp systemd[1]: systemd-journald.service watchdog timeout (li May 13 16:12:53 lacnmp systemd-journal[1370]: Journal stopped Oh, boy, was I happy I had improved my system by changing to systemd. Not. Almost as much as when an Ubuntu system I admin, by using upstart, another "improvement" over init scripts, had no anacron running since a month before I checked. And that system is still running, with a new temporary cron entry to run anacron daily, 18 months later. Well, I'm still trying to figure out how to run logwatch daily (and mail it to myself!), and a few other nits with systemd, and it has claimed to have stopped the journal again, yet somehow the logging continued. Mysteries, linux is full of them. Especially things Lennart has invented, it seems. I think this is fact, not opinion, but I guess eventually systemd will just work and we'll all be happy. I'm not sure everybody will ever be happy with IPV6, on the other hand. Keep things in perspective. So, anyway, the first error on sde had been April 13, and at some point mdadm had banished it from both the RAID6 partitions. It has other partitions, and there have been some errors on those, too. But it's not completely dead. So I left it in the system while I added a replacement for it. > Was it media failure >on some sectors or was it complete failure, which made drive >inaccessible? Some of the messages before May 12 do state "Medium Error". I guess a checksum is probably not matching. Too many to keep adding to flaw maps to avoid the bad spots. I had a drive go bad in 2010 by de-coupling a platter from the spindle. It cost me $1600 to get my data back, but Kroll Ontrack could do it. That was in a RAID that was just slicing, no redundancy. It took me awhile to figure out this mdadm / lvm2 stuff, and you think I still don't because I use a partition within a RAID, and I think I still don't because it's still confusing to me. >... >> >> Well, it seems to work again. >> >> The first baby step was to make a partition on (hd1)/sdb/sdf >> starting at block 34 and ending at block 2047. Partition 8 >> begins at block 2048, and originally I set it to type ef02. > >If at this state you run grub2-install /dev/sdf *and* it completed >without errors, it means it overwrote beginning of /dev/sdf8. Well, that's disturbing. I guess mdadm or ext3 or whatever is resilient enough to slough that off. > I wonder >what can we do to check that partition is not in use. May be opening it >exclusively would help in this case. > >> Then I changed it to fd00 and made the block 34 partition (11) >> type ef02. I tried to make that partition 11 ext3 and >> put some of /boot in it, but obviously it's way too short >> for anything but grub's persistent environment. So I used >> dd with if=/dev/zero to clear it. And I did grub2-install >> with the --recheck option. All while booted from DVD and >> using chroot, keeping in mind the device was /dev/sdb. >> >> That avoided "grub rescue", but the only kernel it found was >> the old one, 3.8.11. >> > >Not sure I understand so far how it could fix "grub rescue". I don't either, but I guess lingering data on sde sometimes didn't steer grub out of bounds. I don't recall ever having a grub.cfg with only 3.8.11 in it, but that's what grub was seeing (see next sentence). >> I stabbed in the dark through another 4 or 5 reboots, until >> eventually pager=1 and cat to look at /boot/grub/grub.cfg >> showed that the only menuentry in it for Linux was for >> 3.8.11, while I knew the latest grub.cfg I had also had >> the new 4.0.5, as well as older 3.8.9 and 3.8.7 ones. >> I'm still not sure where that grub.cfg came from, but I >> made the assumption that it had to do with grub being too >> liberal about failed members of RAID6 partitions. >> >> So I ran >> >> mdadm --zero-superblock /dev/sde8 >> >> and also for 10. >> >> I think that fixed things. > >Yes, *that* is quite possible. Unfortunately GRUB does not currently >checks whether disk superblock generations match each other, and it >stops scanning as soon as enough disks are found, so it could pick up >stale pieces from sde instead of new one from sdf. The kernel RAID support is more picky, and it ignored sde, considering its partitions failed. >So to be on safe side it is necessary to either remove replaced drive >or zero it out, so it is not detected as part of RAID. > >Anyway, I'm happy you fixed it and thank you very much for sharing your >experience, it is quite helpful! What to do when a RAID element fails is not all that clear. Maybe mdadm with --replace and --with arguments?? But, first doing something to add a spare?? The --add seemed to work for me. I've done this on another Ubuntu (not the one still running anacron from cron daily) that had a SATA controller fail. I think it restores the RAID to fully working. At least I don't see any evidence to the contrary. I see in grub, you have all these things to consider about optical drives, virtual machines, Hurd, other CPU architectures. So I see how a little detail about RAID failed members could fall off the edge. But it's not very comfortable when that "grub rescue" prompt comes up, and I thought, oh, what now?