From mboxrd@z Thu Jan 1 00:00:00 1970 From: Yann Dupont Subject: Re: domino-style OSD crash Date: Wed, 04 Jul 2012 19:53:27 +0200 Message-ID: <4FF48317.5030802@univ-nantes.fr> References: <4FCC7573.3000704@univ-nantes.fr> <4FF2AFEB.1010403@univ-nantes.fr> <4FF35C01.4070400@univ-nantes.fr> <4FF3F98C.30602@univ-nantes.fr> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: Received: from smtptls1-cha.cpub.univ-nantes.fr ([193.52.103.113]:35215 "EHLO smtp-tls.univ-nantes.fr" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1752932Ab2GDRxg (ORCPT ); Wed, 4 Jul 2012 13:53:36 -0400 In-Reply-To: Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Gregory Farnum Cc: Tommi Virtanen , Sam Just , ceph-devel Le 04/07/2012 18:21, Gregory Farnum a =C3=A9crit : > On Wednesday, July 4, 2012 at 1:06 AM, Yann Dupont wrote: >> Le 03/07/2012 23:38, Tommi Virtanen a =C3=A9crit : >>> On Tue, Jul 3, 2012 at 1:54 PM, Yann Dupont wrote: >>>> In the case I could repair, do you think a crashed FS as it is rig= ht now is >>>> valuable for you, for future reference , as I saw you can't reprod= uce the >>>> problem ? I can make an archive (or a btrfs dump ?), but it will b= e quite >>>> big. >>> =20 >>> =20 >>> At this point, it's more about the upstream developers (of btrfs et= c) >>> than us; we're on good terms with them but not experts on the on-di= sk >>> format(s). You might want to send an email to the relevant mailing >>> lists before wiping the disks. >>> -- >>> To unsubscribe from this list: send the line "unsubscribe ceph-deve= l" in >>> the body of a message to majordomo@vger.kernel.org (mailto:majordom= o@vger.kernel.org) >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >> =20 >> =20 >> Well, I probably wasn't clear enough. I talked about crashed FS, but= i >> was talking about ceph. The underlying FS (btrfs in that case) of 1 = node >> (and only one) has PROBABLY crashed in the past, causing corruption = in >> ceph data on this node, and then the subsequent crash of other nodes= =2E >> =20 >> RIGHT now btrfs on this node is OK. I can access the filesystem with= out >> errors. >> =20 >> For the moment, on 8 nodes, 4 refuse to restart . >> 1 of the 4 nodes was the crashed node , the 3 others didn't had brob= lem >> with the underlying fs as far as I can tell. >> =20 >> So I think the scenario is : >> =20 >> One node had problem with btrfs, leading first to kernel problem , >> probably corruption (in disk/ in memory maybe ?) ,and ultimately to = a >> kernel oops. Before that ultimate kernel oops, bad data has been >> transmitted to other (sane) nodes, leading to ceph-osd crash on thos= es >> nodes. > I don't think that's actually possible =E2=80=94 the OSDs all do quit= e a lot of interpretation between what they get off the wire and what g= oes on disk. What you've got here are 4 corrupted LevelDB databases, an= d we pretty much can't do that through the interfaces we have. :/ ok, so as all nodes were identical, I probably have hit a btrfs bug=20 (like a erroneous out of space ) in more or less the same time. And whe= n=20 1 osd was out, > =20 >> =20 >> If you think this scenario is highly improbable in real life (that i= s, >> btrfs will probably be fixed for good, and then, corruption can't >> happen), it's ok. >> =20 >> But I wonder if this scenario can be triggered with other problem, a= nd >> bad data can be transmitted to other sane nodes (power outage, out o= f >> memory condition, disk full... for example) >> =20 >> That's why I proposed you a crashed ceph volume image (I shouldn't h= ave >> talked about a crashed fs, sorry for the confusion) > I appreciate the offer, but I don't think this will help much =E2=80=94= it's a disk state managed by somebody else, not our logical state, whi= ch has broken. If we could figure out how that state got broken that'd = be good, but a "ceph image" won't really help in doing so. ok, no problem. I'll restart from scratch, freshly formated. > > I wonder if maybe there's a confounding factor here =E2=80=94 are all= your nodes similar to each other, Yes. I designed the cluster that way. All nodes are identical hardware=20 (powerEdge M610, 10G intel ethernet + emulex fibre channel attached to=20 storage (1 Array for 2 OSD nodes, 1 controller dedicated for each OSD) > or are they running on different kinds of hardware? How did you do = your Ceph upgrades? What's ceph -s display when the cluster is running = as best it can? Ceph was running 0.47.2 at that time - (debian package for ceph). After= =20 the crash I couldn't restart all the nodes. Tried 0.47.3 and now 0.48=20 without success. Nothing particular for upgrades, because for the moment ceph is broken,= =20 so just apt-get upgrade with new version. ceph -s show that : root@label5:~# ceph -s health HEALTH_WARN 260 pgs degraded; 793 pgs down; 785 pgs peering;= =20 32 pgs recovering; 96 pgs stale; 793 pgs stuck inactive; 96 pgs stuck=20 stale; 1092 pgs stuck unclean; recovery 267286/2491140 degraded=20 (10.729%); 1814/1245570 unfound (0.146%) monmap e1: 3 mons at=20 {chichibu=3D172.20.14.130:6789/0,glenesk=3D172.20.14.131:6789/0,karuiza= wa=3D172.20.14.133:6789/0},=20 election epoch 12, quorum 0,1,2 chichibu,glenesk,karuizawa osdmap e2404: 8 osds: 3 up, 3 in pgmap v173701: 1728 pgs: 604 active+clean, 8 down, 5=20 active+recovering+remapped, 32 active+clean+replay, 11=20 active+recovering+degraded, 25 active+remapped, 710 down+peering, 222=20 active+degraded, 7 stale+active+recovering+degraded, 61=20 stale+down+peering, 20 stale+active+degraded, 6 down+remapped+peering, = 8=20 stale+down+remapped+peering, 9 active+recovering; 4786 GB data, 7495 GB= =20 used, 7280 GB / 15360 GB avail; 267286/2491140 degraded (10.729%);=20 1814/1245570 unfound (0.146%) mdsmap e172: 1/1/1 up {0=3Dkaruizawa=3Dup:replay}, 2 up:standby BTW, After the 0.48 upgrade, there was a disk format conversion. 1 of=20 the 4 surviving OSD didn't complete : 2012-07-04 10:13:27.291541 7f8711099780 -1 filestore(/CEPH/data/osd.1)=20 =46ileStore::mount : stale version stamp detected: 2. Proceeding,=20 do_update is set, performing disk format upgrade. 2012-07-04 10:13:27.291618 7f8711099780 0 filestore(/CEPH/data/osd.1)=20 mount found snaps <3744666,3746725> then , nothing happens for hours, iotop show constant disk usage : 6069 be/4 root 0.00 B/s 32.09 M/s 0.00 % 19.08 % ceph-osd -= i=20 1 --pid-file /var/run/ceph/osd.1.pid -c /etc/ceph/ceph.conf strace show lots of syscall like this : [pid 6069] pread(25, "\0EB_LEAF_\0002%e183%uTEMP.label5%u2"..., 4101,=20 94950) =3D 4101 [pid 6069] pread(23, "\0EB_LEAF_\0002%e183%uTEMP.label5%u2"..., 4107,=20 49678) =3D 4107 [pid 6069] pread(36, "\0EB_LEAF_\0002%e183%uTEMP.label5%u2"..., 4110,=20 99797) =3D 4110 [pid 6069] pread(37, "\0EB_LEAF_\0002%e183%uTEMP.label5%u2"..., 4105,=20 8211) =3D 4105 [pid 6069] pread(25, "\0C\0_LEAF_\0002%e183%uTEMP.label5%u3"..., 4121,= =20 99051) =3D 4121 [pid 6069] pread(36, "\0E\0_LEAF_\0002%e183%uTEMP.label5%u3"..., 4173,= =20 103907) =3D 4173 [pid 6069] pread(37, "\0E\0_LEAF_\0002%e183%uTEMP.label5%u3"..., 4169,= =20 12316) =3D 4169 [pid 6069] pread(37, "\0B\0_LEAF_\0002%e183%uTEMP.rb%e0%e1%"..., 4130,= =20 16485) =3D 4130 [pid 6069] pread(36, "\0B\0_LEAF_\0002%e183%uTEMP.rb%e0%e1%"..., 4129,= =20 108080) =3D 4129 Seeems to loop indefinitely. But It's another problem I guess, maybe a consequence of the others pro= blems Cheers. --=20 Yann Dupont - Service IRTS, DSI Universit=C3=A9 de Nantes Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" i= n the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html