Re: domino-style OSD crash

From: Gregory Farnum <greg@inktank.com>
To: Yann Dupont <Yann.Dupont@univ-nantes.fr>,
	Sam Just <sam.just@inktank.com>
Cc: ceph-devel <ceph-devel@vger.kernel.org>
Subject: Re: domino-style OSD crash
Date: Thu, 5 Jul 2012 14:32:08 -0700	[thread overview]
Message-ID: <CAPYLRzg3vJNGYUwBrZ6e9G6x-URCodAxFLfXRL10T8yOqx+wVQ@mail.gmail.com> (raw)
In-Reply-To: <4FF48317.5030802@univ-nantes.fr>

On Wed, Jul 4, 2012 at 10:53 AM, Yann Dupont <Yann.Dupont@univ-nantes.fr> wrote:
> Le 04/07/2012 18:21, Gregory Farnum a écrit :
>
>> On Wednesday, July 4, 2012 at 1:06 AM, Yann Dupont wrote:
>>>
>>> Le 03/07/2012 23:38, Tommi Virtanen a écrit :
>>>>
>>>> On Tue, Jul 3, 2012 at 1:54 PM, Yann Dupont <Yann.Dupont@univ-nantes.fr
>>>> (mailto:Yann.Dupont@univ-nantes.fr)> wrote:
>>>>>
>>>>> In the case I could repair, do you think a crashed FS as it is right
>>>>> now is
>>>>> valuable for you, for future reference , as I saw you can't reproduce
>>>>> the
>>>>> problem ? I can make an archive (or a btrfs dump ?), but it will be
>>>>> quite
>>>>> big.
>>>>
>>>>     At this point, it's more about the upstream developers (of btrfs
>>>> etc)
>>>> than us; we're on good terms with them but not experts on the on-disk
>>>> format(s). You might want to send an email to the relevant mailing
>>>> lists before wiping the disks.
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> (mailto:majordomo@vger.kernel.org)
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>>     Well, I probably wasn't clear enough. I talked about crashed FS, but
>>> i
>>> was talking about ceph. The underlying FS (btrfs in that case) of 1 node
>>> (and only one) has PROBABLY crashed in the past, causing corruption in
>>> ceph data on this node, and then the subsequent crash of other nodes.
>>>   RIGHT now btrfs on this node is OK. I can access the filesystem without
>>> errors.
>>>   For the moment, on 8 nodes, 4 refuse to restart .
>>> 1 of the 4 nodes was the crashed node , the 3 others didn't had broblem
>>> with the underlying fs as far as I can tell.
>>>   So I think the scenario is :
>>>   One node had problem with btrfs, leading first to kernel problem ,
>>> probably corruption (in disk/ in memory maybe ?) ,and ultimately to a
>>> kernel oops. Before that ultimate kernel oops, bad data has been
>>> transmitted to other (sane) nodes, leading to ceph-osd crash on thoses
>>> nodes.
>>
>> I don't think that's actually possible — the OSDs all do quite a lot of
>> interpretation between what they get off the wire and what goes on disk.
>> What you've got here are 4 corrupted LevelDB databases, and we pretty much
>> can't do that through the interfaces we have. :/
>
>
> ok, so as all nodes were identical, I probably have hit a btrfs bug (like a
> erroneous out of space ) in more or less the same time. And when 1 osd was
> out,
>
>>
>>>
>>>   If you think this scenario is highly improbable in real life (that is,
>>> btrfs will probably be fixed for good, and then, corruption can't
>>> happen), it's ok.
>>>   But I wonder if this scenario can be triggered with other problem, and
>>> bad data can be transmitted to other sane nodes (power outage, out of
>>> memory condition, disk full... for example)
>>>   That's why I proposed you a crashed ceph volume image (I shouldn't have
>>> talked about a crashed fs, sorry for the confusion)
>>
>> I appreciate the offer, but I don't think this will help much — it's a
>> disk state managed by somebody else, not our logical state, which has
>> broken. If we could figure out how that state got broken that'd be good, but
>> a "ceph image" won't really help in doing so.
>
> ok, no problem. I'll restart from scratch, freshly formated.
>
>>
>> I wonder if maybe there's a confounding factor here — are all your nodes
>> similar to each other,
>
>
> Yes. I designed the cluster that way. All nodes are identical hardware
> (powerEdge M610, 10G intel ethernet + emulex fibre channel attached to
> storage (1 Array for 2 OSD nodes, 1 controller dedicated for each OSD)

Oh, interesting. Are the broken nodes all on the same set of arrays?

>
>
>>   or are they running on different kinds of hardware? How did you do your
>> Ceph upgrades? What's ceph -s display when the cluster is running as best it
>> can?
>
>
> Ceph was running 0.47.2 at that time - (debian package for ceph). After the
> crash I couldn't restart all the nodes. Tried 0.47.3 and now 0.48 without
> success.
>
> Nothing particular for upgrades, because for the moment ceph is broken, so
> just apt-get upgrade with new version.
>
>
> ceph -s show that :
>
> root@label5:~# ceph -s
>    health HEALTH_WARN 260 pgs degraded; 793 pgs down; 785 pgs peering; 32
> pgs recovering; 96 pgs stale; 793 pgs stuck inactive; 96 pgs stuck stale;
> 1092 pgs stuck unclean; recovery 267286/2491140 degraded (10.729%);
> 1814/1245570 unfound (0.146%)
>    monmap e1: 3 mons at
> {chichibu=172.20.14.130:6789/0,glenesk=172.20.14.131:6789/0,karuizawa=172.20.14.133:6789/0},
> election epoch 12, quorum 0,1,2 chichibu,glenesk,karuizawa
>    osdmap e2404: 8 osds: 3 up, 3 in
>     pgmap v173701: 1728 pgs: 604 active+clean, 8 down, 5
> active+recovering+remapped, 32 active+clean+replay, 11
> active+recovering+degraded, 25 active+remapped, 710 down+peering, 222
> active+degraded, 7 stale+active+recovering+degraded, 61 stale+down+peering,
> 20 stale+active+degraded, 6 down+remapped+peering, 8
> stale+down+remapped+peering, 9 active+recovering; 4786 GB data, 7495 GB
> used, 7280 GB / 15360 GB avail; 267286/2491140 degraded (10.729%);
> 1814/1245570 unfound (0.146%)
>    mdsmap e172: 1/1/1 up {0=karuizawa=up:replay}, 2 up:standby

Okay, that looks about how I'd expect if half your OSDs are down.

>
>
>
> BTW, After the 0.48 upgrade, there was a disk format conversion. 1 of the 4
> surviving OSD didn't complete :
>
> 2012-07-04 10:13:27.291541 7f8711099780 -1 filestore(/CEPH/data/osd.1)
> FileStore::mount : stale version stamp detected: 2. Proceeding, do_update is
> set, performing disk format upgrade.
> 2012-07-04 10:13:27.291618 7f8711099780  0 filestore(/CEPH/data/osd.1) mount
> found snaps <3744666,3746725>
>
> then , nothing happens for hours, iotop show constant disk usage :

>  6069 be/4 root        0.00 B/s   32.09 M/s  0.00 % 19.08 % ceph-osd -i 1
> --pid-file /var/run/ceph/osd.1.pid -c /etc/ceph/ceph.conf
>
> strace show lots of syscall like this :
>
> [pid  6069] pread(25, "\0EB_LEAF_\0002%e183%uTEMP.label5%u2"..., 4101,
> 94950) = 4101
> [pid  6069] pread(23, "\0EB_LEAF_\0002%e183%uTEMP.label5%u2"..., 4107,
> 49678) = 4107
> [pid  6069] pread(36, "\0EB_LEAF_\0002%e183%uTEMP.label5%u2"..., 4110,
> 99797) = 4110
> [pid  6069] pread(37, "\0EB_LEAF_\0002%e183%uTEMP.label5%u2"..., 4105, 8211)
> = 4105
> [pid  6069] pread(25, "\0C\0_LEAF_\0002%e183%uTEMP.label5%u3"..., 4121,
> 99051) = 4121
> [pid  6069] pread(36, "\0E\0_LEAF_\0002%e183%uTEMP.label5%u3"..., 4173,
> 103907) = 4173
> [pid  6069] pread(37, "\0E\0_LEAF_\0002%e183%uTEMP.label5%u3"..., 4169,
> 12316) = 4169
> [pid  6069] pread(37, "\0B\0_LEAF_\0002%e183%uTEMP.rb%e0%e1%"..., 4130,
> 16485) = 4130
> [pid  6069] pread(36, "\0B\0_LEAF_\0002%e183%uTEMP.rb%e0%e1%"..., 4129,
> 108080) = 4129

Sam, does this look like something of ours to you?
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html