All of lore.kernel.org
 help / color / mirror / Atom feed
From: Gregory Farnum <greg@inktank.com>
To: Yann Dupont <Yann.Dupont@univ-nantes.fr>,
	Sam Just <sam.just@inktank.com>
Cc: ceph-devel <ceph-devel@vger.kernel.org>
Subject: Re: domino-style OSD crash
Date: Thu, 5 Jul 2012 14:32:08 -0700	[thread overview]
Message-ID: <CAPYLRzg3vJNGYUwBrZ6e9G6x-URCodAxFLfXRL10T8yOqx+wVQ@mail.gmail.com> (raw)
In-Reply-To: <4FF48317.5030802@univ-nantes.fr>

On Wed, Jul 4, 2012 at 10:53 AM, Yann Dupont <Yann.Dupont@univ-nantes.fr> wrote:
> Le 04/07/2012 18:21, Gregory Farnum a écrit :
>
>> On Wednesday, July 4, 2012 at 1:06 AM, Yann Dupont wrote:
>>>
>>> Le 03/07/2012 23:38, Tommi Virtanen a écrit :
>>>>
>>>> On Tue, Jul 3, 2012 at 1:54 PM, Yann Dupont <Yann.Dupont@univ-nantes.fr
>>>> (mailto:Yann.Dupont@univ-nantes.fr)> wrote:
>>>>>
>>>>> In the case I could repair, do you think a crashed FS as it is right
>>>>> now is
>>>>> valuable for you, for future reference , as I saw you can't reproduce
>>>>> the
>>>>> problem ? I can make an archive (or a btrfs dump ?), but it will be
>>>>> quite
>>>>> big.
>>>>
>>>>     At this point, it's more about the upstream developers (of btrfs
>>>> etc)
>>>> than us; we're on good terms with them but not experts on the on-disk
>>>> format(s). You might want to send an email to the relevant mailing
>>>> lists before wiping the disks.
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> (mailto:majordomo@vger.kernel.org)
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>>     Well, I probably wasn't clear enough. I talked about crashed FS, but
>>> i
>>> was talking about ceph. The underlying FS (btrfs in that case) of 1 node
>>> (and only one) has PROBABLY crashed in the past, causing corruption in
>>> ceph data on this node, and then the subsequent crash of other nodes.
>>>   RIGHT now btrfs on this node is OK. I can access the filesystem without
>>> errors.
>>>   For the moment, on 8 nodes, 4 refuse to restart .
>>> 1 of the 4 nodes was the crashed node , the 3 others didn't had broblem
>>> with the underlying fs as far as I can tell.
>>>   So I think the scenario is :
>>>   One node had problem with btrfs, leading first to kernel problem ,
>>> probably corruption (in disk/ in memory maybe ?) ,and ultimately to a
>>> kernel oops. Before that ultimate kernel oops, bad data has been
>>> transmitted to other (sane) nodes, leading to ceph-osd crash on thoses
>>> nodes.
>>
>> I don't think that's actually possible — the OSDs all do quite a lot of
>> interpretation between what they get off the wire and what goes on disk.
>> What you've got here are 4 corrupted LevelDB databases, and we pretty much
>> can't do that through the interfaces we have. :/
>
>
> ok, so as all nodes were identical, I probably have hit a btrfs bug (like a
> erroneous out of space ) in more or less the same time. And when 1 osd was
> out,
>
>>
>>>
>>>   If you think this scenario is highly improbable in real life (that is,
>>> btrfs will probably be fixed for good, and then, corruption can't
>>> happen), it's ok.
>>>   But I wonder if this scenario can be triggered with other problem, and
>>> bad data can be transmitted to other sane nodes (power outage, out of
>>> memory condition, disk full... for example)
>>>   That's why I proposed you a crashed ceph volume image (I shouldn't have
>>> talked about a crashed fs, sorry for the confusion)
>>
>> I appreciate the offer, but I don't think this will help much — it's a
>> disk state managed by somebody else, not our logical state, which has
>> broken. If we could figure out how that state got broken that'd be good, but
>> a "ceph image" won't really help in doing so.
>
> ok, no problem. I'll restart from scratch, freshly formated.
>
>>
>> I wonder if maybe there's a confounding factor here — are all your nodes
>> similar to each other,
>
>
> Yes. I designed the cluster that way. All nodes are identical hardware
> (powerEdge M610, 10G intel ethernet + emulex fibre channel attached to
> storage (1 Array for 2 OSD nodes, 1 controller dedicated for each OSD)

Oh, interesting. Are the broken nodes all on the same set of arrays?


>
>
>>   or are they running on different kinds of hardware? How did you do your
>> Ceph upgrades? What's ceph -s display when the cluster is running as best it
>> can?
>
>
> Ceph was running 0.47.2 at that time - (debian package for ceph). After the
> crash I couldn't restart all the nodes. Tried 0.47.3 and now 0.48 without
> success.
>
> Nothing particular for upgrades, because for the moment ceph is broken, so
> just apt-get upgrade with new version.
>
>
> ceph -s show that :
>
> root@label5:~# ceph -s
>    health HEALTH_WARN 260 pgs degraded; 793 pgs down; 785 pgs peering; 32
> pgs recovering; 96 pgs stale; 793 pgs stuck inactive; 96 pgs stuck stale;
> 1092 pgs stuck unclean; recovery 267286/2491140 degraded (10.729%);
> 1814/1245570 unfound (0.146%)
>    monmap e1: 3 mons at
> {chichibu=172.20.14.130:6789/0,glenesk=172.20.14.131:6789/0,karuizawa=172.20.14.133:6789/0},
> election epoch 12, quorum 0,1,2 chichibu,glenesk,karuizawa
>    osdmap e2404: 8 osds: 3 up, 3 in
>     pgmap v173701: 1728 pgs: 604 active+clean, 8 down, 5
> active+recovering+remapped, 32 active+clean+replay, 11
> active+recovering+degraded, 25 active+remapped, 710 down+peering, 222
> active+degraded, 7 stale+active+recovering+degraded, 61 stale+down+peering,
> 20 stale+active+degraded, 6 down+remapped+peering, 8
> stale+down+remapped+peering, 9 active+recovering; 4786 GB data, 7495 GB
> used, 7280 GB / 15360 GB avail; 267286/2491140 degraded (10.729%);
> 1814/1245570 unfound (0.146%)
>    mdsmap e172: 1/1/1 up {0=karuizawa=up:replay}, 2 up:standby

Okay, that looks about how I'd expect if half your OSDs are down.

>
>
>
> BTW, After the 0.48 upgrade, there was a disk format conversion. 1 of the 4
> surviving OSD didn't complete :
>
> 2012-07-04 10:13:27.291541 7f8711099780 -1 filestore(/CEPH/data/osd.1)
> FileStore::mount : stale version stamp detected: 2. Proceeding, do_update is
> set, performing disk format upgrade.
> 2012-07-04 10:13:27.291618 7f8711099780  0 filestore(/CEPH/data/osd.1) mount
> found snaps <3744666,3746725>
>
> then , nothing happens for hours, iotop show constant disk usage :

>  6069 be/4 root        0.00 B/s   32.09 M/s  0.00 % 19.08 % ceph-osd -i 1
> --pid-file /var/run/ceph/osd.1.pid -c /etc/ceph/ceph.conf
>
> strace show lots of syscall like this :
>
> [pid  6069] pread(25, "\0EB_LEAF_\0002%e183%uTEMP.label5%u2"..., 4101,
> 94950) = 4101
> [pid  6069] pread(23, "\0EB_LEAF_\0002%e183%uTEMP.label5%u2"..., 4107,
> 49678) = 4107
> [pid  6069] pread(36, "\0EB_LEAF_\0002%e183%uTEMP.label5%u2"..., 4110,
> 99797) = 4110
> [pid  6069] pread(37, "\0EB_LEAF_\0002%e183%uTEMP.label5%u2"..., 4105, 8211)
> = 4105
> [pid  6069] pread(25, "\0C\0_LEAF_\0002%e183%uTEMP.label5%u3"..., 4121,
> 99051) = 4121
> [pid  6069] pread(36, "\0E\0_LEAF_\0002%e183%uTEMP.label5%u3"..., 4173,
> 103907) = 4173
> [pid  6069] pread(37, "\0E\0_LEAF_\0002%e183%uTEMP.label5%u3"..., 4169,
> 12316) = 4169
> [pid  6069] pread(37, "\0B\0_LEAF_\0002%e183%uTEMP.rb%e0%e1%"..., 4130,
> 16485) = 4130
> [pid  6069] pread(36, "\0B\0_LEAF_\0002%e183%uTEMP.rb%e0%e1%"..., 4129,
> 108080) = 4129

Sam, does this look like something of ours to you?
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

  reply	other threads:[~2012-07-05 21:32 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-06-04  8:44 domino-style OSD crash Yann Dupont
2012-06-04 16:16 ` Tommi Virtanen
2012-06-04 17:40   ` Sam Just
2012-06-04 18:34     ` Greg Farnum
2012-07-03  8:40     ` Yann Dupont
2012-07-03 19:42       ` Tommi Virtanen
2012-07-03 20:54         ` Yann Dupont
2012-07-03 21:38           ` Tommi Virtanen
2012-07-04  8:06             ` Yann Dupont
2012-07-04 16:21               ` Gregory Farnum
2012-07-04 17:53                 ` Yann Dupont
2012-07-05 21:32                   ` Gregory Farnum [this message]
2012-07-06  7:19                     ` Yann Dupont
2012-07-06 17:01                       ` Gregory Farnum
2012-07-07  8:19                         ` Yann Dupont
2012-07-09 17:14                           ` Samuel Just
2012-07-10  9:46                             ` Yann Dupont
2012-07-10 15:56                               ` Tommi Virtanen
2012-07-10 16:39                                 ` Yann Dupont
2012-07-10 17:11                                   ` Tommi Virtanen
2012-07-10 17:36                                     ` Yann Dupont
2012-07-10 18:16                                       ` Tommi Virtanen
2012-07-09 17:43               ` Tommi Virtanen
2012-07-09 19:05                 ` Yann Dupont
2012-07-09 19:48                   ` Tommi Virtanen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAPYLRzg3vJNGYUwBrZ6e9G6x-URCodAxFLfXRL10T8yOqx+wVQ@mail.gmail.com \
    --to=greg@inktank.com \
    --cc=Yann.Dupont@univ-nantes.fr \
    --cc=ceph-devel@vger.kernel.org \
    --cc=sam.just@inktank.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.