From mboxrd@z Thu Jan 1 00:00:00 1970 From: Yann Dupont Subject: Re: domino-style OSD crash Date: Fri, 06 Jul 2012 09:19:56 +0200 Message-ID: <4FF6919C.8080201@univ-nantes.fr> References: <4FCC7573.3000704@univ-nantes.fr> <4FF2AFEB.1010403@univ-nantes.fr> <4FF35C01.4070400@univ-nantes.fr> <4FF3F98C.30602@univ-nantes.fr> <4FF48317.5030802@univ-nantes.fr> Mime-Version: 1.0 Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: Received: from smtptls1-lmb.cpub.univ-nantes.fr ([193.52.103.110]:59605 "EHLO smtp-tls.univ-nantes.fr" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1750802Ab2GFIMP (ORCPT ); Fri, 6 Jul 2012 04:12:15 -0400 In-Reply-To: Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Gregory Farnum Cc: Sam Just , ceph-devel Le 05/07/2012 23:32, Gregory Farnum a =E9crit : [...] >> ok, so as all nodes were identical, I probably have hit a btrfs bug = (like a >> erroneous out of space ) in more or less the same time. And when 1 o= sd was >> out, OH , I didn't finish the sentence... When 1 osd was out, missing data=20 was copied on another nodes, probably speeding btrfs problem on those=20 nodes (I suspect erroneous out of space conditions) I've reformatted OSD with xfs. Performance is slightly worse for the=20 moment (well, depend on the workload, and maybe lack of syncfs is to=20 blame), but at least I hope to have the storage layer rock-solid. BTW,=20 I've managed to keep the faulty btrfs volumes . [...] >>> I wonder if maybe there's a confounding factor here =97 are all you= r nodes >>> similar to each other, >> Yes. I designed the cluster that way. All nodes are identical hardwa= re >> (powerEdge M610, 10G intel ethernet + emulex fibre channel attached = to >> storage (1 Array for 2 OSD nodes, 1 controller dedicated for each OS= D) > Oh, interesting. Are the broken nodes all on the same set of arrays? No. There are 4 completely independant raid arrays, in 4 different=20 locations. They are similar (same brand & model, but slighltly differen= t=20 disks, and 1 different firmware), all arrays are multipathed. I don't=20 think the raid array is the problem. We use those particular models=20 since 2/3 years, and in the logs I don't see any problem that can be=20 caused by the storage itself (like scsi or multipath errors) Cheers, --=20 Yann Dupont - Service IRTS, DSI Universit=E9 de Nantes Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" i= n the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html