From mboxrd@z Thu Jan 1 00:00:00 1970 From: Samuel Just Subject: =?UTF-8?B?UmU6IOetlOWkjTog562U5aSNOiDnrZTlpI06IOetlOWkjTog562U5aSNOiBhbm90aGVyIA==?= =?UTF-8?B?cGVlcmluZyBzdHVjayBjYXVzZWQgYnkgbmV0IHByb2JsZW0u?= Date: Mon, 2 Nov 2015 18:15:10 -0800 Message-ID: References: <5F4E7462F2F4A14E974FBE17B8EB37210E5B1FB1@H3CMLB12-EX.srv.huawei-3com.com> <0C3F6DA3760D0C4691E69E5BE224FF495E17479D@H3CMLB12-EX.srv.huawei-3com.com> <0C3F6DA3760D0C4691E69E5BE224FF495E1759B9@H3CMLB12-EX.srv.huawei-3com.com> <0C3F6DA3760D0C4691E69E5BE224FF495E175A2E@H3CMLB12-EX.srv.huawei-3com.com> <0C3F6DA3760D0C4691E69E5BE224FF495E175A81@H3CMLB12-EX.srv.huawei-3com.com> <0C3F6DA3760D0C4691E69E5BE224FF495E175AAD@H3CMLB12-EX.srv.huawei-3com.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: Received: from mail-io0-f180.google.com ([209.85.223.180]:33742 "EHLO mail-io0-f180.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751965AbbKCCPL convert rfc822-to-8bit (ORCPT ); Mon, 2 Nov 2015 21:15:11 -0500 Received: by iodd200 with SMTP id d200so5851994iod.0 for ; Mon, 02 Nov 2015 18:15:10 -0800 (PST) In-Reply-To: Sender: ceph-devel-owner@vger.kernel.org List-ID: To: "yangruifeng.09209@h3c.com" Cc: Chenxiaowei , "Sage Weil (sweil@redhat.com)" , "ceph-devel@vger.kernel.org" Exactly what kernel are you using? -Sam On Mon, Nov 2, 2015 at 6:14 PM, Samuel Just wrote: > Yeah, there's a heartbeat system and the messenger is reliable delive= ry. > -Sam > > On Mon, Nov 2, 2015 at 5:41 PM, yangruifeng.09209@h3c.com > wrote: >> I will try my best to get the detailed log. >> In the current version, we can ensure the messages that are related = to peering is correctly received by peers? >> >> thanks >> Ruifeng Yang. >> >> -----=E9=82=AE=E4=BB=B6=E5=8E=9F=E4=BB=B6----- >> =E5=8F=91=E4=BB=B6=E4=BA=BA: Samuel Just [mailto:sjust@redhat.com] >> =E5=8F=91=E9=80=81=E6=97=B6=E9=97=B4: 2015=E5=B9=B411=E6=9C=883=E6=97= =A5 9:28 >> =E6=94=B6=E4=BB=B6=E4=BA=BA: yangruifeng 09209 (RD) >> =E6=8A=84=E9=80=81: chenxiaowei 11245 (RD); Sage Weil (sweil@redhat.= com); ceph-devel@vger.kernel.org >> =E4=B8=BB=E9=A2=98: Re: =E7=AD=94=E5=A4=8D: =E7=AD=94=E5=A4=8D: =E7=AD= =94=E5=A4=8D: =E7=AD=94=E5=A4=8D: another peering stuck caused by net p= roblem. >> >> Temporary network failures should be handled correctly. The best so= lution is to actually fix that bug then. Capture logging on all involv= ed osds while it is hung and open a bug: >> >> debug osd =3D 20 >> debug filestore =3D 20 >> debug ms =3D 1 >> -Sam >> >> On Mon, Nov 2, 2015 at 5:24 PM, yangruifeng.09209@h3c.com wrote: >>> a unknown reason problem, which cause pg stuck in peering, may be a= temporary failure network failure or other bug. >>> BUT it can be solved by *manual* 'ceph osd down ' >>> >>> -----=E9=82=AE=E4=BB=B6=E5=8E=9F=E4=BB=B6----- >>> =E5=8F=91=E4=BB=B6=E4=BA=BA: ceph-devel-owner@vger.kernel.org >>> [mailto:ceph-devel-owner@vger.kernel.org] =E4=BB=A3=E8=A1=A8 Samuel= Just >>> =E5=8F=91=E9=80=81=E6=97=B6=E9=97=B4: 2015=E5=B9=B411=E6=9C=883=E6=97= =A5 9:12 >>> =E6=94=B6=E4=BB=B6=E4=BA=BA: yangruifeng 09209 (RD) >>> =E6=8A=84=E9=80=81: chenxiaowei 11245 (RD); Sage Weil (sweil@redhat= =2Ecom); >>> ceph-devel@vger.kernel.org >>> =E4=B8=BB=E9=A2=98: Re: =E7=AD=94=E5=A4=8D: =E7=AD=94=E5=A4=8D: =E7= =AD=94=E5=A4=8D: another peering stuck caused by net problem. >>> >>> The problem is that peering shouldn't hang for no reason. If you a= re >>> seeing peering hang for a long time either >>> 1) you are hitting a peering bug which we need to track down and fi= x >>> 2) peering actually cannot make progress. >>> >>> In case 1, it can be nice to have a work around to force peering to= restart and avoid the bug. However, case 2 would not be helped by res= tarting peering, you'd just end up in the same place. If you did it ba= sed on a timeout, you'd just increase load by a ton when in that situat= ion. What problem are you trying to solve? >>> -Sam >>> >>> On Mon, Nov 2, 2015 at 5:05 PM, yangruifeng.09209@h3c.com wrote: >>>> ok. >>>> >>>> thanks >>>> Ruifeng Yang >>>> >>>> -----=E9=82=AE=E4=BB=B6=E5=8E=9F=E4=BB=B6----- >>>> =E5=8F=91=E4=BB=B6=E4=BA=BA: Samuel Just [mailto:sjust@redhat.com] >>>> =E5=8F=91=E9=80=81=E6=97=B6=E9=97=B4: 2015=E5=B9=B411=E6=9C=883=E6= =97=A5 9:03 >>>> =E6=94=B6=E4=BB=B6=E4=BA=BA: yangruifeng 09209 (RD) >>>> =E6=8A=84=E9=80=81: chenxiaowei 11245 (RD); Sage Weil (sweil@redha= t.com) >>>> =E4=B8=BB=E9=A2=98: Re: =E7=AD=94=E5=A4=8D: =E7=AD=94=E5=A4=8D: an= other peering stuck caused by net problem. >>>> >>>> Would it be ok if I reply to the list as well? >>>> -Sam >>>> >>>> On Mon, Nov 2, 2015 at 4:37 PM, yangruifeng.09209@h3c.com wrote: >>>>> the cluster is maybe always peering in same exceptional cases, bu= t >>>>> it can return to normal by *manual* 'ceph osd down ', this= is >>>>> not convenient in a production environment, and against the conce= pt of rados. >>>>> add a timeout mechanism to kick it, or kick it when io hang, mayb= e reasonable? >>>>> >>>>> thanks, >>>>> Ruifeng Yang >>>>> >>>>> -----=E9=82=AE=E4=BB=B6=E5=8E=9F=E4=BB=B6----- >>>>> =E5=8F=91=E4=BB=B6=E4=BA=BA: Samuel Just [mailto:sjust@redhat.com= ] >>>>> =E5=8F=91=E9=80=81=E6=97=B6=E9=97=B4: 2015=E5=B9=B411=E6=9C=883=E6= =97=A5 2:21 >>>>> =E6=94=B6=E4=BB=B6=E4=BA=BA: yangruifeng 09209 (RD) >>>>> =E6=8A=84=E9=80=81: chenxiaowei 11245 (RD); Sage Weil (sweil@redh= at.com) >>>>> =E4=B8=BB=E9=A2=98: Re: =E7=AD=94=E5=A4=8D: another peering stuck= caused by net problem. >>>>> >>>>> I mean issue 'ceph osd down ' for the primary on the pg. = But that only causes peering to restart. If peering stalled previously= , it'll probably stall again. What are you trying to accomplish? >>>>> -Sam >>>>> >>>>> On Fri, Oct 30, 2015 at 5:51 PM, yangruifeng.09209@h3c.com wrote: >>>>>> do you mean restart primary osd? or any other command=EF=BC=9F >>>>>> >>>>>> thanks >>>>>> Ruifeng Yang >>>>>> >>>>>> -----=E9=82=AE=E4=BB=B6=E5=8E=9F=E4=BB=B6----- >>>>>> =E5=8F=91=E4=BB=B6=E4=BA=BA: Samuel Just [mailto:sjust@redhat.co= m] >>>>>> =E5=8F=91=E9=80=81=E6=97=B6=E9=97=B4: 2015=E5=B9=B410=E6=9C=8830= =E6=97=A5 23:07 >>>>>> =E6=94=B6=E4=BB=B6=E4=BA=BA: chenxiaowei 11245 (RD) >>>>>> =E6=8A=84=E9=80=81: Sage Weil (sweil@redhat.com); yangruifeng 09= 209 (RD) >>>>>> =E4=B8=BB=E9=A2=98: Re: another peering stuck caused by net prob= lem. >>>>>> >>>>>> How would that help? As a way to work around a possible bug? Y= ou can accomplish pretty much the same thing by setting the primary dow= n. >>>>>> -Sam >>>>>> >>>>>> On Wed, Oct 28, 2015 at 8:22 PM, Chenxiaowei wrote: >>>>>>> Hi, Samuel&Sage: >>>>>>> I am cxwshawn from H3C(belong to HP), the pg peering st= uck >>>>>>> problem is a serious problem especially under the production en= vironment, So here we came up two solutions: >>>>>>> if state Peering stuck too long, we can check timeout >>>>>>> exceeds to force transition from Peering to Reset state, Or we = can add a command line to force one pg from Peering stuck to Reset stat= e. >>>>>>> >>>>>>> What's your advice? Wish your reply >>>>>>> >>>>>>> Yours >>>>>>> shawn from Beijing, China. >>>>>>> >>>>>>> ---------------------------------------------------------------= --- >>>>>>> - >>>>>>> - >>>>>>> - >>>>>>> - >>>>>>> --------------------------------------------------------------- >>>>>>> =E6=9C=AC=E9=82=AE=E4=BB=B6=E5=8F=8A=E5=85=B6=E9=99=84=E4=BB=B6= =E5=90=AB=E6=9C=89=E6=9D=AD=E5=B7=9E=E5=8D=8E=E4=B8=89=E9=80=9A=E4=BF=A1= =E6=8A=80=E6=9C=AF=E6=9C=89=E9=99=90=E5=85=AC=E5=8F=B8=E7=9A=84=E4=BF=9D= =E5=AF=86=E4=BF=A1=E6=81=AF=EF=BC=8C=E4=BB=85=E9=99=90=E4=BA=8E=E5=8F=91= =E9=80=81=E7=BB=99=E4=B8=8A=E9=9D=A2=E5=9C=B0=E5=9D=80=E4=B8=AD=E5=88=97= =E5=87=BA >>>>>>> =E7=9A=84=E4=B8=AA=E4=BA=BA=E6=88=96=E7=BE=A4=E7=BB=84=E3=80=82= =E7=A6=81=E6=AD=A2=E4=BB=BB=E4=BD=95=E5=85=B6=E4=BB=96=E4=BA=BA=E4=BB=A5= =E4=BB=BB=E4=BD=95=E5=BD=A2=E5=BC=8F=E4=BD=BF=E7=94=A8=EF=BC=88=E5=8C=85= =E6=8B=AC=E4=BD=86=E4=B8=8D=E9=99=90=E4=BA=8E=E5=85=A8=E9=83=A8=E6=88=96= =E9=83=A8=E5=88=86=E5=9C=B0=E6=B3=84=E9=9C=B2=E3=80=81=E5=A4=8D=E5=88=B6= =E3=80=81 >>>>>>> =E6=88=96=E6=95=A3=E5=8F=91=EF=BC=89=E6=9C=AC=E9=82=AE=E4=BB=B6= =E4=B8=AD=E7=9A=84=E4=BF=A1=E6=81=AF=E3=80=82=E5=A6=82=E6=9E=9C=E6=82=A8= =E9=94=99=E6=94=B6=E4=BA=86=E6=9C=AC=E9=82=AE=E4=BB=B6=EF=BC=8C=E8=AF=B7= =E6=82=A8=E7=AB=8B=E5=8D=B3=E7=94=B5=E8=AF=9D=E6=88=96=E9=82=AE=E4=BB=B6= =E9=80=9A=E7=9F=A5=E5=8F=91=E4=BB=B6=E4=BA=BA=E5=B9=B6=E5=88=A0=E9=99=A4= =E6=9C=AC >>>>>>> =E9=82=AE=E4=BB=B6=EF=BC=81 >>>>>>> This e-mail and its attachments contain confidential informatio= n >>>>>>> from H3C, which is intended only for the person or entity whose >>>>>>> address is listed above. Any use of the information contained >>>>>>> herein in any way (including, but not limited to, total or part= ial >>>>>>> disclosure, reproduction, or dissemination) by persons other th= an >>>>>>> the intended >>>>>>> recipient(s) is prohibited. If you receive this e-mail in error= , >>>>>>> please notify the sender by phone or email immediately and dele= te it! >>> -- >>> To unsubscribe from this list: send the line "unsubscribe ceph-deve= l" >>> in the body of a message to majordomo@vger.kernel.org More majordom= o >>> info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" i= n the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html