All of lore.kernel.org
 help / color / mirror / Atom feed
From: "yangruifeng.09209@h3c.com" <yangruifeng.09209@h3c.com>
To: Samuel Just <sjust@redhat.com>
Cc: Chenxiaowei <chen.xiaowei@h3c.com>,
	"Sage Weil (sweil@redhat.com)" <sweil@redhat.com>,
	"ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>
Subject: 答复: 答复: 答复: 答复: 答复: another peering stuck caused by net problem.
Date: Tue, 3 Nov 2015 01:41:16 +0000	[thread overview]
Message-ID: <0C3F6DA3760D0C4691E69E5BE224FF495E175AAD@H3CMLB12-EX.srv.huawei-3com.com> (raw)
In-Reply-To: <CAN=+7FVaQqkN27_7mENWY38a+5HcQOKVug1zR_B4ar+7jwe2Sg@mail.gmail.com>

I will try my best to get the detailed log.
In the current version, we can ensure the messages that are related to peering is correctly received by peers?  

thanks
Ruifeng Yang.

-----邮件原件-----
发件人: Samuel Just [mailto:sjust@redhat.com] 
发送时间: 2015年11月3日 9:28
收件人: yangruifeng 09209 (RD)
抄送: chenxiaowei 11245 (RD); Sage Weil (sweil@redhat.com); ceph-devel@vger.kernel.org
主题: Re: 答复: 答复: 答复: 答复: another peering stuck caused by net problem.

Temporary network failures should be handled correctly.  The best solution is to actually fix that bug then.  Capture logging on all involved osds while it is hung and open a bug:

debug osd = 20
debug filestore = 20
debug ms = 1
-Sam

On Mon, Nov 2, 2015 at 5:24 PM, yangruifeng.09209@h3c.com <yangruifeng.09209@h3c.com> wrote:
> a unknown reason problem, which cause pg stuck in peering, may be a temporary failure network failure or other bug.
> BUT it can be solved by *manual* 'ceph osd down <osdid>'
>
> -----邮件原件-----
> 发件人: ceph-devel-owner@vger.kernel.org 
> [mailto:ceph-devel-owner@vger.kernel.org] 代表 Samuel Just
> 发送时间: 2015年11月3日 9:12
> 收件人: yangruifeng 09209 (RD)
> 抄送: chenxiaowei 11245 (RD); Sage Weil (sweil@redhat.com); 
> ceph-devel@vger.kernel.org
> 主题: Re: 答复: 答复: 答复: another peering stuck caused by net problem.
>
> The problem is that peering shouldn't hang for no reason.  If you are 
> seeing peering hang for a long time either
> 1) you are hitting a peering bug which we need to track down and fix
> 2) peering actually cannot make progress.
>
> In case 1, it can be nice to have a work around to force peering to restart and avoid the bug.  However, case 2 would not be helped by restarting peering, you'd just end up in the same place.  If you did it based on a timeout, you'd just increase load by a ton when in that situation.  What problem are you trying to solve?
> -Sam
>
> On Mon, Nov 2, 2015 at 5:05 PM, yangruifeng.09209@h3c.com <yangruifeng.09209@h3c.com> wrote:
>> ok.
>>
>> thanks
>> Ruifeng Yang
>>
>> -----邮件原件-----
>> 发件人: Samuel Just [mailto:sjust@redhat.com]
>> 发送时间: 2015年11月3日 9:03
>> 收件人: yangruifeng 09209 (RD)
>> 抄送: chenxiaowei 11245 (RD); Sage Weil (sweil@redhat.com)
>> 主题: Re: 答复: 答复: another peering stuck caused by net problem.
>>
>> Would it be ok if I reply to the list as well?
>> -Sam
>>
>> On Mon, Nov 2, 2015 at 4:37 PM, yangruifeng.09209@h3c.com <yangruifeng.09209@h3c.com> wrote:
>>> the cluster is maybe always peering in same exceptional cases, but 
>>> it can return to normal by *manual* 'ceph osd down <osdid>', this is 
>>> not convenient in a production environment, and against the concept of rados.
>>> add a timeout mechanism to kick it, or kick it when io hang, maybe reasonable?
>>>
>>> thanks,
>>> Ruifeng Yang
>>>
>>> -----邮件原件-----
>>> 发件人: Samuel Just [mailto:sjust@redhat.com]
>>> 发送时间: 2015年11月3日 2:21
>>> 收件人: yangruifeng 09209 (RD)
>>> 抄送: chenxiaowei 11245 (RD); Sage Weil (sweil@redhat.com)
>>> 主题: Re: 答复: another peering stuck caused by net problem.
>>>
>>> I mean issue 'ceph osd down <osdid>' for the primary on the pg.  But that only causes peering to restart.  If peering stalled previously, it'll probably stall again.  What are you trying to accomplish?
>>> -Sam
>>>
>>> On Fri, Oct 30, 2015 at 5:51 PM, yangruifeng.09209@h3c.com <yangruifeng.09209@h3c.com> wrote:
>>>> do you mean restart primary osd? or any other command?
>>>>
>>>> thanks
>>>> Ruifeng Yang
>>>>
>>>> -----邮件原件-----
>>>> 发件人: Samuel Just [mailto:sjust@redhat.com]
>>>> 发送时间: 2015年10月30日 23:07
>>>> 收件人: chenxiaowei 11245 (RD)
>>>> 抄送: Sage Weil (sweil@redhat.com); yangruifeng 09209 (RD)
>>>> 主题: Re: another peering stuck caused by net problem.
>>>>
>>>> How would that help?  As a way to work around a possible bug?  You can accomplish pretty much the same thing by setting the primary down.
>>>> -Sam
>>>>
>>>> On Wed, Oct 28, 2015 at 8:22 PM, Chenxiaowei <chen.xiaowei@h3c.com> wrote:
>>>>> Hi, Samuel&Sage:
>>>>>         I am cxwshawn from H3C(belong to HP), the pg peering stuck 
>>>>> problem is a serious problem especially under the production environment, So here we came up two solutions:
>>>>>         if state Peering stuck too long, we can check timeout 
>>>>> exceeds to force transition from Peering to Reset state, Or we can add a command line to force one pg from Peering stuck to Reset state.
>>>>>
>>>>> What's your advice? Wish your reply
>>>>>
>>>>> Yours
>>>>> shawn from Beijing, China.
>>>>>
>>>>> ------------------------------------------------------------------
>>>>> -
>>>>> -
>>>>> -
>>>>> -
>>>>> ---------------------------------------------------------------
>>>>> 本邮件及其附件含有杭州华三通信技术有限公司的保密信息,仅限于发送给上面地址中列出
>>>>> 的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、
>>>>> 或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本
>>>>> 邮件!
>>>>> This e-mail and its attachments contain confidential information 
>>>>> from H3C, which is intended only for the person or entity whose 
>>>>> address is listed above. Any use of the information contained 
>>>>> herein in any way (including, but not limited to, total or partial 
>>>>> disclosure, reproduction, or dissemination) by persons other than 
>>>>> the intended
>>>>> recipient(s) is prohibited. If you receive this e-mail in error, 
>>>>> please notify the sender by phone or email immediately and delete it!
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html

  reply	other threads:[~2015-11-03  1:42 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <5F4E7462F2F4A14E974FBE17B8EB37210E5B1FB1@H3CMLB12-EX.srv.huawei-3com.com>
     [not found] ` <CAN=+7FVJaD2fpgFuEPJ0Mincv5M4K+w_qg0DHYxsXU-4R5=o=w@mail.gmail.com>
     [not found]   ` <0C3F6DA3760D0C4691E69E5BE224FF495E17479D@H3CMLB12-EX.srv.huawei-3com.com>
     [not found]     ` <CAN=+7FW+6x39_Z0uXNuCjwQUPC4EPQTSDC1iG_duycdmvLZF9g@mail.gmail.com>
     [not found]       ` <0C3F6DA3760D0C4691E69E5BE224FF495E1759B9@H3CMLB12-EX.srv.huawei-3com.com>
     [not found]         ` <CAN=+7FU4kwfJDxCDn9TRd+6sjKKZpWwLbd=tUpqCKgv9V_GMTw@mail.gmail.com>
     [not found]           ` <0C3F6DA3760D0C4691E69E5BE224FF495E175A2E@H3CMLB12-EX.srv.huawei-3com.com>
2015-11-03  1:12             ` 答复: 答复: 答复: another peering stuck caused by net problem Samuel Just
2015-11-03  1:24               ` 答复: " yangruifeng.09209
2015-11-03  1:28                 ` Samuel Just
2015-11-03  1:41                   ` yangruifeng.09209 [this message]
2015-11-03  2:14                     ` 答复: " Samuel Just
2015-11-03  2:15                       ` Samuel Just
2015-11-03  2:22                         ` 答复: " yangruifeng.09209

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0C3F6DA3760D0C4691E69E5BE224FF495E175AAD@H3CMLB12-EX.srv.huawei-3com.com \
    --to=yangruifeng.09209@h3c.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=chen.xiaowei@h3c.com \
    --cc=sjust@redhat.com \
    --cc=sweil@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.