All of lore.kernel.org
 help / color / mirror / Atom feed
* Collection of strange lockups on 0.51
@ 2012-09-12 17:33 Andrey Korolyov
  2012-09-12 21:09 ` Tommi Virtanen
  0 siblings, 1 reply; 6+ messages in thread
From: Andrey Korolyov @ 2012-09-12 17:33 UTC (permalink / raw)
  To: ceph-devel

[-- Attachment #1: Type: text/plain, Size: 855 bytes --]

Hi,

This is completely off-list, but I`m asking because only ceph trigger
such a bug :) .

With 0.51, following happens: if I kill an osd, one or more neighbor
nodes may go to hanged state with cpu lockups, not related to
temperature or overall interrupt count or la and it happens randomly
over 16-node cluster. Almost sure that ceph triggerizing some hardware
bug, but I don`t quite sure of which origin. Also after a short time
after reset from such crash a new lockup may be created by any action.

Before blaming system drivers and continuing to investigate a problem,
may I ask if someone faced similar problem? I am using 802.ad on pair
intel 350 for general connectivity. I have attached a bit of traces
which was pushed to netconsole(in some cases, machine died hardly,
e.g. not even sending a final bye over netconsole, so it is not
complete).

[-- Attachment #2: netcon.log.gz --]
[-- Type: application/x-gzip, Size: 12384 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Collection of strange lockups on 0.51
  2012-09-12 17:33 Collection of strange lockups on 0.51 Andrey Korolyov
@ 2012-09-12 21:09 ` Tommi Virtanen
  2012-09-12 21:43   ` Andrey Korolyov
  0 siblings, 1 reply; 6+ messages in thread
From: Tommi Virtanen @ 2012-09-12 21:09 UTC (permalink / raw)
  To: Andrey Korolyov; +Cc: ceph-devel

On Wed, Sep 12, 2012 at 10:33 AM, Andrey Korolyov <andrey@xdel.ru> wrote:
> Hi,
> This is completely off-list, but I`m asking because only ceph trigger
> such a bug :) .
>
> With 0.51, following happens: if I kill an osd, one or more neighbor
> nodes may go to hanged state with cpu lockups, not related to
> temperature or overall interrupt count or la and it happens randomly
> over 16-node cluster. Almost sure that ceph triggerizing some hardware
> bug, but I don`t quite sure of which origin. Also after a short time
> after reset from such crash a new lockup may be created by any action.

From the log, it looks like your ethernet driver is crapping out.

[172517.057886] NETDEV WATCHDOG: eth0 (igb): transmit queue 7 timed out
...
[172517.058622]  [<ffffffff812b2975>] ? netif_tx_lock+0x40/0x76

etc.

The later oopses are talking about paravirt_write_msr etc, which makes
me thing you're using Xen? You probably don't want to run Ceph servers
inside virtualization (for production).

[172696.503900]  [<ffffffff8100d025>] ? paravirt_write_msr+0xb/0xe
[172696.503942]  [<ffffffff810325f3>] ? leave_mm+0x3e/0x3e

and *then* you get

[172695.041709] sd 0:2:0:0: [sda] megasas: RESET cmd=2a retries=0
[172695.041745] megasas: [ 0]waiting for 35 commands to complete
[172696.045602] megaraid_sas: no pending cmds after reset
[172696.045644] megasas: reset successful

which just adds more awesomeness to the soup -- though I do wonder if
this could be caused by the soft hang from earlier.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Collection of strange lockups on 0.51
  2012-09-12 21:09 ` Tommi Virtanen
@ 2012-09-12 21:43   ` Andrey Korolyov
  2012-09-30 21:55     ` Andrey Korolyov
  0 siblings, 1 reply; 6+ messages in thread
From: Andrey Korolyov @ 2012-09-12 21:43 UTC (permalink / raw)
  To: Tommi Virtanen; +Cc: ceph-devel

On Thu, Sep 13, 2012 at 1:09 AM, Tommi Virtanen <tv@inktank.com> wrote:
> On Wed, Sep 12, 2012 at 10:33 AM, Andrey Korolyov <andrey@xdel.ru> wrote:
>> Hi,
>> This is completely off-list, but I`m asking because only ceph trigger
>> such a bug :) .
>>
>> With 0.51, following happens: if I kill an osd, one or more neighbor
>> nodes may go to hanged state with cpu lockups, not related to
>> temperature or overall interrupt count or la and it happens randomly
>> over 16-node cluster. Almost sure that ceph triggerizing some hardware
>> bug, but I don`t quite sure of which origin. Also after a short time
>> after reset from such crash a new lockup may be created by any action.
>
> From the log, it looks like your ethernet driver is crapping out.
>
> [172517.057886] NETDEV WATCHDOG: eth0 (igb): transmit queue 7 timed out
> ...
> [172517.058622]  [<ffffffff812b2975>] ? netif_tx_lock+0x40/0x76
>
> etc.
>
> The later oopses are talking about paravirt_write_msr etc, which makes
> me thing you're using Xen? You probably don't want to run Ceph servers
> inside virtualization (for production).

NOPE. Xen was my choice for almost five years, but right now I am
replaced it with kvm everywhere due to buggy 4.1 '-stable'. 4.0 has
same poor network performance as 3.x but can be really named stable.
All those backtraces comes from bare hardware.

At the end you can see nice backtrace which comes out soon after end
of the boot sequence when I manually typed 'modprobe rbd', it may be
any other command assuming from experience. As soon as I don`t know
anything about long-lasting states in intel, especially of those which
will survive ipmi reset button, I think that first-sight complain
about igb may be not quite right. If there cards may save some of
runtime states to EEPROM and pull them back then I`m wrong.

>
> [172696.503900]  [<ffffffff8100d025>] ? paravirt_write_msr+0xb/0xe
> [172696.503942]  [<ffffffff810325f3>] ? leave_mm+0x3e/0x3e
>
> and *then* you get
>
> [172695.041709] sd 0:2:0:0: [sda] megasas: RESET cmd=2a retries=0
> [172695.041745] megasas: [ 0]waiting for 35 commands to complete
> [172696.045602] megaraid_sas: no pending cmds after reset
> [172696.045644] megasas: reset successful
>
> which just adds more awesomeness to the soup -- though I do wonder if
> this could be caused by the soft hang from earlier.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Collection of strange lockups on 0.51
  2012-09-12 21:43   ` Andrey Korolyov
@ 2012-09-30 21:55     ` Andrey Korolyov
  2012-10-01 16:42       ` Tommi Virtanen
  0 siblings, 1 reply; 6+ messages in thread
From: Andrey Korolyov @ 2012-09-30 21:55 UTC (permalink / raw)
  To: Tommi Virtanen; +Cc: ceph-devel

On Thu, Sep 13, 2012 at 1:43 AM, Andrey Korolyov <andrey@xdel.ru> wrote:
> On Thu, Sep 13, 2012 at 1:09 AM, Tommi Virtanen <tv@inktank.com> wrote:
>> On Wed, Sep 12, 2012 at 10:33 AM, Andrey Korolyov <andrey@xdel.ru> wrote:
>>> Hi,
>>> This is completely off-list, but I`m asking because only ceph trigger
>>> such a bug :) .
>>>
>>> With 0.51, following happens: if I kill an osd, one or more neighbor
>>> nodes may go to hanged state with cpu lockups, not related to
>>> temperature or overall interrupt count or la and it happens randomly
>>> over 16-node cluster. Almost sure that ceph triggerizing some hardware
>>> bug, but I don`t quite sure of which origin. Also after a short time
>>> after reset from such crash a new lockup may be created by any action.
>>
>> From the log, it looks like your ethernet driver is crapping out.
>>
>> [172517.057886] NETDEV WATCHDOG: eth0 (igb): transmit queue 7 timed out
>> ...
>> [172517.058622]  [<ffffffff812b2975>] ? netif_tx_lock+0x40/0x76
>>
>> etc.
>>
>> The later oopses are talking about paravirt_write_msr etc, which makes
>> me thing you're using Xen? You probably don't want to run Ceph servers
>> inside virtualization (for production).
>
> NOPE. Xen was my choice for almost five years, but right now I am
> replaced it with kvm everywhere due to buggy 4.1 '-stable'. 4.0 has
> same poor network performance as 3.x but can be really named stable.
> All those backtraces comes from bare hardware.
>
> At the end you can see nice backtrace which comes out soon after end
> of the boot sequence when I manually typed 'modprobe rbd', it may be
> any other command assuming from experience. As soon as I don`t know
> anything about long-lasting states in intel, especially of those which
> will survive ipmi reset button, I think that first-sight complain
> about igb may be not quite right. If there cards may save some of
> runtime states to EEPROM and pull them back then I`m wrong.

Short post mortem - EX3200/12.1R2.9 may begin to drop packets (seems
to appear more likely on 0.51 traffic patterns, which is very strange
for L2 switching) when a bunch of the 802.3ad pairs, sixteen in my
case, exposed to extremely high load - database benchmark over 700+
rbd-backed VMs and cluster rebalance at same time. It explains
post-reboot lockups in igb driver and all types of lockups above. I
would very appreciate any suggestions of switch models which do not
expose such behavior in simultaneous conditions both off-list and in
this thread.

>
>>
>> [172696.503900]  [<ffffffff8100d025>] ? paravirt_write_msr+0xb/0xe
>> [172696.503942]  [<ffffffff810325f3>] ? leave_mm+0x3e/0x3e
>>
>> and *then* you get
>>
>> [172695.041709] sd 0:2:0:0: [sda] megasas: RESET cmd=2a retries=0
>> [172695.041745] megasas: [ 0]waiting for 35 commands to complete
>> [172696.045602] megaraid_sas: no pending cmds after reset
>> [172696.045644] megasas: reset successful
>>
>> which just adds more awesomeness to the soup -- though I do wonder if
>> this could be caused by the soft hang from earlier.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Collection of strange lockups on 0.51
  2012-09-30 21:55     ` Andrey Korolyov
@ 2012-10-01 16:42       ` Tommi Virtanen
  2012-10-03 21:33         ` Andrey Korolyov
  0 siblings, 1 reply; 6+ messages in thread
From: Tommi Virtanen @ 2012-10-01 16:42 UTC (permalink / raw)
  To: Andrey Korolyov; +Cc: ceph-devel

On Sun, Sep 30, 2012 at 2:55 PM, Andrey Korolyov <andrey@xdel.ru> wrote:
> Short post mortem - EX3200/12.1R2.9 may begin to drop packets (seems
> to appear more likely on 0.51 traffic patterns, which is very strange
> for L2 switching) when a bunch of the 802.3ad pairs, sixteen in my
> case, exposed to extremely high load - database benchmark over 700+
> rbd-backed VMs and cluster rebalance at same time. It explains
> post-reboot lockups in igb driver and all types of lockups above. I
> would very appreciate any suggestions of switch models which do not
> expose such behavior in simultaneous conditions both off-list and in
> this thread.

I don't see how a switch dropping packets would give an ethernet card
driver any excuse to crash, but I'm simultaneously happy to hear that
it doesn't seem like Ceph is at fault, and sorry for your troubles.

I don't have an up to date 1GbE card recommendation to share, but I
would recommend making sure you're using a recent Linux kernel.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Collection of strange lockups on 0.51
  2012-10-01 16:42       ` Tommi Virtanen
@ 2012-10-03 21:33         ` Andrey Korolyov
  0 siblings, 0 replies; 6+ messages in thread
From: Andrey Korolyov @ 2012-10-03 21:33 UTC (permalink / raw)
  To: Tommi Virtanen; +Cc: ceph-devel

On Mon, Oct 1, 2012 at 8:42 PM, Tommi Virtanen <tv@inktank.com> wrote:
> On Sun, Sep 30, 2012 at 2:55 PM, Andrey Korolyov <andrey@xdel.ru> wrote:
>> Short post mortem - EX3200/12.1R2.9 may begin to drop packets (seems
>> to appear more likely on 0.51 traffic patterns, which is very strange
>> for L2 switching) when a bunch of the 802.3ad pairs, sixteen in my
>> case, exposed to extremely high load - database benchmark over 700+
>> rbd-backed VMs and cluster rebalance at same time. It explains
>> post-reboot lockups in igb driver and all types of lockups above. I
>> would very appreciate any suggestions of switch models which do not
>> expose such behavior in simultaneous conditions both off-list and in
>> this thread.
>
> I don't see how a switch dropping packets would give an ethernet card
> driver any excuse to crash, but I'm simultaneously happy to hear that
> it doesn't seem like Ceph is at fault, and sorry for your troubles.
>
> I don't have an up to date 1GbE card recommendation to share, but I
> would recommend making sure you're using a recent Linux kernel.

I have incorrectly formulated a reason - of course drops can not cause
a lockup by themselves, but switch may create somehow a long-lasting
`corrupt` state on the trunk ports which leads to such lockups at the
ethernet card. Of course I`ll play with the driver versions and
card|port settings, thanks for suggestion :)

I`m still investigating the issue since it is a quite hard to repeat
in the right time and hope I`m able to capture this state using
tcpdump-like, e.g. s/w methods - if card driver locks on something, it
may prevent to process problematic byte sequence at packet sniffer level.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2012-10-03 21:33 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-09-12 17:33 Collection of strange lockups on 0.51 Andrey Korolyov
2012-09-12 21:09 ` Tommi Virtanen
2012-09-12 21:43   ` Andrey Korolyov
2012-09-30 21:55     ` Andrey Korolyov
2012-10-01 16:42       ` Tommi Virtanen
2012-10-03 21:33         ` Andrey Korolyov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.