All of lore.kernel.org
 help / color / mirror / Atom feed
* [Fwd: Crash in bonding]
@ 2009-11-02 22:43 Pradeep Satyanarayana
       [not found] ` <4AEF60AC.6030508-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
  0 siblings, 1 reply; 5+ messages in thread
From: Pradeep Satyanarayana @ 2009-11-02 22:43 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 43 bytes --]

Typo in email address. Resending.

Pradeep

[-- Attachment #2: Crash in bonding.eml --]
[-- Type: message/rfc822, Size: 5269 bytes --]

From: Pradeep Satyanarayana <pradeeps-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
To: monis-smomgflXvOZWk0Htik3J/w@public.gmane.org
Cc: EWG <Openfabrics-ewg-0P3JtQMG0aQdnm+yROfE0A@public.gmane.org>, linux-rdma-u79uwXL29TZpP82i2CBTzA@public.gmane.org, fubar-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org
Subject: Crash in bonding
Date: Mon, 02 Nov 2009 14:41:36 -0800
Message-ID: <4AEF6020.7000806-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>

This crash was originally reported against Rhel5.4. However, one can recreate this crash quite easily in OFED-1.5 too. 
The steps to recreate the crash are as follows:

1. Run traffic (I used ping) on the IB interfaces through the bond master
2. ifdown ib0
3. ifdown ib1
4. modprobe -r ib_ipoib

Quite often, the crash stack trace seen is as follows:

ID: 0      TASK: ffff81087fc11820  CPU: 13  COMMAND: "swapper"
 #0 [ffff81010ff07ab0] crash_kexec at ffffffff800ac5b9
 #1 [ffff81010ff07b70] __die at ffffffff80065127
 #2 [ffff81010ff07bb0] do_page_fault at ffffffff80066da7
 #3 [ffff81010ff07ca0] error_exit at ffffffff8005dde9
 #4 [ffff81010ff07d58] neigh_connected_output at ffffffff8022cb87
 #5 [ffff81010ff07d88] ip_output at ffffffff800320ac
 #6 [ffff81010ff07db8] ip_queue_xmit at ffffffff8003464d
 #7 [ffff81010ff07e78] tcp_transmit_skb at ffffffff80021d73
 #8 [ffff81010ff07ec8] tcp_retransmit_skb at ffffffff80250ccd
 #9 [ffff81010ff07f08] tcp_write_timer at ffffffff80252652
#10 [ffff81010ff07f28] run_timer_softirq at ffffffff800968be
#11 [ffff81010ff07f58] __do_softirq at ffffffff8001235a
#12 [ffff81010ff07f88] call_softirq at ffffffff8005e2fc
#13 [ffff81010ff07fa0] do_softirq at ffffffff8006cb14
#14 [ffff81010ff07fb0] apic_timer_interrupt at ffffffff8005dc8e
--- <IRQ stack> ---
#15 [ffff81010ff03e48] apic_timer_interrupt at ffffffff8005dc8e
    [exception RIP: mwait_idle+54]
    RIP: ffffffff800571f4  RSP: ffff81010ff03ef0  RFLAGS: 00000246
    RAX: 0000000000000000  RBX: 000000000000000d  RCX: 0000000000000000
    RDX: 0000000000000000  RSI: 0000000000000001  RDI: ffffffff80301698
    RBP: ffff81087fc11a10   R8: ffff81010ff02000   R9: 0000000000000032
    R10: ffff81048e0cc4f0  R11: ffff8103ebafcd18  R12: 0000000005f33f4d
    R13: 00000d12e63d7223  R14: ffff81047fe797a0  R15: ffff81087fc11820
    ORIG_RAX: ffffffffffffff10  CS: 0010  SS: 0018
#16 [ffff81010ff03ef0] cpu_idle at ffffffff8004939e



I was able to set up some break points and the analysis follows.

cpu 0x1 stopped at breakpoint 0x1 (d000000000ec4214 .bond_release+0x0/0x4d0 [bonding])
        mflr    r0
enter ? for help
1:mon> t
[link register   ] d000000000ecdf80 .bonding_store_slaves+0x304/0x3f0 [bonding]
[c00000000fd97b00] d000000000ecdf70 .bonding_store_slaves+0x2f4/0x3f0 [bonding] (unreliable)
[c00000000fd97bd0] c00000000029a660 .class_device_attr_store+0x44/0x60
[c00000000fd97c40] c00000000015df9c .sysfs_write_file+0x134/0x1b8
[c00000000fd97cf0] c0000000000f8ec4 .vfs_write+0x118/0x200
[c00000000fd97d90] c0000000000f9634 .sys_write+0x4c/0x8c
[c00000000fd97e30] c0000000000086a4 syscall_exit+0x0/0x40
--- Exception: c00 (System Call) at 000000000ff11138
SP (ffd1f300) is in userspace

Did some basic sanity checks and confirmed that we hit a couple of breakpoints and
the bond master was indeed bond0 as expected and the slave device being released was ib1.
After the breakpoints, we crashed 


Faulting instruction address: 0xc00000000034bddc
cpu 0x1: Vector: 300 (Data Access) at [c0000000e025b2b0]
    pc: c00000000034bddc: .neigh_resolve_output+0x28c/0x34c
    lr: c00000000034bdc0: .neigh_resolve_output+0x270/0x34c
    sp: c0000000e025b530
   msr: 8000000000009032
   dar: d000000000c6fe58
 dsisr: 40000000
  current = 0xc0000000e25f1aa0
  paca    = 0xc00000000053e280
    pid   = 3591, comm = ping
enter ? for help
1:mon> e
cpu 0x1: Vector: 300 (Data Access) at [c0000000e025b2b0]
    pc: c00000000034bddc: .neigh_resolve_output+0x28c/0x34c
    lr: c00000000034bdc0: .neigh_resolve_output+0x270/0x34c
    sp: c0000000e025b530
   msr: 8000000000009032
   dar: d000000000c6fe58
 dsisr: 40000000
  current = 0xc0000000e25f1aa0
  paca    = 0xc00000000053e280
    pid   = 3591, comm = ping
1:mon> t
[c0000000e025b5e0] c000000000376934 .ip_output+0x358/0x3c0
[c0000000e025b670] c000000000374a04 .ip_push_pending_frames+0x440/0x558
[c0000000e025b720] c000000000397f10 .raw_sendmsg+0x770/0x860
[c0000000e025b860] c0000000003a24f8 .inet_sendmsg+0x7c/0xa8
[c0000000e025b900] c00000000033031c .sock_sendmsg+0x114/0x1b8
[c0000000e025bb00] c000000000331878 .sys_sendmsg+0x218/0x2ac
[c0000000e025bd20] c000000000356314 .compat_sys_sendmsg+0x14/0x28
[c0000000e025bd90] c000000000357914 .compat_sys_socketcall+0x1e4/0x214
[c0000000e025be30] c0000000000086a4 syscall_exit+0x0/0x40
--- Exception: c00 (System Call) at 0000000007f03c98
SP (ffb6e570) is in userspace
1:mon>

I looked at the skb and confirmed that this was indeed against bond0.

One thing is apparent at this point. ping is continuing even though bond_release()
for ib1 (and of course ib0) occurred way back!

This is the reason for the crash. Any suggestions as to how to fix this?

Pradeep




^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re:  Crash in bonding
       [not found] ` <4AEF60AC.6030508-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
@ 2009-11-03  7:14   ` Or Gerlitz
       [not found]     ` <4AEFD866.1040106-smomgflXvOZWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 5+ messages in thread
From: Or Gerlitz @ 2009-11-03  7:14 UTC (permalink / raw)
  To: Pradeep Satyanarayana; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Moni Shoua

Pradeep Satyanarayana wrote:
> This crash was originally reported against Rhel5.4. However, one can recreate this crash quite easily in OFED-1.5 too. 
I understand that you get the crash when working with the RHEL5.4 
bonding driver, correct? does it happen only with IPoIB devices acting 
as the bonding slaves or also with Ethernet devices? Please note that 
with RHEL 5.4 there's no need to use the ofed provided bonding module, 
more over, I believe that the distro provided one is more stable and 
uptodate in this case. Moving forward, ofed bonding support for newish 
distributions is to be removed. Moni, any reason to support bonding/EL 
5.4 in ofed?

Or.

> The steps to recreate the crash are as follows:
> 1. Run traffic (I used ping) on the IB interfaces through the bond master
> 2. ifdown ib0
> 3. ifdown ib1
> 4. modprobe -r ib_ipoib

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Crash in bonding
       [not found]     ` <4AEFD866.1040106-smomgflXvOZWk0Htik3J/w@public.gmane.org>
@ 2009-11-03 16:38       ` Pradeep Satyanarayana
       [not found]         ` <4AF05C7E.3060804-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
  0 siblings, 1 reply; 5+ messages in thread
From: Pradeep Satyanarayana @ 2009-11-03 16:38 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Moni Shoua

Or Gerlitz wrote:
> Pradeep Satyanarayana wrote:
>> This crash was originally reported against Rhel5.4. However, one can
>> recreate this crash quite easily in OFED-1.5 too. 
> I understand that you get the crash when working with the RHEL5.4
> bonding driver, correct? 
We used the one that comes native with the Rhel5.4 distro. I also used installed
OFED-1.5 on Rhel5.4 and still see the same problem.

does it happen only with IPoIB devices acting
> as the bonding slaves or also with Ethernet devices? 

I am not exactly sure what you are hinting at. Simplest case is to use IPoIB
devices as slaves and with the steps below the crash occurs. The crash is specific
to IPoIB, and does not happen with Ethernet slaves.

Please note that
> with RHEL 5.4 there's no need to use the ofed provided bonding module,
> more over, I believe that the distro provided one is more stable and
> uptodate in this case. 
We used what is natively available with Rhel5.4.

Moving forward, ofed bonding support for newish
> distributions is to be removed. Moni, any reason to support bonding/EL
> 5.4 in ofed?

Can you explain why you plan to remove this from the newer distros? This is 
indeed news to me.

Pradeep

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Crash in bonding
       [not found]         ` <4AF05C7E.3060804-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
@ 2009-11-10 11:44           ` Or Gerlitz
  0 siblings, 0 replies; 5+ messages in thread
From: Or Gerlitz @ 2009-11-10 11:44 UTC (permalink / raw)
  To: Pradeep Satyanarayana; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Moni Shoua

Pradeep Satyanarayana wrote:

> The crash is specific to IPoIB, and does not happen with Ethernet slaves.

okay

> Can you explain why you plan to remove this from the newer distros? This is 
> indeed news to me

we plan to remove bonding from --ofed-- as the distro provided bonding supports ipoib, simple as that, what isn't clear here?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Crash in bonding
       [not found]       ` <39C75744D164D948A170E9792AF8E7CA01AB635F-QfUkFaTmzUSUvQqKE/ONIwC/G2K4zDHf@public.gmane.org>
@ 2009-11-03 17:40         ` Pradeep Satyanarayana
  0 siblings, 0 replies; 5+ messages in thread
From: Pradeep Satyanarayana @ 2009-11-03 17:40 UTC (permalink / raw)
  To: Shiri Franchi
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Moni Shoua,
	fubar-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, EWG

Shiri Franchi wrote:
> Hi,
> 
> I did it exactly as you described:
> 1. ifdown ib0
> 2. ifdown ib1
> 3. modprobe -r ib_ipoib
> 
> And it did not reproduced..

It is a race that you may not be recreating. I have tried
this across different HCAs and platforms (x86_64, ppc64). 
Seems to recreate almost at will for me.

Pradeep

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2009-11-10 11:44 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-11-02 22:43 [Fwd: Crash in bonding] Pradeep Satyanarayana
     [not found] ` <4AEF60AC.6030508-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2009-11-03  7:14   ` Crash in bonding Or Gerlitz
     [not found]     ` <4AEFD866.1040106-smomgflXvOZWk0Htik3J/w@public.gmane.org>
2009-11-03 16:38       ` Pradeep Satyanarayana
     [not found]         ` <4AF05C7E.3060804-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2009-11-10 11:44           ` Or Gerlitz
     [not found] <4AEF6020.7000806@linux.vnet.ibm.com>
     [not found] ` <1257252696.13991.5.camel@linux-nzha.site>
     [not found]   ` <4AF05D9B.30807@linux.vnet.ibm.com>
     [not found]     ` <39C75744D164D948A170E9792AF8E7CA01AB635F@exil.voltaire.com>
     [not found]       ` <39C75744D164D948A170E9792AF8E7CA01AB635F-QfUkFaTmzUSUvQqKE/ONIwC/G2K4zDHf@public.gmane.org>
2009-11-03 17:40         ` Pradeep Satyanarayana

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.