All of lore.kernel.org
 help / color / mirror / Atom feed
* IPoIB oops
@ 2012-07-24 15:14 Yishai Hadas
       [not found] ` <500EBBF0.3020407-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 2+ messages in thread
From: Yishai Hadas @ 2012-07-24 15:14 UTC (permalink / raw)
  To: roland-DgEjT+Ai2ygdnm+yROfE0A; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

Hi Roland,

Just encountered a kernel oops in IPoIB on upstream kernel 3.5.

GIT:  git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
Branch :  Master



The scenario is reproducible - running in a loop unload/load of ipoib 
module.
Oops happened in ipoib_mcast_join_task.

 From initial analysis it seems that problem is in below line as 
priv->broadcast is NULL. (saw it in printk)
priv->mcast_mtu = 
IPOIB_UD_MTU(ib_mtu_enum_to_int(priv->broadcast->mcmember.mtu));

It seems that a work queue task is still active while module goes down.
Any idea about a potential problem here ? may it relate to your 
assumption in commita77a57a1a22afc31891d95879fe3cf2ab03838b0 that flush 
of work queue is not mandatory in some cases ?


Details to reproduce and dump are below
Thanks,
Yishai


Reproduction:

echo "alias ib0 ib_ipoib" > /etc/modprobe.d/ib_ipoib.conf
Run below script:

#!/bin/sh
# Loop forever
while :
do
modprobe -r ib_ipoib
ifconfig ib0 1.1.1.104 netmask 255.255.255.0 up
done # Start over

Dump:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000027
IP: [<ffffffffa0750147>] ipoib_mcast_join_task+0x217/0x350 [ib_ipoib]
PGD 0
Oops: 0000 [#1] SMP
Modules linked in: ib_ipoib(-) netconsole configfs rdma_ucm ib_ucm 
rdma_cm iw_cm ib_addr ib_cm ib_sa ib_uverbs ib_umad mlx4_ib ib_mad 
ib_core mlx4_en mlx4_core ip6table_filter ip6_tables ebtable_nat 
ebtables ipt_REJECT xt_CHECKSUM nfsd exportfs autofs4 nfs lockd fscache 
auth_rpcgss nfs_acl sunrpc bridge stp llc ipv6 dm_mirror dm_region_hash 
dm_log dm_mod vhost_net macvtap macvlan tun iTCO_wdt iTCO_vendor_support 
dcdbas coretemp kvm_intel kvm crc32c_intel ghash_clmulni_intel 
aesni_intel cryptd aes_x86_64 aes_generic microcode ses enclosure sg 
serio_raw pcspkr lpc_ich mfd_core i7core_edac edac_core bnx2 ext3 jbd 
mbcache sr_mod cdrom sd_mod crc_t10dif pata_acpi ata_generic ata_piix 
megaraid_sas [last unloaded: ib_ipoib]
CPU 4
Pid: 8788, comm: kworker/u:1 Not tainted 3.5.0+ #1 Dell Inc. PowerEdge 
R710/0MD99X
RIP: 0010:[<ffffffffa0750147>]  [<ffffffffa0750147>] 
ipoib_mcast_join_task+0x217/0x350 [ib_ipoib]
RSP: 0018:ffff88085412fda0  EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff88045c59e900 RCX: ffff88045c59e878
RDX: ffff88045c59e878 RSI: ffff88045c59e810 RDI: ffff88045c59e7c0
RBP: ffff88085412fdf0 R08: ffff88085412fb98 R09: 0140000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: ffff88045c59e7c0
R13: ffff88045c59e000 R14: ffff88045c59eac0 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff88087fc40000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000027 CR3: 0000000001a0b000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kworker/u:1 (pid: 8788, threadinfo ffff88085412e000, task 
ffff88086995e140)
Stack:
  0000000500000004 0000008000000004 400000000251486a 0000000000000000
  0400000200020080 0000051002001200 ffff880868e59940 ffffffff81d538c0
  ffff88045b33bc00 ffffffffa074ff30 ffff88085412fe50 ffffffff8106e872
Call Trace:
  [<ffffffffa074ff30>] ? ipoib_mcast_join+0x200/0x200 [ib_ipoib]
  [<ffffffff8106e872>] process_one_work+0x132/0x450
  [<ffffffff8107067b>] worker_thread+0x17b/0x3c0
  [<ffffffff81070500>] ? manage_workers+0x120/0x120
  [<ffffffff810757be>] kthread+0x9e/0xb0
  [<ffffffff81522664>] kernel_thread_helper+0x4/0x10
  [<ffffffff81075720>] ? kthread_freezable_should_stop+0x70/0x70
  [<ffffffff81522660>] ? gs_change+0x13/0x13
Code: 66 83 83 c0 fe ff ff 01 fb 66 66 90 66 66 90 48 8b b3 70 ff ff ff 
e9 d3 fe ff ff 66 0f 1f 84 00 00 00 00 00 48 8b 83 70 ff ff ff <0f> b6 
50 27 b8 fb ff ff ff 83 ea 01 83 fa 04 77 0c 89 d2 8b 04
RIP  [<ffffffffa0750147>] ipoib_mcast_join_task+0x217/0x350 [ib_ipoib]
  RSP <ffff88085412fda0>
CR2: 0000000000000027
---[ end trace a8af87e7ad29e6a9 ]---



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re:  ipoib race in multicast flow (was: IPoIB oops)
       [not found] ` <500EBBF0.3020407-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2012-07-25 12:19   ` Or Gerlitz
  0 siblings, 0 replies; 2+ messages in thread
From: Or Gerlitz @ 2012-07-25 12:19 UTC (permalink / raw)
  To: Roland Dreier; +Cc: Yishai Hadas, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On 24/07/2012 18:14, Yishai Hadas wrote:
>  Just encountered a kernel oops in IPoIB on upstream kernel 3.5 [...] 
> oops happened in ipoib_mcast_join_task.

Roland,

I made a review now on the issue Yishai raised, and took a look on few 
related commits to that area, as you wrote in a77a57a1a "IPoIB: Fix 
deadlock on RTNL in ipoib_stop()" - commit c8c2afe3 "IPoIB: Use rtnl 
lock/unlock when changing  device flags" added a call to rtnl_lock() in 
ipoib_mcast_join_task(), which is run from the ipoib_workqueue, and 
hence we can't flush
the workqueue from the context ipoib_stop is called.

HOWEVER, that very same ipoib_stop() context, which doesn't flush the 
workqueue, calls ipoib_mcast_dev_flush which goes and deletes all the 
multicast entries, and this flow place
now without any synchronization with possible running instances of 
ipoib_mcast_join_task which relate to the SAME ipoib device. Yishai's 
test stepped on the broadcast point being null, but
this race can hold for any group which this device is joined to.

What would you suggest here, change the ipoib_stop flow to apply 
flushing, and doing
rtnl_trylock() instead of rtnl_lock() in ipoib_mcast_join_task() doesn't 
seems to be applicable, hence we don't know who took the lock, arbitrary 
context that wants now to apply changes on the device or the ipoib_stop 
one.

I see that this code is executed unconditionally whenever the mcast task is

>    priv->mcast_mtu = 
> IPOIB_UD_MTU(ib_mtu_enum_to_int(priv->broadcast->mcmember.mtu));
>
>         if (!ipoib_cm_admin_enabled(dev)) {
>                 rtnl_lock();
>                 dev_set_mtu(dev, min(priv->mcast_mtu, priv->admin_mtu));
>                 rtnl_unlock();
>         }

maybe if we go wiser and run it only after actually joining the 
broadcast group
and not each time could help with solving the race? and/or move the code 
that does
the dev_set_mtu call to be executed under another context?

Or.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2012-07-25 12:19 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-07-24 15:14 IPoIB oops Yishai Hadas
     [not found] ` <500EBBF0.3020407-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2012-07-25 12:19   ` ipoib race in multicast flow (was: IPoIB oops) Or Gerlitz

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.