All of lore.kernel.org
 help / color / mirror / Atom feed
* A soft lockup in vxlan module
@ 2013-08-07  1:23 Cong Wang
  2013-08-07  2:18 ` Stephen Hemminger
  2013-08-07  4:13 ` Stephen Hemminger
  0 siblings, 2 replies; 5+ messages in thread
From: Cong Wang @ 2013-08-07  1:23 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev

Hi, Stephen

You introduced a soft lockup in vxlan module in

commit fe5c3561e6f0ac7c9546209f01351113c1b77ec8
Author: stephen hemminger <stephen@networkplumber.org>
Date:   Sat Jul 13 10:18:18 2013 -0700

    vxlan: add necessary locking on device removal

The problem is that vxlan_dellink(), which is called with RTNL lock
held, tries to flush the workqueue synchronously, but apparently
igmp_join and igmp_leave work need to hold RTNL lock too, therefore we
have a soft lockup! This is 100% reproducible on my 2.6.32 backport
while running `modprobe -r vxlan`.

A quick but perhaps ugly fix is just releasing RTNL lock before calling
flush_workqueue():

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 8bf31d9..581d3d5 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -1837,7 +1837,9 @@ static void vxlan_dellink(struct net_device *dev,
struct list_head *head)
        struct vxlan_net *vn = net_generic(dev_net(dev), vxlan_net_id);
        struct vxlan_dev *vxlan = netdev_priv(dev);
 
+       rtnl_unlock();
        flush_workqueue(vxlan_wq);
+       rtnl_lock();
 
        spin_lock(&vn->sock_lock);
        hlist_del_rcu(&vxlan->hlist);

However, I think a better way is still what I did, that is, removing
RTNL lock from ip_mc_join_group() and ip_mc_leave_group().

What do you think? Any other idea to fix it?

Thanks.

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: A soft lockup in vxlan module
  2013-08-07  1:23 A soft lockup in vxlan module Cong Wang
@ 2013-08-07  2:18 ` Stephen Hemminger
  2013-08-07  2:34   ` Cong Wang
  2013-08-07  4:13 ` Stephen Hemminger
  1 sibling, 1 reply; 5+ messages in thread
From: Stephen Hemminger @ 2013-08-07  2:18 UTC (permalink / raw)
  To: Cong Wang; +Cc: netdev

Calling unlock in dellink is not safe.
can you reproduce with 3.10 or 3.11-rc?

On Tue, Aug 6, 2013 at 6:23 PM, Cong Wang <amwang@redhat.com> wrote:
> Hi, Stephen
>
> You introduced a soft lockup in vxlan module in
>
> commit fe5c3561e6f0ac7c9546209f01351113c1b77ec8
> Author: stephen hemminger <stephen@networkplumber.org>
> Date:   Sat Jul 13 10:18:18 2013 -0700
>
>     vxlan: add necessary locking on device removal
>
> The problem is that vxlan_dellink(), which is called with RTNL lock
> held, tries to flush the workqueue synchronously, but apparently
> igmp_join and igmp_leave work need to hold RTNL lock too, therefore we
> have a soft lockup! This is 100% reproducible on my 2.6.32 backport
> while running `modprobe -r vxlan`.
>
> A quick but perhaps ugly fix is just releasing RTNL lock before calling
> flush_workqueue():
>
> diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
> index 8bf31d9..581d3d5 100644
> --- a/drivers/net/vxlan.c
> +++ b/drivers/net/vxlan.c
> @@ -1837,7 +1837,9 @@ static void vxlan_dellink(struct net_device *dev,
> struct list_head *head)
>         struct vxlan_net *vn = net_generic(dev_net(dev), vxlan_net_id);
>         struct vxlan_dev *vxlan = netdev_priv(dev);
>
> +       rtnl_unlock();
>         flush_workqueue(vxlan_wq);
> +       rtnl_lock();
>
>         spin_lock(&vn->sock_lock);
>         hlist_del_rcu(&vxlan->hlist);
>
> However, I think a better way is still what I did, that is, removing
> RTNL lock from ip_mc_join_group() and ip_mc_leave_group().
>
> What do you think? Any other idea to fix it?
>
> Thanks.
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: A soft lockup in vxlan module
  2013-08-07  2:18 ` Stephen Hemminger
@ 2013-08-07  2:34   ` Cong Wang
  0 siblings, 0 replies; 5+ messages in thread
From: Cong Wang @ 2013-08-07  2:34 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev

On Tue, 2013-08-06 at 19:18 -0700, Stephen Hemminger wrote:
> Calling unlock in dellink is not safe.

This is my feeling as well.

> can you reproduce with 3.10 or 3.11-rc?
> 

For net-next, I always compile modules as builtin, therefore can't test
it. But clearly the code is same.

The steps to reproduce it are:

1) ping over vxlan0
2) modprobe -r vxlan

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: A soft lockup in vxlan module
  2013-08-07  1:23 A soft lockup in vxlan module Cong Wang
  2013-08-07  2:18 ` Stephen Hemminger
@ 2013-08-07  4:13 ` Stephen Hemminger
  2013-08-07  6:47   ` Cong Wang
  1 sibling, 1 reply; 5+ messages in thread
From: Stephen Hemminger @ 2013-08-07  4:13 UTC (permalink / raw)
  To: Cong Wang; +Cc: netdev

On Wed, 07 Aug 2013 09:23:54 +0800
Cong Wang <amwang@redhat.com> wrote:

> Hi, Stephen
> 
> You introduced a soft lockup in vxlan module in
> 
> commit fe5c3561e6f0ac7c9546209f01351113c1b77ec8
> Author: stephen hemminger <stephen@networkplumber.org>
> Date:   Sat Jul 13 10:18:18 2013 -0700
> 
>     vxlan: add necessary locking on device removal
> 
> The problem is that vxlan_dellink(), which is called with RTNL lock
> held, tries to flush the workqueue synchronously, but apparently
> igmp_join and igmp_leave work need to hold RTNL lock too, therefore we
> have a soft lockup! This is 100% reproducible on my 2.6.32 backport
> while running `modprobe -r vxlan`.
> 
> A quick but perhaps ugly fix is just releasing RTNL lock before calling
> flush_workqueue():
> 
> diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
> index 8bf31d9..581d3d5 100644
> --- a/drivers/net/vxlan.c
> +++ b/drivers/net/vxlan.c
> @@ -1837,7 +1837,9 @@ static void vxlan_dellink(struct net_device *dev,
> struct list_head *head)
>         struct vxlan_net *vn = net_generic(dev_net(dev), vxlan_net_id);
>         struct vxlan_dev *vxlan = netdev_priv(dev);
>  
> +       rtnl_unlock();
>         flush_workqueue(vxlan_wq);
> +       rtnl_lock();
>  
>         spin_lock(&vn->sock_lock);
>         hlist_del_rcu(&vxlan->hlist);
> 
> However, I think a better way is still what I did, that is, removing
> RTNL lock from ip_mc_join_group() and ip_mc_leave_group().
> 
> What do you think? Any other idea to fix it?
> 
> Thanks.
> 

Probably the flush_workqueue can just be removed and let the normal
refcounting work. The workqueue has a reference to device and socket,
therefore the cleanups should work correctly.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: A soft lockup in vxlan module
  2013-08-07  4:13 ` Stephen Hemminger
@ 2013-08-07  6:47   ` Cong Wang
  0 siblings, 0 replies; 5+ messages in thread
From: Cong Wang @ 2013-08-07  6:47 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev

On Tue, 2013-08-06 at 21:13 -0700, Stephen Hemminger wrote:
> 
> Probably the flush_workqueue can just be removed and let the normal
> refcounting work. The workqueue has a reference to device and socket,
> therefore the cleanups should work correctly.

Sounds reasonable, I am testing the following patch:

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 8bf31d9..c51ef9b 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -1837,8 +1837,6 @@ static void vxlan_dellink(struct net_device *dev,
struct list_head *head)
        struct vxlan_net *vn = net_generic(dev_net(dev), vxlan_net_id);
        struct vxlan_dev *vxlan = netdev_priv(dev);
 
-       flush_workqueue(vxlan_wq);
-
        spin_lock(&vn->sock_lock);
        hlist_del_rcu(&vxlan->hlist);
        spin_unlock(&vn->sock_lock);

^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2013-08-07  6:47 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-08-07  1:23 A soft lockup in vxlan module Cong Wang
2013-08-07  2:18 ` Stephen Hemminger
2013-08-07  2:34   ` Cong Wang
2013-08-07  4:13 ` Stephen Hemminger
2013-08-07  6:47   ` Cong Wang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.