All of lore.kernel.org
 help / color / mirror / Atom feed
* dst_ifdown breaks infiniband?
@ 2007-03-18 15:55 ` Michael S. Tsirkin
  0 siblings, 0 replies; 59+ messages in thread
From: Michael S. Tsirkin @ 2007-03-18 15:55 UTC (permalink / raw)
  To: Linux Kernel Mailing List, netdev, general, Roland Dreier,
	Alexey Kuznetsov

Alexey, Roland,
In debugging kernel lockup that occurs with IP over InfiniBand in 2.6.21-rc4:
( https://bugs.openfabrics.org/show_bug.cgi?id=402 )

I noticed the following code in dst_ifdown:

/* Dirty hack. We did it in 2.2 (in __dst_free),
 * we have _very_ good reasons not to repeat
 * this mistake in 2.3, but we have no choice
 * now. _It_ _is_ _explicit_ _deliberate_
 * _race_ _condition_.
 *
 * Commented and originally written by Alexey.
 */
static inline void dst_ifdown(struct dst_entry *dst, struct net_device *dev,
                              int unregister)
{
        if (dst->ops->ifdown)
                dst->ops->ifdown(dst, dev, unregister);

        if (dev != dst->dev)
                return;

        if (!unregister) {
                dst->input = dst_discard_in;
                dst->output = dst_discard_out;
        } else {
                dst->dev = &loopback_dev;
                dev_hold(&loopback_dev);
                dev_put(dev);
                if (dst->neighbour && dst->neighbour->dev == dev) {
                        dst->neighbour->dev = &loopback_dev;
                        dev_put(dev);
                        dev_hold(&loopback_dev);
                }
        }
}

The line dst->neighbour->dev = &loopback_dev breaks IP over InfiniBand,
simply because neighbour->parms still points to an entry that has
been set up with dev->neigh_setup call from IPoIB neighbour device.

So when neighbour->parms->neigh_destructor is called,
we get to ipoib_neigh_destructor in drivers/infiniband/ulp/ipoib/ipoib_main.c,
and that in turn crashes since it needs an infiniband device
in neighbour dev pointer.

This is not new code, and should have triggered long time ago,
so I am not sure how come we are triggering this only now,
but somehow this did not lead to crashes in 2.6.20, but does now in 2.6.21-rc4.

Ideas on how to fix this?

Why is neighbour->dev changed here?

Can dst->neighbour be changed to point to NULL instead, and the neighbour
released?

Thanks very much,
-- 
MST

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [ofa-general] dst_ifdown breaks infiniband?
@ 2007-03-18 15:55 ` Michael S. Tsirkin
  0 siblings, 0 replies; 59+ messages in thread
From: Michael S. Tsirkin @ 2007-03-18 15:55 UTC (permalink / raw)
  To: Linux Kernel Mailing List, netdev, general, Roland Dreier,
	Alexey Kuznetsov

Alexey, Roland,
In debugging kernel lockup that occurs with IP over InfiniBand in 2.6.21-rc4:
( https://bugs.openfabrics.org/show_bug.cgi?id=402 )

I noticed the following code in dst_ifdown:

/* Dirty hack. We did it in 2.2 (in __dst_free),
 * we have _very_ good reasons not to repeat
 * this mistake in 2.3, but we have no choice
 * now. _It_ _is_ _explicit_ _deliberate_
 * _race_ _condition_.
 *
 * Commented and originally written by Alexey.
 */
static inline void dst_ifdown(struct dst_entry *dst, struct net_device *dev,
                              int unregister)
{
        if (dst->ops->ifdown)
                dst->ops->ifdown(dst, dev, unregister);

        if (dev != dst->dev)
                return;

        if (!unregister) {
                dst->input = dst_discard_in;
                dst->output = dst_discard_out;
        } else {
                dst->dev = &loopback_dev;
                dev_hold(&loopback_dev);
                dev_put(dev);
                if (dst->neighbour && dst->neighbour->dev == dev) {
                        dst->neighbour->dev = &loopback_dev;
                        dev_put(dev);
                        dev_hold(&loopback_dev);
                }
        }
}

The line dst->neighbour->dev = &loopback_dev breaks IP over InfiniBand,
simply because neighbour->parms still points to an entry that has
been set up with dev->neigh_setup call from IPoIB neighbour device.

So when neighbour->parms->neigh_destructor is called,
we get to ipoib_neigh_destructor in drivers/infiniband/ulp/ipoib/ipoib_main.c,
and that in turn crashes since it needs an infiniband device
in neighbour dev pointer.

This is not new code, and should have triggered long time ago,
so I am not sure how come we are triggering this only now,
but somehow this did not lead to crashes in 2.6.20, but does now in 2.6.21-rc4.

Ideas on how to fix this?

Why is neighbour->dev changed here?

Can dst->neighbour be changed to point to NULL instead, and the neighbour
released?

Thanks very much,
-- 
MST

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: dst_ifdown breaks infiniband?
  2007-03-18 15:55 ` [ofa-general] " Michael S. Tsirkin
@ 2007-03-18 19:12   ` Alexey Kuznetsov
  -1 siblings, 0 replies; 59+ messages in thread
From: Alexey Kuznetsov @ 2007-03-18 19:12 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Linux Kernel Mailing List, netdev, general, Roland Dreier

Hello!

> This is not new code, and should have triggered long time ago,
> so I am not sure how come we are triggering this only now,
> but somehow this did not lead to crashes in 2.6.20

I see. I guess this was plain luck.


> Why is neighbour->dev changed here?

It holds reference to device and prevents its destruction.
If dst is held somewhere, we cannot destroy the device and deadlock
while unregister.

We could not invalidate dst->neighbour but it looked safe to invalidate
neigh->dev after quiescent state. Obviosuly, it is not and it never was safe.
Was supposed to be repaired asap, but this did not happen. :-(


> Can dst->neighbour be changed to point to NULL instead, and the neighbour
> released?

It should be cleared and we should be sure it will not be destroyed
before quiescent state.

Seems, this is the only correct solution, but to do this we have
to audit all the places where dst->neighbour is dereferenced for
RCU safety.

Actually, it is very good you caught this eventually, the bug was
so _disgusting_ that it was "forgotten" all the time, waiting for
someone who will point out that the king is naked. :-)

Alexey

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [ofa-general] Re: dst_ifdown breaks infiniband?
@ 2007-03-18 19:12   ` Alexey Kuznetsov
  0 siblings, 0 replies; 59+ messages in thread
From: Alexey Kuznetsov @ 2007-03-18 19:12 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev, Linux Kernel Mailing List, general

Hello!

> This is not new code, and should have triggered long time ago,
> so I am not sure how come we are triggering this only now,
> but somehow this did not lead to crashes in 2.6.20

I see. I guess this was plain luck.


> Why is neighbour->dev changed here?

It holds reference to device and prevents its destruction.
If dst is held somewhere, we cannot destroy the device and deadlock
while unregister.

We could not invalidate dst->neighbour but it looked safe to invalidate
neigh->dev after quiescent state. Obviosuly, it is not and it never was safe.
Was supposed to be repaired asap, but this did not happen. :-(


> Can dst->neighbour be changed to point to NULL instead, and the neighbour
> released?

It should be cleared and we should be sure it will not be destroyed
before quiescent state.

Seems, this is the only correct solution, but to do this we have
to audit all the places where dst->neighbour is dereferenced for
RCU safety.

Actually, it is very good you caught this eventually, the bug was
so _disgusting_ that it was "forgotten" all the time, waiting for
someone who will point out that the king is naked. :-)

Alexey

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: dst_ifdown breaks infiniband?
  2007-03-18 19:12   ` [ofa-general] " Alexey Kuznetsov
@ 2007-03-18 19:46     ` Michael S. Tsirkin
  -1 siblings, 0 replies; 59+ messages in thread
From: Michael S. Tsirkin @ 2007-03-18 19:46 UTC (permalink / raw)
  To: Alexey Kuznetsov
  Cc: Michael S. Tsirkin, Linux Kernel Mailing List, netdev, general,
	Roland Dreier

> Quoting Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>:
> Subject: Re: dst_ifdown breaks infiniband?
> 
> Hello!
> 
> > This is not new code, and should have triggered long time ago,
> > so I am not sure how come we are triggering this only now,
> > but somehow this did not lead to crashes in 2.6.20
> 
> I see. I guess this was plain luck.

Hmm. Something I don't understand: does the code
in question not run on *each* device unregister?

Why do I only see this under stress?

-- 
MST

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [ofa-general] Re: dst_ifdown breaks infiniband?
@ 2007-03-18 19:46     ` Michael S. Tsirkin
  0 siblings, 0 replies; 59+ messages in thread
From: Michael S. Tsirkin @ 2007-03-18 19:46 UTC (permalink / raw)
  To: Alexey Kuznetsov; +Cc: netdev, Linux Kernel Mailing List, general

> Quoting Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>:
> Subject: Re: dst_ifdown breaks infiniband?
> 
> Hello!
> 
> > This is not new code, and should have triggered long time ago,
> > so I am not sure how come we are triggering this only now,
> > but somehow this did not lead to crashes in 2.6.20
> 
> I see. I guess this was plain luck.

Hmm. Something I don't understand: does the code
in question not run on *each* device unregister?

Why do I only see this under stress?

-- 
MST

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: dst_ifdown breaks infiniband?
  2007-03-18 19:12   ` [ofa-general] " Alexey Kuznetsov
@ 2007-03-18 19:53     ` Michael S. Tsirkin
  -1 siblings, 0 replies; 59+ messages in thread
From: Michael S. Tsirkin @ 2007-03-18 19:53 UTC (permalink / raw)
  To: Alexey Kuznetsov
  Cc: Michael S. Tsirkin, Linux Kernel Mailing List, netdev, general,
	Roland Dreier

Quoting Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>:
Subject: Re: dst_ifdown breaks infiniband?
> 
> Hello!
> 
> > This is not new code, and should have triggered long time ago,
> > so I am not sure how come we are triggering this only now,
> > but somehow this did not lead to crashes in 2.6.20
> 
> I see. I guess this was plain luck.
> 
> 
> > Why is neighbour->dev changed here?
> 
> It holds reference to device and prevents its destruction.
> If dst is held somewhere, we cannot destroy the device and deadlock
> while unregister.
> 
> We could not invalidate dst->neighbour but it looked safe to invalidate
> neigh->dev after quiescent state. Obviosuly, it is not and it never was safe.
> Was supposed to be repaired asap, but this did not happen. :-(
> 
> > Can dst->neighbour be changed to point to NULL instead, and the neighbour
> > released?
> 
> It should be cleared and we should be sure it will not be destroyed
> before quiescent state.

I'm confused. didn't you say dst_ifdown is called after quiescent state?

> Seems, this is the only correct solution, but to do this we have
> to audit all the places where dst->neighbour is dereferenced for
> RCU safety.
> 
> Actually, it is very good you caught this eventually, the bug was
> so _disgusting_ that it was "forgotten" all the time, waiting for
> someone who will point out that the king is naked. :-)
> 
> Alexey

This does not sound like something that's likely to be accepted in 2.6.21, right?

Any simpler ideas?

-- 
MST

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [ofa-general] Re: dst_ifdown breaks infiniband?
@ 2007-03-18 19:53     ` Michael S. Tsirkin
  0 siblings, 0 replies; 59+ messages in thread
From: Michael S. Tsirkin @ 2007-03-18 19:53 UTC (permalink / raw)
  To: Alexey Kuznetsov; +Cc: netdev, Linux Kernel Mailing List, general

Quoting Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>:
Subject: Re: dst_ifdown breaks infiniband?
> 
> Hello!
> 
> > This is not new code, and should have triggered long time ago,
> > so I am not sure how come we are triggering this only now,
> > but somehow this did not lead to crashes in 2.6.20
> 
> I see. I guess this was plain luck.
> 
> 
> > Why is neighbour->dev changed here?
> 
> It holds reference to device and prevents its destruction.
> If dst is held somewhere, we cannot destroy the device and deadlock
> while unregister.
> 
> We could not invalidate dst->neighbour but it looked safe to invalidate
> neigh->dev after quiescent state. Obviosuly, it is not and it never was safe.
> Was supposed to be repaired asap, but this did not happen. :-(
> 
> > Can dst->neighbour be changed to point to NULL instead, and the neighbour
> > released?
> 
> It should be cleared and we should be sure it will not be destroyed
> before quiescent state.

I'm confused. didn't you say dst_ifdown is called after quiescent state?

> Seems, this is the only correct solution, but to do this we have
> to audit all the places where dst->neighbour is dereferenced for
> RCU safety.
> 
> Actually, it is very good you caught this eventually, the bug was
> so _disgusting_ that it was "forgotten" all the time, waiting for
> someone who will point out that the king is naked. :-)
> 
> Alexey

This does not sound like something that's likely to be accepted in 2.6.21, right?

Any simpler ideas?

-- 
MST

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: dst_ifdown breaks infiniband?
  2007-03-18 19:46     ` [ofa-general] " Michael S. Tsirkin
@ 2007-03-18 19:55       ` Alexey Kuznetsov
  -1 siblings, 0 replies; 59+ messages in thread
From: Alexey Kuznetsov @ 2007-03-18 19:55 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Linux Kernel Mailing List, netdev, general, Roland Dreier

Hello!

> Hmm. Something I don't understand: does the code
> in question not run on *each* device unregister?

It does.


> Why do I only see this under stress?

You should have some referenced destination entries to trigger bad path.
This should happen not only under stress.

F.e. just try to ssh to something via this device. And unregister it.
Seems, the crash is inevitable. If you do not see crash, I will be puzzled.

Alexey

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [ofa-general] Re: dst_ifdown breaks infiniband?
@ 2007-03-18 19:55       ` Alexey Kuznetsov
  0 siblings, 0 replies; 59+ messages in thread
From: Alexey Kuznetsov @ 2007-03-18 19:55 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev, Linux Kernel Mailing List, general

Hello!

> Hmm. Something I don't understand: does the code
> in question not run on *each* device unregister?

It does.


> Why do I only see this under stress?

You should have some referenced destination entries to trigger bad path.
This should happen not only under stress.

F.e. just try to ssh to something via this device. And unregister it.
Seems, the crash is inevitable. If you do not see crash, I will be puzzled.

Alexey

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: dst_ifdown breaks infiniband?
  2007-03-18 19:53     ` [ofa-general] " Michael S. Tsirkin
  (?)
@ 2007-03-18 20:18     ` Alexey Kuznetsov
  2007-03-18 20:29         ` [ofa-general] " Michael S. Tsirkin
  2007-03-19  9:36         ` [ofa-general] " Michael S. Tsirkin
  -1 siblings, 2 replies; 59+ messages in thread
From: Alexey Kuznetsov @ 2007-03-18 20:18 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Linux Kernel Mailing List, netdev, general, Roland Dreier

Hello!

> > It should be cleared and we should be sure it will not be destroyed
> > before quiescent state.
> 
> I'm confused. didn't you say dst_ifdown is called after quiescent state?

Quiescent state should happen after dst->neighbour is invalidated.
And this implies that all the users of dst->neighbour check validity
after dereference and do not use it after quiescent state.


> This does not sound like something that's likely to be accepted in 2.6.21, right?
> 
> Any simpler ideas?

Well, if inifiniband destructor really needs to take that lock... no.
Right now I do not see.

Alexey

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: dst_ifdown breaks infiniband?
  2007-03-18 19:55       ` [ofa-general] " Alexey Kuznetsov
@ 2007-03-18 20:24         ` Michael S. Tsirkin
  -1 siblings, 0 replies; 59+ messages in thread
From: Michael S. Tsirkin @ 2007-03-18 20:24 UTC (permalink / raw)
  To: Alexey Kuznetsov
  Cc: Michael S. Tsirkin, Linux Kernel Mailing List, netdev, general,
	Roland Dreier

> Quoting Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>:
> Subject: Re: dst_ifdown breaks infiniband?
> 
> Hello!
> 
> > Hmm. Something I don't understand: does the code
> > in question not run on *each* device unregister?
> 
> It does.
> 
> 
> > Why do I only see this under stress?
> 
> You should have some referenced destination entries to trigger bad path.
> This should happen not only under stress.
> 
> F.e. just try to ssh to something via this device. And unregister it.
> Seems, the crash is inevitable. If you do not see crash, I will be puzzled.


I did this.
What happens is:

	neigh_setup is called
	dst_ifdown changes the neigh->dev to loopback device

But the funny thing is that this neighbour can thinkably hang
around indefinitely now, and if it does destructor won't be called
and there won't be a crash.

To trigger a crash, I did simply
	ifconfig lo down; ifconfig lo 127.0.0.1

and sure enough it crashes in drivers/infiniband/ulp/ipoib/ipoib_main.c.

-- 
MST

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [ofa-general] Re: dst_ifdown breaks infiniband?
@ 2007-03-18 20:24         ` Michael S. Tsirkin
  0 siblings, 0 replies; 59+ messages in thread
From: Michael S. Tsirkin @ 2007-03-18 20:24 UTC (permalink / raw)
  To: Alexey Kuznetsov; +Cc: netdev, Linux Kernel Mailing List, general

> Quoting Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>:
> Subject: Re: dst_ifdown breaks infiniband?
> 
> Hello!
> 
> > Hmm. Something I don't understand: does the code
> > in question not run on *each* device unregister?
> 
> It does.
> 
> 
> > Why do I only see this under stress?
> 
> You should have some referenced destination entries to trigger bad path.
> This should happen not only under stress.
> 
> F.e. just try to ssh to something via this device. And unregister it.
> Seems, the crash is inevitable. If you do not see crash, I will be puzzled.


I did this.
What happens is:

	neigh_setup is called
	dst_ifdown changes the neigh->dev to loopback device

But the funny thing is that this neighbour can thinkably hang
around indefinitely now, and if it does destructor won't be called
and there won't be a crash.

To trigger a crash, I did simply
	ifconfig lo down; ifconfig lo 127.0.0.1

and sure enough it crashes in drivers/infiniband/ulp/ipoib/ipoib_main.c.

-- 
MST

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: dst_ifdown breaks infiniband?
  2007-03-18 19:12   ` [ofa-general] " Alexey Kuznetsov
                     ` (2 preceding siblings ...)
  (?)
@ 2007-03-18 20:25   ` Michael S. Tsirkin
  2007-03-18 22:24       ` Eric W. Biederman
  -1 siblings, 1 reply; 59+ messages in thread
From: Michael S. Tsirkin @ 2007-03-18 20:25 UTC (permalink / raw)
  To: Alexey Kuznetsov
  Cc: Michael S. Tsirkin, Linux Kernel Mailing List, netdev, general,
	Roland Dreier

> > Why is neighbour->dev changed here?
> 
> It holds reference to device and prevents its destruction.
> If dst is held somewhere, we cannot destroy the device and deadlock
> while unregister.

BTW, can this ever happen for the loopback device itself?
Is it ever unregistered?

-- 
MST

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: dst_ifdown breaks infiniband?
  2007-03-18 20:18     ` Alexey Kuznetsov
@ 2007-03-18 20:29         ` Michael S. Tsirkin
  2007-03-19  9:36         ` [ofa-general] " Michael S. Tsirkin
  1 sibling, 0 replies; 59+ messages in thread
From: Michael S. Tsirkin @ 2007-03-18 20:29 UTC (permalink / raw)
  To: Alexey Kuznetsov
  Cc: Michael S. Tsirkin, Linux Kernel Mailing List, netdev, general,
	Roland Dreier

> > > It should be cleared and we should be sure it will not be destroyed
> > > before quiescent state.
> > 
> > I'm confused. didn't you say dst_ifdown is called after quiescent state?
> 
> Quiescent state should happen after dst->neighbour is invalidated.
> And this implies that all the users of dst->neighbour check validity
> after dereference and do not use it after quiescent state.
> 
> 
> > This does not sound like something that's likely to be accepted in 2.6.21, right?
> > 
> > Any simpler ideas?
> 
> Well, if inifiniband destructor really needs to take that lock... no.
> Right now I do not see.

OK then.
If you post some patches I'll test them.

-- 
MST

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [ofa-general] Re: dst_ifdown breaks infiniband?
@ 2007-03-18 20:29         ` Michael S. Tsirkin
  0 siblings, 0 replies; 59+ messages in thread
From: Michael S. Tsirkin @ 2007-03-18 20:29 UTC (permalink / raw)
  To: Alexey Kuznetsov; +Cc: netdev, Linux Kernel Mailing List, general

> > > It should be cleared and we should be sure it will not be destroyed
> > > before quiescent state.
> > 
> > I'm confused. didn't you say dst_ifdown is called after quiescent state?
> 
> Quiescent state should happen after dst->neighbour is invalidated.
> And this implies that all the users of dst->neighbour check validity
> after dereference and do not use it after quiescent state.
> 
> 
> > This does not sound like something that's likely to be accepted in 2.6.21, right?
> > 
> > Any simpler ideas?
> 
> Well, if inifiniband destructor really needs to take that lock... no.
> Right now I do not see.

OK then.
If you post some patches I'll test them.

-- 
MST

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: dst_ifdown breaks infiniband?
  2007-03-18 19:12   ` [ofa-general] " Alexey Kuznetsov
                     ` (3 preceding siblings ...)
  (?)
@ 2007-03-18 20:33   ` Michael S. Tsirkin
  2007-03-18 21:06       ` [ofa-general] " Michael S. Tsirkin
  -1 siblings, 1 reply; 59+ messages in thread
From: Michael S. Tsirkin @ 2007-03-18 20:33 UTC (permalink / raw)
  To: Alexey Kuznetsov
  Cc: Michael S. Tsirkin, Linux Kernel Mailing List, netdev, general,
	Roland Dreier

Quoting Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>:
Subject: Re: dst_ifdown breaks infiniband?
> > Can dst->neighbour be changed to point to NULL instead, and the neighbour
> > released?
> 
> It should be cleared and we should be sure it will not be destroyed
> before quiescent state.
> 
> Seems, this is the only correct solution, but to do this we have
> to audit all the places where dst->neighbour is dereferenced for
> RCU safety.
> 
> Actually, it is very good you caught this eventually, the bug was
> so _disgusting_ that it was "forgotten" all the time, waiting for
> someone who will point out that the king is naked. :-)

Actually that might not be too bad:
$grep -rIi 'dst->neighbour' net/ | wc -l
36

I'll try to do it.

-- 
MST

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: dst_ifdown breaks infiniband?
  2007-03-18 20:33   ` Michael S. Tsirkin
@ 2007-03-18 21:06       ` Michael S. Tsirkin
  0 siblings, 0 replies; 59+ messages in thread
From: Michael S. Tsirkin @ 2007-03-18 21:06 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alexey Kuznetsov, Linux Kernel Mailing List, netdev, general,
	Roland Dreier

> Quoting Michael S. Tsirkin <mst@dev.mellanox.co.il>:
> Subject: Re: dst_ifdown breaks infiniband?
> 
> Quoting Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>:
> Subject: Re: dst_ifdown breaks infiniband?
> > > Can dst->neighbour be changed to point to NULL instead, and the neighbour
> > > released?
> > 
> > It should be cleared and we should be sure it will not be destroyed
> > before quiescent state.
> > 
> > Seems, this is the only correct solution, but to do this we have
> > to audit all the places where dst->neighbour is dereferenced for
> > RCU safety.
> > 
> > Actually, it is very good you caught this eventually, the bug was
> > so _disgusting_ that it was "forgotten" all the time, waiting for
> > someone who will point out that the king is naked. :-)
> 
> Actually that might not be too bad:
> $grep -rIi 'dst->neighbour' net/ | wc -l
> 36
> 
> I'll try to do it.

Here's the list. Looks OK to me. What do you think?

$grep rIi 'dst->neighbour' net/

./atm/clip.c:395:       if (!skb->dst->neighbour) {
./atm/clip.c:397:               skb->dst->neighbour = clip_find_neighbour(skb->dst, 1);
./atm/clip.c:398:               if (!skb->dst->neighbour) {
./atm/clip.c:409:       entry = NEIGH2ENTRY(skb->dst->neighbour);
./atm/clip.c:426:       DPRINTK("using neighbour %p, vcc %p\n", skb->dst->neighbour, vcc);

The above are all in hard_start_xmit - output routine
so should be OK (atomic) wrt RCU

./core/dst.c:186:       neigh = dst->neighbour;
./core/dst.c:195:               dst->neighbour = NULL;

Looks OK.

./core/dst.c:252:               if (dst->neighbour && dst->neighbour->dev == dev) {
./core/dst.c:253:                       dst->neighbour->dev = &loopback_dev;

This is our boy.

./core/neighbour.c:1045:                        /* On shaper/eql skb->dst->neighbour != neigh :( */
./core/neighbour.c:1046:                        if (skb->dst && skb->dst->neighbour)
./core/neighbour.c:1047:                                n1 = skb->dst->neighbour;

neigh_update - seems to be always called after neigh_lookup
so there is a reference to neighbour.

./core/neighbour.c:1144:        if (!dst || !(neigh = dst->neighbour))

neigh_resolve_output - looks safe

./core/neighbour.c:1174:                      dst, dst ? dst->neighbour : NULL);

merely prints a pointer

./core/neighbour.c:1187:        struct neighbour *neigh = dst->neighbour;

neigh_connected_output - looks safe

./decnet/dn_neigh.c:208:        struct neighbour *neigh = dst->neighbour;
./decnet/dn_neigh.c:226:        struct neighbour *neigh = dst->neighbour;
./decnet/dn_neigh.c:272:        struct neighbour *neigh = dst->neighbour;
./decnet/dn_neigh.c:315:        struct neighbour *neigh = dst->neighbour;
./decnet/dn_route.c:228:        struct dn_dev *dn = dst->neighbour ?
./decnet/dn_route.c:229:                            (struct dn_dev *)dst->neighbour->dev->dn_ptr : NULL;
./decnet/dn_route.c:693:        if ((neigh = dst->neighbour) == NULL)
./decnet/dn_route.c:727:        struct neighbour *neigh = dst->neighbour;

output routines, except
line 228 is dn_dst_update_pmtu, which looks OK as well.

./ipv4/arp.c:445: *     It is very UGLY routine: it DOES NOT use skb->dst->neighbour,
./ipv4/arp.c:508:       struct neighbour *n = dst->neighbour;
./ipv4/arp.c:523:               dst->neighbour = n;

Looks safe.

./ipv4/ip_gre.c:714:                    struct neighbour *neigh = skb->dst->neighbour;
./ipv4/ip_output.c:186: else if (dst->neighbour)
./ipv4/ip_output.c:187:         return dst->neighbour->output(skb);
./ipv6/ip6_output.c:79: else if (dst->neighbour)
./ipv6/ip6_output.c:80:         return dst->neighbour->output(skb);
./ipv6/ip6_output.c:431:        if (skb->dev == dst->dev && dst->neighbour && opt->srcrt == 0) {
./ipv6/ip6_output.c:434:                struct neighbour *n = dst->neighbour;
./ipv6/sit.c:459:                       neigh = skb->dst->neighbour;

These are all output routines

./sched/sch_teql.c:235: struct neighbour *mn = skb->dst->neighbour;

Looks ok - takes reference on the neighbour.

./sched/sch_teql.c:269:     skb->dst->neighbour == NULL)

Looks ok.

-- 
MST

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [ofa-general] Re: dst_ifdown breaks infiniband?
@ 2007-03-18 21:06       ` Michael S. Tsirkin
  0 siblings, 0 replies; 59+ messages in thread
From: Michael S. Tsirkin @ 2007-03-18 21:06 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alexey Kuznetsov, Linux Kernel Mailing List, general, netdev

> Quoting Michael S. Tsirkin <mst@dev.mellanox.co.il>:
> Subject: Re: dst_ifdown breaks infiniband?
> 
> Quoting Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>:
> Subject: Re: dst_ifdown breaks infiniband?
> > > Can dst->neighbour be changed to point to NULL instead, and the neighbour
> > > released?
> > 
> > It should be cleared and we should be sure it will not be destroyed
> > before quiescent state.
> > 
> > Seems, this is the only correct solution, but to do this we have
> > to audit all the places where dst->neighbour is dereferenced for
> > RCU safety.
> > 
> > Actually, it is very good you caught this eventually, the bug was
> > so _disgusting_ that it was "forgotten" all the time, waiting for
> > someone who will point out that the king is naked. :-)
> 
> Actually that might not be too bad:
> $grep -rIi 'dst->neighbour' net/ | wc -l
> 36
> 
> I'll try to do it.

Here's the list. Looks OK to me. What do you think?

$grep rIi 'dst->neighbour' net/

./atm/clip.c:395:       if (!skb->dst->neighbour) {
./atm/clip.c:397:               skb->dst->neighbour = clip_find_neighbour(skb->dst, 1);
./atm/clip.c:398:               if (!skb->dst->neighbour) {
./atm/clip.c:409:       entry = NEIGH2ENTRY(skb->dst->neighbour);
./atm/clip.c:426:       DPRINTK("using neighbour %p, vcc %p\n", skb->dst->neighbour, vcc);

The above are all in hard_start_xmit - output routine
so should be OK (atomic) wrt RCU

./core/dst.c:186:       neigh = dst->neighbour;
./core/dst.c:195:               dst->neighbour = NULL;

Looks OK.

./core/dst.c:252:               if (dst->neighbour && dst->neighbour->dev == dev) {
./core/dst.c:253:                       dst->neighbour->dev = &loopback_dev;

This is our boy.

./core/neighbour.c:1045:                        /* On shaper/eql skb->dst->neighbour != neigh :( */
./core/neighbour.c:1046:                        if (skb->dst && skb->dst->neighbour)
./core/neighbour.c:1047:                                n1 = skb->dst->neighbour;

neigh_update - seems to be always called after neigh_lookup
so there is a reference to neighbour.

./core/neighbour.c:1144:        if (!dst || !(neigh = dst->neighbour))

neigh_resolve_output - looks safe

./core/neighbour.c:1174:                      dst, dst ? dst->neighbour : NULL);

merely prints a pointer

./core/neighbour.c:1187:        struct neighbour *neigh = dst->neighbour;

neigh_connected_output - looks safe

./decnet/dn_neigh.c:208:        struct neighbour *neigh = dst->neighbour;
./decnet/dn_neigh.c:226:        struct neighbour *neigh = dst->neighbour;
./decnet/dn_neigh.c:272:        struct neighbour *neigh = dst->neighbour;
./decnet/dn_neigh.c:315:        struct neighbour *neigh = dst->neighbour;
./decnet/dn_route.c:228:        struct dn_dev *dn = dst->neighbour ?
./decnet/dn_route.c:229:                            (struct dn_dev *)dst->neighbour->dev->dn_ptr : NULL;
./decnet/dn_route.c:693:        if ((neigh = dst->neighbour) == NULL)
./decnet/dn_route.c:727:        struct neighbour *neigh = dst->neighbour;

output routines, except
line 228 is dn_dst_update_pmtu, which looks OK as well.

./ipv4/arp.c:445: *     It is very UGLY routine: it DOES NOT use skb->dst->neighbour,
./ipv4/arp.c:508:       struct neighbour *n = dst->neighbour;
./ipv4/arp.c:523:               dst->neighbour = n;

Looks safe.

./ipv4/ip_gre.c:714:                    struct neighbour *neigh = skb->dst->neighbour;
./ipv4/ip_output.c:186: else if (dst->neighbour)
./ipv4/ip_output.c:187:         return dst->neighbour->output(skb);
./ipv6/ip6_output.c:79: else if (dst->neighbour)
./ipv6/ip6_output.c:80:         return dst->neighbour->output(skb);
./ipv6/ip6_output.c:431:        if (skb->dev == dst->dev && dst->neighbour && opt->srcrt == 0) {
./ipv6/ip6_output.c:434:                struct neighbour *n = dst->neighbour;
./ipv6/sit.c:459:                       neigh = skb->dst->neighbour;

These are all output routines

./sched/sch_teql.c:235: struct neighbour *mn = skb->dst->neighbour;

Looks ok - takes reference on the neighbour.

./sched/sch_teql.c:269:     skb->dst->neighbour == NULL)

Looks ok.

-- 
MST

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: dst_ifdown breaks infiniband?
  2007-03-18 21:06       ` [ofa-general] " Michael S. Tsirkin
@ 2007-03-18 21:20         ` Michael S. Tsirkin
  -1 siblings, 0 replies; 59+ messages in thread
From: Michael S. Tsirkin @ 2007-03-18 21:20 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alexey Kuznetsov, Linux Kernel Mailing List, netdev, general,
	Roland Dreier

> Quoting Michael S. Tsirkin <mst@dev.mellanox.co.il>:
> Subject: Re: dst_ifdown breaks infiniband?
> 
> > Quoting Michael S. Tsirkin <mst@dev.mellanox.co.il>:
> > Subject: Re: dst_ifdown breaks infiniband?
> > 
> > Quoting Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>:
> > Subject: Re: dst_ifdown breaks infiniband?
> > > > Can dst->neighbour be changed to point to NULL instead, and the neighbour
> > > > released?
> > > 
> > > It should be cleared and we should be sure it will not be destroyed
> > > before quiescent state.
> > > 
> > > Seems, this is the only correct solution, but to do this we have
> > > to audit all the places where dst->neighbour is dereferenced for
> > > RCU safety.
> > > 
> > > Actually, it is very good you caught this eventually, the bug was
> > > so _disgusting_ that it was "forgotten" all the time, waiting for
> > > someone who will point out that the king is naked. :-)
> > 
> > Actually that might not be too bad:
> > $grep -rIi 'dst->neighbour' net/ | wc -l
> > 36
> > 
> > I'll try to do it.
> 
> Here's the list. Looks OK to me. What do you think?
> 

So Alexey, how does the following (lightly tested) patch look?
Is this what you had in mind?

-----------------------------

Fix dst_ifdown for infiniband.

Changing dst->neighbour->dev is unsafe because neigh->parms callbacks
are set up for specific device.
We should drop the dst->neighbour reference instead.

Signed-off-by: Michael S. Tsirkin <mst@dev.mellanox.co.il>

---

diff --git a/net/core/dst.c b/net/core/dst.c
index 764bccb..27091a5 100644
--- a/net/core/dst.c
+++ b/net/core/dst.c
@@ -15,6 +15,7 @@
 #include <linux/skbuff.h>
 #include <linux/string.h>
 #include <linux/types.h>
+#include <linux/if_arp.h>
 
 #include <net/dst.h>
 
@@ -235,6 +236,8 @@ again:
 static inline void dst_ifdown(struct dst_entry *dst, struct net_device *dev,
 			      int unregister)
 {
+	struct neighbour *neigh;
+
 	if (dst->ops->ifdown)
 		dst->ops->ifdown(dst, dev, unregister);
 
@@ -245,13 +248,13 @@ static inline void dst_ifdown(struct dst_entry *dst, struct net_device *dev,
 		dst->input = dst_discard_in;
 		dst->output = dst_discard_out;
 	} else {
+		neigh = dst->neighbour;
 		dst->dev = &loopback_dev;
 		dev_hold(&loopback_dev);
 		dev_put(dev);
-		if (dst->neighbour && dst->neighbour->dev == dev) {
-			dst->neighbour->dev = &loopback_dev;
-			dev_put(dev);
-			dev_hold(&loopback_dev);
+		if (neigh && neigh->dev == dev) {
+			dst->neighbour = NULL;
+			neigh_release(neigh);
 		}
 	}
 }


-- 
MST

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [ofa-general] Re: dst_ifdown breaks infiniband?
@ 2007-03-18 21:20         ` Michael S. Tsirkin
  0 siblings, 0 replies; 59+ messages in thread
From: Michael S. Tsirkin @ 2007-03-18 21:20 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alexey Kuznetsov, Linux Kernel Mailing List, general, netdev

> Quoting Michael S. Tsirkin <mst@dev.mellanox.co.il>:
> Subject: Re: dst_ifdown breaks infiniband?
> 
> > Quoting Michael S. Tsirkin <mst@dev.mellanox.co.il>:
> > Subject: Re: dst_ifdown breaks infiniband?
> > 
> > Quoting Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>:
> > Subject: Re: dst_ifdown breaks infiniband?
> > > > Can dst->neighbour be changed to point to NULL instead, and the neighbour
> > > > released?
> > > 
> > > It should be cleared and we should be sure it will not be destroyed
> > > before quiescent state.
> > > 
> > > Seems, this is the only correct solution, but to do this we have
> > > to audit all the places where dst->neighbour is dereferenced for
> > > RCU safety.
> > > 
> > > Actually, it is very good you caught this eventually, the bug was
> > > so _disgusting_ that it was "forgotten" all the time, waiting for
> > > someone who will point out that the king is naked. :-)
> > 
> > Actually that might not be too bad:
> > $grep -rIi 'dst->neighbour' net/ | wc -l
> > 36
> > 
> > I'll try to do it.
> 
> Here's the list. Looks OK to me. What do you think?
> 

So Alexey, how does the following (lightly tested) patch look?
Is this what you had in mind?

-----------------------------

Fix dst_ifdown for infiniband.

Changing dst->neighbour->dev is unsafe because neigh->parms callbacks
are set up for specific device.
We should drop the dst->neighbour reference instead.

Signed-off-by: Michael S. Tsirkin <mst@dev.mellanox.co.il>

---

diff --git a/net/core/dst.c b/net/core/dst.c
index 764bccb..27091a5 100644
--- a/net/core/dst.c
+++ b/net/core/dst.c
@@ -15,6 +15,7 @@
 #include <linux/skbuff.h>
 #include <linux/string.h>
 #include <linux/types.h>
+#include <linux/if_arp.h>
 
 #include <net/dst.h>
 
@@ -235,6 +236,8 @@ again:
 static inline void dst_ifdown(struct dst_entry *dst, struct net_device *dev,
 			      int unregister)
 {
+	struct neighbour *neigh;
+
 	if (dst->ops->ifdown)
 		dst->ops->ifdown(dst, dev, unregister);
 
@@ -245,13 +248,13 @@ static inline void dst_ifdown(struct dst_entry *dst, struct net_device *dev,
 		dst->input = dst_discard_in;
 		dst->output = dst_discard_out;
 	} else {
+		neigh = dst->neighbour;
 		dst->dev = &loopback_dev;
 		dev_hold(&loopback_dev);
 		dev_put(dev);
-		if (dst->neighbour && dst->neighbour->dev == dev) {
-			dst->neighbour->dev = &loopback_dev;
-			dev_put(dev);
-			dev_hold(&loopback_dev);
+		if (neigh && neigh->dev == dev) {
+			dst->neighbour = NULL;
+			neigh_release(neigh);
 		}
 	}
 }


-- 
MST

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [ofa-general] Re: dst_ifdown breaks infiniband?
  2007-03-18 20:25   ` Michael S. Tsirkin
@ 2007-03-18 22:24       ` Eric W. Biederman
  0 siblings, 0 replies; 59+ messages in thread
From: Eric W. Biederman @ 2007-03-18 22:24 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alexey Kuznetsov, netdev, Linux Kernel Mailing List, general

"Michael S. Tsirkin" <mst@dev.mellanox.co.il> writes:

>> > Why is neighbour->dev changed here?
>> 
>> It holds reference to device and prevents its destruction.
>> If dst is held somewhere, we cannot destroy the device and deadlock
>> while unregister.
>
> BTW, can this ever happen for the loopback device itself?
> Is it ever unregistered?

Well I don't think the loopback device is currently but as soon
as we get network namespace support we will have multiple loopback
devices and they will get unregistered when we remove the network
namespace.

Eric

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [ofa-general] Re: dst_ifdown breaks infiniband?
@ 2007-03-18 22:24       ` Eric W. Biederman
  0 siblings, 0 replies; 59+ messages in thread
From: Eric W. Biederman @ 2007-03-18 22:24 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alexey Kuznetsov, Linux Kernel Mailing List, general, netdev

"Michael S. Tsirkin" <mst@dev.mellanox.co.il> writes:

>> > Why is neighbour->dev changed here?
>> 
>> It holds reference to device and prevents its destruction.
>> If dst is held somewhere, we cannot destroy the device and deadlock
>> while unregister.
>
> BTW, can this ever happen for the loopback device itself?
> Is it ever unregistered?

Well I don't think the loopback device is currently but as soon
as we get network namespace support we will have multiple loopback
devices and they will get unregistered when we remove the network
namespace.

Eric

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [ofa-general] Re: dst_ifdown breaks infiniband?
  2007-03-18 22:24       ` Eric W. Biederman
@ 2007-03-18 22:36         ` Michael S. Tsirkin
  -1 siblings, 0 replies; 59+ messages in thread
From: Michael S. Tsirkin @ 2007-03-18 22:36 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Michael S. Tsirkin, Alexey Kuznetsov, netdev,
	Linux Kernel Mailing List, general

> Quoting Eric W. Biederman <ebiederman@lnxi.com>:
> Subject: Re: [ofa-general] Re: dst_ifdown breaks infiniband?
> 
> "Michael S. Tsirkin" <mst@dev.mellanox.co.il> writes:
> 
> >> > Why is neighbour->dev changed here?
> >> 
> >> It holds reference to device and prevents its destruction.
> >> If dst is held somewhere, we cannot destroy the device and deadlock
> >> while unregister.
> >
> > BTW, can this ever happen for the loopback device itself?
> > Is it ever unregistered?
> 
> Well I don't think the loopback device is currently but as soon
> as we get network namespace support we will have multiple loopback
> devices and they will get unregistered when we remove the network
> namespace.

Hmm. Then the code moving dst->dev to point to the loopback
device will have to be fixed too. I'll post a patch a bit later.

-- 
MST

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [ofa-general] Re: dst_ifdown breaks infiniband?
@ 2007-03-18 22:36         ` Michael S. Tsirkin
  0 siblings, 0 replies; 59+ messages in thread
From: Michael S. Tsirkin @ 2007-03-18 22:36 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Alexey Kuznetsov, Linux Kernel Mailing List, general, netdev

> Quoting Eric W. Biederman <ebiederman@lnxi.com>:
> Subject: Re: [ofa-general] Re: dst_ifdown breaks infiniband?
> 
> "Michael S. Tsirkin" <mst@dev.mellanox.co.il> writes:
> 
> >> > Why is neighbour->dev changed here?
> >> 
> >> It holds reference to device and prevents its destruction.
> >> If dst is held somewhere, we cannot destroy the device and deadlock
> >> while unregister.
> >
> > BTW, can this ever happen for the loopback device itself?
> > Is it ever unregistered?
> 
> Well I don't think the loopback device is currently but as soon
> as we get network namespace support we will have multiple loopback
> devices and they will get unregistered when we remove the network
> namespace.

Hmm. Then the code moving dst->dev to point to the loopback
device will have to be fixed too. I'll post a patch a bit later.

-- 
MST

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [ofa-general] Re: dst_ifdown breaks infiniband?
  2007-03-18 22:36         ` Michael S. Tsirkin
  (?)
@ 2007-03-18 22:42         ` Michael S. Tsirkin
  2007-03-19  0:13           ` David Miller
  2007-03-19  9:24             ` Alexey Kuznetsov
  -1 siblings, 2 replies; 59+ messages in thread
From: Michael S. Tsirkin @ 2007-03-18 22:42 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Eric W. Biederman, Alexey Kuznetsov, netdev,
	Linux Kernel Mailing List, general

> Quoting Michael S. Tsirkin <mst@dev.mellanox.co.il>:
> Subject: Re: [ofa-general] Re: dst_ifdown breaks infiniband?
> 
> > Quoting Eric W. Biederman <ebiederman@lnxi.com>:
> > Subject: Re: [ofa-general] Re: dst_ifdown breaks infiniband?
> > 
> > "Michael S. Tsirkin" <mst@dev.mellanox.co.il> writes:
> > 
> > >> > Why is neighbour->dev changed here?
> > >> 
> > >> It holds reference to device and prevents its destruction.
> > >> If dst is held somewhere, we cannot destroy the device and deadlock
> > >> while unregister.
> > >
> > > BTW, can this ever happen for the loopback device itself?
> > > Is it ever unregistered?
> > 
> > Well I don't think the loopback device is currently but as soon
> > as we get network namespace support we will have multiple loopback
> > devices and they will get unregistered when we remove the network
> > namespace.
> 
> Hmm. Then the code moving dst->dev to point to the loopback
> device will have to be fixed too. I'll post a patch a bit later.

Does this look sane (untested)?

Signed-off-by: Michael S. Tsirkin <mst@dev.mellanox.co.il>


diff --git a/net/core/dst.c b/net/core/dst.c
index 764bccb..8283158 100644
--- a/net/core/dst.c
+++ b/net/core/dst.c
@@ -235,6 +236,8 @@ again:
 static inline void dst_ifdown(struct dst_entry *dst, struct net_device *dev,
 			      int unregister)
 {
+	struct neighbour *neigh;
+
 	if (dst->ops->ifdown)
 		dst->ops->ifdown(dst, dev, unregister);
 
@@ -245,14 +248,13 @@ static inline void dst_ifdown(struct dst_entry *dst, struct net_device *dev,
 		dst->input = dst_discard_in;
 		dst->output = dst_discard_out;
 	} else {
-		dst->dev = &loopback_dev;
-		dev_hold(&loopback_dev);
-		dev_put(dev);
-		if (dst->neighbour && dst->neighbour->dev == dev) {
-			dst->neighbour->dev = &loopback_dev;
-			dev_put(dev);
-			dev_hold(&loopback_dev);
+		neigh = dst->neighbour;
+		if (neigh && neigh->dev == dev) {
+			dst->neighbour = NULL;
+			neigh_release(neigh);
 		}
+		dst->dev = NULL;
+		dev_put(dev);
 	}
 }
 

-- 
MST

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [ofa-general] Re: dst_ifdown breaks infiniband?
  2007-03-18 22:42         ` Michael S. Tsirkin
@ 2007-03-19  0:13           ` David Miller
  2007-03-19  5:19               ` Michael S. Tsirkin
  2007-03-19  5:30               ` Eric W. Biederman
  2007-03-19  9:24             ` Alexey Kuznetsov
  1 sibling, 2 replies; 59+ messages in thread
From: David Miller @ 2007-03-19  0:13 UTC (permalink / raw)
  To: mst; +Cc: ebiederman, kuznet, netdev, linux-kernel, general

From: "Michael S. Tsirkin" <mst@dev.mellanox.co.il>
Date: Mon, 19 Mar 2007 00:42:34 +0200

> > Quoting Michael S. Tsirkin <mst@dev.mellanox.co.il>:
> > Subject: Re: [ofa-general] Re: dst_ifdown breaks infiniband?
> > 
> > > Quoting Eric W. Biederman <ebiederman@lnxi.com>:
> > > Subject: Re: [ofa-general] Re: dst_ifdown breaks infiniband?
> > > 
> > > "Michael S. Tsirkin" <mst@dev.mellanox.co.il> writes:
> > > 
> > > >> > Why is neighbour->dev changed here?
> > > >> 
> > > >> It holds reference to device and prevents its destruction.
> > > >> If dst is held somewhere, we cannot destroy the device and deadlock
> > > >> while unregister.
> > > >
> > > > BTW, can this ever happen for the loopback device itself?
> > > > Is it ever unregistered?
> > > 
> > > Well I don't think the loopback device is currently but as soon
> > > as we get network namespace support we will have multiple loopback
> > > devices and they will get unregistered when we remove the network
> > > namespace.
> > 
> > Hmm. Then the code moving dst->dev to point to the loopback
> > device will have to be fixed too. I'll post a patch a bit later.
> 
> Does this look sane (untested)?
> 
> Signed-off-by: Michael S. Tsirkin <mst@dev.mellanox.co.il>

You can't point it at NULL, we don't point it at loopback
just for fun.

There can be asynchronous paths elsewhere in the networking still
referencing the neigh or dst and they will (correctly) feel free to
derefence whatever device is hanging there.  So transitioning
to NULL is invalid.

You guys will need to come up with a better solution for this silly
situation with network namespaces.  Loopback is always available to
point dead routes and neighbour entries at, and this assumption is
massively rooted in the networking.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: dst_ifdown breaks infiniband?
  2007-03-18 21:20         ` [ofa-general] " Michael S. Tsirkin
@ 2007-03-19  5:15           ` Michael S. Tsirkin
  -1 siblings, 0 replies; 59+ messages in thread
From: Michael S. Tsirkin @ 2007-03-19  5:15 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alexey Kuznetsov, Linux Kernel Mailing List, netdev, general,
	Roland Dreier

> Quoting Michael S. Tsirkin <mst@dev.mellanox.co.il>:
> Subject: Re: dst_ifdown breaks infiniband?
> 
> > Quoting Michael S. Tsirkin <mst@dev.mellanox.co.il>:
> > Subject: Re: dst_ifdown breaks infiniband?
> > 
> > > Quoting Michael S. Tsirkin <mst@dev.mellanox.co.il>:
> > > Subject: Re: dst_ifdown breaks infiniband?
> > > 
> > > Quoting Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>:
> > > Subject: Re: dst_ifdown breaks infiniband?
> > > > > Can dst->neighbour be changed to point to NULL instead, and the neighbour
> > > > > released?
> > > > 
> > > > It should be cleared and we should be sure it will not be destroyed
> > > > before quiescent state.
> > > > 
> > > > Seems, this is the only correct solution, but to do this we have
> > > > to audit all the places where dst->neighbour is dereferenced for
> > > > RCU safety.
> > > > 
> > > > Actually, it is very good you caught this eventually, the bug was
> > > > so _disgusting_ that it was "forgotten" all the time, waiting for
> > > > someone who will point out that the king is naked. :-)
> > > 
> > > Actually that might not be too bad:
> > > $grep -rIi 'dst->neighbour' net/ | wc -l
> > > 36
> > > 
> > > I'll try to do it.
> > 
> > Here's the list. Looks OK to me. What do you think?
> > 
> 
> So Alexey, how does the following (lightly tested) patch look?
> Is this what you had in mind?
> 
> -----------------------------
> 
> Fix dst_ifdown for infiniband.
> 
> Changing dst->neighbour->dev is unsafe because neigh->parms callbacks
> are set up for specific device.
> We should drop the dst->neighbour reference instead.
> 
> Signed-off-by: Michael S. Tsirkin <mst@dev.mellanox.co.il>

Ugh, looked again and this looks obviously broken.
Note to self - stop writing code at 23:00.

-- 
MST

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [ofa-general] Re: dst_ifdown breaks infiniband?
@ 2007-03-19  5:15           ` Michael S. Tsirkin
  0 siblings, 0 replies; 59+ messages in thread
From: Michael S. Tsirkin @ 2007-03-19  5:15 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alexey Kuznetsov, Linux Kernel Mailing List, general, netdev

> Quoting Michael S. Tsirkin <mst@dev.mellanox.co.il>:
> Subject: Re: dst_ifdown breaks infiniband?
> 
> > Quoting Michael S. Tsirkin <mst@dev.mellanox.co.il>:
> > Subject: Re: dst_ifdown breaks infiniband?
> > 
> > > Quoting Michael S. Tsirkin <mst@dev.mellanox.co.il>:
> > > Subject: Re: dst_ifdown breaks infiniband?
> > > 
> > > Quoting Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>:
> > > Subject: Re: dst_ifdown breaks infiniband?
> > > > > Can dst->neighbour be changed to point to NULL instead, and the neighbour
> > > > > released?
> > > > 
> > > > It should be cleared and we should be sure it will not be destroyed
> > > > before quiescent state.
> > > > 
> > > > Seems, this is the only correct solution, but to do this we have
> > > > to audit all the places where dst->neighbour is dereferenced for
> > > > RCU safety.
> > > > 
> > > > Actually, it is very good you caught this eventually, the bug was
> > > > so _disgusting_ that it was "forgotten" all the time, waiting for
> > > > someone who will point out that the king is naked. :-)
> > > 
> > > Actually that might not be too bad:
> > > $grep -rIi 'dst->neighbour' net/ | wc -l
> > > 36
> > > 
> > > I'll try to do it.
> > 
> > Here's the list. Looks OK to me. What do you think?
> > 
> 
> So Alexey, how does the following (lightly tested) patch look?
> Is this what you had in mind?
> 
> -----------------------------
> 
> Fix dst_ifdown for infiniband.
> 
> Changing dst->neighbour->dev is unsafe because neigh->parms callbacks
> are set up for specific device.
> We should drop the dst->neighbour reference instead.
> 
> Signed-off-by: Michael S. Tsirkin <mst@dev.mellanox.co.il>

Ugh, looked again and this looks obviously broken.
Note to self - stop writing code at 23:00.

-- 
MST

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [ofa-general] Re: dst_ifdown breaks infiniband?
  2007-03-19  0:13           ` David Miller
@ 2007-03-19  5:19               ` Michael S. Tsirkin
  2007-03-19  5:30               ` Eric W. Biederman
  1 sibling, 0 replies; 59+ messages in thread
From: Michael S. Tsirkin @ 2007-03-19  5:19 UTC (permalink / raw)
  To: David Miller; +Cc: mst, ebiederman, kuznet, netdev, linux-kernel, general

> Quoting David Miller <davem@davemloft.net>:
> Subject: Re: [ofa-general] Re: dst_ifdown breaks infiniband?
> 
> From: "Michael S. Tsirkin" <mst@dev.mellanox.co.il>
> Date: Mon, 19 Mar 2007 00:42:34 +0200
> 
> > > Quoting Michael S. Tsirkin <mst@dev.mellanox.co.il>:
> > > Subject: Re: [ofa-general] Re: dst_ifdown breaks infiniband?
> > > 
> > > > Quoting Eric W. Biederman <ebiederman@lnxi.com>:
> > > > Subject: Re: [ofa-general] Re: dst_ifdown breaks infiniband?
> > > > 
> > > > "Michael S. Tsirkin" <mst@dev.mellanox.co.il> writes:
> > > > 
> > > > >> > Why is neighbour->dev changed here?
> > > > >> 
> > > > >> It holds reference to device and prevents its destruction.
> > > > >> If dst is held somewhere, we cannot destroy the device and deadlock
> > > > >> while unregister.
> > > > >
> > > > > BTW, can this ever happen for the loopback device itself?
> > > > > Is it ever unregistered?
> > > > 
> > > > Well I don't think the loopback device is currently but as soon
> > > > as we get network namespace support we will have multiple loopback
> > > > devices and they will get unregistered when we remove the network
> > > > namespace.
> > > 
> > > Hmm. Then the code moving dst->dev to point to the loopback
> > > device will have to be fixed too. I'll post a patch a bit later.
> > 
> > Does this look sane (untested)?
> > 
> > Signed-off-by: Michael S. Tsirkin <mst@dev.mellanox.co.il>
> 
> You can't point it at NULL, we don't point it at loopback
> just for fun.
> 
> There can be asynchronous paths elsewhere in the networking still
> referencing the neigh or dst and they will (correctly) feel free to
> derefence whatever device is hanging there.  So transitioning
> to NULL is invalid.
> 
> You guys will need to come up with a better solution for this silly
> situation with network namespaces.  Loopback is always available to
> point dead routes and neighbour entries at, and this assumption is
> massively rooted in the networking.

Yes, I see this now.

I guess it's best to focus on the original problem with dst_ifdown breaking
infiniband for now.

For that, we have to audit all the places where dst->neighbour is dereferenced for
RCU safety, and this is already a massive task.

-- 
MST

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [ofa-general] Re: dst_ifdown breaks infiniband?
@ 2007-03-19  5:19               ` Michael S. Tsirkin
  0 siblings, 0 replies; 59+ messages in thread
From: Michael S. Tsirkin @ 2007-03-19  5:19 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-kernel, general, kuznet

> Quoting David Miller <davem@davemloft.net>:
> Subject: Re: [ofa-general] Re: dst_ifdown breaks infiniband?
> 
> From: "Michael S. Tsirkin" <mst@dev.mellanox.co.il>
> Date: Mon, 19 Mar 2007 00:42:34 +0200
> 
> > > Quoting Michael S. Tsirkin <mst@dev.mellanox.co.il>:
> > > Subject: Re: [ofa-general] Re: dst_ifdown breaks infiniband?
> > > 
> > > > Quoting Eric W. Biederman <ebiederman@lnxi.com>:
> > > > Subject: Re: [ofa-general] Re: dst_ifdown breaks infiniband?
> > > > 
> > > > "Michael S. Tsirkin" <mst@dev.mellanox.co.il> writes:
> > > > 
> > > > >> > Why is neighbour->dev changed here?
> > > > >> 
> > > > >> It holds reference to device and prevents its destruction.
> > > > >> If dst is held somewhere, we cannot destroy the device and deadlock
> > > > >> while unregister.
> > > > >
> > > > > BTW, can this ever happen for the loopback device itself?
> > > > > Is it ever unregistered?
> > > > 
> > > > Well I don't think the loopback device is currently but as soon
> > > > as we get network namespace support we will have multiple loopback
> > > > devices and they will get unregistered when we remove the network
> > > > namespace.
> > > 
> > > Hmm. Then the code moving dst->dev to point to the loopback
> > > device will have to be fixed too. I'll post a patch a bit later.
> > 
> > Does this look sane (untested)?
> > 
> > Signed-off-by: Michael S. Tsirkin <mst@dev.mellanox.co.il>
> 
> You can't point it at NULL, we don't point it at loopback
> just for fun.
> 
> There can be asynchronous paths elsewhere in the networking still
> referencing the neigh or dst and they will (correctly) feel free to
> derefence whatever device is hanging there.  So transitioning
> to NULL is invalid.
> 
> You guys will need to come up with a better solution for this silly
> situation with network namespaces.  Loopback is always available to
> point dead routes and neighbour entries at, and this assumption is
> massively rooted in the networking.

Yes, I see this now.

I guess it's best to focus on the original problem with dst_ifdown breaking
infiniband for now.

For that, we have to audit all the places where dst->neighbour is dereferenced for
RCU safety, and this is already a massive task.

-- 
MST

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [ofa-general] Re: dst_ifdown breaks infiniband?
  2007-03-19  0:13           ` David Miller
@ 2007-03-19  5:30               ` Eric W. Biederman
  2007-03-19  5:30               ` Eric W. Biederman
  1 sibling, 0 replies; 59+ messages in thread
From: Eric W. Biederman @ 2007-03-19  5:30 UTC (permalink / raw)
  To: David Miller; +Cc: mst, ebiederman, kuznet, netdev, linux-kernel, general

David Miller <davem@davemloft.net> writes:

> From: "Michael S. Tsirkin" <mst@dev.mellanox.co.il>
> Date: Mon, 19 Mar 2007 00:42:34 +0200

>> > Hmm. Then the code moving dst->dev to point to the loopback
>> > device will have to be fixed too. I'll post a patch a bit later.
>> 
>> Does this look sane (untested)?
>> 
>> Signed-off-by: Michael S. Tsirkin <mst@dev.mellanox.co.il>
>
> You can't point it at NULL, we don't point it at loopback
> just for fun.
>
> There can be asynchronous paths elsewhere in the networking still
> referencing the neigh or dst and they will (correctly) feel free to
> derefence whatever device is hanging there.  So transitioning
> to NULL is invalid.
>
> You guys will need to come up with a better solution for this silly
> situation with network namespaces.  Loopback is always available to
> point dead routes and neighbour entries at, and this assumption is
> massively rooted in the networking.

Sure.  In the network namespace case I think the careful ordering of the
shutdown handles that case.   Even with per network namespace lo
unregistered it still existed until the network namespace actually
exited.  And it only happened on exit.  

So while there may be a tiny race there it hasn't been an issue yet
in practice.

I wasn't proposing that we fix it this way.  I was simply saying that
there was the possibility for the case to exist.  The existence of
a per network namespace loopback device is fairly fundamental to the
network namespace concept.  Heck I think Herbert has been looking at
it for vserver which almost totally socket isolation.

Eric

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [ofa-general] Re: dst_ifdown breaks infiniband?
@ 2007-03-19  5:30               ` Eric W. Biederman
  0 siblings, 0 replies; 59+ messages in thread
From: Eric W. Biederman @ 2007-03-19  5:30 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-kernel, general, kuznet

David Miller <davem@davemloft.net> writes:

> From: "Michael S. Tsirkin" <mst@dev.mellanox.co.il>
> Date: Mon, 19 Mar 2007 00:42:34 +0200

>> > Hmm. Then the code moving dst->dev to point to the loopback
>> > device will have to be fixed too. I'll post a patch a bit later.
>> 
>> Does this look sane (untested)?
>> 
>> Signed-off-by: Michael S. Tsirkin <mst@dev.mellanox.co.il>
>
> You can't point it at NULL, we don't point it at loopback
> just for fun.
>
> There can be asynchronous paths elsewhere in the networking still
> referencing the neigh or dst and they will (correctly) feel free to
> derefence whatever device is hanging there.  So transitioning
> to NULL is invalid.
>
> You guys will need to come up with a better solution for this silly
> situation with network namespaces.  Loopback is always available to
> point dead routes and neighbour entries at, and this assumption is
> massively rooted in the networking.

Sure.  In the network namespace case I think the careful ordering of the
shutdown handles that case.   Even with per network namespace lo
unregistered it still existed until the network namespace actually
exited.  And it only happened on exit.  

So while there may be a tiny race there it hasn't been an issue yet
in practice.

I wasn't proposing that we fix it this way.  I was simply saying that
there was the possibility for the case to exist.  The existence of
a per network namespace loopback device is fairly fundamental to the
network namespace concept.  Heck I think Herbert has been looking at
it for vserver which almost totally socket isolation.

Eric

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [ofa-general] Re: dst_ifdown breaks infiniband?
  2007-03-19  5:30               ` Eric W. Biederman
  (?)
@ 2007-03-19  6:13               ` David Miller
  2007-03-19  9:34                 ` Alexey Kuznetsov
  2007-03-19 15:10                   ` Eric W. Biederman
  -1 siblings, 2 replies; 59+ messages in thread
From: David Miller @ 2007-03-19  6:13 UTC (permalink / raw)
  To: ebiederman; +Cc: mst, kuznet, netdev, linux-kernel, general

From: ebiederman@lnxi.com (Eric W. Biederman)
Date: Sun, 18 Mar 2007 23:30:39 -0600

> Sure.  In the network namespace case I think the careful ordering of the
> shutdown handles that case.   Even with per network namespace lo
> unregistered it still existed until the network namespace actually
> exited.  And it only happened on exit.  
> 
> So while there may be a tiny race there it hasn't been an issue yet
> in practice.

I think the thing to do is to just leave the loopback references
in place, try to unregister the per-namespace loopback device,
and that will safely wait for all the references to go away.

If you do it that way, you should need absolutely no changes to
the other code in this area.

As per Herbert, I think he works on Xen rather than vserver :-)
Perhaps you're thinking of Alexey Kuznetsov or another one of the
vserver guys.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [ofa-general] Re: dst_ifdown breaks infiniband?
  2007-03-18 22:24       ` Eric W. Biederman
  (?)
  (?)
@ 2007-03-19  9:20       ` Alexey Kuznetsov
  -1 siblings, 0 replies; 59+ messages in thread
From: Alexey Kuznetsov @ 2007-03-19  9:20 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Michael S. Tsirkin, netdev, Linux Kernel Mailing List, general

Hello!

> Well I don't think the loopback device is currently but as soon
> as we get network namespace support we will have multiple loopback
> devices and they will get unregistered when we remove the network
> namespace.

There is no logical difference. At the moment when namespace is gone
there is nobody who can hold unrevokable references to this loopback.

Alexey

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [ofa-general] Re: dst_ifdown breaks infiniband?
  2007-03-18 22:42         ` Michael S. Tsirkin
@ 2007-03-19  9:24             ` Alexey Kuznetsov
  2007-03-19  9:24             ` Alexey Kuznetsov
  1 sibling, 0 replies; 59+ messages in thread
From: Alexey Kuznetsov @ 2007-03-19  9:24 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Eric W. Biederman, netdev, Linux Kernel Mailing List, general

Hello!

> Does this look sane (untested)?

It does not, unfortunately.

Instead of regular crash in infiniband you will get numerous
random NULL pointer dereferences both due to dst->neighbour
and due to dst->dev.

Alexey

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [ofa-general] Re: dst_ifdown breaks infiniband?
@ 2007-03-19  9:24             ` Alexey Kuznetsov
  0 siblings, 0 replies; 59+ messages in thread
From: Alexey Kuznetsov @ 2007-03-19  9:24 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev, general, Linux Kernel Mailing List

Hello!

> Does this look sane (untested)?

It does not, unfortunately.

Instead of regular crash in infiniband you will get numerous
random NULL pointer dereferences both due to dst->neighbour
and due to dst->dev.

Alexey

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [ofa-general] Re: dst_ifdown breaks infiniband?
  2007-03-19  9:24             ` Alexey Kuznetsov
  (?)
@ 2007-03-19  9:33             ` Michael S. Tsirkin
  -1 siblings, 0 replies; 59+ messages in thread
From: Michael S. Tsirkin @ 2007-03-19  9:33 UTC (permalink / raw)
  To: Alexey Kuznetsov
  Cc: Michael S. Tsirkin, Eric W. Biederman, netdev,
	Linux Kernel Mailing List, general

> Quoting Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>:
> Subject: Re: [ofa-general] Re: dst_ifdown breaks infiniband?
> 
> > Does this look sane (untested)?
> 
> It does not, unfortunately.
> 
> Instead of regular crash in infiniband you will get numerous
> random NULL pointer dereferences both due to dst->neighbour
> and due to dst->dev.

Right, I saw this clearly in the morning. Thanks.

-- 
MST

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [ofa-general] Re: dst_ifdown breaks infiniband?
  2007-03-19  6:13               ` David Miller
@ 2007-03-19  9:34                 ` Alexey Kuznetsov
  2007-03-19 15:10                   ` Eric W. Biederman
  1 sibling, 0 replies; 59+ messages in thread
From: Alexey Kuznetsov @ 2007-03-19  9:34 UTC (permalink / raw)
  To: David Miller; +Cc: ebiederman, mst, netdev, linux-kernel, general

Hello!

> I think the thing to do is to just leave the loopback references
> in place, try to unregister the per-namespace loopback device,
> and that will safely wait for all the references to go away.

Yes, it is exactly how it works in openvz. All the sockets are killed,
queues are cleared, nobody holds references and virtual loopback
can be unregistered just like another device.

Alexey

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: dst_ifdown breaks infiniband?
  2007-03-18 20:18     ` Alexey Kuznetsov
@ 2007-03-19  9:36         ` Michael S. Tsirkin
  2007-03-19  9:36         ` [ofa-general] " Michael S. Tsirkin
  1 sibling, 0 replies; 59+ messages in thread
From: Michael S. Tsirkin @ 2007-03-19  9:36 UTC (permalink / raw)
  To: Alexey Kuznetsov
  Cc: Michael S. Tsirkin, Linux Kernel Mailing List, netdev, general,
	Roland Dreier, David Miller

> > Any simpler ideas?
> 
> Well, if inifiniband destructor really needs to take that lock... no.
> Right now I do not see.

OK, this is actually not hard to fix - for infiniband, we can just look at
neighbour->dev->type or compare neighbour->dev and
neighbour->parms->dev - if they are different, device is being unregistered,
so we do not need to do anything in the destructor.

I'll send a patch to openfabrics, shortly.

However, after implementing this fix, I hit what could be use after
free at module unloading. Dave, Alexey, Roland, could you take a look at
the following please?

Works fine for me (survived a couple of hours of crazy device
loading/unloading/up/down/hotplug + link data and state activity)
and seems to fix the issue.

---------

If a device driver sets neigh_destructor in neigh_params, this could
get called after the device has been unregistered and the driver module
removed.

This is an old bug, but apparently, started to get triggered more infiniband
after recent multicast and connected mode changes.

Fix this by delaying dev_put until the neigh_params object is removed.

Signed-off-by: Michael S. Tsirkin <mst@dev.mellanox.co.il>

diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 3183142..cb34f1a 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -1313,8 +1313,6 @@ void neigh_parms_release(struct neigh_table *tbl, struct neigh_parms *parms)
 			*p = parms->next;
 			parms->dead = 1;
 			write_unlock_bh(&tbl->lock);
-			if (parms->dev)
-				dev_put(parms->dev);
 			call_rcu(&parms->rcu_head, neigh_rcu_free_parms);
 			return;
 		}
@@ -1325,6 +1323,8 @@ void neigh_parms_release(struct neigh_table *tbl, struct neigh_parms *parms)
 
 void neigh_parms_destroy(struct neigh_parms *parms)
 {
+	if (parms->dev)
+		dev_put(parms->dev);
 	kfree(parms);
 }
 

-- 
MST

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [ofa-general] Re: dst_ifdown breaks infiniband?
@ 2007-03-19  9:36         ` Michael S. Tsirkin
  0 siblings, 0 replies; 59+ messages in thread
From: Michael S. Tsirkin @ 2007-03-19  9:36 UTC (permalink / raw)
  To: Alexey Kuznetsov; +Cc: netdev, Linux Kernel Mailing List, general, David Miller

> > Any simpler ideas?
> 
> Well, if inifiniband destructor really needs to take that lock... no.
> Right now I do not see.

OK, this is actually not hard to fix - for infiniband, we can just look at
neighbour->dev->type or compare neighbour->dev and
neighbour->parms->dev - if they are different, device is being unregistered,
so we do not need to do anything in the destructor.

I'll send a patch to openfabrics, shortly.

However, after implementing this fix, I hit what could be use after
free at module unloading. Dave, Alexey, Roland, could you take a look at
the following please?

Works fine for me (survived a couple of hours of crazy device
loading/unloading/up/down/hotplug + link data and state activity)
and seems to fix the issue.

---------

If a device driver sets neigh_destructor in neigh_params, this could
get called after the device has been unregistered and the driver module
removed.

This is an old bug, but apparently, started to get triggered more infiniband
after recent multicast and connected mode changes.

Fix this by delaying dev_put until the neigh_params object is removed.

Signed-off-by: Michael S. Tsirkin <mst@dev.mellanox.co.il>

diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 3183142..cb34f1a 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -1313,8 +1313,6 @@ void neigh_parms_release(struct neigh_table *tbl, struct neigh_parms *parms)
 			*p = parms->next;
 			parms->dead = 1;
 			write_unlock_bh(&tbl->lock);
-			if (parms->dev)
-				dev_put(parms->dev);
 			call_rcu(&parms->rcu_head, neigh_rcu_free_parms);
 			return;
 		}
@@ -1325,6 +1323,8 @@ void neigh_parms_release(struct neigh_table *tbl, struct neigh_parms *parms)
 
 void neigh_parms_destroy(struct neigh_parms *parms)
 {
+	if (parms->dev)
+		dev_put(parms->dev);
 	kfree(parms);
 }
 

-- 
MST

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: dst_ifdown breaks infiniband?
  2007-03-19  9:36         ` [ofa-general] " Michael S. Tsirkin
@ 2007-03-19  9:55           ` Michael S. Tsirkin
  -1 siblings, 0 replies; 59+ messages in thread
From: Michael S. Tsirkin @ 2007-03-19  9:55 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alexey Kuznetsov, Linux Kernel Mailing List, netdev, general,
	Roland Dreier, David Miller

> Quoting Michael S. Tsirkin <mst@dev.mellanox.co.il>:
> Subject: Re: dst_ifdown breaks infiniband?
> 
> > > Any simpler ideas?
> > 
> > Well, if inifiniband destructor really needs to take that lock... no.
> > Right now I do not see.
> 
> OK, this is actually not hard to fix - for infiniband, we can just look at
> neighbour->dev->type or compare neighbour->dev and
> neighbour->parms->dev - if they are different, device is being unregistered,
> so we do not need to do anything in the destructor.
> 
> I'll send a patch to openfabrics, shortly.
> 
> However, after implementing this fix, I hit what could be use after
> free at module unloading. Dave, Alexey, Roland, could you take a look at
> the following please?
> 
> Works fine for me (survived a couple of hours of crazy device
> loading/unloading/up/down/hotplug + link data and state activity)
> and seems to fix the issue.
> 
> ---------
> 
> If a device driver sets neigh_destructor in neigh_params, this could
> get called after the device has been unregistered and the driver module
> removed.
> 
> This is an old bug, but apparently, started to get triggered more infiniband
> after recent multicast and connected mode changes.
> 
> Fix this by delaying dev_put until the neigh_params object is removed.
> 
> Signed-off-by: Michael S. Tsirkin <mst@dev.mellanox.co.il>

The problem seems real enough but the fix seems no good -
device unregister gets blocked with
	unregister_netdevice: waiting for ib0 to become free. Usage count = 1

It seems the parms object can survive indefinitely after device is removed.

How about creating a new parms object in dst_ifdown, and pointing neighbour
to this?

Would that work?
The advantage of this approach is that neigh->parms is already protected
by RCU.

-- 
MST

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [ofa-general] Re: dst_ifdown breaks infiniband?
@ 2007-03-19  9:55           ` Michael S. Tsirkin
  0 siblings, 0 replies; 59+ messages in thread
From: Michael S. Tsirkin @ 2007-03-19  9:55 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev, Linux Kernel Mailing List, general, Alexey Kuznetsov,
	David Miller

> Quoting Michael S. Tsirkin <mst@dev.mellanox.co.il>:
> Subject: Re: dst_ifdown breaks infiniband?
> 
> > > Any simpler ideas?
> > 
> > Well, if inifiniband destructor really needs to take that lock... no.
> > Right now I do not see.
> 
> OK, this is actually not hard to fix - for infiniband, we can just look at
> neighbour->dev->type or compare neighbour->dev and
> neighbour->parms->dev - if they are different, device is being unregistered,
> so we do not need to do anything in the destructor.
> 
> I'll send a patch to openfabrics, shortly.
> 
> However, after implementing this fix, I hit what could be use after
> free at module unloading. Dave, Alexey, Roland, could you take a look at
> the following please?
> 
> Works fine for me (survived a couple of hours of crazy device
> loading/unloading/up/down/hotplug + link data and state activity)
> and seems to fix the issue.
> 
> ---------
> 
> If a device driver sets neigh_destructor in neigh_params, this could
> get called after the device has been unregistered and the driver module
> removed.
> 
> This is an old bug, but apparently, started to get triggered more infiniband
> after recent multicast and connected mode changes.
> 
> Fix this by delaying dev_put until the neigh_params object is removed.
> 
> Signed-off-by: Michael S. Tsirkin <mst@dev.mellanox.co.il>

The problem seems real enough but the fix seems no good -
device unregister gets blocked with
	unregister_netdevice: waiting for ib0 to become free. Usage count = 1

It seems the parms object can survive indefinitely after device is removed.

How about creating a new parms object in dst_ifdown, and pointing neighbour
to this?

Would that work?
The advantage of this approach is that neigh->parms is already protected
by RCU.

-- 
MST

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: dst_ifdown breaks infiniband?
  2007-03-19  9:36         ` [ofa-general] " Michael S. Tsirkin
@ 2007-03-19 12:05           ` Alexey Kuznetsov
  -1 siblings, 0 replies; 59+ messages in thread
From: Alexey Kuznetsov @ 2007-03-19 12:05 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Linux Kernel Mailing List, netdev, general, Roland Dreier, David Miller

Hello!

> If a device driver sets neigh_destructor in neigh_params, this could
> get called after the device has been unregistered and the driver module
> removed.

It is the same problem: if dst->neighbour holds neighbour, it should
not hold device. parms->dev is not supposed to be used after
neigh_parms_release(). F.e. set parms->dev to NULL to catch bad references.


Do you search for a way to find real inifiniband device in
ipoib_neigh_destructor()? I guess you will not be able.
The problem is logical: if destructor needs device, neighbour entry
_somehow_ have to hold reference to the device (via neigh->dev, neigh->parms,
whatever). Hence, if we hold neighbour entry, unregister cannot be completed.
Therefore, destructor cannot refer to device. Q.E.D. :-)

Seems, releasing dst->neighbour is inevitable.

Alexey

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [ofa-general] Re: dst_ifdown breaks infiniband?
@ 2007-03-19 12:05           ` Alexey Kuznetsov
  0 siblings, 0 replies; 59+ messages in thread
From: Alexey Kuznetsov @ 2007-03-19 12:05 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev, Linux Kernel Mailing List, general, David Miller

Hello!

> If a device driver sets neigh_destructor in neigh_params, this could
> get called after the device has been unregistered and the driver module
> removed.

It is the same problem: if dst->neighbour holds neighbour, it should
not hold device. parms->dev is not supposed to be used after
neigh_parms_release(). F.e. set parms->dev to NULL to catch bad references.


Do you search for a way to find real inifiniband device in
ipoib_neigh_destructor()? I guess you will not be able.
The problem is logical: if destructor needs device, neighbour entry
_somehow_ have to hold reference to the device (via neigh->dev, neigh->parms,
whatever). Hence, if we hold neighbour entry, unregister cannot be completed.
Therefore, destructor cannot refer to device. Q.E.D. :-)

Seems, releasing dst->neighbour is inevitable.

Alexey

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: dst_ifdown breaks infiniband?
  2007-03-19 12:05           ` [ofa-general] " Alexey Kuznetsov
@ 2007-03-19 12:12             ` Michael S. Tsirkin
  -1 siblings, 0 replies; 59+ messages in thread
From: Michael S. Tsirkin @ 2007-03-19 12:12 UTC (permalink / raw)
  To: Alexey Kuznetsov
  Cc: Michael S. Tsirkin, Linux Kernel Mailing List, netdev, general,
	Roland Dreier, David Miller

> Quoting Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>:
> Subject: Re: dst_ifdown breaks infiniband?
> 
> Hello!
> 
> > If a device driver sets neigh_destructor in neigh_params, this could
> > get called after the device has been unregistered and the driver module
> > removed.
> 
> It is the same problem: if dst->neighbour holds neighbour, it should
> not hold device. parms->dev is not supposed to be used after
> neigh_parms_release(). F.e. set parms->dev to NULL to catch bad references.

Yes. I fixed that - simply checking that neighbour->dev is a
loopback device is sufficient to detect the fact that
the device is being unregistered.

> Do you search for a way to find real inifiniband device in
> ipoib_neigh_destructor()?

No, not anymore.

> I guess you will not be able.

I agree it's not possible.

> The problem is logical: if destructor needs device, neighbour entry
> _somehow_ have to hold reference to the device (via neigh->dev, neigh->parms,
> whatever). Hence, if we hold neighbour entry, unregister cannot be completed.
> Therefore, destructor cannot refer to device. Q.E.D. :-)
> 
> Seems, releasing dst->neighbour is inevitable.

infiniband sets parm->neigh_destructor, and I search for a way to prevent
this destructor from being called after the module has been unloaded.
Ideas?

-- 
MST

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [ofa-general] Re: dst_ifdown breaks infiniband?
@ 2007-03-19 12:12             ` Michael S. Tsirkin
  0 siblings, 0 replies; 59+ messages in thread
From: Michael S. Tsirkin @ 2007-03-19 12:12 UTC (permalink / raw)
  To: Alexey Kuznetsov; +Cc: netdev, Linux Kernel Mailing List, general, David Miller

> Quoting Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>:
> Subject: Re: dst_ifdown breaks infiniband?
> 
> Hello!
> 
> > If a device driver sets neigh_destructor in neigh_params, this could
> > get called after the device has been unregistered and the driver module
> > removed.
> 
> It is the same problem: if dst->neighbour holds neighbour, it should
> not hold device. parms->dev is not supposed to be used after
> neigh_parms_release(). F.e. set parms->dev to NULL to catch bad references.

Yes. I fixed that - simply checking that neighbour->dev is a
loopback device is sufficient to detect the fact that
the device is being unregistered.

> Do you search for a way to find real inifiniband device in
> ipoib_neigh_destructor()?

No, not anymore.

> I guess you will not be able.

I agree it's not possible.

> The problem is logical: if destructor needs device, neighbour entry
> _somehow_ have to hold reference to the device (via neigh->dev, neigh->parms,
> whatever). Hence, if we hold neighbour entry, unregister cannot be completed.
> Therefore, destructor cannot refer to device. Q.E.D. :-)
> 
> Seems, releasing dst->neighbour is inevitable.

infiniband sets parm->neigh_destructor, and I search for a way to prevent
this destructor from being called after the module has been unloaded.
Ideas?

-- 
MST

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: dst_ifdown breaks infiniband?
  2007-03-19 12:05           ` [ofa-general] " Alexey Kuznetsov
@ 2007-03-19 12:13             ` Michael S. Tsirkin
  -1 siblings, 0 replies; 59+ messages in thread
From: Michael S. Tsirkin @ 2007-03-19 12:13 UTC (permalink / raw)
  To: Alexey Kuznetsov
  Cc: Michael S. Tsirkin, Linux Kernel Mailing List, netdev, general,
	Roland Dreier, David Miller

> Quoting Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>:
> Subject: Re: dst_ifdown breaks infiniband?
> 
> Hello!
> 
> > If a device driver sets neigh_destructor in neigh_params, this could
> > get called after the device has been unregistered and the driver module
> > removed.
> 
> It is the same problem: if dst->neighbour holds neighbour, it should
> not hold device. parms->dev is not supposed to be used after
> neigh_parms_release(). F.e. set parms->dev to NULL to catch bad references.

Yes. I fixed that - simply checking that neighbour->dev is a
loopback device is sufficient to detect the fact that
the device is being unregistered.

> Do you search for a way to find real inifiniband device in
> ipoib_neigh_destructor()?

No, not anymore.

> I guess you will not be able.

I agree it's not possible.

> The problem is logical: if destructor needs device, neighbour entry
> _somehow_ have to hold reference to the device (via neigh->dev, neigh->parms,
> whatever). Hence, if we hold neighbour entry, unregister cannot be completed.
> Therefore, destructor cannot refer to device. Q.E.D. :-)
> 
> Seems, releasing dst->neighbour is inevitable.

infiniband sets parm->neigh_destructor, and I search for a way to prevent
this destructor from being called after the module has been unloaded.
Ideas?

-- 
MST

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [ofa-general] Re: dst_ifdown breaks infiniband?
@ 2007-03-19 12:13             ` Michael S. Tsirkin
  0 siblings, 0 replies; 59+ messages in thread
From: Michael S. Tsirkin @ 2007-03-19 12:13 UTC (permalink / raw)
  To: Alexey Kuznetsov; +Cc: netdev, Linux Kernel Mailing List, general, David Miller

> Quoting Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>:
> Subject: Re: dst_ifdown breaks infiniband?
> 
> Hello!
> 
> > If a device driver sets neigh_destructor in neigh_params, this could
> > get called after the device has been unregistered and the driver module
> > removed.
> 
> It is the same problem: if dst->neighbour holds neighbour, it should
> not hold device. parms->dev is not supposed to be used after
> neigh_parms_release(). F.e. set parms->dev to NULL to catch bad references.

Yes. I fixed that - simply checking that neighbour->dev is a
loopback device is sufficient to detect the fact that
the device is being unregistered.

> Do you search for a way to find real inifiniband device in
> ipoib_neigh_destructor()?

No, not anymore.

> I guess you will not be able.

I agree it's not possible.

> The problem is logical: if destructor needs device, neighbour entry
> _somehow_ have to hold reference to the device (via neigh->dev, neigh->parms,
> whatever). Hence, if we hold neighbour entry, unregister cannot be completed.
> Therefore, destructor cannot refer to device. Q.E.D. :-)
> 
> Seems, releasing dst->neighbour is inevitable.

infiniband sets parm->neigh_destructor, and I search for a way to prevent
this destructor from being called after the module has been unloaded.
Ideas?

-- 
MST

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: dst_ifdown breaks infiniband?
  2007-03-19 12:12             ` [ofa-general] " Michael S. Tsirkin
@ 2007-03-19 12:59               ` Alexey Kuznetsov
  -1 siblings, 0 replies; 59+ messages in thread
From: Alexey Kuznetsov @ 2007-03-19 12:59 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Linux Kernel Mailing List, netdev, general, Roland Dreier, David Miller

Hello!

> infiniband sets parm->neigh_destructor, and I search for a way to prevent
> this destructor from being called after the module has been unloaded.
> Ideas?

It must be called in any case to update/release internal ipoib structures.

The idea is to move call of parm->neigh_destructor from neighbour destructor
to the moment when it is unhashed, right after n->dead is set.

infiniband is the only user (atm clip uses it too, but that use is obviously
dummy), so that nobody will be harmed.

But ipoib will have to check for validity of skb->dst->neighbour before
attempt to reinitialize private data on dead (n->dead != 0) neighbour.

Alexey

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [ofa-general] Re: dst_ifdown breaks infiniband?
@ 2007-03-19 12:59               ` Alexey Kuznetsov
  0 siblings, 0 replies; 59+ messages in thread
From: Alexey Kuznetsov @ 2007-03-19 12:59 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev, Linux Kernel Mailing List, general, David Miller

Hello!

> infiniband sets parm->neigh_destructor, and I search for a way to prevent
> this destructor from being called after the module has been unloaded.
> Ideas?

It must be called in any case to update/release internal ipoib structures.

The idea is to move call of parm->neigh_destructor from neighbour destructor
to the moment when it is unhashed, right after n->dead is set.

infiniband is the only user (atm clip uses it too, but that use is obviously
dummy), so that nobody will be harmed.

But ipoib will have to check for validity of skb->dst->neighbour before
attempt to reinitialize private data on dead (n->dead != 0) neighbour.

Alexey

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [ofa-general] Re: dst_ifdown breaks infiniband?
  2007-03-19  6:13               ` David Miller
@ 2007-03-19 15:10                   ` Eric W. Biederman
  2007-03-19 15:10                   ` Eric W. Biederman
  1 sibling, 0 replies; 59+ messages in thread
From: Eric W. Biederman @ 2007-03-19 15:10 UTC (permalink / raw)
  To: David Miller; +Cc: mst, kuznet, netdev, linux-kernel, general

David Miller <davem@davemloft.net> writes:


> I think the thing to do is to just leave the loopback references
> in place, try to unregister the per-namespace loopback device,
> and that will safely wait for all the references to go away.

Right.  The only thing I have found that needs to be changed so
far in this area is specifying which loopback device I want
to replace it with.

> If you do it that way, you should need absolutely no changes to
> the other code in this area.
>
> As per Herbert, I think he works on Xen rather than vserver :-)
> Perhaps you're thinking of Alexey Kuznetsov or another one of the
> vserver guys.

I think you are thinking of a different Herbert.  I was thinking of
Herbert Poetzl the vserver maintainer.  Alexey works on OpenVZ.

Until we get the basic architecture merged they are rival projects.

Eric

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [ofa-general] Re: dst_ifdown breaks infiniband?
@ 2007-03-19 15:10                   ` Eric W. Biederman
  0 siblings, 0 replies; 59+ messages in thread
From: Eric W. Biederman @ 2007-03-19 15:10 UTC (permalink / raw)
  To: David Miller; +Cc: kuznet, linux-kernel, general, netdev

David Miller <davem@davemloft.net> writes:


> I think the thing to do is to just leave the loopback references
> in place, try to unregister the per-namespace loopback device,
> and that will safely wait for all the references to go away.

Right.  The only thing I have found that needs to be changed so
far in this area is specifying which loopback device I want
to replace it with.

> If you do it that way, you should need absolutely no changes to
> the other code in this area.
>
> As per Herbert, I think he works on Xen rather than vserver :-)
> Perhaps you're thinking of Alexey Kuznetsov or another one of the
> vserver guys.

I think you are thinking of a different Herbert.  I was thinking of
Herbert Poetzl the vserver maintainer.  Alexey works on OpenVZ.

Until we get the basic architecture merged they are rival projects.

Eric

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: dst_ifdown breaks infiniband?
  2007-03-19 12:59               ` [ofa-general] " Alexey Kuznetsov
@ 2007-03-19 15:13                 ` Michael S. Tsirkin
  -1 siblings, 0 replies; 59+ messages in thread
From: Michael S. Tsirkin @ 2007-03-19 15:13 UTC (permalink / raw)
  To: Alexey Kuznetsov
  Cc: Michael S. Tsirkin, Linux Kernel Mailing List, netdev, general,
	Roland Dreier, David Miller

> Quoting Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>:
> Subject: Re: dst_ifdown breaks infiniband?
> 
> Hello!
> 
> > infiniband sets parm->neigh_destructor, and I search for a way to prevent
> > this destructor from being called after the module has been unloaded.
> > Ideas?
> 
> It must be called in any case to update/release internal ipoib structures.

I don't think there's a problem.
All we do in destructor is release the ipoib_neigh resource.
And on device unregister we release all resources anyway.

> The idea is to move call of parm->neigh_destructor from neighbour destructor
> to the moment when it is unhashed, right after n->dead is set.
>
> infiniband is the only user (atm clip uses it too, but that use is obviously
> dummy), so that nobody will be harmed.

This might work. Could you post a patch to better show what you mean to do?

> But ipoib will have to check for validity of skb->dst->neighbour before
> attempt to reinitialize private data on dead (n->dead != 0) neighbour.

We set a flag before unregister_netdev and test it in start_xmit so
that's covered I think.

-- 
MST

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [ofa-general] Re: dst_ifdown breaks infiniband?
@ 2007-03-19 15:13                 ` Michael S. Tsirkin
  0 siblings, 0 replies; 59+ messages in thread
From: Michael S. Tsirkin @ 2007-03-19 15:13 UTC (permalink / raw)
  To: Alexey Kuznetsov; +Cc: netdev, Linux Kernel Mailing List, general, David Miller

> Quoting Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>:
> Subject: Re: dst_ifdown breaks infiniband?
> 
> Hello!
> 
> > infiniband sets parm->neigh_destructor, and I search for a way to prevent
> > this destructor from being called after the module has been unloaded.
> > Ideas?
> 
> It must be called in any case to update/release internal ipoib structures.

I don't think there's a problem.
All we do in destructor is release the ipoib_neigh resource.
And on device unregister we release all resources anyway.

> The idea is to move call of parm->neigh_destructor from neighbour destructor
> to the moment when it is unhashed, right after n->dead is set.
>
> infiniband is the only user (atm clip uses it too, but that use is obviously
> dummy), so that nobody will be harmed.

This might work. Could you post a patch to better show what you mean to do?

> But ipoib will have to check for validity of skb->dst->neighbour before
> attempt to reinitialize private data on dead (n->dead != 0) neighbour.

We set a flag before unregister_netdev and test it in start_xmit so
that's covered I think.

-- 
MST

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: dst_ifdown breaks infiniband?
  2007-03-19 15:13                 ` [ofa-general] " Michael S. Tsirkin
  (?)
@ 2007-03-19 23:20                 ` Alexey Kuznetsov
  2007-03-20 16:02                   ` Michael S. Tsirkin
  -1 siblings, 1 reply; 59+ messages in thread
From: Alexey Kuznetsov @ 2007-03-19 23:20 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Linux Kernel Mailing List, netdev, general, Roland Dreier, David Miller

Hello!

> This might work. Could you post a patch to better show what you mean to do?

Here it is.

->neigh_destructor() is killed (not used), replaced with ->neigh_cleanup(),
which is called when neighbor entry goes to dead state. At this point
everything is still valid: neigh->dev, neigh->parms etc.

The device should guarantee that dead neighbor entries (neigh->dead != 0)
do not get private part initialized, otherwise nobody will cleanup it.

I think this is enough for ipoib which is the only user of this thing.
Initialization private part of neighbor entries happens in ipib
start_xmit routine, which is not reached when device is down.
But it would be better to add explicit test for neigh->dead
in any case.


diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index f9dbc6f..2b5c297 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -814,7 +814,7 @@ static void ipoib_set_mcast_list(struct 
 	queue_work(ipoib_workqueue, &priv->restart_task);
 }
 
-static void ipoib_neigh_destructor(struct neighbour *n)
+static void ipoib_neigh_cleanup(struct neighbour *n)
 {
 	struct ipoib_neigh *neigh;
 	struct ipoib_dev_priv *priv = netdev_priv(n->dev);
@@ -822,7 +822,7 @@ static void ipoib_neigh_destructor(struc
 	struct ipoib_ah *ah = NULL;
 
 	ipoib_dbg(priv,
-		  "neigh_destructor for %06x " IPOIB_GID_FMT "\n",
+		  "neigh_cleanup for %06x " IPOIB_GID_FMT "\n",
 		  IPOIB_QPN(n->ha),
 		  IPOIB_GID_RAW_ARG(n->ha + 4));
 
@@ -874,7 +874,7 @@ void ipoib_neigh_free(struct net_device 
 
 static int ipoib_neigh_setup_dev(struct net_device *dev, struct neigh_parms *parms)
 {
-	parms->neigh_destructor = ipoib_neigh_destructor;
+	parms->neigh_cleanup = ipoib_neigh_cleanup;
 
 	return 0;
 }
diff --git a/include/net/neighbour.h b/include/net/neighbour.h
index 3725b93..ad7fe11 100644
--- a/include/net/neighbour.h
+++ b/include/net/neighbour.h
@@ -36,7 +36,7 @@ struct neigh_parms
 	struct net_device *dev;
 	struct neigh_parms *next;
 	int	(*neigh_setup)(struct neighbour *);
-	void	(*neigh_destructor)(struct neighbour *);
+	void	(*neigh_cleanup)(struct neighbour *);
 	struct neigh_table *tbl;
 
 	void	*sysctl_table;
diff --git a/net/atm/clip.c b/net/atm/clip.c
index ebb5d0c..8c38258 100644
--- a/net/atm/clip.c
+++ b/net/atm/clip.c
@@ -261,14 +261,6 @@ static void clip_pop(struct atm_vcc *vcc
 	spin_unlock_irqrestore(&PRIV(dev)->xoff_lock, flags);
 }
 
-static void clip_neigh_destroy(struct neighbour *neigh)
-{
-	DPRINTK("clip_neigh_destroy (neigh %p)\n", neigh);
-	if (NEIGH2ENTRY(neigh)->vccs)
-		printk(KERN_CRIT "clip_neigh_destroy: vccs != NULL !!!\n");
-	NEIGH2ENTRY(neigh)->vccs = (void *) NEIGHBOR_DEAD;
-}
-
 static void clip_neigh_solicit(struct neighbour *neigh, struct sk_buff *skb)
 {
 	DPRINTK("clip_neigh_solicit (neigh %p, skb %p)\n", neigh, skb);
@@ -342,7 +334,6 @@ static struct neigh_table clip_tbl = {
 	/* parameters are copied from ARP ... */
 	.parms = {
 		.tbl 			= &clip_tbl,
-		.neigh_destructor	= clip_neigh_destroy,
 		.base_reachable_time 	= 30 * HZ,
 		.retrans_time 		= 1 * HZ,
 		.gc_staletime 		= 60 * HZ,
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 3183142..cfc6001 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -140,6 +140,8 @@ static int neigh_forced_gc(struct neigh_
 				n->dead = 1;
 				shrunk	= 1;
 				write_unlock(&n->lock);
+				if (n->parms->neigh_cleanup)
+					n->parms->neigh_cleanup(n);
 				neigh_release(n);
 				continue;
 			}
@@ -211,6 +213,8 @@ static void neigh_flush_dev(struct neigh
 				NEIGH_PRINTK2("neigh %p is stray.\n", n);
 			}
 			write_unlock(&n->lock);
+			if (n->parms->neigh_cleanup)
+				n->parms->neigh_cleanup(n);
 			neigh_release(n);
 		}
 	}
@@ -582,9 +586,6 @@ void neigh_destroy(struct neighbour *nei
 			kfree(hh);
 	}
 
-	if (neigh->parms->neigh_destructor)
-		(neigh->parms->neigh_destructor)(neigh);
-
 	skb_queue_purge(&neigh->arp_queue);
 
 	dev_put(neigh->dev);
@@ -675,6 +676,8 @@ static void neigh_periodic_timer(unsigne
 			*np = n->next;
 			n->dead = 1;
 			write_unlock(&n->lock);
+			if (n->parms->neigh_cleanup)
+				n->parms->neigh_cleanup(n);
 			neigh_release(n);
 			continue;
 		}
@@ -2088,8 +2091,11 @@ void __neigh_for_each_release(struct nei
 			} else
 				np = &n->next;
 			write_unlock(&n->lock);
-			if (release)
+			if (release) {
+				if (n->parms->neigh_cleanup)
+					n->parms->neigh_cleanup(n);
 				neigh_release(n);
+			}
 		}
 	}
 }

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: dst_ifdown breaks infiniband?
  2007-03-19 23:20                 ` Alexey Kuznetsov
@ 2007-03-20 16:02                   ` Michael S. Tsirkin
  2007-03-20 23:34                       ` [ofa-general] " David Miller
  0 siblings, 1 reply; 59+ messages in thread
From: Michael S. Tsirkin @ 2007-03-20 16:02 UTC (permalink / raw)
  To: Alexey Kuznetsov
  Cc: Michael S. Tsirkin, Linux Kernel Mailing List, netdev, general,
	Roland Dreier, David Miller

> Quoting Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>:
> Subject: Re: dst_ifdown breaks infiniband?
> 
> Hello!
> 
> > This might work. Could you post a patch to better show what you mean to do?
> 
> Here it is.
> 
> ->neigh_destructor() is killed (not used), replaced with ->neigh_cleanup(),
> which is called when neighbor entry goes to dead state. At this point
> everything is still valid: neigh->dev, neigh->parms etc.
> 
> The device should guarantee that dead neighbor entries (neigh->dead != 0)
> do not get private part initialized, otherwise nobody will cleanup it.

OK, I stress-tested this for about 9 hours - apparently this resolves the issues
I was seeing both with hotplug device unregister and module removal.

This is an old bug, but somehow it did not trigger on
older kernels - some code restructuring in infiniband is probably
the reason - so from that POV it's a regression in 2.6.21.
So now several people are experiencing these crashes.

David, Alexey, what do you think about this patch? Is it right?
Could this patch be considered for 2.6.21?

Acked-by: Michael S. Tsirkin <mst@dev.mellanox.co.il>

> 
> I think this is enough for ipoib which is the only user of this thing.
> Initialization private part of neighbor entries happens in ipib
> start_xmit routine, which is not reached when device is down.
> But it would be better to add explicit test for neigh->dead
> in any case.

Additionally, ip over infiniband actually tests a separate flag
IPOIB_FLAG_ADMIN_UP before looking at an skb.
This flag is cleared before the device goes down.
Taken together this should be sufficient I think.

-- 
MST

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: dst_ifdown breaks infiniband?
  2007-03-20 16:02                   ` Michael S. Tsirkin
@ 2007-03-20 23:34                       ` David Miller
  0 siblings, 0 replies; 59+ messages in thread
From: David Miller @ 2007-03-20 23:34 UTC (permalink / raw)
  To: mst; +Cc: kuznet, linux-kernel, netdev, general, rolandd

From: "Michael S. Tsirkin" <mst@dev.mellanox.co.il>
Date: Tue, 20 Mar 2007 18:02:17 +0200

> David, Alexey, what do you think about this patch? Is it right?
> Could this patch be considered for 2.6.21?
> 
> Acked-by: Michael S. Tsirkin <mst@dev.mellanox.co.il>

I plan to apply it and merge.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [ofa-general] Re: dst_ifdown breaks infiniband?
@ 2007-03-20 23:34                       ` David Miller
  0 siblings, 0 replies; 59+ messages in thread
From: David Miller @ 2007-03-20 23:34 UTC (permalink / raw)
  To: mst; +Cc: kuznet, linux-kernel, general, netdev

From: "Michael S. Tsirkin" <mst@dev.mellanox.co.il>
Date: Tue, 20 Mar 2007 18:02:17 +0200

> David, Alexey, what do you think about this patch? Is it right?
> Could this patch be considered for 2.6.21?
> 
> Acked-by: Michael S. Tsirkin <mst@dev.mellanox.co.il>

I plan to apply it and merge.

^ permalink raw reply	[flat|nested] 59+ messages in thread

end of thread, other threads:[~2007-03-20 23:35 UTC | newest]

Thread overview: 59+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-03-18 15:55 dst_ifdown breaks infiniband? Michael S. Tsirkin
2007-03-18 15:55 ` [ofa-general] " Michael S. Tsirkin
2007-03-18 19:12 ` Alexey Kuznetsov
2007-03-18 19:12   ` [ofa-general] " Alexey Kuznetsov
2007-03-18 19:46   ` Michael S. Tsirkin
2007-03-18 19:46     ` [ofa-general] " Michael S. Tsirkin
2007-03-18 19:55     ` Alexey Kuznetsov
2007-03-18 19:55       ` [ofa-general] " Alexey Kuznetsov
2007-03-18 20:24       ` Michael S. Tsirkin
2007-03-18 20:24         ` [ofa-general] " Michael S. Tsirkin
2007-03-18 19:53   ` Michael S. Tsirkin
2007-03-18 19:53     ` [ofa-general] " Michael S. Tsirkin
2007-03-18 20:18     ` Alexey Kuznetsov
2007-03-18 20:29       ` Michael S. Tsirkin
2007-03-18 20:29         ` [ofa-general] " Michael S. Tsirkin
2007-03-19  9:36       ` Michael S. Tsirkin
2007-03-19  9:36         ` [ofa-general] " Michael S. Tsirkin
2007-03-19  9:55         ` Michael S. Tsirkin
2007-03-19  9:55           ` [ofa-general] " Michael S. Tsirkin
2007-03-19 12:05         ` Alexey Kuznetsov
2007-03-19 12:05           ` [ofa-general] " Alexey Kuznetsov
2007-03-19 12:12           ` Michael S. Tsirkin
2007-03-19 12:12             ` [ofa-general] " Michael S. Tsirkin
2007-03-19 12:59             ` Alexey Kuznetsov
2007-03-19 12:59               ` [ofa-general] " Alexey Kuznetsov
2007-03-19 15:13               ` Michael S. Tsirkin
2007-03-19 15:13                 ` [ofa-general] " Michael S. Tsirkin
2007-03-19 23:20                 ` Alexey Kuznetsov
2007-03-20 16:02                   ` Michael S. Tsirkin
2007-03-20 23:34                     ` David Miller
2007-03-20 23:34                       ` [ofa-general] " David Miller
2007-03-19 12:13           ` Michael S. Tsirkin
2007-03-19 12:13             ` [ofa-general] " Michael S. Tsirkin
2007-03-18 20:25   ` Michael S. Tsirkin
2007-03-18 22:24     ` [ofa-general] " Eric W. Biederman
2007-03-18 22:24       ` Eric W. Biederman
2007-03-18 22:36       ` Michael S. Tsirkin
2007-03-18 22:36         ` Michael S. Tsirkin
2007-03-18 22:42         ` Michael S. Tsirkin
2007-03-19  0:13           ` David Miller
2007-03-19  5:19             ` Michael S. Tsirkin
2007-03-19  5:19               ` Michael S. Tsirkin
2007-03-19  5:30             ` Eric W. Biederman
2007-03-19  5:30               ` Eric W. Biederman
2007-03-19  6:13               ` David Miller
2007-03-19  9:34                 ` Alexey Kuznetsov
2007-03-19 15:10                 ` Eric W. Biederman
2007-03-19 15:10                   ` Eric W. Biederman
2007-03-19  9:24           ` Alexey Kuznetsov
2007-03-19  9:24             ` Alexey Kuznetsov
2007-03-19  9:33             ` Michael S. Tsirkin
2007-03-19  9:20       ` Alexey Kuznetsov
2007-03-18 20:33   ` Michael S. Tsirkin
2007-03-18 21:06     ` Michael S. Tsirkin
2007-03-18 21:06       ` [ofa-general] " Michael S. Tsirkin
2007-03-18 21:20       ` Michael S. Tsirkin
2007-03-18 21:20         ` [ofa-general] " Michael S. Tsirkin
2007-03-19  5:15         ` Michael S. Tsirkin
2007-03-19  5:15           ` [ofa-general] " Michael S. Tsirkin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.