All of lore.kernel.org
 help / color / mirror / Atom feed
* problem with rtnetlink 'reference' count
@ 2017-10-23 14:25 Peter Zijlstra
  2017-10-23 15:32 ` Florian Westphal
  0 siblings, 1 reply; 8+ messages in thread
From: Peter Zijlstra @ 2017-10-23 14:25 UTC (permalink / raw)
  To: David Miller; +Cc: fw, netdev

Hi,

I just ran across commit:

  019a316992ee ("rtnetlink: add reference counting to prevent module unload while dump is in progress")

And that commit is _completely_ broken.

 1) it not in fact a refcount, so using refcount_t is silly
 2) there is a distinct lack of memory barriers, so we can easily
    observe the decrement while the msg_handler is still in progress.
 3) waiting with a schedule()/yield() loop is complete crap and subject
    life-locks, imagine doing that rtnl_unregister_all() from a RT task.

Please fix..

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: problem with rtnetlink 'reference' count
  2017-10-23 14:25 problem with rtnetlink 'reference' count Peter Zijlstra
@ 2017-10-23 15:32 ` Florian Westphal
  2017-10-23 16:20   ` Peter Zijlstra
  0 siblings, 1 reply; 8+ messages in thread
From: Florian Westphal @ 2017-10-23 15:32 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: David Miller, fw, netdev

Peter Zijlstra <peterz@infradead.org> wrote:
>   019a316992ee ("rtnetlink: add reference counting to prevent module unload while dump is in progress")
> 
> And that commit is _completely_ broken.
> 
>  1) it not in fact a refcount, so using refcount_t is silly

Your suggestion is...?

>  2) there is a distinct lack of memory barriers, so we can easily
>     observe the decrement while the msg_handler is still in progress.

I guess you mean it needs:

+	smp_mb__before_atomic();
	refcount_dec(&rtnl_msg_handlers_ref[family]);
?

However, this refcount_dec is misplaced anyway as it would need
to occur from nlcb->done() (the handler function gets stored in socket for
use by next recvmsg), so this change is indeed not helpful at all.

>  3) waiting with a schedule()/yield() loop is complete crap and subject
>     life-locks, imagine doing that rtnl_unregister_all() from a RT task.

Whats the alternative?

Only one I see is to pass THIS_MODULE to hook reg/unreg so we know what module registered the
family for purpose of module get/put but thats going to be messy.

Alternatively we can of course sleep instead of schedule() but that
doesn't appear too appealing either (albeit it is a lot less intrusive).

Any other idea?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: problem with rtnetlink 'reference' count
  2017-10-23 15:32 ` Florian Westphal
@ 2017-10-23 16:20   ` Peter Zijlstra
  2017-10-23 16:37     ` Florian Westphal
  0 siblings, 1 reply; 8+ messages in thread
From: Peter Zijlstra @ 2017-10-23 16:20 UTC (permalink / raw)
  To: Florian Westphal; +Cc: David Miller, netdev

On Mon, Oct 23, 2017 at 05:32:00PM +0200, Florian Westphal wrote:

> >  1) it not in fact a refcount, so using refcount_t is silly
> 
> Your suggestion is...?

Normal atomic_t

> >  2) there is a distinct lack of memory barriers, so we can easily
> >     observe the decrement while the msg_handler is still in progress.
> 
> I guess you mean it needs:
> 
> +	smp_mb__before_atomic();
> 	refcount_dec(&rtnl_msg_handlers_ref[family]);
> ?

Yes, but also:

	atomic_inc();
	smp_mb__after_atomic();

To avoid the problem of te inc being observed late.

> However, this refcount_dec is misplaced anyway as it would need
> to occur from nlcb->done() (the handler function gets stored in socket for
> use by next recvmsg), so this change is indeed not helpful at all.
> 
> >  3) waiting with a schedule()/yield() loop is complete crap and subject
> >     life-locks, imagine doing that rtnl_unregister_all() from a RT task.

> Alternatively we can of course sleep instead of schedule() but that
> doesn't appear too appealing either (albeit it is a lot less intrusive).

That is much better than a yield loop.

> Any other idea?

This rtnetlink_rcv_msg() is called from softirq-context, right? Also,
all that stuff happens with rcu_read_lock() held.

So why isn't that synchronize_net() call sufficient? You first clear
rtnl_msg_handlers[protocol], and then you do synchronize_net() which
will wait for all concurrent softirq handlers to complete. Which, if
rtnetlink_rcv_msg() is called from softir, guarantees nobody still uses
it.


Also, if that is all softirq, you should maybe use rcu_read_lock_bh(),
alternatively you should use synchronize_rcu(), as is its a bit
inconsistent.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: problem with rtnetlink 'reference' count
  2017-10-23 16:20   ` Peter Zijlstra
@ 2017-10-23 16:37     ` Florian Westphal
  2017-10-23 18:31       ` Peter Zijlstra
  0 siblings, 1 reply; 8+ messages in thread
From: Florian Westphal @ 2017-10-23 16:37 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Florian Westphal, David Miller, netdev

Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, Oct 23, 2017 at 05:32:00PM +0200, Florian Westphal wrote:
> 
> > >  1) it not in fact a refcount, so using refcount_t is silly
> > 
> > Your suggestion is...?
> 
> Normal atomic_t

Why?  refcount_t gives debug options to catch leaks/underflows,
atomic_t does not.

Is refcount_t only supposed to be used with dec_and_test patterns?

> To avoid the problem of te inc being observed late.
> 
> > However, this refcount_dec is misplaced anyway as it would need
> > to occur from nlcb->done() (the handler function gets stored in socket for
> > use by next recvmsg), so this change is indeed not helpful at all.
> > 
> > >  3) waiting with a schedule()/yield() loop is complete crap and subject
> > >     life-locks, imagine doing that rtnl_unregister_all() from a RT task.
> 
> > Alternatively we can of course sleep instead of schedule() but that
> > doesn't appear too appealing either (albeit it is a lot less intrusive).
> 
> That is much better than a yield loop.
> 
> > Any other idea?
> 
> This rtnetlink_rcv_msg() is called from softirq-context, right? Also,
> all that stuff happens with rcu_read_lock() held.

No, its called from process context.

I need to run now but plan to test and submit something like this:

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -263,7 +263,7 @@ void rtnl_unregister_all(int protocol)
 	synchronize_net();
 
 	while (refcount_read(&rtnl_msg_handlers_ref[protocol]) > 1)
-		schedule();
+		msleep(1);
 	kfree(handlers);
 }
 EXPORT_SYMBOL_GPL(rtnl_unregister_all);
@@ -4149,6 +4149,16 @@ static int rtnl_stats_dump(struct sk_buff *skb, struct netlink_callback *cb)
 	return skb->len;
 }
 
+
+static int rtnl_dumper_done(struct netlink_callback *cb)
+{
+	unsigned int family = (unsigned long)cb->data;
+
+	refcount_dec(&rtnl_msg_handlers_ref[family]);
+	smp_mb__after_atomic();
+	return 0;
+}
+
 /* Process one rtnetlink message. */
 
 static int rtnetlink_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh,
@@ -4207,6 +4217,7 @@ static int rtnetlink_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh,
 		}
 
 		refcount_inc(&rtnl_msg_handlers_ref[family]);
+		smp_mb__after_atomic();
 
 		if (type == RTM_GETLINK - RTM_BASE)
 			min_dump_alloc = rtnl_calcit(skb, nlh);
@@ -4217,11 +4228,12 @@ static int rtnetlink_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh,
 		{
 			struct netlink_dump_control c = {
 				.dump		= dumpit,
+				.done		= rtnl_dumper_done,
 				.min_dump_alloc	= min_dump_alloc,
+				.data		= (void *)(unsigned long)family,
 			};
 			err = netlink_dump_start(rtnl, skb, nlh, &c);
 		}
-		refcount_dec(&rtnl_msg_handlers_ref[family]);
 		return err;
 	}
 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: problem with rtnetlink 'reference' count
  2017-10-23 16:37     ` Florian Westphal
@ 2017-10-23 18:31       ` Peter Zijlstra
  2017-10-23 19:37         ` Florian Westphal
  0 siblings, 1 reply; 8+ messages in thread
From: Peter Zijlstra @ 2017-10-23 18:31 UTC (permalink / raw)
  To: Florian Westphal; +Cc: David Miller, netdev

On Mon, Oct 23, 2017 at 06:37:44PM +0200, Florian Westphal wrote:

> Is refcount_t only supposed to be used with dec_and_test patterns?

Yes, for reference counting objects.

> > This rtnetlink_rcv_msg() is called from softirq-context, right? Also,
> > all that stuff happens with rcu_read_lock() held.
> 
> No, its called from process context.

OK, so then why not do something like so?


---
 net/core/rtnetlink.c | 21 ++++++++++-----------
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index d4bcdcc68e92..473cabd0a551 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -128,7 +128,6 @@ bool lockdep_rtnl_is_held(void)
 #endif /* #ifdef CONFIG_PROVE_LOCKING */
 
 static struct rtnl_link __rcu *rtnl_msg_handlers[RTNL_FAMILY_MAX + 1];
-static refcount_t rtnl_msg_handlers_ref[RTNL_FAMILY_MAX + 1];
 
 static inline int rtm_msgindex(int msgtype)
 {
@@ -260,10 +259,18 @@ void rtnl_unregister_all(int protocol)
 	RCU_INIT_POINTER(rtnl_msg_handlers[protocol], NULL);
 	rtnl_unlock();
 
+	/*
+	 * XXX explain what this is for...
+	 */
 	synchronize_net();
 
-	while (refcount_read(&rtnl_msg_handlers_ref[protocol]) > 1)
-		schedule();
+	/*
+	 * This serializes against the rcu_read_lock() section in
+	 * rtnetlink_rcv_msg() such that after this, all prior instances have
+	 * completed and future instances must observe the NULL written above.
+	 */
+	synchronize_rcu();
+
 	kfree(handlers);
 }
 EXPORT_SYMBOL_GPL(rtnl_unregister_all);
@@ -4203,8 +4210,6 @@ static int rtnetlink_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh,
 				goto err_unlock;
 		}
 
-		refcount_inc(&rtnl_msg_handlers_ref[family]);
-
 		if (type == RTM_GETLINK - RTM_BASE)
 			min_dump_alloc = rtnl_calcit(skb, nlh);
 
@@ -4218,7 +4223,6 @@ static int rtnetlink_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh,
 			};
 			err = netlink_dump_start(rtnl, skb, nlh, &c);
 		}
-		refcount_dec(&rtnl_msg_handlers_ref[family]);
 		return err;
 	}
 
@@ -4230,12 +4234,10 @@ static int rtnetlink_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh,
 
 	flags = READ_ONCE(handlers[type].flags);
 	if (flags & RTNL_FLAG_DOIT_UNLOCKED) {
-		refcount_inc(&rtnl_msg_handlers_ref[family]);
 		doit = READ_ONCE(handlers[type].doit);
 		rcu_read_unlock();
 		if (doit)
 			err = doit(skb, nlh, extack);
-		refcount_dec(&rtnl_msg_handlers_ref[family]);
 		return err;
 	}
 
@@ -4333,9 +4335,6 @@ void __init rtnetlink_init(void)
 {
 	int i;
 
-	for (i = 0; i < ARRAY_SIZE(rtnl_msg_handlers_ref); i++)
-		refcount_set(&rtnl_msg_handlers_ref[i], 1);
-
 	if (register_pernet_subsys(&rtnetlink_net_ops))
 		panic("rtnetlink_init: cannot initialize rtnetlink\n");
 

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: problem with rtnetlink 'reference' count
  2017-10-23 18:31       ` Peter Zijlstra
@ 2017-10-23 19:37         ` Florian Westphal
  2017-10-24  8:33           ` Peter Zijlstra
  0 siblings, 1 reply; 8+ messages in thread
From: Florian Westphal @ 2017-10-23 19:37 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Florian Westphal, David Miller, netdev

Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, Oct 23, 2017 at 06:37:44PM +0200, Florian Westphal wrote:
> 
> > Is refcount_t only supposed to be used with dec_and_test patterns?
> 
> Yes, for reference counting objects.

Hmm, I still feel its appropriate, but anyway:

> > > This rtnetlink_rcv_msg() is called from softirq-context, right? Also,
> > > all that stuff happens with rcu_read_lock() held.
> > 
> > No, its called from process context.
> 
> OK, so then why not do something like so?
> @@ -260,10 +259,18 @@ void rtnl_unregister_all(int protocol)
>  	RCU_INIT_POINTER(rtnl_msg_handlers[protocol], NULL);
>  	rtnl_unlock();
>  
> +	/*
> +	 * XXX explain what this is for...
> +	 */
>  	synchronize_net();
>  
> -	while (refcount_read(&rtnl_msg_handlers_ref[protocol]) > 1)
> -		schedule();
> +	/*
> +	 * This serializes against the rcu_read_lock() section in
> +	 * rtnetlink_rcv_msg() such that after this, all prior instances have
> +	 * completed and future instances must observe the NULL written above.
> +	 */
> +	synchronize_rcu();

Yes, but that won't help with running dumpers, see below.

> @@ -4218,7 +4223,6 @@ static int rtnetlink_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh,
>  			};
>  			err = netlink_dump_start(rtnl, skb, nlh, &c);

This will copy .dumper function address to nlh->cb for later invocation
when dump gets resumed (its called from netlink_recvmsg()),
so this can return to userspace and dump can be resumed on next recv().

Because the dumper function was stored in the socket, NULLing the
rtnl_msg_handlers[] only prevents new dumps from starting but not
already set-up dumps from resuming.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: problem with rtnetlink 'reference' count
  2017-10-23 19:37         ` Florian Westphal
@ 2017-10-24  8:33           ` Peter Zijlstra
  2017-10-24  9:10             ` Florian Westphal
  0 siblings, 1 reply; 8+ messages in thread
From: Peter Zijlstra @ 2017-10-24  8:33 UTC (permalink / raw)
  To: Florian Westphal; +Cc: David Miller, netdev

On Mon, Oct 23, 2017 at 09:37:03PM +0200, Florian Westphal wrote:

> > OK, so then why not do something like so?
> > @@ -260,10 +259,18 @@ void rtnl_unregister_all(int protocol)
> >  	RCU_INIT_POINTER(rtnl_msg_handlers[protocol], NULL);
> >  	rtnl_unlock();
> >  
> > +	/*
> > +	 * XXX explain what this is for...
> > +	 */
> >  	synchronize_net();
> >  
> > -	while (refcount_read(&rtnl_msg_handlers_ref[protocol]) > 1)
> > -		schedule();
> > +	/*
> > +	 * This serializes against the rcu_read_lock() section in
> > +	 * rtnetlink_rcv_msg() such that after this, all prior instances have
> > +	 * completed and future instances must observe the NULL written above.
> > +	 */
> > +	synchronize_rcu();
> 
> Yes, but that won't help with running dumpers, see below.
> 
> > @@ -4218,7 +4223,6 @@ static int rtnetlink_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh,
> >  			};
> >  			err = netlink_dump_start(rtnl, skb, nlh, &c);
> 
> This will copy .dumper function address to nlh->cb for later invocation
> when dump gets resumed (its called from netlink_recvmsg()),
> so this can return to userspace and dump can be resumed on next recv().
> 
> Because the dumper function was stored in the socket, NULLing the
> rtnl_msg_handlers[] only prevents new dumps from starting but not
> already set-up dumps from resuming.

but netlink_dump_start() will actually grab a reference on the module;
but it does so too late.

Would it not be sufficient to put that try_module_get() under the
rcu_read_lock()?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: problem with rtnetlink 'reference' count
  2017-10-24  8:33           ` Peter Zijlstra
@ 2017-10-24  9:10             ` Florian Westphal
  0 siblings, 0 replies; 8+ messages in thread
From: Florian Westphal @ 2017-10-24  9:10 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Florian Westphal, David Miller, netdev

Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, Oct 23, 2017 at 09:37:03PM +0200, Florian Westphal wrote:
> 
> > > OK, so then why not do something like so?
> > > @@ -260,10 +259,18 @@ void rtnl_unregister_all(int protocol)
> > >  	RCU_INIT_POINTER(rtnl_msg_handlers[protocol], NULL);
> > >  	rtnl_unlock();
> > >  
> > > +	/*
> > > +	 * XXX explain what this is for...
> > > +	 */
> > >  	synchronize_net();
> > >  
> > > -	while (refcount_read(&rtnl_msg_handlers_ref[protocol]) > 1)
> > > -		schedule();
> > > +	/*
> > > +	 * This serializes against the rcu_read_lock() section in
> > > +	 * rtnetlink_rcv_msg() such that after this, all prior instances have
> > > +	 * completed and future instances must observe the NULL written above.
> > > +	 */
> > > +	synchronize_rcu();
> > 
> > Yes, but that won't help with running dumpers, see below.
> > 
> > > @@ -4218,7 +4223,6 @@ static int rtnetlink_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh,
> > >  			};
> > >  			err = netlink_dump_start(rtnl, skb, nlh, &c);
> > 
> > This will copy .dumper function address to nlh->cb for later invocation
> > when dump gets resumed (its called from netlink_recvmsg()),
> > so this can return to userspace and dump can be resumed on next recv().
> > 
> > Because the dumper function was stored in the socket, NULLing the
> > rtnl_msg_handlers[] only prevents new dumps from starting but not
> > already set-up dumps from resuming.
> 
> but netlink_dump_start() will actually grab a reference on the module;
> but it does so too late.

Yes, but rtnl_register() doesn't pass a module pointer, i.e. in
rtnetlink_rcv_msg we can't set up the module pointer to the correct one.

We'd have to pass the module pointer to rtnl_register() and store it,
or add a new api that passes it, or, yet another option, avoid use
of .dumpit from modular users and keep the dump handling fully within
these modules.

(the api is from ancient times when it was only used from builtin
 code paths).

I'll try a few things tomorrow to see what makes most sense or if there
are any other options.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2017-10-24  9:11 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-23 14:25 problem with rtnetlink 'reference' count Peter Zijlstra
2017-10-23 15:32 ` Florian Westphal
2017-10-23 16:20   ` Peter Zijlstra
2017-10-23 16:37     ` Florian Westphal
2017-10-23 18:31       ` Peter Zijlstra
2017-10-23 19:37         ` Florian Westphal
2017-10-24  8:33           ` Peter Zijlstra
2017-10-24  9:10             ` Florian Westphal

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.