netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Freeing 'temporary' IPv4 route table entries.
@ 2020-01-31 10:26 David Laight
  2020-01-31 15:53 ` Eric Dumazet
  0 siblings, 1 reply; 4+ messages in thread
From: David Laight @ 2020-01-31 10:26 UTC (permalink / raw)
  To: netdev

If I call sendmsg() on a raw socket (or probably
an unconnected UDP one) rt_dst_alloc() is called
in the bowels of ip_route_output_flow() to hold
the remote address.

Much later __dev_queue_xmit() calls dst_release()
to delete the 'dst' referenced from the skb.

Prior to f8864972 it did just that.
Afterwards the actual delete is 'laundered' through the
rcu callbacks.
This is probably ok for dst that are actually attached
to sockets or tunnels (which aren't freed very often).
But it leads to horrid long rcu callback sequences
when a lot of messages are sent.
(A sample of 1 gave nearly 100 deletes in one go.)
There is also the additional cost of deferring the free
(and the extra retpoline etc).

ISTM that the dst_alloc() done during a send should
set a flag so that the 'dst' can be immediately
freed since it is known that no one can be picking up
a reference as it is being freed.

Thoughts?

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Freeing 'temporary' IPv4 route table entries.
  2020-01-31 10:26 Freeing 'temporary' IPv4 route table entries David Laight
@ 2020-01-31 15:53 ` Eric Dumazet
  2020-01-31 16:48   ` David Laight
  2020-02-03 12:12   ` David Laight
  0 siblings, 2 replies; 4+ messages in thread
From: Eric Dumazet @ 2020-01-31 15:53 UTC (permalink / raw)
  To: David Laight, netdev



On 1/31/20 2:26 AM, David Laight wrote:
> If I call sendmsg() on a raw socket (or probably
> an unconnected UDP one) rt_dst_alloc() is called
> in the bowels of ip_route_output_flow() to hold
> the remote address.
> 
> Much later __dev_queue_xmit() calls dst_release()
> to delete the 'dst' referenced from the skb.
> 
> Prior to f8864972 it did just that.
> Afterwards the actual delete is 'laundered' through the
> rcu callbacks.
> This is probably ok for dst that are actually attached
> to sockets or tunnels (which aren't freed very often).
> But it leads to horrid long rcu callback sequences
> when a lot of messages are sent.
> (A sample of 1 gave nearly 100 deletes in one go.)
> There is also the additional cost of deferring the free
> (and the extra retpoline etc).
> 
> ISTM that the dst_alloc() done during a send should
> set a flag so that the 'dst' can be immediately
> freed since it is known that no one can be picking up
> a reference as it is being freed.
> 
> Thoughts?
> 

I thought these routes were cached in per-cpu caches.

At least for UDP I do not see rcu callbacks being queueed.


^ permalink raw reply	[flat|nested] 4+ messages in thread

* RE: Freeing 'temporary' IPv4 route table entries.
  2020-01-31 15:53 ` Eric Dumazet
@ 2020-01-31 16:48   ` David Laight
  2020-02-03 12:12   ` David Laight
  1 sibling, 0 replies; 4+ messages in thread
From: David Laight @ 2020-01-31 16:48 UTC (permalink / raw)
  To: 'Eric Dumazet', netdev

From: Eric Dumazet
> Sent: 31 January 2020 15:54
> 
> On 1/31/20 2:26 AM, David Laight wrote:
> > If I call sendmsg() on a raw socket (or probably
> > an unconnected UDP one) rt_dst_alloc() is called
> > in the bowels of ip_route_output_flow() to hold
> > the remote address.
> >
> > Much later __dev_queue_xmit() calls dst_release()
> > to delete the 'dst' referenced from the skb.
> >
> > Prior to f8864972 it did just that.
> > Afterwards the actual delete is 'laundered' through the
> > rcu callbacks.
> > This is probably ok for dst that are actually attached
> > to sockets or tunnels (which aren't freed very often).
> > But it leads to horrid long rcu callback sequences
> > when a lot of messages are sent.
> > (A sample of 1 gave nearly 100 deletes in one go.)
> > There is also the additional cost of deferring the free
> > (and the extra retpoline etc).
> >
> > ISTM that the dst_alloc() done during a send should
> > set a flag so that the 'dst' can be immediately
> > freed since it is known that no one can be picking up
> > a reference as it is being freed.
> >
> > Thoughts?
> >
> 
> I thought these routes were cached in per-cpu caches.
> 
> At least for UDP I do not see rcu callbacks being queueed.

For rawip (actually sending UDP) all the sends go:

282710.407759 |   0) pid-29158  |               |          inet_sendmsg() {
282710.407759 |   0) pid-29158  |   0.093 us    |            inet_send_prepare();
282710.407759 |   0) pid-29158  |               |            raw_sendmsg() {
282710.407759 |   0) pid-29158  |   0.094 us    |              security_sk_classify_flow();
282710.407760 |   0) pid-29158  |               |              ip_route_output_flow() {
282710.407760 |   0) pid-29158  |               |                ip_route_output_key_hash() {
282710.407760 |   0) pid-29158  |               |                  ip_route_output_key_hash_rcu() {
282710.407760 |   0) pid-29158  |   0.160 us    |                    fib_table_lookup();
282710.407760 |   0) pid-29158  |               |                    fib_select_path() {
282710.407760 |   0) pid-29158  |   0.102 us    |                      fib_result_prefsrc();
282710.407760 |   0) pid-29158  |   0.302 us    |                    } /* fib_select_path */
282710.407761 |   0) pid-29158  |   0.103 us    |                    find_exception();
282710.407761 |   0) pid-29158  |               |                    rt_dst_alloc() {
282710.407761 |   0) pid-29158  |               |                      dst_alloc() {
282710.407761 |   0) pid-29158  |               |                        kmem_cache_alloc() {
282710.407761 |   0) pid-29158  |   0.102 us    |                          should_failslab();
282710.407761 |   0) pid-29158  |   0.362 us    |                        } /* kmem_cache_alloc */
282710.407761 |   0) pid-29158  |   0.098 us    |                        dst_init();
282710.407762 |   0) pid-29158  |   0.747 us    |                      } /* dst_alloc */
282710.407762 |   0) pid-29158  |   0.936 us    |                    } /* rt_dst_alloc */

Then dst_release() gets called from __dev_queue_xmit().
In this test all the destinations are the same, but in real life
there will be 100s of different addresses.

Are the addresses inserted in the cache at the start of the send and
then deleted once the message is sent?

I'm not sure there are any 'connected' sockets to that IP address.
(This is SIP + RTP testing.)

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 4+ messages in thread

* RE: Freeing 'temporary' IPv4 route table entries.
  2020-01-31 15:53 ` Eric Dumazet
  2020-01-31 16:48   ` David Laight
@ 2020-02-03 12:12   ` David Laight
  1 sibling, 0 replies; 4+ messages in thread
From: David Laight @ 2020-02-03 12:12 UTC (permalink / raw)
  To: 'Eric Dumazet', netdev

From: Eric Dumazet
> Sent: 31 January 2020 15:54
> On 1/31/20 2:26 AM, David Laight wrote:
> > If I call sendmsg() on a raw socket (or probably
> > an unconnected UDP one) rt_dst_alloc() is called
> > in the bowels of ip_route_output_flow() to hold
> > the remote address.
> >
> > Much later __dev_queue_xmit() calls dst_release()
> > to delete the 'dst' referenced from the skb.
> >
> > Prior to f8864972 it did just that.
> > Afterwards the actual delete is 'laundered' through the
> > rcu callbacks.
> > This is probably ok for dst that are actually attached
> > to sockets or tunnels (which aren't freed very often).
> > But it leads to horrid long rcu callback sequences
> > when a lot of messages are sent.
> > (A sample of 1 gave nearly 100 deletes in one go.)
> > There is also the additional cost of deferring the free
> > (and the extra retpoline etc).
> >
> > ISTM that the dst_alloc() done during a send should
> > set a flag so that the 'dst' can be immediately
> > freed since it is known that no one can be picking up
> > a reference as it is being freed.
> >
> > Thoughts?
> >
> 
> I thought these routes were cached in per-cpu caches.
> 
> At least for UDP I do not see rcu callbacks being queueed.

I've done a bit more investigation.

For raw_ip sockets with inet->hdrincl set (ie the application
builds the IPv4 header) flowi4_flags has FLOWI_FLAG_KNOWN_NH set.

This is detected inside __mkroute_output() (in ipv4/route.c)
and forces rt_dst_alloc() be called instead of using the
'dst' from (I think) fib_select_path().
(Is this basically the arp table entry?)

rt_set_nexthop() then calls rt_add_uncached_list().

I suspect that dst_release() is called after every
transmit - but normally just decrements the ref count.
However for raw sends it frees the uncached route.

I think the 'fault' is down to c27c9322d which fixed
an issue where the code was using the IP address from the
pre-built packet instead of the one from the destination
address.

I think there are two issues:
1) if __mkroute_output() creates an 'uncached' route
   it can be freed without waiting for rcu grace.
2) if a raw packets destination address matches then
   the cached route can be used.

Oh - nothing seems to check DST_HOST any more

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2020-02-03 12:12 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-31 10:26 Freeing 'temporary' IPv4 route table entries David Laight
2020-01-31 15:53 ` Eric Dumazet
2020-01-31 16:48   ` David Laight
2020-02-03 12:12   ` David Laight

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).