All of lore.kernel.org
 help / color / mirror / Atom feed
* Performance regression on kernels 3.10 and newer
@ 2014-08-14 18:19 Alexander Duyck
  2014-08-14 18:46 ` Eric Dumazet
  0 siblings, 1 reply; 57+ messages in thread
From: Alexander Duyck @ 2014-08-14 18:19 UTC (permalink / raw)
  To: David Miller, Eric Dumazet; +Cc: netdev

Yesterday I tripped over a bit of an issue and it seems like we are
seeing significant cache thrash on kernels 3.10 and newer when running
multiple streams of small packet stress on multiple NUMA nodes for a
single NIC.

I did some bisection and found that I was able to trace it back to
upstream commit 093162553c33e9479283e107b4431378271c735d (tcp: force a
dst refcount when prequeue packet).

Recreating this issue is pretty strait forward.  All I did was setup 2
dual socket Xeon systems connected back to back with ixgbe and ran the
following script after disabling tcp_autocork on the transmitting system:
  for i in `seq 0 19`
  do
    for j in `seq 0 2`
    do
      netperf -H 192.168.10.1 -t TCP_STREAM \
              -l 10 -c -C -T $i,$i -P 0 -- \
              -m 64 -s 64K -D
    done
  done

The current net tree as-is will give me about 2Gb/s of data w/ 100% CPU
utilization on the receiving system, and with the patch above reverted
on that system it gives me about 4Gb/s with only 21% CPU utilization.
If I set tcp_low_latency=1 I can get the CPU utilization down to about
12% on the same test with about 4Gb/s of throughput.

I'm still working on determining the exact root cause but it looks to me
like there is some significant cache thrash going on in regards to the
dst entries.

Below is a quick breakdown of the top CPU users for tcp_low_latency
on/off using perf top:

tcp_low_latency = 0
 36.49%  [kernel]                [k] ipv4_dst_check
 19.45%  [kernel]                [k] dst_release
 16.07%  [kernel]                [k] _raw_spin_lock
  9.84%  [kernel]                [k] tcp_prequeue
  2.13%  [kernel]                [k] tcp_v4_rcv
  1.38%  [kernel]                [k] memcpy
  1.04%  [ixgbe]                 [k] ixgbe_clean_rx_irq
  0.82%  [kernel]                [k] ip_rcv_finish
  0.54%  [kernel]                [k] dev_gro_receive
  0.51%  [kernel]                [k] build_skb
  0.51%  [kernel]                [k] __netif_receive_skb_core
  0.50%  [kernel]                [k] tcp_rcv_established
  0.46%  [kernel]                [k] sock_def_readable
  0.42%  [kernel]                [k] __slab_free
  0.38%  [kernel]                [k] __inet_lookup_established
  0.36%  [kernel]                [k] ip_rcv
  0.34%  [kernel]                [k] copy_user_enhanced_fast_string
  0.30%  [kernel]                [k] __netdev_alloc_frag
  0.29%  [kernel]                [k] kmem_cache_alloc
  0.27%  [kernel]                [k] inet_gro_receive
  0.25%  [kernel]                [k] put_compound_page
  0.24%  [kernel]                [k] tcp_v4_do_rcv
  0.24%  [kernel]                [k] napi_gro_receive
  0.22%  [kernel]                [k] tcp_event_data_recv
  0.20%  [kernel]                [k] tcp_gro_receive
  0.17%  [kernel]                [k] tcp_v4_early_demux
  0.16%  [kernel]                [k] kmem_cache_free
  0.14%  [ixgbe]                 [k] ixgbe_poll
  0.14%  [kernel]                [k] eth_type_trans
  0.13%  [kernel]                [k] tcp_prequeue_process
  0.13%  [kernel]                [k] tcp_send_delayed_ack
  0.13%  [kernel]                [k] mod_timer
  0.12%  [kernel]                [k] skb_copy_datagram_iovec
  0.12%  [kernel]                [k] irq_entries_start
  0.12%  [kernel]                [k] inet_ehashfn
  0.12%  [kernel]                [k] __tcp_ack_snd_check
  0.12%  [ixgbe]                 [k] ixgbe_xmit_frame_ring

tcp_low_latency = 1
  7.77%  [kernel]                 [k] memcpy
  6.13%  [ixgbe]                  [k] ixgbe_clean_rx_irq
  3.54%  [kernel]                 [k] skb_try_coalesce
  3.22%  [kernel]                 [k] dev_gro_receive
  3.21%  [kernel]                 [k] tcp_v4_rcv
  2.91%  [kernel]                 [k] __netif_receive_skb_core
  2.64%  [kernel]                 [k] build_skb
  2.59%  [kernel]                 [k] acpi_processor_ffh_cstate_enter
  2.53%  [kernel]                 [k] sock_def_readable
  2.26%  [kernel]                 [k] _raw_spin_lock
  2.20%  [kernel]                 [k] tcp_rcv_established
  2.07%  [kernel]                 [k] __inet_lookup_established
  1.95%  [kernel]                 [k] ip_rcv
  1.82%  [kernel]                 [k] kmem_cache_free
  1.76%  [kernel]                 [k] copy_user_enhanced_fast_string
  1.56%  [kernel]                 [k] tcp_try_coalesce
  1.53%  [kernel]                 [k] __netdev_alloc_frag
  1.53%  [kernel]                 [k] inet_gro_receive
  1.51%  [kernel]                 [k] napi_gro_receive
  1.29%  [kernel]                 [k] kmem_cache_alloc
  1.18%  [kernel]                 [k] tcp_gro_receive
  1.09%  [kernel]                 [k] put_compound_page
  0.98%  [kernel]                 [k] ip_local_deliver_finish
  0.97%  [kernel]                 [k] tcp_send_delayed_ack
  0.95%  [kernel]                 [k] tcp_event_data_recv
  0.90%  [kernel]                 [k] inet_ehashfn
  0.88%  [kernel]                 [k] ip_rcv_finish
  0.78%  [kernel]                 [k] tcp_v4_do_rcv
  0.77%  [kernel]                 [k] tcp_v4_early_demux
  0.76%  [kernel]                 [k] __switch_to
  0.76%  [kernel]                 [k] eth_type_trans
  0.75%  [kernel]                 [k] tcp_queue_rcv
  0.74%  [kernel]                 [k] __schedule
  0.72%  [kernel]                 [k] skb_copy_datagram_iovec
  0.71%  [ixgbe]                  [k] ixgbe_xmit_frame_ring
  0.68%  [kernel]                 [k] __tcp_ack_snd_check
  0.68%  [ixgbe]                  [k] ixgbe_poll
  0.67%  [kernel]                 [k] mod_timer
  0.64%  [kernel]                 [k] lapic_next_deadline

Any input/advice on where I should look or patches to possibly test
would be appreciated.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Performance regression on kernels 3.10 and newer
  2014-08-14 18:19 Performance regression on kernels 3.10 and newer Alexander Duyck
@ 2014-08-14 18:46 ` Eric Dumazet
  2014-08-14 19:50   ` Eric Dumazet
                     ` (2 more replies)
  0 siblings, 3 replies; 57+ messages in thread
From: Eric Dumazet @ 2014-08-14 18:46 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: David Miller, netdev

On Thu, 2014-08-14 at 11:19 -0700, Alexander Duyck wrote:
> Yesterday I tripped over a bit of an issue and it seems like we are
> seeing significant cache thrash on kernels 3.10 and newer when running
> multiple streams of small packet stress on multiple NUMA nodes for a
> single NIC.
> 
> I did some bisection and found that I was able to trace it back to
> upstream commit 093162553c33e9479283e107b4431378271c735d (tcp: force a
> dst refcount when prequeue packet).
> 
> Recreating this issue is pretty strait forward.  All I did was setup 2
> dual socket Xeon systems connected back to back with ixgbe and ran the
> following script after disabling tcp_autocork on the transmitting system:
>   for i in `seq 0 19`
>   do
>     for j in `seq 0 2`
>     do
>       netperf -H 192.168.10.1 -t TCP_STREAM \
>               -l 10 -c -C -T $i,$i -P 0 -- \
>               -m 64 -s 64K -D
>     done
>   done
> 
> The current net tree as-is will give me about 2Gb/s of data w/ 100% CPU
> utilization on the receiving system, and with the patch above reverted
> on that system it gives me about 4Gb/s with only 21% CPU utilization.
> If I set tcp_low_latency=1 I can get the CPU utilization down to about
> 12% on the same test with about 4Gb/s of throughput.
> 
> I'm still working on determining the exact root cause but it looks to me
> like there is some significant cache thrash going on in regards to the
> dst entries.
> 
> Below is a quick breakdown of the top CPU users for tcp_low_latency
> on/off using perf top:
> 
> tcp_low_latency = 0

> 
> tcp_low_latency = 1
> Any input/advice on where I should look or patches to possibly test
> would be appreciated.


I believe you answered your own question : prequeue mode does not work
very well when one host has hundred of active TCP flows to one other.

In real life, applications do not use prequeue, because nobody wants one
thread per flow.

Each socket has its own dst now route cache was removed, but if your
netperf migrates cpu (and NUMA node), we do not detect the dst should be
re-created onto a different NUMA node.

But really, I am not sure we want to care about prequeue, as modern
applications uses epoll()/poll()/select() instead of blocking on
recvmsg()

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Performance regression on kernels 3.10 and newer
  2014-08-14 18:46 ` Eric Dumazet
@ 2014-08-14 19:50   ` Eric Dumazet
  2014-08-14 19:59   ` Rick Jones
  2014-08-14 23:16   ` Alexander Duyck
  2 siblings, 0 replies; 57+ messages in thread
From: Eric Dumazet @ 2014-08-14 19:50 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: David Miller, netdev

On Thu, 2014-08-14 at 11:46 -0700, Eric Dumazet wrote:

> I believe you answered your own question : prequeue mode does not work
> very well when one host has hundred of active TCP flows to one other.
> 
> In real life, applications do not use prequeue, because nobody wants one
> thread per flow.
> 
> Each socket has its own dst now route cache was removed, but if your
> netperf migrates cpu (and NUMA node), we do not detect the dst should be
> re-created onto a different NUMA node.
> 
> But really, I am not sure we want to care about prequeue, as modern
> applications uses epoll()/poll()/select() instead of blocking on
> recvmsg()

BTW it looks like you do not use RPS/RFS/RSS, so you have different cpus
doing the dst refcount increment and decrement.

With RFS, your workload just works fine.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Performance regression on kernels 3.10 and newer
  2014-08-14 18:46 ` Eric Dumazet
  2014-08-14 19:50   ` Eric Dumazet
@ 2014-08-14 19:59   ` Rick Jones
  2014-08-14 20:31     ` Alexander Duyck
  2014-08-14 20:46     ` Eric Dumazet
  2014-08-14 23:16   ` Alexander Duyck
  2 siblings, 2 replies; 57+ messages in thread
From: Rick Jones @ 2014-08-14 19:59 UTC (permalink / raw)
  To: Eric Dumazet, Alexander Duyck; +Cc: David Miller, netdev

On 08/14/2014 11:46 AM, Eric Dumazet wrote:

> I believe you answered your own question : prequeue mode does not work
> very well when one host has hundred of active TCP flows to one other.
>
> In real life, applications do not use prequeue, because nobody wants one
> thread per flow.
>
> Each socket has its own dst now route cache was removed, but if your
> netperf migrates cpu (and NUMA node), we do not detect the dst should be
> re-created onto a different NUMA node.

Presumably, the -T $i,$j option in Alex's netperf command lines will 
have bound netperf and netserver to a specific CPU where they will have 
remained.

rick jones

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Performance regression on kernels 3.10 and newer
  2014-08-14 19:59   ` Rick Jones
@ 2014-08-14 20:31     ` Alexander Duyck
  2014-08-14 20:51       ` Eric Dumazet
  2014-08-14 20:46     ` Eric Dumazet
  1 sibling, 1 reply; 57+ messages in thread
From: Alexander Duyck @ 2014-08-14 20:31 UTC (permalink / raw)
  To: Rick Jones, Eric Dumazet; +Cc: David Miller, netdev

On 08/14/2014 12:59 PM, Rick Jones wrote:
> On 08/14/2014 11:46 AM, Eric Dumazet wrote:
> 
>> I believe you answered your own question : prequeue mode does not work
>> very well when one host has hundred of active TCP flows to one other.
>>
>> In real life, applications do not use prequeue, because nobody wants one
>> thread per flow.

My concern here is that netperf is a standard tool to use for testing
network performance, and the kernel default is to run with
tcp_low_latency disabled.  As such the prequeue is a part of the
standard path is it not?  If the prequeue isn't really useful anymore
should we consider pulling it out of the kernel, or disabling it by
making tcp_low_latency the default?

>> Each socket has its own dst now route cache was removed, but if your
>> netperf migrates cpu (and NUMA node), we do not detect the dst should be
>> re-created onto a different NUMA node.
> 
> Presumably, the -T $i,$j option in Alex's netperf command lines will
> have bound netperf and netserver to a specific CPU where they will have
> remained.
> 
> rick jones

Yes, my test was affinitized per CPU.  I was originally trying to test
some local vs remote NUMA performance numbers.  Also as I mentioned I
was using the ixgbe driver with 82599 and I had ATR enabled so the
receive flow was affinitized to the queue as well.  We shouldn't have
had any cross node chatter as a result of that.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Performance regression on kernels 3.10 and newer
  2014-08-14 19:59   ` Rick Jones
  2014-08-14 20:31     ` Alexander Duyck
@ 2014-08-14 20:46     ` Eric Dumazet
  1 sibling, 0 replies; 57+ messages in thread
From: Eric Dumazet @ 2014-08-14 20:46 UTC (permalink / raw)
  To: Rick Jones; +Cc: Alexander Duyck, David Miller, netdev

On Thu, 2014-08-14 at 12:59 -0700, Rick Jones wrote:

> Presumably, the -T $i,$j option in Alex's netperf command lines will 
> have bound netperf and netserver to a specific CPU where they will have 
> remained.

Thats the problem.

Softirq are receiving packets from other cpus. (RX queue X is handling
packets and one cpu is generally draining RX queue X and feed stack with
packets)

With RFS, RX path of IP and TCP stack would been handled by cpu Y (where
user thread is bound)

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Performance regression on kernels 3.10 and newer
  2014-08-14 20:31     ` Alexander Duyck
@ 2014-08-14 20:51       ` Eric Dumazet
  0 siblings, 0 replies; 57+ messages in thread
From: Eric Dumazet @ 2014-08-14 20:51 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: Rick Jones, David Miller, netdev

On Thu, 2014-08-14 at 13:31 -0700, Alexander Duyck wrote:

> My concern here is that netperf is a standard tool to use for testing
> network performance, and the kernel default is to run with
> tcp_low_latency disabled.  As such the prequeue is a part of the
> standard path is it not?  If the prequeue isn't really useful anymore
> should we consider pulling it out of the kernel, or disabling it by
> making tcp_low_latency the default?

Feel free to use netperf, but please do not complain if linux kernel is
not optimized for netperf, using a 20 years old application design (one
thread per flow)

prequeue only works when few TCP flows are active.

For dozen of active flows, then you probably need some tweaks,
like the ones in Documentation/networking/scaling.txt

Then, prequeue or not, you would not notice.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Performance regression on kernels 3.10 and newer
  2014-08-14 18:46 ` Eric Dumazet
  2014-08-14 19:50   ` Eric Dumazet
  2014-08-14 19:59   ` Rick Jones
@ 2014-08-14 23:16   ` Alexander Duyck
  2014-08-14 23:20     ` David Miller
  2014-08-14 23:48     ` Eric Dumazet
  2 siblings, 2 replies; 57+ messages in thread
From: Alexander Duyck @ 2014-08-14 23:16 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev

On 08/14/2014 11:46 AM, Eric Dumazet wrote:
> In real life, applications do not use prequeue, because nobody wants one
> thread per flow.

I still say this is just an argument to remove it.  It looks like you
submitted a patch to allow stripping it from the build about 7 years go.
 I assume it was rejected.

> Each socket has its own dst now route cache was removed, but if your
> netperf migrates cpu (and NUMA node), we do not detect the dst should be
> re-created onto a different NUMA node.

Are you sure about each socket having it's own DST?  Everything I see
seems to indicate it is somehow associated with IP.  For example I can
actually work around the issue by setting up a second subnet on the same
port and then running the tests with each subnet affinitized to a
specific node.

>From what I can tell the issue is that the patch made it so tcp_prequeue
forces the skb to take a reference on the dst via an atomic increment.
It is later freed with an atomic decrement when the skb is freed.  I
believe these two transactions are the source of my cacheline bouncing,
though I am still not sure where the ipv4_dst_check is coming into play
in all this since it shows up as the top item in perf but should be in a
separate cacheline entirely.  Perhaps it is the result of some sort of
false sharing.

Since my test was back to back with only one IP on each end it used the
same DST for all of the queues/CPUs (or at least that is what I am
ass-u-me-ing).  So as a result 1 NUMA node looks okay as things only get
evicted out to LLC for the locked transaction, when you go to 2 sockets
it completely evicts it from LLC and things get very ugly.

I don't believe that his will scale on SMP setups.  If I am missing
something obvious please let me know, but being over 10x worse in terms
of throughput based on CPU utilization is enough to make me just want to
scrap it.  I'm open to any suggestions on where having this enabled
might give us gains.  I have tried testing with a single thread setup
and it still was hurting performance to have things going through
prequeue.  I figure if I cannot find a benefit for it maybe I should
just submit a patch to strip it and the tcp_low_latency sysctl out.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Performance regression on kernels 3.10 and newer
  2014-08-14 23:16   ` Alexander Duyck
@ 2014-08-14 23:20     ` David Miller
  2014-08-14 23:25       ` Tom Herbert
  2014-08-15 17:15       ` Performance regression on kernels 3.10 and newer Alexander Duyck
  2014-08-14 23:48     ` Eric Dumazet
  1 sibling, 2 replies; 57+ messages in thread
From: David Miller @ 2014-08-14 23:20 UTC (permalink / raw)
  To: alexander.h.duyck; +Cc: eric.dumazet, netdev

From: Alexander Duyck <alexander.h.duyck@intel.com>
Date: Thu, 14 Aug 2014 16:16:36 -0700

> Are you sure about each socket having it's own DST?  Everything I see
> seems to indicate it is somehow associated with IP.

Right it should be, unless you have exception entries created by path
MTU or redirects.

WRT prequeue, it does the right thing for dumb apps that block in
receive.  But because it causes the packet to cross domains as it
does, we can't do a lot of tricks which we normally can do, and that's
why the refcounting on the dst is there now.

Perhaps we can find a clever way to elide that refcount, who knows.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Performance regression on kernels 3.10 and newer
  2014-08-14 23:20     ` David Miller
@ 2014-08-14 23:25       ` Tom Herbert
  2014-08-21 23:24         ` David Miller
  2014-08-15 17:15       ` Performance regression on kernels 3.10 and newer Alexander Duyck
  1 sibling, 1 reply; 57+ messages in thread
From: Tom Herbert @ 2014-08-14 23:25 UTC (permalink / raw)
  To: David Miller; +Cc: Alexander Duyck, Eric Dumazet, Linux Netdev List

On Thu, Aug 14, 2014 at 4:20 PM, David Miller <davem@davemloft.net> wrote:
> From: Alexander Duyck <alexander.h.duyck@intel.com>
> Date: Thu, 14 Aug 2014 16:16:36 -0700
>
>> Are you sure about each socket having it's own DST?  Everything I see
>> seems to indicate it is somehow associated with IP.
>
> Right it should be, unless you have exception entries created by path
> MTU or redirects.
>
> WRT prequeue, it does the right thing for dumb apps that block in
> receive.  But because it causes the packet to cross domains as it
> does, we can't do a lot of tricks which we normally can do, and that's
> why the refcounting on the dst is there now.
>
> Perhaps we can find a clever way to elide that refcount, who knows.

I don't know if it's the same problem, but I did post a patch back in
January that would resolve false sharing of dst->__refcnt and rt_genid
in the same cacheline. We could revisit that if it helps.

Tom

> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Performance regression on kernels 3.10 and newer
  2014-08-14 23:16   ` Alexander Duyck
  2014-08-14 23:20     ` David Miller
@ 2014-08-14 23:48     ` Eric Dumazet
  2014-08-15  0:33       ` Rick Jones
  1 sibling, 1 reply; 57+ messages in thread
From: Eric Dumazet @ 2014-08-14 23:48 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: David Miller, netdev

On Thu, 2014-08-14 at 16:16 -0700, Alexander Duyck wrote:
> On 08/14/2014 11:46 AM, Eric Dumazet wrote:
> > In real life, applications do not use prequeue, because nobody wants one
> > thread per flow.
> 
> I still say this is just an argument to remove it.  It looks like you
> submitted a patch to allow stripping it from the build about 7 years go.
>  I assume it was rejected.
> 
> > Each socket has its own dst now route cache was removed, but if your
> > netperf migrates cpu (and NUMA node), we do not detect the dst should be
> > re-created onto a different NUMA node.
> 
> Are you sure about each socket having it's own DST?  Everything I see
> seems to indicate it is somehow associated with IP.  For example I can
> actually work around the issue by setting up a second subnet on the same
> port and then running the tests with each subnet affinitized to a
> specific node.

Here is the thing

nh_pcpu_rth_output is a per cpu cache.

But nh_rth_input is not, because we did not optimize the case where dst
has to be refcounted, yet.

Rationale is explained in d26b3a7c4b3b26319f18bb645de93eba8f4bdcd5
("ipv4: percpu nh_rth_output cache")

If you guys really believe we should have a percpu dst, go for it, but
again, if the softirq handler runs on a different cpu than the
application thread, it wont work.

And, given 72 core servers are now on the way, we'll consume more ram,
for netperf users.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Performance regression on kernels 3.10 and newer
  2014-08-14 23:48     ` Eric Dumazet
@ 2014-08-15  0:33       ` Rick Jones
  0 siblings, 0 replies; 57+ messages in thread
From: Rick Jones @ 2014-08-15  0:33 UTC (permalink / raw)
  To: Eric Dumazet, Alexander Duyck; +Cc: David Miller, netdev

On 08/14/2014 04:48 PM, Eric Dumazet wrote:
> Here is the thing
>
> nh_pcpu_rth_output is a per cpu cache.
>
> But nh_rth_input is not, because we did not optimize the case where dst
> has to be refcounted, yet.
>
> Rationale is explained in d26b3a7c4b3b26319f18bb645de93eba8f4bdcd5
> ("ipv4: percpu nh_rth_output cache")
>
> If you guys really believe we should have a percpu dst, go for it, but
> again, if the softirq handler runs on a different cpu than the
> application thread, it wont work.
>
> And, given 72 core servers are now on the way, we'll consume more ram,
> for netperf users.

Heck, we've had >= 72 core servers for *years* - go back to the 
Itanium-based Superdome servers for example, and I suspect some of their 
Power-based contemporaries.  They just weren't plentiful.  After that, 
there were/are the 8-socket x86 boxes from various vendors, and now the 
BL920s.

I will not claim that netperf represents all apps.  Neither will I claim 
it represents most apps.  But I won't accept that it represents no apps :)

And if there is a reasonably clean way to introduce epoll/poll/select 
into netperf I'm willing to consider it.

happy benchmarking,

rick jones

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Performance regression on kernels 3.10 and newer
  2014-08-14 23:20     ` David Miller
  2014-08-14 23:25       ` Tom Herbert
@ 2014-08-15 17:15       ` Alexander Duyck
  2014-08-15 17:59         ` Eric Dumazet
                           ` (2 more replies)
  1 sibling, 3 replies; 57+ messages in thread
From: Alexander Duyck @ 2014-08-15 17:15 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, netdev

On 08/14/2014 04:20 PM, David Miller wrote:
> From: Alexander Duyck <alexander.h.duyck@intel.com>
> Date: Thu, 14 Aug 2014 16:16:36 -0700
> 
>> Are you sure about each socket having it's own DST?  Everything I see
>> seems to indicate it is somehow associated with IP.
> 
> Right it should be, unless you have exception entries created by path
> MTU or redirects.
> 
> WRT prequeue, it does the right thing for dumb apps that block in
> receive.  But because it causes the packet to cross domains as it
> does, we can't do a lot of tricks which we normally can do, and that's
> why the refcounting on the dst is there now.
> 
> Perhaps we can find a clever way to elide that refcount, who knows.

Actually I would consider the refcount issue just the coffin nail in all
of this.  It seems like there are multiple issues that have been there
for some time and they are just getting worse with the refcount change
in 3.10.

With the prequeue disabled what happens is that the frames are making it
up and hitting tcp_rcv_established before being pushed into the backlog
queues and coalesced there.  I believe the lack of coalescing on the
prequeue path is one of the reasons why it is twice as expensive as the
non-prequeue path CPU wise even if you eliminate the refcount issue.

I realize most of my data is anecdotal as I only have the ixgbe/igb
adapters and netperf to work with.  This is one of the reasons why I
keep asking if someone can tell me what the use case is for this where
it performs well.  From what I can tell it might have had some value
back in the day before the introduction of things such as RPS/RFS where
some of the socket processing would be offloaded to other CPUs for a
single queue device, but even that use case is now deprecated since
RPS/RFS are there and function better than this.  What I am basically
looking for is a way to weight the gain versus the penalties to
determine if this code is even viable anymore.

In the meantime I think I will put together a patch to default
tcp_low_latency to 1 for net and stable, and if we cannot find a good
reason for keeping it then I can submit a patch to net-next that will
strip it out since I don't see any benefit to having this code.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Performance regression on kernels 3.10 and newer
  2014-08-15 17:15       ` Performance regression on kernels 3.10 and newer Alexander Duyck
@ 2014-08-15 17:59         ` Eric Dumazet
  2014-08-15 18:49         ` Tom Herbert
  2014-08-21 23:51         ` David Miller
  2 siblings, 0 replies; 57+ messages in thread
From: Eric Dumazet @ 2014-08-15 17:59 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: David Miller, netdev

On Fri, 2014-08-15 at 10:15 -0700, Alexander Duyck wrote:

> I realize most of my data is anecdotal as I only have the ixgbe/igb
> adapters and netperf to work with.  This is one of the reasons why I
> keep asking if someone can tell me what the use case is for this where
> it performs well.  From what I can tell it might have had some value
> back in the day before the introduction of things such as RPS/RFS where
> some of the socket processing would be offloaded to other CPUs for a
> single queue device, but even that use case is now deprecated since
> RPS/RFS are there and function better than this.  What I am basically
> looking for is a way to weight the gain versus the penalties to
> determine if this code is even viable anymore.
> 
> In the meantime I think I will put together a patch to default
> tcp_low_latency to 1 for net and stable, and if we cannot find a good
> reason for keeping it then I can submit a patch to net-next that will
> strip it out since I don't see any benefit to having this code.

prequeue is useful on low end hosts, because it allows an application to
delay ACK packets if it is slow. You get a nicer TCP behavior, because
its more tied to scheduling glitches. Especially if TCP windows are not
autotuned. I guess some embedded platforms benefit from this.

I would keep it the way it is, because modern high performance
applications do not trigger prequeue anyway.

It is trivial to add into netperf a poll() before recvmsg() to avoid the
prequeue, or start your benchmarks with appropriate sysctl.

It is certainly not a 'stable' candidate change, it has been there for
years.

Note that we have include/linux/percpu-refcount.h now, we might
autodetect some dst are heavily used by different cpus and switch to a
percpu refcount.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Performance regression on kernels 3.10 and newer
  2014-08-15 17:15       ` Performance regression on kernels 3.10 and newer Alexander Duyck
  2014-08-15 17:59         ` Eric Dumazet
@ 2014-08-15 18:49         ` Tom Herbert
  2014-08-15 19:10           ` Alexander Duyck
  2014-08-21 23:51         ` David Miller
  2 siblings, 1 reply; 57+ messages in thread
From: Tom Herbert @ 2014-08-15 18:49 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: David Miller, Eric Dumazet, Linux Netdev List

On Fri, Aug 15, 2014 at 10:15 AM, Alexander Duyck
<alexander.h.duyck@intel.com> wrote:
> On 08/14/2014 04:20 PM, David Miller wrote:
>> From: Alexander Duyck <alexander.h.duyck@intel.com>
>> Date: Thu, 14 Aug 2014 16:16:36 -0700
>>
>>> Are you sure about each socket having it's own DST?  Everything I see
>>> seems to indicate it is somehow associated with IP.
>>
>> Right it should be, unless you have exception entries created by path
>> MTU or redirects.
>>
>> WRT prequeue, it does the right thing for dumb apps that block in
>> receive.  But because it causes the packet to cross domains as it
>> does, we can't do a lot of tricks which we normally can do, and that's
>> why the refcounting on the dst is there now.
>>
>> Perhaps we can find a clever way to elide that refcount, who knows.
>
> Actually I would consider the refcount issue just the coffin nail in all
> of this.  It seems like there are multiple issues that have been there
> for some time and they are just getting worse with the refcount change
> in 3.10.
>
> With the prequeue disabled what happens is that the frames are making it
> up and hitting tcp_rcv_established before being pushed into the backlog
> queues and coalesced there.  I believe the lack of coalescing on the
> prequeue path is one of the reasons why it is twice as expensive as the
> non-prequeue path CPU wise even if you eliminate the refcount issue.
>
> I realize most of my data is anecdotal as I only have the ixgbe/igb
> adapters and netperf to work with.  This is one of the reasons why I
> keep asking if someone can tell me what the use case is for this where
> it performs well.  From what I can tell it might have had some value
> back in the day before the introduction of things such as RPS/RFS where
> some of the socket processing would be offloaded to other CPUs for a
> single queue device, but even that use case is now deprecated since
> RPS/RFS are there and function better than this.  What I am basically
> looking for is a way to weight the gain versus the penalties to
> determine if this code is even viable anymore.
>
Alex, I tried to repro your problem running your script (on bnx2x).
Didn't see see the issue and in fact ip_dest_check did not appear in
top perf functions on perf. I assume this is more related to the
steering configuration rather than the device (although flow director
might be a fundamental difference).

> In the meantime I think I will put together a patch to default
> tcp_low_latency to 1 for net and stable, and if we cannot find a good
> reason for keeping it then I can submit a patch to net-next that will
> strip it out since I don't see any benefit to having this code.
>
> Thanks,
>
> Alex
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Performance regression on kernels 3.10 and newer
  2014-08-15 18:49         ` Tom Herbert
@ 2014-08-15 19:10           ` Alexander Duyck
  2014-08-15 22:16             ` Tom Herbert
  0 siblings, 1 reply; 57+ messages in thread
From: Alexander Duyck @ 2014-08-15 19:10 UTC (permalink / raw)
  To: Tom Herbert; +Cc: David Miller, Eric Dumazet, Linux Netdev List

On 08/15/2014 11:49 AM, Tom Herbert wrote:
> Alex, I tried to repro your problem running your script (on bnx2x).
> Didn't see see the issue and in fact ip_dest_check did not appear in
> top perf functions on perf. I assume this is more related to the
> steering configuration rather than the device (although flow director
> might be a fundamental difference).
> 

So the original script I put out had a typo.  It was supposed to run all
60 at the same time, not one at a time.  So make sure you add an
ampersand to the end of the netperf command line if you run the test so
that it is 60 at once, not 60 in series.

Also one other thing I had to do was disable tcp_autocork.  Without that
the test is a large packets test instead of a small packet test.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Performance regression on kernels 3.10 and newer
  2014-08-15 19:10           ` Alexander Duyck
@ 2014-08-15 22:16             ` Tom Herbert
  2014-08-15 23:23               ` Alexander Duyck
  0 siblings, 1 reply; 57+ messages in thread
From: Tom Herbert @ 2014-08-15 22:16 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: David Miller, Eric Dumazet, Linux Netdev List

On Fri, Aug 15, 2014 at 12:10 PM, Alexander Duyck
<alexander.h.duyck@intel.com> wrote:
> On 08/15/2014 11:49 AM, Tom Herbert wrote:
>> Alex, I tried to repro your problem running your script (on bnx2x).
>> Didn't see see the issue and in fact ip_dest_check did not appear in
>> top perf functions on perf. I assume this is more related to the
>> steering configuration rather than the device (although flow director
>> might be a fundamental difference).
>>
>
> So the original script I put out had a typo.  It was supposed to run all
> 60 at the same time, not one at a time.  So make sure you add an
> ampersand to the end of the netperf command line if you run the test so
> that it is 60 at once, not 60 in series.
>
> Also one other thing I had to do was disable tcp_autocork.  Without that
> the test is a large packets test instead of a small packet test.
>
Okay, by running netperf in background, disabling autoconf, and
turning off RPS/RFS I'm able to get ipv4_dst_check to come up in perf;
but t's not nearly as bad as what you've reported though, only about
1.5%. When I applied path to move rt_genid to different cacheline
ipv4_dst_check goes away (ipv4: move rt_genid to different cache
line). Can you try this patch in your setup?

Thanks,
Tom

> Thanks,
>
> Alex
>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Performance regression on kernels 3.10 and newer
  2014-08-15 22:16             ` Tom Herbert
@ 2014-08-15 23:23               ` Alexander Duyck
  2014-08-18  9:03                 ` David Laight
  0 siblings, 1 reply; 57+ messages in thread
From: Alexander Duyck @ 2014-08-15 23:23 UTC (permalink / raw)
  To: Tom Herbert; +Cc: David Miller, Eric Dumazet, Linux Netdev List, Rick Jones

On 08/15/2014 03:16 PM, Tom Herbert wrote:
> On Fri, Aug 15, 2014 at 12:10 PM, Alexander Duyck
> <alexander.h.duyck@intel.com> wrote:
>> On 08/15/2014 11:49 AM, Tom Herbert wrote:
>>> Alex, I tried to repro your problem running your script (on bnx2x).
>>> Didn't see see the issue and in fact ip_dest_check did not appear in
>>> top perf functions on perf. I assume this is more related to the
>>> steering configuration rather than the device (although flow director
>>> might be a fundamental difference).
>>>
>>
>> So the original script I put out had a typo.  It was supposed to run all
>> 60 at the same time, not one at a time.  So make sure you add an
>> ampersand to the end of the netperf command line if you run the test so
>> that it is 60 at once, not 60 in series.
>>
>> Also one other thing I had to do was disable tcp_autocork.  Without that
>> the test is a large packets test instead of a small packet test.
>>
> Okay, by running netperf in background, disabling autoconf, and
> turning off RPS/RFS I'm able to get ipv4_dst_check to come up in perf;
> but t's not nearly as bad as what you've reported though, only about
> 1.5%. When I applied path to move rt_genid to different cacheline
> ipv4_dst_check goes away (ipv4: move rt_genid to different cache
> line). Can you try this patch in your setup?

The issue doesn't occur for me until I start using netperf on both
sockets with the same IP address on both ends.  Then I see the dst
bouncing between the two nodes and the CPU utilization skyrockets.  If I
am only on one node the dst bouncing is tolerable as it doesn't go any
further than the LLC.

With your patch applied I see ipv4_dst_check drop off to 5% CPU
utilization from the 36% that it was.  However ip_rcv_finish has climbed
up to about 16% so it isn't as though much was saved.  It just pushed it
to the next item to hit in that cacheline.  Throughput was 2.5Gb/s with
100% CPU utilization on the receiver.

Even if the refcount issue is fixed the performance still suffers
compared to the low_latency path in my testing.  When I reverted the
refcount change the CPU utilization dropped from 100% to about 25%, but
that is still double the 12% I am seeing when tcp_low_latency is set.
That is one of the reasons why I am not all that interested in the ref
count fix as I am still likely going to have to work around other issues
in the prequeue path.

Another test I tried was to hack the nettest_bsd.c file in netperf to
perform a poll() based receive.  That resolved the issue and had all the
performance of the tcp_low_latency case.  I may see if I can work with
Rick to push something like that into netperf as I really would prefer
to avoid having to advise everyone on how to setup the sysctl for
tcp_low_latency.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: Performance regression on kernels 3.10 and newer
  2014-08-15 23:23               ` Alexander Duyck
@ 2014-08-18  9:03                 ` David Laight
  2014-08-18 15:22                   ` Alexander Duyck
  0 siblings, 1 reply; 57+ messages in thread
From: David Laight @ 2014-08-18  9:03 UTC (permalink / raw)
  To: 'Alexander Duyck', Tom Herbert
  Cc: David Miller, Eric Dumazet, Linux Netdev List, Rick Jones

From: Alexander Duyck
> ...
> Another test I tried was to hack the nettest_bsd.c file in netperf to
> perform a poll() based receive.  That resolved the issue and had all the
> performance of the tcp_low_latency case.  I may see if I can work with
> Rick to push something like that into netperf as I really would prefer
> to avoid having to advise everyone on how to setup the sysctl for
> tcp_low_latency.

Doesn't that generate 2 system calls per receive?
Unless it now returns more data per receive I'm surprised that
it actually faster.

OTOH I've some code that runs a lot better when I run while :; do :; done
for all but one of the cpus.
I think that is because the processes spinning in userspace don't
get pre-empted.

	David


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Performance regression on kernels 3.10 and newer
  2014-08-18  9:03                 ` David Laight
@ 2014-08-18 15:22                   ` Alexander Duyck
  2014-08-18 15:29                     ` Rick Jones
  0 siblings, 1 reply; 57+ messages in thread
From: Alexander Duyck @ 2014-08-18 15:22 UTC (permalink / raw)
  To: David Laight, Tom Herbert
  Cc: David Miller, Eric Dumazet, Linux Netdev List, Rick Jones

On 08/18/2014 02:03 AM, David Laight wrote:
> From: Alexander Duyck
>> ...
>> Another test I tried was to hack the nettest_bsd.c file in netperf to
>> perform a poll() based receive.  That resolved the issue and had all the
>> performance of the tcp_low_latency case.  I may see if I can work with
>> Rick to push something like that into netperf as I really would prefer
>> to avoid having to advise everyone on how to setup the sysctl for
>> tcp_low_latency.
> 
> Doesn't that generate 2 system calls per receive?
> Unless it now returns more data per receive I'm surprised that
> it actually faster.

If you haven't been keeping up with the thread what I am gaining by
doing this is avoiding a significant cache thrash issue with the dst
entry as the prequeue path involves updating the reference count that is
shared by all of my CPUs.

By using poll to wait for it I don't load frames onto the TCP prequeue
and thereby avoid it.

> OTOH I've some code that runs a lot better when I run while :; do :; done
> for all but one of the cpus.
> I think that is because the processes spinning in userspace don't
> get pre-empted.
> 
> 	David
> 

What you are probably seeing is that the CPU doesn't go into a deep
sleep state so it likely runs better.  You might try the same thing with
a kernel booted with idle=poll and you would probably see the same result.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Performance regression on kernels 3.10 and newer
  2014-08-18 15:22                   ` Alexander Duyck
@ 2014-08-18 15:29                     ` Rick Jones
  0 siblings, 0 replies; 57+ messages in thread
From: Rick Jones @ 2014-08-18 15:29 UTC (permalink / raw)
  To: Alexander Duyck, David Laight, Tom Herbert
  Cc: David Miller, Eric Dumazet, Linux Netdev List

Alex -

Why don't you go ahead and shoot me a patch to netperf to conditionally 
put a poll() or select() in front of receive.  I guess the "omni" code 
would be the place to do it.  The netperf-talk or netperf-dev lists 
would be the place to send it I suspect.

happy benchmarking,

rick jones

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Performance regression on kernels 3.10 and newer
  2014-08-14 23:25       ` Tom Herbert
@ 2014-08-21 23:24         ` David Miller
  2014-09-06 14:45           ` Eric Dumazet
  0 siblings, 1 reply; 57+ messages in thread
From: David Miller @ 2014-08-21 23:24 UTC (permalink / raw)
  To: therbert; +Cc: alexander.h.duyck, eric.dumazet, netdev

From: Tom Herbert <therbert@google.com>
Date: Thu, 14 Aug 2014 16:25:53 -0700

> I don't know if it's the same problem, but I did post a patch back in
> January that would resolve false sharing of dst->__refcnt and rt_genid
> in the same cacheline. We could revisit that if it helps.

I think that regardless of what happens in the discussion here, you
should repost that patch.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Performance regression on kernels 3.10 and newer
  2014-08-15 17:15       ` Performance regression on kernels 3.10 and newer Alexander Duyck
  2014-08-15 17:59         ` Eric Dumazet
  2014-08-15 18:49         ` Tom Herbert
@ 2014-08-21 23:51         ` David Miller
  2 siblings, 0 replies; 57+ messages in thread
From: David Miller @ 2014-08-21 23:51 UTC (permalink / raw)
  To: alexander.h.duyck; +Cc: eric.dumazet, netdev

From: Alexander Duyck <alexander.h.duyck@intel.com>
Date: Fri, 15 Aug 2014 10:15:15 -0700

> This is one of the reasons why I keep asking if someone can tell me
> what the use case is for this where it performs well.

It allows the TCP connection to be paced by the process scheduler's
ability to put the thread doing the I/O on a cpu.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Performance regression on kernels 3.10 and newer
  2014-08-21 23:24         ` David Miller
@ 2014-09-06 14:45           ` Eric Dumazet
  2014-09-06 15:27             ` Eric Dumazet
  0 siblings, 1 reply; 57+ messages in thread
From: Eric Dumazet @ 2014-09-06 14:45 UTC (permalink / raw)
  To: David Miller; +Cc: therbert, alexander.h.duyck, netdev

On Thu, 2014-08-21 at 16:24 -0700, David Miller wrote:
> From: Tom Herbert <therbert@google.com>
> Date: Thu, 14 Aug 2014 16:25:53 -0700
> 
> > I don't know if it's the same problem, but I did post a patch back in
> > January that would resolve false sharing of dst->__refcnt and rt_genid
> > in the same cacheline. We could revisit that if it helps.
> 
> I think that regardless of what happens in the discussion here, you
> should repost that patch.

I believe this wont be necessary, the real meat is elsewhere.

There are very few cases where we really need to keep the dst attached
to skb.

Its mostly because of IP early demux, and its possible to take care of
the fast path (where sk->sk_rx_dst is set), and simply drop dst before
prequeue in this case.

Same stuff for the backlog, used when socket is owned by user.

I'll send a patch.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Performance regression on kernels 3.10 and newer
  2014-09-06 14:45           ` Eric Dumazet
@ 2014-09-06 15:27             ` Eric Dumazet
  2014-09-06 15:46               ` Eric Dumazet
  2014-09-08 15:06               ` [PATCH v2 net-next] tcp: remove dst refcount false sharing for prequeue mode Eric Dumazet
  0 siblings, 2 replies; 57+ messages in thread
From: Eric Dumazet @ 2014-09-06 15:27 UTC (permalink / raw)
  To: David Miller; +Cc: therbert, alexander.h.duyck, netdev

From: Eric Dumazet <edumazet@google.com>

Alexander Duyck reported high false sharing on dst refcount in tcp stack
when prequeue is used. prequeue is the mechanism used when a thread is
blocked in recvmsg()/read() on the TCP socket.

We already try to use RCU in input path as much as possible, but we were
forced to take a refcount on the dst when skb escaped RCU protected
region. When/if the user thread runs on different cpu, dst_release()
will then touch dst refcount again.

Commit 093162553c33 (tcp: force a dst refcount when prequeue packet)
was an example of a race fix.

It turns out the only remaining usage of skb->dst for a packet stored
in a TCP socket prequeue is IP early demux.

We can add a logic to detect when IP early demux is probably going
to use skb->dst.

In IPv4 stack, we also can drop dst when skb has to be queued into
socket backlog. IPv6 needs extra care before doing the same.

Many thanks to Alexander for providing a nice bug report, git bisection,
and reproducer.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
 net/ipv4/tcp_ipv4.c |   27 ++++++++++++++++++---------
 net/ipv6/tcp_ipv6.c |   15 +++++++++------
 2 files changed, 27 insertions(+), 15 deletions(-)

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 3f9bc3f0bba0..5683505553bf 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1559,7 +1559,11 @@ bool tcp_prequeue(struct sock *sk, struct sk_buff *skb)
 	    skb_queue_len(&tp->ucopy.prequeue) == 0)
 		return false;
 
-	skb_dst_force(skb);
+	if (likely(sk->sk_rx_dst))
+		skb_dst_drop(skb);
+	else
+		skb_dst_force(skb);
+
 	__skb_queue_tail(&tp->ucopy.prequeue, skb);
 	tp->ucopy.memory += skb->truesize;
 	if (tp->ucopy.memory > sk->sk_rcvbuf) {
@@ -1682,11 +1686,14 @@ process:
 			if (!tcp_prequeue(sk, skb))
 				ret = tcp_v4_do_rcv(sk, skb);
 		}
-	} else if (unlikely(sk_add_backlog(sk, skb,
-					   sk->sk_rcvbuf + sk->sk_sndbuf))) {
-		bh_unlock_sock(sk);
-		NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
-		goto discard_and_relse;
+	} else {
+		skb_dst_drop(skb);
+		if (unlikely(sk_add_backlog(sk, skb,
+					    sk->sk_rcvbuf + sk->sk_sndbuf))) {
+			bh_unlock_sock(sk);
+			NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
+			goto discard_and_relse;
+		}
 	}
 	bh_unlock_sock(sk);
 
@@ -1765,9 +1772,11 @@ void inet_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb)
 {
 	struct dst_entry *dst = skb_dst(skb);
 
-	dst_hold(dst);
-	sk->sk_rx_dst = dst;
-	inet_sk(sk)->rx_dst_ifindex = skb->skb_iif;
+	if (dst) {
+		dst_hold(dst);
+		sk->sk_rx_dst = dst;
+		inet_sk(sk)->rx_dst_ifindex = skb->skb_iif;
+	}
 }
 EXPORT_SYMBOL(inet_sk_rx_dst_set);
 
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 5b3c70ff7a72..1835480336ac 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -93,13 +93,16 @@ static struct tcp_md5sig_key *tcp_v6_md5_do_lookup(struct sock *sk,
 static void inet6_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb)
 {
 	struct dst_entry *dst = skb_dst(skb);
-	const struct rt6_info *rt = (const struct rt6_info *)dst;
 
-	dst_hold(dst);
-	sk->sk_rx_dst = dst;
-	inet_sk(sk)->rx_dst_ifindex = skb->skb_iif;
-	if (rt->rt6i_node)
-		inet6_sk(sk)->rx_dst_cookie = rt->rt6i_node->fn_sernum;
+	if (dst) {
+		const struct rt6_info *rt = (const struct rt6_info *)dst;
+
+		dst_hold(dst);
+		sk->sk_rx_dst = dst;
+		inet_sk(sk)->rx_dst_ifindex = skb->skb_iif;
+		if (rt->rt6i_node)
+			inet6_sk(sk)->rx_dst_cookie = rt->rt6i_node->fn_sernum;
+	}
 }
 
 static void tcp_v6_hash(struct sock *sk)

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: Performance regression on kernels 3.10 and newer
  2014-09-06 15:27             ` Eric Dumazet
@ 2014-09-06 15:46               ` Eric Dumazet
  2014-09-06 16:38                 ` Eric Dumazet
  2014-09-08 15:06               ` [PATCH v2 net-next] tcp: remove dst refcount false sharing for prequeue mode Eric Dumazet
  1 sibling, 1 reply; 57+ messages in thread
From: Eric Dumazet @ 2014-09-06 15:46 UTC (permalink / raw)
  To: David Miller; +Cc: therbert, alexander.h.duyck, netdev

On Sat, 2014-09-06 at 08:27 -0700, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> Alexander Duyck reported high false sharing on dst refcount in tcp stack
> when prequeue is used. prequeue is the mechanism used when a thread is
> blocked in recvmsg()/read() on the TCP socket.
> 
> We already try to use RCU in input path as much as possible, but we were
> forced to take a refcount on the dst when skb escaped RCU protected
> region. When/if the user thread runs on different cpu, dst_release()
> will then touch dst refcount again.
> 
> Commit 093162553c33 (tcp: force a dst refcount when prequeue packet)
> was an example of a race fix.
> 
> It turns out the only remaining usage of skb->dst for a packet stored
> in a TCP socket prequeue is IP early demux.
> 
> We can add a logic to detect when IP early demux is probably going
> to use skb->dst.
> 
> In IPv4 stack, we also can drop dst when skb has to be queued into
> socket backlog. IPv6 needs extra care before doing the same.
> 
> Many thanks to Alexander for providing a nice bug report, git bisection,
> and reproducer.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reported-by: Alexander Duyck <alexander.h.duyck@intel.com>

Works fine on IPv4, not on IPv6, I will submit a v2 (with a proper title
btw)

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Performance regression on kernels 3.10 and newer
  2014-09-06 15:46               ` Eric Dumazet
@ 2014-09-06 16:38                 ` Eric Dumazet
  2014-09-06 18:21                   ` Eric Dumazet
  0 siblings, 1 reply; 57+ messages in thread
From: Eric Dumazet @ 2014-09-06 16:38 UTC (permalink / raw)
  To: David Miller; +Cc: therbert, alexander.h.duyck, netdev

On Sat, 2014-09-06 at 08:46 -0700, Eric Dumazet wrote:

> Works fine on IPv4, not on IPv6, I will submit a v2 (with a proper title
> btw)

Oh well, we first need to fix IPv6 early demux.

Current net-next kernel profile

+  20.95%      netserver  [kernel.kallsyms]    [k] dst_release     
+  19.33%      netserver  [kernel.kallsyms]    [k] ip6_pol_route.isra.46
+  11.75%      netserver  [kernel.kallsyms]    [k] inet6_sk_rx_dst_set
+   3.72%      netserver  [kernel.kallsyms]    [k] ip6_input_finish

So we repeat over and over calls to inet6_sk_rx_dst_set(),
something is surely wrong.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Performance regression on kernels 3.10 and newer
  2014-09-06 16:38                 ` Eric Dumazet
@ 2014-09-06 18:21                   ` Eric Dumazet
  2014-09-07 19:05                     ` [PATCH net] ipv6: refresh rt6i_genid in ip6_pol_route() Eric Dumazet
  0 siblings, 1 reply; 57+ messages in thread
From: Eric Dumazet @ 2014-09-06 18:21 UTC (permalink / raw)
  To: David Miller, Nicolas Dichtel; +Cc: therbert, alexander.h.duyck, netdev

On Sat, 2014-09-06 at 09:38 -0700, Eric Dumazet wrote:
> On Sat, 2014-09-06 at 08:46 -0700, Eric Dumazet wrote:
> 
> > Works fine on IPv4, not on IPv6, I will submit a v2 (with a proper title
> > btw)
> 
> Oh well, we first need to fix IPv6 early demux.
> 
> Current net-next kernel profile
> 
> +  20.95%      netserver  [kernel.kallsyms]    [k] dst_release     
> +  19.33%      netserver  [kernel.kallsyms]    [k] ip6_pol_route.isra.46
> +  11.75%      netserver  [kernel.kallsyms]    [k] inet6_sk_rx_dst_set
> +   3.72%      netserver  [kernel.kallsyms]    [k] ip6_input_finish
> 
> So we repeat over and over calls to inet6_sk_rx_dst_set(),
> something is surely wrong.
> 

Nicolas, it seems that after your commit
(6f3118b571b8a4c06c7985dc3172c3526cb86253 "ipv6: use net->rt_genid to
check dst validity")  IP early demux is broken.


Basically, we now hit the following condition in ip6_dst_check()

if (rt->rt6i_genid != rt_genid_ipv6(dev_net(rt->dst.dev))) 

In my traces, rt->rt6i_genid stays at 2, and the rt_genid_ipv6()
increments every time a route is added/deleted.

So we keep calling dst_release(), and inet6_sk_rx_dst_set(),
but the route we keep installing is already wrong, as its genid is too
old.

Does following fix make any sense ?

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index f74b0417bd60..a139a16298e3 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -947,9 +947,10 @@ restart:
 		nrt = rt6_alloc_cow(rt, &fl6->daddr, &fl6->saddr);
 	else if (!(rt->dst.flags & DST_HOST))
 		nrt = rt6_alloc_clone(rt, &fl6->daddr);
-	else
+	else {
+		rt->rt6i_genid = rt_genid_ipv6(net);
 		goto out2;
-
+	}
 	ip6_rt_put(rt);
 	rt = nrt ? : net->ipv6.ip6_null_entry;
 

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH net] ipv6: refresh rt6i_genid in ip6_pol_route()
  2014-09-06 18:21                   ` Eric Dumazet
@ 2014-09-07 19:05                     ` Eric Dumazet
  2014-09-07 22:54                       ` David Miller
  0 siblings, 1 reply; 57+ messages in thread
From: Eric Dumazet @ 2014-09-07 19:05 UTC (permalink / raw)
  To: David Miller; +Cc: Nicolas Dichtel, therbert, alexander.h.duyck, netdev

From: Eric Dumazet <edumazet@google.com>

While tracking IPv6 poor performance, I found that IPv6 early demux
was broken in recent kernels. perf profiles show inet6_sk_rx_dst_set()
being called for every incoming TCP segment :

  20.95%      netserver  [kernel.kallsyms]    [k] dst_release     
  19.33%      netserver  [kernel.kallsyms]    [k] ip6_pol_route
  11.75%      netserver  [kernel.kallsyms]    [k] inet6_sk_rx_dst_set
   3.72%      netserver  [kernel.kallsyms]    [k] ip6_input_finish

Regression came in linux-3.6 with commit 6f3118b571b8 ("ipv6: use
net->rt_genid to check dst validity")

When a route found in ip6_pol_route() is cloned (either using
rt6_alloc_cow() or rt6_alloc_clone()), copy gets an updated rt6i_genid.

But when original route is selected, we need to refresh its rt6i_genid
that could be obsolete. If we do not refresh rt6i_genid, ip6_dst_check()
will fail.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: 6f3118b571b8 ("ipv6: use net->rt_genid to check dst validity")
Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
---
 net/ipv6/route.c |   15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index f23fbd28a501..1e76c3c5b87b 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -944,13 +944,20 @@ restart:
 	dst_hold(&rt->dst);
 	read_unlock_bh(&table->tb6_lock);
 
-	if (!(rt->rt6i_flags & (RTF_NONEXTHOP | RTF_GATEWAY)))
+	if (!(rt->rt6i_flags & (RTF_NONEXTHOP | RTF_GATEWAY))) {
 		nrt = rt6_alloc_cow(rt, &fl6->daddr, &fl6->saddr);
-	else if (!(rt->dst.flags & DST_HOST))
+	} else if (!(rt->dst.flags & DST_HOST)) {
 		nrt = rt6_alloc_clone(rt, &fl6->daddr);
-	else
-		goto out2;
+	} else {
+		u32 genid = rt_genid_ipv6(net);
 
+		/* We must refresh rt6i_genid, but only if needed
+		 * to avoid false sharing.
+		 */
+		if (rt->rt6i_genid != genid)
+			rt->rt6i_genid = genid;
+		goto out2;
+	}
 	ip6_rt_put(rt);
 	rt = nrt ? : net->ipv6.ip6_null_entry;
 

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH net] ipv6: refresh rt6i_genid in ip6_pol_route()
  2014-09-07 19:05                     ` [PATCH net] ipv6: refresh rt6i_genid in ip6_pol_route() Eric Dumazet
@ 2014-09-07 22:54                       ` David Miller
  2014-09-08  4:18                         ` Eric Dumazet
  0 siblings, 1 reply; 57+ messages in thread
From: David Miller @ 2014-09-07 22:54 UTC (permalink / raw)
  To: eric.dumazet; +Cc: nicolas.dichtel, therbert, alexander.h.duyck, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sun, 07 Sep 2014 12:05:03 -0700

> From: Eric Dumazet <edumazet@google.com>
> 
> While tracking IPv6 poor performance, I found that IPv6 early demux
> was broken in recent kernels. perf profiles show inet6_sk_rx_dst_set()
> being called for every incoming TCP segment :
> 
>   20.95%      netserver  [kernel.kallsyms]    [k] dst_release     
>   19.33%      netserver  [kernel.kallsyms]    [k] ip6_pol_route
>   11.75%      netserver  [kernel.kallsyms]    [k] inet6_sk_rx_dst_set
>    3.72%      netserver  [kernel.kallsyms]    [k] ip6_input_finish
> 
> Regression came in linux-3.6 with commit 6f3118b571b8 ("ipv6: use
> net->rt_genid to check dst validity")
> 
> When a route found in ip6_pol_route() is cloned (either using
> rt6_alloc_cow() or rt6_alloc_clone()), copy gets an updated rt6i_genid.
> 
> But when original route is selected, we need to refresh its rt6i_genid
> that could be obsolete. If we do not refresh rt6i_genid, ip6_dst_check()
> will fail.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Fixes: 6f3118b571b8 ("ipv6: use net->rt_genid to check dst validity")

This might be broken.

We are dealing here with persistent entries in the ipv6 routine trie.

If you just bump the genid on the next person to look it up, other
sockets and cached entities might not have validated the route yet,
and now will falsely see the route as valid.  We have to ensure that
they too drop this route and perform a relookup.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH net] ipv6: refresh rt6i_genid in ip6_pol_route()
  2014-09-07 22:54                       ` David Miller
@ 2014-09-08  4:18                         ` Eric Dumazet
  2014-09-08  4:27                           ` David Miller
  0 siblings, 1 reply; 57+ messages in thread
From: Eric Dumazet @ 2014-09-08  4:18 UTC (permalink / raw)
  To: David Miller; +Cc: nicolas.dichtel, therbert, alexander.h.duyck, netdev

On Sun, 2014-09-07 at 15:54 -0700, David Miller wrote:

> This might be broken.
> 
> We are dealing here with persistent entries in the ipv6 routine trie.
> 
> If you just bump the genid on the next person to look it up, other
> sockets and cached entities might not have validated the route yet,
> and now will falsely see the route as valid.  We have to ensure that
> they too drop this route and perform a relookup.

I am confused, I thought it was the role of the cookie.

(Ie socket has to store its own cookie to be able to validate a route)

Before 6f3118b571b8 patch, how was this done anyway ?

If persistent routes cannot refresh the genid, then they are useless ?

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH net] ipv6: refresh rt6i_genid in ip6_pol_route()
  2014-09-08  4:18                         ` Eric Dumazet
@ 2014-09-08  4:27                           ` David Miller
  2014-09-08  4:43                             ` Eric Dumazet
  0 siblings, 1 reply; 57+ messages in thread
From: David Miller @ 2014-09-08  4:27 UTC (permalink / raw)
  To: eric.dumazet; +Cc: nicolas.dichtel, therbert, alexander.h.duyck, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sun, 07 Sep 2014 21:18:25 -0700

> On Sun, 2014-09-07 at 15:54 -0700, David Miller wrote:
> 
>> This might be broken.
>> 
>> We are dealing here with persistent entries in the ipv6 routine trie.
>> 
>> If you just bump the genid on the next person to look it up, other
>> sockets and cached entities might not have validated the route yet,
>> and now will falsely see the route as valid.  We have to ensure that
>> they too drop this route and perform a relookup.
> 
> I am confused, I thought it was the role of the cookie.
> 
> (Ie socket has to store its own cookie to be able to validate a route)
> 
> Before 6f3118b571b8 patch, how was this done anyway ?
> 
> If persistent routes cannot refresh the genid, then they are useless ?

I just speak about the genid aspect.

I understand that cookie (via node->fn_sernum) invalidates the path
in the fib_trie, but the genid protects against other circumstances
(matching IPSEC rule, f.e.)

You have to make sure all other sockets did a full route lookup
(including IPSEC) before you can safely adjust the genid.

I could be wrong, recheck my analysis please :-)

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH net] ipv6: refresh rt6i_genid in ip6_pol_route()
  2014-09-08  4:27                           ` David Miller
@ 2014-09-08  4:43                             ` Eric Dumazet
  2014-09-08  4:59                               ` David Miller
  0 siblings, 1 reply; 57+ messages in thread
From: Eric Dumazet @ 2014-09-08  4:43 UTC (permalink / raw)
  To: David Miller; +Cc: nicolas.dichtel, therbert, alexander.h.duyck, netdev

On Sun, 2014-09-07 at 21:27 -0700, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Sun, 07 Sep 2014 21:18:25 -0700
> 
> > On Sun, 2014-09-07 at 15:54 -0700, David Miller wrote:
> > 
> >> This might be broken.
> >> 
> >> We are dealing here with persistent entries in the ipv6 routine trie.
> >> 
> >> If you just bump the genid on the next person to look it up, other
> >> sockets and cached entities might not have validated the route yet,
> >> and now will falsely see the route as valid.  We have to ensure that
> >> they too drop this route and perform a relookup.
> > 
> > I am confused, I thought it was the role of the cookie.
> > 
> > (Ie socket has to store its own cookie to be able to validate a route)
> > 
> > Before 6f3118b571b8 patch, how was this done anyway ?
> > 
> > If persistent routes cannot refresh the genid, then they are useless ?
> 
> I just speak about the genid aspect.
> 
> I understand that cookie (via node->fn_sernum) invalidates the path
> in the fib_trie, but the genid protects against other circumstances
> (matching IPSEC rule, f.e.)
> 
> You have to make sure all other sockets did a full route lookup
> (including IPSEC) before you can safely adjust the genid.
> 
> I could be wrong, recheck my analysis please :-)

So this whole genid protection can not work, unless we make sure a
socket cannot share a route with another socket.

This means we have to clone all routes.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH net] ipv6: refresh rt6i_genid in ip6_pol_route()
  2014-09-08  4:43                             ` Eric Dumazet
@ 2014-09-08  4:59                               ` David Miller
  2014-09-08  5:07                                 ` Eric Dumazet
  0 siblings, 1 reply; 57+ messages in thread
From: David Miller @ 2014-09-08  4:59 UTC (permalink / raw)
  To: eric.dumazet; +Cc: nicolas.dichtel, therbert, alexander.h.duyck, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sun, 07 Sep 2014 21:43:54 -0700

> On Sun, 2014-09-07 at 21:27 -0700, David Miller wrote:
>> From: Eric Dumazet <eric.dumazet@gmail.com>
>> Date: Sun, 07 Sep 2014 21:18:25 -0700
>> 
>> > On Sun, 2014-09-07 at 15:54 -0700, David Miller wrote:
>> > 
>> >> This might be broken.
>> >> 
>> >> We are dealing here with persistent entries in the ipv6 routine trie.
>> >> 
>> >> If you just bump the genid on the next person to look it up, other
>> >> sockets and cached entities might not have validated the route yet,
>> >> and now will falsely see the route as valid.  We have to ensure that
>> >> they too drop this route and perform a relookup.
>> > 
>> > I am confused, I thought it was the role of the cookie.
>> > 
>> > (Ie socket has to store its own cookie to be able to validate a route)
>> > 
>> > Before 6f3118b571b8 patch, how was this done anyway ?
>> > 
>> > If persistent routes cannot refresh the genid, then they are useless ?
>> 
>> I just speak about the genid aspect.
>> 
>> I understand that cookie (via node->fn_sernum) invalidates the path
>> in the fib_trie, but the genid protects against other circumstances
>> (matching IPSEC rule, f.e.)
>> 
>> You have to make sure all other sockets did a full route lookup
>> (including IPSEC) before you can safely adjust the genid.
>> 
>> I could be wrong, recheck my analysis please :-)
> 
> So this whole genid protection can not work, unless we make sure a
> socket cannot share a route with another socket.
> 
> This means we have to clone all routes.

I'm willing to revert the change in question if you think that's the
sanest way forward.

The bug fix for more obscure use cases (IPSEC) if pointless if it
breaks more common things (TCP input route caching).

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH net] ipv6: refresh rt6i_genid in ip6_pol_route()
  2014-09-08  4:59                               ` David Miller
@ 2014-09-08  5:07                                 ` Eric Dumazet
  2014-09-08  8:11                                   ` Nicolas Dichtel
                                                     ` (2 more replies)
  0 siblings, 3 replies; 57+ messages in thread
From: Eric Dumazet @ 2014-09-08  5:07 UTC (permalink / raw)
  To: David Miller; +Cc: nicolas.dichtel, therbert, alexander.h.duyck, netdev

On Sun, 2014-09-07 at 21:59 -0700, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Sun, 07 Sep 2014 21:43:54 -0700
> 
> > On Sun, 2014-09-07 at 21:27 -0700, David Miller wrote:
> >> From: Eric Dumazet <eric.dumazet@gmail.com>
> >> Date: Sun, 07 Sep 2014 21:18:25 -0700
> >> 
> >> > On Sun, 2014-09-07 at 15:54 -0700, David Miller wrote:
> >> > 
> >> >> This might be broken.
> >> >> 
> >> >> We are dealing here with persistent entries in the ipv6 routine trie.
> >> >> 
> >> >> If you just bump the genid on the next person to look it up, other
> >> >> sockets and cached entities might not have validated the route yet,
> >> >> and now will falsely see the route as valid.  We have to ensure that
> >> >> they too drop this route and perform a relookup.
> >> > 
> >> > I am confused, I thought it was the role of the cookie.
> >> > 
> >> > (Ie socket has to store its own cookie to be able to validate a route)
> >> > 
> >> > Before 6f3118b571b8 patch, how was this done anyway ?
> >> > 
> >> > If persistent routes cannot refresh the genid, then they are useless ?
> >> 
> >> I just speak about the genid aspect.
> >> 
> >> I understand that cookie (via node->fn_sernum) invalidates the path
> >> in the fib_trie, but the genid protects against other circumstances
> >> (matching IPSEC rule, f.e.)
> >> 
> >> You have to make sure all other sockets did a full route lookup
> >> (including IPSEC) before you can safely adjust the genid.
> >> 
> >> I could be wrong, recheck my analysis please :-)
> > 
> > So this whole genid protection can not work, unless we make sure a
> > socket cannot share a route with another socket.
> > 
> > This means we have to clone all routes.
> 
> I'm willing to revert the change in question if you think that's the
> sanest way forward.
> 
> The bug fix for more obscure use cases (IPSEC) if pointless if it
> breaks more common things (TCP input route caching).

Lets wait for Nicolas and/or Hannes input, they might have some ideas...

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH net] ipv6: refresh rt6i_genid in ip6_pol_route()
  2014-09-08  5:07                                 ` Eric Dumazet
@ 2014-09-08  8:11                                   ` Nicolas Dichtel
  2014-09-08 10:28                                     ` Eric Dumazet
  2014-09-08 18:48                                   ` Vlad Yasevich
  2014-09-09 12:58                                   ` Hannes Frederic Sowa
  2 siblings, 1 reply; 57+ messages in thread
From: Nicolas Dichtel @ 2014-09-08  8:11 UTC (permalink / raw)
  To: Eric Dumazet, David Miller; +Cc: therbert, alexander.h.duyck, netdev

Le 08/09/2014 07:07, Eric Dumazet a écrit :
> On Sun, 2014-09-07 at 21:59 -0700, David Miller wrote:
>> From: Eric Dumazet <eric.dumazet@gmail.com>
>> Date: Sun, 07 Sep 2014 21:43:54 -0700
>>
>>> On Sun, 2014-09-07 at 21:27 -0700, David Miller wrote:
>>>> From: Eric Dumazet <eric.dumazet@gmail.com>
>>>> Date: Sun, 07 Sep 2014 21:18:25 -0700
>>>>
>>>>> On Sun, 2014-09-07 at 15:54 -0700, David Miller wrote:
>>>>>
>>>>>> This might be broken.
>>>>>>
>>>>>> We are dealing here with persistent entries in the ipv6 routine trie.
>>>>>>
>>>>>> If you just bump the genid on the next person to look it up, other
>>>>>> sockets and cached entities might not have validated the route yet,
>>>>>> and now will falsely see the route as valid.  We have to ensure that
>>>>>> they too drop this route and perform a relookup.
>>>>>
>>>>> I am confused, I thought it was the role of the cookie.
>>>>>
>>>>> (Ie socket has to store its own cookie to be able to validate a route)
>>>>>
>>>>> Before 6f3118b571b8 patch, how was this done anyway ?
>>>>>
>>>>> If persistent routes cannot refresh the genid, then they are useless ?
>>>>
>>>> I just speak about the genid aspect.
>>>>
>>>> I understand that cookie (via node->fn_sernum) invalidates the path
>>>> in the fib_trie, but the genid protects against other circumstances
>>>> (matching IPSEC rule, f.e.)
>>>>
>>>> You have to make sure all other sockets did a full route lookup
>>>> (including IPSEC) before you can safely adjust the genid.
>>>>
>>>> I could be wrong, recheck my analysis please :-)
>>>
>>> So this whole genid protection can not work, unless we make sure a
>>> socket cannot share a route with another socket.
>>>
>>> This means we have to clone all routes.
>>
>> I'm willing to revert the change in question if you think that's the
>> sanest way forward.
>>
>> The bug fix for more obscure use cases (IPSEC) if pointless if it
>> breaks more common things (TCP input route caching).
>
> Lets wait for Nicolas and/or Hannes input, they might have some ideas...

The initial problem was in SCTP. Here is the thread after the v1 patch:
http://patchwork.ozlabs.org/patch/182235/

Before the patch, SCTP stored the IPv6 route in its cache and if an IPsec
rules was inserted after that operation, SCTP never invalidated the cached
route to use a new xfrm route.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH net] ipv6: refresh rt6i_genid in ip6_pol_route()
  2014-09-08  8:11                                   ` Nicolas Dichtel
@ 2014-09-08 10:28                                     ` Eric Dumazet
  2014-09-08 12:16                                       ` Nicolas Dichtel
  0 siblings, 1 reply; 57+ messages in thread
From: Eric Dumazet @ 2014-09-08 10:28 UTC (permalink / raw)
  To: nicolas.dichtel; +Cc: David Miller, therbert, alexander.h.duyck, netdev

On Mon, 2014-09-08 at 10:11 +0200, Nicolas Dichtel wrote:
> Le 08/09/2014 07:07, Eric Dumazet a écrit :
> > On Sun, 2014-09-07 at 21:59 -0700, David Miller wrote:
> >> From: Eric Dumazet <eric.dumazet@gmail.com>
> >> Date: Sun, 07 Sep 2014 21:43:54 -0700
> >>
> >>> On Sun, 2014-09-07 at 21:27 -0700, David Miller wrote:
> >>>> From: Eric Dumazet <eric.dumazet@gmail.com>
> >>>> Date: Sun, 07 Sep 2014 21:18:25 -0700
> >>>>
> >>>>> On Sun, 2014-09-07 at 15:54 -0700, David Miller wrote:
> >>>>>
> >>>>>> This might be broken.
> >>>>>>
> >>>>>> We are dealing here with persistent entries in the ipv6 routine trie.
> >>>>>>
> >>>>>> If you just bump the genid on the next person to look it up, other
> >>>>>> sockets and cached entities might not have validated the route yet,
> >>>>>> and now will falsely see the route as valid.  We have to ensure that
> >>>>>> they too drop this route and perform a relookup.
> >>>>>
> >>>>> I am confused, I thought it was the role of the cookie.
> >>>>>
> >>>>> (Ie socket has to store its own cookie to be able to validate a route)
> >>>>>
> >>>>> Before 6f3118b571b8 patch, how was this done anyway ?
> >>>>>
> >>>>> If persistent routes cannot refresh the genid, then they are useless ?
> >>>>
> >>>> I just speak about the genid aspect.
> >>>>
> >>>> I understand that cookie (via node->fn_sernum) invalidates the path
> >>>> in the fib_trie, but the genid protects against other circumstances
> >>>> (matching IPSEC rule, f.e.)
> >>>>
> >>>> You have to make sure all other sockets did a full route lookup
> >>>> (including IPSEC) before you can safely adjust the genid.
> >>>>
> >>>> I could be wrong, recheck my analysis please :-)
> >>>
> >>> So this whole genid protection can not work, unless we make sure a
> >>> socket cannot share a route with another socket.
> >>>
> >>> This means we have to clone all routes.
> >>
> >> I'm willing to revert the change in question if you think that's the
> >> sanest way forward.
> >>
> >> The bug fix for more obscure use cases (IPSEC) if pointless if it
> >> breaks more common things (TCP input route caching).
> >
> > Lets wait for Nicolas and/or Hannes input, they might have some ideas...
> 
> The initial problem was in SCTP. Here is the thread after the v1 patch:
> http://patchwork.ozlabs.org/patch/182235/
> 
> Before the patch, SCTP stored the IPv6 route in its cache and if an IPsec
> rules was inserted after that operation, SCTP never invalidated the cached
> route to use a new xfrm route.

This thread mentions output route.

The problem I currently have with IPv6 early demux is for input routes.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH net] ipv6: refresh rt6i_genid in ip6_pol_route()
  2014-09-08 10:28                                     ` Eric Dumazet
@ 2014-09-08 12:16                                       ` Nicolas Dichtel
  0 siblings, 0 replies; 57+ messages in thread
From: Nicolas Dichtel @ 2014-09-08 12:16 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, therbert, alexander.h.duyck, netdev

Le 08/09/2014 12:28, Eric Dumazet a écrit :
> On Mon, 2014-09-08 at 10:11 +0200, Nicolas Dichtel wrote:
>> Le 08/09/2014 07:07, Eric Dumazet a écrit :
>>> On Sun, 2014-09-07 at 21:59 -0700, David Miller wrote:
>>>> From: Eric Dumazet <eric.dumazet@gmail.com>
>>>> Date: Sun, 07 Sep 2014 21:43:54 -0700
>>>>
>>>>> On Sun, 2014-09-07 at 21:27 -0700, David Miller wrote:
>>>>>> From: Eric Dumazet <eric.dumazet@gmail.com>
>>>>>> Date: Sun, 07 Sep 2014 21:18:25 -0700
>>>>>>
>>>>>>> On Sun, 2014-09-07 at 15:54 -0700, David Miller wrote:
>>>>>>>
>>>>>>>> This might be broken.
>>>>>>>>
>>>>>>>> We are dealing here with persistent entries in the ipv6 routine trie.
>>>>>>>>
>>>>>>>> If you just bump the genid on the next person to look it up, other
>>>>>>>> sockets and cached entities might not have validated the route yet,
>>>>>>>> and now will falsely see the route as valid.  We have to ensure that
>>>>>>>> they too drop this route and perform a relookup.
>>>>>>>
>>>>>>> I am confused, I thought it was the role of the cookie.
>>>>>>>
>>>>>>> (Ie socket has to store its own cookie to be able to validate a route)
>>>>>>>
>>>>>>> Before 6f3118b571b8 patch, how was this done anyway ?
>>>>>>>
>>>>>>> If persistent routes cannot refresh the genid, then they are useless ?
>>>>>>
>>>>>> I just speak about the genid aspect.
>>>>>>
>>>>>> I understand that cookie (via node->fn_sernum) invalidates the path
>>>>>> in the fib_trie, but the genid protects against other circumstances
>>>>>> (matching IPSEC rule, f.e.)
>>>>>>
>>>>>> You have to make sure all other sockets did a full route lookup
>>>>>> (including IPSEC) before you can safely adjust the genid.
>>>>>>
>>>>>> I could be wrong, recheck my analysis please :-)
>>>>>
>>>>> So this whole genid protection can not work, unless we make sure a
>>>>> socket cannot share a route with another socket.
>>>>>
>>>>> This means we have to clone all routes.
>>>>
>>>> I'm willing to revert the change in question if you think that's the
>>>> sanest way forward.
>>>>
>>>> The bug fix for more obscure use cases (IPSEC) if pointless if it
>>>> breaks more common things (TCP input route caching).
>>>
>>> Lets wait for Nicolas and/or Hannes input, they might have some ideas...
>>
>> The initial problem was in SCTP. Here is the thread after the v1 patch:
>> http://patchwork.ozlabs.org/patch/182235/
>>
>> Before the patch, SCTP stored the IPv6 route in its cache and if an IPsec
>> rules was inserted after that operation, SCTP never invalidated the cached
>> route to use a new xfrm route.
>
> This thread mentions output route.
Yes, it was the target.

>
> The problem I currently have with IPv6 early demux is for input routes.
It's clearly a regression.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH v2 net-next] tcp: remove dst refcount false sharing for prequeue mode
  2014-09-06 15:27             ` Eric Dumazet
  2014-09-06 15:46               ` Eric Dumazet
@ 2014-09-08 15:06               ` Eric Dumazet
  2014-09-08 21:21                 ` David Miller
  1 sibling, 1 reply; 57+ messages in thread
From: Eric Dumazet @ 2014-09-08 15:06 UTC (permalink / raw)
  To: David Miller; +Cc: therbert, alexander.h.duyck, netdev

From: Eric Dumazet <edumazet@google.com>

Alexander Duyck reported high false sharing on dst refcount in tcp stack
when prequeue is used. prequeue is the mechanism used when a thread is
blocked in recvmsg()/read() on a TCP socket, using a blocking model
rather than select()/poll()/epoll() non blocking one.

We already try to use RCU in input path as much as possible, but we were
forced to take a refcount on the dst when skb escaped RCU protected
region. When/if the user thread runs on different cpu, dst_release()
will then touch dst refcount again.

Commit 093162553c33 (tcp: force a dst refcount when prequeue packet)
was an example of a race fix.

It turns out the only remaining usage of skb->dst for a packet stored
in a TCP socket prequeue is IP early demux.

We can add a logic to detect when IP early demux is probably going
to use skb->dst. Because we do an optimistic check rather than duplicate
existing logic, we need to guard inet_sk_rx_dst_set() and
inet6_sk_rx_dst_set() from using a NULL dst.

Many thanks to Alexander for providing a nice bug report, git bisection,
and reproducer.

Tested using Alexander script on a 40Gb NIC, 8 RX queues.
Hosts have 24 cores, 48 hyper threads.

echo 0 >/proc/sys/net/ipv4/tcp_autocorking

for i in `seq 0 47`
do
  for j in `seq 0 2`
  do
     netperf -H $DEST -t TCP_STREAM -l 1000 \
             -c -C -T $i,$i -P 0 -- \
             -m 64 -s 64K -D &
  done
done

Before patch : ~6Mpps and ~95% cpu usage on receiver
After patch : ~9Mpps and ~35% cpu usage on receiver.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
v2: remove the skb_dst_drop() before enqueue to backlog, as this is
currently valid only for ESTABLISHED sockets.
 net/ipv4/tcp_ipv4.c |   20 ++++++++++++++++----
 net/ipv6/tcp_ipv6.c |   15 +++++++++------
 2 files changed, 25 insertions(+), 10 deletions(-)

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 3f9bc3f0bba0..7881b96d2b72 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1559,7 +1559,17 @@ bool tcp_prequeue(struct sock *sk, struct sk_buff *skb)
 	    skb_queue_len(&tp->ucopy.prequeue) == 0)
 		return false;
 
-	skb_dst_force(skb);
+	/* Before escaping RCU protected region, we need to take care of skb
+	 * dst. Prequeue is only enabled for established sockets.
+	 * For such sockets, we might need the skb dst only to set sk->sk_rx_dst
+	 * Instead of doing full sk_rx_dst validity here, let's perform
+	 * an optimistic check.
+	 */
+	if (likely(sk->sk_rx_dst))
+		skb_dst_drop(skb);
+	else
+		skb_dst_force(skb);
+
 	__skb_queue_tail(&tp->ucopy.prequeue, skb);
 	tp->ucopy.memory += skb->truesize;
 	if (tp->ucopy.memory > sk->sk_rcvbuf) {
@@ -1765,9 +1775,11 @@ void inet_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb)
 {
 	struct dst_entry *dst = skb_dst(skb);
 
-	dst_hold(dst);
-	sk->sk_rx_dst = dst;
-	inet_sk(sk)->rx_dst_ifindex = skb->skb_iif;
+	if (dst) {
+		dst_hold(dst);
+		sk->sk_rx_dst = dst;
+		inet_sk(sk)->rx_dst_ifindex = skb->skb_iif;
+	}
 }
 EXPORT_SYMBOL(inet_sk_rx_dst_set);
 
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 5b3c70ff7a72..1835480336ac 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -93,13 +93,16 @@ static struct tcp_md5sig_key *tcp_v6_md5_do_lookup(struct sock *sk,
 static void inet6_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb)
 {
 	struct dst_entry *dst = skb_dst(skb);
-	const struct rt6_info *rt = (const struct rt6_info *)dst;
 
-	dst_hold(dst);
-	sk->sk_rx_dst = dst;
-	inet_sk(sk)->rx_dst_ifindex = skb->skb_iif;
-	if (rt->rt6i_node)
-		inet6_sk(sk)->rx_dst_cookie = rt->rt6i_node->fn_sernum;
+	if (dst) {
+		const struct rt6_info *rt = (const struct rt6_info *)dst;
+
+		dst_hold(dst);
+		sk->sk_rx_dst = dst;
+		inet_sk(sk)->rx_dst_ifindex = skb->skb_iif;
+		if (rt->rt6i_node)
+			inet6_sk(sk)->rx_dst_cookie = rt->rt6i_node->fn_sernum;
+	}
 }
 
 static void tcp_v6_hash(struct sock *sk)

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH net] ipv6: refresh rt6i_genid in ip6_pol_route()
  2014-09-08  5:07                                 ` Eric Dumazet
  2014-09-08  8:11                                   ` Nicolas Dichtel
@ 2014-09-08 18:48                                   ` Vlad Yasevich
  2014-09-09 12:58                                   ` Hannes Frederic Sowa
  2 siblings, 0 replies; 57+ messages in thread
From: Vlad Yasevich @ 2014-09-08 18:48 UTC (permalink / raw)
  To: Eric Dumazet, David Miller
  Cc: nicolas.dichtel, therbert, alexander.h.duyck, netdev

On 09/08/2014 01:07 AM, Eric Dumazet wrote:
> On Sun, 2014-09-07 at 21:59 -0700, David Miller wrote:
>> From: Eric Dumazet <eric.dumazet@gmail.com>
>> Date: Sun, 07 Sep 2014 21:43:54 -0700
>>
>>> On Sun, 2014-09-07 at 21:27 -0700, David Miller wrote:
>>>> From: Eric Dumazet <eric.dumazet@gmail.com>
>>>> Date: Sun, 07 Sep 2014 21:18:25 -0700
>>>>
>>>>> On Sun, 2014-09-07 at 15:54 -0700, David Miller wrote:
>>>>>
>>>>>> This might be broken.
>>>>>>
>>>>>> We are dealing here with persistent entries in the ipv6 routine trie.
>>>>>>
>>>>>> If you just bump the genid on the next person to look it up, other
>>>>>> sockets and cached entities might not have validated the route yet,
>>>>>> and now will falsely see the route as valid.  We have to ensure that
>>>>>> they too drop this route and perform a relookup.
>>>>>
>>>>> I am confused, I thought it was the role of the cookie.
>>>>>
>>>>> (Ie socket has to store its own cookie to be able to validate a route)
>>>>>
>>>>> Before 6f3118b571b8 patch, how was this done anyway ?
>>>>>
>>>>> If persistent routes cannot refresh the genid, then they are useless ?
>>>>
>>>> I just speak about the genid aspect.
>>>>
>>>> I understand that cookie (via node->fn_sernum) invalidates the path
>>>> in the fib_trie, but the genid protects against other circumstances
>>>> (matching IPSEC rule, f.e.)
>>>>
>>>> You have to make sure all other sockets did a full route lookup
>>>> (including IPSEC) before you can safely adjust the genid.
>>>>
>>>> I could be wrong, recheck my analysis please :-)
>>>
>>> So this whole genid protection can not work, unless we make sure a
>>> socket cannot share a route with another socket.
>>>
>>> This means we have to clone all routes.
>>
>> I'm willing to revert the change in question if you think that's the
>> sanest way forward.
>>
>> The bug fix for more obscure use cases (IPSEC) if pointless if it
>> breaks more common things (TCP input route caching).
> 
> Lets wait for Nicolas and/or Hannes input, they might have some ideas...
> 
> 

So, looking at the differences between ipv4 and ipv6, it looks like ipv4
genid is only bumped by xfrm, while ipv6 one is bumped every time
__ipv6_ifa_notify() is called which happens in a lot of palaces...

So, my guess is that ipv4 genid doesn't really grow on non-xfrm configured
systems, while ipv6 one can grow very fast.  Probably why we don't see
it with ipv4 and do with ipv6.

-vlad


> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 net-next] tcp: remove dst refcount false sharing for prequeue mode
  2014-09-08 15:06               ` [PATCH v2 net-next] tcp: remove dst refcount false sharing for prequeue mode Eric Dumazet
@ 2014-09-08 21:21                 ` David Miller
  2014-09-08 21:30                   ` Eric Dumazet
  0 siblings, 1 reply; 57+ messages in thread
From: David Miller @ 2014-09-08 21:21 UTC (permalink / raw)
  To: eric.dumazet; +Cc: therbert, alexander.h.duyck, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 08 Sep 2014 08:06:07 -0700

> @@ -1559,7 +1559,17 @@ bool tcp_prequeue(struct sock *sk, struct sk_buff *skb)
>  	    skb_queue_len(&tp->ucopy.prequeue) == 0)
>  		return false;
>  
> -	skb_dst_force(skb);
> +	/* Before escaping RCU protected region, we need to take care of skb
> +	 * dst. Prequeue is only enabled for established sockets.
> +	 * For such sockets, we might need the skb dst only to set sk->sk_rx_dst
> +	 * Instead of doing full sk_rx_dst validity here, let's perform
> +	 * an optimistic check.
> +	 */
> +	if (likely(sk->sk_rx_dst))
> +		skb_dst_drop(skb);
> +	else
> +		skb_dst_force(skb);
> +

This might not be a strong enough test.

We have to also make all of the checks that would cause the input
path to invalidate sk->sk_rx_dst too.

Otherwise, if it does, we'll crash when we try to do a dst_hold()
on skb_dst(skb) in sk->sk_rx_dst_set().

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 net-next] tcp: remove dst refcount false sharing for prequeue mode
  2014-09-08 21:21                 ` David Miller
@ 2014-09-08 21:30                   ` Eric Dumazet
  2014-09-08 22:41                     ` David Miller
  2014-09-09 23:56                     ` David Miller
  0 siblings, 2 replies; 57+ messages in thread
From: Eric Dumazet @ 2014-09-08 21:30 UTC (permalink / raw)
  To: David Miller; +Cc: therbert, alexander.h.duyck, netdev

On Mon, 2014-09-08 at 14:21 -0700, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Mon, 08 Sep 2014 08:06:07 -0700
> 
> > @@ -1559,7 +1559,17 @@ bool tcp_prequeue(struct sock *sk, struct sk_buff *skb)
> >  	    skb_queue_len(&tp->ucopy.prequeue) == 0)
> >  		return false;
> >  
> > -	skb_dst_force(skb);
> > +	/* Before escaping RCU protected region, we need to take care of skb
> > +	 * dst. Prequeue is only enabled for established sockets.
> > +	 * For such sockets, we might need the skb dst only to set sk->sk_rx_dst
> > +	 * Instead of doing full sk_rx_dst validity here, let's perform
> > +	 * an optimistic check.
> > +	 */
> > +	if (likely(sk->sk_rx_dst))
> > +		skb_dst_drop(skb);
> > +	else
> > +		skb_dst_force(skb);
> > +
> 
> This might not be a strong enough test.
> 
> We have to also make all of the checks that would cause the input
> path to invalidate sk->sk_rx_dst too.
> 
> Otherwise, if it does, we'll crash when we try to do a dst_hold()
> on skb_dst(skb) in sk->sk_rx_dst_set().

I thought I gave enough details in this comment and changelog. Maybe I
had been too verbose  :(

In the worst case, here is what is happening :

sk_rx_dst is checked and invalidated in tcp_v4_do_rcv()

Next packets coming from prequeue might then have a NULL dst, and
we'll do nothing special (sk_rx_dst will stay NULL), because we do
handle NULL dst properly in inet_sk_rx_dst_set() and
inet6_sk_rx_dst_set() 

But next packet to be processed (either in non prequeue mode or
prequeue) will carry skb->dst and we will set sk->sk_rx_dst

I decided to not copy/paste the tests we do in the family dependent
parts, because it was not worth the pain.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 net-next] tcp: remove dst refcount false sharing for prequeue mode
  2014-09-08 21:30                   ` Eric Dumazet
@ 2014-09-08 22:41                     ` David Miller
  2014-09-09 23:56                     ` David Miller
  1 sibling, 0 replies; 57+ messages in thread
From: David Miller @ 2014-09-08 22:41 UTC (permalink / raw)
  To: eric.dumazet; +Cc: therbert, alexander.h.duyck, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 08 Sep 2014 14:30:42 -0700

> In the worst case, here is what is happening :
> 
> sk_rx_dst is checked and invalidated in tcp_v4_do_rcv()
> 
> Next packets coming from prequeue might then have a NULL dst, and
> we'll do nothing special (sk_rx_dst will stay NULL), because we do
> handle NULL dst properly in inet_sk_rx_dst_set() and
> inet6_sk_rx_dst_set() 
> 
> But next packet to be processed (either in non prequeue mode or
> prequeue) will carry skb->dst and we will set sk->sk_rx_dst
> 
> I decided to not copy/paste the tests we do in the family dependent
> parts, because it was not worth the pain.

I must have mis-read some of these code paths, let me look over it
again, thanks!

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH net] ipv6: refresh rt6i_genid in ip6_pol_route()
  2014-09-08  5:07                                 ` Eric Dumazet
  2014-09-08  8:11                                   ` Nicolas Dichtel
  2014-09-08 18:48                                   ` Vlad Yasevich
@ 2014-09-09 12:58                                   ` Hannes Frederic Sowa
  2014-09-10  9:31                                     ` [PATCH net-next] ipv6: implement rt_genid_bump_ipv6 with fn_sernum and remove rt6i_genid Hannes Frederic Sowa
  2 siblings, 1 reply; 57+ messages in thread
From: Hannes Frederic Sowa @ 2014-09-09 12:58 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, nicolas.dichtel, therbert, alexander.h.duyck, netdev

On So, 2014-09-07 at 22:07 -0700, Eric Dumazet wrote:
> On Sun, 2014-09-07 at 21:59 -0700, David Miller wrote:
> > From: Eric Dumazet <eric.dumazet@gmail.com>
> > Date: Sun, 07 Sep 2014 21:43:54 -0700
> > 
> > > On Sun, 2014-09-07 at 21:27 -0700, David Miller wrote:
> > >> From: Eric Dumazet <eric.dumazet@gmail.com>
> > >> Date: Sun, 07 Sep 2014 21:18:25 -0700
> > >> 
> > >> > On Sun, 2014-09-07 at 15:54 -0700, David Miller wrote:
> > >> > 
> > >> >> This might be broken.
> > >> >> 
> > >> >> We are dealing here with persistent entries in the ipv6 routine trie.
> > >> >> 
> > >> >> If you just bump the genid on the next person to look it up, other
> > >> >> sockets and cached entities might not have validated the route yet,
> > >> >> and now will falsely see the route as valid.  We have to ensure that
> > >> >> they too drop this route and perform a relookup.
> > >> > 
> > >> > I am confused, I thought it was the role of the cookie.
> > >> > 
> > >> > (Ie socket has to store its own cookie to be able to validate a route)
> > >> > 
> > >> > Before 6f3118b571b8 patch, how was this done anyway ?
> > >> > 
> > >> > If persistent routes cannot refresh the genid, then they are useless ?
> > >> 
> > >> I just speak about the genid aspect.
> > >> 
> > >> I understand that cookie (via node->fn_sernum) invalidates the path
> > >> in the fib_trie, but the genid protects against other circumstances
> > >> (matching IPSEC rule, f.e.)
> > >> 
> > >> You have to make sure all other sockets did a full route lookup
> > >> (including IPSEC) before you can safely adjust the genid.
> > >> 
> > >> I could be wrong, recheck my analysis please :-)
> > > 
> > > So this whole genid protection can not work, unless we make sure a
> > > socket cannot share a route with another socket.
> > > 
> > > This means we have to clone all routes.
> > 
> > I'm willing to revert the change in question if you think that's the
> > sanest way forward.
> > 
> > The bug fix for more obscure use cases (IPSEC) if pointless if it
> > breaks more common things (TCP input route caching).
> 
> Lets wait for Nicolas and/or Hannes input, they might have some ideas...

My first idea was to remove rt_genid check in ip6_dst_check completely
and rt_genid_bump_ipv6() should walk the trie to increase fib6_sernum in
rt6i_nodes. I'll try this.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 net-next] tcp: remove dst refcount false sharing for prequeue mode
  2014-09-08 21:30                   ` Eric Dumazet
  2014-09-08 22:41                     ` David Miller
@ 2014-09-09 23:56                     ` David Miller
  1 sibling, 0 replies; 57+ messages in thread
From: David Miller @ 2014-09-09 23:56 UTC (permalink / raw)
  To: eric.dumazet; +Cc: therbert, alexander.h.duyck, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 08 Sep 2014 14:30:42 -0700

> In the worst case, here is what is happening :
> 
> sk_rx_dst is checked and invalidated in tcp_v4_do_rcv()
> 
> Next packets coming from prequeue might then have a NULL dst, and
> we'll do nothing special (sk_rx_dst will stay NULL), because we do
> handle NULL dst properly in inet_sk_rx_dst_set() and
> inet6_sk_rx_dst_set() 
> 
> But next packet to be processed (either in non prequeue mode or
> prequeue) will carry skb->dst and we will set sk->sk_rx_dst
> 
> I decided to not copy/paste the tests we do in the family dependent
> parts, because it was not worth the pain.

And, now applied.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH net-next] ipv6: implement rt_genid_bump_ipv6 with fn_sernum and remove rt6i_genid
  2014-09-09 12:58                                   ` Hannes Frederic Sowa
@ 2014-09-10  9:31                                     ` Hannes Frederic Sowa
  2014-09-10 13:26                                       ` Vlad Yasevich
  2014-09-10 20:09                                       ` David Miller
  0 siblings, 2 replies; 57+ messages in thread
From: Hannes Frederic Sowa @ 2014-09-10  9:31 UTC (permalink / raw)
  To: netdev; +Cc: Eric Dumazet, Nicolas Dichtel

Some special routes will never be cloned in IPv6. Those are mainly
DST_HOST routes without RTF_NONEXTHOP or RTF_GATEWAY flags, thus mostly
routes handling the input path.

rt_genid depends on rt6_info clones being removed from the trie and
recreated with current rt6i_genid, thus bumping the genid would invalidate
all routes cached in the sockets and relookup will recreate them.  But in
case a non-cloned route ends up in the per-socket cache, it will never
be recreated, thus will never get a current rt_genid.

After the rt_genid is incremented for the first time all those routes
will always get invalidated by ip6_dst_check on the next check, thus
making early socket demultiplexing absolutely pointless for IPv6. It is
not possible to solve this with rt6i_genid, thus remove it.

In case we need to force the sockets to relookup the routes we now
increase the fn_sernum on all fibnodes in the routing tree. This is a
costly operation but should only happen if we have major routing/policy
changes in the kernel (e.g. manual route adding/removal, xfrm policy
changes). Also this patch optimized the walk over the trie a bit, we
don't touch every rt6_info but only touch the fib6_nodes.

Thanks to Eric Dumazet who noticed this problem.

Fixes: 6f3118b571b8 ("ipv6: use net->rt_genid to check dst validity")
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---

Notes:
    I based this patch on net-next, because it "only" fixes a performance
    problem so far and want it to be a bit more exposed to testing before
    a possible submission to stable.

 include/net/ip6_fib.h       |  5 ++--
 include/net/net_namespace.h |  4 +++-
 net/core/dst.c              |  8 +++++++
 net/ipv6/ip6_fib.c          | 57 ++++++++++++++++++++++++++++++++++++---------
 net/ipv6/route.c            |  4 ----
 5 files changed, 59 insertions(+), 19 deletions(-)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index 9bcb220..a5e09b8 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -119,8 +119,6 @@ struct rt6_info {
 	struct inet6_dev		*rt6i_idev;
 	unsigned long			_rt6i_peer;
 
-	u32				rt6i_genid;
-
 	/* more non-fragment space at head required */
 	unsigned short			rt6i_nfheader_len;
 
@@ -281,7 +279,8 @@ struct fib6_node *fib6_locate(struct fib6_node *root,
 			      const struct in6_addr *daddr, int dst_len,
 			      const struct in6_addr *saddr, int src_len);
 
-void fib6_clean_all(struct net *net, int (*func)(struct rt6_info *, void *arg),
+void fib6_clean_all(struct net *net,
+		    int (*rt_visit)(struct rt6_info *, void *arg),
 		    void *arg);
 
 int fib6_add(struct fib6_node *root, struct rt6_info *rt, struct nl_info *info,
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 361d260..e0b0476 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -358,9 +358,11 @@ static inline int rt_genid_ipv6(struct net *net)
 	return atomic_read(&net->ipv6.rt_genid);
 }
 
+extern void (*__rt_genid_bump_ipv6)(struct net *net);
 static inline void rt_genid_bump_ipv6(struct net *net)
 {
-	atomic_inc(&net->ipv6.rt_genid);
+	if (__rt_genid_bump_ipv6)
+		__rt_genid_bump_ipv6(net);
 }
 #else
 static inline int rt_genid_ipv6(struct net *net)
diff --git a/net/core/dst.c b/net/core/dst.c
index a028409..995183b 100644
--- a/net/core/dst.c
+++ b/net/core/dst.c
@@ -23,6 +23,14 @@
 
 #include <net/dst.h>
 
+/* as soon as the ipv6 module registers, this function is used to
+ * force all cached routing lookups to relookup
+ */
+#if IS_ENABLED(CONFIG_IPV6)
+void (*__rt_genid_bump_ipv6)(struct net *net);
+EXPORT_SYMBOL(__rt_genid_bump_ipv6);
+#endif
+
 /*
  * Theory of operations:
  * 1) We use a list, protected by a spinlock, to add
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 76b7f5e..d23b76f 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -106,10 +106,15 @@ static inline void fib6_walker_unlink(struct fib6_walker_t *w)
 }
 static __inline__ u32 fib6_new_sernum(void)
 {
-	u32 n = ++rt_sernum;
-	if ((__s32)n <= 0)
-		rt_sernum = n = 1;
-	return n;
+	u32 new, old;
+
+	do {
+		old = ACCESS_ONCE(rt_sernum);
+		new = old + 1;
+		if ((s32)new <= 0)
+			new = 1;
+	} while (cmpxchg(&rt_sernum, old, new) != old);
+	return new;
 }
 
 /*
@@ -1553,25 +1558,28 @@ static int fib6_clean_node(struct fib6_walker_t *w)
  */
 
 static void fib6_clean_tree(struct net *net, struct fib6_node *root,
-			    int (*func)(struct rt6_info *, void *arg),
+			    int (*node_visit)(struct fib6_walker_t *),
+			    int (*rt_visit)(struct rt6_info *, void *arg),
 			    int prune, void *arg)
 {
 	struct fib6_cleaner_t c;
 
 	c.w.root = root;
-	c.w.func = fib6_clean_node;
+	c.w.func = node_visit;
 	c.w.prune = prune;
 	c.w.count = 0;
 	c.w.skip = 0;
-	c.func = func;
+	c.func = rt_visit;
 	c.arg = arg;
 	c.net = net;
 
 	fib6_walk(&c.w);
 }
 
-void fib6_clean_all(struct net *net, int (*func)(struct rt6_info *, void *arg),
-		    void *arg)
+static void __fib6_clean_all(struct net *net,
+			     int (*node_visit)(struct fib6_walker_t *),
+			     int (*rt_visit)(struct rt6_info *, void *arg),
+			     void *arg)
 {
 	struct fib6_table *table;
 	struct hlist_head *head;
@@ -1583,13 +1591,20 @@ void fib6_clean_all(struct net *net, int (*func)(struct rt6_info *, void *arg),
 		hlist_for_each_entry_rcu(table, head, tb6_hlist) {
 			write_lock_bh(&table->tb6_lock);
 			fib6_clean_tree(net, &table->tb6_root,
-					func, 0, arg);
+					node_visit, rt_visit, 0, arg);
 			write_unlock_bh(&table->tb6_lock);
 		}
 	}
 	rcu_read_unlock();
 }
 
+void fib6_clean_all(struct net *net,
+		    int (*rt_visit)(struct rt6_info *, void *arg),
+		    void *arg)
+{
+	__fib6_clean_all(net, fib6_clean_node, rt_visit, arg);
+}
+
 static int fib6_prune_clone(struct rt6_info *rt, void *arg)
 {
 	if (rt->rt6i_flags & RTF_CACHE) {
@@ -1602,7 +1617,25 @@ static int fib6_prune_clone(struct rt6_info *rt, void *arg)
 
 static void fib6_prune_clones(struct net *net, struct fib6_node *fn)
 {
-	fib6_clean_tree(net, fn, fib6_prune_clone, 1, NULL);
+	fib6_clean_tree(net, fn, fib6_clean_node, fib6_prune_clone, 1, NULL);
+}
+
+static int fib6_upd_sernum(struct fib6_walker_t *w)
+{
+	struct fib6_cleaner_t *c = container_of(w, struct fib6_cleaner_t, w);
+	int *sernum = c->arg;
+
+	w->node->fn_sernum = *sernum;
+	w->leaf = NULL;
+	return 0;
+}
+
+static void fib6_genid_bump(struct net *net)
+{
+	int sernum;
+
+	sernum = fib6_new_sernum();
+	__fib6_clean_all(net, fib6_upd_sernum, NULL, &sernum);
 }
 
 /*
@@ -1788,6 +1821,8 @@ int __init fib6_init(void)
 			      NULL);
 	if (ret)
 		goto out_unregister_subsys;
+
+	__rt_genid_bump_ipv6 = fib6_genid_bump;
 out:
 	return ret;
 
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index f74b041..a318dd89 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -314,7 +314,6 @@ static inline struct rt6_info *ip6_dst_alloc(struct net *net,
 
 		memset(dst + 1, 0, sizeof(*rt) - sizeof(*dst));
 		rt6_init_peer(rt, table ? &table->tb6_peers : net->ipv6.peers);
-		rt->rt6i_genid = rt_genid_ipv6(net);
 		INIT_LIST_HEAD(&rt->rt6i_siblings);
 	}
 	return rt;
@@ -1096,9 +1095,6 @@ static struct dst_entry *ip6_dst_check(struct dst_entry *dst, u32 cookie)
 	 * DST_OBSOLETE_FORCE_CHK which forces validation calls down
 	 * into this function always.
 	 */
-	if (rt->rt6i_genid != rt_genid_ipv6(dev_net(rt->dst.dev)))
-		return NULL;
-
 	if (!rt->rt6i_node || (rt->rt6i_node->fn_sernum != cookie))
 		return NULL;
 
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH net-next] ipv6: implement rt_genid_bump_ipv6 with fn_sernum and remove rt6i_genid
  2014-09-10  9:31                                     ` [PATCH net-next] ipv6: implement rt_genid_bump_ipv6 with fn_sernum and remove rt6i_genid Hannes Frederic Sowa
@ 2014-09-10 13:26                                       ` Vlad Yasevich
  2014-09-10 13:42                                         ` Hannes Frederic Sowa
  2014-09-10 20:09                                       ` David Miller
  1 sibling, 1 reply; 57+ messages in thread
From: Vlad Yasevich @ 2014-09-10 13:26 UTC (permalink / raw)
  To: Hannes Frederic Sowa, netdev; +Cc: Eric Dumazet, Nicolas Dichtel

On 09/10/2014 05:31 AM, Hannes Frederic Sowa wrote:
> Some special routes will never be cloned in IPv6. Those are mainly
> DST_HOST routes without RTF_NONEXTHOP or RTF_GATEWAY flags, thus mostly
> routes handling the input path.
> 
> rt_genid depends on rt6_info clones being removed from the trie and
> recreated with current rt6i_genid, thus bumping the genid would invalidate
> all routes cached in the sockets and relookup will recreate them.  But in
> case a non-cloned route ends up in the per-socket cache, it will never
> be recreated, thus will never get a current rt_genid.
> 
> After the rt_genid is incremented for the first time all those routes
> will always get invalidated by ip6_dst_check on the next check, thus
> making early socket demultiplexing absolutely pointless for IPv6. It is
> not possible to solve this with rt6i_genid, thus remove it.
> 
> In case we need to force the sockets to relookup the routes we now
> increase the fn_sernum on all fibnodes in the routing tree. This is a
> costly operation but should only happen if we have major routing/policy
> changes in the kernel (e.g. manual route adding/removal, xfrm policy
> changes). Also this patch optimized the walk over the trie a bit, we
> don't touch every rt6_info but only touch the fib6_nodes.
> 
> Thanks to Eric Dumazet who noticed this problem.

Hi Hannes

I also noticed the ipv6_ifa_notify() is called a lot with even being
0.  This will only trigger rtnl notification, but would not trigger any
routing table changes, but would trigger a call to bump the gen_id
which now would perform a rather expensive clean-table operation.

In particular the loop in manage_tempaddrs() is very scary as it can
bump the rev multiples times.

I think it the genid bump in __ipv6_ifa_notify() should only happen
if there was an actual address change.

-vlad

> 
> Fixes: 6f3118b571b8 ("ipv6: use net->rt_genid to check dst validity")
> Cc: Eric Dumazet <eric.dumazet@gmail.com>
> Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
> ---
> 
> Notes:
>     I based this patch on net-next, because it "only" fixes a performance
>     problem so far and want it to be a bit more exposed to testing before
>     a possible submission to stable.
> 
>  include/net/ip6_fib.h       |  5 ++--
>  include/net/net_namespace.h |  4 +++-
>  net/core/dst.c              |  8 +++++++
>  net/ipv6/ip6_fib.c          | 57 ++++++++++++++++++++++++++++++++++++---------
>  net/ipv6/route.c            |  4 ----
>  5 files changed, 59 insertions(+), 19 deletions(-)
> 
> diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
> index 9bcb220..a5e09b8 100644
> --- a/include/net/ip6_fib.h
> +++ b/include/net/ip6_fib.h
> @@ -119,8 +119,6 @@ struct rt6_info {
>  	struct inet6_dev		*rt6i_idev;
>  	unsigned long			_rt6i_peer;
>  
> -	u32				rt6i_genid;
> -
>  	/* more non-fragment space at head required */
>  	unsigned short			rt6i_nfheader_len;
>  
> @@ -281,7 +279,8 @@ struct fib6_node *fib6_locate(struct fib6_node *root,
>  			      const struct in6_addr *daddr, int dst_len,
>  			      const struct in6_addr *saddr, int src_len);
>  
> -void fib6_clean_all(struct net *net, int (*func)(struct rt6_info *, void *arg),
> +void fib6_clean_all(struct net *net,
> +		    int (*rt_visit)(struct rt6_info *, void *arg),
>  		    void *arg);
>  
>  int fib6_add(struct fib6_node *root, struct rt6_info *rt, struct nl_info *info,
> diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
> index 361d260..e0b0476 100644
> --- a/include/net/net_namespace.h
> +++ b/include/net/net_namespace.h
> @@ -358,9 +358,11 @@ static inline int rt_genid_ipv6(struct net *net)
>  	return atomic_read(&net->ipv6.rt_genid);
>  }
>  
> +extern void (*__rt_genid_bump_ipv6)(struct net *net);
>  static inline void rt_genid_bump_ipv6(struct net *net)
>  {
> -	atomic_inc(&net->ipv6.rt_genid);
> +	if (__rt_genid_bump_ipv6)
> +		__rt_genid_bump_ipv6(net);
>  }
>  #else
>  static inline int rt_genid_ipv6(struct net *net)
> diff --git a/net/core/dst.c b/net/core/dst.c
> index a028409..995183b 100644
> --- a/net/core/dst.c
> +++ b/net/core/dst.c
> @@ -23,6 +23,14 @@
>  
>  #include <net/dst.h>
>  
> +/* as soon as the ipv6 module registers, this function is used to
> + * force all cached routing lookups to relookup
> + */
> +#if IS_ENABLED(CONFIG_IPV6)
> +void (*__rt_genid_bump_ipv6)(struct net *net);
> +EXPORT_SYMBOL(__rt_genid_bump_ipv6);
> +#endif
> +
>  /*
>   * Theory of operations:
>   * 1) We use a list, protected by a spinlock, to add
> diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
> index 76b7f5e..d23b76f 100644
> --- a/net/ipv6/ip6_fib.c
> +++ b/net/ipv6/ip6_fib.c
> @@ -106,10 +106,15 @@ static inline void fib6_walker_unlink(struct fib6_walker_t *w)
>  }
>  static __inline__ u32 fib6_new_sernum(void)
>  {
> -	u32 n = ++rt_sernum;
> -	if ((__s32)n <= 0)
> -		rt_sernum = n = 1;
> -	return n;
> +	u32 new, old;
> +
> +	do {
> +		old = ACCESS_ONCE(rt_sernum);
> +		new = old + 1;
> +		if ((s32)new <= 0)
> +			new = 1;
> +	} while (cmpxchg(&rt_sernum, old, new) != old);
> +	return new;
>  }
>  
>  /*
> @@ -1553,25 +1558,28 @@ static int fib6_clean_node(struct fib6_walker_t *w)
>   */
>  
>  static void fib6_clean_tree(struct net *net, struct fib6_node *root,
> -			    int (*func)(struct rt6_info *, void *arg),
> +			    int (*node_visit)(struct fib6_walker_t *),
> +			    int (*rt_visit)(struct rt6_info *, void *arg),
>  			    int prune, void *arg)
>  {
>  	struct fib6_cleaner_t c;
>  
>  	c.w.root = root;
> -	c.w.func = fib6_clean_node;
> +	c.w.func = node_visit;
>  	c.w.prune = prune;
>  	c.w.count = 0;
>  	c.w.skip = 0;
> -	c.func = func;
> +	c.func = rt_visit;
>  	c.arg = arg;
>  	c.net = net;
>  
>  	fib6_walk(&c.w);
>  }
>  
> -void fib6_clean_all(struct net *net, int (*func)(struct rt6_info *, void *arg),
> -		    void *arg)
> +static void __fib6_clean_all(struct net *net,
> +			     int (*node_visit)(struct fib6_walker_t *),
> +			     int (*rt_visit)(struct rt6_info *, void *arg),
> +			     void *arg)
>  {
>  	struct fib6_table *table;
>  	struct hlist_head *head;
> @@ -1583,13 +1591,20 @@ void fib6_clean_all(struct net *net, int (*func)(struct rt6_info *, void *arg),
>  		hlist_for_each_entry_rcu(table, head, tb6_hlist) {
>  			write_lock_bh(&table->tb6_lock);
>  			fib6_clean_tree(net, &table->tb6_root,
> -					func, 0, arg);
> +					node_visit, rt_visit, 0, arg);
>  			write_unlock_bh(&table->tb6_lock);
>  		}
>  	}
>  	rcu_read_unlock();
>  }
>  
> +void fib6_clean_all(struct net *net,
> +		    int (*rt_visit)(struct rt6_info *, void *arg),
> +		    void *arg)
> +{
> +	__fib6_clean_all(net, fib6_clean_node, rt_visit, arg);
> +}
> +
>  static int fib6_prune_clone(struct rt6_info *rt, void *arg)
>  {
>  	if (rt->rt6i_flags & RTF_CACHE) {
> @@ -1602,7 +1617,25 @@ static int fib6_prune_clone(struct rt6_info *rt, void *arg)
>  
>  static void fib6_prune_clones(struct net *net, struct fib6_node *fn)
>  {
> -	fib6_clean_tree(net, fn, fib6_prune_clone, 1, NULL);
> +	fib6_clean_tree(net, fn, fib6_clean_node, fib6_prune_clone, 1, NULL);
> +}
> +
> +static int fib6_upd_sernum(struct fib6_walker_t *w)
> +{
> +	struct fib6_cleaner_t *c = container_of(w, struct fib6_cleaner_t, w);
> +	int *sernum = c->arg;
> +
> +	w->node->fn_sernum = *sernum;
> +	w->leaf = NULL;
> +	return 0;
> +}
> +
> +static void fib6_genid_bump(struct net *net)
> +{
> +	int sernum;
> +
> +	sernum = fib6_new_sernum();
> +	__fib6_clean_all(net, fib6_upd_sernum, NULL, &sernum);
>  }
>  
>  /*
> @@ -1788,6 +1821,8 @@ int __init fib6_init(void)
>  			      NULL);
>  	if (ret)
>  		goto out_unregister_subsys;
> +
> +	__rt_genid_bump_ipv6 = fib6_genid_bump;
>  out:
>  	return ret;
>  
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index f74b041..a318dd89 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -314,7 +314,6 @@ static inline struct rt6_info *ip6_dst_alloc(struct net *net,
>  
>  		memset(dst + 1, 0, sizeof(*rt) - sizeof(*dst));
>  		rt6_init_peer(rt, table ? &table->tb6_peers : net->ipv6.peers);
> -		rt->rt6i_genid = rt_genid_ipv6(net);
>  		INIT_LIST_HEAD(&rt->rt6i_siblings);
>  	}
>  	return rt;
> @@ -1096,9 +1095,6 @@ static struct dst_entry *ip6_dst_check(struct dst_entry *dst, u32 cookie)
>  	 * DST_OBSOLETE_FORCE_CHK which forces validation calls down
>  	 * into this function always.
>  	 */
> -	if (rt->rt6i_genid != rt_genid_ipv6(dev_net(rt->dst.dev)))
> -		return NULL;
> -
>  	if (!rt->rt6i_node || (rt->rt6i_node->fn_sernum != cookie))
>  		return NULL;
>  
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH net-next] ipv6: implement rt_genid_bump_ipv6 with fn_sernum and remove rt6i_genid
  2014-09-10 13:26                                       ` Vlad Yasevich
@ 2014-09-10 13:42                                         ` Hannes Frederic Sowa
  0 siblings, 0 replies; 57+ messages in thread
From: Hannes Frederic Sowa @ 2014-09-10 13:42 UTC (permalink / raw)
  To: Vlad Yasevich; +Cc: netdev, Eric Dumazet, Nicolas Dichtel

Hi Vlad,

On Mi, 2014-09-10 at 09:26 -0400, Vlad Yasevich wrote:
> On 09/10/2014 05:31 AM, Hannes Frederic Sowa wrote:
> > Some special routes will never be cloned in IPv6. Those are mainly
> > DST_HOST routes without RTF_NONEXTHOP or RTF_GATEWAY flags, thus mostly
> > routes handling the input path.
> > 
> > rt_genid depends on rt6_info clones being removed from the trie and
> > recreated with current rt6i_genid, thus bumping the genid would invalidate
> > all routes cached in the sockets and relookup will recreate them.  But in
> > case a non-cloned route ends up in the per-socket cache, it will never
> > be recreated, thus will never get a current rt_genid.
> > 
> > After the rt_genid is incremented for the first time all those routes
> > will always get invalidated by ip6_dst_check on the next check, thus
> > making early socket demultiplexing absolutely pointless for IPv6. It is
> > not possible to solve this with rt6i_genid, thus remove it.
> > 
> > In case we need to force the sockets to relookup the routes we now
> > increase the fn_sernum on all fibnodes in the routing tree. This is a
> > costly operation but should only happen if we have major routing/policy
> > changes in the kernel (e.g. manual route adding/removal, xfrm policy
> > changes). Also this patch optimized the walk over the trie a bit, we
> > don't touch every rt6_info but only touch the fib6_nodes.
> > 
> > Thanks to Eric Dumazet who noticed this problem.
>
> I also noticed the ipv6_ifa_notify() is called a lot with even being
> 0.  This will only trigger rtnl notification, but would not trigger any
> routing table changes, but would trigger a call to bump the gen_id
> which now would perform a rather expensive clean-table operation.
> 
> In particular the loop in manage_tempaddrs() is very scary as it can
> bump the rev multiples times.
> 
> I think it the genid bump in __ipv6_ifa_notify() should only happen
> if there was an actual address change.

Yes, maybe we don't even need to bump id in ipv6_ifa_notify at all, I am
still investigating. :)

Thanks,
Hannes

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH net-next] ipv6: implement rt_genid_bump_ipv6 with fn_sernum and remove rt6i_genid
  2014-09-10  9:31                                     ` [PATCH net-next] ipv6: implement rt_genid_bump_ipv6 with fn_sernum and remove rt6i_genid Hannes Frederic Sowa
  2014-09-10 13:26                                       ` Vlad Yasevich
@ 2014-09-10 20:09                                       ` David Miller
  2014-09-11  8:30                                         ` Hannes Frederic Sowa
  2014-09-11 12:05                                         ` Hannes Frederic Sowa
  1 sibling, 2 replies; 57+ messages in thread
From: David Miller @ 2014-09-10 20:09 UTC (permalink / raw)
  To: hannes; +Cc: netdev, eric.dumazet, nicolas.dichtel

From: Hannes Frederic Sowa <hannes@stressinduktion.org>
Date: Wed, 10 Sep 2014 11:31:28 +0200

> In case we need to force the sockets to relookup the routes we now
> increase the fn_sernum on all fibnodes in the routing tree. This is a
> costly operation but should only happen if we have major routing/policy
> changes in the kernel (e.g. manual route adding/removal, xfrm policy
> changes).

Core routers can update thousands of route updates per second, and they
do this via what you refer to as "manual route adding/removal".

I don't think we want to put such a scalability problem into the tree.

There has to be a lightweight way to address this.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH net-next] ipv6: implement rt_genid_bump_ipv6 with fn_sernum and remove rt6i_genid
  2014-09-10 20:09                                       ` David Miller
@ 2014-09-11  8:30                                         ` Hannes Frederic Sowa
  2014-09-11 12:22                                           ` Vlad Yasevich
  2014-09-11 12:05                                         ` Hannes Frederic Sowa
  1 sibling, 1 reply; 57+ messages in thread
From: Hannes Frederic Sowa @ 2014-09-11  8:30 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, eric.dumazet, nicolas.dichtel

On Mi, 2014-09-10 at 13:09 -0700, David Miller wrote:
> From: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Date: Wed, 10 Sep 2014 11:31:28 +0200
> 
> > In case we need to force the sockets to relookup the routes we now
> > increase the fn_sernum on all fibnodes in the routing tree. This is a
> > costly operation but should only happen if we have major routing/policy
> > changes in the kernel (e.g. manual route adding/removal, xfrm policy
> > changes).
> 
> Core routers can update thousands of route updates per second, and they
> do this via what you refer to as "manual route adding/removal".

Sorry, I was too unspecific here. Route changes because of address
removal/addition on the local stack.

The reason why we do the bump_id here is that we want to flush all the
socket caches in case we have either lost or gained access to a new
source address.

If you think about e.g. BGP routers which update lots of routes, they
aren't affected and the flush won't happen on every route change.

> I don't think we want to put such a scalability problem into the tree.
> 
> There has to be a lightweight way to address this.

I am still investigating why this bump_id actually happened. Seems the
reason is only sctp ontop of IPv6 and maybe we can build something much
more lightweight, yes.

Thanks,
Hannes

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH net-next] ipv6: implement rt_genid_bump_ipv6 with fn_sernum and remove rt6i_genid
  2014-09-10 20:09                                       ` David Miller
  2014-09-11  8:30                                         ` Hannes Frederic Sowa
@ 2014-09-11 12:05                                         ` Hannes Frederic Sowa
  2014-09-11 14:19                                           ` Vlad Yasevich
  1 sibling, 1 reply; 57+ messages in thread
From: Hannes Frederic Sowa @ 2014-09-11 12:05 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, eric.dumazet, nicolas.dichtel

On Mi, 2014-09-10 at 13:09 -0700, David Miller wrote:
> From: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Date: Wed, 10 Sep 2014 11:31:28 +0200
> 
> > In case we need to force the sockets to relookup the routes we now
> > increase the fn_sernum on all fibnodes in the routing tree. This is a
> > costly operation but should only happen if we have major routing/policy
> > changes in the kernel (e.g. manual route adding/removal, xfrm policy
> > changes).
> 
> Core routers can update thousands of route updates per second, and they
> do this via what you refer to as "manual route adding/removal".
> 
> I don't think we want to put such a scalability problem into the tree.
> 
> There has to be a lightweight way to address this.

An alternative approach without traversing the routing table, but each
newly inserted route (even only cached ones) might bump all other routes
out of the per-socket caches:

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index 9bcb220..a7e45b9 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -119,8 +119,6 @@ struct rt6_info {
 	struct inet6_dev		*rt6i_idev;
 	unsigned long			_rt6i_peer;
 
-	u32				rt6i_genid;
-
 	/* more non-fragment space at head required */
 	unsigned short			rt6i_nfheader_len;
 
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 361d260..428fdcb 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -358,18 +358,28 @@ static inline int rt_genid_ipv6(struct net *net)
 	return atomic_read(&net->ipv6.rt_genid);
 }
 
-static inline void rt_genid_bump_ipv6(struct net *net)
+static inline int rt_genid_bump_ipv6(struct net *net)
 {
-	atomic_inc(&net->ipv6.rt_genid);
+	int new, old;
+
+	do {
+		old = atomic_read(&net->ipv6.rt_genid);
+		new = old + 1;
+		if (new <= 0)
+			new = 1;
+	} while (atomic_cmpxchg(&net->ipv6.rt_genid, old, new) != old);
+	return new;
+
 }
 #else
 static inline int rt_genid_ipv6(struct net *net)
 {
-	return 0;
+	return 1;
 }
 
-static inline void rt_genid_bump_ipv6(struct net *net)
+static inline int rt_genid_bump_ipv6(struct net *net)
 {
+	return 1;
 }
 #endif
 
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 76b7f5e..4a2f130 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -84,7 +84,10 @@ static int fib6_walk_continue(struct fib6_walker_t *w);
  *	result of redirects, path MTU changes, etc.
  */
 
-static __u32 rt_sernum;
+static int fib6_new_sernum(struct net *net)
+{
+	return rt_genid_bump_ipv6(net);
+}
 
 static void fib6_gc_timer_cb(unsigned long arg);
 
@@ -104,13 +107,6 @@ static inline void fib6_walker_unlink(struct fib6_walker_t *w)
 	list_del(&w->lh);
 	write_unlock_bh(&fib6_walker_lock);
 }
-static __inline__ u32 fib6_new_sernum(void)
-{
-	u32 n = ++rt_sernum;
-	if ((__s32)n <= 0)
-		rt_sernum = n = 1;
-	return n;
-}
 
 /*
  *	Auxiliary address test functions for the radix tree.
@@ -421,16 +417,15 @@ out:
  */
 
 static struct fib6_node *fib6_add_1(struct fib6_node *root,
-				     struct in6_addr *addr, int plen,
-				     int offset, int allow_create,
-				     int replace_required)
+				    struct in6_addr *addr, int plen,
+				    int offset, int allow_create,
+				    int replace_required, int sernum)
 {
 	struct fib6_node *fn, *in, *ln;
 	struct fib6_node *pn = NULL;
 	struct rt6key *key;
 	int	bit;
 	__be32	dir = 0;
-	__u32	sernum = fib6_new_sernum();
 
 	RT6_TRACE("fib6_add_1\n");
 
@@ -844,6 +839,7 @@ void fib6_force_start_gc(struct net *net)
 int fib6_add(struct fib6_node *root, struct rt6_info *rt, struct nl_info *info,
 	     struct nlattr *mx, int mx_len)
 {
+	struct net *net = dev_net(rt->dst.dev);
 	struct fib6_node *fn, *pn = NULL;
 	int err = -ENOMEM;
 	int allow_create = 1;
@@ -860,7 +856,7 @@ int fib6_add(struct fib6_node *root, struct rt6_info *rt, struct nl_info *info,
 
 	fn = fib6_add_1(root, &rt->rt6i_dst.addr, rt->rt6i_dst.plen,
 			offsetof(struct rt6_info, rt6i_dst), allow_create,
-			replace_required);
+			replace_required, fib6_new_sernum(net));
 	if (IS_ERR(fn)) {
 		err = PTR_ERR(fn);
 		fn = NULL;
@@ -894,14 +890,15 @@ int fib6_add(struct fib6_node *root, struct rt6_info *rt, struct nl_info *info,
 			sfn->leaf = info->nl_net->ipv6.ip6_null_entry;
 			atomic_inc(&info->nl_net->ipv6.ip6_null_entry->rt6i_ref);
 			sfn->fn_flags = RTN_ROOT;
-			sfn->fn_sernum = fib6_new_sernum();
+			sfn->fn_sernum = fib6_new_sernum(net);
 
 			/* Now add the first leaf node to new subtree */
 
 			sn = fib6_add_1(sfn, &rt->rt6i_src.addr,
 					rt->rt6i_src.plen,
 					offsetof(struct rt6_info, rt6i_src),
-					allow_create, replace_required);
+					allow_create, replace_required,
+					fib6_new_sernum(net));
 
 			if (IS_ERR(sn)) {
 				/* If it is failed, discard just allocated
@@ -920,7 +917,8 @@ int fib6_add(struct fib6_node *root, struct rt6_info *rt, struct nl_info *info,
 			sn = fib6_add_1(fn->subtree, &rt->rt6i_src.addr,
 					rt->rt6i_src.plen,
 					offsetof(struct rt6_info, rt6i_src),
-					allow_create, replace_required);
+					allow_create, replace_required,
+					fib6_new_sernum(net));
 
 			if (IS_ERR(sn)) {
 				err = PTR_ERR(sn);
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index f74b041..54b7d81 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -314,7 +314,6 @@ static inline struct rt6_info *ip6_dst_alloc(struct net *net,
 
 		memset(dst + 1, 0, sizeof(*rt) - sizeof(*dst));
 		rt6_init_peer(rt, table ? &table->tb6_peers : net->ipv6.peers);
-		rt->rt6i_genid = rt_genid_ipv6(net);
 		INIT_LIST_HEAD(&rt->rt6i_siblings);
 	}
 	return rt;
@@ -1096,10 +1095,7 @@ static struct dst_entry *ip6_dst_check(struct dst_entry *dst, u32 cookie)
 	 * DST_OBSOLETE_FORCE_CHK which forces validation calls down
 	 * into this function always.
 	 */
-	if (rt->rt6i_genid != rt_genid_ipv6(dev_net(rt->dst.dev)))
-		return NULL;
-
-	if (!rt->rt6i_node || (rt->rt6i_node->fn_sernum != cookie))
+	if (!rt->rt6i_node || rt_genid_ipv6(dev_net(rt->dst.dev)) != cookie)
 		return NULL;
 
 	if (rt6_check_expired(rt))

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH net-next] ipv6: implement rt_genid_bump_ipv6 with fn_sernum and remove rt6i_genid
  2014-09-11  8:30                                         ` Hannes Frederic Sowa
@ 2014-09-11 12:22                                           ` Vlad Yasevich
  2014-09-11 12:40                                             ` Hannes Frederic Sowa
  0 siblings, 1 reply; 57+ messages in thread
From: Vlad Yasevich @ 2014-09-11 12:22 UTC (permalink / raw)
  To: Hannes Frederic Sowa, David Miller; +Cc: netdev, eric.dumazet, nicolas.dichtel

On 09/11/2014 04:30 AM, Hannes Frederic Sowa wrote:
> On Mi, 2014-09-10 at 13:09 -0700, David Miller wrote:
>> From: Hannes Frederic Sowa <hannes@stressinduktion.org>
>> Date: Wed, 10 Sep 2014 11:31:28 +0200
>>
>>> In case we need to force the sockets to relookup the routes we now
>>> increase the fn_sernum on all fibnodes in the routing tree. This is a
>>> costly operation but should only happen if we have major routing/policy
>>> changes in the kernel (e.g. manual route adding/removal, xfrm policy
>>> changes).
>>
>> Core routers can update thousands of route updates per second, and they
>> do this via what you refer to as "manual route adding/removal".
> 
> Sorry, I was too unspecific here. Route changes because of address
> removal/addition on the local stack.
> 
> The reason why we do the bump_id here is that we want to flush all the
> socket caches in case we have either lost or gained access to a new
> source address.
> 
> If you think about e.g. BGP routers which update lots of routes, they
> aren't affected and the flush won't happen on every route change.
> 
>> I don't think we want to put such a scalability problem into the tree.
>>
>> There has to be a lightweight way to address this.
> 
> I am still investigating why this bump_id actually happened. Seems the
> reason is only sctp ontop of IPv6 and maybe we can build something much
> more lightweight, yes.

No.  It was proven that a regular TCP socket could continue sending
traffic using an address that was removed.

-vlad
> 
> Thanks,
> Hannes
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH net-next] ipv6: implement rt_genid_bump_ipv6 with fn_sernum and remove rt6i_genid
  2014-09-11 12:22                                           ` Vlad Yasevich
@ 2014-09-11 12:40                                             ` Hannes Frederic Sowa
  0 siblings, 0 replies; 57+ messages in thread
From: Hannes Frederic Sowa @ 2014-09-11 12:40 UTC (permalink / raw)
  To: Vlad Yasevich, David Miller; +Cc: netdev, eric.dumazet, nicolas.dichtel

On Thu, Sep 11, 2014, at 14:22, Vlad Yasevich wrote:
> On 09/11/2014 04:30 AM, Hannes Frederic Sowa wrote:
> > On Mi, 2014-09-10 at 13:09 -0700, David Miller wrote:
> >> From: Hannes Frederic Sowa <hannes@stressinduktion.org>
> >> Date: Wed, 10 Sep 2014 11:31:28 +0200
> >>
> >>> In case we need to force the sockets to relookup the routes we now
> >>> increase the fn_sernum on all fibnodes in the routing tree. This is a
> >>> costly operation but should only happen if we have major routing/policy
> >>> changes in the kernel (e.g. manual route adding/removal, xfrm policy
> >>> changes).
> >>
> >> Core routers can update thousands of route updates per second, and they
> >> do this via what you refer to as "manual route adding/removal".
> > 
> > Sorry, I was too unspecific here. Route changes because of address
> > removal/addition on the local stack.
> > 
> > The reason why we do the bump_id here is that we want to flush all the
> > socket caches in case we have either lost or gained access to a new
> > source address.
> > 
> > If you think about e.g. BGP routers which update lots of routes, they
> > aren't affected and the flush won't happen on every route change.
> > 
> >> I don't think we want to put such a scalability problem into the tree.
> >>
> >> There has to be a lightweight way to address this.
> > 
> > I am still investigating why this bump_id actually happened. Seems the
> > reason is only sctp ontop of IPv6 and maybe we can build something much
> > more lightweight, yes.
> 
> No.  It was proven that a regular TCP socket could continue sending
> traffic using an address that was removed.

I have seen the original discussion regarding SCTP. I still think the
reason for this is that we don't update fn_sernum during pruning clones
from the trie.

Bye,
Hannes

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH net-next] ipv6: implement rt_genid_bump_ipv6 with fn_sernum and remove rt6i_genid
  2014-09-11 12:05                                         ` Hannes Frederic Sowa
@ 2014-09-11 14:19                                           ` Vlad Yasevich
  2014-09-11 14:32                                             ` Hannes Frederic Sowa
  0 siblings, 1 reply; 57+ messages in thread
From: Vlad Yasevich @ 2014-09-11 14:19 UTC (permalink / raw)
  To: Hannes Frederic Sowa, David Miller; +Cc: netdev, eric.dumazet, nicolas.dichtel

On 09/11/2014 08:05 AM, Hannes Frederic Sowa wrote:
> On Mi, 2014-09-10 at 13:09 -0700, David Miller wrote:
>> From: Hannes Frederic Sowa <hannes@stressinduktion.org>
>> Date: Wed, 10 Sep 2014 11:31:28 +0200
>>
>>> In case we need to force the sockets to relookup the routes we now
>>> increase the fn_sernum on all fibnodes in the routing tree. This is a
>>> costly operation but should only happen if we have major routing/policy
>>> changes in the kernel (e.g. manual route adding/removal, xfrm policy
>>> changes).
>>
>> Core routers can update thousands of route updates per second, and they
>> do this via what you refer to as "manual route adding/removal".
>>
>> I don't think we want to put such a scalability problem into the tree.
>>
>> There has to be a lightweight way to address this.
> 
> An alternative approach without traversing the routing table, but each
> newly inserted route (even only cached ones) might bump all other routes
> out of the per-socket caches:
> 
> diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
> index 9bcb220..a7e45b9 100644
> --- a/include/net/ip6_fib.h
> +++ b/include/net/ip6_fib.h
> @@ -119,8 +119,6 @@ struct rt6_info {
>  	struct inet6_dev		*rt6i_idev;
>  	unsigned long			_rt6i_peer;
>  
> -	u32				rt6i_genid;
> -
>  	/* more non-fragment space at head required */
>  	unsigned short			rt6i_nfheader_len;
>  
> diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
> index 361d260..428fdcb 100644
> --- a/include/net/net_namespace.h
> +++ b/include/net/net_namespace.h
> @@ -358,18 +358,28 @@ static inline int rt_genid_ipv6(struct net *net)
>  	return atomic_read(&net->ipv6.rt_genid);
>  }
>  
> -static inline void rt_genid_bump_ipv6(struct net *net)
> +static inline int rt_genid_bump_ipv6(struct net *net)
>  {
> -	atomic_inc(&net->ipv6.rt_genid);
> +	int new, old;
> +
> +	do {
> +		old = atomic_read(&net->ipv6.rt_genid);
> +		new = old + 1;
> +		if (new <= 0)
> +			new = 1;
> +	} while (atomic_cmpxchg(&net->ipv6.rt_genid, old, new) != old);
> +	return new;
> +
>  }
>  #else
>  static inline int rt_genid_ipv6(struct net *net)
>  {
> -	return 0;
> +	return 1;
>  }
>  
> -static inline void rt_genid_bump_ipv6(struct net *net)
> +static inline int rt_genid_bump_ipv6(struct net *net)
>  {
> +	return 1;
>  }
>  #endif
>  
> diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
> index 76b7f5e..4a2f130 100644
> --- a/net/ipv6/ip6_fib.c
> +++ b/net/ipv6/ip6_fib.c
> @@ -84,7 +84,10 @@ static int fib6_walk_continue(struct fib6_walker_t *w);
>   *	result of redirects, path MTU changes, etc.
>   */
>  
> -static __u32 rt_sernum;
> +static int fib6_new_sernum(struct net *net)
> +{
> +	return rt_genid_bump_ipv6(net);
> +}
>  
>  static void fib6_gc_timer_cb(unsigned long arg);
>  
> @@ -104,13 +107,6 @@ static inline void fib6_walker_unlink(struct fib6_walker_t *w)
>  	list_del(&w->lh);
>  	write_unlock_bh(&fib6_walker_lock);
>  }
> -static __inline__ u32 fib6_new_sernum(void)
> -{
> -	u32 n = ++rt_sernum;
> -	if ((__s32)n <= 0)
> -		rt_sernum = n = 1;
> -	return n;
> -}
>  
>  /*
>   *	Auxiliary address test functions for the radix tree.
> @@ -421,16 +417,15 @@ out:
>   */
>  
>  static struct fib6_node *fib6_add_1(struct fib6_node *root,
> -				     struct in6_addr *addr, int plen,
> -				     int offset, int allow_create,
> -				     int replace_required)
> +				    struct in6_addr *addr, int plen,
> +				    int offset, int allow_create,
> +				    int replace_required, int sernum)
>  {
>  	struct fib6_node *fn, *in, *ln;
>  	struct fib6_node *pn = NULL;
>  	struct rt6key *key;
>  	int	bit;
>  	__be32	dir = 0;
> -	__u32	sernum = fib6_new_sernum();
>  
>  	RT6_TRACE("fib6_add_1\n");
>  
> @@ -844,6 +839,7 @@ void fib6_force_start_gc(struct net *net)
>  int fib6_add(struct fib6_node *root, struct rt6_info *rt, struct nl_info *info,
>  	     struct nlattr *mx, int mx_len)
>  {
> +	struct net *net = dev_net(rt->dst.dev);
>  	struct fib6_node *fn, *pn = NULL;
>  	int err = -ENOMEM;
>  	int allow_create = 1;
> @@ -860,7 +856,7 @@ int fib6_add(struct fib6_node *root, struct rt6_info *rt, struct nl_info *info,
>  
>  	fn = fib6_add_1(root, &rt->rt6i_dst.addr, rt->rt6i_dst.plen,
>  			offsetof(struct rt6_info, rt6i_dst), allow_create,
> -			replace_required);
> +			replace_required, fib6_new_sernum(net));
>  	if (IS_ERR(fn)) {
>  		err = PTR_ERR(fn);
>  		fn = NULL;
> @@ -894,14 +890,15 @@ int fib6_add(struct fib6_node *root, struct rt6_info *rt, struct nl_info *info,
>  			sfn->leaf = info->nl_net->ipv6.ip6_null_entry;
>  			atomic_inc(&info->nl_net->ipv6.ip6_null_entry->rt6i_ref);
>  			sfn->fn_flags = RTN_ROOT;
> -			sfn->fn_sernum = fib6_new_sernum();
> +			sfn->fn_sernum = fib6_new_sernum(net);
>  
>  			/* Now add the first leaf node to new subtree */
>  
>  			sn = fib6_add_1(sfn, &rt->rt6i_src.addr,
>  					rt->rt6i_src.plen,
>  					offsetof(struct rt6_info, rt6i_src),
> -					allow_create, replace_required);
> +					allow_create, replace_required,
> +					fib6_new_sernum(net));
>  
>  			if (IS_ERR(sn)) {
>  				/* If it is failed, discard just allocated
> @@ -920,7 +917,8 @@ int fib6_add(struct fib6_node *root, struct rt6_info *rt, struct nl_info *info,
>  			sn = fib6_add_1(fn->subtree, &rt->rt6i_src.addr,
>  					rt->rt6i_src.plen,
>  					offsetof(struct rt6_info, rt6i_src),
> -					allow_create, replace_required);
> +					allow_create, replace_required,
> +					fib6_new_sernum(net));
>  
>  			if (IS_ERR(sn)) {
>  				err = PTR_ERR(sn);
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index f74b041..54b7d81 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -314,7 +314,6 @@ static inline struct rt6_info *ip6_dst_alloc(struct net *net,
>  
>  		memset(dst + 1, 0, sizeof(*rt) - sizeof(*dst));
>  		rt6_init_peer(rt, table ? &table->tb6_peers : net->ipv6.peers);
> -		rt->rt6i_genid = rt_genid_ipv6(net);
>  		INIT_LIST_HEAD(&rt->rt6i_siblings);
>  	}
>  	return rt;
> @@ -1096,10 +1095,7 @@ static struct dst_entry *ip6_dst_check(struct dst_entry *dst, u32 cookie)
>  	 * DST_OBSOLETE_FORCE_CHK which forces validation calls down
>  	 * into this function always.
>  	 */
> -	if (rt->rt6i_genid != rt_genid_ipv6(dev_net(rt->dst.dev)))
> -		return NULL;
> -
> -	if (!rt->rt6i_node || (rt->rt6i_node->fn_sernum != cookie))
> +	if (!rt->rt6i_node || rt_genid_ipv6(dev_net(rt->dst.dev)) != cookie)
>  		return NULL;
>  
>  	if (rt6_check_expired(rt))
> 

Ok, so now we bump the gen_id every time we add a route and use that as fn_sernum for
that route.  But this doesn't solve the problem that we are seeing in that a re-lookup
of the route still gives us an older route with an older gen_id.

Or am I missing something?

-vlad

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH net-next] ipv6: implement rt_genid_bump_ipv6 with fn_sernum and remove rt6i_genid
  2014-09-11 14:19                                           ` Vlad Yasevich
@ 2014-09-11 14:32                                             ` Hannes Frederic Sowa
  2014-09-11 14:44                                               ` Vlad Yasevich
  0 siblings, 1 reply; 57+ messages in thread
From: Hannes Frederic Sowa @ 2014-09-11 14:32 UTC (permalink / raw)
  To: Vlad Yasevich; +Cc: David Miller, netdev, eric.dumazet, nicolas.dichtel

On Do, 2014-09-11 at 10:19 -0400, Vlad Yasevich wrote:
> On 09/11/2014 08:05 AM, Hannes Frederic Sowa wrote:
> > On Mi, 2014-09-10 at 13:09 -0700, David Miller wrote:
> >> From: Hannes Frederic Sowa <hannes@stressinduktion.org>
> >> Date: Wed, 10 Sep 2014 11:31:28 +0200
> >>
> >>> In case we need to force the sockets to relookup the routes we now
> >>> increase the fn_sernum on all fibnodes in the routing tree. This is a
> >>> costly operation but should only happen if we have major routing/policy
> >>> changes in the kernel (e.g. manual route adding/removal, xfrm policy
> >>> changes).
> >>
> >> Core routers can update thousands of route updates per second, and they
> >> do this via what you refer to as "manual route adding/removal".
> >>
> >> I don't think we want to put such a scalability problem into the tree.
> >>
> >> There has to be a lightweight way to address this.
> > 
> > An alternative approach without traversing the routing table, but each
> > newly inserted route (even only cached ones) might bump all other routes
> > out of the per-socket caches:
> > 
> > diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
> > index 9bcb220..a7e45b9 100644
> > --- a/include/net/ip6_fib.h
> > +++ b/include/net/ip6_fib.h
> > @@ -119,8 +119,6 @@ struct rt6_info {
> >  	struct inet6_dev		*rt6i_idev;
> >  	unsigned long			_rt6i_peer;
> >  
> > -	u32				rt6i_genid;
> > -
> >  	/* more non-fragment space at head required */
> >  	unsigned short			rt6i_nfheader_len;
> >  
> > diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
> > index 361d260..428fdcb 100644
> > --- a/include/net/net_namespace.h
> > +++ b/include/net/net_namespace.h
> > @@ -358,18 +358,28 @@ static inline int rt_genid_ipv6(struct net *net)
> >  	return atomic_read(&net->ipv6.rt_genid);
> >  }
> >  
> > -static inline void rt_genid_bump_ipv6(struct net *net)
> > +static inline int rt_genid_bump_ipv6(struct net *net)
> >  {
> > -	atomic_inc(&net->ipv6.rt_genid);
> > +	int new, old;
> > +
> > +	do {
> > +		old = atomic_read(&net->ipv6.rt_genid);
> > +		new = old + 1;
> > +		if (new <= 0)
> > +			new = 1;
> > +	} while (atomic_cmpxchg(&net->ipv6.rt_genid, old, new) != old);
> > +	return new;
> > +
> >  }
> >  #else
> >  static inline int rt_genid_ipv6(struct net *net)
> >  {
> > -	return 0;
> > +	return 1;
> >  }
> >  
> > -static inline void rt_genid_bump_ipv6(struct net *net)
> > +static inline int rt_genid_bump_ipv6(struct net *net)
> >  {
> > +	return 1;
> >  }
> >  #endif
> >  
> > diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
> > index 76b7f5e..4a2f130 100644
> > --- a/net/ipv6/ip6_fib.c
> > +++ b/net/ipv6/ip6_fib.c
> > @@ -84,7 +84,10 @@ static int fib6_walk_continue(struct fib6_walker_t *w);
> >   *	result of redirects, path MTU changes, etc.
> >   */
> >  
> > -static __u32 rt_sernum;
> > +static int fib6_new_sernum(struct net *net)
> > +{
> > +	return rt_genid_bump_ipv6(net);
> > +}
> >  
> >  static void fib6_gc_timer_cb(unsigned long arg);
> >  
> > @@ -104,13 +107,6 @@ static inline void fib6_walker_unlink(struct fib6_walker_t *w)
> >  	list_del(&w->lh);
> >  	write_unlock_bh(&fib6_walker_lock);
> >  }
> > -static __inline__ u32 fib6_new_sernum(void)
> > -{
> > -	u32 n = ++rt_sernum;
> > -	if ((__s32)n <= 0)
> > -		rt_sernum = n = 1;
> > -	return n;
> > -}
> >  
> >  /*
> >   *	Auxiliary address test functions for the radix tree.
> > @@ -421,16 +417,15 @@ out:
> >   */
> >  
> >  static struct fib6_node *fib6_add_1(struct fib6_node *root,
> > -				     struct in6_addr *addr, int plen,
> > -				     int offset, int allow_create,
> > -				     int replace_required)
> > +				    struct in6_addr *addr, int plen,
> > +				    int offset, int allow_create,
> > +				    int replace_required, int sernum)
> >  {
> >  	struct fib6_node *fn, *in, *ln;
> >  	struct fib6_node *pn = NULL;
> >  	struct rt6key *key;
> >  	int	bit;
> >  	__be32	dir = 0;
> > -	__u32	sernum = fib6_new_sernum();
> >  
> >  	RT6_TRACE("fib6_add_1\n");
> >  
> > @@ -844,6 +839,7 @@ void fib6_force_start_gc(struct net *net)
> >  int fib6_add(struct fib6_node *root, struct rt6_info *rt, struct nl_info *info,
> >  	     struct nlattr *mx, int mx_len)
> >  {
> > +	struct net *net = dev_net(rt->dst.dev);
> >  	struct fib6_node *fn, *pn = NULL;
> >  	int err = -ENOMEM;
> >  	int allow_create = 1;
> > @@ -860,7 +856,7 @@ int fib6_add(struct fib6_node *root, struct rt6_info *rt, struct nl_info *info,
> >  
> >  	fn = fib6_add_1(root, &rt->rt6i_dst.addr, rt->rt6i_dst.plen,
> >  			offsetof(struct rt6_info, rt6i_dst), allow_create,
> > -			replace_required);
> > +			replace_required, fib6_new_sernum(net));
> >  	if (IS_ERR(fn)) {
> >  		err = PTR_ERR(fn);
> >  		fn = NULL;
> > @@ -894,14 +890,15 @@ int fib6_add(struct fib6_node *root, struct rt6_info *rt, struct nl_info *info,
> >  			sfn->leaf = info->nl_net->ipv6.ip6_null_entry;
> >  			atomic_inc(&info->nl_net->ipv6.ip6_null_entry->rt6i_ref);
> >  			sfn->fn_flags = RTN_ROOT;
> > -			sfn->fn_sernum = fib6_new_sernum();
> > +			sfn->fn_sernum = fib6_new_sernum(net);
> >  
> >  			/* Now add the first leaf node to new subtree */
> >  
> >  			sn = fib6_add_1(sfn, &rt->rt6i_src.addr,
> >  					rt->rt6i_src.plen,
> >  					offsetof(struct rt6_info, rt6i_src),
> > -					allow_create, replace_required);
> > +					allow_create, replace_required,
> > +					fib6_new_sernum(net));
> >  
> >  			if (IS_ERR(sn)) {
> >  				/* If it is failed, discard just allocated
> > @@ -920,7 +917,8 @@ int fib6_add(struct fib6_node *root, struct rt6_info *rt, struct nl_info *info,
> >  			sn = fib6_add_1(fn->subtree, &rt->rt6i_src.addr,
> >  					rt->rt6i_src.plen,
> >  					offsetof(struct rt6_info, rt6i_src),
> > -					allow_create, replace_required);
> > +					allow_create, replace_required,
> > +					fib6_new_sernum(net));
> >  
> >  			if (IS_ERR(sn)) {
> >  				err = PTR_ERR(sn);
> > diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> > index f74b041..54b7d81 100644
> > --- a/net/ipv6/route.c
> > +++ b/net/ipv6/route.c
> > @@ -314,7 +314,6 @@ static inline struct rt6_info *ip6_dst_alloc(struct net *net,
> >  
> >  		memset(dst + 1, 0, sizeof(*rt) - sizeof(*dst));
> >  		rt6_init_peer(rt, table ? &table->tb6_peers : net->ipv6.peers);
> > -		rt->rt6i_genid = rt_genid_ipv6(net);
> >  		INIT_LIST_HEAD(&rt->rt6i_siblings);
> >  	}
> >  	return rt;
> > @@ -1096,10 +1095,7 @@ static struct dst_entry *ip6_dst_check(struct dst_entry *dst, u32 cookie)
> >  	 * DST_OBSOLETE_FORCE_CHK which forces validation calls down
> >  	 * into this function always.
> >  	 */
> > -	if (rt->rt6i_genid != rt_genid_ipv6(dev_net(rt->dst.dev)))
> > -		return NULL;
> > -
> > -	if (!rt->rt6i_node || (rt->rt6i_node->fn_sernum != cookie))
> > +	if (!rt->rt6i_node || rt_genid_ipv6(dev_net(rt->dst.dev)) != cookie)
> >  		return NULL;
> >  
> >  	if (rt6_check_expired(rt))
> > 
> 
> Ok, so now we bump the gen_id every time we add a route and use that as fn_sernum for
> that route.  But this doesn't solve the problem that we are seeing in that a re-lookup
> of the route still gives us an older route with an older gen_id.

Hmm, this patch completely removes rt6i_genid from rt6_info (first
hunk). We decide if we need to do the relookup based on the socket's
cookie and the per-netns serial number of the ipv6 routing tables.

Bye,
Hannes

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH net-next] ipv6: implement rt_genid_bump_ipv6 with fn_sernum and remove rt6i_genid
  2014-09-11 14:32                                             ` Hannes Frederic Sowa
@ 2014-09-11 14:44                                               ` Vlad Yasevich
  2014-09-11 14:47                                                 ` Hannes Frederic Sowa
  0 siblings, 1 reply; 57+ messages in thread
From: Vlad Yasevich @ 2014-09-11 14:44 UTC (permalink / raw)
  To: Hannes Frederic Sowa; +Cc: David Miller, netdev, eric.dumazet, nicolas.dichtel

On 09/11/2014 10:32 AM, Hannes Frederic Sowa wrote:
> On Do, 2014-09-11 at 10:19 -0400, Vlad Yasevich wrote:
>> On 09/11/2014 08:05 AM, Hannes Frederic Sowa wrote:
>>> On Mi, 2014-09-10 at 13:09 -0700, David Miller wrote:
>>>> From: Hannes Frederic Sowa <hannes@stressinduktion.org>
>>>> Date: Wed, 10 Sep 2014 11:31:28 +0200
>>>>
>>>>> In case we need to force the sockets to relookup the routes we now
>>>>> increase the fn_sernum on all fibnodes in the routing tree. This is a
>>>>> costly operation but should only happen if we have major routing/policy
>>>>> changes in the kernel (e.g. manual route adding/removal, xfrm policy
>>>>> changes).
>>>>
>>>> Core routers can update thousands of route updates per second, and they
>>>> do this via what you refer to as "manual route adding/removal".
>>>>
>>>> I don't think we want to put such a scalability problem into the tree.
>>>>
>>>> There has to be a lightweight way to address this.
>>>
>>> An alternative approach without traversing the routing table, but each
>>> newly inserted route (even only cached ones) might bump all other routes
>>> out of the per-socket caches:
>>>
>>> diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
>>> index 9bcb220..a7e45b9 100644
>>> --- a/include/net/ip6_fib.h
>>> +++ b/include/net/ip6_fib.h
>>> @@ -119,8 +119,6 @@ struct rt6_info {
>>>  	struct inet6_dev		*rt6i_idev;
>>>  	unsigned long			_rt6i_peer;
>>>  
>>> -	u32				rt6i_genid;
>>> -
>>>  	/* more non-fragment space at head required */
>>>  	unsigned short			rt6i_nfheader_len;
>>>  
>>> diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
>>> index 361d260..428fdcb 100644
>>> --- a/include/net/net_namespace.h
>>> +++ b/include/net/net_namespace.h
>>> @@ -358,18 +358,28 @@ static inline int rt_genid_ipv6(struct net *net)
>>>  	return atomic_read(&net->ipv6.rt_genid);
>>>  }
>>>  
>>> -static inline void rt_genid_bump_ipv6(struct net *net)
>>> +static inline int rt_genid_bump_ipv6(struct net *net)
>>>  {
>>> -	atomic_inc(&net->ipv6.rt_genid);
>>> +	int new, old;
>>> +
>>> +	do {
>>> +		old = atomic_read(&net->ipv6.rt_genid);
>>> +		new = old + 1;
>>> +		if (new <= 0)
>>> +			new = 1;
>>> +	} while (atomic_cmpxchg(&net->ipv6.rt_genid, old, new) != old);
>>> +	return new;
>>> +
>>>  }
>>>  #else
>>>  static inline int rt_genid_ipv6(struct net *net)
>>>  {
>>> -	return 0;
>>> +	return 1;
>>>  }
>>>  
>>> -static inline void rt_genid_bump_ipv6(struct net *net)
>>> +static inline int rt_genid_bump_ipv6(struct net *net)
>>>  {
>>> +	return 1;
>>>  }
>>>  #endif
>>>  
>>> diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
>>> index 76b7f5e..4a2f130 100644
>>> --- a/net/ipv6/ip6_fib.c
>>> +++ b/net/ipv6/ip6_fib.c
>>> @@ -84,7 +84,10 @@ static int fib6_walk_continue(struct fib6_walker_t *w);
>>>   *	result of redirects, path MTU changes, etc.
>>>   */
>>>  
>>> -static __u32 rt_sernum;
>>> +static int fib6_new_sernum(struct net *net)
>>> +{
>>> +	return rt_genid_bump_ipv6(net);
>>> +}
>>>  
>>>  static void fib6_gc_timer_cb(unsigned long arg);
>>>  
>>> @@ -104,13 +107,6 @@ static inline void fib6_walker_unlink(struct fib6_walker_t *w)
>>>  	list_del(&w->lh);
>>>  	write_unlock_bh(&fib6_walker_lock);
>>>  }
>>> -static __inline__ u32 fib6_new_sernum(void)
>>> -{
>>> -	u32 n = ++rt_sernum;
>>> -	if ((__s32)n <= 0)
>>> -		rt_sernum = n = 1;
>>> -	return n;
>>> -}
>>>  
>>>  /*
>>>   *	Auxiliary address test functions for the radix tree.
>>> @@ -421,16 +417,15 @@ out:
>>>   */
>>>  
>>>  static struct fib6_node *fib6_add_1(struct fib6_node *root,
>>> -				     struct in6_addr *addr, int plen,
>>> -				     int offset, int allow_create,
>>> -				     int replace_required)
>>> +				    struct in6_addr *addr, int plen,
>>> +				    int offset, int allow_create,
>>> +				    int replace_required, int sernum)
>>>  {
>>>  	struct fib6_node *fn, *in, *ln;
>>>  	struct fib6_node *pn = NULL;
>>>  	struct rt6key *key;
>>>  	int	bit;
>>>  	__be32	dir = 0;
>>> -	__u32	sernum = fib6_new_sernum();
>>>  
>>>  	RT6_TRACE("fib6_add_1\n");
>>>  
>>> @@ -844,6 +839,7 @@ void fib6_force_start_gc(struct net *net)
>>>  int fib6_add(struct fib6_node *root, struct rt6_info *rt, struct nl_info *info,
>>>  	     struct nlattr *mx, int mx_len)
>>>  {
>>> +	struct net *net = dev_net(rt->dst.dev);
>>>  	struct fib6_node *fn, *pn = NULL;
>>>  	int err = -ENOMEM;
>>>  	int allow_create = 1;
>>> @@ -860,7 +856,7 @@ int fib6_add(struct fib6_node *root, struct rt6_info *rt, struct nl_info *info,
>>>  
>>>  	fn = fib6_add_1(root, &rt->rt6i_dst.addr, rt->rt6i_dst.plen,
>>>  			offsetof(struct rt6_info, rt6i_dst), allow_create,
>>> -			replace_required);
>>> +			replace_required, fib6_new_sernum(net));
>>>  	if (IS_ERR(fn)) {
>>>  		err = PTR_ERR(fn);
>>>  		fn = NULL;
>>> @@ -894,14 +890,15 @@ int fib6_add(struct fib6_node *root, struct rt6_info *rt, struct nl_info *info,
>>>  			sfn->leaf = info->nl_net->ipv6.ip6_null_entry;
>>>  			atomic_inc(&info->nl_net->ipv6.ip6_null_entry->rt6i_ref);
>>>  			sfn->fn_flags = RTN_ROOT;
>>> -			sfn->fn_sernum = fib6_new_sernum();
>>> +			sfn->fn_sernum = fib6_new_sernum(net);
>>>  
>>>  			/* Now add the first leaf node to new subtree */
>>>  
>>>  			sn = fib6_add_1(sfn, &rt->rt6i_src.addr,
>>>  					rt->rt6i_src.plen,
>>>  					offsetof(struct rt6_info, rt6i_src),
>>> -					allow_create, replace_required);
>>> +					allow_create, replace_required,
>>> +					fib6_new_sernum(net));
>>>  
>>>  			if (IS_ERR(sn)) {
>>>  				/* If it is failed, discard just allocated
>>> @@ -920,7 +917,8 @@ int fib6_add(struct fib6_node *root, struct rt6_info *rt, struct nl_info *info,
>>>  			sn = fib6_add_1(fn->subtree, &rt->rt6i_src.addr,
>>>  					rt->rt6i_src.plen,
>>>  					offsetof(struct rt6_info, rt6i_src),
>>> -					allow_create, replace_required);
>>> +					allow_create, replace_required,
>>> +					fib6_new_sernum(net));
>>>  
>>>  			if (IS_ERR(sn)) {
>>>  				err = PTR_ERR(sn);
>>> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
>>> index f74b041..54b7d81 100644
>>> --- a/net/ipv6/route.c
>>> +++ b/net/ipv6/route.c
>>> @@ -314,7 +314,6 @@ static inline struct rt6_info *ip6_dst_alloc(struct net *net,
>>>  
>>>  		memset(dst + 1, 0, sizeof(*rt) - sizeof(*dst));
>>>  		rt6_init_peer(rt, table ? &table->tb6_peers : net->ipv6.peers);
>>> -		rt->rt6i_genid = rt_genid_ipv6(net);
>>>  		INIT_LIST_HEAD(&rt->rt6i_siblings);
>>>  	}
>>>  	return rt;
>>> @@ -1096,10 +1095,7 @@ static struct dst_entry *ip6_dst_check(struct dst_entry *dst, u32 cookie)
>>>  	 * DST_OBSOLETE_FORCE_CHK which forces validation calls down
>>>  	 * into this function always.
>>>  	 */
>>> -	if (rt->rt6i_genid != rt_genid_ipv6(dev_net(rt->dst.dev)))
>>> -		return NULL;
>>> -
>>> -	if (!rt->rt6i_node || (rt->rt6i_node->fn_sernum != cookie))
>>> +	if (!rt->rt6i_node || rt_genid_ipv6(dev_net(rt->dst.dev)) != cookie)
>>>  		return NULL;
>>>  
>>>  	if (rt6_check_expired(rt))
>>>
>>
>> Ok, so now we bump the gen_id every time we add a route and use that as fn_sernum for
>> that route.  But this doesn't solve the problem that we are seeing in that a re-lookup
>> of the route still gives us an older route with an older gen_id.
> 
> Hmm, this patch completely removes rt6i_genid from rt6_info (first
> hunk). We decide if we need to do the relookup based on the socket's
> cookie and the per-netns serial number of the ipv6 routing tables.
> 

Hi Hannes

Right, by you still compare the per-netns serial number to the cookie.  The cookie is
typically taken from the route.  The route takes it from the per-netns serial number.
So, if you add a route with serial=1 and when the socket looks up that route, it
stashes 1 into the cookie.
Then you add another route and serial = 2.  Now the dst_check on the socket fails
(cookie != rt_getid_ipv6(..)) and socket has to re-lookup a route.  It gets the same old
route as before with fn_sernum = 1.  That's stashed in the cookie again.  Next dst_check
is done, the route is invalidated again!.

Am I missing something?

Thanks
-vlad

> Bye,
> Hannes
> 
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH net-next] ipv6: implement rt_genid_bump_ipv6 with fn_sernum and remove rt6i_genid
  2014-09-11 14:44                                               ` Vlad Yasevich
@ 2014-09-11 14:47                                                 ` Hannes Frederic Sowa
  0 siblings, 0 replies; 57+ messages in thread
From: Hannes Frederic Sowa @ 2014-09-11 14:47 UTC (permalink / raw)
  To: Vlad Yasevich; +Cc: David Miller, netdev, eric.dumazet, nicolas.dichtel

On Do, 2014-09-11 at 10:44 -0400, Vlad Yasevich wrote:
> On 09/11/2014 10:32 AM, Hannes Frederic Sowa wrote:
> > On Do, 2014-09-11 at 10:19 -0400, Vlad Yasevich wrote:
> >> On 09/11/2014 08:05 AM, Hannes Frederic Sowa wrote:
> >>> On Mi, 2014-09-10 at 13:09 -0700, David Miller wrote:
> >>>> From: Hannes Frederic Sowa <hannes@stressinduktion.org>
> >>>> Date: Wed, 10 Sep 2014 11:31:28 +0200
> >>>>
> >>>>> In case we need to force the sockets to relookup the routes we now
> >>>>> increase the fn_sernum on all fibnodes in the routing tree. This is a
> >>>>> costly operation but should only happen if we have major routing/policy
> >>>>> changes in the kernel (e.g. manual route adding/removal, xfrm policy
> >>>>> changes).
> >>>>
> >>>> Core routers can update thousands of route updates per second, and they
> >>>> do this via what you refer to as "manual route adding/removal".
> >>>>
> >>>> I don't think we want to put such a scalability problem into the tree.
> >>>>
> >>>> There has to be a lightweight way to address this.
> >>>
> >>> An alternative approach without traversing the routing table, but each
> >>> newly inserted route (even only cached ones) might bump all other routes
> >>> out of the per-socket caches:
> >>>
> >>> diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
> >>> index 9bcb220..a7e45b9 100644
> >>> --- a/include/net/ip6_fib.h
> >>> +++ b/include/net/ip6_fib.h
> >>> @@ -119,8 +119,6 @@ struct rt6_info {
> >>>  	struct inet6_dev		*rt6i_idev;
> >>>  	unsigned long			_rt6i_peer;
> >>>  
> >>> -	u32				rt6i_genid;
> >>> -
> >>>  	/* more non-fragment space at head required */
> >>>  	unsigned short			rt6i_nfheader_len;
> >>>  
> >>> diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
> >>> index 361d260..428fdcb 100644
> >>> --- a/include/net/net_namespace.h
> >>> +++ b/include/net/net_namespace.h
> >>> @@ -358,18 +358,28 @@ static inline int rt_genid_ipv6(struct net *net)
> >>>  	return atomic_read(&net->ipv6.rt_genid);
> >>>  }
> >>>  
> >>> -static inline void rt_genid_bump_ipv6(struct net *net)
> >>> +static inline int rt_genid_bump_ipv6(struct net *net)
> >>>  {
> >>> -	atomic_inc(&net->ipv6.rt_genid);
> >>> +	int new, old;
> >>> +
> >>> +	do {
> >>> +		old = atomic_read(&net->ipv6.rt_genid);
> >>> +		new = old + 1;
> >>> +		if (new <= 0)
> >>> +			new = 1;
> >>> +	} while (atomic_cmpxchg(&net->ipv6.rt_genid, old, new) != old);
> >>> +	return new;
> >>> +
> >>>  }
> >>>  #else
> >>>  static inline int rt_genid_ipv6(struct net *net)
> >>>  {
> >>> -	return 0;
> >>> +	return 1;
> >>>  }
> >>>  
> >>> -static inline void rt_genid_bump_ipv6(struct net *net)
> >>> +static inline int rt_genid_bump_ipv6(struct net *net)
> >>>  {
> >>> +	return 1;
> >>>  }
> >>>  #endif
> >>>  
> >>> diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
> >>> index 76b7f5e..4a2f130 100644
> >>> --- a/net/ipv6/ip6_fib.c
> >>> +++ b/net/ipv6/ip6_fib.c
> >>> @@ -84,7 +84,10 @@ static int fib6_walk_continue(struct fib6_walker_t *w);
> >>>   *	result of redirects, path MTU changes, etc.
> >>>   */
> >>>  
> >>> -static __u32 rt_sernum;
> >>> +static int fib6_new_sernum(struct net *net)
> >>> +{
> >>> +	return rt_genid_bump_ipv6(net);
> >>> +}
> >>>  
> >>>  static void fib6_gc_timer_cb(unsigned long arg);
> >>>  
> >>> @@ -104,13 +107,6 @@ static inline void fib6_walker_unlink(struct fib6_walker_t *w)
> >>>  	list_del(&w->lh);
> >>>  	write_unlock_bh(&fib6_walker_lock);
> >>>  }
> >>> -static __inline__ u32 fib6_new_sernum(void)
> >>> -{
> >>> -	u32 n = ++rt_sernum;
> >>> -	if ((__s32)n <= 0)
> >>> -		rt_sernum = n = 1;
> >>> -	return n;
> >>> -}
> >>>  
> >>>  /*
> >>>   *	Auxiliary address test functions for the radix tree.
> >>> @@ -421,16 +417,15 @@ out:
> >>>   */
> >>>  
> >>>  static struct fib6_node *fib6_add_1(struct fib6_node *root,
> >>> -				     struct in6_addr *addr, int plen,
> >>> -				     int offset, int allow_create,
> >>> -				     int replace_required)
> >>> +				    struct in6_addr *addr, int plen,
> >>> +				    int offset, int allow_create,
> >>> +				    int replace_required, int sernum)
> >>>  {
> >>>  	struct fib6_node *fn, *in, *ln;
> >>>  	struct fib6_node *pn = NULL;
> >>>  	struct rt6key *key;
> >>>  	int	bit;
> >>>  	__be32	dir = 0;
> >>> -	__u32	sernum = fib6_new_sernum();
> >>>  
> >>>  	RT6_TRACE("fib6_add_1\n");
> >>>  
> >>> @@ -844,6 +839,7 @@ void fib6_force_start_gc(struct net *net)
> >>>  int fib6_add(struct fib6_node *root, struct rt6_info *rt, struct nl_info *info,
> >>>  	     struct nlattr *mx, int mx_len)
> >>>  {
> >>> +	struct net *net = dev_net(rt->dst.dev);
> >>>  	struct fib6_node *fn, *pn = NULL;
> >>>  	int err = -ENOMEM;
> >>>  	int allow_create = 1;
> >>> @@ -860,7 +856,7 @@ int fib6_add(struct fib6_node *root, struct rt6_info *rt, struct nl_info *info,
> >>>  
> >>>  	fn = fib6_add_1(root, &rt->rt6i_dst.addr, rt->rt6i_dst.plen,
> >>>  			offsetof(struct rt6_info, rt6i_dst), allow_create,
> >>> -			replace_required);
> >>> +			replace_required, fib6_new_sernum(net));
> >>>  	if (IS_ERR(fn)) {
> >>>  		err = PTR_ERR(fn);
> >>>  		fn = NULL;
> >>> @@ -894,14 +890,15 @@ int fib6_add(struct fib6_node *root, struct rt6_info *rt, struct nl_info *info,
> >>>  			sfn->leaf = info->nl_net->ipv6.ip6_null_entry;
> >>>  			atomic_inc(&info->nl_net->ipv6.ip6_null_entry->rt6i_ref);
> >>>  			sfn->fn_flags = RTN_ROOT;
> >>> -			sfn->fn_sernum = fib6_new_sernum();
> >>> +			sfn->fn_sernum = fib6_new_sernum(net);
> >>>  
> >>>  			/* Now add the first leaf node to new subtree */
> >>>  
> >>>  			sn = fib6_add_1(sfn, &rt->rt6i_src.addr,
> >>>  					rt->rt6i_src.plen,
> >>>  					offsetof(struct rt6_info, rt6i_src),
> >>> -					allow_create, replace_required);
> >>> +					allow_create, replace_required,
> >>> +					fib6_new_sernum(net));
> >>>  
> >>>  			if (IS_ERR(sn)) {
> >>>  				/* If it is failed, discard just allocated
> >>> @@ -920,7 +917,8 @@ int fib6_add(struct fib6_node *root, struct rt6_info *rt, struct nl_info *info,
> >>>  			sn = fib6_add_1(fn->subtree, &rt->rt6i_src.addr,
> >>>  					rt->rt6i_src.plen,
> >>>  					offsetof(struct rt6_info, rt6i_src),
> >>> -					allow_create, replace_required);
> >>> +					allow_create, replace_required,
> >>> +					fib6_new_sernum(net));
> >>>  
> >>>  			if (IS_ERR(sn)) {
> >>>  				err = PTR_ERR(sn);
> >>> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> >>> index f74b041..54b7d81 100644
> >>> --- a/net/ipv6/route.c
> >>> +++ b/net/ipv6/route.c
> >>> @@ -314,7 +314,6 @@ static inline struct rt6_info *ip6_dst_alloc(struct net *net,
> >>>  
> >>>  		memset(dst + 1, 0, sizeof(*rt) - sizeof(*dst));
> >>>  		rt6_init_peer(rt, table ? &table->tb6_peers : net->ipv6.peers);
> >>> -		rt->rt6i_genid = rt_genid_ipv6(net);
> >>>  		INIT_LIST_HEAD(&rt->rt6i_siblings);
> >>>  	}
> >>>  	return rt;
> >>> @@ -1096,10 +1095,7 @@ static struct dst_entry *ip6_dst_check(struct dst_entry *dst, u32 cookie)
> >>>  	 * DST_OBSOLETE_FORCE_CHK which forces validation calls down
> >>>  	 * into this function always.
> >>>  	 */
> >>> -	if (rt->rt6i_genid != rt_genid_ipv6(dev_net(rt->dst.dev)))
> >>> -		return NULL;
> >>> -
> >>> -	if (!rt->rt6i_node || (rt->rt6i_node->fn_sernum != cookie))
> >>> +	if (!rt->rt6i_node || rt_genid_ipv6(dev_net(rt->dst.dev)) != cookie)
> >>>  		return NULL;
> >>>  
> >>>  	if (rt6_check_expired(rt))
> >>>
> >>
> >> Ok, so now we bump the gen_id every time we add a route and use that as fn_sernum for
> >> that route.  But this doesn't solve the problem that we are seeing in that a re-lookup
> >> of the route still gives us an older route with an older gen_id.
> > 
> > Hmm, this patch completely removes rt6i_genid from rt6_info (first
> > hunk). We decide if we need to do the relookup based on the socket's
> > cookie and the per-netns serial number of the ipv6 routing tables.
> > 
> 
> Hi Hannes
> 
> Right, by you still compare the per-netns serial number to the cookie.  The cookie is
> typically taken from the route.  The route takes it from the per-netns serial number.
> So, if you add a route with serial=1 and when the socket looks up that route, it
> stashes 1 into the cookie.
> Then you add another route and serial = 2.  Now the dst_check on the socket fails
> (cookie != rt_getid_ipv6(..)) and socket has to re-lookup a route.  It gets the same old
> route as before with fn_sernum = 1.  That's stashed in the cookie again.  Next dst_check
> is done, the route is invalidated again!.
> 
> Am I missing something?

No, you're right. I wanted to initialize the socket cookie with
rt_genid_ipv6() in this patch not with the value from the tree.

Thanks! :)

^ permalink raw reply	[flat|nested] 57+ messages in thread

end of thread, other threads:[~2014-09-11 14:47 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-08-14 18:19 Performance regression on kernels 3.10 and newer Alexander Duyck
2014-08-14 18:46 ` Eric Dumazet
2014-08-14 19:50   ` Eric Dumazet
2014-08-14 19:59   ` Rick Jones
2014-08-14 20:31     ` Alexander Duyck
2014-08-14 20:51       ` Eric Dumazet
2014-08-14 20:46     ` Eric Dumazet
2014-08-14 23:16   ` Alexander Duyck
2014-08-14 23:20     ` David Miller
2014-08-14 23:25       ` Tom Herbert
2014-08-21 23:24         ` David Miller
2014-09-06 14:45           ` Eric Dumazet
2014-09-06 15:27             ` Eric Dumazet
2014-09-06 15:46               ` Eric Dumazet
2014-09-06 16:38                 ` Eric Dumazet
2014-09-06 18:21                   ` Eric Dumazet
2014-09-07 19:05                     ` [PATCH net] ipv6: refresh rt6i_genid in ip6_pol_route() Eric Dumazet
2014-09-07 22:54                       ` David Miller
2014-09-08  4:18                         ` Eric Dumazet
2014-09-08  4:27                           ` David Miller
2014-09-08  4:43                             ` Eric Dumazet
2014-09-08  4:59                               ` David Miller
2014-09-08  5:07                                 ` Eric Dumazet
2014-09-08  8:11                                   ` Nicolas Dichtel
2014-09-08 10:28                                     ` Eric Dumazet
2014-09-08 12:16                                       ` Nicolas Dichtel
2014-09-08 18:48                                   ` Vlad Yasevich
2014-09-09 12:58                                   ` Hannes Frederic Sowa
2014-09-10  9:31                                     ` [PATCH net-next] ipv6: implement rt_genid_bump_ipv6 with fn_sernum and remove rt6i_genid Hannes Frederic Sowa
2014-09-10 13:26                                       ` Vlad Yasevich
2014-09-10 13:42                                         ` Hannes Frederic Sowa
2014-09-10 20:09                                       ` David Miller
2014-09-11  8:30                                         ` Hannes Frederic Sowa
2014-09-11 12:22                                           ` Vlad Yasevich
2014-09-11 12:40                                             ` Hannes Frederic Sowa
2014-09-11 12:05                                         ` Hannes Frederic Sowa
2014-09-11 14:19                                           ` Vlad Yasevich
2014-09-11 14:32                                             ` Hannes Frederic Sowa
2014-09-11 14:44                                               ` Vlad Yasevich
2014-09-11 14:47                                                 ` Hannes Frederic Sowa
2014-09-08 15:06               ` [PATCH v2 net-next] tcp: remove dst refcount false sharing for prequeue mode Eric Dumazet
2014-09-08 21:21                 ` David Miller
2014-09-08 21:30                   ` Eric Dumazet
2014-09-08 22:41                     ` David Miller
2014-09-09 23:56                     ` David Miller
2014-08-15 17:15       ` Performance regression on kernels 3.10 and newer Alexander Duyck
2014-08-15 17:59         ` Eric Dumazet
2014-08-15 18:49         ` Tom Herbert
2014-08-15 19:10           ` Alexander Duyck
2014-08-15 22:16             ` Tom Herbert
2014-08-15 23:23               ` Alexander Duyck
2014-08-18  9:03                 ` David Laight
2014-08-18 15:22                   ` Alexander Duyck
2014-08-18 15:29                     ` Rick Jones
2014-08-21 23:51         ` David Miller
2014-08-14 23:48     ` Eric Dumazet
2014-08-15  0:33       ` Rick Jones

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.