netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* IPv6 regression introduced by commit 3b6761d18bc11f2af2a6fc494e9026d39593f22c
@ 2020-03-05  8:17 Alarig Le Lay
  2020-03-08  0:52 ` David Ahern
  0 siblings, 1 reply; 16+ messages in thread
From: Alarig Le Lay @ 2020-03-05  8:17 UTC (permalink / raw)
  To: netdev

Hi,

On the bird users ML, we discussed a bug we’re facing when having a
full table: from time to time all the IPv6 traffic is dropped (and all
neighbors are invalidated), after a while it comes back again, then wait
a few minutes and it’s dropped again, and so on.

Basil Fillan determined that it comes from the commit
3b6761d18bc11f2af2a6fc494e9026d39593f22c.

Here are the complete archives about it:

https://bird.network.cz/pipermail/bird-users/2019-June/013509.html
https://bird.network.cz/pipermail/bird-users/2019-November/013992.html
https://bird.network.cz/pipermail/bird-users/2019-December/014011.html
https://bird.network.cz/pipermail/bird-users/2020-February/014269.html

Regards,
Alarig Le Lay

----- Forwarded message from Basil Fillan <jack@basilfillan.uk> -----

From: Basil Fillan <jack@basilfillan.uk>
To: bird-users@network.cz
Date: Wed, 26 Feb 2020 22:54:57 +0000
Subject: Re: IPv6 BGP & kernel 4.19
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101
 Thunderbird/68.4.1

Hi,

We've also experienced this after upgrading a few routers to Debian Buster.
With a kernel bisect we found that a bug was introduced in the following
commit:

3b6761d18bc11f2af2a6fc494e9026d39593f22c

This bug was still present in master as of a few weeks ago.

It appears entries are added to the IPv6 route cache which aren't visible from
"ip -6 route show cache", but are causing the route cache garbage collection
system to trigger extremely often (every packet?) once it exceeds the value of
net.ipv6.route.max_size. Our original symptom was extreme forwarding jitter
caused within the ip6_dst_gc function (identified by some spelunking with
systemtap & perf) worsening as the size of the cache increased. This was due
to our max_size sysctl inadvertently being set to 1 million. Reducing this
value to the default 4096 broke IPv6 forwarding entirely on our test system
under affected kernels. Our documentation had this sysctl marked as the
maximum number of IPv6 routes, so it looks like the use changed at some point.

We've rolled our routers back to kernel 4.9 (with the sysctl set to 4096) for
now, which fixed our immediate issue.

You can reproduce this by adding more than 4096 (default value of the sysctl)
routes to the kernel and running "ip route get" for each of them. Once the
route cache is filled, the error "RTNETLINK answers: Network is unreachable"
will be received for each subsequent "ip route get" incantation, and v6
connectivity will be interrupted.

Thanks,

Basil


On 26/02/2020 20:38, Clément Guivy wrote:
> Hi, did anyone find a solution or workaround regarding this issue?
> Considering a router use case.
> I have looked at rt6_stats, total route count is around 78k (full view),
> and around 4100 entries in the cache at the moment on my first router
> (forwarding a few Mb/s) and around 2500 entries on my second router
> (forwarding less than 1 Mb/s).
> I have reread the entire thread. At first, Alarig's research seemed to
> lead to a neighbor management problem, my understanding is that route
> cache is something else entirely - or is it related somehow?
> 
> 
> On 03/12/2019 19:29, Alarig Le Lay wrote:
> > We agree then, and I act as a router on all those machines.
> > 
> > Le 3 décembre 2019 19:27:11 GMT+01:00, Vincent Bernat
> > <vincent@bernat.ch> a écrit :
> > 
> >     This is the result of PMTUd. But when you are a router, you don't
> >     need to do that, so it's mostly a problem for end hosts.
> > 
> >     On December 3, 2019 7:05:49 PM GMT+01:00, Alarig Le Lay
> >     <alarig@swordarmor.fr> wrote:
> > 
> >         On 03/12/2019 14:16, Vincent Bernat wrote:
> > 
> >             The information needs to be stored somewhere.
> > 
> > 
> >         Why has it to be stored? It’s not really my problem if someone
> > else has
> >         a non-stantard MTU and can’t do TCP-MSS or PMTUd.

----- End forwarded message -----

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: IPv6 regression introduced by commit 3b6761d18bc11f2af2a6fc494e9026d39593f22c
  2020-03-05  8:17 IPv6 regression introduced by commit 3b6761d18bc11f2af2a6fc494e9026d39593f22c Alarig Le Lay
@ 2020-03-08  0:52 ` David Ahern
  2020-03-08 10:57   ` Alarig Le Lay
  2020-09-27 15:35   ` Baptiste Jonglez
  0 siblings, 2 replies; 16+ messages in thread
From: David Ahern @ 2020-03-08  0:52 UTC (permalink / raw)
  To: Alarig Le Lay, netdev, jack, Vincent Bernat

On 3/5/20 1:17 AM, Alarig Le Lay wrote:
> Hi,
> 
> On the bird users ML, we discussed a bug we’re facing when having a
> full table: from time to time all the IPv6 traffic is dropped (and all
> neighbors are invalidated), after a while it comes back again, then wait
> a few minutes and it’s dropped again, and so on.

Kernel version?

you are monitoring neighbor states with 'ip monitor' or something else?


> 
> Basil Fillan determined that it comes from the commit
> 3b6761d18bc11f2af2a6fc494e9026d39593f22c.
> 


...

> We've also experienced this after upgrading a few routers to Debian Buster.
> With a kernel bisect we found that a bug was introduced in the following
> commit:
> 
> 3b6761d18bc11f2af2a6fc494e9026d39593f22c
> 
> This bug was still present in master as of a few weeks ago.
> 
> It appears entries are added to the IPv6 route cache which aren't visible from
> "ip -6 route show cache", but are causing the route cache garbage collection
> system to trigger extremely often (every packet?) once it exceeds the value of
> net.ipv6.route.max_size. Our original symptom was extreme forwarding jitter
> caused within the ip6_dst_gc function (identified by some spelunking with
> systemtap & perf) worsening as the size of the cache increased. This was due
> to our max_size sysctl inadvertently being set to 1 million. Reducing this
> value to the default 4096 broke IPv6 forwarding entirely on our test system
> under affected kernels. Our documentation had this sysctl marked as the
> maximum number of IPv6 routes, so it looks like the use changed at some point.
> 
> We've rolled our routers back to kernel 4.9 (with the sysctl set to 4096) for
> now, which fixed our immediate issue.
> 
> You can reproduce this by adding more than 4096 (default value of the sysctl)
> routes to the kernel and running "ip route get" for each of them. Once the
> route cache is filled, the error "RTNETLINK answers: Network is unreachable"
> will be received for each subsequent "ip route get" incantation, and v6
> connectivity will be interrupted.
> 

The above does not reproduce for me on 5.6 or 4.19, and I would have
been really surprised if it had, so I have to question the git bisect
result.

There is no limit on fib entries, and the number of FIB entries has no
impact on the sysctl in question, net.ipv6.route.max_size. That sysctl
limits the number of dst_entry instances. When the threshold is exceeded
(and the gc_thesh for ipv6 defaults to 1024), each new alloc attempts to
free one via gc. There are many legitimate reasons for why 4k entries
have been created - mtu exceptions, redirects, per-cpu caching, vrfs, ...

In 4.9 FIB entries are created as an rt6_info which is a v6 wrapper
around dst_entry. That changed in 4.15 or 4.16 - I forget which now, and
the commit you reference above is part of the refactoring to make IPv6
more like IPv4 with a different, smaller data structure for fib entries.
A lot of other changes have also gone into IPv6 between 4.9 and top of
tree, and at this point the whole gc thing can probably go away for v6
like it was removed for ipv4.

Try the 5.4 LTS and see if you still hit a problem.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: IPv6 regression introduced by commit 3b6761d18bc11f2af2a6fc494e9026d39593f22c
  2020-03-08  0:52 ` David Ahern
@ 2020-03-08 10:57   ` Alarig Le Lay
  2020-03-09  2:15     ` David Ahern
  2020-09-27 15:35   ` Baptiste Jonglez
  1 sibling, 1 reply; 16+ messages in thread
From: Alarig Le Lay @ 2020-03-08 10:57 UTC (permalink / raw)
  To: David Ahern; +Cc: netdev, jack, Vincent Bernat

On sam.  7 mars 17:52:10 2020, David Ahern wrote:
> On 3/5/20 1:17 AM, Alarig Le Lay wrote:
> Kernel version?

I’ve seen this from 4.19 on my experience, it works at least until 4.15.

> you are monitoring neighbor states with 'ip monitor' or something else?

Yes, 'ip -ts monitor neigh' to be exact.

> The above does not reproduce for me on 5.6 or 4.19, and I would have
> been really surprised if it had, so I have to question the git bisect
> result.

My personal experience is that, while routing is activated (and having a
full-view, I don’t have any soft router without it), the neighbors are
flapping, thus causing a blackhole.
It doesn’t happen with a limit traffic processing. The limit is around
20 Mbps from what I can see.

> There is no limit on fib entries, and the number of FIB entries has no
> impact on the sysctl in question, net.ipv6.route.max_size. That sysctl
> limits the number of dst_entry instances. When the threshold is exceeded
> (and the gc_thesh for ipv6 defaults to 1024), each new alloc attempts to
> free one via gc. There are many legitimate reasons for why 4k entries
> have been created - mtu exceptions, redirects, per-cpu caching, vrfs, ...

I also tried to find a sysctl to ignore the MTU exceptions, as a router
it’s not my problem at all: if the packet is fragmentable, I will do it,
otherwise I will drop it. The MTU negotiation is on the duty of the ends.
I’m not taking the MSS clamping in account as it’s done with iptables
and the mangle table is empty (but CONFIG_IP_NF_MANGLE is enabled).

> In 4.9 FIB entries are created as an rt6_info which is a v6 wrapper
> around dst_entry. That changed in 4.15 or 4.16 - I forget which now, and
> the commit you reference above is part of the refactoring to make IPv6
> more like IPv4 with a different, smaller data structure for fib entries.
> A lot of other changes have also gone into IPv6 between 4.9 and top of
> tree, and at this point the whole gc thing can probably go away for v6
> like it was removed for ipv4.

As it works on 4.15 (the kernel shipped with proxomox 5), I would say
that it has been introduced in 4.16. But I didn’t checked each version
myself.

> Try the 5.4 LTS and see if you still hit a problem.

I have the problem with 5.3 (proxmox 6), so unless FIB handling has been
changed since then, I doubt that it will works, but I will try on
Monday.

Regards,
-- 
Alarig

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: IPv6 regression introduced by commit 3b6761d18bc11f2af2a6fc494e9026d39593f22c
  2020-03-08 10:57   ` Alarig Le Lay
@ 2020-03-09  2:15     ` David Ahern
  2020-03-09  8:59       ` Fabian Grünbichler
  2020-03-10 10:35       ` Alarig Le Lay
  0 siblings, 2 replies; 16+ messages in thread
From: David Ahern @ 2020-03-09  2:15 UTC (permalink / raw)
  To: Alarig Le Lay; +Cc: netdev, jack, Vincent Bernat

On 3/8/20 4:57 AM, Alarig Le Lay wrote:
> On sam.  7 mars 17:52:10 2020, David Ahern wrote:
>> On 3/5/20 1:17 AM, Alarig Le Lay wrote:
>> Kernel version?
> 
> I’ve seen this from 4.19 on my experience, it works at least until 4.15.
> 
>> you are monitoring neighbor states with 'ip monitor' or something else?
> 
> Yes, 'ip -ts monitor neigh' to be exact.
> 
>> The above does not reproduce for me on 5.6 or 4.19, and I would have
>> been really surprised if it had, so I have to question the git bisect
>> result.
> 
> My personal experience is that, while routing is activated (and having a
> full-view, I don’t have any soft router without it), the neighbors are
> flapping, thus causing a blackhole.
> It doesn’t happen with a limit traffic processing. The limit is around
> 20 Mbps from what I can see.

If you are using x86 based CPU you can do this:
    perf probe ip6_dst_alloc%return ret=%ax

    perf record -e probe:* -a -g -- sleep 10
    --> run this during the flapping

    perf script

this will show if the flapping is due to dst alloc failures.

Other things to try:
    perf probe ip6_dst_gc
    perf stat -e probe:* -a -I 1000
    --> will show calls/sec to running dst gc


    perf probe __ip6_rt_update_pmtu
    perf stat -e probe:* -a -I 1000
    --> will show calls/sec to mtu updating


    perf probe rt6_insert_exception
    perf state -e probe:* -a -I 1000
    --> shows calls/sec to inserting exceptions

(in each you can remove the previous probe using 'perf probe -d <name>'
or use -e <exact name> to only see data for the one event).

> I have the problem with 5.3 (proxmox 6), so unless FIB handling has been
> changed since then, I doubt that it will works, but I will try on
> Monday.
> 

a fair amount of changes went in through 5.4 including improvements to
neighbor handling. 5.4 (I think) also had changes around dumping the
route cache.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: IPv6 regression introduced by commit 3b6761d18bc11f2af2a6fc494e9026d39593f22c
  2020-03-09  2:15     ` David Ahern
@ 2020-03-09  8:59       ` Fabian Grünbichler
  2020-03-09 10:47         ` Alarig Le Lay
  2020-03-10 10:35       ` Alarig Le Lay
  1 sibling, 1 reply; 16+ messages in thread
From: Fabian Grünbichler @ 2020-03-09  8:59 UTC (permalink / raw)
  To: Alarig Le Lay, David Ahern; +Cc: Vincent Bernat, jack, netdev

On March 9, 2020 3:15 am, David Ahern wrote:
> On 3/8/20 4:57 AM, Alarig Le Lay wrote:
>> On sam.  7 mars 17:52:10 2020, David Ahern wrote:
>> I have the problem with 5.3 (proxmox 6), so unless FIB handling has been
>> changed since then, I doubt that it will works, but I will try on
>> Monday.
>> 
> 
> a fair amount of changes went in through 5.4 including improvements to
> neighbor handling. 5.4 (I think) also had changes around dumping the
> route cache.

FWIW, there is a 5.4-based kernel preview package available in the 
pvetest repository for Proxmox VE 6.x:

http://download.proxmox.com/debian/pve/dists/buster/pvetest/binary-amd64/pve-kernel-5.4.22-1-pve_5.4.22-1_amd64.deb


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: IPv6 regression introduced by commit 3b6761d18bc11f2af2a6fc494e9026d39593f22c
  2020-03-09  8:59       ` Fabian Grünbichler
@ 2020-03-09 10:47         ` Alarig Le Lay
  2020-03-09 11:35           ` Fabian Grünbichler
  0 siblings, 1 reply; 16+ messages in thread
From: Alarig Le Lay @ 2020-03-09 10:47 UTC (permalink / raw)
  To: Fabian Grünbichler; +Cc: David Ahern, Vincent Bernat, jack, netdev

On lun.  9 mars 09:59:30 2020, Fabian Grünbichler wrote:
> On March 9, 2020 3:15 am, David Ahern wrote:
> > On 3/8/20 4:57 AM, Alarig Le Lay wrote:
> >> On sam.  7 mars 17:52:10 2020, David Ahern wrote:
> >> I have the problem with 5.3 (proxmox 6), so unless FIB handling has been
> >> changed since then, I doubt that it will works, but I will try on
> >> Monday.
> >> 
> > 
> > a fair amount of changes went in through 5.4 including improvements to
> > neighbor handling. 5.4 (I think) also had changes around dumping the
> > route cache.
> 
> FWIW, there is a 5.4-based kernel preview package available in the 
> pvetest repository for Proxmox VE 6.x:
> 
> http://download.proxmox.com/debian/pve/dists/buster/pvetest/binary-amd64/pve-kernel-5.4.22-1-pve_5.4.22-1_amd64.deb

Thanks for the kernel!

I’m still having some issues with 5.4:
root@hv03:~# ip -6 -ts monitor neigh | grep -P 'INCOMPLETE|FAILED'
[2020-03-09T11:35:51.276329] 2a00:5884:0:6::2 dev vmbr12  router FAILED
[2020-03-09T11:36:03.308359] fe80::5287:89ff:fef0:ce81 dev vmbr12  router FAILED
[2020-03-09T11:36:21.996250] 2a00:5884:0:6::1 dev vmbr12  router FAILED
[2020-03-09T11:36:32.524389] 2a00:5884:0:6::1 dev vmbr12  router FAILED
[2020-03-09T11:36:34.800303] 2a00:5884:0:6::2 dev vmbr12  router FAILED
[2020-03-09T11:36:36.588333] 2a00:5884:0:6::1 dev vmbr12  router FAILED
[2020-03-09T11:36:41.196351] 2a00:5884:0:6::1 dev vmbr12  router FAILED

And BGP sessions are flapping as well:
root@hv03:~# birdc6 sh pr | grep bgp
ibgp_asbr01_ipv6 BGP      master   up     11:40:28    Established
ibgp_asbr02_ipv6 BGP      master   up     11:41:09    Established
root@hv03:~# ps -o lstart -p $(pgrep bird6)
                 STARTED
Mon Mar  9 11:14:44 2020


You don’t build linux-perf? I can only find it on the debian repo but
not on the proxmox one nor
http://download.proxmox.com/debian/pve/dists/buster/pvetest/binary-amd64/

root@hv03:~# apt search linux-perf
Sorting... Done
Full Text Search... Done
linux-perf/stable 4.19+105+deb10u3 all
  Performance analysis tools for Linux (meta-package)

linux-perf-4.19/stable 4.19.98-1 amd64
  Performance analysis tools for Linux 4.19

root@hv03:~# apt policy linux-perf/stable
N: Unable to locate package linux-perf/stable
root@hv03:~# apt policy linux-perf
linux-perf:
  Installed: (none)
  Candidate: 4.19+105+deb10u3
  Version table:
     4.19+105+deb10u3 500
        500 http://mirror.grifon.fr/debian buster/main amd64 Packages
root@hv03:~# apt policy linux-perf-4.19
linux-perf-4.19:
  Installed: (none)
  Candidate: 4.19.98-1
  Version table:
     4.19.98-1 500
        500 http://mirror.grifon.fr/debian buster/main amd64 Packages
     4.19.67-2+deb10u2 500
        500 http://mirror.grifon.fr/debian-security buster/updates/main amd64 Packages
root@hv03:~#

Regards,
-- 
Alarig

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: IPv6 regression introduced by commit 3b6761d18bc11f2af2a6fc494e9026d39593f22c
  2020-03-09 10:47         ` Alarig Le Lay
@ 2020-03-09 11:35           ` Fabian Grünbichler
  0 siblings, 0 replies; 16+ messages in thread
From: Fabian Grünbichler @ 2020-03-09 11:35 UTC (permalink / raw)
  To: Alarig Le Lay; +Cc: Vincent Bernat, David Ahern, jack, netdev, t.lamprecht

On March 9, 2020 11:47 am, Alarig Le Lay wrote:
> On lun.  9 mars 09:59:30 2020, Fabian Grünbichler wrote:
>> On March 9, 2020 3:15 am, David Ahern wrote:
>> > On 3/8/20 4:57 AM, Alarig Le Lay wrote:
>> >> On sam.  7 mars 17:52:10 2020, David Ahern wrote:
>> >> I have the problem with 5.3 (proxmox 6), so unless FIB handling has been
>> >> changed since then, I doubt that it will works, but I will try on
>> >> Monday.
>> >> 
>> > 
>> > a fair amount of changes went in through 5.4 including improvements to
>> > neighbor handling. 5.4 (I think) also had changes around dumping the
>> > route cache.
>> 
>> FWIW, there is a 5.4-based kernel preview package available in the 
>> pvetest repository for Proxmox VE 6.x:
>> 
>> http://download.proxmox.com/debian/pve/dists/buster/pvetest/binary-amd64/pve-kernel-5.4.22-1-pve_5.4.22-1_amd64.deb
> 
> Thanks for the kernel!
> 
> I’m still having some issues with 5.4:
> root@hv03:~# ip -6 -ts monitor neigh | grep -P 'INCOMPLETE|FAILED'
> [2020-03-09T11:35:51.276329] 2a00:5884:0:6::2 dev vmbr12  router FAILED
> [2020-03-09T11:36:03.308359] fe80::5287:89ff:fef0:ce81 dev vmbr12  router FAILED
> [2020-03-09T11:36:21.996250] 2a00:5884:0:6::1 dev vmbr12  router FAILED
> [2020-03-09T11:36:32.524389] 2a00:5884:0:6::1 dev vmbr12  router FAILED
> [2020-03-09T11:36:34.800303] 2a00:5884:0:6::2 dev vmbr12  router FAILED
> [2020-03-09T11:36:36.588333] 2a00:5884:0:6::1 dev vmbr12  router FAILED
> [2020-03-09T11:36:41.196351] 2a00:5884:0:6::1 dev vmbr12  router FAILED
> 
> And BGP sessions are flapping as well:
> root@hv03:~# birdc6 sh pr | grep bgp
> ibgp_asbr01_ipv6 BGP      master   up     11:40:28    Established
> ibgp_asbr02_ipv6 BGP      master   up     11:41:09    Established
> root@hv03:~# ps -o lstart -p $(pgrep bird6)
>                  STARTED
> Mon Mar  9 11:14:44 2020
> 
> 
> You don’t build linux-perf? I can only find it on the debian repo but
> not on the proxmox one nor
> http://download.proxmox.com/debian/pve/dists/buster/pvetest/binary-amd64/

called 'linux-tools-$KERNEL_ABI', e.g. linux-tools-5.4, in PVE.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: IPv6 regression introduced by commit 3b6761d18bc11f2af2a6fc494e9026d39593f22c
  2020-03-09  2:15     ` David Ahern
  2020-03-09  8:59       ` Fabian Grünbichler
@ 2020-03-10 10:35       ` Alarig Le Lay
  2020-03-10 15:27         ` David Ahern
  1 sibling, 1 reply; 16+ messages in thread
From: Alarig Le Lay @ 2020-03-10 10:35 UTC (permalink / raw)
  To: David Ahern; +Cc: netdev, jack, Vincent Bernat

On dim.  8 mars 20:15:14 2020, David Ahern wrote:
> If you are using x86 based CPU you can do this:
>     perf probe ip6_dst_alloc%return ret=%ax
> 
>     perf record -e probe:* -a -g -- sleep 10
>     --> run this during the flapping
> 
>     perf script

For this probe I see that: https://paste.swordarmor.fr/raw/pt9b

> this will show if the flapping is due to dst alloc failures.
> 
> Other things to try:
>     perf probe ip6_dst_gc
>     perf stat -e probe:* -a -I 1000
>     --> will show calls/sec to running dst gc

https://paste.swordarmor.fr/raw/uBnm

>     perf probe __ip6_rt_update_pmtu
>     perf stat -e probe:* -a -I 1000
>     --> will show calls/sec to mtu updating

This probe always stays at 0 even when the NDP is failing.
 
>     perf probe rt6_insert_exception
>     perf state -e probe:* -a -I 1000
>     --> shows calls/sec to inserting exceptions

Same as the last one.

> (in each you can remove the previous probe using 'perf probe -d <name>'
> or use -e <exact name> to only see data for the one event).
> 
> > I have the problem with 5.3 (proxmox 6), so unless FIB handling has been
> > changed since then, I doubt that it will works, but I will try on
> > Monday.
> > 
> 
> a fair amount of changes went in through 5.4 including improvements to
> neighbor handling. 5.4 (I think) also had changes around dumping the
> route cache.

Regards,
-- 
Alarig

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: IPv6 regression introduced by commit 3b6761d18bc11f2af2a6fc494e9026d39593f22c
  2020-03-10 10:35       ` Alarig Le Lay
@ 2020-03-10 15:27         ` David Ahern
  2020-03-29 14:09           ` Alarig Le Lay
  0 siblings, 1 reply; 16+ messages in thread
From: David Ahern @ 2020-03-10 15:27 UTC (permalink / raw)
  To: Alarig Le Lay; +Cc: netdev, jack, Vincent Bernat

On 3/10/20 4:35 AM, Alarig Le Lay wrote:
> On dim.  8 mars 20:15:14 2020, David Ahern wrote:
>> If you are using x86 based CPU you can do this:
>>     perf probe ip6_dst_alloc%return ret=%ax
>>
>>     perf record -e probe:* -a -g -- sleep 10
>>     --> run this during the flapping
>>
>>     perf script
> 
> For this probe I see that: https://paste.swordarmor.fr/raw/pt9b

none of the dst allocations are failing.

Are the failing windows always ~30 seconds long?


> 
>> this will show if the flapping is due to dst alloc failures.
>>
>> Other things to try:
>>     perf probe ip6_dst_gc
>>     perf stat -e probe:* -a -I 1000
>>     --> will show calls/sec to running dst gc
> 
> https://paste.swordarmor.fr/raw/uBnm

This is not lining up with dst allocations above. gc function is only
invoked from dst_alloc, and for ipv6 all dst allocations go through
ip6_dst_alloc.

> 
>>     perf probe __ip6_rt_update_pmtu
>>     perf stat -e probe:* -a -I 1000
>>     --> will show calls/sec to mtu updating
> 
> This probe always stays at 0 even when the NDP is failing.
>  
>>     perf probe rt6_insert_exception
>>     perf state -e probe:* -a -I 1000
>>     --> shows calls/sec to inserting exceptions
> 
> Same as the last one.

so no exception handling.

How many ipv6 sockets are open? (ss -6tpn)

How many ipv6 neighbor entries exist? (ip -6 neigh sh)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: IPv6 regression introduced by commit 3b6761d18bc11f2af2a6fc494e9026d39593f22c
  2020-03-10 15:27         ` David Ahern
@ 2020-03-29 14:09           ` Alarig Le Lay
  0 siblings, 0 replies; 16+ messages in thread
From: Alarig Le Lay @ 2020-03-29 14:09 UTC (permalink / raw)
  To: David Ahern; +Cc: netdev, jack, Vincent Bernat

On mar. 10 mars 09:27:53 2020, David Ahern wrote:
> Are the failing windows always ~30 seconds long?

It seems so. At least it’s a relaively fixed amout of time.

> How many ipv6 sockets are open? (ss -6tpn)

Only the BGP daemon.

root@hv03:~# ss -6tpn
State           Recv-Q           Send-Q                          Local Address:Port                            Peer Address:Port                                                     
ESTAB           0                0                          [2a00:5884:0:6::a]:55418                     [2a00:5884:0:6::2]:179            users:(("bird6",pid=1824,fd=10))          
ESTAB           0                0                          [2a00:5884:0:6::a]:56892                     [2a00:5884:0:6::1]:179            users:(("bird6",pid=1824,fd=11))          


> How many ipv6 neighbor entries exist? (ip -6 neigh sh)

root@hv03:~# ip -6 neigh sh
fe80::21b:21ff:fe48:6899 dev vmbr12 lladdr 00:1b:21:48:68:99 router DELAY
fe80::c8fd:83ff:fe88:7052 dev tap116i0 lladdr ca:fd:83:88:70:52 PERMANENT
2a00:5884:8204::1 dev vmbr1 lladdr ca:fd:83:88:70:52 STALE
fe80::5476:43ff:fe0f:209d dev vmbr0 lladdr 56:76:43:0f:20:9d STALE
fe80::3c19:f7ff:fe18:f9ca dev vmbr0 lladdr 3e:19:f7:18:f9:ca STALE
fe80::ecc0:e7ff:fe97:b4d9 dev vmbr0 lladdr ee:c0:e7:97:b4:d9 STALE
fe80::7a2b:cbff:fe4c:d537 dev vmbr13 lladdr 78:2b:cb:4c:d5:37 STALE
fe80::a093:a1ff:fe14:8c8a dev vmbr0 lladdr a2:93:a1:14:8c:8a STALE
fe80::9a4b:e1ff:fe64:b90 dev vmbr0 lladdr 98:4b:e1:64:0b:90 STALE
2a00:5884:0:6::2 dev vmbr12 lladdr 00:1b:21:48:68:99 router REACHABLE
fe80::5287:89ff:fef0:ce81 dev vmbr12 lladdr 50:87:89:f0:ce:81 router REACHABLE
fe80::c8fd:83ff:fe88:7052 dev vmbr1 lladdr ca:fd:83:88:70:52 STALE
2a00:5884:0:6::1 dev vmbr12 lladdr 50:87:89:f0:ce:81 router REACHABLE
fe80::7a2b:cbff:fe4c:d537 dev vmbr12 lladdr 78:2b:cb:4c:d5:37 STALE
fe80::7a2b:cbff:fe4c:d537 dev vmbr8 lladdr 78:2b:cb:4c:d5:37 STALE
fe80::5054:ff:fef9:192d dev vmbr0 lladdr 52:54:00:f9:19:2d STALE

Not so much either.


But the good news is that I have a work-around: adding
`net.ipv6.route.gc_thresh = -1` to sysctl.conf

I don’t know exactly what it does as it’s not documented, I just pick
the IPv4 value.

-- 
Alarig

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: IPv6 regression introduced by commit 3b6761d18bc11f2af2a6fc494e9026d39593f22c
  2020-03-08  0:52 ` David Ahern
  2020-03-08 10:57   ` Alarig Le Lay
@ 2020-09-27 15:35   ` Baptiste Jonglez
  2020-09-27 16:10     ` Baptiste Jonglez
  1 sibling, 1 reply; 16+ messages in thread
From: Baptiste Jonglez @ 2020-09-27 15:35 UTC (permalink / raw)
  To: David Ahern; +Cc: Alarig Le Lay, netdev, jack, Vincent Bernat, Oliver

[-- Attachment #1: Type: text/plain, Size: 2290 bytes --]

Hi,

We are seeing the same issue, more information below.

On 07-03-20, David Ahern wrote:
> On 3/5/20 1:17 AM, Alarig Le Lay wrote:
> > Hi,
> > 
> > On the bird users ML, we discussed a bug we’re facing when having a
> > full table: from time to time all the IPv6 traffic is dropped (and all
> > neighbors are invalidated), after a while it comes back again, then wait
> > a few minutes and it’s dropped again, and so on.
> 
> Kernel version?

We are seeing the issue with 4.19 (debian stable) and 5.4 (debian
stable backports from a few months ago).  Others reported still seeing
the issue with 5.7:

  http://trubka.network.cz/pipermail/bird-users/2020-September/014877.html
  http://trubka.network.cz/pipermail/bird-users/2020-September/014881.html


Interestingly, the issue manifests itself in several different ways:

1) failing IPv6 neighbours, what Alarig reported.  We are seeing this
   on a full-view BGP router with rather low amount of IPv6 traffic
   (around 10-20 Mbps)


2) high jitter when forwarding IPv6 traffic: this was in the original
   report from Basil and also here: http://trubka.network.cz/pipermail/bird-users/2020-September/014877.html


3) system lockup: the system becomes unresponsive, with messages like:

     watchdog: BUG: soft lockup - CPU#X stuck for XXs!

   and messages about transmit timeouts from the NIC driver.

   This happened to us on a router that has a BGP full view and
   handles around 50-100 Mbps of IPv6 traffic, which probably means
   lots of route lookups.  It happened with both 4.19 and 5.4.  On the
   other hand, kernel 4.9 runs fine on that exact same router (we are
   running debian buster with the old kernel from debian stretch).


When we can't use an older kernel, our current workaround is the
following sysctl config:

    net.ipv6.route.gc_thresh = 100000
    net.ipv6.route.max_size = 400000

From my understanding, this works because it basically disables the gc
in most cases.

However, the "fib_rt_alloc" field from /proc/net/rt6_stats (6th field)
is steadily increasing: after 2 days of uptime it's at 67k.  At some
point it will hit the gc threshold, we'll see what happens.

I am also trying to reproduce the issue locally.

Thanks,
Baptiste

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: IPv6 regression introduced by commit 3b6761d18bc11f2af2a6fc494e9026d39593f22c
  2020-09-27 15:35   ` Baptiste Jonglez
@ 2020-09-27 16:10     ` Baptiste Jonglez
  2020-09-28  3:38       ` David Ahern
  0 siblings, 1 reply; 16+ messages in thread
From: Baptiste Jonglez @ 2020-09-27 16:10 UTC (permalink / raw)
  To: David Ahern; +Cc: Alarig Le Lay, netdev, jack, Vincent Bernat, Oliver

[-- Attachment #1: Type: text/plain, Size: 1029 bytes --]

On 27-09-20, Baptiste Jonglez wrote:
> 1) failing IPv6 neighbours, what Alarig reported.  We are seeing this
>    on a full-view BGP router with rather low amount of IPv6 traffic
>    (around 10-20 Mbps)

Ok, I found a quick way to reproduce this issue:

    # for net in {1..9999}; do ip -6 route add 2001:db8:ffff:${net}::/64 via fe80::4242 dev lo; done

and then:

    # for net in {1..9999}; do ping -c1 2001:db8:ffff:${net}::1; done

This quickly gets to a situation where ping fails early with:

    ping: connect: Network is unreachable

At this point, IPv6 connectivity is broken.  The kernel is no longer
replying to IPv6 neighbor solicitation from other hosts on local
networks.

When this happens, the "fib_rt_alloc" field from /proc/net/rt6_stats
is roughly equal to net.ipv6.route.max_size (a bit more in my tests).

Interestingly, the system appears to stay in this broken state
indefinitely, even without trying to send new IPv6 traffic.  The
fib_rt_alloc statistics does not decrease.

Hopes this helps,
Baptiste

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: IPv6 regression introduced by commit 3b6761d18bc11f2af2a6fc494e9026d39593f22c
  2020-09-27 16:10     ` Baptiste Jonglez
@ 2020-09-28  3:38       ` David Ahern
  2020-09-28  5:39         ` Vincent Bernat
  2020-09-28  6:48         ` Baptiste Jonglez
  0 siblings, 2 replies; 16+ messages in thread
From: David Ahern @ 2020-09-28  3:38 UTC (permalink / raw)
  To: Baptiste Jonglez; +Cc: Alarig Le Lay, netdev, jack, Vincent Bernat, Oliver

On 9/27/20 9:10 AM, Baptiste Jonglez wrote:
> On 27-09-20, Baptiste Jonglez wrote:
>> 1) failing IPv6 neighbours, what Alarig reported.  We are seeing this
>>    on a full-view BGP router with rather low amount of IPv6 traffic
>>    (around 10-20 Mbps)
> 
> Ok, I found a quick way to reproduce this issue:
> 
>     # for net in {1..9999}; do ip -6 route add 2001:db8:ffff:${net}::/64 via fe80::4242 dev lo; done
> 
> and then:
> 
>     # for net in {1..9999}; do ping -c1 2001:db8:ffff:${net}::1; done
> 
> This quickly gets to a situation where ping fails early with:
> 
>     ping: connect: Network is unreachable
> 
> At this point, IPv6 connectivity is broken.  The kernel is no longer
> replying to IPv6 neighbor solicitation from other hosts on local
> networks.
> 
> When this happens, the "fib_rt_alloc" field from /proc/net/rt6_stats
> is roughly equal to net.ipv6.route.max_size (a bit more in my tests).
> 
> Interestingly, the system appears to stay in this broken state
> indefinitely, even without trying to send new IPv6 traffic.  The
> fib_rt_alloc statistics does not decrease.
> 

fib_rt_alloc is incremented by calls to ip6_dst_alloc. Each of your
9,999 pings is to a unique address and hence causes a dst to be
allocated and the counter to be incremented. It is never decremented.
That is standard operating procedure.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: IPv6 regression introduced by commit 3b6761d18bc11f2af2a6fc494e9026d39593f22c
  2020-09-28  3:38       ` David Ahern
@ 2020-09-28  5:39         ` Vincent Bernat
  2020-09-28  6:48         ` Baptiste Jonglez
  1 sibling, 0 replies; 16+ messages in thread
From: Vincent Bernat @ 2020-09-28  5:39 UTC (permalink / raw)
  To: David Ahern; +Cc: Baptiste Jonglez, Alarig Le Lay, netdev, jack, Oliver

 ❦ 27 septembre 2020 20:38 -07, David Ahern:

> fib_rt_alloc is incremented by calls to ip6_dst_alloc. Each of your
> 9,999 pings is to a unique address and hence causes a dst to be
> allocated and the counter to be incremented. It is never decremented.
> That is standard operating procedure.

At some point, only PMTU exceptions would create a dst entry.
-- 
Program defensively.
            - The Elements of Programming Style (Kernighan & Plauger)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: IPv6 regression introduced by commit 3b6761d18bc11f2af2a6fc494e9026d39593f22c
  2020-09-28  3:38       ` David Ahern
  2020-09-28  5:39         ` Vincent Bernat
@ 2020-09-28  6:48         ` Baptiste Jonglez
  2020-09-29  3:39           ` David Ahern
  1 sibling, 1 reply; 16+ messages in thread
From: Baptiste Jonglez @ 2020-09-28  6:48 UTC (permalink / raw)
  To: David Ahern; +Cc: Alarig Le Lay, netdev, jack, Vincent Bernat, Oliver

[-- Attachment #1: Type: text/plain, Size: 2176 bytes --]

On 27-09-20, David Ahern wrote:
> On 9/27/20 9:10 AM, Baptiste Jonglez wrote:
> > On 27-09-20, Baptiste Jonglez wrote:
> >> 1) failing IPv6 neighbours, what Alarig reported.  We are seeing this
> >>    on a full-view BGP router with rather low amount of IPv6 traffic
> >>    (around 10-20 Mbps)
> > 
> > Ok, I found a quick way to reproduce this issue:
> > 
> >     # for net in {1..9999}; do ip -6 route add 2001:db8:ffff:${net}::/64 via fe80::4242 dev lo; done
> > 
> > and then:
> > 
> >     # for net in {1..9999}; do ping -c1 2001:db8:ffff:${net}::1; done
> > 
> > This quickly gets to a situation where ping fails early with:
> > 
> >     ping: connect: Network is unreachable
> > 
> > At this point, IPv6 connectivity is broken.  The kernel is no longer
> > replying to IPv6 neighbor solicitation from other hosts on local
> > networks.
> > 
> > When this happens, the "fib_rt_alloc" field from /proc/net/rt6_stats
> > is roughly equal to net.ipv6.route.max_size (a bit more in my tests).
> > 
> > Interestingly, the system appears to stay in this broken state
> > indefinitely, even without trying to send new IPv6 traffic.  The
> > fib_rt_alloc statistics does not decrease.
> > 
> 
> fib_rt_alloc is incremented by calls to ip6_dst_alloc. Each of your
> 9,999 pings is to a unique address and hence causes a dst to be
> allocated and the counter to be incremented. It is never decremented.
> That is standard operating procedure.

Ok, then this is a change in behaviour.  Here is a graph of fib_rt_alloc
on a busy router (IPv6 full view, moderate IPv6 traffic) with 4.9 kernel:

  https://files.polyno.me/tmp/rt6_stats_fib_rt_alloc_4.9.png

It varies quite a lot and stays around 50, so clearly it can be
decremented in regular operation.

On 4.19 and later, it does seem to be decremented only when a route is
removed (ip -6 route delete).  Here is the same graph on a router with a
4.19 kernel and a large net.ipv6.route.max_size:

   https://files.polyno.me/tmp/rt6_stats_fib_rt_alloc_4.19.png

Overall, do you mean that fib_rt_alloc is a red herring and is not a good
marker of the issue?

Thanks,
Baptiste

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: IPv6 regression introduced by commit 3b6761d18bc11f2af2a6fc494e9026d39593f22c
  2020-09-28  6:48         ` Baptiste Jonglez
@ 2020-09-29  3:39           ` David Ahern
  0 siblings, 0 replies; 16+ messages in thread
From: David Ahern @ 2020-09-29  3:39 UTC (permalink / raw)
  To: Baptiste Jonglez; +Cc: Alarig Le Lay, netdev, jack, Vincent Bernat, Oliver

On 9/27/20 11:48 PM, Baptiste Jonglez wrote:
> On 27-09-20, David Ahern wrote:
>> On 9/27/20 9:10 AM, Baptiste Jonglez wrote:
>>> On 27-09-20, Baptiste Jonglez wrote:
>>>> 1) failing IPv6 neighbours, what Alarig reported.  We are seeing this
>>>>    on a full-view BGP router with rather low amount of IPv6 traffic
>>>>    (around 10-20 Mbps)
>>>
>>> Ok, I found a quick way to reproduce this issue:
>>>
>>>     # for net in {1..9999}; do ip -6 route add 2001:db8:ffff:${net}::/64 via fe80::4242 dev lo; done
>>>
>>> and then:
>>>
>>>     # for net in {1..9999}; do ping -c1 2001:db8:ffff:${net}::1; done
>>>
>>> This quickly gets to a situation where ping fails early with:
>>>
>>>     ping: connect: Network is unreachable
>>>
>>> At this point, IPv6 connectivity is broken.  The kernel is no longer
>>> replying to IPv6 neighbor solicitation from other hosts on local
>>> networks.
>>>
>>> When this happens, the "fib_rt_alloc" field from /proc/net/rt6_stats
>>> is roughly equal to net.ipv6.route.max_size (a bit more in my tests).
>>>
>>> Interestingly, the system appears to stay in this broken state
>>> indefinitely, even without trying to send new IPv6 traffic.  The
>>> fib_rt_alloc statistics does not decrease.
>>>
>>
>> fib_rt_alloc is incremented by calls to ip6_dst_alloc. Each of your
>> 9,999 pings is to a unique address and hence causes a dst to be
>> allocated and the counter to be incremented. It is never decremented.
>> That is standard operating procedure.
> 
> Ok, then this is a change in behaviour.  Here is a graph of fib_rt_alloc
> on a busy router (IPv6 full view, moderate IPv6 traffic) with 4.9 kernel:
> 
>   https://files.polyno.me/tmp/rt6_stats_fib_rt_alloc_4.9.png
> 
> It varies quite a lot and stays around 50, so clearly it can be
> decremented in regular operation.
> 
> On 4.19 and later, it does seem to be decremented only when a route is
> removed (ip -6 route delete).  Here is the same graph on a router with a
> 4.19 kernel and a large net.ipv6.route.max_size:
> 
>    https://files.polyno.me/tmp/rt6_stats_fib_rt_alloc_4.19.png
> 
> Overall, do you mean that fib_rt_alloc is a red herring and is not a good
> marker of the issue?
> 

$ git checkout v4.9
$ egrep -r fib_rt_alloc include/ net/
include//net/ip6_fib.h:	__u32		fib_rt_alloc;		/* permanent routes	*/
net//ipv6/route.c:		   net->ipv6.rt6_stats->fib_rt_alloc,

The first declares it; the second prints it. That's it, no other users
so I have no idea why it shows any changes in your v4.9 graph.

Looking git history shows that Wei actually wired up the stats with
commit 81eb8447daae3b62247aa66bb17b82f8fef68249

Author: Wei Wang <weiwan@google.com>
Date:   Fri Oct 6 12:06:11 2017 -0700

    ipv6: take care of rt6_stats

That patch adds an inc but no dec for this stat which is what you show
in your 4.19 graph

Coming back to the bigger problem, fib_rt_alloc has *no* bearing on the
ability to create dst entries which is what the max_route_size sysctl
affects (not FIB entries which are now unbounded, but dst_entry
instances which is when a FIB entry has been hit and used in the
datapath to move packets).

Eric investigated a similar problem recently which resulted in

commit d8882935fcae28bceb5f6f56f09cded8d36d85e6
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri May 8 07:34:14 2020 -0700

    ipv6: use DST_NOCOUNT in ip6_rt_pcpu_alloc()

and I believe is released in v5.8.


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2020-09-29  3:39 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-05  8:17 IPv6 regression introduced by commit 3b6761d18bc11f2af2a6fc494e9026d39593f22c Alarig Le Lay
2020-03-08  0:52 ` David Ahern
2020-03-08 10:57   ` Alarig Le Lay
2020-03-09  2:15     ` David Ahern
2020-03-09  8:59       ` Fabian Grünbichler
2020-03-09 10:47         ` Alarig Le Lay
2020-03-09 11:35           ` Fabian Grünbichler
2020-03-10 10:35       ` Alarig Le Lay
2020-03-10 15:27         ` David Ahern
2020-03-29 14:09           ` Alarig Le Lay
2020-09-27 15:35   ` Baptiste Jonglez
2020-09-27 16:10     ` Baptiste Jonglez
2020-09-28  3:38       ` David Ahern
2020-09-28  5:39         ` Vincent Bernat
2020-09-28  6:48         ` Baptiste Jonglez
2020-09-29  3:39           ` David Ahern

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).