* Any reason why arp monitor keeps emitting netlink failover events?
@ 2022-06-14 2:59 Jonathan Toppins
2022-06-14 15:29 ` Jay Vosburgh
0 siblings, 1 reply; 6+ messages in thread
From: Jonathan Toppins @ 2022-06-14 2:59 UTC (permalink / raw)
To: netdev; +Cc: Jay Vosburgh, Veaceslav Falico, Andy Gospodarek, Hangbin Liu
On net-next/master from today, I see netlink failover events being
emitted from an active-backup bond. In the ip monitor dump you can see
the bond is up (according to the link status) but keeps emitting
failover events and I am not sure why. This appears to be an issue also
on Fedora 35 and CentOS 8 kernels. The configuration appears to be
correct, though I could be missing something. Thoughts?
-Jon
Upstream Commit:
c04245328dd7 net: make __sys_accept4_file() static
Console Log:
[root@fedora ~]# cat ./bond-bz2094911.sh
#!/bin/sh
set -e
set -x
dmesg -C
ip -all netns delete || true
ip link del link1_1 || true
ip link del link2_1 || true
modprobe -r bonding
ip link add name link1_1 type veth peer name link1_2
ip link add name link2_1 type veth peer name link2_2
ip netns add ns1
ip netns exec ns1 ip link add name br0 type bridge vlan_filtering 1
ip link set link1_2 up netns ns1
ip link set link2_2 up netns ns1
ip netns exec ns1 ip link set link1_2 master br0
ip netns exec ns1 ip link set link2_2 master br0
ip netns exec ns1 ip addr add 192.168.30.5/24 dev br0
ip netns exec ns1 ip addr add 192.168.30.7/24 dev br0
ip netns exec ns1 ip link set br0 up
ip link add name bond0 type bond mode active-backup arp_all_targets any \
arp_ip_target "192.168.30.7,192.168.30.5" arp_interval 1000 \
fail_over_mac follow arp_validate active primary_reselect always \
primary link1_1
ip link set bond0 up
ip addr add 192.168.30.10/24 dev bond0
ifenslave bond0 link1_1 link2_1
#ip -ts -o monitor link
[root@fedora ~]# ip -ts -o monitor link dev bond0
[2022-06-13T22:20:35.289034] [2022-06-13T22:20:35.289846]
[2022-06-13T22:20:35.289978] [2022-06-13T22:20:35.290089]
[2022-06-13T22:20:35.290200] [2022-06-13T22:20:35.290311] 14: bond0:
<BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
group default event BONDING FAILOVER \ link/ether fe:5b:52:88:61:68
brd ff:ff:ff:ff:ff:ff
[2022-06-13T22:20:35.291055] 14: bond0:
<BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
group default event NOTIFY PEERS \ link/ether fe:5b:52:88:61:68 brd
ff:ff:ff:ff:ff:ff
[2022-06-13T22:20:35.324494] 14: bond0:
<BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
group default event RESEND IGMP \ link/ether fe:5b:52:88:61:68 brd
ff:ff:ff:ff:ff:ff
[2022-06-13T22:20:36.312078] [2022-06-13T22:20:36.312296]
[2022-06-13T22:20:36.312549] [2022-06-13T22:20:36.312670]
[2022-06-13T22:20:36.312782] [2022-06-13T22:20:36.312893] 14: bond0:
<BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
group default event BONDING FAILOVER \ link/ether fe:5b:52:88:61:68
brd ff:ff:ff:ff:ff:ff
[2022-06-13T22:20:36.313134] 14: bond0:
<BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
group default event NOTIFY PEERS \ link/ether fe:5b:52:88:61:68 brd
ff:ff:ff:ff:ff:ff
[2022-06-13T22:20:36.346157] 14: bond0:
<BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
group default event RESEND IGMP \ link/ether fe:5b:52:88:61:68 brd
ff:ff:ff:ff:ff:ff
[root@fedora tests]# ip -d link show dev bond0
14: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc
noqueue state UP mode DEFAULT group default qlen 1000
link/ether fe:5b:52:88:61:68 brd ff:ff:ff:ff:ff:ff promiscuity 0
minmtu 68 maxmtu 65535
bond mode active-backup active_slave link1_1 miimon 0 updelay 0
downdelay 0 peer_notify_delay 0 use_carrier 1 arp_interval 1000
arp_missed_max 2 arp_ip_target 192.168.30.7,192.168.30.5 arp_validate
active arp_all_targets any primary link1_1 primary_reselect always
fail_over_mac follow xmit_hash_policy layer2 resend_igmp 1 num_grat_arp
1 all_slaves_active 0 min_links 0 lp_interval 1 packets_per_slave 1
lacp_active on lacp_rate slow ad_select stable tlb_dynamic_lb 1
addrgenmode eui64 numtxqueues 16 numrxqueues 16 gso_max_size 65536
gso_max_segs 65535 gro_max_size 65536
[root@fedora tests]#
[root@fedora tests]# ip -d -s link show dev link1_1
11: link1_1@if10: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc
noqueue master bond0 state UP mode DEFAULT group default qlen 1000
link/ether aa:48:a3:a3:2b:2b brd ff:ff:ff:ff:ff:ff link-netns ns1
promiscuity 0 minmtu 68 maxmtu 65535
veth
bond_slave state BACKUP mii_status DOWN link_failure_count 466
perm_hwaddr b6:19:b6:e3:29:c5 queue_id 0 addrgenmode eui64 numtxqueues 1
numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 gro_max_size 65536
RX: bytes packets errors dropped missed mcast
295004 5622 0 0 0 0
TX: bytes packets errors dropped carrier collsns
254824 4678 0 0 0 0
[root@fedora tests]# ip -d -s link show dev link2_1
13: link2_1@if12: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc
noqueue master bond0 state UP mode DEFAULT group default qlen 1000
link/ether aa:48:a3:a3:2b:2b brd ff:ff:ff:ff:ff:ff link-netns ns1
promiscuity 0 minmtu 68 maxmtu 65535
veth
bond_slave state BACKUP mii_status UP link_failure_count 0
perm_hwaddr aa:48:a3:a3:2b:2b queue_id 0 addrgenmode eui64 numtxqueues 1
numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 gro_max_size 65536
RX: bytes packets errors dropped missed mcast
303452 5776 0 0 0 0
TX: bytes packets errors dropped carrier collsns
179592 2866 0 0 0 0
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Any reason why arp monitor keeps emitting netlink failover events?
2022-06-14 2:59 Any reason why arp monitor keeps emitting netlink failover events? Jonathan Toppins
@ 2022-06-14 15:29 ` Jay Vosburgh
2022-06-14 17:07 ` Jonathan Toppins
0 siblings, 1 reply; 6+ messages in thread
From: Jay Vosburgh @ 2022-06-14 15:29 UTC (permalink / raw)
To: Jonathan Toppins; +Cc: netdev, Veaceslav Falico, Andy Gospodarek, Hangbin Liu
Jonathan Toppins <jtoppins@redhat.com> wrote:
>On net-next/master from today, I see netlink failover events being emitted
>from an active-backup bond. In the ip monitor dump you can see the bond is
>up (according to the link status) but keeps emitting failover events and I
>am not sure why. This appears to be an issue also on Fedora 35 and CentOS
>8 kernels. The configuration appears to be correct, though I could be
>missing something. Thoughts?
Anything showing up in the dmesg? There's only one place that
generates the FAILOVER notifier, and it ought to have a corresponding
message in the kernel log.
Also, I note that the link1_1 veth has a lot of failures:
> bond_slave state BACKUP mii_status DOWN link_failure_count 466
Perhaps the bridge within netns ns1 is disabling the port for
some reason?
-J
>-Jon
>
>
>Upstream Commit:
> c04245328dd7 net: make __sys_accept4_file() static
>
>Console Log:
>[root@fedora ~]# cat ./bond-bz2094911.sh
>#!/bin/sh
>
>set -e
>set -x
>
>dmesg -C
>ip -all netns delete || true
>ip link del link1_1 || true
>ip link del link2_1 || true
>modprobe -r bonding
>
>ip link add name link1_1 type veth peer name link1_2
>ip link add name link2_1 type veth peer name link2_2
>ip netns add ns1
>ip netns exec ns1 ip link add name br0 type bridge vlan_filtering 1
>ip link set link1_2 up netns ns1
>ip link set link2_2 up netns ns1
>ip netns exec ns1 ip link set link1_2 master br0
>ip netns exec ns1 ip link set link2_2 master br0
>ip netns exec ns1 ip addr add 192.168.30.5/24 dev br0
>ip netns exec ns1 ip addr add 192.168.30.7/24 dev br0
>ip netns exec ns1 ip link set br0 up
>ip link add name bond0 type bond mode active-backup arp_all_targets any \
> arp_ip_target "192.168.30.7,192.168.30.5" arp_interval 1000 \
> fail_over_mac follow arp_validate active primary_reselect always \
> primary link1_1
>ip link set bond0 up
>ip addr add 192.168.30.10/24 dev bond0
>ifenslave bond0 link1_1 link2_1
>#ip -ts -o monitor link
>
>
>[root@fedora ~]# ip -ts -o monitor link dev bond0
>[2022-06-13T22:20:35.289034] [2022-06-13T22:20:35.289846]
>[2022-06-13T22:20:35.289978] [2022-06-13T22:20:35.290089]
>[2022-06-13T22:20:35.290200] [2022-06-13T22:20:35.290311] 14: bond0:
><BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
>group default event BONDING FAILOVER \ link/ether fe:5b:52:88:61:68 brd
>ff:ff:ff:ff:ff:ff
>[2022-06-13T22:20:35.291055] 14: bond0:
><BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
>group default event NOTIFY PEERS \ link/ether fe:5b:52:88:61:68 brd
>ff:ff:ff:ff:ff:ff
>[2022-06-13T22:20:35.324494] 14: bond0:
><BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
>group default event RESEND IGMP \ link/ether fe:5b:52:88:61:68 brd
>ff:ff:ff:ff:ff:ff
>[2022-06-13T22:20:36.312078] [2022-06-13T22:20:36.312296]
>[2022-06-13T22:20:36.312549] [2022-06-13T22:20:36.312670]
>[2022-06-13T22:20:36.312782] [2022-06-13T22:20:36.312893] 14: bond0:
><BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
>group default event BONDING FAILOVER \ link/ether fe:5b:52:88:61:68 brd
>ff:ff:ff:ff:ff:ff
>[2022-06-13T22:20:36.313134] 14: bond0:
><BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
>group default event NOTIFY PEERS \ link/ether fe:5b:52:88:61:68 brd
>ff:ff:ff:ff:ff:ff
>[2022-06-13T22:20:36.346157] 14: bond0:
><BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
>group default event RESEND IGMP \ link/ether fe:5b:52:88:61:68 brd
>ff:ff:ff:ff:ff:ff
>
>[root@fedora tests]# ip -d link show dev bond0
>14: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue
>state UP mode DEFAULT group default qlen 1000
> link/ether fe:5b:52:88:61:68 brd ff:ff:ff:ff:ff:ff promiscuity 0
>minmtu 68 maxmtu 65535
> bond mode active-backup active_slave link1_1 miimon 0 updelay 0
>downdelay 0 peer_notify_delay 0 use_carrier 1 arp_interval 1000
>arp_missed_max 2 arp_ip_target 192.168.30.7,192.168.30.5 arp_validate
>active arp_all_targets any primary link1_1 primary_reselect always
>fail_over_mac follow xmit_hash_policy layer2 resend_igmp 1 num_grat_arp 1
>all_slaves_active 0 min_links 0 lp_interval 1 packets_per_slave 1
>lacp_active on lacp_rate slow ad_select stable tlb_dynamic_lb 1
>addrgenmode eui64 numtxqueues 16 numrxqueues 16 gso_max_size 65536
>gso_max_segs 65535 gro_max_size 65536
>[root@fedora tests]#
>
>[root@fedora tests]# ip -d -s link show dev link1_1
>11: link1_1@if10: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc
>noqueue master bond0 state UP mode DEFAULT group default qlen 1000
> link/ether aa:48:a3:a3:2b:2b brd ff:ff:ff:ff:ff:ff link-netns ns1
>promiscuity 0 minmtu 68 maxmtu 65535
> veth
> bond_slave state BACKUP mii_status DOWN link_failure_count 466
>perm_hwaddr b6:19:b6:e3:29:c5 queue_id 0 addrgenmode eui64 numtxqueues 1
>numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 gro_max_size 65536
> RX: bytes packets errors dropped missed mcast
> 295004 5622 0 0 0 0
> TX: bytes packets errors dropped carrier collsns
> 254824 4678 0 0 0 0
>
>[root@fedora tests]# ip -d -s link show dev link2_1
>13: link2_1@if12: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc
>noqueue master bond0 state UP mode DEFAULT group default qlen 1000
> link/ether aa:48:a3:a3:2b:2b brd ff:ff:ff:ff:ff:ff link-netns ns1
>promiscuity 0 minmtu 68 maxmtu 65535
> veth
> bond_slave state BACKUP mii_status UP link_failure_count 0 perm_hwaddr
>aa:48:a3:a3:2b:2b queue_id 0 addrgenmode eui64 numtxqueues 1 numrxqueues 1
>gso_max_size 65536 gso_max_segs 65535 gro_max_size 65536
> RX: bytes packets errors dropped missed mcast
> 303452 5776 0 0 0 0
> TX: bytes packets errors dropped carrier collsns
> 179592 2866 0 0 0 0
>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Any reason why arp monitor keeps emitting netlink failover events?
2022-06-14 15:29 ` Jay Vosburgh
@ 2022-06-14 17:07 ` Jonathan Toppins
2022-06-15 0:26 ` Jay Vosburgh
0 siblings, 1 reply; 6+ messages in thread
From: Jonathan Toppins @ 2022-06-14 17:07 UTC (permalink / raw)
To: Jay Vosburgh; +Cc: netdev, Veaceslav Falico, Andy Gospodarek, Hangbin Liu
On 6/14/22 11:29, Jay Vosburgh wrote:
> Jonathan Toppins <jtoppins@redhat.com> wrote:
>
>> On net-next/master from today, I see netlink failover events being emitted
>>from an active-backup bond. In the ip monitor dump you can see the bond is
>> up (according to the link status) but keeps emitting failover events and I
>> am not sure why. This appears to be an issue also on Fedora 35 and CentOS
>> 8 kernels. The configuration appears to be correct, though I could be
>> missing something. Thoughts?
>
> Anything showing up in the dmesg? There's only one place that
> generates the FAILOVER notifier, and it ought to have a corresponding
> message in the kernel log.
>
> Also, I note that the link1_1 veth has a lot of failures:
Yes all those failures are created by the setup, I waited about 5
minutes before dumping the link info. The failover occurs about every
second. Note this is just a representation of a physical network so that
others can run the setup. The script `bond-bz2094911.sh`, easily
reproduces the issue which I dumped with cat below in the original email.
Here is the kernel log, I have dynamic debug enabled for the entire
bonding module:
[66898.926972] bond0: bond_netdev_event received NETDEV_BONDING_FAILOVER
[66898.926978] bond0: bond_master_netdev_event called
[66898.928717] bond0: (slave link2_1): bond_rcv_validate: skb->dev link2_1
[66898.928744] bond0: bond_netdev_event received NETDEV_NOTIFY_PEERS
[66898.928746] bond0: bond_master_netdev_event called
[66898.928839] bond0: (slave link2_1): bond_rcv_validate: skb->dev link2_1
[66898.928853] bond0: (slave link1_1): bond_arp_send_all: target
192.168.30.7
[66898.928867] bond0: (slave link1_1): arp 1 on slave: dst 192.168.30.7
src 192.168.30.10
[66898.928915] bond0: (slave link2_1): bond_rcv_validate: skb->dev link2_1
[66898.928919] bond0: (slave link1_1): bond_rcv_validate: skb->dev link1_1
[66898.928920] bond0: (slave link1_1): bond_arp_rcv: link1_1/0 av 1 sv 1
sip 192.168.30.7 tip 192.168.30.10
[66898.928948] bond0: (slave link1_1): bond_arp_send_all: target
192.168.30.5
[66898.928954] bond0: (slave link1_1): arp 1 on slave: dst 192.168.30.5
src 192.168.30.10
[66898.929045] bond0: (slave link2_1): bond_rcv_validate: skb->dev link2_1
[66898.929050] bond0: (slave link1_1): bond_rcv_validate: skb->dev link1_1
[66898.929051] bond0: (slave link1_1): bond_arp_rcv: link1_1/0 av 1 sv 1
sip 192.168.30.5 tip 192.168.30.10
[66898.932441] bond0: (slave link1_1): bond_slave_netdev_event called
[66898.932449] bond0: (slave link2_1): bond_slave_netdev_event called
[66898.934145] bond0: (slave link2_1): bond_rcv_validate: skb->dev link2_1
[66898.935030] bond0: (slave link2_1): bond_rcv_validate: skb->dev link2_1
[66898.935043] bond0: bond_netdev_event received NETDEV_RESEND_IGMP
[66898.935045] bond0: bond_master_netdev_event called
[66899.982620] bond0: (slave link1_1): bond_arp_send_all: target
192.168.30.7
[66899.982661] bond0: (slave link1_1): arp 1 on slave: dst 192.168.30.7
src 192.168.30.10
[66899.982894] bond0: (slave link2_1): bond_rcv_validate: skb->dev link2_1
[66899.982909] bond0: (slave link1_1): bond_rcv_validate: skb->dev link1_1
[66899.982913] bond0: (slave link1_1): bond_arp_rcv: link1_1/0 av 1 sv 1
sip 192.168.30.7 tip 192.168.30.10
[66899.983110] bond0: (slave link1_1): bond_arp_send_all: target
192.168.30.5
[66899.983130] bond0: (slave link1_1): arp 1 on slave: dst 192.168.30.5
src 192.168.30.10
[66899.983247] bond0: (slave link2_1): bond_rcv_validate: skb->dev link2_1
[66899.983258] bond0: (slave link1_1): bond_rcv_validate: skb->dev link1_1
[66899.983262] bond0: (slave link1_1): bond_arp_rcv: link1_1/0 av 1 sv 1
sip 192.168.30.5 tip 192.168.30.10
[66901.006314] bond0: (slave link1_1): bond_arp_send_all: target
192.168.30.7
[66901.006347] bond0: (slave link1_1): arp 1 on slave: dst 192.168.30.7
src 192.168.30.10
[66901.006495] bond0: (slave link2_1): bond_rcv_validate: skb->dev link2_1
[66901.006507] bond0: (slave link1_1): bond_rcv_validate: skb->dev link1_1
[66901.006511] bond0: (slave link1_1): bond_arp_rcv: link1_1/0 av 1 sv 1
sip 192.168.30.7 tip 192.168.30.10
[66901.006573] bond0: (slave link1_1): bond_arp_send_all: target
192.168.30.5
[66901.006587] bond0: (slave link1_1): arp 1 on slave: dst 192.168.30.5
src 192.168.30.10
[66901.006679] bond0: (slave link2_1): bond_rcv_validate: skb->dev link2_1
[66901.006688] bond0: (slave link1_1): bond_rcv_validate: skb->dev link1_1
[66901.006691] bond0: (slave link1_1): bond_arp_rcv: link1_1/0 av 1 sv 1
sip 192.168.30.5 tip 192.168.30.10
[66902.031207] bond0: (slave link1_1): bond_slave_netdev_event called
[66902.033041] bond0: (slave link1_1): bond_slave_netdev_event called
[66902.033045] bond0: (slave link1_1): link status definitely down,
disabling slave
[66902.033052] bond0: (slave link2_1): making interface the new active one
[66902.035774] bond0: (slave link2_1): bond_slave_netdev_event called
[66902.035790] bond0: (slave link2_1): bond_slave_netdev_event called
[66902.038924] bond0: (slave link2_1): bond_slave_netdev_event called
[66902.038954] bond0: (slave link1_1): bond_slave_netdev_event called
[66902.042318] bond0: (slave link1_1): bond_slave_netdev_event called
[66902.042343] bond0: bond_should_notify_peers: slave link2_1
[66902.044818] bond0: bond_netdev_event received NETDEV_BONDING_FAILOVER
[66902.044822] bond0: bond_master_netdev_event called
[66902.046611] bond0: (slave link1_1): bond_rcv_validate: skb->dev link1_1
[66902.046623] bond0: bond_netdev_event received NETDEV_NOTIFY_PEERS
[66902.046626] bond0: bond_master_netdev_event called
[66902.046710] bond0: (slave link1_1): bond_rcv_validate: skb->dev link1_1
[66902.046723] bond0: (slave link2_1): bond_arp_send_all: target
192.168.30.7
[66902.046754] bond0: (slave link2_1): arp 1 on slave: dst 192.168.30.7
src 192.168.30.10
[66902.046807] bond0: (slave link1_1): bond_rcv_validate: skb->dev link1_1
[66902.046812] bond0: (slave link2_1): bond_rcv_validate: skb->dev link2_1
[66902.046813] bond0: (slave link2_1): bond_arp_rcv: link2_1/0 av 1 sv 1
sip 192.168.30.7 tip 192.168.30.10
[66902.046841] bond0: (slave link2_1): bond_arp_send_all: target
192.168.30.5
[66902.046848] bond0: (slave link2_1): arp 1 on slave: dst 192.168.30.5
src 192.168.30.10
[66902.046892] bond0: (slave link1_1): bond_rcv_validate: skb->dev link1_1
[66902.046896] bond0: (slave link2_1): bond_rcv_validate: skb->dev link2_1
[66902.046897] bond0: (slave link2_1): bond_arp_rcv: link2_1/0 av 1 sv 1
sip 192.168.30.5 tip 192.168.30.10
[66902.050879] bond0: (slave link1_1): bond_slave_netdev_event called
[66902.050888] bond0: (slave link2_1): bond_slave_netdev_event called
[66902.053036] bond0: (slave link1_1): bond_rcv_validate: skb->dev link1_1
[66902.054391] bond0: (slave link1_1): bond_rcv_validate: skb->dev link1_1
[66902.054402] bond0: bond_netdev_event received NETDEV_RESEND_IGMP
[66902.054405] bond0: bond_master_netdev_event called
[66903.056566] bond0: (slave link1_1): bond_slave_netdev_event called
[66903.056573] bond0: (slave link1_1): link status definitely up
[66903.056585] bond0: (slave link1_1): making interface the new active one
[66903.063511] bond0: (slave link2_1): bond_slave_netdev_event called
[66903.067482] bond0: (slave link1_1): bond_slave_netdev_event called
[66903.067495] bond0: (slave link1_1): bond_slave_netdev_event called
[66903.075563] bond0: (slave link1_1): bond_slave_netdev_event called
[66903.075622] bond0: (slave link2_1): bond_slave_netdev_event called
[66903.081868] bond0: (slave link2_1): bond_slave_netdev_event called
[66903.081912] bond0: bond_should_notify_peers: slave link1_1
[66903.085588] bond0: bond_netdev_event received NETDEV_BONDING_FAILOVER
[66903.085593] bond0: bond_master_netdev_event called
[66903.088050] bond0: (slave link2_1): bond_rcv_validate: skb->dev link2_1
[66903.088075] bond0: bond_netdev_event received NETDEV_NOTIFY_PEERS
[66903.088077] bond0: bond_master_netdev_event called
[66903.088132] bond0: (slave link1_1): bond_arp_send_all: target
192.168.30.7
[66903.088147] bond0: (slave link1_1): arp 1 on slave: dst 192.168.30.7
src 192.168.30.10
[66903.088159] bond0: (slave link1_1): bond_arp_send_all: target
192.168.30.5
[66903.088165] bond0: (slave link1_1): arp 1 on slave: dst 192.168.30.5
src 192.168.30.10
[66903.089132] bond0: (slave link2_1): bond_rcv_validate: skb->dev link2_1
[66903.089139] bond0: (slave link2_1): bond_rcv_validate: skb->dev link2_1
[66903.089142] bond0: (slave link1_1): bond_rcv_validate: skb->dev link1_1
[66903.089143] bond0: (slave link1_1): bond_arp_rcv: link1_1/0 av 1 sv 1
sip 192.168.30.7 tip 192.168.30.10
[66903.089170] bond0: (slave link2_1): bond_rcv_validate: skb->dev link2_1
[66903.089173] bond0: (slave link1_1): bond_rcv_validate: skb->dev link1_1
[66903.089174] bond0: (slave link1_1): bond_arp_rcv: link1_1/0 av 1 sv 1
sip 192.168.30.5 tip 192.168.30.10
[66903.092128] bond0: (slave link1_1): bond_slave_netdev_event called
[66903.092137] bond0: (slave link2_1): bond_slave_netdev_event called
[66903.092976] bond0: (slave link2_1): bond_rcv_validate: skb->dev link2_1
[66903.094791] bond0: (slave link2_1): bond_rcv_validate: skb->dev link2_1
[66903.094803] bond0: bond_netdev_event received NETDEV_RESEND_IGMP
[66903.094805] bond0: bond_master_netdev_event called
[66904.142116] bond0: (slave link1_1): bond_arp_send_all: target
192.168.30.7
[66904.142135] bond0: (slave link1_1): arp 1 on slave: dst 192.168.30.7
src 192.168.30.10
[66904.142214] bond0: (slave link2_1): bond_rcv_validate: skb->dev link2_1
[66904.142219] bond0: (slave link1_1): bond_rcv_validate: skb->dev link1_1
[66904.142221] bond0: (slave link1_1): bond_arp_rcv: link1_1/0 av 1 sv 1
sip 192.168.30.7 tip 192.168.30.10
[66904.142256] bond0: (slave link1_1): bond_arp_send_all: target
192.168.30.5
[66904.142263] bond0: (slave link1_1): arp 1 on slave: dst 192.168.30.5
src 192.168.30.10
[66904.142307] bond0: (slave link2_1): bond_rcv_validate: skb->dev link2_1
[66904.142312] bond0: (slave link1_1): bond_rcv_validate: skb->dev link1_1
[66904.142313] bond0: (slave link1_1): bond_arp_rcv: link1_1/0 av 1 sv 1
sip 192.168.30.5 tip 192.168.30.10
[66905.166309] bond0: (slave link1_1): bond_arp_send_all: target
192.168.30.7
[66905.166356] bond0: (slave link1_1): arp 1 on slave: dst 192.168.30.7
src 192.168.30.10
[66905.166568] bond0: (slave link2_1): bond_rcv_validate: skb->dev link2_1
[66905.166586] bond0: (slave link1_1): bond_rcv_validate: skb->dev link1_1
[66905.166591] bond0: (slave link1_1): bond_arp_rcv: link1_1/0 av 1 sv 1
sip 192.168.30.7 tip 192.168.30.10
[66905.166683] bond0: (slave link1_1): bond_arp_send_all: target
192.168.30.5
[66905.166703] bond0: (slave link1_1): arp 1 on slave: dst 192.168.30.5
src 192.168.30.10
[66905.166865] bond0: (slave link2_1): bond_rcv_validate: skb->dev link2_1
[66905.166878] bond0: (slave link1_1): bond_rcv_validate: skb->dev link1_1
[66905.166883] bond0: (slave link1_1): bond_arp_rcv: link1_1/0 av 1 sv 1
sip 192.168.30.5 tip 192.168.30.10
[66906.193475] bond0: (slave link1_1): bond_slave_netdev_event called
[66906.198304] bond0: (slave link1_1): bond_slave_netdev_event called
[66906.198312] bond0: (slave link1_1): link status definitely down,
disabling slave
[66906.198328] bond0: (slave link2_1): making interface the new active one
[66906.205523] bond0: (slave link2_1): bond_slave_netdev_event called
[66906.205557] bond0: (slave link2_1): bond_slave_netdev_event called
[66906.212198] bond0: (slave link2_1): bond_slave_netdev_event called
[66906.212258] bond0: (slave link1_1): bond_slave_netdev_event called
[66906.218529] bond0: (slave link1_1): bond_slave_netdev_event called
[66906.218572] bond0: bond_should_notify_peers: slave link2_1
[66906.223263] bond0: bond_netdev_event received NETDEV_BONDING_FAILOVER
>
>> bond_slave state BACKUP mii_status DOWN link_failure_count 466
>
> Perhaps the bridge within netns ns1 is disabling the port for
> some reason?
>
I do not see a specific issue on the bridge port:
# ip netns exec ns1 ip -d link show link1_2
10: link1_2@if11: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc
noqueue master br0 state UP mode DEFAULT group default qlen 1000
link/ether 6a:0d:5b:0c:e8:ab brd ff:ff:ff:ff:ff:ff link-netnsid 0
promiscuity 1 minmtu 68 maxmtu 65535
veth
bridge_slave state forwarding priority 32 cost 2 hairpin off guard
off root_block off fastleave off learning on flood on port_id 0x8001
port_no 0x1 designated_port 32769 designated_cost 0 designated_bridge
8000.6a:d:5b:c:e8:ab designated_root 8000.6a:d:5b:c:e8:ab hold_timer
0.00 message_age_timer 0.00 forward_delay_timer 0.00
topology_change_ack 0 config_pending 0 proxy_arp off proxy_arp_wifi off
mcast_router 1 mcast_fast_leave off mcast_flood on bcast_flood on
mcast_to_unicast off neigh_suppress off group_fwd_mask 0
group_fwd_mask_str 0x0 vlan_tunnel off isolated off locked off
addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536
gso_max_segs 65535 gro_max_size 65536
-Jon
>>
>> Upstream Commit:
>> c04245328dd7 net: make __sys_accept4_file() static
>>
>> Console Log:
>> [root@fedora ~]# cat ./bond-bz2094911.sh
>> #!/bin/sh
>>
>> set -e
>> set -x
>>
>> dmesg -C
>> ip -all netns delete || true
>> ip link del link1_1 || true
>> ip link del link2_1 || true
>> modprobe -r bonding
>>
>> ip link add name link1_1 type veth peer name link1_2
>> ip link add name link2_1 type veth peer name link2_2
>> ip netns add ns1
>> ip netns exec ns1 ip link add name br0 type bridge vlan_filtering 1
>> ip link set link1_2 up netns ns1
>> ip link set link2_2 up netns ns1
>> ip netns exec ns1 ip link set link1_2 master br0
>> ip netns exec ns1 ip link set link2_2 master br0
>> ip netns exec ns1 ip addr add 192.168.30.5/24 dev br0
>> ip netns exec ns1 ip addr add 192.168.30.7/24 dev br0
>> ip netns exec ns1 ip link set br0 up
>> ip link add name bond0 type bond mode active-backup arp_all_targets any \
>> arp_ip_target "192.168.30.7,192.168.30.5" arp_interval 1000 \
>> fail_over_mac follow arp_validate active primary_reselect always \
>> primary link1_1
>> ip link set bond0 up
>> ip addr add 192.168.30.10/24 dev bond0
>> ifenslave bond0 link1_1 link2_1
>> #ip -ts -o monitor link
>>
>>
>> [root@fedora ~]# ip -ts -o monitor link dev bond0
>> [2022-06-13T22:20:35.289034] [2022-06-13T22:20:35.289846]
>> [2022-06-13T22:20:35.289978] [2022-06-13T22:20:35.290089]
>> [2022-06-13T22:20:35.290200] [2022-06-13T22:20:35.290311] 14: bond0:
>> <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
>> group default event BONDING FAILOVER \ link/ether fe:5b:52:88:61:68 brd
>> ff:ff:ff:ff:ff:ff
>> [2022-06-13T22:20:35.291055] 14: bond0:
>> <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
>> group default event NOTIFY PEERS \ link/ether fe:5b:52:88:61:68 brd
>> ff:ff:ff:ff:ff:ff
>> [2022-06-13T22:20:35.324494] 14: bond0:
>> <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
>> group default event RESEND IGMP \ link/ether fe:5b:52:88:61:68 brd
>> ff:ff:ff:ff:ff:ff
>> [2022-06-13T22:20:36.312078] [2022-06-13T22:20:36.312296]
>> [2022-06-13T22:20:36.312549] [2022-06-13T22:20:36.312670]
>> [2022-06-13T22:20:36.312782] [2022-06-13T22:20:36.312893] 14: bond0:
>> <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
>> group default event BONDING FAILOVER \ link/ether fe:5b:52:88:61:68 brd
>> ff:ff:ff:ff:ff:ff
>> [2022-06-13T22:20:36.313134] 14: bond0:
>> <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
>> group default event NOTIFY PEERS \ link/ether fe:5b:52:88:61:68 brd
>> ff:ff:ff:ff:ff:ff
>> [2022-06-13T22:20:36.346157] 14: bond0:
>> <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
>> group default event RESEND IGMP \ link/ether fe:5b:52:88:61:68 brd
>> ff:ff:ff:ff:ff:ff
>>
>> [root@fedora tests]# ip -d link show dev bond0
>> 14: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue
>> state UP mode DEFAULT group default qlen 1000
>> link/ether fe:5b:52:88:61:68 brd ff:ff:ff:ff:ff:ff promiscuity 0
>> minmtu 68 maxmtu 65535
>> bond mode active-backup active_slave link1_1 miimon 0 updelay 0
>> downdelay 0 peer_notify_delay 0 use_carrier 1 arp_interval 1000
>> arp_missed_max 2 arp_ip_target 192.168.30.7,192.168.30.5 arp_validate
>> active arp_all_targets any primary link1_1 primary_reselect always
>> fail_over_mac follow xmit_hash_policy layer2 resend_igmp 1 num_grat_arp 1
>> all_slaves_active 0 min_links 0 lp_interval 1 packets_per_slave 1
>> lacp_active on lacp_rate slow ad_select stable tlb_dynamic_lb 1
>> addrgenmode eui64 numtxqueues 16 numrxqueues 16 gso_max_size 65536
>> gso_max_segs 65535 gro_max_size 65536
>> [root@fedora tests]#
>>
>> [root@fedora tests]# ip -d -s link show dev link1_1
>> 11: link1_1@if10: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc
>> noqueue master bond0 state UP mode DEFAULT group default qlen 1000
>> link/ether aa:48:a3:a3:2b:2b brd ff:ff:ff:ff:ff:ff link-netns ns1
>> promiscuity 0 minmtu 68 maxmtu 65535
>> veth
>> bond_slave state BACKUP mii_status DOWN link_failure_count 466
>> perm_hwaddr b6:19:b6:e3:29:c5 queue_id 0 addrgenmode eui64 numtxqueues 1
>> numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 gro_max_size 65536
>> RX: bytes packets errors dropped missed mcast
>> 295004 5622 0 0 0 0
>> TX: bytes packets errors dropped carrier collsns
>> 254824 4678 0 0 0 0
>>
>> [root@fedora tests]# ip -d -s link show dev link2_1
>> 13: link2_1@if12: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc
>> noqueue master bond0 state UP mode DEFAULT group default qlen 1000
>> link/ether aa:48:a3:a3:2b:2b brd ff:ff:ff:ff:ff:ff link-netns ns1
>> promiscuity 0 minmtu 68 maxmtu 65535
>> veth
>> bond_slave state BACKUP mii_status UP link_failure_count 0 perm_hwaddr
>> aa:48:a3:a3:2b:2b queue_id 0 addrgenmode eui64 numtxqueues 1 numrxqueues 1
>> gso_max_size 65536 gso_max_segs 65535 gro_max_size 65536
>> RX: bytes packets errors dropped missed mcast
>> 303452 5776 0 0 0 0
>> TX: bytes packets errors dropped carrier collsns
>> 179592 2866 0 0 0 0
>>
>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Any reason why arp monitor keeps emitting netlink failover events?
2022-06-14 17:07 ` Jonathan Toppins
@ 2022-06-15 0:26 ` Jay Vosburgh
2022-06-15 15:51 ` Jonathan Toppins
0 siblings, 1 reply; 6+ messages in thread
From: Jay Vosburgh @ 2022-06-15 0:26 UTC (permalink / raw)
To: Jonathan Toppins; +Cc: netdev, Veaceslav Falico, Andy Gospodarek, Hangbin Liu
Jonathan Toppins <jtoppins@redhat.com> wrote:
>On 6/14/22 11:29, Jay Vosburgh wrote:
>> Jonathan Toppins <jtoppins@redhat.com> wrote:
>>
>>> On net-next/master from today, I see netlink failover events being emitted
>>>from an active-backup bond. In the ip monitor dump you can see the bond is
>>> up (according to the link status) but keeps emitting failover events and I
>>> am not sure why. This appears to be an issue also on Fedora 35 and CentOS
>>> 8 kernels. The configuration appears to be correct, though I could be
>>> missing something. Thoughts?
>> Anything showing up in the dmesg? There's only one place that
>> generates the FAILOVER notifier, and it ought to have a corresponding
>> message in the kernel log.
>> Also, I note that the link1_1 veth has a lot of failures:
>
>Yes all those failures are created by the setup, I waited about 5 minutes
>before dumping the link info. The failover occurs about every second. Note
>this is just a representation of a physical network so that others can run
>the setup. The script `bond-bz2094911.sh`, easily reproduces the issue
>which I dumped with cat below in the original email.
>
>Here is the kernel log, I have dynamic debug enabled for the entire
>bonding module:
I set up the test, and added some additional instrumentation to
bond_ab_arp_inspect, and what seems to be going on is that the
dev_trans_start for link1_1 is never updating. The "down to up"
transition for the ARP monitor only checks last_rx, but the "up to down"
check for the active interface requires both TX and RX recently
("recently" being within the past missed_max * arp_interval).
This looks to be due to HARD_TX_LOCK not actually locking for
NETIF_F_LLTX devices:
#define HARD_TX_LOCK(dev, txq, cpu) { if ((dev->features & NETIF_F_LLTX) == 0) { __netif_tx_lock(txq, cpu); } else { __netif_tx_acquire(txq); } }
in combination with
static inline void txq_trans_update(struct netdev_queue *txq)
{
if (txq->xmit_lock_owner != -1)
WRITE_ONCE(txq->trans_start, jiffies);
}
causes the trans_start update to be skipped on veth devices.
And, sure enough, if I apply the following, the test case
appears to work:
diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 466da01ba2e3..2cb833b3006a 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -312,6 +312,7 @@ static bool veth_skb_is_eligible_for_gro(const struct net_device *dev,
static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
{
struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
+ struct netdev_queue *queue = NULL;
struct veth_rq *rq = NULL;
struct net_device *rcv;
int length = skb->len;
@@ -329,6 +330,7 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
rxq = skb_get_queue_mapping(skb);
if (rxq < rcv->real_num_rx_queues) {
rq = &rcv_priv->rq[rxq];
+ queue = netdev_get_tx_queue(dev, rxq);
/* The napi pointer is available when an XDP program is
* attached or when GRO is enabled
@@ -340,6 +342,8 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
skb_tx_timestamp(skb);
if (likely(veth_forward_skb(rcv, skb, rq, use_napi) == NET_RX_SUCCESS)) {
+ if (queue)
+ txq_trans_cond_update(queue);
if (!use_napi)
dev_lstats_add(dev, length);
} else {
I'm not entirely sure this is the best way to get the
trans_start updated in veth, but LLTX devices need to handle it
internally (and others do, e.g., tun).
Could you test the above and see if it resolves the problem in
your environment as well?
-J
---
-Jay Vosburgh, jay.vosburgh@canonical.com
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: Any reason why arp monitor keeps emitting netlink failover events?
2022-06-15 0:26 ` Jay Vosburgh
@ 2022-06-15 15:51 ` Jonathan Toppins
2022-06-16 18:52 ` Jay Vosburgh
0 siblings, 1 reply; 6+ messages in thread
From: Jonathan Toppins @ 2022-06-15 15:51 UTC (permalink / raw)
To: Jay Vosburgh; +Cc: netdev, Veaceslav Falico, Andy Gospodarek, Hangbin Liu
On 6/14/22 20:26, Jay Vosburgh wrote:
> Jonathan Toppins <jtoppins@redhat.com> wrote:
>
>> On 6/14/22 11:29, Jay Vosburgh wrote:
>>> Jonathan Toppins <jtoppins@redhat.com> wrote:
>>>
>>>> On net-next/master from today, I see netlink failover events being emitted
>>> >from an active-backup bond. In the ip monitor dump you can see the bond is
>>>> up (according to the link status) but keeps emitting failover events and I
>>>> am not sure why. This appears to be an issue also on Fedora 35 and CentOS
>>>> 8 kernels. The configuration appears to be correct, though I could be
>>>> missing something. Thoughts?
>>> Anything showing up in the dmesg? There's only one place that
>>> generates the FAILOVER notifier, and it ought to have a corresponding
>>> message in the kernel log.
>>> Also, I note that the link1_1 veth has a lot of failures:
>>
>> Yes all those failures are created by the setup, I waited about 5 minutes
>> before dumping the link info. The failover occurs about every second. Note
>> this is just a representation of a physical network so that others can run
>> the setup. The script `bond-bz2094911.sh`, easily reproduces the issue
>> which I dumped with cat below in the original email.
>>
>> Here is the kernel log, I have dynamic debug enabled for the entire
>> bonding module:
>
> I set up the test, and added some additional instrumentation to
> bond_ab_arp_inspect, and what seems to be going on is that the
> dev_trans_start for link1_1 is never updating. The "down to up"
> transition for the ARP monitor only checks last_rx, but the "up to down"
> check for the active interface requires both TX and RX recently
> ("recently" being within the past missed_max * arp_interval).
>
> This looks to be due to HARD_TX_LOCK not actually locking for
> NETIF_F_LLTX devices:
>
> #define HARD_TX_LOCK(dev, txq, cpu) { if ((dev->features & NETIF_F_LLTX) == 0) { __netif_tx_lock(txq, cpu); } else { __netif_tx_acquire(txq); } }
>
> in combination with
>
> static inline void txq_trans_update(struct netdev_queue *txq)
> {
> if (txq->xmit_lock_owner != -1)
> WRITE_ONCE(txq->trans_start, jiffies);
> }
>
> causes the trans_start update to be skipped on veth devices.
>
> And, sure enough, if I apply the following, the test case
> appears to work:
>
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> index 466da01ba2e3..2cb833b3006a 100644
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -312,6 +312,7 @@ static bool veth_skb_is_eligible_for_gro(const struct net_device *dev,
> static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
> {
> struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
> + struct netdev_queue *queue = NULL;
> struct veth_rq *rq = NULL;
> struct net_device *rcv;
> int length = skb->len;
> @@ -329,6 +330,7 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
> rxq = skb_get_queue_mapping(skb);
> if (rxq < rcv->real_num_rx_queues) {
> rq = &rcv_priv->rq[rxq];
> + queue = netdev_get_tx_queue(dev, rxq);
>
> /* The napi pointer is available when an XDP program is
> * attached or when GRO is enabled
> @@ -340,6 +342,8 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
>
> skb_tx_timestamp(skb);
> if (likely(veth_forward_skb(rcv, skb, rq, use_napi) == NET_RX_SUCCESS)) {
> + if (queue)
> + txq_trans_cond_update(queue);
> if (!use_napi)
> dev_lstats_add(dev, length);
> } else {
>
>
> I'm not entirely sure this is the best way to get the
> trans_start updated in veth, but LLTX devices need to handle it
> internally (and others do, e.g., tun).
>
> Could you test the above and see if it resolves the problem in
> your environment as well?
>
> -J
>
> ---
> -Jay Vosburgh, jay.vosburgh@canonical.com
>
Hi Jay,
This patch appears to work, you can apply my tested-by.
Tested-by: Jonathan Toppins <jtoppins@redhat.com>
Now this exposes an easily reproducible bonding issue with
bond_should_notify_peers() which is every second the bond issues a
NOTIFY_PEERS event. This notify peers event issue has been observed on
physical hardware (tg3, i40e, igb) drivers. I have not traced the code
yet, wanted to point this out. Run the same reproducer script and start
monitoring the bond;
[root@fedora ~]# ip -ts -o monitor link dev bond0
[2022-06-15T11:30:44.337568] 9: bond0:
<BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
group default event NOTIFY PEERS \ link/ether ce:d3:22:ef:13:d0 brd
ff:ff:ff:ff:ff:ff
[2022-06-15T11:30:45.361381] 9: bond0:
<BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
group default event NOTIFY PEERS \ link/ether ce:d3:22:ef:13:d0 brd
ff:ff:ff:ff:ff:ff
[.. trimmed ..]
[2022-06-15T11:30:56.618621] [2022-06-15T11:30:56.622657] 9: bond0:
<BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
group default event NOTIFY PEERS \ link/ether ce:d3:22:ef:13:d0 brd
ff:ff:ff:ff:ff:ff
[2022-06-15T11:30:57.647644] 9: bond0:
<BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
group default event NOTIFY PEERS \ link/ether ce:d3:22:ef:13:d0 brd
ff:ff:ff:ff:ff:ff
In another shell take down the active interface:
# ip link set link1_1 down
we get the failover below, as expected.
[2022-06-15T11:30:58.671501] [2022-06-15T11:30:58.671576]
[2022-06-15T11:30:58.671611] [2022-06-15T11:30:58.671643]
[2022-06-15T11:30:58.671676] [2022-06-15T11:30:58.671709] 9: bond0:
<BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
group default event BONDING FAILOVER \ link/ether ce:d3:22:ef:13:d0
brd ff:ff:ff:ff:ff:ff
[2022-06-15T11:30:58.671782] 9: bond0:
<BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
group default event NOTIFY PEERS \ link/ether ce:d3:22:ef:13:d0 brd
ff:ff:ff:ff:ff:ff
[2022-06-15T11:30:58.676862] 9: bond0:
<BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
group default event NOTIFY PEERS \ link/ether ce:d3:22:ef:13:d0 brd
ff:ff:ff:ff:ff:ff
[2022-06-15T11:30:58.676948] 9: bond0:
<BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
group default event RESEND IGMP \ link/ether ce:d3:22:ef:13:d0 brd
ff:ff:ff:ff:ff:ff
Now bring back up link1_1 and notice no more NOTIFY_PEERS event every
second. The issue stops with the first failover just brought back up the
primary for completeness.
# ip link set link1_1 up
[2022-06-15T11:31:01.629256] [2022-06-15T11:31:01.630275]
[2022-06-15T11:31:01.742927] [2022-06-15T11:31:01.742991]
[2022-06-15T11:31:01.743021] [2022-06-15T11:31:01.743045]
[2022-06-15T11:31:01.743070] [2022-06-15T11:31:01.743094] 9: bond0:
<BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
group default event BONDING FAILOVER \ link/ether ce:d3:22:ef:13:d0
brd ff:ff:ff:ff:ff:ff
[2022-06-15T11:31:01.743151] 9: bond0:
<BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
group default event NOTIFY PEERS \ link/ether ce:d3:22:ef:13:d0 brd
ff:ff:ff:ff:ff:ff
[2022-06-15T11:31:01.746412] 9: bond0:
<BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
group default event RESEND IGMP \ link/ether ce:d3:22:ef:13:d0 brd
ff:ff:ff:ff:ff:ff
-Jon
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Any reason why arp monitor keeps emitting netlink failover events?
2022-06-15 15:51 ` Jonathan Toppins
@ 2022-06-16 18:52 ` Jay Vosburgh
0 siblings, 0 replies; 6+ messages in thread
From: Jay Vosburgh @ 2022-06-16 18:52 UTC (permalink / raw)
To: Jonathan Toppins; +Cc: netdev, Veaceslav Falico, Andy Gospodarek, Hangbin Liu
Jonathan Toppins <jtoppins@redhat.com> wrote:
[...]
>Now this exposes an easily reproducible bonding issue with
>bond_should_notify_peers() which is every second the bond issues a
>NOTIFY_PEERS event. This notify peers event issue has been observed on
>physical hardware (tg3, i40e, igb) drivers. I have not traced the code
>yet, wanted to point this out. Run the same reproducer script and start
>monitoring the bond;
>
>[root@fedora ~]# ip -ts -o monitor link dev bond0
>[2022-06-15T11:30:44.337568] 9: bond0:
><BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
>group default event NOTIFY PEERS \ link/ether ce:d3:22:ef:13:d0 brd
>ff:ff:ff:ff:ff:ff
>[2022-06-15T11:30:45.361381] 9: bond0:
><BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
>group default event NOTIFY PEERS \ link/ether ce:d3:22:ef:13:d0 brd
>ff:ff:ff:ff:ff:ff
[...]
This one is pretty straightforward; the ARP monitor logic never
decrements the counter for the number of notifications to send (but the
change active logic decrements once, so a failover makes it stop if the
number of notifications to send is one). I'll submit a patch in a bit.
-J
---
-Jay Vosburgh, jay.vosburgh@canonical.com
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2022-06-16 18:53 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-14 2:59 Any reason why arp monitor keeps emitting netlink failover events? Jonathan Toppins
2022-06-14 15:29 ` Jay Vosburgh
2022-06-14 17:07 ` Jonathan Toppins
2022-06-15 0:26 ` Jay Vosburgh
2022-06-15 15:51 ` Jonathan Toppins
2022-06-16 18:52 ` Jay Vosburgh
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.