intel-wired-lan.lists.osuosl.org archive mirror
 help / color / mirror / Atom feed
* [Intel-wired-lan] Non-functional ixgbe driver between Intel X553 chipset and Cisco switch via kernel >=6.1 under Debian
@ 2024-02-15 11:02 kernel.org-fo5k2w
  2024-05-03 13:01 ` kernel.org-fo5k2w
  2024-05-03 18:37 ` Jacob Keller
  0 siblings, 2 replies; 9+ messages in thread
From: kernel.org-fo5k2w @ 2024-02-15 11:02 UTC (permalink / raw)
  To: jesse.brandeburg, anthony.l.nguyen; +Cc: netdev, intel-wired-lan

Hello,

(Please note that I don't speak English, sorry if the traction is not faithful to your language)

Following Bjorn Helgaas's advice (https://bugzilla.kernel.org/show_bug.cgi?id=218050#c14), I'm coming to you in the hope of finding a solution to a problem encountered by several users of the ixgbe driver. The subject has been discussed in the messages and comments on the following pages:
https://marc.info/?l=linux-netdev&m=170118007007901&w=2
https://forum.proxmox.com/threads/intel-x553-sfp-ixgbe-no-go-on-pve8.135129/
https://www.servethehome.com/the-everything-fanless-home-server-firewall-router-and-nas-appliance-qotom-qnap-teamgroup/
https://www.servethehome.com/intel-x553-networking-and-proxmox-ve-8-1-3/?unapproved=518173&moderation-hash=e57a05288058d3ff253ceb42e9ada905
https://forum.proxmox.com/threads/proxmox-8-kernel-6-2-16-4-pve-ixgbe-driver-fails-to-load-due-to-pci-device-probing-failure.131203/
https://bugzilla.kernel.org/show_bug.cgi?id=218491
https://bugzilla.kernel.org/show_bug.cgi?id=218050

Having myself decided to purchase a Qotom Q20332G9-S10 machine with X553 chipset for testing purposes, I can see the effectiveness of the connection problem between the PC's X553 SFP+ and a Cisco switch SFP+. For my part, this happens under GNU/Linux Debian 12 - kernel 6.1.76 and Sid - kernel 6.6.13. So it's not specific to Proxmox.
I should point out that under GNU/Linux Debian 11 - kernel 5.10, the network card (X553 via ixgbe) works without problems. So this is a relatively "recent" bug.

Here's my test environment:
- 1 Qotom Q20332G9-S10 (I used a 16GB Intel Optane M10 M.2 SSD with a fresh GNU/Linux Debian 12)
- 1 Cisco DAC cable (tested with a 1M and a 3M)
- 1 PC with Mellanox Connectx-3 2x SFP+ network card (running GNU/Linux Debian SID installed several years ago)
- 1 Cisco 3560CX-12PD-S switch (2 SFP+ ports) with IOS 15.2(7)E2

Connecting the Qotom Q20332G9-S10 (X553) to the Mellanox Connectx-3 works without a hitch and without any special handling (the linux-image-6.1.0-17-amd64 ixgbe driver works in this configuration). Full 10gbps speeds between the two with an "iperf".

At this stage, I've ruled out a hardware incompatibility (OSI level 1) since the DAC works with the X553. So there's no need to use compatibility tricks as suggested in the link comments with the "allow_unsupported_sfp=1" parameter. This will be useless in the following tests (I've checked).

Where it gets tricky is when you connect it (the Qotom) to the Cisco switch.
Before an "ip link eno1 up", the Cisco raises the link on its side, but the Debian doesn't (link DOWN). After the "ip link eno1 up", the link drops and never comes back. There does seem to be a driver problem in recent kernels (GNU/Linux Debian Stable and Sid).

After compiling the driver manually (https://downloadmirror.intel.com/812532/ixgbe-5.19.9.tar.gz) following the documentation already shared by others (https://www.xmodulo.com/download-install-ixgbe-driver-ubuntu-debian.html), it works with the Cisco (after a "shut/no shut" of the latter's 10gbe port).

So we end up with a working machine (I even configured and used the SR-IOV successfully right afterwards).

PS: I also tested with Debian Sid

I've finally tried the commands you were giving Skyler without any result (rmmod ixgbe; modprobe ixgbe; ethtool -S eno1 | grep fault).

For the moment, the Qotom machine is dedicated to testing, so I'm available to carry out any manipulations you may wish to make to advance the subject.
Can we work on diagnosing this problem so that the next stable release of Debian is fully functional with this Intel network card?

Best regards.

⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁ Yohan Charbi
⢿⡄⠘⠷⠚⠋⠀ Cordialement
⠈⠳⣄⠀

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Intel-wired-lan] Non-functional ixgbe driver between Intel X553 chipset and Cisco switch via kernel >=6.1 under Debian
  2024-02-15 11:02 [Intel-wired-lan] Non-functional ixgbe driver between Intel X553 chipset and Cisco switch via kernel >=6.1 under Debian kernel.org-fo5k2w
@ 2024-05-03 13:01 ` kernel.org-fo5k2w
  2024-05-03 18:37 ` Jacob Keller
  1 sibling, 0 replies; 9+ messages in thread
From: kernel.org-fo5k2w @ 2024-05-03 13:01 UTC (permalink / raw)
  To: kernel.org-fo5k2w
  Cc: intel-wired-lan, debian-kernel, anthony.l.nguyen, netdev

Hello,

I have not yet received a reply from you. This problem is blocking many 
users and is a major handicap when using Intel network cards with X553 
chips.

I'm always available to carry out any tests you may consider useful for 
resolving this ixgbe driver bug in Linux kernels > 6.0.

As a reminder, here is the link to my original message:
https://lore.kernel.org/all/8267673cce94022974bcf35b2bf0f6545105d03e@ycharbi.fr/

Best regards.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Intel-wired-lan] Non-functional ixgbe driver between Intel X553 chipset and Cisco switch via kernel >=6.1 under Debian
  2024-02-15 11:02 [Intel-wired-lan] Non-functional ixgbe driver between Intel X553 chipset and Cisco switch via kernel >=6.1 under Debian kernel.org-fo5k2w
  2024-05-03 13:01 ` kernel.org-fo5k2w
@ 2024-05-03 18:37 ` Jacob Keller
  2024-05-04 13:29   ` kernel.org-fo5k2w
  1 sibling, 1 reply; 9+ messages in thread
From: Jacob Keller @ 2024-05-03 18:37 UTC (permalink / raw)
  To: kernel.org-fo5k2w, jesse.brandeburg, anthony.l.nguyen
  Cc: netdev, intel-wired-lan



On 2/15/2024 3:02 AM, kernel.org-fo5k2w@ycharbi.fr wrote:
> Hello,
> 
> (Please note that I don't speak English, sorry if the traction is not faithful to your language)
> 

Hi,

I haven't touched the ixgbe driver and hardware in many years, but I'll
try to see what I can do to help.

> Following Bjorn Helgaas's advice (https://bugzilla.kernel.org/show_bug.cgi?id=218050#c14), I'm coming to you in the hope of finding a solution to a problem encountered by several users of the ixgbe driver. The subject has been discussed in the messages and comments on the following pages:
> https://marc.info/?l=linux-netdev&m=170118007007901&w=2
> https://forum.proxmox.com/threads/intel-x553-sfp-ixgbe-no-go-on-pve8.135129/
> https://www.servethehome.com/the-everything-fanless-home-server-firewall-router-and-nas-appliance-qotom-qnap-teamgroup/
> https://www.servethehome.com/intel-x553-networking-and-proxmox-ve-8-1-3/?unapproved=518173&moderation-hash=e57a05288058d3ff253ceb42e9ada905
> https://forum.proxmox.com/threads/proxmox-8-kernel-6-2-16-4-pve-ixgbe-driver-fails-to-load-due-to-pci-device-probing-failure.131203/
> https://bugzilla.kernel.org/show_bug.cgi?id=218491
> https://bugzilla.kernel.org/show_bug.cgi?id=218050
> 
> Having myself decided to purchase a Qotom Q20332G9-S10 machine with X553 chipset for testing purposes, I can see the effectiveness of the connection problem between the PC's X553 SFP+ and a Cisco switch SFP+. For my part, this happens under GNU/Linux Debian 12 - kernel 6.1.76 and Sid - kernel 6.6.13. So it's not specific to Proxmox.
> I should point out that under GNU/Linux Debian 11 - kernel 5.10, the network card (X553 via ixgbe) works without problems. So this is a relatively "recent" bug.
> 
> Here's my test environment:
> - 1 Qotom Q20332G9-S10 (I used a 16GB Intel Optane M10 M.2 SSD with a fresh GNU/Linux Debian 12)
> - 1 Cisco DAC cable (tested with a 1M and a 3M)
> - 1 PC with Mellanox Connectx-3 2x SFP+ network card (running GNU/Linux Debian SID installed several years ago)
> - 1 Cisco 3560CX-12PD-S switch (2 SFP+ ports) with IOS 15.2(7)E2
> 
> Connecting the Qotom Q20332G9-S10 (X553) to the Mellanox Connectx-3 works without a hitch and without any special handling (the linux-image-6.1.0-17-amd64 ixgbe driver works in this configuration). Full 10gbps speeds between the two with an "iperf".
> 

So everything works when connected back to back with the Connectx-3. Ok.

> At this stage, I've ruled out a hardware incompatibility (OSI level 1) since the DAC works with the X553. So there's no need to use compatibility tricks as suggested in the link comments with the "allow_unsupported_sfp=1" parameter. This will be useless in the following tests (I've checked).
> 

To confirm, you use the same cable in both cases?

> Where it gets tricky is when you connect it (the Qotom) to the Cisco switch.
> Before an "ip link eno1 up", the Cisco raises the link on its side, but the Debian doesn't (link DOWN). After the "ip link eno1 up", the link drops and never comes back. There does seem to be a driver problem in recent kernels (GNU/Linux Debian Stable and Sid).
> 

But on the switch, the link is reported up until we bring the interface
up in ixgbe, and then link drops and stays down indefinitely?

> After compiling the driver manually (https://downloadmirror.intel.com/812532/ixgbe-5.19.9.tar.gz) following the documentation already shared by others (https://www.xmodulo.com/download-install-ixgbe-driver-ubuntu-debian.html), it works with the Cisco (after a "shut/no shut" of the latter's 10gbe port).
> 
> So we end up with a working machine (I even configured and used the SR-IOV successfully right afterwards).
> 

But if you use the out-of-tree ixgbe driver everything works. Hmm.

> PS: I also tested with Debian Sid
> 
> I've finally tried the commands you were giving Skyler without any result (rmmod ixgbe; modprobe ixgbe; ethtool -S eno1 | grep fault).
> 
> For the moment, the Qotom machine is dedicated to testing, so I'm available to carry out any manipulations you may wish to make to advance the subject.
> Can we work on diagnosing this problem so that the next stable release of Debian is fully functional with this Intel network card?
> 
> Best regards.

I tried checking the out-of-tree versions to see if there were any
obvious fixes. I didn't find anything. The code between the in-kernel
and out-of-tree is so different that it is hard to track down. At first
I wondered if this might be a regression due to recent changes to
support new hardware, but it appears that v6.1 is from before a lot of
that work went in.

It may be helpful if you could provide some more information from the
system in the Cisco switch case:

1. The kernel message logs from when you bring up the interface. You can
get this from dmesg or journalctl -k if you have systemd.

2. "ethtool eno1" after you bring the interface up to see what it
reports about link

3. "ethtool -S eno1" to see if any other stats are reported that might
help us isolate whats going on.


Do you happen to know if any particular in-kernel driver version worked?
It would help limit the search for regressing commits. Ideally, if you
could use git bisect on the setup that could efficiently locate what
regressed the behavior.

Regards,
Jake

> 
> ⢀⣴⠾⠻⢶⣦⠀
> ⣾⠁⢠⠒⠀⣿⡁ Yohan Charbi
> ⢿⡄⠘⠷⠚⠋⠀ Cordialement
> ⠈⠳⣄⠀
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Intel-wired-lan] Non-functional ixgbe driver between Intel X553 chipset and Cisco switch via kernel >=6.1 under Debian
  2024-05-03 18:37 ` Jacob Keller
@ 2024-05-04 13:29   ` kernel.org-fo5k2w
  2024-05-06 21:18     ` Jacob Keller
  2024-05-09 13:26     ` kernel.org-fo5k2w
  0 siblings, 2 replies; 9+ messages in thread
From: kernel.org-fo5k2w @ 2024-05-04 13:29 UTC (permalink / raw)
  To: jacob.e.keller
  Cc: intel-wired-lan, kernel.org-fo5k2w, anthony.l.nguyen, netdev

Hi,

 > I haven't touched the ixgbe driver and hardware in many years, but I'll
try to see what I can do to help.

Thank you very much for your reply. I'll answer you point by point.
I upgraded the Qoton to Debian 13 (testing) with kernel 6.6.15 (amd64) 
to be even more up to date.
A quick test with Fedora 40 shows the same problem.


 > So everything works when connected back to back with the Connectx-3. Ok.

Yes, exactly. Everything works as expected with the Connectx-3.


 > To confirm, you use the same cable in both cases?

Yes, the same cable. I tested two different models:
- 1 Cisco SFP-H10GB-CU1M (1 mètre)
- 1 Cisco SFP-H10GB-CU3M (3 mètres)

I'm only using the SFP-H10GB-CU3M for the rest for convenience.


 > But on the switch, the link is reported up until we bring the interface
 > up in ixgbe, and then link drops and stays down indefinitely?

After initial start-up of the Qotom :
# Port 10Gbe LEDs are green (please note that the MAC address OID - 
20:7c:14 - is registered to Qotom, not Intel).
ip link show dev eno1
7: eno1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode 
DEFAULT group default qlen 1000
     link/ether 20:7c:14:xx:xx:xx brd ff:ff:ff:ff:ff:ff
     altname enp11s0f0

# Cisco (Green LEDs - port mounted)
show running-config | section interface TenGigabitEthernet1/0/1
interface TenGigabitEthernet1/0/1
  no cdp enable

show interface status | include Te1/0/1
Te1/0/1   --- Vers Qotom --- connected    trunk        full    10G 
SFP-10GBase-CX1

show ip interface brief | include Te1/0/1 | Status
Interface              IP-Address      OK? Method Status                
Protocol
Te1/0/1                unassigned      YES unset up                    up

The Cisco and Qotom ports are lit and flashing as if they were 
exchanging ARP or STP traffic. A mirror port on the Cisco's 10Gbe 
interface, however, shows no frame exchange. I connected a PC to port 
g1/0/13 with Wireshark for this test.

monitor session 1 source interface t1/0/1 both
monitor session 1 destination interface g1/0/13

Port switch-on test :
# Starting up the Qotom 10Gbe network interface
ip link set eno1 up
[ 1770.476075] pps pps5: new PPS source ptp5
[ 1770.480784] ixgbe 0000:0b:00.0: registered PHC device on eno1
[ 1770.575496] ixgbe 0000:0b:00.0 eno1: detected SFP+: 3

# The ports on both devices switch off immediately.
# There's no going back:
ip link set eno1 down
[ 1831.329797] ixgbe 0000:0b:00.0: removed PHC on eno1

# The ports are always off on both sides even when unloading the ixgbe 
core module and plugging/unplugging the Cisco SFP-H10GB-CU3M :
rmmod ixgbe
[ 1872.503663] ixgbe 0000:0d:00.1: complete
[ 1872.547628] ixgbe 0000:0d:00.0: complete
[ 1872.591645] ixgbe 0000:0b:00.1: complete
[ 1872.631725] ixgbe 0000:0b:00.0: complete

A reboot is the only way to restore this port switch-on state.
On startup, the Cisco switch displays the following logs (the date is 
not configured):
Sep 30 14:33:00: %LINK-3-UPDOWN: Interface TenGigabitEthernet1/0/1, 
changed state to up
Sep 30 14:33:01: %LINEPROTO-5-UPDOWN: Line protocol on Interface 
TenGigabitEthernet1/0/1, changed state to up


 > But if you use the out-of-tree ixgbe driver everything works. Hmm.

Yes, that's exactly it. The driver on the Intel site works perfectly.

 > I tried checking the out-of-tree versions to see if there were any
 > obvious fixes. I didn't find anything. The code between the in-kernel
 > and out-of-tree is so different that it is hard to track down. At first
 > I wondered if this might be a regression due to recent changes to
 > support new hardware, but it appears that v6.1 is from before a lot of
 > that work went in.

If it helps, vesalius' post of December 3, 2023 on one of the links in 
my original post 
(https://forum.proxmox.com/threads/intel-x553-sfp-ixgbe-no-go-on-pve8.135129/post-612291) 
reports that the following commit has been suspected as the culprit: 
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v6.1.63&id=565736048bd5f9888990569993c6b6bfdf6dcb6d

I quote the end of his message:
"An amazon employee states reverting this commit and recompiling the 
kernel allows their similar network hardware to use the current in-tree 
6.1 ixgbe driver. Otherwise as stated in the VyOS forum thread linked 
above compiling the linux kernel with the out-of-tree intel ixgbe driver 
5.19.6 works too."


 > 1. The kernel message logs from when you bring up the interface. You can
get this from dmesg or journalctl -k if you have systemd.

The kernel returns only the following three lines after a "ip link set 
eno1 up" :
mai 04 12:01:21 servyo kernel: pps pps5: new PPS source ptp5
mai 04 12:01:21 servyo kernel: ixgbe 0000:0b:00.0: registered PHC device 
on eno1
mai 04 12:01:21 servyo kernel: ixgbe 0000:0b:00.0 eno1: detected SFP+: 3

 > 2. "ethtool eno1" after you bring the interface up to see what it
reports about link

ethtool eno1
Settings for eno1:
     Supported ports: [ FIBRE ]
     Supported link modes:   10000baseT/Full
     Supported pause frame use: Symmetric
     Supports auto-negotiation: No
     Supported FEC modes: Not reported
     Advertised link modes:  10000baseT/Full
     Advertised pause frame use: Symmetric
     Advertised auto-negotiation: No
     Advertised FEC modes: Not reported
     Speed: Unknown!
     Duplex: Unknown! (255)
     Auto-negotiation: off
     Port: Direct Attach Copper
     PHYAD: 0
     Transceiver: internal
     Supports Wake-on: d
     Wake-on: d
         Current message level: 0x00000007 (7)
                                drv probe link
     Link detected: no


 > 3. "ethtool -S eno1" to see if any other stats are reported that might
help us isolate whats going on.

ethtool -S eno1
NIC statistics:
      rx_packets: 0
      tx_packets: 0
      rx_bytes: 0
      tx_bytes: 0
      rx_pkts_nic: 0
      tx_pkts_nic: 0
      rx_bytes_nic: 0
      tx_bytes_nic: 0
      lsc_int: 1
      tx_busy: 0
      non_eop_descs: 0
      rx_errors: 0
      tx_errors: 0
      rx_dropped: 0
      tx_dropped: 0
      multicast: 0
      broadcast: 0
      rx_no_buffer_count: 0
      collisions: 0
      rx_over_errors: 0
      rx_crc_errors: 0
      rx_frame_errors: 0
      hw_rsc_aggregated: 0
      hw_rsc_flushed: 0
      fdir_match: 0
      fdir_miss: 0
      fdir_overflow: 0
      rx_fifo_errors: 0
      rx_missed_errors: 0
      tx_aborted_errors: 0
      tx_carrier_errors: 0
      tx_fifo_errors: 0
      tx_heartbeat_errors: 0
      tx_timeout_count: 0
      tx_restart_queue: 0
      rx_length_errors: 0
      rx_long_length_errors: 0
      rx_short_length_errors: 0
      tx_flow_control_xon: 0
      rx_flow_control_xon: 0
      tx_flow_control_xoff: 0
      rx_flow_control_xoff: 0
      rx_csum_offload_errors: 0
      alloc_rx_page: 4088
      alloc_rx_page_failed: 0
      alloc_rx_buff_failed: 0
      rx_no_dma_resources: 0
      os2bmc_rx_by_bmc: 0
      os2bmc_tx_by_bmc: 0
      os2bmc_tx_by_host: 0
      os2bmc_rx_by_host: 0
      tx_hwtstamp_timeouts: 0
      tx_hwtstamp_skipped: 0
      rx_hwtstamp_cleared: 0
      tx_ipsec: 0
      rx_ipsec: 0
      fcoe_bad_fccrc: 0
      rx_fcoe_dropped: 0
      rx_fcoe_packets: 0
      rx_fcoe_dwords: 0
      fcoe_noddp: 0
      fcoe_noddp_ext_buff: 0
      tx_fcoe_packets: 0
      tx_fcoe_dwords: 0
      tx_queue_0_packets: 0
      tx_queue_0_bytes: 0
      tx_queue_1_packets: 0
      tx_queue_1_bytes: 0
      tx_queue_2_packets: 0
      tx_queue_2_bytes: 0
      tx_queue_3_packets: 0
      tx_queue_3_bytes: 0
      tx_queue_4_packets: 0
      tx_queue_4_bytes: 0
      tx_queue_5_packets: 0
      tx_queue_5_bytes: 0
      tx_queue_6_packets: 0
      tx_queue_6_bytes: 0
      tx_queue_7_packets: 0
      tx_queue_7_bytes: 0
      tx_queue_8_packets: 0
      tx_queue_8_bytes: 0
      tx_queue_9_packets: 0
      tx_queue_9_bytes: 0
      tx_queue_10_packets: 0
      tx_queue_10_bytes: 0
      tx_queue_11_packets: 0
      tx_queue_11_bytes: 0
      tx_queue_12_packets: 0
      tx_queue_12_bytes: 0
      tx_queue_13_packets: 0
      tx_queue_13_bytes: 0
      tx_queue_14_packets: 0
      tx_queue_14_bytes: 0
      tx_queue_15_packets: 0
      tx_queue_15_bytes: 0
      tx_queue_16_packets: 0
      tx_queue_16_bytes: 0
      tx_queue_17_packets: 0
      tx_queue_17_bytes: 0
      tx_queue_18_packets: 0
      tx_queue_18_bytes: 0
      tx_queue_19_packets: 0
      tx_queue_19_bytes: 0
      tx_queue_20_packets: 0
      tx_queue_20_bytes: 0
      tx_queue_21_packets: 0
      tx_queue_21_bytes: 0
      tx_queue_22_packets: 0
      tx_queue_22_bytes: 0
      tx_queue_23_packets: 0
      tx_queue_23_bytes: 0
      tx_queue_24_packets: 0
      tx_queue_24_bytes: 0
      tx_queue_25_packets: 0
      tx_queue_25_bytes: 0
      tx_queue_26_packets: 0
      tx_queue_26_bytes: 0
      tx_queue_27_packets: 0
      tx_queue_27_bytes: 0
      tx_queue_28_packets: 0
      tx_queue_28_bytes: 0
      tx_queue_29_packets: 0
      tx_queue_29_bytes: 0
      tx_queue_30_packets: 0
      tx_queue_30_bytes: 0
      tx_queue_31_packets: 0
      tx_queue_31_bytes: 0
      tx_queue_32_packets: 0
      tx_queue_32_bytes: 0
      tx_queue_33_packets: 0
      tx_queue_33_bytes: 0
      tx_queue_34_packets: 0
      tx_queue_34_bytes: 0
      tx_queue_35_packets: 0
      tx_queue_35_bytes: 0
      tx_queue_36_packets: 0
      tx_queue_36_bytes: 0
      tx_queue_37_packets: 0
      tx_queue_37_bytes: 0
      tx_queue_38_packets: 0
      tx_queue_38_bytes: 0
      tx_queue_39_packets: 0
      tx_queue_39_bytes: 0
      tx_queue_40_packets: 0
      tx_queue_40_bytes: 0
      tx_queue_41_packets: 0
      tx_queue_41_bytes: 0
      tx_queue_42_packets: 0
      tx_queue_42_bytes: 0
      tx_queue_43_packets: 0
      tx_queue_43_bytes: 0
      tx_queue_44_packets: 0
      tx_queue_44_bytes: 0
      tx_queue_45_packets: 0
      tx_queue_45_bytes: 0
      tx_queue_46_packets: 0
      tx_queue_46_bytes: 0
      tx_queue_47_packets: 0
      tx_queue_47_bytes: 0
      tx_queue_48_packets: 0
      tx_queue_48_bytes: 0
      tx_queue_49_packets: 0
      tx_queue_49_bytes: 0
      tx_queue_50_packets: 0
      tx_queue_50_bytes: 0
      tx_queue_51_packets: 0
      tx_queue_51_bytes: 0
      tx_queue_52_packets: 0
      tx_queue_52_bytes: 0
      tx_queue_53_packets: 0
      tx_queue_53_bytes: 0
      tx_queue_54_packets: 0
      tx_queue_54_bytes: 0
      tx_queue_55_packets: 0
      tx_queue_55_bytes: 0
      tx_queue_56_packets: 0
      tx_queue_56_bytes: 0
      tx_queue_57_packets: 0
      tx_queue_57_bytes: 0
      tx_queue_58_packets: 0
      tx_queue_58_bytes: 0
      tx_queue_59_packets: 0
      tx_queue_59_bytes: 0
      tx_queue_60_packets: 0
      tx_queue_60_bytes: 0
      tx_queue_61_packets: 0
      tx_queue_61_bytes: 0
      tx_queue_62_packets: 0
      tx_queue_62_bytes: 0
      tx_queue_63_packets: 0
      tx_queue_63_bytes: 0
      rx_queue_0_packets: 0
      rx_queue_0_bytes: 0
      rx_queue_1_packets: 0
      rx_queue_1_bytes: 0
      rx_queue_2_packets: 0
      rx_queue_2_bytes: 0
      rx_queue_3_packets: 0
      rx_queue_3_bytes: 0
      rx_queue_4_packets: 0
      rx_queue_4_bytes: 0
      rx_queue_5_packets: 0
      rx_queue_5_bytes: 0
      rx_queue_6_packets: 0
      rx_queue_6_bytes: 0
      rx_queue_7_packets: 0
      rx_queue_7_bytes: 0
      rx_queue_8_packets: 0
      rx_queue_8_bytes: 0
      rx_queue_9_packets: 0
      rx_queue_9_bytes: 0
      rx_queue_10_packets: 0
      rx_queue_10_bytes: 0
      rx_queue_11_packets: 0
      rx_queue_11_bytes: 0
      rx_queue_12_packets: 0
      rx_queue_12_bytes: 0
      rx_queue_13_packets: 0
      rx_queue_13_bytes: 0
      rx_queue_14_packets: 0
      rx_queue_14_bytes: 0
      rx_queue_15_packets: 0
      rx_queue_15_bytes: 0
      rx_queue_16_packets: 0
      rx_queue_16_bytes: 0
      rx_queue_17_packets: 0
      rx_queue_17_bytes: 0
      rx_queue_18_packets: 0
      rx_queue_18_bytes: 0
      rx_queue_19_packets: 0
      rx_queue_19_bytes: 0
      rx_queue_20_packets: 0
      rx_queue_20_bytes: 0
      rx_queue_21_packets: 0
      rx_queue_21_bytes: 0
      rx_queue_22_packets: 0
      rx_queue_22_bytes: 0
      rx_queue_23_packets: 0
      rx_queue_23_bytes: 0
      rx_queue_24_packets: 0
      rx_queue_24_bytes: 0
      rx_queue_25_packets: 0
      rx_queue_25_bytes: 0
      rx_queue_26_packets: 0
      rx_queue_26_bytes: 0
      rx_queue_27_packets: 0
      rx_queue_27_bytes: 0
      rx_queue_28_packets: 0
      rx_queue_28_bytes: 0
      rx_queue_29_packets: 0
      rx_queue_29_bytes: 0
      rx_queue_30_packets: 0
      rx_queue_30_bytes: 0
      rx_queue_31_packets: 0
      rx_queue_31_bytes: 0
      rx_queue_32_packets: 0
      rx_queue_32_bytes: 0
      rx_queue_33_packets: 0
      rx_queue_33_bytes: 0
      rx_queue_34_packets: 0
      rx_queue_34_bytes: 0
      rx_queue_35_packets: 0
      rx_queue_35_bytes: 0
      rx_queue_36_packets: 0
      rx_queue_36_bytes: 0
      rx_queue_37_packets: 0
      rx_queue_37_bytes: 0
      rx_queue_38_packets: 0
      rx_queue_38_bytes: 0
      rx_queue_39_packets: 0
      rx_queue_39_bytes: 0
      rx_queue_40_packets: 0
      rx_queue_40_bytes: 0
      rx_queue_41_packets: 0
      rx_queue_41_bytes: 0
      rx_queue_42_packets: 0
      rx_queue_42_bytes: 0
      rx_queue_43_packets: 0
      rx_queue_43_bytes: 0
      rx_queue_44_packets: 0
      rx_queue_44_bytes: 0
      rx_queue_45_packets: 0
      rx_queue_45_bytes: 0
      rx_queue_46_packets: 0
      rx_queue_46_bytes: 0
      rx_queue_47_packets: 0
      rx_queue_47_bytes: 0
      rx_queue_48_packets: 0
      rx_queue_48_bytes: 0
      rx_queue_49_packets: 0
      rx_queue_49_bytes: 0
      rx_queue_50_packets: 0
      rx_queue_50_bytes: 0
      rx_queue_51_packets: 0
      rx_queue_51_bytes: 0
      rx_queue_52_packets: 0
      rx_queue_52_bytes: 0
      rx_queue_53_packets: 0
      rx_queue_53_bytes: 0
      rx_queue_54_packets: 0
      rx_queue_54_bytes: 0
      rx_queue_55_packets: 0
      rx_queue_55_bytes: 0
      rx_queue_56_packets: 0
      rx_queue_56_bytes: 0
      rx_queue_57_packets: 0
      rx_queue_57_bytes: 0
      rx_queue_58_packets: 0
      rx_queue_58_bytes: 0
      rx_queue_59_packets: 0
      rx_queue_59_bytes: 0
      rx_queue_60_packets: 0
      rx_queue_60_bytes: 0
      rx_queue_61_packets: 0
      rx_queue_61_bytes: 0
      rx_queue_62_packets: 0
      rx_queue_62_bytes: 0
      rx_queue_63_packets: 0
      rx_queue_63_bytes: 0
      tx_pb_0_pxon: 0
      tx_pb_0_pxoff: 0
      tx_pb_1_pxon: 0
      tx_pb_1_pxoff: 0
      tx_pb_2_pxon: 0
      tx_pb_2_pxoff: 0
      tx_pb_3_pxon: 0
      tx_pb_3_pxoff: 0
      tx_pb_4_pxon: 0
      tx_pb_4_pxoff: 0
      tx_pb_5_pxon: 0
      tx_pb_5_pxoff: 0
      tx_pb_6_pxon: 0
      tx_pb_6_pxoff: 0
      tx_pb_7_pxon: 0
      tx_pb_7_pxoff: 0
      rx_pb_0_pxon: 0
      rx_pb_0_pxoff: 0
      rx_pb_1_pxon: 0
      rx_pb_1_pxoff: 0
      rx_pb_2_pxon: 0
      rx_pb_2_pxoff: 0
      rx_pb_3_pxon: 0
      rx_pb_3_pxoff: 0
      rx_pb_4_pxon: 0
      rx_pb_4_pxoff: 0
      rx_pb_5_pxon: 0
      rx_pb_5_pxoff: 0
      rx_pb_6_pxon: 0
      rx_pb_6_pxoff: 0
      rx_pb_7_pxon: 0
      rx_pb_7_pxoff: 0


 > Do you happen to know if any particular in-kernel driver version worked?
 > It would help limit the search for regressing commits.

I can't retrieve the driver version itself via a “modinfo ixgbe” (no 
field mentions it) but the driver built into Debian 11 kernel 
5.10.0-10-amd64 works perfectly. Debian 12's 6.1.76-amd64 and Debian 
13's 6.6.15-amd64 are problematic. If you have a method of retrieving 
more precise information, I'd be delighted to provide it.
The problem therefore “spread” between the release of Linux >5.10 and >=6.1.

On Linux 5.10.0-10, an ethtool returns this (the port works):
ethtool eno1
Settings for eno1:
     Supported ports: [ FIBRE ]
     Supported link modes:   10000baseT/Full
     Supported pause frame use: Symmetric
     Supports auto-negotiation: No
     Supported FEC modes: Not reported
     Advertised link modes:  10000baseT/Full
     Advertised pause frame use: Symmetric
     Advertised auto-negotiation: No
     Advertised FEC modes: Not reported
     Speed: 10000Mb/s
     Duplex: Full
     Auto-negotiation: off
     Port: Direct Attach Copper
     PHYAD: 0
     Transceiver: internal
     Supports Wake-on: d
     Wake-on: d
         Current message level: 0x00000007 (7)
                                drv probe link
     Link detected: yes


 > Ideally, if you could use git bisect on the setup that could
 > efficiently locate what regressed the behavior.

I really want to, but I have no idea how to go about it. Can you write 
me the command lines to satisfy your request?


Best regards.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Intel-wired-lan] Non-functional ixgbe driver between Intel X553 chipset and Cisco switch via kernel >=6.1 under Debian
  2024-05-04 13:29   ` kernel.org-fo5k2w
@ 2024-05-06 21:18     ` Jacob Keller
  2024-05-07  6:31       ` Linux regression tracking (Thorsten Leemhuis)
  2024-05-21  1:59       ` Jeff Daly
  2024-05-09 13:26     ` kernel.org-fo5k2w
  1 sibling, 2 replies; 9+ messages in thread
From: Jacob Keller @ 2024-05-06 21:18 UTC (permalink / raw)
  To: kernel.org-fo5k2w, Jeff Daly; +Cc: intel-wired-lan, anthony.l.nguyen, netdev



On 5/4/2024 6:29 AM, kernel.org-fo5k2w@ycharbi.fr wrote:
> Hi,
> 
>  > I haven't touched the ixgbe driver and hardware in many years, but I'll
> try to see what I can do to help.
> 
> Thank you very much for your reply. I'll answer you point by point.
> I upgraded the Qoton to Debian 13 (testing) with kernel 6.6.15 (amd64) 
> to be even more up to date.
> A quick test with Fedora 40 shows the same problem.
> 
> 

Thanks for the detailed information.

>  > So everything works when connected back to back with the Connectx-3. Ok.
> 
> Yes, exactly. Everything works as expected with the Connectx-3.
> 
> 
>  > To confirm, you use the same cable in both cases?
> 
> Yes, the same cable. I tested two different models:
> - 1 Cisco SFP-H10GB-CU1M (1 mètre)
> - 1 Cisco SFP-H10GB-CU3M (3 mètres)
> 
> I'm only using the SFP-H10GB-CU3M for the rest for convenience.
> 
> 
>  > But on the switch, the link is reported up until we bring the interface
>  > up in ixgbe, and then link drops and stays down indefinitely?
> 
> After initial start-up of the Qotom :
> # Port 10Gbe LEDs are green (please note that the MAC address OID - 
> 20:7c:14 - is registered to Qotom, not Intel).
> ip link show dev eno1
> 7: eno1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode 
> DEFAULT group default qlen 1000
>      link/ether 20:7c:14:xx:xx:xx brd ff:ff:ff:ff:ff:ff
>      altname enp11s0f0
> 
> # Cisco (Green LEDs - port mounted)
> show running-config | section interface TenGigabitEthernet1/0/1
> interface TenGigabitEthernet1/0/1
>   no cdp enable
> 
> show interface status | include Te1/0/1
> Te1/0/1   --- Vers Qotom --- connected    trunk        full    10G 
> SFP-10GBase-CX1
> 
> show ip interface brief | include Te1/0/1 | Status
> Interface              IP-Address      OK? Method Status                
> Protocol
> Te1/0/1                unassigned      YES unset up                    up
> 
> The Cisco and Qotom ports are lit and flashing as if they were 
> exchanging ARP or STP traffic. A mirror port on the Cisco's 10Gbe 
> interface, however, shows no frame exchange. I connected a PC to port 
> g1/0/13 with Wireshark for this test.
> 
> monitor session 1 source interface t1/0/1 both
> monitor session 1 destination interface g1/0/13
> 
> Port switch-on test :
> # Starting up the Qotom 10Gbe network interface
> ip link set eno1 up
> [ 1770.476075] pps pps5: new PPS source ptp5
> [ 1770.480784] ixgbe 0000:0b:00.0: registered PHC device on eno1
> [ 1770.575496] ixgbe 0000:0b:00.0 eno1: detected SFP+: 3
> 
> # The ports on both devices switch off immediately.
> # There's no going back:
> ip link set eno1 down
> [ 1831.329797] ixgbe 0000:0b:00.0: removed PHC on eno1
> 
> # The ports are always off on both sides even when unloading the ixgbe 
> core module and plugging/unplugging the Cisco SFP-H10GB-CU3M :
> rmmod ixgbe
> [ 1872.503663] ixgbe 0000:0d:00.1: complete
> [ 1872.547628] ixgbe 0000:0d:00.0: complete
> [ 1872.591645] ixgbe 0000:0b:00.1: complete
> [ 1872.631725] ixgbe 0000:0b:00.0: complete
> 
> A reboot is the only way to restore this port switch-on state.
> On startup, the Cisco switch displays the following logs (the date is 
> not configured):
> Sep 30 14:33:00: %LINK-3-UPDOWN: Interface TenGigabitEthernet1/0/1, 
> changed state to up
> Sep 30 14:33:01: %LINEPROTO-5-UPDOWN: Line protocol on Interface 
> TenGigabitEthernet1/0/1, changed state to up
> 
> 
>  > But if you use the out-of-tree ixgbe driver everything works. Hmm.
> 
> Yes, that's exactly it. The driver on the Intel site works perfectly.
> 
>  > I tried checking the out-of-tree versions to see if there were any
>  > obvious fixes. I didn't find anything. The code between the in-kernel
>  > and out-of-tree is so different that it is hard to track down. At first
>  > I wondered if this might be a regression due to recent changes to
>  > support new hardware, but it appears that v6.1 is from before a lot of
>  > that work went in.
> 
> If it helps, vesalius' post of December 3, 2023 on one of the links in 
> my original post 
> (https://forum.proxmox.com/threads/intel-x553-sfp-ixgbe-no-go-on-pve8.135129/post-612291) 
> reports that the following commit has been suspected as the culprit: 
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v6.1.63&id=565736048bd5f9888990569993c6b6bfdf6dcb6d
> 

I'm taking a look at this commit. I see that it was done by someone from
Silicom, and says the following:

> ixgbe: Manual AN-37 for troublesome link partners for X550 SFI
> Some (Juniper MX5) SFP link partners exhibit a disinclination to
> autonegotiate with X550 configured in SFI mode.  This patch enables
> a manual AN-37 restart to work around the problem.

So it appears like its disabling autonegotiation.

> I quote the end of his message:
> "An amazon employee states reverting this commit and recompiling the 
> kernel allows their similar network hardware to use the current in-tree 
> 6.1 ixgbe driver. Otherwise as stated in the VyOS forum thread linked 
> above compiling the linux kernel with the out-of-tree intel ixgbe driver 
> 5.19.6 works too."
> 
> 
>  > 1. The kernel message logs from when you bring up the interface. You can
> get this from dmesg or journalctl -k if you have systemd.
> 
> The kernel returns only the following three lines after a "ip link set 
> eno1 up" :
> mai 04 12:01:21 servyo kernel: pps pps5: new PPS source ptp5
> mai 04 12:01:21 servyo kernel: ixgbe 0000:0b:00.0: registered PHC device 
> on eno1
> mai 04 12:01:21 servyo kernel: ixgbe 0000:0b:00.0 eno1: detected SFP+: 3
> 

The logs show the device coming up and it detects the SFP, but we don't
see a link up status. Ok.

>  > 2. "ethtool eno1" after you bring the interface up to see what it
> reports about link
> 
> ethtool eno1
> Settings for eno1:
>      Supported ports: [ FIBRE ]
>      Supported link modes:   10000baseT/Full
>      Supported pause frame use: Symmetric
>      Supports auto-negotiation: No
>      Supported FEC modes: Not reported
>      Advertised link modes:  10000baseT/Full
>      Advertised pause frame use: Symmetric
>      Advertised auto-negotiation: No
>      Advertised FEC modes: Not reported
>      Speed: Unknown!
>      Duplex: Unknown! (255)
>      Auto-negotiation: off
>      Port: Direct Attach Copper
>      PHYAD: 0
>      Transceiver: internal
>      Supports Wake-on: d
>      Wake-on: d
>          Current message level: 0x00000007 (7)
>                                 drv probe link
>      Link detected: no
> 

No link detected, but it does detect this is a 10GBaseT cable.
Interesting it doesn't report FEC or autonegotiation. Hmm.

> 
>  > 3. "ethtool -S eno1" to see if any other stats are reported that might
> help us isolate whats going on.
> 
> ethtool -S eno1
> NIC statistics:

Snipped the stats. It looks like there wasn't much useful there. No
traffic was sent, and there is only this lsc_int count of 1, which
indicates that a check link status interrupt was fired.. but its only
triggered once.


> 
>  > Do you happen to know if any particular in-kernel driver version worked?
>  > It would help limit the search for regressing commits.
> 
> I can't retrieve the driver version itself via a “modinfo ixgbe” (no 
> field mentions it) but the driver built into Debian 11 kernel 
> 5.10.0-10-amd64 works perfectly. Debian 12's 6.1.76-amd64 and Debian 
> 13's 6.6.15-amd64 are problematic. If you have a method of retrieving 
> more precise information, I'd be delighted to provide it.
> The problem therefore “spread” between the release of Linux >5.10 and >=6.1.
> 

Knowing the kernel is the important part, we don't have specific
versioning of drivers in the kernel anymore.

> On Linux 5.10.0-10, an ethtool returns this (the port works):
> ethtool eno1
> Settings for eno1:
>      Supported ports: [ FIBRE ]
>      Supported link modes:   10000baseT/Full
>      Supported pause frame use: Symmetric
>      Supports auto-negotiation: No
>      Supported FEC modes: Not reported
>      Advertised link modes:  10000baseT/Full
>      Advertised pause frame use: Symmetric
>      Advertised auto-negotiation: No

Interestingly, this does appear to still list autonegotation as disabled.

>      Advertised FEC modes: Not reported
>      Speed: 10000Mb/s
>      Duplex: Full
>      Auto-negotiation: off
>      Port: Direct Attach Copper
>      PHYAD: 0
>      Transceiver: internal
>      Supports Wake-on: d
>      Wake-on: d
>          Current message level: 0x00000007 (7)
>                                 drv probe link
>      Link detected: yes
> 
> 
>  > Ideally, if you could use git bisect on the setup that could
>  > efficiently locate what regressed the behavior.
> 
> I really want to, but I have no idea how to go about it. Can you write 
> me the command lines to satisfy your request?
> 

The steps would require that you build the kernel manually. I can
outline the steps i would take here

1. get the kernel source from git.kernel.org. I place it in $HOME/git/linux
2. switch to v5.10 with 'git switch --detach v5.10'
2. copy the debian 5.10 config file to $HOME/git/linux/.config
3. build kernel with 'make -j24' (adjust -j depending on how much CPU
you want to spend building the kernel)
4. install with 'sudo make -j24 modules_install && sudo make install'
5. reboot and select the v5.10 kernel, double check it works.
6. in $HOME/git/linux run 'git bisect start' to initiate the bisect session.
7. First, label the current v5.10 commit as good with 'git bisect good'
8. Second, label the v6.1 commit as bad with 'git bisect bad v6.1'

This will initiate a bisect session and will checkout the kernel
approximately halfway between v5.10 and v6.1. For each bisection point
it checks, run the following steps:

1. 'make olddefconfig' to update the configuration for this version
2. 'make -j24' to rebuild with the current version
3. 'sudo make -j24 modules_install && sudo make install' to install this
version.
4. reboot into that version and check its behavior.
5. If it works properly then run 'git bisect good'
6. If it works incorrectly, then run 'git bisect bad'

A new commit will be selected. It will pick one between the latest good
point and the closest bad point, essentially honing in towards the
incorrect behavior.

If for any reason a commit can't be built or tested, you can use "git
bisect skip" and it will skip around a bit to find another point that
can be tried.

Its a lot, but it would help us hone in on the exact failure. I think
its ok if you can't do that. I am checking the out-of-tree and upstream
contents around that AN-37 commit.

The upstream implementation of ixgbe_setup_sfi_x550a is:

> static int ixgbe_setup_sfi_x550a(struct ixgbe_hw *hw, ixgbe_link_speed *speed)
> {
>         struct ixgbe_mac_info *mac = &hw->mac;
>         u32 reg_val;
>         int status;
> 
>         /* Disable all AN and force speed to 10G Serial. */
>         status = mac->ops.read_iosf_sb_reg(hw,
>                                 IXGBE_KRM_PMD_FLX_MASK_ST20(hw->bus.lan_id),
>                                 IXGBE_SB_IOSF_TARGET_KR_PHY, &reg_val);
>         if (status)
>                 return status;
> 
>         reg_val &= ~IXGBE_KRM_PMD_FLX_MASK_ST20_AN_EN;
>         reg_val &= ~IXGBE_KRM_PMD_FLX_MASK_ST20_AN37_EN;
>         reg_val &= ~IXGBE_KRM_PMD_FLX_MASK_ST20_SGMII_EN;
>         reg_val &= ~IXGBE_KRM_PMD_FLX_MASK_ST20_SPEED_MASK;
> 
>         /* Select forced link speed for internal PHY. */
>         switch (*speed) {
>         case IXGBE_LINK_SPEED_10GB_FULL:
>                 reg_val |= IXGBE_KRM_PMD_FLX_MASK_ST20_SPEED_10G;
>                 break;
>         case IXGBE_LINK_SPEED_1GB_FULL:
>                 reg_val |= IXGBE_KRM_PMD_FLX_MASK_ST20_SPEED_1G;
>                 break;
>         default:
>                 /* Other link speeds are not supported by internal PHY. */
>                 return -EINVAL;
>         }
> 
>         (void)mac->ops.write_iosf_sb_reg(hw,
>                         IXGBE_KRM_PMD_FLX_MASK_ST20(hw->bus.lan_id),
>                         IXGBE_SB_IOSF_TARGET_KR_PHY, reg_val);
> 
>         /* change mode enforcement rules to hybrid */
>         (void)mac->ops.read_iosf_sb_reg(hw,
>                         IXGBE_KRM_FLX_TMRS_CTRL_ST31(hw->bus.lan_id),
>                         IXGBE_SB_IOSF_TARGET_KR_PHY, &reg_val);
>         reg_val |= 0x0400;
> 
>         (void)mac->ops.write_iosf_sb_reg(hw,
>                         IXGBE_KRM_FLX_TMRS_CTRL_ST31(hw->bus.lan_id),
>                         IXGBE_SB_IOSF_TARGET_KR_PHY, reg_val);
> 
>         /* manually control the config */
>         (void)mac->ops.read_iosf_sb_reg(hw,
>                         IXGBE_KRM_LINK_CTRL_1(hw->bus.lan_id),
>                         IXGBE_SB_IOSF_TARGET_KR_PHY, &reg_val);
>         reg_val |= 0x20002240;
> 
>         (void)mac->ops.write_iosf_sb_reg(hw,
>                         IXGBE_KRM_LINK_CTRL_1(hw->bus.lan_id),
>                         IXGBE_SB_IOSF_TARGET_KR_PHY, reg_val);
> 
>         /* move the AN base page values */
>         (void)mac->ops.read_iosf_sb_reg(hw,
>                         IXGBE_KRM_PCS_KX_AN(hw->bus.lan_id),
>                         IXGBE_SB_IOSF_TARGET_KR_PHY, &reg_val);
>         reg_val |= 0x1;
>         (void)mac->ops.write_iosf_sb_reg(hw,
>                         IXGBE_KRM_PCS_KX_AN(hw->bus.lan_id),
>                         IXGBE_SB_IOSF_TARGET_KR_PHY, reg_val);
> 
>         /* set the AN37 over CB mode */
>         (void)mac->ops.read_iosf_sb_reg(hw,
>                         IXGBE_KRM_AN_CNTL_4(hw->bus.lan_id),
>                         IXGBE_SB_IOSF_TARGET_KR_PHY, &reg_val);
>         reg_val |= 0x20000000;
> 
>         (void)mac->ops.write_iosf_sb_reg(hw,
>                         IXGBE_KRM_AN_CNTL_4(hw->bus.lan_id),
>                         IXGBE_SB_IOSF_TARGET_KR_PHY, reg_val);
> 
>         /* restart AN manually */
>         (void)mac->ops.read_iosf_sb_reg(hw,
>                         IXGBE_KRM_LINK_CTRL_1(hw->bus.lan_id),
>                         IXGBE_SB_IOSF_TARGET_KR_PHY, &reg_val);
>         reg_val |= IXGBE_KRM_LINK_CTRL_1_TETH_AN_RESTART;
> 
>         (void)mac->ops.write_iosf_sb_reg(hw,
>                         IXGBE_KRM_LINK_CTRL_1(hw->bus.lan_id),
>                         IXGBE_SB_IOSF_TARGET_KR_PHY, reg_val);
> 
>         /* Toggle port SW reset by AN reset. */
>         status = ixgbe_restart_an_internal_phy_x550em(hw);
> 
>         return status;
> }


The out-of-tree implementation appears to lack that change done by the
silicom folks.

> static s32 ixgbe_setup_sfi_x550a(struct ixgbe_hw *hw, ixgbe_link_speed *speed)
> {
>         struct ixgbe_mac_info *mac = &hw->mac;
>         s32 status;
>         u32 reg_val;
> 
>         /* Disable all AN and force speed to 10G Serial. */
>         status = mac->ops.read_iosf_sb_reg(hw,
>                                 IXGBE_KRM_PMD_FLX_MASK_ST20(hw->bus.lan_id),
>                                 IXGBE_SB_IOSF_TARGET_KR_PHY, &reg_val);
>         if (status != 0)
>                 return status;
> 
>         reg_val &= ~IXGBE_KRM_PMD_FLX_MASK_ST20_AN_EN;
>         reg_val &= ~IXGBE_KRM_PMD_FLX_MASK_ST20_AN37_EN;
>         reg_val &= ~IXGBE_KRM_PMD_FLX_MASK_ST20_SGMII_EN;
>         reg_val &= ~IXGBE_KRM_PMD_FLX_MASK_ST20_SPEED_MASK;
> 
>         /* Select forced link speed for internal PHY. */
>         switch (*speed) {
>         case IXGBE_LINK_SPEED_10GB_FULL:
>                 reg_val |= IXGBE_KRM_PMD_FLX_MASK_ST20_SPEED_10G;
>                 break;
>         case IXGBE_LINK_SPEED_1GB_FULL:
>                 reg_val |= IXGBE_KRM_PMD_FLX_MASK_ST20_SPEED_1G;
>                 break;
>         default:
>                 /* Other link speeds are not supported by internal PHY. */
>                 return IXGBE_ERR_LINK_SETUP;
>         }
> 
>         status = mac->ops.write_iosf_sb_reg(hw,
>                                 IXGBE_KRM_PMD_FLX_MASK_ST20(hw->bus.lan_id),
>                                 IXGBE_SB_IOSF_TARGET_KR_PHY, reg_val);
> 
>         /* Toggle port SW reset by AN reset. */
>         status = ixgbe_restart_an_internal_phy_x550em(hw);
> 
>         return status;
> }

I suspect those changes must have broken the Cisco switch link behavior.
I unfortunately do not know enough about this hardware or the SFI
configuration to understand why this causes it.

If you don't want to try bisect, I would suggest trying to revert that
commit or simply replace the ixgbe_setup_sfi_x550a function with the one
from out-of-tree here. If you do that, you can rebuild just ixgbe with
"make M=drivers/net/ethernet/intel/ixgbe" and then insert the module
with "insmod drivers/net/ethernet/intel/ixgbe/ixgbe.ko".

It seems likely that this change had unintended side effect which broke
the Cisco switch linking.

I've added Jeff Daly, in the hopes that he could provide more details on
the change.

@Jeff, it seems likely that the change you made at 565736048bd5 ("ixgbe:
Manual AN-37 for troublesome link partners for X550 SFI") is breaking
some other switches. It would help if you could shed some light on this
change as otherwise we might need to revert it and once again break the
setup you fixed.

Thanks,
Jake

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Intel-wired-lan] Non-functional ixgbe driver between Intel X553 chipset and Cisco switch via kernel >=6.1 under Debian
  2024-05-06 21:18     ` Jacob Keller
@ 2024-05-07  6:31       ` Linux regression tracking (Thorsten Leemhuis)
  2024-05-21  1:59       ` Jeff Daly
  1 sibling, 0 replies; 9+ messages in thread
From: Linux regression tracking (Thorsten Leemhuis) @ 2024-05-07  6:31 UTC (permalink / raw)
  To: Jacob Keller, kernel.org-fo5k2w, Jeff Daly
  Cc: anthony.l.nguyen, intel-wired-lan, netdev

On 06.05.24 23:18, Jacob Keller wrote:
> On 5/4/2024 6:29 AM, kernel.org-fo5k2w@ycharbi.fr wrote:
>>  > Ideally, if you could use git bisect on the setup that could
>>  > efficiently locate what regressed the behavior.
>> I really want to, but I have no idea how to go about it. Can you write 
>> me the command lines to satisfy your request?
> The steps would require that you build the kernel manually. I can
> outline the steps i would take here

TWIMC, there is a document on bisection in the kernel now, see
Documentation/admin-guide/verify-bugs-and-bisect-regressions.rst or
https://docs.kernel.org/admin-guide/verify-bugs-and-bisect-regressions.html

Ciao, Thorsten

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Intel-wired-lan] Non-functional ixgbe driver between Intel X553 chipset and Cisco switch via kernel >=6.1 under Debian
  2024-05-04 13:29   ` kernel.org-fo5k2w
  2024-05-06 21:18     ` Jacob Keller
@ 2024-05-09 13:26     ` kernel.org-fo5k2w
  2024-05-21  0:02       ` Jacob Keller
  1 sibling, 1 reply; 9+ messages in thread
From: kernel.org-fo5k2w @ 2024-05-09 13:26 UTC (permalink / raw)
  To: Jacob Keller, kernel.org-fo5k2w, Jeff Daly
  Cc: intel-wired-lan, regressions, anthony.l.nguyen, netdev

Hi,

> No link detected, but it does detect this is a 10GBaseT cable.
> Interesting it doesn't report FEC or autonegotiation. Hmm.

In fact, I personally find it strange that the "Supported link modes" is "10000baseT/Full". A DAC is not a SFP+ 8P8C (RJ45) module. Wouldn't it be more logical if the modes reported were the same as those obtained by an "ethtool eth2" on the Connectx-3 side? :

Settings for eth2:
	Supported ports: [ FIBRE ]
	Supported link modes:   10000baseKX4/Full
	                        1000baseX/Full
	                        10000baseCR/Full
	                        10000baseSR/Full
	Supported pause frame use: Symmetric Receive-only
	Supports auto-negotiation: No
	Supported FEC modes: Not reported
	Advertised link modes:  10000baseKX4/Full
	                        1000baseX/Full
	                        10000baseCR/Full
	                        10000baseSR/Full
	Advertised pause frame use: Symmetric
	Advertised auto-negotiation: No
	Advertised FEC modes: Not reported
	Speed: 10000Mb/s
	Duplex: Full
	Auto-negotiation: off
	Port: Direct Attach Copper
	PHYAD: 0
	Transceiver: internal
	Supports Wake-on: d
	Wake-on: d
        Current message level: 0x00000014 (20)
                               link ifdown
	Link detected: yes


In other words, isn't the fact that the reported mode is "10000baseT/Full" a bug in itself?
 
> Knowing the kernel is the important part, we don't have specific
> versioning of drivers in the kernel anymore.

Ok. I take note of this information.

> The steps would require that you build the kernel manually. I can
> outline the steps i would take here
> 
> 1. get the kernel source from git.kernel.org. I place it in $HOME/git/linux
> 2. switch to v5.10 with 'git switch --detach v5.10'
> 2. copy the debian 5.10 config file to $HOME/git/linux/.config
> 3. build kernel with 'make -j24' (adjust -j depending on how much CPU
> you want to spend building the kernel)
> 4. install with 'sudo make -j24 modules_install && sudo make install'
> 5. reboot and select the v5.10 kernel, double check it works.
> 6. in $HOME/git/linux run 'git bisect start' to initiate the bisect session.
> 7. First, label the current v5.10 commit as good with 'git bisect good'
> 8. Second, label the v6.1 commit as bad with 'git bisect bad v6.1'
> 
> This will initiate a bisect session and will checkout the kernel
> approximately halfway between v5.10 and v6.1. For each bisection point
> it checks, run the following steps:
> 
> 1. 'make olddefconfig' to update the configuration for this version
> 2. 'make -j24' to rebuild with the current version
> 3. 'sudo make -j24 modules_install && sudo make install' to install this
> version.
> 4. reboot into that version and check its behavior.
> 5. If it works properly then run 'git bisect good'
> 6. If it works incorrectly, then run 'git bisect bad'
> 
> A new commit will be selected. It will pick one between the latest good
> point and the closest bad point, essentially honing in towards the
> incorrect behavior.
> 
> If for any reason a commit can't be built or tested, you can use "git
> bisect skip" and it will skip around a bit to find another point that
> can be tried.

Thank you for your and Thorsten Leemhuis's advice. I don't know whether the following Bisect log will be of any help to you. However, I have determined precisely that the problem was introduced with version 6.1. If I boot into 6.0, it works perfectly. So there are fewer differences to search for the problem. Here's the feedback from Bisect, but I'm still dubious about the relevance of this log because the “git bisect bad v6.1” command returned "7614896350aa20764c5eca527262d9eb0a57da63 était à la fois good et bad"... I didn't really understand how it all worked... :

git bisect start
# good: [4fe89d07dcc2804c8b562f6c7896a45643d34b2f] Linux 6.0
git bisect good 45eb8ae5370d5df1ee8236f45df3f29103ba6e12
# bad: [830b3c68c1fb1e9176028d02ef86f3cf76aa2476] Linux 6.1
git bisect bad 7614896350aa20764c5eca527262d9eb0a57da63

I should point out that I had to switch back to Debian 11 because 12 and 13 refuse to compile these old kernels... Anyway, I compiled the versions successively and came across the difference in operation between 6.0 and 6.1.

> I suspect those changes must have broken the Cisco switch link behavior.
> I unfortunately do not know enough about this hardware or the SFI
> configuration to understand why this causes it.
> 
> If you don't want to try bisect, I would suggest trying to revert that
> commit or simply replace the ixgbe_setup_sfi_x550a function with the one
> from out-of-tree here. If you do that, you can rebuild just ixgbe with
> "make M=drivers/net/ethernet/intel/ixgbe" and then insert the module
> with "insmod drivers/net/ethernet/intel/ixgbe/ixgbe.ko".
> 
> It seems likely that this change had unintended side effect which broke
> the Cisco switch linking.


If I do a "git revert 565736048bd5f9888990569993c6b6bfdf6dcb6d" to go back before the state of the suspected problem commit, compile kernel 6.1 and boot on it, it works perfectly.
So it turns out that this is the source of the malfunction and was introduced with Linux 6.1.

 
> I've added Jeff Daly, in the hopes that he could provide more details on
> the change.
> 
> @Jeff, it seems likely that the change you made at 565736048bd5 ("ixgbe:
> Manual AN-37 for troublesome link partners for X550 SFI") is breaking
> some other switches. It would help if you could shed some light on this
> change as otherwise we might need to revert it and once again break the
> setup you fixed.
> 
> Thanks,
> Jake

Let me know if you need more information. I'll be happy to help!

Best regards.

⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁ Yohan Charbi
⢿⡄⠘⠷⠚⠋⠀ Cordialement
⠈⠳⣄⠀

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Intel-wired-lan] Non-functional ixgbe driver between Intel X553 chipset and Cisco switch via kernel >=6.1 under Debian
  2024-05-09 13:26     ` kernel.org-fo5k2w
@ 2024-05-21  0:02       ` Jacob Keller
  0 siblings, 0 replies; 9+ messages in thread
From: Jacob Keller @ 2024-05-21  0:02 UTC (permalink / raw)
  To: kernel.org-fo5k2w, Jeff Daly
  Cc: intel-wired-lan, regressions, anthony.l.nguyen, netdev



On 5/9/2024 6:26 AM, kernel.org-fo5k2w@ycharbi.fr wrote:
> Hi,
> 
>> No link detected, but it does detect this is a 10GBaseT cable.
>> Interesting it doesn't report FEC or autonegotiation. Hmm.
> 
> In fact, I personally find it strange that the "Supported link modes" is "10000baseT/Full". A DAC is not a SFP+ 8P8C (RJ45) module. Wouldn't it be more logical if the modes reported were the same as those obtained by an "ethtool eth2" on the Connectx-3 side? :
> 
> Settings for eth2:
> 	Supported ports: [ FIBRE ]
> 	Supported link modes:   10000baseKX4/Full
> 	                        1000baseX/Full
> 	                        10000baseCR/Full
> 	                        10000baseSR/Full
> 	Supported pause frame use: Symmetric Receive-only
> 	Supports auto-negotiation: No
> 	Supported FEC modes: Not reported
> 	Advertised link modes:  10000baseKX4/Full
> 	                        1000baseX/Full
> 	                        10000baseCR/Full
> 	                        10000baseSR/Full
> 	Advertised pause frame use: Symmetric
> 	Advertised auto-negotiation: No
> 	Advertised FEC modes: Not reported
> 	Speed: 10000Mb/s
> 	Duplex: Full
> 	Auto-negotiation: off
> 	Port: Direct Attach Copper
> 	PHYAD: 0
> 	Transceiver: internal
> 	Supports Wake-on: d
> 	Wake-on: d
>         Current message level: 0x00000014 (20)
>                                link ifdown
> 	Link detected: yes
> 
> 
> In other words, isn't the fact that the reported mode is "10000baseT/Full" a bug in itself?
>  

Possibly, though I am not familiar enough to know for sure.

>> Knowing the kernel is the important part, we don't have specific
>> versioning of drivers in the kernel anymore.
> 
> Ok. I take note of this information.
> 
>> The steps would require that you build the kernel manually. I can
>> outline the steps i would take here
>>
>> 1. get the kernel source from git.kernel.org. I place it in $HOME/git/linux
>> 2. switch to v5.10 with 'git switch --detach v5.10'
>> 2. copy the debian 5.10 config file to $HOME/git/linux/.config
>> 3. build kernel with 'make -j24' (adjust -j depending on how much CPU
>> you want to spend building the kernel)
>> 4. install with 'sudo make -j24 modules_install && sudo make install'
>> 5. reboot and select the v5.10 kernel, double check it works.
>> 6. in $HOME/git/linux run 'git bisect start' to initiate the bisect session.
>> 7. First, label the current v5.10 commit as good with 'git bisect good'
>> 8. Second, label the v6.1 commit as bad with 'git bisect bad v6.1'
>>
>> This will initiate a bisect session and will checkout the kernel
>> approximately halfway between v5.10 and v6.1. For each bisection point
>> it checks, run the following steps:
>>
>> 1. 'make olddefconfig' to update the configuration for this version
>> 2. 'make -j24' to rebuild with the current version
>> 3. 'sudo make -j24 modules_install && sudo make install' to install this
>> version.
>> 4. reboot into that version and check its behavior.
>> 5. If it works properly then run 'git bisect good'
>> 6. If it works incorrectly, then run 'git bisect bad'
>>
>> A new commit will be selected. It will pick one between the latest good
>> point and the closest bad point, essentially honing in towards the
>> incorrect behavior.
>>
>> If for any reason a commit can't be built or tested, you can use "git
>> bisect skip" and it will skip around a bit to find another point that
>> can be tried.
> 
> Thank you for your and Thorsten Leemhuis's advice. I don't know whether the following Bisect log will be of any help to you. However, I have determined precisely that the problem was introduced with version 6.1. If I boot into 6.0, it works perfectly. So there are fewer differences to search for the problem. Here's the feedback from Bisect, but I'm still dubious about the relevance of this log because the “git bisect bad v6.1” command returned "7614896350aa20764c5eca527262d9eb0a57da63 était à la fois good et bad"... I didn't really understand how it all worked... :
> 
> git bisect start
> # good: [4fe89d07dcc2804c8b562f6c7896a45643d34b2f] Linux 6.0
> git bisect good 45eb8ae5370d5df1ee8236f45df3f29103ba6e12
> # bad: [830b3c68c1fb1e9176028d02ef86f3cf76aa2476] Linux 6.1
> git bisect bad 7614896350aa20764c5eca527262d9eb0a57da63
> 
> I should point out that I had to switch back to Debian 11 because 12 and 13 refuse to compile these old kernels... Anyway, I compiled the versions successively and came across the difference in operation between 6.0 and 6.1.
> 

From this point, it would checkout a commit and then you would build and
verify if it works, and then run "git bisect good" if the test passed,
and "git bisect bad" if the test failed, and it would then pick a new
test candidate. It uses a bisection process (similar to the name) to
reduce the range by ~half each time to quickly hone in on the
appropriate bad commit.

>> I suspect those changes must have broken the Cisco switch link behavior.
>> I unfortunately do not know enough about this hardware or the SFI
>> configuration to understand why this causes it.
>>
>> If you don't want to try bisect, I would suggest trying to revert that
>> commit or simply replace the ixgbe_setup_sfi_x550a function with the one
>> from out-of-tree here. If you do that, you can rebuild just ixgbe with
>> "make M=drivers/net/ethernet/intel/ixgbe" and then insert the module
>> with "insmod drivers/net/ethernet/intel/ixgbe/ixgbe.ko".
>>
>> It seems likely that this change had unintended side effect which broke
>> the Cisco switch linking.
> 
> 
> If I do a "git revert 565736048bd5f9888990569993c6b6bfdf6dcb6d" to go back before the state of the suspected problem commit, compile kernel 6.1 and boot on it, it works perfectly.
> So it turns out that this is the source of the malfunction and was introduced with Linux 6.1.
> 

That is sufficient for me. I did some further digging here and it looks
like this was originally some workaround for a specific switch. Per the
commit message, this was submitted to our driver by Jeff Daly from
Silicom. I suspect that the fix happens to resolve issues on a
particular switch. Clearly breaks other switches and in a pretty bad way.

From what I can understand, the Cisco switch you are using sees the AN37
and basically decides the port is confused and gives up and won't
attempt to link again without a power cycle.

>  
>> I've added Jeff Daly, in the hopes that he could provide more details on
>> the change.
>>
>> @Jeff, it seems likely that the change you made at 565736048bd5 ("ixgbe:
>> Manual AN-37 for troublesome link partners for X550 SFI") is breaking
>> some other switches. It would help if you could shed some light on this
>> change as otherwise we might need to revert it and once again break the
>> setup you fixed.
>>
>> Thanks,
>> Jake
> 
> Let me know if you need more information. I'll be happy to help!
> 

I believe the correct thing to do here is to revert this change. It
helps some cases, but clearly broke others. Jeff (or anyone from
silicom) has not responded with clarifications. We (Intel) do not have
this change in the related out-of-tree driver. Although this did get
tested by us, I suspect we simply did not test it with the correct range
of devices and switches.

I don't know enough about the standards to know which switch is at fault
or behaving incorrectly here, but I am inclined to revert this fix
because it breaks, and the original authors of the fix aren't commenting.

> Best regards.
> 
> ⢀⣴⠾⠻⢶⣦⠀
> ⣾⠁⢠⠒⠀⣿⡁ Yohan Charbi
> ⢿⡄⠘⠷⠚⠋⠀ Cordialement
> ⠈⠳⣄⠀

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Intel-wired-lan] Non-functional ixgbe driver between Intel X553 chipset and Cisco switch via kernel >=6.1 under Debian
  2024-05-06 21:18     ` Jacob Keller
  2024-05-07  6:31       ` Linux regression tracking (Thorsten Leemhuis)
@ 2024-05-21  1:59       ` Jeff Daly
  1 sibling, 0 replies; 9+ messages in thread
From: Jeff Daly @ 2024-05-21  1:59 UTC (permalink / raw)
  To: Jacob Keller, kernel.org-fo5k2w; +Cc: intel-wired-lan, anthony.l.nguyen, netdev

Hi Jacob, I've only just gotten around to looking at this as it seems to have fallen off my current emails to read.  I've been in the middle of an Amston Lake platform bringup here at Silicom.  

I'll have to dig back through my memory because this was awhile back and I was mostly the shepherd to get this patch through, it was originally from another source.  If I recall it was (as stated in the patch) a specific workaround for a specific Juniper switch with certain SFPs in conjunction with the Denverton implementation of the X550 IP.

I wasn't aware that there were conflicts with other switches after the kernel updates.

> -----Original Message-----
> From: Jacob Keller <jacob.e.keller@intel.com>
> Sent: Monday, May 6, 2024 5:19 PM
> To: kernel.org-fo5k2w@ycharbi.fr; Jeff Daly <jeffd@silicom-usa.com>
> Cc: anthony.l.nguyen@intel.com; intel-wired-lan@lists.osuosl.org;
> jesse.brandeburg@intel.com; netdev@vger.kernel.org
> Subject: Re: Non-functional ixgbe driver between Intel X553 chipset and
> Cisco switch via kernel >=6.1 under Debian
> 
> Caution: This is an external email. Please take care when clicking links or
> opening attachments.
> 
> 
> On 5/4/2024 6:29 AM, kernel.org-fo5k2w@ycharbi.fr wrote:
> > Hi,
> >
> >  > I haven't touched the ixgbe driver and hardware in many years, but
> > I'll try to see what I can do to help.
> >
> > Thank you very much for your reply. I'll answer you point by point.
> > I upgraded the Qoton to Debian 13 (testing) with kernel 6.6.15 (amd64)
> > to be even more up to date.
> > A quick test with Fedora 40 shows the same problem.
> >
> >
> 
> Thanks for the detailed information.
> 
> >  > So everything works when connected back to back with the Connectx-3.
> Ok.
> >
> > Yes, exactly. Everything works as expected with the Connectx-3.
> >
> >
> >  > To confirm, you use the same cable in both cases?
> >
> > Yes, the same cable. I tested two different models:
> > - 1 Cisco SFP-H10GB-CU1M (1 mètre)
> > - 1 Cisco SFP-H10GB-CU3M (3 mètres)
> >
> > I'm only using the SFP-H10GB-CU3M for the rest for convenience.
> >
> >
> >  > But on the switch, the link is reported up until we bring the
> > interface  > up in ixgbe, and then link drops and stays down indefinitely?
> >
> > After initial start-up of the Qotom :
> > # Port 10Gbe LEDs are green (please note that the MAC address OID -
> > 20:7c:14 - is registered to Qotom, not Intel).
> > ip link show dev eno1
> > 7: eno1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN
> mode
> > DEFAULT group default qlen 1000
> >      link/ether 20:7c:14:xx:xx:xx brd ff:ff:ff:ff:ff:ff
> >      altname enp11s0f0
> >
> > # Cisco (Green LEDs - port mounted)
> > show running-config | section interface TenGigabitEthernet1/0/1
> > interface TenGigabitEthernet1/0/1
> >   no cdp enable
> >
> > show interface status | include Te1/0/1
> > Te1/0/1   --- Vers Qotom --- connected    trunk        full    10G
> > SFP-10GBase-CX1
> >
> > show ip interface brief | include Te1/0/1 | Status
> > Interface              IP-Address      OK? Method Status
> > Protocol
> > Te1/0/1                unassigned      YES unset up                    up
> >
> > The Cisco and Qotom ports are lit and flashing as if they were
> > exchanging ARP or STP traffic. A mirror port on the Cisco's 10Gbe
> > interface, however, shows no frame exchange. I connected a PC to port
> > g1/0/13 with Wireshark for this test.
> >
> > monitor session 1 source interface t1/0/1 both monitor session 1
> > destination interface g1/0/13
> >
> > Port switch-on test :
> > # Starting up the Qotom 10Gbe network interface ip link set eno1 up [
> > 1770.476075] pps pps5: new PPS source ptp5 [ 1770.480784] ixgbe
> > 0000:0b:00.0: registered PHC device on eno1 [ 1770.575496] ixgbe
> > 0000:0b:00.0 eno1: detected SFP+: 3
> >
> > # The ports on both devices switch off immediately.
> > # There's no going back:
> > ip link set eno1 down
> > [ 1831.329797] ixgbe 0000:0b:00.0: removed PHC on eno1
> >
> > # The ports are always off on both sides even when unloading the ixgbe
> > core module and plugging/unplugging the Cisco SFP-H10GB-CU3M :
> > rmmod ixgbe
> > [ 1872.503663] ixgbe 0000:0d:00.1: complete [ 1872.547628] ixgbe
> > 0000:0d:00.0: complete [ 1872.591645] ixgbe 0000:0b:00.1: complete [
> > 1872.631725] ixgbe 0000:0b:00.0: complete
> >
> > A reboot is the only way to restore this port switch-on state.
> > On startup, the Cisco switch displays the following logs (the date is
> > not configured):
> > Sep 30 14:33:00: %LINK-3-UPDOWN: Interface TenGigabitEthernet1/0/1,
> > changed state to up Sep 30 14:33:01: %LINEPROTO-5-UPDOWN: Line
> > protocol on Interface TenGigabitEthernet1/0/1, changed state to up
> >
> >
> >  > But if you use the out-of-tree ixgbe driver everything works. Hmm.
> >
> > Yes, that's exactly it. The driver on the Intel site works perfectly.
> >
> >  > I tried checking the out-of-tree versions to see if there were any
> > > obvious fixes. I didn't find anything. The code between the
> > in-kernel  > and out-of-tree is so different that it is hard to track
> > down. At first  > I wondered if this might be a regression due to
> > recent changes to  > support new hardware, but it appears that v6.1 is
> > from before a lot of  > that work went in.
> >
> > If it helps, vesalius' post of December 3, 2023 on one of the links in
> > my original post
> > (https://forum.proxmox.com/threads/intel-x553-sfp-ixgbe-no-go-on-
> pve8.
> > 135129/post-612291) reports that the following commit has been
> > suspected as the culprit:
> > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commi
> > t/?h=v6.1.63&id=565736048bd5f9888990569993c6b6bfdf6dcb6d
> >
> 
> I'm taking a look at this commit. I see that it was done by someone from
> Silicom, and says the following:
> 
> > ixgbe: Manual AN-37 for troublesome link partners for X550 SFI Some
> > (Juniper MX5) SFP link partners exhibit a disinclination to
> > autonegotiate with X550 configured in SFI mode.  This patch enables a
> > manual AN-37 restart to work around the problem.
> 
> So it appears like its disabling autonegotiation.
> 
> > I quote the end of his message:
> > "An amazon employee states reverting this commit and recompiling the
> > kernel allows their similar network hardware to use the current
> > in-tree
> > 6.1 ixgbe driver. Otherwise as stated in the VyOS forum thread linked
> > above compiling the linux kernel with the out-of-tree intel ixgbe
> > driver
> > 5.19.6 works too."
> >
> >
> >  > 1. The kernel message logs from when you bring up the interface.
> > You can get this from dmesg or journalctl -k if you have systemd.
> >
> > The kernel returns only the following three lines after a "ip link set
> > eno1 up" :
> > mai 04 12:01:21 servyo kernel: pps pps5: new PPS source ptp5 mai 04
> > 12:01:21 servyo kernel: ixgbe 0000:0b:00.0: registered PHC device on
> > eno1 mai 04 12:01:21 servyo kernel: ixgbe 0000:0b:00.0 eno1: detected
> > SFP+: 3
> >
> 
> The logs show the device coming up and it detects the SFP, but we don't see
> a link up status. Ok.
> 
> >  > 2. "ethtool eno1" after you bring the interface up to see what it
> > reports about link
> >
> > ethtool eno1
> > Settings for eno1:
> >      Supported ports: [ FIBRE ]
> >      Supported link modes:   10000baseT/Full
> >      Supported pause frame use: Symmetric
> >      Supports auto-negotiation: No
> >      Supported FEC modes: Not reported
> >      Advertised link modes:  10000baseT/Full
> >      Advertised pause frame use: Symmetric
> >      Advertised auto-negotiation: No
> >      Advertised FEC modes: Not reported
> >      Speed: Unknown!
> >      Duplex: Unknown! (255)
> >      Auto-negotiation: off
> >      Port: Direct Attach Copper
> >      PHYAD: 0
> >      Transceiver: internal
> >      Supports Wake-on: d
> >      Wake-on: d
> >          Current message level: 0x00000007 (7)
> >                                 drv probe link
> >      Link detected: no
> >
> 
> No link detected, but it does detect this is a 10GBaseT cable.
> Interesting it doesn't report FEC or autonegotiation. Hmm.
> 
> >
> >  > 3. "ethtool -S eno1" to see if any other stats are reported that
> > might help us isolate whats going on.
> >
> > ethtool -S eno1
> > NIC statistics:
> 
> Snipped the stats. It looks like there wasn't much useful there. No traffic was
> sent, and there is only this lsc_int count of 1, which indicates that a check link
> status interrupt was fired.. but its only triggered once.
> 
> 
> >
> >  > Do you happen to know if any particular in-kernel driver version worked?
> >  > It would help limit the search for regressing commits.
> >
> > I can't retrieve the driver version itself via a “modinfo ixgbe” (no
> > field mentions it) but the driver built into Debian 11 kernel
> > 5.10.0-10-amd64 works perfectly. Debian 12's 6.1.76-amd64 and Debian
> > 13's 6.6.15-amd64 are problematic. If you have a method of retrieving
> > more precise information, I'd be delighted to provide it.
> > The problem therefore “spread” between the release of Linux >5.10 and
> >=6.1.
> >
> 
> Knowing the kernel is the important part, we don't have specific versioning
> of drivers in the kernel anymore.
> 
> > On Linux 5.10.0-10, an ethtool returns this (the port works):
> > ethtool eno1
> > Settings for eno1:
> >      Supported ports: [ FIBRE ]
> >      Supported link modes:   10000baseT/Full
> >      Supported pause frame use: Symmetric
> >      Supports auto-negotiation: No
> >      Supported FEC modes: Not reported
> >      Advertised link modes:  10000baseT/Full
> >      Advertised pause frame use: Symmetric
> >      Advertised auto-negotiation: No
> 
> Interestingly, this does appear to still list autonegotation as disabled.
> 
> >      Advertised FEC modes: Not reported
> >      Speed: 10000Mb/s
> >      Duplex: Full
> >      Auto-negotiation: off
> >      Port: Direct Attach Copper
> >      PHYAD: 0
> >      Transceiver: internal
> >      Supports Wake-on: d
> >      Wake-on: d
> >          Current message level: 0x00000007 (7)
> >                                 drv probe link
> >      Link detected: yes
> >
> >
> >  > Ideally, if you could use git bisect on the setup that could  >
> > efficiently locate what regressed the behavior.
> >
> > I really want to, but I have no idea how to go about it. Can you write
> > me the command lines to satisfy your request?
> >
> 
> The steps would require that you build the kernel manually. I can outline the
> steps i would take here
> 
> 1. get the kernel source from git.kernel.org. I place it in $HOME/git/linux 2.
> switch to v5.10 with 'git switch --detach v5.10'
> 2. copy the debian 5.10 config file to $HOME/git/linux/.config 3. build kernel
> with 'make -j24' (adjust -j depending on how much CPU you want to spend
> building the kernel) 4. install with 'sudo make -j24 modules_install && sudo
> make install'
> 5. reboot and select the v5.10 kernel, double check it works.
> 6. in $HOME/git/linux run 'git bisect start' to initiate the bisect session.
> 7. First, label the current v5.10 commit as good with 'git bisect good'
> 8. Second, label the v6.1 commit as bad with 'git bisect bad v6.1'
> 
> This will initiate a bisect session and will checkout the kernel approximately
> halfway between v5.10 and v6.1. For each bisection point it checks, run the
> following steps:
> 
> 1. 'make olddefconfig' to update the configuration for this version 2. 'make -
> j24' to rebuild with the current version 3. 'sudo make -j24 modules_install &&
> sudo make install' to install this version.
> 4. reboot into that version and check its behavior.
> 5. If it works properly then run 'git bisect good'
> 6. If it works incorrectly, then run 'git bisect bad'
> 
> A new commit will be selected. It will pick one between the latest good point
> and the closest bad point, essentially honing in towards the incorrect
> behavior.
> 
> If for any reason a commit can't be built or tested, you can use "git bisect
> skip" and it will skip around a bit to find another point that can be tried.
> 
> Its a lot, but it would help us hone in on the exact failure. I think its ok if you
> can't do that. I am checking the out-of-tree and upstream contents around
> that AN-37 commit.
> 
> The upstream implementation of ixgbe_setup_sfi_x550a is:
> 
> > static int ixgbe_setup_sfi_x550a(struct ixgbe_hw *hw, ixgbe_link_speed
> > *speed) {
> >         struct ixgbe_mac_info *mac = &hw->mac;
> >         u32 reg_val;
> >         int status;
> >
> >         /* Disable all AN and force speed to 10G Serial. */
> >         status = mac->ops.read_iosf_sb_reg(hw,
> >                                 IXGBE_KRM_PMD_FLX_MASK_ST20(hw->bus.lan_id),
> >                                 IXGBE_SB_IOSF_TARGET_KR_PHY, &reg_val);
> >         if (status)
> >                 return status;
> >
> >         reg_val &= ~IXGBE_KRM_PMD_FLX_MASK_ST20_AN_EN;
> >         reg_val &= ~IXGBE_KRM_PMD_FLX_MASK_ST20_AN37_EN;
> >         reg_val &= ~IXGBE_KRM_PMD_FLX_MASK_ST20_SGMII_EN;
> >         reg_val &= ~IXGBE_KRM_PMD_FLX_MASK_ST20_SPEED_MASK;
> >
> >         /* Select forced link speed for internal PHY. */
> >         switch (*speed) {
> >         case IXGBE_LINK_SPEED_10GB_FULL:
> >                 reg_val |= IXGBE_KRM_PMD_FLX_MASK_ST20_SPEED_10G;
> >                 break;
> >         case IXGBE_LINK_SPEED_1GB_FULL:
> >                 reg_val |= IXGBE_KRM_PMD_FLX_MASK_ST20_SPEED_1G;
> >                 break;
> >         default:
> >                 /* Other link speeds are not supported by internal PHY. */
> >                 return -EINVAL;
> >         }
> >
> >         (void)mac->ops.write_iosf_sb_reg(hw,
> >                         IXGBE_KRM_PMD_FLX_MASK_ST20(hw->bus.lan_id),
> >                         IXGBE_SB_IOSF_TARGET_KR_PHY, reg_val);
> >
> >         /* change mode enforcement rules to hybrid */
> >         (void)mac->ops.read_iosf_sb_reg(hw,
> >                         IXGBE_KRM_FLX_TMRS_CTRL_ST31(hw->bus.lan_id),
> >                         IXGBE_SB_IOSF_TARGET_KR_PHY, &reg_val);
> >         reg_val |= 0x0400;
> >
> >         (void)mac->ops.write_iosf_sb_reg(hw,
> >                         IXGBE_KRM_FLX_TMRS_CTRL_ST31(hw->bus.lan_id),
> >                         IXGBE_SB_IOSF_TARGET_KR_PHY, reg_val);
> >
> >         /* manually control the config */
> >         (void)mac->ops.read_iosf_sb_reg(hw,
> >                         IXGBE_KRM_LINK_CTRL_1(hw->bus.lan_id),
> >                         IXGBE_SB_IOSF_TARGET_KR_PHY, &reg_val);
> >         reg_val |= 0x20002240;
> >
> >         (void)mac->ops.write_iosf_sb_reg(hw,
> >                         IXGBE_KRM_LINK_CTRL_1(hw->bus.lan_id),
> >                         IXGBE_SB_IOSF_TARGET_KR_PHY, reg_val);
> >
> >         /* move the AN base page values */
> >         (void)mac->ops.read_iosf_sb_reg(hw,
> >                         IXGBE_KRM_PCS_KX_AN(hw->bus.lan_id),
> >                         IXGBE_SB_IOSF_TARGET_KR_PHY, &reg_val);
> >         reg_val |= 0x1;
> >         (void)mac->ops.write_iosf_sb_reg(hw,
> >                         IXGBE_KRM_PCS_KX_AN(hw->bus.lan_id),
> >                         IXGBE_SB_IOSF_TARGET_KR_PHY, reg_val);
> >
> >         /* set the AN37 over CB mode */
> >         (void)mac->ops.read_iosf_sb_reg(hw,
> >                         IXGBE_KRM_AN_CNTL_4(hw->bus.lan_id),
> >                         IXGBE_SB_IOSF_TARGET_KR_PHY, &reg_val);
> >         reg_val |= 0x20000000;
> >
> >         (void)mac->ops.write_iosf_sb_reg(hw,
> >                         IXGBE_KRM_AN_CNTL_4(hw->bus.lan_id),
> >                         IXGBE_SB_IOSF_TARGET_KR_PHY, reg_val);
> >
> >         /* restart AN manually */
> >         (void)mac->ops.read_iosf_sb_reg(hw,
> >                         IXGBE_KRM_LINK_CTRL_1(hw->bus.lan_id),
> >                         IXGBE_SB_IOSF_TARGET_KR_PHY, &reg_val);
> >         reg_val |= IXGBE_KRM_LINK_CTRL_1_TETH_AN_RESTART;
> >
> >         (void)mac->ops.write_iosf_sb_reg(hw,
> >                         IXGBE_KRM_LINK_CTRL_1(hw->bus.lan_id),
> >                         IXGBE_SB_IOSF_TARGET_KR_PHY, reg_val);
> >
> >         /* Toggle port SW reset by AN reset. */
> >         status = ixgbe_restart_an_internal_phy_x550em(hw);
> >
> >         return status;
> > }
> 
> 
> The out-of-tree implementation appears to lack that change done by the
> silicom folks.
> 
> > static s32 ixgbe_setup_sfi_x550a(struct ixgbe_hw *hw, ixgbe_link_speed
> > *speed) {
> >         struct ixgbe_mac_info *mac = &hw->mac;
> >         s32 status;
> >         u32 reg_val;
> >
> >         /* Disable all AN and force speed to 10G Serial. */
> >         status = mac->ops.read_iosf_sb_reg(hw,
> >                                 IXGBE_KRM_PMD_FLX_MASK_ST20(hw->bus.lan_id),
> >                                 IXGBE_SB_IOSF_TARGET_KR_PHY, &reg_val);
> >         if (status != 0)
> >                 return status;
> >
> >         reg_val &= ~IXGBE_KRM_PMD_FLX_MASK_ST20_AN_EN;
> >         reg_val &= ~IXGBE_KRM_PMD_FLX_MASK_ST20_AN37_EN;
> >         reg_val &= ~IXGBE_KRM_PMD_FLX_MASK_ST20_SGMII_EN;
> >         reg_val &= ~IXGBE_KRM_PMD_FLX_MASK_ST20_SPEED_MASK;
> >
> >         /* Select forced link speed for internal PHY. */
> >         switch (*speed) {
> >         case IXGBE_LINK_SPEED_10GB_FULL:
> >                 reg_val |= IXGBE_KRM_PMD_FLX_MASK_ST20_SPEED_10G;
> >                 break;
> >         case IXGBE_LINK_SPEED_1GB_FULL:
> >                 reg_val |= IXGBE_KRM_PMD_FLX_MASK_ST20_SPEED_1G;
> >                 break;
> >         default:
> >                 /* Other link speeds are not supported by internal PHY. */
> >                 return IXGBE_ERR_LINK_SETUP;
> >         }
> >
> >         status = mac->ops.write_iosf_sb_reg(hw,
> >                                 IXGBE_KRM_PMD_FLX_MASK_ST20(hw->bus.lan_id),
> >                                 IXGBE_SB_IOSF_TARGET_KR_PHY, reg_val);
> >
> >         /* Toggle port SW reset by AN reset. */
> >         status = ixgbe_restart_an_internal_phy_x550em(hw);
> >
> >         return status;
> > }
> 
> I suspect those changes must have broken the Cisco switch link behavior.
> I unfortunately do not know enough about this hardware or the SFI
> configuration to understand why this causes it.
> 
> If you don't want to try bisect, I would suggest trying to revert that commit or
> simply replace the ixgbe_setup_sfi_x550a function with the one from out-of-
> tree here. If you do that, you can rebuild just ixgbe with "make
> M=drivers/net/ethernet/intel/ixgbe" and then insert the module with
> "insmod drivers/net/ethernet/intel/ixgbe/ixgbe.ko".
> 
> It seems likely that this change had unintended side effect which broke the
> Cisco switch linking.
> 
> I've added Jeff Daly, in the hopes that he could provide more details on the
> change.
> 
> @Jeff, it seems likely that the change you made at 565736048bd5 ("ixgbe:
> Manual AN-37 for troublesome link partners for X550 SFI") is breaking some
> other switches. It would help if you could shed some light on this change as
> otherwise we might need to revert it and once again break the setup you
> fixed.
> 
> Thanks,
> Jake

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2024-05-21  1:59 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-15 11:02 [Intel-wired-lan] Non-functional ixgbe driver between Intel X553 chipset and Cisco switch via kernel >=6.1 under Debian kernel.org-fo5k2w
2024-05-03 13:01 ` kernel.org-fo5k2w
2024-05-03 18:37 ` Jacob Keller
2024-05-04 13:29   ` kernel.org-fo5k2w
2024-05-06 21:18     ` Jacob Keller
2024-05-07  6:31       ` Linux regression tracking (Thorsten Leemhuis)
2024-05-21  1:59       ` Jeff Daly
2024-05-09 13:26     ` kernel.org-fo5k2w
2024-05-21  0:02       ` Jacob Keller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).