* mlx5_core 5.10 stable series regression starting at 5.10.65 @ 2021-09-20 20:22 Patrick.Mclean 2021-09-21 6:31 ` Greg KH 0 siblings, 1 reply; 5+ messages in thread From: Patrick.Mclean @ 2021-09-20 20:22 UTC (permalink / raw) To: stable Cc: regressions, ayal, saeedm, netdev, leonro, Aaron.U'ren, Russell.Brown, Victor.Payno In 5.10 stable kernels since 5.10.65 certain mlx5 cards are no longer usable (relevant dmesg logs and lspci output are pasted below). Bisecting the problem tracks the problem down to this commit: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.10.y&id=fe6322774ca28669868a7e231e173e09f7422118 Here is how lscpi -nn identifies the cards: 41:00.0 Ethernet controller [0200]: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:1017] 41:00.1 Ethernet controller [0200]: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:1017] Here are the relevant dmesg logs: [ 13.409473] mlx5_core 0000:41:00.0: firmware version: 16.31.1014 [ 13.415944] mlx5_core 0000:41:00.0: 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link) [ 13.707425] mlx5_core 0000:41:00.0: Rate limit: 127 rates are supported, range: 0Mbps to 24414Mbps [ 13.718221] mlx5_core 0000:41:00.0: E-Switch: Total vports 2, per vport: max uc(128) max mc(2048) [ 13.740607] mlx5_core 0000:41:00.0: Port module event: module 0, Cable plugged [ 13.759857] mlx5_core 0000:41:00.0: mlx5_pcie_event:294:(pid 586): PCIe slot advertised sufficient power (75W). [ 17.986973] mlx5_core 0000:41:00.0: E-Switch: cleanup [ 18.686204] mlx5_core 0000:41:00.0: init_one:1371:(pid 803): mlx5_load_one failed with error code -22 [ 18.701352] mlx5_core: probe of 0000:41:00.0 failed with error -22 [ 18.727364] mlx5_core 0000:41:00.1: firmware version: 16.31.1014 [ 18.743853] mlx5_core 0000:41:00.1: 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link) [ 19.015349] mlx5_core 0000:41:00.1: Rate limit: 127 rates are supported, range: 0Mbps to 24414Mbps [ 19.025157] mlx5_core 0000:41:00.1: E-Switch: Total vports 2, per vport: max uc(128) max mc(2048) [ 19.053569] mlx5_core 0000:41:00.1: Port module event: module 1, Cable unplugged [ 19.062093] mlx5_core 0000:41:00.1: mlx5_pcie_event:294:(pid 591): PCIe slot advertised sufficient power (75W). [ 22.826932] mlx5_core 0000:41:00.1: E-Switch: cleanup [ 23.544747] mlx5_core 0000:41:00.1: init_one:1371:(pid 803): mlx5_load_one failed with error code -22 [ 23.555071] mlx5_core: probe of 0000:41:00.1 failed with error -22 Please let me know if I can provide any further information. ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: mlx5_core 5.10 stable series regression starting at 5.10.65 2021-09-20 20:22 mlx5_core 5.10 stable series regression starting at 5.10.65 Patrick.Mclean @ 2021-09-21 6:31 ` Greg KH 2021-09-21 22:22 ` Patrick.Mclean 0 siblings, 1 reply; 5+ messages in thread From: Greg KH @ 2021-09-21 6:31 UTC (permalink / raw) To: Patrick.Mclean Cc: stable, regressions, ayal, saeedm, netdev, leonro, Aaron.U'ren, Russell.Brown, Victor.Payno On Mon, Sep 20, 2021 at 08:22:44PM +0000, Patrick.Mclean@sony.com wrote: > In 5.10 stable kernels since 5.10.65 certain mlx5 cards are no longer usable (relevant dmesg logs and lspci output are pasted below). > > Bisecting the problem tracks the problem down to this commit: > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.10.y&id=fe6322774ca28669868a7e231e173e09f7422118 > > Here is how lscpi -nn identifies the cards: > 41:00.0 Ethernet controller [0200]: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:1017] > 41:00.1 Ethernet controller [0200]: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:1017] > > Here are the relevant dmesg logs: > [ 13.409473] mlx5_core 0000:41:00.0: firmware version: 16.31.1014 > [ 13.415944] mlx5_core 0000:41:00.0: 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link) > [ 13.707425] mlx5_core 0000:41:00.0: Rate limit: 127 rates are supported, range: 0Mbps to 24414Mbps > [ 13.718221] mlx5_core 0000:41:00.0: E-Switch: Total vports 2, per vport: max uc(128) max mc(2048) > [ 13.740607] mlx5_core 0000:41:00.0: Port module event: module 0, Cable plugged > [ 13.759857] mlx5_core 0000:41:00.0: mlx5_pcie_event:294:(pid 586): PCIe slot advertised sufficient power (75W). > [ 17.986973] mlx5_core 0000:41:00.0: E-Switch: cleanup > [ 18.686204] mlx5_core 0000:41:00.0: init_one:1371:(pid 803): mlx5_load_one failed with error code -22 > [ 18.701352] mlx5_core: probe of 0000:41:00.0 failed with error -22 > [ 18.727364] mlx5_core 0000:41:00.1: firmware version: 16.31.1014 > [ 18.743853] mlx5_core 0000:41:00.1: 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link) > [ 19.015349] mlx5_core 0000:41:00.1: Rate limit: 127 rates are supported, range: 0Mbps to 24414Mbps > [ 19.025157] mlx5_core 0000:41:00.1: E-Switch: Total vports 2, per vport: max uc(128) max mc(2048) > [ 19.053569] mlx5_core 0000:41:00.1: Port module event: module 1, Cable unplugged > [ 19.062093] mlx5_core 0000:41:00.1: mlx5_pcie_event:294:(pid 591): PCIe slot advertised sufficient power (75W). > [ 22.826932] mlx5_core 0000:41:00.1: E-Switch: cleanup > [ 23.544747] mlx5_core 0000:41:00.1: init_one:1371:(pid 803): mlx5_load_one failed with error code -22 > [ 23.555071] mlx5_core: probe of 0000:41:00.1 failed with error -22 > > Please let me know if I can provide any further information. If you revert that single change, do things work properly? Does newer kernels (5.14, 5.15-rc2) work properly for you as well? thanks, greg k-h ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: mlx5_core 5.10 stable series regression starting at 5.10.65 2021-09-21 6:31 ` Greg KH @ 2021-09-21 22:22 ` Patrick.Mclean 2021-09-22 6:21 ` Leon Romanovsky 0 siblings, 1 reply; 5+ messages in thread From: Patrick.Mclean @ 2021-09-21 22:22 UTC (permalink / raw) To: greg Cc: stable, regressions, ayal, saeedm, netdev, leonro, Aaron.U'ren, Russell.Brown, Victor.Payno > On Mon, Sep 20, 2021 at 08:22:44PM +0000, Patrick.Mclean@sony.com wrote: > > In 5.10 stable kernels since 5.10.65 certain mlx5 cards are no longer usable (relevant dmesg logs and lspci output are pasted below). > > > > Bisecting the problem tracks the problem down to this commit: > > https://urldefense.com/v3/__https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.10.y&id=fe6322774ca28669868a7e231e173e09f7422118__;!!JmoZiZGBv3RvKRSx!phUrsR595UusBY2Q9eNJQS7-VNtnb72Rcvhe-W0QKDPir1WY9mvWOkLLfe63k-6Uvw$ > > > > Here is how lscpi -nn identifies the cards: > > 41:00.0 Ethernet controller [0200]: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:1017] > > 41:00.1 Ethernet controller [0200]: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:1017] > > > > Here are the relevant dmesg logs: > > [ 13.409473] mlx5_core 0000:41:00.0: firmware version: 16.31.1014 > > [ 13.415944] mlx5_core 0000:41:00.0: 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link) > > [ 13.707425] mlx5_core 0000:41:00.0: Rate limit: 127 rates are supported, range: 0Mbps to 24414Mbps > > [ 13.718221] mlx5_core 0000:41:00.0: E-Switch: Total vports 2, per vport: max uc(128) max mc(2048) > > [ 13.740607] mlx5_core 0000:41:00.0: Port module event: module 0, Cable plugged > > [ 13.759857] mlx5_core 0000:41:00.0: mlx5_pcie_event:294:(pid 586): PCIe slot advertised sufficient power (75W). > > [ 17.986973] mlx5_core 0000:41:00.0: E-Switch: cleanup > > [ 18.686204] mlx5_core 0000:41:00.0: init_one:1371:(pid 803): mlx5_load_one failed with error code -22 > > [ 18.701352] mlx5_core: probe of 0000:41:00.0 failed with error -22 > > [ 18.727364] mlx5_core 0000:41:00.1: firmware version: 16.31.1014 > > [ 18.743853] mlx5_core 0000:41:00.1: 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link) > > [ 19.015349] mlx5_core 0000:41:00.1: Rate limit: 127 rates are supported, range: 0Mbps to 24414Mbps > > [ 19.025157] mlx5_core 0000:41:00.1: E-Switch: Total vports 2, per vport: max uc(128) max mc(2048) > > [ 19.053569] mlx5_core 0000:41:00.1: Port module event: module 1, Cable unplugged > > [ 19.062093] mlx5_core 0000:41:00.1: mlx5_pcie_event:294:(pid 591): PCIe slot advertised sufficient power (75W). > > [ 22.826932] mlx5_core 0000:41:00.1: E-Switch: cleanup > > [ 23.544747] mlx5_core 0000:41:00.1: init_one:1371:(pid 803): mlx5_load_one failed with error code -22 > > [ 23.555071] mlx5_core: probe of 0000:41:00.1 failed with error -22 > > > > Please let me know if I can provide any further information. > > If you revert that single change, do things work properly? Yes, things work properly after reverting that single change (tested with 5.10.67). > Does newer kernels (5.14, 5.15-rc2) work properly for you as well? We tested 5.14.6, and it works as expected. ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: mlx5_core 5.10 stable series regression starting at 5.10.65 2021-09-21 22:22 ` Patrick.Mclean @ 2021-09-22 6:21 ` Leon Romanovsky 2021-09-23 11:04 ` Greg KH 0 siblings, 1 reply; 5+ messages in thread From: Leon Romanovsky @ 2021-09-22 6:21 UTC (permalink / raw) To: Patrick.Mclean Cc: greg, stable, regressions, ayal, saeedm, netdev, Aaron.U'ren, Russell.Brown, Victor.Payno On Tue, Sep 21, 2021 at 10:22:57PM +0000, Patrick.Mclean@sony.com wrote: > > On Mon, Sep 20, 2021 at 08:22:44PM +0000, Patrick.Mclean@sony.com wrote: > > > In 5.10 stable kernels since 5.10.65 certain mlx5 cards are no longer usable (relevant dmesg logs and lspci output are pasted below). > > > > > > Bisecting the problem tracks the problem down to this commit: > > > https://urldefense.com/v3/__https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.10.y&id=fe6322774ca28669868a7e231e173e09f7422118__;!!JmoZiZGBv3RvKRSx!phUrsR595UusBY2Q9eNJQS7-VNtnb72Rcvhe-W0QKDPir1WY9mvWOkLLfe63k-6Uvw$ > > > > > > Here is how lscpi -nn identifies the cards: > > > 41:00.0 Ethernet controller [0200]: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:1017] > > > 41:00.1 Ethernet controller [0200]: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:1017] > > > > > > Here are the relevant dmesg logs: > > > [ 13.409473] mlx5_core 0000:41:00.0: firmware version: 16.31.1014 > > > [ 13.415944] mlx5_core 0000:41:00.0: 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link) > > > [ 13.707425] mlx5_core 0000:41:00.0: Rate limit: 127 rates are supported, range: 0Mbps to 24414Mbps > > > [ 13.718221] mlx5_core 0000:41:00.0: E-Switch: Total vports 2, per vport: max uc(128) max mc(2048) > > > [ 13.740607] mlx5_core 0000:41:00.0: Port module event: module 0, Cable plugged > > > [ 13.759857] mlx5_core 0000:41:00.0: mlx5_pcie_event:294:(pid 586): PCIe slot advertised sufficient power (75W). > > > [ 17.986973] mlx5_core 0000:41:00.0: E-Switch: cleanup > > > [ 18.686204] mlx5_core 0000:41:00.0: init_one:1371:(pid 803): mlx5_load_one failed with error code -22 > > > [ 18.701352] mlx5_core: probe of 0000:41:00.0 failed with error -22 > > > [ 18.727364] mlx5_core 0000:41:00.1: firmware version: 16.31.1014 > > > [ 18.743853] mlx5_core 0000:41:00.1: 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link) > > > [ 19.015349] mlx5_core 0000:41:00.1: Rate limit: 127 rates are supported, range: 0Mbps to 24414Mbps > > > [ 19.025157] mlx5_core 0000:41:00.1: E-Switch: Total vports 2, per vport: max uc(128) max mc(2048) > > > [ 19.053569] mlx5_core 0000:41:00.1: Port module event: module 1, Cable unplugged > > > [ 19.062093] mlx5_core 0000:41:00.1: mlx5_pcie_event:294:(pid 591): PCIe slot advertised sufficient power (75W). > > > [ 22.826932] mlx5_core 0000:41:00.1: E-Switch: cleanup > > > [ 23.544747] mlx5_core 0000:41:00.1: init_one:1371:(pid 803): mlx5_load_one failed with error code -22 > > > [ 23.555071] mlx5_core: probe of 0000:41:00.1 failed with error -22 > > > > > > Please let me know if I can provide any further information. > > > > If you revert that single change, do things work properly? > > Yes, things work properly after reverting that single change (tested with 5.10.67). The stable@ kernel is missing commit 3d347b1b19da ("net/mlx5: Add support for devlink traps in mlx5 core driver"), which added mlx5 devlink callbacks (.trap_init and .trap_fini). I don't know why the commit that you reverted was added to stable@ in the first place. It doesn't fix any bug and has no Fixes tag. Thanks ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: mlx5_core 5.10 stable series regression starting at 5.10.65 2021-09-22 6:21 ` Leon Romanovsky @ 2021-09-23 11:04 ` Greg KH 0 siblings, 0 replies; 5+ messages in thread From: Greg KH @ 2021-09-23 11:04 UTC (permalink / raw) To: Leon Romanovsky Cc: Patrick.Mclean, stable, regressions, ayal, saeedm, netdev, Aaron.U'ren, Russell.Brown, Victor.Payno On Wed, Sep 22, 2021 at 09:21:48AM +0300, Leon Romanovsky wrote: > On Tue, Sep 21, 2021 at 10:22:57PM +0000, Patrick.Mclean@sony.com wrote: > > > On Mon, Sep 20, 2021 at 08:22:44PM +0000, Patrick.Mclean@sony.com wrote: > > > > In 5.10 stable kernels since 5.10.65 certain mlx5 cards are no longer usable (relevant dmesg logs and lspci output are pasted below). > > > > > > > > Bisecting the problem tracks the problem down to this commit: > > > > https://urldefense.com/v3/__https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.10.y&id=fe6322774ca28669868a7e231e173e09f7422118__;!!JmoZiZGBv3RvKRSx!phUrsR595UusBY2Q9eNJQS7-VNtnb72Rcvhe-W0QKDPir1WY9mvWOkLLfe63k-6Uvw$ > > > > > > > > Here is how lscpi -nn identifies the cards: > > > > 41:00.0 Ethernet controller [0200]: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:1017] > > > > 41:00.1 Ethernet controller [0200]: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:1017] > > > > > > > > Here are the relevant dmesg logs: > > > > [ 13.409473] mlx5_core 0000:41:00.0: firmware version: 16.31.1014 > > > > [ 13.415944] mlx5_core 0000:41:00.0: 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link) > > > > [ 13.707425] mlx5_core 0000:41:00.0: Rate limit: 127 rates are supported, range: 0Mbps to 24414Mbps > > > > [ 13.718221] mlx5_core 0000:41:00.0: E-Switch: Total vports 2, per vport: max uc(128) max mc(2048) > > > > [ 13.740607] mlx5_core 0000:41:00.0: Port module event: module 0, Cable plugged > > > > [ 13.759857] mlx5_core 0000:41:00.0: mlx5_pcie_event:294:(pid 586): PCIe slot advertised sufficient power (75W). > > > > [ 17.986973] mlx5_core 0000:41:00.0: E-Switch: cleanup > > > > [ 18.686204] mlx5_core 0000:41:00.0: init_one:1371:(pid 803): mlx5_load_one failed with error code -22 > > > > [ 18.701352] mlx5_core: probe of 0000:41:00.0 failed with error -22 > > > > [ 18.727364] mlx5_core 0000:41:00.1: firmware version: 16.31.1014 > > > > [ 18.743853] mlx5_core 0000:41:00.1: 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link) > > > > [ 19.015349] mlx5_core 0000:41:00.1: Rate limit: 127 rates are supported, range: 0Mbps to 24414Mbps > > > > [ 19.025157] mlx5_core 0000:41:00.1: E-Switch: Total vports 2, per vport: max uc(128) max mc(2048) > > > > [ 19.053569] mlx5_core 0000:41:00.1: Port module event: module 1, Cable unplugged > > > > [ 19.062093] mlx5_core 0000:41:00.1: mlx5_pcie_event:294:(pid 591): PCIe slot advertised sufficient power (75W). > > > > [ 22.826932] mlx5_core 0000:41:00.1: E-Switch: cleanup > > > > [ 23.544747] mlx5_core 0000:41:00.1: init_one:1371:(pid 803): mlx5_load_one failed with error code -22 > > > > [ 23.555071] mlx5_core: probe of 0000:41:00.1 failed with error -22 > > > > > > > > Please let me know if I can provide any further information. > > > > > > If you revert that single change, do things work properly? > > > > Yes, things work properly after reverting that single change (tested with 5.10.67). > > The stable@ kernel is missing commit 3d347b1b19da ("net/mlx5: Add support for devlink traps > in mlx5 core driver"), which added mlx5 devlink callbacks (.trap_init and .trap_fini). Ok, will go revert this now, thanks for confirming it and letting me know. > I don't know why the commit that you reverted was added to stable@ in > the first place. It doesn't fix any bug and has no Fixes tag. Looks like it was brought in as a dependancy for another fix that required it as the revert was not clean and I had to do it "by hand". thanks, greg k-h ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2021-09-23 11:04 UTC | newest] Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-09-20 20:22 mlx5_core 5.10 stable series regression starting at 5.10.65 Patrick.Mclean 2021-09-21 6:31 ` Greg KH 2021-09-21 22:22 ` Patrick.Mclean 2021-09-22 6:21 ` Leon Romanovsky 2021-09-23 11:04 ` Greg KH
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).