regressions.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* Potential regression/bug in net/mlx5 driver
@ 2023-03-28 23:08 Paul Moore
  2023-03-29 22:20 ` Saeed Mahameed
  2023-03-31 13:10 ` Linux regression tracking #adding (Thorsten Leemhuis)
  0 siblings, 2 replies; 24+ messages in thread
From: Paul Moore @ 2023-03-28 23:08 UTC (permalink / raw)
  To: Shay Drory, Saeed Mahameed; +Cc: netdev, regressions, selinux

Hello all,

Starting with the v6.3-rcX kernel releases I noticed that my
InfiniBand devices were no longer present under /sys/class/infiniband,
causing some of my automated testing to fail.  It took me a while to
find the time to bisect the issue, but I eventually identified the
problematic commit:

  commit fe998a3c77b9f989a30a2a01fb00d3729a6d53a4
  Author: Shay Drory <shayd@nvidia.com>
  Date:   Wed Jun 29 11:38:21 2022 +0300

   net/mlx5: Enable management PF initialization

   Enable initialization of DPU Management PF, which is a new loopback PF
   designed for communication with BMC.
   For now Management PF doesn't support nor require most upper layer
   protocols so avoid them.

   Signed-off-by: Shay Drory <shayd@nvidia.com>
   Reviewed-by: Eran Ben Elisha <eranbe@nvidia.com>
   Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
   Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

I'm not a mlx5 driver expert so I can't really offer much in the way
of a fix, but as a quick test I did remove the
'mlx5_core_is_management_pf(...)' calls in mlx5/core/dev.c and
everything seemed to work okay on my test system (or rather the tests
ran without problem).

If you need any additional information, or would like me to test a
patch, please let me know.

-- 
paul-moore.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Potential regression/bug in net/mlx5 driver
  2023-03-28 23:08 Potential regression/bug in net/mlx5 driver Paul Moore
@ 2023-03-29 22:20 ` Saeed Mahameed
  2023-03-30  1:27   ` Paul Moore
  2023-03-31 13:10 ` Linux regression tracking #adding (Thorsten Leemhuis)
  1 sibling, 1 reply; 24+ messages in thread
From: Saeed Mahameed @ 2023-03-29 22:20 UTC (permalink / raw)
  To: Paul Moore; +Cc: Shay Drory, Saeed Mahameed, netdev, regressions, selinux

On 28 Mar 19:08, Paul Moore wrote:
>Hello all,
>
>Starting with the v6.3-rcX kernel releases I noticed that my
>InfiniBand devices were no longer present under /sys/class/infiniband,
>causing some of my automated testing to fail.  It took me a while to
>find the time to bisect the issue, but I eventually identified the
>problematic commit:
>
>  commit fe998a3c77b9f989a30a2a01fb00d3729a6d53a4
>  Author: Shay Drory <shayd@nvidia.com>
>  Date:   Wed Jun 29 11:38:21 2022 +0300
>
>   net/mlx5: Enable management PF initialization
>
>   Enable initialization of DPU Management PF, which is a new loopback PF
>   designed for communication with BMC.
>   For now Management PF doesn't support nor require most upper layer
>   protocols so avoid them.
>
>   Signed-off-by: Shay Drory <shayd@nvidia.com>
>   Reviewed-by: Eran Ben Elisha <eranbe@nvidia.com>
>   Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
>   Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
>
>I'm not a mlx5 driver expert so I can't really offer much in the way
>of a fix, but as a quick test I did remove the
>'mlx5_core_is_management_pf(...)' calls in mlx5/core/dev.c and
>everything seemed to work okay on my test system (or rather the tests
>ran without problem).
>
>If you need any additional information, or would like me to test a
>patch, please let me know.
>

Hi Paul, 

Our team is looking into this, the current theory is that you have an old
FW that doesn't have the correct capabilities set.

Can you please provide the FW version and the ConnectX device you are
testing ? 

$ devlink dev info
$ lspci -s <pci_dev> -vv
since boot:
$ dmesg 

>-- 
>paul-moore.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Potential regression/bug in net/mlx5 driver
  2023-03-29 22:20 ` Saeed Mahameed
@ 2023-03-30  1:27   ` Paul Moore
  2023-04-09  8:48     ` Linux regression tracking (Thorsten Leemhuis)
  0 siblings, 1 reply; 24+ messages in thread
From: Paul Moore @ 2023-03-30  1:27 UTC (permalink / raw)
  To: Saeed Mahameed; +Cc: Shay Drory, Saeed Mahameed, netdev, regressions, selinux

On Wed, Mar 29, 2023 at 6:20 PM Saeed Mahameed <saeed@kernel.org> wrote:
> On 28 Mar 19:08, Paul Moore wrote:
> >Hello all,
> >
> >Starting with the v6.3-rcX kernel releases I noticed that my
> >InfiniBand devices were no longer present under /sys/class/infiniband,
> >causing some of my automated testing to fail.  It took me a while to
> >find the time to bisect the issue, but I eventually identified the
> >problematic commit:
> >
> >  commit fe998a3c77b9f989a30a2a01fb00d3729a6d53a4
> >  Author: Shay Drory <shayd@nvidia.com>
> >  Date:   Wed Jun 29 11:38:21 2022 +0300
> >
> >   net/mlx5: Enable management PF initialization
> >
> >   Enable initialization of DPU Management PF, which is a new loopback PF
> >   designed for communication with BMC.
> >   For now Management PF doesn't support nor require most upper layer
> >   protocols so avoid them.
> >
> >   Signed-off-by: Shay Drory <shayd@nvidia.com>
> >   Reviewed-by: Eran Ben Elisha <eranbe@nvidia.com>
> >   Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
> >   Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
> >
> >I'm not a mlx5 driver expert so I can't really offer much in the way
> >of a fix, but as a quick test I did remove the
> >'mlx5_core_is_management_pf(...)' calls in mlx5/core/dev.c and
> >everything seemed to work okay on my test system (or rather the tests
> >ran without problem).
> >
> >If you need any additional information, or would like me to test a
> >patch, please let me know.
>
> Hi Paul,
>
> Our team is looking into this, the current theory is that you have an old
> FW that doesn't have the correct capabilities set.

That's very possible; I installed this card many years ago and haven't
updated the FW once.  I'm happy to update the FW (do you have a
pointer/how-to?), but it might be good to identify a fix first as I'm
guessing there will be others like me ...

> Can you please provide the FW version and the ConnectX device you are
> testing ?
>
> $ devlink dev info

% devlink dev info; echo $?
0

No output and no error code.  However, I do see the following in dmesg:

[  255.251124] mlx5_core 0000:00:08.0: mlx5_fw_version_query:823:(pid
959): fw query isn't supported by the FW

... which appears to support your theory about ancient hardware.

> $ lspci -s <pci_dev> -vv

While there is only one physical card, there are two PCI devices (it's
a dual port card).  I'm only copying the first device since I'm
guessing that's really all you need:

% lspci -s 00:07.0 -vv
00:07.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
       Subsystem: Mellanox Technologies Device 0010
       Physical Slot: 7
       Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
                Stepping- SERR+ FastB2B- DisINTx+
       Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
               <TAbort- <MAbort- >SERR- <PERR- INTx-
       Latency: 0, Cache Line Size: 64 bytes
       Interrupt: pin A routed to IRQ 11
       Region 0: Memory at fa000000 (64-bit, prefetchable) [size=32M]
       Expansion ROM at fe900000 [disabled] [size=1M]
       Capabilities: [60] Express (v2) Endpoint, MSI 00
               DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s
                       unlimited, L1 unlimited
                       ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
                       SlotPowerLimit 25W
               DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
                       RlxdOrd- ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
                       MaxPayload 256 bytes, MaxReadReq 512 bytes
               DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr-
                       TransPend-
               LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM not supported
                       ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
               LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
                       ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
               LnkSta: Speed 8GT/s, Width x8
                       TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
               DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP-
                        LTR- 10BitTagComp+ 10BitTagReq- OBFF Not Supported,
                        ExtFmt- EETLPPrefix- EmergencyPowerReduction
                        Not Supported, EmergencyPowerReductionInit-
                        FRS- TPHComp- ExtTPHComp-
               AtomicOpsCap: 32bit- 64bit- 128bitCAS-
               DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR-
                        10BitTagReq- OBFF Disabled,
               AtomicOpsCtl: ReqEn-
               LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+
                        EqualizationPhase1+ EqualizationPhase2+
                        EqualizationPhase3+ LinkEqualizationRequest-
                        Retimer- 2Retimers- CrosslinkRes: unsupported
       Capabilities: [48] Vital Product Data
               Product Name: CX454A - ConnectX-4 QSFP28
               Read-only fields:
                       [PN] Part number: MCX454A-FCAT
                       [EC] Engineering changes: AB
                       [SN] Serial number: MT1730X05081
                       [V0] Vendor specific: PCIeGen3 x8
                       [RV] Reserved: checksum good, 0 byte(s) reserved
               End
       Capabilities: [9c] MSI-X: Enable+ Count=64 Masked-
               Vector table: BAR=0 offset=00002000
               PBA: BAR=0 offset=00003000
       Capabilities: [c0] Vendor Specific Information: Len=18 <?>
       Capabilities: [40] Power Management version 3
               Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA
                      PME(D0-,D1-,D2-,D3hot-,D3cold+)
               Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
       Kernel driver in use: mlx5_core
       Kernel modules: mlx5_core

> since boot:
> $ dmesg

% devlink dev info
% dmesg | grep mlx5
[    4.739691] mlx5_core 0000:00:07.0: firmware version: 12.18.1000
[    4.740134] mlx5_core 0000:00:07.0: 63.008 Gb/s available PCIe
bandwidth (8.0GT/s PCIe x8 link)
[    7.048567] mlx5_core 0000:00:07.0: Port module event: module 0,
Cable plugged
[    7.211879] mlx5_core 0000:00:08.0: firmware version: 12.18.1000
[    7.212309] mlx5_core 0000:00:08.0: 63.008 Gb/s available PCIe
bandwidth (8.0GT/s PCIe x8 link)
[    7.897218] mlx5_core 0000:00:08.0: Port module event: module 1,
Cable plugged
[   10.875388] mlx5_core 0000:00:07.0 ibs7: renamed from ib0
[   10.995115] mlx5_core 0000:00:08.0 ibs8: renamed from ib0
[  181.471663] mlx5_core 0000:00:07.0: mlx5_fw_version_query:823:(pid
918): fw query isn't supported by the FW
[  181.472286] mlx5_core 0000:00:08.0: mlx5_fw_version_query:823:(pid
918): fw query isn't supported by the FW

-- 
paul-moore.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Potential regression/bug in net/mlx5 driver
  2023-03-28 23:08 Potential regression/bug in net/mlx5 driver Paul Moore
  2023-03-29 22:20 ` Saeed Mahameed
@ 2023-03-31 13:10 ` Linux regression tracking #adding (Thorsten Leemhuis)
  1 sibling, 0 replies; 24+ messages in thread
From: Linux regression tracking #adding (Thorsten Leemhuis) @ 2023-03-31 13:10 UTC (permalink / raw)
  To: Paul Moore, Shay Drory, Saeed Mahameed; +Cc: netdev, regressions, selinux

[TLDR: I'm adding this report to the list of tracked Linux kernel
regressions; the text you find below is based on a few templates
paragraphs you might have encountered already in similar form.
See link in footer if these mails annoy you.]

On 29.03.23 01:08, Paul Moore wrote:
> 
> Starting with the v6.3-rcX kernel releases I noticed that my
> InfiniBand devices were no longer present under /sys/class/infiniband,
> causing some of my automated testing to fail.  It took me a while to
> find the time to bisect the issue, but I eventually identified the
> problematic commit:
> 
>   commit fe998a3c77b9f989a30a2a01fb00d3729a6d53a4
>   Author: Shay Drory <shayd@nvidia.com>
>   Date:   Wed Jun 29 11:38:21 2022 +0300
> 
>    net/mlx5: Enable management PF initialization
> 
>    Enable initialization of DPU Management PF, which is a new loopback PF
>    designed for communication with BMC.
>    For now Management PF doesn't support nor require most upper layer
>    protocols so avoid them.
> 
>    Signed-off-by: Shay Drory <shayd@nvidia.com>
>    Reviewed-by: Eran Ben Elisha <eranbe@nvidia.com>
>    Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
>    Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
> 
> I'm not a mlx5 driver expert so I can't really offer much in the way
> of a fix, but as a quick test I did remove the
> 'mlx5_core_is_management_pf(...)' calls in mlx5/core/dev.c and
> everything seemed to work okay on my test system (or rather the tests
> ran without problem).
> 
> If you need any additional information, or would like me to test a
> patch, please let me know.

Thanks for the report. To be sure the issue doesn't fall through the
cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
tracking bot:

#regzbot ^introduced fe998a3c77b9f989a30a2a01fb00d3729a6d53a4
#regzbot title net: mlx5: InfiniBand devices were no longer present
#regzbot ignore-activity

This isn't a regression? This issue or a fix for it are already
discussed somewhere else? It was fixed already? You want to clarify when
the regression started to happen? Or point out I got the title or
something else totally wrong? Then just reply and tell me -- ideally
while also telling regzbot about it, as explained by the page listed in
the footer of this mail.

Developers: When fixing the issue, remember to add 'Link:' tags pointing
to the report (the parent of this mail). See page linked in footer for
details.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Potential regression/bug in net/mlx5 driver
  2023-03-30  1:27   ` Paul Moore
@ 2023-04-09  8:48     ` Linux regression tracking (Thorsten Leemhuis)
  2023-04-09 23:50       ` Paul Moore
  0 siblings, 1 reply; 24+ messages in thread
From: Linux regression tracking (Thorsten Leemhuis) @ 2023-04-09  8:48 UTC (permalink / raw)
  To: Paul Moore, Saeed Mahameed
  Cc: Shay Drory, Saeed Mahameed, netdev, regressions, selinux

On 30.03.23 03:27, Paul Moore wrote:
> On Wed, Mar 29, 2023 at 6:20 PM Saeed Mahameed <saeed@kernel.org> wrote:
>> On 28 Mar 19:08, Paul Moore wrote:
>>>
>>> Starting with the v6.3-rcX kernel releases I noticed that my
>>> InfiniBand devices were no longer present under /sys/class/infiniband,
>>> causing some of my automated testing to fail.  It took me a while to
>>> find the time to bisect the issue, but I eventually identified the
>>> problematic commit:
>>>
>>>  commit fe998a3c77b9f989a30a2a01fb00d3729a6d53a4
>>>  Author: Shay Drory <shayd@nvidia.com>
>>>  Date:   Wed Jun 29 11:38:21 2022 +0300
>>>
>>>   net/mlx5: Enable management PF initialization
>>>
>>>   Enable initialization of DPU Management PF, which is a new loopback PF
>>>   designed for communication with BMC.
>>>   For now Management PF doesn't support nor require most upper layer
>>>   protocols so avoid them.
>>>
>>>   Signed-off-by: Shay Drory <shayd@nvidia.com>
>>>   Reviewed-by: Eran Ben Elisha <eranbe@nvidia.com>
>>>   Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
>>>   Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
>>>
>>> I'm not a mlx5 driver expert so I can't really offer much in the way
>>> of a fix, but as a quick test I did remove the
>>> 'mlx5_core_is_management_pf(...)' calls in mlx5/core/dev.c and
>>> everything seemed to work okay on my test system (or rather the tests
>>> ran without problem).
>>>
>>> If you need any additional information, or would like me to test a
>>> patch, please let me know.
>>
>> Our team is looking into this, the current theory is that you have an old
>> FW that doesn't have the correct capabilities set.
> 
> That's very possible; I installed this card many years ago and haven't
> updated the FW once.
>
>  I'm happy to update the FW (do you have a
> pointer/how-to?), but it might be good to identify a fix first as I'm
> guessing there will be others like me ...

Nothing happened here for about ten days afaics (or was there progress
and I just missed it?). That made me wonder: how sound is Paul's guess
that there will be others that might run into this? If that's likely it
afaics would be good to get this regression fixed before the release,
which is just two or three weeks away.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

#regzbot poke

>> Can you please provide the FW version and the ConnectX device you are
>> testing ?
>>
>> $ devlink dev info
> 
> % devlink dev info; echo $?
> 0
> 
> No output and no error code.  However, I do see the following in dmesg:
> 
> [  255.251124] mlx5_core 0000:00:08.0: mlx5_fw_version_query:823:(pid
> 959): fw query isn't supported by the FW
> 
> ... which appears to support your theory about ancient hardware.
> 
>> $ lspci -s <pci_dev> -vv
> 
> While there is only one physical card, there are two PCI devices (it's
> a dual port card).  I'm only copying the first device since I'm
> guessing that's really all you need:
> 
> % lspci -s 00:07.0 -vv
> 00:07.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
>        Subsystem: Mellanox Technologies Device 0010
>        Physical Slot: 7
>        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
>                 Stepping- SERR+ FastB2B- DisINTx+
>        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
>                <TAbort- <MAbort- >SERR- <PERR- INTx-
>        Latency: 0, Cache Line Size: 64 bytes
>        Interrupt: pin A routed to IRQ 11
>        Region 0: Memory at fa000000 (64-bit, prefetchable) [size=32M]
>        Expansion ROM at fe900000 [disabled] [size=1M]
>        Capabilities: [60] Express (v2) Endpoint, MSI 00
>                DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s
>                        unlimited, L1 unlimited
>                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
>                        SlotPowerLimit 25W
>                DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
>                        RlxdOrd- ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
>                        MaxPayload 256 bytes, MaxReadReq 512 bytes
>                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr-
>                        TransPend-
>                LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM not supported
>                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
>                LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
>                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>                LnkSta: Speed 8GT/s, Width x8
>                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
>                DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP-
>                         LTR- 10BitTagComp+ 10BitTagReq- OBFF Not Supported,
>                         ExtFmt- EETLPPrefix- EmergencyPowerReduction
>                         Not Supported, EmergencyPowerReductionInit-
>                         FRS- TPHComp- ExtTPHComp-
>                AtomicOpsCap: 32bit- 64bit- 128bitCAS-
>                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR-
>                         10BitTagReq- OBFF Disabled,
>                AtomicOpsCtl: ReqEn-
>                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+
>                         EqualizationPhase1+ EqualizationPhase2+
>                         EqualizationPhase3+ LinkEqualizationRequest-
>                         Retimer- 2Retimers- CrosslinkRes: unsupported
>        Capabilities: [48] Vital Product Data
>                Product Name: CX454A - ConnectX-4 QSFP28
>                Read-only fields:
>                        [PN] Part number: MCX454A-FCAT
>                        [EC] Engineering changes: AB
>                        [SN] Serial number: MT1730X05081
>                        [V0] Vendor specific: PCIeGen3 x8
>                        [RV] Reserved: checksum good, 0 byte(s) reserved
>                End
>        Capabilities: [9c] MSI-X: Enable+ Count=64 Masked-
>                Vector table: BAR=0 offset=00002000
>                PBA: BAR=0 offset=00003000
>        Capabilities: [c0] Vendor Specific Information: Len=18 <?>
>        Capabilities: [40] Power Management version 3
>                Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA
>                       PME(D0-,D1-,D2-,D3hot-,D3cold+)
>                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
>        Kernel driver in use: mlx5_core
>        Kernel modules: mlx5_core
> 
>> since boot:
>> $ dmesg
> 
> % devlink dev info
> % dmesg | grep mlx5
> [    4.739691] mlx5_core 0000:00:07.0: firmware version: 12.18.1000
> [    4.740134] mlx5_core 0000:00:07.0: 63.008 Gb/s available PCIe
> bandwidth (8.0GT/s PCIe x8 link)
> [    7.048567] mlx5_core 0000:00:07.0: Port module event: module 0,
> Cable plugged
> [    7.211879] mlx5_core 0000:00:08.0: firmware version: 12.18.1000
> [    7.212309] mlx5_core 0000:00:08.0: 63.008 Gb/s available PCIe
> bandwidth (8.0GT/s PCIe x8 link)
> [    7.897218] mlx5_core 0000:00:08.0: Port module event: module 1,
> Cable plugged
> [   10.875388] mlx5_core 0000:00:07.0 ibs7: renamed from ib0
> [   10.995115] mlx5_core 0000:00:08.0 ibs8: renamed from ib0
> [  181.471663] mlx5_core 0000:00:07.0: mlx5_fw_version_query:823:(pid
> 918): fw query isn't supported by the FW
> [  181.472286] mlx5_core 0000:00:08.0: mlx5_fw_version_query:823:(pid
> 918): fw query isn't supported by the FW
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Potential regression/bug in net/mlx5 driver
  2023-04-09  8:48     ` Linux regression tracking (Thorsten Leemhuis)
@ 2023-04-09 23:50       ` Paul Moore
  2023-04-10  5:46         ` Leon Romanovsky
  0 siblings, 1 reply; 24+ messages in thread
From: Paul Moore @ 2023-04-09 23:50 UTC (permalink / raw)
  To: Linux regressions mailing list
  Cc: Saeed Mahameed, Shay Drory, Saeed Mahameed, netdev, selinux

On Sun, Apr 9, 2023 at 4:48 AM Linux regression tracking (Thorsten
Leemhuis) <regressions@leemhuis.info> wrote:
> On 30.03.23 03:27, Paul Moore wrote:
> > On Wed, Mar 29, 2023 at 6:20 PM Saeed Mahameed <saeed@kernel.org> wrote:
> >> On 28 Mar 19:08, Paul Moore wrote:
> >>>
> >>> Starting with the v6.3-rcX kernel releases I noticed that my
> >>> InfiniBand devices were no longer present under /sys/class/infiniband,
> >>> causing some of my automated testing to fail.  It took me a while to
> >>> find the time to bisect the issue, but I eventually identified the
> >>> problematic commit:
> >>>
> >>>  commit fe998a3c77b9f989a30a2a01fb00d3729a6d53a4
> >>>  Author: Shay Drory <shayd@nvidia.com>
> >>>  Date:   Wed Jun 29 11:38:21 2022 +0300
> >>>
> >>>   net/mlx5: Enable management PF initialization
> >>>
> >>>   Enable initialization of DPU Management PF, which is a new loopback PF
> >>>   designed for communication with BMC.
> >>>   For now Management PF doesn't support nor require most upper layer
> >>>   protocols so avoid them.
> >>>
> >>>   Signed-off-by: Shay Drory <shayd@nvidia.com>
> >>>   Reviewed-by: Eran Ben Elisha <eranbe@nvidia.com>
> >>>   Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
> >>>   Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
> >>>
> >>> I'm not a mlx5 driver expert so I can't really offer much in the way
> >>> of a fix, but as a quick test I did remove the
> >>> 'mlx5_core_is_management_pf(...)' calls in mlx5/core/dev.c and
> >>> everything seemed to work okay on my test system (or rather the tests
> >>> ran without problem).
> >>>
> >>> If you need any additional information, or would like me to test a
> >>> patch, please let me know.
> >>
> >> Our team is looking into this, the current theory is that you have an old
> >> FW that doesn't have the correct capabilities set.
> >
> > That's very possible; I installed this card many years ago and haven't
> > updated the FW once.
> >
> >  I'm happy to update the FW (do you have a
> > pointer/how-to?), but it might be good to identify a fix first as I'm
> > guessing there will be others like me ...
>
> Nothing happened here for about ten days afaics (or was there progress
> and I just missed it?). That made me wonder: how sound is Paul's guess
> that there will be others that might run into this? If that's likely it
> afaics would be good to get this regression fixed before the release,
> which is just two or three weeks away.
>
> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> --
> Everything you wanna know about Linux kernel regression tracking:
> https://linux-regtracking.leemhuis.info/about/#tldr
> If I did something stupid, please tell me, as explained on that page.
>
> #regzbot poke

I haven't seen any updates from the mlx5 driver folks, although I may
not have been CC'd?

I did revert that commit on my automated testing kernels and things
are working correctly again, although I'm pretty sure that's not a
good long term solution.  I did also dig up the information on
updating the card's firmware, but I'm holding off on that in case the
driver devs want me to test a fix.

-- 
paul-moore.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Potential regression/bug in net/mlx5 driver
  2023-04-09 23:50       ` Paul Moore
@ 2023-04-10  5:46         ` Leon Romanovsky
  2023-04-13 13:49           ` Linux regression tracking (Thorsten Leemhuis)
  2023-04-13 14:54           ` Jakub Kicinski
  0 siblings, 2 replies; 24+ messages in thread
From: Leon Romanovsky @ 2023-04-10  5:46 UTC (permalink / raw)
  To: Paul Moore
  Cc: Linux regressions mailing list, Saeed Mahameed, Shay Drory,
	Saeed Mahameed, netdev, selinux

On Sun, Apr 09, 2023 at 07:50:34PM -0400, Paul Moore wrote:
> On Sun, Apr 9, 2023 at 4:48 AM Linux regression tracking (Thorsten
> Leemhuis) <regressions@leemhuis.info> wrote:
> > On 30.03.23 03:27, Paul Moore wrote:
> > > On Wed, Mar 29, 2023 at 6:20 PM Saeed Mahameed <saeed@kernel.org> wrote:
> > >> On 28 Mar 19:08, Paul Moore wrote:
> > >>>
> > >>> Starting with the v6.3-rcX kernel releases I noticed that my
> > >>> InfiniBand devices were no longer present under /sys/class/infiniband,
> > >>> causing some of my automated testing to fail.  It took me a while to
> > >>> find the time to bisect the issue, but I eventually identified the
> > >>> problematic commit:
> > >>>
> > >>>  commit fe998a3c77b9f989a30a2a01fb00d3729a6d53a4
> > >>>  Author: Shay Drory <shayd@nvidia.com>
> > >>>  Date:   Wed Jun 29 11:38:21 2022 +0300
> > >>>
> > >>>   net/mlx5: Enable management PF initialization
> > >>>
> > >>>   Enable initialization of DPU Management PF, which is a new loopback PF
> > >>>   designed for communication with BMC.
> > >>>   For now Management PF doesn't support nor require most upper layer
> > >>>   protocols so avoid them.
> > >>>
> > >>>   Signed-off-by: Shay Drory <shayd@nvidia.com>
> > >>>   Reviewed-by: Eran Ben Elisha <eranbe@nvidia.com>
> > >>>   Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
> > >>>   Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
> > >>>
> > >>> I'm not a mlx5 driver expert so I can't really offer much in the way
> > >>> of a fix, but as a quick test I did remove the
> > >>> 'mlx5_core_is_management_pf(...)' calls in mlx5/core/dev.c and
> > >>> everything seemed to work okay on my test system (or rather the tests
> > >>> ran without problem).
> > >>>
> > >>> If you need any additional information, or would like me to test a
> > >>> patch, please let me know.
> > >>
> > >> Our team is looking into this, the current theory is that you have an old
> > >> FW that doesn't have the correct capabilities set.
> > >
> > > That's very possible; I installed this card many years ago and haven't
> > > updated the FW once.
> > >
> > >  I'm happy to update the FW (do you have a
> > > pointer/how-to?), but it might be good to identify a fix first as I'm
> > > guessing there will be others like me ...
> >
> > Nothing happened here for about ten days afaics (or was there progress
> > and I just missed it?). That made me wonder: how sound is Paul's guess
> > that there will be others that might run into this? If that's likely it
> > afaics would be good to get this regression fixed before the release,
> > which is just two or three weeks away.
> >
> > Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> > --
> > Everything you wanna know about Linux kernel regression tracking:
> > https://linux-regtracking.leemhuis.info/about/#tldr
> > If I did something stupid, please tell me, as explained on that page.
> >
> > #regzbot poke
> 
> I haven't seen any updates from the mlx5 driver folks, although I may
> not have been CC'd?

We are extremely slow these days due to combination of holidays
(Easter, Passover, Ramadan, spring break e.t.c).

Thanks

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Potential regression/bug in net/mlx5 driver
  2023-04-10  5:46         ` Leon Romanovsky
@ 2023-04-13 13:49           ` Linux regression tracking (Thorsten Leemhuis)
  2023-04-13 14:54           ` Jakub Kicinski
  1 sibling, 0 replies; 24+ messages in thread
From: Linux regression tracking (Thorsten Leemhuis) @ 2023-04-13 13:49 UTC (permalink / raw)
  To: Leon Romanovsky, Paul Moore
  Cc: Linux regressions mailing list, Saeed Mahameed, Shay Drory,
	Saeed Mahameed, netdev, selinux



On 10.04.23 07:46, Leon Romanovsky wrote:
> On Sun, Apr 09, 2023 at 07:50:34PM -0400, Paul Moore wrote:
>> On Sun, Apr 9, 2023 at 4:48 AM Linux regression tracking (Thorsten
>> Leemhuis) <regressions@leemhuis.info> wrote:
>>> On 30.03.23 03:27, Paul Moore wrote:
>>>> On Wed, Mar 29, 2023 at 6:20 PM Saeed Mahameed <saeed@kernel.org> wrote:
>>>>> On 28 Mar 19:08, Paul Moore wrote:
>>>>>>
>>>>>> Starting with the v6.3-rcX kernel releases I noticed that my
>>>>>> InfiniBand devices were no longer present under /sys/class/infiniband,
>>>>>> causing some of my automated testing to fail.  It took me a while to
>>>>>> find the time to bisect the issue, but I eventually identified the
>>>>>> problematic commit:
>>>>>>
>>>>>>  commit fe998a3c77b9f989a30a2a01fb00d3729a6d53a4
>>>>>>  Author: Shay Drory <shayd@nvidia.com>
>>>>>>  Date:   Wed Jun 29 11:38:21 2022 +0300
>>>>>>
>>>>>>   net/mlx5: Enable management PF initialization
>>>>>>
>>>>>>   Enable initialization of DPU Management PF, which is a new loopback PF
>>>>>>   designed for communication with BMC.
>>>>>>   For now Management PF doesn't support nor require most upper layer
>>>>>>   protocols so avoid them.
>>>>>>
>>>>>>   Signed-off-by: Shay Drory <shayd@nvidia.com>
>>>>>>   Reviewed-by: Eran Ben Elisha <eranbe@nvidia.com>
>>>>>>   Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
>>>>>>   Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
>>>>>>
>>>>>> I'm not a mlx5 driver expert so I can't really offer much in the way
>>>>>> of a fix, but as a quick test I did remove the
>>>>>> 'mlx5_core_is_management_pf(...)' calls in mlx5/core/dev.c and
>>>>>> everything seemed to work okay on my test system (or rather the tests
>>>>>> ran without problem).
>>>>>>
>>>>>> If you need any additional information, or would like me to test a
>>>>>> patch, please let me know.
>>>>>
>>>>> Our team is looking into this, the current theory is that you have an old
>>>>> FW that doesn't have the correct capabilities set.
>>>>
>>>> That's very possible; I installed this card many years ago and haven't
>>>> updated the FW once.
>>>>
>>>>  I'm happy to update the FW (do you have a
>>>> pointer/how-to?), but it might be good to identify a fix first as I'm
>>>> guessing there will be others like me ...
>>>
>>> Nothing happened here for about ten days afaics (or was there progress
>>> and I just missed it?). That made me wonder: how sound is Paul's guess
>>> that there will be others that might run into this? If that's likely it
>>> afaics would be good to get this regression fixed before the release,
>>> which is just two or three weeks away.
>>>
>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>>> --
>>> Everything you wanna know about Linux kernel regression tracking:
>>> https://linux-regtracking.leemhuis.info/about/#tldr
>>> If I did something stupid, please tell me, as explained on that page.
>>>
>>> #regzbot poke
>>
>> I haven't seen any updates from the mlx5 driver folks, although I may
>> not have been CC'd?
> 
> We are extremely slow these days due to combination of holidays
> (Easter, Passover, Ramadan, spring break e.t.c).

That's how it is sometimes, no worries. But well, rc7 is only a three
days away and 6.3 thus might be out in 10 days already. Hence allow me
to ask: is it possible to fix this by reverting the culprit now (and
reapplying it later in fixed form). If that's and option I'd say "go for
it", to ensure that revert makes it into rc7 and thus is tested at least
one week before the final (or two, if Linus decides to do a rc8).

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

#regzbot poke

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Potential regression/bug in net/mlx5 driver
  2023-04-10  5:46         ` Leon Romanovsky
  2023-04-13 13:49           ` Linux regression tracking (Thorsten Leemhuis)
@ 2023-04-13 14:54           ` Jakub Kicinski
  2023-04-13 15:19             ` Paul Moore
  1 sibling, 1 reply; 24+ messages in thread
From: Jakub Kicinski @ 2023-04-13 14:54 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Paul Moore, Linux regressions mailing list, Saeed Mahameed,
	Shay Drory, Saeed Mahameed, netdev, selinux, Tariq Toukan

On Mon, 10 Apr 2023 08:46:05 +0300 Leon Romanovsky wrote:
> > I haven't seen any updates from the mlx5 driver folks, although I may
> > not have been CC'd?  
> 
> We are extremely slow these days due to combination of holidays
> (Easter, Passover, Ramadan, spring break e.t.c).

Let's get this fixed ASAP, please. I understand that there are
holidays, but it's been over 2 weeks, and addressing regressions
should be highest priority for any maintainer! :(

From what I gather all we need here is to throw in an extra condition
for "FW is hella old" into mlx5_core_is_management_pf(), no?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Potential regression/bug in net/mlx5 driver
  2023-04-13 14:54           ` Jakub Kicinski
@ 2023-04-13 15:19             ` Paul Moore
  2023-04-13 21:12               ` Saeed Mahameed
  0 siblings, 1 reply; 24+ messages in thread
From: Paul Moore @ 2023-04-13 15:19 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Leon Romanovsky, Linux regressions mailing list, Saeed Mahameed,
	Shay Drory, Saeed Mahameed, netdev, selinux, Tariq Toukan

On Thu, Apr 13, 2023 at 10:54 AM Jakub Kicinski <kuba@kernel.org> wrote:
> On Mon, 10 Apr 2023 08:46:05 +0300 Leon Romanovsky wrote:
> > > I haven't seen any updates from the mlx5 driver folks, although I may
> > > not have been CC'd?
> >
> > We are extremely slow these days due to combination of holidays
> > (Easter, Passover, Ramadan, spring break e.t.c).
>
> Let's get this fixed ASAP, please. I understand that there are
> holidays, but it's been over 2 weeks, and addressing regressions
> should be highest priority for any maintainer! :(
>
> From what I gather all we need here is to throw in an extra condition
> for "FW is hella old" into mlx5_core_is_management_pf(), no?

That's my gut feeling too, at least for a quick solution.  I'd offer
to cobble together a fix, but my kernel expertise ends well before I
get to the mlx5 driver :)

I have been running for a while now with that small patch reverted on
my test machines (so I can keep my tests running) and everything seems
to be okay, but there may be other issues caused by the revert that
I'm not seeing.

-- 
paul-moore.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Potential regression/bug in net/mlx5 driver
  2023-04-13 15:19             ` Paul Moore
@ 2023-04-13 21:12               ` Saeed Mahameed
  2023-04-13 22:21                 ` Jakub Kicinski
  0 siblings, 1 reply; 24+ messages in thread
From: Saeed Mahameed @ 2023-04-13 21:12 UTC (permalink / raw)
  To: Paul Moore
  Cc: Jakub Kicinski, Leon Romanovsky, Linux regressions mailing list,
	Saeed Mahameed, Shay Drory, netdev, selinux, Tariq Toukan

On 13 Apr 11:19, Paul Moore wrote:
>On Thu, Apr 13, 2023 at 10:54 AM Jakub Kicinski <kuba@kernel.org> wrote:
>> On Mon, 10 Apr 2023 08:46:05 +0300 Leon Romanovsky wrote:
>> > > I haven't seen any updates from the mlx5 driver folks, although I may
>> > > not have been CC'd?
>> >
>> > We are extremely slow these days due to combination of holidays
>> > (Easter, Passover, Ramadan, spring break e.t.c).
>>
>> Let's get this fixed ASAP, please. I understand that there are
>> holidays, but it's been over 2 weeks, and addressing regressions
>> should be highest priority for any maintainer! :(
>>
>> From what I gather all we need here is to throw in an extra condition
>> for "FW is hella old" into mlx5_core_is_management_pf(), no?
>

Hi, Jakub and Paul
This is a high priority and we are working on this, unfortunately for mlx5
we don't check FW versions since we support more than 6 different devices
already, with different FW production lines. 

So we believe that this bug is very hard to solve without breaking backward
compatibility with the currently supported working FWs, the issue exists only
on very old firmwares and we will recommend a firmware upgrade to resolve this
issue.

>That's my gut feeling too, at least for a quick solution.  I'd offer
>to cobble together a fix, but my kernel expertise ends well before I
>get to the mlx5 driver :)
>
>I have been running for a while now with that small patch reverted on
>my test machines (so I can keep my tests running) and everything seems
>to be okay, but there may be other issues caused by the revert that
>I'm not seeing.
>

Paul is it possible to upgrade your device's FW ? your current FW is 6 years
old and we officially don't support FWs this old.

here's a link to start your upgrade.
https://network.nvidia.com/support/firmware/connectx4ib/

Let me know if you need any further assistance.

>-- 
>paul-moore.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Potential regression/bug in net/mlx5 driver
  2023-04-13 21:12               ` Saeed Mahameed
@ 2023-04-13 22:21                 ` Jakub Kicinski
  2023-04-13 22:34                   ` Saeed Mahameed
  0 siblings, 1 reply; 24+ messages in thread
From: Jakub Kicinski @ 2023-04-13 22:21 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Paul Moore, Leon Romanovsky, Linux regressions mailing list,
	Saeed Mahameed, Shay Drory, netdev, selinux, Tariq Toukan

On Thu, 13 Apr 2023 14:12:49 -0700 Saeed Mahameed wrote:
> This is a high priority and we are working on this, unfortunately for mlx5
> we don't check FW versions since we support more than 6 different devices
> already, with different FW production lines. 
> 
> So we believe that this bug is very hard to solve without breaking backward
> compatibility with the currently supported working FWs, the issue exists only
> on very old firmwares and we will recommend a firmware upgrade to resolve this
> issue.

On a closer read I don't like what this patch is doing at all.
I'm not sure we have precedent for "management connection" functions.
This requires a larger discussion. And after looking up the patch set
it went in, it seems to have been one of the hastily merged ones.
I'm sending a revert.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Potential regression/bug in net/mlx5 driver
  2023-04-13 22:21                 ` Jakub Kicinski
@ 2023-04-13 22:34                   ` Saeed Mahameed
  2023-04-13 22:51                     ` Jakub Kicinski
  0 siblings, 1 reply; 24+ messages in thread
From: Saeed Mahameed @ 2023-04-13 22:34 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Paul Moore, Leon Romanovsky, Linux regressions mailing list,
	Saeed Mahameed, Shay Drory, netdev, selinux, Tariq Toukan

On 13 Apr 15:21, Jakub Kicinski wrote:
>On Thu, 13 Apr 2023 14:12:49 -0700 Saeed Mahameed wrote:
>> This is a high priority and we are working on this, unfortunately for mlx5
>> we don't check FW versions since we support more than 6 different devices
>> already, with different FW production lines.
>>
>> So we believe that this bug is very hard to solve without breaking backward
>> compatibility with the currently supported working FWs, the issue exists only
>> on very old firmwares and we will recommend a firmware upgrade to resolve this
>> issue.
>
>On a closer read I don't like what this patch is doing at all.
>I'm not sure we have precedent for "management connection" functions.
>This requires a larger discussion. And after looking up the patch set

But this management connection function has the same architecture as other
"Normal" mlx5 functions, from the driver pov. The same way mlx5 
doesn't care if the underlaying function is CX4/5/6 we don't care if it was
a "management function".

We are currently working on enabling a subset of netdev functionality using
the same mlx5 constructs and current mlx5e code to load up a mlx5e netdev
on it.. 

>it went in, it seems to have been one of the hastily merged ones.
>I'm sending a revert.

But let's discuss what's wrong with it, and what are your thoughts ? 
the fact that it breaks a 6 years OLD FW, doesn't make it so horrible.

The patchset is a bug fix where previous mlx5 load on such function failed 
with some nasty kernel log messages, so the patchset only provides a fix to
make mlx5 load on such function go smooth and avoid loading any interface
on that function until we provide the patches for that which is a WIP right
now.



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Potential regression/bug in net/mlx5 driver
  2023-04-13 22:34                   ` Saeed Mahameed
@ 2023-04-13 22:51                     ` Jakub Kicinski
  2023-04-14  3:03                       ` Saeed Mahameed
  0 siblings, 1 reply; 24+ messages in thread
From: Jakub Kicinski @ 2023-04-13 22:51 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Paul Moore, Leon Romanovsky, Linux regressions mailing list,
	Saeed Mahameed, Shay Drory, netdev, selinux, Tariq Toukan

On Thu, 13 Apr 2023 15:34:21 -0700 Saeed Mahameed wrote:
> >On a closer read I don't like what this patch is doing at all.
> >I'm not sure we have precedent for "management connection" functions.
> >This requires a larger discussion. And after looking up the patch set  
> 
> But this management connection function has the same architecture as other
> "Normal" mlx5 functions, from the driver pov. The same way mlx5 
> doesn't care if the underlaying function is CX4/5/6 we don't care if it was
> a "management function".

Yes, and that's why every single IPU implementation thinks that it's 
a great idea. Because it's easy to implement. But what is it for
architecturally? Running what is effectively FW commands over TCP?

> We are currently working on enabling a subset of netdev functionality using
> the same mlx5 constructs and current mlx5e code to load up a mlx5e netdev
> on it.. 
> 
> >it went in, it seems to have been one of the hastily merged ones.
> >I'm sending a revert.  
> 
> But let's discuss what's wrong with it, and what are your thoughts ? 
> the fact that it breaks a 6 years OLD FW, doesn't make it so horrible.

Right, the breakage is a separate topic.

You say 6 years old but the part is EOL, right? The part is old and
stable, AFAIU the breakage stems from development work for parts which
are 3 or so generations newer.

The question is who's supposed to be paying the price of mlx5 being
used for old and new parts? What is fair to expect from the user
when the FW Paul has presumably works just fine for him?

> The patchset is a bug fix where previous mlx5 load on such function failed 
> with some nasty kernel log messages, so the patchset only provides a fix to
> make mlx5 load on such function go smooth and avoid loading any interface
> on that function until we provide the patches for that which is a WIP right
> now.

Ah, that's probably why I wasn't screaming at it when it was
posted. I must have understood it then. The commit title is quite
confusing by iteself - "_Enable_ management PF initialization". 

Why is it hard to exclude anything older than CX6 from this condition?
That part I'm still not understanding.. can you add more color?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Potential regression/bug in net/mlx5 driver
  2023-04-13 22:51                     ` Jakub Kicinski
@ 2023-04-14  3:03                       ` Saeed Mahameed
  2023-04-14  3:26                         ` Jakub Kicinski
  0 siblings, 1 reply; 24+ messages in thread
From: Saeed Mahameed @ 2023-04-14  3:03 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Paul Moore, Leon Romanovsky, Linux regressions mailing list,
	Saeed Mahameed, Shay Drory, netdev, selinux, Tariq Toukan

On 13 Apr 15:51, Jakub Kicinski wrote:
>On Thu, 13 Apr 2023 15:34:21 -0700 Saeed Mahameed wrote:
>> >On a closer read I don't like what this patch is doing at all.
>> >I'm not sure we have precedent for "management connection" functions.
>> >This requires a larger discussion. And after looking up the patch set
>>
>> But this management connection function has the same architecture as other
>> "Normal" mlx5 functions, from the driver pov. The same way mlx5
>> doesn't care if the underlaying function is CX4/5/6 we don't care if it was
>> a "management function".
>
>Yes, and that's why every single IPU implementation thinks that it's
>a great idea. Because it's easy to implement. But what is it for
>architecturally? Running what is effectively FW commands over TCP?

Where did you get this idea from? maybe we got the name wrong, 
"management PF" is simply a minimalistic netdev PF to have eth connection
with the on board BMC .. 

I agree that the name "management PF" sounds scary, but it is not a control
function as you think, not at all. As the original commit message states:
"loopback PF designed for communication with BMC".

>
>> We are currently working on enabling a subset of netdev functionality using
>> the same mlx5 constructs and current mlx5e code to load up a mlx5e netdev
>> on it..
>>
>> >it went in, it seems to have been one of the hastily merged ones.
>> >I'm sending a revert.
>>
>> But let's discuss what's wrong with it, and what are your thoughts ?
>> the fact that it breaks a 6 years OLD FW, doesn't make it so horrible.
>
>Right, the breakage is a separate topic.
>
>You say 6 years old but the part is EOL, right? The part is old and
>stable, AFAIU the breakage stems from development work for parts which
>are 3 or so generations newer.
>

Officially we test only 3 GA FWs back. The fact that mlx5 is a generic CX
driver makes it really hard to test all the possible combinations, so we
need to be strict with how back we want to officially support and test old
generations.

>The question is who's supposed to be paying the price of mlx5 being
>used for old and new parts? What is fair to expect from the user
>when the FW Paul has presumably works just fine for him?
>
Upgrade FW when possible, it is always easier than upgrading the kernel.
Anyways this was a very rare FW/Arch bug, We should've exposed an
explicit cap for this new type of PF when we had the chance, now it's too
late since a proper fix will require FW and Driver upgrades and breaking
the current solution we have over other OSes as well.

Yes I can craft an if condition to explicitly check for chip id and FW
version for this corner case, which has no precedence in mlx5, but I prefer
to ask to upgrade FW first, and if that's an acceptable solution, I would
like to keep the mlx5 clean and device agnostic as much as possible.

>> The patchset is a bug fix where previous mlx5 load on such function failed
>> with some nasty kernel log messages, so the patchset only provides a fix to
>> make mlx5 load on such function go smooth and avoid loading any interface
>> on that function until we provide the patches for that which is a WIP right
>> now.
>
>Ah, that's probably why I wasn't screaming at it when it was
>posted. I must have understood it then. The commit title is quite
>confusing by iteself - "_Enable_ management PF initialization".
>

Yes the naming is misleading, this not what the name suggests, just a minimal
PF ethernet channel to the BMC, no body is planning to run "raw FW commands
over TCP", you don't need "special PF" to do this :) .. 
In fact any vendor could already be doing this on any normal
PF, so I think you are basing your argument on an irrelevant claim.

>Why is it hard to exclude anything older than CX6 from this condition?
>That part I'm still not understanding.. can you add more color?

CX arch and mlx5 are forward compatible, we try to keep mlx5 device
agnostic and use the CX well-defined feature discovery protocols to boot
the correct set of features.



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Potential regression/bug in net/mlx5 driver
  2023-04-14  3:03                       ` Saeed Mahameed
@ 2023-04-14  3:26                         ` Jakub Kicinski
  2023-04-14 14:37                           ` Paul Moore
  2023-04-14 22:20                           ` Saeed Mahameed
  0 siblings, 2 replies; 24+ messages in thread
From: Jakub Kicinski @ 2023-04-14  3:26 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Paul Moore, Leon Romanovsky, Linux regressions mailing list,
	Saeed Mahameed, Shay Drory, netdev, selinux, Tariq Toukan

On Thu, 13 Apr 2023 20:03:18 -0700 Saeed Mahameed wrote:
> On 13 Apr 15:51, Jakub Kicinski wrote:
> >On Thu, 13 Apr 2023 15:34:21 -0700 Saeed Mahameed wrote:  
> >> But this management connection function has the same architecture as other
> >> "Normal" mlx5 functions, from the driver pov. The same way mlx5
> >> doesn't care if the underlaying function is CX4/5/6 we don't care if it was
> >> a "management function".  
> >
> >Yes, and that's why every single IPU implementation thinks that it's
> >a great idea. Because it's easy to implement. But what is it for
> >architecturally? Running what is effectively FW commands over TCP?  
> 
> Where did you get this idea from? maybe we got the name wrong, 
> "management PF" is simply a minimalistic netdev PF to have eth connection
> with the on board BMC .. 
> 
> I agree that the name "management PF" sounds scary, but it is not a control
> function as you think, not at all. As the original commit message states:
> "loopback PF designed for communication with BMC".

Can you draw a small diagram with the bare metal guest, IPU, and BMC?
What's talking to what? And what packets are exchanged?

> >> But let's discuss what's wrong with it, and what are your thoughts ?
> >> the fact that it breaks a 6 years OLD FW, doesn't make it so horrible.  
> >
> >Right, the breakage is a separate topic.
> >
> >You say 6 years old but the part is EOL, right? The part is old and
> >stable, AFAIU the breakage stems from development work for parts which
> >are 3 or so generations newer.
> 
> Officially we test only 3 GA FWs back. The fact that mlx5 is a generic CX
> driver makes it really hard to test all the possible combinations, so we
> need to be strict with how back we want to officially support and test old
> generations.

Would you be able to pull the datapoints for what 3 GA FWs means 
in case of CX4? Release number and date when it was released?

I understand the challenge of backward compat with a multi-gen
driver. It's a trade off.

> >The question is who's supposed to be paying the price of mlx5 being
> >used for old and new parts? What is fair to expect from the user
> >when the FW Paul has presumably works just fine for him?
> >  
> Upgrade FW when possible, it is always easier than upgrading the kernel.
> Anyways this was a very rare FW/Arch bug, We should've exposed an
> explicit cap for this new type of PF when we had the chance, now it's too
> late since a proper fix will require FW and Driver upgrades and breaking
> the current solution we have over other OSes as well.
>
> Yes I can craft an if condition to explicitly check for chip id and FW
> version for this corner case, which has no precedence in mlx5, but I prefer
> to ask to upgrade FW first, and if that's an acceptable solution, I would
> like to keep the mlx5 clean and device agnostic as much as possible.

IMO you either need a fully fleshed out FW update story, with advanced
warnings for a few releases, distributing the FW via linux-firmware or
fwupdmgr or such.  Or deal with the corner cases in the driver :(

We can get Paul to update, sure, but if he noticed so quickly the
question remains how many people out in the wild will get affected 
and not know what the cause is?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Potential regression/bug in net/mlx5 driver
  2023-04-14  3:26                         ` Jakub Kicinski
@ 2023-04-14 14:37                           ` Paul Moore
  2023-04-14 22:20                           ` Saeed Mahameed
  1 sibling, 0 replies; 24+ messages in thread
From: Paul Moore @ 2023-04-14 14:37 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Saeed Mahameed, Leon Romanovsky, Linux regressions mailing list,
	Saeed Mahameed, Shay Drory, netdev, selinux, Tariq Toukan

On Thu, Apr 13, 2023 at 11:26 PM Jakub Kicinski <kuba@kernel.org> wrote:
> On Thu, 13 Apr 2023 20:03:18 -0700 Saeed Mahameed wrote:
> > On 13 Apr 15:51, Jakub Kicinski wrote:
> > >On Thu, 13 Apr 2023 15:34:21 -0700 Saeed Mahameed wrote:

...

> > >The question is who's supposed to be paying the price of mlx5 being
> > >used for old and new parts? What is fair to expect from the user
> > >when the FW Paul has presumably works just fine for him?
> > >
> > Upgrade FW when possible, it is always easier than upgrading the kernel.
> > Anyways this was a very rare FW/Arch bug, We should've exposed an
> > explicit cap for this new type of PF when we had the chance, now it's too
> > late since a proper fix will require FW and Driver upgrades and breaking
> > the current solution we have over other OSes as well.
> >
> > Yes I can craft an if condition to explicitly check for chip id and FW
> > version for this corner case, which has no precedence in mlx5, but I prefer
> > to ask to upgrade FW first, and if that's an acceptable solution, I would
> > like to keep the mlx5 clean and device agnostic as much as possible.
>
> IMO you either need a fully fleshed out FW update story, with advanced
> warnings for a few releases, distributing the FW via linux-firmware or
> fwupdmgr or such.  Or deal with the corner cases in the driver :(
>
> We can get Paul to update, sure, but if he noticed so quickly the
> question remains how many people out in the wild will get affected
> and not know what the cause is?

I think it is that last bit which is the real issue, at least from a
regression standpoint.  I didn't see anything on the console or in the
logs to indicate that ancient/buggy FW was the issue, even once I
bisected the kernel (which your average user isn't going to do) it
wasn't clear that it was a FW problem.  Perhaps the mlx5 driver should
perform a simple FW version check on initialization and
pr_warn()/pr_err() if the loaded FW is below a support threshold?
Seeing a "mlx5: hey idiot, your FW is ancient, you need to upgrade!"
line on my console/dmesg would have sent me in the right direction and
likely avoided all of this ...

-- 
paul-moore.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Potential regression/bug in net/mlx5 driver
  2023-04-14  3:26                         ` Jakub Kicinski
  2023-04-14 14:37                           ` Paul Moore
@ 2023-04-14 22:20                           ` Saeed Mahameed
  2023-04-15  0:34                             ` Jakub Kicinski
  1 sibling, 1 reply; 24+ messages in thread
From: Saeed Mahameed @ 2023-04-14 22:20 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Paul Moore, Leon Romanovsky, Linux regressions mailing list,
	Saeed Mahameed, Shay Drory, netdev, selinux, Tariq Toukan

On 13 Apr 20:26, Jakub Kicinski wrote:
>On Thu, 13 Apr 2023 20:03:18 -0700 Saeed Mahameed wrote:
>> On 13 Apr 15:51, Jakub Kicinski wrote:
>> >On Thu, 13 Apr 2023 15:34:21 -0700 Saeed Mahameed wrote:
>> >> But this management connection function has the same architecture as other
>> >> "Normal" mlx5 functions, from the driver pov. The same way mlx5
>> >> doesn't care if the underlaying function is CX4/5/6 we don't care if it was
>> >> a "management function".
>> >
>> >Yes, and that's why every single IPU implementation thinks that it's
>> >a great idea. Because it's easy to implement. But what is it for
>> >architecturally? Running what is effectively FW commands over TCP?
>>
>> Where did you get this idea from? maybe we got the name wrong,
>> "management PF" is simply a minimalistic netdev PF to have eth connection
>> with the on board BMC ..
>>
>> I agree that the name "management PF" sounds scary, but it is not a control
>> function as you think, not at all. As the original commit message states:
>> "loopback PF designed for communication with BMC".
>
>Can you draw a small diagram with the bare metal guest, IPU, and BMC?
>What's talking to what? And what packets are exchanged?
>

Yes, Working on that...

>> >> But let's discuss what's wrong with it, and what are your thoughts ?
>> >> the fact that it breaks a 6 years OLD FW, doesn't make it so horrible.
>> >
>> >Right, the breakage is a separate topic.
>> >
>> >You say 6 years old but the part is EOL, right? The part is old and
>> >stable, AFAIU the breakage stems from development work for parts which
>> >are 3 or so generations newer.
>>
>> Officially we test only 3 GA FWs back. The fact that mlx5 is a generic CX
>> driver makes it really hard to test all the possible combinations, so we
>> need to be strict with how back we want to officially support and test old
>> generations.
>
>Would you be able to pull the datapoints for what 3 GA FWs means
>in case of CX4? Release number and date when it was released?
>

https://network.nvidia.com/files/related-docs/eol/LCR-000821.pdf

Since CX4 was EOL last year, it is going to be hard to find this info but
let me check my email archive.. 

12.28.2006   27-Sep-20 - recommended version
12.26.xxxx   12-Dec-2019
12.24.1000   2-Dec-18


>I understand the challenge of backward compat with a multi-gen
>driver. It's a trade off.
>
>> >The question is who's supposed to be paying the price of mlx5 being
>> >used for old and new parts? What is fair to expect from the user
>> >when the FW Paul has presumably works just fine for him?
>> >
>> Upgrade FW when possible, it is always easier than upgrading the kernel.
>> Anyways this was a very rare FW/Arch bug, We should've exposed an
>> explicit cap for this new type of PF when we had the chance, now it's too
>> late since a proper fix will require FW and Driver upgrades and breaking
>> the current solution we have over other OSes as well.
>>
>> Yes I can craft an if condition to explicitly check for chip id and FW
>> version for this corner case, which has no precedence in mlx5, but I prefer
>> to ask to upgrade FW first, and if that's an acceptable solution, I would
>> like to keep the mlx5 clean and device agnostic as much as possible.
>
>IMO you either need a fully fleshed out FW update story, with advanced
>warnings for a few releases, distributing the FW via linux-firmware or
>fwupdmgr or such.  Or deal with the corner cases in the driver :(
>

Completely agree, I will start an internal discussion .. 

>We can get Paul to update, sure, but if he noticed so quickly the
>question remains how many people out in the wild will get affected
>and not know what the cause is?

Right, I will make sure this will be addressed, will let you know how we
will handle this, will try to post a patch early next cycle, but i will
need to work with Arch and release managers for this, so it will take a
couple of weeks to formalize a proper solution.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Potential regression/bug in net/mlx5 driver
  2023-04-14 22:20                           ` Saeed Mahameed
@ 2023-04-15  0:34                             ` Jakub Kicinski
  2023-04-15  4:40                               ` Saeed Mahameed
  0 siblings, 1 reply; 24+ messages in thread
From: Jakub Kicinski @ 2023-04-15  0:34 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Paul Moore, Leon Romanovsky, Linux regressions mailing list,
	Saeed Mahameed, Shay Drory, netdev, selinux, Tariq Toukan

On Fri, 14 Apr 2023 15:20:01 -0700 Saeed Mahameed wrote:
> >> Officially we test only 3 GA FWs back. The fact that mlx5 is a generic CX
> >> driver makes it really hard to test all the possible combinations, so we
> >> need to be strict with how back we want to officially support and test old
> >> generations.  
> >
> >Would you be able to pull the datapoints for what 3 GA FWs means
> >in case of CX4? Release number and date when it was released?
>
> https://network.nvidia.com/files/related-docs/eol/LCR-000821.pdf
> 
> Since CX4 was EOL last year, it is going to be hard to find this info but
> let me check my email archive.. 
> 
> 12.28.2006   27-Sep-20 - recommended version
> 12.26.xxxx   12-Dec-2019
> 12.24.1000   2-Dec-18

That's basically 3 years of support. Seems fairly reasonable.
 
> >> Upgrade FW when possible, it is always easier than upgrading the kernel.
> >> Anyways this was a very rare FW/Arch bug, We should've exposed an
> >> explicit cap for this new type of PF when we had the chance, now it's too
> >> late since a proper fix will require FW and Driver upgrades and breaking
> >> the current solution we have over other OSes as well.
> >>
> >> Yes I can craft an if condition to explicitly check for chip id and FW
> >> version for this corner case, which has no precedence in mlx5, but I prefer
> >> to ask to upgrade FW first, and if that's an acceptable solution, I would
> >> like to keep the mlx5 clean and device agnostic as much as possible.  
> >
> >IMO you either need a fully fleshed out FW update story, with advanced
> >warnings for a few releases, distributing the FW via linux-firmware or
> >fwupdmgr or such.  Or deal with the corner cases in the driver :(
> 
> Completely agree, I will start an internal discussion .. 
> 
> >We can get Paul to update, sure, but if he noticed so quickly the
> >question remains how many people out in the wild will get affected
> >and not know what the cause is?  
> 
> Right, I will make sure this will be addressed, will let you know how we
> will handle this, will try to post a patch early next cycle, but i will
> need to work with Arch and release managers for this, so it will take a
> couple of weeks to formalize a proper solution.

What do we do now, tho? If the main side effect of a revert is that
users of a newfangled device with an order of magnitude lower
deployment continue to see a warning/error in the logs - I'm leaning
towards applying it :(

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Potential regression/bug in net/mlx5 driver
  2023-04-15  0:34                             ` Jakub Kicinski
@ 2023-04-15  4:40                               ` Saeed Mahameed
  2023-04-17 15:38                                 ` Jakub Kicinski
  0 siblings, 1 reply; 24+ messages in thread
From: Saeed Mahameed @ 2023-04-15  4:40 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Paul Moore, Leon Romanovsky, Linux regressions mailing list,
	Saeed Mahameed, Shay Drory, netdev, selinux, Tariq Toukan

On 14 Apr 17:34, Jakub Kicinski wrote:
>On Fri, 14 Apr 2023 15:20:01 -0700 Saeed Mahameed wrote:
>> >> Officially we test only 3 GA FWs back. The fact that mlx5 is a generic CX
>> >> driver makes it really hard to test all the possible combinations, so we
>> >> need to be strict with how back we want to officially support and test old
>> >> generations.
>> >
>> >Would you be able to pull the datapoints for what 3 GA FWs means
>> >in case of CX4? Release number and date when it was released?
>>
>> https://network.nvidia.com/files/related-docs/eol/LCR-000821.pdf
>>
>> Since CX4 was EOL last year, it is going to be hard to find this info but
>> let me check my email archive..
>>
>> 12.28.2006   27-Sep-20 - recommended version
>> 12.26.xxxx   12-Dec-2019
>> 12.24.1000   2-Dec-18
>
>That's basically 3 years of support. Seems fairly reasonable.
>
>> >> Upgrade FW when possible, it is always easier than upgrading the kernel.
>> >> Anyways this was a very rare FW/Arch bug, We should've exposed an
>> >> explicit cap for this new type of PF when we had the chance, now it's too
>> >> late since a proper fix will require FW and Driver upgrades and breaking
>> >> the current solution we have over other OSes as well.
>> >>
>> >> Yes I can craft an if condition to explicitly check for chip id and FW
>> >> version for this corner case, which has no precedence in mlx5, but I prefer
>> >> to ask to upgrade FW first, and if that's an acceptable solution, I would
>> >> like to keep the mlx5 clean and device agnostic as much as possible.
>> >
>> >IMO you either need a fully fleshed out FW update story, with advanced
>> >warnings for a few releases, distributing the FW via linux-firmware or
>> >fwupdmgr or such.  Or deal with the corner cases in the driver :(
>>
>> Completely agree, I will start an internal discussion ..
>>
>> >We can get Paul to update, sure, but if he noticed so quickly the
>> >question remains how many people out in the wild will get affected
>> >and not know what the cause is?
>>
>> Right, I will make sure this will be addressed, will let you know how we
>> will handle this, will try to post a patch early next cycle, but i will
>> need to work with Arch and release managers for this, so it will take a
>> couple of weeks to formalize a proper solution.
>
>What do we do now, tho? If the main side effect of a revert is that
>users of a newfangled device with an order of magnitude lower
>deployment continue to see a warning/error in the logs - I'm leaning
>towards applying it :(

I tend to agree with you but let me check with the FW architect what he has
to offer, either we provide a FW version check or another more accurate
FW cap test that could solve the issue for everyone. If I don't come up with
a solution by next Wednesday I will repost your revert in my next net PR
on Wednesday. You can mark it awaiting-upstream for now, if that works for
you.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Potential regression/bug in net/mlx5 driver
  2023-04-15  4:40                               ` Saeed Mahameed
@ 2023-04-17 15:38                                 ` Jakub Kicinski
  2023-04-20  0:43                                   ` Saeed Mahameed
  0 siblings, 1 reply; 24+ messages in thread
From: Jakub Kicinski @ 2023-04-17 15:38 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Paul Moore, Leon Romanovsky, Linux regressions mailing list,
	Saeed Mahameed, Shay Drory, netdev, selinux, Tariq Toukan

On Fri, 14 Apr 2023 21:40:35 -0700 Saeed Mahameed wrote:
> >What do we do now, tho? If the main side effect of a revert is that
> >users of a newfangled device with an order of magnitude lower
> >deployment continue to see a warning/error in the logs - I'm leaning
> >towards applying it :(  
> 
> I tend to agree with you but let me check with the FW architect what he has
> to offer, either we provide a FW version check or another more accurate
> FW cap test that could solve the issue for everyone. If I don't come up with
> a solution by next Wednesday I will repost your revert in my next net PR
> on Wednesday. You can mark it awaiting-upstream for now, if that works for
> you.

OK, sounds good.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Potential regression/bug in net/mlx5 driver
  2023-04-17 15:38                                 ` Jakub Kicinski
@ 2023-04-20  0:43                                   ` Saeed Mahameed
  2023-04-20  0:46                                     ` Jakub Kicinski
  0 siblings, 1 reply; 24+ messages in thread
From: Saeed Mahameed @ 2023-04-20  0:43 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Saeed Mahameed, Paul Moore, Leon Romanovsky,
	Linux regressions mailing list, Shay Drory, netdev, selinux,
	Tariq Toukan

On 17 Apr 08:38, Jakub Kicinski wrote:
>On Fri, 14 Apr 2023 21:40:35 -0700 Saeed Mahameed wrote:
>> >What do we do now, tho? If the main side effect of a revert is that
>> >users of a newfangled device with an order of magnitude lower
>> >deployment continue to see a warning/error in the logs - I'm leaning
>> >towards applying it :(
>>
>> I tend to agree with you but let me check with the FW architect what he has
>> to offer, either we provide a FW version check or another more accurate
>> FW cap test that could solve the issue for everyone. If I don't come up with
>> a solution by next Wednesday I will repost your revert in my next net PR
>> on Wednesday. You can mark it awaiting-upstream for now, if that works for
>> you.
>
>OK, sounds good.


So I checked with Arch and we agreed that the only devices that need to
expose this management PF are Bluefield chips, which have dedicated device
IDs, and newer than the affected FW, so we can fix this by making the check
more strict by testing device IDs as well.

I will provide a patch by tomorrow, will let Paul test it first.



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Potential regression/bug in net/mlx5 driver
  2023-04-20  0:43                                   ` Saeed Mahameed
@ 2023-04-20  0:46                                     ` Jakub Kicinski
  2023-04-20  4:02                                       ` Saeed Mahameed
  0 siblings, 1 reply; 24+ messages in thread
From: Jakub Kicinski @ 2023-04-20  0:46 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Saeed Mahameed, Paul Moore, Leon Romanovsky,
	Linux regressions mailing list, Shay Drory, netdev, selinux,
	Tariq Toukan

On Wed, 19 Apr 2023 17:43:11 -0700 Saeed Mahameed wrote:
> So I checked with Arch and we agreed that the only devices that need to
> expose this management PF are Bluefield chips, which have dedicated device
> IDs, and newer than the affected FW, so we can fix this by making the check
> more strict by testing device IDs as well.
> 
> I will provide a patch by tomorrow, will let Paul test it first.

What's "by tomorrow"? Today COB or some time tomorrow? 
Paolo is sending the PR tomorrow, fix needs to be on the list *now*.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Potential regression/bug in net/mlx5 driver
  2023-04-20  0:46                                     ` Jakub Kicinski
@ 2023-04-20  4:02                                       ` Saeed Mahameed
  0 siblings, 0 replies; 24+ messages in thread
From: Saeed Mahameed @ 2023-04-20  4:02 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Saeed Mahameed, Paul Moore, Leon Romanovsky,
	Linux regressions mailing list, Shay Drory, netdev, selinux,
	Tariq Toukan

On 19 Apr 17:46, Jakub Kicinski wrote:
>On Wed, 19 Apr 2023 17:43:11 -0700 Saeed Mahameed wrote:
>> So I checked with Arch and we agreed that the only devices that need to
>> expose this management PF are Bluefield chips, which have dedicated device
>> IDs, and newer than the affected FW, so we can fix this by making the check
>> more strict by testing device IDs as well.
>>
>> I will provide a patch by tomorrow, will let Paul test it first.
>
>What's "by tomorrow"? Today COB or some time tomorrow?
>Paolo is sending the PR tomorrow, fix needs to be on the list *now*.

I just saw you applied the revert, anyway here's our proposal:
https://patchwork.kernel.org/project/netdevbpf/patch/20230420035652.295680-1-saeed@kernel.org/

We just tie Management PF to specific device IDs where it's actually
supported.

I guess I can bring back a combination of the original patch and my fix
to next cycle.

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2023-04-20  4:02 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-28 23:08 Potential regression/bug in net/mlx5 driver Paul Moore
2023-03-29 22:20 ` Saeed Mahameed
2023-03-30  1:27   ` Paul Moore
2023-04-09  8:48     ` Linux regression tracking (Thorsten Leemhuis)
2023-04-09 23:50       ` Paul Moore
2023-04-10  5:46         ` Leon Romanovsky
2023-04-13 13:49           ` Linux regression tracking (Thorsten Leemhuis)
2023-04-13 14:54           ` Jakub Kicinski
2023-04-13 15:19             ` Paul Moore
2023-04-13 21:12               ` Saeed Mahameed
2023-04-13 22:21                 ` Jakub Kicinski
2023-04-13 22:34                   ` Saeed Mahameed
2023-04-13 22:51                     ` Jakub Kicinski
2023-04-14  3:03                       ` Saeed Mahameed
2023-04-14  3:26                         ` Jakub Kicinski
2023-04-14 14:37                           ` Paul Moore
2023-04-14 22:20                           ` Saeed Mahameed
2023-04-15  0:34                             ` Jakub Kicinski
2023-04-15  4:40                               ` Saeed Mahameed
2023-04-17 15:38                                 ` Jakub Kicinski
2023-04-20  0:43                                   ` Saeed Mahameed
2023-04-20  0:46                                     ` Jakub Kicinski
2023-04-20  4:02                                       ` Saeed Mahameed
2023-03-31 13:10 ` Linux regression tracking #adding (Thorsten Leemhuis)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).