regressions.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
From: Paul Moore <paul@paul-moore.com>
To: Saeed Mahameed <saeed@kernel.org>
Cc: Shay Drory <shayd@nvidia.com>, Saeed Mahameed <saeedm@nvidia.com>,
	netdev@vger.kernel.org,  regressions@lists.linux.dev,
	selinux@vger.kernel.org
Subject: Re: Potential regression/bug in net/mlx5 driver
Date: Wed, 29 Mar 2023 21:27:45 -0400	[thread overview]
Message-ID: <CAHC9VhTvQLa=+Ykwmr_Uhgjrc6dfi24ou=NBsACkhwZN7X4EtQ@mail.gmail.com> (raw)
In-Reply-To: <ZCS5oxM/m9LuidL/@x130>

On Wed, Mar 29, 2023 at 6:20 PM Saeed Mahameed <saeed@kernel.org> wrote:
> On 28 Mar 19:08, Paul Moore wrote:
> >Hello all,
> >
> >Starting with the v6.3-rcX kernel releases I noticed that my
> >InfiniBand devices were no longer present under /sys/class/infiniband,
> >causing some of my automated testing to fail.  It took me a while to
> >find the time to bisect the issue, but I eventually identified the
> >problematic commit:
> >
> >  commit fe998a3c77b9f989a30a2a01fb00d3729a6d53a4
> >  Author: Shay Drory <shayd@nvidia.com>
> >  Date:   Wed Jun 29 11:38:21 2022 +0300
> >
> >   net/mlx5: Enable management PF initialization
> >
> >   Enable initialization of DPU Management PF, which is a new loopback PF
> >   designed for communication with BMC.
> >   For now Management PF doesn't support nor require most upper layer
> >   protocols so avoid them.
> >
> >   Signed-off-by: Shay Drory <shayd@nvidia.com>
> >   Reviewed-by: Eran Ben Elisha <eranbe@nvidia.com>
> >   Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
> >   Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
> >
> >I'm not a mlx5 driver expert so I can't really offer much in the way
> >of a fix, but as a quick test I did remove the
> >'mlx5_core_is_management_pf(...)' calls in mlx5/core/dev.c and
> >everything seemed to work okay on my test system (or rather the tests
> >ran without problem).
> >
> >If you need any additional information, or would like me to test a
> >patch, please let me know.
>
> Hi Paul,
>
> Our team is looking into this, the current theory is that you have an old
> FW that doesn't have the correct capabilities set.

That's very possible; I installed this card many years ago and haven't
updated the FW once.  I'm happy to update the FW (do you have a
pointer/how-to?), but it might be good to identify a fix first as I'm
guessing there will be others like me ...

> Can you please provide the FW version and the ConnectX device you are
> testing ?
>
> $ devlink dev info

% devlink dev info; echo $?
0

No output and no error code.  However, I do see the following in dmesg:

[  255.251124] mlx5_core 0000:00:08.0: mlx5_fw_version_query:823:(pid
959): fw query isn't supported by the FW

... which appears to support your theory about ancient hardware.

> $ lspci -s <pci_dev> -vv

While there is only one physical card, there are two PCI devices (it's
a dual port card).  I'm only copying the first device since I'm
guessing that's really all you need:

% lspci -s 00:07.0 -vv
00:07.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
       Subsystem: Mellanox Technologies Device 0010
       Physical Slot: 7
       Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
                Stepping- SERR+ FastB2B- DisINTx+
       Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
               <TAbort- <MAbort- >SERR- <PERR- INTx-
       Latency: 0, Cache Line Size: 64 bytes
       Interrupt: pin A routed to IRQ 11
       Region 0: Memory at fa000000 (64-bit, prefetchable) [size=32M]
       Expansion ROM at fe900000 [disabled] [size=1M]
       Capabilities: [60] Express (v2) Endpoint, MSI 00
               DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s
                       unlimited, L1 unlimited
                       ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
                       SlotPowerLimit 25W
               DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
                       RlxdOrd- ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
                       MaxPayload 256 bytes, MaxReadReq 512 bytes
               DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr-
                       TransPend-
               LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM not supported
                       ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
               LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
                       ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
               LnkSta: Speed 8GT/s, Width x8
                       TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
               DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP-
                        LTR- 10BitTagComp+ 10BitTagReq- OBFF Not Supported,
                        ExtFmt- EETLPPrefix- EmergencyPowerReduction
                        Not Supported, EmergencyPowerReductionInit-
                        FRS- TPHComp- ExtTPHComp-
               AtomicOpsCap: 32bit- 64bit- 128bitCAS-
               DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR-
                        10BitTagReq- OBFF Disabled,
               AtomicOpsCtl: ReqEn-
               LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+
                        EqualizationPhase1+ EqualizationPhase2+
                        EqualizationPhase3+ LinkEqualizationRequest-
                        Retimer- 2Retimers- CrosslinkRes: unsupported
       Capabilities: [48] Vital Product Data
               Product Name: CX454A - ConnectX-4 QSFP28
               Read-only fields:
                       [PN] Part number: MCX454A-FCAT
                       [EC] Engineering changes: AB
                       [SN] Serial number: MT1730X05081
                       [V0] Vendor specific: PCIeGen3 x8
                       [RV] Reserved: checksum good, 0 byte(s) reserved
               End
       Capabilities: [9c] MSI-X: Enable+ Count=64 Masked-
               Vector table: BAR=0 offset=00002000
               PBA: BAR=0 offset=00003000
       Capabilities: [c0] Vendor Specific Information: Len=18 <?>
       Capabilities: [40] Power Management version 3
               Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA
                      PME(D0-,D1-,D2-,D3hot-,D3cold+)
               Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
       Kernel driver in use: mlx5_core
       Kernel modules: mlx5_core

> since boot:
> $ dmesg

% devlink dev info
% dmesg | grep mlx5
[    4.739691] mlx5_core 0000:00:07.0: firmware version: 12.18.1000
[    4.740134] mlx5_core 0000:00:07.0: 63.008 Gb/s available PCIe
bandwidth (8.0GT/s PCIe x8 link)
[    7.048567] mlx5_core 0000:00:07.0: Port module event: module 0,
Cable plugged
[    7.211879] mlx5_core 0000:00:08.0: firmware version: 12.18.1000
[    7.212309] mlx5_core 0000:00:08.0: 63.008 Gb/s available PCIe
bandwidth (8.0GT/s PCIe x8 link)
[    7.897218] mlx5_core 0000:00:08.0: Port module event: module 1,
Cable plugged
[   10.875388] mlx5_core 0000:00:07.0 ibs7: renamed from ib0
[   10.995115] mlx5_core 0000:00:08.0 ibs8: renamed from ib0
[  181.471663] mlx5_core 0000:00:07.0: mlx5_fw_version_query:823:(pid
918): fw query isn't supported by the FW
[  181.472286] mlx5_core 0000:00:08.0: mlx5_fw_version_query:823:(pid
918): fw query isn't supported by the FW

-- 
paul-moore.com

  reply	other threads:[~2023-03-30  1:27 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-28 23:08 Potential regression/bug in net/mlx5 driver Paul Moore
2023-03-29 22:20 ` Saeed Mahameed
2023-03-30  1:27   ` Paul Moore [this message]
2023-04-09  8:48     ` Linux regression tracking (Thorsten Leemhuis)
2023-04-09 23:50       ` Paul Moore
2023-04-10  5:46         ` Leon Romanovsky
2023-04-13 13:49           ` Linux regression tracking (Thorsten Leemhuis)
2023-04-13 14:54           ` Jakub Kicinski
2023-04-13 15:19             ` Paul Moore
2023-04-13 21:12               ` Saeed Mahameed
2023-04-13 22:21                 ` Jakub Kicinski
2023-04-13 22:34                   ` Saeed Mahameed
2023-04-13 22:51                     ` Jakub Kicinski
2023-04-14  3:03                       ` Saeed Mahameed
2023-04-14  3:26                         ` Jakub Kicinski
2023-04-14 14:37                           ` Paul Moore
2023-04-14 22:20                           ` Saeed Mahameed
2023-04-15  0:34                             ` Jakub Kicinski
2023-04-15  4:40                               ` Saeed Mahameed
2023-04-17 15:38                                 ` Jakub Kicinski
2023-04-20  0:43                                   ` Saeed Mahameed
2023-04-20  0:46                                     ` Jakub Kicinski
2023-04-20  4:02                                       ` Saeed Mahameed
2023-03-31 13:10 ` Linux regression tracking #adding (Thorsten Leemhuis)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAHC9VhTvQLa=+Ykwmr_Uhgjrc6dfi24ou=NBsACkhwZN7X4EtQ@mail.gmail.com' \
    --to=paul@paul-moore.com \
    --cc=netdev@vger.kernel.org \
    --cc=regressions@lists.linux.dev \
    --cc=saeed@kernel.org \
    --cc=saeedm@nvidia.com \
    --cc=selinux@vger.kernel.org \
    --cc=shayd@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).