linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Saeed Mahameed <saeedm@mellanox.com>
To: "schnelle@linux.ibm.com" <schnelle@linux.ibm.com>,
	Parav Pandit <parav@mellanox.com>
Cc: "netdev@vger.kernel.org" <netdev@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: [REGRESSION] mlx5: Driver remove during hot unplug is broken
Date: Fri, 12 Jun 2020 22:01:56 +0000	[thread overview]
Message-ID: <7660d8e0d2cb1fbd40cf89ea4c9a0eff4807157c.camel@mellanox.com> (raw)
In-Reply-To: <f942d546-ee7e-60f6-612a-ae093a9459a5@linux.ibm.com>

On Fri, 2020-06-12 at 15:09 +0200, Niklas Schnelle wrote:
> Hello Parav, Hello Saeed,
> 
> our CI system for IBM Z Linux found a hang[0] when hot unplugging a
> ConnectX-4 Lx VF from a z/VM guest
> in Linus' current tree and added during the merge window.
> Sadly it didn't happen all the time which sent me on the wrong path
> for two full git bisects.
> 
> Anyway, I've now tracked this down to the following commit which when
> reverted
> fixes the issue:
> 
> 41798df9bfca ("net/mlx5: Drain wq first during PCI device removal")
> 
> Looking at the diff I'd say the likely culprit is that before
> the commit the order of calls was:
> 
> mlx5_unregister_device(dev)
> mlx5_drain_health_wq(dev)
> 
> But with the commit it becomes
> 
> mlx5_drain_health_wq(dev)
> mlx5_unregister_device(dev)
> 
> So without really knowing anything about these functions I would
> guess that with the device still registered the drained
> queue does not remain empty as new entries are added.
> Does that sound plausible to you?
> 

I don't think it is related, maybe this is similar to some issues
addressed lately by Shay's patches:

https://patchwork.ozlabs.org/project/netdev/patch/20200611224708.235014-2-saeedm@mellanox.com/
https://patchwork.ozlabs.org/project/netdev/patch/20200611224708.235014-3-saeedm@mellanox.com/

net/mlx5: drain health workqueue in case of driver load error
net/mlx5: Fix fatal error handling during device load

> Best regards,
> Niklas Schnelle
> 
> [0] dmesg output:
> [   36.447442] mlx5_core 0000:00:00.0: poll_health:694:(pid 0): Fatal
> error 1 detected
> [   36.447450] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0):
> assert_var[0] 0xffffffff
> [   36.447453] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0):
> assert_var[1] 0xffffffff
> [   36.447455] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0):
> assert_var[2] 0xffffffff
> [   36.447458] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0):
> assert_var[3] 0xffffffff
> [   36.447461] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0):
> assert_var[4] 0xffffffff
> [   36.447463] mlx5_core 0000:00:00.0: print_health_info:375:(pid 0):
> assert_exit_ptr 0xffffffff
> [   36.447467] mlx5_core 0000:00:00.0: print_health_info:377:(pid 0):
> assert_callra 0xffffffff
> [   36.447471] mlx5_core 0000:00:00.0: print_health_info:380:(pid 0):
> fw_ver 65535.65535.65535
> [   36.447475] mlx5_core 0000:00:00.0: print_health_info:381:(pid 0):
> hw_id 0xffffffff
> [   36.447478] mlx5_core 0000:00:00.0: print_health_info:382:(pid 0):
> irisc_index 255
> [   36.447492] mlx5_core 0000:00:00.0: print_health_info:383:(pid 0):
> synd 0xff: unrecognized error
> [   36.447621] mlx5_core 0000:00:00.0: print_health_info:385:(pid 0):
> ext_synd 0xffff
> [   36.447624] mlx5_core 0000:00:00.0: print_health_info:387:(pid 0):
> raw fw_ver 0xffffffff
> [   36.447885] crw_info : CRW reports slct=0, oflw=0, chn=0, rsc=B,
> anc=0, erc=0, rsid=0
> [   36.447897] zpci: 0000:00:00.0: Event 0x303 reconfigured PCI
> function 0x514
> [   47.099220] mlx5_core 0000:00:00.0: poll_health:709:(pid 0):
> device's health compromised - reached miss count
> [   47.099228] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0):
> assert_var[0] 0xffffffff
> [   47.099231] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0):
> assert_var[1] 0xffffffff
> [   47.099234] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0):
> assert_var[2] 0xffffffff
> [   47.099236] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0):
> assert_var[3] 0xffffffff
> [   47.099239] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0):
> assert_var[4] 0xffffffff
> [   47.099241] mlx5_core 0000:00:00.0: print_health_info:375:(pid 0):
> assert_exit_ptr 0xffffffff
> [   47.099245] mlx5_core 0000:00:00.0: print_health_info:377:(pid 0):
> assert_callra 0xffffffff
> [   47.099249] mlx5_core 0000:00:00.0: print_health_info:380:(pid 0):
> fw_ver 65535.65535.65535
> [   47.099253] mlx5_core 0000:00:00.0: print_health_info:381:(pid 0):
> hw_id 0xffffffff
> [   47.099256] mlx5_core 0000:00:00.0: print_health_info:382:(pid 0):
> irisc_index 255
> [   47.099327] mlx5_core 0000:00:00.0: print_health_info:383:(pid 0):
> synd 0xff: unrecognized error
> [   47.099329] mlx5_core 0000:00:00.0: print_health_info:385:(pid 0):
> ext_synd 0xffff
> [   47.099330] mlx5_core 0000:00:00.0: print_health_info:387:(pid 0):
> raw fw_ver 0xffffffff
> [  100.539106] mlx5_core 0000:00:00.0: wait_func:991:(pid 121):
> 2RST_QP(0x50a) timeout. Will cause a leak of a command resource
> [  100.539118] infiniband mlx5_0: destroy_qp_common:2525:(pid 121):
> mlx5_ib: modify QP 0x00072c to RESET failed
> [  141.499325] mlx5_core 0000:00:00.0: wait_func:991:(pid 32):
> QUERY_VPORT_COUNTER(0x770) timeout. Will cause a leak of a command
> resource
> [  161.978957] mlx5_core 0000:00:00.0: wait_func:991:(pid 121):
> DESTROY_QP(0x501) timeout. Will cause a leak of a command resource

Shay's patches also came to avoid such command timeouts.



  reply	other threads:[~2020-06-12 22:02 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-06-12 13:09 [REGRESSION] mlx5: Driver remove during hot unplug is broken Niklas Schnelle
2020-06-12 22:01 ` Saeed Mahameed [this message]
2020-06-15 10:01   ` Niklas Schnelle
2020-07-08 10:43     ` Parav Pandit
2020-07-08 11:44       ` Niklas Schnelle
2020-07-08 15:44         ` Parav Pandit
2020-07-09 10:06           ` Niklas Schnelle
2020-07-09 18:34             ` Parav Pandit
2020-07-10  8:34               ` Niklas Schnelle

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7660d8e0d2cb1fbd40cf89ea4c9a0eff4807157c.camel@mellanox.com \
    --to=saeedm@mellanox.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=parav@mellanox.com \
    --cc=schnelle@linux.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).