netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next v2] net/mlx5: stop waiting for PCI link if reset is required
@ 2023-04-11 10:51 Niklas Schnelle
  2023-04-12 23:33 ` Jacob Keller
  2023-04-13 23:01 ` Saeed Mahameed
  0 siblings, 2 replies; 3+ messages in thread
From: Niklas Schnelle @ 2023-04-11 10:51 UTC (permalink / raw)
  To: Saeed Mahameed, Leon Romanovsky, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni
  Cc: Gerd Bayer, Alexander Schmidt, Leon Romanovsky, netdev,
	linux-rdma, linux-kernel

After an error on the PCI link, the driver does not need to wait
for the link to become functional again as a reset is required. Stop
the wait loop in this case to accelerate the recovery flow.

Co-developed-by: Alexander Schmidt <alexs@linux.ibm.com>
Signed-off-by: Alexander Schmidt <alexs@linux.ibm.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/20230403075657.168294-1-schnelle@linux.ibm.com
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/health.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/health.c b/drivers/net/ethernet/mellanox/mlx5/core/health.c
index f9438d4e43ca..81ca44e0705a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/health.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/health.c
@@ -325,6 +325,8 @@ int mlx5_health_wait_pci_up(struct mlx5_core_dev *dev)
 	while (sensor_pci_not_working(dev)) {
 		if (time_after(jiffies, end))
 			return -ETIMEDOUT;
+		if (pci_channel_offline(dev->pdev))
+			return -EIO;
 		msleep(100);
 	}
 	return 0;
@@ -332,10 +334,16 @@ int mlx5_health_wait_pci_up(struct mlx5_core_dev *dev)
 
 static int mlx5_health_try_recover(struct mlx5_core_dev *dev)
 {
+	int rc;
+
 	mlx5_core_warn(dev, "handling bad device here\n");
 	mlx5_handle_bad_state(dev);
-	if (mlx5_health_wait_pci_up(dev)) {
-		mlx5_core_err(dev, "health recovery flow aborted, PCI reads still not working\n");
+	rc = mlx5_health_wait_pci_up(dev);
+	if (rc) {
+		if (rc == -ETIMEDOUT)
+			mlx5_core_err(dev, "health recovery flow aborted, PCI reads still not working\n");
+		else
+			mlx5_core_err(dev, "health recovery flow aborted, PCI channel offline\n");
 		return -EIO;
 	}
 	mlx5_core_err(dev, "starting health recovery flow\n");

base-commit: 09a9639e56c01c7a00d6c0ca63f4c7c41abe075d
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH net-next v2] net/mlx5: stop waiting for PCI link if reset is required
  2023-04-11 10:51 [PATCH net-next v2] net/mlx5: stop waiting for PCI link if reset is required Niklas Schnelle
@ 2023-04-12 23:33 ` Jacob Keller
  2023-04-13 23:01 ` Saeed Mahameed
  1 sibling, 0 replies; 3+ messages in thread
From: Jacob Keller @ 2023-04-12 23:33 UTC (permalink / raw)
  To: Niklas Schnelle, Saeed Mahameed, Leon Romanovsky,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Gerd Bayer, Alexander Schmidt, Leon Romanovsky, netdev,
	linux-rdma, linux-kernel



On 4/11/2023 3:51 AM, Niklas Schnelle wrote:
> After an error on the PCI link, the driver does not need to wait
> for the link to become functional again as a reset is required. Stop
> the wait loop in this case to accelerate the recovery flow.
> 

Ok, so if the PCI link is completely offline (pci_channel_offline) then
we just bail out immediately and fail to recover, reporting to the user
as-such. Then a system administrator can setup in and perform the
appropriate reset? Rather than not reporting until the timeout
completes. Essentially, we know that this will never recover at this
point so stop wasting time.

Makes sense.

> Co-developed-by: Alexander Schmidt <alexs@linux.ibm.com>
> Signed-off-by: Alexander Schmidt <alexs@linux.ibm.com>
> Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
> Link: https://lore.kernel.org/r/20230403075657.168294-1-schnelle@linux.ibm.com
> Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
> ---

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>

>  drivers/net/ethernet/mellanox/mlx5/core/health.c | 12 ++++++++++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/health.c b/drivers/net/ethernet/mellanox/mlx5/core/health.c
> index f9438d4e43ca..81ca44e0705a 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/health.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/health.c
> @@ -325,6 +325,8 @@ int mlx5_health_wait_pci_up(struct mlx5_core_dev *dev)
>  	while (sensor_pci_not_working(dev)) {
>  		if (time_after(jiffies, end))
>  			return -ETIMEDOUT;
> +		if (pci_channel_offline(dev->pdev))
> +			return -EIO;
>  		msleep(100);
>  	}
>  	return 0;
> @@ -332,10 +334,16 @@ int mlx5_health_wait_pci_up(struct mlx5_core_dev *dev)
>  
>  static int mlx5_health_try_recover(struct mlx5_core_dev *dev)
>  {
> +	int rc;
> +
>  	mlx5_core_warn(dev, "handling bad device here\n");
>  	mlx5_handle_bad_state(dev);
> -	if (mlx5_health_wait_pci_up(dev)) {
> -		mlx5_core_err(dev, "health recovery flow aborted, PCI reads still not working\n");
> +	rc = mlx5_health_wait_pci_up(dev);
> +	if (rc) {
> +		if (rc == -ETIMEDOUT)
> +			mlx5_core_err(dev, "health recovery flow aborted, PCI reads still not working\n");
> +		else
> +			mlx5_core_err(dev, "health recovery flow aborted, PCI channel offline\n");
>  		return -EIO;
>  	}
>  	mlx5_core_err(dev, "starting health recovery flow\n");
> 
> base-commit: 09a9639e56c01c7a00d6c0ca63f4c7c41abe075d

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH net-next v2] net/mlx5: stop waiting for PCI link if reset is required
  2023-04-11 10:51 [PATCH net-next v2] net/mlx5: stop waiting for PCI link if reset is required Niklas Schnelle
  2023-04-12 23:33 ` Jacob Keller
@ 2023-04-13 23:01 ` Saeed Mahameed
  1 sibling, 0 replies; 3+ messages in thread
From: Saeed Mahameed @ 2023-04-13 23:01 UTC (permalink / raw)
  To: Niklas Schnelle
  Cc: Saeed Mahameed, Leon Romanovsky, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Gerd Bayer, Alexander Schmidt,
	Leon Romanovsky, netdev, linux-rdma, linux-kernel

On 11 Apr 12:51, Niklas Schnelle wrote:
>After an error on the PCI link, the driver does not need to wait
>for the link to become functional again as a reset is required. Stop
>the wait loop in this case to accelerate the recovery flow.
>
>Co-developed-by: Alexander Schmidt <alexs@linux.ibm.com>
>Signed-off-by: Alexander Schmidt <alexs@linux.ibm.com>
>Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
>Link: https://lore.kernel.org/r/20230403075657.168294-1-schnelle@linux.ibm.com
>Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
>---
> drivers/net/ethernet/mellanox/mlx5/core/health.c | 12 ++++++++++--
> 1 file changed, 10 insertions(+), 2 deletions(-)
>
>diff --git a/drivers/net/ethernet/mellanox/mlx5/core/health.c b/drivers/net/ethernet/mellanox/mlx5/core/health.c
>index f9438d4e43ca..81ca44e0705a 100644
>--- a/drivers/net/ethernet/mellanox/mlx5/core/health.c
>+++ b/drivers/net/ethernet/mellanox/mlx5/core/health.c
>@@ -325,6 +325,8 @@ int mlx5_health_wait_pci_up(struct mlx5_core_dev *dev)
> 	while (sensor_pci_not_working(dev)) {
> 		if (time_after(jiffies, end))
> 			return -ETIMEDOUT;
>+		if (pci_channel_offline(dev->pdev))
>+			return -EIO;

We already sent a patch to net not too long a go to break this while loop
when there is a triggered reset:
  
net/mlx5: Stop waiting for PCI up if teardown was triggered
https://lore.kernel.org/netdev/20230314054234.267365-3-saeed@kernel.org/

Usually when the pci goes offline, either the PCI subsystem will detect
that and will trigger the mlx5 teardown or mlx5 health check will detect it
and will initiate the teardown, in both ways the MLX5_BREAK_FW_WAIT flag
will be raised and the loop will quit, please let me know if you think 
the extra check of pci_channel_offline(dev->pdev) is still required here
for your system.



^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2023-04-13 23:01 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-11 10:51 [PATCH net-next v2] net/mlx5: stop waiting for PCI link if reset is required Niklas Schnelle
2023-04-12 23:33 ` Jacob Keller
2023-04-13 23:01 ` Saeed Mahameed

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).