All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHv2 1/1] net/mlx4_core: avoid resetting HCA when accessing an  offline device
@ 2018-04-18 13:31 Zhu Yanjun
  2018-05-10 14:24 ` Tariq Toukan
  0 siblings, 1 reply; 2+ messages in thread
From: Zhu Yanjun @ 2018-04-18 13:31 UTC (permalink / raw)
  To: tariqt, netdev, linux-rdma

While a faulty cable is used or HCA firmware error, HCA device will
be offline. When the driver is accessing this offline device, the
following call trace will pop out.

"
...
  [<ffffffff816e4842>] dump_stack+0x63/0x81
  [<ffffffff816e459e>] panic+0xcc/0x21b
  [<ffffffffa03e5f8a>] mlx4_enter_error_state+0xba/0xf0 [mlx4_core]
  [<ffffffffa03e7298>] mlx4_cmd_reset_flow+0x38/0x60 [mlx4_core]
  [<ffffffffa03e7381>] mlx4_cmd_poll+0xc1/0x2e0 [mlx4_core]
  [<ffffffffa03e9f00>] __mlx4_cmd+0xb0/0x160 [mlx4_core]
  [<ffffffffa0406934>] mlx4_SENSE_PORT+0x54/0xd0 [mlx4_core]
  [<ffffffffa03f5f54>] mlx4_dev_cap+0x4a4/0xb50 [mlx4_core]
...
"
In the above call trace, the function mlx4_cmd_poll calls the function
mlx4_cmd_post to access the HCA while HCA is offline. Then mlx4_cmd_post
returns an error -EIO. Per -EIO, the function mlx4_cmd_poll calls
mlx4_cmd_reset_flow to reset HCA. And the above call trace pops out.

This is not reasonable. Since HCA device is offline when it is being
accessed, it should not be reset again.

In this patch, since HCA is offline, the function mlx4_cmd_post returns
an error -EINVAL. Per -EINVAL, the function mlx4_cmd_poll directly returns
instead of resetting HCA.

CC: Srinivas Eeda <srinivas.eeda@oracle.com>
CC: Junxiao Bi <junxiao.bi@oracle.com>
Suggested-by: Håkon Bugge <haakon.bugge@oracle.com>
Suggested-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
---
V1->V2: Follow Tariq's advice, avoid the disturbance from other returned errors.
Since the returned values from the function mlx4_cmd_post are -EIO and -EINVAL,
to -EIO, the HCA device should be reset. To -EINVAL, that means that the function
mlx4_cmd_post is accessing an offline device. It is not necessary to reset HCA.
Go to label out directly.
---
 drivers/net/ethernet/mellanox/mlx4/cmd.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/cmd.c b/drivers/net/ethernet/mellanox/mlx4/cmd.c
index 6a9086d..df735b8 100644
--- a/drivers/net/ethernet/mellanox/mlx4/cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx4/cmd.c
@@ -451,6 +451,8 @@ static int mlx4_cmd_post(struct mlx4_dev *dev, u64 in_param, u64 out_param,
 		 * Device is going through error recovery
 		 * and cannot accept commands.
 		 */
+		mlx4_err(dev, "%s : Device is in error recovery.\n", __func__);
+		ret = -EINVAL;
 		goto out;
 	}
 
@@ -610,8 +612,11 @@ static int mlx4_cmd_poll(struct mlx4_dev *dev, u64 in_param, u64 *out_param,
 
 	err = mlx4_cmd_post(dev, in_param, out_param ? *out_param : 0,
 			    in_modifier, op_modifier, op, CMD_POLL_TOKEN, 0);
-	if (err)
+	if (err) {
+		if (err == -EINVAL)
+			goto out;
 		goto out_reset;
+	}
 
 	end = msecs_to_jiffies(timeout) + jiffies;
 	while (cmd_pending(dev) && time_before(jiffies, end)) {
@@ -710,8 +715,11 @@ static int mlx4_cmd_wait(struct mlx4_dev *dev, u64 in_param, u64 *out_param,
 
 	err = mlx4_cmd_post(dev, in_param, out_param ? *out_param : 0,
 			    in_modifier, op_modifier, op, context->token, 1);
-	if (err)
+	if (err) {
+		if (err == -EINVAL)
+			goto out;
 		goto out_reset;
+	}
 
 	if (op == MLX4_CMD_SENSE_PORT) {
 		ret_wait =
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 2+ messages in thread

* Re: [PATCHv2 1/1] net/mlx4_core: avoid resetting HCA when accessing an offline device
  2018-04-18 13:31 [PATCHv2 1/1] net/mlx4_core: avoid resetting HCA when accessing an offline device Zhu Yanjun
@ 2018-05-10 14:24 ` Tariq Toukan
  0 siblings, 0 replies; 2+ messages in thread
From: Tariq Toukan @ 2018-05-10 14:24 UTC (permalink / raw)
  To: Zhu Yanjun, tariqt, netdev, linux-rdma



On 18/04/2018 4:31 PM, Zhu Yanjun wrote:
> While a faulty cable is used or HCA firmware error, HCA device will
> be offline. When the driver is accessing this offline device, the
> following call trace will pop out.
> 
> "
> ...
>    [<ffffffff816e4842>] dump_stack+0x63/0x81
>    [<ffffffff816e459e>] panic+0xcc/0x21b
>    [<ffffffffa03e5f8a>] mlx4_enter_error_state+0xba/0xf0 [mlx4_core]
>    [<ffffffffa03e7298>] mlx4_cmd_reset_flow+0x38/0x60 [mlx4_core]
>    [<ffffffffa03e7381>] mlx4_cmd_poll+0xc1/0x2e0 [mlx4_core]
>    [<ffffffffa03e9f00>] __mlx4_cmd+0xb0/0x160 [mlx4_core]
>    [<ffffffffa0406934>] mlx4_SENSE_PORT+0x54/0xd0 [mlx4_core]
>    [<ffffffffa03f5f54>] mlx4_dev_cap+0x4a4/0xb50 [mlx4_core]
> ...
> "
> In the above call trace, the function mlx4_cmd_poll calls the function
> mlx4_cmd_post to access the HCA while HCA is offline. Then mlx4_cmd_post
> returns an error -EIO. Per -EIO, the function mlx4_cmd_poll calls
> mlx4_cmd_reset_flow to reset HCA. And the above call trace pops out.
> 
> This is not reasonable. Since HCA device is offline when it is being
> accessed, it should not be reset again.
> 
> In this patch, since HCA is offline, the function mlx4_cmd_post returns
> an error -EINVAL. Per -EINVAL, the function mlx4_cmd_poll directly returns
> instead of resetting HCA.
> 
> CC: Srinivas Eeda <srinivas.eeda@oracle.com>
> CC: Junxiao Bi <junxiao.bi@oracle.com>
> Suggested-by: Håkon Bugge <haakon.bugge@oracle.com>
> Suggested-by: Tariq Toukan <tariqt@mellanox.com>
> Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
> ---
> V1->V2: Follow Tariq's advice, avoid the disturbance from other returned errors.
> Since the returned values from the function mlx4_cmd_post are -EIO and -EINVAL,
> to -EIO, the HCA device should be reset. To -EINVAL, that means that the function
> mlx4_cmd_post is accessing an offline device. It is not necessary to reset HCA.
> Go to label out directly.
> ---
>   drivers/net/ethernet/mellanox/mlx4/cmd.c | 12 ++++++++++--
>   1 file changed, 10 insertions(+), 2 deletions(-)
> 

Reviewed-by: Tariq Toukan <tariqt@mellanox.com>

Thanks Zhu.

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2018-05-10 14:24 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-18 13:31 [PATCHv2 1/1] net/mlx4_core: avoid resetting HCA when accessing an offline device Zhu Yanjun
2018-05-10 14:24 ` Tariq Toukan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.