All of lore.kernel.org
 help / color / mirror / Atom feed
From: Shannon Nelson <snelson@pensando.io>
To: davem@davemloft.net, netdev@vger.kernel.org, kuba@kernel.org
Cc: drivers@pensando.io, Shannon Nelson <snelson@pensando.io>,
	Brett Creeley <brett@pensando.io>
Subject: [PATCH net 15/16] ionic: stretch heartbeat detection
Date: Mon, 24 Jan 2022 10:53:11 -0800	[thread overview]
Message-ID: <20220124185312.72646-16-snelson@pensando.io> (raw)
In-Reply-To: <20220124185312.72646-1-snelson@pensando.io>

The driver can be premature in detecting stalled firmware
when the heartbeat is not updated because the firmware can
occasionally take a long time (more than 2 seconds) to service
a request, and doesn't update the heartbeat during that time.

The firmware heartbeat is not necessarily a steady 1 second
periodic beat, but better described as something that should
progress at least once in every DECVMD_TIMEOUT period.
The single-threaded design in the FW means that if a devcmd
or adminq request launches a large internal job, it is stuck
waiting for that job to finish before it can get back to
updating the heartbeat.  Since all requests are "guaranteed"
to finish within the DEVCMD_TIMEOUT period, the driver needs
to less aggressive in checking the heartbeat progress.

We change our current 2 second window to something bigger than
DEVCMD_TIMEOUT which should take care of most of the issue.
We stop checking for the heartbeat while waiting for a request,
as long as we're still watching for the FW status.  Lastly,
we make sure our FW status is up to date before running a
devcmd request.

Once we do this, we need to not check the heartbeat on DEV
commands because it may be stalled while we're on the fw_down
path.  Instead, we can rely on the is_fw_running check.

Fixes: b2b9a8d7ed13 ("ionic: avoid races in ionic_heartbeat_check")
Signed-off-by: Brett Creeley <brett@pensando.io>
Signed-off-by: Shannon Nelson <snelson@pensando.io>
---
 drivers/net/ethernet/pensando/ionic/ionic.h   |  2 +-
 .../net/ethernet/pensando/ionic/ionic_dev.c   |  6 ++--
 .../net/ethernet/pensando/ionic/ionic_lif.c   |  2 +-
 .../net/ethernet/pensando/ionic/ionic_main.c  | 34 ++++++++-----------
 4 files changed, 20 insertions(+), 24 deletions(-)

diff --git a/drivers/net/ethernet/pensando/ionic/ionic.h b/drivers/net/ethernet/pensando/ionic/ionic.h
index 04fdaf5c1a02..602f4d45d529 100644
--- a/drivers/net/ethernet/pensando/ionic/ionic.h
+++ b/drivers/net/ethernet/pensando/ionic/ionic.h
@@ -18,7 +18,7 @@ struct ionic_lif;
 #define PCI_DEVICE_ID_PENSANDO_IONIC_ETH_PF	0x1002
 #define PCI_DEVICE_ID_PENSANDO_IONIC_ETH_VF	0x1003
 
-#define DEVCMD_TIMEOUT  10
+#define DEVCMD_TIMEOUT			5
 #define IONIC_ADMINQ_TIME_SLICE		msecs_to_jiffies(100)
 
 #define IONIC_PHC_UPDATE_NS	10000000000	    /* 10s in nanoseconds */
diff --git a/drivers/net/ethernet/pensando/ionic/ionic_dev.c b/drivers/net/ethernet/pensando/ionic/ionic_dev.c
index 1535a40a5fab..51d36a549ef7 100644
--- a/drivers/net/ethernet/pensando/ionic/ionic_dev.c
+++ b/drivers/net/ethernet/pensando/ionic/ionic_dev.c
@@ -236,9 +236,11 @@ int ionic_heartbeat_check(struct ionic *ionic)
 	if (!idev->fw_status_ready)
 		return -ENXIO;
 
-	/* wait at least one watchdog period since the last heartbeat */
+	/* Because of some variability in the actual FW heartbeat, we
+	 * wait longer than the DEVCMD_TIMEOUT before checking again.
+	 */
 	last_check_time = idev->last_hb_time;
-	if (time_before(check_time, last_check_time + ionic->watchdog_period))
+	if (time_before(check_time, last_check_time + DEVCMD_TIMEOUT * 2 * HZ))
 		return 0;
 
 	fw_hb = ioread32(&idev->dev_info_regs->fw_heartbeat);
diff --git a/drivers/net/ethernet/pensando/ionic/ionic_lif.c b/drivers/net/ethernet/pensando/ionic/ionic_lif.c
index e84a01edc4e4..05dd8c4f5466 100644
--- a/drivers/net/ethernet/pensando/ionic/ionic_lif.c
+++ b/drivers/net/ethernet/pensando/ionic/ionic_lif.c
@@ -1787,7 +1787,7 @@ static void ionic_lif_quiesce(struct ionic_lif *lif)
 
 	err = ionic_adminq_post_wait(lif, &ctx);
 	if (err)
-		netdev_err(lif->netdev, "lif quiesce failed %d\n", err);
+		netdev_dbg(lif->netdev, "lif quiesce failed %d\n", err);
 }
 
 static void ionic_txrx_disable(struct ionic_lif *lif)
diff --git a/drivers/net/ethernet/pensando/ionic/ionic_main.c b/drivers/net/ethernet/pensando/ionic/ionic_main.c
index 78771663808a..4029b4e021f8 100644
--- a/drivers/net/ethernet/pensando/ionic/ionic_main.c
+++ b/drivers/net/ethernet/pensando/ionic/ionic_main.c
@@ -358,13 +358,14 @@ int ionic_adminq_wait(struct ionic_lif *lif, struct ionic_admin_ctx *ctx,
 		if (remaining)
 			break;
 
-		/* interrupt the wait if FW stopped */
+		/* force a check of FW status and break out if FW reset */
+		(void)ionic_heartbeat_check(lif->ionic);
 		if ((test_bit(IONIC_LIF_F_FW_RESET, lif->state) &&
 		     !lif->ionic->idev.fw_status_ready) ||
 		    test_bit(IONIC_LIF_F_FW_STOPPING, lif->state)) {
 			if (do_msg)
-				netdev_err(netdev, "%s (%d) interrupted, FW in reset\n",
-					   name, ctx->cmd.cmd.opcode);
+				netdev_warn(netdev, "%s (%d) interrupted, FW in reset\n",
+					    name, ctx->cmd.cmd.opcode);
 			ctx->comp.comp.status = IONIC_RC_ERROR;
 			return -ENXIO;
 		}
@@ -425,9 +426,9 @@ static int __ionic_dev_cmd_wait(struct ionic *ionic, unsigned long max_seconds,
 	unsigned long start_time;
 	unsigned long max_wait;
 	unsigned long duration;
+	int done = 0;
+	bool fw_up;
 	int opcode;
-	int hb = 0;
-	int done;
 	int err;
 
 	/* Wait for dev cmd to complete, retrying if we get EAGAIN,
@@ -437,31 +438,24 @@ static int __ionic_dev_cmd_wait(struct ionic *ionic, unsigned long max_seconds,
 try_again:
 	opcode = readb(&idev->dev_cmd_regs->cmd.cmd.opcode);
 	start_time = jiffies;
-	do {
+	for (fw_up = ionic_is_fw_running(idev);
+	     !done && fw_up && time_before(jiffies, max_wait);
+	     fw_up = ionic_is_fw_running(idev)) {
 		done = ionic_dev_cmd_done(idev);
 		if (done)
 			break;
 		usleep_range(100, 200);
-
-		/* Don't check the heartbeat on FW_CONTROL commands as they are
-		 * notorious for interrupting the firmware's heartbeat update.
-		 */
-		if (opcode != IONIC_CMD_FW_CONTROL)
-			hb = ionic_heartbeat_check(ionic);
-	} while (!done && !hb && time_before(jiffies, max_wait));
+	}
 	duration = jiffies - start_time;
 
 	dev_dbg(ionic->dev, "DEVCMD %s (%d) done=%d took %ld secs (%ld jiffies)\n",
 		ionic_opcode_to_str(opcode), opcode,
 		done, duration / HZ, duration);
 
-	if (!done && hb) {
-		/* It is possible (but unlikely) that FW was busy and missed a
-		 * heartbeat check but is still alive and will process this
-		 * request, so don't clean the dev_cmd in this case.
-		 */
-		dev_dbg(ionic->dev, "DEVCMD %s (%d) failed - FW halted\n",
-			ionic_opcode_to_str(opcode), opcode);
+	if (!done && !fw_up) {
+		ionic_dev_cmd_clean(ionic);
+		dev_warn(ionic->dev, "DEVCMD %s (%d) interrupted - FW is down\n",
+			 ionic_opcode_to_str(opcode), opcode);
 		return -ENXIO;
 	}
 
-- 
2.17.1


  parent reply	other threads:[~2022-01-24 18:57 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-01-24 18:52 [PATCH net 00/16] ionic: updates for stable FW recovery Shannon Nelson
2022-01-24 18:52 ` [PATCH net 01/16] ionic: fix type complaint in ionic_dev_cmd_clean() Shannon Nelson
2022-01-24 18:52 ` [PATCH net 02/16] ionic: start watchdog after all is setup Shannon Nelson
2022-01-24 18:52 ` [PATCH net 03/16] ionic: separate function for watchdog init Shannon Nelson
2022-01-24 18:53 ` [PATCH net 04/16] ionic: Don't send reset commands if FW isn't running Shannon Nelson
2022-01-24 18:53 ` [PATCH net 05/16] ionic: add FW_STOPPING state Shannon Nelson
2022-01-24 18:53 ` [PATCH net 06/16] ionic: better handling of RESET event Shannon Nelson
2022-01-24 18:53 ` [PATCH net 07/16] ionic: fix up printing of timeout error Shannon Nelson
2022-01-24 18:53 ` [PATCH net 08/16] ionic: Correctly print AQ errors if completions aren't received Shannon Nelson
2022-01-24 18:53 ` [PATCH net 09/16] ionic: Allow flexibility for error reporting on dev commands Shannon Nelson
2022-01-24 18:53 ` [PATCH net 10/16] ionic: Query FW when getting VF info via ndo_get_vf_config Shannon Nelson
2022-01-24 18:53 ` [PATCH net 11/16] ionic: Prevent filter add/del err msgs when the device is not available Shannon Nelson
2022-01-24 18:53 ` [PATCH net 12/16] ionic: Cleanups in the Tx hotpath code Shannon Nelson
2022-01-24 18:53 ` [PATCH net 13/16] ionic: disable napi when ionic_lif_init() fails Shannon Nelson
2022-01-24 18:53 ` [PATCH net 14/16] ionic: remove the dbid_inuse bitmap Shannon Nelson
2022-01-24 18:53 ` Shannon Nelson [this message]
2022-01-24 18:53 ` [PATCH net 16/16] ionic: replace set_vf data with union Shannon Nelson
2022-01-25 18:03 ` [PATCH net 00/16] ionic: updates for stable FW recovery Jakub Kicinski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220124185312.72646-16-snelson@pensando.io \
    --to=snelson@pensando.io \
    --cc=brett@pensando.io \
    --cc=davem@davemloft.net \
    --cc=drivers@pensando.io \
    --cc=kuba@kernel.org \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.