netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next 0/8] pds_core: Various improvements and AQ race condition cleanup
@ 2024-01-04 17:12 Brett Creeley
  2024-01-04 17:12 ` [PATCH net-next 1/8] pds_core: Prevent health thread from running during reset/remove Brett Creeley
                   ` (8 more replies)
  0 siblings, 9 replies; 14+ messages in thread
From: Brett Creeley @ 2024-01-04 17:12 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, netdev, linux-kernel
  Cc: shannon.nelson, brett.creeley

There can be many users of the pds_core's adminq. This includes
pds_core's uses and any clients that depend on it. When the pds_core
device goes through a reset for any reason the adminq is freed
and reconfigured. There are some gaps in the current implementation
that will cause crashes during reset if any of the previously mentioned
users of the adminq attempt to use it after it's been freed. This series
addresses these issues.

This series also includes some general cleanups.

Brett Creeley (8):
  pds_core: Prevent health thread from running during reset/remove
  pds_core: Cancel AQ work on teardown
  pds_core: Use struct pdsc for the pdsc_adminq_isr private data
  pds_core: Prevent race issues involving the adminq
  pds_core: Clear BARs on reset
  pds_core: Don't assign interrupt index/bound_intr to notifyq
  pds_core: Unmask adminq interrupt in work thread
  pds_core: Fix up some RCT issues

 drivers/net/ethernet/amd/pds_core/adminq.c  | 74 ++++++++++++++-------
 drivers/net/ethernet/amd/pds_core/core.c    | 38 +++++++++--
 drivers/net/ethernet/amd/pds_core/core.h    |  1 +
 drivers/net/ethernet/amd/pds_core/debugfs.c |  8 +--
 drivers/net/ethernet/amd/pds_core/dev.c     |  9 ++-
 drivers/net/ethernet/amd/pds_core/devlink.c |  3 +-
 drivers/net/ethernet/amd/pds_core/fw.c      |  3 +
 drivers/net/ethernet/amd/pds_core/main.c    | 24 ++++++-
 8 files changed, 123 insertions(+), 37 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH net-next 1/8] pds_core: Prevent health thread from running during reset/remove
  2024-01-04 17:12 [PATCH net-next 0/8] pds_core: Various improvements and AQ race condition cleanup Brett Creeley
@ 2024-01-04 17:12 ` Brett Creeley
  2024-01-04 17:12 ` [PATCH net-next 2/8] pds_core: Cancel AQ work on teardown Brett Creeley
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Brett Creeley @ 2024-01-04 17:12 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, netdev, linux-kernel
  Cc: shannon.nelson, brett.creeley

The PCIe reset handlers can run at the same time as the
health thread. This can cause the health thread to
stomp on the PCIe reset. Fix this by preventing the
health thread from running while a PCIe reset is happening.

As part of this use timer_shutdown_sync() during reset and
remove to make sure the timer doesn't ever get rearmed.

Signed-off-by: Brett Creeley <brett.creeley@amd.com>
Reviewed-by: Shannon Nelson <shannon.nelson@amd.com>
---
 drivers/net/ethernet/amd/pds_core/main.c | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/amd/pds_core/main.c b/drivers/net/ethernet/amd/pds_core/main.c
index 3080898d7b95..5172a5ad8ec6 100644
--- a/drivers/net/ethernet/amd/pds_core/main.c
+++ b/drivers/net/ethernet/amd/pds_core/main.c
@@ -293,7 +293,7 @@ static int pdsc_init_pf(struct pdsc *pdsc)
 err_out_teardown:
 	pdsc_teardown(pdsc, PDSC_TEARDOWN_REMOVING);
 err_out_unmap_bars:
-	del_timer_sync(&pdsc->wdtimer);
+	timer_shutdown_sync(&pdsc->wdtimer);
 	if (pdsc->wq)
 		destroy_workqueue(pdsc->wq);
 	mutex_destroy(&pdsc->config_lock);
@@ -420,7 +420,7 @@ static void pdsc_remove(struct pci_dev *pdev)
 		 */
 		pdsc_sriov_configure(pdev, 0);
 
-		del_timer_sync(&pdsc->wdtimer);
+		timer_shutdown_sync(&pdsc->wdtimer);
 		if (pdsc->wq)
 			destroy_workqueue(pdsc->wq);
 
@@ -445,10 +445,24 @@ static void pdsc_remove(struct pci_dev *pdev)
 	devlink_free(dl);
 }
 
+static void pdsc_stop_health_thread(struct pdsc *pdsc)
+{
+	timer_shutdown_sync(&pdsc->wdtimer);
+	if (pdsc->health_work.func)
+		cancel_work_sync(&pdsc->health_work);
+}
+
+static void pdsc_restart_health_thread(struct pdsc *pdsc)
+{
+	timer_setup(&pdsc->wdtimer, pdsc_wdtimer_cb, 0);
+	mod_timer(&pdsc->wdtimer, jiffies + 1);
+}
+
 void pdsc_reset_prepare(struct pci_dev *pdev)
 {
 	struct pdsc *pdsc = pci_get_drvdata(pdev);
 
+	pdsc_stop_health_thread(pdsc);
 	pdsc_fw_down(pdsc);
 
 	pci_free_irq_vectors(pdev);
@@ -486,6 +500,7 @@ void pdsc_reset_done(struct pci_dev *pdev)
 	}
 
 	pdsc_fw_up(pdsc);
+	pdsc_restart_health_thread(pdsc);
 }
 
 static const struct pci_error_handlers pdsc_err_handler = {
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH net-next 2/8] pds_core: Cancel AQ work on teardown
  2024-01-04 17:12 [PATCH net-next 0/8] pds_core: Various improvements and AQ race condition cleanup Brett Creeley
  2024-01-04 17:12 ` [PATCH net-next 1/8] pds_core: Prevent health thread from running during reset/remove Brett Creeley
@ 2024-01-04 17:12 ` Brett Creeley
  2024-01-04 17:12 ` [PATCH net-next 3/8] pds_core: Use struct pdsc for the pdsc_adminq_isr private data Brett Creeley
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Brett Creeley @ 2024-01-04 17:12 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, netdev, linux-kernel
  Cc: shannon.nelson, brett.creeley

There is a small window where pdsc_work_thread()
calls pdsc_process_adminq() and pdsc_process_adminq()
passes the PDSC_S_STOPPING_DRIVER check and starts
to process adminq/notifyq work and then the driver
starts a fw_down cycle. This could cause some
undefined behavior if the notifyqcq/adminqcq are
free'd while pdsc_process_adminq() is running. Use
cancel_work_sync() on the adminqcq's work struct
to make sure any pending work items are cancelled
and any in progress work items are completed.

Also, make sure to not call cancel_work_sync() if
the work item has not be initialized. Without this,
traces will happen in cases where a reset fails and
teardown is called again or if reset fails and the
driver is removed.

Signed-off-by: Brett Creeley <brett.creeley@amd.com>
Reviewed-by: Shannon Nelson <shannon.nelson@amd.com>
---
 drivers/net/ethernet/amd/pds_core/core.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/amd/pds_core/core.c b/drivers/net/ethernet/amd/pds_core/core.c
index 0d2091e9eb28..b582729331eb 100644
--- a/drivers/net/ethernet/amd/pds_core/core.c
+++ b/drivers/net/ethernet/amd/pds_core/core.c
@@ -464,6 +464,8 @@ void pdsc_teardown(struct pdsc *pdsc, bool removing)
 
 	if (!pdsc->pdev->is_virtfn)
 		pdsc_devcmd_reset(pdsc);
+	if (pdsc->adminqcq.work.func)
+		cancel_work_sync(&pdsc->adminqcq.work);
 	pdsc_qcq_free(pdsc, &pdsc->notifyqcq);
 	pdsc_qcq_free(pdsc, &pdsc->adminqcq);
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH net-next 3/8] pds_core: Use struct pdsc for the pdsc_adminq_isr private data
  2024-01-04 17:12 [PATCH net-next 0/8] pds_core: Various improvements and AQ race condition cleanup Brett Creeley
  2024-01-04 17:12 ` [PATCH net-next 1/8] pds_core: Prevent health thread from running during reset/remove Brett Creeley
  2024-01-04 17:12 ` [PATCH net-next 2/8] pds_core: Cancel AQ work on teardown Brett Creeley
@ 2024-01-04 17:12 ` Brett Creeley
  2024-01-04 17:12 ` [PATCH net-next 4/8] pds_core: Prevent race issues involving the adminq Brett Creeley
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Brett Creeley @ 2024-01-04 17:12 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, netdev, linux-kernel
  Cc: shannon.nelson, brett.creeley

The initial design for the adminq interrupt was done based
on client drivers having their own adminq and adminq
interrupt. So, each client driver's adminq isr would use
their specific adminqcq for the private data struct. For the
time being the design has changed to only use a single
adminq for all clients. So, instead use the struct pdsc for
the private data to simplify things a bit.

This also has the benefit of not dereferencing the adminqcq
to access the pdsc struct when the PDSC_S_STOPPING_DRIVER bit
is set and the adminqcq has actually been cleared/freed.

Signed-off-by: Brett Creeley <brett.creeley@amd.com>
Reviewed-by: Shannon Nelson <shannon.nelson@amd.com>
---
 drivers/net/ethernet/amd/pds_core/adminq.c | 5 +++--
 drivers/net/ethernet/amd/pds_core/core.c   | 2 +-
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/amd/pds_core/adminq.c b/drivers/net/ethernet/amd/pds_core/adminq.c
index 5beadabc2136..68be5ea251fc 100644
--- a/drivers/net/ethernet/amd/pds_core/adminq.c
+++ b/drivers/net/ethernet/amd/pds_core/adminq.c
@@ -135,8 +135,8 @@ void pdsc_work_thread(struct work_struct *work)
 
 irqreturn_t pdsc_adminq_isr(int irq, void *data)
 {
-	struct pdsc_qcq *qcq = data;
-	struct pdsc *pdsc = qcq->pdsc;
+	struct pdsc *pdsc = data;
+	struct pdsc_qcq *qcq;
 
 	/* Don't process AdminQ when shutting down */
 	if (pdsc->state & BIT_ULL(PDSC_S_STOPPING_DRIVER)) {
@@ -145,6 +145,7 @@ irqreturn_t pdsc_adminq_isr(int irq, void *data)
 		return IRQ_HANDLED;
 	}
 
+	qcq = &pdsc->adminqcq;
 	queue_work(pdsc->wq, &qcq->work);
 	pds_core_intr_mask(&pdsc->intr_ctrl[qcq->intx], PDS_CORE_INTR_MASK_CLEAR);
 
diff --git a/drivers/net/ethernet/amd/pds_core/core.c b/drivers/net/ethernet/amd/pds_core/core.c
index b582729331eb..0356e56a6e99 100644
--- a/drivers/net/ethernet/amd/pds_core/core.c
+++ b/drivers/net/ethernet/amd/pds_core/core.c
@@ -125,7 +125,7 @@ static int pdsc_qcq_intr_alloc(struct pdsc *pdsc, struct pdsc_qcq *qcq)
 
 	snprintf(name, sizeof(name), "%s-%d-%s",
 		 PDS_CORE_DRV_NAME, pdsc->pdev->bus->number, qcq->q.name);
-	index = pdsc_intr_alloc(pdsc, name, pdsc_adminq_isr, qcq);
+	index = pdsc_intr_alloc(pdsc, name, pdsc_adminq_isr, pdsc);
 	if (index < 0)
 		return index;
 	qcq->intx = index;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH net-next 4/8] pds_core: Prevent race issues involving the adminq
  2024-01-04 17:12 [PATCH net-next 0/8] pds_core: Various improvements and AQ race condition cleanup Brett Creeley
                   ` (2 preceding siblings ...)
  2024-01-04 17:12 ` [PATCH net-next 3/8] pds_core: Use struct pdsc for the pdsc_adminq_isr private data Brett Creeley
@ 2024-01-04 17:12 ` Brett Creeley
  2024-01-04 19:16   ` Simon Horman
                     ` (2 more replies)
  2024-01-04 17:12 ` [PATCH net-next 5/8] pds_core: Clear BARs on reset Brett Creeley
                   ` (4 subsequent siblings)
  8 siblings, 3 replies; 14+ messages in thread
From: Brett Creeley @ 2024-01-04 17:12 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, netdev, linux-kernel
  Cc: shannon.nelson, brett.creeley

There are multiple paths that can result in using the pdsc's
adminq.

[1] pdsc_adminq_isr and the resulting work from queue_work(),
    i.e. pdsc_work_thread()->pdsc_process_adminq()

[2] pdsc_adminq_post()

When the device goes through reset via PCIe reset and/or
a fw_down/fw_up cycle due to bad PCIe state or bad device
state the adminq is destroyed and recreated.

A NULL pointer dereference can happen if [1] or [2] happens
after the adminq is already destroyed.

In order to fix this, add some further state checks and
implement reference counting for adminq uses. Reference
counting was used because multiple threads can attempt to
access the adminq at the same time via [1] or [2]. Additionally,
multiple clients (i.e. pds-vfio-pci) can be using [2]
at the same time.

The adminq_refcnt is initialized to 1 when the adminq has been
allocated and is ready to use. Users/clients of the adminq
(i.e. [1] and [2]) will increment the refcnt when they are using
the adminq. When the driver goes into a fw_down cycle it will
set the PDSC_S_FW_DEAD bit and then wait for the adminq_refcnt
to hit 1. Setting the PDSC_S_FW_DEAD before waiting will prevent
any further adminq_refcnt increments. Waiting for the
adminq_refcnt to hit 1 allows for any current users of the adminq
to finish before the driver frees the adminq. Once the
adminq_refcnt hits 1 the driver clears the refcnt to signify that
the adminq is deleted and cannot be used. On the fw_up cycle the
driver will once again initialize the adminq_refcnt to 1 allowing
the adminq to be used again.

Signed-off-by: Brett Creeley <brett.creeley@amd.com>
Reviewed-by: Shannon Nelson <shannon.nelson@amd.com>
---
 drivers/net/ethernet/amd/pds_core/adminq.c | 31 +++++++++++++++++-----
 drivers/net/ethernet/amd/pds_core/core.c   | 21 +++++++++++++++
 drivers/net/ethernet/amd/pds_core/core.h   |  1 +
 3 files changed, 47 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/amd/pds_core/adminq.c b/drivers/net/ethernet/amd/pds_core/adminq.c
index 68be5ea251fc..5edff33d56f3 100644
--- a/drivers/net/ethernet/amd/pds_core/adminq.c
+++ b/drivers/net/ethernet/amd/pds_core/adminq.c
@@ -63,6 +63,15 @@ static int pdsc_process_notifyq(struct pdsc_qcq *qcq)
 	return nq_work;
 }
 
+static bool pdsc_adminq_inc_if_up(struct pdsc *pdsc)
+{
+	if (pdsc->state & BIT_ULL(PDSC_S_STOPPING_DRIVER) ||
+	    pdsc->state & BIT_ULL(PDSC_S_FW_DEAD))
+		return false;
+
+	return refcount_inc_not_zero(&pdsc->adminq_refcnt);
+}
+
 void pdsc_process_adminq(struct pdsc_qcq *qcq)
 {
 	union pds_core_adminq_comp *comp;
@@ -75,9 +84,9 @@ void pdsc_process_adminq(struct pdsc_qcq *qcq)
 	int aq_work = 0;
 	int credits;
 
-	/* Don't process AdminQ when shutting down */
-	if (pdsc->state & BIT_ULL(PDSC_S_STOPPING_DRIVER)) {
-		dev_err(pdsc->dev, "%s: called while PDSC_S_STOPPING_DRIVER\n",
+	/* Don't process AdminQ when it's not up */
+	if (!pdsc_adminq_inc_if_up(pdsc)) {
+		dev_err(pdsc->dev, "%s: called while adminq is unavailable\n",
 			__func__);
 		return;
 	}
@@ -124,6 +133,7 @@ void pdsc_process_adminq(struct pdsc_qcq *qcq)
 		pds_core_intr_credits(&pdsc->intr_ctrl[qcq->intx],
 				      credits,
 				      PDS_CORE_INTR_CRED_REARM);
+	refcount_dec(&pdsc->adminq_refcnt);
 }
 
 void pdsc_work_thread(struct work_struct *work)
@@ -138,9 +148,9 @@ irqreturn_t pdsc_adminq_isr(int irq, void *data)
 	struct pdsc *pdsc = data;
 	struct pdsc_qcq *qcq;
 
-	/* Don't process AdminQ when shutting down */
-	if (pdsc->state & BIT_ULL(PDSC_S_STOPPING_DRIVER)) {
-		dev_err(pdsc->dev, "%s: called while PDSC_S_STOPPING_DRIVER\n",
+	/* Don't process AdminQ when it's not up */
+	if (!pdsc_adminq_inc_if_up(pdsc)) {
+		dev_err(pdsc->dev, "%s: called while adminq is unavailable\n",
 			__func__);
 		return IRQ_HANDLED;
 	}
@@ -148,6 +158,7 @@ irqreturn_t pdsc_adminq_isr(int irq, void *data)
 	qcq = &pdsc->adminqcq;
 	queue_work(pdsc->wq, &qcq->work);
 	pds_core_intr_mask(&pdsc->intr_ctrl[qcq->intx], PDS_CORE_INTR_MASK_CLEAR);
+	refcount_dec(&pdsc->adminq_refcnt);
 
 	return IRQ_HANDLED;
 }
@@ -231,6 +242,12 @@ int pdsc_adminq_post(struct pdsc *pdsc,
 	int err = 0;
 	int index;
 
+	if (!pdsc_adminq_inc_if_up(pdsc)) {
+		dev_dbg(pdsc->dev, "%s: preventing adminq cmd %u\n",
+			__func__, cmd->opcode);
+		return -ENXIO;
+	}
+
 	wc.qcq = &pdsc->adminqcq;
 	index = __pdsc_adminq_post(pdsc, &pdsc->adminqcq, cmd, comp, &wc);
 	if (index < 0) {
@@ -286,6 +303,8 @@ int pdsc_adminq_post(struct pdsc *pdsc,
 			queue_work(pdsc->wq, &pdsc->health_work);
 	}
 
+	refcount_dec(&pdsc->adminq_refcnt);
+
 	return err;
 }
 EXPORT_SYMBOL_GPL(pdsc_adminq_post);
diff --git a/drivers/net/ethernet/amd/pds_core/core.c b/drivers/net/ethernet/amd/pds_core/core.c
index 0356e56a6e99..3b3e1541dd1c 100644
--- a/drivers/net/ethernet/amd/pds_core/core.c
+++ b/drivers/net/ethernet/amd/pds_core/core.c
@@ -450,6 +450,7 @@ int pdsc_setup(struct pdsc *pdsc, bool init)
 		pdsc_debugfs_add_viftype(pdsc);
 	}
 
+	refcount_set(&pdsc->adminq_refcnt, 1);
 	clear_bit(PDSC_S_FW_DEAD, &pdsc->state);
 	return 0;
 
@@ -514,6 +515,24 @@ void pdsc_stop(struct pdsc *pdsc)
 					   PDS_CORE_INTR_MASK_SET);
 }
 
+void pdsc_adminq_wait_and_dec_once_unused(struct pdsc *pdsc)
+{
+	/* The driver initializes the adminq_refcnt to 1 when the adminq is
+	 * allocated and ready for use. Other users/requesters will increment
+	 * the refcnt while in use. If the refcnt is down to 1 then the adminq
+	 * is not in use and the refcnt can be cleared and adminq freed. Before
+	 * calling this function the driver will set PDSC_S_FW_DEAD, which
+	 * prevent subsequent attempts to use the adminq and increment the
+	 * refcnt to fail. This guarantees that this function will eventually
+	 * exit.
+	 */
+	while (!refcount_dec_if_one(&pdsc->adminq_refcnt)) {
+		dev_dbg_ratelimited(pdsc->dev, "%s: adminq in use\n",
+				    __func__);
+		cpu_relax();
+	}
+}
+
 void pdsc_fw_down(struct pdsc *pdsc)
 {
 	union pds_core_notifyq_comp reset_event = {
@@ -529,6 +548,8 @@ void pdsc_fw_down(struct pdsc *pdsc)
 	if (pdsc->pdev->is_virtfn)
 		return;
 
+	pdsc_adminq_wait_and_dec_once_unused(pdsc);
+
 	/* Notify clients of fw_down */
 	if (pdsc->fw_reporter)
 		devlink_health_report(pdsc->fw_reporter, "FW down reported", pdsc);
diff --git a/drivers/net/ethernet/amd/pds_core/core.h b/drivers/net/ethernet/amd/pds_core/core.h
index e35d3e7006bf..cbd5716f46e6 100644
--- a/drivers/net/ethernet/amd/pds_core/core.h
+++ b/drivers/net/ethernet/amd/pds_core/core.h
@@ -184,6 +184,7 @@ struct pdsc {
 	struct mutex devcmd_lock;	/* lock for dev_cmd operations */
 	struct mutex config_lock;	/* lock for configuration operations */
 	spinlock_t adminq_lock;		/* lock for adminq operations */
+	refcount_t adminq_refcnt;
 	struct pds_core_dev_info_regs __iomem *info_regs;
 	struct pds_core_dev_cmd_regs __iomem *cmd_regs;
 	struct pds_core_intr __iomem *intr_ctrl;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH net-next 5/8] pds_core: Clear BARs on reset
  2024-01-04 17:12 [PATCH net-next 0/8] pds_core: Various improvements and AQ race condition cleanup Brett Creeley
                   ` (3 preceding siblings ...)
  2024-01-04 17:12 ` [PATCH net-next 4/8] pds_core: Prevent race issues involving the adminq Brett Creeley
@ 2024-01-04 17:12 ` Brett Creeley
  2024-01-04 17:12 ` [PATCH net-next 6/8] pds_core: Don't assign interrupt index/bound_intr to notifyq Brett Creeley
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Brett Creeley @ 2024-01-04 17:12 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, netdev, linux-kernel
  Cc: shannon.nelson, brett.creeley

During reset the BARs might be accessed when they are
unmapped. This can cause unexpected issues, so fix it by
clearing the cached BAR values so they are not accessed
until they are re-mapped.

Also, make sure any places that can access the BARs
when they are NULL are prevented.

Signed-off-by: Brett Creeley <brett.creeley@amd.com>
Reviewed-by: Shannon Nelson <shannon.nelson@amd.com>
---
 drivers/net/ethernet/amd/pds_core/adminq.c  | 28 +++++++++++++++------
 drivers/net/ethernet/amd/pds_core/core.c    |  8 +++++-
 drivers/net/ethernet/amd/pds_core/dev.c     |  9 ++++++-
 drivers/net/ethernet/amd/pds_core/devlink.c |  3 ++-
 drivers/net/ethernet/amd/pds_core/fw.c      |  3 +++
 drivers/net/ethernet/amd/pds_core/main.c    |  5 ++++
 6 files changed, 45 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/amd/pds_core/adminq.c b/drivers/net/ethernet/amd/pds_core/adminq.c
index 5edff33d56f3..ea773cfa0af6 100644
--- a/drivers/net/ethernet/amd/pds_core/adminq.c
+++ b/drivers/net/ethernet/amd/pds_core/adminq.c
@@ -191,10 +191,16 @@ static int __pdsc_adminq_post(struct pdsc *pdsc,
 
 	/* Check that the FW is running */
 	if (!pdsc_is_fw_running(pdsc)) {
-		u8 fw_status = ioread8(&pdsc->info_regs->fw_status);
-
-		dev_info(pdsc->dev, "%s: post failed - fw not running %#02x:\n",
-			 __func__, fw_status);
+		if (pdsc->info_regs) {
+			u8 fw_status =
+				ioread8(&pdsc->info_regs->fw_status);
+
+			dev_info(pdsc->dev, "%s: post failed - fw not running %#02x:\n",
+				 __func__, fw_status);
+		} else {
+			dev_info(pdsc->dev, "%s: post failed - BARs not setup\n",
+				 __func__);
+		}
 		ret = -ENXIO;
 
 		goto err_out_unlock;
@@ -266,10 +272,16 @@ int pdsc_adminq_post(struct pdsc *pdsc,
 			break;
 
 		if (!pdsc_is_fw_running(pdsc)) {
-			u8 fw_status = ioread8(&pdsc->info_regs->fw_status);
-
-			dev_dbg(pdsc->dev, "%s: post wait failed - fw not running %#02x:\n",
-				__func__, fw_status);
+			if (pdsc->info_regs) {
+				u8 fw_status =
+					ioread8(&pdsc->info_regs->fw_status);
+
+				dev_dbg(pdsc->dev, "%s: post wait failed - fw not running %#02x:\n",
+					__func__, fw_status);
+			} else {
+				dev_dbg(pdsc->dev, "%s: post wait failed - BARs not setup\n",
+					__func__);
+			}
 			err = -ENXIO;
 			break;
 		}
diff --git a/drivers/net/ethernet/amd/pds_core/core.c b/drivers/net/ethernet/amd/pds_core/core.c
index 3b3e1541dd1c..023a59c8e425 100644
--- a/drivers/net/ethernet/amd/pds_core/core.c
+++ b/drivers/net/ethernet/amd/pds_core/core.c
@@ -600,7 +600,13 @@ void pdsc_fw_up(struct pdsc *pdsc)
 
 static void pdsc_check_pci_health(struct pdsc *pdsc)
 {
-	u8 fw_status = ioread8(&pdsc->info_regs->fw_status);
+	u8 fw_status;
+
+	/* some sort of teardown already in progress */
+	if (!pdsc->info_regs)
+		return;
+
+	fw_status = ioread8(&pdsc->info_regs->fw_status);
 
 	/* is PCI broken? */
 	if (fw_status != PDS_RC_BAD_PCI)
diff --git a/drivers/net/ethernet/amd/pds_core/dev.c b/drivers/net/ethernet/amd/pds_core/dev.c
index 31940b857e0e..62a38e0a8454 100644
--- a/drivers/net/ethernet/amd/pds_core/dev.c
+++ b/drivers/net/ethernet/amd/pds_core/dev.c
@@ -57,6 +57,9 @@ int pdsc_err_to_errno(enum pds_core_status_code code)
 
 bool pdsc_is_fw_running(struct pdsc *pdsc)
 {
+	if (!pdsc->info_regs)
+		return false;
+
 	pdsc->fw_status = ioread8(&pdsc->info_regs->fw_status);
 	pdsc->last_fw_time = jiffies;
 	pdsc->last_hb = ioread32(&pdsc->info_regs->fw_heartbeat);
@@ -182,13 +185,17 @@ int pdsc_devcmd_locked(struct pdsc *pdsc, union pds_core_dev_cmd *cmd,
 {
 	int err;
 
+	if (!pdsc->cmd_regs)
+		return -ENXIO;
+
 	memcpy_toio(&pdsc->cmd_regs->cmd, cmd, sizeof(*cmd));
 	pdsc_devcmd_dbell(pdsc);
 	err = pdsc_devcmd_wait(pdsc, cmd->opcode, max_seconds);
-	memcpy_fromio(comp, &pdsc->cmd_regs->comp, sizeof(*comp));
 
 	if ((err == -ENXIO || err == -ETIMEDOUT) && pdsc->wq)
 		queue_work(pdsc->wq, &pdsc->health_work);
+	else
+		memcpy_fromio(comp, &pdsc->cmd_regs->comp, sizeof(*comp));
 
 	return err;
 }
diff --git a/drivers/net/ethernet/amd/pds_core/devlink.c b/drivers/net/ethernet/amd/pds_core/devlink.c
index e9948ea5bbcd..54864f27c87a 100644
--- a/drivers/net/ethernet/amd/pds_core/devlink.c
+++ b/drivers/net/ethernet/amd/pds_core/devlink.c
@@ -111,7 +111,8 @@ int pdsc_dl_info_get(struct devlink *dl, struct devlink_info_req *req,
 
 	mutex_lock(&pdsc->devcmd_lock);
 	err = pdsc_devcmd_locked(pdsc, &cmd, &comp, pdsc->devcmd_timeout * 2);
-	memcpy_fromio(&fw_list, pdsc->cmd_regs->data, sizeof(fw_list));
+	if (!err)
+		memcpy_fromio(&fw_list, pdsc->cmd_regs->data, sizeof(fw_list));
 	mutex_unlock(&pdsc->devcmd_lock);
 	if (err && err != -EIO)
 		return err;
diff --git a/drivers/net/ethernet/amd/pds_core/fw.c b/drivers/net/ethernet/amd/pds_core/fw.c
index 90a811f3878a..fa626719e68d 100644
--- a/drivers/net/ethernet/amd/pds_core/fw.c
+++ b/drivers/net/ethernet/amd/pds_core/fw.c
@@ -107,6 +107,9 @@ int pdsc_firmware_update(struct pdsc *pdsc, const struct firmware *fw,
 
 	dev_info(pdsc->dev, "Installing firmware\n");
 
+	if (!pdsc->cmd_regs)
+		return -ENXIO;
+
 	dl = priv_to_devlink(pdsc);
 	devlink_flash_update_status_notify(dl, "Preparing to flash",
 					   NULL, 0, 0);
diff --git a/drivers/net/ethernet/amd/pds_core/main.c b/drivers/net/ethernet/amd/pds_core/main.c
index 5172a5ad8ec6..05fdeb235e5f 100644
--- a/drivers/net/ethernet/amd/pds_core/main.c
+++ b/drivers/net/ethernet/amd/pds_core/main.c
@@ -37,6 +37,11 @@ static void pdsc_unmap_bars(struct pdsc *pdsc)
 	struct pdsc_dev_bar *bars = pdsc->bars;
 	unsigned int i;
 
+	pdsc->info_regs = NULL;
+	pdsc->cmd_regs = NULL;
+	pdsc->intr_status = NULL;
+	pdsc->intr_ctrl = NULL;
+
 	for (i = 0; i < PDS_CORE_BARS_MAX; i++) {
 		if (bars[i].vaddr)
 			pci_iounmap(pdsc->pdev, bars[i].vaddr);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH net-next 6/8] pds_core: Don't assign interrupt index/bound_intr to notifyq
  2024-01-04 17:12 [PATCH net-next 0/8] pds_core: Various improvements and AQ race condition cleanup Brett Creeley
                   ` (4 preceding siblings ...)
  2024-01-04 17:12 ` [PATCH net-next 5/8] pds_core: Clear BARs on reset Brett Creeley
@ 2024-01-04 17:12 ` Brett Creeley
  2024-01-04 17:12 ` [PATCH net-next 7/8] pds_core: Unmask adminq interrupt in work thread Brett Creeley
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Brett Creeley @ 2024-01-04 17:12 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, netdev, linux-kernel
  Cc: shannon.nelson, brett.creeley

The notifyq rides on the adminq's interrupt, so there's
no need to setup and/or access the notifyq's interrupt
index or bound_intr. The driver sets the bound_intr
using  qcq->intx = -1 for the notifyq, but luckily
nothing accesses that field for notifyq. Instead of
expecting that remains the case, just clean up
the notifyq's interrupt index and bound_intr fields.

Signed-off-by: Brett Creeley <brett.creeley@amd.com>
Reviewed-by: Shannon Nelson <shannon.nelson@amd.com>
---
 drivers/net/ethernet/amd/pds_core/core.c    | 5 +----
 drivers/net/ethernet/amd/pds_core/debugfs.c | 3 ++-
 2 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/amd/pds_core/core.c b/drivers/net/ethernet/amd/pds_core/core.c
index 023a59c8e425..5426ab5de66b 100644
--- a/drivers/net/ethernet/amd/pds_core/core.c
+++ b/drivers/net/ethernet/amd/pds_core/core.c
@@ -129,6 +129,7 @@ static int pdsc_qcq_intr_alloc(struct pdsc *pdsc, struct pdsc_qcq *qcq)
 	if (index < 0)
 		return index;
 	qcq->intx = index;
+	qcq->cq.bound_intr = &pdsc->intr_info[index];
 
 	return 0;
 }
@@ -222,7 +223,6 @@ int pdsc_qcq_alloc(struct pdsc *pdsc, unsigned int type, unsigned int index,
 		goto err_out_free_irq;
 	}
 
-	qcq->cq.bound_intr = &pdsc->intr_info[qcq->intx];
 	qcq->cq.num_descs = num_descs;
 	qcq->cq.desc_size = cq_desc_size;
 	qcq->cq.tail_idx = 0;
@@ -433,9 +433,6 @@ int pdsc_setup(struct pdsc *pdsc, bool init)
 	if (err)
 		goto err_out_teardown;
 
-	/* NotifyQ rides on the AdminQ interrupt */
-	pdsc->notifyqcq.intx = pdsc->adminqcq.intx;
-
 	/* Set up the Core with the AdminQ and NotifyQ info */
 	err = pdsc_core_init(pdsc);
 	if (err)
diff --git a/drivers/net/ethernet/amd/pds_core/debugfs.c b/drivers/net/ethernet/amd/pds_core/debugfs.c
index 8ec392299b7d..dd6aaeecfe0a 100644
--- a/drivers/net/ethernet/amd/pds_core/debugfs.c
+++ b/drivers/net/ethernet/amd/pds_core/debugfs.c
@@ -105,7 +105,6 @@ void pdsc_debugfs_add_qcq(struct pdsc *pdsc, struct pdsc_qcq *qcq)
 	struct dentry *qcq_dentry, *q_dentry, *cq_dentry;
 	struct dentry *intr_dentry;
 	struct debugfs_regset32 *intr_ctrl_regset;
-	struct pdsc_intr_info *intr = &pdsc->intr_info[qcq->intx];
 	struct pdsc_queue *q = &qcq->q;
 	struct pdsc_cq *cq = &qcq->cq;
 
@@ -143,6 +142,8 @@ void pdsc_debugfs_add_qcq(struct pdsc *pdsc, struct pdsc_qcq *qcq)
 	debugfs_create_u16("tail", 0400, cq_dentry, &cq->tail_idx);
 
 	if (qcq->flags & PDS_CORE_QCQ_F_INTR) {
+		struct pdsc_intr_info *intr = &pdsc->intr_info[qcq->intx];
+
 		intr_dentry = debugfs_create_dir("intr", qcq->dentry);
 		if (IS_ERR_OR_NULL(intr_dentry))
 			return;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH net-next 7/8] pds_core: Unmask adminq interrupt in work thread
  2024-01-04 17:12 [PATCH net-next 0/8] pds_core: Various improvements and AQ race condition cleanup Brett Creeley
                   ` (5 preceding siblings ...)
  2024-01-04 17:12 ` [PATCH net-next 6/8] pds_core: Don't assign interrupt index/bound_intr to notifyq Brett Creeley
@ 2024-01-04 17:12 ` Brett Creeley
  2024-01-04 17:12 ` [PATCH net-next 8/8] pds_core: Fix up some RCT issues Brett Creeley
  2024-01-05  7:59 ` [PATCH net-next 0/8] pds_core: Various improvements and AQ race condition cleanup Przemek Kitszel
  8 siblings, 0 replies; 14+ messages in thread
From: Brett Creeley @ 2024-01-04 17:12 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, netdev, linux-kernel
  Cc: shannon.nelson, brett.creeley

Unmasking the interrupt during the pdsc_adminq_isr
is a bit early and could cause unnecessary interrupts.
Instead always unmask after processing the adminq
and notifyq in pdsc_work_thread()->pdsc_process_adminq().
Also, since we are always unmasking, there's no need
for the local credits variable in pdsc_process_adminq().

Signed-off-by: Brett Creeley <brett.creeley@amd.com>
Reviewed-by: Shannon Nelson <shannon.nelson@amd.com>
---
 drivers/net/ethernet/amd/pds_core/adminq.c | 10 +++-------
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/amd/pds_core/adminq.c b/drivers/net/ethernet/amd/pds_core/adminq.c
index ea773cfa0af6..c83a0a80d533 100644
--- a/drivers/net/ethernet/amd/pds_core/adminq.c
+++ b/drivers/net/ethernet/amd/pds_core/adminq.c
@@ -82,7 +82,6 @@ void pdsc_process_adminq(struct pdsc_qcq *qcq)
 	unsigned long irqflags;
 	int nq_work = 0;
 	int aq_work = 0;
-	int credits;
 
 	/* Don't process AdminQ when it's not up */
 	if (!pdsc_adminq_inc_if_up(pdsc)) {
@@ -128,11 +127,9 @@ void pdsc_process_adminq(struct pdsc_qcq *qcq)
 
 credits:
 	/* Return the interrupt credits, one for each completion */
-	credits = nq_work + aq_work;
-	if (credits)
-		pds_core_intr_credits(&pdsc->intr_ctrl[qcq->intx],
-				      credits,
-				      PDS_CORE_INTR_CRED_REARM);
+	pds_core_intr_credits(&pdsc->intr_ctrl[qcq->intx],
+			      nq_work + aq_work,
+			      PDS_CORE_INTR_CRED_REARM);
 	refcount_dec(&pdsc->adminq_refcnt);
 }
 
@@ -157,7 +154,6 @@ irqreturn_t pdsc_adminq_isr(int irq, void *data)
 
 	qcq = &pdsc->adminqcq;
 	queue_work(pdsc->wq, &qcq->work);
-	pds_core_intr_mask(&pdsc->intr_ctrl[qcq->intx], PDS_CORE_INTR_MASK_CLEAR);
 	refcount_dec(&pdsc->adminq_refcnt);
 
 	return IRQ_HANDLED;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH net-next 8/8] pds_core: Fix up some RCT issues
  2024-01-04 17:12 [PATCH net-next 0/8] pds_core: Various improvements and AQ race condition cleanup Brett Creeley
                   ` (6 preceding siblings ...)
  2024-01-04 17:12 ` [PATCH net-next 7/8] pds_core: Unmask adminq interrupt in work thread Brett Creeley
@ 2024-01-04 17:12 ` Brett Creeley
  2024-01-05  7:59 ` [PATCH net-next 0/8] pds_core: Various improvements and AQ race condition cleanup Przemek Kitszel
  8 siblings, 0 replies; 14+ messages in thread
From: Brett Creeley @ 2024-01-04 17:12 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, netdev, linux-kernel
  Cc: shannon.nelson, brett.creeley

Running xmastree.py against the driver found some
RCT issues, so fix them.

Signed-off-by: Brett Creeley <brett.creeley@amd.com>
Reviewed-by: Shannon Nelson <shannon.nelson@amd.com>
---
 drivers/net/ethernet/amd/pds_core/debugfs.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/amd/pds_core/debugfs.c b/drivers/net/ethernet/amd/pds_core/debugfs.c
index dd6aaeecfe0a..d56fdbb4cdb9 100644
--- a/drivers/net/ethernet/amd/pds_core/debugfs.c
+++ b/drivers/net/ethernet/amd/pds_core/debugfs.c
@@ -32,8 +32,8 @@ void pdsc_debugfs_del_dev(struct pdsc *pdsc)
 
 static int identity_show(struct seq_file *seq, void *v)
 {
-	struct pdsc *pdsc = seq->private;
 	struct pds_core_dev_identity *ident;
+	struct pdsc *pdsc = seq->private;
 	int vt;
 
 	ident = &pdsc->dev_ident;
@@ -102,8 +102,7 @@ static const struct debugfs_reg32 intr_ctrl_regs[] = {
 
 void pdsc_debugfs_add_qcq(struct pdsc *pdsc, struct pdsc_qcq *qcq)
 {
-	struct dentry *qcq_dentry, *q_dentry, *cq_dentry;
-	struct dentry *intr_dentry;
+	struct dentry *qcq_dentry, *q_dentry, *cq_dentry, *intr_dentry;
 	struct debugfs_regset32 *intr_ctrl_regset;
 	struct pdsc_queue *q = &qcq->q;
 	struct pdsc_cq *cq = &qcq->cq;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next 4/8] pds_core: Prevent race issues involving the adminq
  2024-01-04 17:12 ` [PATCH net-next 4/8] pds_core: Prevent race issues involving the adminq Brett Creeley
@ 2024-01-04 19:16   ` Simon Horman
  2024-01-04 19:24     ` Brett Creeley
  2024-01-06  1:50   ` kernel test robot
  2024-01-07  1:28   ` kernel test robot
  2 siblings, 1 reply; 14+ messages in thread
From: Simon Horman @ 2024-01-04 19:16 UTC (permalink / raw)
  To: Brett Creeley
  Cc: davem, edumazet, kuba, pabeni, netdev, linux-kernel, shannon.nelson

On Thu, Jan 04, 2024 at 09:12:17AM -0800, Brett Creeley wrote:
> There are multiple paths that can result in using the pdsc's
> adminq.
> 
> [1] pdsc_adminq_isr and the resulting work from queue_work(),
>     i.e. pdsc_work_thread()->pdsc_process_adminq()
> 
> [2] pdsc_adminq_post()
> 
> When the device goes through reset via PCIe reset and/or
> a fw_down/fw_up cycle due to bad PCIe state or bad device
> state the adminq is destroyed and recreated.
> 
> A NULL pointer dereference can happen if [1] or [2] happens
> after the adminq is already destroyed.
> 
> In order to fix this, add some further state checks and
> implement reference counting for adminq uses. Reference
> counting was used because multiple threads can attempt to
> access the adminq at the same time via [1] or [2]. Additionally,
> multiple clients (i.e. pds-vfio-pci) can be using [2]
> at the same time.
> 
> The adminq_refcnt is initialized to 1 when the adminq has been
> allocated and is ready to use. Users/clients of the adminq
> (i.e. [1] and [2]) will increment the refcnt when they are using
> the adminq. When the driver goes into a fw_down cycle it will
> set the PDSC_S_FW_DEAD bit and then wait for the adminq_refcnt
> to hit 1. Setting the PDSC_S_FW_DEAD before waiting will prevent
> any further adminq_refcnt increments. Waiting for the
> adminq_refcnt to hit 1 allows for any current users of the adminq
> to finish before the driver frees the adminq. Once the
> adminq_refcnt hits 1 the driver clears the refcnt to signify that
> the adminq is deleted and cannot be used. On the fw_up cycle the
> driver will once again initialize the adminq_refcnt to 1 allowing
> the adminq to be used again.
> 
> Signed-off-by: Brett Creeley <brett.creeley@amd.com>
> Reviewed-by: Shannon Nelson <shannon.nelson@amd.com>

...

> diff --git a/drivers/net/ethernet/amd/pds_core/core.c b/drivers/net/ethernet/amd/pds_core/core.c
> index 0356e56a6e99..3b3e1541dd1c 100644
> --- a/drivers/net/ethernet/amd/pds_core/core.c
> +++ b/drivers/net/ethernet/amd/pds_core/core.c
> @@ -450,6 +450,7 @@ int pdsc_setup(struct pdsc *pdsc, bool init)
>  		pdsc_debugfs_add_viftype(pdsc);
>  	}
>  
> +	refcount_set(&pdsc->adminq_refcnt, 1);
>  	clear_bit(PDSC_S_FW_DEAD, &pdsc->state);
>  	return 0;
>  
> @@ -514,6 +515,24 @@ void pdsc_stop(struct pdsc *pdsc)
>  					   PDS_CORE_INTR_MASK_SET);
>  }
>  
> +void pdsc_adminq_wait_and_dec_once_unused(struct pdsc *pdsc)

Hi Brett,

a minor nit from my side: pdsc_adminq_wait_and_dec_once_unused is only used
in this file so perhaps it should be static?

> +{
> +	/* The driver initializes the adminq_refcnt to 1 when the adminq is
> +	 * allocated and ready for use. Other users/requesters will increment
> +	 * the refcnt while in use. If the refcnt is down to 1 then the adminq
> +	 * is not in use and the refcnt can be cleared and adminq freed. Before
> +	 * calling this function the driver will set PDSC_S_FW_DEAD, which
> +	 * prevent subsequent attempts to use the adminq and increment the
> +	 * refcnt to fail. This guarantees that this function will eventually
> +	 * exit.
> +	 */
> +	while (!refcount_dec_if_one(&pdsc->adminq_refcnt)) {
> +		dev_dbg_ratelimited(pdsc->dev, "%s: adminq in use\n",
> +				    __func__);
> +		cpu_relax();
> +	}
> +}
> +
>  void pdsc_fw_down(struct pdsc *pdsc)
>  {
>  	union pds_core_notifyq_comp reset_event = {
> @@ -529,6 +548,8 @@ void pdsc_fw_down(struct pdsc *pdsc)
>  	if (pdsc->pdev->is_virtfn)
>  		return;
>  
> +	pdsc_adminq_wait_and_dec_once_unused(pdsc);
> +
>  	/* Notify clients of fw_down */
>  	if (pdsc->fw_reporter)
>  		devlink_health_report(pdsc->fw_reporter, "FW down reported", pdsc);

...

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next 4/8] pds_core: Prevent race issues involving the adminq
  2024-01-04 19:16   ` Simon Horman
@ 2024-01-04 19:24     ` Brett Creeley
  0 siblings, 0 replies; 14+ messages in thread
From: Brett Creeley @ 2024-01-04 19:24 UTC (permalink / raw)
  To: Simon Horman, Brett Creeley
  Cc: davem, edumazet, kuba, pabeni, netdev, linux-kernel, shannon.nelson



On 1/4/2024 11:16 AM, Simon Horman wrote:
> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
> 
> 
> On Thu, Jan 04, 2024 at 09:12:17AM -0800, Brett Creeley wrote:
>> There are multiple paths that can result in using the pdsc's
>> adminq.
>>
>> [1] pdsc_adminq_isr and the resulting work from queue_work(),
>>      i.e. pdsc_work_thread()->pdsc_process_adminq()
>>
>> [2] pdsc_adminq_post()
>>
>> When the device goes through reset via PCIe reset and/or
>> a fw_down/fw_up cycle due to bad PCIe state or bad device
>> state the adminq is destroyed and recreated.
>>
>> A NULL pointer dereference can happen if [1] or [2] happens
>> after the adminq is already destroyed.
>>
>> In order to fix this, add some further state checks and
>> implement reference counting for adminq uses. Reference
>> counting was used because multiple threads can attempt to
>> access the adminq at the same time via [1] or [2]. Additionally,
>> multiple clients (i.e. pds-vfio-pci) can be using [2]
>> at the same time.
>>
>> The adminq_refcnt is initialized to 1 when the adminq has been
>> allocated and is ready to use. Users/clients of the adminq
>> (i.e. [1] and [2]) will increment the refcnt when they are using
>> the adminq. When the driver goes into a fw_down cycle it will
>> set the PDSC_S_FW_DEAD bit and then wait for the adminq_refcnt
>> to hit 1. Setting the PDSC_S_FW_DEAD before waiting will prevent
>> any further adminq_refcnt increments. Waiting for the
>> adminq_refcnt to hit 1 allows for any current users of the adminq
>> to finish before the driver frees the adminq. Once the
>> adminq_refcnt hits 1 the driver clears the refcnt to signify that
>> the adminq is deleted and cannot be used. On the fw_up cycle the
>> driver will once again initialize the adminq_refcnt to 1 allowing
>> the adminq to be used again.
>>
>> Signed-off-by: Brett Creeley <brett.creeley@amd.com>
>> Reviewed-by: Shannon Nelson <shannon.nelson@amd.com>
> 
> ...
> 
>> diff --git a/drivers/net/ethernet/amd/pds_core/core.c b/drivers/net/ethernet/amd/pds_core/core.c
>> index 0356e56a6e99..3b3e1541dd1c 100644
>> --- a/drivers/net/ethernet/amd/pds_core/core.c
>> +++ b/drivers/net/ethernet/amd/pds_core/core.c
>> @@ -450,6 +450,7 @@ int pdsc_setup(struct pdsc *pdsc, bool init)
>>                pdsc_debugfs_add_viftype(pdsc);
>>        }
>>
>> +     refcount_set(&pdsc->adminq_refcnt, 1);
>>        clear_bit(PDSC_S_FW_DEAD, &pdsc->state);
>>        return 0;
>>
>> @@ -514,6 +515,24 @@ void pdsc_stop(struct pdsc *pdsc)
>>                                           PDS_CORE_INTR_MASK_SET);
>>   }
>>
>> +void pdsc_adminq_wait_and_dec_once_unused(struct pdsc *pdsc)
> 
> Hi Brett,
> 
> a minor nit from my side: pdsc_adminq_wait_and_dec_once_unused is only used
> in this file so perhaps it should be static?

Simon,

Yep, looks like I missed that. Good catch.

Thanks,

Brett

> 
>> +{
>> +     /* The driver initializes the adminq_refcnt to 1 when the adminq is
>> +      * allocated and ready for use. Other users/requesters will increment
>> +      * the refcnt while in use. If the refcnt is down to 1 then the adminq
>> +      * is not in use and the refcnt can be cleared and adminq freed. Before
>> +      * calling this function the driver will set PDSC_S_FW_DEAD, which
>> +      * prevent subsequent attempts to use the adminq and increment the
>> +      * refcnt to fail. This guarantees that this function will eventually
>> +      * exit.
>> +      */
>> +     while (!refcount_dec_if_one(&pdsc->adminq_refcnt)) {
>> +             dev_dbg_ratelimited(pdsc->dev, "%s: adminq in use\n",
>> +                                 __func__);
>> +             cpu_relax();
>> +     }
>> +}
>> +
>>   void pdsc_fw_down(struct pdsc *pdsc)
>>   {
>>        union pds_core_notifyq_comp reset_event = {
>> @@ -529,6 +548,8 @@ void pdsc_fw_down(struct pdsc *pdsc)
>>        if (pdsc->pdev->is_virtfn)
>>                return;
>>
>> +     pdsc_adminq_wait_and_dec_once_unused(pdsc);
>> +
>>        /* Notify clients of fw_down */
>>        if (pdsc->fw_reporter)
>>                devlink_health_report(pdsc->fw_reporter, "FW down reported", pdsc);
> 
> ...

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next 0/8] pds_core: Various improvements and AQ race condition cleanup
  2024-01-04 17:12 [PATCH net-next 0/8] pds_core: Various improvements and AQ race condition cleanup Brett Creeley
                   ` (7 preceding siblings ...)
  2024-01-04 17:12 ` [PATCH net-next 8/8] pds_core: Fix up some RCT issues Brett Creeley
@ 2024-01-05  7:59 ` Przemek Kitszel
  8 siblings, 0 replies; 14+ messages in thread
From: Przemek Kitszel @ 2024-01-05  7:59 UTC (permalink / raw)
  To: Brett Creeley, davem, edumazet, kuba, pabeni, netdev, linux-kernel
  Cc: shannon.nelson

On 1/4/24 18:12, Brett Creeley wrote:
> There can be many users of the pds_core's adminq. This includes
> pds_core's uses and any clients that depend on it. When the pds_core
> device goes through a reset for any reason the adminq is freed
> and reconfigured. There are some gaps in the current implementation
> that will cause crashes during reset if any of the previously mentioned
> users of the adminq attempt to use it after it's been freed. This series
> addresses these issues.
> 
> This series also includes some general cleanups.
> 
> Brett Creeley (8):
>    pds_core: Prevent health thread from running during reset/remove
>    pds_core: Cancel AQ work on teardown
>    pds_core: Use struct pdsc for the pdsc_adminq_isr private data
>    pds_core: Prevent race issues involving the adminq
>    pds_core: Clear BARs on reset
>    pds_core: Don't assign interrupt index/bound_intr to notifyq
>    pds_core: Unmask adminq interrupt in work thread
>    pds_core: Fix up some RCT issues
> 
>   drivers/net/ethernet/amd/pds_core/adminq.c  | 74 ++++++++++++++-------
>   drivers/net/ethernet/amd/pds_core/core.c    | 38 +++++++++--
>   drivers/net/ethernet/amd/pds_core/core.h    |  1 +
>   drivers/net/ethernet/amd/pds_core/debugfs.c |  8 +--
>   drivers/net/ethernet/amd/pds_core/dev.c     |  9 ++-
>   drivers/net/ethernet/amd/pds_core/devlink.c |  3 +-
>   drivers/net/ethernet/amd/pds_core/fw.c      |  3 +
>   drivers/net/ethernet/amd/pds_core/main.c    | 24 ++++++-
>   8 files changed, 123 insertions(+), 37 deletions(-)
> 

very nice series, I didn't spot any problems with the code,
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next 4/8] pds_core: Prevent race issues involving the adminq
  2024-01-04 17:12 ` [PATCH net-next 4/8] pds_core: Prevent race issues involving the adminq Brett Creeley
  2024-01-04 19:16   ` Simon Horman
@ 2024-01-06  1:50   ` kernel test robot
  2024-01-07  1:28   ` kernel test robot
  2 siblings, 0 replies; 14+ messages in thread
From: kernel test robot @ 2024-01-06  1:50 UTC (permalink / raw)
  To: Brett Creeley, davem, edumazet, kuba, pabeni, netdev, linux-kernel
  Cc: oe-kbuild-all, shannon.nelson, brett.creeley

Hi Brett,

kernel test robot noticed the following build warnings:

[auto build test WARNING on net-next/main]

url:    https://github.com/intel-lab-lkp/linux/commits/Brett-Creeley/pds_core-Prevent-health-thread-from-running-during-reset-remove/20240105-011706
base:   net-next/main
patch link:    https://lore.kernel.org/r/20240104171221.31399-5-brett.creeley%40amd.com
patch subject: [PATCH net-next 4/8] pds_core: Prevent race issues involving the adminq
config: s390-allmodconfig (https://download.01.org/0day-ci/archive/20240106/202401060952.4J8S3LbP-lkp@intel.com/config)
compiler: s390-linux-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240106/202401060952.4J8S3LbP-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202401060952.4J8S3LbP-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> drivers/net/ethernet/amd/pds_core/core.c:518:6: warning: no previous prototype for 'pdsc_adminq_wait_and_dec_once_unused' [-Wmissing-prototypes]
     518 | void pdsc_adminq_wait_and_dec_once_unused(struct pdsc *pdsc)
         |      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


vim +/pdsc_adminq_wait_and_dec_once_unused +518 drivers/net/ethernet/amd/pds_core/core.c

   517	
 > 518	void pdsc_adminq_wait_and_dec_once_unused(struct pdsc *pdsc)
   519	{
   520		/* The driver initializes the adminq_refcnt to 1 when the adminq is
   521		 * allocated and ready for use. Other users/requesters will increment
   522		 * the refcnt while in use. If the refcnt is down to 1 then the adminq
   523		 * is not in use and the refcnt can be cleared and adminq freed. Before
   524		 * calling this function the driver will set PDSC_S_FW_DEAD, which
   525		 * prevent subsequent attempts to use the adminq and increment the
   526		 * refcnt to fail. This guarantees that this function will eventually
   527		 * exit.
   528		 */
   529		while (!refcount_dec_if_one(&pdsc->adminq_refcnt)) {
   530			dev_dbg_ratelimited(pdsc->dev, "%s: adminq in use\n",
   531					    __func__);
   532			cpu_relax();
   533		}
   534	}
   535	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next 4/8] pds_core: Prevent race issues involving the adminq
  2024-01-04 17:12 ` [PATCH net-next 4/8] pds_core: Prevent race issues involving the adminq Brett Creeley
  2024-01-04 19:16   ` Simon Horman
  2024-01-06  1:50   ` kernel test robot
@ 2024-01-07  1:28   ` kernel test robot
  2 siblings, 0 replies; 14+ messages in thread
From: kernel test robot @ 2024-01-07  1:28 UTC (permalink / raw)
  To: Brett Creeley, davem, edumazet, kuba, pabeni, netdev, linux-kernel
  Cc: llvm, oe-kbuild-all, shannon.nelson, brett.creeley

Hi Brett,

kernel test robot noticed the following build warnings:

[auto build test WARNING on net-next/main]

url:    https://github.com/intel-lab-lkp/linux/commits/Brett-Creeley/pds_core-Prevent-health-thread-from-running-during-reset-remove/20240105-011706
base:   net-next/main
patch link:    https://lore.kernel.org/r/20240104171221.31399-5-brett.creeley%40amd.com
patch subject: [PATCH net-next 4/8] pds_core: Prevent race issues involving the adminq
config: x86_64-allmodconfig (https://download.01.org/0day-ci/archive/20240107/202401070931.tvTUY2US-lkp@intel.com/config)
compiler: ClangBuiltLinux clang version 17.0.6 (https://github.com/llvm/llvm-project 6009708b4367171ccdbf4b5905cb6a803753fe18)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240107/202401070931.tvTUY2US-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202401070931.tvTUY2US-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> drivers/net/ethernet/amd/pds_core/core.c:518:6: warning: no previous prototype for function 'pdsc_adminq_wait_and_dec_once_unused' [-Wmissing-prototypes]
     518 | void pdsc_adminq_wait_and_dec_once_unused(struct pdsc *pdsc)
         |      ^
   drivers/net/ethernet/amd/pds_core/core.c:518:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
     518 | void pdsc_adminq_wait_and_dec_once_unused(struct pdsc *pdsc)
         | ^
         | static 
   1 warning generated.


vim +/pdsc_adminq_wait_and_dec_once_unused +518 drivers/net/ethernet/amd/pds_core/core.c

   517	
 > 518	void pdsc_adminq_wait_and_dec_once_unused(struct pdsc *pdsc)
   519	{
   520		/* The driver initializes the adminq_refcnt to 1 when the adminq is
   521		 * allocated and ready for use. Other users/requesters will increment
   522		 * the refcnt while in use. If the refcnt is down to 1 then the adminq
   523		 * is not in use and the refcnt can be cleared and adminq freed. Before
   524		 * calling this function the driver will set PDSC_S_FW_DEAD, which
   525		 * prevent subsequent attempts to use the adminq and increment the
   526		 * refcnt to fail. This guarantees that this function will eventually
   527		 * exit.
   528		 */
   529		while (!refcount_dec_if_one(&pdsc->adminq_refcnt)) {
   530			dev_dbg_ratelimited(pdsc->dev, "%s: adminq in use\n",
   531					    __func__);
   532			cpu_relax();
   533		}
   534	}
   535	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2024-01-07  1:28 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-04 17:12 [PATCH net-next 0/8] pds_core: Various improvements and AQ race condition cleanup Brett Creeley
2024-01-04 17:12 ` [PATCH net-next 1/8] pds_core: Prevent health thread from running during reset/remove Brett Creeley
2024-01-04 17:12 ` [PATCH net-next 2/8] pds_core: Cancel AQ work on teardown Brett Creeley
2024-01-04 17:12 ` [PATCH net-next 3/8] pds_core: Use struct pdsc for the pdsc_adminq_isr private data Brett Creeley
2024-01-04 17:12 ` [PATCH net-next 4/8] pds_core: Prevent race issues involving the adminq Brett Creeley
2024-01-04 19:16   ` Simon Horman
2024-01-04 19:24     ` Brett Creeley
2024-01-06  1:50   ` kernel test robot
2024-01-07  1:28   ` kernel test robot
2024-01-04 17:12 ` [PATCH net-next 5/8] pds_core: Clear BARs on reset Brett Creeley
2024-01-04 17:12 ` [PATCH net-next 6/8] pds_core: Don't assign interrupt index/bound_intr to notifyq Brett Creeley
2024-01-04 17:12 ` [PATCH net-next 7/8] pds_core: Unmask adminq interrupt in work thread Brett Creeley
2024-01-04 17:12 ` [PATCH net-next 8/8] pds_core: Fix up some RCT issues Brett Creeley
2024-01-05  7:59 ` [PATCH net-next 0/8] pds_core: Various improvements and AQ race condition cleanup Przemek Kitszel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).