linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next 0/9] net: ipa: don't disable NAPI in suspend
@ 2021-01-29 20:20 Alex Elder
  2021-01-29 20:20 ` [PATCH net-next 1/9] net: ipa: don't thaw channel if error starting Alex Elder
                   ` (8 more replies)
  0 siblings, 9 replies; 21+ messages in thread
From: Alex Elder @ 2021-01-29 20:20 UTC (permalink / raw)
  To: davem, kuba
  Cc: elder, evgreen, bjorn.andersson, cpratapa, subashab, netdev,
	linux-kernel

A few weeks ago I suggested a change that added a flag to determine
whether NAPI should be re-enabled on a channel when we're done
polling.  That change was questioned, and upon further investigation
I realized the IPA suspend path was "doing it wrong."

Currently (for newer hardware) the IPA driver suspends channels by
issuing a STOP command.  Part of the stop processing includes a
"freeze" operation, which quiesces activity, disables the I/O
completion interrupt, and disables NAPI.  But disabling NAPI is
only meant to be done when shutting down the channel; there is
no need to disable it when a channel is being stopped for suspend.

This series reworks the way channels are stopped, with the end
result being that neither NAPI nor the I/O completion interrupt is
disabled when a channel is suspended.

The first patch fixes an error handling bug in the channel starting
path.  The second patch creates a helper function to encpasulate
retrying channel stop commands.  The third also creates helper
functions, but in doing so it makes channel stop and start handling
be consistent for both "regular" stop and suspend.

The fourth patch open-codes the freeze and thaw functions as a first
step toward reworking what they do (reordering and eliminating steps).

The fifth patch makes the I/O completion interrupt get disabled
*after* a channel is stopped.  This eliminates a small race in which
the interrupt condition could occur between disabling the interrupt
and stopping the channel.  Once stopped, the channel will generate
no more I/O completion interrupts.

The sixth and seventh patches arrange for the completion interrupt
to be disabled only stopping a channel "for good", not when
suspending.  (The sixth patch just makes a small step to facilitate
review; these two could be squashed together.)

The 8th patch ensures a TX request--if initiated just before
stopping the TX queue--is included when determining whether a
a channel is quiesced for stop or suspend.

And finally the last patch implements the ultimate objective,
disabling NAPI *only* when "really" stopping a channel (not for
suspend).  Instead of disabling NAPI, a call to napi_synchronize()
ensures everything's done before we suspend.

					-Alex

Alex Elder (9):
  net: ipa: don't thaw channel if error starting
  net: ipa: introduce gsi_channel_stop_retry()
  net: ipa: introduce __gsi_channel_start()
  net: ipa: kill gsi_channel_freeze() and gsi_channel_thaw()
  net: ipa: disable IEOB interrupt after channel stop
  net: ipa: move completion interrupt enable/disable
  net: ipa: don't disable IEOB interrupt during suspend
  net: ipa: expand last transaction check
  net: ipa: don't disable NAPI in suspend

 drivers/net/ipa/gsi.c | 137 ++++++++++++++++++++++++++----------------
 1 file changed, 85 insertions(+), 52 deletions(-)

-- 
2.27.0


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH net-next 1/9] net: ipa: don't thaw channel if error starting
  2021-01-29 20:20 [PATCH net-next 0/9] net: ipa: don't disable NAPI in suspend Alex Elder
@ 2021-01-29 20:20 ` Alex Elder
  2021-01-29 20:20 ` [PATCH net-next 2/9] net: ipa: introduce gsi_channel_stop_retry() Alex Elder
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 21+ messages in thread
From: Alex Elder @ 2021-01-29 20:20 UTC (permalink / raw)
  To: davem, kuba
  Cc: elder, evgreen, bjorn.andersson, cpratapa, subashab, netdev,
	linux-kernel

If an error occurs starting a channel, don't "thaw" it.
We should assume the channel remains in a non-started state.

Update the comment in gsi_channel_stop(); calls to this function are
no longer retried.

Signed-off-by: Alex Elder <elder@linaro.org>
---
 drivers/net/ipa/gsi.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ipa/gsi.c b/drivers/net/ipa/gsi.c
index f79cf3c327c1c..4a3e125e898f6 100644
--- a/drivers/net/ipa/gsi.c
+++ b/drivers/net/ipa/gsi.c
@@ -885,7 +885,9 @@ int gsi_channel_start(struct gsi *gsi, u32 channel_id)
 
 	mutex_unlock(&gsi->mutex);
 
-	gsi_channel_thaw(channel);
+	/* Thaw the channel if successful */
+	if (!ret)
+		gsi_channel_thaw(channel);
 
 	return ret;
 }
@@ -910,7 +912,7 @@ int gsi_channel_stop(struct gsi *gsi, u32 channel_id)
 
 	mutex_unlock(&gsi->mutex);
 
-	/* Thaw the channel if we need to retry (or on error) */
+	/* Re-thaw the channel if an error occurred while stopping */
 	if (ret)
 		gsi_channel_thaw(channel);
 
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH net-next 2/9] net: ipa: introduce gsi_channel_stop_retry()
  2021-01-29 20:20 [PATCH net-next 0/9] net: ipa: don't disable NAPI in suspend Alex Elder
  2021-01-29 20:20 ` [PATCH net-next 1/9] net: ipa: don't thaw channel if error starting Alex Elder
@ 2021-01-29 20:20 ` Alex Elder
  2021-01-29 20:20 ` [PATCH net-next 3/9] net: ipa: introduce __gsi_channel_start() Alex Elder
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 21+ messages in thread
From: Alex Elder @ 2021-01-29 20:20 UTC (permalink / raw)
  To: davem, kuba
  Cc: elder, evgreen, bjorn.andersson, cpratapa, subashab, netdev,
	linux-kernel

Create a new helper function that encapsulates issuing a set of
channel stop commands, retrying if appropriate, with a short delay
between attempts.

Signed-off-by: Alex Elder <elder@linaro.org>
---
 drivers/net/ipa/gsi.c | 20 +++++++++++++++-----
 1 file changed, 15 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ipa/gsi.c b/drivers/net/ipa/gsi.c
index 4a3e125e898f6..bd1bf388d9892 100644
--- a/drivers/net/ipa/gsi.c
+++ b/drivers/net/ipa/gsi.c
@@ -892,15 +892,12 @@ int gsi_channel_start(struct gsi *gsi, u32 channel_id)
 	return ret;
 }
 
-/* Stop a started channel */
-int gsi_channel_stop(struct gsi *gsi, u32 channel_id)
+static int gsi_channel_stop_retry(struct gsi_channel *channel)
 {
-	struct gsi_channel *channel = &gsi->channel[channel_id];
 	u32 retries = GSI_CHANNEL_STOP_RETRIES;
+	struct gsi *gsi = channel->gsi;
 	int ret;
 
-	gsi_channel_freeze(channel);
-
 	mutex_lock(&gsi->mutex);
 
 	do {
@@ -912,6 +909,19 @@ int gsi_channel_stop(struct gsi *gsi, u32 channel_id)
 
 	mutex_unlock(&gsi->mutex);
 
+	return ret;
+}
+
+/* Stop a started channel */
+int gsi_channel_stop(struct gsi *gsi, u32 channel_id)
+{
+	struct gsi_channel *channel = &gsi->channel[channel_id];
+	int ret;
+
+	gsi_channel_freeze(channel);
+
+	ret = gsi_channel_stop_retry(channel);
+
 	/* Re-thaw the channel if an error occurred while stopping */
 	if (ret)
 		gsi_channel_thaw(channel);
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH net-next 3/9] net: ipa: introduce __gsi_channel_start()
  2021-01-29 20:20 [PATCH net-next 0/9] net: ipa: don't disable NAPI in suspend Alex Elder
  2021-01-29 20:20 ` [PATCH net-next 1/9] net: ipa: don't thaw channel if error starting Alex Elder
  2021-01-29 20:20 ` [PATCH net-next 2/9] net: ipa: introduce gsi_channel_stop_retry() Alex Elder
@ 2021-01-29 20:20 ` Alex Elder
  2021-01-29 20:20 ` [PATCH net-next 4/9] net: ipa: kill gsi_channel_freeze() and gsi_channel_thaw() Alex Elder
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 21+ messages in thread
From: Alex Elder @ 2021-01-29 20:20 UTC (permalink / raw)
  To: davem, kuba
  Cc: elder, evgreen, bjorn.andersson, cpratapa, subashab, netdev,
	linux-kernel

Create a new function that does most of the work of starting a
channel.  What's different is that it takes a flag indicating
whether the channel should really be stopped or not.  When doing a
"normal" channel start, the flag is true.  Create another new
function __gsi_channel_stop() that behaves similarly.

IPA v3.5.1 implements suspend using a special SUSPEND endpoint
setting.  If the endpoint is suspended when an I/O completes on the
underlying GSI channel, a SUSPEND interrupt is generated.

Newer versions of IPA do not implement the SUSPEND endpoint mode.
Instead, endpoint suspend is implemented by simply stopping the
underlying GSI channel.  In this case, an I/O completion on a
*stopped* channel causes the SUSPEND interrupt condition.

These new functions put all activity related to starting or stopping
a channel (including "thawing/freezing" the channel) in one place,
whether or not the channel is actually started or stopped.

Signed-off-by: Alex Elder <elder@linaro.org>
---
 drivers/net/ipa/gsi.c | 71 ++++++++++++++++++++++---------------------
 1 file changed, 37 insertions(+), 34 deletions(-)

diff --git a/drivers/net/ipa/gsi.c b/drivers/net/ipa/gsi.c
index bd1bf388d9892..bba64887fe969 100644
--- a/drivers/net/ipa/gsi.c
+++ b/drivers/net/ipa/gsi.c
@@ -873,23 +873,30 @@ static void gsi_channel_deprogram(struct gsi_channel *channel)
 	/* Nothing to do */
 }
 
+static int __gsi_channel_start(struct gsi_channel *channel, bool start)
+{
+	struct gsi *gsi = channel->gsi;
+	int ret;
+
+	mutex_lock(&gsi->mutex);
+
+	ret = start ? gsi_channel_start_command(channel) : 0;
+
+	mutex_unlock(&gsi->mutex);
+
+	/* Thaw the channel if successful */
+	if (!ret)
+		gsi_channel_thaw(channel);
+
+	return ret;
+}
+
 /* Start an allocated GSI channel */
 int gsi_channel_start(struct gsi *gsi, u32 channel_id)
 {
 	struct gsi_channel *channel = &gsi->channel[channel_id];
-	int ret;
 
-	mutex_lock(&gsi->mutex);
-
-	ret = gsi_channel_start_command(channel);
-
-	mutex_unlock(&gsi->mutex);
-
-	/* Thaw the channel if successful */
-	if (!ret)
-		gsi_channel_thaw(channel);
-
-	return ret;
+	return __gsi_channel_start(channel, true);
 }
 
 static int gsi_channel_stop_retry(struct gsi_channel *channel)
@@ -912,21 +919,27 @@ static int gsi_channel_stop_retry(struct gsi_channel *channel)
 	return ret;
 }
 
+static int __gsi_channel_stop(struct gsi_channel *channel, bool stop)
+{
+	int ret;
+
+	gsi_channel_freeze(channel);
+
+	ret = stop ? gsi_channel_stop_retry(channel) : 0;
+
+	/* Re-thaw the channel if an error occurred while stopping */
+	if (ret)
+		gsi_channel_thaw(channel);
+
+	return ret;
+}
+
 /* Stop a started channel */
 int gsi_channel_stop(struct gsi *gsi, u32 channel_id)
 {
 	struct gsi_channel *channel = &gsi->channel[channel_id];
-	int ret;
 
-	gsi_channel_freeze(channel);
-
-	ret = gsi_channel_stop_retry(channel);
-
-	/* Re-thaw the channel if an error occurred while stopping */
-	if (ret)
-		gsi_channel_thaw(channel);
-
-	return ret;
+	return __gsi_channel_stop(channel, true);
 }
 
 /* Reset and reconfigure a channel, (possibly) enabling the doorbell engine */
@@ -952,12 +965,7 @@ int gsi_channel_suspend(struct gsi *gsi, u32 channel_id, bool stop)
 {
 	struct gsi_channel *channel = &gsi->channel[channel_id];
 
-	if (stop)
-		return gsi_channel_stop(gsi, channel_id);
-
-	gsi_channel_freeze(channel);
-
-	return 0;
+	return __gsi_channel_stop(channel, stop);
 }
 
 /* Resume a suspended channel (starting will be requested if STOPPED) */
@@ -965,12 +973,7 @@ int gsi_channel_resume(struct gsi *gsi, u32 channel_id, bool start)
 {
 	struct gsi_channel *channel = &gsi->channel[channel_id];
 
-	if (start)
-		return gsi_channel_start(gsi, channel_id);
-
-	gsi_channel_thaw(channel);
-
-	return 0;
+	return __gsi_channel_start(channel, start);
 }
 
 /**
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH net-next 4/9] net: ipa: kill gsi_channel_freeze() and gsi_channel_thaw()
  2021-01-29 20:20 [PATCH net-next 0/9] net: ipa: don't disable NAPI in suspend Alex Elder
                   ` (2 preceding siblings ...)
  2021-01-29 20:20 ` [PATCH net-next 3/9] net: ipa: introduce __gsi_channel_start() Alex Elder
@ 2021-01-29 20:20 ` Alex Elder
  2021-01-29 20:20 ` [PATCH net-next 5/9] net: ipa: disable IEOB interrupt after channel stop Alex Elder
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 21+ messages in thread
From: Alex Elder @ 2021-01-29 20:20 UTC (permalink / raw)
  To: davem, kuba
  Cc: elder, evgreen, bjorn.andersson, cpratapa, subashab, netdev,
	linux-kernel

Open-code gsi_channel_freeze() and gsi_channel_thaw() in all callers
and get rid of these two functions.  This is part of reworking the
sequence of things done during channel suspend/resume and start/stop.

Signed-off-by: Alex Elder <elder@linaro.org>
---
 drivers/net/ipa/gsi.c | 37 ++++++++++++-------------------------
 1 file changed, 12 insertions(+), 25 deletions(-)

diff --git a/drivers/net/ipa/gsi.c b/drivers/net/ipa/gsi.c
index bba64887fe969..565c785e33a25 100644
--- a/drivers/net/ipa/gsi.c
+++ b/drivers/net/ipa/gsi.c
@@ -764,24 +764,6 @@ static void gsi_channel_trans_quiesce(struct gsi_channel *channel)
 	}
 }
 
-/* Stop channel activity.  Transactions may not be allocated until thawed. */
-static void gsi_channel_freeze(struct gsi_channel *channel)
-{
-	gsi_channel_trans_quiesce(channel);
-
-	napi_disable(&channel->napi);
-
-	gsi_irq_ieob_disable_one(channel->gsi, channel->evt_ring_id);
-}
-
-/* Allow transactions to be used on the channel again. */
-static void gsi_channel_thaw(struct gsi_channel *channel)
-{
-	gsi_irq_ieob_enable_one(channel->gsi, channel->evt_ring_id);
-
-	napi_enable(&channel->napi);
-}
-
 /* Program a channel for use */
 static void gsi_channel_program(struct gsi_channel *channel, bool doorbell)
 {
@@ -884,9 +866,10 @@ static int __gsi_channel_start(struct gsi_channel *channel, bool start)
 
 	mutex_unlock(&gsi->mutex);
 
-	/* Thaw the channel if successful */
-	if (!ret)
-		gsi_channel_thaw(channel);
+	if (!ret) {
+		gsi_irq_ieob_enable_one(gsi, channel->evt_ring_id);
+		napi_enable(&channel->napi);
+	}
 
 	return ret;
 }
@@ -921,15 +904,19 @@ static int gsi_channel_stop_retry(struct gsi_channel *channel)
 
 static int __gsi_channel_stop(struct gsi_channel *channel, bool stop)
 {
+	struct gsi *gsi = channel->gsi;
 	int ret;
 
-	gsi_channel_freeze(channel);
+	gsi_channel_trans_quiesce(channel);
+	napi_disable(&channel->napi);
+	gsi_irq_ieob_disable_one(gsi, channel->evt_ring_id);
 
 	ret = stop ? gsi_channel_stop_retry(channel) : 0;
 
-	/* Re-thaw the channel if an error occurred while stopping */
-	if (ret)
-		gsi_channel_thaw(channel);
+	if (ret) {
+		gsi_irq_ieob_enable_one(gsi, channel->evt_ring_id);
+		napi_enable(&channel->napi);
+	}
 
 	return ret;
 }
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH net-next 5/9] net: ipa: disable IEOB interrupt after channel stop
  2021-01-29 20:20 [PATCH net-next 0/9] net: ipa: don't disable NAPI in suspend Alex Elder
                   ` (3 preceding siblings ...)
  2021-01-29 20:20 ` [PATCH net-next 4/9] net: ipa: kill gsi_channel_freeze() and gsi_channel_thaw() Alex Elder
@ 2021-01-29 20:20 ` Alex Elder
  2021-01-29 20:20 ` [PATCH net-next 6/9] net: ipa: move completion interrupt enable/disable Alex Elder
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 21+ messages in thread
From: Alex Elder @ 2021-01-29 20:20 UTC (permalink / raw)
  To: davem, kuba
  Cc: elder, evgreen, bjorn.andersson, cpratapa, subashab, netdev,
	linux-kernel

Disable the I/O completion interrupt on a channel *after* we
successfully stop it rather than before.  This ensures a completion
occurring just before the channel is stopped triggers an interrupt.

Enable the interrupt *before* starting a channel rather than after,
to be symmetric.  A stopped channel won't generate any completion
interrupts anyway.

Signed-off-by: Alex Elder <elder@linaro.org>
---
 drivers/net/ipa/gsi.c | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ipa/gsi.c b/drivers/net/ipa/gsi.c
index 565c785e33a25..1a02936b4db06 100644
--- a/drivers/net/ipa/gsi.c
+++ b/drivers/net/ipa/gsi.c
@@ -860,16 +860,18 @@ static int __gsi_channel_start(struct gsi_channel *channel, bool start)
 	struct gsi *gsi = channel->gsi;
 	int ret;
 
+	gsi_irq_ieob_enable_one(gsi, channel->evt_ring_id);
+
 	mutex_lock(&gsi->mutex);
 
 	ret = start ? gsi_channel_start_command(channel) : 0;
 
 	mutex_unlock(&gsi->mutex);
 
-	if (!ret) {
-		gsi_irq_ieob_enable_one(gsi, channel->evt_ring_id);
+	if (ret)
+		gsi_irq_ieob_disable_one(gsi, channel->evt_ring_id);
+	else
 		napi_enable(&channel->napi);
-	}
 
 	return ret;
 }
@@ -909,14 +911,13 @@ static int __gsi_channel_stop(struct gsi_channel *channel, bool stop)
 
 	gsi_channel_trans_quiesce(channel);
 	napi_disable(&channel->napi);
-	gsi_irq_ieob_disable_one(gsi, channel->evt_ring_id);
 
 	ret = stop ? gsi_channel_stop_retry(channel) : 0;
 
-	if (ret) {
-		gsi_irq_ieob_enable_one(gsi, channel->evt_ring_id);
+	if (ret)
 		napi_enable(&channel->napi);
-	}
+	else
+		gsi_irq_ieob_disable_one(gsi, channel->evt_ring_id);
 
 	return ret;
 }
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH net-next 6/9] net: ipa: move completion interrupt enable/disable
  2021-01-29 20:20 [PATCH net-next 0/9] net: ipa: don't disable NAPI in suspend Alex Elder
                   ` (4 preceding siblings ...)
  2021-01-29 20:20 ` [PATCH net-next 5/9] net: ipa: disable IEOB interrupt after channel stop Alex Elder
@ 2021-01-29 20:20 ` Alex Elder
  2021-01-29 20:20 ` [PATCH net-next 7/9] net: ipa: don't disable IEOB interrupt during suspend Alex Elder
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 21+ messages in thread
From: Alex Elder @ 2021-01-29 20:20 UTC (permalink / raw)
  To: davem, kuba
  Cc: elder, evgreen, bjorn.andersson, cpratapa, subashab, netdev,
	linux-kernel

Move the calls to enable or disable IEOB interrupts out of
__gsi_channel_start() and __gsi_channel_stop() and into their
callers.  This is a small step to make the next patch easier
to understand.

Signed-off-by: Alex Elder <elder@linaro.org>
---
 drivers/net/ipa/gsi.c | 45 +++++++++++++++++++++++++++++++------------
 1 file changed, 33 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ipa/gsi.c b/drivers/net/ipa/gsi.c
index 1a02936b4db06..70647e8450845 100644
--- a/drivers/net/ipa/gsi.c
+++ b/drivers/net/ipa/gsi.c
@@ -860,17 +860,13 @@ static int __gsi_channel_start(struct gsi_channel *channel, bool start)
 	struct gsi *gsi = channel->gsi;
 	int ret;
 
-	gsi_irq_ieob_enable_one(gsi, channel->evt_ring_id);
-
 	mutex_lock(&gsi->mutex);
 
 	ret = start ? gsi_channel_start_command(channel) : 0;
 
 	mutex_unlock(&gsi->mutex);
 
-	if (ret)
-		gsi_irq_ieob_disable_one(gsi, channel->evt_ring_id);
-	else
+	if (!ret)
 		napi_enable(&channel->napi);
 
 	return ret;
@@ -880,8 +876,16 @@ static int __gsi_channel_start(struct gsi_channel *channel, bool start)
 int gsi_channel_start(struct gsi *gsi, u32 channel_id)
 {
 	struct gsi_channel *channel = &gsi->channel[channel_id];
+	int ret;
 
-	return __gsi_channel_start(channel, true);
+	/* Enable the completion interrupt */
+	gsi_irq_ieob_enable_one(gsi, channel->evt_ring_id);
+
+	ret = __gsi_channel_start(channel, true);
+	if (ret)
+		gsi_irq_ieob_disable_one(gsi, channel->evt_ring_id);
+
+	return ret;
 }
 
 static int gsi_channel_stop_retry(struct gsi_channel *channel)
@@ -906,7 +910,6 @@ static int gsi_channel_stop_retry(struct gsi_channel *channel)
 
 static int __gsi_channel_stop(struct gsi_channel *channel, bool stop)
 {
-	struct gsi *gsi = channel->gsi;
 	int ret;
 
 	gsi_channel_trans_quiesce(channel);
@@ -916,8 +919,6 @@ static int __gsi_channel_stop(struct gsi_channel *channel, bool stop)
 
 	if (ret)
 		napi_enable(&channel->napi);
-	else
-		gsi_irq_ieob_disable_one(gsi, channel->evt_ring_id);
 
 	return ret;
 }
@@ -926,8 +927,14 @@ static int __gsi_channel_stop(struct gsi_channel *channel, bool stop)
 int gsi_channel_stop(struct gsi *gsi, u32 channel_id)
 {
 	struct gsi_channel *channel = &gsi->channel[channel_id];
+	int ret;
 
-	return __gsi_channel_stop(channel, true);
+	/* Only disable the completion interrupt if stop is successful */
+	ret = __gsi_channel_stop(channel, true);
+	if (!ret)
+		gsi_irq_ieob_disable_one(gsi, channel->evt_ring_id);
+
+	return ret;
 }
 
 /* Reset and reconfigure a channel, (possibly) enabling the doorbell engine */
@@ -952,16 +959,30 @@ void gsi_channel_reset(struct gsi *gsi, u32 channel_id, bool doorbell)
 int gsi_channel_suspend(struct gsi *gsi, u32 channel_id, bool stop)
 {
 	struct gsi_channel *channel = &gsi->channel[channel_id];
+	int ret;
 
-	return __gsi_channel_stop(channel, stop);
+	/* No completions when suspended; disable interrupt if successful */
+	ret = __gsi_channel_stop(channel, stop);
+	if (!ret)
+		gsi_irq_ieob_disable_one(gsi, channel->evt_ring_id);
+
+	return ret;
 }
 
 /* Resume a suspended channel (starting will be requested if STOPPED) */
 int gsi_channel_resume(struct gsi *gsi, u32 channel_id, bool start)
 {
 	struct gsi_channel *channel = &gsi->channel[channel_id];
+	int ret;
 
-	return __gsi_channel_start(channel, start);
+	/* Re-enable the completion interrupt */
+	gsi_irq_ieob_enable_one(gsi, channel->evt_ring_id);
+
+	ret = __gsi_channel_start(channel, start);
+	if (ret)
+		gsi_irq_ieob_disable_one(gsi, channel->evt_ring_id);
+
+	return ret;
 }
 
 /**
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH net-next 7/9] net: ipa: don't disable IEOB interrupt during suspend
  2021-01-29 20:20 [PATCH net-next 0/9] net: ipa: don't disable NAPI in suspend Alex Elder
                   ` (5 preceding siblings ...)
  2021-01-29 20:20 ` [PATCH net-next 6/9] net: ipa: move completion interrupt enable/disable Alex Elder
@ 2021-01-29 20:20 ` Alex Elder
  2021-01-29 20:20 ` [PATCH net-next 8/9] net: ipa: expand last transaction check Alex Elder
  2021-01-29 20:20 ` [PATCH net-next 9/9] net: ipa: don't disable NAPI in suspend Alex Elder
  8 siblings, 0 replies; 21+ messages in thread
From: Alex Elder @ 2021-01-29 20:20 UTC (permalink / raw)
  To: davem, kuba
  Cc: elder, evgreen, bjorn.andersson, cpratapa, subashab, netdev,
	linux-kernel

No completion interrupts will occur while an endpoint is suspended,
or when a channel has been stopped for suspend.  So there's no need
to disable the interrupt during suspend and re-enable it when
resuming.

We'll enable the interrupt when we first start the channel, and
disable it again only when it's "really" stopped.

Signed-off-by: Alex Elder <elder@linaro.org>
---
 drivers/net/ipa/gsi.c | 18 ++----------------
 1 file changed, 2 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ipa/gsi.c b/drivers/net/ipa/gsi.c
index 70647e8450845..74d1dd04ad6e9 100644
--- a/drivers/net/ipa/gsi.c
+++ b/drivers/net/ipa/gsi.c
@@ -959,30 +959,16 @@ void gsi_channel_reset(struct gsi *gsi, u32 channel_id, bool doorbell)
 int gsi_channel_suspend(struct gsi *gsi, u32 channel_id, bool stop)
 {
 	struct gsi_channel *channel = &gsi->channel[channel_id];
-	int ret;
 
-	/* No completions when suspended; disable interrupt if successful */
-	ret = __gsi_channel_stop(channel, stop);
-	if (!ret)
-		gsi_irq_ieob_disable_one(gsi, channel->evt_ring_id);
-
-	return ret;
+	return __gsi_channel_stop(channel, stop);
 }
 
 /* Resume a suspended channel (starting will be requested if STOPPED) */
 int gsi_channel_resume(struct gsi *gsi, u32 channel_id, bool start)
 {
 	struct gsi_channel *channel = &gsi->channel[channel_id];
-	int ret;
 
-	/* Re-enable the completion interrupt */
-	gsi_irq_ieob_enable_one(gsi, channel->evt_ring_id);
-
-	ret = __gsi_channel_start(channel, start);
-	if (ret)
-		gsi_irq_ieob_disable_one(gsi, channel->evt_ring_id);
-
-	return ret;
+	return __gsi_channel_start(channel, start);
 }
 
 /**
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH net-next 8/9] net: ipa: expand last transaction check
  2021-01-29 20:20 [PATCH net-next 0/9] net: ipa: don't disable NAPI in suspend Alex Elder
                   ` (6 preceding siblings ...)
  2021-01-29 20:20 ` [PATCH net-next 7/9] net: ipa: don't disable IEOB interrupt during suspend Alex Elder
@ 2021-01-29 20:20 ` Alex Elder
  2021-01-29 20:20 ` [PATCH net-next 9/9] net: ipa: don't disable NAPI in suspend Alex Elder
  8 siblings, 0 replies; 21+ messages in thread
From: Alex Elder @ 2021-01-29 20:20 UTC (permalink / raw)
  To: davem, kuba
  Cc: elder, evgreen, bjorn.andersson, cpratapa, subashab, netdev,
	linux-kernel

Transactions to send data for a network device can be allocated at
any time up until the point the TX queue is stopped.  It is possible
for ipa_start_xmit() to be called in one context just before a
the transmit queue is stopped in another.

Update gsi_channel_trans_last() so that for TX channels the
allocated and pending transaction lists are checked--in addition
to the completed and polled lists--to determine the "last"
transaction.  This means any transaction that has been allocated
before the TX queue is stopped will be allowed to complete before
we conclude the channel is quiesced.

Rework the function a bit to use a list pointer and gotos.

Signed-off-by: Alex Elder <elder@linaro.org>
---
 drivers/net/ipa/gsi.c | 34 +++++++++++++++++++++++++---------
 1 file changed, 25 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ipa/gsi.c b/drivers/net/ipa/gsi.c
index 74d1dd04ad6e9..217ca21bfe043 100644
--- a/drivers/net/ipa/gsi.c
+++ b/drivers/net/ipa/gsi.c
@@ -725,22 +725,38 @@ static void gsi_evt_ring_program(struct gsi *gsi, u32 evt_ring_id)
 	gsi_evt_ring_doorbell(gsi, evt_ring_id, 0);
 }
 
-/* Return the last (most recent) transaction completed on a channel. */
+/* Find the transaction whose completion indicates a channel is quiesced */
 static struct gsi_trans *gsi_channel_trans_last(struct gsi_channel *channel)
 {
 	struct gsi_trans_info *trans_info = &channel->trans_info;
+	const struct list_head *list;
 	struct gsi_trans *trans;
 
 	spin_lock_bh(&trans_info->spinlock);
 
-	if (!list_empty(&trans_info->complete))
-		trans = list_last_entry(&trans_info->complete,
-					struct gsi_trans, links);
-	else if (!list_empty(&trans_info->polled))
-		trans = list_last_entry(&trans_info->polled,
-					struct gsi_trans, links);
-	else
-		trans = NULL;
+	/* There is a small chance a TX transaction got allocated just
+	 * before we disabled transmits, so check for that.
+	 */
+	if (channel->toward_ipa) {
+		list = &trans_info->alloc;
+		if (!list_empty(list))
+			goto done;
+		list = &trans_info->pending;
+		if (!list_empty(list))
+			goto done;
+	}
+
+	/* Otherwise (TX or RX) we want to wait for anything that
+	 * has completed, or has been polled but not released yet.
+	 */
+	list = &trans_info->complete;
+	if (!list_empty(list))
+		goto done;
+	list = &trans_info->polled;
+	if (list_empty(list))
+		list = NULL;
+done:
+	trans = list ? list_last_entry(list, struct gsi_trans, links) : NULL;
 
 	/* Caller will wait for this, so take a reference */
 	if (trans)
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH net-next 9/9] net: ipa: don't disable NAPI in suspend
  2021-01-29 20:20 [PATCH net-next 0/9] net: ipa: don't disable NAPI in suspend Alex Elder
                   ` (7 preceding siblings ...)
  2021-01-29 20:20 ` [PATCH net-next 8/9] net: ipa: expand last transaction check Alex Elder
@ 2021-01-29 20:20 ` Alex Elder
  2021-01-30 15:25   ` Willem de Bruijn
  8 siblings, 1 reply; 21+ messages in thread
From: Alex Elder @ 2021-01-29 20:20 UTC (permalink / raw)
  To: davem, kuba
  Cc: elder, evgreen, bjorn.andersson, cpratapa, subashab, netdev,
	linux-kernel

The channel stop and suspend paths both call __gsi_channel_stop(),
which quiesces channel activity, disables NAPI, and (on other than
SDM845) stops the channel.  Similarly, the start and resume paths
share __gsi_channel_start(), which starts the channel and re-enables
NAPI again.

Disabling NAPI should be done when stopping a channel, but this
should *not* be done when suspending.  It's not necessary in the
suspend path anyway, because the stopped channel (or suspended
endpoint on SDM845) will not cause interrupts to schedule NAPI,
and gsi_channel_trans_quiesce() won't return until there are no
more transactions to process in the NAPI polling loop.

Instead, enable NAPI in gsi_channel_start(), when the completion
interrupt is first enabled.  Disable it again in gsi_channel_stop(),
when finally disabling the interrupt.

Add a call to napi_synchronize() to __gsi_channel_stop(), to ensure
NAPI polling is done before moving on.

Signed-off-by: Alex Elder <elder@linaro.org>
---
 drivers/net/ipa/gsi.c | 35 +++++++++++++++++++++--------------
 1 file changed, 21 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ipa/gsi.c b/drivers/net/ipa/gsi.c
index 217ca21bfe043..afc5c9ede01af 100644
--- a/drivers/net/ipa/gsi.c
+++ b/drivers/net/ipa/gsi.c
@@ -876,15 +876,15 @@ static int __gsi_channel_start(struct gsi_channel *channel, bool start)
 	struct gsi *gsi = channel->gsi;
 	int ret;
 
+	if (!start)
+		return 0;
+
 	mutex_lock(&gsi->mutex);
 
-	ret = start ? gsi_channel_start_command(channel) : 0;
+	ret = gsi_channel_start_command(channel);
 
 	mutex_unlock(&gsi->mutex);
 
-	if (!ret)
-		napi_enable(&channel->napi);
-
 	return ret;
 }
 
@@ -894,12 +894,16 @@ int gsi_channel_start(struct gsi *gsi, u32 channel_id)
 	struct gsi_channel *channel = &gsi->channel[channel_id];
 	int ret;
 
-	/* Enable the completion interrupt */
+	/* Enable NAPI and the completion interrupt */
+	napi_enable(&channel->napi);
 	gsi_irq_ieob_enable_one(gsi, channel->evt_ring_id);
 
 	ret = __gsi_channel_start(channel, true);
-	if (ret)
-		gsi_irq_ieob_disable_one(gsi, channel->evt_ring_id);
+	if (!ret)
+		return 0;
+
+	gsi_irq_ieob_disable_one(gsi, channel->evt_ring_id);
+	napi_disable(&channel->napi);
 
 	return ret;
 }
@@ -928,13 +932,13 @@ static int __gsi_channel_stop(struct gsi_channel *channel, bool stop)
 {
 	int ret;
 
+	/* Wait for any underway transactions to complete before stopping. */
 	gsi_channel_trans_quiesce(channel);
-	napi_disable(&channel->napi);
 
 	ret = stop ? gsi_channel_stop_retry(channel) : 0;
-
-	if (ret)
-		napi_enable(&channel->napi);
+	/* Finally, ensure NAPI polling has finished. */
+	if (!ret)
+		napi_synchronize(&channel->napi);
 
 	return ret;
 }
@@ -947,10 +951,13 @@ int gsi_channel_stop(struct gsi *gsi, u32 channel_id)
 
 	/* Only disable the completion interrupt if stop is successful */
 	ret = __gsi_channel_stop(channel, true);
-	if (!ret)
-		gsi_irq_ieob_disable_one(gsi, channel->evt_ring_id);
+	if (ret)
+		return ret;
 
-	return ret;
+	gsi_irq_ieob_disable_one(gsi, channel->evt_ring_id);
+	napi_disable(&channel->napi);
+
+	return 0;
 }
 
 /* Reset and reconfigure a channel, (possibly) enabling the doorbell engine */
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next 9/9] net: ipa: don't disable NAPI in suspend
  2021-01-29 20:20 ` [PATCH net-next 9/9] net: ipa: don't disable NAPI in suspend Alex Elder
@ 2021-01-30 15:25   ` Willem de Bruijn
  2021-01-30 19:22     ` Jakub Kicinski
  2021-01-31  4:29     ` Alex Elder
  0 siblings, 2 replies; 21+ messages in thread
From: Willem de Bruijn @ 2021-01-30 15:25 UTC (permalink / raw)
  To: Alex Elder
  Cc: David Miller, Jakub Kicinski, elder, evgreen, bjorn.andersson,
	cpratapa, Subash Abhinov Kasiviswanathan, Network Development,
	LKML

On Fri, Jan 29, 2021 at 3:29 PM Alex Elder <elder@linaro.org> wrote:
>
> The channel stop and suspend paths both call __gsi_channel_stop(),
> which quiesces channel activity, disables NAPI, and (on other than
> SDM845) stops the channel.  Similarly, the start and resume paths
> share __gsi_channel_start(), which starts the channel and re-enables
> NAPI again.
>
> Disabling NAPI should be done when stopping a channel, but this
> should *not* be done when suspending.  It's not necessary in the
> suspend path anyway, because the stopped channel (or suspended
> endpoint on SDM845) will not cause interrupts to schedule NAPI,
> and gsi_channel_trans_quiesce() won't return until there are no
> more transactions to process in the NAPI polling loop.

But why is it incorrect to do so?

From a quick look, virtio-net disables on both remove and freeze, for instance.

> Instead, enable NAPI in gsi_channel_start(), when the completion
> interrupt is first enabled.  Disable it again in gsi_channel_stop(),
> when finally disabling the interrupt.
>
> Add a call to napi_synchronize() to __gsi_channel_stop(), to ensure
> NAPI polling is done before moving on.
>
> Signed-off-by: Alex Elder <elder@linaro.org>
> ---
=
> @@ -894,12 +894,16 @@ int gsi_channel_start(struct gsi *gsi, u32 channel_id)
>         struct gsi_channel *channel = &gsi->channel[channel_id];
>         int ret;
>
> -       /* Enable the completion interrupt */
> +       /* Enable NAPI and the completion interrupt */
> +       napi_enable(&channel->napi);
>         gsi_irq_ieob_enable_one(gsi, channel->evt_ring_id);
>
>         ret = __gsi_channel_start(channel, true);
> -       if (ret)
> -               gsi_irq_ieob_disable_one(gsi, channel->evt_ring_id);
> +       if (!ret)
> +               return 0;
> +
> +       gsi_irq_ieob_disable_one(gsi, channel->evt_ring_id);
> +       napi_disable(&channel->napi);
>
>         return ret;
>  }

subjective, but easier to parse when the normal control flow is linear
and the error path takes a branch (or goto, if reused).

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next 9/9] net: ipa: don't disable NAPI in suspend
  2021-01-30 15:25   ` Willem de Bruijn
@ 2021-01-30 19:22     ` Jakub Kicinski
  2021-01-31  4:30       ` Alex Elder
  2021-01-31  4:29     ` Alex Elder
  1 sibling, 1 reply; 21+ messages in thread
From: Jakub Kicinski @ 2021-01-30 19:22 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Alex Elder, David Miller, elder, evgreen, bjorn.andersson,
	cpratapa, Subash Abhinov Kasiviswanathan, Network Development,
	LKML

On Sat, 30 Jan 2021 10:25:16 -0500 Willem de Bruijn wrote:
> > @@ -894,12 +894,16 @@ int gsi_channel_start(struct gsi *gsi, u32 channel_id)
> >         struct gsi_channel *channel = &gsi->channel[channel_id];
> >         int ret;
> >
> > -       /* Enable the completion interrupt */
> > +       /* Enable NAPI and the completion interrupt */
> > +       napi_enable(&channel->napi);
> >         gsi_irq_ieob_enable_one(gsi, channel->evt_ring_id);
> >
> >         ret = __gsi_channel_start(channel, true);
> > -       if (ret)
> > -               gsi_irq_ieob_disable_one(gsi, channel->evt_ring_id);
> > +       if (!ret)
> > +               return 0;
> > +
> > +       gsi_irq_ieob_disable_one(gsi, channel->evt_ring_id);
> > +       napi_disable(&channel->napi);
> >
> >         return ret;
> >  }  
> 
> subjective, but easier to parse when the normal control flow is linear
> and the error path takes a branch (or goto, if reused).

FWIW - fully agreed, I always thought this was part of the kernel
coding style.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next 9/9] net: ipa: don't disable NAPI in suspend
  2021-01-30 15:25   ` Willem de Bruijn
  2021-01-30 19:22     ` Jakub Kicinski
@ 2021-01-31  4:29     ` Alex Elder
  2021-01-31 14:52       ` Willem de Bruijn
  1 sibling, 1 reply; 21+ messages in thread
From: Alex Elder @ 2021-01-31  4:29 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: David Miller, Jakub Kicinski, elder, evgreen, bjorn.andersson,
	cpratapa, Subash Abhinov Kasiviswanathan, Network Development,
	LKML

On 1/30/21 9:25 AM, Willem de Bruijn wrote:
> On Fri, Jan 29, 2021 at 3:29 PM Alex Elder <elder@linaro.org> wrote:
>>
>> The channel stop and suspend paths both call __gsi_channel_stop(),
>> which quiesces channel activity, disables NAPI, and (on other than
>> SDM845) stops the channel.  Similarly, the start and resume paths
>> share __gsi_channel_start(), which starts the channel and re-enables
>> NAPI again.
>>
>> Disabling NAPI should be done when stopping a channel, but this
>> should *not* be done when suspending.  It's not necessary in the
>> suspend path anyway, because the stopped channel (or suspended
>> endpoint on SDM845) will not cause interrupts to schedule NAPI,
>> and gsi_channel_trans_quiesce() won't return until there are no
>> more transactions to process in the NAPI polling loop.
> 
> But why is it incorrect to do so?

Maybe it's not; I also thought it was fine before, but...

Someone at Qualcomm asked me why I thought NAPI needed
to be disabled on suspend.  My response was basically
that it was a lightweight operation, and it shouldn't
really be a problem to do so.

Then, when I posted two patches last month, Jakub's
response told me he didn't understand why I was doing
what I was doing, and I stepped back to reconsider
the details of what was happening at suspend time.
 
https://lore.kernel.org/netdev/20210107183803.47308e23@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com/

Four things were happening to suspend a channel:
quiesce activity; disable interrupt; disable NAPI;
and stop the channel.  It occurred to me that a
stopped channel would not generate interrupts, so if
the channel was stopped earlier there would be no need
to disable the interrupt.  Similarly there would be
(essentially) no need to disable NAPI once a channel
was stopped.

Underlying all of this is that I started chasing a
hang that was occurring on suspend over a month ago.
It was hard to reproduce (hundreds or thousands of
suspend/resume cycles without hitting it), and one
of the few times I actually hit the problem it was
stuck in napi_disable(), apparently waiting for
NAPI_STATE_SCHED to get cleared by napi_complete().

My best guess about how this could occur was if there
were a race of some kind between the interrupt handler
(scheduling NAPI) and the poll function (completing
it).  I found a number of problems while looking
at this, and in the past few weeks I've posted some
fixes to improve things.  Still, even with some of
these fixes in place we have seen a hang (but now
even more rarely).

So this grand rework of suspending/stopping channels
is an attempt to resolve this hang on suspend.

The channel is now stopped early, and once stopped,
everything that completed prior to the channel being
stopped is polled before considering the suspend
function done.  A stopped channel won't interrupt,
so we don't bother disabling the completion interrupt,
with no interrupts, NAPI won't be scheduled, so there's
no need to disable NAPI either.

The net result is simpler, and seems logical, and
should preclude any possible race between the interrupt
handler and poll function.  I'm trying to solve the
hang problem analytically, because it takes *so* long
to reproduce.

I'm open to other suggestions.

					-Alex

>  From a quick look, virtio-net disables on both remove and freeze, for instance.
> 
>> Instead, enable NAPI in gsi_channel_start(), when the completion
>> interrupt is first enabled.  Disable it again in gsi_channel_stop(),
>> when finally disabling the interrupt.
>>
>> Add a call to napi_synchronize() to __gsi_channel_stop(), to ensure
>> NAPI polling is done before moving on.
>>
>> Signed-off-by: Alex Elder <elder@linaro.org>
>> ---
> =
>> @@ -894,12 +894,16 @@ int gsi_channel_start(struct gsi *gsi, u32 channel_id)
>>          struct gsi_channel *channel = &gsi->channel[channel_id];
>>          int ret;
>>
>> -       /* Enable the completion interrupt */
>> +       /* Enable NAPI and the completion interrupt */
>> +       napi_enable(&channel->napi);
>>          gsi_irq_ieob_enable_one(gsi, channel->evt_ring_id);
>>
>>          ret = __gsi_channel_start(channel, true);
>> -       if (ret)
>> -               gsi_irq_ieob_disable_one(gsi, channel->evt_ring_id);
>> +       if (!ret)
>> +               return 0;
>> +
>> +       gsi_irq_ieob_disable_one(gsi, channel->evt_ring_id);
>> +       napi_disable(&channel->napi);
>>
>>          return ret;
>>   }
> 
> subjective, but easier to parse when the normal control flow is linear
> and the error path takes a branch (or goto, if reused).
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next 9/9] net: ipa: don't disable NAPI in suspend
  2021-01-30 19:22     ` Jakub Kicinski
@ 2021-01-31  4:30       ` Alex Elder
  0 siblings, 0 replies; 21+ messages in thread
From: Alex Elder @ 2021-01-31  4:30 UTC (permalink / raw)
  To: Jakub Kicinski, Willem de Bruijn
  Cc: David Miller, elder, evgreen, bjorn.andersson, cpratapa,
	Subash Abhinov Kasiviswanathan, Network Development, LKML

On 1/30/21 1:22 PM, Jakub Kicinski wrote:
> On Sat, 30 Jan 2021 10:25:16 -0500 Willem de Bruijn wrote:
>>> @@ -894,12 +894,16 @@ int gsi_channel_start(struct gsi *gsi, u32 channel_id)
>>>          struct gsi_channel *channel = &gsi->channel[channel_id];
>>>          int ret;
>>>
>>> -       /* Enable the completion interrupt */
>>> +       /* Enable NAPI and the completion interrupt */
>>> +       napi_enable(&channel->napi);
>>>          gsi_irq_ieob_enable_one(gsi, channel->evt_ring_id);
>>>
>>>          ret = __gsi_channel_start(channel, true);
>>> -       if (ret)
>>> -               gsi_irq_ieob_disable_one(gsi, channel->evt_ring_id);
>>> +       if (!ret)
>>> +               return 0;
>>> +
>>> +       gsi_irq_ieob_disable_one(gsi, channel->evt_ring_id);
>>> +       napi_disable(&channel->napi);
>>>
>>>          return ret;
>>>   }
>>
>> subjective, but easier to parse when the normal control flow is linear
>> and the error path takes a branch (or goto, if reused).
> 
> FWIW - fully agreed, I always thought this was part of the kernel
> coding style.

That's fine.  I will post a v2 of the series to fix this up.
I'll wait a bit (maybe until Monday morning), in case there's
any other input on the series to address.

Thanks both of you for your comments.

					-Alex

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next 9/9] net: ipa: don't disable NAPI in suspend
  2021-01-31  4:29     ` Alex Elder
@ 2021-01-31 14:52       ` Willem de Bruijn
  2021-01-31 15:32         ` Alex Elder
  0 siblings, 1 reply; 21+ messages in thread
From: Willem de Bruijn @ 2021-01-31 14:52 UTC (permalink / raw)
  To: Alex Elder
  Cc: David Miller, Jakub Kicinski, elder, evgreen, bjorn.andersson,
	cpratapa, Subash Abhinov Kasiviswanathan, Network Development,
	LKML

On Sat, Jan 30, 2021 at 11:29 PM Alex Elder <elder@linaro.org> wrote:
>
> On 1/30/21 9:25 AM, Willem de Bruijn wrote:
> > On Fri, Jan 29, 2021 at 3:29 PM Alex Elder <elder@linaro.org> wrote:
> >>
> >> The channel stop and suspend paths both call __gsi_channel_stop(),
> >> which quiesces channel activity, disables NAPI, and (on other than
> >> SDM845) stops the channel.  Similarly, the start and resume paths
> >> share __gsi_channel_start(), which starts the channel and re-enables
> >> NAPI again.
> >>
> >> Disabling NAPI should be done when stopping a channel, but this
> >> should *not* be done when suspending.  It's not necessary in the
> >> suspend path anyway, because the stopped channel (or suspended
> >> endpoint on SDM845) will not cause interrupts to schedule NAPI,
> >> and gsi_channel_trans_quiesce() won't return until there are no
> >> more transactions to process in the NAPI polling loop.
> >
> > But why is it incorrect to do so?
>
> Maybe it's not; I also thought it was fine before, but...
>
> Someone at Qualcomm asked me why I thought NAPI needed
> to be disabled on suspend.  My response was basically
> that it was a lightweight operation, and it shouldn't
> really be a problem to do so.
>
> Then, when I posted two patches last month, Jakub's
> response told me he didn't understand why I was doing
> what I was doing, and I stepped back to reconsider
> the details of what was happening at suspend time.
>
> https://lore.kernel.org/netdev/20210107183803.47308e23@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com/
>
> Four things were happening to suspend a channel:
> quiesce activity; disable interrupt; disable NAPI;
> and stop the channel.  It occurred to me that a
> stopped channel would not generate interrupts, so if
> the channel was stopped earlier there would be no need
> to disable the interrupt.  Similarly there would be
> (essentially) no need to disable NAPI once a channel
> was stopped.
>
> Underlying all of this is that I started chasing a
> hang that was occurring on suspend over a month ago.
> It was hard to reproduce (hundreds or thousands of
> suspend/resume cycles without hitting it), and one
> of the few times I actually hit the problem it was
> stuck in napi_disable(), apparently waiting for
> NAPI_STATE_SCHED to get cleared by napi_complete().

This is important information.

What exactly do you mean by hang?

>
> My best guess about how this could occur was if there
> were a race of some kind between the interrupt handler
> (scheduling NAPI) and the poll function (completing
> it).  I found a number of problems while looking
> at this, and in the past few weeks I've posted some
> fixes to improve things.  Still, even with some of
> these fixes in place we have seen a hang (but now
> even more rarely).
>
> So this grand rework of suspending/stopping channels
> is an attempt to resolve this hang on suspend.

Do you have any data that this patchset resolves the issue, or is it
too hard to reproduce to say anything?

> The channel is now stopped early, and once stopped,
> everything that completed prior to the channel being
> stopped is polled before considering the suspend
> function done.

Does the call to gsi_channel_trans_quiesce before
gsi_channel_stop_retry leave a race where new transactions may occur
until state GSI_CHANNEL_STATE_STOPPED is reached? Asking without truly
knowing the details of this device.

> A stopped channel won't interrupt,
> so we don't bother disabling the completion interrupt,
> with no interrupts, NAPI won't be scheduled, so there's
> no need to disable NAPI either.

That sounds plausible. But it doesn't explain why napi_disable "should
*not* be done when suspending" as the commit states.

Arguably, leaving that won't have much effect either way, and is in
line with other drivers.

Your previous patchset mentions "When stopping a channel, the IPA
driver currently disables NAPI before disabling the interrupt." That
would no longer be the case.

> The net result is simpler, and seems logical, and
> should preclude any possible race between the interrupt
> handler and poll function.  I'm trying to solve the
> hang problem analytically, because it takes *so* long
> to reproduce.
>
> I'm open to other suggestions.
>
>                                         -Alex
>
> >  From a quick look, virtio-net disables on both remove and freeze, for instance.
> >
> >> Instead, enable NAPI in gsi_channel_start(), when the completion
> >> interrupt is first enabled.  Disable it again in gsi_channel_stop(),
> >> when finally disabling the interrupt.
> >>
> >> Add a call to napi_synchronize() to __gsi_channel_stop(), to ensure
> >> NAPI polling is done before moving on.
> >>
> >> Signed-off-by: Alex Elder <elder@linaro.org>
> >> ---
> > =
> >> @@ -894,12 +894,16 @@ int gsi_channel_start(struct gsi *gsi, u32 channel_id)
> >>          struct gsi_channel *channel = &gsi->channel[channel_id];
> >>          int ret;
> >>
> >> -       /* Enable the completion interrupt */
> >> +       /* Enable NAPI and the completion interrupt */
> >> +       napi_enable(&channel->napi);
> >>          gsi_irq_ieob_enable_one(gsi, channel->evt_ring_id);
> >>
> >>          ret = __gsi_channel_start(channel, true);
> >> -       if (ret)
> >> -               gsi_irq_ieob_disable_one(gsi, channel->evt_ring_id);
> >> +       if (!ret)
> >> +               return 0;
> >> +
> >> +       gsi_irq_ieob_disable_one(gsi, channel->evt_ring_id);
> >> +       napi_disable(&channel->napi);
> >>
> >>          return ret;
> >>   }
> >
> > subjective, but easier to parse when the normal control flow is linear
> > and the error path takes a branch (or goto, if reused).
> >
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next 9/9] net: ipa: don't disable NAPI in suspend
  2021-01-31 14:52       ` Willem de Bruijn
@ 2021-01-31 15:32         ` Alex Elder
  2021-02-01  1:36           ` Willem de Bruijn
  0 siblings, 1 reply; 21+ messages in thread
From: Alex Elder @ 2021-01-31 15:32 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: David Miller, Jakub Kicinski, elder, evgreen, bjorn.andersson,
	cpratapa, Subash Abhinov Kasiviswanathan, Network Development,
	LKML

On 1/31/21 8:52 AM, Willem de Bruijn wrote:
> On Sat, Jan 30, 2021 at 11:29 PM Alex Elder <elder@linaro.org> wrote:
>>
>> On 1/30/21 9:25 AM, Willem de Bruijn wrote:
>>> On Fri, Jan 29, 2021 at 3:29 PM Alex Elder <elder@linaro.org> wrote:
>>>>
>>>> The channel stop and suspend paths both call __gsi_channel_stop(),
>>>> which quiesces channel activity, disables NAPI, and (on other than
>>>> SDM845) stops the channel.  Similarly, the start and resume paths
>>>> share __gsi_channel_start(), which starts the channel and re-enables
>>>> NAPI again.
>>>>
>>>> Disabling NAPI should be done when stopping a channel, but this
>>>> should *not* be done when suspending.  It's not necessary in the
>>>> suspend path anyway, because the stopped channel (or suspended
>>>> endpoint on SDM845) will not cause interrupts to schedule NAPI,
>>>> and gsi_channel_trans_quiesce() won't return until there are no
>>>> more transactions to process in the NAPI polling loop.
>>>
>>> But why is it incorrect to do so?
>>
>> Maybe it's not; I also thought it was fine before, but...
>>
>> Someone at Qualcomm asked me why I thought NAPI needed
>> to be disabled on suspend.  My response was basically
>> that it was a lightweight operation, and it shouldn't
>> really be a problem to do so.
>>
>> Then, when I posted two patches last month, Jakub's
>> response told me he didn't understand why I was doing
>> what I was doing, and I stepped back to reconsider
>> the details of what was happening at suspend time.
>>
>> https://lore.kernel.org/netdev/20210107183803.47308e23@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com/
>>
>> Four things were happening to suspend a channel:
>> quiesce activity; disable interrupt; disable NAPI;
>> and stop the channel.  It occurred to me that a
>> stopped channel would not generate interrupts, so if
>> the channel was stopped earlier there would be no need
>> to disable the interrupt.  Similarly there would be
>> (essentially) no need to disable NAPI once a channel
>> was stopped.
>>
>> Underlying all of this is that I started chasing a
>> hang that was occurring on suspend over a month ago.
>> It was hard to reproduce (hundreds or thousands of
>> suspend/resume cycles without hitting it), and one
>> of the few times I actually hit the problem it was
>> stuck in napi_disable(), apparently waiting for
>> NAPI_STATE_SCHED to get cleared by napi_complete().
> 
> This is important information.
> 
> What exactly do you mean by hang?

Yes it's important!  Unfortunately I was not able to
gather details about the problem in all the cases where
it occurred.  But in at least one case I *did* confirm
it was in the situation described above.

What I mean by "hang" is that the system simply stopped
on its way down, and the IPA ->suspend callback never
completed (stuck in napi_disable).  So I expect that
the SCHED flag was never going to get cleared (because
of a race, presumably).

>> My best guess about how this could occur was if there
>> were a race of some kind between the interrupt handler
>> (scheduling NAPI) and the poll function (completing
>> it).  I found a number of problems while looking
>> at this, and in the past few weeks I've posted some
>> fixes to improve things.  Still, even with some of
>> these fixes in place we have seen a hang (but now
>> even more rarely).
>>
>> So this grand rework of suspending/stopping channels
>> is an attempt to resolve this hang on suspend.
> 
> Do you have any data that this patchset resolves the issue, or is it
> too hard to reproduce to say anything?

The data I have is that I have been running for weeks
with tens of thousands of iterations with this patch
(and the rest of them) without any hang.  Unfortunately
that doesn't guarantee anything.  I contemplated trying
to "catch" the problem and report that it *would have*
occurred had the fix not been in place, but I haven't
tried that (in part because it might not be easy).

So...  Too hard to reproduce, but I have evidence that
my testing so far has never reproduced the hang.

>> The channel is now stopped early, and once stopped,
>> everything that completed prior to the channel being
>> stopped is polled before considering the suspend
>> function done.
> 
> Does the call to gsi_channel_trans_quiesce before
> gsi_channel_stop_retry leave a race where new transactions may occur
> until state GSI_CHANNEL_STATE_STOPPED is reached? Asking without truly
> knowing the details of this device.

It should not.  For TX endpoints that have a net device, new
requests will have been stopped earlier by netif_stop_queue()
(in ipa_modem_suspend()).  For RX endpoints, receive buffers
are replenished to the hardware, but we stop that earlier
as well, in ipa_endpoint_suspend_one().  So the quiesce call
is meant to figure out what the last submitted request was
for an endpoint (channel), and then wait for that to complete.

The "hang" occurs on an RX endpoint, and in particular it
occurs on an endpoint that we *know* will be receiving a
packet as part of the suspend process (when clearing the
hardware pipeline).  I can go into that further but won't'
unless asked.

>> A stopped channel won't interrupt,
>> so we don't bother disabling the completion interrupt,
>> with no interrupts, NAPI won't be scheduled, so there's
>> no need to disable NAPI either.
> 
> That sounds plausible. But it doesn't explain why napi_disable "should
> *not* be done when suspending" as the commit states.
> 
> Arguably, leaving that won't have much effect either way, and is in
> line with other drivers.

Understood and agreed.  In fact, if the hang occurrs in
napi_disable() when waiting for NAPI_STATE_SCHED to clear,
it would occur in napi_synchronize() as well.  At this point
it's more about the whole set of rework here, and keeping
NAPI enabled during suspend seems a little cleaner.

See my followup message, about Jakub's assertion that NAPI
assumes the device will be *reset* when NAPI is disabled.
(I'm not convinced NAPI assumes that, but that doesn't matter.)
In any case, the IPA hardware does *not* reset channels when
suspended.

> Your previous patchset mentions "When stopping a channel, the IPA
> driver currently disables NAPI before disabling the interrupt." That
> would no longer be the case.

I'm not sure which patch you're referring to (and I'm in
a hurry at the moment).  But yes, with this patch we would
only disable NAPI when "really" stopping the channel, not
when suspending it.  And we'd similarly be no longer
disabling the completion interrupt on suspend either.

Thanks a lot, I appreciate the help and input on this.

					-Alex

>> The net result is simpler, and seems logical, and
>> should preclude any possible race between the interrupt
>> handler and poll function.  I'm trying to solve the
>> hang problem analytically, because it takes *so* long
>> to reproduce.
>>
>> I'm open to other suggestions.
>>
>>                                         -Alex
>>
>>>  From a quick look, virtio-net disables on both remove and freeze, for instance.
>>>
>>>> Instead, enable NAPI in gsi_channel_start(), when the completion
>>>> interrupt is first enabled.  Disable it again in gsi_channel_stop(),
>>>> when finally disabling the interrupt.
>>>>
>>>> Add a call to napi_synchronize() to __gsi_channel_stop(), to ensure
>>>> NAPI polling is done before moving on.
>>>>
>>>> Signed-off-by: Alex Elder <elder@linaro.org>
>>>> ---
>>> =
>>>> @@ -894,12 +894,16 @@ int gsi_channel_start(struct gsi *gsi, u32 channel_id)
>>>>          struct gsi_channel *channel = &gsi->channel[channel_id];
>>>>          int ret;
>>>>
>>>> -       /* Enable the completion interrupt */
>>>> +       /* Enable NAPI and the completion interrupt */
>>>> +       napi_enable(&channel->napi);
>>>>          gsi_irq_ieob_enable_one(gsi, channel->evt_ring_id);
>>>>
>>>>          ret = __gsi_channel_start(channel, true);
>>>> -       if (ret)
>>>> -               gsi_irq_ieob_disable_one(gsi, channel->evt_ring_id);
>>>> +       if (!ret)
>>>> +               return 0;
>>>> +
>>>> +       gsi_irq_ieob_disable_one(gsi, channel->evt_ring_id);
>>>> +       napi_disable(&channel->napi);
>>>>
>>>>          return ret;
>>>>   }
>>>
>>> subjective, but easier to parse when the normal control flow is linear
>>> and the error path takes a branch (or goto, if reused).
>>>
>>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next 9/9] net: ipa: don't disable NAPI in suspend
  2021-01-31 15:32         ` Alex Elder
@ 2021-02-01  1:36           ` Willem de Bruijn
  2021-02-01 14:35             ` Alex Elder
  0 siblings, 1 reply; 21+ messages in thread
From: Willem de Bruijn @ 2021-02-01  1:36 UTC (permalink / raw)
  To: Alex Elder
  Cc: David Miller, Jakub Kicinski, elder, evgreen, bjorn.andersson,
	cpratapa, Subash Abhinov Kasiviswanathan, Network Development,
	LKML

On Sun, Jan 31, 2021 at 10:32 AM Alex Elder <elder@linaro.org> wrote:
>
> On 1/31/21 8:52 AM, Willem de Bruijn wrote:
> > On Sat, Jan 30, 2021 at 11:29 PM Alex Elder <elder@linaro.org> wrote:
> >>
> >> On 1/30/21 9:25 AM, Willem de Bruijn wrote:
> >>> On Fri, Jan 29, 2021 at 3:29 PM Alex Elder <elder@linaro.org> wrote:
> >>>>
> >>>> The channel stop and suspend paths both call __gsi_channel_stop(),
> >>>> which quiesces channel activity, disables NAPI, and (on other than
> >>>> SDM845) stops the channel.  Similarly, the start and resume paths
> >>>> share __gsi_channel_start(), which starts the channel and re-enables
> >>>> NAPI again.
> >>>>
> >>>> Disabling NAPI should be done when stopping a channel, but this
> >>>> should *not* be done when suspending.  It's not necessary in the
> >>>> suspend path anyway, because the stopped channel (or suspended
> >>>> endpoint on SDM845) will not cause interrupts to schedule NAPI,
> >>>> and gsi_channel_trans_quiesce() won't return until there are no
> >>>> more transactions to process in the NAPI polling loop.
> >>>
> >>> But why is it incorrect to do so?
> >>
> >> Maybe it's not; I also thought it was fine before, but...
> >>
> >> Someone at Qualcomm asked me why I thought NAPI needed
> >> to be disabled on suspend.  My response was basically
> >> that it was a lightweight operation, and it shouldn't
> >> really be a problem to do so.
> >>
> >> Then, when I posted two patches last month, Jakub's
> >> response told me he didn't understand why I was doing
> >> what I was doing, and I stepped back to reconsider
> >> the details of what was happening at suspend time.
> >>
> >> https://lore.kernel.org/netdev/20210107183803.47308e23@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com/
> >>
> >> Four things were happening to suspend a channel:
> >> quiesce activity; disable interrupt; disable NAPI;
> >> and stop the channel.  It occurred to me that a
> >> stopped channel would not generate interrupts, so if
> >> the channel was stopped earlier there would be no need
> >> to disable the interrupt.  Similarly there would be
> >> (essentially) no need to disable NAPI once a channel
> >> was stopped.
> >>
> >> Underlying all of this is that I started chasing a
> >> hang that was occurring on suspend over a month ago.
> >> It was hard to reproduce (hundreds or thousands of
> >> suspend/resume cycles without hitting it), and one
> >> of the few times I actually hit the problem it was
> >> stuck in napi_disable(), apparently waiting for
> >> NAPI_STATE_SCHED to get cleared by napi_complete().
> >
> > This is important information.
> >
> > What exactly do you mean by hang?
>
> Yes it's important!  Unfortunately I was not able to
> gather details about the problem in all the cases where
> it occurred.  But in at least one case I *did* confirm
> it was in the situation described above.
>
> What I mean by "hang" is that the system simply stopped
> on its way down, and the IPA ->suspend callback never
> completed (stuck in napi_disable).  So I expect that
> the SCHED flag was never going to get cleared (because
> of a race, presumably).
>
> >> My best guess about how this could occur was if there
> >> were a race of some kind between the interrupt handler
> >> (scheduling NAPI) and the poll function (completing
> >> it).  I found a number of problems while looking
> >> at this, and in the past few weeks I've posted some
> >> fixes to improve things.  Still, even with some of
> >> these fixes in place we have seen a hang (but now
> >> even more rarely).
> >>
> >> So this grand rework of suspending/stopping channels
> >> is an attempt to resolve this hang on suspend.
> >
> > Do you have any data that this patchset resolves the issue, or is it
> > too hard to reproduce to say anything?
>
> The data I have is that I have been running for weeks
> with tens of thousands of iterations with this patch
> (and the rest of them) without any hang.  Unfortunately
> that doesn't guarantee anything.  I contemplated trying
> to "catch" the problem and report that it *would have*
> occurred had the fix not been in place, but I haven't
> tried that (in part because it might not be easy).
>
> So...  Too hard to reproduce, but I have evidence that
> my testing so far has never reproduced the hang.
>
> >> The channel is now stopped early, and once stopped,
> >> everything that completed prior to the channel being
> >> stopped is polled before considering the suspend
> >> function done.
> >
> > Does the call to gsi_channel_trans_quiesce before
> > gsi_channel_stop_retry leave a race where new transactions may occur
> > until state GSI_CHANNEL_STATE_STOPPED is reached? Asking without truly
> > knowing the details of this device.
>
> It should not.  For TX endpoints that have a net device, new
> requests will have been stopped earlier by netif_stop_queue()
> (in ipa_modem_suspend()).  For RX endpoints, receive buffers
> are replenished to the hardware, but we stop that earlier
> as well, in ipa_endpoint_suspend_one().  So the quiesce call
> is meant to figure out what the last submitted request was
> for an endpoint (channel), and then wait for that to complete.
>
> The "hang" occurs on an RX endpoint, and in particular it
> occurs on an endpoint that we *know* will be receiving a
> packet as part of the suspend process (when clearing the
> hardware pipeline).  I can go into that further but won't'
> unless asked.
>
> >> A stopped channel won't interrupt,
> >> so we don't bother disabling the completion interrupt,
> >> with no interrupts, NAPI won't be scheduled, so there's
> >> no need to disable NAPI either.
> >
> > That sounds plausible. But it doesn't explain why napi_disable "should
> > *not* be done when suspending" as the commit states.
> >
> > Arguably, leaving that won't have much effect either way, and is in
> > line with other drivers.
>
> Understood and agreed.  In fact, if the hang occurrs in
> napi_disable() when waiting for NAPI_STATE_SCHED to clear,
> it would occur in napi_synchronize() as well.

Agreed.

So you have an environment to test a patch in, it might be worthwhile
to test essentially the same logic reordering as in this patch set,
but while still disabling napi.

The disappearing race may be due to another change rather than
napi_disable vs napi_synchronize. A smaller, more targeted patch could
also be a net (instead of net-next) candidate.

> At this point
> it's more about the whole set of rework here, and keeping
> NAPI enabled during suspend seems a little cleaner.

I'm not sure. I haven't looked if there is a common behavior across
devices. That might be informative. igb, for one, releases all
resources.

> See my followup message, about Jakub's assertion that NAPI
> assumes the device will be *reset* when NAPI is disabled.
> (I'm not convinced NAPI assumes that, but that doesn't matter.)
> In any case, the IPA hardware does *not* reset channels when
> suspended.
>
> > Your previous patchset mentions "When stopping a channel, the IPA
> > driver currently disables NAPI before disabling the interrupt." That
> > would no longer be the case.
>
> I'm not sure which patch you're referring to (and I'm in
> a hurry at the moment).  But yes, with this patch we would
> only disable NAPI when "really" stopping the channel, not
> when suspending it.  And we'd similarly be no longer
> disabling the completion interrupt on suspend either.
>
> Thanks a lot, I appreciate the help and input on this.
>
>                                         -Alex
>
> >> The net result is simpler, and seems logical, and
> >> should preclude any possible race between the interrupt
> >> handler and poll function.  I'm trying to solve the
> >> hang problem analytically, because it takes *so* long
> >> to reproduce.
> >>
> >> I'm open to other suggestions.
> >>
> >>                                         -Alex
> >>
> >>>  From a quick look, virtio-net disables on both remove and freeze, for instance.
> >>>
> >>>> Instead, enable NAPI in gsi_channel_start(), when the completion
> >>>> interrupt is first enabled.  Disable it again in gsi_channel_stop(),
> >>>> when finally disabling the interrupt.
> >>>>
> >>>> Add a call to napi_synchronize() to __gsi_channel_stop(), to ensure
> >>>> NAPI polling is done before moving on.
> >>>>
> >>>> Signed-off-by: Alex Elder <elder@linaro.org>
> >>>> ---
> >>> =
> >>>> @@ -894,12 +894,16 @@ int gsi_channel_start(struct gsi *gsi, u32 channel_id)
> >>>>          struct gsi_channel *channel = &gsi->channel[channel_id];
> >>>>          int ret;
> >>>>
> >>>> -       /* Enable the completion interrupt */
> >>>> +       /* Enable NAPI and the completion interrupt */
> >>>> +       napi_enable(&channel->napi);
> >>>>          gsi_irq_ieob_enable_one(gsi, channel->evt_ring_id);
> >>>>
> >>>>          ret = __gsi_channel_start(channel, true);
> >>>> -       if (ret)
> >>>> -               gsi_irq_ieob_disable_one(gsi, channel->evt_ring_id);
> >>>> +       if (!ret)
> >>>> +               return 0;
> >>>> +
> >>>> +       gsi_irq_ieob_disable_one(gsi, channel->evt_ring_id);
> >>>> +       napi_disable(&channel->napi);
> >>>>
> >>>>          return ret;
> >>>>   }
> >>>
> >>> subjective, but easier to parse when the normal control flow is linear
> >>> and the error path takes a branch (or goto, if reused).
> >>>
> >>
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next 9/9] net: ipa: don't disable NAPI in suspend
  2021-02-01  1:36           ` Willem de Bruijn
@ 2021-02-01 14:35             ` Alex Elder
  2021-02-01 14:47               ` Willem de Bruijn
  0 siblings, 1 reply; 21+ messages in thread
From: Alex Elder @ 2021-02-01 14:35 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: David Miller, Jakub Kicinski, elder, evgreen, bjorn.andersson,
	cpratapa, Subash Abhinov Kasiviswanathan, Network Development,
	LKML

On 1/31/21 7:36 PM, Willem de Bruijn wrote:
> On Sun, Jan 31, 2021 at 10:32 AM Alex Elder <elder@linaro.org> wrote:
>>
>> On 1/31/21 8:52 AM, Willem de Bruijn wrote:
>>> On Sat, Jan 30, 2021 at 11:29 PM Alex Elder <elder@linaro.org> wrote:
>>>>
>>>> On 1/30/21 9:25 AM, Willem de Bruijn wrote:
>>>>> On Fri, Jan 29, 2021 at 3:29 PM Alex Elder <elder@linaro.org> wrote:
>>>>>>
>>>>>> The channel stop and suspend paths both call __gsi_channel_stop(),
>>>>>> which quiesces channel activity, disables NAPI, and (on other than
>>>>>> SDM845) stops the channel.  Similarly, the start and resume paths
>>>>>> share __gsi_channel_start(), which starts the channel and re-enables
>>>>>> NAPI again.
>>>>>>
>>>>>> Disabling NAPI should be done when stopping a channel, but this
>>>>>> should *not* be done when suspending.  It's not necessary in the
>>>>>> suspend path anyway, because the stopped channel (or suspended
>>>>>> endpoint on SDM845) will not cause interrupts to schedule NAPI,
>>>>>> and gsi_channel_trans_quiesce() won't return until there are no
>>>>>> more transactions to process in the NAPI polling loop.
>>>>>
>>>>> But why is it incorrect to do so?
>>>>
>>>> Maybe it's not; I also thought it was fine before, but...

. . .

>> The "hang" occurs on an RX endpoint, and in particular it
>> occurs on an endpoint that we *know* will be receiving a
>> packet as part of the suspend process (when clearing the
>> hardware pipeline).  I can go into that further but won't'
>> unless asked.
>>
>>>> A stopped channel won't interrupt,
>>>> so we don't bother disabling the completion interrupt,
>>>> with no interrupts, NAPI won't be scheduled, so there's
>>>> no need to disable NAPI either.
>>>
>>> That sounds plausible. But it doesn't explain why napi_disable "should
>>> *not* be done when suspending" as the commit states.
>>>
>>> Arguably, leaving that won't have much effect either way, and is in
>>> line with other drivers.
>>
>> Understood and agreed.  In fact, if the hang occurrs in
>> napi_disable() when waiting for NAPI_STATE_SCHED to clear,
>> it would occur in napi_synchronize() as well.
> 
> Agreed.
> 
> So you have an environment to test a patch in, it might be worthwhile
> to test essentially the same logic reordering as in this patch set,
> but while still disabling napi.

What is the purpose of this test?  Just to guarantee
that the NAPI hang goes away?  Because you agree that
the napi_schedule() call would *also* hang if that
problem exists, right?

Anyway, what you're suggesting is to simply test with
this last patch removed.  I can do that but I really
don't expect it to change anything.  I will start that
test later today when I'm turning my attention to
something else for a while.

> The disappearing race may be due to another change rather than
> napi_disable vs napi_synchronize. A smaller, more targeted patch could
> also be a net (instead of net-next) candidate.

I am certain it is.

I can tell you that we have seen a hang (after I think 2500+
suspend/resume cycles) with the IPA code that is currently
upstream.

But with this latest series of 9, there is no hang after
10,000+ cycles.  That gives me a bisect window, but I really
don't want to go through a full bisect of even those 9,
because it's 4 tests, each of which takes days to complete.

Looking at the 9 patches, I think this one is the most
likely culprit:
   net: ipa: disable IEOB interrupt after channel stop

I think the race involves the I/O completion handler
interacting with NAPI in an unwanted way, but I have
not come up with the exact sequence that would lead
to getting stuck in napi_disable().

Here are some possible events that could occur on an
RX channel in *some* order, prior to that patch.  And
in the order I show there's at least a problem of a
receive not being processed immediately.

		. . . (suspend initiated)

	replenish_stop()
	quiesce()
			IRQ fires (receive ready)
	napi_disable()
			napi_schedule() (ignored)
	irq_disable()
			IRQ condition; pending
	channel_stop()

		. . . (resume triggered)

	channel_start()
	irq_enable()
			pending IRQ fires
			napi_schedule() (ignored)
	napi_enable()

		. . . (suspend initiated)

>> At this point
>> it's more about the whole set of rework here, and keeping
>> NAPI enabled during suspend seems a little cleaner.
> 
> I'm not sure. I haven't looked if there is a common behavior across
> devices. That might be informative. igb, for one, releases all
> resources.

I tried to do a survey of that too and did not see a
consistent pattern.  I didn't look *that* hard because
doing so would be more involved than I wanted to get.

So in summary:
- I'm putting together version 2 of this series now
- Testing this past week seems to show that this series
   makes the hang in napi_disable() (or synchronize)
   go away.
- I think the most likely patch in this series that
   fixes the problem is the IRQ ordering one I mention
   above, but right now I can't cite a specific sequence
   of events that would prove it.
- I will begin some long testing later today without
   this last patch applied
     --> But I think testing without the IRQ ordering
	patch would be more promising, and I'd like
	to hear your opinion on that

Thanks again for your input and help on this.

					-Alex

. . .

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next 9/9] net: ipa: don't disable NAPI in suspend
  2021-02-01 14:35             ` Alex Elder
@ 2021-02-01 14:47               ` Willem de Bruijn
  2021-02-01 15:48                 ` Alex Elder
  2021-02-01 18:38                 ` Alex Elder
  0 siblings, 2 replies; 21+ messages in thread
From: Willem de Bruijn @ 2021-02-01 14:47 UTC (permalink / raw)
  To: Alex Elder
  Cc: David Miller, Jakub Kicinski, elder, evgreen, bjorn.andersson,
	cpratapa, Subash Abhinov Kasiviswanathan, Network Development,
	LKML

On Mon, Feb 1, 2021 at 9:35 AM Alex Elder <elder@linaro.org> wrote:
>
> On 1/31/21 7:36 PM, Willem de Bruijn wrote:
> > On Sun, Jan 31, 2021 at 10:32 AM Alex Elder <elder@linaro.org> wrote:
> >>
> >> On 1/31/21 8:52 AM, Willem de Bruijn wrote:
> >>> On Sat, Jan 30, 2021 at 11:29 PM Alex Elder <elder@linaro.org> wrote:
> >>>>
> >>>> On 1/30/21 9:25 AM, Willem de Bruijn wrote:
> >>>>> On Fri, Jan 29, 2021 at 3:29 PM Alex Elder <elder@linaro.org> wrote:
> >>>>>>
> >>>>>> The channel stop and suspend paths both call __gsi_channel_stop(),
> >>>>>> which quiesces channel activity, disables NAPI, and (on other than
> >>>>>> SDM845) stops the channel.  Similarly, the start and resume paths
> >>>>>> share __gsi_channel_start(), which starts the channel and re-enables
> >>>>>> NAPI again.
> >>>>>>
> >>>>>> Disabling NAPI should be done when stopping a channel, but this
> >>>>>> should *not* be done when suspending.  It's not necessary in the
> >>>>>> suspend path anyway, because the stopped channel (or suspended
> >>>>>> endpoint on SDM845) will not cause interrupts to schedule NAPI,
> >>>>>> and gsi_channel_trans_quiesce() won't return until there are no
> >>>>>> more transactions to process in the NAPI polling loop.
> >>>>>
> >>>>> But why is it incorrect to do so?
> >>>>
> >>>> Maybe it's not; I also thought it was fine before, but...
>
> . . .
>
> >> The "hang" occurs on an RX endpoint, and in particular it
> >> occurs on an endpoint that we *know* will be receiving a
> >> packet as part of the suspend process (when clearing the
> >> hardware pipeline).  I can go into that further but won't'
> >> unless asked.
> >>
> >>>> A stopped channel won't interrupt,
> >>>> so we don't bother disabling the completion interrupt,
> >>>> with no interrupts, NAPI won't be scheduled, so there's
> >>>> no need to disable NAPI either.
> >>>
> >>> That sounds plausible. But it doesn't explain why napi_disable "should
> >>> *not* be done when suspending" as the commit states.
> >>>
> >>> Arguably, leaving that won't have much effect either way, and is in
> >>> line with other drivers.
> >>
> >> Understood and agreed.  In fact, if the hang occurrs in
> >> napi_disable() when waiting for NAPI_STATE_SCHED to clear,
> >> it would occur in napi_synchronize() as well.
> >
> > Agreed.
> >
> > So you have an environment to test a patch in, it might be worthwhile
> > to test essentially the same logic reordering as in this patch set,
> > but while still disabling napi.
>
> What is the purpose of this test?  Just to guarantee
> that the NAPI hang goes away?  Because you agree that
> the napi_schedule() call would *also* hang if that
> problem exists, right?
>
> Anyway, what you're suggesting is to simply test with
> this last patch removed.  I can do that but I really
> don't expect it to change anything.  I will start that
> test later today when I'm turning my attention to
> something else for a while.
>
> > The disappearing race may be due to another change rather than
> > napi_disable vs napi_synchronize. A smaller, more targeted patch could
> > also be a net (instead of net-next) candidate.
>
> I am certain it is.
>
> I can tell you that we have seen a hang (after I think 2500+
> suspend/resume cycles) with the IPA code that is currently
> upstream.
>
> But with this latest series of 9, there is no hang after
> 10,000+ cycles.  That gives me a bisect window, but I really
> don't want to go through a full bisect of even those 9,
> because it's 4 tests, each of which takes days to complete.
>
> Looking at the 9 patches, I think this one is the most
> likely culprit:
>    net: ipa: disable IEOB interrupt after channel stop
>
> I think the race involves the I/O completion handler
> interacting with NAPI in an unwanted way, but I have
> not come up with the exact sequence that would lead
> to getting stuck in napi_disable().
>
> Here are some possible events that could occur on an
> RX channel in *some* order, prior to that patch.  And
> in the order I show there's at least a problem of a
> receive not being processed immediately.
>
>                 . . . (suspend initiated)
>
>         replenish_stop()
>         quiesce()
>                         IRQ fires (receive ready)
>         napi_disable()
>                         napi_schedule() (ignored)
>         irq_disable()
>                         IRQ condition; pending
>         channel_stop()
>
>                 . . . (resume triggered)
>
>         channel_start()
>         irq_enable()
>                         pending IRQ fires
>                         napi_schedule() (ignored)
>         napi_enable()
>
>                 . . . (suspend initiated)
>
> >> At this point
> >> it's more about the whole set of rework here, and keeping
> >> NAPI enabled during suspend seems a little cleaner.
> >
> > I'm not sure. I haven't looked if there is a common behavior across
> > devices. That might be informative. igb, for one, releases all
> > resources.
>
> I tried to do a survey of that too and did not see a
> consistent pattern.  I didn't look *that* hard because
> doing so would be more involved than I wanted to get.

Okay. If there is no consistent pattern, either approach works.

I'm fine with this patchset as is.

> So in summary:
> - I'm putting together version 2 of this series now
> - Testing this past week seems to show that this series
>    makes the hang in napi_disable() (or synchronize)
>    go away.
> - I think the most likely patch in this series that
>    fixes the problem is the IRQ ordering one I mention
>    above, but right now I can't cite a specific sequence
>    of events that would prove it.
> - I will begin some long testing later today without
>    this last patch applied
>      --> But I think testing without the IRQ ordering
>         patch would be more promising, and I'd like
>         to hear your opinion on that

Either test depends on whether you find it worthwhile to more
specifically identify the root cause. If not, no need to run the tests
on my behalf. I understand that these are time consuming.

>
> Thanks again for your input and help on this.
>
>                                         -Alex
>
> . . .

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next 9/9] net: ipa: don't disable NAPI in suspend
  2021-02-01 14:47               ` Willem de Bruijn
@ 2021-02-01 15:48                 ` Alex Elder
  2021-02-01 18:38                 ` Alex Elder
  1 sibling, 0 replies; 21+ messages in thread
From: Alex Elder @ 2021-02-01 15:48 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: David Miller, Jakub Kicinski, elder, evgreen, bjorn.andersson,
	cpratapa, Subash Abhinov Kasiviswanathan, Network Development,
	LKML

On 2/1/21 8:47 AM, Willem de Bruijn wrote:
>> I tried to do a survey of that too and did not see a
>> consistent pattern.  I didn't look*that*  hard because
>> doing so would be more involved than I wanted to get.
> Okay. If there is no consistent pattern, either approach works.
> 
> I'm fine with this patchset as is.

OK.  I'm sending version 2 of the series to address the
comment about returning success early--before the end
of the function.  I'm also tweaking one or two patch
descriptions.

As I looked at the series I saw some other things I'd
like to do a little differently.  But to keep things
simple I'm going to do that in a follow-on series
instead.

Thank you very much for your time on this.

					-Alex

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next 9/9] net: ipa: don't disable NAPI in suspend
  2021-02-01 14:47               ` Willem de Bruijn
  2021-02-01 15:48                 ` Alex Elder
@ 2021-02-01 18:38                 ` Alex Elder
  1 sibling, 0 replies; 21+ messages in thread
From: Alex Elder @ 2021-02-01 18:38 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: David Miller, Jakub Kicinski, elder, evgreen, bjorn.andersson,
	cpratapa, Subash Abhinov Kasiviswanathan, Network Development,
	LKML

On 2/1/21 8:47 AM, Willem de Bruijn wrote:
>> - I will begin some long testing later today without
>>     this last patch applied
>>       --> But I think testing without the IRQ ordering
>>          patch would be more promising, and I'd like
>>          to hear your opinion on that
> Either test depends on whether you find it worthwhile to more
> specifically identify the root cause. If not, no need to run the tests
> on my behalf. I understand that these are time consuming.

I *would* like to really understand the root cause.
And a month ago I would have absolutely gone through
a bisect process so I could narrow it down.

But at this point I'm content to have it fixed (fingers
crossed) and am more interested in putting all this
behind me.  It's taken a lot of time and attention.

And even though this represents a real bug, at least
for now I am content to keep all this work in net-next
rather than isolate the specific problem and try to
back-port it (without all the other 20+ commits that
preceded these) to the net branch.

So although I see the value in identifying the specific
root cause, I'm not going to do that unless/until it
becomes clear it *must* be done (to back-port to net).

					-Alex

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2021-02-01 18:41 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-29 20:20 [PATCH net-next 0/9] net: ipa: don't disable NAPI in suspend Alex Elder
2021-01-29 20:20 ` [PATCH net-next 1/9] net: ipa: don't thaw channel if error starting Alex Elder
2021-01-29 20:20 ` [PATCH net-next 2/9] net: ipa: introduce gsi_channel_stop_retry() Alex Elder
2021-01-29 20:20 ` [PATCH net-next 3/9] net: ipa: introduce __gsi_channel_start() Alex Elder
2021-01-29 20:20 ` [PATCH net-next 4/9] net: ipa: kill gsi_channel_freeze() and gsi_channel_thaw() Alex Elder
2021-01-29 20:20 ` [PATCH net-next 5/9] net: ipa: disable IEOB interrupt after channel stop Alex Elder
2021-01-29 20:20 ` [PATCH net-next 6/9] net: ipa: move completion interrupt enable/disable Alex Elder
2021-01-29 20:20 ` [PATCH net-next 7/9] net: ipa: don't disable IEOB interrupt during suspend Alex Elder
2021-01-29 20:20 ` [PATCH net-next 8/9] net: ipa: expand last transaction check Alex Elder
2021-01-29 20:20 ` [PATCH net-next 9/9] net: ipa: don't disable NAPI in suspend Alex Elder
2021-01-30 15:25   ` Willem de Bruijn
2021-01-30 19:22     ` Jakub Kicinski
2021-01-31  4:30       ` Alex Elder
2021-01-31  4:29     ` Alex Elder
2021-01-31 14:52       ` Willem de Bruijn
2021-01-31 15:32         ` Alex Elder
2021-02-01  1:36           ` Willem de Bruijn
2021-02-01 14:35             ` Alex Elder
2021-02-01 14:47               ` Willem de Bruijn
2021-02-01 15:48                 ` Alex Elder
2021-02-01 18:38                 ` Alex Elder

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).