[PATCH] can: c_can: Handle lost bus-off interrupt while IRQs are disabled

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] can: c_can: Handle lost bus-off interrupt while IRQs are disabled
@ 2020-03-26 10:43 Felix Riemann
  2020-03-26 13:40 ` Kurt Van Dijck
  0 siblings, 1 reply; 17+ messages in thread
From: Felix Riemann @ 2020-03-26 10:43 UTC (permalink / raw)
  To: linux-can
  Cc: Marc Kleine-Budde, Wolfgang Grandegger, Felix Riemann, Andre Kalb

There are appears to be a race condition where interrupts caused
by bus offs get lost if it occurs while interrupts are disabled
or being re-enabled.

This tries to avoid the deadlock by rescheduling the NAPI worker
to handle the bus-off condition.

Signed-off-by: Felix Riemann <felix.riemann@sma.de>
Reviewed-by: Andre Kalb <andre.kalb@sma.de>
---
 drivers/net/can/c_can/c_can.c | 19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/drivers/net/can/c_can/c_can.c b/drivers/net/can/c_can/c_can.c
index 8e9f5620c9a2..0952ac0b9123 100644
--- a/drivers/net/can/c_can/c_can.c
+++ b/drivers/net/can/c_can/c_can.c
@@ -1123,8 +1123,25 @@ static int c_can_poll(struct napi_struct *napi, int quota)
 	if (work_done < quota) {
 		napi_complete_done(napi, work_done);
 		/* enable all IRQs if we are not in bus off state */
-		if (priv->can.state != CAN_STATE_BUS_OFF)
+		if (priv->can.state != CAN_STATE_BUS_OFF) {
+			u32 ctrl;
 			c_can_irq_control(priv, true);
+
+			/* There appears to be a race condition when the device
+			 * enters bus off while interrupts are off or being
+			 * re-enabled causing the bus off to get lost.
+			 * This tries to avoid this condition.
+			 */
+			ctrl = priv->read_reg(priv, C_CAN_CTRL_REG);
+
+			if (ctrl & CONTROL_INIT) {
+				netdev_warn(dev, "lost bus off\n");
+				c_can_irq_control(priv, false);
+				/* Reschedule worker to handle bus off */
+				atomic_set(&priv->sie_pending, 1);
+				napi_reschedule(napi);
+			}
+		}
 	}
 
 	return work_done;
-- 
2.26.0

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] can: c_can: Handle lost bus-off interrupt while IRQs are disabled
  2020-03-26 10:43 [PATCH] can: c_can: Handle lost bus-off interrupt while IRQs are disabled Felix Riemann
@ 2020-03-26 13:40 ` Kurt Van Dijck
  2020-03-26 14:49   ` Felix Riemann
  0 siblings, 1 reply; 17+ messages in thread
From: Kurt Van Dijck @ 2020-03-26 13:40 UTC (permalink / raw)
  To: Felix Riemann
  Cc: linux-can, Marc Kleine-Budde, Wolfgang Grandegger, Andre Kalb

This patch looks very familiar.
I tried this 6 months ago, when I got a setup where I could reproduce
this within minutes.

This patch didn't solve my problem, and I fixed my problem with reading the
ISR register only once in the IRQ, and _not_ 2nd time in napi handler.
Reading the ISR in C_CAN is consuming the info, so unless the there is
more state to clear (such as received message buffers), the interrupt
cause is assumed to be known, and the C_CAN proceeds to the next
interrupt cause.
I've not experienced a single problem since, where before the fix, I'd
experience a few occurances every month.
We have an am3358 (beaglebone-like hardware).

I believe I submitted a patch for that.

Re-scheduling the napi is a workaround, IMHO.

Kurt

On do, 26 mrt 2020 11:43:18 +0100, Felix Riemann wrote:
> There are appears to be a race condition where interrupts caused
> by bus offs get lost if it occurs while interrupts are disabled
> or being re-enabled.
> 
> This tries to avoid the deadlock by rescheduling the NAPI worker
> to handle the bus-off condition.
> 
> Signed-off-by: Felix Riemann <felix.riemann@sma.de>
> Reviewed-by: Andre Kalb <andre.kalb@sma.de>
> ---
>  drivers/net/can/c_can/c_can.c | 19 ++++++++++++++++++-
>  1 file changed, 18 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/can/c_can/c_can.c b/drivers/net/can/c_can/c_can.c
> index 8e9f5620c9a2..0952ac0b9123 100644
> --- a/drivers/net/can/c_can/c_can.c
> +++ b/drivers/net/can/c_can/c_can.c
> @@ -1123,8 +1123,25 @@ static int c_can_poll(struct napi_struct *napi, int quota)
>  	if (work_done < quota) {
>  		napi_complete_done(napi, work_done);
>  		/* enable all IRQs if we are not in bus off state */
> -		if (priv->can.state != CAN_STATE_BUS_OFF)
> +		if (priv->can.state != CAN_STATE_BUS_OFF) {
> +			u32 ctrl;
>  			c_can_irq_control(priv, true);
> +
> +			/* There appears to be a race condition when the device
> +			 * enters bus off while interrupts are off or being
> +			 * re-enabled causing the bus off to get lost.
> +			 * This tries to avoid this condition.
> +			 */
> +			ctrl = priv->read_reg(priv, C_CAN_CTRL_REG);
> +
> +			if (ctrl & CONTROL_INIT) {
> +				netdev_warn(dev, "lost bus off\n");
> +				c_can_irq_control(priv, false);
> +				/* Reschedule worker to handle bus off */
> +				atomic_set(&priv->sie_pending, 1);
> +				napi_reschedule(napi);
> +			}
> +		}
>  	}
>  
>  	return work_done;
> -- 
> 2.26.0
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [PATCH] can: c_can: Handle lost bus-off interrupt while IRQs are disabled
  2020-03-26 13:40 ` Kurt Van Dijck
@ 2020-03-26 14:49   ` Felix Riemann
  2020-03-26 19:38     ` Kurt Van Dijck
  0 siblings, 1 reply; 17+ messages in thread
From: Felix Riemann @ 2020-03-26 14:49 UTC (permalink / raw)
  To: Kurt Van Dijck
  Cc: linux-can, Marc Kleine-Budde, Wolfgang Grandegger, Andre Kalb

Hi Kurt,

yes that would be 3cb3eaac52c0f145d895f4b6c22834d5f02b8569.

I tried your patch as well, since the description matched the problem pretty well. However, I could still produce the hang on our board (based on AM3352) especially if the bus off occurs while the bus load is very high.

What I could see was the worker being triggered by the interrupt with the bus being in passive mode. When the worker then re-enabled the interrupt the hardware had already entered init mode due to the bus-off but nor would it trigger again nor would the interrupt register indicate a pending IRQ. The status register also indicates an error (LEC != 0x7).

A first attempt to fix/workaround this placed the check just before the worker re-enables the interrupts. But even this setup could reproduce the problem, although it took longer to do so.  So the timing around enabling the interrupts seems to matter here. Actually, adding a printout at the top of the worker where your patch checks the value of the sie_pending atomic seems to increase the chances of making it happen on my board, but that's just a feeling I got while cleaning up my patch.

Regards,

Felix

-----Original Message-----
From: Kurt Van Dijck <dev.kurt@vandijck-laurijssen.be>
Sent: Thursday, March 26, 2020 2:40 PM

This patch looks very familiar.
I tried this 6 months ago, when I got a setup where I could reproduce this within minutes.

This patch didn't solve my problem, and I fixed my problem with reading the ISR register only once in the IRQ, and _not_ 2nd time in napi handler.
Reading the ISR in C_CAN is consuming the info, so unless the there is more state to clear (such as received message buffers), the interrupt cause is assumed to be known, and the C_CAN proceeds to the next interrupt cause.
I've not experienced a single problem since, where before the fix, I'd experience a few occurances every month.
We have an am3358 (beaglebone-like hardware).

I believe I submitted a patch for that.

Re-scheduling the napi is a workaround, IMHO.

Kurt

On do, 26 mrt 2020 11:43:18 +0100, Felix Riemann wrote:
> There are appears to be a race condition where interrupts caused by
> bus offs get lost if it occurs while interrupts are disabled or being
> re-enabled.
>
> This tries to avoid the deadlock by rescheduling the NAPI worker to
> handle the bus-off condition.
>
> Signed-off-by: Felix Riemann <felix.riemann@sma.de>
> Reviewed-by: Andre Kalb <andre.kalb@sma.de>
> ---
>  drivers/net/can/c_can/c_can.c | 19 ++++++++++++++++++-
>  1 file changed, 18 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/can/c_can/c_can.c
> b/drivers/net/can/c_can/c_can.c index 8e9f5620c9a2..0952ac0b9123
> 100644
> --- a/drivers/net/can/c_can/c_can.c
> +++ b/drivers/net/can/c_can/c_can.c
> @@ -1123,8 +1123,25 @@ static int c_can_poll(struct napi_struct *napi, int quota)
>  if (work_done < quota) {
>  napi_complete_done(napi, work_done);
>  /* enable all IRQs if we are not in bus off state */
> -if (priv->can.state != CAN_STATE_BUS_OFF)
> +if (priv->can.state != CAN_STATE_BUS_OFF) {
> +u32 ctrl;
>  c_can_irq_control(priv, true);
> +
> +/* There appears to be a race condition when the device
> + * enters bus off while interrupts are off or being
> + * re-enabled causing the bus off to get lost.
> + * This tries to avoid this condition.
> + */
> +ctrl = priv->read_reg(priv, C_CAN_CTRL_REG);
> +
> +if (ctrl & CONTROL_INIT) {
> +netdev_warn(dev, "lost bus off\n");
> +c_can_irq_control(priv, false);
> +/* Reschedule worker to handle bus off */
> +atomic_set(&priv->sie_pending, 1);
> +napi_reschedule(napi);
> +}
> +}
>  }
>
>  return work_done;
> --
> 2.26.0
>
___________________________________________________

SMA Solar Technology AG
Aufsichtsrat: Dr. Erik Ehrentraut (Vorsitzender)
Vorstand: Ulrich Hadding, Dr.-Ing. Juergen Reinert
Handelsregister: Amtsgericht Kassel HRB 3972
Sitz der Gesellschaft: 34266 Niestetal
USt-ID-Nr. DE 113 08 59 54
WEEE-Reg.-Nr. DE 95881150
___________________________________________________

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] can: c_can: Handle lost bus-off interrupt while IRQs are disabled
  2020-03-26 14:49   ` Felix Riemann
@ 2020-03-26 19:38     ` Kurt Van Dijck
  2020-03-27 10:09       ` Felix Riemann
  0 siblings, 1 reply; 17+ messages in thread
From: Kurt Van Dijck @ 2020-03-26 19:38 UTC (permalink / raw)
  To: Felix Riemann
  Cc: linux-can, Marc Kleine-Budde, Wolfgang Grandegger, Andre Kalb

Interesting topic.

After the bus-off problem, I also attacked the overflow problems. No
idea why this patch series got stalled. I activated the complete 64
message buffers of the D_CAN hardware (only 64 of the 128 being present
in the IP for some reason I don't know) and modified the IRQ to use
rx-offload. That may affect the circumstances under load, I admit.

Whould you mind trying that series, just for reference, in your setup?
If your hypothesis is right, my observations do not survive your
busload.

Kind regards,
Kurt

On do, 26 mrt 2020 14:49:23 +0000, Felix Riemann wrote:
> Hi Kurt,
> 
> yes that would be 3cb3eaac52c0f145d895f4b6c22834d5f02b8569.
> 
> I tried your patch as well, since the description matched the problem pretty well. However, I could still produce the hang on our board (based on AM3352) especially if the bus off occurs while the bus load is very high.
> 
> What I could see was the worker being triggered by the interrupt with the bus being in passive mode. When the worker then re-enabled the interrupt the hardware had already entered init mode due to the bus-off but nor would it trigger again nor would the interrupt register indicate a pending IRQ. The status register also indicates an error (LEC != 0x7).
> 
> A first attempt to fix/workaround this placed the check just before the worker re-enables the interrupts. But even this setup could reproduce the problem, although it took longer to do so.  So the timing around enabling the interrupts seems to matter here. Actually, adding a printout at the top of the worker where your patch checks the value of the sie_pending atomic seems to increase the chances of making it happen on my board, but that's just a feeling I got while cleaning up my patch.
> 
> Regards,
> 
> Felix
> 
> -----Original Message-----
> From: Kurt Van Dijck <dev.kurt@vandijck-laurijssen.be>
> Sent: Thursday, March 26, 2020 2:40 PM
> 
> This patch looks very familiar.
> I tried this 6 months ago, when I got a setup where I could reproduce this within minutes.
> 
> This patch didn't solve my problem, and I fixed my problem with reading the ISR register only once in the IRQ, and _not_ 2nd time in napi handler.
> Reading the ISR in C_CAN is consuming the info, so unless the there is more state to clear (such as received message buffers), the interrupt cause is assumed to be known, and the C_CAN proceeds to the next interrupt cause.
> I've not experienced a single problem since, where before the fix, I'd experience a few occurances every month.
> We have an am3358 (beaglebone-like hardware).
> 
> I believe I submitted a patch for that.
> 
> Re-scheduling the napi is a workaround, IMHO.
> 
> Kurt
> 
> On do, 26 mrt 2020 11:43:18 +0100, Felix Riemann wrote:
> > There are appears to be a race condition where interrupts caused by
> > bus offs get lost if it occurs while interrupts are disabled or being
> > re-enabled.
> >
> > This tries to avoid the deadlock by rescheduling the NAPI worker to
> > handle the bus-off condition.
> >
> > Signed-off-by: Felix Riemann <felix.riemann@sma.de>
> > Reviewed-by: Andre Kalb <andre.kalb@sma.de>
> > ---
> >  drivers/net/can/c_can/c_can.c | 19 ++++++++++++++++++-
> >  1 file changed, 18 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/net/can/c_can/c_can.c
> > b/drivers/net/can/c_can/c_can.c index 8e9f5620c9a2..0952ac0b9123
> > 100644
> > --- a/drivers/net/can/c_can/c_can.c
> > +++ b/drivers/net/can/c_can/c_can.c
> > @@ -1123,8 +1123,25 @@ static int c_can_poll(struct napi_struct *napi, int quota)
> >  if (work_done < quota) {
> >  napi_complete_done(napi, work_done);
> >  /* enable all IRQs if we are not in bus off state */
> > -if (priv->can.state != CAN_STATE_BUS_OFF)
> > +if (priv->can.state != CAN_STATE_BUS_OFF) {
> > +u32 ctrl;
> >  c_can_irq_control(priv, true);
> > +
> > +/* There appears to be a race condition when the device
> > + * enters bus off while interrupts are off or being
> > + * re-enabled causing the bus off to get lost.
> > + * This tries to avoid this condition.
> > + */
> > +ctrl = priv->read_reg(priv, C_CAN_CTRL_REG);
> > +
> > +if (ctrl & CONTROL_INIT) {
> > +netdev_warn(dev, "lost bus off\n");
> > +c_can_irq_control(priv, false);
> > +/* Reschedule worker to handle bus off */
> > +atomic_set(&priv->sie_pending, 1);
> > +napi_reschedule(napi);
> > +}
> > +}
> >  }
> >
> >  return work_done;
> > --
> > 2.26.0
> >
> ___________________________________________________
> 
> SMA Solar Technology AG
> Aufsichtsrat: Dr. Erik Ehrentraut (Vorsitzender)
> Vorstand: Ulrich Hadding, Dr.-Ing. Juergen Reinert
> Handelsregister: Amtsgericht Kassel HRB 3972
> Sitz der Gesellschaft: 34266 Niestetal
> USt-ID-Nr. DE 113 08 59 54
> WEEE-Reg.-Nr. DE 95881150
> ___________________________________________________

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [PATCH] can: c_can: Handle lost bus-off interrupt while IRQs are disabled
  2020-03-26 19:38     ` Kurt Van Dijck
@ 2020-03-27 10:09       ` Felix Riemann
  2020-03-30  7:02         ` Kurt Van Dijck
  0 siblings, 1 reply; 17+ messages in thread
From: Felix Riemann @ 2020-03-27 10:09 UTC (permalink / raw)
  To: Kurt Van Dijck
  Cc: linux-can, Marc Kleine-Budde, Wolfgang Grandegger, Andre Kalb

Is it this one?
https://marc.info/?l=linux-can&m=157683579518282

It seems to be missing patch 6 in the series.

Regards,
Felix

-----Original Message-----
From: Kurt Van Dijck
Sent: Thursday, March 26, 2020 8:39 PM

Interesting topic.

After the bus-off problem, I also attacked the overflow problems. No
idea why this patch series got stalled. I activated the complete 64
message buffers of the D_CAN hardware (only 64 of the 128 being present
in the IP for some reason I don't know) and modified the IRQ to use
rx-offload. That may affect the circumstances under load, I admit.

Whould you mind trying that series, just for reference, in your setup?
If your hypothesis is right, my observations do not survive your
busload.

Kind regards,
Kurt

On do, 26 mrt 2020 14:49:23 +0000, Felix Riemann wrote:
> Hi Kurt,
>
> yes that would be 3cb3eaac52c0f145d895f4b6c22834d5f02b8569.
>
> I tried your patch as well, since the description matched the problem pretty well. However, I could still produce the hang on our board (based on AM3352) especially if the bus off occurs while the bus load is very high.
>
> What I could see was the worker being triggered by the interrupt with the bus being in passive mode. When the worker then re-enabled the interrupt the hardware had already entered init mode due to the bus-off but nor would it trigger again nor would the interrupt register indicate a pending IRQ. The status register also indicates an error (LEC != 0x7).
>
> A first attempt to fix/workaround this placed the check just before the worker re-enables the interrupts. But even this setup could reproduce the problem, although it took longer to do so.  So the timing around enabling the interrupts seems to matter here. Actually, adding a printout at the top of the worker where your patch checks the value of the sie_pending atomic seems to increase the chances of making it happen on my board, but that's just a feeling I got while cleaning up my patch.
>
> Regards,
>
> Felix
>
> -----Original Message-----
> From: Kurt Van Dijck <dev.kurt@vandijck-laurijssen.be>
> Sent: Thursday, March 26, 2020 2:40 PM
>
> This patch looks very familiar.
> I tried this 6 months ago, when I got a setup where I could reproduce this within minutes.
>
> This patch didn't solve my problem, and I fixed my problem with reading the ISR register only once in the IRQ, and _not_ 2nd time in napi handler.
> Reading the ISR in C_CAN is consuming the info, so unless the there is more state to clear (such as received message buffers), the interrupt cause is assumed to be known, and the C_CAN proceeds to the next interrupt cause.
> I've not experienced a single problem since, where before the fix, I'd experience a few occurances every month.
> We have an am3358 (beaglebone-like hardware).
>
> I believe I submitted a patch for that.
>
> Re-scheduling the napi is a workaround, IMHO.
>
> Kurt
>
> On do, 26 mrt 2020 11:43:18 +0100, Felix Riemann wrote:
> > There are appears to be a race condition where interrupts caused by
> > bus offs get lost if it occurs while interrupts are disabled or being
> > re-enabled.
> >
> > This tries to avoid the deadlock by rescheduling the NAPI worker to
> > handle the bus-off condition.
> >
> > Signed-off-by: Felix Riemann <felix.riemann@sma.de>
> > Reviewed-by: Andre Kalb <andre.kalb@sma.de>
> > ---
> >  drivers/net/can/c_can/c_can.c | 19 ++++++++++++++++++-
> >  1 file changed, 18 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/net/can/c_can/c_can.c
> > b/drivers/net/can/c_can/c_can.c index 8e9f5620c9a2..0952ac0b9123
> > 100644
> > --- a/drivers/net/can/c_can/c_can.c
> > +++ b/drivers/net/can/c_can/c_can.c
> > @@ -1123,8 +1123,25 @@ static int c_can_poll(struct napi_struct *napi, int quota)
> >  if (work_done < quota) {
> >  napi_complete_done(napi, work_done);
> >  /* enable all IRQs if we are not in bus off state */
> > -if (priv->can.state != CAN_STATE_BUS_OFF)
> > +if (priv->can.state != CAN_STATE_BUS_OFF) {
> > +u32 ctrl;
> >  c_can_irq_control(priv, true);
> > +
> > +/* There appears to be a race condition when the device
> > + * enters bus off while interrupts are off or being
> > + * re-enabled causing the bus off to get lost.
> > + * This tries to avoid this condition.
> > + */
> > +ctrl = priv->read_reg(priv, C_CAN_CTRL_REG);
> > +
> > +if (ctrl & CONTROL_INIT) {
> > +netdev_warn(dev, "lost bus off\n");
> > +c_can_irq_control(priv, false);
> > +/* Reschedule worker to handle bus off */
> > +atomic_set(&priv->sie_pending, 1);
> > +napi_reschedule(napi);
> > +}
> > +}
> >  }
> >
> >  return work_done;
> > --
> > 2.26.0
> >
> ___________________________________________________
>
> SMA Solar Technology AG
> Aufsichtsrat: Dr. Erik Ehrentraut (Vorsitzender)
> Vorstand: Ulrich Hadding, Dr.-Ing. Juergen Reinert
> Handelsregister: Amtsgericht Kassel HRB 3972
> Sitz der Gesellschaft: 34266 Niestetal
> USt-ID-Nr. DE 113 08 59 54
> WEEE-Reg.-Nr. DE 95881150
> ___________________________________________________
___________________________________________________

SMA Solar Technology AG
Aufsichtsrat: Dr. Erik Ehrentraut (Vorsitzender)
Vorstand: Ulrich Hadding, Dr.-Ing. Juergen Reinert
Handelsregister: Amtsgericht Kassel HRB 3972
Sitz der Gesellschaft: 34266 Niestetal
USt-ID-Nr. DE 113 08 59 54
WEEE-Reg.-Nr. DE 95881150
___________________________________________________

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] can: c_can: Handle lost bus-off interrupt while IRQs are disabled
  2020-03-27 10:09       ` Felix Riemann
@ 2020-03-30  7:02         ` Kurt Van Dijck
  2020-03-30 17:30           ` Felix Riemann
  0 siblings, 1 reply; 17+ messages in thread
From: Kurt Van Dijck @ 2020-03-30  7:02 UTC (permalink / raw)
  To: Felix Riemann
  Cc: linux-can, Marc Kleine-Budde, Wolfgang Grandegger, Andre Kalb

On vr, 27 mrt 2020 10:09:03 +0000, Felix Riemann wrote:
> Is it this one?
> https://marc.info/?l=linux-can&m=157683579518282

yes

> 
> It seems to be missing patch 6 in the series.

you can get there via https://github.com/kurt-vd/linux, branch c_can-next

Kurt

> 
> Regards,
> Felix
> 
> -----Original Message-----
> From: Kurt Van Dijck
> Sent: Thursday, March 26, 2020 8:39 PM
> 
> Interesting topic.
> 
> After the bus-off problem, I also attacked the overflow problems. No
> idea why this patch series got stalled. I activated the complete 64
> message buffers of the D_CAN hardware (only 64 of the 128 being present
> in the IP for some reason I don't know) and modified the IRQ to use
> rx-offload. That may affect the circumstances under load, I admit.
> 
> Whould you mind trying that series, just for reference, in your setup?
> If your hypothesis is right, my observations do not survive your
> busload.
> 
> Kind regards,
> Kurt
> 
> On do, 26 mrt 2020 14:49:23 +0000, Felix Riemann wrote:
> > Hi Kurt,
> >
> > yes that would be 3cb3eaac52c0f145d895f4b6c22834d5f02b8569.
> >
> > I tried your patch as well, since the description matched the problem pretty well. However, I could still produce the hang on our board (based on AM3352) especially if the bus off occurs while the bus load is very high.
> >
> > What I could see was the worker being triggered by the interrupt with the bus being in passive mode. When the worker then re-enabled the interrupt the hardware had already entered init mode due to the bus-off but nor would it trigger again nor would the interrupt register indicate a pending IRQ. The status register also indicates an error (LEC != 0x7).
> >
> > A first attempt to fix/workaround this placed the check just before the worker re-enables the interrupts. But even this setup could reproduce the problem, although it took longer to do so.  So the timing around enabling the interrupts seems to matter here. Actually, adding a printout at the top of the worker where your patch checks the value of the sie_pending atomic seems to increase the chances of making it happen on my board, but that's just a feeling I got while cleaning up my patch.
> >
> > Regards,
> >
> > Felix
> >
> > -----Original Message-----
> > From: Kurt Van Dijck <dev.kurt@vandijck-laurijssen.be>
> > Sent: Thursday, March 26, 2020 2:40 PM
> >
> > This patch looks very familiar.
> > I tried this 6 months ago, when I got a setup where I could reproduce this within minutes.
> >
> > This patch didn't solve my problem, and I fixed my problem with reading the ISR register only once in the IRQ, and _not_ 2nd time in napi handler.
> > Reading the ISR in C_CAN is consuming the info, so unless the there is more state to clear (such as received message buffers), the interrupt cause is assumed to be known, and the C_CAN proceeds to the next interrupt cause.
> > I've not experienced a single problem since, where before the fix, I'd experience a few occurances every month.
> > We have an am3358 (beaglebone-like hardware).
> >
> > I believe I submitted a patch for that.
> >
> > Re-scheduling the napi is a workaround, IMHO.
> >
> > Kurt
> >
> > On do, 26 mrt 2020 11:43:18 +0100, Felix Riemann wrote:
> > > There are appears to be a race condition where interrupts caused by
> > > bus offs get lost if it occurs while interrupts are disabled or being
> > > re-enabled.
> > >
> > > This tries to avoid the deadlock by rescheduling the NAPI worker to
> > > handle the bus-off condition.
> > >
> > > Signed-off-by: Felix Riemann <felix.riemann@sma.de>
> > > Reviewed-by: Andre Kalb <andre.kalb@sma.de>
> > > ---
> > >  drivers/net/can/c_can/c_can.c | 19 ++++++++++++++++++-
> > >  1 file changed, 18 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/net/can/c_can/c_can.c
> > > b/drivers/net/can/c_can/c_can.c index 8e9f5620c9a2..0952ac0b9123
> > > 100644
> > > --- a/drivers/net/can/c_can/c_can.c
> > > +++ b/drivers/net/can/c_can/c_can.c
> > > @@ -1123,8 +1123,25 @@ static int c_can_poll(struct napi_struct *napi, int quota)
> > >  if (work_done < quota) {
> > >  napi_complete_done(napi, work_done);
> > >  /* enable all IRQs if we are not in bus off state */
> > > -if (priv->can.state != CAN_STATE_BUS_OFF)
> > > +if (priv->can.state != CAN_STATE_BUS_OFF) {
> > > +u32 ctrl;
> > >  c_can_irq_control(priv, true);
> > > +
> > > +/* There appears to be a race condition when the device
> > > + * enters bus off while interrupts are off or being
> > > + * re-enabled causing the bus off to get lost.
> > > + * This tries to avoid this condition.
> > > + */
> > > +ctrl = priv->read_reg(priv, C_CAN_CTRL_REG);
> > > +
> > > +if (ctrl & CONTROL_INIT) {
> > > +netdev_warn(dev, "lost bus off\n");
> > > +c_can_irq_control(priv, false);
> > > +/* Reschedule worker to handle bus off */
> > > +atomic_set(&priv->sie_pending, 1);
> > > +napi_reschedule(napi);
> > > +}
> > > +}
> > >  }
> > >
> > >  return work_done;
> > > --
> > > 2.26.0
> > >
> > ___________________________________________________
> >
> > SMA Solar Technology AG
> > Aufsichtsrat: Dr. Erik Ehrentraut (Vorsitzender)
> > Vorstand: Ulrich Hadding, Dr.-Ing. Juergen Reinert
> > Handelsregister: Amtsgericht Kassel HRB 3972
> > Sitz der Gesellschaft: 34266 Niestetal
> > USt-ID-Nr. DE 113 08 59 54
> > WEEE-Reg.-Nr. DE 95881150
> > ___________________________________________________
> ___________________________________________________
> 
> SMA Solar Technology AG
> Aufsichtsrat: Dr. Erik Ehrentraut (Vorsitzender)
> Vorstand: Ulrich Hadding, Dr.-Ing. Juergen Reinert
> Handelsregister: Amtsgericht Kassel HRB 3972
> Sitz der Gesellschaft: 34266 Niestetal
> USt-ID-Nr. DE 113 08 59 54
> WEEE-Reg.-Nr. DE 95881150
> ___________________________________________________

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [PATCH] can: c_can: Handle lost bus-off interrupt while IRQs are disabled
  2020-03-30  7:02         ` Kurt Van Dijck
@ 2020-03-30 17:30           ` Felix Riemann
  2020-03-31  9:44             ` Kurt Van Dijck
  2020-04-29  8:47             ` Kurt Van Dijck
  0 siblings, 2 replies; 17+ messages in thread
From: Felix Riemann @ 2020-03-30 17:30 UTC (permalink / raw)
  To: Kurt Van Dijck
  Cc: linux-can, Marc Kleine-Budde, Wolfgang Grandegger, Andre Kalb

Hi Kurt,

tried to give it a try. However, after starting a few (4 - 5) cangen  instances to increase bus load it panics:

[  207.259295] Unable to handle kernel NULL pointer dereference at virtual address 00000000
[  207.267433] pgd = d2bd4866
[  207.270153] [00000000] *pgd=8ca25831, *pte=00000000, *ppte=00000000
[  207.276469] Internal error: Oops: 817 [#1] PREEMPT ARM
[  207.281633] Modules linked in: cfg80211 rfkill lm75 musb_dsps musb_hdrc usbcore phy_am335x phy_generic usb_common phy_am335x_control gpio_keys xt_owner iptable_filter ip_tables can_raw
[  207.298189] CPU: 0 PID: 782 Comm: cangen Not tainted 5.5.11-00013-gcdf175eb0ffc #7
[  207.305793] Hardware name: Generic AM33XX (Flattened Device Tree)
[  207.311942] PC is at can_rx_offload_irq_receive_skb+0x54/0x78
[  207.317722] LR is at c_can_isr+0x52c/0x5b0
[  207.321838] pc : [<c04856f4>]    lr : [<c04875f8>]    psr: 80070193
[  207.328133] sp : cde6fd30  ip : 00000008  fp : cf4ec5c8
[  207.333383] r10: 00000000  r9 : 00000000  r8 : cf4ec4e0
[  207.338633] r7 : 00000001  r6 : 00000000  r5 : 00000039  r4 : cf4ec000
[  207.345192] r3 : cf4ec5c8  r2 : cf4ec5dc  r1 : 00000000  r0 : cf4ec5dc
[  207.351752] Flags: Nzcv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment none
[  207.359010] Control: 10c5387d  Table: 8cb50019  DAC: 00000051
[  207.364785] Process cangen (pid: 782, stack limit = 0x7ef64b0d)
[  207.370734] Stack: (0xcde6fd30 to 0xcde70000)
[  207.375116] fd20:                                     00000000 cf4ec000 00000000 00000000
[  207.383341] fd40: cde6fd7f 8d12c45f cd82ed80 cc864c80 cf3a9310 00000030 00000001 ffffe000
[  207.391566] fd60: cf3a9300 cde6fda0 00000000 c0165450 cf3a9300 cde6fda0 00010002 c0a04248
[  207.399790] fd80: cf3a9310 cf3a9300 00000001 00000000 cde6e000 8ec8018f 000000e8 c01655f4
[  207.408014] fda0: 00000000 8d12c45f cf3a9300 cf3a9310 cf01f400 c01656a4 cf3a9300 cf3a9310
[  207.416239] fdc0: cf01f400 c0169520 00000000 00000000 cf01f400 c0164720 00000000 c0164dac
[  207.424463] fde0: c01128c4 20070113 ffffffff cde6fe2c 00000000 c0101a8c 00020837 cfecd200
[  207.432687] fe00: cfcb9000 00214200 8ec8018f 0000ec80 00000000 cfecd200 00000000 00000000
[  207.440911] fe20: 8ec8018f 000000e8 20070113 cde6fe48 00020837 c01128c4 20070113 ffffffff
[  207.449136] fe40: 00000051 bf000000 8ec8018f cca261ac 00000000 c0203830 cde6fee4 cf79e0b0
[  207.457359] fe60: 00000000 c0206358 cfecd200 00000001 cde6fee4 cdcba300 ce38e188 000000ec
[  207.465584] fe80: c0a04248 c01de0d0 00023b20 ce38e18c 000000e8 00280000 ce37e720 00000000
[  207.473807] fea0: 00000000 8d12c45f 00000011 00000000 000000ec c0a04248 b6e61000 00000040
[  207.482032] fec0: 000000dd cc8b7000 cc8b703c c0206b80 c0a04248 00000000 cdcba840 00000010
[  207.490256] fee0: 00000000 cf79e0b0 00000054 00100cca 000000de b6e6b000 ccb52db8 ccb52db8
[  207.498479] ff00: 00000000 00000000 00000000 00000000 cca261ac cc8b703c 00000000 8d12c45f
[  207.506704] ff20: ce694570 cde6ffb0 b6e61de4 cf79e0b0 00000054 cd944000 cc8b7000 80000007
[  207.514928] ff40: cc8b703c c0110c44 cdcba840 00000010 00000010 c022c9f4 00000000 00000010
[  207.523152] ff60: cdcba840 00000007 c0a08e38 b6e61de4 cde6ffb0 10c53c7d 00000000 00023b20
[  207.531375] ff80: 00023b20 c0110f94 00000000 00000010 00000000 00000004 c0101204 b6e61de4
[  207.539599] ffa0: 60070010 ffffffff 10c5387d c01020c8 bea8fbc8 00000000 00000010 00000000
[  207.547822] ffc0: 00000000 00000010 00000000 00000003 00000000 00000000 00023b20 00023b20
[  207.556047] ffe0: b6e61de4 bea8fb78 00011454 b6e61de4 60070010 ffffffff 00000000 00000000
[  207.564287] [<c04856f4>] (can_rx_offload_irq_receive_skb) from [<cf4ec000>] (0xcf4ec000)
[  207.572428] Code: e58430e8 e8bd8010 e5932018 e2830014 (e5810000)
[  207.578557] ---[ end trace 67a8f60eda872c72 ]---
[  207.583199] Kernel panic - not syncing: Fatal exception in interrupt

I also noticed two things while building the patchset.
The GPL export of can_rx_offload_irq_receive_skb uses the wrong name in the second patch and patch 5 prevents the c_can driver from being built as a module as it references the unexported __can_get_echo_skb function.

Regards,

Felix


> -----Original Message-----
> From: Kurt Van Dijck <dev.kurt@vandijck-laurijssen.be>
> Sent: Monday, March 30, 2020 9:02 AM
>
> On vr, 27 mrt 2020 10:09:03 +0000, Felix Riemann wrote:
> > Is it this one?
> > https://marc.info/?l=linux-can&m=157683579518282
>
> yes
>
> >
> > It seems to be missing patch 6 in the series.
>
> you can get there via
> https://github.com/kurt-vd/linux, branch c_can-next
>
> Kurt
>
> >
> > Regards,
> > Felix
> >
> > -----Original Message-----
> > From: Kurt Van Dijck
> > Sent: Thursday, March 26, 2020 8:39 PM
> >
> > Interesting topic.
> >
> > After the bus-off problem, I also attacked the overflow problems. No
> > idea why this patch series got stalled. I activated the complete 64
> > message buffers of the D_CAN hardware (only 64 of the 128 being
> > present in the IP for some reason I don't know) and modified the IRQ
> > to use rx-offload. That may affect the circumstances under load, I admit.
> >
> > Whould you mind trying that series, just for reference, in your setup?
> > If your hypothesis is right, my observations do not survive your
> > busload.
> >
> > Kind regards,
> > Kurt
> >
> > On do, 26 mrt 2020 14:49:23 +0000, Felix Riemann wrote:
> > > Hi Kurt,
> > >
> > > yes that would be 3cb3eaac52c0f145d895f4b6c22834d5f02b8569.
> > >
> > > I tried your patch as well, since the description matched the problem pretty
> well. However, I could still produce the hang on our board (based on AM3352)
> especially if the bus off occurs while the bus load is very high.
> > >
> > > What I could see was the worker being triggered by the interrupt with the
> bus being in passive mode. When the worker then re-enabled the interrupt the
> hardware had already entered init mode due to the bus-off but nor would it
> trigger again nor would the interrupt register indicate a pending IRQ. The status
> register also indicates an error (LEC != 0x7).
> > >
> > > A first attempt to fix/workaround this placed the check just before the
> worker re-enables the interrupts. But even this setup could reproduce the
> problem, although it took longer to do so.  So the timing around enabling the
> interrupts seems to matter here. Actually, adding a printout at the top of the
> worker where your patch checks the value of the sie_pending atomic seems to
> increase the chances of making it happen on my board, but that's just a feeling I
> got while cleaning up my patch.
> > >
> > > Regards,
> > >
> > > Felix
> > >
> > > -----Original Message-----
> > > From: Kurt Van Dijck <mailto:dev.kurt@vandijck-laurijssen.be>
> > > Sent: Thursday, March 26, 2020 2:40 PM
> > >
> > > This patch looks very familiar.
> > > I tried this 6 months ago, when I got a setup where I could reproduce this
> within minutes.
> > >
> > > This patch didn't solve my problem, and I fixed my problem with reading the
> ISR register only once in the IRQ, and _not_ 2nd time in napi handler.
> > > Reading the ISR in C_CAN is consuming the info, so unless the there is more
> state to clear (such as received message buffers), the interrupt cause is assumed
> to be known, and the C_CAN proceeds to the next interrupt cause.
> > > I've not experienced a single problem since, where before the fix, I'd
> experience a few occurances every month.
> > > We have an am3358 (beaglebone-like hardware).
> > >
> > > I believe I submitted a patch for that.
> > >
> > > Re-scheduling the napi is a workaround, IMHO.
> > >
> > > Kurt
> > >
> > > On do, 26 mrt 2020 11:43:18 +0100, Felix Riemann wrote:
> > > > There are appears to be a race condition where interrupts caused
> > > > by bus offs get lost if it occurs while interrupts are disabled or
> > > > being re-enabled.
> > > >
> > > > This tries to avoid the deadlock by rescheduling the NAPI worker
> > > > to handle the bus-off condition.
> > > >
> > > > Signed-off-by: Felix Riemann <mailto:felix.riemann@sma.de>
> > > > Reviewed-by: Andre Kalb <mailto:andre.kalb@sma.de>
> > > > ---
> > > >  drivers/net/can/c_can/c_can.c | 19 ++++++++++++++++++-
> > > >  1 file changed, 18 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/drivers/net/can/c_can/c_can.c
> > > > b/drivers/net/can/c_can/c_can.c index 8e9f5620c9a2..0952ac0b9123
> > > > 100644
> > > > --- a/drivers/net/can/c_can/c_can.c
> > > > +++ b/drivers/net/can/c_can/c_can.c
> > > > @@ -1123,8 +1123,25 @@ static int c_can_poll(struct napi_struct
> > > > *napi, int quota)  if (work_done < quota) {
> > > > napi_complete_done(napi, work_done);
> > > >  /* enable all IRQs if we are not in bus off state */ -if
> > > > (priv->can.state != CAN_STATE_BUS_OFF)
> > > > +if (priv->can.state != CAN_STATE_BUS_OFF) {
> > > > +u32 ctrl;
> > > >  c_can_irq_control(priv, true);
> > > > +
> > > > +/* There appears to be a race condition when the device
> > > > + * enters bus off while interrupts are off or being
> > > > + * re-enabled causing the bus off to get lost.
> > > > + * This tries to avoid this condition.
> > > > + */
> > > > +ctrl = priv->read_reg(priv, C_CAN_CTRL_REG);
> > > > +
> > > > +if (ctrl & CONTROL_INIT) {
> > > > +netdev_warn(dev, "lost bus off\n"); c_can_irq_control(priv,
> > > > +false);
> > > > +/* Reschedule worker to handle bus off */
> > > > +atomic_set(&priv->sie_pending, 1); napi_reschedule(napi); } }
> > > >  }
> > > >
> > > >  return work_done;
> > > > --
> > > > 2.26.0
___________________________________________________

SMA Solar Technology AG
Aufsichtsrat: Dr. Erik Ehrentraut (Vorsitzender)
Vorstand: Ulrich Hadding, Dr.-Ing. Juergen Reinert
Handelsregister: Amtsgericht Kassel HRB 3972
Sitz der Gesellschaft: 34266 Niestetal
USt-ID-Nr. DE 113 08 59 54
WEEE-Reg.-Nr. DE 95881150
___________________________________________________

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] can: c_can: Handle lost bus-off interrupt while IRQs are disabled
  2020-03-30 17:30           ` Felix Riemann
@ 2020-03-31  9:44             ` Kurt Van Dijck
  2020-04-29  8:47             ` Kurt Van Dijck
  1 sibling, 0 replies; 17+ messages in thread
From: Kurt Van Dijck @ 2020-03-31  9:44 UTC (permalink / raw)
  To: Felix Riemann
  Cc: linux-can, Marc Kleine-Budde, Wolfgang Grandegger, Andre Kalb

On ma, 30 mrt 2020 17:30:08 +0000, Felix Riemann wrote:
> Hi Kurt,
> 
> tried to give it a try. However, after starting a few (4 - 5) cangen  instances to increase bus load it panics:

Thanks for testing!
I'll take a look later this week (have some bytes to burn today).
> 
> [  207.259295] Unable to handle kernel NULL pointer dereference at virtual address 00000000
> [  207.267433] pgd = d2bd4866
> [  207.270153] [00000000] *pgd=8ca25831, *pte=00000000, *ppte=00000000
> [  207.276469] Internal error: Oops: 817 [#1] PREEMPT ARM
> [  207.281633] Modules linked in: cfg80211 rfkill lm75 musb_dsps musb_hdrc usbcore phy_am335x phy_generic usb_common phy_am335x_control gpio_keys xt_owner iptable_filter ip_tables can_raw
> [  207.298189] CPU: 0 PID: 782 Comm: cangen Not tainted 5.5.11-00013-gcdf175eb0ffc #7
> [  207.305793] Hardware name: Generic AM33XX (Flattened Device Tree)
> [  207.311942] PC is at can_rx_offload_irq_receive_skb+0x54/0x78
> [  207.317722] LR is at c_can_isr+0x52c/0x5b0
> [  207.321838] pc : [<c04856f4>]    lr : [<c04875f8>]    psr: 80070193
> [  207.328133] sp : cde6fd30  ip : 00000008  fp : cf4ec5c8
> [  207.333383] r10: 00000000  r9 : 00000000  r8 : cf4ec4e0
> [  207.338633] r7 : 00000001  r6 : 00000000  r5 : 00000039  r4 : cf4ec000
> [  207.345192] r3 : cf4ec5c8  r2 : cf4ec5dc  r1 : 00000000  r0 : cf4ec5dc
> [  207.351752] Flags: Nzcv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment none
> [  207.359010] Control: 10c5387d  Table: 8cb50019  DAC: 00000051
> [  207.364785] Process cangen (pid: 782, stack limit = 0x7ef64b0d)
> [  207.370734] Stack: (0xcde6fd30 to 0xcde70000)
> [  207.375116] fd20:                                     00000000 cf4ec000 00000000 00000000
> [  207.383341] fd40: cde6fd7f 8d12c45f cd82ed80 cc864c80 cf3a9310 00000030 00000001 ffffe000
> [  207.391566] fd60: cf3a9300 cde6fda0 00000000 c0165450 cf3a9300 cde6fda0 00010002 c0a04248
> [  207.399790] fd80: cf3a9310 cf3a9300 00000001 00000000 cde6e000 8ec8018f 000000e8 c01655f4
> [  207.408014] fda0: 00000000 8d12c45f cf3a9300 cf3a9310 cf01f400 c01656a4 cf3a9300 cf3a9310
> [  207.416239] fdc0: cf01f400 c0169520 00000000 00000000 cf01f400 c0164720 00000000 c0164dac
> [  207.424463] fde0: c01128c4 20070113 ffffffff cde6fe2c 00000000 c0101a8c 00020837 cfecd200
> [  207.432687] fe00: cfcb9000 00214200 8ec8018f 0000ec80 00000000 cfecd200 00000000 00000000
> [  207.440911] fe20: 8ec8018f 000000e8 20070113 cde6fe48 00020837 c01128c4 20070113 ffffffff
> [  207.449136] fe40: 00000051 bf000000 8ec8018f cca261ac 00000000 c0203830 cde6fee4 cf79e0b0
> [  207.457359] fe60: 00000000 c0206358 cfecd200 00000001 cde6fee4 cdcba300 ce38e188 000000ec
> [  207.465584] fe80: c0a04248 c01de0d0 00023b20 ce38e18c 000000e8 00280000 ce37e720 00000000
> [  207.473807] fea0: 00000000 8d12c45f 00000011 00000000 000000ec c0a04248 b6e61000 00000040
> [  207.482032] fec0: 000000dd cc8b7000 cc8b703c c0206b80 c0a04248 00000000 cdcba840 00000010
> [  207.490256] fee0: 00000000 cf79e0b0 00000054 00100cca 000000de b6e6b000 ccb52db8 ccb52db8
> [  207.498479] ff00: 00000000 00000000 00000000 00000000 cca261ac cc8b703c 00000000 8d12c45f
> [  207.506704] ff20: ce694570 cde6ffb0 b6e61de4 cf79e0b0 00000054 cd944000 cc8b7000 80000007
> [  207.514928] ff40: cc8b703c c0110c44 cdcba840 00000010 00000010 c022c9f4 00000000 00000010
> [  207.523152] ff60: cdcba840 00000007 c0a08e38 b6e61de4 cde6ffb0 10c53c7d 00000000 00023b20
> [  207.531375] ff80: 00023b20 c0110f94 00000000 00000010 00000000 00000004 c0101204 b6e61de4
> [  207.539599] ffa0: 60070010 ffffffff 10c5387d c01020c8 bea8fbc8 00000000 00000010 00000000
> [  207.547822] ffc0: 00000000 00000010 00000000 00000003 00000000 00000000 00023b20 00023b20
> [  207.556047] ffe0: b6e61de4 bea8fb78 00011454 b6e61de4 60070010 ffffffff 00000000 00000000
> [  207.564287] [<c04856f4>] (can_rx_offload_irq_receive_skb) from [<cf4ec000>] (0xcf4ec000)
> [  207.572428] Code: e58430e8 e8bd8010 e5932018 e2830014 (e5810000)
> [  207.578557] ---[ end trace 67a8f60eda872c72 ]---
> [  207.583199] Kernel panic - not syncing: Fatal exception in interrupt
> 
> I also noticed two things while building the patchset.
> The GPL export of can_rx_offload_irq_receive_skb uses the wrong name in the second patch and patch 5 prevents the c_can driver from being built as a module as it references the unexported __can_get_echo_skb function.
> 
> Regards,
> 
> Felix
> 
> 
> > -----Original Message-----
> > From: Kurt Van Dijck <dev.kurt@vandijck-laurijssen.be>
> > Sent: Monday, March 30, 2020 9:02 AM
> >
> > On vr, 27 mrt 2020 10:09:03 +0000, Felix Riemann wrote:
> > > Is it this one?
> > > https://marc.info/?l=linux-can&m=157683579518282
> >
> > yes
> >
> > >
> > > It seems to be missing patch 6 in the series.
> >
> > you can get there via
> > https://github.com/kurt-vd/linux, branch c_can-next
> >
> > Kurt
> >
> > >
> > > Regards,
> > > Felix
> > >
> > > -----Original Message-----
> > > From: Kurt Van Dijck
> > > Sent: Thursday, March 26, 2020 8:39 PM
> > >
> > > Interesting topic.
> > >
> > > After the bus-off problem, I also attacked the overflow problems. No
> > > idea why this patch series got stalled. I activated the complete 64
> > > message buffers of the D_CAN hardware (only 64 of the 128 being
> > > present in the IP for some reason I don't know) and modified the IRQ
> > > to use rx-offload. That may affect the circumstances under load, I admit.
> > >
> > > Whould you mind trying that series, just for reference, in your setup?
> > > If your hypothesis is right, my observations do not survive your
> > > busload.
> > >
> > > Kind regards,
> > > Kurt
> > >
> > > On do, 26 mrt 2020 14:49:23 +0000, Felix Riemann wrote:
> > > > Hi Kurt,
> > > >
> > > > yes that would be 3cb3eaac52c0f145d895f4b6c22834d5f02b8569.
> > > >
> > > > I tried your patch as well, since the description matched the problem pretty
> > well. However, I could still produce the hang on our board (based on AM3352)
> > especially if the bus off occurs while the bus load is very high.
> > > >
> > > > What I could see was the worker being triggered by the interrupt with the
> > bus being in passive mode. When the worker then re-enabled the interrupt the
> > hardware had already entered init mode due to the bus-off but nor would it
> > trigger again nor would the interrupt register indicate a pending IRQ. The status
> > register also indicates an error (LEC != 0x7).
> > > >
> > > > A first attempt to fix/workaround this placed the check just before the
> > worker re-enables the interrupts. But even this setup could reproduce the
> > problem, although it took longer to do so.  So the timing around enabling the
> > interrupts seems to matter here. Actually, adding a printout at the top of the
> > worker where your patch checks the value of the sie_pending atomic seems to
> > increase the chances of making it happen on my board, but that's just a feeling I
> > got while cleaning up my patch.
> > > >
> > > > Regards,
> > > >
> > > > Felix
> > > >
> > > > -----Original Message-----
> > > > From: Kurt Van Dijck <mailto:dev.kurt@vandijck-laurijssen.be>
> > > > Sent: Thursday, March 26, 2020 2:40 PM
> > > >
> > > > This patch looks very familiar.
> > > > I tried this 6 months ago, when I got a setup where I could reproduce this
> > within minutes.
> > > >
> > > > This patch didn't solve my problem, and I fixed my problem with reading the
> > ISR register only once in the IRQ, and _not_ 2nd time in napi handler.
> > > > Reading the ISR in C_CAN is consuming the info, so unless the there is more
> > state to clear (such as received message buffers), the interrupt cause is assumed
> > to be known, and the C_CAN proceeds to the next interrupt cause.
> > > > I've not experienced a single problem since, where before the fix, I'd
> > experience a few occurances every month.
> > > > We have an am3358 (beaglebone-like hardware).
> > > >
> > > > I believe I submitted a patch for that.
> > > >
> > > > Re-scheduling the napi is a workaround, IMHO.
> > > >
> > > > Kurt
> > > >
> > > > On do, 26 mrt 2020 11:43:18 +0100, Felix Riemann wrote:
> > > > > There are appears to be a race condition where interrupts caused
> > > > > by bus offs get lost if it occurs while interrupts are disabled or
> > > > > being re-enabled.
> > > > >
> > > > > This tries to avoid the deadlock by rescheduling the NAPI worker
> > > > > to handle the bus-off condition.
> > > > >
> > > > > Signed-off-by: Felix Riemann <mailto:felix.riemann@sma.de>
> > > > > Reviewed-by: Andre Kalb <mailto:andre.kalb@sma.de>
> > > > > ---
> > > > >  drivers/net/can/c_can/c_can.c | 19 ++++++++++++++++++-
> > > > >  1 file changed, 18 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/drivers/net/can/c_can/c_can.c
> > > > > b/drivers/net/can/c_can/c_can.c index 8e9f5620c9a2..0952ac0b9123
> > > > > 100644
> > > > > --- a/drivers/net/can/c_can/c_can.c
> > > > > +++ b/drivers/net/can/c_can/c_can.c
> > > > > @@ -1123,8 +1123,25 @@ static int c_can_poll(struct napi_struct
> > > > > *napi, int quota)  if (work_done < quota) {
> > > > > napi_complete_done(napi, work_done);
> > > > >  /* enable all IRQs if we are not in bus off state */ -if
> > > > > (priv->can.state != CAN_STATE_BUS_OFF)
> > > > > +if (priv->can.state != CAN_STATE_BUS_OFF) {
> > > > > +u32 ctrl;
> > > > >  c_can_irq_control(priv, true);
> > > > > +
> > > > > +/* There appears to be a race condition when the device
> > > > > + * enters bus off while interrupts are off or being
> > > > > + * re-enabled causing the bus off to get lost.
> > > > > + * This tries to avoid this condition.
> > > > > + */
> > > > > +ctrl = priv->read_reg(priv, C_CAN_CTRL_REG);
> > > > > +
> > > > > +if (ctrl & CONTROL_INIT) {
> > > > > +netdev_warn(dev, "lost bus off\n"); c_can_irq_control(priv,
> > > > > +false);
> > > > > +/* Reschedule worker to handle bus off */
> > > > > +atomic_set(&priv->sie_pending, 1); napi_reschedule(napi); } }
> > > > >  }
> > > > >
> > > > >  return work_done;
> > > > > --
> > > > > 2.26.0
> ___________________________________________________
> 
> SMA Solar Technology AG
> Aufsichtsrat: Dr. Erik Ehrentraut (Vorsitzender)
> Vorstand: Ulrich Hadding, Dr.-Ing. Juergen Reinert
> Handelsregister: Amtsgericht Kassel HRB 3972
> Sitz der Gesellschaft: 34266 Niestetal
> USt-ID-Nr. DE 113 08 59 54
> WEEE-Reg.-Nr. DE 95881150
> ___________________________________________________

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] can: c_can: Handle lost bus-off interrupt while IRQs are disabled
  2020-03-30 17:30           ` Felix Riemann
  2020-03-31  9:44             ` Kurt Van Dijck
@ 2020-04-29  8:47             ` Kurt Van Dijck
  2020-04-29  9:06               ` Kurt Van Dijck
  1 sibling, 1 reply; 17+ messages in thread
From: Kurt Van Dijck @ 2020-04-29  8:47 UTC (permalink / raw)
  To: Felix Riemann
  Cc: linux-can, Marc Kleine-Budde, Wolfgang Grandegger, Andre Kalb

Hi,

On ma, 30 mrt 2020 17:30:08 +0000, Felix Riemann wrote:
> Hi Kurt,
> 
> tried to give it a try. However, after starting a few (4 - 5) cangen  instances to increase bus load it panics:

It would be more helpful if you had a verbose stacktrace.

Kurt

> 
> [  207.259295] Unable to handle kernel NULL pointer dereference at virtual address 00000000
> [  207.267433] pgd = d2bd4866
> [  207.270153] [00000000] *pgd=8ca25831, *pte=00000000, *ppte=00000000
> [  207.276469] Internal error: Oops: 817 [#1] PREEMPT ARM
> [  207.281633] Modules linked in: cfg80211 rfkill lm75 musb_dsps musb_hdrc usbcore phy_am335x phy_generic usb_common phy_am335x_control gpio_keys xt_owner iptable_filter ip_tables can_raw
> [  207.298189] CPU: 0 PID: 782 Comm: cangen Not tainted 5.5.11-00013-gcdf175eb0ffc #7
> [  207.305793] Hardware name: Generic AM33XX (Flattened Device Tree)
> [  207.311942] PC is at can_rx_offload_irq_receive_skb+0x54/0x78
> [  207.317722] LR is at c_can_isr+0x52c/0x5b0
> [  207.321838] pc : [<c04856f4>]    lr : [<c04875f8>]    psr: 80070193
> [  207.328133] sp : cde6fd30  ip : 00000008  fp : cf4ec5c8
> [  207.333383] r10: 00000000  r9 : 00000000  r8 : cf4ec4e0
> [  207.338633] r7 : 00000001  r6 : 00000000  r5 : 00000039  r4 : cf4ec000
> [  207.345192] r3 : cf4ec5c8  r2 : cf4ec5dc  r1 : 00000000  r0 : cf4ec5dc
> [  207.351752] Flags: Nzcv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment none
> [  207.359010] Control: 10c5387d  Table: 8cb50019  DAC: 00000051
> [  207.364785] Process cangen (pid: 782, stack limit = 0x7ef64b0d)
> [  207.370734] Stack: (0xcde6fd30 to 0xcde70000)
> [  207.375116] fd20:                                     00000000 cf4ec000 00000000 00000000
> [  207.383341] fd40: cde6fd7f 8d12c45f cd82ed80 cc864c80 cf3a9310 00000030 00000001 ffffe000
> [  207.391566] fd60: cf3a9300 cde6fda0 00000000 c0165450 cf3a9300 cde6fda0 00010002 c0a04248
> [  207.399790] fd80: cf3a9310 cf3a9300 00000001 00000000 cde6e000 8ec8018f 000000e8 c01655f4
> [  207.408014] fda0: 00000000 8d12c45f cf3a9300 cf3a9310 cf01f400 c01656a4 cf3a9300 cf3a9310
> [  207.416239] fdc0: cf01f400 c0169520 00000000 00000000 cf01f400 c0164720 00000000 c0164dac
> [  207.424463] fde0: c01128c4 20070113 ffffffff cde6fe2c 00000000 c0101a8c 00020837 cfecd200
> [  207.432687] fe00: cfcb9000 00214200 8ec8018f 0000ec80 00000000 cfecd200 00000000 00000000
> [  207.440911] fe20: 8ec8018f 000000e8 20070113 cde6fe48 00020837 c01128c4 20070113 ffffffff
> [  207.449136] fe40: 00000051 bf000000 8ec8018f cca261ac 00000000 c0203830 cde6fee4 cf79e0b0
> [  207.457359] fe60: 00000000 c0206358 cfecd200 00000001 cde6fee4 cdcba300 ce38e188 000000ec
> [  207.465584] fe80: c0a04248 c01de0d0 00023b20 ce38e18c 000000e8 00280000 ce37e720 00000000
> [  207.473807] fea0: 00000000 8d12c45f 00000011 00000000 000000ec c0a04248 b6e61000 00000040
> [  207.482032] fec0: 000000dd cc8b7000 cc8b703c c0206b80 c0a04248 00000000 cdcba840 00000010
> [  207.490256] fee0: 00000000 cf79e0b0 00000054 00100cca 000000de b6e6b000 ccb52db8 ccb52db8
> [  207.498479] ff00: 00000000 00000000 00000000 00000000 cca261ac cc8b703c 00000000 8d12c45f
> [  207.506704] ff20: ce694570 cde6ffb0 b6e61de4 cf79e0b0 00000054 cd944000 cc8b7000 80000007
> [  207.514928] ff40: cc8b703c c0110c44 cdcba840 00000010 00000010 c022c9f4 00000000 00000010
> [  207.523152] ff60: cdcba840 00000007 c0a08e38 b6e61de4 cde6ffb0 10c53c7d 00000000 00023b20
> [  207.531375] ff80: 00023b20 c0110f94 00000000 00000010 00000000 00000004 c0101204 b6e61de4
> [  207.539599] ffa0: 60070010 ffffffff 10c5387d c01020c8 bea8fbc8 00000000 00000010 00000000
> [  207.547822] ffc0: 00000000 00000010 00000000 00000003 00000000 00000000 00023b20 00023b20
> [  207.556047] ffe0: b6e61de4 bea8fb78 00011454 b6e61de4 60070010 ffffffff 00000000 00000000
> [  207.564287] [<c04856f4>] (can_rx_offload_irq_receive_skb) from [<cf4ec000>] (0xcf4ec000)
> [  207.572428] Code: e58430e8 e8bd8010 e5932018 e2830014 (e5810000)
> [  207.578557] ---[ end trace 67a8f60eda872c72 ]---
> [  207.583199] Kernel panic - not syncing: Fatal exception in interrupt
> 
> I also noticed two things while building the patchset.
> The GPL export of can_rx_offload_irq_receive_skb uses the wrong name in the second patch and patch 5 prevents the c_can driver from being built as a module as it references the unexported __can_get_echo_skb function.
> 
> Regards,
> 
> Felix
> 
> 
> > -----Original Message-----
> > From: Kurt Van Dijck <dev.kurt@vandijck-laurijssen.be>
> > Sent: Monday, March 30, 2020 9:02 AM
> >
> > On vr, 27 mrt 2020 10:09:03 +0000, Felix Riemann wrote:
> > > Is it this one?
> > > https://marc.info/?l=linux-can&m=157683579518282
> >
> > yes
> >
> > >
> > > It seems to be missing patch 6 in the series.
> >
> > you can get there via
> > https://github.com/kurt-vd/linux, branch c_can-next
> >
> > Kurt
> >
> > >
> > > Regards,
> > > Felix
> > >
> > > -----Original Message-----
> > > From: Kurt Van Dijck
> > > Sent: Thursday, March 26, 2020 8:39 PM
> > >
> > > Interesting topic.
> > >
> > > After the bus-off problem, I also attacked the overflow problems. No
> > > idea why this patch series got stalled. I activated the complete 64
> > > message buffers of the D_CAN hardware (only 64 of the 128 being
> > > present in the IP for some reason I don't know) and modified the IRQ
> > > to use rx-offload. That may affect the circumstances under load, I admit.
> > >
> > > Whould you mind trying that series, just for reference, in your setup?
> > > If your hypothesis is right, my observations do not survive your
> > > busload.
> > >
> > > Kind regards,
> > > Kurt
> > >
> > > On do, 26 mrt 2020 14:49:23 +0000, Felix Riemann wrote:
> > > > Hi Kurt,
> > > >
> > > > yes that would be 3cb3eaac52c0f145d895f4b6c22834d5f02b8569.
> > > >
> > > > I tried your patch as well, since the description matched the problem pretty
> > well. However, I could still produce the hang on our board (based on AM3352)
> > especially if the bus off occurs while the bus load is very high.
> > > >
> > > > What I could see was the worker being triggered by the interrupt with the
> > bus being in passive mode. When the worker then re-enabled the interrupt the
> > hardware had already entered init mode due to the bus-off but nor would it
> > trigger again nor would the interrupt register indicate a pending IRQ. The status
> > register also indicates an error (LEC != 0x7).
> > > >
> > > > A first attempt to fix/workaround this placed the check just before the
> > worker re-enables the interrupts. But even this setup could reproduce the
> > problem, although it took longer to do so.  So the timing around enabling the
> > interrupts seems to matter here. Actually, adding a printout at the top of the
> > worker where your patch checks the value of the sie_pending atomic seems to
> > increase the chances of making it happen on my board, but that's just a feeling I
> > got while cleaning up my patch.
> > > >
> > > > Regards,
> > > >
> > > > Felix
> > > >
> > > > -----Original Message-----
> > > > From: Kurt Van Dijck <mailto:dev.kurt@vandijck-laurijssen.be>
> > > > Sent: Thursday, March 26, 2020 2:40 PM
> > > >
> > > > This patch looks very familiar.
> > > > I tried this 6 months ago, when I got a setup where I could reproduce this
> > within minutes.
> > > >
> > > > This patch didn't solve my problem, and I fixed my problem with reading the
> > ISR register only once in the IRQ, and _not_ 2nd time in napi handler.
> > > > Reading the ISR in C_CAN is consuming the info, so unless the there is more
> > state to clear (such as received message buffers), the interrupt cause is assumed
> > to be known, and the C_CAN proceeds to the next interrupt cause.
> > > > I've not experienced a single problem since, where before the fix, I'd
> > experience a few occurances every month.
> > > > We have an am3358 (beaglebone-like hardware).
> > > >
> > > > I believe I submitted a patch for that.
> > > >
> > > > Re-scheduling the napi is a workaround, IMHO.
> > > >
> > > > Kurt
> > > >
> > > > On do, 26 mrt 2020 11:43:18 +0100, Felix Riemann wrote:
> > > > > There are appears to be a race condition where interrupts caused
> > > > > by bus offs get lost if it occurs while interrupts are disabled or
> > > > > being re-enabled.
> > > > >
> > > > > This tries to avoid the deadlock by rescheduling the NAPI worker
> > > > > to handle the bus-off condition.
> > > > >
> > > > > Signed-off-by: Felix Riemann <mailto:felix.riemann@sma.de>
> > > > > Reviewed-by: Andre Kalb <mailto:andre.kalb@sma.de>
> > > > > ---
> > > > >  drivers/net/can/c_can/c_can.c | 19 ++++++++++++++++++-
> > > > >  1 file changed, 18 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/drivers/net/can/c_can/c_can.c
> > > > > b/drivers/net/can/c_can/c_can.c index 8e9f5620c9a2..0952ac0b9123
> > > > > 100644
> > > > > --- a/drivers/net/can/c_can/c_can.c
> > > > > +++ b/drivers/net/can/c_can/c_can.c
> > > > > @@ -1123,8 +1123,25 @@ static int c_can_poll(struct napi_struct
> > > > > *napi, int quota)  if (work_done < quota) {
> > > > > napi_complete_done(napi, work_done);
> > > > >  /* enable all IRQs if we are not in bus off state */ -if
> > > > > (priv->can.state != CAN_STATE_BUS_OFF)
> > > > > +if (priv->can.state != CAN_STATE_BUS_OFF) {
> > > > > +u32 ctrl;
> > > > >  c_can_irq_control(priv, true);
> > > > > +
> > > > > +/* There appears to be a race condition when the device
> > > > > + * enters bus off while interrupts are off or being
> > > > > + * re-enabled causing the bus off to get lost.
> > > > > + * This tries to avoid this condition.
> > > > > + */
> > > > > +ctrl = priv->read_reg(priv, C_CAN_CTRL_REG);
> > > > > +
> > > > > +if (ctrl & CONTROL_INIT) {
> > > > > +netdev_warn(dev, "lost bus off\n"); c_can_irq_control(priv,
> > > > > +false);
> > > > > +/* Reschedule worker to handle bus off */
> > > > > +atomic_set(&priv->sie_pending, 1); napi_reschedule(napi); } }
> > > > >  }
> > > > >
> > > > >  return work_done;
> > > > > --
> > > > > 2.26.0
> ___________________________________________________
> 
> SMA Solar Technology AG
> Aufsichtsrat: Dr. Erik Ehrentraut (Vorsitzender)
> Vorstand: Ulrich Hadding, Dr.-Ing. Juergen Reinert
> Handelsregister: Amtsgericht Kassel HRB 3972
> Sitz der Gesellschaft: 34266 Niestetal
> USt-ID-Nr. DE 113 08 59 54
> WEEE-Reg.-Nr. DE 95881150
> ___________________________________________________

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] can: c_can: Handle lost bus-off interrupt while IRQs are disabled
  2020-04-29  8:47             ` Kurt Van Dijck
@ 2020-04-29  9:06               ` Kurt Van Dijck
  2020-06-10  8:08                 ` Felix Riemann
  0 siblings, 1 reply; 17+ messages in thread
From: Kurt Van Dijck @ 2020-04-29  9:06 UTC (permalink / raw)
  To: Felix Riemann, linux-can, Marc Kleine-Budde, Wolfgang Grandegger,
	Andre Kalb

I have the impression that the recv path is covered, at least the null
ptr would have caused issued earlier already.
Would the problem arise in c_can_do_tx fetching a NULL echo_skb?

Kurt

On wo, 29 apr 2020 10:47:33 +0200, Kurt Van Dijck wrote:
> 
> Hi,
> 
> On ma, 30 mrt 2020 17:30:08 +0000, Felix Riemann wrote:
> > Hi Kurt,
> > 
> > tried to give it a try. However, after starting a few (4 - 5) cangen  instances to increase bus load it panics:
> 
> It would be more helpful if you had a verbose stacktrace.
> 
> Kurt
> 
> > 
> > [  207.259295] Unable to handle kernel NULL pointer dereference at virtual address 00000000
> > [  207.267433] pgd = d2bd4866
> > [  207.270153] [00000000] *pgd=8ca25831, *pte=00000000, *ppte=00000000
> > [  207.276469] Internal error: Oops: 817 [#1] PREEMPT ARM
> > [  207.281633] Modules linked in: cfg80211 rfkill lm75 musb_dsps musb_hdrc usbcore phy_am335x phy_generic usb_common phy_am335x_control gpio_keys xt_owner iptable_filter ip_tables can_raw
> > [  207.298189] CPU: 0 PID: 782 Comm: cangen Not tainted 5.5.11-00013-gcdf175eb0ffc #7
> > [  207.305793] Hardware name: Generic AM33XX (Flattened Device Tree)
> > [  207.311942] PC is at can_rx_offload_irq_receive_skb+0x54/0x78
> > [  207.317722] LR is at c_can_isr+0x52c/0x5b0
> > [  207.321838] pc : [<c04856f4>]    lr : [<c04875f8>]    psr: 80070193
> > [  207.328133] sp : cde6fd30  ip : 00000008  fp : cf4ec5c8
> > [  207.333383] r10: 00000000  r9 : 00000000  r8 : cf4ec4e0
> > [  207.338633] r7 : 00000001  r6 : 00000000  r5 : 00000039  r4 : cf4ec000
> > [  207.345192] r3 : cf4ec5c8  r2 : cf4ec5dc  r1 : 00000000  r0 : cf4ec5dc
> > [  207.351752] Flags: Nzcv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment none
> > [  207.359010] Control: 10c5387d  Table: 8cb50019  DAC: 00000051
> > [  207.364785] Process cangen (pid: 782, stack limit = 0x7ef64b0d)
> > [  207.370734] Stack: (0xcde6fd30 to 0xcde70000)
> > [  207.375116] fd20:                                     00000000 cf4ec000 00000000 00000000
> > [  207.383341] fd40: cde6fd7f 8d12c45f cd82ed80 cc864c80 cf3a9310 00000030 00000001 ffffe000
> > [  207.391566] fd60: cf3a9300 cde6fda0 00000000 c0165450 cf3a9300 cde6fda0 00010002 c0a04248
> > [  207.399790] fd80: cf3a9310 cf3a9300 00000001 00000000 cde6e000 8ec8018f 000000e8 c01655f4
> > [  207.408014] fda0: 00000000 8d12c45f cf3a9300 cf3a9310 cf01f400 c01656a4 cf3a9300 cf3a9310
> > [  207.416239] fdc0: cf01f400 c0169520 00000000 00000000 cf01f400 c0164720 00000000 c0164dac
> > [  207.424463] fde0: c01128c4 20070113 ffffffff cde6fe2c 00000000 c0101a8c 00020837 cfecd200
> > [  207.432687] fe00: cfcb9000 00214200 8ec8018f 0000ec80 00000000 cfecd200 00000000 00000000
> > [  207.440911] fe20: 8ec8018f 000000e8 20070113 cde6fe48 00020837 c01128c4 20070113 ffffffff
> > [  207.449136] fe40: 00000051 bf000000 8ec8018f cca261ac 00000000 c0203830 cde6fee4 cf79e0b0
> > [  207.457359] fe60: 00000000 c0206358 cfecd200 00000001 cde6fee4 cdcba300 ce38e188 000000ec
> > [  207.465584] fe80: c0a04248 c01de0d0 00023b20 ce38e18c 000000e8 00280000 ce37e720 00000000
> > [  207.473807] fea0: 00000000 8d12c45f 00000011 00000000 000000ec c0a04248 b6e61000 00000040
> > [  207.482032] fec0: 000000dd cc8b7000 cc8b703c c0206b80 c0a04248 00000000 cdcba840 00000010
> > [  207.490256] fee0: 00000000 cf79e0b0 00000054 00100cca 000000de b6e6b000 ccb52db8 ccb52db8
> > [  207.498479] ff00: 00000000 00000000 00000000 00000000 cca261ac cc8b703c 00000000 8d12c45f
> > [  207.506704] ff20: ce694570 cde6ffb0 b6e61de4 cf79e0b0 00000054 cd944000 cc8b7000 80000007
> > [  207.514928] ff40: cc8b703c c0110c44 cdcba840 00000010 00000010 c022c9f4 00000000 00000010
> > [  207.523152] ff60: cdcba840 00000007 c0a08e38 b6e61de4 cde6ffb0 10c53c7d 00000000 00023b20
> > [  207.531375] ff80: 00023b20 c0110f94 00000000 00000010 00000000 00000004 c0101204 b6e61de4
> > [  207.539599] ffa0: 60070010 ffffffff 10c5387d c01020c8 bea8fbc8 00000000 00000010 00000000
> > [  207.547822] ffc0: 00000000 00000010 00000000 00000003 00000000 00000000 00023b20 00023b20
> > [  207.556047] ffe0: b6e61de4 bea8fb78 00011454 b6e61de4 60070010 ffffffff 00000000 00000000
> > [  207.564287] [<c04856f4>] (can_rx_offload_irq_receive_skb) from [<cf4ec000>] (0xcf4ec000)
> > [  207.572428] Code: e58430e8 e8bd8010 e5932018 e2830014 (e5810000)
> > [  207.578557] ---[ end trace 67a8f60eda872c72 ]---
> > [  207.583199] Kernel panic - not syncing: Fatal exception in interrupt
> > 
> > I also noticed two things while building the patchset.
> > The GPL export of can_rx_offload_irq_receive_skb uses the wrong name in the second patch and patch 5 prevents the c_can driver from being built as a module as it references the unexported __can_get_echo_skb function.
> > 
> > Regards,
> > 
> > Felix
> > 
> > 
> > > -----Original Message-----
> > > From: Kurt Van Dijck <dev.kurt@vandijck-laurijssen.be>
> > > Sent: Monday, March 30, 2020 9:02 AM
> > >
> > > On vr, 27 mrt 2020 10:09:03 +0000, Felix Riemann wrote:
> > > > Is it this one?
> > > > https://marc.info/?l=linux-can&m=157683579518282
> > >
> > > yes
> > >
> > > >
> > > > It seems to be missing patch 6 in the series.
> > >
> > > you can get there via
> > > https://github.com/kurt-vd/linux, branch c_can-next
> > >
> > > Kurt
> > >
> > > >
> > > > Regards,
> > > > Felix
> > > >
> > > > -----Original Message-----
> > > > From: Kurt Van Dijck
> > > > Sent: Thursday, March 26, 2020 8:39 PM
> > > >
> > > > Interesting topic.
> > > >
> > > > After the bus-off problem, I also attacked the overflow problems. No
> > > > idea why this patch series got stalled. I activated the complete 64
> > > > message buffers of the D_CAN hardware (only 64 of the 128 being
> > > > present in the IP for some reason I don't know) and modified the IRQ
> > > > to use rx-offload. That may affect the circumstances under load, I admit.
> > > >
> > > > Whould you mind trying that series, just for reference, in your setup?
> > > > If your hypothesis is right, my observations do not survive your
> > > > busload.
> > > >
> > > > Kind regards,
> > > > Kurt
> > > >
> > > > On do, 26 mrt 2020 14:49:23 +0000, Felix Riemann wrote:
> > > > > Hi Kurt,
> > > > >
> > > > > yes that would be 3cb3eaac52c0f145d895f4b6c22834d5f02b8569.
> > > > >
> > > > > I tried your patch as well, since the description matched the problem pretty
> > > well. However, I could still produce the hang on our board (based on AM3352)
> > > especially if the bus off occurs while the bus load is very high.
> > > > >
> > > > > What I could see was the worker being triggered by the interrupt with the
> > > bus being in passive mode. When the worker then re-enabled the interrupt the
> > > hardware had already entered init mode due to the bus-off but nor would it
> > > trigger again nor would the interrupt register indicate a pending IRQ. The status
> > > register also indicates an error (LEC != 0x7).
> > > > >
> > > > > A first attempt to fix/workaround this placed the check just before the
> > > worker re-enables the interrupts. But even this setup could reproduce the
> > > problem, although it took longer to do so.  So the timing around enabling the
> > > interrupts seems to matter here. Actually, adding a printout at the top of the
> > > worker where your patch checks the value of the sie_pending atomic seems to
> > > increase the chances of making it happen on my board, but that's just a feeling I
> > > got while cleaning up my patch.
> > > > >
> > > > > Regards,
> > > > >
> > > > > Felix
> > > > >
> > > > > -----Original Message-----
> > > > > From: Kurt Van Dijck <mailto:dev.kurt@vandijck-laurijssen.be>
> > > > > Sent: Thursday, March 26, 2020 2:40 PM
> > > > >
> > > > > This patch looks very familiar.
> > > > > I tried this 6 months ago, when I got a setup where I could reproduce this
> > > within minutes.
> > > > >
> > > > > This patch didn't solve my problem, and I fixed my problem with reading the
> > > ISR register only once in the IRQ, and _not_ 2nd time in napi handler.
> > > > > Reading the ISR in C_CAN is consuming the info, so unless the there is more
> > > state to clear (such as received message buffers), the interrupt cause is assumed
> > > to be known, and the C_CAN proceeds to the next interrupt cause.
> > > > > I've not experienced a single problem since, where before the fix, I'd
> > > experience a few occurances every month.
> > > > > We have an am3358 (beaglebone-like hardware).
> > > > >
> > > > > I believe I submitted a patch for that.
> > > > >
> > > > > Re-scheduling the napi is a workaround, IMHO.
> > > > >
> > > > > Kurt
> > > > >
> > > > > On do, 26 mrt 2020 11:43:18 +0100, Felix Riemann wrote:
> > > > > > There are appears to be a race condition where interrupts caused
> > > > > > by bus offs get lost if it occurs while interrupts are disabled or
> > > > > > being re-enabled.
> > > > > >
> > > > > > This tries to avoid the deadlock by rescheduling the NAPI worker
> > > > > > to handle the bus-off condition.
> > > > > >
> > > > > > Signed-off-by: Felix Riemann <mailto:felix.riemann@sma.de>
> > > > > > Reviewed-by: Andre Kalb <mailto:andre.kalb@sma.de>
> > > > > > ---
> > > > > >  drivers/net/can/c_can/c_can.c | 19 ++++++++++++++++++-
> > > > > >  1 file changed, 18 insertions(+), 1 deletion(-)
> > > > > >
> > > > > > diff --git a/drivers/net/can/c_can/c_can.c
> > > > > > b/drivers/net/can/c_can/c_can.c index 8e9f5620c9a2..0952ac0b9123
> > > > > > 100644
> > > > > > --- a/drivers/net/can/c_can/c_can.c
> > > > > > +++ b/drivers/net/can/c_can/c_can.c
> > > > > > @@ -1123,8 +1123,25 @@ static int c_can_poll(struct napi_struct
> > > > > > *napi, int quota)  if (work_done < quota) {
> > > > > > napi_complete_done(napi, work_done);
> > > > > >  /* enable all IRQs if we are not in bus off state */ -if
> > > > > > (priv->can.state != CAN_STATE_BUS_OFF)
> > > > > > +if (priv->can.state != CAN_STATE_BUS_OFF) {
> > > > > > +u32 ctrl;
> > > > > >  c_can_irq_control(priv, true);
> > > > > > +
> > > > > > +/* There appears to be a race condition when the device
> > > > > > + * enters bus off while interrupts are off or being
> > > > > > + * re-enabled causing the bus off to get lost.
> > > > > > + * This tries to avoid this condition.
> > > > > > + */
> > > > > > +ctrl = priv->read_reg(priv, C_CAN_CTRL_REG);
> > > > > > +
> > > > > > +if (ctrl & CONTROL_INIT) {
> > > > > > +netdev_warn(dev, "lost bus off\n"); c_can_irq_control(priv,
> > > > > > +false);
> > > > > > +/* Reschedule worker to handle bus off */
> > > > > > +atomic_set(&priv->sie_pending, 1); napi_reschedule(napi); } }
> > > > > >  }
> > > > > >
> > > > > >  return work_done;
> > > > > > --
> > > > > > 2.26.0
> > ___________________________________________________
> > 
> > SMA Solar Technology AG
> > Aufsichtsrat: Dr. Erik Ehrentraut (Vorsitzender)
> > Vorstand: Ulrich Hadding, Dr.-Ing. Juergen Reinert
> > Handelsregister: Amtsgericht Kassel HRB 3972
> > Sitz der Gesellschaft: 34266 Niestetal
> > USt-ID-Nr. DE 113 08 59 54
> > WEEE-Reg.-Nr. DE 95881150
> > ___________________________________________________

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [PATCH] can: c_can: Handle lost bus-off interrupt while IRQs are disabled
  2020-04-29  9:06               ` Kurt Van Dijck
@ 2020-06-10  8:08                 ` Felix Riemann
  2020-06-12  7:10                   ` Kurt Van Dijck
  0 siblings, 1 reply; 17+ messages in thread
From: Felix Riemann @ 2020-06-10  8:08 UTC (permalink / raw)
  To: Kurt Van Dijck
  Cc: linux-can, Marc Kleine-Budde, Wolfgang Grandegger, Andre Kalb

Hi Kurt,

> -----Original Message-----
> From: Kurt Van Dijck <dev.kurt@vandijck-laurijssen.be>
> Sent: Wednesday, April 29, 2020 11:06 AM
> Subject: Re: [PATCH] can: c_can: Handle lost bus-off interrupt while IRQs are
> disabled
>
> I have the impression that the recv path is covered, at least the null ptr would
> have caused issued earlier already.
> Would the problem arise in c_can_do_tx fetching a NULL echo_skb?

It really looks like it.

Adding a BUG_ON to the loop in c_can_do_tx triggers with a NULL skb returned by __can_get_echo_skb().

For testing I replaced the BUG_ON with a check that skips the skb if that happens (not sure if that would be a valid fix in that place):

while ((obj = c_can_ffs64(pend))) {
pend &= ~((u64)1 << (obj - 1));
c_can_inval_tx_object(dev, IF_RX, obj);
idx = obj - priv->obj.send_frst;
skb = __can_get_echo_skb(dev, idx, &len);
if(!skb)
continue;
can_rx_offload_irq_receive_skb(&priv->offload, skb);
bytes += len;
pkts++;
}

So far it got through the night without crashing or deadlocking, but I can't really tell if we are missing pakets though.

Regards,

Felix
___________________________________________________

SMA Solar Technology AG
Aufsichtsrat: Uwe Kleinkauf (Vorsitzender)
Vorstand: Ulrich Hadding, Dr.-Ing. Juergen Reinert
Handelsregister: Amtsgericht Kassel HRB 3972
Sitz der Gesellschaft: 34266 Niestetal
USt-ID-Nr. DE 113 08 59 54
WEEE-Reg.-Nr. DE 95881150
___________________________________________________

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] can: c_can: Handle lost bus-off interrupt while IRQs are disabled
  2020-06-10  8:08                 ` Felix Riemann
@ 2020-06-12  7:10                   ` Kurt Van Dijck
  2020-06-12  7:42                     ` Kurt Van Dijck
  0 siblings, 1 reply; 17+ messages in thread
From: Kurt Van Dijck @ 2020-06-12  7:10 UTC (permalink / raw)
  To: Felix Riemann
  Cc: linux-can, Marc Kleine-Budde, Wolfgang Grandegger, Andre Kalb

Thanks for verifying,

I'll take a look how that can happen.

I think packet loss is better (less packets lost) than before.

Kurt

On wo, 10 jun 2020 08:08:28 +0000, Felix Riemann wrote:
> Hi Kurt,
> 
> > -----Original Message-----
> >
> > I have the impression that the recv path is covered, at least the null ptr would
> > have caused issued earlier already.
> > Would the problem arise in c_can_do_tx fetching a NULL echo_skb?
> 
> It really looks like it.
> 
> Adding a BUG_ON to the loop in c_can_do_tx triggers with a NULL skb returned by __can_get_echo_skb().
> 
> For testing I replaced the BUG_ON with a check that skips the skb if that happens (not sure if that would be a valid fix in that place):
> 
> while ((obj = c_can_ffs64(pend))) {
> pend &= ~((u64)1 << (obj - 1));
> c_can_inval_tx_object(dev, IF_RX, obj);
> idx = obj - priv->obj.send_frst;
> skb = __can_get_echo_skb(dev, idx, &len);
> if(!skb)
> continue;
> can_rx_offload_irq_receive_skb(&priv->offload, skb);
> bytes += len;
> pkts++;
> }
> 
> So far it got through the night without crashing or deadlocking, but I can't really tell if we are missing pakets though.
> 
> Regards,
> 
> Felix
> ___________________________________________________
> 
> SMA Solar Technology AG
> Aufsichtsrat: Uwe Kleinkauf (Vorsitzender)
> Vorstand: Ulrich Hadding, Dr.-Ing. Juergen Reinert
> Handelsregister: Amtsgericht Kassel HRB 3972
> Sitz der Gesellschaft: 34266 Niestetal
> USt-ID-Nr. DE 113 08 59 54
> WEEE-Reg.-Nr. DE 95881150
> ___________________________________________________

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] can: c_can: Handle lost bus-off interrupt while IRQs are disabled
  2020-06-12  7:10                   ` Kurt Van Dijck
@ 2020-06-12  7:42                     ` Kurt Van Dijck
  2020-06-15 11:11                       ` Felix Riemann
  0 siblings, 1 reply; 17+ messages in thread
From: Kurt Van Dijck @ 2020-06-12  7:42 UTC (permalink / raw)
  To: Felix Riemann, linux-can, Marc Kleine-Budde, Wolfgang Grandegger,
	Andre Kalb

Hi Felix,

Next question comes up.
See inline.

On vr, 12 jun 2020 09:10:44 +0200, Kurt Van Dijck wrote:
> Thanks for verifying,
> 
> I'll take a look how that can happen.
> 
> I think packet loss is better (less packets lost) than before.
> 
> Kurt
> 
> On wo, 10 jun 2020 08:08:28 +0000, Felix Riemann wrote:
> > Hi Kurt,
> > 
> > > -----Original Message-----
> > >
> > > I have the impression that the recv path is covered, at least the null ptr would
> > > have caused issued earlier already.
> > > Would the problem arise in c_can_do_tx fetching a NULL echo_skb?
> > 
> > It really looks like it.
> > 
> > Adding a BUG_ON to the loop in c_can_do_tx triggers with a NULL skb returned by __can_get_echo_skb().
> > 
> > For testing I replaced the BUG_ON with a check that skips the skb if that happens (not sure if that would be a valid fix in that place):
> > 
> > while ((obj = c_can_ffs64(pend))) {
> > pend &= ~((u64)1 << (obj - 1));
> > c_can_inval_tx_object(dev, IF_RX, obj);
> > idx = obj - priv->obj.send_frst;
> > skb = __can_get_echo_skb(dev, idx, &len);
> > if(!skb)

It would be interesting to know idx and priv->tx_active here.

> > continue;
> > can_rx_offload_irq_receive_skb(&priv->offload, skb);
> > bytes += len;
> > pkts++;
> > }
> > 
> > So far it got through the night without crashing or deadlocking, but I can't really tell if we are missing pakets though.
> > 
> > Regards,
> > 
> > Felix
> > ___________________________________________________
> > 
> > SMA Solar Technology AG
> > Aufsichtsrat: Uwe Kleinkauf (Vorsitzender)
> > Vorstand: Ulrich Hadding, Dr.-Ing. Juergen Reinert
> > Handelsregister: Amtsgericht Kassel HRB 3972
> > Sitz der Gesellschaft: 34266 Niestetal
> > USt-ID-Nr. DE 113 08 59 54
> > WEEE-Reg.-Nr. DE 95881150
> > ___________________________________________________

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [PATCH] can: c_can: Handle lost bus-off interrupt while IRQs are disabled
  2020-06-12  7:42                     ` Kurt Van Dijck
@ 2020-06-15 11:11                       ` Felix Riemann
  2020-06-15 15:31                         ` Kurt Van Dijck
  0 siblings, 1 reply; 17+ messages in thread
From: Felix Riemann @ 2020-06-15 11:11 UTC (permalink / raw)
  To: Kurt Van Dijck
  Cc: linux-can, Marc Kleine-Budde, Wolfgang Grandegger, Andre Kalb

Hi Kurt,

see below.

> -----Original Message-----
> From: Kurt Van Dijck <dev.kurt@vandijck-laurijssen.be>
> Sent: Friday, June 12, 2020 9:43 AM
> Subject: Re: [PATCH] can: c_can: Handle lost bus-off interrupt while IRQs are
> disabled
>
> Hi Felix,
>
> Next question comes up.
> See inline.
>
> On vr, 12 jun 2020 09:10:44 +0200, Kurt Van Dijck wrote:
> > Thanks for verifying,
> >
> > I'll take a look how that can happen.
> >
> > I think packet loss is better (less packets lost) than before.
> >
> > Kurt
> >
> > On wo, 10 jun 2020 08:08:28 +0000, Felix Riemann wrote:
> > > Hi Kurt,
> > >
> > > > -----Original Message-----
> > > >
> > > > I have the impression that the recv path is covered, at least the
> > > > null ptr would have caused issued earlier already.
> > > > Would the problem arise in c_can_do_tx fetching a NULL echo_skb?
> > >
> > > It really looks like it.
> > >
> > > Adding a BUG_ON to the loop in c_can_do_tx triggers with a NULL skb
> returned by __can_get_echo_skb().
> > >
> > > For testing I replaced the BUG_ON with a check that skips the skb if that
> happens (not sure if that would be a valid fix in that place):
> > >
> > > while ((obj = c_can_ffs64(pend))) {
> > > pend &= ~((u64)1 << (obj - 1));
> > > c_can_inval_tx_object(dev, IF_RX, obj); idx = obj -
> > > priv->obj.send_frst; skb = __can_get_echo_skb(dev, idx, &len);
> > > if(!skb)
>
> It would be interesting to know idx and priv->tx_active here.

That's the statistic over about 1600 lines I captured. First number is number of occurrences:

    992 can0: Skip @ idx 0, tx_active: 0x1
    255 can0: Skip @ idx 0, tx_active: 0x3
    250 can0: Skip @ idx 1, tx_active: 0x2
     11 can0: Skip @ idx 0, tx_active: 0x7
     10 can0: Skip @ idx 1, tx_active: 0x6
      8 can0: Skip @ idx 7, tx_active: 0x80
      8 can0: Skip @ idx 6, tx_active: 0xc0
      8 can0: Skip @ idx 2, tx_active: 0x6
      6 can0: Skip @ idx 1, tx_active: 0xe
      5 can0: Skip @ idx 2, tx_active: 0x4
      5 can0: Skip @ idx 0, tx_active: 0xf
      4 can0: Skip @ idx 2, tx_active: 0xc
      4 can0: Skip @ idx 0, tx_active: 0x3f
      3 can0: Skip @ idx 6, tx_active: 0x40
      3 can0: Skip @ idx 3, tx_active: 0xc
      3 can0: Skip @ idx 2, tx_active: 0xfc
      3 can0: Skip @ idx 2, tx_active: 0x3c
      3 can0: Skip @ idx 0, tx_active: 0x1f
      2 can0: Skip @ idx 5, tx_active: 0x3e
      2 can0: Skip @ idx 4, tx_active: 0x3e
      2 can0: Skip @ idx 3, tx_active: 0xf8
      2 can0: Skip @ idx 3, tx_active: 0x3e
      2 can0: Skip @ idx 2, tx_active: 0x3e
      2 can0: Skip @ idx 1, tx_active: 0x3e
      2 can0: Skip @ idx 1, tx_active: 0x1e
      1 can0: Skip @ idx 7, tx_active: 0xf0
      1 can0: Skip @ idx 6, tx_active: 0xf0
      1 can0: Skip @ idx 6, tx_active: 0x78
      1 can0: Skip @ idx 6, tx_active: 0x70
      1 can0: Skip @ idx 5, tx_active: 0xf0
      1 can0: Skip @ idx 5, tx_active: 0x78
      1 can0: Skip @ idx 5, tx_active: 0x70
      1 can0: Skip @ idx 5, tx_active: 0x38
      1 can0: Skip @ idx 5, tx_active: 0x20
      1 can0: Skip @ idx 4, tx_active: 0xf0
      1 can0: Skip @ idx 4, tx_active: 0x38
      1 can0: Skip @ idx 4, tx_active: 0x1c
      1 can0: Skip @ idx 3, tx_active: 0xfc
      1 can0: Skip @ idx 3, tx_active: 0xe
      1 can0: Skip @ idx 3, tx_active: 0x8
      1 can0: Skip @ idx 3, tx_active: 0x78
      1 can0: Skip @ idx 3, tx_active: 0x38
      1 can0: Skip @ idx 3, tx_active: 0x1c
      1 can0: Skip @ idx 2, tx_active: 0xe
      1 can0: Skip @ idx 2, tx_active: 0x7c
      1 can0: Skip @ idx 2, tx_active: 0x1c
      1 can0: Skip @ idx 1, tx_active: 0xfe
      1 can0: Skip @ idx 0, tx_active: 0xff
      1 can0: Skip @ idx 0, tx_active: 0x7f

I used atomic_read to get the value of tx_active.

Regards,

Felix
___________________________________________________

SMA Solar Technology AG
Aufsichtsrat: Uwe Kleinkauf (Vorsitzender)
Vorstand: Ulrich Hadding, Dr.-Ing. Juergen Reinert
Handelsregister: Amtsgericht Kassel HRB 3972
Sitz der Gesellschaft: 34266 Niestetal
USt-ID-Nr. DE 113 08 59 54
WEEE-Reg.-Nr. DE 95881150
___________________________________________________

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] can: c_can: Handle lost bus-off interrupt while IRQs are disabled
  2020-06-15 11:11                       ` Felix Riemann
@ 2020-06-15 15:31                         ` Kurt Van Dijck
  2020-06-16  9:51                           ` Felix Riemann
  0 siblings, 1 reply; 17+ messages in thread
From: Kurt Van Dijck @ 2020-06-15 15:31 UTC (permalink / raw)
  To: Felix Riemann
  Cc: linux-can, Marc Kleine-Budde, Wolfgang Grandegger, Andre Kalb

On ma, 15 jun 2020 11:11:17 +0000, Felix Riemann wrote:
> 
> Hi Kurt,
> 
> see below.
> 
> > -----Original Message-----
> >
> > Hi Felix,
> >
> > Next question comes up.
> > See inline.
> >
> > On vr, 12 jun 2020 09:10:44 +0200, Kurt Van Dijck wrote:
> > > Thanks for verifying,
> > >
> > > I'll take a look how that can happen.
> > >
> > > I think packet loss is better (less packets lost) than before.
> > >
> > > Kurt
> > >
> > > On wo, 10 jun 2020 08:08:28 +0000, Felix Riemann wrote:
> > > > Hi Kurt,
> > > >
> > > > > -----Original Message-----
> > > > >
> > > > > I have the impression that the recv path is covered, at least the
> > > > > null ptr would have caused issued earlier already.
> > > > > Would the problem arise in c_can_do_tx fetching a NULL echo_skb?
> > > >
> > > > It really looks like it.
> > > >
> > > > Adding a BUG_ON to the loop in c_can_do_tx triggers with a NULL skb
> > returned by __can_get_echo_skb().
> > > >
> > > > For testing I replaced the BUG_ON with a check that skips the skb if that
> > happens (not sure if that would be a valid fix in that place):
> > > >
> > > > while ((obj = c_can_ffs64(pend))) {
> > > > pend &= ~((u64)1 << (obj - 1));
> > > > c_can_inval_tx_object(dev, IF_RX, obj); idx = obj -
> > > > priv->obj.send_frst; skb = __can_get_echo_skb(dev, idx, &len);
> > > > if(!skb)
> >
> > It would be interesting to know idx and priv->tx_active here.
> 
> That's the statistic over about 1600 lines I captured. First number is number of occurrences:

just double-check:
These are all when __can_get_echo_skb returned NULL?

> 
>     992 can0: Skip @ idx 0, tx_active: 0x1
>     255 can0: Skip @ idx 0, tx_active: 0x3
>     250 can0: Skip @ idx 1, tx_active: 0x2
>      11 can0: Skip @ idx 0, tx_active: 0x7
>      10 can0: Skip @ idx 1, tx_active: 0x6
>       8 can0: Skip @ idx 7, tx_active: 0x80
>       8 can0: Skip @ idx 6, tx_active: 0xc0
>       8 can0: Skip @ idx 2, tx_active: 0x6
>       6 can0: Skip @ idx 1, tx_active: 0xe
>       5 can0: Skip @ idx 2, tx_active: 0x4
>       5 can0: Skip @ idx 0, tx_active: 0xf
>       4 can0: Skip @ idx 2, tx_active: 0xc
>       4 can0: Skip @ idx 0, tx_active: 0x3f
>       3 can0: Skip @ idx 6, tx_active: 0x40
>       3 can0: Skip @ idx 3, tx_active: 0xc
>       3 can0: Skip @ idx 2, tx_active: 0xfc
>       3 can0: Skip @ idx 2, tx_active: 0x3c
>       3 can0: Skip @ idx 0, tx_active: 0x1f
>       2 can0: Skip @ idx 5, tx_active: 0x3e
>       2 can0: Skip @ idx 4, tx_active: 0x3e
>       2 can0: Skip @ idx 3, tx_active: 0xf8
>       2 can0: Skip @ idx 3, tx_active: 0x3e
>       2 can0: Skip @ idx 2, tx_active: 0x3e
>       2 can0: Skip @ idx 1, tx_active: 0x3e
>       2 can0: Skip @ idx 1, tx_active: 0x1e
>       1 can0: Skip @ idx 7, tx_active: 0xf0
>       1 can0: Skip @ idx 6, tx_active: 0xf0
>       1 can0: Skip @ idx 6, tx_active: 0x78
>       1 can0: Skip @ idx 6, tx_active: 0x70
>       1 can0: Skip @ idx 5, tx_active: 0xf0
>       1 can0: Skip @ idx 5, tx_active: 0x78
>       1 can0: Skip @ idx 5, tx_active: 0x70
>       1 can0: Skip @ idx 5, tx_active: 0x38
>       1 can0: Skip @ idx 5, tx_active: 0x20
>       1 can0: Skip @ idx 4, tx_active: 0xf0
>       1 can0: Skip @ idx 4, tx_active: 0x38
>       1 can0: Skip @ idx 4, tx_active: 0x1c
>       1 can0: Skip @ idx 3, tx_active: 0xfc
>       1 can0: Skip @ idx 3, tx_active: 0xe
>       1 can0: Skip @ idx 3, tx_active: 0x8
>       1 can0: Skip @ idx 3, tx_active: 0x78
>       1 can0: Skip @ idx 3, tx_active: 0x38
>       1 can0: Skip @ idx 3, tx_active: 0x1c
>       1 can0: Skip @ idx 2, tx_active: 0xe
>       1 can0: Skip @ idx 2, tx_active: 0x7c
>       1 can0: Skip @ idx 2, tx_active: 0x1c
>       1 can0: Skip @ idx 1, tx_active: 0xfe
>       1 can0: Skip @ idx 0, tx_active: 0xff
>       1 can0: Skip @ idx 0, tx_active: 0x7f
> 
> I used atomic_read to get the value of tx_active.
> 
> Regards,
> 
> Felix
> ___________________________________________________
> 
> SMA Solar Technology AG
> Aufsichtsrat: Uwe Kleinkauf (Vorsitzender)
> Vorstand: Ulrich Hadding, Dr.-Ing. Juergen Reinert
> Handelsregister: Amtsgericht Kassel HRB 3972
> Sitz der Gesellschaft: 34266 Niestetal
> USt-ID-Nr. DE 113 08 59 54
> WEEE-Reg.-Nr. DE 95881150
> ___________________________________________________

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [PATCH] can: c_can: Handle lost bus-off interrupt while IRQs are disabled
  2020-06-15 15:31                         ` Kurt Van Dijck
@ 2020-06-16  9:51                           ` Felix Riemann
  2020-06-16 12:42                             ` Kurt Van Dijck
  0 siblings, 1 reply; 17+ messages in thread
From: Felix Riemann @ 2020-06-16  9:51 UTC (permalink / raw)
  To: Kurt Van Dijck
  Cc: linux-can, Marc Kleine-Budde, Wolfgang Grandegger, Andre Kalb

Hi Kurt,

> -----Original Message-----
> From: Kurt Van Dijck <dev.kurt@vandijck-laurijssen.be>
> Sent: Monday, June 15, 2020 5:31 PM
> Subject: Re: [PATCH] can: c_can: Handle lost bus-off interrupt while IRQs are
> disabled
>
> On ma, 15 jun 2020 11:11:17 +0000, Felix Riemann wrote:
> >
> > Hi Kurt,
> >
> > see below.
> >
> > > -----Original Message-----
> > >
> > > Hi Felix,
> > >
> > > Next question comes up.
> > > See inline.
> > >
> > > On vr, 12 jun 2020 09:10:44 +0200, Kurt Van Dijck wrote:
> > > > Thanks for verifying,
> > > >
> > > > I'll take a look how that can happen.
> > > >
> > > > I think packet loss is better (less packets lost) than before.
> > > >
> > > > Kurt
> > > >
> > > > On wo, 10 jun 2020 08:08:28 +0000, Felix Riemann wrote:
> > > > > Hi Kurt,
> > > > >
> > > > > > -----Original Message-----
> > > > > >
> > > > > > I have the impression that the recv path is covered, at least
> > > > > > the null ptr would have caused issued earlier already.
> > > > > > Would the problem arise in c_can_do_tx fetching a NULL echo_skb?
> > > > >
> > > > > It really looks like it.
> > > > >
> > > > > Adding a BUG_ON to the loop in c_can_do_tx triggers with a NULL
> > > > > skb
> > > returned by __can_get_echo_skb().
> > > > >
> > > > > For testing I replaced the BUG_ON with a check that skips the
> > > > > skb if that
> > > happens (not sure if that would be a valid fix in that place):
> > > > >
> > > > > while ((obj = c_can_ffs64(pend))) { pend &= ~((u64)1 << (obj -
> > > > > 1)); c_can_inval_tx_object(dev, IF_RX, obj); idx = obj -
> > > > > priv->obj.send_frst; skb = __can_get_echo_skb(dev, idx, &len);
> > > > > if(!skb)
> > >
> > > It would be interesting to know idx and priv->tx_active here.
> >
> > That's the statistic over about 1600 lines I captured. First number is number
> of occurrences:
>
> just double-check:
> These are all when __can_get_echo_skb returned NULL?

Yes, that's correct. The printout occurs directly before it skips to the next item in the loop.

Felix
___________________________________________________

SMA Solar Technology AG
Aufsichtsrat: Uwe Kleinkauf (Vorsitzender)
Vorstand: Ulrich Hadding, Dr.-Ing. Juergen Reinert
Handelsregister: Amtsgericht Kassel HRB 3972
Sitz der Gesellschaft: 34266 Niestetal
USt-ID-Nr. DE 113 08 59 54
WEEE-Reg.-Nr. DE 95881150
___________________________________________________

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] can: c_can: Handle lost bus-off interrupt while IRQs are disabled
  2020-06-16  9:51                           ` Felix Riemann
@ 2020-06-16 12:42                             ` Kurt Van Dijck
  0 siblings, 0 replies; 17+ messages in thread
From: Kurt Van Dijck @ 2020-06-16 12:42 UTC (permalink / raw)
  To: Felix Riemann
  Cc: linux-can, Marc Kleine-Budde, Wolfgang Grandegger, Andre Kalb

On di, 16 jun 2020 09:51:00 +0000, Felix Riemann wrote:
> Hi Kurt,
> 
> > -----Original Message-----
> > From: Kurt Van Dijck <dev.kurt@vandijck-laurijssen.be>
> > Sent: Monday, June 15, 2020 5:31 PM
> > Subject: Re: [PATCH] can: c_can: Handle lost bus-off interrupt while IRQs are
> > disabled
> >
> > On ma, 15 jun 2020 11:11:17 +0000, Felix Riemann wrote:
> > >
> > > Hi Kurt,
> > >
> > > see below.
> > >
> > > > -----Original Message-----
> > > >
> > > > Hi Felix,
> > > >
> > > > Next question comes up.
> > > > See inline.
> > > >
> > > > On vr, 12 jun 2020 09:10:44 +0200, Kurt Van Dijck wrote:
> > > > > Thanks for verifying,
> > > > >
> > > > > I'll take a look how that can happen.
> > > > >
> > > > > I think packet loss is better (less packets lost) than before.
> > > > >
> > > > > Kurt
> > > > >
> > > > > On wo, 10 jun 2020 08:08:28 +0000, Felix Riemann wrote:
> > > > > > Hi Kurt,
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > >
> > > > > > > I have the impression that the recv path is covered, at least
> > > > > > > the null ptr would have caused issued earlier already.
> > > > > > > Would the problem arise in c_can_do_tx fetching a NULL echo_skb?
> > > > > >
> > > > > > It really looks like it.
> > > > > >
> > > > > > Adding a BUG_ON to the loop in c_can_do_tx triggers with a NULL
> > > > > > skb
> > > > returned by __can_get_echo_skb().
> > > > > >
> > > > > > For testing I replaced the BUG_ON with a check that skips the
> > > > > > skb if that
> > > > happens (not sure if that would be a valid fix in that place):
> > > > > >
> > > > > > while ((obj = c_can_ffs64(pend))) { pend &= ~((u64)1 << (obj -
> > > > > > 1)); c_can_inval_tx_object(dev, IF_RX, obj); idx = obj -
> > > > > > priv->obj.send_frst; skb = __can_get_echo_skb(dev, idx, &len);
> > > > > > if(!skb)
> > > >
> > > > It would be interesting to know idx and priv->tx_active here.
> > >
> > > That's the statistic over about 1600 lines I captured. First number is number
> > of occurrences:
> >
> > just double-check:
> > These are all when __can_get_echo_skb returned NULL?
> 
> Yes, that's correct. The printout occurs directly before it skips to the next item in the loop.

thanks.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2020-06-16 12:42 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-26 10:43 [PATCH] can: c_can: Handle lost bus-off interrupt while IRQs are disabled Felix Riemann
2020-03-26 13:40 ` Kurt Van Dijck
2020-03-26 14:49   ` Felix Riemann
2020-03-26 19:38     ` Kurt Van Dijck
2020-03-27 10:09       ` Felix Riemann
2020-03-30  7:02         ` Kurt Van Dijck
2020-03-30 17:30           ` Felix Riemann
2020-03-31  9:44             ` Kurt Van Dijck
2020-04-29  8:47             ` Kurt Van Dijck
2020-04-29  9:06               ` Kurt Van Dijck
2020-06-10  8:08                 ` Felix Riemann
2020-06-12  7:10                   ` Kurt Van Dijck
2020-06-12  7:42                     ` Kurt Van Dijck
2020-06-15 11:11                       ` Felix Riemann
2020-06-15 15:31                         ` Kurt Van Dijck
2020-06-16  9:51                           ` Felix Riemann
2020-06-16 12:42                             ` Kurt Van Dijck

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.