All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] can: ti_hecc: fix rx wrong sequence issue
@ 2012-10-08  6:40 AnilKumar Ch
  2012-10-08  9:12 ` Jan Lübbe
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: AnilKumar Ch @ 2012-10-08  6:40 UTC (permalink / raw)
  To: wg, mkl; +Cc: swarren, linux-can, anantgole, nsekhar, AnilKumar Ch

Fix "received wrong sequence count. expected: x, got: y" errors
reported by cansequence in a stress test (--verbose).

While processing the receive packets in mailboxes, upper mailboxes
need to enable while processing 12th (RX_BUFFER mailbox) mailbox.
This requires a status check to identify whether the receiving of
packets in progress or not. If the mailboxes are enabled before the
receive packet status is cleared then there is a possibility of
receiving out of order packets.

With this patch mailboxes are enabled once the receive status is
cleared.

Signed-off-by: AnilKumar Ch <anilkumar@ti.com>
---
 drivers/net/can/ti_hecc.c |   19 ++++++++++++++++++-
 1 files changed, 18 insertions(+), 1 deletions(-)

diff --git a/drivers/net/can/ti_hecc.c b/drivers/net/can/ti_hecc.c
index 39293e9..2417a2d 100644
--- a/drivers/net/can/ti_hecc.c
+++ b/drivers/net/can/ti_hecc.c
@@ -169,6 +169,7 @@ MODULE_VERSION(HECC_MODULE_VERSION);
 #define HECC_CANES_SMA		BIT(5)	/* suspend mode ack */
 #define HECC_CANES_CCE		BIT(4)	/* Change config enabled */
 #define HECC_CANES_PDA		BIT(3)	/* Power down mode ack */
+#define HECC_CANES_RM		BIT(1)	/* Receive Mode bit */
 
 #define HECC_CANBTC_SAM		BIT(7)	/* sample points */
 
@@ -194,6 +195,13 @@ MODULE_VERSION(HECC_MODULE_VERSION);
 #define HECC_CANGIM_DEF_MASK	0x700	/* only busoff/warning/passive */
 #define HECC_CANGIM_SIL		BIT(2)	/* system interrupts to int line 1 */
 
+/*
+ * Receive Mode bit reflects what the CAN protocol kernel (CPK) is
+ * actually doing regardless of mailbox configuration. CPK receive
+ * mode timeout. Tried from 1 - 5us and kept 10 as a safety value.
+ */
+#define RM_TIMEOUT_US		10
+
 /* CAN Bittiming constants as per HECC specs */
 static struct can_bittiming_const ti_hecc_bittiming_const = {
 	.name = DRV_NAME,
@@ -615,7 +623,7 @@ static int ti_hecc_rx_poll(struct napi_struct *napi, int quota)
 	struct ti_hecc_priv *priv = netdev_priv(ndev);
 	u32 num_pkts = 0;
 	u32 mbx_mask;
-	unsigned long pending_pkts, flags;
+	unsigned long pending_pkts, flags, timeout;
 
 	if (!netif_running(ndev))
 		return 0;
@@ -633,6 +641,15 @@ static int ti_hecc_rx_poll(struct napi_struct *napi, int quota)
 		--priv->rx_next;
 		if (priv->rx_next == HECC_RX_BUFFER_MBOX) {
 			/* enable high bank mailboxes */
+			timeout = jiffies + usecs_to_jiffies(RM_TIMEOUT_US);
+			while (hecc_get_bit(priv, HECC_CANES, HECC_CANES_RM)) {
+				cpu_relax();
+				if (time_after(jiffies, timeout)) {
+					netdev_warn(ndev, "receiving pkt\n");
+					break;
+				}
+			}
+
 			spin_lock_irqsave(&priv->mbx_lock, flags);
 			mbx_mask = hecc_read(priv, HECC_CANME);
 			mbx_mask |= HECC_RX_HIGH_MBOX_MASK;
-- 
1.7.0.4


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] can: ti_hecc: fix rx wrong sequence issue
  2012-10-08  6:40 [PATCH] can: ti_hecc: fix rx wrong sequence issue AnilKumar Ch
@ 2012-10-08  9:12 ` Jan Lübbe
  2012-10-11 12:01   ` Jan Lübbe
  2012-10-08  9:28 ` Wolfgang Grandegger
  2012-11-05 11:34 ` Marc Kleine-Budde
  2 siblings, 1 reply; 10+ messages in thread
From: Jan Lübbe @ 2012-10-08  9:12 UTC (permalink / raw)
  To: AnilKumar Ch; +Cc: wg, mkl, swarren, linux-can, anantgole, nsekhar

On Mon, 2012-10-08 at 12:10 +0530, AnilKumar Ch wrote:
> Fix "received wrong sequence count. expected: x, got: y" errors
> reported by cansequence in a stress test (--verbose).

I tested this on a AM3505 with cansequence 4.0.6. For 500kbit, I see no
reordering anymore.

When testing at 1Mbit, it seems to be worse than before:
~ # ifconfig can0 down; canconfig can0 bitrate 1000000; ifconfig can0 up
can0 bitrate: 1000000, sample-point: 0.692
[ 1627.995574] ti_hecc ti_hecc: can0: setting CANBTC=0x3b
~ # cansequence -r -vv can0
interface = can0, family = 29, type = 3, proto = 1
received frame. sequence number: 0
received frame. sequence number: 1
received frame. sequence number: 2
received frame. sequence number: 3
received frame. sequence number: 4
received frame. sequence number: 5
received frame. sequence number: 6
received frame. sequence number: 7
received frame. sequence number: 8
received frame. sequence number: 9
received frame. sequence number: 10
received frame. sequence number: 11
received frame. sequence number: 12
received frame. sequence number: 13
received frame. sequence number: 14
received frame. sequence number: 15
received frame. sequence number: 16
received frame. sequence number: 17
received frame. sequence number: 18
received frame. sequence number: 19
received frame. sequence number: 20
received frame. sequence number: 21
received frame. sequence number: 21
received wrong sequence count. expected: 22, got: 21
received frame. sequence number: 21
received wrong sequence count. expected: 22, got: 21
received frame. sequence number: 22
received frame. sequence number: 22
received wrong sequence count. expected: 23, got: 22
received frame. sequence number: 23
received frame. sequence number: 23
received wrong sequence count. expected: 24, got: 23
received frame. sequence number: 24
received frame. sequence number: 24
received wrong sequence count. expected: 25, got: 24
received frame. sequence number: 24
received wrong sequence count. expected: 25, got: 24
received frame. sequence number: 25
received frame. sequence number: 25
received wrong sequence count. expected: 26, got: 25
received frame. sequence number: 26
received frame. sequence number: 27
received frame. sequence number: 27
received wrong sequence count. expected: 28, got: 27
received frame. sequence number: 28
received frame. sequence number: 28
received wrong sequence count. expected: 29, got: 28
received frame. sequence number: 28
received wrong sequence count. expected: 29, got: 28
received frame. sequence number: 29
...

Here we can see that after the initial frames, most frames are received
twice (some even thrice).

Regards,
Jan
-- 
Pengutronix e.K.                           |                             |
Industrial Linux Solutions                 | http://www.pengutronix.de/  |
Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-0    |
Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] can: ti_hecc: fix rx wrong sequence issue
  2012-10-08  6:40 [PATCH] can: ti_hecc: fix rx wrong sequence issue AnilKumar Ch
  2012-10-08  9:12 ` Jan Lübbe
@ 2012-10-08  9:28 ` Wolfgang Grandegger
  2012-11-05 11:34 ` Marc Kleine-Budde
  2 siblings, 0 replies; 10+ messages in thread
From: Wolfgang Grandegger @ 2012-10-08  9:28 UTC (permalink / raw)
  To: AnilKumar Ch; +Cc: mkl, swarren, linux-can, anantgole, nsekhar

On 10/08/2012 08:40 AM, AnilKumar Ch wrote:
> Fix "received wrong sequence count. expected: x, got: y" errors
> reported by cansequence in a stress test (--verbose).
> 
> While processing the receive packets in mailboxes, upper mailboxes
> need to enable while processing 12th (RX_BUFFER mailbox) mailbox.
> This requires a status check to identify whether the receiving of
> packets in progress or not. If the mailboxes are enabled before the
> receive packet status is cleared then there is a possibility of
> receiving out of order packets.
> 
> With this patch mailboxes are enabled once the receive status is
> cleared.
> 
> Signed-off-by: AnilKumar Ch <anilkumar@ti.com>

> ---
>  drivers/net/can/ti_hecc.c |   19 ++++++++++++++++++-
>  1 files changed, 18 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/net/can/ti_hecc.c b/drivers/net/can/ti_hecc.c
> index 39293e9..2417a2d 100644
> --- a/drivers/net/can/ti_hecc.c
> +++ b/drivers/net/can/ti_hecc.c
> @@ -169,6 +169,7 @@ MODULE_VERSION(HECC_MODULE_VERSION);
>  #define HECC_CANES_SMA		BIT(5)	/* suspend mode ack */
>  #define HECC_CANES_CCE		BIT(4)	/* Change config enabled */
>  #define HECC_CANES_PDA		BIT(3)	/* Power down mode ack */
> +#define HECC_CANES_RM		BIT(1)	/* Receive Mode bit */
>  
>  #define HECC_CANBTC_SAM		BIT(7)	/* sample points */
>  
> @@ -194,6 +195,13 @@ MODULE_VERSION(HECC_MODULE_VERSION);
>  #define HECC_CANGIM_DEF_MASK	0x700	/* only busoff/warning/passive */
>  #define HECC_CANGIM_SIL		BIT(2)	/* system interrupts to int line 1 */
>  
> +/*
> + * Receive Mode bit reflects what the CAN protocol kernel (CPK) is
> + * actually doing regardless of mailbox configuration. CPK receive
> + * mode timeout. Tried from 1 - 5us and kept 10 as a safety value.
> + */
> +#define RM_TIMEOUT_US		10
> +
>  /* CAN Bittiming constants as per HECC specs */
>  static struct can_bittiming_const ti_hecc_bittiming_const = {
>  	.name = DRV_NAME,
> @@ -615,7 +623,7 @@ static int ti_hecc_rx_poll(struct napi_struct *napi, int quota)
>  	struct ti_hecc_priv *priv = netdev_priv(ndev);
>  	u32 num_pkts = 0;
>  	u32 mbx_mask;
> -	unsigned long pending_pkts, flags;
> +	unsigned long pending_pkts, flags, timeout;
>  
>  	if (!netif_running(ndev))
>  		return 0;
> @@ -633,6 +641,15 @@ static int ti_hecc_rx_poll(struct napi_struct *napi, int quota)
>  		--priv->rx_next;
>  		if (priv->rx_next == HECC_RX_BUFFER_MBOX) {
>  			/* enable high bank mailboxes */
> +			timeout = jiffies + usecs_to_jiffies(RM_TIMEOUT_US);
> +			while (hecc_get_bit(priv, HECC_CANES, HECC_CANES_RM)) {
> +				cpu_relax();
> +				if (time_after(jiffies, timeout)) {
> +					netdev_warn(ndev, "receiving pkt\n");
> +					break;
> +				}
> +			}
> +
>  			spin_lock_irqsave(&priv->mbx_lock, flags);
>  			mbx_mask = hecc_read(priv, HECC_CANME);
>  			mbx_mask |= HECC_RX_HIGH_MBOX_MASK;
> 

Hm, jiffies do have a 1/HZ resolution, normally 1/100 or 1/250 s. Then
time_after() would not trigger before the next jiffy. Or have I missed
something.

FYI: the program canfdtest [1] from the offical can-utils has been added
some time ago to allow detection of out-of-order messages.

Wolfgang.

[1] https://gitorious.org/linux-can/can-utils/blobs/master/canfdtest.c


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] can: ti_hecc: fix rx wrong sequence issue
  2012-10-08  9:12 ` Jan Lübbe
@ 2012-10-11 12:01   ` Jan Lübbe
  0 siblings, 0 replies; 10+ messages in thread
From: Jan Lübbe @ 2012-10-11 12:01 UTC (permalink / raw)
  To: AnilKumar Ch; +Cc: wg, mkl, swarren, linux-can, anantgole, nsekhar

Hi,

On Mon, 2012-10-08 at 11:12 +0200, Jan Lübbe wrote:
> On Mon, 2012-10-08 at 12:10 +0530, AnilKumar Ch wrote:
> > Fix "received wrong sequence count. expected: x, got: y" errors
> > reported by cansequence in a stress test (--verbose).
> 
> I tested this on a AM3505 with cansequence 4.0.6. For 500kbit, I see no
> reordering anymore.
> 
> When testing at 1Mbit, it seems to be worse than before:
*snip*
> Here we can see that after the initial frames, most frames are received
> twice (some even thrice).

Do you have some idea on the cause of this problem, so we could do some
more focused debugging?

Regards,
Jan
-- 
Pengutronix e.K.                           |                             |
Industrial Linux Solutions                 | http://www.pengutronix.de/  |
Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-0    |
Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] can: ti_hecc: fix rx wrong sequence issue
  2012-10-08  6:40 [PATCH] can: ti_hecc: fix rx wrong sequence issue AnilKumar Ch
  2012-10-08  9:12 ` Jan Lübbe
  2012-10-08  9:28 ` Wolfgang Grandegger
@ 2012-11-05 11:34 ` Marc Kleine-Budde
  2012-11-05 13:47   ` AnilKumar, Chimata
  2012-11-05 18:15   ` Marc Kleine-Budde
  2 siblings, 2 replies; 10+ messages in thread
From: Marc Kleine-Budde @ 2012-11-05 11:34 UTC (permalink / raw)
  To: AnilKumar Ch; +Cc: wg, swarren, linux-can, anantgole, nsekhar

[-- Attachment #1: Type: text/plain, Size: 5521 bytes --]

Hello,

On 10/08/2012 08:40 AM, AnilKumar Ch wrote:
> Fix "received wrong sequence count. expected: x, got: y" errors
> reported by cansequence in a stress test (--verbose).
> 
> While processing the receive packets in mailboxes, upper mailboxes
> need to enable while processing 12th (RX_BUFFER mailbox) mailbox.
> This requires a status check to identify whether the receiving of
> packets in progress or not. If the mailboxes are enabled before the
> receive packet status is cleared then there is a possibility of
> receiving out of order packets.
> 
> With this patch mailboxes are enabled once the receive status is
> cleared.
> 
> Signed-off-by: AnilKumar Ch <anilkumar@ti.com>

AnilKumar Ch, can you add some TI people with access to the vhdl CAN
core design to Cc?

I'm debugging this issue for a week, fixed several other problems with
the driver and now and it seems there is a race condition in the hardware.

The algorithm in the driver relies on the ability to re-enable a bunch
of mailboxes with a single write command. In the ti_hecc hardware the
driver uses the CANME (mailbox enable) register to do so. Under certain
circumstances I've noticed that the CAN core receives the next message
into an undefined mailbox. According to the data sheet CAN messages are
always received into the highest free mailbox (with a matching filter).
But after enabling the mailboxes the next CAN frame doesn't go into the
highest mailbox.

This is a log from my test application. The ti_hecc receives 1 byte long
frames at 500 kbit/s. In the first data byte a number is increased with
every CAN frame (the "d=" column). The enabled mailboxes are in the "me"
column and the pending messages in "rmp".

> [   57.390747] ti_hecc ti_hecc: can0: i=39502 m=14 n=14 rmp=00004000 me=00007ff0 d= 78
> [   57.390777] ti_hecc ti_hecc: can0: i=39503 m=13 n=13 rmp=00002000 me=00003ff0 d= 79
> [   57.390777] ti_hecc ti_hecc: can0: i=39504 m=12 n=12 rmp=00001000 me=00001ff0 d= 80
> [   57.390808] ti_hecc ti_hecc: can0: i=39505 m=11 n=11 rmp=00000800 me=00000ff0 d= 81
> [   57.390838] ti_hecc ti_hecc: can0: i=39506 m=10 n=10 rmp=00000400 me=000007f0 d= 82
> [   57.390838] ti_hecc ti_hecc: can0: i=39507 m= 9 n= 9 rmp=00000200 me=000003f0 d= 83
> [   57.390869] ti_hecc ti_hecc: can0: i=39508 m= 8 n= 8 rmp=00000100 me=000001f0 d= 84

Here the driver enables the high priority mailboxes (0xfffff000) with a
single write. Actually the driver writes 0xfffff0f0, as there a still
four free low priority mailboxes.

> [   57.390899] ti_hecc ti_hecc: can0: i=39509 m= 0 n= 7 rmp=00010000 me=fffff0f0 d=  0
                                                                 ^
The next CAN frame should be received, either into 0x00000800 (if the
reception happens before the driver enables the high prio mailboxes) or
into 0x80000000 (if the reception takes place after the driver has
written the canme register).

> [   57.390930] ti_hecc ti_hecc: can0: i=39510 m= 0 n= 7 rmp=00010000 me=fffff0f0 d=  0
> [   57.390930] ti_hecc ti_hecc: can0: i=39511 m= 0 n= 7 rmp=00010000 me=fffff0f0 d=  0
> [   57.390960] ti_hecc ti_hecc: can0: i=39512 m= 0 n= 7 rmp=00010000 me=fffff0f0 d=  0
> [   57.390991] ti_hecc ti_hecc: can0: i=39513 m= 0 n= 7 rmp=00010000 me=fffff0f0 d=  0
> [   57.391021] ti_hecc ti_hecc: can0: i=39514 m= 0 n= 7 rmp=00010000 me=fffff0f0 d=  0
> [   57.391021] ti_hecc ti_hecc: can0: i=39515 m= 0 n= 7 rmp=00010000 me=fffff0f0 d=  0
> [   57.391052] ti_hecc ti_hecc: can0: i=39516 m= 0 n= 7 rmp=00010000 me=fffff0f0 d=  0
> [   57.391082] ti_hecc ti_hecc: can0: i=39517 m=31 n=31 rmp=80010000 me=fffff0f0 d= 86
                                                              ^
We see the next frame goes into 0x80000000 as expected. (There are so
many d=0 lines, because the driver is confused, as the next frame is not
expected in the 0x00010000 mailbox.)

There isn't any tx going on at the moment, the rx-path is the only
modifier of the canme register.

I thought maybe there is a problem in the hardware with the canme
register, so I've reworked the algorithm not to change the canme
register at all. canme stays 0xfffffff0 all the time (all rx mailboxes
active), but received frames are not acknowledged (I don't write 1s into
canrmp after message reception). The pending rx interrupts are masked
with canmim. To "enable" the mailboxes again, the driver clears bits in
the canrmp register with a single write and then unmasks the interrupt
again. This however results in _the_same_ buggy behaviour.

Then I had a closer look at AnilKumar Ch's patch, the canme register is
not changed if a the CAN core currently receives a CAN frame. I added
AnilKumar Ch's busy loop before modifying the canme register, and it
works now. So I conclude that the "pick next free mailbox" algorithm in
the CAN core is racy. You are not allowed to do something that changes a
mailbox's status from disabled/full to enabled/free when a CAN frame is
received. As you cannot control CAN frame reception, this is a nice
hardware race condition. Please add this to the manual and errata sheets.

I'll post a series which fixes the problems I've found soon.

regards,
Marc

-- 
Pengutronix e.K.                  | Marc Kleine-Budde           |
Industrial Linux Solutions        | Phone: +49-231-2826-924     |
Vertretung West/Dortmund          | Fax:   +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 259 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [PATCH] can: ti_hecc: fix rx wrong sequence issue
  2012-11-05 11:34 ` Marc Kleine-Budde
@ 2012-11-05 13:47   ` AnilKumar, Chimata
  2012-11-05 13:56     ` Marc Kleine-Budde
  2012-11-05 18:15   ` Marc Kleine-Budde
  1 sibling, 1 reply; 10+ messages in thread
From: AnilKumar, Chimata @ 2012-11-05 13:47 UTC (permalink / raw)
  To: Marc Kleine-Budde; +Cc: wg, swarren, linux-can, Gole, Anant, Nori, Sekhar

On Mon, Nov 05, 2012 at 17:04:55, Marc Kleine-Budde wrote:
> Hello,

Hi Marc,

Thanks for looking into this issue, I had spent some considerable
time for debugging this issue, I have seen better sequence with
below changes at 500KBPS but issue still there. Because of some
other higher priority task I left there.

Some of my observations in the driver:-
1. One is buffer mailbox (12th mailbox we are missing)

@@ -627,18 +617,34 @@ static int ti_hecc_rx_poll(struct napi_struct *napi, int quota)
 			if (ti_hecc_rx_pkt(priv, priv->rx_next) < 0)
 				return num_pkts;
 			++num_pkts;
-		} else if (priv->rx_next > HECC_RX_BUFFER_MBOX) {
+		} else if (priv->rx_next >= HECC_RX_BUFFER_MBOX) {
 			break; /* pkt not received yet */
 		}
 		--priv->rx_next;
-		if (priv->rx_next == HECC_RX_BUFFER_MBOX) {
-			/* enable high bank mailboxes */
-			spin_lock_irqsave(&priv->mbx_lock, flags);
+		if (priv->rx_next == (HECC_RX_BUFFER_MBOX - 1)) {

2. Lower mail box configuration is not happening properly. I have
changed this.
3. Removed spinlocks (Just for testing)

> 
> On 10/08/2012 08:40 AM, AnilKumar Ch wrote:
> > Fix "received wrong sequence count. expected: x, got: y" errors
> > reported by cansequence in a stress test (--verbose).
> > 
> > While processing the receive packets in mailboxes, upper mailboxes
> > need to enable while processing 12th (RX_BUFFER mailbox) mailbox.
> > This requires a status check to identify whether the receiving of
> > packets in progress or not. If the mailboxes are enabled before the
> > receive packet status is cleared then there is a possibility of
> > receiving out of order packets.
> > 
> > With this patch mailboxes are enabled once the receive status is
> > cleared.
> > 
> > Signed-off-by: AnilKumar Ch <anilkumar@ti.com>
> 
> AnilKumar Ch, can you add some TI people with access to the vhdl CAN
> core design to Cc?

I am contacting the person internally, I will add immediately once I
got the right person. 

> 
> I'm debugging this issue for a week, fixed several other problems with
> the driver and now and it seems there is a race condition in the hardware.
> 
> The algorithm in the driver relies on the ability to re-enable a bunch
> of mailboxes with a single write command. In the ti_hecc hardware the
> driver uses the CANME (mailbox enable) register to do so. Under certain
> circumstances I've noticed that the CAN core receives the next message
> into an undefined mailbox. According to the data sheet CAN messages are
> always received into the highest free mailbox (with a matching filter).
> But after enabling the mailboxes the next CAN frame doesn't go into the
> highest mailbox.
> 
> This is a log from my test application. The ti_hecc receives 1 byte long
> frames at 500 kbit/s. In the first data byte a number is increased with
> every CAN frame (the "d=" column). The enabled mailboxes are in the "me"
> column and the pending messages in "rmp".
> 
> > [   57.390747] ti_hecc ti_hecc: can0: i=39502 m=14 n=14 rmp=00004000 me=00007ff0 d= 78
> > [   57.390777] ti_hecc ti_hecc: can0: i=39503 m=13 n=13 rmp=00002000 me=00003ff0 d= 79
> > [   57.390777] ti_hecc ti_hecc: can0: i=39504 m=12 n=12 rmp=00001000 me=00001ff0 d= 80
> > [   57.390808] ti_hecc ti_hecc: can0: i=39505 m=11 n=11 rmp=00000800 me=00000ff0 d= 81
> > [   57.390838] ti_hecc ti_hecc: can0: i=39506 m=10 n=10 rmp=00000400 me=000007f0 d= 82
> > [   57.390838] ti_hecc ti_hecc: can0: i=39507 m= 9 n= 9 rmp=00000200 me=000003f0 d= 83
> > [   57.390869] ti_hecc ti_hecc: can0: i=39508 m= 8 n= 8 rmp=00000100 me=000001f0 d= 84
> 
> Here the driver enables the high priority mailboxes (0xfffff000) with a
> single write. Actually the driver writes 0xfffff0f0, as there a still
> four free low priority mailboxes.
> 
> > [   57.390899] ti_hecc ti_hecc: can0: i=39509 m= 0 n= 7 rmp=00010000 me=fffff0f0 d=  0
>                                                                  ^
> The next CAN frame should be received, either into 0x00000800 (if the
> reception happens before the driver enables the high prio mailboxes) or
> into 0x80000000 (if the reception takes place after the driver has
> written the canme register).
> 
> > [   57.390930] ti_hecc ti_hecc: can0: i=39510 m= 0 n= 7 rmp=00010000 me=fffff0f0 d=  0
> > [   57.390930] ti_hecc ti_hecc: can0: i=39511 m= 0 n= 7 rmp=00010000 me=fffff0f0 d=  0
> > [   57.390960] ti_hecc ti_hecc: can0: i=39512 m= 0 n= 7 rmp=00010000 me=fffff0f0 d=  0
> > [   57.390991] ti_hecc ti_hecc: can0: i=39513 m= 0 n= 7 rmp=00010000 me=fffff0f0 d=  0
> > [   57.391021] ti_hecc ti_hecc: can0: i=39514 m= 0 n= 7 rmp=00010000 me=fffff0f0 d=  0
> > [   57.391021] ti_hecc ti_hecc: can0: i=39515 m= 0 n= 7 rmp=00010000 me=fffff0f0 d=  0
> > [   57.391052] ti_hecc ti_hecc: can0: i=39516 m= 0 n= 7 rmp=00010000 me=fffff0f0 d=  0
> > [   57.391082] ti_hecc ti_hecc: can0: i=39517 m=31 n=31 rmp=80010000 me=fffff0f0 d= 86
>                                                               ^
> We see the next frame goes into 0x80000000 as expected. (There are so
> many d=0 lines, because the driver is confused, as the next frame is not
> expected in the 0x00010000 mailbox.)
> 
> There isn't any tx going on at the moment, the rx-path is the only
> modifier of the canme register.
> 
> I thought maybe there is a problem in the hardware with the canme
> register, so I've reworked the algorithm not to change the canme
> register at all. canme stays 0xfffffff0 all the time (all rx mailboxes
> active), but received frames are not acknowledged (I don't write 1s into
> canrmp after message reception). The pending rx interrupts are masked
> with canmim. To "enable" the mailboxes again, the driver clears bits in
> the canrmp register with a single write and then unmasks the interrupt
> again. This however results in _the_same_ buggy behaviour.
> 
> Then I had a closer look at AnilKumar Ch's patch, the canme register is
> not changed if a the CAN core currently receives a CAN frame. I added
> AnilKumar Ch's busy loop before modifying the canme register, and it
> works now. So I conclude that the "pick next free mailbox" algorithm in
> the CAN core is racy. You are not allowed to do something that changes a
> mailbox's status from disabled/full to enabled/free when a CAN frame is
> received. As you cannot control CAN frame reception, this is a nice
> hardware race condition. Please add this to the manual and errata sheets.
> 
> I'll post a series which fixes the problems I've found soon.

I can test the series if your patches are ready.

Thanks
AnilKumar

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] can: ti_hecc: fix rx wrong sequence issue
  2012-11-05 13:47   ` AnilKumar, Chimata
@ 2012-11-05 13:56     ` Marc Kleine-Budde
  0 siblings, 0 replies; 10+ messages in thread
From: Marc Kleine-Budde @ 2012-11-05 13:56 UTC (permalink / raw)
  To: AnilKumar, Chimata; +Cc: wg, swarren, linux-can, Gole, Anant, Nori, Sekhar

[-- Attachment #1: Type: text/plain, Size: 2022 bytes --]

On 11/05/2012 02:47 PM, AnilKumar, Chimata wrote:
> On Mon, Nov 05, 2012 at 17:04:55, Marc Kleine-Budde wrote:
>> Hello,
> 
> Hi Marc,
> 
> Thanks for looking into this issue, I had spent some considerable
> time for debugging this issue, I have seen better sequence with
> below changes at 500KBPS but issue still there. Because of some
> other higher priority task I left there.

I'm testing with cansequence at 500 kbit/s at the moment. With my
current patch, I'm not seeing any out-of-order problem. With "ping -f"
and serial load, the driver goes into a rx-overflow. But this is
probably due to too high load.

The hardware I'm using has an issue with 1Mbit/s, thus I cannot test this.

> Some of my observations in the driver:-
> 1. One is buffer mailbox (12th mailbox we are missing)
> 
> @@ -627,18 +617,34 @@ static int ti_hecc_rx_poll(struct napi_struct *napi, int quota)
>  			if (ti_hecc_rx_pkt(priv, priv->rx_next) < 0)
>  				return num_pkts;
>  			++num_pkts;
> -		} else if (priv->rx_next > HECC_RX_BUFFER_MBOX) {
> +		} else if (priv->rx_next >= HECC_RX_BUFFER_MBOX) {
>  			break; /* pkt not received yet */
>  		}
>  		--priv->rx_next;
> -		if (priv->rx_next == HECC_RX_BUFFER_MBOX) {
> -			/* enable high bank mailboxes */
> -			spin_lock_irqsave(&priv->mbx_lock, flags);
> +		if (priv->rx_next == (HECC_RX_BUFFER_MBOX - 1)) {
> 
> 2. Lower mail box configuration is not happening properly. I have
> changed this.
> 3. Removed spinlocks (Just for testing)

I've reworked the poll-loop completely.

[...]

> I can test the series if your patches are ready.

I'll send you a preliminary version to test soon, it wouldn't be finial,
but you can test, also include 1Mbit/s.

Marc

-- 
Pengutronix e.K.                  | Marc Kleine-Budde           |
Industrial Linux Solutions        | Phone: +49-231-2826-924     |
Vertretung West/Dortmund          | Fax:   +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 259 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] can: ti_hecc: fix rx wrong sequence issue
  2012-11-05 11:34 ` Marc Kleine-Budde
  2012-11-05 13:47   ` AnilKumar, Chimata
@ 2012-11-05 18:15   ` Marc Kleine-Budde
  2012-11-05 18:40     ` Marc Kleine-Budde
  1 sibling, 1 reply; 10+ messages in thread
From: Marc Kleine-Budde @ 2012-11-05 18:15 UTC (permalink / raw)
  To: AnilKumar Ch; +Cc: wg, swarren, linux-can, anantgole, nsekhar

[-- Attachment #1: Type: text/plain, Size: 2645 bytes --]

On 11/05/2012 12:34 PM, Marc Kleine-Budde wrote:
> Hello,
> 
> On 10/08/2012 08:40 AM, AnilKumar Ch wrote:
>> Fix "received wrong sequence count. expected: x, got: y" errors
>> reported by cansequence in a stress test (--verbose).
>>
>> While processing the receive packets in mailboxes, upper mailboxes
>> need to enable while processing 12th (RX_BUFFER mailbox) mailbox.
>> This requires a status check to identify whether the receiving of
>> packets in progress or not. If the mailboxes are enabled before the
>> receive packet status is cleared then there is a possibility of
>> receiving out of order packets.
>>
>> With this patch mailboxes are enabled once the receive status is
>> cleared.
>>
[..]

> Then I had a closer look at AnilKumar Ch's patch, the canme register is
> not changed if a the CAN core currently receives a CAN frame. I added
> AnilKumar Ch's busy loop before modifying the canme register, and it
> works now. So I conclude that the "pick next free mailbox" algorithm in
> the CAN core is racy. You are not allowed to do something that changes a
> mailbox's status from disabled/full to enabled/free when a CAN frame is
> received. As you cannot control CAN frame reception, this is a nice
> hardware race condition. Please add this to the manual and errata sheets.

I really want to know where the race window starts and ends, in order to
specify proper timeouts.

This is a hunk from AnilKumar Ch's patch:

>  			/* enable high bank mailboxes */
> +			timeout = jiffies + usecs_to_jiffies(RM_TIMEOUT_US);
> +			while (hecc_get_bit(priv, HECC_CANES, HECC_CANES_RM)) {
> +				cpu_relax();
> +				if (time_after(jiffies, timeout)) {
> +					netdev_warn(ndev, "receiving pkt\n");
> +					break;
> +				}
> +			}
> +

You first wait for reception to finish, but then might block on this
spin_lock. This introduces a software race condition.

>  			spin_lock_irqsave(&priv->mbx_lock, flags);
>  			mbx_mask = hecc_read(priv, HECC_CANME);
>  			mbx_mask |= HECC_RX_HIGH_MBOX_MASK;

We have to use the spin_lock because the CANME register is shared
between the rx and tx path. I think it's better to make use of
CANRPM+CANMIM, because I think the critical path is the modification of
the CANRPM register, which is not shared, so it can be written without
the need for the spin_lock.

Marc

-- 
Pengutronix e.K.                  | Marc Kleine-Budde           |
Industrial Linux Solutions        | Phone: +49-231-2826-924     |
Vertretung West/Dortmund          | Fax:   +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 259 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] can: ti_hecc: fix rx wrong sequence issue
  2012-11-05 18:15   ` Marc Kleine-Budde
@ 2012-11-05 18:40     ` Marc Kleine-Budde
  2014-09-08 10:15       ` Jan Sondhauss
  0 siblings, 1 reply; 10+ messages in thread
From: Marc Kleine-Budde @ 2012-11-05 18:40 UTC (permalink / raw)
  To: AnilKumar Ch; +Cc: wg, swarren, linux-can, anantgole, nsekhar

[-- Attachment #1: Type: text/plain, Size: 3396 bytes --]

On 11/05/2012 07:15 PM, Marc Kleine-Budde wrote:
> On 11/05/2012 12:34 PM, Marc Kleine-Budde wrote:
>> Hello,
>>
>> On 10/08/2012 08:40 AM, AnilKumar Ch wrote:
>>> Fix "received wrong sequence count. expected: x, got: y" errors
>>> reported by cansequence in a stress test (--verbose).
>>>
>>> While processing the receive packets in mailboxes, upper mailboxes
>>> need to enable while processing 12th (RX_BUFFER mailbox) mailbox.
>>> This requires a status check to identify whether the receiving of
>>> packets in progress or not. If the mailboxes are enabled before the
>>> receive packet status is cleared then there is a possibility of
>>> receiving out of order packets.
>>>
>>> With this patch mailboxes are enabled once the receive status is
>>> cleared.
>>>
> [..]
> 
>> Then I had a closer look at AnilKumar Ch's patch, the canme register is
>> not changed if a the CAN core currently receives a CAN frame. I added
>> AnilKumar Ch's busy loop before modifying the canme register, and it
>> works now. So I conclude that the "pick next free mailbox" algorithm in
>> the CAN core is racy. You are not allowed to do something that changes a
>> mailbox's status from disabled/full to enabled/free when a CAN frame is
>> received. As you cannot control CAN frame reception, this is a nice
>> hardware race condition. Please add this to the manual and errata sheets.
> 
> I really want to know where the race window starts and ends, in order to
> specify proper timeouts.
> 
> This is a hunk from AnilKumar Ch's patch:
> 
>>  			/* enable high bank mailboxes */
>> +			timeout = jiffies + usecs_to_jiffies(RM_TIMEOUT_US);
>> +			while (hecc_get_bit(priv, HECC_CANES, HECC_CANES_RM)) {
>> +				cpu_relax();
>> +				if (time_after(jiffies, timeout)) {
>> +					netdev_warn(ndev, "receiving pkt\n");
>> +					break;
>> +				}
>> +			}
>> +
> 
> You first wait for reception to finish, but then might block on this
> spin_lock. This introduces a software race condition.
> 
>>  			spin_lock_irqsave(&priv->mbx_lock, flags);
>>  			mbx_mask = hecc_read(priv, HECC_CANME);
>>  			mbx_mask |= HECC_RX_HIGH_MBOX_MASK;
> 
> We have to use the spin_lock because the CANME register is shared
> between the rx and tx path. I think it's better to make use of
> CANRPM+CANMIM, because I think the critical path is the modification of
> the CANRPM register, which is not shared, so it can be written without
> the need for the spin_lock.

I've implemented both solutions and both are working. I've instrumented
how long the CAN driver does stay in the while loop. I've loaded the CAN
bus with:

# 1 byte frame
cansequence -p

# 8 byte frame, all 1s
cansend can0 -i 0xffffffff 0xff 0xff 0xff 0xff 0xff 0xaff 0xff 0xff
--loop=-1 -e

Both solution show about the same number of loops, depending on the bit
rate of the bus:

500 kbit/s
hecc_set_bit_canme: looped 867 times

50 kbit/s
hecc_set_bit_canme: looped 4807 times

AnilKumar, can you get in contact with the hardware people to figure out
a better solution than to cycle 4k times in a loop?

Marc
-- 
Pengutronix e.K.                  | Marc Kleine-Budde           |
Industrial Linux Solutions        | Phone: +49-231-2826-924     |
Vertretung West/Dortmund          | Fax:   +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 259 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] can: ti_hecc: fix rx wrong sequence issue
  2012-11-05 18:40     ` Marc Kleine-Budde
@ 2014-09-08 10:15       ` Jan Sondhauss
  0 siblings, 0 replies; 10+ messages in thread
From: Jan Sondhauss @ 2014-09-08 10:15 UTC (permalink / raw)
  To: linux-can

Marc Kleine-Budde <mkl <at> pengutronix.de> writes:

> 
> > [..]
> > 
> >> Then I had a closer look at AnilKumar Ch's patch, the canme register is
> >> not changed if a the CAN core currently receives a CAN frame. I added
> >> AnilKumar Ch's busy loop before modifying the canme register, and it
> >> works now. So I conclude that the "pick next free mailbox" algorithm in
> >> the CAN core is racy. You are not allowed to do something that changes a
> >> mailbox's status from disabled/full to enabled/free when a CAN frame is
> >> received. As you cannot control CAN frame reception, this is a nice
> >> hardware race condition. Please add this to the manual and errata sheets.
> > 
> > I really want to know where the race window starts and ends, in order to
> > specify proper timeouts.
> > [..]
> > 
> > We have to use the spin_lock because the CANME register is shared
> > between the rx and tx path. I think it's better to make use of
> > CANRPM+CANMIM, because I think the critical path is the modification of
> > the CANRPM register, which is not shared, so it can be written without
> > the need for the spin_lock.
> 
> I've implemented both solutions and both are working. I've instrumented
> how long the CAN driver does stay in the while loop. I've loaded the CAN
> bus with:
> [..]
> 
> AnilKumar, can you get in contact with the hardware people to figure out
> a better solution than to cycle 4k times in a loop?
> 
> Marc

Hello Marc, hello AnilKumar,

is there a proper solution to this problem available yet? I haven't found one
on linux-can repository, but maybe I've been looking at the wrong place.
We are running a "custom" 3.6.11 kernel with PREEMPT_RT_FULL on a Ti AM3517.
Also we applied AnilKumars busy-wait-for-rx-idle-loop.

We have a problem in one test-case with two can-devices (including our device
under test) where we are exchanging messages in a request-response matter:
The am3517 sends 16 requests in a row and then expects 16 response messages.

Under some circumstances the driver stays in the napi's-poll list forever, 
because there is a message available in a lower "main"-mailbox but 
rx_next points to the first mailbox: 

[24817.008123] ti_hecc: priv->rx_next 0000001f
[24817.012769] ti_hecc: HECC_CANME fffffff0
[24817.017015] ti_hecc: HECC_CANMD fffffff0
[24817.021261] ti_hecc: HECC_CANRMP 40000000

It looks to me that the access to the CANME register in the ti_hecc_rx_pkt()
function is also racy:

	spin_lock_irqsave(&priv->mbx_lock, flags);
	hecc_clear_bit(priv, HECC_CANME, mbx_mask);
	hecc_write(priv, HECC_CANRMP, mbx_mask);
	/* enable mailbox only if it is part of rx buffer mailboxes */
	if (mbxno < HECC_RX_BUFFER_MBOX)
		hecc_set_bit(priv, HECC_CANME, mbx_mask);
	spin_unlock_irqrestore(&priv->mbx_lock, flags);

That makes me wonder what was the purpose of the else-branch in the 
ti_hecc_rx_poll() function (as it prevents us from receiving messages)?:

		} else if (priv->rx_next >= HECC_RX_BUFFER_MBOX) {            
			break; /* pkt not received yet */
		}

Our problem is somewhat off-topic, but relates to the racy CANME access so I
hope I posted at the right place. :)

Thanks and best regards,
Jan

 












^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2014-09-08 10:25 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-10-08  6:40 [PATCH] can: ti_hecc: fix rx wrong sequence issue AnilKumar Ch
2012-10-08  9:12 ` Jan Lübbe
2012-10-11 12:01   ` Jan Lübbe
2012-10-08  9:28 ` Wolfgang Grandegger
2012-11-05 11:34 ` Marc Kleine-Budde
2012-11-05 13:47   ` AnilKumar, Chimata
2012-11-05 13:56     ` Marc Kleine-Budde
2012-11-05 18:15   ` Marc Kleine-Budde
2012-11-05 18:40     ` Marc Kleine-Budde
2014-09-08 10:15       ` Jan Sondhauss

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.