All of lore.kernel.org
 help / color / mirror / Atom feed
* can: flexcan: implement workaround for FIFO overruns (based on code by David Jander)
@ 2015-07-08 14:38 Torsten Lang
  2015-07-09  6:33 ` Holger Schurig
                   ` (3 more replies)
  0 siblings, 4 replies; 24+ messages in thread
From: Torsten Lang @ 2015-07-08 14:38 UTC (permalink / raw)
  To: linux-can; +Cc: Marc Kleine-Budde

[-- Attachment #1: Type: text/plain, Size: 3864 bytes --]

This patch is for Kernel 4.1-rc8 (and should AFAIK still be valid also for the current 4.1/4.2 versions). It is based on the rework done by David Jander which disables the only six messages deep hardware FIFO of the FlexCAN core and instead uses all available mailboxes for reception.

During my stress tests with the original driver I detected that on the i.MX6 based boards we use in one of our projects CAN messages were lost because of RX FIFO overruns. During my research I got in contact with Oliver Hartkopp who passed me the link to David Jander's reworked FlexCAN driver, but in my tests with this one I had to find out that David's code sometimes feeds messages into the network stack in wrong order (especially under high CPU load). Due to the concurrent nature of the receive process (the matching process of the FlexCAN hardware on one side and the accesses of the kernel driver on the other side) I believe that actually there is no way to guarantee the order in which mailboxes are used by the matching process, so we cannot simply service them in ascending order.

So for my rework I removed the complex inactivation/reactivation code of David Jander's driver and implemented a simple queue where messages are sorted in by their timestamps. The mailboxes now are serviced as recommended by Freescale's i.MX6 user's manual, and the servicing as such has been moved completely over to the NAPI poll function. You will find some more comments in the patch itself.

I have made stress tests under full CPU load on the i.MX6 Solo and DualLite (KaRo boards TX6S and TX6U) with a backport of this driver to Kernel 3.16.0. The main reason for testing with kernel 3.16.0 is that the BSP for these modules still is based on this kernel and I couldn't get the current kernel to run flawlessly on this hardware together with my application (see below).

I made my tests using Oliver Hartkopp's isotp kernel module together with the isotp to ethernet tunnel transporting files of ~16MB in size between a desktop PC with a PCAN USB Pro adapter and the i.MX6 target and used an extended version of Oliver's isotpdump where I added some checks for transport protocol errors. With my reworked driver these tests passed flawlessly.

I think it's time now to publish my code for review and discussion.

I have two items left that could be checked by Marc Kleine-Budde and/or David Jander:

My reworked interrupt handler returns IRQ_HANDLED *only* if actually an interrupt was signalled (and not masked) by the corresponding registers which I think is the correct way for drivers using shared interrupts. With this code I receive a "spurious" interrupt once after closing and re-opening a socket again (see disabled code fragment at the end of the interrupt handler - this will output a message once after opening the CAN socket with no interrupt flag set).

Furthermore, I think there is still some potential to simplify the receive/error interrupt checks in the interrupt handler.

With best regards,
Torsten

P. S. As mentioned I have some trouble getting the kernel to work on the hardware I use. The main remaining problem is that my application no longer receives the CAN messages correctly with kernel 4.1 - neither with the original FlexCAN driver nor with mine. As long as my Qt application uses the CAN bus alone it only receives single messages about every 30..60s (actually the test messages are transmitted every 250ms). As soon as I start a candump in parallel my application also receives the test messages. I currently have no explanation for this behaviour.

-- 
Uwe Schneider GmbH
Heegwaldring 12
63694 Limeshain
Tel.: +49 / 6047 / 98 65 21
FAX.: +49 / 6047 / 98 73 96

Uwe Schneider Gesellschaft für innovative
Produkte der Computertechnik mbH
Sitz der Gesellschaft: Limeshain
Registergericht: Friedberg HRB 5895
Geschäftsführer: Uwe Schneider 


[-- Attachment #2: flexcan_kernel_4.1-rc8.patch --]
[-- Type: text/plain, Size: 30878 bytes --]

--- linux-4.1-rc8/drivers/net/can/flexcan.c.orig	2015-07-06 10:35:18.173014623 +0200
+++ linux-4.1-rc8/drivers/net/can/flexcan.c	2015-07-08 14:29:05.811213827 +0200
@@ -4,6 +4,8 @@
  * Copyright (c) 2005-2006 Varma Electronics Oy
  * Copyright (c) 2009 Sascha Hauer, Pengutronix
  * Copyright (c) 2010 Marc Kleine-Budde, Pengutronix
+ * Copyright (c) 2014 David Jander, Protonic Holland
+ * Copyright (c) 2015 Torsten Lang, Uwe Schneider GmbH
  *
  * Based on code originally by Andrey Volkov <avolkov@varma-el.com>
  *
@@ -17,6 +19,37 @@
  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
  * GNU General Public License for more details.
  *
+ * Some remarks about the new queue implementation:
+ *
+ * The FlexCAN features a hardware queue only 6 entries deep. As the
+ * Kernel produces severe latencies until the CAN interrupt is handled
+ * the original code produced queue overruns e. g. on the Freescale i.MX6.
+ *
+ * The next generation code by David Jander tries to avoid these
+ * overruns by using all free MB objects as receive buffer but during
+ * tests CAN frames occasionally were fed into the network stack in the
+ * wrong order, especially under high CPU load.
+ *
+ * This implementation services the receive MBs as described in the
+ * Freescale i.MX6 manual so that serviced MBs may be immediately
+ * allocated again by the matching process. Due to the concurrent work of
+ * the matching process and the servicing code we cannot guarantee the order
+ * in which MBs are filled by the matching process. So reading out the MBs
+ * where the corresponding interrupt flag is set is done in one loop and the
+ * MBs will be *sorted* into the queue by their timestamps with a search
+ * barrier set to the initial tail of the queue.
+ *
+ * At high CAN bitrates the timer may wrap every ~65ms, so we have a
+ * window of ~32ms until timestamp comparisons become ambiguous. This is why
+ * the new code also sets the search barrier to the last old entry of the
+ * queue before feeding the new MBs. The insertion loop will never search
+ * beyond this marker. If a continuity break is found at the current insert
+ * position a linear search forward or backward is done until the correct
+ * insert position is found.
+ *
+ * During tests the timestamp continuity didn't break too often (~50
+ * per 16MB of transferred ISOTP data), mostly needing a search over 1..3
+ * positions, so the search algorithm doesn't produce too much overhead...
  */
 
 #include <linux/netdevice.h>
@@ -31,7 +64,6 @@
 #include <linux/interrupt.h>
 #include <linux/io.h>
 #include <linux/kernel.h>
-#include <linux/list.h>
 #include <linux/module.h>
 #include <linux/of.h>
 #include <linux/of_device.h>
@@ -40,8 +72,8 @@
 
 #define DRV_NAME			"flexcan"
 
-/* 8 for RX fifo and 2 error handling */
-#define FLEXCAN_NAPI_WEIGHT		(8 + 2)
+/* 63 MB's plus status plus some possibly pending MBs */
+#define FLEXCAN_NAPI_WEIGHT		(70)
 
 /* FLEXCAN module configuration register (CANMCR) bits */
 #define FLEXCAN_MCR_MDIS		BIT(31)
@@ -93,13 +125,13 @@
 	(FLEXCAN_CTRL_ERR_BUS | FLEXCAN_CTRL_ERR_STATE)
 
 /* FLEXCAN control register 2 (CTRL2) bits */
-#define FLEXCAN_CRL2_ECRWRE		BIT(29)
-#define FLEXCAN_CRL2_WRMFRZ		BIT(28)
-#define FLEXCAN_CRL2_RFFN(x)		(((x) & 0x0f) << 24)
-#define FLEXCAN_CRL2_TASD(x)		(((x) & 0x1f) << 19)
-#define FLEXCAN_CRL2_MRP		BIT(18)
-#define FLEXCAN_CRL2_RRS		BIT(17)
-#define FLEXCAN_CRL2_EACEN		BIT(16)
+#define FLEXCAN_CTRL2_ECRWRE		BIT(29)
+#define FLEXCAN_CTRL2_WRMFRZ		BIT(28)
+#define FLEXCAN_CTRL2_RFFN(x)		(((x) & 0x0f) << 24)
+#define FLEXCAN_CTRL2_TASD(x)		(((x) & 0x1f) << 19)
+#define FLEXCAN_CTRL2_MRP		BIT(18)
+#define FLEXCAN_CTRL2_RRS		BIT(17)
+#define FLEXCAN_CTRL2_EACEN		BIT(16)
 
 /* FLEXCAN memory error control register (MECR) bits */
 #define FLEXCAN_MECR_ECRWRDIS		BIT(31)
@@ -113,6 +145,9 @@
 #define FLEXCAN_MECR_ECCDIS		BIT(8)
 #define FLEXCAN_MECR_NCEFAFRZ		BIT(7)
 
+/* FLEXCAN control register 2 (CTRL2) bits */
+#define FLEXCAN_CTRL2_EACEN		BIT(16)
+
 /* FLEXCAN error and status register (ESR) bits */
 #define FLEXCAN_ESR_TWRN_INT		BIT(17)
 #define FLEXCAN_ESR_RWRN_INT		BIT(16)
@@ -147,23 +182,21 @@
 
 /* FLEXCAN interrupt flag register (IFLAG) bits */
 /* Errata ERR005829 step7: Reserve first valid MB */
-#define FLEXCAN_TX_BUF_RESERVED		8
-#define FLEXCAN_TX_BUF_ID		9
+#define FLEXCAN_TX_BUF_RESERVED		0
+#define FLEXCAN_TX_BUF_ID		1
 #define FLEXCAN_IFLAG_BUF(x)		BIT(x)
-#define FLEXCAN_IFLAG_RX_FIFO_OVERFLOW	BIT(7)
-#define FLEXCAN_IFLAG_RX_FIFO_WARN	BIT(6)
-#define FLEXCAN_IFLAG_RX_FIFO_AVAILABLE	BIT(5)
-#define FLEXCAN_IFLAG_DEFAULT \
-	(FLEXCAN_IFLAG_RX_FIFO_OVERFLOW | FLEXCAN_IFLAG_RX_FIFO_AVAILABLE | \
-	 FLEXCAN_IFLAG_BUF(FLEXCAN_TX_BUF_ID))
+#define FLEXCAN_IFLAG1_DEFAULT 		(0xfffffffe)
+#define FLEXCAN_IFLAG2_DEFAULT 		(0xffffffff)
 
 /* FLEXCAN message buffers */
-#define FLEXCAN_MB_CNT_CODE(x)		(((x) & 0xf) << 24)
+#define FLEXCAN_MB_CODE_MASK		(0xf << 24)
+
 #define FLEXCAN_MB_CODE_RX_INACTIVE	(0x0 << 24)
 #define FLEXCAN_MB_CODE_RX_EMPTY	(0x4 << 24)
 #define FLEXCAN_MB_CODE_RX_FULL		(0x2 << 24)
-#define FLEXCAN_MB_CODE_RX_OVERRRUN	(0x6 << 24)
+#define FLEXCAN_MB_CODE_RX_OVERRUN	(0x6 << 24)
 #define FLEXCAN_MB_CODE_RX_RANSWER	(0xa << 24)
+#define FLEXCAN_MB_CODE_RX_BUSY		(0x1 << 24)
 
 #define FLEXCAN_MB_CODE_TX_INACTIVE	(0x8 << 24)
 #define FLEXCAN_MB_CODE_TX_ABORT	(0x9 << 24)
@@ -176,9 +209,7 @@
 #define FLEXCAN_MB_CNT_LENGTH(x)	(((x) & 0xf) << 16)
 #define FLEXCAN_MB_CNT_TIMESTAMP(x)	((x) & 0xffff)
 
-#define FLEXCAN_MB_CODE_MASK		(0xf0ffffff)
-
-#define FLEXCAN_TIMEOUT_US             (50)
+#define FLEXCAN_TIMEOUT_US		(50)
 
 /*
  * FLEXCAN hardware feature flags
@@ -206,6 +237,27 @@
 	u32 data[2];
 };
 
+/* End marker for head/tail/links */
+#define FLEXCAN_MB_QUEUE_SIZE		62
+#define FLEXCAN_MB_QUEUE_ROOT		(FLEXCAN_MB_QUEUE_SIZE)
+
+/* Structure for sorted MB shadow
+ *
+ * Note: The last entries in the pre/next members contains the used
+ *       MB list root. This way several extra checks can be avoided.
+ */
+struct flexcan_mb_queue {
+	u8 free_head;                     /* head of free entries list      */
+	u8 free_num;                      /* number of entries in free list */
+	u8 search_back_barrier;           /* no backward search beyond this */
+	u8 start_index;                   /* start index for next insertion */
+	u8 prev[FLEXCAN_MB_QUEUE_SIZE+1]; /* previous links                 */
+	                                  /* last entry contains list tail  */
+	u8 next[FLEXCAN_MB_QUEUE_SIZE+1]; /* next links                     */
+	                                  /* last entry contains list head  */
+	struct flexcan_mb mb[FLEXCAN_MB_QUEUE_SIZE]; /* list contents       */
+};
+
 /* Structure of the hardware registers */
 struct flexcan_regs {
 	u32 mcr;		/* 0x00 */
@@ -221,7 +273,7 @@
 	u32 imask1;		/* 0x28 */
 	u32 iflag2;		/* 0x2c */
 	u32 iflag1;		/* 0x30 */
-	u32 crl2;		/* 0x34 */
+	u32 ctrl2;		/* 0x34 */
 	u32 esr2;		/* 0x38 */
 	u32 imeur;		/* 0x3c */
 	u32 lrfr;		/* 0x40 */
@@ -230,7 +282,19 @@
 	u32 rxfir;		/* 0x4c */
 	u32 _reserved3[12];	/* 0x50 */
 	struct flexcan_mb cantxfg[64];	/* 0x80 */
-	u32 _reserved4[408];
+	/* FIFO-mode:
+	 *			MB
+	 * 0x080...0x08f	0	RX message buffer
+	 * 0x090...0x0df	1-5	reserverd
+	 * 0x0e0...0x0ff	6-7	8 entry ID table
+	 *				(mx25, mx28, mx35, mx53)
+	 * 0x0e0...0x2df	6-7..37	8..128 entry ID table
+	 *			  	size conf'ed via ctrl2::RFFN
+	 *				(mx6, vf610)
+	 */
+	u32 _reserved4[256];	/* 0x480 */
+	u32 rximr[64];		/* 0x880 */
+	u32 _reserved5[88];	/* 0x980 */
 	u32 mecr;		/* 0xae0 */
 	u32 erriar;		/* 0xae4 */
 	u32 erridpr;		/* 0xae8 */
@@ -247,6 +311,7 @@
 
 struct flexcan_priv {
 	struct can_priv can;
+	struct net_device *dev;
 	struct napi_struct napi;
 
 	void __iomem *base;
@@ -258,6 +323,7 @@
 	struct flexcan_platform_data *pdata;
 	const struct flexcan_devtype_data *devtype_data;
 	struct regulator *reg_xceiver;
+	struct flexcan_mb_queue queue;
 };
 
 static struct flexcan_devtype_data fsl_p1010_devtype_data = {
@@ -283,6 +349,179 @@
 	.brp_inc = 1,
 };
 
+/* Queue helpers */
+
+/* Check index for start/end of list */
+static inline int __flexcan_queue_is_end(const struct flexcan_mb_queue *queue,
+                                         u8 index)
+{
+	return index >= FLEXCAN_MB_QUEUE_ROOT;
+}
+
+/* Check index for barrier */
+static inline int __flexcan_queue_is_barrier(const struct flexcan_mb_queue *queue,
+                                             u8 index)
+{
+	return index == queue->search_back_barrier;
+}
+
+/* Return used MB queue head */
+static inline u8 __flexcan_queue_head(const struct flexcan_mb_queue *queue)
+{
+	return queue->next[FLEXCAN_MB_QUEUE_ROOT];
+}
+
+/* Return used MB queue tail */
+static inline u8 __flexcan_queue_tail(const struct flexcan_mb_queue *queue)
+{
+	return queue->prev[FLEXCAN_MB_QUEUE_ROOT];
+}
+
+/* Check if queue is empty */
+static inline int __flexcan_queue_empty(const struct flexcan_mb_queue *queue)
+{
+	return __flexcan_queue_is_end(queue, __flexcan_queue_head(queue));
+}
+
+/* Check if queue is full */
+static inline int __flexcan_queue_full(const struct flexcan_mb_queue *queue)
+{
+	return __flexcan_queue_is_end(queue, queue->free_head);
+}
+
+/* Get free MBs */
+static inline u8 __flexcan_queue_free_avail(const struct flexcan_mb_queue *queue)
+{
+	return queue->free_num;
+}
+
+/* Compare two indexes by their timestamps.
+ *
+ * return value:
+ * < 0: timestamp of idx1 is older than timestamp of idx2
+ * = 0: timestamps of idx1 and idx2 are the same
+ * > 0: timestamp of idx1 is newer than timestamp of idx2
+ */
+static inline s16 __flexcan_queue_cmp(const struct flexcan_mb_queue *queue,
+                                      u8 idx1, u8 idx2)
+{
+	return FLEXCAN_MB_CNT_TIMESTAMP(queue->mb[idx1].can_ctrl)
+	     - FLEXCAN_MB_CNT_TIMESTAMP(queue->mb[idx2].can_ctrl);
+}
+
+/* Queue functions */
+
+/* Initialize queue */
+static void flexcan_queue_init(struct flexcan_mb_queue *queue)
+{
+	size_t i;
+
+	/* used mb list */
+	queue->next[FLEXCAN_MB_QUEUE_ROOT] = FLEXCAN_MB_QUEUE_ROOT;
+	queue->prev[FLEXCAN_MB_QUEUE_ROOT] = FLEXCAN_MB_QUEUE_ROOT;
+
+	/* free mb list */
+	queue->free_head = 0;
+	queue->free_num  = FLEXCAN_MB_QUEUE_SIZE;
+	for (i = 0; i < FLEXCAN_MB_QUEUE_SIZE; i++) {
+		queue->next[i] = i+1;
+	};
+}
+
+/* Pull first entry from queue, return NULL when queue is empty.
+ * This function also removes the returned entry from the queue
+ * and pushes it back into the free list.
+ */
+static struct flexcan_mb *flexcan_queue_out(struct flexcan_mb_queue *queue)
+{
+	u8 old_free;
+	u8 old_head;
+	u8 old_next;
+
+	if (__flexcan_queue_empty(queue))
+		return NULL;
+
+	/* remove head entry and insert it into the free list */
+	old_free = queue->free_head;
+	old_head = __flexcan_queue_head(queue);
+	old_next = queue->next[old_head];
+
+	queue->next[FLEXCAN_MB_QUEUE_ROOT] = old_next;
+	queue->prev[old_next]              = FLEXCAN_MB_QUEUE_ROOT;
+
+	queue->free_head      = old_head;
+	queue->next[old_head] = old_free;
+	queue->free_num++;
+
+	return &queue->mb[old_head];		
+}
+
+/* Prepare queue for next insertion loop */
+static void flexcan_queue_in_prepare(struct flexcan_mb_queue *queue)
+{
+	queue->start_index         = __flexcan_queue_tail(queue);
+	queue->search_back_barrier = __flexcan_queue_tail(queue);
+}
+ 
+/* Pull first free empty MB from queue and return it. If no free empty MBs exist
+ * the function returns NULL.
+ */
+static struct flexcan_mb *flexcan_queue_get_free_mb(struct flexcan_mb_queue *queue)
+{
+	u8 old_free;
+
+	if (unlikely(__flexcan_queue_full(queue)))
+		return NULL;
+
+	old_free         = queue->free_head;
+	queue->free_head = queue->next[old_free];
+	queue->free_num--;
+
+	return &queue->mb[old_free];
+}
+
+/* Push new entry into queue - mb must be a value previously returned by
+ * flexcan_queue_get_free_mb(). The new entry is inserted into the queue
+ * according to its timestamp. flexcan_queue_in_prepare() must be called
+ * between the last call to flexcan_queue_out() and the next call to
+ * flexcan_queue_in(). Then, all full hardware MBs must be fed in one loop.
+ */
+static void flexcan_queue_in(struct flexcan_mb_queue *queue, const struct flexcan_mb *mb)
+{
+	u8 mb_index = mb - &queue->mb[0];
+	
+	/* check for broken timestamp continuity */
+	u8 start = queue->start_index;
+	u8 next  = queue->next[start];
+
+	if ( !__flexcan_queue_is_end(queue, start)
+	   &&!__flexcan_queue_is_end(queue, next)
+	   &&(__flexcan_queue_cmp(queue, mb_index, next) >= 0) ) {
+		/* search forward until we find the correct insertion point */
+		do {
+			start = next;
+			next  = queue->next[start];
+		} while ( !__flexcan_queue_is_end(queue, next)
+		        &&(__flexcan_queue_cmp(queue, mb_index, next) >= 0) );
+	} else {
+		/* search backward until we find the correct insertion point */
+		while ( !__flexcan_queue_is_barrier(queue, start)
+		      &&(__flexcan_queue_cmp(queue, mb_index, start) < 0) ) {
+			start = queue->prev[start];
+		}
+		next = queue->next[start];
+	}
+
+	/* insert MB after start index */
+	queue->next[mb_index] = next;
+	queue->prev[mb_index] = start;
+	queue->next[start]    = mb_index;
+	queue->prev[next]     = mb_index; 
+
+	/* update start index */
+	queue->start_index = mb_index;
+}
+
 /*
  * Abstract off the read/write for arm versus ppc. This
  * assumes that PPC uses big-endian registers and everything
@@ -468,7 +707,7 @@
 	struct flexcan_regs __iomem *regs = priv->base;
 	struct can_frame *cf = (struct can_frame *)skb->data;
 	u32 can_id;
-	u32 ctrl = FLEXCAN_MB_CNT_CODE(0xc) | (cf->can_dlc << 16);
+	u32 ctrl = FLEXCAN_MB_CODE_TX_DATA | (cf->can_dlc << 16);
 
 	if (can_dropped_invalid_skb(dev, skb))
 		return NETDEV_TX_OK;
@@ -592,12 +831,13 @@
 		rx_state = unlikely(reg_esr & FLEXCAN_ESR_RX_WRN) ?
 			   CAN_STATE_ERROR_WARNING : CAN_STATE_ERROR_ACTIVE;
 		new_state = max(tx_state, rx_state);
-	} else {
+	} else if (unlikely(flt == FLEXCAN_ESR_FLT_CONF_PASSIVE)) {
 		__flexcan_get_berr_counter(dev, &bec);
-		new_state = flt == FLEXCAN_ESR_FLT_CONF_PASSIVE ?
-			    CAN_STATE_ERROR_PASSIVE : CAN_STATE_BUS_OFF;
+		new_state = CAN_STATE_ERROR_PASSIVE;
 		rx_state = bec.rxerr >= bec.txerr ? new_state : 0;
 		tx_state = bec.rxerr <= bec.txerr ? new_state : 0;
+	} else {
+		new_state = CAN_STATE_BUS_OFF;
 	}
 
 	/* state hasn't changed */
@@ -621,16 +861,23 @@
 	return 1;
 }
 
-static void flexcan_read_fifo(const struct net_device *dev,
-			      struct can_frame *cf)
+static int flexcan_read_frame(struct net_device *dev, struct flexcan_mb *mb)
 {
-	const struct flexcan_priv *priv = netdev_priv(dev);
-	struct flexcan_regs __iomem *regs = priv->base;
-	struct flexcan_mb __iomem *mb = &regs->cantxfg[0];
-	u32 reg_ctrl, reg_id;
+	struct net_device_stats *stats = &dev->stats;
+	struct can_frame *cf;
+	struct sk_buff *skb;
+	u32 reg_ctrl;
+	u32 reg_id;
 
-	reg_ctrl = flexcan_read(&mb->can_ctrl);
-	reg_id = flexcan_read(&mb->can_id);
+	reg_ctrl = mb->can_ctrl;
+
+	skb = alloc_can_skb(dev, &cf);
+	if (unlikely(!skb)) {
+		stats->rx_dropped++;
+		return 0;
+	}
+
+	reg_id = mb->can_id;
 	if (reg_ctrl & FLEXCAN_MB_CNT_IDE)
 		cf->can_id = ((reg_id >> 0) & CAN_EFF_MASK) | CAN_EFF_FLAG;
 	else
@@ -640,27 +887,9 @@
 		cf->can_id |= CAN_RTR_FLAG;
 	cf->can_dlc = get_can_dlc((reg_ctrl >> 16) & 0xf);
 
-	*(__be32 *)(cf->data + 0) = cpu_to_be32(flexcan_read(&mb->data[0]));
-	*(__be32 *)(cf->data + 4) = cpu_to_be32(flexcan_read(&mb->data[1]));
-
-	/* mark as read */
-	flexcan_write(FLEXCAN_IFLAG_RX_FIFO_AVAILABLE, &regs->iflag1);
-	flexcan_read(&regs->timer);
-}
-
-static int flexcan_read_frame(struct net_device *dev)
-{
-	struct net_device_stats *stats = &dev->stats;
-	struct can_frame *cf;
-	struct sk_buff *skb;
-
-	skb = alloc_can_skb(dev, &cf);
-	if (unlikely(!skb)) {
-		stats->rx_dropped++;
-		return 0;
-	}
+	*(__be32 *)(cf->data + 0) = cpu_to_be32(mb->data[0]);
+	*(__be32 *)(cf->data + 4) = cpu_to_be32(mb->data[1]);
 
-	flexcan_read_fifo(dev, cf);
 	netif_receive_skb(skb);
 
 	stats->rx_packets++;
@@ -671,58 +900,222 @@
 	return 1;
 }
 
+static void flexcan_copy_one_rxmb(struct flexcan_priv *priv, int i)
+{
+	struct flexcan_regs __iomem *regs = priv->base;
+	struct flexcan_mb __iomem *mb;
+	struct flexcan_mb *mb_queue;
+	u32 reg_ctrl;
+	u32 code;
+
+	mb = &regs->cantxfg[i];
+	reg_ctrl = flexcan_read(&mb->can_ctrl);
+	
+	code = reg_ctrl & FLEXCAN_MB_CODE_MASK;
+	while (code == FLEXCAN_MB_CODE_RX_BUSY) {
+		/* MB busy, shouldn't take long */
+		reg_ctrl = flexcan_read(&mb->can_ctrl);
+		code = reg_ctrl & FLEXCAN_MB_CODE_MASK;
+	}
+
+	if ((code == FLEXCAN_MB_CODE_RX_FULL) ||
+	    (code == FLEXCAN_MB_CODE_RX_OVERRUN)) {
+		/* get an unused MB from the queue */
+		mb_queue = flexcan_queue_get_free_mb(&priv->queue);
+		
+		if (unlikely(!mb_queue)) {
+			/* this should never ever happen as the higher functions
+			 * guarantee that we have enough space left in the queue
+			 */
+			priv->dev->stats.rx_over_errors++;
+			priv->dev->stats.rx_errors++;
+		} else {
+			/* copy contents */
+			mb_queue->can_ctrl = reg_ctrl;
+			mb_queue->can_id   = flexcan_read(&mb->can_id);
+			mb_queue->data[0]  = flexcan_read(&mb->data[0]);
+			mb_queue->data[1]  = flexcan_read(&mb->data[1]);
+			
+			/* insert MB in the queue */
+			flexcan_queue_in(&priv->queue, mb_queue);
+		}
+		
+		if (code == FLEXCAN_MB_CODE_RX_OVERRUN) {
+			/* This MB was overrun, we lost data */
+			priv->dev->stats.rx_over_errors++;
+			priv->dev->stats.rx_errors++;
+		}
+	}
+}
+
+static void flexcan_copy_rxmbs(struct flexcan_priv *priv, u32 reg_iflag1, u32 reg_iflag2)
+{
+	struct flexcan_regs __iomem *regs = priv->base;
+	u32 bit;
+	int i;
+
+	/* prepare queue */
+	flexcan_queue_in_prepare(&priv->queue);
+	
+	/* check lower MBs */
+	for (i = 0, bit = 1; reg_iflag1 != 0; bit <<= 1, ++i) {
+		if (reg_iflag1 & bit) {
+			/* handle pending MB */
+			flexcan_copy_one_rxmb(priv, i);
+			/* clear bit in shadow */
+			reg_iflag1 &= ~bit;
+			/* acknowledge interrupt */
+			flexcan_write(bit, &regs->iflag1);
+		}
+	}
+	
+	/* check upper MBs */
+	for (i = 32, bit = 1; reg_iflag2 != 0; bit <<= 1, ++i) {
+		if (reg_iflag2 & bit) {
+			/* handle pending MB */
+			flexcan_copy_one_rxmb(priv, i);
+			/* clear bit in shadow */
+			reg_iflag2 &= ~bit;
+			/* acknowledge interrupt */
+			flexcan_write(bit, &regs->iflag2);
+
+		}
+	}
+	
+	/* Unlock the last locked MB if any */
+	flexcan_read(&regs->timer);
+}
+
 static int flexcan_poll(struct napi_struct *napi, int quota)
 {
 	struct net_device *dev = napi->dev;
-	const struct flexcan_priv *priv = netdev_priv(dev);
+	struct flexcan_priv *priv = netdev_priv(dev);
 	struct flexcan_regs __iomem *regs = priv->base;
-	u32 reg_iflag1, reg_esr;
+	struct flexcan_mb *mb;
+	u32 reg_esr;
+	u32 reg_iflag1;
+	u32 reg_iflag2;
 	int work_done = 0;
+	int pending_work;
 
 	/*
 	 * The error bits are cleared on read,
 	 * use saved value from irq handler.
 	 */
-	reg_esr = flexcan_read(&regs->esr) | priv->reg_esr;
+	reg_esr    = flexcan_read(&regs->esr) | priv->reg_esr;
+	reg_iflag1 = flexcan_read(&regs->iflag1) & ~(1 << FLEXCAN_TX_BUF_ID);
+	reg_iflag2 = flexcan_read(&regs->iflag2);
+	priv->reg_esr = 0;
 
 	/* handle state changes */
 	work_done += flexcan_poll_state(dev, reg_esr);
 
-	/* handle RX-FIFO */
-	reg_iflag1 = flexcan_read(&regs->iflag1);
-	while (reg_iflag1 & FLEXCAN_IFLAG_RX_FIFO_AVAILABLE &&
-	       work_done < quota) {
-		work_done += flexcan_read_frame(dev);
-		reg_iflag1 = flexcan_read(&regs->iflag1);
-	}
-
 	/* report bus errors */
-	if (flexcan_has_and_handle_berr(priv, reg_esr) && work_done < quota)
+	if (flexcan_has_and_handle_berr(priv, reg_esr))
 		work_done += flexcan_poll_bus_err(dev, reg_esr);
 
-	if (work_done < quota) {
+	/* acknowledge error interrupts if any */
+	if (reg_esr & FLEXCAN_ESR_ALL_INT) {
+		/* ACK all bus error and state change IRQ sources */
+		flexcan_write(reg_esr & FLEXCAN_ESR_ALL_INT, &regs->esr);
+	}
+
+	pending_work = hweight32(reg_iflag1)+hweight32(reg_iflag2);
+
+	/* first of all, make sure that we have enough space in the queue to read the pending MBs */
+	while ( (__flexcan_queue_free_avail(&priv->queue) < pending_work)
+	      &&(work_done < quota)
+	      &&((mb = flexcan_queue_out(&priv->queue)) != NULL) ) {
+		work_done += flexcan_read_frame(dev, mb);
+	}
+	
+	/* handle rx messages from hardware */
+	/* note: the copy function reads all pending MBs, so we check the quota in advance */
+	if ((pending_work > 0) && (work_done+pending_work <= quota)) {
+		/* copy pending MBs */
+		flexcan_copy_rxmbs(priv, reg_iflag1, reg_iflag2);
+		
+		/* update work */
+		work_done   += pending_work;
+		pending_work = 0;
+	}
+
+	/* now, feed the MBs from the queue as long as our quota allows */
+	while ( (work_done < quota)
+	      &&((mb = flexcan_queue_out(&priv->queue)) != NULL) ) {
+		work_done += flexcan_read_frame(dev, mb);
+	}
+	
+	if (work_done+pending_work < quota) {
 		napi_complete(napi);
+
+		/* 
+		 * As re-enabling the IRQs isn't atomic we *must* disable
+		 * the flexcan interrupt here, otherwise if a message is
+		 * received while enabling the interrupts the service
+		 * function would either not handle the interrupt or it would
+		 * handle it and would disable all interrupts again and after
+		 * it returns we will end up with partially enabled interrupts!
+		 * Anyway, there is no way to get a consistent result without
+		 * a short lock while writing to the three registers.
+		 */
+		disable_irq(dev->irq);
+		
 		/* enable IRQs */
-		flexcan_write(FLEXCAN_IFLAG_DEFAULT, &regs->imask1);
+		flexcan_write(FLEXCAN_IFLAG1_DEFAULT, &regs->imask1);
+		flexcan_write(FLEXCAN_IFLAG2_DEFAULT, &regs->imask2);
 		flexcan_write(priv->reg_ctrl_default, &regs->ctrl);
+		
+		/* now we can safely re-enable the hardware interrupt */
+		enable_irq(dev->irq);
 	}
 
 	return work_done;
 }
 
+static void flexcan_trigger_napi(struct flexcan_priv *priv)
+{
+	struct flexcan_regs __iomem *regs = priv->base;
+
+	/* disable MB and error IRQs as described in the NAPI description */
+	flexcan_write((1 << FLEXCAN_TX_BUF_ID), &regs->imask1);
+	flexcan_write(0, &regs->imask2);
+	flexcan_write(priv->reg_ctrl_default & ~FLEXCAN_CTRL_ERR_ALL,
+		      &regs->ctrl);
+
+	/* trigger polling */
+	napi_schedule(&priv->napi);
+}
+
 static irqreturn_t flexcan_irq(int irq, void *dev_id)
 {
+	irqreturn_t result = IRQ_NONE;
 	struct net_device *dev = dev_id;
-	struct net_device_stats *stats = &dev->stats;
 	struct flexcan_priv *priv = netdev_priv(dev);
 	struct flexcan_regs __iomem *regs = priv->base;
-	u32 reg_iflag1, reg_esr;
+	u32 reg_iflag1;
+	u32 reg_iflag2;
+	u32 reg_esr = 0;
+	u32 reg_ctrl;
+	int reg_esr_val = 0;
 
 	reg_iflag1 = flexcan_read(&regs->iflag1);
-	reg_esr = flexcan_read(&regs->esr);
-	/* ACK all bus error and state change IRQ sources */
-	if (reg_esr & FLEXCAN_ESR_ALL_INT)
-		flexcan_write(reg_esr & FLEXCAN_ESR_ALL_INT, &regs->esr);
+	reg_iflag2 = flexcan_read(&regs->iflag2);
+	reg_ctrl   = flexcan_read(&regs->ctrl);
+
+	/* handle transmission complete interrupt */
+	if (reg_iflag1 & (1 << FLEXCAN_TX_BUF_ID)) {	
+		struct net_device_stats *stats = &dev->stats;
+		stats->tx_bytes += can_get_echo_skb(dev, 0);
+		stats->tx_packets++;
+		can_led_event(dev, CAN_LED_EVENT_TX);
+		/* after sending a RTR frame MB is in RX mode */
+		flexcan_write(FLEXCAN_MB_CODE_TX_INACTIVE,
+			      &regs->cantxfg[FLEXCAN_TX_BUF_ID].can_ctrl);
+		flexcan_write((1 << FLEXCAN_TX_BUF_ID), &regs->iflag1);
+		netif_wake_queue(dev);
+		result = IRQ_HANDLED;
+	}
 
 	/*
 	 * schedule NAPI in case of:
@@ -730,41 +1123,53 @@
 	 * - state change IRQ
 	 * - bus error IRQ and bus error reporting is activated
 	 */
-	if ((reg_iflag1 & FLEXCAN_IFLAG_RX_FIFO_AVAILABLE) ||
-	    (reg_esr & FLEXCAN_ESR_ERR_STATE) ||
-	    flexcan_has_and_handle_berr(priv, reg_esr)) {
+	if (reg_ctrl & FLEXCAN_CTRL_ERR_ALL)
+	{
+		/* any further handling only when interrupts are still
+		 * enabled (NAPI not active) to avoid interfering with
+		 * the NAPI handler
+		 */
+		 
 		/*
+		 * Read and store esr
 		 * The error bits are cleared on read,
 		 * save them for later use.
 		 */
+		reg_esr     = flexcan_read(&regs->esr);
+		reg_esr_val = 1;
 		priv->reg_esr = reg_esr & FLEXCAN_ESR_ERR_BUS;
-		flexcan_write(FLEXCAN_IFLAG_DEFAULT &
-			~FLEXCAN_IFLAG_RX_FIFO_AVAILABLE, &regs->imask1);
-		flexcan_write(priv->reg_ctrl_default & ~FLEXCAN_CTRL_ERR_ALL,
-		       &regs->ctrl);
-		napi_schedule(&priv->napi);
-	}
 
-	/* FIFO overflow */
-	if (reg_iflag1 & FLEXCAN_IFLAG_RX_FIFO_OVERFLOW) {
-		flexcan_write(FLEXCAN_IFLAG_RX_FIFO_OVERFLOW, &regs->iflag1);
-		dev->stats.rx_over_errors++;
-		dev->stats.rx_errors++;
+	        if (reg_esr & FLEXCAN_ESR_ALL_INT) {
+			/* ACK all bus error and state change IRQ sources */
+			flexcan_write(reg_esr & FLEXCAN_ESR_ALL_INT,
+				      &regs->esr);
+			flexcan_trigger_napi(priv);
+			result = IRQ_HANDLED;
+		} else if (flexcan_has_and_handle_berr(priv, reg_esr)) {
+			flexcan_trigger_napi(priv);
+			result = IRQ_HANDLED;
+		} else if ( (reg_iflag1
+		            &flexcan_read(&regs->imask1)
+		            &~(1 << FLEXCAN_TX_BUF_ID))
+		          ||(reg_iflag2 & flexcan_read(&regs->imask2)) ) {
+			/* rx or tx interrupt */
+			flexcan_trigger_napi(priv);
+			result = IRQ_HANDLED;
+		}
 	}
 
-	/* transmission complete interrupt */
-	if (reg_iflag1 & (1 << FLEXCAN_TX_BUF_ID)) {
-		stats->tx_bytes += can_get_echo_skb(dev, 0);
-		stats->tx_packets++;
-		can_led_event(dev, CAN_LED_EVENT_TX);
-		/* after sending a RTR frame mailbox is in RX mode */
-		flexcan_write(FLEXCAN_MB_CODE_TX_INACTIVE,
-			      &regs->cantxfg[FLEXCAN_TX_BUF_ID].can_ctrl);
-		flexcan_write((1 << FLEXCAN_TX_BUF_ID), &regs->iflag1);
-		netif_wake_queue(dev);
+	#if 0
+	if (result != IRQ_HANDLED) {
+		if (reg_esr_val)
+			netdev_err(dev, "BUG! Unhandled IRQ: iflag1=0x%08X, iflag2=0x%08X, esr=0x%08X, ctrl1=%08X, imask1=0x%08X, imask2=0x%08X\n",
+			           reg_iflag1, reg_iflag2, reg_esr, reg_ctrl, flexcan_read(&regs->imask1), flexcan_read(&regs->imask2));
+		else
+			netdev_err(dev, "BUG! Unhandled IRQ: iflag1=0x%08X, iflag2=0x%08X, ctrl1=%08X, imask1=0x%08X, imask2=0x%08X\n",
+			           reg_iflag1, reg_iflag2, reg_ctrl, flexcan_read(&regs->imask1), flexcan_read(&regs->imask2));
 	}
+	#endif
 
-	return IRQ_HANDLED;
+	return result;
 }
 
 static void flexcan_set_bittiming(struct net_device *dev)
@@ -815,9 +1220,12 @@
 {
 	struct flexcan_priv *priv = netdev_priv(dev);
 	struct flexcan_regs __iomem *regs = priv->base;
-	u32 reg_mcr, reg_ctrl, reg_crl2, reg_mecr;
+	u32 reg_mcr, reg_ctrl, reg_ctrl2, reg_mecr;
 	int err, i;
 
+	/* initialize queue */
+	flexcan_queue_init(&priv->queue);
+	
 	/* enable module */
 	err = flexcan_chip_enable(priv);
 	if (err)
@@ -834,7 +1242,7 @@
 	 * MCR
 	 *
 	 * enable freeze
-	 * enable fifo
+	 * disable fifo
 	 * halt now
 	 * only supervisor access
 	 * enable warning int
@@ -843,11 +1251,11 @@
 	 *
 	 */
 	reg_mcr = flexcan_read(&regs->mcr);
-	reg_mcr &= ~FLEXCAN_MCR_MAXMB(0xff);
-	reg_mcr |= FLEXCAN_MCR_FRZ | FLEXCAN_MCR_FEN | FLEXCAN_MCR_HALT |
+	reg_mcr &= ~(FLEXCAN_MCR_MAXMB(0xff) | FLEXCAN_MCR_FEN);
+	reg_mcr |= FLEXCAN_MCR_FRZ | FLEXCAN_MCR_HALT |
 		FLEXCAN_MCR_SUPV | FLEXCAN_MCR_WRN_EN |
 		FLEXCAN_MCR_IDAM_C | FLEXCAN_MCR_SRX_DIS |
-		FLEXCAN_MCR_MAXMB(FLEXCAN_TX_BUF_ID);
+		FLEXCAN_MCR_BCC | FLEXCAN_MCR_MAXMB(0x3f);
 	netdev_dbg(dev, "%s: writing mcr=0x%08x", __func__, reg_mcr);
 	flexcan_write(reg_mcr, &regs->mcr);
 
@@ -880,31 +1288,38 @@
 
 	/* save for later use */
 	priv->reg_ctrl_default = reg_ctrl;
+	/* leave interrupts disabled for now */
+	reg_ctrl &= ~FLEXCAN_CTRL_ERR_ALL;
 	netdev_dbg(dev, "%s: writing ctrl=0x%08x", __func__, reg_ctrl);
 	flexcan_write(reg_ctrl, &regs->ctrl);
 
-	/* clear and invalidate all mailboxes first */
-	for (i = FLEXCAN_TX_BUF_ID; i < ARRAY_SIZE(regs->cantxfg); i++) {
-		flexcan_write(FLEXCAN_MB_CODE_RX_INACTIVE,
+	if (priv->devtype_data->features & FLEXCAN_HAS_V10_FEATURES) {
+		/* CTRL2: Enable EACEN */
+		reg_ctrl2 = flexcan_read(&regs->ctrl2);
+		reg_ctrl2 |= FLEXCAN_CTRL2_EACEN;
+		flexcan_write(reg_ctrl2, &regs->ctrl2);
+	}
+
+	/* Prepare MBs. Skip the first, use one for TX the rest for RX */
+	for (i = 1; i < ARRAY_SIZE(regs->cantxfg); i++) {
+		if (i == FLEXCAN_TX_BUF_ID)
+			flexcan_write(FLEXCAN_MB_CODE_TX_INACTIVE,
+				&regs->cantxfg[i].can_ctrl);
+		else
+			flexcan_write(FLEXCAN_MB_CODE_RX_EMPTY,
 			      &regs->cantxfg[i].can_ctrl);
+		flexcan_write(0, &regs->rximr[i]); /* Clear filter */
 	}
 
-	/* Errata ERR005829: mark first TX mailbox as INACTIVE */
+	/* Errata ERR005829: mark first TX MB as INACTIVE */
 	flexcan_write(FLEXCAN_MB_CODE_TX_INACTIVE,
 		      &regs->cantxfg[FLEXCAN_TX_BUF_RESERVED].can_ctrl);
 
-	/* mark TX mailbox as INACTIVE */
-	flexcan_write(FLEXCAN_MB_CODE_TX_INACTIVE,
-		      &regs->cantxfg[FLEXCAN_TX_BUF_ID].can_ctrl);
-
 	/* acceptance mask/acceptance code (accept everything) */
 	flexcan_write(0x0, &regs->rxgmask);
 	flexcan_write(0x0, &regs->rx14mask);
 	flexcan_write(0x0, &regs->rx15mask);
 
-	if (priv->devtype_data->features & FLEXCAN_HAS_V10_FEATURES)
-		flexcan_write(0x0, &regs->rxfgmask);
-
 	/*
 	 * On Vybrid, disable memory error detection interrupts
 	 * and freeze mode.
@@ -918,9 +1333,9 @@
 		 * and Correction of Memory Errors" to write to
 		 * MECR register
 		 */
-		reg_crl2 = flexcan_read(&regs->crl2);
-		reg_crl2 |= FLEXCAN_CRL2_ECRWRE;
-		flexcan_write(reg_crl2, &regs->crl2);
+		reg_ctrl2 = flexcan_read(&regs->ctrl2);
+		reg_ctrl2 |= FLEXCAN_CTRL2_ECRWRE;
+		flexcan_write(reg_ctrl2, &regs->ctrl2);
 
 		reg_mecr = flexcan_read(&regs->mecr);
 		reg_mecr &= ~FLEXCAN_MECR_ECRWRDIS;
@@ -941,8 +1356,12 @@
 
 	priv->can.state = CAN_STATE_ERROR_ACTIVE;
 
-	/* enable FIFO interrupts */
-	flexcan_write(FLEXCAN_IFLAG_DEFAULT, &regs->imask1);
+	/* enable all interrupts atomically */
+	disable_irq(dev->irq);
+	flexcan_write(FLEXCAN_IFLAG1_DEFAULT, &regs->imask1);
+	flexcan_write(FLEXCAN_IFLAG2_DEFAULT, &regs->imask2);
+	flexcan_write(priv->reg_ctrl_default, &regs->ctrl);
+	enable_irq(dev->irq);
 
 	/* print chip status */
 	netdev_dbg(dev, "%s: reading mcr=0x%08x ctrl=0x%08x\n", __func__,
@@ -972,10 +1391,13 @@
 	flexcan_chip_freeze(priv);
 	flexcan_chip_disable(priv);
 
-	/* Disable all interrupts */
+	/* disable all interrupts atomically */
+	disable_irq(dev->irq);
 	flexcan_write(0, &regs->imask1);
+	flexcan_write(0, &regs->imask2);
 	flexcan_write(priv->reg_ctrl_default & ~FLEXCAN_CTRL_ERR_ALL,
 		      &regs->ctrl);
+	enable_irq(dev->irq);
 
 	flexcan_transceiver_disable(priv);
 	priv->can.state = CAN_STATE_STOPPED;
@@ -1225,6 +1647,7 @@
 		CAN_CTRLMODE_LISTENONLY	| CAN_CTRLMODE_3_SAMPLES |
 		CAN_CTRLMODE_BERR_REPORTING;
 	priv->base = base;
+	priv->dev = dev;
 	priv->clk_ipg = clk_ipg;
 	priv->clk_per = clk_per;
 	priv->pdata = dev_get_platdata(&pdev->dev);

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: can: flexcan: implement workaround for FIFO overruns (based on code by David Jander)
  2015-07-08 14:38 can: flexcan: implement workaround for FIFO overruns (based on code by David Jander) Torsten Lang
@ 2015-07-09  6:33 ` Holger Schurig
  2015-07-09  6:38   ` Holger Schurig
  2015-07-09  9:26   ` Torsten Lang
  2015-07-09  6:58 ` Alexander Stein
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 24+ messages in thread
From: Holger Schurig @ 2015-07-09  6:33 UTC (permalink / raw)
  To: Torsten Lang; +Cc: linux-can, Marc Kleine-Budde

Wow, I'm very interested into your work because I too had RxFIFO overruns.

So, have here some first "review" remarks:

* But for a review it's much better to send patches in-line (i.E. not
as an attachment)
* And because your "Signed-off:" is missing, we don't know what the
release status is. As is, the patch cannot added ever to the kernel!
* try to fix the things that scripts/checkpatch.pl
patches/flexcan_kernel_4.1-rc8.patch say, e.g. it found
** CHECK: Alignment should match open parenthesis
** CHECK: Blank lines aren't necessary before a close brace '}'
** CHECK: if this code is redundant consider removing it
** CHECK: Logical continuations should be on the previous line
** CHECK: spaces preferred around that '&' (ctx:ExO)
** CHECK: spaces preferred around that '+' (ctx:VxV)
** WARNING: line over 80 characters
** WARNING: Missing a blank line after declarations
** WARNING: networking block comments don't use an empty /* line, use
/* Comment...
** WARNING: please, no space before tabs
** WARNING: please, no spaces at the start of a line
** ERROR: code indent should use tabs where possible
** ERROR: Missing Signed-off-by: line(s)
** ERROR: space prohibited after that open parenthesis '('
** ERROR: space prohibited before that close parenthesis ')'
** ERROR: spaces required around that '&&' (ctx:ExO)
** ERROR: spaces required around that '||' (ctx:ExV)
** ERROR: spaces required around that '&&' (ctx:ExV)
** ERROR: that open brace { should be on the previous line
** ERROR: trailing whitespace
* All of this is written in Documentation/SubmittingPatches and
Documentation/CodingStyle

Other remarks:
* after your patch, FLEXCAN_CTRL2_EACEN is defined twice.
* You change crl2 to ctrl2, but in Linux git this is already done
(commit 6f75fce1ea81b1d91f6f1a8e3dd00b0ce8e83982). So your patch
doesn't apply to Linux git. I have such problem also from time to
time, because I tend to run stable Linux kernel (and not linux-git) on
my customers hardware. But then I look at what happened to this driver
in linux-git and backport the patches into my local
quilt-stack-of-patches. This essentially brings this one driver to
linux-git level, and any patch I do would also apply to linux-git.
* I like your descriptive comments!  I just don't know why you
describe the FIFO-Modes after cantxfg and before _reserved4?
* you add a "struct net_device *dev". I doubt that this is really
needed, the old flexcan could do without ...
* please remove debugging code, e.g. "#if 0"  sections. I personally
do this by having them in a second quilt patch, e.g.
"flexcan-debug.patch". So I can roll them in fast when I need them,
but keep the to-be-submitted patch clean.



BTW, you can remove a good amount of latency in an i.MX6 kernel if you
remove CONFIG_MMC_CLKGATE. That introduces a latency of 1 ms every
second ... urgh.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: can: flexcan: implement workaround for FIFO overruns (based on code by David Jander)
  2015-07-09  6:33 ` Holger Schurig
@ 2015-07-09  6:38   ` Holger Schurig
  2015-07-09  9:26   ` Torsten Lang
  1 sibling, 0 replies; 24+ messages in thread
From: Holger Schurig @ 2015-07-09  6:38 UTC (permalink / raw)
  To: Torsten Lang; +Cc: linux-can, Marc Kleine-Budde

One more idea, and this is gonna make more work: is it possible to
split the patch into two?  The first one would receive via the
mailboxes.

And the second patch adds the sorting on this.

My vague idea is that the sorting can be done differently, e.g. here
in the mailing people talk currently about timestamping skbd, or about
RPS. Maybe it's possible to use an existing solution/framework and not
invent an own solution here.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: can: flexcan: implement workaround for FIFO overruns (based on code by David Jander)
  2015-07-08 14:38 can: flexcan: implement workaround for FIFO overruns (based on code by David Jander) Torsten Lang
  2015-07-09  6:33 ` Holger Schurig
@ 2015-07-09  6:58 ` Alexander Stein
  2015-07-09  7:27   ` Holger Schurig
  2015-07-09  7:42 ` Tom Evans
  2015-07-09  8:06 ` Holger Schurig
  3 siblings, 1 reply; 24+ messages in thread
From: Alexander Stein @ 2015-07-09  6:58 UTC (permalink / raw)
  To: Torsten Lang; +Cc: linux-can, Marc Kleine-Budde

Hello,

On Wednesday 08 July 2015 16:38:02, Torsten Lang wrote:
> This patch is for Kernel 4.1-rc8 (and should AFAIK still be valid also for the current 4.1/4.2 versions). It is based on the rework done by David Jander which disables the only six messages deep hardware FIFO of the FlexCAN core and instead uses all available mailboxes for reception.

Please keep in mind that disabling FIFO mode in FlexCAN disables RTR-reception on i.MX3. Please refer to http://marc.info/?l=linux-can&m=141199785820075&w=2 about the mail I've written to David last year). This needs to be handled.

Best regards,
Alexander
-- 
Dipl.-Inf. Alexander Stein

SYS TEC electronic GmbH
Am Windrad 2
08468 Heinsdorfergrund
Tel.: 03765 38600-1156
Fax: 03765 38600-4100
Email: alexander.stein@systec-electronic.com
Website: www.systec-electronic.com
 
Managing Director: Dipl.-Phys. Siegmar Schmidt
Commercial registry: Amtsgericht Chemnitz, HRB 28082


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: can: flexcan: implement workaround for FIFO overruns (based on code by David Jander)
  2015-07-09  6:58 ` Alexander Stein
@ 2015-07-09  7:27   ` Holger Schurig
  2015-07-09  7:48     ` Alexander Stein
  0 siblings, 1 reply; 24+ messages in thread
From: Holger Schurig @ 2015-07-09  7:27 UTC (permalink / raw)
  To: Alexander Stein; +Cc: Torsten Lang, linux-can, Marc Kleine-Budde

2015-07-09 8:58 GMT+02:00 Alexander Stein
<alexander.stein@systec-electronic.com>:
> This needs to be handled.

Do you have any suggestion how to handle this?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: can: flexcan: implement workaround for FIFO overruns (based on code by David Jander)
  2015-07-08 14:38 can: flexcan: implement workaround for FIFO overruns (based on code by David Jander) Torsten Lang
  2015-07-09  6:33 ` Holger Schurig
  2015-07-09  6:58 ` Alexander Stein
@ 2015-07-09  7:42 ` Tom Evans
  2015-07-09  9:48   ` Torsten Lang
  2015-07-09  8:06 ` Holger Schurig
  3 siblings, 1 reply; 24+ messages in thread
From: Tom Evans @ 2015-07-09  7:42 UTC (permalink / raw)
  To: Torsten Lang, linux-can; +Cc: Marc Kleine-Budde

On 09/07/15 00:38, Torsten Lang wrote:
> It is based on the rework done by David Jander which disables
 > the only six messages deep hardware FIFO of the FlexCAN core
 > and instead uses all available mailboxes for reception.

 > #define FLEXCAN_MB_QUEUE_SIZE		62

The FlexCAN Driver is not specific to the i.MX. It is used in other FreeScale 
parts. The early parts (ColdFire) have 16 buffers, didn't have "Message 
Queueing" or the FIFO, so aren't supported by Linux at all. The one in the
MCF5441x had the FIFO, Message Queueing, but only 16 Messages. I don't know 
which ones are in the PPC chips. You may need to make the queue size settable 
in the Device Tree.

Two years back I had FlexCAN overrunning. I found the problem to be that the 
driver reads the messages during NAPI, while the matching Ethernet driver read 
them during interrupts, and there was unnecessary kernel debugging on.

I rewrote it to receive all messages during interrupts and haven't had any 
problems since. Is it "normal" to have interrupts locked out for more than 
300us (six 50us CAN messages at 1MHz)? Shouldn't that be something that should 
be fixed? Or is having interrupts locked out for 3200us (64 message buffers) 
the new "normal"?

I'd be interested in reasons why the above isn't a good solution to this problem.

 > The mailboxes now are serviced as recommended by Freescale's i.MX6
 > user's manual,

Which recommends sorting the messages by 16-bit hardware timestamps.

 > and the servicing as such has been moved completely
 > over to the NAPI poll function.

Are you sure NAPI won't get delayed by more than 3.2ms? It should have a worse 
latency than the raw interrupts. Which is the "worst" and more/less likely, a 
300us Interrupt latency or a 3200us NAPI latency (800us on 16-buffer models)?

> P. S.The main remaining problem is that my application no longer
 > receives the CAN messages correctly with kernel 4.1 ... it only receives
 > single messages about every 30..60s ... As soon as I start a candump in
 > parallel my application also receives the test messages. I currently
 > have no explanation for this behaviour.

I think I remember a recent post noting that problem, but searching the list 
on gmane doesn't find it for me. I hope someone else remembers and posts the 
patch that caused this problem and the one that fixed it. It might have had 
something to do with candump having filter-setting added.

Tom


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: can: flexcan: implement workaround for FIFO overruns (based on code by David Jander)
  2015-07-09  7:27   ` Holger Schurig
@ 2015-07-09  7:48     ` Alexander Stein
  2015-07-09  7:59       ` Holger Schurig
  0 siblings, 1 reply; 24+ messages in thread
From: Alexander Stein @ 2015-07-09  7:48 UTC (permalink / raw)
  To: Holger Schurig; +Cc: Torsten Lang, linux-can, Marc Kleine-Budde

Hello,

On Thursday 09 July 2015 09:27:57, Holger Schurig wrote:
> 2015-07-09 8:58 GMT+02:00 Alexander Stein
> <alexander.stein@systec-electronic.com>:
> > This needs to be handled.
> 
> Do you have any suggestion how to handle this?

Well, up to now I haven't seen any CAN frame loss on several i.MX35 boards while using FIFO mode. So I suggest to extend the compatible to keep the FIFO mode on i.MX3 hardware.

Best regards,
Alexander
-- 
Dipl.-Inf. Alexander Stein

SYS TEC electronic GmbH
Am Windrad 2
08468 Heinsdorfergrund
Tel.: 03765 38600-1156
Fax: 03765 38600-4100
Email: alexander.stein@systec-electronic.com
Website: www.systec-electronic.com
 
Managing Director: Dipl.-Phys. Siegmar Schmidt
Commercial registry: Amtsgericht Chemnitz, HRB 28082


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: can: flexcan: implement workaround for FIFO overruns (based on code by David Jander)
  2015-07-09  7:48     ` Alexander Stein
@ 2015-07-09  7:59       ` Holger Schurig
  2015-07-09 10:03         ` Torsten Lang
  2015-07-22  8:00         ` Torsten Lang
  0 siblings, 2 replies; 24+ messages in thread
From: Holger Schurig @ 2015-07-09  7:59 UTC (permalink / raw)
  To: Alexander Stein; +Cc: Torsten Lang, linux-can, Marc Kleine-Budde

> The early parts (ColdFire) have 16 buffers, didn't have "Message Queueing" or the FIFO, so aren't supported by Linux at all.

Fine, so we can ignore them :-)

> Is it "normal" to have interrupts locked out for more than 300us (six 50us CAN messages at 1MHz)?

Unfortunately yes.  My $CUSTOMER had overruns with 500 kB/s, 80% bus
load, and CAN messages with 3 bytes of data. My guess this was mostly
due to the sucky SDHCI (eMMC) driver code in Linux. I fixed that, but
occassionally ftrace still shows large times with irqsoff, I need to
dig into them as well. Still /me thinks that an RxFIFO of just 6 CAN
messages isn't swell for an OS that is known to not guarantee response
times, like Linux. Especially not for CAN, people use it after all
because of it's reliability guarantees.

BTW, with the current in-tree FlexCAN drivers we have two things were
IRQ or scheduling latency can cause lost frames:

* the time from the hardware IRQ until flexcan_isr() is actually
called, e.g. because for spin_lock_irqsave
* the time when the ISR does a napi_schedule() until NAPI get's
scheduled and calls flexcan_poll()

For the first latency, only FTRACE and fixing the other kernel parts
helps. Unfortunately, some kernel parts are so complex that it is
over-the-head of many people (ok, I confess: over my head).

I killed the second latency with my kfifo patch that I posted the
other day. Getting rid of NAPI completely would also be a method, I'm
not sure NAPI wins us anything, compared to Ethernet CAN is slow.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: can: flexcan: implement workaround for FIFO overruns (based on code by David Jander)
  2015-07-08 14:38 can: flexcan: implement workaround for FIFO overruns (based on code by David Jander) Torsten Lang
                   ` (2 preceding siblings ...)
  2015-07-09  7:42 ` Tom Evans
@ 2015-07-09  8:06 ` Holger Schurig
  2015-07-09  8:43   ` Oliver Hartkopp
  3 siblings, 1 reply; 24+ messages in thread
From: Holger Schurig @ 2015-07-09  8:06 UTC (permalink / raw)
  To: Torsten Lang; +Cc: linux-can, Marc Kleine-Budde

2015-07-08 16:38 GMT+02:00 Torsten Lang <torsten.lang@uweschneider.de>:
> P. S. As mentioned I have some trouble getting the kernel to work on the hardware I use.
> The main remaining problem is that my application no longer receives the CAN messages
> correctly with kernel 4.1 - neither with the original FlexCAN driver nor with mine. As long
> as y Qt application uses the CAN bus alone it only receives single messages about
> every 30..60s (actually the test messages are transmitted every 250ms). As soon
> as I start a candump in parallel my application also receives the test messages. I
> currently have no explanation for this behaviour.

I'm on stable Linux 3.18.17 and do not observe this behavior.

4.1.x migth be whacky, because I once had the serial port hang in some
DMA related code. I "fixed" this with a

+#if 0
+       if (is_imx6q_uart(sport) && !uart_console(port) &&
+           !sport->dma_is_inited)
+               imx_uart_dma_init(sport);
+#endif

... but that was certainly a hack. My $CUSTOMER promply reported my
other things like a stopping touchscreen, GUI app not accepting input
anymore. I was flooded with other work, I want back to savety of
3.18.x :-)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: can: flexcan: implement workaround for FIFO overruns (based on code by David Jander)
  2015-07-09  8:06 ` Holger Schurig
@ 2015-07-09  8:43   ` Oliver Hartkopp
  0 siblings, 0 replies; 24+ messages in thread
From: Oliver Hartkopp @ 2015-07-09  8:43 UTC (permalink / raw)
  To: Holger Schurig, Torsten Lang; +Cc: linux-can, Marc Kleine-Budde

On 09.07.2015 10:06, Holger Schurig wrote:
> 2015-07-08 16:38 GMT+02:00 Torsten Lang <torsten.lang@uweschneider.de>:
>> P. S. As mentioned I have some trouble getting the kernel to work on the hardware I use.
>> The main remaining problem is that my application no longer receives the CAN messages
>> correctly with kernel 4.1 - neither with the original FlexCAN driver nor with mine. As long
>> as y Qt application uses the CAN bus alone it only receives single messages about
>> every 30..60s (actually the test messages are transmitted every 250ms). As soon
>> as I start a candump in parallel my application also receives the test messages. I
>> currently have no explanation for this behaviour.

Kernel versions 4.1-rc to 4.1.1 (included) have a problem when timestamping is 
not enabled. candump enables the timestamping in your case -> makes it work 
properly.

Please apply this stable patch to the affected 4.1 kernels:

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=36c01245eb8046c16eee6431e7dbfbb302635fa8

This fix turned out to be a bit to brutal :-)

So here's the 'nice' rework of that fix:

http://marc.info/?l=linux-can&m=143531273819705&w=2

But the first fix will fix your problem.

Regards,
Oliver



>
> I'm on stable Linux 3.18.17 and do not observe this behavior.
>
> 4.1.x migth be whacky, because I once had the serial port hang in some
> DMA related code. I "fixed" this with a
>
> +#if 0
> +       if (is_imx6q_uart(sport) && !uart_console(port) &&
> +           !sport->dma_is_inited)
> +               imx_uart_dma_init(sport);
> +#endif
>
> ... but that was certainly a hack. My $CUSTOMER promply reported my
> other things like a stopping touchscreen, GUI app not accepting input
> anymore. I was flooded with other work, I want back to savety of
> 3.18.x :-)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-can" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: can: flexcan: implement workaround for FIFO overruns (based on code by David Jander)
  2015-07-09  6:33 ` Holger Schurig
  2015-07-09  6:38   ` Holger Schurig
@ 2015-07-09  9:26   ` Torsten Lang
  2015-07-09  9:32     ` Holger Schurig
  1 sibling, 1 reply; 24+ messages in thread
From: Torsten Lang @ 2015-07-09  9:26 UTC (permalink / raw)
  To: Holger Schurig; +Cc: linux-can, Marc Kleine-Budde

Hi Holger,
regarding the formal things - I already reformatted my work according to the CodingStyle guidelines in the kernel documentation folder, but actually I didn't see a hint regarding the check script, otherwise I would've let run it over my code.

As the kernel has heavily grown since the last time I wrote drivers (which was for the Atari - at that time we were around kernel 0.6..0.8 for the M68K architecture while the mainline kernel just hitted 1.0 AFAIR) it's quite easy to miss something.

My patch is based in commit 0f57d86787d8b1076ea8f9cbdddda2a46d534a27 (tag v4.1-rc8) but wasn't created with git as I had a bit trouble with my tools...

Regards,
Torsten

Am 09.07.2015 um 08:33 schrieb Holger Schurig:
> Wow, I'm very interested into your work because I too had RxFIFO overruns.
>
> So, have here some first "review" remarks:
>
> * But for a review it's much better to send patches in-line (i.E. not
> as an attachment)
> * And because your "Signed-off:" is missing, we don't know what the
> release status is. As is, the patch cannot added ever to the kernel!
> * try to fix the things that scripts/checkpatch.pl
> patches/flexcan_kernel_4.1-rc8.patch say, e.g. it found
> ** CHECK: Alignment should match open parenthesis
> ** CHECK: Blank lines aren't necessary before a close brace '}'
> ** CHECK: if this code is redundant consider removing it
> ** CHECK: Logical continuations should be on the previous line
> ** CHECK: spaces preferred around that '&' (ctx:ExO)
> ** CHECK: spaces preferred around that '+' (ctx:VxV)
> ** WARNING: line over 80 characters
> ** WARNING: Missing a blank line after declarations
> ** WARNING: networking block comments don't use an empty /* line, use
> /* Comment...
> ** WARNING: please, no space before tabs
> ** WARNING: please, no spaces at the start of a line
> ** ERROR: code indent should use tabs where possible
> ** ERROR: Missing Signed-off-by: line(s)
> ** ERROR: space prohibited after that open parenthesis '('
> ** ERROR: space prohibited before that close parenthesis ')'
> ** ERROR: spaces required around that '&&' (ctx:ExO)
> ** ERROR: spaces required around that '||' (ctx:ExV)
> ** ERROR: spaces required around that '&&' (ctx:ExV)
> ** ERROR: that open brace { should be on the previous line
> ** ERROR: trailing whitespace
> * All of this is written in Documentation/SubmittingPatches and
> Documentation/CodingStyle
>
> Other remarks:
> * after your patch, FLEXCAN_CTRL2_EACEN is defined twice.
> * You change crl2 to ctrl2, but in Linux git this is already done
> (commit 6f75fce1ea81b1d91f6f1a8e3dd00b0ce8e83982). So your patch
> doesn't apply to Linux git. I have such problem also from time to
> time, because I tend to run stable Linux kernel (and not linux-git) on
> my customers hardware. But then I look at what happened to this driver
> in linux-git and backport the patches into my local
> quilt-stack-of-patches. This essentially brings this one driver to
> linux-git level, and any patch I do would also apply to linux-git.
> * I like your descriptive comments!  I just don't know why you
> describe the FIFO-Modes after cantxfg and before _reserved4?
> * you add a "struct net_device *dev". I doubt that this is really
> needed, the old flexcan could do without ...
> * please remove debugging code, e.g. "#if 0"  sections. I personally
> do this by having them in a second quilt patch, e.g.
> "flexcan-debug.patch". So I can roll them in fast when I need them,
> but keep the to-be-submitted patch clean.
>
>
>
> BTW, you can remove a good amount of latency in an i.MX6 kernel if you
> remove CONFIG_MMC_CLKGATE. That introduces a latency of 1 ms every
> second ... urgh.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: can: flexcan: implement workaround for FIFO overruns (based on code by David Jander)
  2015-07-09  9:26   ` Torsten Lang
@ 2015-07-09  9:32     ` Holger Schurig
  2015-07-09  9:36       ` Marc Kleine-Budde
  0 siblings, 1 reply; 24+ messages in thread
From: Holger Schurig @ 2015-07-09  9:32 UTC (permalink / raw)
  To: Torsten Lang; +Cc: linux-can, Marc Kleine-Budde

Thorsten,

after pondering a bit on this I'd just say give up.

The FlexCAN driver is in a state of limbo. For many months
occassionally people submit some path, but nothing really changes,
only minor things get accepted and applied to the kernel. I see two
reasons:

* some people have the picture of a "one big patch series that saves
the world" view. But this one is complex, and so it never
materialized. For example my KFIFO patch was not dismissed because of
technical reasons, but because this big patch will eventually come.
Marc said "I'll look into it at the end of the week", but that was
optimistic
* the FlexCAN driver is too complex, or, better phrased: the
environment where it runs is too complex. When you make a substantial
change, you'd need to test it an all CPUs that implement various
versions of the FlexCAN IP. But who has access to an i.MX6, to an
i.MX5, and to the other chips that run it?  This hinders changes that
do bigger architectural changes. I'm not sure how this can be fixed,
maybe in the way the wifi people do it with Atheros chipsets.

But my guess, at the current point of view, is that your patch (even
when reworked) has little chance to get accepted. It will just rest
here in the mailing list.

Do I sound frustrated?  Yes. Because the FlexCAN RxFIFO-packet-lost
problem is known for a very long time and nothing substantial happens.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: can: flexcan: implement workaround for FIFO overruns (based on code by David Jander)
  2015-07-09  9:32     ` Holger Schurig
@ 2015-07-09  9:36       ` Marc Kleine-Budde
  2015-07-09  9:42         ` Holger Schurig
  0 siblings, 1 reply; 24+ messages in thread
From: Marc Kleine-Budde @ 2015-07-09  9:36 UTC (permalink / raw)
  To: Holger Schurig, Torsten Lang; +Cc: linux-can

[-- Attachment #1: Type: text/plain, Size: 1111 bytes --]

On 07/09/2015 11:32 AM, Holger Schurig wrote:
> Thorsten,
> 
> after pondering a bit on this I'd just say give up.
> 
> The FlexCAN driver is in a state of limbo. For many months
> occassionally people submit some path, but nothing really changes,
> only minor things get accepted and applied to the kernel. I see two
> reasons:
> 
> * some people have the picture of a "one big patch series that saves
> the world" view. But this one is complex, and so it never
> materialized. For example my KFIFO patch was not dismissed because of
> technical reasons, but because this big patch will eventually come.
> Marc said "I'll look into it at the end of the week", but that was
> optimistic

Sorry - I'm not full time paid to work on linux-can. But there soon will
be another end-of-the-week to look at the code.

Marc
-- 
Pengutronix e.K.                  | Marc Kleine-Budde           |
Industrial Linux Solutions        | Phone: +49-231-2826-924     |
Vertretung West/Dortmund          | Fax:   +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: can: flexcan: implement workaround for FIFO overruns (based on code by David Jander)
  2015-07-09  9:36       ` Marc Kleine-Budde
@ 2015-07-09  9:42         ` Holger Schurig
  0 siblings, 0 replies; 24+ messages in thread
From: Holger Schurig @ 2015-07-09  9:42 UTC (permalink / raw)
  To: Marc Kleine-Budde; +Cc: Torsten Lang, linux-can

> Sorry - I'm not full time paid to work on linux-can.

Marc, the more awesome is what you do and what you've done!

But still, if things like David Jaunder's work takes so long to
materialize then something in the process is wrong.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: can: flexcan: implement workaround for FIFO overruns (based on code by David Jander)
  2015-07-09  7:42 ` Tom Evans
@ 2015-07-09  9:48   ` Torsten Lang
  2015-07-09 10:05     ` Holger Schurig
  2015-07-09 15:36     ` Tom Evans
  0 siblings, 2 replies; 24+ messages in thread
From: Torsten Lang @ 2015-07-09  9:48 UTC (permalink / raw)
  To: tom_usenet, linux-can; +Cc: Marc Kleine-Budde

Am 09.07.2015 um 09:42 schrieb Tom Evans:
> On 09/07/15 00:38, Torsten Lang wrote:
>> It is based on the rework done by David Jander which disables
> > the only six messages deep hardware FIFO of the FlexCAN core
> > and instead uses all available mailboxes for reception.
>
> > #define FLEXCAN_MB_QUEUE_SIZE        62
>
> The FlexCAN Driver is not specific to the i.MX. It is used in other FreeScale parts. The early parts (ColdFire) have 16 buffers, didn't have "Message Queueing" or the FIFO, so aren't supported by Linux at all. The one in the
> MCF5441x had the FIFO, Message Queueing, but only 16 Messages. I don't know which ones are in the PPC chips. You may need to make the queue size settable in the Device Tree.
>
> Two years back I had FlexCAN overrunning. I found the problem to be that the driver reads the messages during NAPI, while the matching Ethernet driver read them during interrupts, and there was unnecessary kernel debugging on.
>
> I rewrote it to receive all messages during interrupts and haven't had any problems since. Is it "normal" to have interrupts locked out for more than 300us (six 50us CAN messages at 1MHz)? Shouldn't that be something that should be fixed? Or is having interrupts locked out for 3200us (64 message buffers) the new "normal"?
>
> I'd be interested in reasons why the above isn't a good solution to this problem.
I did tests with reading out the mailboxes directly in the interrupt handler but still had problems. From what I found during my search in the net the interrupt handling implementation in Linux for the Freescale range of SoCs seems to suck because it does not configure any interrupt priorization and the interrupt handler "prefers" to handle interrupts just by the bit order in the interrupt controller could lead to very high latencies in case of FlexCAN interrupts. On which i.MX did you test your change with success?
>
> > The mailboxes now are serviced as recommended by Freescale's i.MX6
> > user's manual,
>
> Which recommends sorting the messages by 16-bit hardware timestamps.
Which recommends to service mailboxes according to the corresponding interrupt flags while David's code reads out the control code and marks full mailboxes as inactive (and active again later).
>
> > and the servicing as such has been moved completely
> > over to the NAPI poll function.
>
> Are you sure NAPI won't get delayed by more than 3.2ms? It should have a worse latency than the raw interrupts. Which is the "worst" and more/less likely, a 300us Interrupt latency or a 3200us NAPI latency (800us on 16-buffer models)?
>
>> P. S.The main remaining problem is that my application no longer
> > receives the CAN messages correctly with kernel 4.1 ... it only receives
> > single messages about every 30..60s ... As soon as I start a candump in
> > parallel my application also receives the test messages. I currently
> > have no explanation for this behaviour.
>
> I think I remember a recent post noting that problem, but searching the list on gmane doesn't find it for me. I hope someone else remembers and posts the patch that caused this problem and the one that fixed it. It might have had something to do with candump having filter-setting added.
>
> Tom
>
Yes, I lately got some info about these problems in 4.1.

Torsten

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: can: flexcan: implement workaround for FIFO overruns (based on code by David Jander)
  2015-07-09  7:59       ` Holger Schurig
@ 2015-07-09 10:03         ` Torsten Lang
  2015-07-22  8:00         ` Torsten Lang
  1 sibling, 0 replies; 24+ messages in thread
From: Torsten Lang @ 2015-07-09 10:03 UTC (permalink / raw)
  To: Holger Schurig, Alexander Stein; +Cc: linux-can, Marc Kleine-Budde

Am 09.07.2015 um 09:59 schrieb Holger Schurig:
>> The early parts (ColdFire) have 16 buffers, didn't have "Message Queueing" or the FIFO, so aren't supported by Linux at all.
> Fine, so we can ignore them :-)
>
>> Is it "normal" to have interrupts locked out for more than 300us (six 50us CAN messages at 1MHz)?
> Unfortunately yes.  My $CUSTOMER had overruns with 500 kB/s, 80% bus
> load, and CAN messages with 3 bytes of data. My guess this was mostly
> due to the sucky SDHCI (eMMC) driver code in Linux. I fixed that, but
> occassionally ftrace still shows large times with irqsoff, I need to
> dig into them as well. Still /me thinks that an RxFIFO of just 6 CAN
> messages isn't swell for an OS that is known to not guarantee response
> times, like Linux. Especially not for CAN, people use it after all
> because of it's reliability guarantees.
Same for our customer's project. I first noticed when data was missing in the logs our application writes.
> BTW, with the current in-tree FlexCAN drivers we have two things were
> IRQ or scheduling latency can cause lost frames:
>
> * the time from the hardware IRQ until flexcan_isr() is actually
> called, e.g. because for spin_lock_irqsave
> * the time when the ISR does a napi_schedule() until NAPI get's
> scheduled and calls flexcan_poll()
I'm not sure yet which latencies allocating and feeding the SKBs introduces, but I think NAPI could help when splitting up the mailbox handling where in the interrupt just the mailboxes are pushed into an internal queue of the driver and then fed into the network stack in the NAPI.
> For the first latency, only FTRACE and fixing the other kernel parts
> helps. Unfortunately, some kernel parts are so complex that it is
> over-the-head of many people (ok, I confess: over my head).
Same for me. I'm not the one to fix the basics of the TZIC code.
> I killed the second latency with my kfifo patch that I posted the
> other day. Getting rid of NAPI completely would also be a method, I'm
> not sure NAPI wins us anything, compared to Ethernet CAN is slow.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: can: flexcan: implement workaround for FIFO overruns (based on code by David Jander)
  2015-07-09  9:48   ` Torsten Lang
@ 2015-07-09 10:05     ` Holger Schurig
  2015-07-09 15:36     ` Tom Evans
  1 sibling, 0 replies; 24+ messages in thread
From: Holger Schurig @ 2015-07-09 10:05 UTC (permalink / raw)
  To: Torsten Lang; +Cc: Tom Evans, linux-can, Marc Kleine-Budde

2015-07-09 11:48 GMT+02:00 Torsten Lang <torsten.lang@uweschneider.de>:

> From what I found during my search in the net the interrupt handling
> implementation in Linux for the Freescale range of SoCs seems to
> suck because it does not configure any interrupt priorization and
> the interrupt handler "prefers" to handle interrupts just by the bit
> order in the interrupt controller could lead to very high latencies
> in case of FlexCAN interrupts.

The thing about the prioritization is true .. but it's not the reason.
Because even when you give the IRQs for the FlexCAN the highest
priority (I have a patch for this), then this will only trigger if two
interrupts arrive at the same time. This is almost never.

The main reason is that on ARM, IRQs are not preemptible. When an ISR
is running, no other ISR (higher priority or not) can interrupt the
first one. On the other side, this pretty much guarantees that the
stack won't overflow :-)

I wonder how SJA CAN drivers do it, their RxFIFO equivalent is much smaller.


PS: Patch for IRQ prioritization in the GIC:
https://drive.google.com/file/d/0B-OEhM3f0eCtZjhueTdKdXYzMjg/view?usp=sharing
 Deliberately no Signed-off-there, because a) it doesn't help and b)
it still contains debug output :-)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: can: flexcan: implement workaround for FIFO overruns (based on code by David Jander)
  2015-07-09  9:48   ` Torsten Lang
  2015-07-09 10:05     ` Holger Schurig
@ 2015-07-09 15:36     ` Tom Evans
  2015-07-10  9:17       ` Torsten Lang
  1 sibling, 1 reply; 24+ messages in thread
From: Tom Evans @ 2015-07-09 15:36 UTC (permalink / raw)
  To: Torsten Lang, linux-can; +Cc: Marc Kleine-Budde

On 9/07/2015 7:48 PM, Torsten Lang wrote:
> Am 09.07.2015 um 09:42 schrieb Tom Evans:
>> On 09/07/15 00:38, Torsten Lang wrote:
>>> It is based on the rework done by David Jander which disables
>>> the only six messages deep hardware FIFO of the FlexCAN core
>>> and instead uses all available mailboxes for reception.

That's such a big change to the driver (and given Holger's comments) I 
would suggest submitting it as a separate driver - "flexcan2.c", 
"flexcan-ng.c" or some such. Leave the old one alone, or fix it with 
Holger's unload-during-interrupts version or equivalent.

>> I'd be interested in reasons why the above isn't a
 >> good solution to this problem.
>
> I did tests with reading out the mailboxes directly in the interrupt
 > handler but still had problems.

Time to run FTRACE and see what's broken or set wrong. Holger seems to 
have been hit with a broken SD driver. I found our kernel supplier had 
left all the semaphore/mutex/slub debugging on and that was making the 
kernel about 5 times slower than it should have been. Easily fixed once 
found.

 > From what I found during my search in the net the interrupt
 > handling implementation in Linux for the Freescale range of
 > SoCs seems to suck because it does not configure any interrupt
 > priorization and the interrupt handler "prefers" to handle
 > interrupts just by the bit order in the interrupt controller
 > could lead to very high latencies in case of FlexCAN interrupts.

Yes, I fixed that too. A simple mod that created a TZIC version of 
"avic_irq_set_priority()", and calls to that from the platform setup. 
But I really miss having SIX different levels (that can happily 
interrupt each other) in M68k/ColdFire.

> On which i.MX did you test your change with success?

i.MX53.

Holger has said:
 > The thing about the prioritization is true .. but it's not the
 > reason. Because even when you give the IRQs for the FlexCAN
 > the highest priority (I have a patch for this), then this
 > will only trigger if two interrupts arrive at the same
 > time. This is almost never.

I don't think so. It only requires one interrupt to arrive when the 
previous one is still running. If the previous one is the FEC (Ethernet) 
AND I'm flood-pinging the thing hard AND the 3.4 FEC driver doesn't use 
NAPI then the CPU is spending a huge amount of time in the FEC ISR, 
followed by another run in the FEC ISR, and again; not letting CAN run. 
Elevating CAN't priority does help in this case.

When you're playing whack-a-mole you have to whack all the moles...

Tom


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: can: flexcan: implement workaround for FIFO overruns (based on code by David Jander)
  2015-07-09 15:36     ` Tom Evans
@ 2015-07-10  9:17       ` Torsten Lang
  2015-07-11  6:42         ` Tom Evans
  0 siblings, 1 reply; 24+ messages in thread
From: Torsten Lang @ 2015-07-10  9:17 UTC (permalink / raw)
  To: Tom Evans, linux-can; +Cc: Marc Kleine-Budde

Hi Tom,
it would be nice if you could mail me the patches you applied to the flexcan driver and kernel interrupt handling by mail so that I could try out if this solves the problems I still had with the "NO-NAPI" version I made of the 3.16 flexcan driver. I already integrated several patches I found in the net or received by mail.

Torsten

Am 09.07.2015 um 09:42 schrieb Tom Evans:
> On 09/07/15 00:38, Torsten Lang wrote:
>> It is based on the rework done by David Jander which disables
> > the only six messages deep hardware FIFO of the FlexCAN core
> > and instead uses all available mailboxes for reception.
>
> > #define FLEXCAN_MB_QUEUE_SIZE        62
>
> The FlexCAN Driver is not specific to the i.MX. It is used in other FreeScale parts. The early parts (ColdFire) have 16 buffers, didn't have "Message Queueing" or the FIFO, so aren't supported by Linux at all. The one in the
> MCF5441x had the FIFO, Message Queueing, but only 16 Messages. I don't know which ones are in the PPC chips. You may need to make the queue size settable in the Device Tree.
>
> Two years back I had FlexCAN overrunning. I found the problem to be that the driver reads the messages during NAPI, while the matching Ethernet driver read them during interrupts, and there was unnecessary kernel debugging on.
>
> I rewrote it to receive all messages during interrupts and haven't had any problems since. Is it "normal" to have interrupts locked out for more than 300us (six 50us CAN messages at 1MHz)? Shouldn't that be something that should be fixed? Or is having interrupts locked out for 3200us (64 message buffers) the new "normal"? 
Yes, I've also done such a rewrite for the kernel 3.16 supported by our board manufacturer. But still had problems with overruns. So I will check for any leftover debug settings etc. when I have some time again.

I've done some more tests with the patch I mailed under very heavy load (with small DLC) - it seems that we really have massive delays until NAPI polling starts, so even with my patch I can provoke message loss. What seems to work flawlessly is the sorting of the messages, so I think for the final solution I need to unload the mailboxes directly in the interrupt into an internal FIFO where the messages are sorted in by their timestamps.

But if leaving the RX interrupts active synchronization must be done very carefully as to my understanding NAPI was not meant to be done with active interrupts (that can trigger the NAPI polling again). You can easily create race conditions when the poll function is interrupted with a quota below the limit and between the unloading/polling code and the decision to deactivate the polling - the result would be that the interrupt handler would unload the mailboxes and tries to trigger NAPI (wich would be a no-op as it is still active) while after return the NAPI poll would turn off the polling without reading the messages left in the internal FIFO. This is where IMHO the code I've seen that unloads the hardware mailboxes into a buffer from the interrupt and clearing the buffer during NAPI pol
 l is wrong.

Am 09.07.2015 um 17:36 schrieb Tom Evans:
> On 9/07/2015 7:48 PM, Torsten Lang wrote:
>> Am 09.07.2015 um 09:42 schrieb Tom Evans:
>>> On 09/07/15 00:38, Torsten Lang wrote:
>>>> It is based on the rework done by David Jander which disables
>>>> the only six messages deep hardware FIFO of the FlexCAN core
>>>> and instead uses all available mailboxes for reception.
>
> That's such a big change to the driver (and given Holger's comments) I would suggest submitting it as a separate driver - "flexcan2.c", "flexcan-ng.c" or some such. Leave the old one alone, or fix it with Holger's unload-during-interrupts version or equivalent.
Either this or the driver would have to support FIFO mode (with the risk of data loss) and mailbox based queue depending on the available features.
>
>>> I'd be interested in reasons why the above isn't a
> >> good solution to this problem.
>>
>> I did tests with reading out the mailboxes directly in the interrupt
> > handler but still had problems.
>
> Time to run FTRACE and see what's broken or set wrong. Holger seems to have been hit with a broken SD driver. I found our kernel supplier had left all the semaphore/mutex/slub debugging on and that was making the kernel about 5 times slower than it should have been. Easily fixed once found.
>
> > From what I found during my search in the net the interrupt
> > handling implementation in Linux for the Freescale range of
> > SoCs seems to suck because it does not configure any interrupt
> > priorization and the interrupt handler "prefers" to handle
> > interrupts just by the bit order in the interrupt controller
> > could lead to very high latencies in case of FlexCAN interrupts.
>
> Yes, I fixed that too. A simple mod that created a TZIC version of "avic_irq_set_priority()", and calls to that from the platform setup. But I really miss having SIX different levels (that can happily interrupt each other) in M68k/ColdFire.
>
>> On which i.MX did you test your change with success?
>
> i.MX53.
>
> Holger has said:
> > The thing about the prioritization is true .. but it's not the
> > reason. Because even when you give the IRQs for the FlexCAN
> > the highest priority (I have a patch for this), then this
> > will only trigger if two interrupts arrive at the same
> > time. This is almost never.
>
> I don't think so. It only requires one interrupt to arrive when the previous one is still running. If the previous one is the FEC (Ethernet) AND I'm flood-pinging the thing hard AND the 3.4 FEC driver doesn't use NAPI then the CPU is spending a huge amount of time in the FEC ISR, followed by another run in the FEC ISR, and again; not letting CAN run. Elevating CAN't priority does help in this case.
>
> When you're playing whack-a-mole you have to whack all the moles...
>
> Tom
>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: can: flexcan: implement workaround for FIFO overruns (based on code by David Jander)
  2015-07-10  9:17       ` Torsten Lang
@ 2015-07-11  6:42         ` Tom Evans
  0 siblings, 0 replies; 24+ messages in thread
From: Tom Evans @ 2015-07-11  6:42 UTC (permalink / raw)
  To: Torsten Lang, linux-can; +Cc: Marc Kleine-Budde

On 10/07/2015 7:17 PM, Torsten Lang wrote:
> Hi Tom,
> it would be nice if you could mail me the patches you applied
 > to the flexcan driver and kernel interrupt handling by mail

Done.

> so that I could try out if this solves the problems I still
 > had with the "NO-NAPI" version I made of the 3.16 flexcan driver.

Can you give us a summary of your problems?

Tom


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: can: flexcan: implement workaround for FIFO overruns (based on code by David Jander)
  2015-07-09  7:59       ` Holger Schurig
  2015-07-09 10:03         ` Torsten Lang
@ 2015-07-22  8:00         ` Torsten Lang
  2015-07-22  8:57           ` Marc Kleine-Budde
  1 sibling, 1 reply; 24+ messages in thread
From: Torsten Lang @ 2015-07-22  8:00 UTC (permalink / raw)
  Cc: Holger Schurig, Alexander Stein, linux-can, Marc Kleine-Budde

Am 09.07.2015 um 09:59 schrieb Holger Schurig:
>> The early parts (ColdFire) have 16 buffers, didn't have "Message Queueing" or the FIFO, so aren't supported by Linux at all.
> Fine, so we can ignore them :-)
That would be one more reason to at least have an option for working without the hardware FIFO.
>
>> Is it "normal" to have interrupts locked out for more than 300us (six 50us CAN messages at 1MHz)?
> Unfortunately yes.  My $CUSTOMER had overruns with 500 kB/s, 80% bus
> load, and CAN messages with 3 bytes of data. My guess this was mostly
> due to the sucky SDHCI (eMMC) driver code in Linux. I fixed that, but
> occassionally ftrace still shows large times with irqsoff, I need to
> dig into them as well. Still /me thinks that an RxFIFO of just 6 CAN
> messages isn't swell for an OS that is known to not guarantee response
> times, like Linux. Especially not for CAN, people use it after all
> because of it's reliability guarantees.
I've done some tests with FTRACE, same result. The trace results show that the SDHCI driver executes long sequences of code under spinlock_irqsave. As far as I can see from the trace, sdhci_do_set_ios first locks the interrupts, then activates the clock, does the operation and deactivates the clock again. The actual busy looping appears in the IMX SD/MMC driver which is waiting after every clock change. Turning off CONFIG_MMC_CLKGATE doesn't help here. Even when these busy waits would be avoided there still would be ~100us of operations under spinlock_irqsave.
>
> BTW, with the current in-tree FlexCAN drivers we have two things were
> IRQ or scheduling latency can cause lost frames:
>
> * the time from the hardware IRQ until flexcan_isr() is actually
> called, e.g. because for spin_lock_irqsave
> * the time when the ISR does a napi_schedule() until NAPI get's
> scheduled and calls flexcan_poll()
>
> For the first latency, only FTRACE and fixing the other kernel parts
> helps. Unfortunately, some kernel parts are so complex that it is
> over-the-head of many people (ok, I confess: over my head).
>
> I killed the second latency with my kfifo patch that I posted the
> other day. Getting rid of NAPI completely would also be a method, I'm
> not sure NAPI wins us anything, compared to Ethernet CAN is slow.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: can: flexcan: implement workaround for FIFO overruns (based on code by David Jander)
  2015-07-22  8:00         ` Torsten Lang
@ 2015-07-22  8:57           ` Marc Kleine-Budde
  2015-07-24  3:53             ` Tom Evans
  0 siblings, 1 reply; 24+ messages in thread
From: Marc Kleine-Budde @ 2015-07-22  8:57 UTC (permalink / raw)
  To: Torsten Lang; +Cc: Holger Schurig, Alexander Stein, linux-can

[-- Attachment #1: Type: text/plain, Size: 2035 bytes --]

On 07/22/2015 10:00 AM, Torsten Lang wrote:
> Am 09.07.2015 um 09:59 schrieb Holger Schurig:
>>> The early parts (ColdFire) have 16 buffers, didn't have "Message Queueing" or the FIFO, so aren't supported by Linux at all.
>> Fine, so we can ignore them :-)
> That would be one more reason to at least have an option for working without the hardware FIFO.
>>
>>> Is it "normal" to have interrupts locked out for more than 300us (six 50us CAN messages at 1MHz)?
>> Unfortunately yes.  My $CUSTOMER had overruns with 500 kB/s, 80% bus
>> load, and CAN messages with 3 bytes of data. My guess this was mostly
>> due to the sucky SDHCI (eMMC) driver code in Linux. I fixed that, but
>> occassionally ftrace still shows large times with irqsoff, I need to
>> dig into them as well. Still /me thinks that an RxFIFO of just 6 CAN
>> messages isn't swell for an OS that is known to not guarantee response
>> times, like Linux. Especially not for CAN, people use it after all
>> because of it's reliability guarantees.
> I've done some tests with FTRACE, same result. The trace results show that the SDHCI driver executes long sequences of code under spinlock_irqsave. As far as I can see from the trace, sdhci_do_set_ios first locks the interrupts, then activates the clock, does the operation and deactivates the clock again. The actual busy looping appears in the IMX SD/MMC driver which is waiting after every clock change. Turning off CONFIG_MMC_CLKGATE doesn't help here. Even when these busy waits would be avoided there still would be ~100us of operations under spinlock_irqsave.

IIRC you can insert a SD/MMC/eMMC and mark it as non removable in the DT
to work around the problem - connecting the Card Detection pin might
also work.

Marc
-- 
Pengutronix e.K.                  | Marc Kleine-Budde           |
Industrial Linux Solutions        | Phone: +49-231-2826-924     |
Vertretung West/Dortmund          | Fax:   +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: can: flexcan: implement workaround for FIFO overruns (based on code by David Jander)
  2015-07-22  8:57           ` Marc Kleine-Budde
@ 2015-07-24  3:53             ` Tom Evans
  2015-07-24  8:45               ` Torsten Lang
  0 siblings, 1 reply; 24+ messages in thread
From: Tom Evans @ 2015-07-24  3:53 UTC (permalink / raw)
  To: Marc Kleine-Budde, Torsten Lang
  Cc: Holger Schurig, Alexander Stein, linux-can

On 22/07/15 18:57, Marc Kleine-Budde wrote:
> On 07/22/2015 10:00 AM, Torsten Lang wrote:
>> Am 09.07.2015 um 09:59 schrieb Holger Schurig:
>>>> Is it "normal" to have interrupts locked out for more than 300us
 >>>> (six 50us CAN messages at 1MHz)?
>>> Unfortunately yes.  My $CUSTOMER had overruns with 500 kB/s, 80% bus
>>> load, and CAN messages with 3 bytes of data. My guess this was mostly
>>> due to the sucky SDHCI (eMMC) driver code in Linux.

There's been some off-list email on this. It looks like the MMC driver isn't 
well written.

I think I've found the "Official Freescale" workaround for the CAN problems 
under Linux:

http://cache.freescale.com/files/32bit/doc/app_note/AN4815.pdf

The above App Note documents migration to the i.MX 6SoloX chip. It contains 
one each of Cortex-A9 and Cortex-M4 CPU cores.

The purpose of the M4 core is listed to be:

               Secondary processor for fast boot, low power
               processing, and robust CAN message handling.

So that's your only fix. Use a totally separate CPU for CAN.

Or don't design hardware that uses chips that the software doesn't support 
(well enough to work with your other hardware). So don't design with eMMC on 
this platform. Use something simpler and faster, like NAND or NOR Flash.

On a slightly better note, who's responsible for the MMC driver?

Which one are you using? This one (from Freescale 2.6.35):

/*
  *  linux/drivers/mmc/host/mxc_mmc.c - Freescale MXC/i.MX MMC driver
  *
  *  based on imxmmc.c
  *  Copyright (C) 2004 Sascha Hauer, Pengutronix <sascha@saschahauer.de>
  *
  *  derived from pxamci.c by Russell King

This one from the Mainline:

http://lxr.free-electrons.com/source/drivers/mmc/host/mxcmmc.c

   9  *  Copyright (C) 2008 Sascha Hauer, Pengutronix <s.hauer@pengutronix.de>
  10  *  Copyright (C) 2006 Pavel Pisa, PiKRON <ppisa@pikron.com>
  11  *
  12  *  derived from pxamci.c by Russell King

Is there a mailing list where these apparent MMC driver problems can be 
reported? Probably "gmane.linux.kernel.mmc".

It might be worth posting some FTRACE results to there.

> IIRC you can insert a SD/MMC/eMMC and mark it as non removable
 > in the DT to work around the problem - connecting the Card
 > Detection pin might also work.

Has anyone tested that to see if it helps or helps enough?

Tom


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: can: flexcan: implement workaround for FIFO overruns (based on code by David Jander)
  2015-07-24  3:53             ` Tom Evans
@ 2015-07-24  8:45               ` Torsten Lang
  0 siblings, 0 replies; 24+ messages in thread
From: Torsten Lang @ 2015-07-24  8:45 UTC (permalink / raw)
  To: tom_usenet, Marc Kleine-Budde; +Cc: Holger Schurig, Alexander Stein, linux-can

Am 24.07.2015 um 05:53 schrieb Tom Evans:
> On 22/07/15 18:57, Marc Kleine-Budde wrote:
>> On 07/22/2015 10:00 AM, Torsten Lang wrote:
>>> Am 09.07.2015 um 09:59 schrieb Holger Schurig:
>>>>> Is it "normal" to have interrupts locked out for more than 300us
> >>>> (six 50us CAN messages at 1MHz)?
>>>> Unfortunately yes.  My $CUSTOMER had overruns with 500 kB/s, 80% bus
>>>> load, and CAN messages with 3 bytes of data. My guess this was mostly
>>>> due to the sucky SDHCI (eMMC) driver code in Linux.
>
> There's been some off-list email on this. It looks like the MMC driver isn't well written.
>
> I think I've found the "Official Freescale" workaround for the CAN problems under Linux:
>
> http://cache.freescale.com/files/32bit/doc/app_note/AN4815.pdf
>
> The above App Note documents migration to the i.MX 6SoloX chip. It contains one each of Cortex-A9 and Cortex-M4 CPU cores.
>
> The purpose of the M4 core is listed to be:
>
>               Secondary processor for fast boot, low power
>               processing, and robust CAN message handling.
>
> So that's your only fix. Use a totally separate CPU for CAN.
>
> Or don't design hardware that uses chips that the software doesn't support (well enough to work with your other hardware). So don't design with eMMC on this platform. Use something simpler and faster, like NAND or NOR Flash.
>
> On a slightly better note, who's responsible for the MMC driver?
>
> Which one are you using? This one (from Freescale 2.6.35):
>
No, the i.MX6 boards use sdhci-esdhc-imx.c which introduces these insane mdelay(1)s when clock switching. To reduce the irqsoff latencies I ported a SDHC low latency patch originally made for the Raspberry Pi kernel which also seems to be used in the OpenWRT project - see https://gist.github.com/ddv2005/3454406 or https://dev.openwrt.org/browser/trunk/target/linux/brcm2708/patches-3.10/0026-Add-low-latency-mode-to-sdcard-driver.-Disable-with-.patch?rev=39770. This reduces the latencies severely (which actually isn't surprising when looking into sdhci.c and sdhci-esdhc-imx.c). But doing the tuning without hard lock possibly can lead to problems with certain cards as I found in other places, so this low latency change could possibly do harm.

I further developed a patch for the GIC driver to allow setting interrupt priorities of the global interrupts (IDs >=32) via the device tree. My device tree then looks like this:
        intc: interrupt-controller@00a01000 {
                compatible = "arm,cortex-a9-gic";
                #interrupt-cells = <3>;
                interrupt-controller;
                reg = <0x00a01000 0x1000>,
                      <0x00a00100 0x100>;
                interrupt-priorities = /bits/ 16 <    0xb0>, // default priority
                                       /bits/ 16 <110 0x90>, // FLEXCAN1
                                       /bits/ 16 <111 0x90>; // FLEXCAN2
        };
Having the priorities at a central place allows a change with minumum impact on the rest of the system - the patch is local to the irqchip driver, no additional interfaces needed, no changes to the (more or less generic) drivers needed.

With both changes and Tom's driver I already was able to receive without any losses under full CPU load (ssvb_cpuburn running, my application running), but as soon as I access any mass storage (USB, SDHC) I again crack the barrier of the 300us delay and observe occasional losses.

This is why I switched back to my own software queue driver. I did rewrite it to also unload mailboxes in interrupt like Tom's driver does which needs a more complex queue implementation where I split lists into a in, out and a staging area. So the in (irq) and out (polling) path can work completely lockless and I only need locks when moving lists to or from the staging area.

With this combination of patches I was able to transfer data (scp over an isotp tunnel with blocksize 0 resulting in long sequences of back-to-back messages) under full 100% CPU load with parallel mass storage transfer (multiple transfers on SD and USB devices).
> /*
>  *  linux/drivers/mmc/host/mxc_mmc.c - Freescale MXC/i.MX MMC driver
>  *
>  *  based on imxmmc.c
>  *  Copyright (C) 2004 Sascha Hauer, Pengutronix <sascha@saschahauer.de>
>  *
>  *  derived from pxamci.c by Russell King
>
> This one from the Mainline:
>
> http://lxr.free-electrons.com/source/drivers/mmc/host/mxcmmc.c
>
>   9  *  Copyright (C) 2008 Sascha Hauer, Pengutronix <s.hauer@pengutronix.de>
>  10  *  Copyright (C) 2006 Pavel Pisa, PiKRON <ppisa@pikron.com>
>  11  *
>  12  *  derived from pxamci.c by Russell King
>
> Is there a mailing list where these apparent MMC driver problems can be reported? Probably "gmane.linux.kernel.mmc".
>
> It might be worth posting some FTRACE results to there.
>
Just in case anyone is interested (I just shortened the lengthy __timer_delay output):

# tracer: irqsoff
#
# irqsoff latency trace v1.1.5 on 3.16.0
# --------------------------------------------------------------------
# latency: 2119 us, #858/858, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
#    -----------------
#    | task: kworker/u4:2-93 (uid:0 nice:0 policy:0 rt_prio:0)
#    -----------------
#  => started at: _raw_spin_lock_irqsave
#  => ended at:   _raw_spin_unlock_irqrestore
#
#
#                  _------=> CPU#           
#                 / _-----=> irqs-off       
#                | / _----=> need-resched   
#                || / _---=> hardirq/softirq
#                ||| / _--=> preempt-depth  
#                |||| /     delay            
#  cmd     pid   ||||| time  |   caller     
#     \   /      |||||  \    |   /          
kworker/-93      0d...    1us+: _raw_spin_lock_irqsave
kworker/-93      0d..1    4us+: preempt_count_add <-_raw_spin_lock_irqsave
kworker/-93      0d..2    8us+: esdhc_readl_le <-esdhc_pltfm_set_clock
kworker/-93      0d..2   11us+: esdhc_writel_le <-esdhc_pltfm_set_clock
kworker/-93      0d..2   13us+: l2c210_sync <-esdhc_writel_le
kworker/-93      0d..2   18us+: esdhc_readl_le <-esdhc_pltfm_set_clock
kworker/-93      0d..2   21us+: esdhc_writel_le <-esdhc_pltfm_set_clock
kworker/-93      0d..2   26us+: l2c210_sync <-esdhc_writel_le
kworker/-93      0d..2   29us+: l2c210_sync <-esdhc_pltfm_set_clock
kworker/-93      0d..2   33us+: __timer_const_udelay <-esdhc_pltfm_set_clock
kworker/-93      0d..2   36us+: __timer_delay <-__timer_const_udelay
kworker/-93      0d..2   38us+: imx_read_current_timer <-__timer_delay
kworker/-93      0d..2   41us+: imx_read_current_timer <-__timer_delay
...
kworker/-93      0d..2 1038us+: imx_read_current_timer <-__timer_delay
kworker/-93      0d..2 1040us+: imx_read_current_timer <-__timer_delay
kworker/-93      0d..2 1044us+: esdhc_writeb_le <-sdhci_do_set_ios
kworker/-93      0d..2 1046us+: esdhc_writeb_le <-sdhci_do_set_ios
kworker/-93      0d..2 1049us+: esdhc_pltfm_set_bus_width <-sdhci_do_set_ios
kworker/-93      0d..2 1051us+: l2c210_sync <-esdhc_pltfm_set_bus_width
kworker/-93      0d..2 1054us+: esdhc_writeb_le <-sdhci_do_set_ios
kworker/-93      0d..2 1057us+: l2c210_sync <-esdhc_writeb_le
kworker/-93      0d..2 1061us+: esdhc_readw_le <-sdhci_do_set_ios
kworker/-93      0d..2 1064us+: esdhc_writew_le <-sdhci_do_set_ios
kworker/-93      0d..2 1067us+: l2c210_sync <-esdhc_writew_le
kworker/-93      0d..2 1070us+: l2c210_sync <-esdhc_writew_le
kworker/-93      0d..2 1072us+: esdhc_readw_le <-sdhci_do_set_ios
kworker/-93      0d..2 1075us+: esdhc_writew_le <-sdhci_do_set_ios
kworker/-93      0d..2 1077us+: l2c210_sync <-esdhc_writew_le
kworker/-93      0d..2 1080us+: esdhc_set_uhs_signaling <-sdhci_do_set_ios
kworker/-93      0d..2 1082us+: l2c210_sync <-esdhc_set_uhs_signaling
kworker/-93      0d..2 1085us+: pinctrl_select_state <-esdhc_set_uhs_signaling
kworker/-93      0d..2 1088us+: esdhc_readl_le <-esdhc_pltfm_set_clock
kworker/-93      0d..2 1090us+: esdhc_writel_le <-esdhc_pltfm_set_clock
kworker/-93      0d..2 1093us+: l2c210_sync <-esdhc_writel_le
kworker/-93      0d..2 1095us+: esdhc_readl_le <-esdhc_pltfm_set_clock
kworker/-93      0d..2 1098us+: esdhc_writel_le <-esdhc_pltfm_set_clock
kworker/-93      0d..2 1100us+: l2c210_sync <-esdhc_writel_le
kworker/-93      0d..2 1103us+: l2c210_sync <-esdhc_pltfm_set_clock
kworker/-93      0d..2 1105us+: __timer_const_udelay <-esdhc_pltfm_set_clock
kworker/-93      0d..2 1108us+: __timer_delay <-__timer_const_udelay
kworker/-93      0d..2 1110us+: imx_read_current_timer <-__timer_delay
kworker/-93      0d..2 1112us+: imx_read_current_timer <-__timer_delay
...
kworker/-93      0d..2 2108us+: imx_read_current_timer <-__timer_delay
kworker/-93      0d..2 2111us+: imx_read_current_timer <-__timer_delay
kworker/-93      0d..2 2113us+: _raw_spin_unlock_irqrestore <-sdhci_do_set_ios
kworker/-93      0d..1 2117us+: _raw_spin_unlock_irqrestore
kworker/-93      0d..1 2122us!: trace_hardirqs_on <-_raw_spin_unlock_irqrestore
kworker/-93      0d..1 2242us : <stack trace>
 => sdhci_do_set_ios
 => sdhci_runtime_resume_host
 => sdhci_esdhc_runtime_resume
 => pm_generic_runtime_resume
 => __rpm_callback
 => rpm_callback
 => rpm_resume
 => __pm_runtime_resume
 => sdhci_runtime_pm_get.clone.18
 => sdhci_request
 => mmc_start_request
 => mmc_wait_for_req
 => mmc_wait_for_cmd
 => mmc_send_status
 => mmc_alive
 => _mmc_detect_card_removed
 => mmc_detect
 => mmc_rescan
 => process_one_work
 => worker_thread
 => kthread
 => ret_from_fork

>> IIRC you can insert a SD/MMC/eMMC and mark it as non removable
> > in the DT to work around the problem - connecting the Card
> > Detection pin might also work.
>
> Has anyone tested that to see if it helps or helps enough?
>
> Tom
>


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2015-07-24  8:46 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-07-08 14:38 can: flexcan: implement workaround for FIFO overruns (based on code by David Jander) Torsten Lang
2015-07-09  6:33 ` Holger Schurig
2015-07-09  6:38   ` Holger Schurig
2015-07-09  9:26   ` Torsten Lang
2015-07-09  9:32     ` Holger Schurig
2015-07-09  9:36       ` Marc Kleine-Budde
2015-07-09  9:42         ` Holger Schurig
2015-07-09  6:58 ` Alexander Stein
2015-07-09  7:27   ` Holger Schurig
2015-07-09  7:48     ` Alexander Stein
2015-07-09  7:59       ` Holger Schurig
2015-07-09 10:03         ` Torsten Lang
2015-07-22  8:00         ` Torsten Lang
2015-07-22  8:57           ` Marc Kleine-Budde
2015-07-24  3:53             ` Tom Evans
2015-07-24  8:45               ` Torsten Lang
2015-07-09  7:42 ` Tom Evans
2015-07-09  9:48   ` Torsten Lang
2015-07-09 10:05     ` Holger Schurig
2015-07-09 15:36     ` Tom Evans
2015-07-10  9:17       ` Torsten Lang
2015-07-11  6:42         ` Tom Evans
2015-07-09  8:06 ` Holger Schurig
2015-07-09  8:43   ` Oliver Hartkopp

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.