From mboxrd@z Thu Jan 1 00:00:00 1970 From: Tom Evans Subject: Re: CAN question: how to trace frame errors? Date: Fri, 12 Jun 2015 22:33:17 +1000 Message-ID: <557AD18D.8010807@optusnet.com.au> References: <557AA7BE.6040004@pengutronix.de> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Return-path: Received: from mail110.syd.optusnet.com.au ([211.29.132.97]:50469 "EHLO mail110.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754999AbbFLMdY (ORCPT ); Fri, 12 Jun 2015 08:33:24 -0400 In-Reply-To: Sender: linux-can-owner@vger.kernel.org List-ID: To: Holger Schurig , Marc Kleine-Budde Cc: linux-can@vger.kernel.org On 12/06/2015 8:01 PM, Holger Schurig wrote: >> These are probably RX-FIFO overrun errors: > > You're right. > > 13:03:16 kernel: ##HS flexcan_irq FIFO overrun > > That really puzzles me. The errors occured originally when someuse > used a Kvazer to generate 80% load at 500 kB/s. In my tests, I used 1 > MB/s and cangen (no Kvazer here). The i.MX6 was otherwise idle: Idle has nothing to do with it. The problem is the latency. Let's do the maths. The worst case CAN message arrival rate is with standard ID CAN messages with zero bytes of data. These are about 50 bits long. So at 500kHz they arrive at 10kHz or 100us interval. FlexCAN has a six message FIFO. So it can receive six without overflowing, but the seventh will blow it. So it has to (start to) be unloaded within 700us. Since the CPU is running at 800MHz, that's only 560,000 instructions to respond to an interrupt. And it isn't managing it. That's what "Not Real Time" means - half a million instructions isn't enough time. Except it isn't "respond to an interrupt" as the FlexCAN driver doesn't receive the 8-byte messages during the interrupt. It schedules a NAPI service routine to read the data, and they can easily be delayed that long waiting for a slot. Do you have a Frame Buffer? Is it write-through or cache-flushed? I've read that a flush of a frame-buffer-sized chunk of memory can take the L2 cache a very long time to complete. Think MILLIseconds. That locks up ALL CORES unless the other ones are lucky enough to stay inside both their L1 caches. We couldn't handle CAN losing data (and run at 1 MBit) so I rewrote the 3.4 FlexCAN driver to unload the FIFO into a ring buffer during interrupts and to have the NAPI routine unload that. No problems since then. > kernel is 3.18.14 Then you SHOULD be better off than we were, running 3.4. In that version the FlexCAN controller uses NAPI (and always has) while the Ethernet controller didn't, but would happily try and unload 100 Ethernet packets all the way to the network layer in the interrupt routine, blocking the FlexCAN interrupts and NAPI run. So check your Ethernet driver and see if it uses NAPI and if there's any work-limiting in the interrupt routine. Flood ping (ping -f -l 20) one of them and see if that makes the overruns worse during your CAN test. Run cangen with an 8-byte CAN packet and see if the lower arrival rate fixes it. Change the message length and see if you can use that to "measure the latency". Then see what other interrupts you're getting. Try: while true; do cat /proc/interrupts; sleep 1; done See which interrupt counts (apart from FlexCAN) are increasing quickly. Then you'll want to FTRACE (CONFIG_PERF_EVENTS) the kernel. The Kernel Tracing is very good and very powerful. Learning how to run this is a skill very worth having. You want to find where it is spending its time between the interrupt and the NAPI run. In our case we found our kernel was spending the majority of its time in mutex, spinlock and slub debugging. It got six times faster when I got rid of those. Tom