Re: [Rfi] Cyclone V CAN errors when application pinned to CPU1

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: [Rfi] Cyclone V CAN errors when application pinned to CPU1
       [not found] <562155B7.7020504@vsis.cz>
@ 2015-10-20  7:18 ` Robert Schwebel
  2015-10-20  7:37   ` Marc Kleine-Budde
  0 siblings, 1 reply; 7+ messages in thread
From: Robert Schwebel @ 2015-10-20  7:18 UTC (permalink / raw)
  To: Vlastimil Setka; +Cc: rfi, Marc Kleine-Budde

Hi,


Adding Marc to Cc:.

On Fri, Oct 16, 2015 at 09:53:27PM +0200, Vlastimil Setka wrote:
> We discovered very weird behaviour of CAN controller in Cyclone V SoC
> with Linux socketcan stack. The problem was first seen on 3.10-ltsi a
> few months ago, and now again on 3.18 from altera github (with rt
> preempt patch applied).

Could you try if the issue happens with a recent mainline kernel as
well? RT is available for 4.1, so that would be a good choice.
 
> We have a linux application which sends data periodically (1 to 20 ms
> period) out over the can0 socketcan interface. Sometimes the first
> data byte in the CAN frame is zero on the wire, but non-zero in the
> data sent! When running with this period, this happens at random
> times, but during a few minutes it can be allways replicated.
> 
> The problem only appears when the application is pinned to CPU1 by
> linux process afinity mechanism. When pinned to default CPU0, there is
> no problem.
> 
> Anyone seen this issue? Any idea how to debug it and what can be a
> reason? What version (git repo / tag) of Linux should I use?
> 
> We plan to do some in-deep evaluation and testing, but I want to share
> the experience now.

Is your test program available somewhere?

Regards,
Robert Schwebel
-- 
Pengutronix e.K.                           |                             |
Industrial Linux Solutions                 | http://www.pengutronix.de/  |
Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-0    |
Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Rfi] Cyclone V CAN errors when application pinned to CPU1
  2015-10-20  7:18 ` [Rfi] Cyclone V CAN errors when application pinned to CPU1 Robert Schwebel
@ 2015-10-20  7:37   ` Marc Kleine-Budde
  2016-02-06 17:59     ` Vlastimil Setka
  0 siblings, 1 reply; 7+ messages in thread
From: Marc Kleine-Budde @ 2015-10-20  7:37 UTC (permalink / raw)
  To: Robert Schwebel, Vlastimil Setka; +Cc: rfi, linux-can

[-- Attachment #1: Type: text/plain, Size: 2018 bytes --]

CC += linux-can

On 10/20/2015 09:18 AM, Robert Schwebel wrote:

> Adding Marc to Cc:.

tnx Robert.

> On Fri, Oct 16, 2015 at 09:53:27PM +0200, Vlastimil Setka wrote:
>> We discovered very weird behaviour of CAN controller in Cyclone V SoC
>> with Linux socketcan stack. The problem was first seen on 3.10-ltsi a
>> few months ago, and now again on 3.18 from altera github (with rt
>> preempt patch applied).
> 
> Could you try if the issue happens with a recent mainline kernel as
> well? RT is available for 4.1, so that would be a good choice.

Which CAN driver are you using?

>> We have a linux application which sends data periodically (1 to 20 ms
>> period) out over the can0 socketcan interface. Sometimes the first
>> data byte in the CAN frame is zero on the wire, but non-zero in the
>> data sent! When running with this period, this happens at random
>> times, but during a few minutes it can be allways replicated.

How do you measure the CAN frames on the wire?

>> The problem only appears when the application is pinned to CPU1 by
>> linux process afinity mechanism. When pinned to default CPU0, there is
>> no problem.
>>
>> Anyone seen this issue? Any idea how to debug it and what can be a
>> reason? What version (git repo / tag) of Linux should I use?
>>
>> We plan to do some in-deep evaluation and testing, but I want to share
>> the experience now.

The TX functions is usually pretty straight forward. Copy all data bytes
into the hardware, write ID and DLC, then hit the send bit (or whatever
triggers the hardware to send the frame). Maybe there's some barrier
missing in this sequence? Where can I find your driver?

> Is your test program available somewhere?

Marc

-- 
Pengutronix e.K.                  | Marc Kleine-Budde           |
Industrial Linux Solutions        | Phone: +49-231-2826-924     |
Vertretung West/Dortmund          | Fax:   +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Rfi] Cyclone V CAN errors when application pinned to CPU1
  2015-10-20  7:37   ` Marc Kleine-Budde
@ 2016-02-06 17:59     ` Vlastimil Setka
  2016-02-06 22:34       ` Tom Evans
  0 siblings, 1 reply; 7+ messages in thread
From: Vlastimil Setka @ 2016-02-06 17:59 UTC (permalink / raw)
  To: Marc Kleine-Budde, Robert Schwebel; +Cc: rfi, linux-can

Hi, I was just able to get back to this issue after some time.

20.10.2015 9:37 Marc Kleine-Budde:
>> On Fri, Oct 16, 2015 at 09:53:27PM +0200, Vlastimil Setka wrote:
>>> We discovered very weird behaviour of CAN controller in Cyclone V SoC
>>> with Linux socketcan stack. The problem was first seen on 3.10-ltsi a
>>> few months ago, and now again on 3.18 from altera github (with rt
>>> preempt patch applied).
>> Could you try if the issue happens with a recent mainline kernel as
>> well? RT is available for 4.1, so that would be a good choice.
Now the tests were done on 4.3 kernel with a few Altera patches, without RT patch - build from: https://github.com/altera-opensource/linux-socfpga/tree/rel_socfpga-4.3_16.02.01_pr
> Which CAN driver are you using?
CAN controller of the affected system is embedded in the Altera Cyclone V SoC chip, it's implementation of a standard D_CAN core. It use mainline c_can driver: https://github.com/altera-opensource/linux-socfpga/tree/rel_socfpga-4.3_16.02.01_pr/drivers/net/can/c_can
>>> We have a linux application which sends data periodically (1 to 20 ms
>>> period) out over the can0 socketcan interface. Sometimes the first
>>> data byte in the CAN frame is zero on the wire, but non-zero in the
>>> data sent! When running with this period, this happens at random
>>> times, but during a few minutes it can be allways replicated.
> How do you measure the CAN frames on the wire?
Using another linux system by:
    candump can0,0:0,#FFFFFFFF | grep -v "11 22 33 44 55 66 77 88"
We also verified the results concurrently using USB P-CAN adapter on another system with some proprietary monitoring application, and also using oscilloscope with logic analyzer.
>>> The problem only appears when the application is pinned to CPU1 by
>>> linux process afinity mechanism. When pinned to default CPU0, there is
>>> no problem.
>>>
>>> Anyone seen this issue? Any idea how to debug it and what can be a
>>> reason? What version (git repo / tag) of Linux should I use?
>>>
>>> We plan to do some in-deep evaluation and testing, but I want to share
>>> the experience now.
> The TX functions is usually pretty straight forward. Copy all data bytes
> into the hardware, write ID and DLC, then hit the send bit (or whatever
> triggers the hardware to send the frame). Maybe there's some barrier
> missing in this sequence? Where can I find your driver?
The driver is c_can from mainline (see above).
>> Is your test program available somewhere?
Test program is very simple C code which cyclically writes to CAN socket, available here: https://gist.github.com/vstk/f9e4b7c5646bedfedd42

We now discovered that the problem is not triggered by CPU pinning which was our initial idea. It can be reproducibly triggered by a high network load on ethernet generated by iperf for example. Other load on CPU or data flash does not affect this. Everything is OK only in case when the test program is executed with a high (80) RT priority on RT-patched kernel, and the whole system is limited to a single CPU by isolcpus=1 kernel option (this case initially pointed us to some SMP-related problem). Errors are most common when the test program is pinned to CPU1 by taskset and rest of the system is limited to CPU0.

Steps to reproduce:

- monitoring linux system connected to CAN bus: configure can0 interface, run:

root@monitor:~# ip link set can0 type can bitrate 1000000 restart-ms 100
root@monitor:~# ip link set can0 up
root@monitor:~# candump can0,0:0,#FFFFFFFF | grep -v "11 22 33 44 55 66 77 88"

- system under test: configure can0 interface, build test program (https://gist.github.com/vstk/f9e4b7c5646bedfedd42), run it:

root@cyclone-soc:~# ./canbench
# canbench [ <rt-prio> [ <interval-us> ] ]
iface can0 at index 2 OK
interval: 1.000000 ms

- do some network load at system under test:

root@cyclone-soc:~# iperf -c 192.168.1.1

- data errors can be seen in a few seconds at monitoring system (the "11 22 33 44 55 66 77 88" is a test pattern generated by the test program):

root@monitor:~# candump can0,0:0,#FFFFFFFF | grep -v "11 22 33 44 55 66 77 88"
  can0  266   [8]  11 22 33 44 00 00 77 88
  can0  266   [8]  11 22 33 44 01 00 77 88
  can0  266   [8]  11 22 33 44 01 00 77 88
  can0  266   [8]  0E 00 33 44 55 66 77 88
  can0  266   [8]  11 22 33 44 01 00 77 88

As a next step, I plan to check data inside the driver just before it writes into the hardware to verify if the error is not in network stack above the driver. Any other idea?

Vlastimil Setka


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Rfi] Cyclone V CAN errors when application pinned to CPU1
  2016-02-06 17:59     ` Vlastimil Setka
@ 2016-02-06 22:34       ` Tom Evans
  2016-02-06 23:56         ` Vlastimil Setka
  0 siblings, 1 reply; 7+ messages in thread
From: Tom Evans @ 2016-02-06 22:34 UTC (permalink / raw)
  To: Vlastimil Setka, Marc Kleine-Budde, Robert Schwebel; +Cc: rfi, linux-can

On 7/02/2016 4:59 AM, Vlastimil Setka wrote:
 >>> We have a linux application which sends data
 >>> periodically (1 to 20 ms period) out over the
 >>> can0 socketcan interface. Sometimes the first
 >>> data byte in the CAN frame is zero on the wire,
 >>> but non-zero in the data sent!

>> The TX functions is usually pretty straight forward.
 >> Copy all data bytes into the hardware, write ID and DLC,
 >> then hit the send bit (or whatever triggers the hardware
 >> to send the frame). Maybe there's some barrier
>> missing in this sequence?

I'd suggest you "objdump -S" the CAN driver object file and check to see 
the optimizer hasn't re-ordered the above sequence too much.

 > It can be reproducibly triggered by a high network load on
 > ethernet generated by iperf for example.

Which generates a lot of interrupts. Which are probably interrupting the 
above transmit sequence and delaying its completion. During which time 
something else can get in. The most likely disturbing interrupt would be 
a CAN Receive or Transmit interrupt. Is the transmitter "one message at 
a time" in that hardware, is there a FIFO or are there multiple transmit 
message buffers?

Do you have any other CAN traffic on the network that might be 
generating CAN Receive interrupts?

I'd suggest you add a "reentry counter" to the driver and test it on 
entry to various routines (transmit, receive, interrupt), Increment it 
on entry to the transmit routine and decrement on exit. "printk" a 
warning when you see "reentry" and correlate with the data corruption. 
Reduce where you increment and decrement to just around the transmit 
code that loads the hardware and see if you can zero in on the part of 
the code that can't handle the reentry.

It is also possible your "periodic transmit task" is being delayed 
sufficiently that it sends two or more messages back-to-back. Transmit 
flow control might not be working properly. I'd suggest putting a 
sequence counter in the CAN message to see if any are getting dropped or 
duplicated. You could also try a partial microsecond transmit timestamp 
in there to detect two messages being sent close together or back-to-back.

> As a next step, I plan to check data inside the driver
 > just before it writes into the hardware to verify if
 > the error is not in network stack above the driver.
 > Any other idea?

Can you read the data back from the hardware and verify it got written 
properly? Do this before initiating the transmit and after as well.

Since you seem to be always sending the same data (so writing the same 
data into the registers) I'd suggest sending different data in alternate 
messages to see if there's any "stale data" being sent as well.

Tom

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Rfi] Cyclone V CAN errors when application pinned to CPU1
  2016-02-06 22:34       ` Tom Evans
@ 2016-02-06 23:56         ` Vlastimil Setka
  2016-02-07  0:54           ` Tom Evans
  2016-02-07 22:19           ` Tom Evans
  0 siblings, 2 replies; 7+ messages in thread
From: Vlastimil Setka @ 2016-02-06 23:56 UTC (permalink / raw)
  To: Tom Evans, Marc Kleine-Budde, Robert Schwebel; +Cc: rfi, linux-can

6.2.2016 23:34 Tom Evans:
> On 7/02/2016 4:59 AM, Vlastimil Setka wrote:
> >>> We have a linux application which sends data
> >>> periodically (1 to 20 ms period) out over the
> >>> can0 socketcan interface. Sometimes the first
> >>> data byte in the CAN frame is zero on the wire,
> >>> but non-zero in the data sent!
>>> The TX functions is usually pretty straight forward. Copy all data bytes into the hardware, write ID and DLC, then hit the send bit (or whatever triggers the hardware to send the frame). Maybe there's some barrier missing in this sequence? 
> I'd suggest you "objdump -S" the CAN driver object file and check to see the optimizer hasn't re-ordered the above sequence too much.

I'm not so familiar with reading assembly, and the driver is a bit complicated by splitting this into many functions.

Relevant source code start probably in the c_can_start_xmit function: https://github.com/altera-opensource/linux-socfpga/blob/rel_socfpga-4.3_16.02.01_pr/drivers/net/can/c_can/c_can.c#L434

I uploaded objdump -S of my c_can.o here: https://gist.github.com/vstk/9c4307bb9ae0a6ae0208

> > It can be reproducibly triggered by a high network load on
> > ethernet generated by iperf for example.
>
> Which generates a lot of interrupts. Which are probably interrupting the above transmit sequence and delaying its completion. During which time something else can get in. The most likely disturbing interrupt would be a CAN Receive or Transmit interrupt. Is the transmitter "one message at a time" in that hardware, is there a FIFO or are there multiple transmit message buffers?

As I know, the CAN controller has pool of buffers - called message objects in CAN terminology - and these are somehow allocated to rx/tx messages.

> Do you have any other CAN traffic on the network that might be generating CAN Receive interrupts?

No, the only traffic on the network is generated by the test program.

> I'd suggest you add a "reentry counter" to the driver and test it on entry to various routines (transmit, receive, interrupt), Increment it on entry to the transmit routine and decrement on exit. "printk" a warning when you see "reentry" and correlate with the data corruption. Reduce where you increment and decrement to just around the transmit code that loads the hardware and see if you can zero in on the part of the code that can't handle the reentry.
>
> It is also possible your "periodic transmit task" is being delayed sufficiently that it sends two or more messages back-to-back. Transmit flow control might not be working properly. I'd suggest putting a sequence counter in the CAN message to see if any are getting dropped or duplicated. You could also try a partial microsecond transmit timestamp in there to detect two messages being sent close together or back-to-back.

I'm also thinking about interrupts related problem. It also explains why the problem is not present on RT-patched kernel with test program executed with high priority -- higher than kernel interrupt threads. Interrupts are handled using NAPI, so there should not be a re-entry. Maybe the NAPI is important part of the problem, because the CAN NAPI handler could be significantly delayed based on ethernet traffic which driver also use NAPI infrastructure.

The problem was initially seen in application where CAN message timing was monitored by receiving system, and the jitter was under 150us of 1000us period, so it seems to be not triggered by delaying tx task or back-to-back xmit. There were no missing or duplicated messages, only malformed data in a one message between many surrounding correct messages.

>> As a next step, I plan to check data inside the driver just before it writes into the hardware to verify if the error is not in network stack above the driver. Any other idea? 
> Can you read the data back from the hardware and verify it got written properly? Do this before initiating the transmit and after as well.
>
> Since you seem to be always sending the same data (so writing the same data into the registers) I'd suggest sending different data in alternate messages to see if there's any "stale data" being sent as well.

My initial suspicion was that some stale data from previous communication is sent, that's why my test program send always the same data - and the error can be easily detected by candump+grep.

Thanks for the idea. I will do another tests with variable data containing counters and I will monitor data on wires using logic analyser to also see precise timing.

Vlastimil Setka


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Rfi] Cyclone V CAN errors when application pinned to CPU1
  2016-02-06 23:56         ` Vlastimil Setka
@ 2016-02-07  0:54           ` Tom Evans
  2016-02-07 22:19           ` Tom Evans
  1 sibling, 0 replies; 7+ messages in thread
From: Tom Evans @ 2016-02-07  0:54 UTC (permalink / raw)
  To: Vlastimil Setka, Marc Kleine-Budde, Robert Schwebel; +Cc: rfi, linux-can

On 7/02/2016 10:56 AM, Vlastimil Setka wrote:
> 6.2.2016 23:34 Tom Evans:
>> On 7/02/2016 4:59 AM, Vlastimil Setka wrote:
>>>>> We have a linux application which sends data
>>>>> periodically (1 to 20 ms period) out over the
>>>>> can0 socketcan interface. Sometimes the first
>>>>> data byte in the CAN frame is zero on the wire,
>>>>> but non-zero in the data sent!
>>>> The TX functions is usually pretty straight forward. Copy all data bytes into the hardware, write ID and DLC, then hit the send bit (or whatever triggers the hardware to send the frame). Maybe there's some barrier missing in this sequence?
>> I'd suggest you "objdump -S" the CAN driver object file and check to see the optimizer hasn't re-ordered the above sequence too much.
>
> I'm not so familiar with reading assembly,

It is a skill worth working on.

 > and the driver is a bit
 > complicated by splitting this into many functions.

Having lots of simple functions makes it easier to understand and follow 
than the alternative.

I've looked at the source and the assembly and I'm pleasantly surprised. 
It looks like there's almost no optimization as the assembly exactly 
matches the code. Nothing unexpected.

If there are any barriers then they have to be in priv->write_reg()

Everything is done by NAPI in this driver [1]. The interrupt () does 
nothing but trigger NAPI to run "c_can_poll()". It receives messages, 
finds completed transmits, frees the buffers and restarts the NAPI 
transmit queue. NAPI transmits come through c_can_start_xmit(), which 
calls c_can_setup_tx_object() to load the data, THEN weirdly calls 
can_put_echo_skb() before starting the send with c_can_object_put().

NAPI should be guaranteeing that things are done in the right order. I'd 
still add the "recursive checks" though. Your problem might be due to 
the "rt preempt patch" messing up NAPI somehow so it isn't obeying the 
rules. Does it fail without that patch?

Note 1: I don't like NAPI. My experience with Freescale's FlexCAN (which 
did the same thing - reading all data from the chip in NAPI) was that 
the six-entry FIFO would easily overflow and lose messages at 1MHz CAN 
bit rate with high Ethernet loading.

> I uploaded objdump -S of my c_can.o here:
 > https://gist.github.com/vstk/9c4307bb9ae0a6ae0208

Read through from line 4092. The assembly follows the source exactly.

Tom

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Rfi] Cyclone V CAN errors when application pinned to CPU1
  2016-02-06 23:56         ` Vlastimil Setka
  2016-02-07  0:54           ` Tom Evans
@ 2016-02-07 22:19           ` Tom Evans
  1 sibling, 0 replies; 7+ messages in thread
From: Tom Evans @ 2016-02-07 22:19 UTC (permalink / raw)
  To: Vlastimil Setka, Marc Kleine-Budde, Robert Schwebel; +Cc: rfi, linux-can

On 07/02/16 10:56, Vlastimil Setka wrote:
> 6.2.2016 23:34 Tom Evans:
>> On 7/02/2016 4:59 AM, Vlastimil Setka wrote:
>>>>> We have a linux application which sends data
>>>>> periodically (1 to 20 ms period) out over the
>>>>> can0 socketcan interface. Sometimes the first
>>>>> data byte in the CAN frame is zero on the wire,
>>>>> but non-zero in the data sent!
>>>> The TX functions is usually pretty straight forward. Copy all data bytes into the hardware, write ID and DLC, then hit the send bit (or whatever triggers the hardware to send the frame). Maybe there's some barrier missing in this sequence?
>> I'd suggest you "objdump -S" the CAN driver object file and check to see the optimizer hasn't re-ordered the above sequence too much.
...
>>> It can be reproducibly triggered by a high network load on
>>> ethernet generated by iperf for example.
>>
>> Which generates a lot of interrupts. Which are probably
 >> interrupting the above transmit sequence and delaying its
 >> completion. During which time something else can get in.
...
>> I'd suggest you add a "reentry counter" to the driver

An easier way to investigate this (initially, you'll need things like reentry 
counters later) is to add "spin_lock_irqsave()" and "spin_unlock_irqrestore()"
pairs around sections of the CAN driver code you want to "protect". Have a 
look at the other CAN drivers that do this and some of the large number of 
Ethernet drivers for examples of how to declare the variables and call the 
functions.

I'm not recommending the above as a fix, just as a tool for narrowing down the 
cause of your problem.

If this is bad advice (the wrong functions etc) I ask others on the list that 
know better to correct me.

Tom


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2016-02-07 22:19 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <562155B7.7020504@vsis.cz>
2015-10-20  7:18 ` [Rfi] Cyclone V CAN errors when application pinned to CPU1 Robert Schwebel
2015-10-20  7:37   ` Marc Kleine-Budde
2016-02-06 17:59     ` Vlastimil Setka
2016-02-06 22:34       ` Tom Evans
2016-02-06 23:56         ` Vlastimil Setka
2016-02-07  0:54           ` Tom Evans
2016-02-07 22:19           ` Tom Evans

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.