All of lore.kernel.org
 help / color / mirror / Atom feed
* sja1000 interrupt problem
@ 2013-10-08  0:47 Austin Schuh
  2013-10-08  6:32 ` Wolfgang Grandegger
  0 siblings, 1 reply; 66+ messages in thread
From: Austin Schuh @ 2013-10-08  0:47 UTC (permalink / raw)
  To: linux-can

I am seeing the following problem with my Peak miniPCIe card.  After
being up for a long time and not having any programs listening to the
CAN traffic, I get the following message from dmesg, and then I start
not receiving packets (Some get through, but not all of them.)

Is this a known issue?  Are there any tests that I can run to help
isolate and fix the problem?

Thanks,
    Austin Schuh

$ lspci -v -s 05:00.0
05:00.0 Network controller: PEAK-System Technik GmbH Device 0008 (rev 02)
        Subsystem: PEAK-System Technik GmbH Device 0005
        Flags: bus master, fast devsel, latency 0, IRQ 18
        Memory at e0510000 (32-bit, non-prefetchable) [size=64K]
        Memory at e0500000 (32-bit, non-prefetchable) [size=64K]
        Capabilities: [50] Power Management version 0
        Capabilities: [70] MSI: Enable- Count=1/1 Maskable- 64bit+
        Capabilities: [90] Express Endpoint, MSI 00
        Capabilities: [100] Vendor Specific Information: ID=0000 Rev=0
Len=00c <?>
        Kernel driver in use: peak_pci

$ uname -a
Linux vpc3 3.10-3-rt-amd64 #1 SMP PREEMPT RT Debian 3.10.11-1
(2013-09-10) x86_64 GNU/Linux

$ dmesg
--- snip ---
[105918.136633] irq 18: nobody cared (try booting with the "irqpoll" option)
[105918.136763] CPU: 1 PID: 202 Comm: irq/18-ata_gene Not tainted
3.10-3-rt-amd64 #1 Debian 3.10.11-1
[105918.136768] Hardware name: CompuLab Intense-PC/Intense-PC, BIOS
CR_2.2.0.377 X64 04/10/2013
[105918.136777]  ffffffff8139f8fc 000000000000003c ffffffff810a2d43
ffff88042a4b5600
[105918.136781]  0000000000000000 ffff88042750ece0 ffffffff810a3182
ffffffff814089d0
[105918.136785]  ffff880426527c80 ffff88042a4b5600 ffff88042750ece0
ffffffff810a13d7
[105918.136791] Call Trace:
[105918.136805]  [<ffffffff8139f8fc>] ? dump_stack+0xd/0x17
[105918.136814]  [<ffffffff810a2d43>] ? __report_bad_irq+0x21/0xc1
[105918.136819]  [<ffffffff810a3182>] ? note_interrupt+0x16d/0x1ff
[105918.136824]  [<ffffffff810a13d7>] ? irq_thread_fn+0x30/0x30
[105918.136828]  [<ffffffff810a1d9c>] ? irq_thread+0xaf/0x190
[105918.136833]  [<ffffffff810a1e7d>] ? irq_thread+0x190/0x190
[105918.136837]  [<ffffffff810a1ced>] ? wake_threads_waitq+0x3a/0x3a
[105918.136843]  [<ffffffff8105a909>] ? kthread+0x81/0x89
[105918.136848]  [<ffffffff8105a888>] ? __kthread_parkme+0x5c/0x5c
[105918.136854]  [<ffffffff813a75fc>] ? ret_from_fork+0x7c/0xb0
[105918.136858]  [<ffffffff8105a888>] ? __kthread_parkme+0x5c/0x5c
[105918.136861] handlers:
[105918.136962] [<ffffffff810a126f>] irq_default_primary_handler
threaded [<ffffffffa00db857>] ata_bmdma_interrupt [libata]
[105918.136972] [<ffffffff810a126f>] irq_default_primary_handler
threaded [<ffffffffa023f75e>] sja1000_interrupt [sja1000]
[105918.136980] [<ffffffff810a126f>] irq_default_primary_handler
threaded [<ffffffffa023f75e>] sja1000_interrupt [sja1000]
[105918.136981] Disabling IRQ #18
[202007.935715] irq 18: nobody cared (try booting with the "irqpoll" option)
[202007.935860] CPU: 0 PID: 202 Comm: irq/18-ata_gene Not tainted
3.10-3-rt-amd64 #1 Debian 3.10.11-1
[202007.935865] Hardware name: CompuLab Intense-PC/Intense-PC, BIOS
CR_2.2.0.377 X64 04/10/2013
[202007.935874]  ffffffff8139f8fc 000000000000003c ffffffff810a2d43
ffff88042a4b5600
[202007.935877]  0000000000000000 ffff88042750ece0 ffffffff810a3182
ffffffff814089d0
[202007.935881]  ffff880426527c80 ffff88042a4b5600 ffff88042750ece0
ffffffff810a13d7
[202007.935888] Call Trace:
[202007.935901]  [<ffffffff8139f8fc>] ? dump_stack+0xd/0x17
[202007.935910]  [<ffffffff810a2d43>] ? __report_bad_irq+0x21/0xc1
[202007.935915]  [<ffffffff810a3182>] ? note_interrupt+0x16d/0x1ff
[202007.935920]  [<ffffffff810a13d7>] ? irq_thread_fn+0x30/0x30
[202007.935925]  [<ffffffff810a1d9c>] ? irq_thread+0xaf/0x190
[202007.935930]  [<ffffffff810a1e7d>] ? irq_thread+0x190/0x190
[202007.935936]  [<ffffffff810a1ced>] ? wake_threads_waitq+0x3a/0x3a
[202007.935944]  [<ffffffff8105a909>] ? kthread+0x81/0x89
[202007.935950]  [<ffffffff8105a888>] ? __kthread_parkme+0x5c/0x5c
[202007.935957]  [<ffffffff813a75fc>] ? ret_from_fork+0x7c/0xb0
[202007.935961]  [<ffffffff8105a888>] ? __kthread_parkme+0x5c/0x5c
[202007.935964] handlers:
[202007.936067] [<ffffffff810a126f>] irq_default_primary_handler
threaded [<ffffffffa00db857>] ata_bmdma_interrupt [libata]
[202007.936076] [<ffffffff810a126f>] irq_default_primary_handler
threaded [<ffffffffa023f75e>] sja1000_interrupt [sja1000]
[202007.936084] [<ffffffff810a126f>] irq_default_primary_handler
threaded [<ffffffffa023f75e>] sja1000_interrupt [sja1000]
[202007.936086] Disabling IRQ #18

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: sja1000 interrupt problem
  2013-10-08  0:47 sja1000 interrupt problem Austin Schuh
@ 2013-10-08  6:32 ` Wolfgang Grandegger
  2013-10-08  6:58   ` Oliver Hartkopp
  0 siblings, 1 reply; 66+ messages in thread
From: Wolfgang Grandegger @ 2013-10-08  6:32 UTC (permalink / raw)
  To: Austin Schuh; +Cc: linux-can

On Mon, 7 Oct 2013 17:47:48 -0700, Austin Schuh <austin@peloton-tech.com>

wrote:

> I am seeing the following problem with my Peak miniPCIe card.  After

> being up for a long time and not having any programs listening to the

> CAN traffic, I get the following message from dmesg, and then I start

> not receiving packets (Some get through, but not all of them.)

> 

> Is this a known issue?  Are there any tests that I can run to help

> isolate and fix the problem?

> 

> Thanks,

>     Austin Schuh

> 

> $ lspci -v -s 05:00.0

> 05:00.0 Network controller: PEAK-System Technik GmbH Device 0008 (rev

02)

>         Subsystem: PEAK-System Technik GmbH Device 0005

>         Flags: bus master, fast devsel, latency 0, IRQ 18

>         Memory at e0510000 (32-bit, non-prefetchable) [size=64K]

>         Memory at e0500000 (32-bit, non-prefetchable) [size=64K]

>         Capabilities: [50] Power Management version 0

>         Capabilities: [70] MSI: Enable- Count=1/1 Maskable- 64bit+

>         Capabilities: [90] Express Endpoint, MSI 00

>         Capabilities: [100] Vendor Specific Information: ID=0000 Rev=0

> Len=00c <?>

>         Kernel driver in use: peak_pci

> 

> $ uname -a

> Linux vpc3 3.10-3-rt-amd64 #1 SMP PREEMPT RT Debian 3.10.11-1

> (2013-09-10) x86_64 GNU/Linux

> 

> $ dmesg

> --- snip ---

> [105918.136633] irq 18: nobody cared (try booting with the "irqpoll"

> option)

> [105918.136763] CPU: 1 PID: 202 Comm: irq/18-ata_gene Not tainted

> 3.10-3-rt-amd64 #1 Debian 3.10.11-1

> [105918.136768] Hardware name: CompuLab Intense-PC/Intense-PC, BIOS

> CR_2.2.0.377 X64 04/10/2013

> [105918.136777]  ffffffff8139f8fc 000000000000003c ffffffff810a2d43

> ffff88042a4b5600

> [105918.136781]  0000000000000000 ffff88042750ece0 ffffffff810a3182

> ffffffff814089d0

> [105918.136785]  ffff880426527c80 ffff88042a4b5600 ffff88042750ece0

> ffffffff810a13d7

> [105918.136791] Call Trace:

> [105918.136805]  [<ffffffff8139f8fc>] ? dump_stack+0xd/0x17

> [105918.136814]  [<ffffffff810a2d43>] ? __report_bad_irq+0x21/0xc1

> [105918.136819]  [<ffffffff810a3182>] ? note_interrupt+0x16d/0x1ff

> [105918.136824]  [<ffffffff810a13d7>] ? irq_thread_fn+0x30/0x30

> [105918.136828]  [<ffffffff810a1d9c>] ? irq_thread+0xaf/0x190

> [105918.136833]  [<ffffffff810a1e7d>] ? irq_thread+0x190/0x190

> [105918.136837]  [<ffffffff810a1ced>] ? wake_threads_waitq+0x3a/0x3a

> [105918.136843]  [<ffffffff8105a909>] ? kthread+0x81/0x89

> [105918.136848]  [<ffffffff8105a888>] ? __kthread_parkme+0x5c/0x5c

> [105918.136854]  [<ffffffff813a75fc>] ? ret_from_fork+0x7c/0xb0

> [105918.136858]  [<ffffffff8105a888>] ? __kthread_parkme+0x5c/0x5c

> [105918.136861] handlers:

> [105918.136962] [<ffffffff810a126f>] irq_default_primary_handler

> threaded [<ffffffffa00db857>] ata_bmdma_interrupt [libata]

> [105918.136972] [<ffffffff810a126f>] irq_default_primary_handler

> threaded [<ffffffffa023f75e>] sja1000_interrupt [sja1000]

> [105918.136980] [<ffffffff810a126f>] irq_default_primary_handler

> threaded [<ffffffffa023f75e>] sja1000_interrupt [sja1000]

> [105918.136981] Disabling IRQ #18

> [202007.935715] irq 18: nobody cared (try booting with the "irqpoll"

> option)

> [202007.935860] CPU: 0 PID: 202 Comm: irq/18-ata_gene Not tainted

> 3.10-3-rt-amd64 #1 Debian 3.10.11-1

> [202007.935865] Hardware name: CompuLab Intense-PC/Intense-PC, BIOS

> CR_2.2.0.377 X64 04/10/2013

> [202007.935874]  ffffffff8139f8fc 000000000000003c ffffffff810a2d43

> ffff88042a4b5600

> [202007.935877]  0000000000000000 ffff88042750ece0 ffffffff810a3182

> ffffffff814089d0

> [202007.935881]  ffff880426527c80 ffff88042a4b5600 ffff88042750ece0

> ffffffff810a13d7

> [202007.935888] Call Trace:

> [202007.935901]  [<ffffffff8139f8fc>] ? dump_stack+0xd/0x17

> [202007.935910]  [<ffffffff810a2d43>] ? __report_bad_irq+0x21/0xc1

> [202007.935915]  [<ffffffff810a3182>] ? note_interrupt+0x16d/0x1ff

> [202007.935920]  [<ffffffff810a13d7>] ? irq_thread_fn+0x30/0x30

> [202007.935925]  [<ffffffff810a1d9c>] ? irq_thread+0xaf/0x190

> [202007.935930]  [<ffffffff810a1e7d>] ? irq_thread+0x190/0x190

> [202007.935936]  [<ffffffff810a1ced>] ? wake_threads_waitq+0x3a/0x3a

> [202007.935944]  [<ffffffff8105a909>] ? kthread+0x81/0x89

> [202007.935950]  [<ffffffff8105a888>] ? __kthread_parkme+0x5c/0x5c

> [202007.935957]  [<ffffffff813a75fc>] ? ret_from_fork+0x7c/0xb0

> [202007.935961]  [<ffffffff8105a888>] ? __kthread_parkme+0x5c/0x5c

> [202007.935964] handlers:

> [202007.936067] [<ffffffff810a126f>] irq_default_primary_handler

> threaded [<ffffffffa00db857>] ata_bmdma_interrupt [libata]

> [202007.936076] [<ffffffff810a126f>] irq_default_primary_handler

> threaded [<ffffffffa023f75e>] sja1000_interrupt [sja1000]

> [202007.936084] [<ffffffff810a126f>] irq_default_primary_handler

> threaded [<ffffffffa023f75e>] sja1000_interrupt [sja1000]

> [202007.936086] Disabling IRQ #18



Is the IRQ 18 from the SJA1000 (check /proc/interrupts)?. Is it shared

with another device (ata?). Anyway, the IRQ handler of IRQ 18 returned

IRQ_NONE. You are using an "-rt" kernel. Maybe there is a race condition

somewhere.



Wolfgang.





^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: sja1000 interrupt problem
  2013-10-08  6:32 ` Wolfgang Grandegger
@ 2013-10-08  6:58   ` Oliver Hartkopp
  2013-10-08 18:48     ` Austin Schuh
  0 siblings, 1 reply; 66+ messages in thread
From: Oliver Hartkopp @ 2013-10-08  6:58 UTC (permalink / raw)
  To: Wolfgang Grandegger, Austin Schuh; +Cc: linux-can

Hi,

sometimes I have this 'nobody cares' problem with my Dell 6510 too. But it obviously has something to do with my USB controller. So this is probably IRQ or hardware related. Btw. if you can reproduce this problem, this might help to get behind the problem.

Regards,
Oliver
-- 
Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: sja1000 interrupt problem
  2013-10-08  6:58   ` Oliver Hartkopp
@ 2013-10-08 18:48     ` Austin Schuh
  2013-10-08 19:44       ` Wolfgang Grandegger
  0 siblings, 1 reply; 66+ messages in thread
From: Austin Schuh @ 2013-10-08 18:48 UTC (permalink / raw)
  To: Oliver Hartkopp; +Cc: Wolfgang Grandegger, linux-can

$ cat /proc/interrupts  | grep " 18:"
 18:   23361632   96767425   20421047   23392586   IO-APIC-fasteoi
ata_generic, can1, can0

Looks like the IRQ is shared with ata_generic.  I don't see any
hard-drive related problems though after the message, only CAN
problems.  If I bring the interface down and then back up with
ifconfig, it works again.

I see this failure about once per day.

Austin

On Mon, Oct 7, 2013 at 11:58 PM, Oliver Hartkopp <socketcan@hartkopp.net> wrote:
> Hi,
>
> sometimes I have this 'nobody cares' problem with my Dell 6510 too. But it obviously has something to do with my USB controller. So this is probably IRQ or hardware related. Btw. if you can reproduce this problem, this might help to get behind the problem.
>
> Regards,
> Oliver
> --
> Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: sja1000 interrupt problem
  2013-10-08 18:48     ` Austin Schuh
@ 2013-10-08 19:44       ` Wolfgang Grandegger
  2013-10-08 20:47         ` Austin Schuh
  0 siblings, 1 reply; 66+ messages in thread
From: Wolfgang Grandegger @ 2013-10-08 19:44 UTC (permalink / raw)
  To: Austin Schuh, Oliver Hartkopp; +Cc: linux-can

On 10/08/2013 08:48 PM, Austin Schuh wrote:
> $ cat /proc/interrupts  | grep " 18:"
>  18:   23361632   96767425   20421047   23392586   IO-APIC-fasteoi
> ata_generic, can1, can0

Are both CAN active when the problem occurs. At least two handler for
the SJA1000 are called.

[105918.136861] handlers:
[105918.136962] [<ffffffff810a126f>] irq_default_primary_handler
  threaded [<ffffffffa00db857>] ata_bmdma_interrupt [libata]
[105918.136972] [<ffffffff810a126f>] irq_default_primary_handler
  threaded [<ffffffffa023f75e>] sja1000_interrupt [sja1000]
[105918.136980] [<ffffffff810a126f>] irq_default_primary_handler
  threaded [<ffffffffa023f75e>] sja1000_interrupt [sja1000]
[105918.136981] Disabling IRQ #18

> Looks like the IRQ is shared with ata_generic.  I don't see any
> hard-drive related problems though after the message, only CAN
> problems.  If I bring the interface down and then back up with
> ifconfig, it works again.
> 
> I see this failure about once per day.

Can you use another PCI slot to have an IRQ solely for the CAN devices?

What does "ip -d -s link" list?

Does the problem show up with a vanilla (non-rt) kernel as well?

Does the problem show up if an app is reading the CAN messages?

Are you able to rebuild your kernel with some additional debug code?
(Not sure what to trigger yet, though).

Wolfgang.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: sja1000 interrupt problem
  2013-10-08 19:44       ` Wolfgang Grandegger
@ 2013-10-08 20:47         ` Austin Schuh
  2013-10-09  6:21           ` Wolfgang Grandegger
                             ` (2 more replies)
  0 siblings, 3 replies; 66+ messages in thread
From: Austin Schuh @ 2013-10-08 20:47 UTC (permalink / raw)
  To: Wolfgang Grandegger; +Cc: Oliver Hartkopp, linux-can

Hi Wolfgang,

Thanks for taking the time to help!

Both CAN channels are active and at least one of them has external
traffic on the bus when the problem occurs.

I have installed the PEAK card in an "IntensePC" (
http://www.fit-pc.com/web/products/intense-pc/ ).  It is in the only
full sized miniPCIe slot in the PC.

I'll start some tests to determine the mean failure time so I can try
a non-rt kernel and give you a definitive answer on whether or not
that fixes it.  That could take me a day or two given how long it
takes to reproduce.

I built the kernel once already, and am comfortable patching it.  I'm
currently running a Debian Sid kernel (3.10).  I saw the same problem
with the Debian 3.2 kernel where I backported the sja1000 driver.

Austin

$ ip -d -s link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    RX: bytes  packets  errors  dropped overrun mcast
    7964       57       0       0       0       0
    TX: bytes  packets  errors  dropped carrier collsns
    7964       57       0       0       0       0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast
state UP mode DEFAULT qlen 1000
    link/ether 00:01:c0:12:29:80 brd ff:ff:ff:ff:ff:ff
    RX: bytes  packets  errors  dropped overrun mcast
    160966647  505906   0       0       0       0
    TX: bytes  packets  errors  dropped carrier collsns
    1565668810 1457778  0       0       0       0
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast
state UP mode DEFAULT qlen 1000
    link/ether 00:01:c0:12:29:7f brd ff:ff:ff:ff:ff:ff
    RX: bytes  packets  errors  dropped overrun mcast
    92348233   293914   0       0       0       0
    TX: bytes  packets  errors  dropped carrier collsns
    2152586    27438    0       0       0       0
4: can0: <NOARP,UP,LOWER_UP,ECHO> mtu 16 qdisc pfifo_fast state
UNKNOWN mode DEFAULT qlen 10
    link/can
    can <TRIPLE-SAMPLING> state ERROR-ACTIVE (berr-counter tx 0 rx 0)
restart-ms 0
    bitrate 250000 sample-point 0.875
    tq 250 prop-seg 6 phase-seg1 7 phase-seg2 2 sjw 1
    sja1000: tseg1 1..16 tseg2 1..8 sjw 1..4 brp 1..64 brp-inc 1
    clock 8000000
    re-started bus-errors arbit-lost error-warn error-pass bus-off
    0          0          0          0          0          0
    RX: bytes  packets  errors  dropped overrun mcast
    15428069   1929618  0       0       0       0
    TX: bytes  packets  errors  dropped carrier collsns
    0          0        0       0       0       0
5: can1: <NOARP,UP,LOWER_UP,ECHO> mtu 16 qdisc pfifo_fast state
UNKNOWN mode DEFAULT qlen 30
    link/can
    can <TRIPLE-SAMPLING> state ERROR-ACTIVE (berr-counter tx 0 rx 0)
restart-ms 0
    bitrate 500000 sample-point 0.875
    tq 125 prop-seg 6 phase-seg1 7 phase-seg2 2 sjw 1
    sja1000: tseg1 1..16 tseg2 1..8 sjw 1..4 brp 1..64 brp-inc 1
    clock 8000000
    re-started bus-errors arbit-lost error-warn error-pass bus-off
    0          0          0          0          0          0
    RX: bytes  packets  errors  dropped overrun mcast
    89752816   11505773 0       468029  0       0
    TX: bytes  packets  errors  dropped carrier collsns
    3283508    1281127  0       0       0       0
6: vcan0: <NOARP,UP,LOWER_UP> mtu 16 qdisc noqueue state UNKNOWN mode DEFAULT
    link/can
    vcan
    RX: bytes  packets  errors  dropped overrun mcast
    0          0        0       0       0       0
    TX: bytes  packets  errors  dropped carrier collsns
    0          0        0       0       0       0
7: vcan1: <NOARP,UP,LOWER_UP> mtu 16 qdisc noqueue state UNKNOWN mode DEFAULT
    link/can
    vcan
    RX: bytes  packets  errors  dropped overrun mcast
    3620640    452580   0       0       0       0
    TX: bytes  packets  errors  dropped carrier collsns
    3620640    452580   0       0       0       0
8: vcan2: <NOARP,UP,LOWER_UP> mtu 16 qdisc noqueue state UNKNOWN mode DEFAULT
    link/can
    vcan
    RX: bytes  packets  errors  dropped overrun mcast
    9999880    1249985  0       0       0       0
    TX: bytes  packets  errors  dropped carrier collsns
    9999880    1249985  0       0       0       0
9: vcan3: <NOARP,UP,LOWER_UP> mtu 16 qdisc noqueue state UNKNOWN mode DEFAULT
    link/can
    vcan
    RX: bytes  packets  errors  dropped overrun mcast
    0          0        0       0       0       0
    TX: bytes  packets  errors  dropped carrier collsns
    0          0        0       0       0       0
10: vcan10: <NOARP,UP,LOWER_UP> mtu 16 qdisc noqueue state UNKNOWN mode DEFAULT
    link/can
    vcan
    RX: bytes  packets  errors  dropped overrun mcast
    0          0        0       0       0       0
    TX: bytes  packets  errors  dropped carrier collsns
    0          0        0       0       0       0
11: vcan11: <NOARP,UP,LOWER_UP> mtu 16 qdisc noqueue state UNKNOWN mode DEFAULT
    link/can
    vcan
    RX: bytes  packets  errors  dropped overrun mcast
    3552992    444124   0       0       0       0
    TX: bytes  packets  errors  dropped carrier collsns
    3552992    444124   0       0       0       0
12: vcan12: <NOARP,UP,LOWER_UP> mtu 16 qdisc noqueue state UNKNOWN mode DEFAULT
    link/can
    vcan
    RX: bytes  packets  errors  dropped overrun mcast
    0          0        0       0       0       0
    TX: bytes  packets  errors  dropped carrier collsns
    0          0        0       0       0       0
13: vcan13: <NOARP,UP,LOWER_UP> mtu 16 qdisc noqueue state UNKNOWN mode DEFAULT
    link/can
    vcan
    RX: bytes  packets  errors  dropped overrun mcast
    0          0        0       0       0       0
    TX: bytes  packets  errors  dropped carrier collsns
    0          0        0       0       0       0
14: vcan20: <NOARP,UP,LOWER_UP> mtu 16 qdisc noqueue state UNKNOWN mode DEFAULT
    link/can
    vcan
    RX: bytes  packets  errors  dropped overrun mcast
    0          0        0       0       0       0
    TX: bytes  packets  errors  dropped carrier collsns
    0          0        0       0       0       0
15: vcan21: <NOARP,UP,LOWER_UP> mtu 16 qdisc noqueue state UNKNOWN mode DEFAULT
    link/can
    vcan
    RX: bytes  packets  errors  dropped overrun mcast
    0          0        0       0       0       0
    TX: bytes  packets  errors  dropped carrier collsns
    0          0        0       0       0       0
16: vcan22: <NOARP,UP,LOWER_UP> mtu 16 qdisc noqueue state UNKNOWN mode DEFAULT
    link/can
    vcan
    RX: bytes  packets  errors  dropped overrun mcast
    0          0        0       0       0       0
    TX: bytes  packets  errors  dropped carrier collsns
    0          0        0       0       0       0
17: vcan23: <NOARP,UP,LOWER_UP> mtu 16 qdisc noqueue state UNKNOWN mode DEFAULT
    link/can
    vcan
    RX: bytes  packets  errors  dropped overrun mcast
    0          0        0       0       0       0
    TX: bytes  packets  errors  dropped carrier collsns
    0          0        0       0       0       0
18: vcan50: <NOARP,UP,LOWER_UP> mtu 16 qdisc noqueue state UNKNOWN mode DEFAULT
    link/can
    vcan
    RX: bytes  packets  errors  dropped overrun mcast
    0          0        0       0       0       0
    TX: bytes  packets  errors  dropped carrier collsns
    0          0        0       0       0       0

On Tue, Oct 8, 2013 at 12:44 PM, Wolfgang Grandegger <wg@grandegger.com> wrote:
> On 10/08/2013 08:48 PM, Austin Schuh wrote:
>> $ cat /proc/interrupts  | grep " 18:"
>>  18:   23361632   96767425   20421047   23392586   IO-APIC-fasteoi
>> ata_generic, can1, can0
>
> Are both CAN active when the problem occurs. At least two handler for
> the SJA1000 are called.
>
> [105918.136861] handlers:
> [105918.136962] [<ffffffff810a126f>] irq_default_primary_handler
>   threaded [<ffffffffa00db857>] ata_bmdma_interrupt [libata]
> [105918.136972] [<ffffffff810a126f>] irq_default_primary_handler
>   threaded [<ffffffffa023f75e>] sja1000_interrupt [sja1000]
> [105918.136980] [<ffffffff810a126f>] irq_default_primary_handler
>   threaded [<ffffffffa023f75e>] sja1000_interrupt [sja1000]
> [105918.136981] Disabling IRQ #18
>
>> Looks like the IRQ is shared with ata_generic.  I don't see any
>> hard-drive related problems though after the message, only CAN
>> problems.  If I bring the interface down and then back up with
>> ifconfig, it works again.
>>
>> I see this failure about once per day.
>
> Can you use another PCI slot to have an IRQ solely for the CAN devices?
>
> What does "ip -d -s link" list?
>
> Does the problem show up with a vanilla (non-rt) kernel as well?
>
> Does the problem show up if an app is reading the CAN messages?
>
> Are you able to rebuild your kernel with some additional debug code?
> (Not sure what to trigger yet, though).
>
> Wolfgang.
>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: sja1000 interrupt problem
  2013-10-08 20:47         ` Austin Schuh
@ 2013-10-09  6:21           ` Wolfgang Grandegger
  2013-10-09  6:31           ` Wolfgang Grandegger
  2013-10-09  6:47           ` Wolfgang Grandegger
  2 siblings, 0 replies; 66+ messages in thread
From: Wolfgang Grandegger @ 2013-10-09  6:21 UTC (permalink / raw)
  To: Austin Schuh; +Cc: Oliver Hartkopp, linux-can

On Tue, 8 Oct 2013 13:47:06 -0700, Austin Schuh <austin@peloton-tech.com>

wrote:

> Hi Wolfgang,

> 

> Thanks for taking the time to help!

> 

> Both CAN channels are active and at least one of them has external

> traffic on the bus when the problem occurs.

> 

> I have installed the PEAK card in an "IntensePC" (

> http://www.fit-pc.com/web/products/intense-pc/ ).  It is in the only

> full sized miniPCIe slot in the PC.

> 

> I'll start some tests to determine the mean failure time so I can try

> a non-rt kernel and give you a definitive answer on whether or not

> that fixes it.  That could take me a day or two given how long it

> takes to reproduce.

> 

> I built the kernel once already, and am comfortable patching it.  I'm

> currently running a Debian Sid kernel (3.10).  I saw the same problem

> with the Debian 3.2 kernel where I backported the sja1000 driver.

> 

> Austin

> 

> $ ip -d -s link

> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode

> DEFAULT

>     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

>     RX: bytes  packets  errors  dropped overrun mcast

>     7964       57       0       0       0       0

>     TX: bytes  packets  errors  dropped carrier collsns

>     7964       57       0       0       0       0

> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast

> state UP mode DEFAULT qlen 1000

>     link/ether 00:01:c0:12:29:80 brd ff:ff:ff:ff:ff:ff

>     RX: bytes  packets  errors  dropped overrun mcast

>     160966647  505906   0       0       0       0

>     TX: bytes  packets  errors  dropped carrier collsns

>     1565668810 1457778  0       0       0       0

> 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast

> state UP mode DEFAULT qlen 1000

>     link/ether 00:01:c0:12:29:7f brd ff:ff:ff:ff:ff:ff

>     RX: bytes  packets  errors  dropped overrun mcast

>     92348233   293914   0       0       0       0

>     TX: bytes  packets  errors  dropped carrier collsns

>     2152586    27438    0       0       0       0

> 4: can0: <NOARP,UP,LOWER_UP,ECHO> mtu 16 qdisc pfifo_fast state

> UNKNOWN mode DEFAULT qlen 10

>     link/can

>     can <TRIPLE-SAMPLING> state ERROR-ACTIVE (berr-counter tx 0 rx 0)

> restart-ms 0

>     bitrate 250000 sample-point 0.875

>     tq 250 prop-seg 6 phase-seg1 7 phase-seg2 2 sjw 1

>     sja1000: tseg1 1..16 tseg2 1..8 sjw 1..4 brp 1..64 brp-inc 1

>     clock 8000000

>     re-started bus-errors arbit-lost error-warn error-pass bus-off

>     0          0          0          0          0          0

>     RX: bytes  packets  errors  dropped overrun mcast

>     15428069   1929618  0       0       0       0

>     TX: bytes  packets  errors  dropped carrier collsns

>     0          0        0       0       0       0

> 5: can1: <NOARP,UP,LOWER_UP,ECHO> mtu 16 qdisc pfifo_fast state

> UNKNOWN mode DEFAULT qlen 30

>     link/can

>     can <TRIPLE-SAMPLING> state ERROR-ACTIVE (berr-counter tx 0 rx 0)

> restart-ms 0

>     bitrate 500000 sample-point 0.875

>     tq 125 prop-seg 6 phase-seg1 7 phase-seg2 2 sjw 1

>     sja1000: tseg1 1..16 tseg2 1..8 sjw 1..4 brp 1..64 brp-inc 1

>     clock 8000000

>     re-started bus-errors arbit-lost error-warn error-pass bus-off

>     0          0          0          0          0          0

>     RX: bytes  packets  errors  dropped overrun mcast

>     89752816   11505773 0       468029  0       0

>     TX: bytes  packets  errors  dropped carrier collsns

>     3283508    1281127  0       0       0       0

> 6: vcan0: <NOARP,UP,LOWER_UP> mtu 16 qdisc noqueue state UNKNOWN mode

> DEFAULT

>     link/can

>     vcan

>     RX: bytes  packets  errors  dropped overrun mcast

>     0          0        0       0       0       0

>     TX: bytes  packets  errors  dropped carrier collsns

>     0          0        0       0       0       0

> 7: vcan1: <NOARP,UP,LOWER_UP> mtu 16 qdisc noqueue state UNKNOWN mode

> DEFAULT

>     link/can

>     vcan

>     RX: bytes  packets  errors  dropped overrun mcast

>     3620640    452580   0       0       0       0

>     TX: bytes  packets  errors  dropped carrier collsns

>     3620640    452580   0       0       0       0

> 8: vcan2: <NOARP,UP,LOWER_UP> mtu 16 qdisc noqueue state UNKNOWN mode

> DEFAULT

>     link/can

>     vcan

>     RX: bytes  packets  errors  dropped overrun mcast

>     9999880    1249985  0       0       0       0

>     TX: bytes  packets  errors  dropped carrier collsns

>     9999880    1249985  0       0       0       0

> 9: vcan3: <NOARP,UP,LOWER_UP> mtu 16 qdisc noqueue state UNKNOWN mode

> DEFAULT

>     link/can

>     vcan

>     RX: bytes  packets  errors  dropped overrun mcast

>     0          0        0       0       0       0

>     TX: bytes  packets  errors  dropped carrier collsns

>     0          0        0       0       0       0

> 10: vcan10: <NOARP,UP,LOWER_UP> mtu 16 qdisc noqueue state UNKNOWN mode

> DEFAULT

>     link/can

>     vcan

>     RX: bytes  packets  errors  dropped overrun mcast

>     0          0        0       0       0       0

>     TX: bytes  packets  errors  dropped carrier collsns

>     0          0        0       0       0       0

> 11: vcan11: <NOARP,UP,LOWER_UP> mtu 16 qdisc noqueue state UNKNOWN mode

> DEFAULT

>     link/can

>     vcan

>     RX: bytes  packets  errors  dropped overrun mcast

>     3552992    444124   0       0       0       0

>     TX: bytes  packets  errors  dropped carrier collsns

>     3552992    444124   0       0       0       0

> 12: vcan12: <NOARP,UP,LOWER_UP> mtu 16 qdisc noqueue state UNKNOWN mode

> DEFAULT

>     link/can

>     vcan

>     RX: bytes  packets  errors  dropped overrun mcast

>     0          0        0       0       0       0

>     TX: bytes  packets  errors  dropped carrier collsns

>     0          0        0       0       0       0

> 13: vcan13: <NOARP,UP,LOWER_UP> mtu 16 qdisc noqueue state UNKNOWN mode

> DEFAULT

>     link/can

>     vcan

>     RX: bytes  packets  errors  dropped overrun mcast

>     0          0        0       0       0       0

>     TX: bytes  packets  errors  dropped carrier collsns

>     0          0        0       0       0       0

> 14: vcan20: <NOARP,UP,LOWER_UP> mtu 16 qdisc noqueue state UNKNOWN mode

> DEFAULT

>     link/can

>     vcan

>     RX: bytes  packets  errors  dropped overrun mcast

>     0          0        0       0       0       0

>     TX: bytes  packets  errors  dropped carrier collsns

>     0          0        0       0       0       0

> 15: vcan21: <NOARP,UP,LOWER_UP> mtu 16 qdisc noqueue state UNKNOWN mode

> DEFAULT

>     link/can

>     vcan

>     RX: bytes  packets  errors  dropped overrun mcast

>     0          0        0       0       0       0

>     TX: bytes  packets  errors  dropped carrier collsns

>     0          0        0       0       0       0

> 16: vcan22: <NOARP,UP,LOWER_UP> mtu 16 qdisc noqueue state UNKNOWN mode

> DEFAULT

>     link/can

>     vcan

>     RX: bytes  packets  errors  dropped overrun mcast

>     0          0        0       0       0       0

>     TX: bytes  packets  errors  dropped carrier collsns

>     0          0        0       0       0       0

> 17: vcan23: <NOARP,UP,LOWER_UP> mtu 16 qdisc noqueue state UNKNOWN mode

> DEFAULT

>     link/can

>     vcan

>     RX: bytes  packets  errors  dropped overrun mcast

>     0          0        0       0       0       0

>     TX: bytes  packets  errors  dropped carrier collsns

>     0          0        0       0       0       0

> 18: vcan50: <NOARP,UP,LOWER_UP> mtu 16 qdisc noqueue state UNKNOWN mode

> DEFAULT

>     link/can

>     vcan

>     RX: bytes  packets  errors  dropped overrun mcast

>     0          0        0       0       0       0

>     TX: bytes  packets  errors  dropped carrier collsns

>     0          0        0       0       0       0

> 

> On Tue, Oct 8, 2013 at 12:44 PM, Wolfgang Grandegger <wg@grandegger.com>

> wrote:

>> On 10/08/2013 08:48 PM, Austin Schuh wrote:

>>> $ cat /proc/interrupts  | grep " 18:"

>>>  18:   23361632   96767425   20421047   23392586   IO-APIC-fasteoi

>>> ata_generic, can1, can0

>>

>> Are both CAN active when the problem occurs. At least two handler for

>> the SJA1000 are called.

>>

>> [105918.136861] handlers:

>> [105918.136962] [<ffffffff810a126f>] irq_default_primary_handler

>>   threaded [<ffffffffa00db857>] ata_bmdma_interrupt [libata]

>> [105918.136972] [<ffffffff810a126f>] irq_default_primary_handler

>>   threaded [<ffffffffa023f75e>] sja1000_interrupt [sja1000]

>> [105918.136980] [<ffffffff810a126f>] irq_default_primary_handler

>>   threaded [<ffffffffa023f75e>] sja1000_interrupt [sja1000]

>> [105918.136981] Disabling IRQ #18

>>

>>> Looks like the IRQ is shared with ata_generic.  I don't see any

>>> hard-drive related problems though after the message, only CAN

>>> problems.  If I bring the interface down and then back up with

>>> ifconfig, it works again.

>>>

>>> I see this failure about once per day.

>>

>> Can you use another PCI slot to have an IRQ solely for the CAN devices?

>>

>> What does "ip -d -s link" list?

>>

>> Does the problem show up with a vanilla (non-rt) kernel as well?

>>

>> Does the problem show up if an app is reading the CAN messages?

>>

>> Are you able to rebuild your kernel with some additional debug code?

>> (Not sure what to trigger yet, though).

>>

>> Wolfgang.

>>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: sja1000 interrupt problem
  2013-10-08 20:47         ` Austin Schuh
  2013-10-09  6:21           ` Wolfgang Grandegger
@ 2013-10-09  6:31           ` Wolfgang Grandegger
  2013-10-09  6:47           ` Wolfgang Grandegger
  2 siblings, 0 replies; 66+ messages in thread
From: Wolfgang Grandegger @ 2013-10-09  6:31 UTC (permalink / raw)
  To: Austin Schuh; +Cc: Oliver Hartkopp, linux-can

Hi Austin,



On Tue, 8 Oct 2013 13:47:06 -0700, Austin Schuh <austin@peloton-tech.com>

wrote:

> Hi Wolfgang,

> 

> Thanks for taking the time to help!

> 

> Both CAN channels are active and at least one of them has external

> traffic on the bus when the problem occurs.

> 

> I have installed the PEAK card in an "IntensePC" (

> http://www.fit-pc.com/web/products/intense-pc/ ).  It is in the only

> full sized miniPCIe slot in the PC.

> 

> I'll start some tests to determine the mean failure time so I can try

> a non-rt kernel and give you a definitive answer on whether or not

> that fixes it.  That could take me a day or two given how long it

> takes to reproduce.



Not sure if it's worth the effort... at least for the moment.



> I built the kernel once already, and am comfortable patching it.  I'm

> currently running a Debian Sid kernel (3.10).  I saw the same problem

> with the Debian 3.2 kernel where I backported the sja1000 driver.



OK, my idea is to do an ftrace freeze when IRQ 18 is not handled. Could

you please add 





"pr_info("Stop tracing...\n");tracing_off();" when

IRQ 18 is  



http://lxr.free-electrons.com/source/kernel/irq/spurious.c#L291



> Austin

> 

> $ ip -d -s link

> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode

> DEFAULT

>     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

>     RX: bytes  packets  errors  dropped overrun mcast

>     7964       57       0       0       0       0

>     TX: bytes  packets  errors  dropped carrier collsns

>     7964       57       0       0       0       0

> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast

> state UP mode DEFAULT qlen 1000

>     link/ether 00:01:c0:12:29:80 brd ff:ff:ff:ff:ff:ff

>     RX: bytes  packets  errors  dropped overrun mcast

>     160966647  505906   0       0       0       0

>     TX: bytes  packets  errors  dropped carrier collsns

>     1565668810 1457778  0       0       0       0

> 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast

> state UP mode DEFAULT qlen 1000

>     link/ether 00:01:c0:12:29:7f brd ff:ff:ff:ff:ff:ff

>     RX: bytes  packets  errors  dropped overrun mcast

>     92348233   293914   0       0       0       0

>     TX: bytes  packets  errors  dropped carrier collsns

>     2152586    27438    0       0       0       0

> 4: can0: <NOARP,UP,LOWER_UP,ECHO> mtu 16 qdisc pfifo_fast state

> UNKNOWN mode DEFAULT qlen 10

>     link/can

>     can <TRIPLE-SAMPLING> state ERROR-ACTIVE (berr-counter tx 0 rx 0)

> restart-ms 0

>     bitrate 250000 sample-point 0.875

>     tq 250 prop-seg 6 phase-seg1 7 phase-seg2 2 sjw 1

>     sja1000: tseg1 1..16 tseg2 1..8 sjw 1..4 brp 1..64 brp-inc 1

>     clock 8000000

>     re-started bus-errors arbit-lost error-warn error-pass bus-off

>     0          0          0          0          0          0

>     RX: bytes  packets  errors  dropped overrun mcast

>     15428069   1929618  0       0       0       0

>     TX: bytes  packets  errors  dropped carrier collsns

>     0          0        0       0       0       0

> 5: can1: <NOARP,UP,LOWER_UP,ECHO> mtu 16 qdisc pfifo_fast state

> UNKNOWN mode DEFAULT qlen 30

>     link/can

>     can <TRIPLE-SAMPLING> state ERROR-ACTIVE (berr-counter tx 0 rx 0)

> restart-ms 0

>     bitrate 500000 sample-point 0.875

>     tq 125 prop-seg 6 phase-seg1 7 phase-seg2 2 sjw 1

>     sja1000: tseg1 1..16 tseg2 1..8 sjw 1..4 brp 1..64 brp-inc 1

>     clock 8000000

>     re-started bus-errors arbit-lost error-warn error-pass bus-off

>     0          0          0          0          0          0

>     RX: bytes  packets  errors  dropped overrun mcast

>     89752816   11505773 0       468029  0       0

>     TX: bytes  packets  errors  dropped carrier collsns

>     3283508    1281127  0       0       0       0

> 6: vcan0: <NOARP,UP,LOWER_UP> mtu 16 qdisc noqueue state UNKNOWN mode

> DEFAULT

>     link/can

>     vcan

>     RX: bytes  packets  errors  dropped overrun mcast

>     0          0        0       0       0       0

>     TX: bytes  packets  errors  dropped carrier collsns

>     0          0        0       0       0       0

> 7: vcan1: <NOARP,UP,LOWER_UP> mtu 16 qdisc noqueue state UNKNOWN mode

> DEFAULT

>     link/can

>     vcan

>     RX: bytes  packets  errors  dropped overrun mcast

>     3620640    452580   0       0       0       0

>     TX: bytes  packets  errors  dropped carrier collsns

>     3620640    452580   0       0       0       0

> 8: vcan2: <NOARP,UP,LOWER_UP> mtu 16 qdisc noqueue state UNKNOWN mode

> DEFAULT

>     link/can

>     vcan

>     RX: bytes  packets  errors  dropped overrun mcast

>     9999880    1249985  0       0       0       0

>     TX: bytes  packets  errors  dropped carrier collsns

>     9999880    1249985  0       0       0       0

> 9: vcan3: <NOARP,UP,LOWER_UP> mtu 16 qdisc noqueue state UNKNOWN mode

> DEFAULT

>     link/can

>     vcan

>     RX: bytes  packets  errors  dropped overrun mcast

>     0          0        0       0       0       0

>     TX: bytes  packets  errors  dropped carrier collsns

>     0          0        0       0       0       0

> 10: vcan10: <NOARP,UP,LOWER_UP> mtu 16 qdisc noqueue state UNKNOWN mode

> DEFAULT

>     link/can

>     vcan

>     RX: bytes  packets  errors  dropped overrun mcast

>     0          0        0       0       0       0

>     TX: bytes  packets  errors  dropped carrier collsns

>     0          0        0       0       0       0

> 11: vcan11: <NOARP,UP,LOWER_UP> mtu 16 qdisc noqueue state UNKNOWN mode

> DEFAULT

>     link/can

>     vcan

>     RX: bytes  packets  errors  dropped overrun mcast

>     3552992    444124   0       0       0       0

>     TX: bytes  packets  errors  dropped carrier collsns

>     3552992    444124   0       0       0       0

> 12: vcan12: <NOARP,UP,LOWER_UP> mtu 16 qdisc noqueue state UNKNOWN mode

> DEFAULT

>     link/can

>     vcan

>     RX: bytes  packets  errors  dropped overrun mcast

>     0          0        0       0       0       0

>     TX: bytes  packets  errors  dropped carrier collsns

>     0          0        0       0       0       0

> 13: vcan13: <NOARP,UP,LOWER_UP> mtu 16 qdisc noqueue state UNKNOWN mode

> DEFAULT

>     link/can

>     vcan

>     RX: bytes  packets  errors  dropped overrun mcast

>     0          0        0       0       0       0

>     TX: bytes  packets  errors  dropped carrier collsns

>     0          0        0       0       0       0

> 14: vcan20: <NOARP,UP,LOWER_UP> mtu 16 qdisc noqueue state UNKNOWN mode

> DEFAULT

>     link/can

>     vcan

>     RX: bytes  packets  errors  dropped overrun mcast

>     0          0        0       0       0       0

>     TX: bytes  packets  errors  dropped carrier collsns

>     0          0        0       0       0       0

> 15: vcan21: <NOARP,UP,LOWER_UP> mtu 16 qdisc noqueue state UNKNOWN mode

> DEFAULT

>     link/can

>     vcan

>     RX: bytes  packets  errors  dropped overrun mcast

>     0          0        0       0       0       0

>     TX: bytes  packets  errors  dropped carrier collsns

>     0          0        0       0       0       0

> 16: vcan22: <NOARP,UP,LOWER_UP> mtu 16 qdisc noqueue state UNKNOWN mode

> DEFAULT

>     link/can

>     vcan

>     RX: bytes  packets  errors  dropped overrun mcast

>     0          0        0       0       0       0

>     TX: bytes  packets  errors  dropped carrier collsns

>     0          0        0       0       0       0

> 17: vcan23: <NOARP,UP,LOWER_UP> mtu 16 qdisc noqueue state UNKNOWN mode

> DEFAULT

>     link/can

>     vcan

>     RX: bytes  packets  errors  dropped overrun mcast

>     0          0        0       0       0       0

>     TX: bytes  packets  errors  dropped carrier collsns

>     0          0        0       0       0       0

> 18: vcan50: <NOARP,UP,LOWER_UP> mtu 16 qdisc noqueue state UNKNOWN mode

> DEFAULT

>     link/can

>     vcan

>     RX: bytes  packets  errors  dropped overrun mcast

>     0          0        0       0       0       0

>     TX: bytes  packets  errors  dropped carrier collsns

>     0          0        0       0       0       0

> 

> On Tue, Oct 8, 2013 at 12:44 PM, Wolfgang Grandegger <wg@grandegger.com>

> wrote:

>> On 10/08/2013 08:48 PM, Austin Schuh wrote:

>>> $ cat /proc/interrupts  | grep " 18:"

>>>  18:   23361632   96767425   20421047   23392586   IO-APIC-fasteoi

>>> ata_generic, can1, can0

>>

>> Are both CAN active when the problem occurs. At least two handler for

>> the SJA1000 are called.

>>

>> [105918.136861] handlers:

>> [105918.136962] [<ffffffff810a126f>] irq_default_primary_handler

>>   threaded [<ffffffffa00db857>] ata_bmdma_interrupt [libata]

>> [105918.136972] [<ffffffff810a126f>] irq_default_primary_handler

>>   threaded [<ffffffffa023f75e>] sja1000_interrupt [sja1000]

>> [105918.136980] [<ffffffff810a126f>] irq_default_primary_handler

>>   threaded [<ffffffffa023f75e>] sja1000_interrupt [sja1000]

>> [105918.136981] Disabling IRQ #18

>>

>>> Looks like the IRQ is shared with ata_generic.  I don't see any

>>> hard-drive related problems though after the message, only CAN

>>> problems.  If I bring the interface down and then back up with

>>> ifconfig, it works again.

>>>

>>> I see this failure about once per day.

>>

>> Can you use another PCI slot to have an IRQ solely for the CAN devices?

>>

>> What does "ip -d -s link" list?

>>

>> Does the problem show up with a vanilla (non-rt) kernel as well?

>>

>> Does the problem show up if an app is reading the CAN messages?

>>

>> Are you able to rebuild your kernel with some additional debug code?

>> (Not sure what to trigger yet, though).

>>

>> Wolfgang.

>>

> --

> To unsubscribe from this list: send the line "unsubscribe linux-can" in

> the body of a message to majordomo@vger.kernel.org

> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: sja1000 interrupt problem
  2013-10-08 20:47         ` Austin Schuh
  2013-10-09  6:21           ` Wolfgang Grandegger
  2013-10-09  6:31           ` Wolfgang Grandegger
@ 2013-10-09  6:47           ` Wolfgang Grandegger
       [not found]             ` <CANGgnMZpPGctUWGcg7Lp-QFPc7d6A5GeL9KQYnpeYMR8WukgdA@mail.gmail.com>
  2 siblings, 1 reply; 66+ messages in thread
From: Wolfgang Grandegger @ 2013-10-09  6:47 UTC (permalink / raw)
  To: Austin Schuh; +Cc: Oliver Hartkopp, linux-can

Hi Austin,



I have some trouble with my WEB mail interface resulting in two incomplete

messages sent out...





On Tue, 8 Oct 2013 13:47:06 -0700, Austin Schuh <austin@peloton-tech.com>

wrote:

> Hi Wolfgang,

> 

> Thanks for taking the time to help!

> 

> Both CAN channels are active and at least one of them has external

> traffic on the bus when the problem occurs.

> 

> I have installed the PEAK card in an "IntensePC" (

> http://www.fit-pc.com/web/products/intense-pc/ ).  It is in the only

> full sized miniPCIe slot in the PC.

> 

> I'll start some tests to determine the mean failure time so I can try

> a non-rt kernel and give you a definitive answer on whether or not

> that fixes it.  That could take me a day or two given how long it

> takes to reproduce.



Not sure if it's worth the effort... at least for the moment.



> I built the kernel once already, and am comfortable patching it.  I'm

> currently running a Debian Sid kernel (3.10).  I saw the same problem

> with the Debian 3.2 kernel where I backported the sja1000 driver.



OK, I think the SJA1000 gets stuck for some reason. My idea is to do an

ftrace freeze when IRQ 18 is not handled. Could you please add:



    if (irq == 18) {

        "pr_info("Unhandled IRQ 18... stop tracing...\n");

         tracing_off();

    }



here:



    http://lxr.free-electrons.com/source/kernel/irq/spurious.c#L291



... actually when the first unhandled IRQ 18 shows up. You can

activate function tracing as shown below:



  # cd /sys/kernel/debug/tracing/

  # echo function > current_tracer



When the tracing is stop via unhandled IRQ 18 just save the trace:



   # cp trace /tmp/trace.log



You may need to increase the tracing buffer:



   # echo 20000 > buffer_total_size_kb



This will introduce some overhead. Let's hope that the problem does

still show up.



> $ ip -d -s link

...

 4: can0: <NOARP,UP,LOWER_UP,ECHO> mtu 16 qdisc pfifo_fast state

> UNKNOWN mode DEFAULT qlen 10

>     link/can

>     can <TRIPLE-SAMPLING> state ERROR-ACTIVE (berr-counter tx 0 rx 0)



BTW. why are you using TRIPLE-SAMPLING. IIRC, it's mainly useful for low

CAN

bitrates. At 800 kBits/s it's usually not set.



Wolfgang.



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: sja1000 interrupt problem
       [not found]             ` <CANGgnMZpPGctUWGcg7Lp-QFPc7d6A5GeL9KQYnpeYMR8WukgdA@mail.gmail.com>
@ 2013-11-07  8:15               ` Wolfgang Grandegger
  2013-11-07 23:43                 ` Austin Schuh
  0 siblings, 1 reply; 66+ messages in thread
From: Wolfgang Grandegger @ 2013-11-07  8:15 UTC (permalink / raw)
  To: Austin Schuh; +Cc: Oliver Hartkopp, linux-can

Hi Austin,



On Wed, 6 Nov 2013 16:33:26 -0800, Austin Schuh <austin@peloton-tech.com>

wrote:

> Hi Wolfgang,

> 

> Thanks for the help.  It took me longer than I would like to admit to

> get a nice clean trace, and to fight some fires so I could get back to

> this.

> 

> I disabled tracing where you suggested, and rebuilt the kernel.

> 

> 283   if (unlikely(action_ret == IRQ_NONE)) {

> 284     /*

> 285      * If we are seeing only the odd spurious IRQ caused by

> 286      * bus asynchronicity then don't eventually trigger an error,

> 287      * otherwise the counter becomes a doomsday timer for otherwise

> 288      * working systems

> 289      */

> 290     if (irq == 18) {

> 291       pr_info("Unhandled IRQ 18... stop tracing...\n");

> 292       tracing_off();

> 293     }

> --- snip ---

> 

> I'm getting that there is an unhandled IRQ every time I send/receive a

> packet.  My test setup is that I send from one of the two CAN ports to

> the other using the CLI tools.

> 

> echo 1 > tracing_on; cansend can0 500#1E.10.10.00.00.00.00.01;

> 

> The dump is attached.



I do not see any sja1000_rx() calls. Either they never happen or more

likely the trace is not long enough. Could you try with a larger buffer

using "echo 20000 > buffer_size_kb"? I also do not see some pr_info()

related functions at the end of the trace. Are you sure is has stopped

(cat tracing_on or message in dmesg)?



Also please do an "echo 0 > trace" to clear the trace content.



Wolfgang.



> On Tue, Oct 8, 2013 at 11:47 PM, Wolfgang Grandegger <wg@grandegger.com>

> wrote:

>> Hi Austin,

>>

>> I have some trouble with my WEB mail interface resulting in two

>> incomplete

>> messages sent out...

>>

>>

>> On Tue, 8 Oct 2013 13:47:06 -0700, Austin Schuh

<austin@peloton-tech.com>

>> wrote:

>>> Hi Wolfgang,

>>>

>>> Thanks for taking the time to help!

>>>

>>> Both CAN channels are active and at least one of them has external

>>> traffic on the bus when the problem occurs.

>>>

>>> I have installed the PEAK card in an "IntensePC" (

>>> http://www.fit-pc.com/web/products/intense-pc/ ).  It is in the only

>>> full sized miniPCIe slot in the PC.

>>>

>>> I'll start some tests to determine the mean failure time so I can try

>>> a non-rt kernel and give you a definitive answer on whether or not

>>> that fixes it.  That could take me a day or two given how long it

>>> takes to reproduce.

>>

>> Not sure if it's worth the effort... at least for the moment.

>>

>>> I built the kernel once already, and am comfortable patching it.  I'm

>>> currently running a Debian Sid kernel (3.10).  I saw the same problem

>>> with the Debian 3.2 kernel where I backported the sja1000 driver.

>>

>> OK, I think the SJA1000 gets stuck for some reason. My idea is to do an

>> ftrace freeze when IRQ 18 is not handled. Could you please add:

>>

>>     if (irq == 18) {

>>         "pr_info("Unhandled IRQ 18... stop tracing...\n");

>>          tracing_off();

>>     }

>>

>> here:

>>

>>     http://lxr.free-electrons.com/source/kernel/irq/spurious.c#L291

>>

>> ... actually when the first unhandled IRQ 18 shows up. You can

>> activate function tracing as shown below:

>>

>>   # cd /sys/kernel/debug/tracing/

>>   # echo function > current_tracer

>>

>> When the tracing is stop via unhandled IRQ 18 just save the trace:

>>

>>    # cp trace /tmp/trace.log

>>

>> You may need to increase the tracing buffer:

>>

>>    # echo 20000 > buffer_total_size_kb

>>

>> This will introduce some overhead. Let's hope that the problem does

>> still show up.

>>

>>> $ ip -d -s link

>> ...

>>  4: can0: <NOARP,UP,LOWER_UP,ECHO> mtu 16 qdisc pfifo_fast state

>>> UNKNOWN mode DEFAULT qlen 10

>>>     link/can

>>>     can <TRIPLE-SAMPLING> state ERROR-ACTIVE (berr-counter tx 0 rx 0)

>>

>> BTW. why are you using TRIPLE-SAMPLING. IIRC, it's mainly useful for

low

>> CAN

>> bitrates. At 800 kBits/s it's usually not set.

>>

>> Wolfgang.

>>



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: sja1000 interrupt problem
  2013-11-07  8:15               ` Wolfgang Grandegger
@ 2013-11-07 23:43                 ` Austin Schuh
  2013-11-09 14:21                   ` Oliver Hartkopp
  2013-11-09 19:42                   ` Wolfgang Grandegger
  0 siblings, 2 replies; 66+ messages in thread
From: Austin Schuh @ 2013-11-07 23:43 UTC (permalink / raw)
  To: Wolfgang Grandegger; +Cc: Oliver Hartkopp, linux-can

[-- Attachment #1: Type: text/plain, Size: 3091 bytes --]

>>
>> The dump is attached.
>
> I do not see any sja1000_rx() calls. Either they never happen or more
> likely the trace is not long enough. Could you try with a larger buffer
> using "echo 20000 > buffer_size_kb"? I also do not see some pr_info()
> related functions at the end of the trace. Are you sure is has stopped
> (cat tracing_on or message in dmesg)?
>
> Also please do an "echo 0 > trace" to clear the trace content.
>
> Wolfgang.

Hi Wolfgang,

I'm pretty certain that the trace is long enough.  I tried again with
echo -e -n 100000 > buffer_size_kb and I still don't see any calls to
sja1000_rx.

I added some pr_info prints at the front of sja1000_rx and
sja1000_interrupt.  For each packet sent and then received, I see the
following.  The following lines are from me sending 4 packets.

Nov  7 15:35:52 vpc5 kernel: [   75.136107] Got an sja1000 interrupt.
Nov  7 15:35:52 vpc5 kernel: [   75.136123] Unhandled IRQ 18... stop tracing...
Nov  7 15:35:52 vpc5 kernel: [   75.136130] Got an sja1000 interrupt.
Nov  7 15:35:52 vpc5 kernel: [   75.136139] Received packet.
Nov  7 15:35:52 vpc5 kernel: [   75.136146] sja1000_rx
Nov  7 15:35:52 vpc5 kernel: [   75.136155] TX complete.
Nov  7 15:35:52 vpc5 kernel: [   75.136174] Returning IRQ_HANDLED
Nov  7 15:35:52 vpc5 kernel: [   75.136207] Returning IRQ_HANDLED


Nov  7 15:35:54 vpc5 kernel: [   77.215328] Got an sja1000 interrupt.
Nov  7 15:35:54 vpc5 kernel: [   77.215345] Got an sja1000 interrupt.
Nov  7 15:35:54 vpc5 kernel: [   77.215348] Unhandled IRQ 18... stop tracing...
Nov  7 15:35:54 vpc5 kernel: [   77.215355] Received packet.
Nov  7 15:35:54 vpc5 kernel: [   77.215360] sja1000_rx
Nov  7 15:35:54 vpc5 kernel: [   77.215394] TX complete.
Nov  7 15:35:54 vpc5 kernel: [   77.215410] Returning IRQ_HANDLED
Nov  7 15:35:54 vpc5 kernel: [   77.215418] Returning IRQ_HANDLED


Nov  7 15:35:57 vpc5 kernel: [   80.614289] Unhandled IRQ 18... stop tracing...
Nov  7 15:35:57 vpc5 kernel: [   80.614297] Got an sja1000 interrupt.
Nov  7 15:35:57 vpc5 kernel: [   80.614308] Got an sja1000 interrupt.
Nov  7 15:35:57 vpc5 kernel: [   80.614323] Received packet.
Nov  7 15:35:57 vpc5 kernel: [   80.614327] TX complete.
Nov  7 15:35:57 vpc5 kernel: [   80.614335] sja1000_rx
Nov  7 15:35:57 vpc5 kernel: [   80.614343] Returning IRQ_HANDLED
Nov  7 15:35:57 vpc5 kernel: [   80.614394] Returning IRQ_HANDLED


Nov  7 15:36:02 vpc5 kernel: [   84.991239] Got an sja1000 interrupt.
Nov  7 15:36:02 vpc5 kernel: [   84.991255] Got an sja1000 interrupt.
Nov  7 15:36:02 vpc5 kernel: [   84.991258] Unhandled IRQ 18... stop tracing...
Nov  7 15:36:02 vpc5 kernel: [   84.991266] Received packet.
Nov  7 15:36:02 vpc5 kernel: [   84.991269] sja1000_rx
Nov  7 15:36:02 vpc5 kernel: [   84.991303] TX complete.
Nov  7 15:36:02 vpc5 kernel: [   84.991320] Returning IRQ_HANDLED
Nov  7 15:36:02 vpc5 kernel: [   84.991326] Returning IRQ_HANDLED

I didn't rerun any traces, since my analysis of the syslog is that it
won't give you the information you are looking for without moving the
tracing_off call somewhere else.

Austin

[-- Attachment #2: sja1000.c.diff --]
[-- Type: application/octet-stream, Size: 3946 bytes --]

*** /tmp/sja1000.c_orig	2013-11-07 15:37:57.238311456 -0800
--- sja1000.c	2013-11-07 15:28:29.586335509 -0800
***************
*** 330,335 ****
--- 330,336 ----
  	uint8_t dreg;
  	canid_t id;
  	int i;
+   pr_info("sja1000_rx\n");
  
  	/* create zero'ed CAN frame buffer */
  	skb = alloc_can_skb(dev, &cf);
***************
*** 493,502 ****
  	struct net_device_stats *stats = &dev->stats;
  	uint8_t isrc, status;
  	int n = 0;
  
  	/* Shared interrupts and IRQ off? */
! 	if (priv->read_reg(priv, SJA1000_IER) == IRQ_OFF)
  		return IRQ_NONE;
  
  	if (priv->pre_irq)
  		priv->pre_irq(priv);
--- 494,506 ----
  	struct net_device_stats *stats = &dev->stats;
  	uint8_t isrc, status;
  	int n = 0;
+   pr_info("Got an sja1000 interrupt.\n");
  
  	/* Shared interrupts and IRQ off? */
! 	if (priv->read_reg(priv, SJA1000_IER) == IRQ_OFF) {
!     pr_info("IRQ none at start\n");
  		return IRQ_NONE;
+   }
  
  	if (priv->pre_irq)
  		priv->pre_irq(priv);
***************
*** 506,516 ****
  		n++;
  		status = priv->read_reg(priv, SJA1000_SR);
  		/* check for absent controller due to hw unplug */
! 		if (status == 0xFF && sja1000_is_absent(priv))
  			return IRQ_NONE;
  
! 		if (isrc & IRQ_WUI)
  			netdev_warn(dev, "wakeup interrupt\n");
  
  		if (isrc & IRQ_TI) {
  			/* transmission buffer released */
--- 510,524 ----
  		n++;
  		status = priv->read_reg(priv, SJA1000_SR);
  		/* check for absent controller due to hw unplug */
! 		if (status == 0xFF && sja1000_is_absent(priv)) {
!       pr_info("SJA1000 IRQ None\n");
  			return IRQ_NONE;
+     }
  
! 		if (isrc & IRQ_WUI) {
!       pr_info("SJA1000 Wakeup interrupt.\n");
  			netdev_warn(dev, "wakeup interrupt\n");
+     }
  
  		if (isrc & IRQ_TI) {
  			/* transmission buffer released */
***************
*** 518,545 ****
  			    !(status & SR_TCS)) {
  				stats->tx_errors++;
  				can_free_echo_skb(dev, 0);
  			} else {
  				/* transmission complete */
  				stats->tx_bytes +=
  					priv->read_reg(priv, SJA1000_FI) & 0xf;
  				stats->tx_packets++;
  				can_get_echo_skb(dev, 0);
  			}
  			netif_wake_queue(dev);
  			can_led_event(dev, CAN_LED_EVENT_TX);
  		}
  		if (isrc & IRQ_RI) {
  			/* receive interrupt */
  			while (status & SR_RBS) {
  				sja1000_rx(dev);
  				status = priv->read_reg(priv, SJA1000_SR);
  				/* check for absent controller */
! 				if (status == 0xFF && sja1000_is_absent(priv))
  					return IRQ_NONE;
  			}
  		}
  		if (isrc & (IRQ_DOI | IRQ_EI | IRQ_BEI | IRQ_EPI | IRQ_ALI)) {
  			/* error interrupt */
  			if (sja1000_err(dev, isrc, status))
  				break;
  		}
--- 526,559 ----
  			    !(status & SR_TCS)) {
  				stats->tx_errors++;
  				can_free_echo_skb(dev, 0);
+         pr_info("TX buffer released.\n");
  			} else {
  				/* transmission complete */
  				stats->tx_bytes +=
  					priv->read_reg(priv, SJA1000_FI) & 0xf;
  				stats->tx_packets++;
  				can_get_echo_skb(dev, 0);
+         pr_info("TX complete.\n");
  			}
  			netif_wake_queue(dev);
  			can_led_event(dev, CAN_LED_EVENT_TX);
  		}
  		if (isrc & IRQ_RI) {
  			/* receive interrupt */
+       pr_info("Received packet.\n");
  			while (status & SR_RBS) {
  				sja1000_rx(dev);
  				status = priv->read_reg(priv, SJA1000_SR);
  				/* check for absent controller */
! 				if (status == 0xFF && sja1000_is_absent(priv)) {
!           pr_info("IRQ None rx\n");
  					return IRQ_NONE;
+         }
  			}
  		}
  		if (isrc & (IRQ_DOI | IRQ_EI | IRQ_BEI | IRQ_EPI | IRQ_ALI)) {
  			/* error interrupt */
+       pr_info("Error interrupt.\n");
  			if (sja1000_err(dev, isrc, status))
  				break;
  		}
***************
*** 551,556 ****
--- 565,575 ----
  	if (n >= SJA1000_MAX_IRQ)
  		netdev_dbg(dev, "%d messages handled in ISR", n);
  
+   if (n) {
+     pr_info("Returning IRQ_HANDLED\n");
+   } else {
+     pr_info("Returning IRQ_NONE\n");
+   }
  	return (n) ? IRQ_HANDLED : IRQ_NONE;
  }
  EXPORT_SYMBOL_GPL(sja1000_interrupt);

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: sja1000 interrupt problem
  2013-11-07 23:43                 ` Austin Schuh
@ 2013-11-09 14:21                   ` Oliver Hartkopp
  2013-11-12  2:59                     ` Austin Schuh
  2013-11-09 19:42                   ` Wolfgang Grandegger
  1 sibling, 1 reply; 66+ messages in thread
From: Oliver Hartkopp @ 2013-11-09 14:21 UTC (permalink / raw)
  To: Austin Schuh; +Cc: Wolfgang Grandegger, linux-can

Hello Austin,

from the trace I assume that you connected the two CAN interfaces which each
other and you get the tx-complete interrupt very close to the rx interrupt.

(1) Is this assumed setup correct?

From the trace it is pretty hard to know which CAN interface is in charge.
(2) Can you please add the output of dev->ifindex in the pr_info() calls?

From your first post I was able to read your current kernel as:
> $ uname -a
> Linux vpc3 3.10-3-rt-amd64 #1 SMP PREEMPT RT Debian 3.10.11-1
> (2013-09-10) x86_64 GNU/Linux

(3) Is it possible to use a standard (non-RT) kernel on your system to confirm
this issue on a unmodified system?

On 08.11.2013 00:43, Austin Schuh wrote:

> I added some pr_info prints at the front of sja1000_rx and
> sja1000_interrupt.  For each packet sent and then received, I see the
> following.  The following lines are from me sending 4 packets.
> 
> Nov  7 15:35:52 vpc5 kernel: [   75.136107] Got an sja1000 interrupt.
(which interface?)

> Nov  7 15:35:52 vpc5 kernel: [   75.136123] Unhandled IRQ 18... stop tracing...
> Nov  7 15:35:52 vpc5 kernel: [   75.136130] Got an sja1000 interrupt.
(which interface?)

> Nov  7 15:35:52 vpc5 kernel: [   75.136139] Received packet.
(which interface?)

> Nov  7 15:35:52 vpc5 kernel: [   75.136146] sja1000_rx
(which interface?)

> Nov  7 15:35:52 vpc5 kernel: [   75.136155] TX complete.
(which interface?)

> Nov  7 15:35:52 vpc5 kernel: [   75.136174] Returning IRQ_HANDLED
> Nov  7 15:35:52 vpc5 kernel: [   75.136207] Returning IRQ_HANDLED
(which interface?)

Thanks,
Oliver


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: sja1000 interrupt problem
  2013-11-07 23:43                 ` Austin Schuh
  2013-11-09 14:21                   ` Oliver Hartkopp
@ 2013-11-09 19:42                   ` Wolfgang Grandegger
       [not found]                     ` <CANGgnMbb+VResUC6h+cK6Hfe5PLJx9R9ao6bMdJM2e5BPaDamw@mail.gmail.com>
  1 sibling, 1 reply; 66+ messages in thread
From: Wolfgang Grandegger @ 2013-11-09 19:42 UTC (permalink / raw)
  To: Austin Schuh; +Cc: Oliver Hartkopp, linux-can

Hi Austin,

On 11/08/2013 12:43 AM, Austin Schuh wrote:
>>>
>>> The dump is attached.
>>
>> I do not see any sja1000_rx() calls. Either they never happen or more
>> likely the trace is not long enough. Could you try with a larger buffer
>> using "echo 20000 > buffer_size_kb"? I also do not see some pr_info()
>> related functions at the end of the trace. Are you sure is has stopped
>> (cat tracing_on or message in dmesg)?
>>
>> Also please do an "echo 0 > trace" to clear the trace content.
>>
>> Wolfgang.
> 
> Hi Wolfgang,
> 
> I'm pretty certain that the trace is long enough.  I tried again with
> echo -e -n 100000 > buffer_size_kb and I still don't see any calls to
> sja1000_rx.
> 
> I added some pr_info prints at the front of sja1000_rx and
> sja1000_interrupt.  For each packet sent and then received, I see the
> following.  The following lines are from me sending 4 packets.

sja1000_interrupt() is normally called for each SJA1000 device (shared
interrupt).

> Nov  7 15:35:52 vpc5 kernel: [   75.136107] Got an sja1000 interrupt.
> Nov  7 15:35:52 vpc5 kernel: [   75.136123] Unhandled IRQ 18... stop tracing...

I'm confused. Why is the IRQ "unhandled" without calling
sja1000_interrupt() twice. Ah, this is due to threaded interrupt
handling, where the spurious interrupt check is called for each
handler/device. Therefore the trigger is simply bad.

> Nov  7 15:35:52 vpc5 kernel: [   75.136130] Got an sja1000 interrupt.
> Nov  7 15:35:52 vpc5 kernel: [   75.136139] Received packet.
> Nov  7 15:35:52 vpc5 kernel: [   75.136146] sja1000_rx
> Nov  7 15:35:52 vpc5 kernel: [   75.136155] TX complete.
> Nov  7 15:35:52 vpc5 kernel: [   75.136174] Returning IRQ_HANDLED
> Nov  7 15:35:52 vpc5 kernel: [   75.136207] Returning IRQ_HANDLED
...

This is the working case. How does it look like when the device stops
receiving messages. You should also label the device.

> I didn't rerun any traces, since my analysis of the syslog is that it
> won't give you the information you are looking for without moving the
> tracing_off call somewhere else.

Well, we know that messages are received properly for some time. We need
to trigger the malfunctioning. I understood that it happens just once a
day that the interrupts get stuck, right?

Wolfgang.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: sja1000 interrupt problem
  2013-11-09 14:21                   ` Oliver Hartkopp
@ 2013-11-12  2:59                     ` Austin Schuh
  2013-11-12 21:26                       ` Oliver Hartkopp
  0 siblings, 1 reply; 66+ messages in thread
From: Austin Schuh @ 2013-11-12  2:59 UTC (permalink / raw)
  To: Oliver Hartkopp; +Cc: Wolfgang Grandegger, linux-can

Hi Oliver,

On Sat, Nov 9, 2013 at 6:21 AM, Oliver Hartkopp <socketcan@hartkopp.net> wrote:
> Hello Austin,
>
> from the trace I assume that you connected the two CAN interfaces which each
> other and you get the tx-complete interrupt very close to the rx interrupt.
>
> (1) Is this assumed setup correct?

Correct, it is.

> From the trace it is pretty hard to know which CAN interface is in charge.
> (2) Can you please add the output of dev->ifindex in the pr_info() calls?

Gladly.  See the updated logs.

[  556.019246] peak_pci 0000:05:00.0 can1: Got an sja1000 interrupt.
[  556.019268] Unhandled IRQ 18... stop tracing...
[  556.019280] peak_pci 0000:05:00.0 can0: Got an sja1000 interrupt.
[  556.019289] peak_pci 0000:05:00.0 can1: Received packet.
[  556.019299] peak_pci 0000:05:00.0 can1: sja1000_rx
[  556.019307] peak_pci 0000:05:00.0 can0: TX complete.
[  556.019318] peak_pci 0000:05:00.0 can0: Returning IRQ_HANDLED
[  556.019362] peak_pci 0000:05:00.0 can1: Returning IRQ_HANDLED

[  785.760244] peak_pci 0000:05:00.0 can0: Got an sja1000 interrupt.
[  785.760249] peak_pci 0000:05:00.0 can1: Got an sja1000 interrupt.
[  785.760266] Unhandled IRQ 18... stop tracing...
[  785.760271] peak_pci 0000:05:00.0 can1: Received packet.
[  785.760286] peak_pci 0000:05:00.0 can1: sja1000_rx
[  785.760303] peak_pci 0000:05:00.0 can0: TX complete.
[  785.760317] peak_pci 0000:05:00.0 can0: Returning IRQ_HANDLED
[  785.760354] peak_pci 0000:05:00.0 can1: Returning IRQ_HANDLED

[  805.392135] peak_pci 0000:05:00.0 can1: Got an sja1000 interrupt.
[  805.392176] peak_pci 0000:05:00.0 can0: Got an sja1000 interrupt.
[  805.392181] peak_pci 0000:05:00.0 can1: Received packet.
[  805.392195] peak_pci 0000:05:00.0 can1: sja1000_rx
[  805.392210] Unhandled IRQ 18... stop tracing...
[  805.392248] peak_pci 0000:05:00.0 can0: TX complete.
[  805.392268] peak_pci 0000:05:00.0 can0: Returning IRQ_HANDLED
[  805.392271] peak_pci 0000:05:00.0 can1: Returning IRQ_HANDLED

> From your first post I was able to read your current kernel as:
>> $ uname -a
>> Linux vpc3 3.10-3-rt-amd64 #1 SMP PREEMPT RT Debian 3.10.11-1
>> (2013-09-10) x86_64 GNU/Linux
>
> (3) Is it possible to use a standard (non-RT) kernel on your system to confirm
> this issue on a unmodified system?

It will be a bit of time before I can schedule enough time on the PC
in question with a non-realtime kernel.  I'll see what I can do.

Thanks!
    Austin

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: sja1000 interrupt problem
  2013-11-12  2:59                     ` Austin Schuh
@ 2013-11-12 21:26                       ` Oliver Hartkopp
  2013-11-12 23:22                         ` Austin Schuh
  0 siblings, 1 reply; 66+ messages in thread
From: Oliver Hartkopp @ 2013-11-12 21:26 UTC (permalink / raw)
  To: Austin Schuh; +Cc: Wolfgang Grandegger, linux-can

On 12.11.2013 03:59, Austin Schuh wrote:

>> From the trace it is pretty hard to know which CAN interface is in charge.
>> (2) Can you please add the output of dev->ifindex in the pr_info() calls?
> 
> Gladly.  See the updated logs.
> 
> [  556.019246] peak_pci 0000:05:00.0 can1: Got an sja1000 interrupt.
> [  556.019268] Unhandled IRQ 18... stop tracing...
> [  556.019280] peak_pci 0000:05:00.0 can0: Got an sja1000 interrupt.
> [  556.019289] peak_pci 0000:05:00.0 can1: Received packet.
> [  556.019299] peak_pci 0000:05:00.0 can1: sja1000_rx
> [  556.019307] peak_pci 0000:05:00.0 can0: TX complete.
> [  556.019318] peak_pci 0000:05:00.0 can0: Returning IRQ_HANDLED
> [  556.019362] peak_pci 0000:05:00.0 can1: Returning IRQ_HANDLED
> 

This looks pretty broken regarding the IRQ handling.
Maybe the IRQ thread handling has a real problem in the -rt kernel ?!?

>> (3) Is it possible to use a standard (non-RT) kernel on your system to confirm
>> this issue on a unmodified system?
> 
> It will be a bit of time before I can schedule enough time on the PC
> in question with a non-realtime kernel.  I'll see what I can do.

That would be a valuable information to go deeper into this problem.

Thanks,
Oliver

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: sja1000 interrupt problem
       [not found]                     ` <CANGgnMbb+VResUC6h+cK6Hfe5PLJx9R9ao6bMdJM2e5BPaDamw@mail.gmail.com>
@ 2013-11-12 22:15                       ` Wolfgang Grandegger
  0 siblings, 0 replies; 66+ messages in thread
From: Wolfgang Grandegger @ 2013-11-12 22:15 UTC (permalink / raw)
  To: Austin Schuh; +Cc: Oliver Hartkopp, linux-can

Hi Austin,

On 11/12/2013 03:59 AM, Austin Schuh wrote:
> Hi Wolfgang,
> 
> On Sat, Nov 9, 2013 at 11:42 AM, Wolfgang Grandegger <wg@grandegger.com> wrote:
>> Hi Austin,
>>
>> On 11/08/2013 12:43 AM, Austin Schuh wrote:
>>>>>
>>>>> The dump is attached.
>>>>
>>>> I do not see any sja1000_rx() calls. Either they never happen or more
>>>> likely the trace is not long enough. Could you try with a larger buffer
>>>> using "echo 20000 > buffer_size_kb"? I also do not see some pr_info()
>>>> related functions at the end of the trace. Are you sure is has stopped
>>>> (cat tracing_on or message in dmesg)?
>>>>
>>>> Also please do an "echo 0 > trace" to clear the trace content.
>>>>
>>>> Wolfgang.
>>>
>>> Hi Wolfgang,
>>>
>>> I'm pretty certain that the trace is long enough.  I tried again with
>>> echo -e -n 100000 > buffer_size_kb and I still don't see any calls to
>>> sja1000_rx.
>>>
>>> I added some pr_info prints at the front of sja1000_rx and
>>> sja1000_interrupt.  For each packet sent and then received, I see the
>>> following.  The following lines are from me sending 4 packets.
>>
>> sja1000_interrupt() is normally called for each SJA1000 device (shared
>> interrupt).
>>
>>> Nov  7 15:35:52 vpc5 kernel: [   75.136107] Got an sja1000 interrupt.
>>> Nov  7 15:35:52 vpc5 kernel: [   75.136123] Unhandled IRQ 18... stop tracing...
>>
>> I'm confused. Why is the IRQ "unhandled" without calling
>> sja1000_interrupt() twice. Ah, this is due to threaded interrupt
>> handling, where the spurious interrupt check is called for each
>> handler/device. Therefore the trigger is simply bad.
>>
>>> Nov  7 15:35:52 vpc5 kernel: [   75.136130] Got an sja1000 interrupt.
>>> Nov  7 15:35:52 vpc5 kernel: [   75.136139] Received packet.
>>> Nov  7 15:35:52 vpc5 kernel: [   75.136146] sja1000_rx
>>> Nov  7 15:35:52 vpc5 kernel: [   75.136155] TX complete.
>>> Nov  7 15:35:52 vpc5 kernel: [   75.136174] Returning IRQ_HANDLED
>>> Nov  7 15:35:52 vpc5 kernel: [   75.136207] Returning IRQ_HANDLED
>> ...
>>
>> This is the working case. How does it look like when the device stops
>> receiving messages. You should also label the device.
> 
> I haven't been able to run the code long enough with all the debug
> turned on because my syslog grows way too big.  This only happens on
> the PC connected to the highly loaded CAN bus.  I have another PC
> connected to a lightly loaded bus, and I don't see nearly as many
> problems with that PC.  I just figured out how to not fill my syslog
> up, so I'll get back to you when I have something interesting to show.
> 
>>> I didn't rerun any traces, since my analysis of the syslog is that it
>>> won't give you the information you are looking for without moving the
>>> tracing_off call somewhere else.
>>
>> Well, we know that messages are received properly for some time. We need
>> to trigger the malfunctioning. I understood that it happens just once a
>> day that the interrupts get stuck, right?
>>
>> Wolfgang.
> 
> I now have a number of datapoints.  These are the timestamps from when
> it disabled the IRQ.  After each of these, I rebooted the machine to
> be safe, though it works to instead take the interface down and bring
> it back up again.  I'm seeing anywhere from 13 hours - 2 hours.
> 
> [47430.043798] Disabling IRQ #18
> [21439.136781] Disabling IRQ #18
> [5641.564900] Disabling IRQ #18
> [45129.357272] Disabling IRQ #18

Be aware that it prints these messages not before having detected more
than 99000 unhandled IRQs in sequence. See:

  http://lxr.linux.no/#linux+v3.12/kernel/irq/spurious.c#L308

Either the SJA1000 is stuck or the interrupt does not go away. Maybe we
have a race with interrupt acknowledgement.

> I moved the trace disable to the end of sja1000_interrupt when I get
> an interrupt from can1, since that is the last event in the trace.
> Attached is the new trace.  It should have more interesting
> information in it.

This does disable the trace after the first interrupt served for CAN1.
Or do I miss something? What do you want to trigger?

>>From the syslog:
> Nov 11 17:56:14 vpc5 kernel: [   79.069060] Unhandled IRQ 18...
> irqs_unhandled: 1, irq_count: 27
> Nov 11 17:56:14 vpc5 kernel: [   79.069097] peak_pci 0000:05:00.0
> can0: Got an sja1000 interrupt.
> Nov 11 17:56:14 vpc5 kernel: [   79.069113] peak_pci 0000:05:00.0
> can1: Got an sja1000 interrupt.
> Nov 11 17:56:14 vpc5 kernel: [   79.069132] peak_pci 0000:05:00.0
> can1: Received packet.
> Nov 11 17:56:14 vpc5 kernel: [   79.069157] peak_pci 0000:05:00.0
> can0: TX complete.
> Nov 11 17:56:14 vpc5 kernel: [   79.069161] peak_pci 0000:05:00.0
> can1: sja1000_rx
> Nov 11 17:56:14 vpc5 kernel: [   79.069199] peak_pci 0000:05:00.0
> can0: Returning IRQ_HANDLED
> Nov 11 17:56:14 vpc5 kernel: [   79.069242] peak_pci 0000:05:00.0
> can1: Returning IRQ_HANDLED
> Nov 11 17:56:14 vpc5 kernel: [   79.069269] peak_pci 0000:05:00.0
> can1: Found can1, disabling tracing.

If you use FTRACE, please remove these prinkt's.

> Instead of sending packets between the CAN interfaces, I grabbed
> another CAN device and am using it to send packets on the bus.  This
> is the configuration that I have been seeing the real failures in.  I
> now have 2 machines with the CAN device attached to see if I can catch
> the kprintfs when it dies.  Hopefully one of them will have died by
> tomorrow morning and I'll be able to find something interesting.

It's good to make the test case as simple as possible. If I understand
correctly, just CAN1 does then receive messages.

Concerning the trigger. Maybe it's already sufficient to stop the trace
after 10 (or even less) unhandled IRQ 18 by adding:

	if (unlikely(desc->irqs_unhandled == 10)) {
		tracing_off();
		printk("%s: tracing stopped\n", __func__);
	}


in note_interrupt here:

  http://lxr.linux.no/#linux+v3.12/kernel/irq/spurious.c#L306.

Other things to check:

- do the following checks in sja1000_interrupt ever return IRQ_NONE?

  /* check for absent controller due to hw unplug */
        if (status == 0xFF && sja1000_is_absent(priv))
               return IRQ_NONE;

- What is the value of the PITA_ICR register when the device is stuck
  (IRQ 18 disabled)? See:

  http://lxr.linux.no/#linux/drivers/net/can/sja1000/peak_pci.c#L539

Hope it helps to find the problem.

Wolfgang.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: sja1000 interrupt problem
  2013-11-12 21:26                       ` Oliver Hartkopp
@ 2013-11-12 23:22                         ` Austin Schuh
  2013-11-13  3:41                           ` Austin Schuh
  2013-11-13  6:44                           ` Oliver Hartkopp
  0 siblings, 2 replies; 66+ messages in thread
From: Austin Schuh @ 2013-11-12 23:22 UTC (permalink / raw)
  To: Oliver Hartkopp; +Cc: Wolfgang Grandegger, linux-can

On Tue, Nov 12, 2013 at 1:26 PM, Oliver Hartkopp <socketcan@hartkopp.net> wrote:
> On 12.11.2013 03:59, Austin Schuh wrote:
>
>>> From the trace it is pretty hard to know which CAN interface is in charge.
>>> (2) Can you please add the output of dev->ifindex in the pr_info() calls?
>>
>> Gladly.  See the updated logs.
>>
>> [  556.019246] peak_pci 0000:05:00.0 can1: Got an sja1000 interrupt.
>> [  556.019268] Unhandled IRQ 18... stop tracing...
>> [  556.019280] peak_pci 0000:05:00.0 can0: Got an sja1000 interrupt.
>> [  556.019289] peak_pci 0000:05:00.0 can1: Received packet.
>> [  556.019299] peak_pci 0000:05:00.0 can1: sja1000_rx
>> [  556.019307] peak_pci 0000:05:00.0 can0: TX complete.
>> [  556.019318] peak_pci 0000:05:00.0 can0: Returning IRQ_HANDLED
>> [  556.019362] peak_pci 0000:05:00.0 can1: Returning IRQ_HANDLED
>>
>
> This looks pretty broken regarding the IRQ handling.
> Maybe the IRQ thread handling has a real problem in the -rt kernel ?!?

Sounds pretty plausible right now.

>>> (3) Is it possible to use a standard (non-RT) kernel on your system to confirm
>>> this issue on a unmodified system?
>>
>> It will be a bit of time before I can schedule enough time on the PC
>> in question with a non-realtime kernel.  I'll see what I can do.
>
> That would be a valuable information to go deeper into this problem.

Here is what is in the syslog from the same machine sending to it's
self with a non-realtime kernel.

# uname -a
Linux vpc5 3.10-3-amd64 #1 SMP Debian 3.10.11-2 (2013-09-10) x86_64 GNU/Linux

<6>[  169.993870] peak_pci 0000:05:00.0 can1: Got an sja1000 interrupt.
<6>[  169.993904] peak_pci 0000:05:00.0 can1: Received packet.
<6>[  169.993923] peak_pci 0000:05:00.0 can1: sja1000_rx
<6>[  169.993994] peak_pci 0000:05:00.0 can1: Returning IRQ_HANDLED
<6>[  169.994013] peak_pci 0000:05:00.0 can1: Found can1, disabling tracing.
<6>[  169.994029] peak_pci 0000:05:00.0 can0: Got an sja1000 interrupt.
<6>[  169.994048] peak_pci 0000:05:00.0 can0: TX complete.
<6>[  169.994061] peak_pci 0000:05:00.0 can0: Returning IRQ_HANDLED

<6>[  528.362919] peak_pci 0000:05:00.0 can1: Got an sja1000 interrupt.
<6>[  528.362953] peak_pci 0000:05:00.0 can1: Received packet.
<6>[  528.362970] peak_pci 0000:05:00.0 can1: sja1000_rx
<6>[  528.363038] peak_pci 0000:05:00.0 can1: Returning IRQ_HANDLED
<6>[  528.363056] peak_pci 0000:05:00.0 can1: Found can1, disabling tracing.
<6>[  528.363071] peak_pci 0000:05:00.0 can0: Got an sja1000 interrupt.
<6>[  528.363088] peak_pci 0000:05:00.0 can0: TX complete.
<6>[  528.363100] peak_pci 0000:05:00.0 can0: Returning IRQ_HANDLED

When I attach the CAN device which continually sends, I get the
following.  (The front might be snipped improperly, not sure.)

<6>[  640.823323] peak_pci 0000:05:00.0 can1: Got an sja1000 interrupt.
<6>[  640.823331] peak_pci 0000:05:00.0 can1: Returning IRQ_NONE
<6>[  640.823334] peak_pci 0000:05:00.0 can0: Got an sja1000 interrupt.
<6>[  640.823344] peak_pci 0000:05:00.0 can0: Received packet.
<6>[  640.823346] peak_pci 0000:05:00.0 can0: sja1000_rx
<6>[  640.823391] peak_pci 0000:05:00.0 can0: Returning IRQ_HANDLED
<6>[  640.823581] peak_pci 0000:05:00.0 can1: Got an sja1000 interrupt.
<6>[  640.823597] peak_pci 0000:05:00.0 can1: Returning IRQ_NONE
<6>[  640.823604] peak_pci 0000:05:00.0 can0: Got an sja1000 interrupt.
<6>[  640.823618] peak_pci 0000:05:00.0 can0: Received packet.
<6>[  640.823624] peak_pci 0000:05:00.0 can0: sja1000_rx
<6>[  640.823684] peak_pci 0000:05:00.0 can0: Returning IRQ_HANDLED
<6>[  640.823818] peak_pci 0000:05:00.0 can1: Got an sja1000 interrupt.
<6>[  640.823827] peak_pci 0000:05:00.0 can1: Returning IRQ_NONE
<6>[  640.823829] peak_pci 0000:05:00.0 can0: Got an sja1000 interrupt.
<6>[  640.823838] peak_pci 0000:05:00.0 can0: Received packet.
<6>[  640.823840] peak_pci 0000:05:00.0 can0: sja1000_rx
<6>[  640.823880] peak_pci 0000:05:00.0 can0: Returning IRQ_HANDLED

Looks like this is related to running a realtime kernel.

> Thanks,
> Oliver

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: sja1000 interrupt problem
  2013-11-12 23:22                         ` Austin Schuh
@ 2013-11-13  3:41                           ` Austin Schuh
  2013-11-13  6:58                             ` Oliver Hartkopp
  2013-11-13  6:44                           ` Oliver Hartkopp
  1 sibling, 1 reply; 66+ messages in thread
From: Austin Schuh @ 2013-11-13  3:41 UTC (permalink / raw)
  To: Oliver Hartkopp; +Cc: Wolfgang Grandegger, linux-can

On Tue, Nov 12, 2013 at 3:22 PM, Austin Schuh <austin@peloton-tech.com> wrote:
> On Tue, Nov 12, 2013 at 1:26 PM, Oliver Hartkopp <socketcan@hartkopp.net> wrote:
>> On 12.11.2013 03:59, Austin Schuh wrote:
>>
>>>> From the trace it is pretty hard to know which CAN interface is in charge.
>>>> (2) Can you please add the output of dev->ifindex in the pr_info() calls?
>>>
>>> Gladly.  See the updated logs.
>>>
>>> [  556.019246] peak_pci 0000:05:00.0 can1: Got an sja1000 interrupt.
>>> [  556.019268] Unhandled IRQ 18... stop tracing...
>>> [  556.019280] peak_pci 0000:05:00.0 can0: Got an sja1000 interrupt.
>>> [  556.019289] peak_pci 0000:05:00.0 can1: Received packet.
>>> [  556.019299] peak_pci 0000:05:00.0 can1: sja1000_rx
>>> [  556.019307] peak_pci 0000:05:00.0 can0: TX complete.
>>> [  556.019318] peak_pci 0000:05:00.0 can0: Returning IRQ_HANDLED
>>> [  556.019362] peak_pci 0000:05:00.0 can1: Returning IRQ_HANDLED
>>>
>>
>> This looks pretty broken regarding the IRQ handling.
>> Maybe the IRQ thread handling has a real problem in the -rt kernel ?!?
>
> Sounds pretty plausible right now.

Ok, I spent a good chunk of today reading the IRQ handling code in the
kernel, and I think I get what is happening and have a plausible
explanation for why the interrupt is getting disabled.  Not sure how
to test it.

Here is what it looks like is happening.  The hardware triggers an
interrupt.  The handler is called, and then the registered action for
each of the devices is to notify their threads that an IRQ occurred,
and to have them handle it.  Each of the handling threads then calls
the sja1000_interrupt function, or the equivalent ata_generic
interrupt function.  2 of the 3 interrupt functions then return
IRQ_NONE, and one of them returns IRQ_HANDLED.  note_interrupt is then
called in each of the threads (instead of being called once in the
non-rt case), resulting in 2 unhanded calls, and 1 handled call.  So
far, so good.  The kernel operates as expected, since less than 99.9 %
of the interrupts are handled.  (There is a note_interrupt call in the
handler, but since the threaded handlers are notified, this doesn't
get counted.

Since the IRQ handlers are now all in threads, if the thread that
actually receives data doesn't process the interrupts either because
something goes wrong, or because it doesn't get scheduled, there will
be a bunch of unhanded interrupts noted, and no handled interrupts
noted.  This will cause the IRQ to be disabled.

I guess the next interesting thing to do is to trigger when it
disables the IRQ and take a look at what is happening.  I have a test
running on one machine with tracing enabled which will disable tracing
when the IRQ is disabled.  That should provide some interesting
results.  I think I also know how to bypass it for now by setting
"noirqdebug", but I'd like to fix it for real as well.

Austin

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: sja1000 interrupt problem
  2013-11-12 23:22                         ` Austin Schuh
  2013-11-13  3:41                           ` Austin Schuh
@ 2013-11-13  6:44                           ` Oliver Hartkopp
  2013-11-13  8:11                             ` Wolfgang Grandegger
  1 sibling, 1 reply; 66+ messages in thread
From: Oliver Hartkopp @ 2013-11-13  6:44 UTC (permalink / raw)
  To: Austin Schuh, Wolfgang Grandegger; +Cc: linux-can



On 13.11.2013 00:22, Austin Schuh wrote:

> Here is what is in the syslog from the same machine sending to it's
> self with a non-realtime kernel.
> 
> # uname -a
> Linux vpc5 3.10-3-amd64 #1 SMP Debian 3.10.11-2 (2013-09-10) x86_64 GNU/Linux
> 
> <6>[  169.993870] peak_pci 0000:05:00.0 can1: Got an sja1000 interrupt.
> <6>[  169.993904] peak_pci 0000:05:00.0 can1: Received packet.
> <6>[  169.993923] peak_pci 0000:05:00.0 can1: sja1000_rx
> <6>[  169.993994] peak_pci 0000:05:00.0 can1: Returning IRQ_HANDLED
> <6>[  169.994013] peak_pci 0000:05:00.0 can1: Found can1, disabling tracing.
> <6>[  169.994029] peak_pci 0000:05:00.0 can0: Got an sja1000 interrupt.
> <6>[  169.994048] peak_pci 0000:05:00.0 can0: TX complete.
> <6>[  169.994061] peak_pci 0000:05:00.0 can0: Returning IRQ_HANDLED

This looks indeed much better :-)

> 
> When I attach the CAN device which continually sends, I get the
> following.  (The front might be snipped improperly, not sure.)
> 
> <6>[  640.823323] peak_pci 0000:05:00.0 can1: Got an sja1000 interrupt.
> <6>[  640.823331] peak_pci 0000:05:00.0 can1: Returning IRQ_NONE
> <6>[  640.823334] peak_pci 0000:05:00.0 can0: Got an sja1000 interrupt.
> <6>[  640.823344] peak_pci 0000:05:00.0 can0: Received packet.
> <6>[  640.823346] peak_pci 0000:05:00.0 can0: sja1000_rx
> <6>[  640.823391] peak_pci 0000:05:00.0 can0: Returning IRQ_HANDLED

This is a correct behaviour too:

The shared IRQ is handled by can1 (which had nothing to do) and then the chain
goes to can0 which handles the reception correctly.

@Wolfgang: Do we need an additional protection for the PITA handling in
peak_pci.c ?

@Austin:
I have another idea to test, if it just shows up in the mainline driver:

Can you please download the out-of-tree driver from PEAK (version 7.9):

http://www.peak-system.com/linux/index.htm

I wonder if this one actually compiles with a -rt kernel and if it shows up
the same issue then.

Best regards,
Oliver


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: sja1000 interrupt problem
  2013-11-13  3:41                           ` Austin Schuh
@ 2013-11-13  6:58                             ` Oliver Hartkopp
  2013-11-13  9:48                               ` Kurt Van Dijck
  0 siblings, 1 reply; 66+ messages in thread
From: Oliver Hartkopp @ 2013-11-13  6:58 UTC (permalink / raw)
  To: Austin Schuh; +Cc: Wolfgang Grandegger, linux-can

Hi Austin,

sorry for checking my mails in sequential order :-)
I would have been able to shorten the last mail.

Thanks for your interesting investigation.
I wonder why this problem did not show up before then. Having shared
interrupts should be a usual thing.

This kind of race condition should not be there at all. Do you have a second
peak_pci hardware? I could be an idea to try to split the IRQs in a way that
you have two IRQs for two cards - and then connect can0 to can2.
You would have a pretty fast following RX/TX interrupt but without interrupt
sharing ...

Best regards,
Oliver

On 13.11.2013 04:41, Austin Schuh wrote:
> On Tue, Nov 12, 2013 at 3:22 PM, Austin Schuh <austin@peloton-tech.com> wrote:
>> On Tue, Nov 12, 2013 at 1:26 PM, Oliver Hartkopp <socketcan@hartkopp.net> wrote:
>>> On 12.11.2013 03:59, Austin Schuh wrote:
>>>
>>>>> From the trace it is pretty hard to know which CAN interface is in charge.
>>>>> (2) Can you please add the output of dev->ifindex in the pr_info() calls?
>>>>
>>>> Gladly.  See the updated logs.
>>>>
>>>> [  556.019246] peak_pci 0000:05:00.0 can1: Got an sja1000 interrupt.
>>>> [  556.019268] Unhandled IRQ 18... stop tracing...
>>>> [  556.019280] peak_pci 0000:05:00.0 can0: Got an sja1000 interrupt.
>>>> [  556.019289] peak_pci 0000:05:00.0 can1: Received packet.
>>>> [  556.019299] peak_pci 0000:05:00.0 can1: sja1000_rx
>>>> [  556.019307] peak_pci 0000:05:00.0 can0: TX complete.
>>>> [  556.019318] peak_pci 0000:05:00.0 can0: Returning IRQ_HANDLED
>>>> [  556.019362] peak_pci 0000:05:00.0 can1: Returning IRQ_HANDLED
>>>>
>>>
>>> This looks pretty broken regarding the IRQ handling.
>>> Maybe the IRQ thread handling has a real problem in the -rt kernel ?!?
>>
>> Sounds pretty plausible right now.
> 
> Ok, I spent a good chunk of today reading the IRQ handling code in the
> kernel, and I think I get what is happening and have a plausible
> explanation for why the interrupt is getting disabled.  Not sure how
> to test it.
> 
> Here is what it looks like is happening.  The hardware triggers an
> interrupt.  The handler is called, and then the registered action for
> each of the devices is to notify their threads that an IRQ occurred,
> and to have them handle it.  Each of the handling threads then calls
> the sja1000_interrupt function, or the equivalent ata_generic
> interrupt function.  2 of the 3 interrupt functions then return
> IRQ_NONE, and one of them returns IRQ_HANDLED.  note_interrupt is then
> called in each of the threads (instead of being called once in the
> non-rt case), resulting in 2 unhanded calls, and 1 handled call.  So
> far, so good.  The kernel operates as expected, since less than 99.9 %
> of the interrupts are handled.  (There is a note_interrupt call in the
> handler, but since the threaded handlers are notified, this doesn't
> get counted.
> 
> Since the IRQ handlers are now all in threads, if the thread that
> actually receives data doesn't process the interrupts either because
> something goes wrong, or because it doesn't get scheduled, there will
> be a bunch of unhanded interrupts noted, and no handled interrupts
> noted.  This will cause the IRQ to be disabled.
> 
> I guess the next interesting thing to do is to trigger when it
> disables the IRQ and take a look at what is happening.  I have a test
> running on one machine with tracing enabled which will disable tracing
> when the IRQ is disabled.  That should provide some interesting
> results.  I think I also know how to bypass it for now by setting
> "noirqdebug", but I'd like to fix it for real as well.
> 
> Austin
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: sja1000 interrupt problem
  2013-11-13  6:44                           ` Oliver Hartkopp
@ 2013-11-13  8:11                             ` Wolfgang Grandegger
  2013-11-13  9:08                               ` Pavel Pisa
  0 siblings, 1 reply; 66+ messages in thread
From: Wolfgang Grandegger @ 2013-11-13  8:11 UTC (permalink / raw)
  To: Oliver Hartkopp; +Cc: Austin Schuh, linux-can

On Wed, 13 Nov 2013 07:44:23 +0100, Oliver Hartkopp

<socketcan@hartkopp.net> wrote:

> On 13.11.2013 00:22, Austin Schuh wrote:

> 

>> Here is what is in the syslog from the same machine sending to it's

>> self with a non-realtime kernel.

>> 

>> # uname -a

>> Linux vpc5 3.10-3-amd64 #1 SMP Debian 3.10.11-2 (2013-09-10) x86_64

>> GNU/Linux

>> 

>> <6>[  169.993870] peak_pci 0000:05:00.0 can1: Got an sja1000 interrupt.

>> <6>[  169.993904] peak_pci 0000:05:00.0 can1: Received packet.

>> <6>[  169.993923] peak_pci 0000:05:00.0 can1: sja1000_rx

>> <6>[  169.993994] peak_pci 0000:05:00.0 can1: Returning IRQ_HANDLED

>> <6>[  169.994013] peak_pci 0000:05:00.0 can1: Found can1, disabling

>> tracing.

>> <6>[  169.994029] peak_pci 0000:05:00.0 can0: Got an sja1000 interrupt.

>> <6>[  169.994048] peak_pci 0000:05:00.0 can0: TX complete.

>> <6>[  169.994061] peak_pci 0000:05:00.0 can0: Returning IRQ_HANDLED

> 

> This looks indeed much better :-)



Note that the -rt uses threaded interrupt handlers, which can run on

separate

CPUs (just have a look to the function traces Austin provide), Looks like

vanilla is just less sensitive to races.



>> When I attach the CAN device which continually sends, I get the

>> following.  (The front might be snipped improperly, not sure.)

>> 

>> <6>[  640.823323] peak_pci 0000:05:00.0 can1: Got an sja1000 interrupt.

>> <6>[  640.823331] peak_pci 0000:05:00.0 can1: Returning IRQ_NONE

>> <6>[  640.823334] peak_pci 0000:05:00.0 can0: Got an sja1000 interrupt.

>> <6>[  640.823344] peak_pci 0000:05:00.0 can0: Received packet.

>> <6>[  640.823346] peak_pci 0000:05:00.0 can0: sja1000_rx

>> <6>[  640.823391] peak_pci 0000:05:00.0 can0: Returning IRQ_HANDLED

> 

> This is a correct behaviour too:



It is not a requirement that the (shared) handler are called sequentially.



> The shared IRQ is handled by can1 (which had nothing to do) and then the

> chain

> goes to can0 which handles the reception correctly.

> 

> @Wolfgang: Do we need an additional protection for the PITA handling in

> peak_pci.c ?



Don't know. It depends on the hardware. The interrupt is cleared by

writing the

relevant bit to the PITA register. Looks save at a first glance but maybe

the

hardware is not smart enough. Adding spinlock protection would quickly

reveal if

that is the problem. Anyway, there could be other races as well.



I can imaging another problem with the following returns:



      /* check for absent controller due to hw unplug */

      if (status == 0xFF && sja1000_is_absent(priv))

          return IRQ_NONE;



Austin, could you please check if they occur.



> @Austin:

> I have another idea to test, if it just shows up in the mainline driver:

> 

> Can you please download the out-of-tree driver from PEAK (version 7.9):

> 

> http://www.peak-system.com/linux/index.htm

> 

> I wonder if this one actually compiles with a -rt kernel and if it shows

up

> the same issue then.



Well, we should better fix the problem in the Linux CAN driver. 



Wolfgang.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: sja1000 interrupt problem
  2013-11-13  8:11                             ` Wolfgang Grandegger
@ 2013-11-13  9:08                               ` Pavel Pisa
  2013-11-13  9:52                                 ` Wolfgang Grandegger
                                                   ` (2 more replies)
  0 siblings, 3 replies; 66+ messages in thread
From: Pavel Pisa @ 2013-11-13  9:08 UTC (permalink / raw)
  To: Wolfgang Grandegger; +Cc: Oliver Hartkopp, Austin Schuh, linux-can

Hello all,

I have noticed that you speak about PITA in the thread.

There can be problem that local bus IRQs are shared
and sensed as edge triggered on the PITa but PCI
interrupt processing is selected for level sharing
between PCI cards/sources. That means, when IRQ
on the first SJA1000 chip is activated exactly after
it was checked in ISR and before other chip's IRQ
is cleared than you receive no more any IRQ until
forced/polled check of the first chip.

Please, check what kind of the PCI to local-bus bridge
is on the card if my explanation applies. If it is Infineon
PSB4610 PITA-2  or some similar chip than it is possible
that there can be present above described problem I have
fight remotely for EMS-WUENSCHE CPC-PCI cards at Universita'
degli Studi di Parma years ago with LinCAN. They have donated
one such board to us so I can retest SocketCAN 

LinCAN commits

commit b5ac7e6f7e0d1af9dbb4bdb56921ab04de4e1ade
Author: ppisa <ppisa>
Date:   Fri Jul 16 15:44:20 2004 +0000

    EMS CPC-PCI fix correcting poorly undocumented PITA2 IRQ behavior.
    This workaround compiles only for 2.6.x kernels now and correct
    fix compatible with 2.4 requires driver wide changes.
    That is why CPC-PCI is not enabled by default.

commit 3a2bb63f0bb8de2aafb346b53b945c59b3f87a41
Author: ppisa <ppisa>
Date:   Fri Jul 2 00:26:41 2004 +0000

    CPC-PCI second chip IRQ corrected. Message timestamp code added.
    The timestamp code has some time overhead. If it is problem,
    it can be disabled in the main.h file.

In the fact, I am not sure if code in
  linux-devel/drivers/net/can/sja1000/ems_pci.c
supporting this card is fully correct either even that LinCAN
is referenced as the part of the code source. I need to analyze
if the sharing problem is resolved.

There seems to be code to resolve IRQ flag clearing in peak_pci.c
peak_pci_post_irq()

and different for two PITA versions in ems_pci.c 
  ems_pci_v1_write_reg()
  ems_pci_v1_post_irq()

LinCAn code strategy was to loop an the sibling CAN channers
as long as there was one full cycle without any pending IRQ
in all chips. This ensured that the shared wire from SJA1000
chips is low and next IRQ would lead to enge on PITA input.
That worked quite well.

  http://sourceforge.net/p/ortcan/lincan/ci/master/tree/lincan/src/ems_cpcpci.c#l215

So check exac PITA version and report it. Check, if it is same as one
supported by SocketCAN EMS driver and if the handling is the same.
If this does not help than try to handle both chips
from single registered IRQ with loop (or register that IRQ twice
but do cross channel handling from it somehow if it is not possible
to do that some other way). Then it is necessary to check that same
chip is not handled from two IRQ threads concurrently in RT.

Other option is somehow switch given PCI line processing to
shared edge triggered mode in kernel PCI core.

Generally, RT is more sensitive to this stuck IRQs problem, because
race window is longer. But problem probably applies to stock kernel
but with probability many times smaller.

Best wishes,

            Pavel



On Wednesday 13 of November 2013 09:11:58 Wolfgang Grandegger wrote:
> On Wed, 13 Nov 2013 07:44:23 +0100, Oliver Hartkopp
>
> <socketcan@hartkopp.net> wrote:
> > On 13.11.2013 00:22, Austin Schuh wrote:
> >> Here is what is in the syslog from the same machine sending to it's
> >> self with a non-realtime kernel.
> >>
> >> # uname -a
> >> Linux vpc5 3.10-3-amd64 #1 SMP Debian 3.10.11-2 (2013-09-10) x86_64
> >> GNU/Linux
> >>
> >> <6>[  169.993870] peak_pci 0000:05:00.0 can1: Got an sja1000 interrupt.
> >> <6>[  169.993904] peak_pci 0000:05:00.0 can1: Received packet.
> >> <6>[  169.993923] peak_pci 0000:05:00.0 can1: sja1000_rx
> >> <6>[  169.993994] peak_pci 0000:05:00.0 can1: Returning IRQ_HANDLED
> >> <6>[  169.994013] peak_pci 0000:05:00.0 can1: Found can1, disabling
> >> tracing.
> >> <6>[  169.994029] peak_pci 0000:05:00.0 can0: Got an sja1000 interrupt.
> >> <6>[  169.994048] peak_pci 0000:05:00.0 can0: TX complete.
> >> <6>[  169.994061] peak_pci 0000:05:00.0 can0: Returning IRQ_HANDLED
> >
> > This looks indeed much better :-)
>
> Note that the -rt uses threaded interrupt handlers, which can run on
> separate
> CPUs (just have a look to the function traces Austin provide), Looks like
> vanilla is just less sensitive to races.
>
> >> When I attach the CAN device which continually sends, I get the
> >> following.  (The front might be snipped improperly, not sure.)
> >>
> >> <6>[  640.823323] peak_pci 0000:05:00.0 can1: Got an sja1000 interrupt.
> >> <6>[  640.823331] peak_pci 0000:05:00.0 can1: Returning IRQ_NONE
> >> <6>[  640.823334] peak_pci 0000:05:00.0 can0: Got an sja1000 interrupt.
> >> <6>[  640.823344] peak_pci 0000:05:00.0 can0: Received packet.
> >> <6>[  640.823346] peak_pci 0000:05:00.0 can0: sja1000_rx
> >> <6>[  640.823391] peak_pci 0000:05:00.0 can0: Returning IRQ_HANDLED
> >
> > This is a correct behaviour too:
>
> It is not a requirement that the (shared) handler are called sequentially.
>
> > The shared IRQ is handled by can1 (which had nothing to do) and then the
> > chain
> > goes to can0 which handles the reception correctly.
> >
> > @Wolfgang: Do we need an additional protection for the PITA handling in
> > peak_pci.c ?
>
> Don't know. It depends on the hardware. The interrupt is cleared by
> writing the
> relevant bit to the PITA register. Looks save at a first glance but maybe
> the
> hardware is not smart enough. Adding spinlock protection would quickly
> reveal if
> that is the problem. Anyway, there could be other races as well.
>
> I can imaging another problem with the following returns:
>
>       /* check for absent controller due to hw unplug */
>       if (status == 0xFF && sja1000_is_absent(priv))
>           return IRQ_NONE;
>
> Austin, could you please check if they occur.
>
> > @Austin:
> > I have another idea to test, if it just shows up in the mainline driver:
> >
> > Can you please download the out-of-tree driver from PEAK (version 7.9):
> >
> > http://www.peak-system.com/linux/index.htm
> >
> > I wonder if this one actually compiles with a -rt kernel and if it shows
>
> up
>
> > the same issue then.
>
> Well, we should better fix the problem in the Linux CAN driver.
>
> Wolfgang.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-can" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: sja1000 interrupt problem
  2013-11-13  6:58                             ` Oliver Hartkopp
@ 2013-11-13  9:48                               ` Kurt Van Dijck
  0 siblings, 0 replies; 66+ messages in thread
From: Kurt Van Dijck @ 2013-11-13  9:48 UTC (permalink / raw)
  To: Oliver Hartkopp; +Cc: Austin Schuh, Wolfgang Grandegger, linux-can

Hi,

I followed this interesting 'thread'.
I'm not sure this is a contribution, but I share my idea already
since it may help.

It appears that the -rt is affecting the peak_pci driver.
So I started looking into the SJA1000+peak_pci combination.

The peak_pci PITA does not look strange. The channel mask
is used, so I don't suspect a locking problem there.

schematically, the current situation:
1. deal with SJA1000 IRQ
2. handle PITA

I'm not very familiar with the precise function of the PITA.
What I do find strange is that the PITA is handled _after_
processing the SJA1000 IRQ. This means that between processing
the last SJA1000 IRQ and handling PITA, a new event may arrive.
Especially when scheduling takes place in between.

Ideally, I would try:
1. disable SJA1000 IRQ (make sure the hardware returns idle again)
2. handle PITA (the kernel knows it must do something)
3. handle SJA1000 IRQ (do something)
4. enable SJA1000 IRQ

This alternative sequence would effectively get an SJA1000's irq
line idle, and let the PITA know that.
If after handling the IRQ some event happens again, then the
irq line get active again when enabling the SJA1000 IRQ.

To make sure I expressed myself well, I attached a very elementary
patch that shows what I actually mean.

Kurt

diff --git a/drivers/net/can/sja1000/sja1000.c b/drivers/net/can/sja1000/sja1000.c
index 83ee11e..a235ea0 100644
--- a/drivers/net/can/sja1000/sja1000.c
+++ b/drivers/net/can/sja1000/sja1000.c
@@ -494,15 +494,21 @@ irqreturn_t sja1000_interrupt(int irq, void *dev_id)
 	if (priv->read_reg(priv, REG_IER) == IRQ_OFF)
 		return IRQ_NONE;
 
+	priv->write_reg(priv, REG_IER, 0);
+
 	if (priv->pre_irq)
 		priv->pre_irq(priv);
+	/* do the post irq as a workaround.
+	   better rework the driver to use pre_irq */
+	if (priv->post_irq)
+		priv->post_irq(priv);
 
 	while ((isrc = priv->read_reg(priv, REG_IR)) && (n < SJA1000_MAX_IRQ)) {
 		n++;
 		status = priv->read_reg(priv, REG_SR);
 		/* check for absent controller due to hw unplug */
 		if (status == 0xFF && sja1000_is_absent(priv))
-			return IRQ_NONE;
+			goto done;
 
 		if (isrc & IRQ_WUI)
 			netdev_warn(dev, "wakeup interrupt\n");
@@ -529,7 +535,7 @@ irqreturn_t sja1000_interrupt(int irq, void *dev_id)
 				status = priv->read_reg(priv, REG_SR);
 				/* check for absent controller */
 				if (status == 0xFF && sja1000_is_absent(priv))
-					return IRQ_NONE;
+					goto done;
 			}
 		}
 		if (isrc & (IRQ_DOI | IRQ_EI | IRQ_BEI | IRQ_EPI | IRQ_ALI)) {
@@ -539,12 +545,13 @@ irqreturn_t sja1000_interrupt(int irq, void *dev_id)
 		}
 	}
 
-	if (priv->post_irq)
-		priv->post_irq(priv);
-
 	if (n >= SJA1000_MAX_IRQ)
 		netdev_dbg(dev, "%d messages handled in ISR", n);
 
+done:
+	/* IRQ_ALL should become the same as used in set_normal_mode */
+	priv->write_reg(priv, REG_IER, IRQ_ALL);
+
 	return (n) ? IRQ_HANDLED : IRQ_NONE;
 }
 EXPORT_SYMBOL_GPL(sja1000_interrupt);


On Wed, Nov 13, 2013 at 07:58:59AM +0100, Oliver Hartkopp wrote:
> Hi Austin,
> 
> sorry for checking my mails in sequential order :-)
> I would have been able to shorten the last mail.
> 
> Thanks for your interesting investigation.
> I wonder why this problem did not show up before then. Having shared
> interrupts should be a usual thing.
> 
> This kind of race condition should not be there at all. Do you have a second
> peak_pci hardware? I could be an idea to try to split the IRQs in a way that
> you have two IRQs for two cards - and then connect can0 to can2.
> You would have a pretty fast following RX/TX interrupt but without interrupt
> sharing ...
> 
> Best regards,
> Oliver
> 
> On 13.11.2013 04:41, Austin Schuh wrote:
> > On Tue, Nov 12, 2013 at 3:22 PM, Austin Schuh <austin@peloton-tech.com> wrote:
> >> On Tue, Nov 12, 2013 at 1:26 PM, Oliver Hartkopp <socketcan@hartkopp.net> wrote:
> >>> On 12.11.2013 03:59, Austin Schuh wrote:
> >>>
> >>>>> From the trace it is pretty hard to know which CAN interface is in charge.
> >>>>> (2) Can you please add the output of dev->ifindex in the pr_info() calls?
> >>>>
> >>>> Gladly.  See the updated logs.
> >>>>
> >>>> [  556.019246] peak_pci 0000:05:00.0 can1: Got an sja1000 interrupt.
> >>>> [  556.019268] Unhandled IRQ 18... stop tracing...
> >>>> [  556.019280] peak_pci 0000:05:00.0 can0: Got an sja1000 interrupt.
> >>>> [  556.019289] peak_pci 0000:05:00.0 can1: Received packet.
> >>>> [  556.019299] peak_pci 0000:05:00.0 can1: sja1000_rx
> >>>> [  556.019307] peak_pci 0000:05:00.0 can0: TX complete.
> >>>> [  556.019318] peak_pci 0000:05:00.0 can0: Returning IRQ_HANDLED
> >>>> [  556.019362] peak_pci 0000:05:00.0 can1: Returning IRQ_HANDLED
> >>>>
> >>>
> >>> This looks pretty broken regarding the IRQ handling.
> >>> Maybe the IRQ thread handling has a real problem in the -rt kernel ?!?
> >>
> >> Sounds pretty plausible right now.
> > 
> > Ok, I spent a good chunk of today reading the IRQ handling code in the
> > kernel, and I think I get what is happening and have a plausible
> > explanation for why the interrupt is getting disabled.  Not sure how
> > to test it.
> > 
> > Here is what it looks like is happening.  The hardware triggers an
> > interrupt.  The handler is called, and then the registered action for
> > each of the devices is to notify their threads that an IRQ occurred,
> > and to have them handle it.  Each of the handling threads then calls
> > the sja1000_interrupt function, or the equivalent ata_generic
> > interrupt function.  2 of the 3 interrupt functions then return
> > IRQ_NONE, and one of them returns IRQ_HANDLED.  note_interrupt is then
> > called in each of the threads (instead of being called once in the
> > non-rt case), resulting in 2 unhanded calls, and 1 handled call.  So
> > far, so good.  The kernel operates as expected, since less than 99.9 %
> > of the interrupts are handled.  (There is a note_interrupt call in the
> > handler, but since the threaded handlers are notified, this doesn't
> > get counted.
> > 
> > Since the IRQ handlers are now all in threads, if the thread that
> > actually receives data doesn't process the interrupts either because
> > something goes wrong, or because it doesn't get scheduled, there will
> > be a bunch of unhanded interrupts noted, and no handled interrupts
> > noted.  This will cause the IRQ to be disabled.
> > 
> > I guess the next interesting thing to do is to trigger when it
> > disables the IRQ and take a look at what is happening.  I have a test
> > running on one machine with tracing enabled which will disable tracing
> > when the IRQ is disabled.  That should provide some interesting
> > results.  I think I also know how to bypass it for now by setting
> > "noirqdebug", but I'd like to fix it for real as well.
> > 
> > Austin
> > 

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: sja1000 interrupt problem
  2013-11-13  9:08                               ` Pavel Pisa
@ 2013-11-13  9:52                                 ` Wolfgang Grandegger
  2013-11-13 18:41                                   ` Oliver Hartkopp
  2013-11-13 11:02                                 ` Kurt Van Dijck
  2013-11-16 21:42                                 ` Oliver Hartkopp
  2 siblings, 1 reply; 66+ messages in thread
From: Wolfgang Grandegger @ 2013-11-13  9:52 UTC (permalink / raw)
  To: Pavel Pisa; +Cc: Oliver Hartkopp, Austin Schuh, linux-can

Hi Pavel,



On Wed, 13 Nov 2013 10:08:54 +0100, Pavel Pisa <pisa@cmp.felk.cvut.cz>

wrote:

> Hello all,

> 

> I have noticed that you speak about PITA in the thread.

> 

> There can be problem that local bus IRQs are shared

> and sensed as edge triggered on the PITa but PCI

> interrupt processing is selected for level sharing

> between PCI cards/sources. That means, when IRQ

> on the first SJA1000 chip is activated exactly after

> it was checked in ISR and before other chip's IRQ

> is cleared than you receive no more any IRQ until

> forced/polled check of the first chip.

> 

> Please, check what kind of the PCI to local-bus bridge

> is on the card if my explanation applies. If it is Infineon

> PSB4610 PITA-2  or some similar chip than it is possible

> that there can be present above described problem I have

> fight remotely for EMS-WUENSCHE CPC-PCI cards at Universita'

> degli Studi di Parma years ago with LinCAN. They have donated

> one such board to us so I can retest SocketCAN 

> 

> LinCAN commits

> 

> commit b5ac7e6f7e0d1af9dbb4bdb56921ab04de4e1ade

> Author: ppisa <ppisa>

> Date:   Fri Jul 16 15:44:20 2004 +0000

> 

>     EMS CPC-PCI fix correcting poorly undocumented PITA2 IRQ behavior.

>     This workaround compiles only for 2.6.x kernels now and correct

>     fix compatible with 2.4 requires driver wide changes.

>     That is why CPC-PCI is not enabled by default.

> 

> commit 3a2bb63f0bb8de2aafb346b53b945c59b3f87a41

> Author: ppisa <ppisa>

> Date:   Fri Jul 2 00:26:41 2004 +0000

> 

>     CPC-PCI second chip IRQ corrected. Message timestamp code added.

>     The timestamp code has some time overhead. If it is problem,

>     it can be disabled in the main.h file.

> 

> In the fact, I am not sure if code in

>   linux-devel/drivers/net/can/sja1000/ems_pci.c

> supporting this card is fully correct either even that LinCAN

> is referenced as the part of the code source. I need to analyze

> if the sharing problem is resolved.

> 

> There seems to be code to resolve IRQ flag clearing in peak_pci.c

> peak_pci_post_irq()

> 

> and different for two PITA versions in ems_pci.c 

>   ems_pci_v1_write_reg()

>   ems_pci_v1_post_irq()

> 

> LinCAn code strategy was to loop an the sibling CAN channers

> as long as there was one full cycle without any pending IRQ

> in all chips. This ensured that the shared wire from SJA1000

> chips is low and next IRQ would lead to enge on PITA input.

> That worked quite well.

> 

>  

http://sourceforge.net/p/ortcan/lincan/ci/master/tree/lincan/src/ems_cpcpci.c#l215



In Linux-CAN we have something similar:



http://lxr.linux.no/#linux/drivers/net/can/sja1000/ems_pcmcia.c#L90



> So check exac PITA version and report it. Check, if it is same as one

> supported by SocketCAN EMS driver and if the handling is the same.



@Austin, could you check what is written on the PITA chip?



> If this does not help than try to handle both chips

> from single registered IRQ with loop (or register that IRQ twice

> but do cross channel handling from it somehow if it is not possible

> to do that some other way). Then it is necessary to check that same

> chip is not handled from two IRQ threads concurrently in RT.

> 

> Other option is somehow switch given PCI line processing to

> shared edge triggered mode in kernel PCI core.



Hm, how should that work? Anyway, I'm aware of the problem when using

edge triggered interrupts. I thought it could not happen with level

triggered interrupts but I'm also afraid the hardware is not smart/good

enough. We should ask Peak Systems.



> Generally, RT is more sensitive to this stuck IRQs problem, because

> race window is longer. But problem probably applies to stock kernel

> but with probability many times smaller.



I agree.



Wolfgang.





^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: sja1000 interrupt problem
  2013-11-13  9:08                               ` Pavel Pisa
  2013-11-13  9:52                                 ` Wolfgang Grandegger
@ 2013-11-13 11:02                                 ` Kurt Van Dijck
  2013-11-16 21:42                                 ` Oliver Hartkopp
  2 siblings, 0 replies; 66+ messages in thread
From: Kurt Van Dijck @ 2013-11-13 11:02 UTC (permalink / raw)
  To: Pavel Pisa; +Cc: Wolfgang Grandegger, Oliver Hartkopp, Austin Schuh, linux-can

Hi Pavel,

Sorry for reading my email out-of-order.

I had similar conclusions, without enough knowledge
of PCI busses.
I just sent a very preliminary patch. Do you think
this would solve the race condition without altering
the PCI-related properties?

Kurt

On Wed, Nov 13, 2013 at 10:08:54AM +0100, Pavel Pisa wrote:
> Hello all,
> 
> I have noticed that you speak about PITA in the thread.
> 
> There can be problem that local bus IRQs are shared
> and sensed as edge triggered on the PITa but PCI
> interrupt processing is selected for level sharing
> between PCI cards/sources. That means, when IRQ
> on the first SJA1000 chip is activated exactly after
> it was checked in ISR and before other chip's IRQ
> is cleared than you receive no more any IRQ until
> forced/polled check of the first chip.
> 
> Please, check what kind of the PCI to local-bus bridge
> is on the card if my explanation applies. If it is Infineon
> PSB4610 PITA-2  or some similar chip than it is possible
> that there can be present above described problem I have
> fight remotely for EMS-WUENSCHE CPC-PCI cards at Universita'
> degli Studi di Parma years ago with LinCAN. They have donated
> one such board to us so I can retest SocketCAN 
> 
> LinCAN commits
> 
> commit b5ac7e6f7e0d1af9dbb4bdb56921ab04de4e1ade
> Author: ppisa <ppisa>
> Date:   Fri Jul 16 15:44:20 2004 +0000
> 
>     EMS CPC-PCI fix correcting poorly undocumented PITA2 IRQ behavior.
>     This workaround compiles only for 2.6.x kernels now and correct
>     fix compatible with 2.4 requires driver wide changes.
>     That is why CPC-PCI is not enabled by default.
> 
> commit 3a2bb63f0bb8de2aafb346b53b945c59b3f87a41
> Author: ppisa <ppisa>
> Date:   Fri Jul 2 00:26:41 2004 +0000
> 
>     CPC-PCI second chip IRQ corrected. Message timestamp code added.
>     The timestamp code has some time overhead. If it is problem,
>     it can be disabled in the main.h file.
> 
> In the fact, I am not sure if code in
>   linux-devel/drivers/net/can/sja1000/ems_pci.c
> supporting this card is fully correct either even that LinCAN
> is referenced as the part of the code source. I need to analyze
> if the sharing problem is resolved.
> 
> There seems to be code to resolve IRQ flag clearing in peak_pci.c
> peak_pci_post_irq()
> 
> and different for two PITA versions in ems_pci.c 
>   ems_pci_v1_write_reg()
>   ems_pci_v1_post_irq()
> 
> LinCAn code strategy was to loop an the sibling CAN channers
> as long as there was one full cycle without any pending IRQ
> in all chips. This ensured that the shared wire from SJA1000
> chips is low and next IRQ would lead to enge on PITA input.
> That worked quite well.
> 
>   http://sourceforge.net/p/ortcan/lincan/ci/master/tree/lincan/src/ems_cpcpci.c#l215
> 
> So check exac PITA version and report it. Check, if it is same as one
> supported by SocketCAN EMS driver and if the handling is the same.
> If this does not help than try to handle both chips
> from single registered IRQ with loop (or register that IRQ twice
> but do cross channel handling from it somehow if it is not possible
> to do that some other way). Then it is necessary to check that same
> chip is not handled from two IRQ threads concurrently in RT.
> 
> Other option is somehow switch given PCI line processing to
> shared edge triggered mode in kernel PCI core.
> 
> Generally, RT is more sensitive to this stuck IRQs problem, because
> race window is longer. But problem probably applies to stock kernel
> but with probability many times smaller.
> 
> Best wishes,
> 
>             Pavel
> 
> 
> 
> On Wednesday 13 of November 2013 09:11:58 Wolfgang Grandegger wrote:
> > On Wed, 13 Nov 2013 07:44:23 +0100, Oliver Hartkopp
> >
> > <socketcan@hartkopp.net> wrote:
> > > On 13.11.2013 00:22, Austin Schuh wrote:
> > >> Here is what is in the syslog from the same machine sending to it's
> > >> self with a non-realtime kernel.
> > >>
> > >> # uname -a
> > >> Linux vpc5 3.10-3-amd64 #1 SMP Debian 3.10.11-2 (2013-09-10) x86_64
> > >> GNU/Linux
> > >>
> > >> <6>[  169.993870] peak_pci 0000:05:00.0 can1: Got an sja1000 interrupt.
> > >> <6>[  169.993904] peak_pci 0000:05:00.0 can1: Received packet.
> > >> <6>[  169.993923] peak_pci 0000:05:00.0 can1: sja1000_rx
> > >> <6>[  169.993994] peak_pci 0000:05:00.0 can1: Returning IRQ_HANDLED
> > >> <6>[  169.994013] peak_pci 0000:05:00.0 can1: Found can1, disabling
> > >> tracing.
> > >> <6>[  169.994029] peak_pci 0000:05:00.0 can0: Got an sja1000 interrupt.
> > >> <6>[  169.994048] peak_pci 0000:05:00.0 can0: TX complete.
> > >> <6>[  169.994061] peak_pci 0000:05:00.0 can0: Returning IRQ_HANDLED
> > >
> > > This looks indeed much better :-)
> >
> > Note that the -rt uses threaded interrupt handlers, which can run on
> > separate
> > CPUs (just have a look to the function traces Austin provide), Looks like
> > vanilla is just less sensitive to races.
> >
> > >> When I attach the CAN device which continually sends, I get the
> > >> following.  (The front might be snipped improperly, not sure.)
> > >>
> > >> <6>[  640.823323] peak_pci 0000:05:00.0 can1: Got an sja1000 interrupt.
> > >> <6>[  640.823331] peak_pci 0000:05:00.0 can1: Returning IRQ_NONE
> > >> <6>[  640.823334] peak_pci 0000:05:00.0 can0: Got an sja1000 interrupt.
> > >> <6>[  640.823344] peak_pci 0000:05:00.0 can0: Received packet.
> > >> <6>[  640.823346] peak_pci 0000:05:00.0 can0: sja1000_rx
> > >> <6>[  640.823391] peak_pci 0000:05:00.0 can0: Returning IRQ_HANDLED
> > >
> > > This is a correct behaviour too:
> >
> > It is not a requirement that the (shared) handler are called sequentially.
> >
> > > The shared IRQ is handled by can1 (which had nothing to do) and then the
> > > chain
> > > goes to can0 which handles the reception correctly.
> > >
> > > @Wolfgang: Do we need an additional protection for the PITA handling in
> > > peak_pci.c ?
> >
> > Don't know. It depends on the hardware. The interrupt is cleared by
> > writing the
> > relevant bit to the PITA register. Looks save at a first glance but maybe
> > the
> > hardware is not smart enough. Adding spinlock protection would quickly
> > reveal if
> > that is the problem. Anyway, there could be other races as well.
> >
> > I can imaging another problem with the following returns:
> >
> >       /* check for absent controller due to hw unplug */
> >       if (status == 0xFF && sja1000_is_absent(priv))
> >           return IRQ_NONE;
> >
> > Austin, could you please check if they occur.
> >
> > > @Austin:
> > > I have another idea to test, if it just shows up in the mainline driver:
> > >
> > > Can you please download the out-of-tree driver from PEAK (version 7.9):
> > >
> > > http://www.peak-system.com/linux/index.htm
> > >
> > > I wonder if this one actually compiles with a -rt kernel and if it shows
> >
> > up
> >
> > > the same issue then.
> >
> > Well, we should better fix the problem in the Linux CAN driver.
> >
> > Wolfgang.
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-can" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-can" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Kurt Van Dijck
GRAMMER EiA ELECTRONICS
http://www.eia.be
kurt.van.dijck@eia.be
+32-38708534

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: sja1000 interrupt problem
  2013-11-13  9:52                                 ` Wolfgang Grandegger
@ 2013-11-13 18:41                                   ` Oliver Hartkopp
  2013-11-13 19:29                                     ` Wolfgang Grandegger
  0 siblings, 1 reply; 66+ messages in thread
From: Oliver Hartkopp @ 2013-11-13 18:41 UTC (permalink / raw)
  To: Wolfgang Grandegger, Pavel Pisa, Kurt Van Dijck, Stephane Grosjean
  Cc: Austin Schuh, linux-can

On 13.11.2013 10:52, Wolfgang Grandegger wrote:

> 
> In Linux-CAN we have something similar:
> 
> http://lxr.linux.no/#linux/drivers/net/can/sja1000/ems_pcmcia.c#L90
> 

Indeed.

I think reworking the sja1000.c driver (as suggested by Kurt) won't make it.
It only touches the generic irq handling.

The peak_pci driver creates (depending on the number of channels) e.g.
two/four AFAICS pretty *independent* sja1000 netdevices.

Currently at the end of the generic sja1000 interrupt handling the according
irq bit in the PITA is cleared. This is not necessarily at the end of the
interrupt chain.

What we would need to set up a similar handling as we have in EMS PCMCIA or
the EMS PCI (referenced by Pavel) is a "group of sja1000 netdevices" which is
placed on a single PEAK PCI adapter.

Indeed there is already a chain of sja1000 netdevices:

http://lxr.linux.no/#linux+v3.12/drivers/net/can/sja1000/peak_pci.c#L645

BUT this is only used to clean up all channels when the PCI device is removed
or some errors occur at creation time.

IMO the existing chain of netdevices is not only needed for the device removal
but also for the interrupt handling.

When ever the interrupt for the PCI adapter occurs all channels have to be
handled (in a private peak_pci_interrupt() function) and finally the PITA has
to be cleared there too.

That change won't make use of the possibility to clear single IRQ bits in the
PITA anymore. And the PITA has to be checked first (e.g. to check if we have a
new interrupt somewhere later in the interrupt chain) to skip the irq handling
when it's obsolete.

Any thoughts?

Regards,
Oliver

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: sja1000 interrupt problem
  2013-11-13 18:41                                   ` Oliver Hartkopp
@ 2013-11-13 19:29                                     ` Wolfgang Grandegger
  2013-11-13 22:00                                       ` Oliver Hartkopp
  0 siblings, 1 reply; 66+ messages in thread
From: Wolfgang Grandegger @ 2013-11-13 19:29 UTC (permalink / raw)
  To: Oliver Hartkopp, Pavel Pisa, Kurt Van Dijck, Stephane Grosjean
  Cc: Austin Schuh, linux-can

Hi Oliver,

On 11/13/2013 07:41 PM, Oliver Hartkopp wrote:
> On 13.11.2013 10:52, Wolfgang Grandegger wrote:
> 
>>
>> In Linux-CAN we have something similar:
>>
>> http://lxr.linux.no/#linux/drivers/net/can/sja1000/ems_pcmcia.c#L90
>>
> 
> Indeed.
> 
> I think reworking the sja1000.c driver (as suggested by Kurt) won't make it.
> It only touches the generic irq handling.
> 
> The peak_pci driver creates (depending on the number of channels) e.g.
> two/four AFAICS pretty *independent* sja1000 netdevices.
> 
> Currently at the end of the generic sja1000 interrupt handling the according
> irq bit in the PITA is cleared. This is not necessarily at the end of the
> interrupt chain.
> 
> What we would need to set up a similar handling as we have in EMS PCMCIA or
> the EMS PCI (referenced by Pavel) is a "group of sja1000 netdevices" which is
> placed on a single PEAK PCI adapter.

Why? Normally we do not have such problems with level sensitive
interrupts. Also so far it's pure speculation that this might be the
cause of the problem.

> Indeed there is already a chain of sja1000 netdevices:
> 
> http://lxr.linux.no/#linux+v3.12/drivers/net/can/sja1000/peak_pci.c#L645
> 
> BUT this is only used to clean up all channels when the PCI device is removed
> or some errors occur at creation time.
> 
> IMO the existing chain of netdevices is not only needed for the device removal
> but also for the interrupt handling.
> 
> When ever the interrupt for the PCI adapter occurs all channels have to be
> handled (in a private peak_pci_interrupt() function) and finally the PITA has
> to be cleared there too.
> 
> That change won't make use of the possibility to clear single IRQ bits in the
> PITA anymore. And the PITA has to be checked first (e.g. to check if we have a
> new interrupt somewhere later in the interrupt chain) to skip the irq handling
> when it's obsolete.
> 
> Any thoughts?

See my comments above. If the PEAK PCI hardware is setting the levels
correctly, there is no problem with the current interrupt handling.

Wolfgang.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: sja1000 interrupt problem
  2013-11-13 19:29                                     ` Wolfgang Grandegger
@ 2013-11-13 22:00                                       ` Oliver Hartkopp
  0 siblings, 0 replies; 66+ messages in thread
From: Oliver Hartkopp @ 2013-11-13 22:00 UTC (permalink / raw)
  To: Wolfgang Grandegger
  Cc: Pavel Pisa, Kurt Van Dijck, Stephane Grosjean, Austin Schuh, linux-can



On 13.11.2013 20:29, Wolfgang Grandegger wrote:
> Hi Oliver,
> 
> On 11/13/2013 07:41 PM, Oliver Hartkopp wrote:
>> On 13.11.2013 10:52, Wolfgang Grandegger wrote:
>>
>>>
>>> In Linux-CAN we have something similar:
>>>
>>> http://lxr.linux.no/#linux/drivers/net/can/sja1000/ems_pcmcia.c#L90
>>>
>>
>> Indeed.
>>
>> I think reworking the sja1000.c driver (as suggested by Kurt) won't make it.
>> It only touches the generic irq handling.
>>
>> The peak_pci driver creates (depending on the number of channels) e.g.
>> two/four AFAICS pretty *independent* sja1000 netdevices.
>>
>> Currently at the end of the generic sja1000 interrupt handling the according
>> irq bit in the PITA is cleared. This is not necessarily at the end of the
>> interrupt chain.
>>
>> What we would need to set up a similar handling as we have in EMS PCMCIA or
>> the EMS PCI (referenced by Pavel) is a "group of sja1000 netdevices" which is
>> placed on a single PEAK PCI adapter.
> 
> Why? Normally we do not have such problems with level sensitive
> interrupts. Also so far it's pure speculation that this might be the
> cause of the problem.

Yes it is speculation.

I just wanted to point out what are the differences from the EMS PCMCIA
implementation to the current peak_pci driver.

I'm using a system with many PEAK cPCI four channel cards without any issues
with a mainline kernel. No idea why it fails like this in the -rt system.

Regards,
Oliver

> 
>> Indeed there is already a chain of sja1000 netdevices:
>>
>> http://lxr.linux.no/#linux+v3.12/drivers/net/can/sja1000/peak_pci.c#L645
>>
>> BUT this is only used to clean up all channels when the PCI device is removed
>> or some errors occur at creation time.
>>
>> IMO the existing chain of netdevices is not only needed for the device removal
>> but also for the interrupt handling.
>>
>> When ever the interrupt for the PCI adapter occurs all channels have to be
>> handled (in a private peak_pci_interrupt() function) and finally the PITA has
>> to be cleared there too.
>>
>> That change won't make use of the possibility to clear single IRQ bits in the
>> PITA anymore. And the PITA has to be checked first (e.g. to check if we have a
>> new interrupt somewhere later in the interrupt chain) to skip the irq handling
>> when it's obsolete.
>>
>> Any thoughts?
> 
> See my comments above. If the PEAK PCI hardware is setting the levels
> correctly, there is no problem with the current interrupt handling.
> 
> Wolfgang.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-can" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: sja1000 interrupt problem
  2013-11-13  9:08                               ` Pavel Pisa
  2013-11-13  9:52                                 ` Wolfgang Grandegger
  2013-11-13 11:02                                 ` Kurt Van Dijck
@ 2013-11-16 21:42                                 ` Oliver Hartkopp
  2013-11-17  8:18                                   ` Wolfgang Grandegger
  2 siblings, 1 reply; 66+ messages in thread
From: Oliver Hartkopp @ 2013-11-16 21:42 UTC (permalink / raw)
  To: Pavel Pisa, Wolfgang Grandegger, Austin Schuh; +Cc: linux-can

Hi all,

I would like to continue the discussion about Austins problem with the -rt kernel.

This problem addressed by Pavel should be handled correctly ...

> I have noticed that you speak about PITA in the thread.
> 
> There can be problem that local bus IRQs are shared
> and sensed as edge triggered on the PITa but PCI
> interrupt processing is selected for level sharing
> between PCI cards/sources. That means, when IRQ
> on the first SJA1000 chip is activated exactly after
> it was checked in ISR and before other chip's IRQ
> is cleared than you receive no more any IRQ until
> forced/polled check of the first chip.

... by this while statement in the current sja1000interrupt() function:

http://lxr.linux.no/#linux+v3.12/drivers/net/can/sja1000/sja1000.c#L503

AFAICS when the interrupt is handled in this manner finally the device
corresponding bit in the PITA is set in peak_pci_post_irq(), see:

http://lxr.linux.no/#linux+v3.12/drivers/net/can/sja1000/peak_pci.c#L538

I doubt that protecting peak_pci_post_irq() would help as it only handles the
device corresponding bit.

I don't know if the -rt kernel with the irq threads is more sensible to
IRQ_NONE return values that the mainline kernel.

Btw. I would try two more things with the sja1000.c driver:

1. Print the corresponding CAN device name and the point in the code before
returning IRQ_NONE to catch the problematic return site.

2. Replace all IRQ_NONE with IRQ_HANDLED for a test ...

Any other ideas?

Regards,
Oliver


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: sja1000 interrupt problem
  2013-11-16 21:42                                 ` Oliver Hartkopp
@ 2013-11-17  8:18                                   ` Wolfgang Grandegger
  2013-11-17 14:27                                     ` Oliver Hartkopp
  0 siblings, 1 reply; 66+ messages in thread
From: Wolfgang Grandegger @ 2013-11-17  8:18 UTC (permalink / raw)
  To: Oliver Hartkopp; +Cc: Pavel Pisa, Austin Schuh, linux-can

Hi Oliver,



On Sat, 16 Nov 2013 22:42:10 +0100, Oliver Hartkopp

<socketcan@hartkopp.net> wrote:

> Hi all,

> 

> I would like to continue the discussion about Austins problem with the

-rt

> kernel.

> 

> This problem addressed by Pavel should be handled correctly ...

> 

>> I have noticed that you speak about PITA in the thread.

>> 

>> There can be problem that local bus IRQs are shared

>> and sensed as edge triggered on the PITa but PCI

>> interrupt processing is selected for level sharing

>> between PCI cards/sources. That means, when IRQ

>> on the first SJA1000 chip is activated exactly after

>> it was checked in ISR and before other chip's IRQ

>> is cleared than you receive no more any IRQ until

>> forced/polled check of the first chip.

> 

> ... by this while statement in the current sja1000interrupt() function:

> 

> http://lxr.linux.no/#linux+v3.12/drivers/net/can/sja1000/sja1000.c#L503

> 

> AFAICS when the interrupt is handled in this manner finally the device

> corresponding bit in the PITA is set in peak_pci_post_irq(), see:

> 

> http://lxr.linux.no/#linux+v3.12/drivers/net/can/sja1000/peak_pci.c#L538

> 

> I doubt that protecting peak_pci_post_irq() would help as it only

handles

> the

> device corresponding bit.



I agree if the hardware is working properly. But who knows...



> I don't know if the -rt kernel with the irq threads is more sensible to

> IRQ_NONE return values that the mainline kernel.



It's usually more sensitive to race conditions. IRQ_NONE is only returned

for shared unhandled interrupts.



> Btw. I would try two more things with the sja1000.c driver:

> 

> 1. Print the corresponding CAN device name and the point in the code

before

> returning IRQ_NONE to catch the problematic return site.

> 

> 2. Replace all IRQ_NONE with IRQ_HANDLED for a test ...

> 

> Any other ideas?



A function trace is still the first thing to take. Be aware that the

tasks are running on more than one CPU. It's just to establish a good

trigger.



Anyway, I think the direct "return IRQ_NONE" are wrong because they

do not call the "irq_post" handler.



Wolfgang.



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: sja1000 interrupt problem
  2013-11-17  8:18                                   ` Wolfgang Grandegger
@ 2013-11-17 14:27                                     ` Oliver Hartkopp
  2013-11-17 17:23                                       ` Wolfgang Grandegger
  0 siblings, 1 reply; 66+ messages in thread
From: Oliver Hartkopp @ 2013-11-17 14:27 UTC (permalink / raw)
  To: Wolfgang Grandegger; +Cc: Pavel Pisa, Austin Schuh, linux-can



On 17.11.2013 09:18, Wolfgang Grandegger wrote:
> 
> On Sat, 16 Nov 2013 22:42:10 +0100, Oliver Hartkopp


>> Btw. I would try two more things with the sja1000.c driver:
> 
>>
> 
>> 1. Print the corresponding CAN device name and the point in the code
> 
> before
> 
>> returning IRQ_NONE to catch the problematic return site.
> 
>>
> 
>> 2. Replace all IRQ_NONE with IRQ_HANDLED for a test ...
> 
>>
> 
>> Any other ideas?
> 
> 
> 
> A function trace is still the first thing to take. Be aware that the
> 
> tasks are running on more than one CPU. It's just to establish a good
> 
> trigger.
> 

Yes. But so far the function trace only showed the fact of simultaneous irq
treads for the two CAN interfaces - what we were expecting :-)

IMO it's interesting to get the concrete return site from where the IRQ_NONE
is returned. Maybe one of the other interfaces / bits in PITA are enabled
which we did not see so far.

> 
> 
> Anyway, I think the direct "return IRQ_NONE" are wrong because they
> 
> do not call the "irq_post" handler.
> 

That's true. I just sent a patch to fix this issue.

Tnx,
Oliver

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: sja1000 interrupt problem
  2013-11-17 14:27                                     ` Oliver Hartkopp
@ 2013-11-17 17:23                                       ` Wolfgang Grandegger
  2013-11-17 20:46                                         ` Wolfgang Grandegger
  0 siblings, 1 reply; 66+ messages in thread
From: Wolfgang Grandegger @ 2013-11-17 17:23 UTC (permalink / raw)
  To: Oliver Hartkopp; +Cc: Pavel Pisa, Austin Schuh, linux-can

On 11/17/2013 03:27 PM, Oliver Hartkopp wrote:
> 
> 
> On 17.11.2013 09:18, Wolfgang Grandegger wrote:
>>
>> On Sat, 16 Nov 2013 22:42:10 +0100, Oliver Hartkopp
> 
> 
>>> Btw. I would try two more things with the sja1000.c driver:
>>
>>>
>>
>>> 1. Print the corresponding CAN device name and the point in the code
>>
>> before
>>
>>> returning IRQ_NONE to catch the problematic return site.
>>
>>>
>>
>>> 2. Replace all IRQ_NONE with IRQ_HANDLED for a test ...
>>
>>>
>>
>>> Any other ideas?
>>
>>
>>
>> A function trace is still the first thing to take. Be aware that the
>>
>> tasks are running on more than one CPU. It's just to establish a good
>>
>> trigger.
>>
> 
> Yes. But so far the function trace only showed the fact of simultaneous irq
> treads for the two CAN interfaces - what we were expecting :-)

Yes, because the trigger was not good so far.

> IMO it's interesting to get the concrete return site from where the IRQ_NONE
> is returned. Maybe one of the other interfaces / bits in PITA are enabled
> which we did not see so far.

The problem is that the IRQs are shared, also with the ATA disk. There
it's normal to see unhandled interrupts. My idea was to trigger a
"tracing_off" after 10 IRQ_NONE in sequence.

>> Anyway, I think the direct "return IRQ_NONE" are wrong because they
>>
>> do not call the "irq_post" handler.
>>
> 
> That's true. I just sent a patch to fix this issue.

Thanks, will have a closer look tomorrow.

Wolfgang.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: sja1000 interrupt problem
  2013-11-17 17:23                                       ` Wolfgang Grandegger
@ 2013-11-17 20:46                                         ` Wolfgang Grandegger
  2013-11-18 17:08                                           ` Austin Schuh
  0 siblings, 1 reply; 66+ messages in thread
From: Wolfgang Grandegger @ 2013-11-17 20:46 UTC (permalink / raw)
  To: Oliver Hartkopp; +Cc: Pavel Pisa, Austin Schuh, linux-can

On 11/17/2013 06:23 PM, Wolfgang Grandegger wrote:
> On 11/17/2013 03:27 PM, Oliver Hartkopp wrote:
>>
>>
>> On 17.11.2013 09:18, Wolfgang Grandegger wrote:
>>>
>>> On Sat, 16 Nov 2013 22:42:10 +0100, Oliver Hartkopp
>>
>>
>>>> Btw. I would try two more things with the sja1000.c driver:
>>>
>>>>
>>>
>>>> 1. Print the corresponding CAN device name and the point in the code
>>>
>>> before
>>>
>>>> returning IRQ_NONE to catch the problematic return site.
>>>
>>>>
>>>
>>>> 2. Replace all IRQ_NONE with IRQ_HANDLED for a test ...
>>>
>>>>
>>>
>>>> Any other ideas?
>>>
>>>
>>>
>>> A function trace is still the first thing to take. Be aware that the
>>>
>>> tasks are running on more than one CPU. It's just to establish a good
>>>
>>> trigger.
>>>
>>
>> Yes. But so far the function trace only showed the fact of simultaneous irq
>> treads for the two CAN interfaces - what we were expecting :-)
> 
> Yes, because the trigger was not good so far.
> 
>> IMO it's interesting to get the concrete return site from where the IRQ_NONE
>> is returned. Maybe one of the other interfaces / bits in PITA are enabled
>> which we did not see so far.
> 
> The problem is that the IRQs are shared, also with the ATA disk. There
> it's normal to see unhandled interrupts. My idea was to trigger a
> "tracing_off" after 10 IRQ_NONE in sequence.

See my patch below. @Austin, you may want to give it a try.

Wolfgang.

From eac2b63d114bc83544f806a549fe7339d54c031f Mon Sep 17 00:00:00 2001
From: Wolfgang Grandegger <wg@grandegger.com>
Date: Sun, 17 Nov 2013 21:39:05 +0100
Subject: [PATCH] sja1000: debugging code to inspect too much unhandeled irq

If more than "IRQ_NONE_COUNT_MAX" unhandled interrupts are detected
in sequence, ftracing will be stopped.
---
 drivers/net/can/sja1000/sja1000.c |   34 +++++++++++++++++++++++++++-------
 1 file changed, 27 insertions(+), 7 deletions(-)

diff --git a/drivers/net/can/sja1000/sja1000.c b/drivers/net/can/sja1000/sja1000.c
index 7164a99..80e8d84 100644
--- a/drivers/net/can/sja1000/sja1000.c
+++ b/drivers/net/can/sja1000/sja1000.c
@@ -486,6 +486,9 @@ static int sja1000_err(struct net_device *dev, uint8_t isrc, uint8_t status)
 	return 0;
 }
 
+#define IRQ_NONE_COUNT_MAX 5
+static int irq_none_count;
+
 irqreturn_t sja1000_interrupt(int irq, void *dev_id)
 {
 	struct net_device *dev = (struct net_device *)dev_id;
@@ -496,7 +499,7 @@ irqreturn_t sja1000_interrupt(int irq, void *dev_id)
 
 	/* Shared interrupts and IRQ off? */
 	if (priv->read_reg(priv, SJA1000_IER) == IRQ_OFF)
-		return IRQ_NONE;
+		goto out;
 
 	if (priv->pre_irq)
 		priv->pre_irq(priv);
@@ -506,8 +509,11 @@ irqreturn_t sja1000_interrupt(int irq, void *dev_id)
 		n++;
 		status = priv->read_reg(priv, SJA1000_SR);
 		/* check for absent controller due to hw unplug */
-		if (status == 0xFF && sja1000_is_absent(priv))
-			return IRQ_NONE;
+		if (status == 0xFF && sja1000_is_absent(priv)) {
+			netdev_warn(dev, "controller lost (n=%d)\n", n);
+			n = 0;
+			goto out;
+		}
 
 		if (isrc & IRQ_WUI)
 			netdev_warn(dev, "wakeup interrupt\n");
@@ -534,8 +540,11 @@ irqreturn_t sja1000_interrupt(int irq, void *dev_id)
 				sja1000_rx(dev);
 				status = priv->read_reg(priv, SJA1000_SR);
 				/* check for absent controller */
-				if (status == 0xFF && sja1000_is_absent(priv))
-					return IRQ_NONE;
+				if (status == 0xFF && sja1000_is_absent(priv)) {
+					netdev_warn(dev, "controller lost (n=%d)\n", n);
+					n = 0;
+					goto out;
+				}
 			}
 		}
 		if (isrc & (IRQ_DOI | IRQ_EI | IRQ_BEI | IRQ_EPI | IRQ_ALI)) {
@@ -547,11 +556,22 @@ irqreturn_t sja1000_interrupt(int irq, void *dev_id)
 
 	if (priv->post_irq)
 		priv->post_irq(priv);
-
 	if (n >= SJA1000_MAX_IRQ)
 		netdev_dbg(dev, "%d messages handled in ISR", n);
 
-	return (n) ? IRQ_HANDLED : IRQ_NONE;
+out:
+	if (n) {
+		irq_none_count = 0;
+		return IRQ_HANDLED;
+	} else {
+		irq_none_count++;
+		if (irq_none_count >= IRQ_NONE_COUNT_MAX) {
+			netdev_info(dev, "tracing stopped after %d unhandled irqs)\n",
+				    irq_none_count);
+			tracing_off();
+		}
+		return IRQ_NONE;
+	}
 }
 EXPORT_SYMBOL_GPL(sja1000_interrupt);
 
-- 
1.7.9.5





^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: sja1000 interrupt problem
  2013-11-17 20:46                                         ` Wolfgang Grandegger
@ 2013-11-18 17:08                                           ` Austin Schuh
  2013-12-09 21:54                                             ` Austin Schuh
  0 siblings, 1 reply; 66+ messages in thread
From: Austin Schuh @ 2013-11-18 17:08 UTC (permalink / raw)
  To: Wolfgang Grandegger; +Cc: Oliver Hartkopp, Pavel Pisa, linux-can

I'm off on a buisness trip right now, and when I get back either late
this week, or early next week, I'll be able to get back to working on
debugging this.  Thanks everyone for all your help so far!

Austin

On Sun, Nov 17, 2013 at 12:46 PM, Wolfgang Grandegger <wg@grandegger.com> wrote:
> On 11/17/2013 06:23 PM, Wolfgang Grandegger wrote:
>> On 11/17/2013 03:27 PM, Oliver Hartkopp wrote:
>>>
>>>
>>> On 17.11.2013 09:18, Wolfgang Grandegger wrote:
>>>>
>>>> On Sat, 16 Nov 2013 22:42:10 +0100, Oliver Hartkopp
>>>
>>>
>>>>> Btw. I would try two more things with the sja1000.c driver:
>>>>
>>>>>
>>>>
>>>>> 1. Print the corresponding CAN device name and the point in the code
>>>>
>>>> before
>>>>
>>>>> returning IRQ_NONE to catch the problematic return site.
>>>>
>>>>>
>>>>
>>>>> 2. Replace all IRQ_NONE with IRQ_HANDLED for a test ...
>>>>
>>>>>
>>>>
>>>>> Any other ideas?
>>>>
>>>>
>>>>
>>>> A function trace is still the first thing to take. Be aware that the
>>>>
>>>> tasks are running on more than one CPU. It's just to establish a good
>>>>
>>>> trigger.
>>>>
>>>
>>> Yes. But so far the function trace only showed the fact of simultaneous irq
>>> treads for the two CAN interfaces - what we were expecting :-)
>>
>> Yes, because the trigger was not good so far.
>>
>>> IMO it's interesting to get the concrete return site from where the IRQ_NONE
>>> is returned. Maybe one of the other interfaces / bits in PITA are enabled
>>> which we did not see so far.
>>
>> The problem is that the IRQs are shared, also with the ATA disk. There
>> it's normal to see unhandled interrupts. My idea was to trigger a
>> "tracing_off" after 10 IRQ_NONE in sequence.
>
> See my patch below. @Austin, you may want to give it a try.
>
> Wolfgang.
>
> From eac2b63d114bc83544f806a549fe7339d54c031f Mon Sep 17 00:00:00 2001
> From: Wolfgang Grandegger <wg@grandegger.com>
> Date: Sun, 17 Nov 2013 21:39:05 +0100
> Subject: [PATCH] sja1000: debugging code to inspect too much unhandeled irq
>
> If more than "IRQ_NONE_COUNT_MAX" unhandled interrupts are detected
> in sequence, ftracing will be stopped.
> ---
>  drivers/net/can/sja1000/sja1000.c |   34 +++++++++++++++++++++++++++-------
>  1 file changed, 27 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/net/can/sja1000/sja1000.c b/drivers/net/can/sja1000/sja1000.c
> index 7164a99..80e8d84 100644
> --- a/drivers/net/can/sja1000/sja1000.c
> +++ b/drivers/net/can/sja1000/sja1000.c
> @@ -486,6 +486,9 @@ static int sja1000_err(struct net_device *dev, uint8_t isrc, uint8_t status)
>         return 0;
>  }
>
> +#define IRQ_NONE_COUNT_MAX 5
> +static int irq_none_count;
> +
>  irqreturn_t sja1000_interrupt(int irq, void *dev_id)
>  {
>         struct net_device *dev = (struct net_device *)dev_id;
> @@ -496,7 +499,7 @@ irqreturn_t sja1000_interrupt(int irq, void *dev_id)
>
>         /* Shared interrupts and IRQ off? */
>         if (priv->read_reg(priv, SJA1000_IER) == IRQ_OFF)
> -               return IRQ_NONE;
> +               goto out;
>
>         if (priv->pre_irq)
>                 priv->pre_irq(priv);
> @@ -506,8 +509,11 @@ irqreturn_t sja1000_interrupt(int irq, void *dev_id)
>                 n++;
>                 status = priv->read_reg(priv, SJA1000_SR);
>                 /* check for absent controller due to hw unplug */
> -               if (status == 0xFF && sja1000_is_absent(priv))
> -                       return IRQ_NONE;
> +               if (status == 0xFF && sja1000_is_absent(priv)) {
> +                       netdev_warn(dev, "controller lost (n=%d)\n", n);
> +                       n = 0;
> +                       goto out;
> +               }
>
>                 if (isrc & IRQ_WUI)
>                         netdev_warn(dev, "wakeup interrupt\n");
> @@ -534,8 +540,11 @@ irqreturn_t sja1000_interrupt(int irq, void *dev_id)
>                                 sja1000_rx(dev);
>                                 status = priv->read_reg(priv, SJA1000_SR);
>                                 /* check for absent controller */
> -                               if (status == 0xFF && sja1000_is_absent(priv))
> -                                       return IRQ_NONE;
> +                               if (status == 0xFF && sja1000_is_absent(priv)) {
> +                                       netdev_warn(dev, "controller lost (n=%d)\n", n);
> +                                       n = 0;
> +                                       goto out;
> +                               }
>                         }
>                 }
>                 if (isrc & (IRQ_DOI | IRQ_EI | IRQ_BEI | IRQ_EPI | IRQ_ALI)) {
> @@ -547,11 +556,22 @@ irqreturn_t sja1000_interrupt(int irq, void *dev_id)
>
>         if (priv->post_irq)
>                 priv->post_irq(priv);
> -
>         if (n >= SJA1000_MAX_IRQ)
>                 netdev_dbg(dev, "%d messages handled in ISR", n);
>
> -       return (n) ? IRQ_HANDLED : IRQ_NONE;
> +out:
> +       if (n) {
> +               irq_none_count = 0;
> +               return IRQ_HANDLED;
> +       } else {
> +               irq_none_count++;
> +               if (irq_none_count >= IRQ_NONE_COUNT_MAX) {
> +                       netdev_info(dev, "tracing stopped after %d unhandled irqs)\n",
> +                                   irq_none_count);
> +                       tracing_off();
> +               }
> +               return IRQ_NONE;
> +       }
>  }
>  EXPORT_SYMBOL_GPL(sja1000_interrupt);
>
> --
> 1.7.9.5
>
>
>
>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: sja1000 interrupt problem
  2013-11-18 17:08                                           ` Austin Schuh
@ 2013-12-09 21:54                                             ` Austin Schuh
  2013-12-09 21:54                                               ` Austin Schuh
  2013-12-10  7:49                                               ` Wolfgang Grandegger
  0 siblings, 2 replies; 66+ messages in thread
From: Austin Schuh @ 2013-12-09 21:54 UTC (permalink / raw)
  To: Wolfgang Grandegger; +Cc: Oliver Hartkopp, Pavel Pisa, linux-can

Ok.  I succeeded in getting a trace from when the IRQ gets disabled.
It took ~2.5 days to trigger it with tracing enabled, and it takes
much less time (typically under a day) with tracing disabled.  The
trace is at http://98.207.84.11/sja1000_crash.trace.bz2  It is 42 MB
compressed and 1 GB uncompressed.

Austin

On Mon, Nov 18, 2013 at 9:08 AM, Austin Schuh <austin@peloton-tech.com> wrote:
> I'm off on a buisness trip right now, and when I get back either late
> this week, or early next week, I'll be able to get back to working on
> debugging this.  Thanks everyone for all your help so far!
>
> Austin
>
> On Sun, Nov 17, 2013 at 12:46 PM, Wolfgang Grandegger <wg@grandegger.com> wrote:
>> On 11/17/2013 06:23 PM, Wolfgang Grandegger wrote:
>>> On 11/17/2013 03:27 PM, Oliver Hartkopp wrote:
>>>>
>>>>
>>>> On 17.11.2013 09:18, Wolfgang Grandegger wrote:
>>>>>
>>>>> On Sat, 16 Nov 2013 22:42:10 +0100, Oliver Hartkopp
>>>>
>>>>
>>>>>> Btw. I would try two more things with the sja1000.c driver:
>>>>>
>>>>>>
>>>>>
>>>>>> 1. Print the corresponding CAN device name and the point in the code
>>>>>
>>>>> before
>>>>>
>>>>>> returning IRQ_NONE to catch the problematic return site.
>>>>>
>>>>>>
>>>>>
>>>>>> 2. Replace all IRQ_NONE with IRQ_HANDLED for a test ...
>>>>>
>>>>>>
>>>>>
>>>>>> Any other ideas?
>>>>>
>>>>>
>>>>>
>>>>> A function trace is still the first thing to take. Be aware that the
>>>>>
>>>>> tasks are running on more than one CPU. It's just to establish a good
>>>>>
>>>>> trigger.
>>>>>
>>>>
>>>> Yes. But so far the function trace only showed the fact of simultaneous irq
>>>> treads for the two CAN interfaces - what we were expecting :-)
>>>
>>> Yes, because the trigger was not good so far.
>>>
>>>> IMO it's interesting to get the concrete return site from where the IRQ_NONE
>>>> is returned. Maybe one of the other interfaces / bits in PITA are enabled
>>>> which we did not see so far.
>>>
>>> The problem is that the IRQs are shared, also with the ATA disk. There
>>> it's normal to see unhandled interrupts. My idea was to trigger a
>>> "tracing_off" after 10 IRQ_NONE in sequence.
>>
>> See my patch below. @Austin, you may want to give it a try.
>>
>> Wolfgang.
>>
>> From eac2b63d114bc83544f806a549fe7339d54c031f Mon Sep 17 00:00:00 2001
>> From: Wolfgang Grandegger <wg@grandegger.com>
>> Date: Sun, 17 Nov 2013 21:39:05 +0100
>> Subject: [PATCH] sja1000: debugging code to inspect too much unhandeled irq
>>
>> If more than "IRQ_NONE_COUNT_MAX" unhandled interrupts are detected
>> in sequence, ftracing will be stopped.
>> ---
>>  drivers/net/can/sja1000/sja1000.c |   34 +++++++++++++++++++++++++++-------
>>  1 file changed, 27 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/net/can/sja1000/sja1000.c b/drivers/net/can/sja1000/sja1000.c
>> index 7164a99..80e8d84 100644
>> --- a/drivers/net/can/sja1000/sja1000.c
>> +++ b/drivers/net/can/sja1000/sja1000.c
>> @@ -486,6 +486,9 @@ static int sja1000_err(struct net_device *dev, uint8_t isrc, uint8_t status)
>>         return 0;
>>  }
>>
>> +#define IRQ_NONE_COUNT_MAX 5
>> +static int irq_none_count;
>> +
>>  irqreturn_t sja1000_interrupt(int irq, void *dev_id)
>>  {
>>         struct net_device *dev = (struct net_device *)dev_id;
>> @@ -496,7 +499,7 @@ irqreturn_t sja1000_interrupt(int irq, void *dev_id)
>>
>>         /* Shared interrupts and IRQ off? */
>>         if (priv->read_reg(priv, SJA1000_IER) == IRQ_OFF)
>> -               return IRQ_NONE;
>> +               goto out;
>>
>>         if (priv->pre_irq)
>>                 priv->pre_irq(priv);
>> @@ -506,8 +509,11 @@ irqreturn_t sja1000_interrupt(int irq, void *dev_id)
>>                 n++;
>>                 status = priv->read_reg(priv, SJA1000_SR);
>>                 /* check for absent controller due to hw unplug */
>> -               if (status == 0xFF && sja1000_is_absent(priv))
>> -                       return IRQ_NONE;
>> +               if (status == 0xFF && sja1000_is_absent(priv)) {
>> +                       netdev_warn(dev, "controller lost (n=%d)\n", n);
>> +                       n = 0;
>> +                       goto out;
>> +               }
>>
>>                 if (isrc & IRQ_WUI)
>>                         netdev_warn(dev, "wakeup interrupt\n");
>> @@ -534,8 +540,11 @@ irqreturn_t sja1000_interrupt(int irq, void *dev_id)
>>                                 sja1000_rx(dev);
>>                                 status = priv->read_reg(priv, SJA1000_SR);
>>                                 /* check for absent controller */
>> -                               if (status == 0xFF && sja1000_is_absent(priv))
>> -                                       return IRQ_NONE;
>> +                               if (status == 0xFF && sja1000_is_absent(priv)) {
>> +                                       netdev_warn(dev, "controller lost (n=%d)\n", n);
>> +                                       n = 0;
>> +                                       goto out;
>> +                               }
>>                         }
>>                 }
>>                 if (isrc & (IRQ_DOI | IRQ_EI | IRQ_BEI | IRQ_EPI | IRQ_ALI)) {
>> @@ -547,11 +556,22 @@ irqreturn_t sja1000_interrupt(int irq, void *dev_id)
>>
>>         if (priv->post_irq)
>>                 priv->post_irq(priv);
>> -
>>         if (n >= SJA1000_MAX_IRQ)
>>                 netdev_dbg(dev, "%d messages handled in ISR", n);
>>
>> -       return (n) ? IRQ_HANDLED : IRQ_NONE;
>> +out:
>> +       if (n) {
>> +               irq_none_count = 0;
>> +               return IRQ_HANDLED;
>> +       } else {
>> +               irq_none_count++;
>> +               if (irq_none_count >= IRQ_NONE_COUNT_MAX) {
>> +                       netdev_info(dev, "tracing stopped after %d unhandled irqs)\n",
>> +                                   irq_none_count);
>> +                       tracing_off();
>> +               }
>> +               return IRQ_NONE;
>> +       }
>>  }
>>  EXPORT_SYMBOL_GPL(sja1000_interrupt);
>>
>> --
>> 1.7.9.5
>>
>>
>>
>>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: sja1000 interrupt problem
  2013-12-09 21:54                                             ` Austin Schuh
@ 2013-12-09 21:54                                               ` Austin Schuh
  2013-12-10  7:49                                               ` Wolfgang Grandegger
  1 sibling, 0 replies; 66+ messages in thread
From: Austin Schuh @ 2013-12-09 21:54 UTC (permalink / raw)
  To: Wolfgang Grandegger; +Cc: Oliver Hartkopp, Pavel Pisa, linux-can

FYI, this is without the patch proposed in "Re: [PATCH v6] can:
sja1000: fix {pre,post}_irq() handling and IRQ handler return value"

Austin

On Mon, Dec 9, 2013 at 1:54 PM, Austin Schuh <austin@peloton-tech.com> wrote:
> Ok.  I succeeded in getting a trace from when the IRQ gets disabled.
> It took ~2.5 days to trigger it with tracing enabled, and it takes
> much less time (typically under a day) with tracing disabled.  The
> trace is at http://98.207.84.11/sja1000_crash.trace.bz2  It is 42 MB
> compressed and 1 GB uncompressed.
>
> Austin
>
> On Mon, Nov 18, 2013 at 9:08 AM, Austin Schuh <austin@peloton-tech.com> wrote:
>> I'm off on a buisness trip right now, and when I get back either late
>> this week, or early next week, I'll be able to get back to working on
>> debugging this.  Thanks everyone for all your help so far!
>>
>> Austin
>>
>> On Sun, Nov 17, 2013 at 12:46 PM, Wolfgang Grandegger <wg@grandegger.com> wrote:
>>> On 11/17/2013 06:23 PM, Wolfgang Grandegger wrote:
>>>> On 11/17/2013 03:27 PM, Oliver Hartkopp wrote:
>>>>>
>>>>>
>>>>> On 17.11.2013 09:18, Wolfgang Grandegger wrote:
>>>>>>
>>>>>> On Sat, 16 Nov 2013 22:42:10 +0100, Oliver Hartkopp
>>>>>
>>>>>
>>>>>>> Btw. I would try two more things with the sja1000.c driver:
>>>>>>
>>>>>>>
>>>>>>
>>>>>>> 1. Print the corresponding CAN device name and the point in the code
>>>>>>
>>>>>> before
>>>>>>
>>>>>>> returning IRQ_NONE to catch the problematic return site.
>>>>>>
>>>>>>>
>>>>>>
>>>>>>> 2. Replace all IRQ_NONE with IRQ_HANDLED for a test ...
>>>>>>
>>>>>>>
>>>>>>
>>>>>>> Any other ideas?
>>>>>>
>>>>>>
>>>>>>
>>>>>> A function trace is still the first thing to take. Be aware that the
>>>>>>
>>>>>> tasks are running on more than one CPU. It's just to establish a good
>>>>>>
>>>>>> trigger.
>>>>>>
>>>>>
>>>>> Yes. But so far the function trace only showed the fact of simultaneous irq
>>>>> treads for the two CAN interfaces - what we were expecting :-)
>>>>
>>>> Yes, because the trigger was not good so far.
>>>>
>>>>> IMO it's interesting to get the concrete return site from where the IRQ_NONE
>>>>> is returned. Maybe one of the other interfaces / bits in PITA are enabled
>>>>> which we did not see so far.
>>>>
>>>> The problem is that the IRQs are shared, also with the ATA disk. There
>>>> it's normal to see unhandled interrupts. My idea was to trigger a
>>>> "tracing_off" after 10 IRQ_NONE in sequence.
>>>
>>> See my patch below. @Austin, you may want to give it a try.
>>>
>>> Wolfgang.
>>>
>>> From eac2b63d114bc83544f806a549fe7339d54c031f Mon Sep 17 00:00:00 2001
>>> From: Wolfgang Grandegger <wg@grandegger.com>
>>> Date: Sun, 17 Nov 2013 21:39:05 +0100
>>> Subject: [PATCH] sja1000: debugging code to inspect too much unhandeled irq
>>>
>>> If more than "IRQ_NONE_COUNT_MAX" unhandled interrupts are detected
>>> in sequence, ftracing will be stopped.
>>> ---
>>>  drivers/net/can/sja1000/sja1000.c |   34 +++++++++++++++++++++++++++-------
>>>  1 file changed, 27 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/drivers/net/can/sja1000/sja1000.c b/drivers/net/can/sja1000/sja1000.c
>>> index 7164a99..80e8d84 100644
>>> --- a/drivers/net/can/sja1000/sja1000.c
>>> +++ b/drivers/net/can/sja1000/sja1000.c
>>> @@ -486,6 +486,9 @@ static int sja1000_err(struct net_device *dev, uint8_t isrc, uint8_t status)
>>>         return 0;
>>>  }
>>>
>>> +#define IRQ_NONE_COUNT_MAX 5
>>> +static int irq_none_count;
>>> +
>>>  irqreturn_t sja1000_interrupt(int irq, void *dev_id)
>>>  {
>>>         struct net_device *dev = (struct net_device *)dev_id;
>>> @@ -496,7 +499,7 @@ irqreturn_t sja1000_interrupt(int irq, void *dev_id)
>>>
>>>         /* Shared interrupts and IRQ off? */
>>>         if (priv->read_reg(priv, SJA1000_IER) == IRQ_OFF)
>>> -               return IRQ_NONE;
>>> +               goto out;
>>>
>>>         if (priv->pre_irq)
>>>                 priv->pre_irq(priv);
>>> @@ -506,8 +509,11 @@ irqreturn_t sja1000_interrupt(int irq, void *dev_id)
>>>                 n++;
>>>                 status = priv->read_reg(priv, SJA1000_SR);
>>>                 /* check for absent controller due to hw unplug */
>>> -               if (status == 0xFF && sja1000_is_absent(priv))
>>> -                       return IRQ_NONE;
>>> +               if (status == 0xFF && sja1000_is_absent(priv)) {
>>> +                       netdev_warn(dev, "controller lost (n=%d)\n", n);
>>> +                       n = 0;
>>> +                       goto out;
>>> +               }
>>>
>>>                 if (isrc & IRQ_WUI)
>>>                         netdev_warn(dev, "wakeup interrupt\n");
>>> @@ -534,8 +540,11 @@ irqreturn_t sja1000_interrupt(int irq, void *dev_id)
>>>                                 sja1000_rx(dev);
>>>                                 status = priv->read_reg(priv, SJA1000_SR);
>>>                                 /* check for absent controller */
>>> -                               if (status == 0xFF && sja1000_is_absent(priv))
>>> -                                       return IRQ_NONE;
>>> +                               if (status == 0xFF && sja1000_is_absent(priv)) {
>>> +                                       netdev_warn(dev, "controller lost (n=%d)\n", n);
>>> +                                       n = 0;
>>> +                                       goto out;
>>> +                               }
>>>                         }
>>>                 }
>>>                 if (isrc & (IRQ_DOI | IRQ_EI | IRQ_BEI | IRQ_EPI | IRQ_ALI)) {
>>> @@ -547,11 +556,22 @@ irqreturn_t sja1000_interrupt(int irq, void *dev_id)
>>>
>>>         if (priv->post_irq)
>>>                 priv->post_irq(priv);
>>> -
>>>         if (n >= SJA1000_MAX_IRQ)
>>>                 netdev_dbg(dev, "%d messages handled in ISR", n);
>>>
>>> -       return (n) ? IRQ_HANDLED : IRQ_NONE;
>>> +out:
>>> +       if (n) {
>>> +               irq_none_count = 0;
>>> +               return IRQ_HANDLED;
>>> +       } else {
>>> +               irq_none_count++;
>>> +               if (irq_none_count >= IRQ_NONE_COUNT_MAX) {
>>> +                       netdev_info(dev, "tracing stopped after %d unhandled irqs)\n",
>>> +                                   irq_none_count);
>>> +                       tracing_off();
>>> +               }
>>> +               return IRQ_NONE;
>>> +       }
>>>  }
>>>  EXPORT_SYMBOL_GPL(sja1000_interrupt);
>>>
>>> --
>>> 1.7.9.5
>>>
>>>
>>>
>>>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: sja1000 interrupt problem
  2013-12-09 21:54                                             ` Austin Schuh
  2013-12-09 21:54                                               ` Austin Schuh
@ 2013-12-10  7:49                                               ` Wolfgang Grandegger
  2013-12-10  8:05                                                 ` Austin Schuh
  1 sibling, 1 reply; 66+ messages in thread
From: Wolfgang Grandegger @ 2013-12-10  7:49 UTC (permalink / raw)
  To: Austin Schuh; +Cc: Oliver Hartkopp, Pavel Pisa, linux-can

Hi Austin,



On Mon, 9 Dec 2013 13:54:24 -0800, Austin Schuh <austin@peloton-tech.com>

wrote:

> Ok.  I succeeded in getting a trace from when the IRQ gets disabled.

> It took ~2.5 days to trigger it with tracing enabled, and it takes

> much less time (typically under a day) with tracing disabled.  The

> trace is at http://98.207.84.11/sja1000_crash.trace.bz2  It is 42 MB

> compressed and 1 GB uncompressed.



Could you please tell use what trigger you used (how and when did you

stop the trace) and what the test scenario was (TX on CAN0, RX on CAN1,

etc.).



Wolfgang.



> Austin

> 

> On Mon, Nov 18, 2013 at 9:08 AM, Austin Schuh <austin@peloton-tech.com>

> wrote:

>> I'm off on a buisness trip right now, and when I get back either late

>> this week, or early next week, I'll be able to get back to working on

>> debugging this.  Thanks everyone for all your help so far!

>>

>> Austin

>>

>> On Sun, Nov 17, 2013 at 12:46 PM, Wolfgang Grandegger

>> <wg@grandegger.com> wrote:

>>> On 11/17/2013 06:23 PM, Wolfgang Grandegger wrote:

>>>> On 11/17/2013 03:27 PM, Oliver Hartkopp wrote:

>>>>>

>>>>>

>>>>> On 17.11.2013 09:18, Wolfgang Grandegger wrote:

>>>>>>

>>>>>> On Sat, 16 Nov 2013 22:42:10 +0100, Oliver Hartkopp

>>>>>

>>>>>

>>>>>>> Btw. I would try two more things with the sja1000.c driver:

>>>>>>

>>>>>>>

>>>>>>

>>>>>>> 1. Print the corresponding CAN device name and the point in the

code

>>>>>>

>>>>>> before

>>>>>>

>>>>>>> returning IRQ_NONE to catch the problematic return site.

>>>>>>

>>>>>>>

>>>>>>

>>>>>>> 2. Replace all IRQ_NONE with IRQ_HANDLED for a test ...

>>>>>>

>>>>>>>

>>>>>>

>>>>>>> Any other ideas?

>>>>>>

>>>>>>

>>>>>>

>>>>>> A function trace is still the first thing to take. Be aware that

the

>>>>>>

>>>>>> tasks are running on more than one CPU. It's just to establish a

good

>>>>>>

>>>>>> trigger.

>>>>>>

>>>>>

>>>>> Yes. But so far the function trace only showed the fact of

>>>>> simultaneous irq

>>>>> treads for the two CAN interfaces - what we were expecting :-)

>>>>

>>>> Yes, because the trigger was not good so far.

>>>>

>>>>> IMO it's interesting to get the concrete return site from where the

>>>>> IRQ_NONE

>>>>> is returned. Maybe one of the other interfaces / bits in PITA are

>>>>> enabled

>>>>> which we did not see so far.

>>>>

>>>> The problem is that the IRQs are shared, also with the ATA disk.

There

>>>> it's normal to see unhandled interrupts. My idea was to trigger a

>>>> "tracing_off" after 10 IRQ_NONE in sequence.

>>>

>>> See my patch below. @Austin, you may want to give it a try.

>>>

>>> Wolfgang.

>>>

>>> From eac2b63d114bc83544f806a549fe7339d54c031f Mon Sep 17 00:00:00 2001

>>> From: Wolfgang Grandegger <wg@grandegger.com>

>>> Date: Sun, 17 Nov 2013 21:39:05 +0100

>>> Subject: [PATCH] sja1000: debugging code to inspect too much

unhandeled

>>> irq

>>>

>>> If more than "IRQ_NONE_COUNT_MAX" unhandled interrupts are detected

>>> in sequence, ftracing will be stopped.

>>> ---

>>>  drivers/net/can/sja1000/sja1000.c |   34

>>>  +++++++++++++++++++++++++++-------

>>>  1 file changed, 27 insertions(+), 7 deletions(-)

>>>

>>> diff --git a/drivers/net/can/sja1000/sja1000.c

>>> b/drivers/net/can/sja1000/sja1000.c

>>> index 7164a99..80e8d84 100644

>>> --- a/drivers/net/can/sja1000/sja1000.c

>>> +++ b/drivers/net/can/sja1000/sja1000.c

>>> @@ -486,6 +486,9 @@ static int sja1000_err(struct net_device *dev,

>>> uint8_t isrc, uint8_t status)

>>>         return 0;

>>>  }

>>>

>>> +#define IRQ_NONE_COUNT_MAX 5

>>> +static int irq_none_count;

>>> +

>>>  irqreturn_t sja1000_interrupt(int irq, void *dev_id)

>>>  {

>>>         struct net_device *dev = (struct net_device *)dev_id;

>>> @@ -496,7 +499,7 @@ irqreturn_t sja1000_interrupt(int irq, void

*dev_id)

>>>

>>>         /* Shared interrupts and IRQ off? */

>>>         if (priv->read_reg(priv, SJA1000_IER) == IRQ_OFF)

>>> -               return IRQ_NONE;

>>> +               goto out;

>>>

>>>         if (priv->pre_irq)

>>>                 priv->pre_irq(priv);

>>> @@ -506,8 +509,11 @@ irqreturn_t sja1000_interrupt(int irq, void

>>> *dev_id)

>>>                 n++;

>>>                 status = priv->read_reg(priv, SJA1000_SR);

>>>                 /* check for absent controller due to hw unplug */

>>> -               if (status == 0xFF && sja1000_is_absent(priv))

>>> -                       return IRQ_NONE;

>>> +               if (status == 0xFF && sja1000_is_absent(priv)) {

>>> +                       netdev_warn(dev, "controller lost (n=%d)\n",

n);

>>> +                       n = 0;

>>> +                       goto out;

>>> +               }

>>>

>>>                 if (isrc & IRQ_WUI)

>>>                         netdev_warn(dev, "wakeup interrupt\n");

>>> @@ -534,8 +540,11 @@ irqreturn_t sja1000_interrupt(int irq, void

>>> *dev_id)

>>>                                 sja1000_rx(dev);

>>>                                 status = priv->read_reg(priv,

>>>                                 SJA1000_SR);

>>>                                 /* check for absent controller */

>>> -                               if (status == 0xFF &&

>>> sja1000_is_absent(priv))

>>> -                                       return IRQ_NONE;

>>> +                               if (status == 0xFF &&

>>> sja1000_is_absent(priv)) {

>>> +                                       netdev_warn(dev, "controller

>>> lost (n=%d)\n", n);

>>> +                                       n = 0;

>>> +                                       goto out;

>>> +                               }

>>>                         }

>>>                 }

>>>                 if (isrc & (IRQ_DOI | IRQ_EI | IRQ_BEI | IRQ_EPI |

>>>                 IRQ_ALI)) {

>>> @@ -547,11 +556,22 @@ irqreturn_t sja1000_interrupt(int irq, void

>>> *dev_id)

>>>

>>>         if (priv->post_irq)

>>>                 priv->post_irq(priv);

>>> -

>>>         if (n >= SJA1000_MAX_IRQ)

>>>                 netdev_dbg(dev, "%d messages handled in ISR", n);

>>>

>>> -       return (n) ? IRQ_HANDLED : IRQ_NONE;

>>> +out:

>>> +       if (n) {

>>> +               irq_none_count = 0;

>>> +               return IRQ_HANDLED;

>>> +       } else {

>>> +               irq_none_count++;

>>> +               if (irq_none_count >= IRQ_NONE_COUNT_MAX) {

>>> +                       netdev_info(dev, "tracing stopped after %d

>>> unhandled irqs)\n",

>>> +                                   irq_none_count);

>>> +                       tracing_off();

>>> +               }

>>> +               return IRQ_NONE;

>>> +       }

>>>  }

>>>  EXPORT_SYMBOL_GPL(sja1000_interrupt);

>>>

>>> --

>>> 1.7.9.5

>>>

>>>

>>>

>>>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: sja1000 interrupt problem
  2013-12-10  7:49                                               ` Wolfgang Grandegger
@ 2013-12-10  8:05                                                 ` Austin Schuh
  2013-12-10  9:32                                                   ` Wolfgang Grandegger
  0 siblings, 1 reply; 66+ messages in thread
From: Austin Schuh @ 2013-12-10  8:05 UTC (permalink / raw)
  To: Wolfgang Grandegger; +Cc: Oliver Hartkopp, Pavel Pisa, linux-can

Hi Wolfgang,

On Mon, Dec 9, 2013 at 11:49 PM, Wolfgang Grandegger <wg@grandegger.com> wrote:
> Hi Austin,
>
> On Mon, 9 Dec 2013 13:54:24 -0800, Austin Schuh <austin@peloton-tech.com>
> wrote:
>> Ok.  I succeeded in getting a trace from when the IRQ gets disabled.
>> It took ~2.5 days to trigger it with tracing enabled, and it takes
>> much less time (typically under a day) with tracing disabled.  The
>> trace is at http://98.207.84.11/sja1000_crash.trace.bz2  It is 42 MB
>> compressed and 1 GB uncompressed.
>
> Could you please tell use what trigger you used (how and when did you
> stop the trace) and what the test scenario was (TX on CAN0, RX on CAN1,
> etc.).
>
> Wolfgang.

Whoops, forgot that critical detail...

can0 is connected to a can bus at 250k with a bunch of other can
devices, but for this test, they are all off and nothing is happening
on that bus.

can1 is connected to a can bus at 500k with 2 other can devices.  They
are filling the bus up to about 50% bus load.  I am sending nothing on
can1 for this test.  There is very little load on the system at the
time besides running candump on can1 and streaming that back over ssh.

The trigger is in kernel/irq/spurrious.c and triggers when IRQ 18 gets
disabled.  That is the only modification to the kernel.

Austin

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: sja1000 interrupt problem
  2013-12-10  8:05                                                 ` Austin Schuh
@ 2013-12-10  9:32                                                   ` Wolfgang Grandegger
  2013-12-10 13:47                                                     ` Oliver Hartkopp
  0 siblings, 1 reply; 66+ messages in thread
From: Wolfgang Grandegger @ 2013-12-10  9:32 UTC (permalink / raw)
  To: Austin Schuh; +Cc: Oliver Hartkopp, Pavel Pisa, linux-can

On Tue, 10 Dec 2013 00:05:35 -0800, Austin Schuh <austin@peloton-tech.com>

wrote:

> Hi Wolfgang,

> 

> On Mon, Dec 9, 2013 at 11:49 PM, Wolfgang Grandegger <wg@grandegger.com>

> wrote:

>> Hi Austin,

>>

>> On Mon, 9 Dec 2013 13:54:24 -0800, Austin Schuh

<austin@peloton-tech.com>

>> wrote:

>>> Ok.  I succeeded in getting a trace from when the IRQ gets disabled.

>>> It took ~2.5 days to trigger it with tracing enabled, and it takes

>>> much less time (typically under a day) with tracing disabled.  The

>>> trace is at http://98.207.84.11/sja1000_crash.trace.bz2  It is 42 MB

>>> compressed and 1 GB uncompressed.

>>

>> Could you please tell use what trigger you used (how and when did you

>> stop the trace) and what the test scenario was (TX on CAN0, RX on CAN1,

>> etc.).

>>

>> Wolfgang.

> 

> Whoops, forgot that critical detail...

> 

> can0 is connected to a can bus at 250k with a bunch of other can

> devices, but for this test, they are all off and nothing is happening

> on that bus.



Is CAN0 "up"? I ask because I found in the trace:



     irq/18-can0-1863  [002] .....11 360026.316603: peak_pci_post_irq

<-sja1000_interrupt



> 

> can1 is connected to a can bus at 500k with 2 other can devices.  They

> are filling the bus up to about 50% bus load.  I am sending nothing on

> can1 for this test.  There is very little load on the system at the

> time besides running candump on can1 and streaming that back over ssh.



OK, only RX.



> 

> The trigger is in kernel/irq/spurrious.c and triggers when IRQ 18 gets

> disabled.  That is the only modification to the kernel.



You mean here:



  void note_interrupt(unsigned int irq, struct irq_desc *desc,

  		    irqreturn_t action_ret)

  {

	...

	if (unlikely(desc->irqs_unhandled > 99900)) {

		/*

		 * The interrupt is stuck

		 */

 		__report_bad_irq(irq, desc, action_ret);

		/*

  		* Now kill the IRQ

  		*/

  		printk(KERN_EMERG "Disabling IRQ #%d\n", irq);

  		desc->istate |= IRQS_SPURIOUS_DISABLED;

 		desc->depth++;

 		irq_disable(desc);



		mod_timer(&poll_spurious_irq_timer,

		jiffies + POLL_SPURIOUS_IRQ_INTERVAL);

	}

	desc->irqs_unhandled = 0;

  }



I see that note_interrupt() is called at high rate before it gets

disabled:



     irq/18-can1-1890  [001] ....... 360026.315459: note_interrupt

<-irq_thread

 irq/18-ata_gene-219   [003] ....... 360026.315626: note_interrupt

<-irq_thread

     irq/18-can0-1863  [002] ....... 360026.315627: note_interrupt

<-irq_thread

     irq/18-can1-1890  [001] ....... 360026.315683: note_interrupt

<-irq_thread

     irq/18-can0-1863  [002] ....... 360026.315873: note_interrupt

<-irq_thread

 irq/18-ata_gene-219   [003] ....... 360026.315878: note_interrupt

<-irq_thread

     irq/18-can1-1890  [001] ....... 360026.315984: note_interrupt

<-irq_thread

     irq/18-can0-1863  [002] ....... 360026.316122: note_interrupt

<-irq_thread

 irq/18-ata_gene-219   [003] ....... 360026.316140: note_interrupt

<-irq_thread

     irq/18-can1-1890  [001] ....... 360026.316184: note_interrupt

<-irq_thread

     irq/18-can0-1863  [002] ....... 360026.316358: note_interrupt

<-irq_thread

 irq/18-ata_gene-219   [003] ....... 360026.316361: note_interrupt

<-irq_thread

 irq/18-ata_gene-219   [003] ....... 360026.316361: __report_bad_irq

<-note_interrupt

     irq/18-can1-1890  [001] ....... 360026.316437: note_interrupt

<-irq_thread

     irq/18-can0-1863  [002] ....... 360026.316608: note_interrupt

<-irq_thread

     irq/18-can1-1890  [001] ....... 360026.316714: note_interrupt

<-irq_thread

 irq/18-ata_gene-219   [003] d...1.. 360026.317050:

_raw_spin_unlock_irqrestore <-note_interrupt

 irq/18-ata_gene-219   [003] ....... 360026.317051: printk

<-note_interrupt

 irq/18-ata_gene-219   [003] ....... 360026.317541: irq_disable

<-note_interrupt

 irq/18-ata_gene-219   [003] ....... 360026.317541: mod_timer

<-note_interrupt



But it will be difficult to locate what happened 99900 irq before. Anyway,

I will

have a closer look this evening.



Wolfgang.





^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: sja1000 interrupt problem
  2013-12-10  9:32                                                   ` Wolfgang Grandegger
@ 2013-12-10 13:47                                                     ` Oliver Hartkopp
  2013-12-10 14:23                                                       ` Oliver Hartkopp
  2013-12-10 14:41                                                       ` Wolfgang Grandegger
  0 siblings, 2 replies; 66+ messages in thread
From: Oliver Hartkopp @ 2013-12-10 13:47 UTC (permalink / raw)
  To: Wolfgang Grandegger, Austin Schuh, Pavel Pisa; +Cc: linux-can

Hey all,

as I have a similar setup here (Core i7, 5x PEAK cPCI = 20 CAN interfaces) I
downloaded the linux-image-3.10-0.bpo.3-rt-686-pae kernel including the
sources from

	http://packages.debian.org/de/wheezy-backports/kernel/

and was able to see Austins problem with the -rt kernel.

My interrupt lines are mostly dedicated to the CAN interfaces, so I was able
to select interrupts (17 & 19) that _only_ deal with sja1000 irq handlers:

 16:          7          7         10          9   IO-APIC-fasteoi   ehci_hcd:usb1, ahci, can4, can5, can6, can7
 17:    6328236    6330659    6328557    6330266   IO-APIC-fasteoi   can8, can10, can9
 18:          0          0          0          0   IO-APIC-fasteoi   can12, can13, can14, can15
 19:    1446093    1443817    1445833    1444230   IO-APIC-fasteoi   can2, can16, can17, can18, can19, can3, can1, can0

can0/can2 are linked together (500 kbit/s)
can1/can3 are linked together (500 kbit/s)
can9 is linked to a 1Mbit/s CAN traffic source

All interfaces get a full bus load from the outside.
Additionally can0 and can1 get a 'cangen -g0 -i <if>' from the local host.

The funny thing was that one time IRQ #19 got disabled twice(?!?) :

Message from syslogd@xxxxx at Dec 10 11:25:37 ...
 kernel:[  967.213174] Disabling IRQ #19

Message from syslogd@xxxxx at Dec 10 12:06:13 ...
 kernel:[ 3401.523019] Disabling IRQ #17

Message from syslogd@xxxxx at Dec 10 12:49:08 ...
 kernel:[ 5975.113373] Disabling IRQ #19

Don't know where the last message could come from as the 8 CAN interfaces at
this interrupt line were already dead for more than a hour.

The disabling of the interrupt seems to be reproducible - as Austin already
mentioned after different times.

My assumption was that we run into a problem with the PITA chip, when
consuming the interface specific interrupt line in peak_pci_post_irq(), see:

static void peak_pci_post_irq(const struct sja1000_priv *priv)
{
        struct peak_pci_chan *chan = priv->priv;
        u16 icr;

        /* Select and clear in PITA stored interrupt */
        icr = readw(chan->cfg_base + PITA_ICR);
        if (icr & chan->icr_mask)
                writew(chan->icr_mask, chan->cfg_base + PITA_ICR);
}

With the writew() only the corresponding SJA1000 line is consumed.

My quick hack was to clear all bits in the PITA each time:

--- peak_pci.c~ 2013-09-08 07:10:14.000000000 +0200
+++ peak_pci.c  2013-12-10 13:26:48.315166478 +0100
@@ -542,9 +542,13 @@
        u16 icr;
 
        /* Select and clear in PITA stored interrupt */
+#if 0
        icr = readw(chan->cfg_base + PITA_ICR);
        if (icr & chan->icr_mask)
                writew(chan->icr_mask, chan->cfg_base + PITA_ICR);
+#else
+       writew(0x00C3, chan->cfg_base + PITA_ICR);
+#endif
 }
 
 static int peak_pci_probe(struct pci_dev *pdev, const struct pci_device_id *ent)

The 0x00C3 comes from OR'ing the values from 
static const u16 peak_pci_icr_masks[PEAK_PCI_CHAN_MAX]

I'm currently running the setup for more than one hour without any problems.

But I assume that this a really bad hack - and I did not check, if any CAN
frames got lost. Btw. the performance increased from 90% busload to 95%
busload with that patch when creating only local traffic on the host.

Any idea how to proceed?

Regards,
Oliver

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: sja1000 interrupt problem
  2013-12-10 13:47                                                     ` Oliver Hartkopp
@ 2013-12-10 14:23                                                       ` Oliver Hartkopp
  2013-12-10 14:41                                                       ` Wolfgang Grandegger
  1 sibling, 0 replies; 66+ messages in thread
From: Oliver Hartkopp @ 2013-12-10 14:23 UTC (permalink / raw)
  To: Wolfgang Grandegger, Austin Schuh, Pavel Pisa; +Cc: linux-can

In addition to the setup of the mail below:

Now the can9 (with the 1Mbit/s) crashed with this message:

[ 5542.981022] irq 17: nobody cared (try booting with the "irqpoll" option)
[ 5542.983013] CPU: 3 PID: 5407 Comm: irq/17-can10 Not tainted 3.10.11-rt7-can #1
[ 5542.983016] Hardware name: xxxxxx
[ 5542.983019]  00000000 c108910d f4e44840 00000000 00000011 c1089466 ee219f00 f4e44840
[ 5542.983027]  ee219f00 ef2d7580 c1087cf3 c10884a9 ee219f20 ef2d7580 1647bf59 00000000
[ 5542.983035]  00000000 00000000 00000000 c108857f ef169a68 ee219f00 c1088416 ee87bf90
[ 5542.983042] Call Trace:
[ 5542.983052]  [<c108910d>] ? __report_bad_irq+0x11/0x94
[ 5542.983057]  [<c1089466>] ? note_interrupt+0x118/0x192
[ 5542.983061]  [<c1087cf3>] ? irq_thread_fn+0x21/0x21
[ 5542.983064]  [<c10884a9>] ? irq_thread+0x93/0x169
[ 5542.983069]  [<c108857f>] ? irq_thread+0x169/0x169
[ 5542.983072]  [<c1088416>] ? wake_threads_waitq+0x31/0x31
[ 5542.983080]  [<c104a79e>] ? kthread+0x68/0x6d
[ 5542.983090]  [<c13143b7>] ? ret_from_kernel_thread+0x1b/0x28
[ 5542.983096]  [<c104a736>] ? __kthread_parkme+0x50/0x50
[ 5542.983102] handlers:
[ 5542.985069] [<c1087bdb>] irq_default_primary_handler threaded [<f886769b>] sja1000_interrupt [sja1000]
[ 5542.985073] [<c1087bdb>] irq_default_primary_handler threaded [<f886769b>] sja1000_interrupt [sja1000]
[ 5542.985080] [<c1087bdb>] irq_default_primary_handler threaded [<f886769b>] sja1000_interrupt [sja1000]
[ 5542.985082] [<c1087bdb>] irq_default_primary_handler threaded [<f886769b>] sja1000_interrupt [sja1000]
[ 5542.985083] Disabling IRQ #17

The problem with can9 shows up with irq/17-can10.
This might be related to the PITA hack.

Looks like this machine turned into a zombie:

I still get about 60 CAN frames per second from can9 even without the interrupt #17
counters in /proc/interrupts being increased ...

Oliver

On 10.12.2013 14:47, Oliver Hartkopp wrote:
> Hey all,
> 
> as I have a similar setup here (Core i7, 5x PEAK cPCI = 20 CAN interfaces) I
> downloaded the linux-image-3.10-0.bpo.3-rt-686-pae kernel including the
> sources from
> 
> 	http://packages.debian.org/de/wheezy-backports/kernel/
> 
> and was able to see Austins problem with the -rt kernel.
> 
> My interrupt lines are mostly dedicated to the CAN interfaces, so I was able
> to select interrupts (17 & 19) that _only_ deal with sja1000 irq handlers:
> 
>  16:          7          7         10          9   IO-APIC-fasteoi   ehci_hcd:usb1, ahci, can4, can5, can6, can7
>  17:    6328236    6330659    6328557    6330266   IO-APIC-fasteoi   can8, can10, can9
>  18:          0          0          0          0   IO-APIC-fasteoi   can12, can13, can14, can15
>  19:    1446093    1443817    1445833    1444230   IO-APIC-fasteoi   can2, can16, can17, can18, can19, can3, can1, can0
> 
> can0/can2 are linked together (500 kbit/s)
> can1/can3 are linked together (500 kbit/s)
> can9 is linked to a 1Mbit/s CAN traffic source
> 
> All interfaces get a full bus load from the outside.
> Additionally can0 and can1 get a 'cangen -g0 -i <if>' from the local host.
> 
> The funny thing was that one time IRQ #19 got disabled twice(?!?) :
> 
> Message from syslogd@xxxxx at Dec 10 11:25:37 ...
>  kernel:[  967.213174] Disabling IRQ #19
> 
> Message from syslogd@xxxxx at Dec 10 12:06:13 ...
>  kernel:[ 3401.523019] Disabling IRQ #17
> 
> Message from syslogd@xxxxx at Dec 10 12:49:08 ...
>  kernel:[ 5975.113373] Disabling IRQ #19
> 
> Don't know where the last message could come from as the 8 CAN interfaces at
> this interrupt line were already dead for more than a hour.
> 
> The disabling of the interrupt seems to be reproducible - as Austin already
> mentioned after different times.
> 
> My assumption was that we run into a problem with the PITA chip, when
> consuming the interface specific interrupt line in peak_pci_post_irq(), see:
> 
> static void peak_pci_post_irq(const struct sja1000_priv *priv)
> {
>         struct peak_pci_chan *chan = priv->priv;
>         u16 icr;
> 
>         /* Select and clear in PITA stored interrupt */
>         icr = readw(chan->cfg_base + PITA_ICR);
>         if (icr & chan->icr_mask)
>                 writew(chan->icr_mask, chan->cfg_base + PITA_ICR);
> }
> 
> With the writew() only the corresponding SJA1000 line is consumed.
> 
> My quick hack was to clear all bits in the PITA each time:
> 
> --- peak_pci.c~ 2013-09-08 07:10:14.000000000 +0200
> +++ peak_pci.c  2013-12-10 13:26:48.315166478 +0100
> @@ -542,9 +542,13 @@
>         u16 icr;
>  
>         /* Select and clear in PITA stored interrupt */
> +#if 0
>         icr = readw(chan->cfg_base + PITA_ICR);
>         if (icr & chan->icr_mask)
>                 writew(chan->icr_mask, chan->cfg_base + PITA_ICR);
> +#else
> +       writew(0x00C3, chan->cfg_base + PITA_ICR);
> +#endif
>  }
>  
>  static int peak_pci_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
> 
> The 0x00C3 comes from OR'ing the values from 
> static const u16 peak_pci_icr_masks[PEAK_PCI_CHAN_MAX]
> 
> I'm currently running the setup for more than one hour without any problems.
> 
> But I assume that this a really bad hack - and I did not check, if any CAN
> frames got lost. Btw. the performance increased from 90% busload to 95%
> busload with that patch when creating only local traffic on the host.
> 
> Any idea how to proceed?
> 
> Regards,
> Oliver
> --
> To unsubscribe from this list: send the line "unsubscribe linux-can" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 




On 10.12.2013 14:47, Oliver Hartkopp wrote:
> Hey all,
> 
> as I have a similar setup here (Core i7, 5x PEAK cPCI = 20 CAN interfaces) I
> downloaded the linux-image-3.10-0.bpo.3-rt-686-pae kernel including the
> sources from
> 
> 	http://packages.debian.org/de/wheezy-backports/kernel/
> 
> and was able to see Austins problem with the -rt kernel.
> 
> My interrupt lines are mostly dedicated to the CAN interfaces, so I was able
> to select interrupts (17 & 19) that _only_ deal with sja1000 irq handlers:
> 
>  16:          7          7         10          9   IO-APIC-fasteoi   ehci_hcd:usb1, ahci, can4, can5, can6, can7
>  17:    6328236    6330659    6328557    6330266   IO-APIC-fasteoi   can8, can10, can9
>  18:          0          0          0          0   IO-APIC-fasteoi   can12, can13, can14, can15
>  19:    1446093    1443817    1445833    1444230   IO-APIC-fasteoi   can2, can16, can17, can18, can19, can3, can1, can0
> 
> can0/can2 are linked together (500 kbit/s)
> can1/can3 are linked together (500 kbit/s)
> can9 is linked to a 1Mbit/s CAN traffic source
> 
> All interfaces get a full bus load from the outside.
> Additionally can0 and can1 get a 'cangen -g0 -i <if>' from the local host.
> 
> The funny thing was that one time IRQ #19 got disabled twice(?!?) :
> 
> Message from syslogd@xxxxx at Dec 10 11:25:37 ...
>  kernel:[  967.213174] Disabling IRQ #19
> 
> Message from syslogd@xxxxx at Dec 10 12:06:13 ...
>  kernel:[ 3401.523019] Disabling IRQ #17
> 
> Message from syslogd@xxxxx at Dec 10 12:49:08 ...
>  kernel:[ 5975.113373] Disabling IRQ #19
> 
> Don't know where the last message could come from as the 8 CAN interfaces at
> this interrupt line were already dead for more than a hour.
> 
> The disabling of the interrupt seems to be reproducible - as Austin already
> mentioned after different times.
> 
> My assumption was that we run into a problem with the PITA chip, when
> consuming the interface specific interrupt line in peak_pci_post_irq(), see:
> 
> static void peak_pci_post_irq(const struct sja1000_priv *priv)
> {
>         struct peak_pci_chan *chan = priv->priv;
>         u16 icr;
> 
>         /* Select and clear in PITA stored interrupt */
>         icr = readw(chan->cfg_base + PITA_ICR);
>         if (icr & chan->icr_mask)
>                 writew(chan->icr_mask, chan->cfg_base + PITA_ICR);
> }
> 
> With the writew() only the corresponding SJA1000 line is consumed.
> 
> My quick hack was to clear all bits in the PITA each time:
> 
> --- peak_pci.c~ 2013-09-08 07:10:14.000000000 +0200
> +++ peak_pci.c  2013-12-10 13:26:48.315166478 +0100
> @@ -542,9 +542,13 @@
>         u16 icr;
>  
>         /* Select and clear in PITA stored interrupt */
> +#if 0
>         icr = readw(chan->cfg_base + PITA_ICR);
>         if (icr & chan->icr_mask)
>                 writew(chan->icr_mask, chan->cfg_base + PITA_ICR);
> +#else
> +       writew(0x00C3, chan->cfg_base + PITA_ICR);
> +#endif
>  }
>  
>  static int peak_pci_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
> 
> The 0x00C3 comes from OR'ing the values from 
> static const u16 peak_pci_icr_masks[PEAK_PCI_CHAN_MAX]
> 
> I'm currently running the setup for more than one hour without any problems.
> 
> But I assume that this a really bad hack - and I did not check, if any CAN
> frames got lost. Btw. the performance increased from 90% busload to 95%
> busload with that patch when creating only local traffic on the host.
> 
> Any idea how to proceed?
> 
> Regards,
> Oliver
> --
> To unsubscribe from this list: send the line "unsubscribe linux-can" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: sja1000 interrupt problem
  2013-12-10 13:47                                                     ` Oliver Hartkopp
  2013-12-10 14:23                                                       ` Oliver Hartkopp
@ 2013-12-10 14:41                                                       ` Wolfgang Grandegger
  2013-12-10 16:05                                                         ` Oliver Hartkopp
  1 sibling, 1 reply; 66+ messages in thread
From: Wolfgang Grandegger @ 2013-12-10 14:41 UTC (permalink / raw)
  To: Oliver Hartkopp; +Cc: Austin Schuh, Pavel Pisa, linux-can

On Tue, 10 Dec 2013 14:47:24 +0100, Oliver Hartkopp

<socketcan@hartkopp.net> wrote:

> Hey all,

> 

> as I have a similar setup here (Core i7, 5x PEAK cPCI = 20 CAN

interfaces)

> I

> downloaded the linux-image-3.10-0.bpo.3-rt-686-pae kernel including the

> sources from

> 

> 	http://packages.debian.org/de/wheezy-backports/kernel/

> 

> and was able to see Austins problem with the -rt kernel.

> 

> My interrupt lines are mostly dedicated to the CAN interfaces, so I was

> able

> to select interrupts (17 & 19) that _only_ deal with sja1000 irq

handlers:

> 

>  16:          7          7         10          9   IO-APIC-fasteoi  

>  ehci_hcd:usb1, ahci, can4, can5, can6, can7

>  17:    6328236    6330659    6328557    6330266   IO-APIC-fasteoi  

can8,

>  can10, can9

>  18:          0          0          0          0   IO-APIC-fasteoi  

>  can12, can13, can14, can15

>  19:    1446093    1443817    1445833    1444230   IO-APIC-fasteoi  

can2,

>  can16, can17, can18, can19, can3, can1, can0

> 

> can0/can2 are linked together (500 kbit/s)

> can1/can3 are linked together (500 kbit/s)

> can9 is linked to a 1Mbit/s CAN traffic source

> 

> All interfaces get a full bus load from the outside.

> Additionally can0 and can1 get a 'cangen -g0 -i <if>' from the local

host.

> 

> The funny thing was that one time IRQ #19 got disabled twice(?!?) :

> 

> Message from syslogd@xxxxx at Dec 10 11:25:37 ...

>  kernel:[  967.213174] Disabling IRQ #19

> 

> Message from syslogd@xxxxx at Dec 10 12:06:13 ...

>  kernel:[ 3401.523019] Disabling IRQ #17

> 

> Message from syslogd@xxxxx at Dec 10 12:49:08 ...

>  kernel:[ 5975.113373] Disabling IRQ #19

> 

> Don't know where the last message could come from as the 8 CAN

interfaces

> at

> this interrupt line were already dead for more than a hour.

> 

> The disabling of the interrupt seems to be reproducible - as Austin

already

> mentioned after different times.

> 

> My assumption was that we run into a problem with the PITA chip, when

> consuming the interface specific interrupt line in peak_pci_post_irq(),

> see:

> 

> static void peak_pci_post_irq(const struct sja1000_priv *priv)

> {

>         struct peak_pci_chan *chan = priv->priv;

>         u16 icr;

> 

>         /* Select and clear in PITA stored interrupt */

>         icr = readw(chan->cfg_base + PITA_ICR);

>         if (icr & chan->icr_mask)

>                 writew(chan->icr_mask, chan->cfg_base + PITA_ICR);

> }

> 

> With the writew() only the corresponding SJA1000 line is consumed.

> 

> My quick hack was to clear all bits in the PITA each time:

> 

> --- peak_pci.c~ 2013-09-08 07:10:14.000000000 +0200

> +++ peak_pci.c  2013-12-10 13:26:48.315166478 +0100

> @@ -542,9 +542,13 @@

>         u16 icr;

>  

>         /* Select and clear in PITA stored interrupt */

> +#if 0

>         icr = readw(chan->cfg_base + PITA_ICR);

>         if (icr & chan->icr_mask)

>                 writew(chan->icr_mask, chan->cfg_base + PITA_ICR);



Could you try adding a spin lock/unlock here.



> +#else

> +       writew(0x00C3, chan->cfg_base + PITA_ICR);



This may kill unhandled (shared) IRQs. You can do that only if all

handlers have been called in advance (with a global ISR).





> +#endif

>  }

>  

>  static int peak_pci_probe(struct pci_dev *pdev, const struct

>  pci_device_id *ent)

> 

> The 0x00C3 comes from OR'ing the values from 

> static const u16 peak_pci_icr_masks[PEAK_PCI_CHAN_MAX]

> 

> I'm currently running the setup for more than one hour without any

> problems.

> 

> But I assume that this a really bad hack - and I did not check, if any

CAN

> frames got lost. Btw. the performance increased from 90% busload to 95%

> busload with that patch when creating only local traffic on the host.

> 

> Any idea how to proceed?



See above.



Wolfgang.



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: sja1000 interrupt problem
  2013-12-10 14:41                                                       ` Wolfgang Grandegger
@ 2013-12-10 16:05                                                         ` Oliver Hartkopp
  2013-12-10 21:12                                                           ` Wolfgang Grandegger
  0 siblings, 1 reply; 66+ messages in thread
From: Oliver Hartkopp @ 2013-12-10 16:05 UTC (permalink / raw)
  To: Wolfgang Grandegger; +Cc: Austin Schuh, Pavel Pisa, linux-can

On 10.12.2013 15:41, Wolfgang Grandegger wrote:
> On Tue, 10 Dec 2013 14:47:24 +0100, Oliver Hartkopp

>> +       writew(0x00C3, chan->cfg_base + PITA_ICR);
> 
> This may kill unhandled (shared) IRQs. You can do that only if all
> handlers have been called in advance (with a global ISR).
> 

So here's my patch for PITA access protection in peak_pci_post_irq().

Note that we need a single lock for all (e.g. 4) channels of the PITA.
So my idea was to create the lock in the per-channel structure of the first
channel and then let the channels 1,2,3 refer to this lock.

Maybe there's a better solution for that. E.g. the driver cleanup at removal
time may get into problems with this approach?!?

So far the system is running. I'll take a look tomorrow morning if its still
alive :-) Maybe Austin can do some tests with this patch too.

Regards,
Oliver

--- linux-source-3.10/drivers/net/can/sja1000/peak_pci.c	2013-09-08 07:10:14.000000000 +0200
+++ linux-source-3.10-rt/drivers/net/can/sja1000/peak_pci.c	2013-12-10 16:58:16.381938240 +0100
@@ -41,6 +41,8 @@
 struct peak_pciec_card;
 struct peak_pci_chan {
 	void __iomem *cfg_base;		/* Common for all channels */
+	spinlock_t pita_lock;
+	spinlock_t *pita_lock_ptr;
 	struct net_device *prev_dev;	/* Chain of network devices */
 	u16 icr_mask;			/* Interrupt mask for fast ack */
 	struct peak_pciec_card *pciec_card;	/* only for PCIeC LEDs */
@@ -539,12 +541,17 @@
 static void peak_pci_post_irq(const struct sja1000_priv *priv)
 {
 	struct peak_pci_chan *chan = priv->priv;
+	unsigned long flags;
 	u16 icr;
 
 	/* Select and clear in PITA stored interrupt */
+	spin_lock_irqsave(chan->pita_lock_ptr, flags);
+
 	icr = readw(chan->cfg_base + PITA_ICR);
 	if (icr & chan->icr_mask)
 		writew(chan->icr_mask, chan->cfg_base + PITA_ICR);
+
+	spin_unlock_irqrestore(chan->pita_lock_ptr, flags);
 }
 
 static int peak_pci_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
@@ -553,6 +560,7 @@
 	struct peak_pci_chan *chan;
 	struct net_device *dev;
 	void __iomem *cfg_base, *reg_base;
+	spinlock_t *pita_lock_ptr0;
 	u16 sub_sys_id, icr;
 	int i, err, channels;
 
@@ -611,6 +619,7 @@
 	icr = readw(cfg_base + PITA_ICR + 2);
 
 	for (i = 0; i < channels; i++) {
+
 		dev = alloc_sja1000dev(sizeof(struct peak_pci_chan));
 		if (!dev) {
 			err = -ENOMEM;
@@ -623,6 +632,13 @@
 		chan->cfg_base = cfg_base;
 		priv->reg_base = reg_base + i * PEAK_PCI_CHAN_SIZE;
 
+		if (i == 0) {
+			spin_lock_init(&chan->pita_lock);
+			pita_lock_ptr0 = &chan->pita_lock;
+		}
+
+		chan->pita_lock_ptr = pita_lock_ptr0;
+
 		priv->read_reg = peak_pci_read_reg;
 		priv->write_reg = peak_pci_write_reg;
 		priv->post_irq = peak_pci_post_irq;



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: sja1000 interrupt problem
  2013-12-10 16:05                                                         ` Oliver Hartkopp
@ 2013-12-10 21:12                                                           ` Wolfgang Grandegger
  2013-12-11 16:59                                                             ` Oliver Hartkopp
  0 siblings, 1 reply; 66+ messages in thread
From: Wolfgang Grandegger @ 2013-12-10 21:12 UTC (permalink / raw)
  To: Oliver Hartkopp; +Cc: Austin Schuh, Pavel Pisa, linux-can

Hi Oliver,

On 12/10/2013 05:05 PM, Oliver Hartkopp wrote:
> On 10.12.2013 15:41, Wolfgang Grandegger wrote:
>> On Tue, 10 Dec 2013 14:47:24 +0100, Oliver Hartkopp
> 
>>> +       writew(0x00C3, chan->cfg_base + PITA_ICR);
>>
>> This may kill unhandled (shared) IRQs. You can do that only if all
>> handlers have been called in advance (with a global ISR).
>>
> 
> So here's my patch for PITA access protection in peak_pci_post_irq().
> 
> Note that we need a single lock for all (e.g. 4) channels of the PITA.
> So my idea was to create the lock in the per-channel structure of the first
> channel and then let the channels 1,2,3 refer to this lock.

Cool, this means that the hardware is not working as expected.

> Maybe there's a better solution for that. E.g. the driver cleanup at removal
> time may get into problems with this approach?!?
> 
> So far the system is running. I'll take a look tomorrow morning if its still
> alive :-) Maybe Austin can do some tests with this patch too.

That would be nice... before I start digging through 1 GB of function
trace lines ;-).

One more comment inline...

Wolfgang.

> --- linux-source-3.10/drivers/net/can/sja1000/peak_pci.c	2013-09-08 07:10:14.000000000 +0200
> +++ linux-source-3.10-rt/drivers/net/can/sja1000/peak_pci.c	2013-12-10 16:58:16.381938240 +0100
> @@ -41,6 +41,8 @@
>  struct peak_pciec_card;
>  struct peak_pci_chan {
>  	void __iomem *cfg_base;		/* Common for all channels */
> +	spinlock_t pita_lock;
> +	spinlock_t *pita_lock_ptr;

Well, that's ugly and waste space. Already cfg_base is a per card
property. Maybe it's time to introduce a "struct peak_pci_card" and
handle it ala ems_pci.

>  	struct net_device *prev_dev;	/* Chain of network devices */
>  	u16 icr_mask;			/* Interrupt mask for fast ack */
>  	struct peak_pciec_card *pciec_card;	/* only for PCIeC LEDs */
> @@ -539,12 +541,17 @@
>  static void peak_pci_post_irq(const struct sja1000_priv *priv)
>  {
>  	struct peak_pci_chan *chan = priv->priv;
> +	unsigned long flags;
>  	u16 icr;
>  
>  	/* Select and clear in PITA stored interrupt */
> +	spin_lock_irqsave(chan->pita_lock_ptr, flags);
> +
>  	icr = readw(chan->cfg_base + PITA_ICR);
>  	if (icr & chan->icr_mask)
>  		writew(chan->icr_mask, chan->cfg_base + PITA_ICR);
> +
> +	spin_unlock_irqrestore(chan->pita_lock_ptr, flags);
>  }
>  
>  static int peak_pci_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
> @@ -553,6 +560,7 @@
>  	struct peak_pci_chan *chan;
>  	struct net_device *dev;
>  	void __iomem *cfg_base, *reg_base;
> +	spinlock_t *pita_lock_ptr0;
>  	u16 sub_sys_id, icr;
>  	int i, err, channels;
>  
> @@ -611,6 +619,7 @@
>  	icr = readw(cfg_base + PITA_ICR + 2);
>  
>  	for (i = 0; i < channels; i++) {
> +

Please do not add unrelated changes.

>  		dev = alloc_sja1000dev(sizeof(struct peak_pci_chan));
>  		if (!dev) {
>  			err = -ENOMEM;
> @@ -623,6 +632,13 @@
>  		chan->cfg_base = cfg_base;
>  		priv->reg_base = reg_base + i * PEAK_PCI_CHAN_SIZE;
>  
> +		if (i == 0) {
> +			spin_lock_init(&chan->pita_lock);
> +			pita_lock_ptr0 = &chan->pita_lock;
> +		}
> +
> +		chan->pita_lock_ptr = pita_lock_ptr0;
> +
>  		priv->read_reg = peak_pci_read_reg;
>  		priv->write_reg = peak_pci_write_reg;
>  		priv->post_irq = peak_pci_post_irq;
> 

BTW: while browsing Austin's function trace I realized that we do access
also the inactive device if interrupts are shared. I think we should use
a shadow register variable for SJA1000_IR. Hope to find some time over
the weekend to provide a patch.

Wolfgang.

Wolfgang.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: sja1000 interrupt problem
  2013-12-10 21:12                                                           ` Wolfgang Grandegger
@ 2013-12-11 16:59                                                             ` Oliver Hartkopp
  2013-12-11 19:27                                                               ` Wolfgang Grandegger
  0 siblings, 1 reply; 66+ messages in thread
From: Oliver Hartkopp @ 2013-12-11 16:59 UTC (permalink / raw)
  To: Wolfgang Grandegger; +Cc: Austin Schuh, Pavel Pisa, linux-can

Bad news.

Unfortunately the patch did not have the wanted effect and the interrupt #19
dropped again the late evening.

I'm currently running the setup overnight with the same kernel but only
without -rt to make sure it definitely is a -rt issue.

On 10.12.2013 22:12, Wolfgang Grandegger wrote:

>>  
>>  	for (i = 0; i < channels; i++) {
>> +
> 
> Please do not add unrelated changes.
> 

Was not intended. I added/removed some code here ...

> 
> BTW: while browsing Austin's function trace I realized that we do access
> also the inactive device if interrupts are shared. I think we should use
> a shadow register variable for SJA1000_IR. Hope to find some time over
> the weekend to provide a patch.

??

AFAIK the interrupt is registered in sja1000_open (=> ifconfig up).
So why do we access an inactive (ifconfig down) device then?

Regards,
Oliver


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: sja1000 interrupt problem
  2013-12-11 16:59                                                             ` Oliver Hartkopp
@ 2013-12-11 19:27                                                               ` Wolfgang Grandegger
  2013-12-12  6:13                                                                 ` Oliver Hartkopp
  0 siblings, 1 reply; 66+ messages in thread
From: Wolfgang Grandegger @ 2013-12-11 19:27 UTC (permalink / raw)
  To: Oliver Hartkopp; +Cc: Austin Schuh, Pavel Pisa, linux-can

On 12/11/2013 05:59 PM, Oliver Hartkopp wrote:
> Bad news.
> 
> Unfortunately the patch did not have the wanted effect and the interrupt #19
> dropped again the late evening.

Then it was pure luck that it ran that much longer?

> I'm currently running the setup overnight with the same kernel but only
> without -rt to make sure it definitely is a -rt issue.
> 
> On 10.12.2013 22:12, Wolfgang Grandegger wrote:
> 
>>>  
>>>  	for (i = 0; i < channels; i++) {
>>> +
>>
>> Please do not add unrelated changes.
>>
> 
> Was not intended. I added/removed some code here ...
> 
>>
>> BTW: while browsing Austin's function trace I realized that we do access
>> also the inactive device if interrupts are shared. I think we should use
>> a shadow register variable for SJA1000_IR. Hope to find some time over
>> the weekend to provide a patch.
> 
> ??
> 
> AFAIK the interrupt is registered in sja1000_open (=> ifconfig up).
> So why do we access an inactive (ifconfig down) device then?

You are right. Sorry for the noise.

BTW, does the problem show up if only one can is active? Austin tests
did just RX on can1, but can0 was obviously up (but no traffic).

Wolfgang.

Wolfgang.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: sja1000 interrupt problem
  2013-12-11 19:27                                                               ` Wolfgang Grandegger
@ 2013-12-12  6:13                                                                 ` Oliver Hartkopp
  2013-12-12 17:38                                                                   ` Oliver Hartkopp
  0 siblings, 1 reply; 66+ messages in thread
From: Oliver Hartkopp @ 2013-12-12  6:13 UTC (permalink / raw)
  To: Wolfgang Grandegger; +Cc: Austin Schuh, Pavel Pisa, linux-can

On 11.12.2013 20:27, Wolfgang Grandegger wrote:
> On 12/11/2013 05:59 PM, Oliver Hartkopp wrote:
>> Bad news.
>>
>> Unfortunately the patch did not have the wanted effect and the interrupt #19
>> dropped again the late evening.
> 
> Then it was pure luck that it ran that much longer?
> 

Not sure about that. The time to failure is varying pretty much.

>> I'm currently running the setup overnight with the same kernel but only
>> without -rt to make sure it definitely is a -rt issue.
>>

It is. The setup with the mainline kernel is still running this morning.

> 
> BTW, does the problem show up if only one can is active? Austin tests
> did just RX on can1, but can0 was obviously up (but no traffic).

There was a case that the IRQ line from can9, can10, can11, can12 failed with
an error messaged pointing to can10. All four interfaces where up but only
can9 had traffic at that time.

Regards,
Oliver

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: sja1000 interrupt problem
  2013-12-12  6:13                                                                 ` Oliver Hartkopp
@ 2013-12-12 17:38                                                                   ` Oliver Hartkopp
  2013-12-12 22:56                                                                     ` Wolfgang Grandegger
  0 siblings, 1 reply; 66+ messages in thread
From: Oliver Hartkopp @ 2013-12-12 17:38 UTC (permalink / raw)
  To: Wolfgang Grandegger; +Cc: Austin Schuh, Pavel Pisa, linux-can

On 12.12.2013 07:13, Oliver Hartkopp wrote:
> On 11.12.2013 20:27, Wolfgang Grandegger wrote:

>> BTW, does the problem show up if only one can is active? Austin tests
>> did just RX on can1, but can0 was obviously up (but no traffic).
> 
> There was a case that the IRQ line from can9, can10, can11, can12 failed with
> an error messaged pointing to can10. All four interfaces where up but only
> can9 had traffic at that time.

Hi Wolfgang,

here's my latest investigation result.

The setup still has (only) can9 with traffic and after a modification of
linux/kernel/irq/spurious.c (patch below) I got this:

[ 1117.957651] irq 17: nobody cared (try booting with the "irqpoll" option)
[ 1117.959910] CPU: 0 PID: 3498 Comm: irq/17-can9 Not tainted 3.10.11-rt7-can #6
[ 1117.959913] Hardware name: xxxxxx
[ 1117.959916]  00000000 c1089114 f4e44840 00000001 00000011 c1089490 ee84e780 f4e44840
[ 1117.959924]  ee84e780 ed876fa0 c1087cf3 c10884a9 ee84e7a0 ed876fa0 16edf7d9 00000000
[ 1117.959932]  00000000 00000000 00000000 c108857f eea53de4 ee84e780 c1088416 ef0a1f90
[ 1117.959939] Call Trace:
[ 1117.959949]  [<c1089114>] ? __report_bad_irq+0x18/0xbe
[ 1117.959953]  [<c1089490>] ? note_interrupt+0x118/0x194
[ 1117.959957]  [<c1087cf3>] ? irq_thread_fn+0x21/0x21
[ 1117.959960]  [<c10884a9>] ? irq_thread+0x93/0x169
[ 1117.959964]  [<c108857f>] ? irq_thread+0x169/0x169
[ 1117.959968]  [<c1088416>] ? wake_threads_waitq+0x31/0x31
[ 1117.959973]  [<c104a79e>] ? kthread+0x68/0x6d
[ 1117.959979]  [<c13143b7>] ? ret_from_kernel_thread+0x1b/0x28
[ 1117.959982]  [<c104a736>] ? __kthread_parkme+0x50/0x50
[ 1117.959986] handlers:
[ 1117.962184] [<c1087bdb>] irq_default_primary_handler threaded [<f86b169b>] sja1000_interrupt [sja1000]device can8 PITA 0x0001
[ 1117.962190] [<c1087bdb>] irq_default_primary_handler threaded [<f86b169b>] sja1000_interrupt [sja1000]device can10 PITA 0x0001
[ 1117.962196] [<c1087bdb>] irq_default_primary_handler threaded [<f86b169b>] sja1000_interrupt [sja1000]device can11 PITA 0x0001
[ 1117.962201] [<c1087bdb>] irq_default_primary_handler threaded [<f86b169b>] sja1000_interrupt [sja1000]device can9 PITA 0x0001
[ 1117.962202] Disabling IRQ #17

The value in the PITA register is 0x0001 which is from can9 according to
peak_pci_icr_masks[] (2nd element).

So obviously the flag was set and not consumed correctly ...

Regards,
Oliver

ps. here's the hack to print the PITA register value in spurious.c

As I was sure that I only have SJA1000 devices the dereferencing of the
pointers from dev_id was feasible. 

--- spurious.c-orig	2013-12-12 15:46:03.508063829 +0100
+++ spurious.c	2013-12-12 16:52:32.171719932 +0100
@@ -12,6 +12,8 @@
 #include <linux/kallsyms.h>
 #include <linux/interrupt.h>
 #include <linux/moduleparam.h>
+#include <linux/netdevice.h>
+#include <linux/can/dev.h>
 #include <linux/timer.h>
 
 #include "internals.h"
@@ -24,6 +26,36 @@
 static int irq_poll_cpu;
 static atomic_t irq_poll_active;
 
+struct peak_pci_chanx {
+	void __iomem *cfg_base;		/* Common for all channels */
+	struct net_device *prev_dev;	/* Chain of network devices */
+	u16 icr_mask;			/* Interrupt mask for fast ack */
+	void *pciec_card;	/* only for PCIeC LEDs */
+};
+struct sja1000_privx {
+	struct can_priv can;	/* must be the first member */
+	void *echo_skb;
+
+	/* the lower-layer is responsible for appropriate locking */
+	u8 (*read_reg) (const struct sja1000_privx *priv, int reg);
+	void (*write_reg) (const struct sja1000_privx *priv, int reg, u8 val);
+	void (*pre_irq) (const struct sja1000_privx *priv);
+	void (*post_irq) (const struct sja1000_privx *priv);
+
+	void *priv;		/* for board-specific data */
+	struct net_device *dev;
+
+	void __iomem *reg_base;	 /* ioremap'ed address to registers */
+	unsigned long irq_flags; /* for request_irq() */
+	spinlock_t cmdreg_lock;  /* lock for concurrent cmd register writes */
+
+	u16 flags;		/* custom mode flags */
+	u8 ocr;			/* output control register */
+	u8 cdr;			/* clock divider register */
+};
+#define PITA_ICR                0x00    /* Interrupt control register */
+
+
 /*
  * We wait here for a poller to finish.
  *
@@ -189,6 +221,7 @@
 {
 	struct irqaction *action;
 	unsigned long flags;
+	struct net_device *netdev;
 
 	if (bad_action_ret(action_ret)) {
 		printk(KERN_ERR "irq event %d: bogus return value %x\n",
@@ -209,11 +242,23 @@
 	raw_spin_lock_irqsave(&desc->lock, flags);
 	action = desc->action;
 	while (action) {
+
+		struct sja1000_privx *priv;
+		struct peak_pci_chanx *chan;
+
 		printk(KERN_ERR "[<%p>] %pf", action->handler, action->handler);
 		if (action->thread_fn)
 			printk(KERN_CONT " threaded [<%p>] %pf",
 					action->thread_fn, action->thread_fn);
+
+		netdev = (struct net_device *) action->dev_id;
+		priv = netdev_priv(netdev);
+		chan = priv->priv;
+		printk(KERN_CONT "device %s PITA 0x%04X", netdev->name,
+		       readw(chan->cfg_base + PITA_ICR));
+
 		printk(KERN_CONT "\n");
+
 		action = action->next;
 	}
 	raw_spin_unlock_irqrestore(&desc->lock, flags);


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: sja1000 interrupt problem
  2013-12-12 17:38                                                                   ` Oliver Hartkopp
@ 2013-12-12 22:56                                                                     ` Wolfgang Grandegger
  2013-12-13  0:07                                                                       ` Austin Schuh
  2013-12-13  9:38                                                                       ` Oliver Hartkopp
  0 siblings, 2 replies; 66+ messages in thread
From: Wolfgang Grandegger @ 2013-12-12 22:56 UTC (permalink / raw)
  To: Oliver Hartkopp; +Cc: Austin Schuh, Pavel Pisa, linux-can

Hi Oliver,

I started to analyse Austin's function trace...

On 12/12/2013 06:38 PM, Oliver Hartkopp wrote:
> On 12.12.2013 07:13, Oliver Hartkopp wrote:
>> On 11.12.2013 20:27, Wolfgang Grandegger wrote:
> 
>>> BTW, does the problem show up if only one can is active? Austin tests
>>> did just RX on can1, but can0 was obviously up (but no traffic).
>>
>> There was a case that the IRQ line from can9, can10, can11, can12 failed with
>> an error messaged pointing to can10. All four interfaces where up but only
>> can9 had traffic at that time.
> 
> Hi Wolfgang,
> 
> here's my latest investigation result.
> 
> The setup still has (only) can9 with traffic and after a modification of
> linux/kernel/irq/spurious.c (patch below) I got this:
> 
> [ 1117.957651] irq 17: nobody cared (try booting with the "irqpoll" option)
> [ 1117.959910] CPU: 0 PID: 3498 Comm: irq/17-can9 Not tainted 3.10.11-rt7-can #6
> [ 1117.959913] Hardware name: xxxxxx
> [ 1117.959916]  00000000 c1089114 f4e44840 00000001 00000011 c1089490 ee84e780 f4e44840
> [ 1117.959924]  ee84e780 ed876fa0 c1087cf3 c10884a9 ee84e7a0 ed876fa0 16edf7d9 00000000
> [ 1117.959932]  00000000 00000000 00000000 c108857f eea53de4 ee84e780 c1088416 ef0a1f90
> [ 1117.959939] Call Trace:
> [ 1117.959949]  [<c1089114>] ? __report_bad_irq+0x18/0xbe
> [ 1117.959953]  [<c1089490>] ? note_interrupt+0x118/0x194
> [ 1117.959957]  [<c1087cf3>] ? irq_thread_fn+0x21/0x21
> [ 1117.959960]  [<c10884a9>] ? irq_thread+0x93/0x169
> [ 1117.959964]  [<c108857f>] ? irq_thread+0x169/0x169
> [ 1117.959968]  [<c1088416>] ? wake_threads_waitq+0x31/0x31
> [ 1117.959973]  [<c104a79e>] ? kthread+0x68/0x6d
> [ 1117.959979]  [<c13143b7>] ? ret_from_kernel_thread+0x1b/0x28
> [ 1117.959982]  [<c104a736>] ? __kthread_parkme+0x50/0x50
> [ 1117.959986] handlers:
> [ 1117.962184] [<c1087bdb>] irq_default_primary_handler threaded [<f86b169b>] sja1000_interrupt [sja1000]device can8 PITA 0x0001
> [ 1117.962190] [<c1087bdb>] irq_default_primary_handler threaded [<f86b169b>] sja1000_interrupt [sja1000]device can10 PITA 0x0001
> [ 1117.962196] [<c1087bdb>] irq_default_primary_handler threaded [<f86b169b>] sja1000_interrupt [sja1000]device can11 PITA 0x0001
> [ 1117.962201] [<c1087bdb>] irq_default_primary_handler threaded [<f86b169b>] sja1000_interrupt [sja1000]device can9 PITA 0x0001
> [ 1117.962202] Disabling IRQ #17

...ending here as well.

> The value in the PITA register is 0x0001 which is from can9 according to
> peak_pci_icr_masks[] (2nd element).
> 
> So obviously the flag was set and not consumed correctly ...

Well, it could also be set after the bug above triggered. If I look to
Austins trace, I see at the end (with "grep note_interrupt"):

 irq/18-ata_gene-219   [003] ....... 360026.315878: note_interrupt <-irq_thread
     irq/18-can1-1890  [001] ....... 360026.315984: note_interrupt <-irq_thread
     irq/18-can0-1863  [002] ....... 360026.316122: note_interrupt <-irq_thread
 irq/18-ata_gene-219   [003] ....... 360026.316140: note_interrupt <-irq_thread
     irq/18-can1-1890  [001] ....... 360026.316184: note_interrupt <-irq_thread
     irq/18-can0-1863  [002] ....... 360026.316358: note_interrupt <-irq_thread
 irq/18-ata_gene-219   [003] ....... 360026.316361: note_interrupt <-irq_thread
 irq/18-ata_gene-219   [003] ....... 360026.316361: __report_bad_irq <-note_interrupt
     irq/18-can1-1890  [001] ....... 360026.316437: note_interrupt <-irq_thread
     irq/18-can0-1863  [002] ....... 360026.316608: note_interrupt <-irq_thread
     irq/18-can1-1890  [001] ....... 360026.316714: note_interrupt <-irq_thread
 irq/18-ata_gene-219   [003] d...1.. 360026.317050: _raw_spin_unlock_irqrestore <-note_interrupt
 irq/18-ata_gene-219   [003] ....... 360026.317051: printk <-note_interrupt
 irq/18-ata_gene-219   [003] ....... 360026.317541: irq_disable <-note_interrupt
 irq/18-ata_gene-219   [003] ....... 360026.317541: mod_timer <-note_interrupt

In Austin's case can1 is doing RX and can0 is up but inactive. The last activity
in "sja1000_interrupt" is:

     irq/18-can1-1890  [001] .....11 360026.316619: sja1000_interrupt <-irq_forced_thread_fn
     irq/18-can1-1890  [001] .....11 360026.316619: peak_pci_read_reg <-sja1000_interrupt
     irq/18-can1-1890  [001] .....11 360026.316621: peak_pci_read_reg <-sja1000_interrupt
     irq/18-can1-1890  [001] .....11 360026.316623: peak_pci_read_reg <-sja1000_interrupt
     irq/18-can1-1890  [001] .....11 360026.316626: alloc_can_skb <-sja1000_interrupt
     irq/18-can1-1890  [001] .....11 360026.316631: peak_pci_read_reg <-sja1000_interrupt
     irq/18-can1-1890  [001] .....11 360026.316633: peak_pci_read_reg <-sja1000_interrupt
     irq/18-can1-1890  [001] .....11 360026.316637: peak_pci_read_reg <-sja1000_interrupt
     irq/18-can1-1890  [001] .....11 360026.316640: peak_pci_read_reg <-sja1000_interrupt
     irq/18-can1-1890  [001] .....11 360026.316643: peak_pci_read_reg <-sja1000_interrupt
     irq/18-can1-1890  [001] .....11 360026.316645: peak_pci_read_reg <-sja1000_interrupt
     irq/18-can1-1890  [001] .....11 360026.316648: peak_pci_read_reg <-sja1000_interrupt
     irq/18-can1-1890  [001] .....11 360026.316651: peak_pci_read_reg <-sja1000_interrupt
     irq/18-can1-1890  [001] .....11 360026.316654: peak_pci_read_reg <-sja1000_interrupt
     irq/18-can1-1890  [001] .....11 360026.316656: peak_pci_read_reg <-sja1000_interrupt
     irq/18-can1-1890  [001] .....11 360026.316659: peak_pci_read_reg <-sja1000_interrupt
     irq/18-can1-1890  [001] .....11 360026.316662: sja1000_write_cmdreg <-sja1000_interrupt
     irq/18-can1-1890  [001] .....12 360026.316666: migrate_enable <-sja1000_interrupt
     irq/18-can1-1890  [001] .....11 360026.316667: netif_rx <-sja1000_interrupt
     irq/18-can1-1890  [001] .....11 360026.316671: peak_pci_read_reg <-sja1000_interrupt
     irq/18-can1-1890  [001] .....11 360026.316673: peak_pci_read_reg <-sja1000_interrupt
     irq/18-can1-1890  [001] .....11 360026.316675: peak_pci_post_irq <-sja1000_interrupt

Which ends with "peak_pci_post_irq" before the interrupt gets disabled.
Pretty normal! I'm puzzled why the irqs_unhandled counter of the
interrupt reaches 99900 here:

  http://lxr.free-electrons.com/source/kernel/irq/spurious.c#L30

My impression is that the problem is with counting "irqs_unhandled" and "irqs_count",
which might not be done atomically. Actually three threads call "note_interrupt".
Does that make sense? Hope to find some time tomorrow to use atomic_set and friends
to handle these counters.

Wolfgang.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: sja1000 interrupt problem
  2013-12-12 22:56                                                                     ` Wolfgang Grandegger
@ 2013-12-13  0:07                                                                       ` Austin Schuh
  2013-12-13 16:16                                                                         ` Oliver Hartkopp
  2013-12-13  9:38                                                                       ` Oliver Hartkopp
  1 sibling, 1 reply; 66+ messages in thread
From: Austin Schuh @ 2013-12-13  0:07 UTC (permalink / raw)
  To: Wolfgang Grandegger; +Cc: Oliver Hartkopp, Pavel Pisa, linux-can

I have tried to look around to see if I can find documentation on what
PITA means in this situation so I can be more helpful.  I'm not
finding anything.

I don't see any locking in note_interrupt or around it from the
calling function (irq_thread, if I'm not mistaken).  Good catch, and
that looks wrong for sure.

Austin

On Thu, Dec 12, 2013 at 2:56 PM, Wolfgang Grandegger <wg@grandegger.com> wrote:
> Hi Oliver,
>
> I started to analyse Austin's function trace...
>
> On 12/12/2013 06:38 PM, Oliver Hartkopp wrote:
>> On 12.12.2013 07:13, Oliver Hartkopp wrote:
>>> On 11.12.2013 20:27, Wolfgang Grandegger wrote:
>>
>>>> BTW, does the problem show up if only one can is active? Austin tests
>>>> did just RX on can1, but can0 was obviously up (but no traffic).
>>>
>>> There was a case that the IRQ line from can9, can10, can11, can12 failed with
>>> an error messaged pointing to can10. All four interfaces where up but only
>>> can9 had traffic at that time.
>>
>> Hi Wolfgang,
>>
>> here's my latest investigation result.
>>
>> The setup still has (only) can9 with traffic and after a modification of
>> linux/kernel/irq/spurious.c (patch below) I got this:
>>
>> [ 1117.957651] irq 17: nobody cared (try booting with the "irqpoll" option)
>> [ 1117.959910] CPU: 0 PID: 3498 Comm: irq/17-can9 Not tainted 3.10.11-rt7-can #6
>> [ 1117.959913] Hardware name: xxxxxx
>> [ 1117.959916]  00000000 c1089114 f4e44840 00000001 00000011 c1089490 ee84e780 f4e44840
>> [ 1117.959924]  ee84e780 ed876fa0 c1087cf3 c10884a9 ee84e7a0 ed876fa0 16edf7d9 00000000
>> [ 1117.959932]  00000000 00000000 00000000 c108857f eea53de4 ee84e780 c1088416 ef0a1f90
>> [ 1117.959939] Call Trace:
>> [ 1117.959949]  [<c1089114>] ? __report_bad_irq+0x18/0xbe
>> [ 1117.959953]  [<c1089490>] ? note_interrupt+0x118/0x194
>> [ 1117.959957]  [<c1087cf3>] ? irq_thread_fn+0x21/0x21
>> [ 1117.959960]  [<c10884a9>] ? irq_thread+0x93/0x169
>> [ 1117.959964]  [<c108857f>] ? irq_thread+0x169/0x169
>> [ 1117.959968]  [<c1088416>] ? wake_threads_waitq+0x31/0x31
>> [ 1117.959973]  [<c104a79e>] ? kthread+0x68/0x6d
>> [ 1117.959979]  [<c13143b7>] ? ret_from_kernel_thread+0x1b/0x28
>> [ 1117.959982]  [<c104a736>] ? __kthread_parkme+0x50/0x50
>> [ 1117.959986] handlers:
>> [ 1117.962184] [<c1087bdb>] irq_default_primary_handler threaded [<f86b169b>] sja1000_interrupt [sja1000]device can8 PITA 0x0001
>> [ 1117.962190] [<c1087bdb>] irq_default_primary_handler threaded [<f86b169b>] sja1000_interrupt [sja1000]device can10 PITA 0x0001
>> [ 1117.962196] [<c1087bdb>] irq_default_primary_handler threaded [<f86b169b>] sja1000_interrupt [sja1000]device can11 PITA 0x0001
>> [ 1117.962201] [<c1087bdb>] irq_default_primary_handler threaded [<f86b169b>] sja1000_interrupt [sja1000]device can9 PITA 0x0001
>> [ 1117.962202] Disabling IRQ #17
>
> ...ending here as well.
>
>> The value in the PITA register is 0x0001 which is from can9 according to
>> peak_pci_icr_masks[] (2nd element).
>>
>> So obviously the flag was set and not consumed correctly ...
>
> Well, it could also be set after the bug above triggered. If I look to
> Austins trace, I see at the end (with "grep note_interrupt"):
>
>  irq/18-ata_gene-219   [003] ....... 360026.315878: note_interrupt <-irq_thread
>      irq/18-can1-1890  [001] ....... 360026.315984: note_interrupt <-irq_thread
>      irq/18-can0-1863  [002] ....... 360026.316122: note_interrupt <-irq_thread
>  irq/18-ata_gene-219   [003] ....... 360026.316140: note_interrupt <-irq_thread
>      irq/18-can1-1890  [001] ....... 360026.316184: note_interrupt <-irq_thread
>      irq/18-can0-1863  [002] ....... 360026.316358: note_interrupt <-irq_thread
>  irq/18-ata_gene-219   [003] ....... 360026.316361: note_interrupt <-irq_thread
>  irq/18-ata_gene-219   [003] ....... 360026.316361: __report_bad_irq <-note_interrupt
>      irq/18-can1-1890  [001] ....... 360026.316437: note_interrupt <-irq_thread
>      irq/18-can0-1863  [002] ....... 360026.316608: note_interrupt <-irq_thread
>      irq/18-can1-1890  [001] ....... 360026.316714: note_interrupt <-irq_thread
>  irq/18-ata_gene-219   [003] d...1.. 360026.317050: _raw_spin_unlock_irqrestore <-note_interrupt
>  irq/18-ata_gene-219   [003] ....... 360026.317051: printk <-note_interrupt
>  irq/18-ata_gene-219   [003] ....... 360026.317541: irq_disable <-note_interrupt
>  irq/18-ata_gene-219   [003] ....... 360026.317541: mod_timer <-note_interrupt
>
> In Austin's case can1 is doing RX and can0 is up but inactive. The last activity
> in "sja1000_interrupt" is:
>
>      irq/18-can1-1890  [001] .....11 360026.316619: sja1000_interrupt <-irq_forced_thread_fn
>      irq/18-can1-1890  [001] .....11 360026.316619: peak_pci_read_reg <-sja1000_interrupt
>      irq/18-can1-1890  [001] .....11 360026.316621: peak_pci_read_reg <-sja1000_interrupt
>      irq/18-can1-1890  [001] .....11 360026.316623: peak_pci_read_reg <-sja1000_interrupt
>      irq/18-can1-1890  [001] .....11 360026.316626: alloc_can_skb <-sja1000_interrupt
>      irq/18-can1-1890  [001] .....11 360026.316631: peak_pci_read_reg <-sja1000_interrupt
>      irq/18-can1-1890  [001] .....11 360026.316633: peak_pci_read_reg <-sja1000_interrupt
>      irq/18-can1-1890  [001] .....11 360026.316637: peak_pci_read_reg <-sja1000_interrupt
>      irq/18-can1-1890  [001] .....11 360026.316640: peak_pci_read_reg <-sja1000_interrupt
>      irq/18-can1-1890  [001] .....11 360026.316643: peak_pci_read_reg <-sja1000_interrupt
>      irq/18-can1-1890  [001] .....11 360026.316645: peak_pci_read_reg <-sja1000_interrupt
>      irq/18-can1-1890  [001] .....11 360026.316648: peak_pci_read_reg <-sja1000_interrupt
>      irq/18-can1-1890  [001] .....11 360026.316651: peak_pci_read_reg <-sja1000_interrupt
>      irq/18-can1-1890  [001] .....11 360026.316654: peak_pci_read_reg <-sja1000_interrupt
>      irq/18-can1-1890  [001] .....11 360026.316656: peak_pci_read_reg <-sja1000_interrupt
>      irq/18-can1-1890  [001] .....11 360026.316659: peak_pci_read_reg <-sja1000_interrupt
>      irq/18-can1-1890  [001] .....11 360026.316662: sja1000_write_cmdreg <-sja1000_interrupt
>      irq/18-can1-1890  [001] .....12 360026.316666: migrate_enable <-sja1000_interrupt
>      irq/18-can1-1890  [001] .....11 360026.316667: netif_rx <-sja1000_interrupt
>      irq/18-can1-1890  [001] .....11 360026.316671: peak_pci_read_reg <-sja1000_interrupt
>      irq/18-can1-1890  [001] .....11 360026.316673: peak_pci_read_reg <-sja1000_interrupt
>      irq/18-can1-1890  [001] .....11 360026.316675: peak_pci_post_irq <-sja1000_interrupt
>
> Which ends with "peak_pci_post_irq" before the interrupt gets disabled.
> Pretty normal! I'm puzzled why the irqs_unhandled counter of the
> interrupt reaches 99900 here:
>
>   http://lxr.free-electrons.com/source/kernel/irq/spurious.c#L30
>
> My impression is that the problem is with counting "irqs_unhandled" and "irqs_count",
> which might not be done atomically. Actually three threads call "note_interrupt".
> Does that make sense? Hope to find some time tomorrow to use atomic_set and friends
> to handle these counters.
>
> Wolfgang.
>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: sja1000 interrupt problem
  2013-12-12 22:56                                                                     ` Wolfgang Grandegger
  2013-12-13  0:07                                                                       ` Austin Schuh
@ 2013-12-13  9:38                                                                       ` Oliver Hartkopp
  2013-12-13 10:04                                                                         ` Wolfgang Grandegger
  2013-12-13 10:07                                                                         ` Marc Kleine-Budde
  1 sibling, 2 replies; 66+ messages in thread
From: Oliver Hartkopp @ 2013-12-13  9:38 UTC (permalink / raw)
  To: Wolfgang Grandegger; +Cc: Austin Schuh, Pavel Pisa, linux-can

On 12.12.2013 23:56, Wolfgang Grandegger wrote:

> My impression is that the problem is with counting "irqs_unhandled" and "irqs_count",
> which might not be done atomically. Actually three threads call "note_interrupt".
> Does that make sense? Hope to find some time tomorrow to use atomic_set and friends
> to handle these counters.

To hopefully complete the picture some more traces from yesterday evening:

[ 1117.959986] handlers:
[ 1117.962184] [<c1087bdb>] irq_default_primary_handler threaded [<f86b169b>] sja1000_interrupt [sja1000]device can8 PITA 0x0001
[ 1117.962190] [<c1087bdb>] irq_default_primary_handler threaded [<f86b169b>] sja1000_interrupt [sja1000]device can10 PITA 0x0001
[ 1117.962196] [<c1087bdb>] irq_default_primary_handler threaded [<f86b169b>] sja1000_interrupt [sja1000]device can11 PITA 0x0001
[ 1117.962201] [<c1087bdb>] irq_default_primary_handler threaded [<f86b169b>] sja1000_interrupt [sja1000]device can9 PITA 0x0001
[ 1117.962202] Disabling IRQ #17

(..)

[ 5995.979307] handlers:
[ 5995.979337] [<c1087bdb>] irq_default_primary_handler threaded [<f86b169b>] sja1000_interrupt [sja1000]device can0 PITA 0x0042
[ 5995.979342] [<c1087bdb>] irq_default_primary_handler threaded [<f86b169b>] sja1000_interrupt [sja1000]device can1 PITA 0x0042
[ 5995.979346] [<c1087bdb>] irq_default_primary_handler threaded [<f86b169b>] sja1000_interrupt [sja1000]device can2 PITA 0x0042
[ 5995.979350] [<c1087bdb>] irq_default_primary_handler threaded [<f86b169b>] sja1000_interrupt [sja1000]device can3 PITA 0x0042
[ 5995.979354] [<c1087bdb>] irq_default_primary_handler threaded [<f86b169b>] sja1000_interrupt [sja1000]device can16 PITA 0x0000
[ 5995.979358] [<c1087bdb>] irq_default_primary_handler threaded [<f86b169b>] sja1000_interrupt [sja1000]device can17 PITA 0x0000
[ 5995.979362] [<c1087bdb>] irq_default_primary_handler threaded [<f86b169b>] sja1000_interrupt [sja1000]device can18 PITA 0x0000
[ 5995.979365] [<c1087bdb>] irq_default_primary_handler threaded [<f86b169b>] sja1000_interrupt [sja1000]device can19 PITA 0x0000
[ 5995.979366] Disabling IRQ #19

(..)

[ 7527.712564] handlers:
[ 7527.712606] [<c1087bdb>] irq_default_primary_handler threaded [<f86b169b>] sja1000_interrupt [sja1000]device can12 PITA 0x0000
[ 7527.712612] [<c1087bdb>] irq_default_primary_handler threaded [<f86b169b>] sja1000_interrupt [sja1000]device can13 PITA 0x0000
[ 7527.712617] [<c1087bdb>] irq_default_primary_handler threaded [<f86b169b>] sja1000_interrupt [sja1000]device can14 PITA 0x0000
[ 7527.712623] [<c1087bdb>] irq_default_primary_handler threaded [<f86b169b>] sja1000_interrupt [sja1000]device can15 PITA 0x0000
[ 7527.712624] Disabling IRQ #18

/proc/interrupts:

 16:          8          9          8          8   IO-APIC-fasteoi   ehci_hcd:usb1, ahci, can4, can5, can6, can7
 17:    1838572    1843233    1845868    1838175   IO-APIC-fasteoi   can8, can10, can11, can9
 18:   12665112   12624875   12641515   12637319   IO-APIC-fasteoi   can12, can13, can14, can15
 19:   10787522   10822954   10803457   10815440   IO-APIC-fasteoi   can0, can1, can2, can3, can16, can17, can18, can19

So after some time all CAN related interrupts have been disabled ...

I wondered if the PITA access for consuming the bit is really working.
Therefore I made the if-statement a while statement here:

--- linux-source-3.10/drivers/net/can/sja1000/peak_pci.c	2013-09-08 07:10:14.000000000 +0200
+++ linux-source-3.10-rt/drivers/net/can/sja1000/peak_pci.c	2013-12-13 08:42:15.850192329 +0100
@@ -539,12 +539,17 @@
 static void peak_pci_post_irq(const struct sja1000_priv *priv)
 {
 	struct peak_pci_chan *chan = priv->priv;
+#if 0
 	u16 icr;
 
 	/* Select and clear in PITA stored interrupt */
 	icr = readw(chan->cfg_base + PITA_ICR);
 	if (icr & chan->icr_mask)
 		writew(chan->icr_mask, chan->cfg_base + PITA_ICR);
+#else
+	while (readw(chan->cfg_base + PITA_ICR) & chan->icr_mask)
+		writew(chan->icr_mask, chan->cfg_base + PITA_ICR);
+#endif
 }
 
 static int peak_pci_probe(struct pci_dev *pdev, const struct pci_device_id *ent)


This should usually not have any effect, right?
But what happened was a big crash after a pretty short time:

[  760.718091] INFO: rcu_preempt self-detected stall on CPU { 1}  (t=84015 jiffies g=482 c=481 q=2688)
[  760.718092] sending NMI to all CPUs:
[  760.718094] NMI backtrace for cpu 1
[  760.718098] CPU: 1 PID: 3629 Comm: irq/17-can9 Not tainted 3.10.11-rt7-can #6
[  760.718099] Hardware name: xxxxxx
[  760.718100] task: edcdb4e0 ti: ee942000 task.ti: ee942000
[  760.718101] EIP: 0060:[<c118b3a3>] EFLAGS: 00000006 CPU: 1
[  760.718106] EIP is at __const_udelay+0x7/0x17
[  760.718107] EAX: 00418958 EBX: 00002710 ECX: c13ca099 EDX: 009aa184
[  760.718108] ESI: f51bb954 EDI: 00000a80 EBP: ee943ec0 ESP: ee943dbc
[  760.718109]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[  760.718111] CR0: 8005003b CR2: 0812a748 CR3: 0156b000 CR4: 000007f0
[  760.718112] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[  760.718112] DR6: ffff0ff0 DR7: 00000400
[  760.718113] Stack:
[  760.718117]  c10235ec c146a580 c108dd53 c13d58b6 0001482f 000001e2 000001e1 00000a80
[  760.718120]  00000007 c1037f67 c10382b5 00000001 c146a580 00000001 edcdb4e0 00000000
[  760.718123]  00000001 ee943ec0 c103d5a1 f51bb7f4 ee943ec0 000000b1 c106c73b f51bb7f4
[  760.718124] Call Trace:
[  760.718129]  [<c10235ec>] ? arch_trigger_all_cpu_backtrace+0x57/0x5f
[  760.718132]  [<c108dd53>] ? rcu_check_callbacks+0x17e/0x470
[  760.718135]  [<c1037f67>] ? raise_softirq_irqoff+0x5/0x2a
[  760.718137]  [<c10382b5>] ? raise_softirq+0x17/0x20
[  760.718140]  [<c103d5a1>] ? update_process_times+0x2f/0x39
[  760.718142]  [<c106c73b>] ? tick_sched_handle+0x37/0x43
[  760.718144]  [<c106c91e>] ? tick_sched_timer+0x28/0x4b
[  760.718145]  [<c106c8f6>] ? tick_sched_do_timer+0x2f/0x2f
[  760.718149]  [<c104d3a5>] ? __run_hrtimer+0x8e/0x12e
[  760.718151]  [<c104dc59>] ? hrtimer_interrupt+0x1a8/0x305
[  760.718164]  [<c1022b3a>] ? smp_apic_timer_interrupt+0x55/0x64
[  760.718167]  [<c1310b7c>] ? apic_timer_interrupt+0x34/0x3c
[  760.718171]  [<f8370001>] ? usb_otg_state_string+0x1/0x13 [usb_common]
[  760.718177]  [<c126007b>] ? skb_copy_datagram_const_iovec+0xf/0x196
[  760.718180]  [<f8709078>] ? peak_pci_post_irq+0x12/0x1b [peak_pci]
[  760.718183]  [<f88e7adf>] ? sja1000_interrupt+0x444/0x456 [sja1000]
[  760.718187]  [<c121f27b>] ? add_interrupt_randomness+0x34/0x131
[  760.718191]  [<c1087cf3>] ? irq_thread_fn+0x21/0x21
[  760.718193]  [<c1087d08>] ? irq_forced_thread_fn+0x15/0x38
[  760.718195]  [<c1088494>] ? irq_thread+0x7e/0x169
[  760.718197]  [<c108857f>] ? irq_thread+0x169/0x169
[  760.718198]  [<c1088416>] ? wake_threads_waitq+0x31/0x31
[  760.718200]  [<c104a79e>] ? kthread+0x68/0x6d
[  760.718203]  [<c13143b7>] ? ret_from_kernel_thread+0x1b/0x28
[  760.718205]  [<c104a736>] ? __kthread_parkme+0x50/0x50
[  760.718221] Code: 00 8d bc 27 00 00 00 00 eb 0e 8d b4 26 00 00 00 00 8d bc 27 00 00 00 00 48 75 fd 48 c3 ff 15 94 17 48 c1 c3 64 8b 15 dc 50 56 c1 <6b> d2 3e c1 e0 02 f7 e2 8d 42 01 e9 e2 ff ff ff 69 c0 c7 10 00
[  760.718223] NMI backtrace for cpu 0
[  760.718225] CPU: 0 PID: 3 Comm: ksoftirqd/0 Not tainted 3.10.11-rt7-can #6
[  760.718226] Hardware name: xxxxxx
[  760.718227] task: f4c70bc0 ti: f4c7e000 task.ti: f4c7e000
[  760.718228] EIP: 0060:[<c131063c>] EFLAGS: 00000002 CPU: 0
[  760.718231] EIP is at _raw_spin_unlock_irq+0x3/0x43
[  760.718232] EAX: f51b2640 EBX: ee34de00 ECX: f515a000 EDX: f4c70bc0
[  760.718233] ESI: f51b2640 EDI: 00000000 EBP: 00000000 ESP: f4c7fec0
[  760.718234]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[  760.718235] CR0: 8005003b CR2: b7721484 CR3: 2ea77000 CR4: 000007f0
[  760.718237] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[  760.718237] DR6: ffff0ff0 DR7: 00000400
[  760.718238] Stack:
[  760.718241]  c1052acd 6b2cb6d6 ef2a3c00 f51b2640 f4c70bc0 c130fa2f ee34de00 000000b1
[  760.718244]  c1565640 1e17e089 000000b1 c1565640 c103461e f4c70bc0 c10546d3 f515d488
[  760.718248]  f4c70bc0 c151043c f4c7ff3c c1037c06 00000000 00000004 f4c70bc0 f4c70bc0
[  760.718248] Call Trace:
[  760.718252]  [<c1052acd>] ? finish_task_switch+0x38/0x9d
[  760.718254]  [<c130fa2f>] ? __schedule+0x385/0x41e
[  760.718257]  [<c103461e>] ? unpin_current_cpu+0xb/0x45
[  760.718259]  [<c10546d3>] ? migrate_enable+0x18f/0x19c
[  760.718261]  [<c1037c06>] ? do_current_softirqs+0x209/0x26e
[  760.718263]  [<c108e2f0>] ? rcu_note_context_switch+0x13b/0x14c
[  760.718265]  [<c130fb84>] ? schedule+0x5e/0x6e
[  760.718267]  [<c10505a6>] ? smpboot_thread_fn+0x233/0x2a5
[  760.718269]  [<c130fb84>] ? schedule+0x5e/0x6e
[  760.718271]  [<c1050373>] ? test_ti_thread_flag+0x7/0x7
[  760.718272]  [<c104a79e>] ? kthread+0x68/0x6d
[  760.718275]  [<c13143b7>] ? ret_from_kernel_thread+0x1b/0x28
[  760.718277]  [<c104a736>] ? __kthread_parkme+0x50/0x50
[  760.718293] Code: 85 c0 74 07 e8 ae f4 ff ff eb 15 89 e0 ba 09 00 00 00 25 00 e0 ff ff e8 6e 0c d6 ff 85 c0 75 e4 8b 04 24 e9 c1 76 d2 ff 80 00 01 <fb> 66 66 90 66 90 b8 01 00 00 00 e8 fe 27 00 00 89 e0 ba 03 00
[  760.718294] NMI backtrace for cpu 2
[  760.718296] CPU: 2 PID: 3195 Comm: irq/18-can12 Not tainted 3.10.11-rt7-can #6
[  760.718297] Hardware name: xxxxxx
[  760.718298] task: ee2bb4e0 ti: edd34000 task.ti: edd34000
[  760.718299] EIP: 0060:[<f8709078>] EFLAGS: 00000202 CPU: 2
[  760.718305] EIP is at peak_pci_post_irq+0x12/0x1b [peak_pci]
[  760.718306] EAX: ef66ee1c EBX: ef66e800 ECX: f8590003 EDX: f859a000
[  760.718307] ESI: 00000385 EDI: f864a01c EBP: ef66ed40 ESP: edd35efc
[  760.718308]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[  760.718309] CR0: 8005003b CR2: b7749000 CR3: 2eae3000 CR4: 000007f0
[  760.718310] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[  760.718311] DR6: ffff0ff0 DR7: 00000400
[  760.718312] Stack:
[  760.718315]  f88e7adf c121f27b ee2bb413 000344d2 edd83748 00000001 ef08d200 edd83748
[  760.718319]  eea248c0 f4e44900 ee2bb4e0 c1087cf3 c1087d08 f4e44900 eea248c0 ee2bb4e0
[  760.718322]  c1088494 eea248e0 ee2bb4e0 2f86ad9c 00000000 00000000 00000000 00000000
[  760.718323] Call Trace:
[  760.718326]  [<f88e7adf>] ? sja1000_interrupt+0x444/0x456 [sja1000]
[  760.718329]  [<c121f27b>] ? add_interrupt_randomness+0x34/0x131
[  760.718332]  [<c1087cf3>] ? irq_thread_fn+0x21/0x21
[  760.718334]  [<c1087d08>] ? irq_forced_thread_fn+0x15/0x38
[  760.718336]  [<c1088494>] ? irq_thread+0x7e/0x169
[  760.718338]  [<c108857f>] ? irq_thread+0x169/0x169
[  760.718340]  [<c1088416>] ? wake_threads_waitq+0x31/0x31
[  760.718342]  [<c104a79e>] ? kthread+0x68/0x6d
[  760.718344]  [<c13143b7>] ? ret_from_kernel_thread+0x1b/0x28
[  760.718346]  [<c104a736>] ? __kthread_parkme+0x50/0x50
[  760.718364] Code: c3 8b 80 b8 00 00 00 8d 04 90 8a 00 c3 8b 80 b8 00 00 00 8d 04 90 88 08 c3 8b 80 b0 00 00 00 eb 05 8b 08 66 89 11 8b 10 66 8b 0a <8b> 50 08 66 85 d1 75 ee c3 57 56 89 ce 53 89 c3 83 ec 14 31 c0
[  760.718365] NMI backtrace for cpu 3
[  760.718367] CPU: 3 PID: 2867 Comm: irq/19-can2 Not tainted 3.10.11-rt7-can #6
[  760.718368] Hardware name: xxxxxx
[  760.718369] task: edcdc0a0 ti: edd14000 task.ti: edd14000
[  760.718370] EIP: 0060:[<f8709078>] EFLAGS: 00000202 CPU: 3
[  760.718372] EIP is at peak_pci_post_irq+0x12/0x1b [peak_pci]
[  760.718373] EAX: ef66c61c EBX: ef66c000 ECX: f82700c3 EDX: f8270000
[  760.718375] ESI: 00000127 EDI: f827881c EBP: ef66c540 ESP: edd15efc
[  760.718376]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[  760.718377] CR0: 8005003b CR2: b76fc000 CR3: 2eae3000 CR4: 000007f0
[  760.718378] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[  760.718379] DR6: ffff0ff0 DR7: 00000400
[  760.718379] Stack:
[  760.718382]  f88e7adf c121f27b edcdc013 000344d2 edcecdc8 00000001 ef127200 edcecdc8
[  760.718386]  edc229c0 f4e449c0 edcdc0a0 c1087cf3 c1087d08 f4e449c0 edc229c0 edcdc0a0
[  760.718389]  c1088494 edc229e0 edcdc0a0 15b2b9b8 00000000 00000000 00000000 00000000
[  760.718389] Call Trace:
[  760.718392]  [<f88e7adf>] ? sja1000_interrupt+0x444/0x456 [sja1000]
[  760.718394]  [<c121f27b>] ? add_interrupt_randomness+0x34/0x131
[  760.718397]  [<c1087cf3>] ? irq_thread_fn+0x21/0x21
[  760.718398]  [<c1087d08>] ? irq_forced_thread_fn+0x15/0x38
[  760.718400]  [<c1088494>] ? irq_thread+0x7e/0x169
[  760.718402]  [<c108857f>] ? irq_thread+0x169/0x169
[  760.718404]  [<c1088416>] ? wake_threads_waitq+0x31/0x31
[  760.718405]  [<c104a79e>] ? kthread+0x68/0x6d
[  760.718408]  [<c13143b7>] ? ret_from_kernel_thread+0x1b/0x28
[  760.718409]  [<c104a736>] ? __kthread_parkme+0x50/0x50
[  760.718426] Code: c3 8b 80 b8 00 00 00 8d 04 90 8a 00 c3 8b 80 b8 00 00 00 8d 04 90 88 08 c3 8b 80 b0 00 00 00 eb 05 8b 08 66 89 11 8b 10 66 8b 0a <8b> 50 08 66 85 d1 75 ee c3 57 56 89 ce 53 89 c3 83 ec 14 31 c0

No idea ...

Regards,
Oliver


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: sja1000 interrupt problem
  2013-12-13  9:38                                                                       ` Oliver Hartkopp
@ 2013-12-13 10:04                                                                         ` Wolfgang Grandegger
  2013-12-13 10:09                                                                           ` Wolfgang Grandegger
  2013-12-13 10:07                                                                         ` Marc Kleine-Budde
  1 sibling, 1 reply; 66+ messages in thread
From: Wolfgang Grandegger @ 2013-12-13 10:04 UTC (permalink / raw)
  To: Oliver Hartkopp; +Cc: Austin Schuh, Pavel Pisa, linux-can

On Fri, 13 Dec 2013 10:38:49 +0100, Oliver Hartkopp

<socketcan@hartkopp.net> wrote:

> On 12.12.2013 23:56, Wolfgang Grandegger wrote:

> 

>> My impression is that the problem is with counting "irqs_unhandled" and

>> "irqs_count",

>> which might not be done atomically. Actually three threads call

>> "note_interrupt".

>> Does that make sense? Hope to find some time tomorrow to use atomic_set

>> and friends

>> to handle these counters.

> 

> To hopefully complete the picture some more traces from yesterday

evening:

> 

> [ 1117.959986] handlers:

> [ 1117.962184] [<c1087bdb>] irq_default_primary_handler threaded

> [<f86b169b>] sja1000_interrupt [sja1000]device can8 PITA 0x0001

> [ 1117.962190] [<c1087bdb>] irq_default_primary_handler threaded

> [<f86b169b>] sja1000_interrupt [sja1000]device can10 PITA 0x0001

> [ 1117.962196] [<c1087bdb>] irq_default_primary_handler threaded

> [<f86b169b>] sja1000_interrupt [sja1000]device can11 PITA 0x0001

> [ 1117.962201] [<c1087bdb>] irq_default_primary_handler threaded

> [<f86b169b>] sja1000_interrupt [sja1000]device can9 PITA 0x0001

> [ 1117.962202] Disabling IRQ #17

> 

> (..)

> 

> [ 5995.979307] handlers:

> [ 5995.979337] [<c1087bdb>] irq_default_primary_handler threaded

> [<f86b169b>] sja1000_interrupt [sja1000]device can0 PITA 0x0042

> [ 5995.979342] [<c1087bdb>] irq_default_primary_handler threaded

> [<f86b169b>] sja1000_interrupt [sja1000]device can1 PITA 0x0042

> [ 5995.979346] [<c1087bdb>] irq_default_primary_handler threaded

> [<f86b169b>] sja1000_interrupt [sja1000]device can2 PITA 0x0042

> [ 5995.979350] [<c1087bdb>] irq_default_primary_handler threaded

> [<f86b169b>] sja1000_interrupt [sja1000]device can3 PITA 0x0042

> [ 5995.979354] [<c1087bdb>] irq_default_primary_handler threaded

> [<f86b169b>] sja1000_interrupt [sja1000]device can16 PITA 0x0000

> [ 5995.979358] [<c1087bdb>] irq_default_primary_handler threaded

> [<f86b169b>] sja1000_interrupt [sja1000]device can17 PITA 0x0000

> [ 5995.979362] [<c1087bdb>] irq_default_primary_handler threaded

> [<f86b169b>] sja1000_interrupt [sja1000]device can18 PITA 0x0000

> [ 5995.979365] [<c1087bdb>] irq_default_primary_handler threaded

> [<f86b169b>] sja1000_interrupt [sja1000]device can19 PITA 0x0000

> [ 5995.979366] Disabling IRQ #19

> 

> (..)

> 

> [ 7527.712564] handlers:

> [ 7527.712606] [<c1087bdb>] irq_default_primary_handler threaded

> [<f86b169b>] sja1000_interrupt [sja1000]device can12 PITA 0x0000

> [ 7527.712612] [<c1087bdb>] irq_default_primary_handler threaded

> [<f86b169b>] sja1000_interrupt [sja1000]device can13 PITA 0x0000

> [ 7527.712617] [<c1087bdb>] irq_default_primary_handler threaded

> [<f86b169b>] sja1000_interrupt [sja1000]device can14 PITA 0x0000

> [ 7527.712623] [<c1087bdb>] irq_default_primary_handler threaded

> [<f86b169b>] sja1000_interrupt [sja1000]device can15 PITA 0x0000

> [ 7527.712624] Disabling IRQ #18

> 

> /proc/interrupts:

> 

>  16:          8          9          8          8   IO-APIC-fasteoi  

>  ehci_hcd:usb1, ahci, can4, can5, can6, can7

>  17:    1838572    1843233    1845868    1838175   IO-APIC-fasteoi  

can8,

>  can10, can11, can9

>  18:   12665112   12624875   12641515   12637319   IO-APIC-fasteoi  

>  can12, can13, can14, can15

>  19:   10787522   10822954   10803457   10815440   IO-APIC-fasteoi  

can0,

>  can1, can2, can3, can16, can17, can18, can19

> 

> So after some time all CAN related interrupts have been disabled ...

> 

> I wondered if the PITA access for consuming the bit is really working.

> Therefore I made the if-statement a while statement here:

> 

> --- linux-source-3.10/drivers/net/can/sja1000/peak_pci.c	2013-09-08

> 07:10:14.000000000 +0200

> +++ linux-source-3.10-rt/drivers/net/can/sja1000/peak_pci.c	2013-12-13

> 08:42:15.850192329 +0100

> @@ -539,12 +539,17 @@

>  static void peak_pci_post_irq(const struct sja1000_priv *priv)

>  {

>  	struct peak_pci_chan *chan = priv->priv;

> +#if 0

>  	u16 icr;

>  

>  	/* Select and clear in PITA stored interrupt */

>  	icr = readw(chan->cfg_base + PITA_ICR);

>  	if (icr & chan->icr_mask)

>  		writew(chan->icr_mask, chan->cfg_base + PITA_ICR);

> +#else

> +	while (readw(chan->cfg_base + PITA_ICR) & chan->icr_mask)

> +		writew(chan->icr_mask, chan->cfg_base + PITA_ICR);

> +#endif

>  }

>  

>  static int peak_pci_probe(struct pci_dev *pdev, const struct

>  pci_device_id *ent)

> 

> 

> This should usually not have any effect, right?

> But what happened was a big crash after a pretty short time:

> 

> [  760.718091] INFO: rcu_preempt self-detected stall on CPU { 1} 

(t=84015

> jiffies g=482 c=481 q=2688)

> [  760.718092] sending NMI to all CPUs:

> [  760.718094] NMI backtrace for cpu 1

> [  760.718098] CPU: 1 PID: 3629 Comm: irq/17-can9 Not tainted

> 3.10.11-rt7-can #6

> [  760.718099] Hardware name: xxxxxx

> [  760.718100] task: edcdb4e0 ti: ee942000 task.ti: ee942000

> [  760.718101] EIP: 0060:[<c118b3a3>] EFLAGS: 00000006 CPU: 1

> [  760.718106] EIP is at __const_udelay+0x7/0x17

> [  760.718107] EAX: 00418958 EBX: 00002710 ECX: c13ca099 EDX: 009aa184

> [  760.718108] ESI: f51bb954 EDI: 00000a80 EBP: ee943ec0 ESP: ee943dbc

> [  760.718109]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068

> [  760.718111] CR0: 8005003b CR2: 0812a748 CR3: 0156b000 CR4: 000007f0

> [  760.718112] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000

> [  760.718112] DR6: ffff0ff0 DR7: 00000400

> [  760.718113] Stack:

> [  760.718117]  c10235ec c146a580 c108dd53 c13d58b6 0001482f 000001e2

> 000001e1 00000a80

> [  760.718120]  00000007 c1037f67 c10382b5 00000001 c146a580 00000001

> edcdb4e0 00000000

> [  760.718123]  00000001 ee943ec0 c103d5a1 f51bb7f4 ee943ec0 000000b1

> c106c73b f51bb7f4

> [  760.718124] Call Trace:

> [  760.718129]  [<c10235ec>] ? arch_trigger_all_cpu_backtrace+0x57/0x5f

> [  760.718132]  [<c108dd53>] ? rcu_check_callbacks+0x17e/0x470

> [  760.718135]  [<c1037f67>] ? raise_softirq_irqoff+0x5/0x2a

> [  760.718137]  [<c10382b5>] ? raise_softirq+0x17/0x20

> [  760.718140]  [<c103d5a1>] ? update_process_times+0x2f/0x39

> [  760.718142]  [<c106c73b>] ? tick_sched_handle+0x37/0x43

> [  760.718144]  [<c106c91e>] ? tick_sched_timer+0x28/0x4b

> [  760.718145]  [<c106c8f6>] ? tick_sched_do_timer+0x2f/0x2f

> [  760.718149]  [<c104d3a5>] ? __run_hrtimer+0x8e/0x12e

> [  760.718151]  [<c104dc59>] ? hrtimer_interrupt+0x1a8/0x305

> [  760.718164]  [<c1022b3a>] ? smp_apic_timer_interrupt+0x55/0x64

> [  760.718167]  [<c1310b7c>] ? apic_timer_interrupt+0x34/0x3c

> [  760.718171]  [<f8370001>] ? usb_otg_state_string+0x1/0x13

[usb_common]

> [  760.718177]  [<c126007b>] ? skb_copy_datagram_const_iovec+0xf/0x196

> [  760.718180]  [<f8709078>] ? peak_pci_post_irq+0x12/0x1b [peak_pci]

> [  760.718183]  [<f88e7adf>] ? sja1000_interrupt+0x444/0x456 [sja1000]

> [  760.718187]  [<c121f27b>] ? add_interrupt_randomness+0x34/0x131

> [  760.718191]  [<c1087cf3>] ? irq_thread_fn+0x21/0x21

> [  760.718193]  [<c1087d08>] ? irq_forced_thread_fn+0x15/0x38

> [  760.718195]  [<c1088494>] ? irq_thread+0x7e/0x169

> [  760.718197]  [<c108857f>] ? irq_thread+0x169/0x169

> [  760.718198]  [<c1088416>] ? wake_threads_waitq+0x31/0x31

> [  760.718200]  [<c104a79e>] ? kthread+0x68/0x6d

> [  760.718203]  [<c13143b7>] ? ret_from_kernel_thread+0x1b/0x28

> [  760.718205]  [<c104a736>] ? __kthread_parkme+0x50/0x50

> [  760.718221] Code: 00 8d bc 27 00 00 00 00 eb 0e 8d b4 26 00 00 00 00

8d

> bc 27 00 00 00 00 48 75 fd 48 c3 ff 15 94 17 48 c1 c3 64 8b 15 dc 50 56

c1

> <6b> d2 3e c1 e0 02 f7 e2 8d 42 01 e9 e2 ff ff ff 69 c0 c7 10 00

> [  760.718223] NMI backtrace for cpu 0

> [  760.718225] CPU: 0 PID: 3 Comm: ksoftirqd/0 Not tainted

3.10.11-rt7-can

> #6

> [  760.718226] Hardware name: xxxxxx

> [  760.718227] task: f4c70bc0 ti: f4c7e000 task.ti: f4c7e000

> [  760.718228] EIP: 0060:[<c131063c>] EFLAGS: 00000002 CPU: 0

> [  760.718231] EIP is at _raw_spin_unlock_irq+0x3/0x43

> [  760.718232] EAX: f51b2640 EBX: ee34de00 ECX: f515a000 EDX: f4c70bc0

> [  760.718233] ESI: f51b2640 EDI: 00000000 EBP: 00000000 ESP: f4c7fec0

> [  760.718234]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068

> [  760.718235] CR0: 8005003b CR2: b7721484 CR3: 2ea77000 CR4: 000007f0

> [  760.718237] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000

> [  760.718237] DR6: ffff0ff0 DR7: 00000400

> [  760.718238] Stack:

> [  760.718241]  c1052acd 6b2cb6d6 ef2a3c00 f51b2640 f4c70bc0 c130fa2f

> ee34de00 000000b1

> [  760.718244]  c1565640 1e17e089 000000b1 c1565640 c103461e f4c70bc0

> c10546d3 f515d488

> [  760.718248]  f4c70bc0 c151043c f4c7ff3c c1037c06 00000000 00000004

> f4c70bc0 f4c70bc0

> [  760.718248] Call Trace:

> [  760.718252]  [<c1052acd>] ? finish_task_switch+0x38/0x9d

> [  760.718254]  [<c130fa2f>] ? __schedule+0x385/0x41e

> [  760.718257]  [<c103461e>] ? unpin_current_cpu+0xb/0x45

> [  760.718259]  [<c10546d3>] ? migrate_enable+0x18f/0x19c

> [  760.718261]  [<c1037c06>] ? do_current_softirqs+0x209/0x26e

> [  760.718263]  [<c108e2f0>] ? rcu_note_context_switch+0x13b/0x14c

> [  760.718265]  [<c130fb84>] ? schedule+0x5e/0x6e

> [  760.718267]  [<c10505a6>] ? smpboot_thread_fn+0x233/0x2a5

> [  760.718269]  [<c130fb84>] ? schedule+0x5e/0x6e

> [  760.718271]  [<c1050373>] ? test_ti_thread_flag+0x7/0x7

> [  760.718272]  [<c104a79e>] ? kthread+0x68/0x6d

> [  760.718275]  [<c13143b7>] ? ret_from_kernel_thread+0x1b/0x28

> [  760.718277]  [<c104a736>] ? __kthread_parkme+0x50/0x50

> [  760.718293] Code: 85 c0 74 07 e8 ae f4 ff ff eb 15 89 e0 ba 09 00 00

00

> 25 00 e0 ff ff e8 6e 0c d6 ff 85 c0 75 e4 8b 04 24 e9 c1 76 d2 ff 80 00

01

> <fb> 66 66 90 66 90 b8 01 00 00 00 e8 fe 27 00 00 89 e0 ba 03 00

> [  760.718294] NMI backtrace for cpu 2

> [  760.718296] CPU: 2 PID: 3195 Comm: irq/18-can12 Not tainted

> 3.10.11-rt7-can #6

> [  760.718297] Hardware name: xxxxxx

> [  760.718298] task: ee2bb4e0 ti: edd34000 task.ti: edd34000

> [  760.718299] EIP: 0060:[<f8709078>] EFLAGS: 00000202 CPU: 2

> [  760.718305] EIP is at peak_pci_post_irq+0x12/0x1b [peak_pci]

> [  760.718306] EAX: ef66ee1c EBX: ef66e800 ECX: f8590003 EDX: f859a000

> [  760.718307] ESI: 00000385 EDI: f864a01c EBP: ef66ed40 ESP: edd35efc

> [  760.718308]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068

> [  760.718309] CR0: 8005003b CR2: b7749000 CR3: 2eae3000 CR4: 000007f0

> [  760.718310] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000

> [  760.718311] DR6: ffff0ff0 DR7: 00000400

> [  760.718312] Stack:

> [  760.718315]  f88e7adf c121f27b ee2bb413 000344d2 edd83748 00000001

> ef08d200 edd83748

> [  760.718319]  eea248c0 f4e44900 ee2bb4e0 c1087cf3 c1087d08 f4e44900

> eea248c0 ee2bb4e0

> [  760.718322]  c1088494 eea248e0 ee2bb4e0 2f86ad9c 00000000 00000000

> 00000000 00000000

> [  760.718323] Call Trace:

> [  760.718326]  [<f88e7adf>] ? sja1000_interrupt+0x444/0x456 [sja1000]

> [  760.718329]  [<c121f27b>] ? add_interrupt_randomness+0x34/0x131

> [  760.718332]  [<c1087cf3>] ? irq_thread_fn+0x21/0x21

> [  760.718334]  [<c1087d08>] ? irq_forced_thread_fn+0x15/0x38

> [  760.718336]  [<c1088494>] ? irq_thread+0x7e/0x169

> [  760.718338]  [<c108857f>] ? irq_thread+0x169/0x169

> [  760.718340]  [<c1088416>] ? wake_threads_waitq+0x31/0x31

> [  760.718342]  [<c104a79e>] ? kthread+0x68/0x6d

> [  760.718344]  [<c13143b7>] ? ret_from_kernel_thread+0x1b/0x28

> [  760.718346]  [<c104a736>] ? __kthread_parkme+0x50/0x50

> [  760.718364] Code: c3 8b 80 b8 00 00 00 8d 04 90 8a 00 c3 8b 80 b8 00

00

> 00 8d 04 90 88 08 c3 8b 80 b0 00 00 00 eb 05 8b 08 66 89 11 8b 10 66 8b

0a

> <8b> 50 08 66 85 d1 75 ee c3 57 56 89 ce 53 89 c3 83 ec 14 31 c0

> [  760.718365] NMI backtrace for cpu 3

> [  760.718367] CPU: 3 PID: 2867 Comm: irq/19-can2 Not tainted

> 3.10.11-rt7-can #6

> [  760.718368] Hardware name: xxxxxx

> [  760.718369] task: edcdc0a0 ti: edd14000 task.ti: edd14000

> [  760.718370] EIP: 0060:[<f8709078>] EFLAGS: 00000202 CPU: 3

> [  760.718372] EIP is at peak_pci_post_irq+0x12/0x1b [peak_pci]

> [  760.718373] EAX: ef66c61c EBX: ef66c000 ECX: f82700c3 EDX: f8270000

> [  760.718375] ESI: 00000127 EDI: f827881c EBP: ef66c540 ESP: edd15efc

> [  760.718376]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068

> [  760.718377] CR0: 8005003b CR2: b76fc000 CR3: 2eae3000 CR4: 000007f0

> [  760.718378] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000

> [  760.718379] DR6: ffff0ff0 DR7: 00000400

> [  760.718379] Stack:

> [  760.718382]  f88e7adf c121f27b edcdc013 000344d2 edcecdc8 00000001

> ef127200 edcecdc8

> [  760.718386]  edc229c0 f4e449c0 edcdc0a0 c1087cf3 c1087d08 f4e449c0

> edc229c0 edcdc0a0

> [  760.718389]  c1088494 edc229e0 edcdc0a0 15b2b9b8 00000000 00000000

> 00000000 00000000

> [  760.718389] Call Trace:

> [  760.718392]  [<f88e7adf>] ? sja1000_interrupt+0x444/0x456 [sja1000]

> [  760.718394]  [<c121f27b>] ? add_interrupt_randomness+0x34/0x131

> [  760.718397]  [<c1087cf3>] ? irq_thread_fn+0x21/0x21

> [  760.718398]  [<c1087d08>] ? irq_forced_thread_fn+0x15/0x38

> [  760.718400]  [<c1088494>] ? irq_thread+0x7e/0x169

> [  760.718402]  [<c108857f>] ? irq_thread+0x169/0x169

> [  760.718404]  [<c1088416>] ? wake_threads_waitq+0x31/0x31

> [  760.718405]  [<c104a79e>] ? kthread+0x68/0x6d

> [  760.718408]  [<c13143b7>] ? ret_from_kernel_thread+0x1b/0x28

> [  760.718409]  [<c104a736>] ? __kthread_parkme+0x50/0x50

> [  760.718426] Code: c3 8b 80 b8 00 00 00 8d 04 90 8a 00 c3 8b 80 b8 00

00

> 00 8d 04 90 88 08 c3 8b 80 b0 00 00 00 eb 05 8b 08 66 89 11 8b 10 66 8b

0a

> <8b> 50 08 66 85 d1 75 ee c3 57 56 89 ce 53 89 c3 83 ec 14 31 c0

> 

> No idea ...



Try protecting note_interrupt() in irq_thread() as shown below:



raw_spin_unlock(&desc->lock);

        if (!noirqdebug) {



866                         note_interrupt(action->irq, desc, action_ret);



> 

> Regards,

> Oliver

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: sja1000 interrupt problem
  2013-12-13  9:38                                                                       ` Oliver Hartkopp
  2013-12-13 10:04                                                                         ` Wolfgang Grandegger
@ 2013-12-13 10:07                                                                         ` Marc Kleine-Budde
  2013-12-13 16:22                                                                           ` Oliver Hartkopp
  1 sibling, 1 reply; 66+ messages in thread
From: Marc Kleine-Budde @ 2013-12-13 10:07 UTC (permalink / raw)
  To: Oliver Hartkopp, Wolfgang Grandegger; +Cc: Austin Schuh, Pavel Pisa, linux-can

[-- Attachment #1: Type: text/plain, Size: 1760 bytes --]

On 12/13/2013 10:38 AM, Oliver Hartkopp wrote:
[...]

> So after some time all CAN related interrupts have been disabled ...
> 
> I wondered if the PITA access for consuming the bit is really working.
> Therefore I made the if-statement a while statement here:
> 
> --- linux-source-3.10/drivers/net/can/sja1000/peak_pci.c	2013-09-08 07:10:14.000000000 +0200
> +++ linux-source-3.10-rt/drivers/net/can/sja1000/peak_pci.c	2013-12-13 08:42:15.850192329 +0100
> @@ -539,12 +539,17 @@
>  static void peak_pci_post_irq(const struct sja1000_priv *priv)
>  {
>  	struct peak_pci_chan *chan = priv->priv;
> +#if 0
>  	u16 icr;
>  
>  	/* Select and clear in PITA stored interrupt */
>  	icr = readw(chan->cfg_base + PITA_ICR);
>  	if (icr & chan->icr_mask)
>  		writew(chan->icr_mask, chan->cfg_base + PITA_ICR);
> +#else
> +	while (readw(chan->cfg_base + PITA_ICR) & chan->icr_mask)
> +		writew(chan->icr_mask, chan->cfg_base + PITA_ICR);
> +#endif
>  }
>  
>  static int peak_pci_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
> 
> 
> This should usually not have any effect, right?
> But what happened was a big crash after a pretty short time:
> 
> [  760.718091] INFO: rcu_preempt self-detected stall on CPU { 1}  (t=84015 jiffies g=482 c=481 q=2688)

For me it seems that the new while loop keeps spinning and then the rcu
subsystem detects some stalls.

Maybe it's time to get in touch with the hardware engineers at peak.

Marc

-- 
Pengutronix e.K.                  | Marc Kleine-Budde           |
Industrial Linux Solutions        | Phone: +49-231-2826-924     |
Vertretung West/Dortmund          | Fax:   +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 259 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: sja1000 interrupt problem
  2013-12-13 10:04                                                                         ` Wolfgang Grandegger
@ 2013-12-13 10:09                                                                           ` Wolfgang Grandegger
  2013-12-13 16:25                                                                             ` Oliver Hartkopp
  0 siblings, 1 reply; 66+ messages in thread
From: Wolfgang Grandegger @ 2013-12-13 10:09 UTC (permalink / raw)
  To: Wolfgang Grandegger; +Cc: Oliver Hartkopp, Austin Schuh, Pavel Pisa, linux-can

Hi Oliver,



my mailer was too quick...



On Fri, 13 Dec 2013 11:04:42 +0100, Wolfgang Grandegger

<wg@grandegger.com> wrote:

> On Fri, 13 Dec 2013 10:38:49 +0100, Oliver Hartkopp

> <socketcan@hartkopp.net> wrote:

...

>> No idea ...

> 

> Try protecting note_interrupt() in irq_thread() as shown below:

> 

> raw_spin_unlock(&desc->lock);

>         if (!noirqdebug) {

> 

> 866                         note_interrupt(action->irq, desc,

action_ret);

> 



I mean modifying:



        if (!noirqdebug) {

               raw_spin_lock(&desc->lock);

               note_interrupt(action->irq, desc, action_ret);

               raw_spin_unlock(&desc->lock);

        }



here http://lxr.free-electrons.com/source/kernel/irq/manage.c#L865.



Maybe a normal spin_lock is already OK.



Wolfgang.



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: sja1000 interrupt problem
  2013-12-13  0:07                                                                       ` Austin Schuh
@ 2013-12-13 16:16                                                                         ` Oliver Hartkopp
  0 siblings, 0 replies; 66+ messages in thread
From: Oliver Hartkopp @ 2013-12-13 16:16 UTC (permalink / raw)
  To: Austin Schuh; +Cc: Wolfgang Grandegger, Pavel Pisa, linux-can


On 13.12.2013 01:07, Austin Schuh wrote:
> I have tried to look around to see if I can find documentation on what
> PITA means in this situation so I can be more helpful.  I'm not
> finding anything.

The PITA is an ancient/discontinued PCI chip.

Indeed I was not able to find it at the Infineon Website too - that's really a
shame.

But you can take a look here:
http://pdf.datasheetcatalog.com/datasheet/infineon/1-pita_22p.pdf

Regards,
Oliver


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: sja1000 interrupt problem
  2013-12-13 10:07                                                                         ` Marc Kleine-Budde
@ 2013-12-13 16:22                                                                           ` Oliver Hartkopp
  2013-12-13 17:14                                                                             ` Oliver Hartkopp
  0 siblings, 1 reply; 66+ messages in thread
From: Oliver Hartkopp @ 2013-12-13 16:22 UTC (permalink / raw)
  To: Marc Kleine-Budde
  Cc: Wolfgang Grandegger, Austin Schuh, Pavel Pisa, linux-can



On 13.12.2013 11:07, Marc Kleine-Budde wrote:
> On 12/13/2013 10:38 AM, Oliver Hartkopp wrote:


>> +#if 0
>>  	u16 icr;
>>  
>>  	/* Select and clear in PITA stored interrupt */
>>  	icr = readw(chan->cfg_base + PITA_ICR);
>>  	if (icr & chan->icr_mask)
>>  		writew(chan->icr_mask, chan->cfg_base + PITA_ICR);
>> +#else
>> +	while (readw(chan->cfg_base + PITA_ICR) & chan->icr_mask)
>> +		writew(chan->icr_mask, chan->cfg_base + PITA_ICR);
>> +#endif

>>
>> This should usually not have any effect, right?
>> But what happened was a big crash after a pretty short time:
>>
>> [  760.718091] INFO: rcu_preempt self-detected stall on CPU { 1}  (t=84015 jiffies g=482 c=481 q=2688)
> 
> For me it seems that the new while loop keeps spinning and then the rcu
> subsystem detects some stalls.
> 
> Maybe it's time to get in touch with the hardware engineers at peak.
> 

I personally do not expect this to be a PEAK problem as the usual case with
mainline Linux (non -rt) works fine.

Maybe it's time to look into other implementations than PEAK/mainline ...

If they all do it the same way I would assume it's a plain -rt irq thread issue.

Regards,
Oliver

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: sja1000 interrupt problem
  2013-12-13 10:09                                                                           ` Wolfgang Grandegger
@ 2013-12-13 16:25                                                                             ` Oliver Hartkopp
  2013-12-13 17:33                                                                               ` Wolfgang Grandegger
  0 siblings, 1 reply; 66+ messages in thread
From: Oliver Hartkopp @ 2013-12-13 16:25 UTC (permalink / raw)
  To: Wolfgang Grandegger; +Cc: Austin Schuh, Pavel Pisa, linux-can



On 13.12.2013 11:09, Wolfgang Grandegger wrote:

> I mean modifying:
> 
> 
> 
>         if (!noirqdebug) {
> 
>                raw_spin_lock(&desc->lock);
> 
>                note_interrupt(action->irq, desc, action_ret);
> 
>                raw_spin_unlock(&desc->lock);
> 
>         }
> 
> 
> 
> here http://lxr.free-electrons.com/source/kernel/irq/manage.c#L865.
> 
> 
> 
> Maybe a normal spin_lock is already OK.
> 

No change in behaviour when adding this patch.

But I did not use IRQ debugging in my machine ;-)

Regards,
Oliver

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: sja1000 interrupt problem
  2013-12-13 16:22                                                                           ` Oliver Hartkopp
@ 2013-12-13 17:14                                                                             ` Oliver Hartkopp
  2013-12-13 21:14                                                                               ` Oliver Hartkopp
  0 siblings, 1 reply; 66+ messages in thread
From: Oliver Hartkopp @ 2013-12-13 17:14 UTC (permalink / raw)
  To: Marc Kleine-Budde
  Cc: Wolfgang Grandegger, Austin Schuh, Pavel Pisa, linux-can

Answering myself ...

> 
> Maybe it's time to look into other implementations than PEAK/mainline ...
> 

E.g. a EMS_PCI adapter has a PITA-2 too (depending on it's HW revicsion).
There's a EMS PCI driver in mainline and (at least) can4linux.

Both access the registers with 32 bit read/write functions but the peak_pci
only writes 16 bit values?!?

Checking

	http://pdf.datasheetcatalog.com/datasheet/infineon/1-pita_22p.pdf

32 bit should be the right way to do it.

Regards,
Oliver

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: sja1000 interrupt problem
  2013-12-13 16:25                                                                             ` Oliver Hartkopp
@ 2013-12-13 17:33                                                                               ` Wolfgang Grandegger
  0 siblings, 0 replies; 66+ messages in thread
From: Wolfgang Grandegger @ 2013-12-13 17:33 UTC (permalink / raw)
  To: Oliver Hartkopp; +Cc: Austin Schuh, Pavel Pisa, linux-can

On 12/13/2013 05:25 PM, Oliver Hartkopp wrote:
> 
> 
> On 13.12.2013 11:09, Wolfgang Grandegger wrote:
> 
>> I mean modifying:
>>
>>
>>
>>         if (!noirqdebug) {
>>
>>                raw_spin_lock(&desc->lock);
>>
>>                note_interrupt(action->irq, desc, action_ret);
>>
>>                raw_spin_unlock(&desc->lock);
>>
>>         }
>>
>>
>>
>> here http://lxr.free-electrons.com/source/kernel/irq/manage.c#L865.
>>
>>
>>
>> Maybe a normal spin_lock is already OK.
>>
> 
> No change in behaviour when adding this patch.
> 
> But I did not use IRQ debugging in my machine ;-)


You mean you use "noirqdebug=1" on your system. But how did you then get
the following output?

 [ 1117.962202] Disabling IRQ #17

Maybe there are more issues with thread interrupt handling.

Wolfgang.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: sja1000 interrupt problem
  2013-12-13 17:14                                                                             ` Oliver Hartkopp
@ 2013-12-13 21:14                                                                               ` Oliver Hartkopp
  2013-12-14  9:51                                                                                 ` Oliver Hartkopp
  0 siblings, 1 reply; 66+ messages in thread
From: Oliver Hartkopp @ 2013-12-13 21:14 UTC (permalink / raw)
  To: Wolfgang Grandegger, Pavel Pisa
  Cc: Marc Kleine-Budde, Austin Schuh, linux-can

Hi all,

after some more investigation of the two PITA specifications

https://www.google.de/#q=PCI+Interface+for+Telephony+infineon

http://pdf.datasheetcatalog.com/datasheet/infineon/1-pita_22p.pdf

and

http://pdf.datasheetcatalog.com/datasheet/infineon/1-pita_12p.pdf

I'm not sure *why* the driver works at all.

It's the mix between byte and word accesses and especially the per-device
interrupt bits in the PITA_ICR (Interrupt Control Register).

The interrupt bits in this register GP[0123]_INT are located in the bits
2,3,4,5 in the ICR (1-pita_12p.pdf, p. 191 / 1-pita_22p.pdf p.202)

If my interpretation is correct the

	static const u16 peak_pci_icr_masks[PEAK_PCI_CHAN_MAX] = {
        	0x02, 0x01, 0x40, 0x80
	};

would be completely bogus and would not hit the right bit in

static void peak_pci_post_irq(const struct sja1000_priv *priv)
{
        struct peak_pci_chan *chan = priv->priv;
        u16 icr;

        /* Select and clear in PITA stored interrupt */
        icr = readw(chan->cfg_base + PITA_ICR);
        if (icr & chan->icr_mask)
                writew(chan->icr_mask, chan->cfg_base + PITA_ICR);
}

at all.

Am I wrong? Or is this the wrong specification?
The code in ems_pci.c seems to fit to this PITA spec ...

Regards,
Oliver


On 13.12.2013 18:14, Oliver Hartkopp wrote:
> Answering myself ...
> 
>>
>> Maybe it's time to look into other implementations than PEAK/mainline ...
>>
> 
> E.g. a EMS_PCI adapter has a PITA-2 too (depending on it's HW revicsion).
> There's a EMS PCI driver in mainline and (at least) can4linux.
> 
> Both access the registers with 32 bit read/write functions but the peak_pci
> only writes 16 bit values?!?
> 
> Checking
> 
> 	http://pdf.datasheetcatalog.com/datasheet/infineon/1-pita_22p.pdf
> 
> 32 bit should be the right way to do it.
> 
> Regards,
> Oliver
> --
> To unsubscribe from this list: send the line "unsubscribe linux-can" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: sja1000 interrupt problem
  2013-12-13 21:14                                                                               ` Oliver Hartkopp
@ 2013-12-14  9:51                                                                                 ` Oliver Hartkopp
  2013-12-20 23:13                                                                                   ` Austin Schuh
  0 siblings, 1 reply; 66+ messages in thread
From: Oliver Hartkopp @ 2013-12-14  9:51 UTC (permalink / raw)
  To: Wolfgang Grandegger, Pavel Pisa
  Cc: Marc Kleine-Budde, Austin Schuh, linux-can

Ok. I think I got it now:

As long as the PCAN PCI adapter had only 2 channels PEAK obviously used the
PSB4600 PITA v1.2 (1-pita_12p.pdf) where there are two interrupt lines INT0
and INT1 accessible by bit 0 and bit 1 in the ICR register (see page 191).

Due to the discontinuation of the PSB4600 in newer designs there's a Lattice
4256V CPLD, see detail photo at

	http://gridconnect.com/pcan/can-adapters/can-mini-pci.html

The CPLD now obviously uses the formerly reserved bits 6+7 for the channels
3+4 in a backward compatible manner. So everything with peak_pci_icr_masks[]
is fine but it's no real PITA anymore :-)

Sorry for the confusion.

Looks like there's some more investigation of the -rt irq threads to do :-(

Regards,
Oliver


On 13.12.2013 22:14, Oliver Hartkopp wrote:
> Hi all,
> 
> after some more investigation of the two PITA specifications
> 
> https://www.google.de/#q=PCI+Interface+for+Telephony+infineon
> 
> http://pdf.datasheetcatalog.com/datasheet/infineon/1-pita_22p.pdf
> 
> and
> 
> http://pdf.datasheetcatalog.com/datasheet/infineon/1-pita_12p.pdf
> 
> I'm not sure *why* the driver works at all.
> 
> It's the mix between byte and word accesses and especially the per-device
> interrupt bits in the PITA_ICR (Interrupt Control Register).
> 
> The interrupt bits in this register GP[0123]_INT are located in the bits
> 2,3,4,5 in the ICR (1-pita_12p.pdf, p. 191 / 1-pita_22p.pdf p.202)
> 
> If my interpretation is correct the
> 
> 	static const u16 peak_pci_icr_masks[PEAK_PCI_CHAN_MAX] = {
>         	0x02, 0x01, 0x40, 0x80
> 	};
> 
> would be completely bogus and would not hit the right bit in
> 
> static void peak_pci_post_irq(const struct sja1000_priv *priv)
> {
>         struct peak_pci_chan *chan = priv->priv;
>         u16 icr;
> 
>         /* Select and clear in PITA stored interrupt */
>         icr = readw(chan->cfg_base + PITA_ICR);
>         if (icr & chan->icr_mask)
>                 writew(chan->icr_mask, chan->cfg_base + PITA_ICR);
> }
> 
> at all.
> 
> Am I wrong? Or is this the wrong specification?
> The code in ems_pci.c seems to fit to this PITA spec ...
> 
> Regards,
> Oliver
> 
> 
> On 13.12.2013 18:14, Oliver Hartkopp wrote:
>> Answering myself ...
>>
>>>
>>> Maybe it's time to look into other implementations than PEAK/mainline ...
>>>
>>
>> E.g. a EMS_PCI adapter has a PITA-2 too (depending on it's HW revicsion).
>> There's a EMS PCI driver in mainline and (at least) can4linux.
>>
>> Both access the registers with 32 bit read/write functions but the peak_pci
>> only writes 16 bit values?!?
>>
>> Checking
>>
>> 	http://pdf.datasheetcatalog.com/datasheet/infineon/1-pita_22p.pdf
>>
>> 32 bit should be the right way to do it.
>>
>> Regards,
>> Oliver
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-can" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-can" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: sja1000 interrupt problem
  2013-12-14  9:51                                                                                 ` Oliver Hartkopp
@ 2013-12-20 23:13                                                                                   ` Austin Schuh
  2013-12-21  8:29                                                                                     ` Wolfgang Grandegger
  2013-12-21 12:55                                                                                     ` Oliver Hartkopp
  0 siblings, 2 replies; 66+ messages in thread
From: Austin Schuh @ 2013-12-20 23:13 UTC (permalink / raw)
  To: Oliver Hartkopp
  Cc: Wolfgang Grandegger, Pavel Pisa, Marc Kleine-Budde, linux-can

I have applied the fix proposed in https://lkml.org/lkml/2013/3/7/222
for the note_interrupt function right now, and will run a test this
weekend to see if it fixes it for sure.  I am now consistently seeing
only 1 / 100000 of the IRQ handler calls being counted as unhandled,
which is a lot better.

I was concerned that if the handler threads were starved, I could
cause a bunch of unhandled interrupts, so I did a test.  I stressed
the system by running 4 realtime tasks (= the number of hyperthreads)
that were a higher priority than the CAN handler tasks.  I get a 'data
overrun interrupt', but the unhandled count only climbs to 3 / 100000.
 I'm no longer worried about that problem, at least with the PEAK
card.

Oliver, does this patch fix it for you?

I'm going to email Thomas on Monday if the system survives the weekend
with my results and work on getting into mainline.

Austin

On Sat, Dec 14, 2013 at 1:51 AM, Oliver Hartkopp <socketcan@hartkopp.net> wrote:
> Ok. I think I got it now:
>
> As long as the PCAN PCI adapter had only 2 channels PEAK obviously used the
> PSB4600 PITA v1.2 (1-pita_12p.pdf) where there are two interrupt lines INT0
> and INT1 accessible by bit 0 and bit 1 in the ICR register (see page 191).
>
> Due to the discontinuation of the PSB4600 in newer designs there's a Lattice
> 4256V CPLD, see detail photo at
>
>         http://gridconnect.com/pcan/can-adapters/can-mini-pci.html
>
> The CPLD now obviously uses the formerly reserved bits 6+7 for the channels
> 3+4 in a backward compatible manner. So everything with peak_pci_icr_masks[]
> is fine but it's no real PITA anymore :-)
>
> Sorry for the confusion.
>
> Looks like there's some more investigation of the -rt irq threads to do :-(
>
> Regards,
> Oliver
>
>
> On 13.12.2013 22:14, Oliver Hartkopp wrote:
>> Hi all,
>>
>> after some more investigation of the two PITA specifications
>>
>> https://www.google.de/#q=PCI+Interface+for+Telephony+infineon
>>
>> http://pdf.datasheetcatalog.com/datasheet/infineon/1-pita_22p.pdf
>>
>> and
>>
>> http://pdf.datasheetcatalog.com/datasheet/infineon/1-pita_12p.pdf
>>
>> I'm not sure *why* the driver works at all.
>>
>> It's the mix between byte and word accesses and especially the per-device
>> interrupt bits in the PITA_ICR (Interrupt Control Register).
>>
>> The interrupt bits in this register GP[0123]_INT are located in the bits
>> 2,3,4,5 in the ICR (1-pita_12p.pdf, p. 191 / 1-pita_22p.pdf p.202)
>>
>> If my interpretation is correct the
>>
>>       static const u16 peak_pci_icr_masks[PEAK_PCI_CHAN_MAX] = {
>>               0x02, 0x01, 0x40, 0x80
>>       };
>>
>> would be completely bogus and would not hit the right bit in
>>
>> static void peak_pci_post_irq(const struct sja1000_priv *priv)
>> {
>>         struct peak_pci_chan *chan = priv->priv;
>>         u16 icr;
>>
>>         /* Select and clear in PITA stored interrupt */
>>         icr = readw(chan->cfg_base + PITA_ICR);
>>         if (icr & chan->icr_mask)
>>                 writew(chan->icr_mask, chan->cfg_base + PITA_ICR);
>> }
>>
>> at all.
>>
>> Am I wrong? Or is this the wrong specification?
>> The code in ems_pci.c seems to fit to this PITA spec ...
>>
>> Regards,
>> Oliver
>>
>>
>> On 13.12.2013 18:14, Oliver Hartkopp wrote:
>>> Answering myself ...
>>>
>>>>
>>>> Maybe it's time to look into other implementations than PEAK/mainline ...
>>>>
>>>
>>> E.g. a EMS_PCI adapter has a PITA-2 too (depending on it's HW revicsion).
>>> There's a EMS PCI driver in mainline and (at least) can4linux.
>>>
>>> Both access the registers with 32 bit read/write functions but the peak_pci
>>> only writes 16 bit values?!?
>>>
>>> Checking
>>>
>>>      http://pdf.datasheetcatalog.com/datasheet/infineon/1-pita_22p.pdf
>>>
>>> 32 bit should be the right way to do it.
>>>
>>> Regards,
>>> Oliver
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-can" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-can" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: sja1000 interrupt problem
  2013-12-20 23:13                                                                                   ` Austin Schuh
@ 2013-12-21  8:29                                                                                     ` Wolfgang Grandegger
  2013-12-21 13:12                                                                                       ` Oliver Hartkopp
  2013-12-21 12:55                                                                                     ` Oliver Hartkopp
  1 sibling, 1 reply; 66+ messages in thread
From: Wolfgang Grandegger @ 2013-12-21  8:29 UTC (permalink / raw)
  To: Austin Schuh, Oliver Hartkopp; +Cc: Pavel Pisa, Marc Kleine-Budde, linux-can

Hi Austin,

On 12/21/2013 12:13 AM, Austin Schuh wrote:
> I have applied the fix proposed in https://lkml.org/lkml/2013/3/7/222
> for the note_interrupt function right now, and will run a test this
> weekend to see if it fixes it for sure.  I am now consistently seeing
> only 1 / 100000 of the IRQ handler calls being counted as unhandled,
> which is a lot better.

The patch confirms our observations and I was already thinking of
reporting it on the Linux-RT mailing list (or directly to Thomas). Well,
next time I will hesitate less.

Wolfgang.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: sja1000 interrupt problem
  2013-12-20 23:13                                                                                   ` Austin Schuh
  2013-12-21  8:29                                                                                     ` Wolfgang Grandegger
@ 2013-12-21 12:55                                                                                     ` Oliver Hartkopp
  2013-12-23 15:58                                                                                       ` Oliver Hartkopp
  1 sibling, 1 reply; 66+ messages in thread
From: Oliver Hartkopp @ 2013-12-21 12:55 UTC (permalink / raw)
  To: Austin Schuh
  Cc: Wolfgang Grandegger, Pavel Pisa, Marc Kleine-Budde, linux-can

Hi Austin,

I was also integrating some counters for handled and non-handled interrupts
per-device - which indicated that note_interrupt() obviously is in charge to
solve the issue.

But I was not able to get further - therefore many thanks for your hint!!

I'll test the patch on Monday to run the system for at least some hours before
I leave the office for XMAS ;-)

Tnx & best regards,
Oliver

On 21.12.2013 00:13, Austin Schuh wrote:
> I have applied the fix proposed in https://lkml.org/lkml/2013/3/7/222
> for the note_interrupt function right now, and will run a test this
> weekend to see if it fixes it for sure.  I am now consistently seeing
> only 1 / 100000 of the IRQ handler calls being counted as unhandled,
> which is a lot better.
> 
> I was concerned that if the handler threads were starved, I could
> cause a bunch of unhandled interrupts, so I did a test.  I stressed
> the system by running 4 realtime tasks (= the number of hyperthreads)
> that were a higher priority than the CAN handler tasks.  I get a 'data
> overrun interrupt', but the unhandled count only climbs to 3 / 100000.
>  I'm no longer worried about that problem, at least with the PEAK
> card.
> 
> Oliver, does this patch fix it for you?
> 
> I'm going to email Thomas on Monday if the system survives the weekend
> with my results and work on getting into mainline.
> 
> Austin
> 
> On Sat, Dec 14, 2013 at 1:51 AM, Oliver Hartkopp <socketcan@hartkopp.net> wrote:
>> Ok. I think I got it now:
>>
>> As long as the PCAN PCI adapter had only 2 channels PEAK obviously used the
>> PSB4600 PITA v1.2 (1-pita_12p.pdf) where there are two interrupt lines INT0
>> and INT1 accessible by bit 0 and bit 1 in the ICR register (see page 191).
>>
>> Due to the discontinuation of the PSB4600 in newer designs there's a Lattice
>> 4256V CPLD, see detail photo at
>>
>>         http://gridconnect.com/pcan/can-adapters/can-mini-pci.html
>>
>> The CPLD now obviously uses the formerly reserved bits 6+7 for the channels
>> 3+4 in a backward compatible manner. So everything with peak_pci_icr_masks[]
>> is fine but it's no real PITA anymore :-)
>>
>> Sorry for the confusion.
>>
>> Looks like there's some more investigation of the -rt irq threads to do :-(
>>
>> Regards,
>> Oliver
>>
>>
>> On 13.12.2013 22:14, Oliver Hartkopp wrote:
>>> Hi all,
>>>
>>> after some more investigation of the two PITA specifications
>>>
>>> https://www.google.de/#q=PCI+Interface+for+Telephony+infineon
>>>
>>> http://pdf.datasheetcatalog.com/datasheet/infineon/1-pita_22p.pdf
>>>
>>> and
>>>
>>> http://pdf.datasheetcatalog.com/datasheet/infineon/1-pita_12p.pdf
>>>
>>> I'm not sure *why* the driver works at all.
>>>
>>> It's the mix between byte and word accesses and especially the per-device
>>> interrupt bits in the PITA_ICR (Interrupt Control Register).
>>>
>>> The interrupt bits in this register GP[0123]_INT are located in the bits
>>> 2,3,4,5 in the ICR (1-pita_12p.pdf, p. 191 / 1-pita_22p.pdf p.202)
>>>
>>> If my interpretation is correct the
>>>
>>>       static const u16 peak_pci_icr_masks[PEAK_PCI_CHAN_MAX] = {
>>>               0x02, 0x01, 0x40, 0x80
>>>       };
>>>
>>> would be completely bogus and would not hit the right bit in
>>>
>>> static void peak_pci_post_irq(const struct sja1000_priv *priv)
>>> {
>>>         struct peak_pci_chan *chan = priv->priv;
>>>         u16 icr;
>>>
>>>         /* Select and clear in PITA stored interrupt */
>>>         icr = readw(chan->cfg_base + PITA_ICR);
>>>         if (icr & chan->icr_mask)
>>>                 writew(chan->icr_mask, chan->cfg_base + PITA_ICR);
>>> }
>>>
>>> at all.
>>>
>>> Am I wrong? Or is this the wrong specification?
>>> The code in ems_pci.c seems to fit to this PITA spec ...
>>>
>>> Regards,
>>> Oliver
>>>
>>>
>>> On 13.12.2013 18:14, Oliver Hartkopp wrote:
>>>> Answering myself ...
>>>>
>>>>>
>>>>> Maybe it's time to look into other implementations than PEAK/mainline ...
>>>>>
>>>>
>>>> E.g. a EMS_PCI adapter has a PITA-2 too (depending on it's HW revicsion).
>>>> There's a EMS PCI driver in mainline and (at least) can4linux.
>>>>
>>>> Both access the registers with 32 bit read/write functions but the peak_pci
>>>> only writes 16 bit values?!?
>>>>
>>>> Checking
>>>>
>>>>      http://pdf.datasheetcatalog.com/datasheet/infineon/1-pita_22p.pdf
>>>>
>>>> 32 bit should be the right way to do it.
>>>>
>>>> Regards,
>>>> Oliver
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-can" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-can" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: sja1000 interrupt problem
  2013-12-21  8:29                                                                                     ` Wolfgang Grandegger
@ 2013-12-21 13:12                                                                                       ` Oliver Hartkopp
  0 siblings, 0 replies; 66+ messages in thread
From: Oliver Hartkopp @ 2013-12-21 13:12 UTC (permalink / raw)
  To: Wolfgang Grandegger
  Cc: Austin Schuh, Pavel Pisa, Marc Kleine-Budde, linux-can



On 21.12.2013 09:29, Wolfgang Grandegger wrote:
> Hi Austin,
> 
> On 12/21/2013 12:13 AM, Austin Schuh wrote:
>> I have applied the fix proposed in https://lkml.org/lkml/2013/3/7/222
>> for the note_interrupt function right now, and will run a test this
>> weekend to see if it fixes it for sure.  I am now consistently seeing
>> only 1 / 100000 of the IRQ handler calls being counted as unhandled,
>> which is a lot better.
> 
> The patch confirms our observations and I was already thinking of
> reporting it on the Linux-RT mailing list (or directly to Thomas). Well,
> next time I will hesitate less.
> 

:-)

I have a spurious problem with my USB on my i7 Laptop with Intel graphics when
my 8 year old mouse is attached at boot time. I'll test this patch on the
latest 3.13-rc4 too. Maybe this non-deterministic issue is fixed also.

Best regards,
Oliver


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: sja1000 interrupt problem
  2013-12-21 12:55                                                                                     ` Oliver Hartkopp
@ 2013-12-23 15:58                                                                                       ` Oliver Hartkopp
  0 siblings, 0 replies; 66+ messages in thread
From: Oliver Hartkopp @ 2013-12-23 15:58 UTC (permalink / raw)
  To: Austin Schuh, Wolfgang Grandegger
  Cc: Pavel Pisa, Marc Kleine-Budde, linux-can

Hi all,

my tests at these targets:

- Core i7 / 5x PEAK cPCI with linux-3.10.25 (stable)
- Core i7 / 5x PEAK cPCI with linux-3.10.11-rt7 (rt)
- Core i7 / Dell 6510 Laptop linux-3.13.0-rc4 (mainline head)

are all successful so far (Intel i7 M640@2.80GHz).

There was no irq thread issue on the -rt kernel (for at least 4 hours under
heavy load) and my USB mouse worked every time when booting my Laptop (at
least 10 times).

I'll continue applying the patch on my machines to check for any issues. But
so far it looks great.

Best regards,
Oliver

> On 21.12.2013 00:13, Austin Schuh wrote:
>> I have applied the fix proposed in https://lkml.org/lkml/2013/3/7/222
>> for the note_interrupt function right now, and will run a test this
>> weekend to see if it fixes it for sure.  I am now consistently seeing
>> only 1 / 100000 of the IRQ handler calls being counted as unhandled,
>> which is a lot better.


^ permalink raw reply	[flat|nested] 66+ messages in thread

end of thread, other threads:[~2013-12-23 15:58 UTC | newest]

Thread overview: 66+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-10-08  0:47 sja1000 interrupt problem Austin Schuh
2013-10-08  6:32 ` Wolfgang Grandegger
2013-10-08  6:58   ` Oliver Hartkopp
2013-10-08 18:48     ` Austin Schuh
2013-10-08 19:44       ` Wolfgang Grandegger
2013-10-08 20:47         ` Austin Schuh
2013-10-09  6:21           ` Wolfgang Grandegger
2013-10-09  6:31           ` Wolfgang Grandegger
2013-10-09  6:47           ` Wolfgang Grandegger
     [not found]             ` <CANGgnMZpPGctUWGcg7Lp-QFPc7d6A5GeL9KQYnpeYMR8WukgdA@mail.gmail.com>
2013-11-07  8:15               ` Wolfgang Grandegger
2013-11-07 23:43                 ` Austin Schuh
2013-11-09 14:21                   ` Oliver Hartkopp
2013-11-12  2:59                     ` Austin Schuh
2013-11-12 21:26                       ` Oliver Hartkopp
2013-11-12 23:22                         ` Austin Schuh
2013-11-13  3:41                           ` Austin Schuh
2013-11-13  6:58                             ` Oliver Hartkopp
2013-11-13  9:48                               ` Kurt Van Dijck
2013-11-13  6:44                           ` Oliver Hartkopp
2013-11-13  8:11                             ` Wolfgang Grandegger
2013-11-13  9:08                               ` Pavel Pisa
2013-11-13  9:52                                 ` Wolfgang Grandegger
2013-11-13 18:41                                   ` Oliver Hartkopp
2013-11-13 19:29                                     ` Wolfgang Grandegger
2013-11-13 22:00                                       ` Oliver Hartkopp
2013-11-13 11:02                                 ` Kurt Van Dijck
2013-11-16 21:42                                 ` Oliver Hartkopp
2013-11-17  8:18                                   ` Wolfgang Grandegger
2013-11-17 14:27                                     ` Oliver Hartkopp
2013-11-17 17:23                                       ` Wolfgang Grandegger
2013-11-17 20:46                                         ` Wolfgang Grandegger
2013-11-18 17:08                                           ` Austin Schuh
2013-12-09 21:54                                             ` Austin Schuh
2013-12-09 21:54                                               ` Austin Schuh
2013-12-10  7:49                                               ` Wolfgang Grandegger
2013-12-10  8:05                                                 ` Austin Schuh
2013-12-10  9:32                                                   ` Wolfgang Grandegger
2013-12-10 13:47                                                     ` Oliver Hartkopp
2013-12-10 14:23                                                       ` Oliver Hartkopp
2013-12-10 14:41                                                       ` Wolfgang Grandegger
2013-12-10 16:05                                                         ` Oliver Hartkopp
2013-12-10 21:12                                                           ` Wolfgang Grandegger
2013-12-11 16:59                                                             ` Oliver Hartkopp
2013-12-11 19:27                                                               ` Wolfgang Grandegger
2013-12-12  6:13                                                                 ` Oliver Hartkopp
2013-12-12 17:38                                                                   ` Oliver Hartkopp
2013-12-12 22:56                                                                     ` Wolfgang Grandegger
2013-12-13  0:07                                                                       ` Austin Schuh
2013-12-13 16:16                                                                         ` Oliver Hartkopp
2013-12-13  9:38                                                                       ` Oliver Hartkopp
2013-12-13 10:04                                                                         ` Wolfgang Grandegger
2013-12-13 10:09                                                                           ` Wolfgang Grandegger
2013-12-13 16:25                                                                             ` Oliver Hartkopp
2013-12-13 17:33                                                                               ` Wolfgang Grandegger
2013-12-13 10:07                                                                         ` Marc Kleine-Budde
2013-12-13 16:22                                                                           ` Oliver Hartkopp
2013-12-13 17:14                                                                             ` Oliver Hartkopp
2013-12-13 21:14                                                                               ` Oliver Hartkopp
2013-12-14  9:51                                                                                 ` Oliver Hartkopp
2013-12-20 23:13                                                                                   ` Austin Schuh
2013-12-21  8:29                                                                                     ` Wolfgang Grandegger
2013-12-21 13:12                                                                                       ` Oliver Hartkopp
2013-12-21 12:55                                                                                     ` Oliver Hartkopp
2013-12-23 15:58                                                                                       ` Oliver Hartkopp
2013-11-09 19:42                   ` Wolfgang Grandegger
     [not found]                     ` <CANGgnMbb+VResUC6h+cK6Hfe5PLJx9R9ao6bMdJM2e5BPaDamw@mail.gmail.com>
2013-11-12 22:15                       ` Wolfgang Grandegger

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.