All of lore.kernel.org
 help / color / mirror / Atom feed
* CM-ITC, pch_can/c_can_pci, sendto() returning ENOBUFS
@ 2022-08-25 13:25 Jacob Kroon
  2022-08-26 11:24 ` Jacob Kroon
  0 siblings, 1 reply; 36+ messages in thread
From: Jacob Kroon @ 2022-08-25 13:25 UTC (permalink / raw)
  To: linux-can

Hi,

I am using a CM-ITC board 
(https://www.compulab.com/products/computer-on-modules/cm-itc/) with an 
application that uses the CAN interface. After a while of successfully 
sending packets, sendto() starts returning ENOBUFS. I wait a whole 
second and try to send, several retries, but I get ENOBUFS every time. 
I'm using kernel 5.15.59, and I've tried both the pch_can and c_can_pci 
driver, but both show the same error.

In the console I see several of:
can0: can_put_echo_skb: BUG! echo_skb 0 is occupied

I've also tried to increase the txqueuelen to 1000, as suggested here

https://stackoverflow.com/questions/40424433/write-no-buffer-space-available-socket-can-linux-can

but I think that if I increase the queuelen the threads just block 
forever in sendto() (sockets are opened in blocking mode)

If I bring down the interface with

ifconfig can0 down
ifconfig can0 up

the transmitting does get unblocked.

Is there anything I can do to debug this further ? Any other ideas ?

Thanks,
Jacob

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: CM-ITC, pch_can/c_can_pci, sendto() returning ENOBUFS
  2022-08-25 13:25 CM-ITC, pch_can/c_can_pci, sendto() returning ENOBUFS Jacob Kroon
@ 2022-08-26 11:24 ` Jacob Kroon
  2022-08-29  9:14   ` Jacob Kroon
  0 siblings, 1 reply; 36+ messages in thread
From: Jacob Kroon @ 2022-08-26 11:24 UTC (permalink / raw)
  To: linux-can

On 8/25/22 15:25, Jacob Kroon wrote:
> Hi,
> 
> I am using a CM-ITC board 
> (https://www.compulab.com/products/computer-on-modules/cm-itc/) with an 
> application that uses the CAN interface. After a while of successfully 
> sending packets, sendto() starts returning ENOBUFS. I wait a whole 
> second and try to send, several retries, but I get ENOBUFS every time. 
> I'm using kernel 5.15.59, and I've tried both the pch_can and c_can_pci 
> driver, but both show the same error.
> 
> In the console I see several of:
> can0: can_put_echo_skb: BUG! echo_skb 0 is occupied
> 
> I've also tried to increase the txqueuelen to 1000, as suggested here
> 
> https://stackoverflow.com/questions/40424433/write-no-buffer-space-available-socket-can-linux-can 
> 
> 
> but I think that if I increase the queuelen the threads just block 
> forever in sendto() (sockets are opened in blocking mode)
> 
> If I bring down the interface with
> 
> ifconfig can0 down
> ifconfig can0 up
> 
> the transmitting does get unblocked.
> 
> Is there anything I can do to debug this further ? Any other ideas ?
> 

First I get the print:

can_put_echo_skb: BUG! echo_skb 0 is occupied!

then netif_stop_queue() is called from here:

https://github.com/torvalds/linux/blob/master/drivers/net/can/c_can/c_can_main.c#L469

and then there is no call to netif_start_queue(), so the bus hangs.

Jacob

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: CM-ITC, pch_can/c_can_pci, sendto() returning ENOBUFS
  2022-08-26 11:24 ` Jacob Kroon
@ 2022-08-29  9:14   ` Jacob Kroon
  2022-08-29 13:20     ` Jacob Kroon
  0 siblings, 1 reply; 36+ messages in thread
From: Jacob Kroon @ 2022-08-29  9:14 UTC (permalink / raw)
  To: linux-can

On 8/26/22 13:24, Jacob Kroon wrote:
> On 8/25/22 15:25, Jacob Kroon wrote:
>> Hi,
>>
>> I am using a CM-ITC board 
>> (https://www.compulab.com/products/computer-on-modules/cm-itc/) with 
>> an application that uses the CAN interface. After a while of 
>> successfully sending packets, sendto() starts returning ENOBUFS. I 
>> wait a whole second and try to send, several retries, but I get 
>> ENOBUFS every time. I'm using kernel 5.15.59, and I've tried both the 
>> pch_can and c_can_pci driver, but both show the same error.
>>
>> In the console I see several of:
>> can0: can_put_echo_skb: BUG! echo_skb 0 is occupied
>>
>> I've also tried to increase the txqueuelen to 1000, as suggested here
>>
>> https://stackoverflow.com/questions/40424433/write-no-buffer-space-available-socket-can-linux-can
>>
>> but I think that if I increase the queuelen the threads just block 
>> forever in sendto() (sockets are opened in blocking mode)
>>
>> If I bring down the interface with
>>
>> ifconfig can0 down
>> ifconfig can0 up
>>
>> the transmitting does get unblocked.
>>
>> Is there anything I can do to debug this further ? Any other ideas ?
>>
> 
> First I get the print:
> 
> can_put_echo_skb: BUG! echo_skb 0 is occupied!
> 
> then netif_stop_queue() is called from here:
> 
> https://github.com/torvalds/linux/blob/master/drivers/net/can/c_can/c_can_main.c#L469
> 
> and then there is no call to netif_start_queue(), so the bus hangs.
> 

Switching back to the pch_can driver. I'm guessing here but it would 
seem that the driver is not receiving the TX interrupt for 
'PCH_TX_OBJ_END' that would wake up the netif queue, since with the 
changes below I can't reproduce the hang, although I'm still seeing a 
lot of "echo_skb <x> is occupied!":

> diff --git a/drivers/net/can/pch_can.c b/drivers/net/can/pch_can.c
> index 964c8a09226a..75ad2272d9b2 100644
> --- a/drivers/net/can/pch_can.c
> +++ b/drivers/net/can/pch_can.c
> @@ -719,8 +720,7 @@ static void pch_can_tx_complete(struct net_device *ndev, u32 int_stat)
>                           PCH_IF_MCONT_DLC);
>         stats->tx_bytes += dlc;
>         stats->tx_packets++;
> -       if (int_stat == PCH_TX_OBJ_END)
> -               netif_wake_queue(ndev);
> +       netif_wake_queue(ndev);
>  }

Jacob

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: CM-ITC, pch_can/c_can_pci, sendto() returning ENOBUFS
  2022-08-29  9:14   ` Jacob Kroon
@ 2022-08-29 13:20     ` Jacob Kroon
  2022-08-29 13:53       ` Oliver Hartkopp
  0 siblings, 1 reply; 36+ messages in thread
From: Jacob Kroon @ 2022-08-29 13:20 UTC (permalink / raw)
  To: linux-can, wg, mkl

Hi Wolfgang and Marc,

On 8/29/22 11:14, Jacob Kroon wrote:
> On 8/26/22 13:24, Jacob Kroon wrote:
>> On 8/25/22 15:25, Jacob Kroon wrote:
>>> Hi,
>>>
>>> I am using a CM-ITC board 
>>> (https://www.compulab.com/products/computer-on-modules/cm-itc/) with 
>>> an application that uses the CAN interface. After a while of 
>>> successfully sending packets, sendto() starts returning ENOBUFS. I 
>>> wait a whole second and try to send, several retries, but I get 
>>> ENOBUFS every time. I'm using kernel 5.15.59, and I've tried both the 
>>> pch_can and c_can_pci driver, but both show the same error.
>>>
>>> In the console I see several of:
>>> can0: can_put_echo_skb: BUG! echo_skb 0 is occupied
>>>
>>> I've also tried to increase the txqueuelen to 1000, as suggested here
>>>
>>> https://stackoverflow.com/questions/40424433/write-no-buffer-space-available-socket-can-linux-can
>>>
>>> but I think that if I increase the queuelen the threads just block 
>>> forever in sendto() (sockets are opened in blocking mode)
>>>
>>> If I bring down the interface with
>>>
>>> ifconfig can0 down
>>> ifconfig can0 up
>>>
>>> the transmitting does get unblocked.
>>>
>>> Is there anything I can do to debug this further ? Any other ideas ?
>>>
>>
>> First I get the print:
>>
>> can_put_echo_skb: BUG! echo_skb 0 is occupied!
>>
>> then netif_stop_queue() is called from here:
>>
>> https://github.com/torvalds/linux/blob/master/drivers/net/can/c_can/c_can_main.c#L469
>>
>> and then there is no call to netif_start_queue(), so the bus hangs.
>>
> 
> Switching back to the pch_can driver. I'm guessing here but it would 
> seem that the driver is not receiving the TX interrupt for 
> 'PCH_TX_OBJ_END' that would wake up the netif queue, since with the 
> changes below I can't reproduce the hang, although I'm still seeing a 
> lot of "echo_skb <x> is occupied!":
> 
>> diff --git a/drivers/net/can/pch_can.c b/drivers/net/can/pch_can.c
>> index 964c8a09226a..75ad2272d9b2 100644
>> --- a/drivers/net/can/pch_can.c
>> +++ b/drivers/net/can/pch_can.c
>> @@ -719,8 +720,7 @@ static void pch_can_tx_complete(struct net_device 
>> *ndev, u32 int_stat)
>>                           PCH_IF_MCONT_DLC);
>>         stats->tx_bytes += dlc;
>>         stats->tx_packets++;
>> -       if (int_stat == PCH_TX_OBJ_END)
>> -               netif_wake_queue(ndev);
>> +       netif_wake_queue(ndev);
>>  }
> 

Is there a recommended driver for the Intel EG20T PCH CAN controller ? 
pch_can or c_can_pci ?

Jacob

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: CM-ITC, pch_can/c_can_pci, sendto() returning ENOBUFS
  2022-08-29 13:20     ` Jacob Kroon
@ 2022-08-29 13:53       ` Oliver Hartkopp
  2022-08-30 12:59         ` Jacob Kroon
  0 siblings, 1 reply; 36+ messages in thread
From: Oliver Hartkopp @ 2022-08-29 13:53 UTC (permalink / raw)
  To: Jacob Kroon, linux-can, wg, mkl

Hi Jacob,

On 29.08.22 15:20, Jacob Kroon wrote:
> Hi Wolfgang and Marc,
> 
> On 8/29/22 11:14, Jacob Kroon wrote:
>> On 8/26/22 13:24, Jacob Kroon wrote:
>>> On 8/25/22 15:25, Jacob Kroon wrote:
>>>> Hi,
>>>>
>>>> I am using a CM-ITC board 
>>>> (https://www.compulab.com/products/computer-on-modules/cm-itc/) with 
>>>> an application that uses the CAN interface. After a while of 
>>>> successfully sending packets, sendto() starts returning ENOBUFS. I 
>>>> wait a whole second and try to send, several retries, but I get 
>>>> ENOBUFS every time. I'm using kernel 5.15.59, and I've tried both 
>>>> the pch_can and c_can_pci driver, but both show the same error.
>>>>
>>>> In the console I see several of:
>>>> can0: can_put_echo_skb: BUG! echo_skb 0 is occupied
>>>>
>>>> I've also tried to increase the txqueuelen to 1000, as suggested here
>>>>
>>>> https://stackoverflow.com/questions/40424433/write-no-buffer-space-available-socket-can-linux-can 
>>>>
>>>>
>>>> but I think that if I increase the queuelen the threads just block 
>>>> forever in sendto() (sockets are opened in blocking mode)
>>>>
>>>> If I bring down the interface with
>>>>
>>>> ifconfig can0 down
>>>> ifconfig can0 up
>>>>
>>>> the transmitting does get unblocked.
>>>>
>>>> Is there anything I can do to debug this further ? Any other ideas ?
>>>>
>>>
>>> First I get the print:
>>>
>>> can_put_echo_skb: BUG! echo_skb 0 is occupied!
>>>
>>> then netif_stop_queue() is called from here:
>>>
>>> https://github.com/torvalds/linux/blob/master/drivers/net/can/c_can/c_can_main.c#L469 
>>>
>>>
>>> and then there is no call to netif_start_queue(), so the bus hangs.
>>>
>>
>> Switching back to the pch_can driver. I'm guessing here but it would 
>> seem that the driver is not receiving the TX interrupt for 
>> 'PCH_TX_OBJ_END' that would wake up the netif queue, since with the 
>> changes below I can't reproduce the hang, although I'm still seeing a 
>> lot of "echo_skb <x> is occupied!":
>>
>>> diff --git a/drivers/net/can/pch_can.c b/drivers/net/can/pch_can.c
>>> index 964c8a09226a..75ad2272d9b2 100644
>>> --- a/drivers/net/can/pch_can.c
>>> +++ b/drivers/net/can/pch_can.c
>>> @@ -719,8 +720,7 @@ static void pch_can_tx_complete(struct net_device 
>>> *ndev, u32 int_stat)
>>>                           PCH_IF_MCONT_DLC);
>>>         stats->tx_bytes += dlc;
>>>         stats->tx_packets++;
>>> -       if (int_stat == PCH_TX_OBJ_END)
>>> -               netif_wake_queue(ndev);
>>> +       netif_wake_queue(ndev);
>>>  }
>>
> 
> Is there a recommended driver for the Intel EG20T PCH CAN controller ? 
> pch_can or c_can_pci ?

I'm very happy that you show up with this question!

AFAIK the PCH_CAN driver was contributed by some OKI developers on 
behalf of Intel to support their EG20T platform. This driver has not 
been worked on for years now.

Later the C_CAN driver got its PCI support which should have made the 
PCH_CAN driver obsolete as they are both using the C_CAN IP core.

So my suggestion would be to work with the well maintained c_can_pci and 
mark the pch_can obsolete in the Linux kernel.

IIRC Marc is very experienced with the C_CAN driver but he is still on 
vacation this week.

Regarding your problem:
"BUG! echo_skb 0 is occupied" should never show up but I wonder if you 
probably have another CAN specific problem with your setup.

Did you set up a CAN network with at least two CAN nodes, identical 
bitrate settings and 2x120 Ohms (== 60 Ohms) termination between CAN_L 
and CAN_H?

Do you have another CAN node which can be attached to the EG20T setup 
(e.g. some ECU or an USB CAN adapter)?

Best regards,
Oliver

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: CM-ITC, pch_can/c_can_pci, sendto() returning ENOBUFS
  2022-08-29 13:53       ` Oliver Hartkopp
@ 2022-08-30 12:59         ` Jacob Kroon
  2022-08-30 19:15           ` Oliver Hartkopp
  0 siblings, 1 reply; 36+ messages in thread
From: Jacob Kroon @ 2022-08-30 12:59 UTC (permalink / raw)
  To: Oliver Hartkopp, linux-can, wg, mkl

Hi Oliver,

On 8/29/22 15:53, Oliver Hartkopp wrote:
> Hi Jacob,
> 
> On 29.08.22 15:20, Jacob Kroon wrote:
>> Hi Wolfgang and Marc,
>>
>> On 8/29/22 11:14, Jacob Kroon wrote:
>>> On 8/26/22 13:24, Jacob Kroon wrote:
>>>> On 8/25/22 15:25, Jacob Kroon wrote:
>>>>> Hi,
>>>>>
>>>>> I am using a CM-ITC board 
>>>>> (https://www.compulab.com/products/computer-on-modules/cm-itc/) 
>>>>> with an application that uses the CAN interface. After a while of 
>>>>> successfully sending packets, sendto() starts returning ENOBUFS. I 
>>>>> wait a whole second and try to send, several retries, but I get 
>>>>> ENOBUFS every time. I'm using kernel 5.15.59, and I've tried both 
>>>>> the pch_can and c_can_pci driver, but both show the same error.
>>>>>
>>>>> In the console I see several of:
>>>>> can0: can_put_echo_skb: BUG! echo_skb 0 is occupied
>>>>>
>>>>> I've also tried to increase the txqueuelen to 1000, as suggested here
>>>>>
>>>>> https://stackoverflow.com/questions/40424433/write-no-buffer-space-available-socket-can-linux-can
>>>>>
>>>>> but I think that if I increase the queuelen the threads just block 
>>>>> forever in sendto() (sockets are opened in blocking mode)
>>>>>
>>>>> If I bring down the interface with
>>>>>
>>>>> ifconfig can0 down
>>>>> ifconfig can0 up
>>>>>
>>>>> the transmitting does get unblocked.
>>>>>
>>>>> Is there anything I can do to debug this further ? Any other ideas ?
>>>>>
>>>>
>>>> First I get the print:
>>>>
>>>> can_put_echo_skb: BUG! echo_skb 0 is occupied!
>>>>
>>>> then netif_stop_queue() is called from here:
>>>>
>>>> https://github.com/torvalds/linux/blob/master/drivers/net/can/c_can/c_can_main.c#L469
>>>>
>>>> and then there is no call to netif_start_queue(), so the bus hangs.
>>>>
>>>
>>> Switching back to the pch_can driver. I'm guessing here but it would 
>>> seem that the driver is not receiving the TX interrupt for 
>>> 'PCH_TX_OBJ_END' that would wake up the netif queue, since with the 
>>> changes below I can't reproduce the hang, although I'm still seeing a 
>>> lot of "echo_skb <x> is occupied!":
>>>
>>>> diff --git a/drivers/net/can/pch_can.c b/drivers/net/can/pch_can.c
>>>> index 964c8a09226a..75ad2272d9b2 100644
>>>> --- a/drivers/net/can/pch_can.c
>>>> +++ b/drivers/net/can/pch_can.c
>>>> @@ -719,8 +720,7 @@ static void pch_can_tx_complete(struct 
>>>> net_device *ndev, u32 int_stat)
>>>>                           PCH_IF_MCONT_DLC);
>>>>         stats->tx_bytes += dlc;
>>>>         stats->tx_packets++;
>>>> -       if (int_stat == PCH_TX_OBJ_END)
>>>> -               netif_wake_queue(ndev);
>>>> +       netif_wake_queue(ndev);
>>>>  }
>>>
>>
>> Is there a recommended driver for the Intel EG20T PCH CAN controller ? 
>> pch_can or c_can_pci ?
> 
> I'm very happy that you show up with this question!
> 
> AFAIK the PCH_CAN driver was contributed by some OKI developers on 
> behalf of Intel to support their EG20T platform. This driver has not 
> been worked on for years now.
> 
> Later the C_CAN driver got its PCI support which should have made the 
> PCH_CAN driver obsolete as they are both using the C_CAN IP core.
> 
> So my suggestion would be to work with the well maintained c_can_pci and 
> mark the pch_can obsolete in the Linux kernel.
> 

Ok.

> IIRC Marc is very experienced with the C_CAN driver but he is still on 
> vacation this week.
> 
> Regarding your problem:
> "BUG! echo_skb 0 is occupied" should never show up but I wonder if you 
> probably have another CAN specific problem with your setup.
> 
> Did you set up a CAN network with at least two CAN nodes, identical 
> bitrate settings and 2x120 Ohms (== 60 Ohms) termination between CAN_L 
> and CAN_H?
> 

I have double checked that both endpoints of the network are terminated 
with 120Ohm resistors, and when I check CAN_H/CAN_L in an oscilloscope 
they look ok, and within the spec. voltage levels.

> Do you have another CAN node which can be attached to the EG20T setup 
> (e.g. some ECU or an USB CAN adapter)?

Yes I do have a CAN analyser from Microchip. I guess I can record all 
traffic with the analyzer, and compare it to what I see with "candump 
can0" on the host. Or do you have some other suggestion ?

Thanks for responding,
Jacob

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: CM-ITC, pch_can/c_can_pci, sendto() returning ENOBUFS
  2022-08-30 12:59         ` Jacob Kroon
@ 2022-08-30 19:15           ` Oliver Hartkopp
  2022-09-01  9:38             ` Jacob Kroon
  0 siblings, 1 reply; 36+ messages in thread
From: Oliver Hartkopp @ 2022-08-30 19:15 UTC (permalink / raw)
  To: Jacob Kroon, linux-can, wg, mkl

Hi Jacob,

On 30.08.22 14:59, Jacob Kroon wrote:

> On 8/29/22 15:53, Oliver Hartkopp wrote:

>> Regarding your problem:
>> "BUG! echo_skb 0 is occupied" should never show up but I wonder if you 
>> probably have another CAN specific problem with your setup.
>>
>> Did you set up a CAN network with at least two CAN nodes, identical 
>> bitrate settings and 2x120 Ohms (== 60 Ohms) termination between CAN_L 
>> and CAN_H?
>>
> 
> I have double checked that both endpoints of the network are terminated 
> with 120Ohm resistors, and when I check CAN_H/CAN_L in an oscilloscope 
> they look ok, and within the spec. voltage levels.

https://www.compulab.com/wp-content/uploads/2011/08/CM-iTC-Reference-Guide.pdf

"it is necessary to add transceiver
hardware (see the SB-iTC reference schematic)"

I assume you have a transceiver, right? ;-)

What is the other endpoint? The EG20T and another (automotive) ECU?

>> Do you have another CAN node which can be attached to the EG20T setup 
>> (e.g. some ECU or an USB CAN adapter)?
> 
> Yes I do have a CAN analyser from Microchip. I guess I can record all 
> traffic with the analyzer, and compare it to what I see with "candump 
> can0" on the host. Or do you have some other suggestion ?

Yes, please add the CAN analyzer from Microchip too!

The problem with only two nodes is that you have to be very precise with 
bitrate settings and sampling points so that the receiving node needs to 
properly set the ACK to acknowlege the CAN frame.

I had been working with a MSCAN system some time ago and that wasn't 
able to talk to a commercial CAN tool until I added another node (from 
another CAN tool provider).

Maybe you can make the other node talk to the Microchip CAN analyzer and 
let the EG20T receive that traffic first.

Best,
Oliver

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: CM-ITC, pch_can/c_can_pci, sendto() returning ENOBUFS
  2022-08-30 19:15           ` Oliver Hartkopp
@ 2022-09-01  9:38             ` Jacob Kroon
  2022-09-01 16:35               ` Oliver Hartkopp
  2022-09-05 15:54               ` Marc Kleine-Budde
  0 siblings, 2 replies; 36+ messages in thread
From: Jacob Kroon @ 2022-09-01  9:38 UTC (permalink / raw)
  To: Oliver Hartkopp, linux-can, wg, mkl

Hi Oliver,

On 8/30/22 21:15, Oliver Hartkopp wrote:
> Hi Jacob,
> 
> On 30.08.22 14:59, Jacob Kroon wrote:
> 
>> On 8/29/22 15:53, Oliver Hartkopp wrote:
> 
>>> Regarding your problem:
>>> "BUG! echo_skb 0 is occupied" should never show up but I wonder if 
>>> you probably have another CAN specific problem with your setup.
>>>
>>> Did you set up a CAN network with at least two CAN nodes, identical 
>>> bitrate settings and 2x120 Ohms (== 60 Ohms) termination between 
>>> CAN_L and CAN_H?
>>>
>>
>> I have double checked that both endpoints of the network are 
>> terminated with 120Ohm resistors, and when I check CAN_H/CAN_L in an 
>> oscilloscope they look ok, and within the spec. voltage levels.
> 
> https://www.compulab.com/wp-content/uploads/2011/08/CM-iTC-Reference-Guide.pdf
> 
> "it is necessary to add transceiver
> hardware (see the SB-iTC reference schematic)"
> 
> I assume you have a transceiver, right? ;-)
> 

Yes,, all nodes are using a TJA1050 transceiver
(https://www.nxp.com/docs/en/data-sheet/TJA1050.pdf)

> What is the other endpoint? The EG20T and another (automotive) ECU?

Currently I have 4 nodes in the network, EG20T is in one end.

> 
>>> Do you have another CAN node which can be attached to the EG20T setup 
>>> (e.g. some ECU or an USB CAN adapter)?
>>
>> Yes I do have a CAN analyser from Microchip. I guess I can record all 
>> traffic with the analyzer, and compare it to what I see with "candump 
>> can0" on the host. Or do you have some other suggestion ?
> 
> Yes, please add the CAN analyzer from Microchip too!
> 
> The problem with only two nodes is that you have to be very precise with 
> bitrate settings and sampling points so that the receiving node needs to 
> properly set the ACK to acknowlege the CAN frame.
> 
> I had been working with a MSCAN system some time ago and that wasn't 
> able to talk to a commercial CAN tool until I added another node (from 
> another CAN tool provider).
> 
> Maybe you can make the other node talk to the Microchip CAN analyzer and 
> let the EG20T receive that traffic first.
> 

I used "candump can0 -l" on the EG20T host to capture the traffic, and 
then connected an CAN USB analyzer to the network and used that to 
capture the traffic. One thing sticks out. This is the log from the CAN 
USB analyzer:

> ...
> 505.7052;RX;0x464;3;0x01;0x01;0x00;;;;;;
> 505.7052;RX;0x464;3;0x00;0x00;0x00;;;;;;
> 505.7063;RX;0x65;64;;;;;;;;;
> 505.7662;RX;0x440;3;0x32;0x20;0xFA;;;;;;
> 505.7912;RX;0x44C;3;0x35;0x20;0xFA;;;;;;
> 505.9632;RX;0x464;3;0x00;0x00;0x00;;;;;;
> 505.9632;RX;0x464;3;0x01;0x01;0x00;;;;;;
> 505.9752;RX;0x468;3;0x51;0x20;0xFA;;;;;;
> 506.0362;RX;0x440;3;0x32;0x20;0xFA;;;;;;
> 506.0622;RX;0x44C;3;0x35;0x20;0xFA;;;;;;
> 506.2112;RX;0x464;3;0x00;0x00;0x00;;;;;;
> 506.2112;RX;0x464;3;0x00;0x00;0x00;;;;;;
> 506.2462;RX;0x468;3;0x51;0x20;0xFA;;;;;;
> 506.3072;RX;0x440;3;0x32;0x20;0xFA;;;;;;
> 506.3322;RX;0x44C;3;0x35;0x20;0xFA;;;;;;
> 506.4572;RX;0x464;3;0x00;0x00;0x00;;;;;;
> 506.4580;RX;0x464;3;0x00;0x00;0x00;;;;;;
> 506.5162;RX;0x468;3;0x51;0x20;0xFA;;;;;;
> 522.7203;RX;0x1E;1;0xFF;;;;;;;;
> ...

Note the third message from the top. This is what "candump" on the host 
logs:

> ...
> (1662022485.638794) can0 464#010100
> (1662022485.638940) can0 464#000000
> (1662022485.699405) can0 440#3220FA
> (1662022485.725166) can0 44C#3520FA
> (1662022485.896858) can0 464#000000
> (1662022485.897382) can0 464#010100
> (1662022485.909042) can0 468#5120FA
> (1662022485.970036) can0 440#3220FA
> (1662022485.995596) can0 44C#3520FA
> (1662022486.144685) can0 464#000000
> (1662022486.144768) can0 464#000000
> (1662022486.179595) can0 468#5120FA
> (1662022486.240561) can0 440#3220FA
> (1662022486.266274) can0 44C#3520FA
> (1662022486.391248) can0 464#000000
> (1662022486.391469) can0 464#000000
> (1662022486.450115) can0 468#5120FA
> (1662022502.662035) can0 01E#FF
> ...

It fails to see the 3rd message from the previous log. What would that 
indicate ? The CAN analyzer sees the message, but the EG20T doesn't.

Jacob

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: CM-ITC, pch_can/c_can_pci, sendto() returning ENOBUFS
  2022-09-01  9:38             ` Jacob Kroon
@ 2022-09-01 16:35               ` Oliver Hartkopp
  2022-09-02 15:13                 ` Jacob Kroon
  2022-09-05 15:54               ` Marc Kleine-Budde
  1 sibling, 1 reply; 36+ messages in thread
From: Oliver Hartkopp @ 2022-09-01 16:35 UTC (permalink / raw)
  To: Jacob Kroon, linux-can, wg, mkl

Hi Jacob,

On 01.09.22 11:38, Jacob Kroon wrote:
> On 8/30/22 21:15, Oliver Hartkopp wrote:

>> I assume you have a transceiver, right? ;-)
>>
> 
> Yes,, all nodes are using a TJA1050 transceiver
> (https://www.nxp.com/docs/en/data-sheet/TJA1050.pdf)

Good!

>> What is the other endpoint? The EG20T and another (automotive) ECU?
> 
> Currently I have 4 nodes in the network, EG20T is in one end.
> 

Ok, that's a good base for testing.

>>>> Do you have another CAN node which can be attached to the EG20T 
>>>> setup (e.g. some ECU or an USB CAN adapter)?
>>>
>>> Yes I do have a CAN analyser from Microchip. I guess I can record all 
>>> traffic with the analyzer, and compare it to what I see with "candump 
>>> can0" on the host. Or do you have some other suggestion ?
>>
>> Yes, please add the CAN analyzer from Microchip too!
>>
>> The problem with only two nodes is that you have to be very precise 
>> with bitrate settings and sampling points so that the receiving node 
>> needs to properly set the ACK to acknowlege the CAN frame.
>>
>> I had been working with a MSCAN system some time ago and that wasn't 
>> able to talk to a commercial CAN tool until I added another node (from 
>> another CAN tool provider).
>>
>> Maybe you can make the other node talk to the Microchip CAN analyzer 
>> and let the EG20T receive that traffic first.
>>
> 
> I used "candump can0 -l" on the EG20T host to capture the traffic, and 
> then connected an CAN USB analyzer to the network and used that to 
> capture the traffic. One thing sticks out. This is the log from the CAN 
> USB analyzer:
> 
>> ...
>> 505.7052;RX;0x464;3;0x01;0x01;0x00;;;;;;
>> 505.7052;RX;0x464;3;0x00;0x00;0x00;;;;;;
>> 505.7063;RX;0x65;64;;;;;;;;;

What should this be?

A length of 64 and no data ??

This is no valid CAN frame.

>> 505.7662;RX;0x440;3;0x32;0x20;0xFA;;;;;;
(..)

> 
> Note the third message from the top. This is what "candump" on the host 
> logs:
> 
>> ...
>> (1662022485.638794) can0 464#010100
>> (1662022485.638940) can0 464#000000
>> (1662022485.699405) can0 440#3220FA

The correct CAN frames are displayed correctly.

>> ...
> 
> It fails to see the 3rd message from the previous log. What would that 
> indicate ? The CAN analyzer sees the message, but the EG20T doesn't.

Don't know if this is an error on the CAN bus. You can also print error 
messages of detected CAN bus problems with adding an error message filter.

See 'candump -h' :

candump -l any,0:0,#FFFFFFFF
          (log error frames and also all data frames)


Best regards,
Oliver

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: CM-ITC, pch_can/c_can_pci, sendto() returning ENOBUFS
  2022-09-01 16:35               ` Oliver Hartkopp
@ 2022-09-02 15:13                 ` Jacob Kroon
  2022-09-02 16:39                   ` Jacob Kroon
  2022-09-05 14:17                   ` Marc Kleine-Budde
  0 siblings, 2 replies; 36+ messages in thread
From: Jacob Kroon @ 2022-09-02 15:13 UTC (permalink / raw)
  To: Oliver Hartkopp, linux-can, wg, mkl

Hi Oliver,

On 9/1/22 18:35, Oliver Hartkopp wrote:
> Hi Jacob,
> 
> On 01.09.22 11:38, Jacob Kroon wrote:
>> On 8/30/22 21:15, Oliver Hartkopp wrote:
> 
>>> I assume you have a transceiver, right? ;-)
>>>
>>
>> Yes,, all nodes are using a TJA1050 transceiver
>> (https://www.nxp.com/docs/en/data-sheet/TJA1050.pdf)
> 
> Good!
> 
>>> What is the other endpoint? The EG20T and another (automotive) ECU?
>>
>> Currently I have 4 nodes in the network, EG20T is in one end.
>>
> 
> Ok, that's a good base for testing.
> 
>>>>> Do you have another CAN node which can be attached to the EG20T 
>>>>> setup (e.g. some ECU or an USB CAN adapter)?
>>>>
>>>> Yes I do have a CAN analyser from Microchip. I guess I can record 
>>>> all traffic with the analyzer, and compare it to what I see with 
>>>> "candump can0" on the host. Or do you have some other suggestion ?
>>>
>>> Yes, please add the CAN analyzer from Microchip too!
>>>
>>> The problem with only two nodes is that you have to be very precise 
>>> with bitrate settings and sampling points so that the receiving node 
>>> needs to properly set the ACK to acknowlege the CAN frame.
>>>
>>> I had been working with a MSCAN system some time ago and that wasn't 
>>> able to talk to a commercial CAN tool until I added another node 
>>> (from another CAN tool provider).
>>>
>>> Maybe you can make the other node talk to the Microchip CAN analyzer 
>>> and let the EG20T receive that traffic first.
>>>
>>
>> I used "candump can0 -l" on the EG20T host to capture the traffic, and 
>> then connected an CAN USB analyzer to the network and used that to 
>> capture the traffic. One thing sticks out. This is the log from the 
>> CAN USB analyzer:
>>
>>> ...
>>> 505.7052;RX;0x464;3;0x01;0x01;0x00;;;;;;
>>> 505.7052;RX;0x464;3;0x00;0x00;0x00;;;;;;
>>> 505.7063;RX;0x65;64;;;;;;;;;
> 
> What should this be?
> 
> A length of 64 and no data ??
> 
> This is no valid CAN frame.
> 
>>> 505.7662;RX;0x440;3;0x32;0x20;0xFA;;;;;;
> (..)
> 
>>
>> Note the third message from the top. This is what "candump" on the 
>> host logs:
>>
>>> ...
>>> (1662022485.638794) can0 464#010100
>>> (1662022485.638940) can0 464#000000
>>> (1662022485.699405) can0 440#3220FA
> 
> The correct CAN frames are displayed correctly.
> 
>>> ...
>>
>> It fails to see the 3rd message from the previous log. What would that 
>> indicate ? The CAN analyzer sees the message, but the EG20T doesn't.
> 
> Don't know if this is an error on the CAN bus. You can also print error 
> messages of detected CAN bus problems with adding an error message filter.
> 
> See 'candump -h' :
> 
> candump -l any,0:0,#FFFFFFFF
>           (log error frames and also all data frames)
> 
> 

Thank you Oliver for all the good hints.

I've done some more logging, but there are no error frames being logged.

I can see that both pch_can and c_can_pci drivers call 
can_put_echo_skb() in their ndo_start_xmit functions, but neither checks 
the return value whether it succeeded or not. Shouldn't both these 
return NETDEV_TX_BUSY if there are no echo slots available ?

One reason I ask is because whenever I strace the application, it would 
seem the problem goes away, and I'm guessing strace:ing will slow down 
my application.

Regards
Jacob

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: CM-ITC, pch_can/c_can_pci, sendto() returning ENOBUFS
  2022-09-02 15:13                 ` Jacob Kroon
@ 2022-09-02 16:39                   ` Jacob Kroon
  2022-09-05 14:17                   ` Marc Kleine-Budde
  1 sibling, 0 replies; 36+ messages in thread
From: Jacob Kroon @ 2022-09-02 16:39 UTC (permalink / raw)
  To: Oliver Hartkopp, linux-can, wg, mkl

On 9/2/22 17:13, Jacob Kroon wrote:
> Hi Oliver,
> 
> On 9/1/22 18:35, Oliver Hartkopp wrote:
>> Hi Jacob,
>>
>> On 01.09.22 11:38, Jacob Kroon wrote:
>>> On 8/30/22 21:15, Oliver Hartkopp wrote:
>>
>>>> I assume you have a transceiver, right? ;-)
>>>>
>>>
>>> Yes,, all nodes are using a TJA1050 transceiver
>>> (https://www.nxp.com/docs/en/data-sheet/TJA1050.pdf)
>>
>> Good!
>>
>>>> What is the other endpoint? The EG20T and another (automotive) ECU?
>>>
>>> Currently I have 4 nodes in the network, EG20T is in one end.
>>>
>>
>> Ok, that's a good base for testing.
>>
>>>>>> Do you have another CAN node which can be attached to the EG20T 
>>>>>> setup (e.g. some ECU or an USB CAN adapter)?
>>>>>
>>>>> Yes I do have a CAN analyser from Microchip. I guess I can record 
>>>>> all traffic with the analyzer, and compare it to what I see with 
>>>>> "candump can0" on the host. Or do you have some other suggestion ?
>>>>
>>>> Yes, please add the CAN analyzer from Microchip too!
>>>>
>>>> The problem with only two nodes is that you have to be very precise 
>>>> with bitrate settings and sampling points so that the receiving node 
>>>> needs to properly set the ACK to acknowlege the CAN frame.
>>>>
>>>> I had been working with a MSCAN system some time ago and that wasn't 
>>>> able to talk to a commercial CAN tool until I added another node 
>>>> (from another CAN tool provider).
>>>>
>>>> Maybe you can make the other node talk to the Microchip CAN analyzer 
>>>> and let the EG20T receive that traffic first.
>>>>
>>>
>>> I used "candump can0 -l" on the EG20T host to capture the traffic, 
>>> and then connected an CAN USB analyzer to the network and used that 
>>> to capture the traffic. One thing sticks out. This is the log from 
>>> the CAN USB analyzer:
>>>
>>>> ...
>>>> 505.7052;RX;0x464;3;0x01;0x01;0x00;;;;;;
>>>> 505.7052;RX;0x464;3;0x00;0x00;0x00;;;;;;
>>>> 505.7063;RX;0x65;64;;;;;;;;;
>>
>> What should this be?
>>
>> A length of 64 and no data ??
>>
>> This is no valid CAN frame.
>>
>>>> 505.7662;RX;0x440;3;0x32;0x20;0xFA;;;;;;
>> (..)
>>
>>>
>>> Note the third message from the top. This is what "candump" on the 
>>> host logs:
>>>
>>>> ...
>>>> (1662022485.638794) can0 464#010100
>>>> (1662022485.638940) can0 464#000000
>>>> (1662022485.699405) can0 440#3220FA
>>
>> The correct CAN frames are displayed correctly.
>>
>>>> ...
>>>
>>> It fails to see the 3rd message from the previous log. What would 
>>> that indicate ? The CAN analyzer sees the message, but the EG20T 
>>> doesn't.
>>
>> Don't know if this is an error on the CAN bus. You can also print 
>> error messages of detected CAN bus problems with adding an error 
>> message filter.
>>
>> See 'candump -h' :
>>
>> candump -l any,0:0,#FFFFFFFF
>>           (log error frames and also all data frames)
>>
>>
> 
> Thank you Oliver for all the good hints.
> 
> I've done some more logging, but there are no error frames being logged.
> 
> I can see that both pch_can and c_can_pci drivers call 
> can_put_echo_skb() in their ndo_start_xmit functions, but neither checks 
> the return value whether it succeeded or not. Shouldn't both these 
> return NETDEV_TX_BUSY if there are no echo slots available ?
> 
> One reason I ask is because whenever I strace the application, it would 
> seem the problem goes away, and I'm guessing strace:ing will slow down 
> my application.
> 

I did try the patch below, but then I just get the lockups without the 
warning messages:

> diff --git a/drivers/net/can/pch_can.c b/drivers/net/can/pch_can.c
> index 964c8a09226a..0a230368c443 100644
> --- a/drivers/net/can/pch_can.c
> +++ b/drivers/net/can/pch_can.c
> @@ -889,6 +889,10 @@ static netdev_tx_t pch_xmit(struct sk_buff *skb, struct net_device *ndev)
>                 return NETDEV_TX_OK;
>  
>         tx_obj_no = priv->tx_obj;
> +
> +       if (priv->can.echo_skb[tx_obj_no - PCH_RX_OBJ_END - 1])
> +               return NETDEV_TX_BUSY;
> +
>         if (priv->tx_obj == PCH_TX_OBJ_END) {
>                 if (ioread32(&priv->regs->treq2) & PCH_TREQ2_TX_MASK)
>                         netif_stop_queue(ndev);

Jacob

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: CM-ITC, pch_can/c_can_pci, sendto() returning ENOBUFS
  2022-09-02 15:13                 ` Jacob Kroon
  2022-09-02 16:39                   ` Jacob Kroon
@ 2022-09-05 14:17                   ` Marc Kleine-Budde
  1 sibling, 0 replies; 36+ messages in thread
From: Marc Kleine-Budde @ 2022-09-05 14:17 UTC (permalink / raw)
  To: Jacob Kroon; +Cc: Oliver Hartkopp, linux-can

[-- Attachment #1: Type: text/plain, Size: 1418 bytes --]

On 02.09.2022 17:13:46, Jacob Kroon wrote:
> Thank you Oliver for all the good hints.
> 
> I've done some more logging, but there are no error frames being logged.
> 
> I can see that both pch_can and c_can_pci drivers call can_put_echo_skb() in
> their ndo_start_xmit functions, but neither checks the return value whether
> it succeeded or not. Shouldn't both these return NETDEV_TX_BUSY if there are
> no echo slots available ?

No - The TX queue between the kernel and the driver should only be
active in the first place if there is room in the hardware TX queue. If
there is room in the hardware TX queue, there's room in the echo skb
array, too.

> One reason I ask is because whenever I strace the application, it would seem
> the problem goes away, and I'm guessing strace:ing will slow down my
> application.

ACK - strace slows down the app, probably resulting in the TX queue not
run full.

If I find some time I can try to reproduce the problem on a beagle bone,
which uses d_can core. Just got back from holidays, so I'm quite busy
right now. Drop me a note, if you need commercial support.

regards,
Marc

-- 
Pengutronix e.K.                 | Marc Kleine-Budde           |
Embedded Linux                   | https://www.pengutronix.de  |
Vertretung West/Dortmund         | Phone: +49-231-2826-924     |
Amtsgericht Hildesheim, HRA 2686 | Fax:   +49-5121-206917-5555 |

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: CM-ITC, pch_can/c_can_pci, sendto() returning ENOBUFS
  2022-09-01  9:38             ` Jacob Kroon
  2022-09-01 16:35               ` Oliver Hartkopp
@ 2022-09-05 15:54               ` Marc Kleine-Budde
  2022-09-16  4:14                 ` Jacob Kroon
  1 sibling, 1 reply; 36+ messages in thread
From: Marc Kleine-Budde @ 2022-09-05 15:54 UTC (permalink / raw)
  To: Jacob Kroon; +Cc: Oliver Hartkopp, linux-can, wg

[-- Attachment #1: Type: text/plain, Size: 2693 bytes --]

On 01.09.2022 11:38:31, Jacob Kroon wrote:
> I used "candump can0 -l" on the EG20T host to capture the traffic, and
> then connected an CAN USB analyzer to the network and used that to
> capture the traffic. One thing sticks out. This is the log from the
> CAN USB analyzer:

Who generates these CAN messages?

> > ...
> > 505.7052;RX;0x464;3;0x01;0x01;0x00;;;;;;
> > 505.7052;RX;0x464;3;0x00;0x00;0x00;;;;;;
> > 505.7063;RX;0x65;64;;;;;;;;;

As Oliver pointed out, this doesn't look like a valid CAN frame. Is the
analyzer and/or sender configured for CAN-FD?

> > 505.7662;RX;0x440;3;0x32;0x20;0xFA;;;;;;
> > 505.7912;RX;0x44C;3;0x35;0x20;0xFA;;;;;;
> > 505.9632;RX;0x464;3;0x00;0x00;0x00;;;;;;
> > 505.9632;RX;0x464;3;0x01;0x01;0x00;;;;;;
> > 505.9752;RX;0x468;3;0x51;0x20;0xFA;;;;;;
> > 506.0362;RX;0x440;3;0x32;0x20;0xFA;;;;;;
> > 506.0622;RX;0x44C;3;0x35;0x20;0xFA;;;;;;
> > 506.2112;RX;0x464;3;0x00;0x00;0x00;;;;;;
> > 506.2112;RX;0x464;3;0x00;0x00;0x00;;;;;;
> > 506.2462;RX;0x468;3;0x51;0x20;0xFA;;;;;;
> > 506.3072;RX;0x440;3;0x32;0x20;0xFA;;;;;;
> > 506.3322;RX;0x44C;3;0x35;0x20;0xFA;;;;;;
> > 506.4572;RX;0x464;3;0x00;0x00;0x00;;;;;;
> > 506.4580;RX;0x464;3;0x00;0x00;0x00;;;;;;
> > 506.5162;RX;0x468;3;0x51;0x20;0xFA;;;;;;
> > 522.7203;RX;0x1E;1;0xFF;;;;;;;;
> > ...
> 
> Note the third message from the top. This is what "candump" on the host
> logs:
> 
> > ...
> > (1662022485.638794) can0 464#010100
> > (1662022485.638940) can0 464#000000
> > (1662022485.699405) can0 440#3220FA
> > (1662022485.725166) can0 44C#3520FA
> > (1662022485.896858) can0 464#000000
> > (1662022485.897382) can0 464#010100
> > (1662022485.909042) can0 468#5120FA
> > (1662022485.970036) can0 440#3220FA
> > (1662022485.995596) can0 44C#3520FA
> > (1662022486.144685) can0 464#000000
> > (1662022486.144768) can0 464#000000
> > (1662022486.179595) can0 468#5120FA
> > (1662022486.240561) can0 440#3220FA
> > (1662022486.266274) can0 44C#3520FA
> > (1662022486.391248) can0 464#000000
> > (1662022486.391469) can0 464#000000
> > (1662022486.450115) can0 468#5120FA
> > (1662022502.662035) can0 01E#FF
> > ...
> 
> It fails to see the 3rd message from the previous log. What would that
> indicate ? The CAN analyzer sees the message, but the EG20T doesn't.

Is this error somehow related to the "can0: can_put_echo_skb: BUG!
echo_skb 0 is occupied"?

regards,
Marc

-- 
Pengutronix e.K.                 | Marc Kleine-Budde           |
Embedded Linux                   | https://www.pengutronix.de  |
Vertretung West/Dortmund         | Phone: +49-231-2826-924     |
Amtsgericht Hildesheim, HRA 2686 | Fax:   +49-5121-206917-5555 |

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: CM-ITC, pch_can/c_can_pci, sendto() returning ENOBUFS
  2022-09-05 15:54               ` Marc Kleine-Budde
@ 2022-09-16  4:14                 ` Jacob Kroon
  2022-09-19 23:24                   ` Jacob Kroon
  2022-09-23 11:36                   ` Marc Kleine-Budde
  0 siblings, 2 replies; 36+ messages in thread
From: Jacob Kroon @ 2022-09-16  4:14 UTC (permalink / raw)
  To: Marc Kleine-Budde; +Cc: Oliver Hartkopp, linux-can, wg, dariobin

Hi Marc and Dario,

(CC:ing patch author Dario)

On 9/5/22 17:54, Marc Kleine-Budde wrote:
> On 01.09.2022 11:38:31, Jacob Kroon wrote:
>> I used "candump can0 -l" on the EG20T host to capture the traffic, and
>> then connected an CAN USB analyzer to the network and used that to
>> capture the traffic. One thing sticks out. This is the log from the
>> CAN USB analyzer:
> 
> Who generates these CAN messages?
> 

The invalid frames in the logs are being sent from the the EG20T host, 
but some of them have also originated from the other nodes in the network.

>>> ...
>>> 505.7052;RX;0x464;3;0x01;0x01;0x00;;;;;;
>>> 505.7052;RX;0x464;3;0x00;0x00;0x00;;;;;;
>>> 505.7063;RX;0x65;64;;;;;;;;;
> 
> As Oliver pointed out, this doesn't look like a valid CAN frame. Is the
> analyzer and/or sender configured for CAN-FD?
> 

No, none of the nodes in the network are sending CAN-FD frames, they are 
all normal CAN frames, max 8 bytes.

>>> 505.7662;RX;0x440;3;0x32;0x20;0xFA;;;;;;
>>> 505.7912;RX;0x44C;3;0x35;0x20;0xFA;;;;;;
>>> 505.9632;RX;0x464;3;0x00;0x00;0x00;;;;;;
>>> 505.9632;RX;0x464;3;0x01;0x01;0x00;;;;;;
>>> 505.9752;RX;0x468;3;0x51;0x20;0xFA;;;;;;
>>> 506.0362;RX;0x440;3;0x32;0x20;0xFA;;;;;;
>>> 506.0622;RX;0x44C;3;0x35;0x20;0xFA;;;;;;
>>> 506.2112;RX;0x464;3;0x00;0x00;0x00;;;;;;
>>> 506.2112;RX;0x464;3;0x00;0x00;0x00;;;;;;
>>> 506.2462;RX;0x468;3;0x51;0x20;0xFA;;;;;;
>>> 506.3072;RX;0x440;3;0x32;0x20;0xFA;;;;;;
>>> 506.3322;RX;0x44C;3;0x35;0x20;0xFA;;;;;;
>>> 506.4572;RX;0x464;3;0x00;0x00;0x00;;;;;;
>>> 506.4580;RX;0x464;3;0x00;0x00;0x00;;;;;;
>>> 506.5162;RX;0x468;3;0x51;0x20;0xFA;;;;;;
>>> 522.7203;RX;0x1E;1;0xFF;;;;;;;;
>>> ...
>>
>> Note the third message from the top. This is what "candump" on the host
>> logs:
>>
>>> ...
>>> (1662022485.638794) can0 464#010100
>>> (1662022485.638940) can0 464#000000
>>> (1662022485.699405) can0 440#3220FA
>>> (1662022485.725166) can0 44C#3520FA
>>> (1662022485.896858) can0 464#000000
>>> (1662022485.897382) can0 464#010100
>>> (1662022485.909042) can0 468#5120FA
>>> (1662022485.970036) can0 440#3220FA
>>> (1662022485.995596) can0 44C#3520FA
>>> (1662022486.144685) can0 464#000000
>>> (1662022486.144768) can0 464#000000
>>> (1662022486.179595) can0 468#5120FA
>>> (1662022486.240561) can0 440#3220FA
>>> (1662022486.266274) can0 44C#3520FA
>>> (1662022486.391248) can0 464#000000
>>> (1662022486.391469) can0 464#000000
>>> (1662022486.450115) can0 468#5120FA
>>> (1662022502.662035) can0 01E#FF
>>> ...
>>
>> It fails to see the 3rd message from the previous log. What would that
>> indicate ? The CAN analyzer sees the message, but the EG20T doesn't.
> 
> Is this error somehow related to the "can0: can_put_echo_skb: BUG!
> echo_skb 0 is occupied"?
> 

Possibly.

What I do know is that if I revert commit:

"can: c_can: cache frames to operate as a true FIFO"
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=387da6bc7a826cc6d532b1c0002b7c7513238d5f

then everything looks good. I don't get any BUG messages, and the host 
has been running overnight without problems, so it seems to have fixed 
the network interface lockup as well.

Jacob

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: CM-ITC, pch_can/c_can_pci, sendto() returning ENOBUFS
  2022-09-16  4:14                 ` Jacob Kroon
@ 2022-09-19 23:24                   ` Jacob Kroon
  2022-09-20  1:23                     ` Vincent Mailhol
  2022-09-21  7:25                     ` dariobin
  2022-09-23 11:36                   ` Marc Kleine-Budde
  1 sibling, 2 replies; 36+ messages in thread
From: Jacob Kroon @ 2022-09-19 23:24 UTC (permalink / raw)
  To: Marc Kleine-Budde, dariobin; +Cc: Oliver Hartkopp, linux-can, wg

Hi Marc and Dario,

On 9/16/22 06:14, Jacob Kroon wrote:
...> What I do know is that if I revert commit:
> 
> "can: c_can: cache frames to operate as a true FIFO"
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=387da6bc7a826cc6d532b1c0002b7c7513238d5f
> 
> then everything looks good. I don't get any BUG messages, and the host 
> has been running overnight without problems, so it seems to have fixed 
> the network interface lockup as well.

I ran the kernel *with* the commit above, and also with the following patch:

> diff --git a/drivers/net/can/c_can/c_can_main.c b/drivers/net/can/c_can/c_can_main.c
> index 52671d1ea17d..4375dc70e21f 100644
> --- a/drivers/net/can/c_can/c_can_main.c
> +++ b/drivers/net/can/c_can/c_can_main.c
> @@ -1,3 +1,4 @@
> +#define DEBUG
>  /*
>   * CAN bus driver for Bosch C_CAN controller
>   *
> @@ -469,8 +470,15 @@ static netdev_tx_t c_can_start_xmit(struct sk_buff *skb,
>  	if (c_can_get_tx_free(tx_ring) == 0)
>  		netif_stop_queue(dev);
>  
> -	if (idx < c_can_get_tx_tail(tx_ring))
> +	netdev_dbg(dev, "JAKR:%d:%d:%d:%d\n", idx,
> +	                                      c_can_get_tx_head(tx_ring),
> +	                                      c_can_get_tx_tail(tx_ring),
> +	                                      c_can_get_tx_free(tx_ring));
> +
> +	if (idx < c_can_get_tx_tail(tx_ring)) {
>  		cmd &= ~IF_COMM_TXRQST; /* Cache the message */
> +		netdev_dbg(dev, "JAKR:Caching messages\n");
> +	}
>  
>  	/* Store the message in the interface so we can call
>  	 * can_put_echo_skb(). We must do this before we enable

and I've uploaded the entire log I could capture from /dev/kmsg, right 
up to the hang, here:

https://pastebin.com/6hvAcPc9

What looks odd to me right from the start is that sometimes when idx 
rolls over to 0, and *only* when it rolls over to 0, the CAN frame gets 
cached because "idx < c_can_get_tx_tail(tx_ring)".

Is it possible there is some difference between c_can and d_can in how 
the HW buffers are working, which breaks the driver on my particular HW 
setup ?

Regards,
Jacob

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: CM-ITC, pch_can/c_can_pci, sendto() returning ENOBUFS
  2022-09-19 23:24                   ` Jacob Kroon
@ 2022-09-20  1:23                     ` Vincent Mailhol
  2022-09-20  5:08                       ` Jacob Kroon
  2022-09-21  7:25                     ` dariobin
  1 sibling, 1 reply; 36+ messages in thread
From: Vincent Mailhol @ 2022-09-20  1:23 UTC (permalink / raw)
  To: Jacob Kroon; +Cc: Marc Kleine-Budde, dariobin, Oliver Hartkopp, linux-can, wg

On Tue. 20 Sep. 2022 at 08:43, Jacob Kroon <jacob.kroon@gmail.com> wrote:
> Hi Marc and Dario,
>
> On 9/16/22 06:14, Jacob Kroon wrote:
> ...> What I do know is that if I revert commit:
> >
> > "can: c_can: cache frames to operate as a true FIFO"
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=387da6bc7a826cc6d532b1c0002b7c7513238d5f
> >
> > then everything looks good. I don't get any BUG messages, and the host
> > has been running overnight without problems, so it seems to have fixed
> > the network interface lockup as well.
>
> I ran the kernel *with* the commit above, and also with the following patch:
>
> > diff --git a/drivers/net/can/c_can/c_can_main.c b/drivers/net/can/c_can/c_can_main.c
> > index 52671d1ea17d..4375dc70e21f 100644
> > --- a/drivers/net/can/c_can/c_can_main.c
> > +++ b/drivers/net/can/c_can/c_can_main.c
> > @@ -1,3 +1,4 @@
> > +#define DEBUG
> >  /*
> >   * CAN bus driver for Bosch C_CAN controller
> >   *
> > @@ -469,8 +470,15 @@ static netdev_tx_t c_can_start_xmit(struct sk_buff *skb,
> >       if (c_can_get_tx_free(tx_ring) == 0)
> >               netif_stop_queue(dev);
> >
> > -     if (idx < c_can_get_tx_tail(tx_ring))
> > +     netdev_dbg(dev, "JAKR:%d:%d:%d:%d\n", idx,
> > +                                           c_can_get_tx_head(tx_ring),
> > +                                           c_can_get_tx_tail(tx_ring),
> > +                                           c_can_get_tx_free(tx_ring));
> > +
> > +     if (idx < c_can_get_tx_tail(tx_ring)) {
> >               cmd &= ~IF_COMM_TXRQST; /* Cache the message */
> > +             netdev_dbg(dev, "JAKR:Caching messages\n");
> > +     }
> >
> >       /* Store the message in the interface so we can call
> >        * can_put_echo_skb(). We must do this before we enable
>
> and I've uploaded the entire log I could capture from /dev/kmsg, right
> up to the hang, here:
>
> https://pastebin.com/6hvAcPc9
>
> What looks odd to me right from the start is that sometimes when idx
> rolls over to 0, and *only* when it rolls over to 0, the CAN frame gets
> cached because "idx < c_can_get_tx_tail(tx_ring)".

This looks like a classic integer wraparound issue. A cast to s32
should do the trick:
  (s32)idx - c_can_get_tx_tail(tx_ring) < 0

This is not tested. I let you double check. You can refer to this
function for an example:
https://elixir.bootlin.com/linux/v5.19/source/drivers/net/can/usb/etas_es58x/es58x_core.c#L336

Yours sincerely,
Vincent Mailhol

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: CM-ITC, pch_can/c_can_pci, sendto() returning ENOBUFS
  2022-09-20  1:23                     ` Vincent Mailhol
@ 2022-09-20  5:08                       ` Jacob Kroon
  0 siblings, 0 replies; 36+ messages in thread
From: Jacob Kroon @ 2022-09-20  5:08 UTC (permalink / raw)
  To: Vincent Mailhol
  Cc: Marc Kleine-Budde, dariobin, Oliver Hartkopp, linux-can, wg

Hi Vincent,

On 9/20/22 03:23, Vincent Mailhol wrote:
> On Tue. 20 Sep. 2022 at 08:43, Jacob Kroon <jacob.kroon@gmail.com> wrote:
>> Hi Marc and Dario,
>>
>> On 9/16/22 06:14, Jacob Kroon wrote:
>> ...> What I do know is that if I revert commit:
>>>
>>> "can: c_can: cache frames to operate as a true FIFO"
>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=387da6bc7a826cc6d532b1c0002b7c7513238d5f
>>>
>>> then everything looks good. I don't get any BUG messages, and the host
>>> has been running overnight without problems, so it seems to have fixed
>>> the network interface lockup as well.
>>
>> I ran the kernel *with* the commit above, and also with the following patch:
>>
>>> diff --git a/drivers/net/can/c_can/c_can_main.c b/drivers/net/can/c_can/c_can_main.c
>>> index 52671d1ea17d..4375dc70e21f 100644
>>> --- a/drivers/net/can/c_can/c_can_main.c
>>> +++ b/drivers/net/can/c_can/c_can_main.c
>>> @@ -1,3 +1,4 @@
>>> +#define DEBUG
>>>   /*
>>>    * CAN bus driver for Bosch C_CAN controller
>>>    *
>>> @@ -469,8 +470,15 @@ static netdev_tx_t c_can_start_xmit(struct sk_buff *skb,
>>>        if (c_can_get_tx_free(tx_ring) == 0)
>>>                netif_stop_queue(dev);
>>>
>>> -     if (idx < c_can_get_tx_tail(tx_ring))
>>> +     netdev_dbg(dev, "JAKR:%d:%d:%d:%d\n", idx,
>>> +                                           c_can_get_tx_head(tx_ring),
>>> +                                           c_can_get_tx_tail(tx_ring),
>>> +                                           c_can_get_tx_free(tx_ring));
>>> +
>>> +     if (idx < c_can_get_tx_tail(tx_ring)) {
>>>                cmd &= ~IF_COMM_TXRQST; /* Cache the message */
>>> +             netdev_dbg(dev, "JAKR:Caching messages\n");
>>> +     }
>>>
>>>        /* Store the message in the interface so we can call
>>>         * can_put_echo_skb(). We must do this before we enable
>>
>> and I've uploaded the entire log I could capture from /dev/kmsg, right
>> up to the hang, here:
>>
>> https://pastebin.com/6hvAcPc9
>>
>> What looks odd to me right from the start is that sometimes when idx
>> rolls over to 0, and *only* when it rolls over to 0, the CAN frame gets
>> cached because "idx < c_can_get_tx_tail(tx_ring)".
> 
> This looks like a classic integer wraparound issue. A cast to s32
> should do the trick:
>    (s32)idx - c_can_get_tx_tail(tx_ring) < 0
> 
> This is not tested. I let you double check. You can refer to this
> function for an example:
> https://elixir.bootlin.com/linux/v5.19/source/drivers/net/can/usb/etas_es58x/es58x_core.c#L336
> 

I tried that change but there is no change in behaviour, and I don't see 
why it would make a difference since both expressions "idx" and 
"c_can_get_tx_tail(tx_ring)" are unsigned expressions to begin with.

Regards,
Jacob

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: CM-ITC, pch_can/c_can_pci, sendto() returning ENOBUFS
  2022-09-19 23:24                   ` Jacob Kroon
  2022-09-20  1:23                     ` Vincent Mailhol
@ 2022-09-21  7:25                     ` dariobin
  2022-09-21  7:47                       ` Marc Kleine-Budde
  1 sibling, 1 reply; 36+ messages in thread
From: dariobin @ 2022-09-21  7:25 UTC (permalink / raw)
  To: Jacob Kroon, Marc Kleine-Budde; +Cc: Oliver Hartkopp, linux-can, wg

Hi Jacob,

> Il 20/09/2022 01:24 Jacob Kroon <jacob.kroon@gmail.com> ha scritto:
> 
>  
> Hi Marc and Dario,
> 
> On 9/16/22 06:14, Jacob Kroon wrote:
> ...> What I do know is that if I revert commit:
> > 
> > "can: c_can: cache frames to operate as a true FIFO"
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=387da6bc7a826cc6d532b1c0002b7c7513238d5f
> > 
> > then everything looks good. I don't get any BUG messages, and the host 
> > has been running overnight without problems, so it seems to have fixed 
> > the network interface lockup as well.

Here's what I think:
If one or more messages are cached, the controller has to transmit more frames 
in the unit of time when they can be transmitted (IF_COMM_TXRQST), different from
when the transmission occurs directly on request from the user space. In the case 
of cached data transmission I therefore think that the controller is more heavily
loaded. Can this shift the balance ?

> 
> I ran the kernel *with* the commit above, and also with the following patch:
> 
> > diff --git a/drivers/net/can/c_can/c_can_main.c b/drivers/net/can/c_can/c_can_main.c
> > index 52671d1ea17d..4375dc70e21f 100644
> > --- a/drivers/net/can/c_can/c_can_main.c
> > +++ b/drivers/net/can/c_can/c_can_main.c
> > @@ -1,3 +1,4 @@
> > +#define DEBUG
> >  /*
> >   * CAN bus driver for Bosch C_CAN controller
> >   *
> > @@ -469,8 +470,15 @@ static netdev_tx_t c_can_start_xmit(struct sk_buff *skb,
> >  	if (c_can_get_tx_free(tx_ring) == 0)
> >  		netif_stop_queue(dev);
> >  
> > -	if (idx < c_can_get_tx_tail(tx_ring))
> > +	netdev_dbg(dev, "JAKR:%d:%d:%d:%d\n", idx,
> > +	                                      c_can_get_tx_head(tx_ring),
> > +	                                      c_can_get_tx_tail(tx_ring),
> > +	                                      c_can_get_tx_free(tx_ring));
> > +
> > +	if (idx < c_can_get_tx_tail(tx_ring)) {
> >  		cmd &= ~IF_COMM_TXRQST; /* Cache the message */
> > +		netdev_dbg(dev, "JAKR:Caching messages\n");
> > +	}
> >  
> >  	/* Store the message in the interface so we can call
> >  	 * can_put_echo_skb(). We must do this before we enable
> 
> and I've uploaded the entire log I could capture from /dev/kmsg, right 
> up to the hang, here:
> 
> https://pastebin.com/6hvAcPc9
> 
> What looks odd to me right from the start is that sometimes when idx 
> rolls over to 0, and *only* when it rolls over to 0, the CAN frame gets 
> cached because "idx < c_can_get_tx_tail(tx_ring)".

If the message were not stored but transmitted, the order of transmission 
would not be respected.

> 
> Is it possible there is some difference between c_can and d_can in how 
> the HW buffers are working, which breaks the driver on my particular HW 
> setup ?
> 

I tested the patch on a beaglebone board without encountering any problems.
There is also a version of the driver I submitted to Xenomai running on a custom
board without problems. But surely the setup and context is different from yours.

What compatible are you using in your device tree?
I used "ti,am3352-d_can".

Thanks and regards,
Dario

> Regards,
> Jacob

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: CM-ITC, pch_can/c_can_pci, sendto() returning ENOBUFS
  2022-09-21  7:25                     ` dariobin
@ 2022-09-21  7:47                       ` Marc Kleine-Budde
  2022-09-21  8:26                         ` Jacob Kroon
                                           ` (2 more replies)
  0 siblings, 3 replies; 36+ messages in thread
From: Marc Kleine-Budde @ 2022-09-21  7:47 UTC (permalink / raw)
  To: dariobin; +Cc: Jacob Kroon, Oliver Hartkopp, linux-can, wg

[-- Attachment #1: Type: text/plain, Size: 3890 bytes --]

On 21.09.2022 09:25:41, dariobin@libero.it wrote:
> > On 9/16/22 06:14, Jacob Kroon wrote:
> > ...> What I do know is that if I revert commit:
> > > 
> > > "can: c_can: cache frames to operate as a true FIFO"
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=387da6bc7a826cc6d532b1c0002b7c7513238d5f
> > > 
> > > then everything looks good. I don't get any BUG messages, and the host 
> > > has been running overnight without problems, so it seems to have fixed 
> > > the network interface lockup as well.
> 
> Here's what I think:
> If one or more messages are cached, the controller has to transmit more frames 
> in the unit of time when they can be transmitted (IF_COMM_TXRQST), different from
> when the transmission occurs directly on request from the user space. In the case 
> of cached data transmission I therefore think that the controller is more heavily
> loaded. Can this shift the balance ?
> 
> > 
> > I ran the kernel *with* the commit above, and also with the following patch:
> > 
> > > diff --git a/drivers/net/can/c_can/c_can_main.c b/drivers/net/can/c_can/c_can_main.c
> > > index 52671d1ea17d..4375dc70e21f 100644
> > > --- a/drivers/net/can/c_can/c_can_main.c
> > > +++ b/drivers/net/can/c_can/c_can_main.c
> > > @@ -1,3 +1,4 @@
> > > +#define DEBUG
> > >  /*
> > >   * CAN bus driver for Bosch C_CAN controller
> > >   *
> > > @@ -469,8 +470,15 @@ static netdev_tx_t c_can_start_xmit(struct sk_buff *skb,
> > >  	if (c_can_get_tx_free(tx_ring) == 0)
> > >  		netif_stop_queue(dev);
> > >  
> > > -	if (idx < c_can_get_tx_tail(tx_ring))
> > > +	netdev_dbg(dev, "JAKR:%d:%d:%d:%d\n", idx,
> > > +	                                      c_can_get_tx_head(tx_ring),
> > > +	                                      c_can_get_tx_tail(tx_ring),
> > > +	                                      c_can_get_tx_free(tx_ring));
> > > +
> > > +	if (idx < c_can_get_tx_tail(tx_ring)) {
> > >  		cmd &= ~IF_COMM_TXRQST; /* Cache the message */
> > > +		netdev_dbg(dev, "JAKR:Caching messages\n");
> > > +	}
> > >  
> > >  	/* Store the message in the interface so we can call
> > >  	 * can_put_echo_skb(). We must do this before we enable
> > 
> > and I've uploaded the entire log I could capture from /dev/kmsg, right 
> > up to the hang, here:
> > 
> > https://pastebin.com/6hvAcPc9
> > 
> > What looks odd to me right from the start is that sometimes when idx 
> > rolls over to 0, and *only* when it rolls over to 0, the CAN frame gets 
> > cached because "idx < c_can_get_tx_tail(tx_ring)".
> 
> If the message were not stored but transmitted, the order of transmission 
> would not be respected.
> 
> > 
> > Is it possible there is some difference between c_can and d_can in how 
> > the HW buffers are working, which breaks the driver on my particular HW 
> > setup ?
> > 
> 
> I tested the patch on a beaglebone board without encountering any problems.
> There is also a version of the driver I submitted to Xenomai running on a custom
> board without problems. But surely the setup and context is different from yours.
> 
> What compatible are you using in your device tree?
> I used "ti,am3352-d_can".

I think Jacob's board has a c_can core, while the beagle bone uses a
d_can. Maybe there's a subtle difference between these cores?

Dario, do you have access to a real c_can core to test?

As reverting 387da6bc7a82 ("can: c_can: cache frames to operate as a
true FIFO") helps to fix Jacob's problem, a temporary solution might be
to only cache frames on d_can cores.

regards,
Marc

-- 
Pengutronix e.K.                 | Marc Kleine-Budde           |
Embedded Linux                   | https://www.pengutronix.de  |
Vertretung West/Dortmund         | Phone: +49-231-2826-924     |
Amtsgericht Hildesheim, HRA 2686 | Fax:   +49-5121-206917-5555 |

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: CM-ITC, pch_can/c_can_pci, sendto() returning ENOBUFS
  2022-09-21  7:47                       ` Marc Kleine-Budde
@ 2022-09-21  8:26                         ` Jacob Kroon
  2022-09-21  9:55                         ` Oliver Hartkopp
  2022-09-22  7:20                         ` dariobin
  2 siblings, 0 replies; 36+ messages in thread
From: Jacob Kroon @ 2022-09-21  8:26 UTC (permalink / raw)
  To: Marc Kleine-Budde, dariobin; +Cc: Oliver Hartkopp, linux-can, wg

Hi Marc and Dario,

On 9/21/22 09:47, Marc Kleine-Budde wrote:
> On 21.09.2022 09:25:41, dariobin@libero.it wrote:
>>> On 9/16/22 06:14, Jacob Kroon wrote:
>>> ...> What I do know is that if I revert commit:
>>>>
>>>> "can: c_can: cache frames to operate as a true FIFO"
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=387da6bc7a826cc6d532b1c0002b7c7513238d5f
>>>>
>>>> then everything looks good. I don't get any BUG messages, and the host
>>>> has been running overnight without problems, so it seems to have fixed
>>>> the network interface lockup as well.
>>
>> Here's what I think:
>> If one or more messages are cached, the controller has to transmit more frames
>> in the unit of time when they can be transmitted (IF_COMM_TXRQST), different from
>> when the transmission occurs directly on request from the user space. In the case
>> of cached data transmission I therefore think that the controller is more heavily
>> loaded. Can this shift the balance ?
>>
>>>
>>> I ran the kernel *with* the commit above, and also with the following patch:
>>>
>>>> diff --git a/drivers/net/can/c_can/c_can_main.c b/drivers/net/can/c_can/c_can_main.c
>>>> index 52671d1ea17d..4375dc70e21f 100644
>>>> --- a/drivers/net/can/c_can/c_can_main.c
>>>> +++ b/drivers/net/can/c_can/c_can_main.c
>>>> @@ -1,3 +1,4 @@
>>>> +#define DEBUG
>>>>   /*
>>>>    * CAN bus driver for Bosch C_CAN controller
>>>>    *
>>>> @@ -469,8 +470,15 @@ static netdev_tx_t c_can_start_xmit(struct sk_buff *skb,
>>>>   	if (c_can_get_tx_free(tx_ring) == 0)
>>>>   		netif_stop_queue(dev);
>>>>   
>>>> -	if (idx < c_can_get_tx_tail(tx_ring))
>>>> +	netdev_dbg(dev, "JAKR:%d:%d:%d:%d\n", idx,
>>>> +	                                      c_can_get_tx_head(tx_ring),
>>>> +	                                      c_can_get_tx_tail(tx_ring),
>>>> +	                                      c_can_get_tx_free(tx_ring));
>>>> +
>>>> +	if (idx < c_can_get_tx_tail(tx_ring)) {
>>>>   		cmd &= ~IF_COMM_TXRQST; /* Cache the message */
>>>> +		netdev_dbg(dev, "JAKR:Caching messages\n");
>>>> +	}
>>>>   
>>>>   	/* Store the message in the interface so we can call
>>>>   	 * can_put_echo_skb(). We must do this before we enable
>>>
>>> and I've uploaded the entire log I could capture from /dev/kmsg, right
>>> up to the hang, here:
>>>
>>> https://pastebin.com/6hvAcPc9
>>>
>>> What looks odd to me right from the start is that sometimes when idx
>>> rolls over to 0, and *only* when it rolls over to 0, the CAN frame gets
>>> cached because "idx < c_can_get_tx_tail(tx_ring)".
>>
>> If the message were not stored but transmitted, the order of transmission
>> would not be respected.
>>
>>>
>>> Is it possible there is some difference between c_can and d_can in how
>>> the HW buffers are working, which breaks the driver on my particular HW
>>> setup ?
>>>
>>
>> I tested the patch on a beaglebone board without encountering any problems.
>> There is also a version of the driver I submitted to Xenomai running on a custom
>> board without problems. But surely the setup and context is different from yours.
>>
>> What compatible are you using in your device tree?
>> I used "ti,am3352-d_can".
> 
> I think Jacob's board has a c_can core, while the beagle bone uses a
> d_can. Maybe there's a subtle difference between these cores?
> 
> Dario, do you have access to a real c_can core to test?
> 
> As reverting 387da6bc7a82 ("can: c_can: cache frames to operate as a
> true FIFO") helps to fix Jacob's problem, a temporary solution might be
> to only cache frames on d_can cores.
> 

Yes, mine is c_can, my hardware is

https://www.compulab.com/products/computer-on-modules/cm-itc/

which has a Intel Platform Controller Hub (PCH) EG20T that includes the 
CAN bus. A quick Google list the datasheet here:

https://www.versalogic.com/wp-content/themes/vsl-new/assets/resources/support/pdf/Intel_PCH_EG20T_Datasheet.pdf

Regards,
Jacob

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: CM-ITC, pch_can/c_can_pci, sendto() returning ENOBUFS
  2022-09-21  7:47                       ` Marc Kleine-Budde
  2022-09-21  8:26                         ` Jacob Kroon
@ 2022-09-21  9:55                         ` Oliver Hartkopp
  2022-09-21 10:32                           ` Marc Kleine-Budde
  2022-09-22  7:20                         ` dariobin
  2 siblings, 1 reply; 36+ messages in thread
From: Oliver Hartkopp @ 2022-09-21  9:55 UTC (permalink / raw)
  To: Marc Kleine-Budde, dariobin; +Cc: Jacob Kroon, linux-can, wg



On 21.09.22 09:47, Marc Kleine-Budde wrote:
> On 21.09.2022 09:25:41, dariobin@libero.it wrote:
>>> On 9/16/22 06:14, Jacob Kroon wrote:
>>> ...> What I do know is that if I revert commit:
>>>>
>>>> "can: c_can: cache frames to operate as a true FIFO"
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=387da6bc7a826cc6d532b1c0002b7c7513238d5f
>>>>
>>>> then everything looks good. I don't get any BUG messages, and the host
>>>> has been running overnight without problems, so it seems to have fixed
>>>> the network interface lockup as well.
>>
>> Here's what I think:
>> If one or more messages are cached, the controller has to transmit more frames
>> in the unit of time when they can be transmitted (IF_COMM_TXRQST), different from
>> when the transmission occurs directly on request from the user space. In the case
>> of cached data transmission I therefore think that the controller is more heavily
>> loaded. Can this shift the balance ?
>>
>>>
>>> I ran the kernel *with* the commit above, and also with the following patch:
>>>
>>>> diff --git a/drivers/net/can/c_can/c_can_main.c b/drivers/net/can/c_can/c_can_main.c
>>>> index 52671d1ea17d..4375dc70e21f 100644
>>>> --- a/drivers/net/can/c_can/c_can_main.c
>>>> +++ b/drivers/net/can/c_can/c_can_main.c
>>>> @@ -1,3 +1,4 @@
>>>> +#define DEBUG
>>>>   /*
>>>>    * CAN bus driver for Bosch C_CAN controller
>>>>    *
>>>> @@ -469,8 +470,15 @@ static netdev_tx_t c_can_start_xmit(struct sk_buff *skb,
>>>>   	if (c_can_get_tx_free(tx_ring) == 0)
>>>>   		netif_stop_queue(dev);
>>>>   
>>>> -	if (idx < c_can_get_tx_tail(tx_ring))
>>>> +	netdev_dbg(dev, "JAKR:%d:%d:%d:%d\n", idx,
>>>> +	                                      c_can_get_tx_head(tx_ring),
>>>> +	                                      c_can_get_tx_tail(tx_ring),
>>>> +	                                      c_can_get_tx_free(tx_ring));
>>>> +
>>>> +	if (idx < c_can_get_tx_tail(tx_ring)) {
>>>>   		cmd &= ~IF_COMM_TXRQST; /* Cache the message */
>>>> +		netdev_dbg(dev, "JAKR:Caching messages\n");
>>>> +	}
>>>>   
>>>>   	/* Store the message in the interface so we can call
>>>>   	 * can_put_echo_skb(). We must do this before we enable
>>>
>>> and I've uploaded the entire log I could capture from /dev/kmsg, right
>>> up to the hang, here:
>>>
>>> https://pastebin.com/6hvAcPc9
>>>
>>> What looks odd to me right from the start is that sometimes when idx
>>> rolls over to 0, and *only* when it rolls over to 0, the CAN frame gets
>>> cached because "idx < c_can_get_tx_tail(tx_ring)".
>>
>> If the message were not stored but transmitted, the order of transmission
>> would not be respected.
>>
>>>
>>> Is it possible there is some difference between c_can and d_can in how
>>> the HW buffers are working, which breaks the driver on my particular HW
>>> setup ?
>>>
>>
>> I tested the patch on a beaglebone board without encountering any problems.
>> There is also a version of the driver I submitted to Xenomai running on a custom
>> board without problems. But surely the setup and context is different from yours.
>>
>> What compatible are you using in your device tree?
>> I used "ti,am3352-d_can".
> 
> I think Jacob's board has a c_can core, while the beagle bone uses a
> d_can. Maybe there's a subtle difference between these cores?
> 
> Dario, do you have access to a real c_can core to test?
> 
> As reverting 387da6bc7a82 ("can: c_can: cache frames to operate as a
> true FIFO") helps to fix Jacob's problem, a temporary solution might be
> to only cache frames on d_can cores.

Btw. I uploaded the 'latest' C_CAN manuals on

https://github.com/linux-can/can-doc

... as it could only be found on archive.org :-/

Unfortunately I was not able to find any (latest?) D_CAN manual anymore, 
which was originally hosted at 
http://www.semiconductors.bosch.de/media/en/pdf/ipmodules_1/can/d_can_users_manual_111.pdf

Archive.org did not crawl this PDF ;-(

If someone still has this D_CAN PDF please send a URL or the PDF itself 
to me, so that I can put it there too.

Thanks,
Oliver

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: CM-ITC, pch_can/c_can_pci, sendto() returning ENOBUFS
  2022-09-21  9:55                         ` Oliver Hartkopp
@ 2022-09-21 10:32                           ` Marc Kleine-Budde
  2022-09-21 10:39                             ` Oliver Hartkopp
  0 siblings, 1 reply; 36+ messages in thread
From: Marc Kleine-Budde @ 2022-09-21 10:32 UTC (permalink / raw)
  To: Oliver Hartkopp; +Cc: dariobin, Jacob Kroon, linux-can, wg

[-- Attachment #1: Type: text/plain, Size: 994 bytes --]

On 21.09.2022 11:55:59, Oliver Hartkopp wrote:
> Btw. I uploaded the 'latest' C_CAN manuals on
> 
> https://github.com/linux-can/can-doc
> 
> ... as it could only be found on archive.org :-/
> 
> Unfortunately I was not able to find any (latest?) D_CAN manual anymore,
> which was originally hosted at http://www.semiconductors.bosch.de/media/en/pdf/ipmodules_1/can/d_can_users_manual_111.pdf
> 
> Archive.org did not crawl this PDF ;-(
> 
> If someone still has this D_CAN PDF please send a URL or the PDF itself to
> me, so that I can put it there too.

I've found some old stuff on https://www.yumpu.com/user/bosch.semiconductors.de

You've got PR:
https://github.com/linux-can/can-doc/pull/1

Marc

-- 
Pengutronix e.K.                 | Marc Kleine-Budde           |
Embedded Linux                   | https://www.pengutronix.de  |
Vertretung West/Dortmund         | Phone: +49-231-2826-924     |
Amtsgericht Hildesheim, HRA 2686 | Fax:   +49-5121-206917-5555 |

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: CM-ITC, pch_can/c_can_pci, sendto() returning ENOBUFS
  2022-09-21 10:32                           ` Marc Kleine-Budde
@ 2022-09-21 10:39                             ` Oliver Hartkopp
  2022-09-21 10:53                               ` Marc Kleine-Budde
  0 siblings, 1 reply; 36+ messages in thread
From: Oliver Hartkopp @ 2022-09-21 10:39 UTC (permalink / raw)
  To: Marc Kleine-Budde; +Cc: dariobin, Jacob Kroon, linux-can, wg



On 21.09.22 12:32, Marc Kleine-Budde wrote:
> On 21.09.2022 11:55:59, Oliver Hartkopp wrote:
>> Btw. I uploaded the 'latest' C_CAN manuals on
>>
>> https://github.com/linux-can/can-doc
>>
>> ... as it could only be found on archive.org :-/
>>
>> Unfortunately I was not able to find any (latest?) D_CAN manual anymore,
>> which was originally hosted at http://www.semiconductors.bosch.de/media/en/pdf/ipmodules_1/can/d_can_users_manual_111.pdf
>>
>> Archive.org did not crawl this PDF ;-(
>>
>> If someone still has this D_CAN PDF please send a URL or the PDF itself to
>> me, so that I can put it there too.
> 
> I've found some old stuff on https://www.yumpu.com/user/bosch.semiconductors.de
> 
> You've got PR:
> https://github.com/linux-can/can-doc/pull/1

Pulled.

Excellent contribution! Thanks Marc!

Best regards,
Oliver

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: CM-ITC, pch_can/c_can_pci, sendto() returning ENOBUFS
  2022-09-21 10:39                             ` Oliver Hartkopp
@ 2022-09-21 10:53                               ` Marc Kleine-Budde
  2022-09-21 11:00                                 ` Oliver Hartkopp
  0 siblings, 1 reply; 36+ messages in thread
From: Marc Kleine-Budde @ 2022-09-21 10:53 UTC (permalink / raw)
  To: Oliver Hartkopp; +Cc: dariobin, Jacob Kroon, linux-can, wg

[-- Attachment #1: Type: text/plain, Size: 1354 bytes --]

On 21.09.2022 12:39:06, Oliver Hartkopp wrote:
> 
> 
> On 21.09.22 12:32, Marc Kleine-Budde wrote:
> > On 21.09.2022 11:55:59, Oliver Hartkopp wrote:
> > > Btw. I uploaded the 'latest' C_CAN manuals on
> > > 
> > > https://github.com/linux-can/can-doc
> > > 
> > > ... as it could only be found on archive.org :-/
> > > 
> > > Unfortunately I was not able to find any (latest?) D_CAN manual anymore,
> > > which was originally hosted at http://www.semiconductors.bosch.de/media/en/pdf/ipmodules_1/can/d_can_users_manual_111.pdf
> > > 
> > > Archive.org did not crawl this PDF ;-(
> > > 
> > > If someone still has this D_CAN PDF please send a URL or the PDF itself to
> > > me, so that I can put it there too.
> > 
> > I've found some old stuff on https://www.yumpu.com/user/bosch.semiconductors.de
> > 
> > You've got PR:
> > https://github.com/linux-can/can-doc/pull/1
> 
> Pulled.
> 
> Excellent contribution! Thanks Marc!

Doh! I accidentally committed the readme of your M_CAN history repo.
I'll force push the master to get rid of it. OK?

Marc

-- 
Pengutronix e.K.                 | Marc Kleine-Budde           |
Embedded Linux                   | https://www.pengutronix.de  |
Vertretung West/Dortmund         | Phone: +49-231-2826-924     |
Amtsgericht Hildesheim, HRA 2686 | Fax:   +49-5121-206917-5555 |

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: CM-ITC, pch_can/c_can_pci, sendto() returning ENOBUFS
  2022-09-21 10:53                               ` Marc Kleine-Budde
@ 2022-09-21 11:00                                 ` Oliver Hartkopp
  0 siblings, 0 replies; 36+ messages in thread
From: Oliver Hartkopp @ 2022-09-21 11:00 UTC (permalink / raw)
  To: Marc Kleine-Budde; +Cc: dariobin, Jacob Kroon, linux-can, wg



On 21.09.22 12:53, Marc Kleine-Budde wrote:
> On 21.09.2022 12:39:06, Oliver Hartkopp wrote:
>>
>>
>> On 21.09.22 12:32, Marc Kleine-Budde wrote:
>>> On 21.09.2022 11:55:59, Oliver Hartkopp wrote:
>>>> Btw. I uploaded the 'latest' C_CAN manuals on
>>>>
>>>> https://github.com/linux-can/can-doc
>>>>
>>>> ... as it could only be found on archive.org :-/
>>>>
>>>> Unfortunately I was not able to find any (latest?) D_CAN manual anymore,
>>>> which was originally hosted at http://www.semiconductors.bosch.de/media/en/pdf/ipmodules_1/can/d_can_users_manual_111.pdf
>>>>
>>>> Archive.org did not crawl this PDF ;-(
>>>>
>>>> If someone still has this D_CAN PDF please send a URL or the PDF itself to
>>>> me, so that I can put it there too.
>>>
>>> I've found some old stuff on https://www.yumpu.com/user/bosch.semiconductors.de
>>>
>>> You've got PR:
>>> https://github.com/linux-can/can-doc/pull/1
>>
>> Pulled.
>>
>> Excellent contribution! Thanks Marc!
> 
> Doh! I accidentally committed the readme of your M_CAN history repo.
> I'll force push the master to get rid of it. OK?

NP. Just go for it ;-)

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: CM-ITC, pch_can/c_can_pci, sendto() returning ENOBUFS
  2022-09-21  7:47                       ` Marc Kleine-Budde
  2022-09-21  8:26                         ` Jacob Kroon
  2022-09-21  9:55                         ` Oliver Hartkopp
@ 2022-09-22  7:20                         ` dariobin
  2 siblings, 0 replies; 36+ messages in thread
From: dariobin @ 2022-09-22  7:20 UTC (permalink / raw)
  To: Marc Kleine-Budde; +Cc: Jacob Kroon, Oliver Hartkopp, linux-can, wg

Hi Marc,

> Il 21/09/2022 09:47 Marc Kleine-Budde <mkl@pengutronix.de> ha scritto:
> 
>  
> On 21.09.2022 09:25:41, dariobin@libero.it wrote:
> > > On 9/16/22 06:14, Jacob Kroon wrote:
> > > ...> What I do know is that if I revert commit:
> > > > 
> > > > "can: c_can: cache frames to operate as a true FIFO"
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=387da6bc7a826cc6d532b1c0002b7c7513238d5f
> > > > 
> > > > then everything looks good. I don't get any BUG messages, and the host 
> > > > has been running overnight without problems, so it seems to have fixed 
> > > > the network interface lockup as well.
> > 
> > Here's what I think:
> > If one or more messages are cached, the controller has to transmit more frames 
> > in the unit of time when they can be transmitted (IF_COMM_TXRQST), different from
> > when the transmission occurs directly on request from the user space. In the case 
> > of cached data transmission I therefore think that the controller is more heavily
> > loaded. Can this shift the balance ?
> > 
> > > 
> > > I ran the kernel *with* the commit above, and also with the following patch:
> > > 
> > > > diff --git a/drivers/net/can/c_can/c_can_main.c b/drivers/net/can/c_can/c_can_main.c
> > > > index 52671d1ea17d..4375dc70e21f 100644
> > > > --- a/drivers/net/can/c_can/c_can_main.c
> > > > +++ b/drivers/net/can/c_can/c_can_main.c
> > > > @@ -1,3 +1,4 @@
> > > > +#define DEBUG
> > > >  /*
> > > >   * CAN bus driver for Bosch C_CAN controller
> > > >   *
> > > > @@ -469,8 +470,15 @@ static netdev_tx_t c_can_start_xmit(struct sk_buff *skb,
> > > >  	if (c_can_get_tx_free(tx_ring) == 0)
> > > >  		netif_stop_queue(dev);
> > > >  
> > > > -	if (idx < c_can_get_tx_tail(tx_ring))
> > > > +	netdev_dbg(dev, "JAKR:%d:%d:%d:%d\n", idx,
> > > > +	                                      c_can_get_tx_head(tx_ring),
> > > > +	                                      c_can_get_tx_tail(tx_ring),
> > > > +	                                      c_can_get_tx_free(tx_ring));
> > > > +
> > > > +	if (idx < c_can_get_tx_tail(tx_ring)) {
> > > >  		cmd &= ~IF_COMM_TXRQST; /* Cache the message */
> > > > +		netdev_dbg(dev, "JAKR:Caching messages\n");
> > > > +	}
> > > >  
> > > >  	/* Store the message in the interface so we can call
> > > >  	 * can_put_echo_skb(). We must do this before we enable
> > > 
> > > and I've uploaded the entire log I could capture from /dev/kmsg, right 
> > > up to the hang, here:
> > > 
> > > https://pastebin.com/6hvAcPc9
> > > 
> > > What looks odd to me right from the start is that sometimes when idx 
> > > rolls over to 0, and *only* when it rolls over to 0, the CAN frame gets 
> > > cached because "idx < c_can_get_tx_tail(tx_ring)".
> > 
> > If the message were not stored but transmitted, the order of transmission 
> > would not be respected.
> > 
> > > 
> > > Is it possible there is some difference between c_can and d_can in how 
> > > the HW buffers are working, which breaks the driver on my particular HW 
> > > setup ?
> > > 
> > 
> > I tested the patch on a beaglebone board without encountering any problems.
> > There is also a version of the driver I submitted to Xenomai running on a custom
> > board without problems. But surely the setup and context is different from yours.
> > 
> > What compatible are you using in your device tree?
> > I used "ti,am3352-d_can".
> 
> I think Jacob's board has a c_can core, while the beagle bone uses a
> d_can. Maybe there's a subtle difference between these cores?
> 
> Dario, do you have access to a real c_can core to test?
No, only d_can core.

> 
> As reverting 387da6bc7a82 ("can: c_can: cache frames to operate as a
> true FIFO") helps to fix Jacob's problem, a temporary solution might be
> to only cache frames on d_can cores.
OK

Thanks and regards,
Dario

> 
> regards,
> Marc
> 
> -- 
> Pengutronix e.K.                 | Marc Kleine-Budde           |
> Embedded Linux                   | https://www.pengutronix.de  |
> Vertretung West/Dortmund         | Phone: +49-231-2826-924     |
> Amtsgericht Hildesheim, HRA 2686 | Fax:   +49-5121-206917-5555 |

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: CM-ITC, pch_can/c_can_pci, sendto() returning ENOBUFS
  2022-09-16  4:14                 ` Jacob Kroon
  2022-09-19 23:24                   ` Jacob Kroon
@ 2022-09-23 11:36                   ` Marc Kleine-Budde
  2022-09-23 17:55                     ` dariobin
  1 sibling, 1 reply; 36+ messages in thread
From: Marc Kleine-Budde @ 2022-09-23 11:36 UTC (permalink / raw)
  To: Jacob Kroon; +Cc: Oliver Hartkopp, linux-can, wg, dariobin

[-- Attachment #1: Type: text/plain, Size: 968 bytes --]

On 16.09.2022 06:14:58, Jacob Kroon wrote:
> What I do know is that if I revert commit:
> 
> "can: c_can: cache frames to operate as a true FIFO"
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=387da6bc7a826cc6d532b1c0002b7c7513238d5f
> 
> then everything looks good. I don't get any BUG messages, and the host
> has been running overnight without problems, so it seems to have fixed
> the network interface lockup as well.

Jacob, after this mail, I'll send 2 patches. One tries to disable the
cache feature for C_CAN cores, the other shuts a potential race window.
Please test both patches, but only apply one of them at a time. :)

regards,
Marc

-- 
Pengutronix e.K.                 | Marc Kleine-Budde           |
Embedded Linux                   | https://www.pengutronix.de  |
Vertretung West/Dortmund         | Phone: +49-231-2826-924     |
Amtsgericht Hildesheim, HRA 2686 | Fax:   +49-5121-206917-5555 |

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: CM-ITC, pch_can/c_can_pci, sendto() returning ENOBUFS
  2022-09-23 11:36                   ` Marc Kleine-Budde
@ 2022-09-23 17:55                     ` dariobin
  2022-09-23 19:03                       ` Jacob Kroon
  0 siblings, 1 reply; 36+ messages in thread
From: dariobin @ 2022-09-23 17:55 UTC (permalink / raw)
  To: Marc Kleine-Budde, Jacob Kroon; +Cc: Oliver Hartkopp, linux-can, wg

Hi Michael,

> Il 23/09/2022 13:36 Marc Kleine-Budde <mkl@pengutronix.de> ha scritto:
> 
>  
> On 16.09.2022 06:14:58, Jacob Kroon wrote:
> > What I do know is that if I revert commit:
> > 
> > "can: c_can: cache frames to operate as a true FIFO"
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=387da6bc7a826cc6d532b1c0002b7c7513238d5f
> > 
> > then everything looks good. I don't get any BUG messages, and the host
> > has been running overnight without problems, so it seems to have fixed
> > the network interface lockup as well.
> 
> Jacob, after this mail, I'll send 2 patches. One tries to disable the
> cache feature for C_CAN cores, the other shuts a potential race window.

About the "can: c_can: don't cache TX messages for C_CAN cores" patch:
Could it make sense to change the c_can_start_xmit in this way

-       if (idx < c_can_get_tx_tail(tx_ring))
-               cmd &= ~IF_COMM_TXRQST; /* Cache the message */
+       if (idx < c_can_get_tx_tail(tx_ring)) {
+               if (priv->type == BOSCH_D_CAN) {
+                       cmd &= ~IF_COMM_TXRQST; /* Cache the message */
+               } else {
+                       netif_stop_queue(priv->dev);
+                       return NETDEV_TX_BUSY;
+               }
+       }

without changing the c_can_get_tx_{free,busy} routines ?

Thanks and regards,
Dario

> Please test both patches, but only apply one of them at a time. :)
> 
> regards,
> Marc
> 
> -- 
> Pengutronix e.K.                 | Marc Kleine-Budde           |
> Embedded Linux                   | https://www.pengutronix.de  |
> Vertretung West/Dortmund         | Phone: +49-231-2826-924     |
> Amtsgericht Hildesheim, HRA 2686 | Fax:   +49-5121-206917-5555 |

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: CM-ITC, pch_can/c_can_pci, sendto() returning ENOBUFS
  2022-09-23 17:55                     ` dariobin
@ 2022-09-23 19:03                       ` Jacob Kroon
  2022-09-23 19:21                         ` Jacob Kroon
  0 siblings, 1 reply; 36+ messages in thread
From: Jacob Kroon @ 2022-09-23 19:03 UTC (permalink / raw)
  To: dariobin, Marc Kleine-Budde; +Cc: Oliver Hartkopp, linux-can, wg

Hi Dario,

On 9/23/22 19:55, dariobin@libero.it wrote:
> Hi Michael,
> 
>> Il 23/09/2022 13:36 Marc Kleine-Budde <mkl@pengutronix.de> ha scritto:
>>
>>   
>> On 16.09.2022 06:14:58, Jacob Kroon wrote:
>>> What I do know is that if I revert commit:
>>>
>>> "can: c_can: cache frames to operate as a true FIFO"
>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=387da6bc7a826cc6d532b1c0002b7c7513238d5f
>>>
>>> then everything looks good. I don't get any BUG messages, and the host
>>> has been running overnight without problems, so it seems to have fixed
>>> the network interface lockup as well.
>>
>> Jacob, after this mail, I'll send 2 patches. One tries to disable the
>> cache feature for C_CAN cores, the other shuts a potential race window.
> 
> About the "can: c_can: don't cache TX messages for C_CAN cores" patch:
> Could it make sense to change the c_can_start_xmit in this way
> 
> -       if (idx < c_can_get_tx_tail(tx_ring))
> -               cmd &= ~IF_COMM_TXRQST; /* Cache the message */
> +       if (idx < c_can_get_tx_tail(tx_ring)) {
> +               if (priv->type == BOSCH_D_CAN) {
> +                       cmd &= ~IF_COMM_TXRQST; /* Cache the message */
> +               } else {
> +                       netif_stop_queue(priv->dev);
> +                       return NETDEV_TX_BUSY;
> +               }
> +       }
> 
> without changing the c_can_get_tx_{free,busy} routines ?
> 

I can test, so you suggest only doing the above patch, or were there 
other parts from Marc's patch you wanted in ?

Jacob

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: CM-ITC, pch_can/c_can_pci, sendto() returning ENOBUFS
  2022-09-23 19:03                       ` Jacob Kroon
@ 2022-09-23 19:21                         ` Jacob Kroon
  2022-09-23 19:45                           ` dariobin
  0 siblings, 1 reply; 36+ messages in thread
From: Jacob Kroon @ 2022-09-23 19:21 UTC (permalink / raw)
  To: dariobin, Marc Kleine-Budde; +Cc: Oliver Hartkopp, linux-can, wg

On 9/23/22 21:03, Jacob Kroon wrote:
> Hi Dario,
> 
> On 9/23/22 19:55, dariobin@libero.it wrote:
>> Hi Michael,
>>
>>> Il 23/09/2022 13:36 Marc Kleine-Budde <mkl@pengutronix.de> ha scritto:
>>>
>>> On 16.09.2022 06:14:58, Jacob Kroon wrote:
>>>> What I do know is that if I revert commit:
>>>>
>>>> "can: c_can: cache frames to operate as a true FIFO"
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=387da6bc7a826cc6d532b1c0002b7c7513238d5f
>>>>
>>>> then everything looks good. I don't get any BUG messages, and the host
>>>> has been running overnight without problems, so it seems to have fixed
>>>> the network interface lockup as well.
>>>
>>> Jacob, after this mail, I'll send 2 patches. One tries to disable the
>>> cache feature for C_CAN cores, the other shuts a potential race window.
>>
>> About the "can: c_can: don't cache TX messages for C_CAN cores" patch:
>> Could it make sense to change the c_can_start_xmit in this way
>>
>> -       if (idx < c_can_get_tx_tail(tx_ring))
>> -               cmd &= ~IF_COMM_TXRQST; /* Cache the message */
>> +       if (idx < c_can_get_tx_tail(tx_ring)) {
>> +               if (priv->type == BOSCH_D_CAN) {
>> +                       cmd &= ~IF_COMM_TXRQST; /* Cache the message */
>> +               } else {
>> +                       netif_stop_queue(priv->dev);
>> +                       return NETDEV_TX_BUSY;
>> +               }
>> +       }
>>
>> without changing the c_can_get_tx_{free,busy} routines ?
>>
> 
> I can test, so you suggest only doing the above patch, or were there 
> other parts from Marc's patch you wanted in ?
>

Well I tried only the above, and the patch below

diff --git a/drivers/net/can/c_can/c_can_main.c 
b/drivers/net/can/c_can/c_can_main.c
index a7362af0babb..21d0a55ebbb3 100644
--- a/drivers/net/can/c_can/c_can_main.c
+++ b/drivers/net/can/c_can/c_can_main.c
@@ -468,8 +468,14 @@ static netdev_tx_t c_can_start_xmit(struct sk_buff 
*skb,
  	if (c_can_get_tx_free(tx_ring) == 0)
  		netif_stop_queue(dev);

-	if (idx < c_can_get_tx_tail(tx_ring))
-		cmd &= ~IF_COMM_TXRQST; /* Cache the message */
+	if (idx < c_can_get_tx_tail(tx_ring)) {
+		if (priv->type == BOSCH_D_CAN) {
+			cmd &= ~IF_COMM_TXRQST; /* Cache the message */
+		} else {
+			netif_stop_queue(priv->dev);
+			return NETDEV_TX_BUSY;
+		}
+	}

  	/* Store the message in the interface so we can call
  	 * can_put_echo_skb(). We must do this before we enable
@@ -761,7 +767,7 @@ static void c_can_do_tx(struct net_device *dev)

  	tail = c_can_get_tx_tail(tx_ring);

-	if (tail == 0) {
+	if (priv->type == BOSCH_D_CAN && tail == 0) {
  		u8 head = c_can_get_tx_head(tx_ring);

  		/* Start transmission for all cached messages */


but they both seem to lockup the network.

Jacob

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: CM-ITC, pch_can/c_can_pci, sendto() returning ENOBUFS
  2022-09-23 19:21                         ` Jacob Kroon
@ 2022-09-23 19:45                           ` dariobin
  2022-09-23 20:27                             ` Jacob Kroon
  2022-09-28  8:02                             ` Marc Kleine-Budde
  0 siblings, 2 replies; 36+ messages in thread
From: dariobin @ 2022-09-23 19:45 UTC (permalink / raw)
  To: Jacob Kroon, Marc Kleine-Budde; +Cc: Oliver Hartkopp, linux-can, wg

Hi Jacob,

> Il 23/09/2022 21:21 Jacob Kroon <jacob.kroon@gmail.com> ha scritto:
> 
>  
> On 9/23/22 21:03, Jacob Kroon wrote:
> > Hi Dario,
> > 
> > On 9/23/22 19:55, dariobin@libero.it wrote:
> >> Hi Michael,
> >>
> >>> Il 23/09/2022 13:36 Marc Kleine-Budde <mkl@pengutronix.de> ha scritto:
> >>>
> >>> On 16.09.2022 06:14:58, Jacob Kroon wrote:
> >>>> What I do know is that if I revert commit:
> >>>>
> >>>> "can: c_can: cache frames to operate as a true FIFO"
> >>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=387da6bc7a826cc6d532b1c0002b7c7513238d5f
> >>>>
> >>>> then everything looks good. I don't get any BUG messages, and the host
> >>>> has been running overnight without problems, so it seems to have fixed
> >>>> the network interface lockup as well.
> >>>
> >>> Jacob, after this mail, I'll send 2 patches. One tries to disable the
> >>> cache feature for C_CAN cores, the other shuts a potential race window.
> >>
> >> About the "can: c_can: don't cache TX messages for C_CAN cores" patch:
> >> Could it make sense to change the c_can_start_xmit in this way
> >>
> >> -       if (idx < c_can_get_tx_tail(tx_ring))
> >> -               cmd &= ~IF_COMM_TXRQST; /* Cache the message */
> >> +       if (idx < c_can_get_tx_tail(tx_ring)) {
> >> +               if (priv->type == BOSCH_D_CAN) {
> >> +                       cmd &= ~IF_COMM_TXRQST; /* Cache the message */
> >> +               } else {
> >> +                       netif_stop_queue(priv->dev);
> >> +                       return NETDEV_TX_BUSY;
> >> +               }
> >> +       }
> >>
> >> without changing the c_can_get_tx_{free,busy} routines ?
> >>
> > 
> > I can test, so you suggest only doing the above patch, or were there 
> > other parts from Marc's patch you wanted in ?
> >
> 
> Well I tried only the above, and the patch below
> 
> diff --git a/drivers/net/can/c_can/c_can_main.c 
> b/drivers/net/can/c_can/c_can_main.c
> index a7362af0babb..21d0a55ebbb3 100644
> --- a/drivers/net/can/c_can/c_can_main.c
> +++ b/drivers/net/can/c_can/c_can_main.c
> @@ -468,8 +468,14 @@ static netdev_tx_t c_can_start_xmit(struct sk_buff 
> *skb,
>   	if (c_can_get_tx_free(tx_ring) == 0)
>   		netif_stop_queue(dev);
> 
> -	if (idx < c_can_get_tx_tail(tx_ring))
> -		cmd &= ~IF_COMM_TXRQST; /* Cache the message */
> +	if (idx < c_can_get_tx_tail(tx_ring)) {
> +		if (priv->type == BOSCH_D_CAN) {
> +			cmd &= ~IF_COMM_TXRQST; /* Cache the message */
> +		} else {
> +			netif_stop_queue(priv->dev);
> +			return NETDEV_TX_BUSY;
> +		}
> +	}
> 
>   	/* Store the message in the interface so we can call
>   	 * can_put_echo_skb(). We must do this before we enable
> @@ -761,7 +767,7 @@ static void c_can_do_tx(struct net_device *dev)
> 
>   	tail = c_can_get_tx_tail(tx_ring);
> 
> -	if (tail == 0) {
> +	if (priv->type == BOSCH_D_CAN && tail == 0) {
>   		u8 head = c_can_get_tx_head(tx_ring);
> 
>   		/* Start transmission for all cached messages */
> 
> 
> but they both seem to lockup the network.
> 

I didn't understand if you applied two patches separately or not.
This was the only patch I had thought of. Which was similar to Marc's 
for the interrupt part but differed in the c_can_start_xmit part.

--- a/drivers/net/can/c_can/c_can_main.c
+++ b/drivers/net/can/c_can/c_can_main.c
@@ -464,13 +464,19 @@ static netdev_tx_t c_can_start_xmit(struct sk_buff *skb,
                return NETDEV_TX_BUSY;
 
        idx = c_can_get_tx_head(tx_ring);
+       if (idx < c_can_get_tx_tail(tx_ring)) {
+               if (priv->type == BOSCH_D_CAN) {
+                       cmd &= ~IF_COMM_TXRQST; /* Cache the message */
+               } else {
+                       netif_stop_queue(priv->dev);
+                       return NETDEV_TX_BUSY;
+               }
+       }
+
        tx_ring->head++;
        if (c_can_get_tx_free(tx_ring) == 0)
                netif_stop_queue(dev);
 
-       if (idx < c_can_get_tx_tail(tx_ring))
-               cmd &= ~IF_COMM_TXRQST; /* Cache the message */
-
        /* Store the message in the interface so we can call
         * can_put_echo_skb(). We must do this before we enable
         * transmit as we might race against do_tx().
@@ -723,7 +729,6 @@ static void c_can_do_tx(struct net_device *dev)
        struct c_can_tx_ring *tx_ring = &priv->tx;
        struct net_device_stats *stats = &dev->stats;
        u32 idx, obj, pkts = 0, bytes = 0, pend;
-       u8 tail;
 
        if (priv->msg_obj_tx_last > 32)
                pend = priv->read_reg32(priv, C_CAN_INTPND3_REG);
@@ -759,15 +764,20 @@ static void c_can_do_tx(struct net_device *dev)
        stats->tx_bytes += bytes;
        stats->tx_packets += pkts;
 
-       tail = c_can_get_tx_tail(tx_ring);
+       if (priv->type == BOSCH_D_CAN) {
+               u8 tail;
+
+               tail = c_can_get_tx_tail(tx_ring);
 
-       if (tail == 0) {
-               u8 head = c_can_get_tx_head(tx_ring);
+               if (tail == 0) {
+                       u8 head = c_can_get_tx_head(tx_ring);
 
-               /* Start transmission for all cached messages */
-               for (idx = tail; idx < head; idx++) {
-                       obj = idx + priv->msg_obj_tx_first;
-                       c_can_object_put(dev, IF_NAPI, obj, IF_COMM_TXRQST);
+                       /* Start transmission for all cached messages */
+                       for (idx = tail; idx < head; idx++) {
+                               obj = idx + priv->msg_obj_tx_first;
+                               c_can_object_put(dev, IF_NAPI, obj,
+                                                IF_COMM_TXRQST);
+                       }
                }
        }
 }

I changed it a little from the previous email because I noticed a problem.

Thanks and regards,
Dario

> Jacob

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: CM-ITC, pch_can/c_can_pci, sendto() returning ENOBUFS
  2022-09-23 19:45                           ` dariobin
@ 2022-09-23 20:27                             ` Jacob Kroon
  2022-09-24  5:17                               ` Jacob Kroon
  2022-09-28  8:02                             ` Marc Kleine-Budde
  1 sibling, 1 reply; 36+ messages in thread
From: Jacob Kroon @ 2022-09-23 20:27 UTC (permalink / raw)
  To: dariobin, Marc Kleine-Budde; +Cc: Oliver Hartkopp, linux-can, wg

Hi Dario,

On 9/23/22 21:45, dariobin@libero.it wrote:
> Hi Jacob,
> 
>> Il 23/09/2022 21:21 Jacob Kroon <jacob.kroon@gmail.com> ha scritto:
>>
>>   
>> On 9/23/22 21:03, Jacob Kroon wrote:
>>> Hi Dario,
>>>
>>> On 9/23/22 19:55, dariobin@libero.it wrote:
>>>> Hi Michael,
>>>>
>>>>> Il 23/09/2022 13:36 Marc Kleine-Budde <mkl@pengutronix.de> ha scritto:
>>>>>
>>>>> On 16.09.2022 06:14:58, Jacob Kroon wrote:
>>>>>> What I do know is that if I revert commit:
>>>>>>
>>>>>> "can: c_can: cache frames to operate as a true FIFO"
>>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=387da6bc7a826cc6d532b1c0002b7c7513238d5f
>>>>>>
>>>>>> then everything looks good. I don't get any BUG messages, and the host
>>>>>> has been running overnight without problems, so it seems to have fixed
>>>>>> the network interface lockup as well.
>>>>>
>>>>> Jacob, after this mail, I'll send 2 patches. One tries to disable the
>>>>> cache feature for C_CAN cores, the other shuts a potential race window.
>>>>
>>>> About the "can: c_can: don't cache TX messages for C_CAN cores" patch:
>>>> Could it make sense to change the c_can_start_xmit in this way
>>>>
>>>> -       if (idx < c_can_get_tx_tail(tx_ring))
>>>> -               cmd &= ~IF_COMM_TXRQST; /* Cache the message */
>>>> +       if (idx < c_can_get_tx_tail(tx_ring)) {
>>>> +               if (priv->type == BOSCH_D_CAN) {
>>>> +                       cmd &= ~IF_COMM_TXRQST; /* Cache the message */
>>>> +               } else {
>>>> +                       netif_stop_queue(priv->dev);
>>>> +                       return NETDEV_TX_BUSY;
>>>> +               }
>>>> +       }
>>>>
>>>> without changing the c_can_get_tx_{free,busy} routines ?
>>>>
>>>
>>> I can test, so you suggest only doing the above patch, or were there
>>> other parts from Marc's patch you wanted in ?
>>>
>>
>> Well I tried only the above, and the patch below
>>
>> diff --git a/drivers/net/can/c_can/c_can_main.c
>> b/drivers/net/can/c_can/c_can_main.c
>> index a7362af0babb..21d0a55ebbb3 100644
>> --- a/drivers/net/can/c_can/c_can_main.c
>> +++ b/drivers/net/can/c_can/c_can_main.c
>> @@ -468,8 +468,14 @@ static netdev_tx_t c_can_start_xmit(struct sk_buff
>> *skb,
>>    	if (c_can_get_tx_free(tx_ring) == 0)
>>    		netif_stop_queue(dev);
>>
>> -	if (idx < c_can_get_tx_tail(tx_ring))
>> -		cmd &= ~IF_COMM_TXRQST; /* Cache the message */
>> +	if (idx < c_can_get_tx_tail(tx_ring)) {
>> +		if (priv->type == BOSCH_D_CAN) {
>> +			cmd &= ~IF_COMM_TXRQST; /* Cache the message */
>> +		} else {
>> +			netif_stop_queue(priv->dev);
>> +			return NETDEV_TX_BUSY;
>> +		}
>> +	}
>>
>>    	/* Store the message in the interface so we can call
>>    	 * can_put_echo_skb(). We must do this before we enable
>> @@ -761,7 +767,7 @@ static void c_can_do_tx(struct net_device *dev)
>>
>>    	tail = c_can_get_tx_tail(tx_ring);
>>
>> -	if (tail == 0) {
>> +	if (priv->type == BOSCH_D_CAN && tail == 0) {
>>    		u8 head = c_can_get_tx_head(tx_ring);
>>
>>    		/* Start transmission for all cached messages */
>>
>>
>> but they both seem to lockup the network.
>>
> 
> I didn't understand if you applied two patches separately or not.
> This was the only patch I had thought of. Which was similar to Marc's
> for the interrupt part but differed in the c_can_start_xmit part.
> 
> --- a/drivers/net/can/c_can/c_can_main.c
> +++ b/drivers/net/can/c_can/c_can_main.c
> @@ -464,13 +464,19 @@ static netdev_tx_t c_can_start_xmit(struct sk_buff *skb,
>                  return NETDEV_TX_BUSY;
>   
>          idx = c_can_get_tx_head(tx_ring);
> +       if (idx < c_can_get_tx_tail(tx_ring)) {
> +               if (priv->type == BOSCH_D_CAN) {
> +                       cmd &= ~IF_COMM_TXRQST; /* Cache the message */
> +               } else {
> +                       netif_stop_queue(priv->dev);
> +                       return NETDEV_TX_BUSY;
> +               }
> +       }
> +
>          tx_ring->head++;
>          if (c_can_get_tx_free(tx_ring) == 0)
>                  netif_stop_queue(dev);
>   
> -       if (idx < c_can_get_tx_tail(tx_ring))
> -               cmd &= ~IF_COMM_TXRQST; /* Cache the message */
> -
>          /* Store the message in the interface so we can call
>           * can_put_echo_skb(). We must do this before we enable
>           * transmit as we might race against do_tx().
> @@ -723,7 +729,6 @@ static void c_can_do_tx(struct net_device *dev)
>          struct c_can_tx_ring *tx_ring = &priv->tx;
>          struct net_device_stats *stats = &dev->stats;
>          u32 idx, obj, pkts = 0, bytes = 0, pend;
> -       u8 tail;
>   
>          if (priv->msg_obj_tx_last > 32)
>                  pend = priv->read_reg32(priv, C_CAN_INTPND3_REG);
> @@ -759,15 +764,20 @@ static void c_can_do_tx(struct net_device *dev)
>          stats->tx_bytes += bytes;
>          stats->tx_packets += pkts;
>   
> -       tail = c_can_get_tx_tail(tx_ring);
> +       if (priv->type == BOSCH_D_CAN) {
> +               u8 tail;
> +
> +               tail = c_can_get_tx_tail(tx_ring);
>   
> -       if (tail == 0) {
> -               u8 head = c_can_get_tx_head(tx_ring);
> +               if (tail == 0) {
> +                       u8 head = c_can_get_tx_head(tx_ring);
>   
> -               /* Start transmission for all cached messages */
> -               for (idx = tail; idx < head; idx++) {
> -                       obj = idx + priv->msg_obj_tx_first;
> -                       c_can_object_put(dev, IF_NAPI, obj, IF_COMM_TXRQST);
> +                       /* Start transmission for all cached messages */
> +                       for (idx = tail; idx < head; idx++) {
> +                               obj = idx + priv->msg_obj_tx_first;
> +                               c_can_object_put(dev, IF_NAPI, obj,
> +                                                IF_COMM_TXRQST);
> +                       }
>                  }
>          }
>   }
> 
> I changed it a little from the previous email because I noticed a problem.
> 

Okay, the patch above seems to be working better. I'm pretty sure

https://marc.info/?l=linux-can&m=166393304023574&w=2

is good, so I'll keep the machine running overnight with Darios patch 
above instead.

Jacob

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: CM-ITC, pch_can/c_can_pci, sendto() returning ENOBUFS
  2022-09-23 20:27                             ` Jacob Kroon
@ 2022-09-24  5:17                               ` Jacob Kroon
  2022-09-28  8:25                                 ` Marc Kleine-Budde
  0 siblings, 1 reply; 36+ messages in thread
From: Jacob Kroon @ 2022-09-24  5:17 UTC (permalink / raw)
  To: dariobin, Marc Kleine-Budde; +Cc: Oliver Hartkopp, linux-can, wg

On 9/23/22 22:27, Jacob Kroon wrote:
> Hi Dario,
> 
> On 9/23/22 21:45, dariobin@libero.it wrote:
>> Hi Jacob,
>>
>>> Il 23/09/2022 21:21 Jacob Kroon <jacob.kroon@gmail.com> ha scritto:
>>>
>>> On 9/23/22 21:03, Jacob Kroon wrote:
>>>> Hi Dario,
>>>>
>>>> On 9/23/22 19:55, dariobin@libero.it wrote:
>>>>> Hi Michael,
>>>>>
>>>>>> Il 23/09/2022 13:36 Marc Kleine-Budde <mkl@pengutronix.de> ha 
>>>>>> scritto:
>>>>>>
>>>>>> On 16.09.2022 06:14:58, Jacob Kroon wrote:
>>>>>>> What I do know is that if I revert commit:
>>>>>>>
>>>>>>> "can: c_can: cache frames to operate as a true FIFO"
>>>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=387da6bc7a826cc6d532b1c0002b7c7513238d5f
>>>>>>>
>>>>>>> then everything looks good. I don't get any BUG messages, and the 
>>>>>>> host
>>>>>>> has been running overnight without problems, so it seems to have 
>>>>>>> fixed
>>>>>>> the network interface lockup as well.
>>>>>>
>>>>>> Jacob, after this mail, I'll send 2 patches. One tries to disable the
>>>>>> cache feature for C_CAN cores, the other shuts a potential race 
>>>>>> window.
>>>>>
>>>>> About the "can: c_can: don't cache TX messages for C_CAN cores" patch:
>>>>> Could it make sense to change the c_can_start_xmit in this way
>>>>>
>>>>> -       if (idx < c_can_get_tx_tail(tx_ring))
>>>>> -               cmd &= ~IF_COMM_TXRQST; /* Cache the message */
>>>>> +       if (idx < c_can_get_tx_tail(tx_ring)) {
>>>>> +               if (priv->type == BOSCH_D_CAN) {
>>>>> +                       cmd &= ~IF_COMM_TXRQST; /* Cache the 
>>>>> message */
>>>>> +               } else {
>>>>> +                       netif_stop_queue(priv->dev);
>>>>> +                       return NETDEV_TX_BUSY;
>>>>> +               }
>>>>> +       }
>>>>>
>>>>> without changing the c_can_get_tx_{free,busy} routines ?
>>>>>
>>>>
>>>> I can test, so you suggest only doing the above patch, or were there
>>>> other parts from Marc's patch you wanted in ?
>>>>
>>>
>>> Well I tried only the above, and the patch below
>>>
>>> diff --git a/drivers/net/can/c_can/c_can_main.c
>>> b/drivers/net/can/c_can/c_can_main.c
>>> index a7362af0babb..21d0a55ebbb3 100644
>>> --- a/drivers/net/can/c_can/c_can_main.c
>>> +++ b/drivers/net/can/c_can/c_can_main.c
>>> @@ -468,8 +468,14 @@ static netdev_tx_t c_can_start_xmit(struct sk_buff
>>> *skb,
>>>        if (c_can_get_tx_free(tx_ring) == 0)
>>>            netif_stop_queue(dev);
>>>
>>> -    if (idx < c_can_get_tx_tail(tx_ring))
>>> -        cmd &= ~IF_COMM_TXRQST; /* Cache the message */
>>> +    if (idx < c_can_get_tx_tail(tx_ring)) {
>>> +        if (priv->type == BOSCH_D_CAN) {
>>> +            cmd &= ~IF_COMM_TXRQST; /* Cache the message */
>>> +        } else {
>>> +            netif_stop_queue(priv->dev);
>>> +            return NETDEV_TX_BUSY;
>>> +        }
>>> +    }
>>>
>>>        /* Store the message in the interface so we can call
>>>         * can_put_echo_skb(). We must do this before we enable
>>> @@ -761,7 +767,7 @@ static void c_can_do_tx(struct net_device *dev)
>>>
>>>        tail = c_can_get_tx_tail(tx_ring);
>>>
>>> -    if (tail == 0) {
>>> +    if (priv->type == BOSCH_D_CAN && tail == 0) {
>>>            u8 head = c_can_get_tx_head(tx_ring);
>>>
>>>            /* Start transmission for all cached messages */
>>>
>>>
>>> but they both seem to lockup the network.
>>>
>>
>> I didn't understand if you applied two patches separately or not.
>> This was the only patch I had thought of. Which was similar to Marc's
>> for the interrupt part but differed in the c_can_start_xmit part.
>>
>> --- a/drivers/net/can/c_can/c_can_main.c
>> +++ b/drivers/net/can/c_can/c_can_main.c
>> @@ -464,13 +464,19 @@ static netdev_tx_t c_can_start_xmit(struct 
>> sk_buff *skb,
>>                  return NETDEV_TX_BUSY;
>>          idx = c_can_get_tx_head(tx_ring);
>> +       if (idx < c_can_get_tx_tail(tx_ring)) {
>> +               if (priv->type == BOSCH_D_CAN) {
>> +                       cmd &= ~IF_COMM_TXRQST; /* Cache the message */
>> +               } else {
>> +                       netif_stop_queue(priv->dev);
>> +                       return NETDEV_TX_BUSY;
>> +               }
>> +       }
>> +
>>          tx_ring->head++;
>>          if (c_can_get_tx_free(tx_ring) == 0)
>>                  netif_stop_queue(dev);
>> -       if (idx < c_can_get_tx_tail(tx_ring))
>> -               cmd &= ~IF_COMM_TXRQST; /* Cache the message */
>> -
>>          /* Store the message in the interface so we can call
>>           * can_put_echo_skb(). We must do this before we enable
>>           * transmit as we might race against do_tx().
>> @@ -723,7 +729,6 @@ static void c_can_do_tx(struct net_device *dev)
>>          struct c_can_tx_ring *tx_ring = &priv->tx;
>>          struct net_device_stats *stats = &dev->stats;
>>          u32 idx, obj, pkts = 0, bytes = 0, pend;
>> -       u8 tail;
>>          if (priv->msg_obj_tx_last > 32)
>>                  pend = priv->read_reg32(priv, C_CAN_INTPND3_REG);
>> @@ -759,15 +764,20 @@ static void c_can_do_tx(struct net_device *dev)
>>          stats->tx_bytes += bytes;
>>          stats->tx_packets += pkts;
>> -       tail = c_can_get_tx_tail(tx_ring);
>> +       if (priv->type == BOSCH_D_CAN) {
>> +               u8 tail;
>> +
>> +               tail = c_can_get_tx_tail(tx_ring);
>> -       if (tail == 0) {
>> -               u8 head = c_can_get_tx_head(tx_ring);
>> +               if (tail == 0) {
>> +                       u8 head = c_can_get_tx_head(tx_ring);
>> -               /* Start transmission for all cached messages */
>> -               for (idx = tail; idx < head; idx++) {
>> -                       obj = idx + priv->msg_obj_tx_first;
>> -                       c_can_object_put(dev, IF_NAPI, obj, 
>> IF_COMM_TXRQST);
>> +                       /* Start transmission for all cached messages */
>> +                       for (idx = tail; idx < head; idx++) {
>> +                               obj = idx + priv->msg_obj_tx_first;
>> +                               c_can_object_put(dev, IF_NAPI, obj,
>> +                                                IF_COMM_TXRQST);
>> +                       }
>>                  }
>>          }
>>   }
>>
>> I changed it a little from the previous email because I noticed a 
>> problem.
>>
> 
> Okay, the patch above seems to be working better. I'm pretty sure
> 
> https://marc.info/?l=linux-can&m=166393304023574&w=2
> 
> is good, so I'll keep the machine running overnight with Darios patch 
> above instead.
> 

Machine is still running with CAN network traffic working, so both 
patches at

https://marc.info/?l=linux-can&m=166393304023574&w=2
https://marc.info/?l=linux-can&m=166396200108947&w=2

are working for me.

Jacob

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: CM-ITC, pch_can/c_can_pci, sendto() returning ENOBUFS
  2022-09-23 19:45                           ` dariobin
  2022-09-23 20:27                             ` Jacob Kroon
@ 2022-09-28  8:02                             ` Marc Kleine-Budde
  1 sibling, 0 replies; 36+ messages in thread
From: Marc Kleine-Budde @ 2022-09-28  8:02 UTC (permalink / raw)
  To: dariobin; +Cc: Jacob Kroon, Oliver Hartkopp, linux-can, wg

[-- Attachment #1: Type: text/plain, Size: 3483 bytes --]

On 23.09.2022 21:45:04, dariobin@libero.it wrote:
> I didn't understand if you applied two patches separately or not.
> This was the only patch I had thought of. Which was similar to Marc's 
> for the interrupt part but differed in the c_can_start_xmit part.
> 
> --- a/drivers/net/can/c_can/c_can_main.c
> +++ b/drivers/net/can/c_can/c_can_main.c
> @@ -464,13 +464,19 @@ static netdev_tx_t c_can_start_xmit(struct sk_buff *skb,
>                 return NETDEV_TX_BUSY;
>  
>         idx = c_can_get_tx_head(tx_ring);
> +       if (idx < c_can_get_tx_tail(tx_ring)) {
> +               if (priv->type == BOSCH_D_CAN) {
> +                       cmd &= ~IF_COMM_TXRQST; /* Cache the message */
> +               } else {
> +                       netif_stop_queue(priv->dev);
> +                       return NETDEV_TX_BUSY;

This works, but it is frowned upon. You intentionally omit adequate flow
control for the C_CAN core. The overhead for starting the
ndo_start_xmit() callback and aborting it with ETDEV_TX_BUSY is
considered too high.

> +               }
> +       }
> +
>         tx_ring->head++;
>         if (c_can_get_tx_free(tx_ring) == 0)
>                 netif_stop_queue(dev);
>  
> -       if (idx < c_can_get_tx_tail(tx_ring))
> -               cmd &= ~IF_COMM_TXRQST; /* Cache the message */
> -
>         /* Store the message in the interface so we can call
>          * can_put_echo_skb(). We must do this before we enable
>          * transmit as we might race against do_tx().
> @@ -723,7 +729,6 @@ static void c_can_do_tx(struct net_device *dev)
>         struct c_can_tx_ring *tx_ring = &priv->tx;
>         struct net_device_stats *stats = &dev->stats;
>         u32 idx, obj, pkts = 0, bytes = 0, pend;
> -       u8 tail;
>  
>         if (priv->msg_obj_tx_last > 32)
>                 pend = priv->read_reg32(priv, C_CAN_INTPND3_REG);
> @@ -759,15 +764,20 @@ static void c_can_do_tx(struct net_device *dev)
>         stats->tx_bytes += bytes;
>         stats->tx_packets += pkts;
>  
> -       tail = c_can_get_tx_tail(tx_ring);
> +       if (priv->type == BOSCH_D_CAN) {
> +               u8 tail;
> +
> +               tail = c_can_get_tx_tail(tx_ring);
>  
> -       if (tail == 0) {
> -               u8 head = c_can_get_tx_head(tx_ring);
> +               if (tail == 0) {
> +                       u8 head = c_can_get_tx_head(tx_ring);
>  
> -               /* Start transmission for all cached messages */
> -               for (idx = tail; idx < head; idx++) {
> -                       obj = idx + priv->msg_obj_tx_first;
> -                       c_can_object_put(dev, IF_NAPI, obj, IF_COMM_TXRQST);
> +                       /* Start transmission for all cached messages */
> +                       for (idx = tail; idx < head; idx++) {
> +                               obj = idx + priv->msg_obj_tx_first;
> +                               c_can_object_put(dev, IF_NAPI, obj,
> +                                                IF_COMM_TXRQST);
> +                       }
>                 }
>         }
>  }
> 
> I changed it a little from the previous email because I noticed a problem.

Marc

-- 
Pengutronix e.K.                 | Marc Kleine-Budde           |
Embedded Linux                   | https://www.pengutronix.de  |
Vertretung West/Dortmund         | Phone: +49-231-2826-924     |
Amtsgericht Hildesheim, HRA 2686 | Fax:   +49-5121-206917-5555 |

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: CM-ITC, pch_can/c_can_pci, sendto() returning ENOBUFS
  2022-09-24  5:17                               ` Jacob Kroon
@ 2022-09-28  8:25                                 ` Marc Kleine-Budde
  2022-09-28  8:28                                   ` Jacob Kroon
  0 siblings, 1 reply; 36+ messages in thread
From: Marc Kleine-Budde @ 2022-09-28  8:25 UTC (permalink / raw)
  To: Jacob Kroon; +Cc: dariobin, Oliver Hartkopp, linux-can, wg

[-- Attachment #1: Type: text/plain, Size: 692 bytes --]

On 24.09.2022 07:17:25, Jacob Kroon wrote:
> Machine is still running with CAN network traffic working, so both patches
> at
> 
> https://marc.info/?l=linux-can&m=166393304023574&w=2
> https://marc.info/?l=linux-can&m=166396200108947&w=2
> 
> are working for me.

Can I add your Tested-by for my variant of the patch? That is:

| https://lore.kernel.org/all/20220923114223.726808-1-mkl@pengutronix.de

Marc

-- 
Pengutronix e.K.                 | Marc Kleine-Budde           |
Embedded Linux                   | https://www.pengutronix.de  |
Vertretung West/Dortmund         | Phone: +49-231-2826-924     |
Amtsgericht Hildesheim, HRA 2686 | Fax:   +49-5121-206917-5555 |

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: CM-ITC, pch_can/c_can_pci, sendto() returning ENOBUFS
  2022-09-28  8:25                                 ` Marc Kleine-Budde
@ 2022-09-28  8:28                                   ` Jacob Kroon
  0 siblings, 0 replies; 36+ messages in thread
From: Jacob Kroon @ 2022-09-28  8:28 UTC (permalink / raw)
  To: Marc Kleine-Budde; +Cc: dariobin, Oliver Hartkopp, linux-can, wg

On 9/28/22 10:25, Marc Kleine-Budde wrote:
> On 24.09.2022 07:17:25, Jacob Kroon wrote:
>> Machine is still running with CAN network traffic working, so both patches
>> at
>>
>> https://marc.info/?l=linux-can&m=166393304023574&w=2
>> https://marc.info/?l=linux-can&m=166396200108947&w=2
>>
>> are working for me.
> 
> Can I add your Tested-by for my variant of the patch? That is:
> 
> | https://lore.kernel.org/all/20220923114223.726808-1-mkl@pengutronix.de
> 

Absolutely,

Regards
Jacob

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2022-09-28  8:29 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-25 13:25 CM-ITC, pch_can/c_can_pci, sendto() returning ENOBUFS Jacob Kroon
2022-08-26 11:24 ` Jacob Kroon
2022-08-29  9:14   ` Jacob Kroon
2022-08-29 13:20     ` Jacob Kroon
2022-08-29 13:53       ` Oliver Hartkopp
2022-08-30 12:59         ` Jacob Kroon
2022-08-30 19:15           ` Oliver Hartkopp
2022-09-01  9:38             ` Jacob Kroon
2022-09-01 16:35               ` Oliver Hartkopp
2022-09-02 15:13                 ` Jacob Kroon
2022-09-02 16:39                   ` Jacob Kroon
2022-09-05 14:17                   ` Marc Kleine-Budde
2022-09-05 15:54               ` Marc Kleine-Budde
2022-09-16  4:14                 ` Jacob Kroon
2022-09-19 23:24                   ` Jacob Kroon
2022-09-20  1:23                     ` Vincent Mailhol
2022-09-20  5:08                       ` Jacob Kroon
2022-09-21  7:25                     ` dariobin
2022-09-21  7:47                       ` Marc Kleine-Budde
2022-09-21  8:26                         ` Jacob Kroon
2022-09-21  9:55                         ` Oliver Hartkopp
2022-09-21 10:32                           ` Marc Kleine-Budde
2022-09-21 10:39                             ` Oliver Hartkopp
2022-09-21 10:53                               ` Marc Kleine-Budde
2022-09-21 11:00                                 ` Oliver Hartkopp
2022-09-22  7:20                         ` dariobin
2022-09-23 11:36                   ` Marc Kleine-Budde
2022-09-23 17:55                     ` dariobin
2022-09-23 19:03                       ` Jacob Kroon
2022-09-23 19:21                         ` Jacob Kroon
2022-09-23 19:45                           ` dariobin
2022-09-23 20:27                             ` Jacob Kroon
2022-09-24  5:17                               ` Jacob Kroon
2022-09-28  8:25                                 ` Marc Kleine-Budde
2022-09-28  8:28                                   ` Jacob Kroon
2022-09-28  8:02                             ` Marc Kleine-Budde

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.