All of lore.kernel.org
 help / color / mirror / Atom feed
* peak_pci: TX Frame Loss
@ 2015-11-18 14:51 Andri Yngvason
  2015-11-19  8:38 ` Stephane Grosjean
  2015-12-02 18:09 ` Andri Yngvason
  0 siblings, 2 replies; 17+ messages in thread
From: Andri Yngvason @ 2015-11-18 14:51 UTC (permalink / raw)
  To: linux-can; +Cc: wg, mkl, s.grosjean

Hi all,

We've been experiencing frame loss on transmission in the peak_pci netdev
driver.

The frames are not reported as "dumped" by the netlink interface.

We are running CANopen and this manifests sporadically as nodes dropping off the
network due to failure to answer node guarding RTR and as SDO request timeouts.

Example with can0 and can1 on the same bus where the CANopen master is on can0
and can1 is set up to listen only:
(1446688151.783844)  can0  701  [1] remote request <- node guarding request on can0
(1446688151.784296)  can0  70A  [1] remote request <- another node guarding request
(1446688151.784304)  can1  70A  [1] remote request <- only the latter is seen by can1
(1446688151.784751)  can0  720  [1] remote request
(1446688151.784763)  can1  720  [1] remote request                             
(1446688151.785793)  can1  283  [8] 00 00 00 00 00 00 00 00                    
(1446688151.785792)  can0  283  [8] 00 00 00 00 00 00 00 00                    
(1446688151.786164)  can0  70A  [1] 85 <-- node guarding response
(1446688151.786163)  can1  70A  [1] 85                                         
(1446688151.786641)  can1  720  [1] 85 
(1446688151.786641)  can0  720  [1] 85 <-- node guarding response
(1446688151.787057)  can0  721  [1] remote request                             
(1446688151.787063)  can1  721  [1] remote request                             
(1446688151.787728)  can0  721  [1] 05                                         
(1446688151.787733)  can1  721  [1] 05          

Node 1 never responded because it never received the request.

The node guarding requests are sent in bursts where lower ids appear before
higher ids. A curious observation is that it's always the lowest id that drops
out first. I.e. the first frame in a burst of frames is the one that's lost.

Another interesting thing that we've found out is that if we turn off SMP on the
system, the problem disappears. But obviously we don't want to disable SMP in a
production system. ;)
It helps to set the cpy affinity of all threads and processes that touch the CAN
bus to a single core but sadly it doesn't eliminate the problem.

Our systems are running on kernel version 3.14.3 with the rt patch. I tried
running 4.1.12-rt13 but that did not eliminate the problem. We also tried
running with the pcan netdev driver from peak which does in fact run without
frame loss. Thus, this is probably an issue with either peak_pci or sja1000.

I tried poking around in sja1000.c. I noticed that sja1000_start_xmit() is not
guarded against trying to transmit when the tx buffer is occupied, so I added a
check and a print-out:
diff --git a/drivers/net/can/sja1000/sja1000.c b/drivers/net/can/sja1000/sja1000.c
index 32bd7f4..adc49db 100644
--- a/drivers/net/can/sja1000/sja1000.c
+++ b/drivers/net/can/sja1000/sja1000.c
@@ -292,6 +292,11 @@ static netdev_tx_t sja1000_start_xmit(struct sk_buff *skb,
 
        netif_stop_queue(dev);
 
+       if (!(priv->read_reg(priv, SJA1000_SR) & SR_TBS)) {
+               netdev_err(dev, "BUG!, TX FIFO full when queue awake!\n");
+               return NETDEV_TX_BUSY;
+       }
+
        fi = dlc = cf->can_dlc;
        id = cf->can_id;

There was no error message in dmesg after frame loss, so that's not the problem.

The CPU is an Intel i7-4700EQ and the CAN interface is a Peak PCIe dual channel.

Does anyone have an idea what might be wrong? :)

Best regards,
Andri

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: peak_pci: TX Frame Loss
  2015-11-18 14:51 peak_pci: TX Frame Loss Andri Yngvason
@ 2015-11-19  8:38 ` Stephane Grosjean
  2015-11-19 10:12   ` Andri Yngvason
  2015-12-02 18:09 ` Andri Yngvason
  1 sibling, 1 reply; 17+ messages in thread
From: Stephane Grosjean @ 2015-11-19  8:38 UTC (permalink / raw)
  To: Andri Yngvason, linux-can; +Cc: wg, mkl

Hi Andri,

Could you first give me the result of sudo lspci -d 1c: -vvv please?

Regards,

Stéphane

Le 18/11/2015 15:51, Andri Yngvason a écrit :
> Hi all,
>
> We've been experiencing frame loss on transmission in the peak_pci netdev
> driver.
>
> The frames are not reported as "dumped" by the netlink interface.
>
> We are running CANopen and this manifests sporadically as nodes dropping off the
> network due to failure to answer node guarding RTR and as SDO request timeouts.
>
> Example with can0 and can1 on the same bus where the CANopen master is on can0
> and can1 is set up to listen only:
> (1446688151.783844)  can0  701  [1] remote request <- node guarding request on can0
> (1446688151.784296)  can0  70A  [1] remote request <- another node guarding request
> (1446688151.784304)  can1  70A  [1] remote request <- only the latter is seen by can1
> (1446688151.784751)  can0  720  [1] remote request
> (1446688151.784763)  can1  720  [1] remote request
> (1446688151.785793)  can1  283  [8] 00 00 00 00 00 00 00 00
> (1446688151.785792)  can0  283  [8] 00 00 00 00 00 00 00 00
> (1446688151.786164)  can0  70A  [1] 85 <-- node guarding response
> (1446688151.786163)  can1  70A  [1] 85
> (1446688151.786641)  can1  720  [1] 85
> (1446688151.786641)  can0  720  [1] 85 <-- node guarding response
> (1446688151.787057)  can0  721  [1] remote request
> (1446688151.787063)  can1  721  [1] remote request
> (1446688151.787728)  can0  721  [1] 05
> (1446688151.787733)  can1  721  [1] 05
>
> Node 1 never responded because it never received the request.
>
> The node guarding requests are sent in bursts where lower ids appear before
> higher ids. A curious observation is that it's always the lowest id that drops
> out first. I.e. the first frame in a burst of frames is the one that's lost.
>
> Another interesting thing that we've found out is that if we turn off SMP on the
> system, the problem disappears. But obviously we don't want to disable SMP in a
> production system. ;)
> It helps to set the cpy affinity of all threads and processes that touch the CAN
> bus to a single core but sadly it doesn't eliminate the problem.
>
> Our systems are running on kernel version 3.14.3 with the rt patch. I tried
> running 4.1.12-rt13 but that did not eliminate the problem. We also tried
> running with the pcan netdev driver from peak which does in fact run without
> frame loss. Thus, this is probably an issue with either peak_pci or sja1000.
>
> I tried poking around in sja1000.c. I noticed that sja1000_start_xmit() is not
> guarded against trying to transmit when the tx buffer is occupied, so I added a
> check and a print-out:
> diff --git a/drivers/net/can/sja1000/sja1000.c b/drivers/net/can/sja1000/sja1000.c
> index 32bd7f4..adc49db 100644
> --- a/drivers/net/can/sja1000/sja1000.c
> +++ b/drivers/net/can/sja1000/sja1000.c
> @@ -292,6 +292,11 @@ static netdev_tx_t sja1000_start_xmit(struct sk_buff *skb,
>   
>          netif_stop_queue(dev);
>   
> +       if (!(priv->read_reg(priv, SJA1000_SR) & SR_TBS)) {
> +               netdev_err(dev, "BUG!, TX FIFO full when queue awake!\n");
> +               return NETDEV_TX_BUSY;
> +       }
> +
>          fi = dlc = cf->can_dlc;
>          id = cf->can_id;
>
> There was no error message in dmesg after frame loss, so that's not the problem.
>
> The CPU is an Intel i7-4700EQ and the CAN interface is a Peak PCIe dual channel.
>
> Does anyone have an idea what might be wrong? :)
>
> Best regards,
> Andri

--
PEAK-System Technik GmbH
Sitz der Gesellschaft Darmstadt
Handelsregister Darmstadt HRB 9183 
Geschaeftsfuehrung: Alexander Gach, Uwe Wilhelm
--

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: peak_pci: TX Frame Loss
  2015-11-19  8:38 ` Stephane Grosjean
@ 2015-11-19 10:12   ` Andri Yngvason
  0 siblings, 0 replies; 17+ messages in thread
From: Andri Yngvason @ 2015-11-19 10:12 UTC (permalink / raw)
  To: Stephane Grosjean, linux-can; +Cc: wg, mkl

Hi Stephane,

Quoting Stephane Grosjean (2015-11-19 08:38:17)
> Hi Andri,
> 
> Could you first give me the result of sudo lspci -d 1c: -vvv please?
root@sensorx:~# ./lspci -d 1c: -vvv
03:00.0 Class 0280: Device 001c:0008 (rev 02)
        Subsystem: Device 001c:0005
        Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Interrupt: pin A routed to IRQ 19
        Region 0: Memory at f7d10000 (32-bit, non-prefetchable) [size=64K]
        Region 1: Memory at f7d00000 (32-bit, non-prefetchable) [size=64K]
        Capabilities: [50] Power Management version 0
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [70] MSI: Enable- Count=1/2 Maskable- 64bit+
                Address: 0000000000000000  Data: 0000
        Capabilities: [90] Express (v1) Endpoint, MSI 00
                DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s, Exit Latency L0s unlimited, L1 <1us
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp-
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-
        Capabilities: [100 v1] Vendor Specific Information: ID=0000 Rev=0 Len=00c <?>
        Kernel driver in use: peak_pci
>
[...]

Thanks,
Andri

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: peak_pci: TX Frame Loss
  2015-11-18 14:51 peak_pci: TX Frame Loss Andri Yngvason
  2015-11-19  8:38 ` Stephane Grosjean
@ 2015-12-02 18:09 ` Andri Yngvason
  2015-12-02 19:19   ` Oliver Hartkopp
  1 sibling, 1 reply; 17+ messages in thread
From: Andri Yngvason @ 2015-12-02 18:09 UTC (permalink / raw)
  To: linux-can; +Cc: wg, mkl, s.grosjean, hrafnkell.eiriksson, haukur.hafsteinsson

Quoting Andri Yngvason (2015-11-18 14:51:21)
[...]
> We've been experiencing frame loss on transmission in the peak_pci netdev
> driver.
[...]

Hi all,

I think I've figured it out.

I wrote two programs: One that sends out bursts of can frames with the same
can_id but increasing data[0] and one that listens to can frames and stops
ftrace when a frame is missing. These programs also place ftrace markers.
See link below [1].

In cases where frames were lost sja1000_write_cmdreg() is called from two cores
at the same time.
 
 7)               |                sja1000_interrupt [sja1001]() {
 [...]
 3)               |            raw_sendmsg [can_raw]() {
 [...]
 3)               |                      dev_hard_start_xmit() {
 3)               |                        sja1000_start_xmit [sja1000]() {
 7)   2.071 us    |                  peak_pci_read_reg [peak_pci]();
 3)   3.901 us    |                          peak_pci_read_reg [peak_pci]();
 [...]
 7)   2.066 us    |                  peak_pci_read_reg [peak_pci]();
 3)   0.139 us    |                          can_put_echo_skb [can_dev]();
 3)               |                          sja1000_write_cmdreg [sja1000]() {
 start_xmit acquires spin lock:
 3)   0.050 us    |                            _raw_spin_lock_irqsave();
 start_xmit writes to command register:
 3)   0.043 us    |                            peak_pci_write_reg [peak_pci]();
 3)   2.516 us    |                            peak_pci_read_reg [peak_pci]();
 7)               |                  sja1000_write_cmdreg [sja1000]() {
 sja1000_interrupt from other thread stalls on spin lock:
 7)   2.035 us    |                    _raw_spin_lock_irqsave();
 start_xmit releases spin lock:
 3)   0.083 us    |                            _raw_spin_unlock_irqrestore();
 sja1000_interrupt continues and writes to command register:
 7)   0.057 us    |                    peak_pci_write_reg [peak_pci]();
 3)   4.456 us    |                          }
 3) + 21.813 us   |                        }
 7)   1.657 us    |                    peak_pci_read_reg [peak_pci]();
 3) + 22.632 us   |                      }
 3)   0.044 us    |                      _raw_spin_lock();
 3) + 26.822 us   |                    }
 3)   0.105 us    |                    __local_bh_enable_ip();
 3) + 30.474 us   |                  }
 3) + 30.985 us   |                }
 sja1000_interrupt releases spin lock:
 7)   0.072 us    |                    _raw_spin_unlock_irqrestore();

 As far as locking goes, this looks OK, but maybe the sja1000 (or peak's FPGA
 implementation thereof) hasn't settled after writing to the command register?

 I tried adding a delay to test my theory:
diff --git a/drivers/net/can/sja1000/sja1000.c b/drivers/net/can/sja1000/sja1000.c
index f968d1e..b5115c2 100644
--- a/drivers/net/can/sja1000/sja1000.c
+++ b/drivers/net/can/sja1000/sja1000.c
@@ -93,6 +93,7 @@ static void sja1000_write_cmdreg(struct sja1000_priv *priv, u8 val)
 	spin_lock_irqsave(&priv->cmdreg_lock, flags);
 	priv->write_reg(priv, SJA1000_CMR, val);
 	priv->read_reg(priv, SJA1000_SR);
+	udelay(10);
 	spin_unlock_irqrestore(&priv->cmdreg_lock, flags);
 }
 
I've been running the test programs for a while now and I haven't lost a frame
yet, so this delay seems to solve the problem.

Do you guys think that this is an acceptable solution?
Should it maybe reside within peak_pci.c instead?

Regards,
Andri

[1] https://gist.github.com/any1/7062000f8c02b9b27bf6

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: peak_pci: TX Frame Loss
  2015-12-02 18:09 ` Andri Yngvason
@ 2015-12-02 19:19   ` Oliver Hartkopp
  2015-12-03  6:37     ` Oliver Hartkopp
  2015-12-03  8:20     ` Marc Kleine-Budde
  0 siblings, 2 replies; 17+ messages in thread
From: Oliver Hartkopp @ 2015-12-02 19:19 UTC (permalink / raw)
  To: Andri Yngvason, linux-can
  Cc: wg, mkl, s.grosjean, hrafnkell.eiriksson, haukur.hafsteinsson



On 12/02/2015 07:09 PM, Andri Yngvason wrote:
> Quoting Andri Yngvason (2015-11-18 14:51:21)
> [...]
>> We've been experiencing frame loss on transmission in the peak_pci netdev
>> driver.
> [...]
> 
> Hi all,
> 
> I think I've figured it out.
> 
> I wrote two programs: One that sends out bursts of can frames with the same
> can_id but increasing data[0] and one that listens to can frames and stops
> ftrace when a frame is missing. These programs also place ftrace markers.
> See link below [1].
> 
> In cases where frames were lost sja1000_write_cmdreg() is called from two cores
> at the same time.
>  
>  7)               |                sja1000_interrupt [sja1001]() {
>  [...]
>  3)               |            raw_sendmsg [can_raw]() {
>  [...]
>  3)               |                      dev_hard_start_xmit() {
>  3)               |                        sja1000_start_xmit [sja1000]() {
>  7)   2.071 us    |                  peak_pci_read_reg [peak_pci]();
>  3)   3.901 us    |                          peak_pci_read_reg [peak_pci]();
>  [...]
>  7)   2.066 us    |                  peak_pci_read_reg [peak_pci]();
>  3)   0.139 us    |                          can_put_echo_skb [can_dev]();
>  3)               |                          sja1000_write_cmdreg [sja1000]() {
>  start_xmit acquires spin lock:
>  3)   0.050 us    |                            _raw_spin_lock_irqsave();
>  start_xmit writes to command register:
>  3)   0.043 us    |                            peak_pci_write_reg [peak_pci]();
>  3)   2.516 us    |                            peak_pci_read_reg [peak_pci]();
>  7)               |                  sja1000_write_cmdreg [sja1000]() {
>  sja1000_interrupt from other thread stalls on spin lock:
>  7)   2.035 us    |                    _raw_spin_lock_irqsave();
>  start_xmit releases spin lock:
>  3)   0.083 us    |                            _raw_spin_unlock_irqrestore();
>  sja1000_interrupt continues and writes to command register:
>  7)   0.057 us    |                    peak_pci_write_reg [peak_pci]();
>  3)   4.456 us    |                          }
>  3) + 21.813 us   |                        }
>  7)   1.657 us    |                    peak_pci_read_reg [peak_pci]();
>  3) + 22.632 us   |                      }
>  3)   0.044 us    |                      _raw_spin_lock();
>  3) + 26.822 us   |                    }
>  3)   0.105 us    |                    __local_bh_enable_ip();
>  3) + 30.474 us   |                  }
>  3) + 30.985 us   |                }
>  sja1000_interrupt releases spin lock:
>  7)   0.072 us    |                    _raw_spin_unlock_irqrestore();
> 
>  As far as locking goes, this looks OK, but maybe the sja1000 (or peak's FPGA
>  implementation thereof) hasn't settled after writing to the command register?
> 
>  I tried adding a delay to test my theory:
> diff --git a/drivers/net/can/sja1000/sja1000.c b/drivers/net/can/sja1000/sja1000.c
> index f968d1e..b5115c2 100644
> --- a/drivers/net/can/sja1000/sja1000.c
> +++ b/drivers/net/can/sja1000/sja1000.c
> @@ -93,6 +93,7 @@ static void sja1000_write_cmdreg(struct sja1000_priv *priv, u8 val)
>  	spin_lock_irqsave(&priv->cmdreg_lock, flags);
>  	priv->write_reg(priv, SJA1000_CMR, val);
>  	priv->read_reg(priv, SJA1000_SR);
> +	udelay(10);
>  	spin_unlock_irqrestore(&priv->cmdreg_lock, flags);
>  }

Hi Andri,

looking at the code above I wonder whether the

	priv->read_reg(priv, SJA1000_SR);

is a bad hack anyway.

The read_reg() becomes readb(priv->reg_base + (port << 2)) in peak_pci.c and
the result is never used.

What if the entire operation gets 'optimized' at some point?

Can you check what happens if you just replace the priv->read_reg() with your
udelay(10) call?

Best regards,
Oliver


>  
> I've been running the test programs for a while now and I haven't lost a frame
> yet, so this delay seems to solve the problem.
> 
> Do you guys think that this is an acceptable solution?
> Should it maybe reside within peak_pci.c instead?
> 
> Regards,
> Andri
> 
> [1] https://gist.github.com/any1/7062000f8c02b9b27bf6
> --
> To unsubscribe from this list: send the line "unsubscribe linux-can" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: peak_pci: TX Frame Loss
  2015-12-02 19:19   ` Oliver Hartkopp
@ 2015-12-03  6:37     ` Oliver Hartkopp
  2015-12-03 11:23       ` Andri Yngvason
  2015-12-03  8:20     ` Marc Kleine-Budde
  1 sibling, 1 reply; 17+ messages in thread
From: Oliver Hartkopp @ 2015-12-03  6:37 UTC (permalink / raw)
  To: Andri Yngvason, linux-can
  Cc: wg, mkl, s.grosjean, hrafnkell.eiriksson, haukur.hafsteinsson

Some more details:

On 12/02/2015 08:19 PM, Oliver Hartkopp wrote:
> On 12/02/2015 07:09 PM, Andri Yngvason wrote:
>> Quoting Andri Yngvason (2015-11-18 14:51:21)
>> [...]
>>> We've been experiencing frame loss on transmission in the peak_pci netdev
>>> driver.
>> [...]


>>  I tried adding a delay to test my theory:
>> diff --git a/drivers/net/can/sja1000/sja1000.c b/drivers/net/can/sja1000/sja1000.c
>> index f968d1e..b5115c2 100644
>> --- a/drivers/net/can/sja1000/sja1000.c
>> +++ b/drivers/net/can/sja1000/sja1000.c
>> @@ -93,6 +93,7 @@ static void sja1000_write_cmdreg(struct sja1000_priv *priv, u8 val)
>>  	spin_lock_irqsave(&priv->cmdreg_lock, flags);
>>  	priv->write_reg(priv, SJA1000_CMR, val);
>>  	priv->read_reg(priv, SJA1000_SR);
>> +	udelay(10);
>>  	spin_unlock_irqrestore(&priv->cmdreg_lock, flags);
>>  }
> 
> Hi Andri,
> 
> looking at the code above I wonder whether the
> 
> 	priv->read_reg(priv, SJA1000_SR);
> 
> is a bad hack anyway.
> 
> The read_reg() becomes readb(priv->reg_base + (port << 2)) in peak_pci.c and
> the result is never used.
> 
> What if the entire operation gets 'optimized' at some point?
> 
> Can you check what happens if you just replace the priv->read_reg() with your
> udelay(10) call?
> 

In fact I contributed the first SJA1000 SMP fix myself.

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit?id=57c8a456640fa3ca777652f11f2db4179a3e66b6

As we only picked the code suggestion from Klaus the delay was introduced by
priv->read_reg() - which is obviously not the proper way.

The data sheet says:

"Between two commands at least one internal clock cycle is needed in order to
proceed. The internal clock is half of the external oscillator frequency."

The SJA1000 accepts external clocks from 0 to 24 MHz - all our drivers have an
external clock of 16 MHz which leads to an internal clock of 8 MHz.

If we need an additional internal clock cycle we can calculate this with
4 MHz and 1/4.000.000 = 0.000,000,25 => 250ns = 0.25us

Even if someone would use an external clock of 8 MHz a udelay(1) should be enough.

Can you please make a test where you replace the priv->read_reg() just with
udelay(1)?

If it works I would like to contribute a patch to fix my former attempt with a
Reported-by: and Tested-by: tag from you.

Many thanks,
Oliver



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: peak_pci: TX Frame Loss
  2015-12-02 19:19   ` Oliver Hartkopp
  2015-12-03  6:37     ` Oliver Hartkopp
@ 2015-12-03  8:20     ` Marc Kleine-Budde
  1 sibling, 0 replies; 17+ messages in thread
From: Marc Kleine-Budde @ 2015-12-03  8:20 UTC (permalink / raw)
  To: Oliver Hartkopp, Andri Yngvason, linux-can
  Cc: wg, s.grosjean, hrafnkell.eiriksson, haukur.hafsteinsson

[-- Attachment #1: Type: text/plain, Size: 1321 bytes --]

On 12/02/2015 08:19 PM, Oliver Hartkopp wrote:
>> diff --git a/drivers/net/can/sja1000/sja1000.c b/drivers/net/can/sja1000/sja1000.c
>> index f968d1e..b5115c2 100644
>> --- a/drivers/net/can/sja1000/sja1000.c
>> +++ b/drivers/net/can/sja1000/sja1000.c
>> @@ -93,6 +93,7 @@ static void sja1000_write_cmdreg(struct sja1000_priv *priv, u8 val)
>>  	spin_lock_irqsave(&priv->cmdreg_lock, flags);
>>  	priv->write_reg(priv, SJA1000_CMR, val);
>>  	priv->read_reg(priv, SJA1000_SR);
>> +	udelay(10);
>>  	spin_unlock_irqrestore(&priv->cmdreg_lock, flags);
>>  }
> 
> Hi Andri,
> 
> looking at the code above I wonder whether the
> 
> 	priv->read_reg(priv, SJA1000_SR);
> 
> is a bad hack anyway.
> 
> The read_reg() becomes readb(priv->reg_base + (port << 2)) in peak_pci.c and
> the result is never used.
> 
> What if the entire operation gets 'optimized' at some point?

readX() calls are never optimized out.

> Can you check what happens if you just replace the priv->read_reg() with your
> udelay(10) call?

Marc

-- 
Pengutronix e.K.                  | Marc Kleine-Budde           |
Industrial Linux Solutions        | Phone: +49-231-2826-924     |
Vertretung West/Dortmund          | Fax:   +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: peak_pci: TX Frame Loss
  2015-12-03  6:37     ` Oliver Hartkopp
@ 2015-12-03 11:23       ` Andri Yngvason
  2015-12-03 11:44         ` Marc Kleine-Budde
  2015-12-03 16:37         ` Stephane Grosjean
  0 siblings, 2 replies; 17+ messages in thread
From: Andri Yngvason @ 2015-12-03 11:23 UTC (permalink / raw)
  To: Oliver Hartkopp
  Cc: linux-can, wg, mkl, s.grosjean, hrafnkell.eiriksson, haukur.hafsteinsson

On Thu, Dec 03, 2015 at 07:37:17AM +0100, Oliver Hartkopp wrote:
> Some more details:
> 
> On 12/02/2015 08:19 PM, Oliver Hartkopp wrote:
> > On 12/02/2015 07:09 PM, Andri Yngvason wrote:
> >> Quoting Andri Yngvason (2015-11-18 14:51:21)
> >> [...]
> >>> We've been experiencing frame loss on transmission in the peak_pci netdev
> >>> driver.
> >> [...]
> 
> 
> >>  I tried adding a delay to test my theory:
> >> diff --git a/drivers/net/can/sja1000/sja1000.c b/drivers/net/can/sja1000/sja1000.c
> >> index f968d1e..b5115c2 100644
> >> --- a/drivers/net/can/sja1000/sja1000.c
> >> +++ b/drivers/net/can/sja1000/sja1000.c
> >> @@ -93,6 +93,7 @@ static void sja1000_write_cmdreg(struct sja1000_priv *priv, u8 val)
> >>  	spin_lock_irqsave(&priv->cmdreg_lock, flags);
> >>  	priv->write_reg(priv, SJA1000_CMR, val);
> >>  	priv->read_reg(priv, SJA1000_SR);
> >> +	udelay(10);
> >>  	spin_unlock_irqrestore(&priv->cmdreg_lock, flags);
> >>  }
> > 
> > Hi Andri,
> > 
> > looking at the code above I wonder whether the
> > 
> > 	priv->read_reg(priv, SJA1000_SR);
> > 
> > is a bad hack anyway.
> > 
> > The read_reg() becomes readb(priv->reg_base + (port << 2)) in peak_pci.c and
> > the result is never used.
> > 
> > What if the entire operation gets 'optimized' at some point?

Hi Oliver,

According to ftrace it takes at least 1us to execute.

> > 
> > Can you check what happens if you just replace the priv->read_reg() with your
> > udelay(10) call?
> > 
> 
> In fact I contributed the first SJA1000 SMP fix myself.
> 
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit?id=57c8a456640fa3ca777652f11f2db4179a3e66b6
> 
> As we only picked the code suggestion from Klaus the delay was introduced by
> priv->read_reg() - which is obviously not the proper way.
> 
> The data sheet says:
> 
> "Between two commands at least one internal clock cycle is needed in order to
> proceed. The internal clock is half of the external oscillator frequency."

The upper bound would have been more useful information here. ;)

> 
> The SJA1000 accepts external clocks from 0 to 24 MHz - all our drivers have an
> external clock of 16 MHz which leads to an internal clock of 8 MHz.
> 
> If we need an additional internal clock cycle we can calculate this with
> 4 MHz and 1/4.000.000 = 0.000,000,25 => 250ns = 0.25us
> 
> Even if someone would use an external clock of 8 MHz a udelay(1) should be enough.
> 
> Can you please make a test where you replace the priv->read_reg() just with
> udelay(1)?

I tried it and it's not enough. The sja1000 on the peak pci is an FPGA
implementation that claims to be sja1000 compatible. Is it possible that peak's
FPGA implementation is slower than the original sja1000?

Stéphane, what do you think?

> 
> If it works I would like to contribute a patch to fix my former attempt with a
> Reported-by: and Tested-by: tag from you.
> 
Sure, whatever gets this fixed is fine by me. ;)

Thanks,
Andri

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: peak_pci: TX Frame Loss
  2015-12-03 11:23       ` Andri Yngvason
@ 2015-12-03 11:44         ` Marc Kleine-Budde
  2015-12-08 10:21           ` Stephane Grosjean
  2015-12-03 16:37         ` Stephane Grosjean
  1 sibling, 1 reply; 17+ messages in thread
From: Marc Kleine-Budde @ 2015-12-03 11:44 UTC (permalink / raw)
  To: Andri Yngvason, Oliver Hartkopp
  Cc: linux-can, wg, s.grosjean, hrafnkell.eiriksson, haukur.hafsteinsson

[-- Attachment #1: Type: text/plain, Size: 569 bytes --]

On 12/03/2015 12:23 PM, Andri Yngvason wrote:
> I tried it and it's not enough. The sja1000 on the peak pci is an FPGA
> implementation that claims to be sja1000 compatible. Is it possible that peak's
> FPGA implementation is slower than the original sja1000?

Maybe buggier :)

Marc

-- 
Pengutronix e.K.                  | Marc Kleine-Budde           |
Industrial Linux Solutions        | Phone: +49-231-2826-924     |
Vertretung West/Dortmund          | Fax:   +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: peak_pci: TX Frame Loss
  2015-12-03 11:23       ` Andri Yngvason
  2015-12-03 11:44         ` Marc Kleine-Budde
@ 2015-12-03 16:37         ` Stephane Grosjean
  1 sibling, 0 replies; 17+ messages in thread
From: Stephane Grosjean @ 2015-12-03 16:37 UTC (permalink / raw)
  To: Andri Yngvason, Oliver Hartkopp
  Cc: linux-can, wg, mkl, hrafnkell.eiriksson, haukur.hafsteinsson

Hi,

The question has been forwarded to our hw engineers.
I will give you any feedback I'll have about that.

Regards,

Stéphane

Le 03/12/2015 12:23, Andri Yngvason a écrit :
> On Thu, Dec 03, 2015 at 07:37:17AM +0100, Oliver Hartkopp wrote:
>> Some more details:
>>
>> On 12/02/2015 08:19 PM, Oliver Hartkopp wrote:
>>> On 12/02/2015 07:09 PM, Andri Yngvason wrote:
>>>> Quoting Andri Yngvason (2015-11-18 14:51:21)
>>>> [...]
>>>>> We've been experiencing frame loss on transmission in the peak_pci netdev
>>>>> driver.
>>>> [...]
>>
>>>>   I tried adding a delay to test my theory:
>>>> diff --git a/drivers/net/can/sja1000/sja1000.c b/drivers/net/can/sja1000/sja1000.c
>>>> index f968d1e..b5115c2 100644
>>>> --- a/drivers/net/can/sja1000/sja1000.c
>>>> +++ b/drivers/net/can/sja1000/sja1000.c
>>>> @@ -93,6 +93,7 @@ static void sja1000_write_cmdreg(struct sja1000_priv *priv, u8 val)
>>>>   	spin_lock_irqsave(&priv->cmdreg_lock, flags);
>>>>   	priv->write_reg(priv, SJA1000_CMR, val);
>>>>   	priv->read_reg(priv, SJA1000_SR);
>>>> +	udelay(10);
>>>>   	spin_unlock_irqrestore(&priv->cmdreg_lock, flags);
>>>>   }
>>> Hi Andri,
>>>
>>> looking at the code above I wonder whether the
>>>
>>> 	priv->read_reg(priv, SJA1000_SR);
>>>
>>> is a bad hack anyway.
>>>
>>> The read_reg() becomes readb(priv->reg_base + (port << 2)) in peak_pci.c and
>>> the result is never used.
>>>
>>> What if the entire operation gets 'optimized' at some point?
> Hi Oliver,
>
> According to ftrace it takes at least 1us to execute.
>
>>> Can you check what happens if you just replace the priv->read_reg() with your
>>> udelay(10) call?
>>>
>> In fact I contributed the first SJA1000 SMP fix myself.
>>
>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit?id=57c8a456640fa3ca777652f11f2db4179a3e66b6
>>
>> As we only picked the code suggestion from Klaus the delay was introduced by
>> priv->read_reg() - which is obviously not the proper way.
>>
>> The data sheet says:
>>
>> "Between two commands at least one internal clock cycle is needed in order to
>> proceed. The internal clock is half of the external oscillator frequency."
> The upper bound would have been more useful information here. ;)
>
>> The SJA1000 accepts external clocks from 0 to 24 MHz - all our drivers have an
>> external clock of 16 MHz which leads to an internal clock of 8 MHz.
>>
>> If we need an additional internal clock cycle we can calculate this with
>> 4 MHz and 1/4.000.000 = 0.000,000,25 => 250ns = 0.25us
>>
>> Even if someone would use an external clock of 8 MHz a udelay(1) should be enough.
>>
>> Can you please make a test where you replace the priv->read_reg() just with
>> udelay(1)?
> I tried it and it's not enough. The sja1000 on the peak pci is an FPGA
> implementation that claims to be sja1000 compatible. Is it possible that peak's
> FPGA implementation is slower than the original sja1000?
>
> Stéphane, what do you think?
>
>> If it works I would like to contribute a patch to fix my former attempt with a
>> Reported-by: and Tested-by: tag from you.
>>
> Sure, whatever gets this fixed is fine by me. ;)
>
> Thanks,
> Andri

--
PEAK-System Technik GmbH
Sitz der Gesellschaft Darmstadt
Handelsregister Darmstadt HRB 9183 
Geschaeftsfuehrung: Alexander Gach, Uwe Wilhelm
--

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: peak_pci: TX Frame Loss
  2015-12-03 11:44         ` Marc Kleine-Budde
@ 2015-12-08 10:21           ` Stephane Grosjean
  2015-12-08 10:50             ` Andri Yngvason
  0 siblings, 1 reply; 17+ messages in thread
From: Stephane Grosjean @ 2015-12-08 10:21 UTC (permalink / raw)
  To: Marc Kleine-Budde, Andri Yngvason, Oliver Hartkopp
  Cc: linux-can, wg, hrafnkell.eiriksson, haukur.hafsteinsson

Hi!

Don't know atm what is buggier, but:

- the issue isn't reproducible with our Windows drivers,
- the issue isn't reproducible with the pcan driver,

IMHO, and after having read and compared the sja1000 and the pcan code, 
I would like clarification on:

- the sja1000_start_xmit() function: what about two (or more) tasks 
running this function on a SMP system, and, of course, writing to the 
same SJA1000? Especially, running the (non-protected) sequence of 
several consecutive "priv->write_reg()" ?

- in the same vein, what happens when an IRQ occurs in the middle of the 
_xmit()  (non-protected) write_reg() sequence?

FYI, I have surrounded the _xmit() function as well as the ISR with 
couples of spin_lock/spin_unlock instructions, rebuild everything and 
have ran Andri's testbed... and guess what?

So, my question is: I'm far from getting the whole knowledge of how the 
net/can core handles multiple access to the same hardware (maybe 
concurrent access is managed somewhere else in the network stack, never 
know), but I don't explain how things can run correctly without 
guaranteeing at least the writing sequences to the SJA1000 registers...

Once again, maybe I'm wrong. ATM, the testbed always runs as it should...

Regards,

Stéphane


Le 03/12/2015 12:44, Marc Kleine-Budde a écrit :
> On 12/03/2015 12:23 PM, Andri Yngvason wrote:
>> I tried it and it's not enough. The sja1000 on the peak pci is an FPGA
>> implementation that claims to be sja1000 compatible. Is it possible that peak's
>> FPGA implementation is slower than the original sja1000?
> Maybe buggier :)
>
> Marc
>

--
PEAK-System Technik GmbH
Sitz der Gesellschaft Darmstadt
Handelsregister Darmstadt HRB 9183 
Geschaeftsfuehrung: Alexander Gach, Uwe Wilhelm
--

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: peak_pci: TX Frame Loss
  2015-12-08 10:21           ` Stephane Grosjean
@ 2015-12-08 10:50             ` Andri Yngvason
  2015-12-08 11:42               ` Stephane Grosjean
  0 siblings, 1 reply; 17+ messages in thread
From: Andri Yngvason @ 2015-12-08 10:50 UTC (permalink / raw)
  To: Stephane Grosjean
  Cc: Marc Kleine-Budde, Oliver Hartkopp, linux-can, wg,
	hrafnkell.eiriksson, haukur.hafsteinsson

Hi Stephane,

On Tue, Dec 08, 2015 at 11:21:26AM +0100, Stephane Grosjean wrote:
> Hi!
> 
> Don't know atm what is buggier, but:
> 
> - the issue isn't reproducible with our Windows drivers,
> - the issue isn't reproducible with the pcan driver,
You have spin locks sprinkled around the pcan driver even where there isn't a
specific shared resource that's being protected.

However, a nice side-effect of that is that the interrupt routine is not allowed
to run concurrently with the xmit function, so the command register is never
written to so frequently as in the linux-can driver.
> 
[...]
> FYI, I have surrounded the _xmit() function as well as the ISR with couples
> of spin_lock/spin_unlock instructions, rebuild everything and have ran
> Andri's testbed... and guess what?
I don't know. I'm not sure if you hinted at the answer to this question. Did the
problem go away? I guess it did due to the reasons that I mentioned above.

This does not strike at the core of the issue. A targeted solution is to add a
delay.
> 
[...]
> 
> Le 03/12/2015 12:44, Marc Kleine-Budde a écrit :
> >On 12/03/2015 12:23 PM, Andri Yngvason wrote:
> >>I tried it and it's not enough. The sja1000 on the peak pci is an FPGA
> >>implementation that claims to be sja1000 compatible. Is it possible that peak's
> >>FPGA implementation is slower than the original sja1000?
> >Maybe buggier :)
> >
> >Marc
> >
> 

Regards,
Andri

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: peak_pci: TX Frame Loss
  2015-12-08 10:50             ` Andri Yngvason
@ 2015-12-08 11:42               ` Stephane Grosjean
  2015-12-08 12:24                 ` Andri Yngvason
  0 siblings, 1 reply; 17+ messages in thread
From: Stephane Grosjean @ 2015-12-08 11:42 UTC (permalink / raw)
  To: Andri Yngvason
  Cc: Marc Kleine-Budde, Oliver Hartkopp, linux-can, wg,
	hrafnkell.eiriksson, haukur.hafsteinsson

Hi Andri,

So, you mean that, for example, the below sequence (extract from _xmit()):

                 priv->write_reg(priv, SJA1000_FI, fi);
                 priv->write_reg(priv, SJA1000_ID1, (id & 0x000007f8) >> 3);
                 priv->write_reg(priv, SJA1000_ID2, (id & 0x00000007) << 5);
                 ...
                 (1)
                 ...
                 sja1000_write_cmdreg(priv, CMD_TR);

can't be interrupted at all (especially at (1)), by any IRQ which ISR 
could write (again) some new ID?
I mean, shouldn't the SJA1000 and the sequence order of the cmds be 
considered as shared resource?

Regards,

Stéphane

Le 08/12/2015 11:50, Andri Yngvason a écrit :
> Hi Stephane,
>
> On Tue, Dec 08, 2015 at 11:21:26AM +0100, Stephane Grosjean wrote:
>> Hi!
>>
>> Don't know atm what is buggier, but:
>>
>> - the issue isn't reproducible with our Windows drivers,
>> - the issue isn't reproducible with the pcan driver,
> You have spin locks sprinkled around the pcan driver even where there isn't a
> specific shared resource that's being protected.
>
> However, a nice side-effect of that is that the interrupt routine is not allowed
> to run concurrently with the xmit function, so the command register is never
> written to so frequently as in the linux-can driver.
> [...]
>> FYI, I have surrounded the _xmit() function as well as the ISR with couples
>> of spin_lock/spin_unlock instructions, rebuild everything and have ran
>> Andri's testbed... and guess what?
> I don't know. I'm not sure if you hinted at the answer to this question. Did the
> problem go away? I guess it did due to the reasons that I mentioned above.
>
> This does not strike at the core of the issue. A targeted solution is to add a
> delay.
> [...]
>> Le 03/12/2015 12:44, Marc Kleine-Budde a écrit :
>>> On 12/03/2015 12:23 PM, Andri Yngvason wrote:
>>>> I tried it and it's not enough. The sja1000 on the peak pci is an FPGA
>>>> implementation that claims to be sja1000 compatible. Is it possible that peak's
>>>> FPGA implementation is slower than the original sja1000?
>>> Maybe buggier :)
>>>
>>> Marc
>>>
> Regards,
> Andri

--
PEAK-System Technik GmbH
Sitz der Gesellschaft Darmstadt
Handelsregister Darmstadt HRB 9183 
Geschaeftsfuehrung: Alexander Gach, Uwe Wilhelm
--

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: peak_pci: TX Frame Loss
  2015-12-08 11:42               ` Stephane Grosjean
@ 2015-12-08 12:24                 ` Andri Yngvason
  2015-12-08 14:12                   ` [BULK]Re: " Stephane Grosjean
  2015-12-22  8:13                   ` Stephane Grosjean
  0 siblings, 2 replies; 17+ messages in thread
From: Andri Yngvason @ 2015-12-08 12:24 UTC (permalink / raw)
  To: Stephane Grosjean
  Cc: Marc Kleine-Budde, Oliver Hartkopp, linux-can, wg,
	hrafnkell.eiriksson, haukur.hafsteinsson

Hi Stephane,

On Tue, Dec 08, 2015 at 12:42:58PM +0100, Stephane Grosjean wrote:
> Hi Andri,
> 
> So, you mean that, for example, the below sequence (extract from _xmit()):
> 
>                 priv->write_reg(priv, SJA1000_FI, fi);
>                 priv->write_reg(priv, SJA1000_ID1, (id & 0x000007f8) >> 3);
>                 priv->write_reg(priv, SJA1000_ID2, (id & 0x00000007) << 5);
>                 ...
>                 (1)
>                 ...
>                 sja1000_write_cmdreg(priv, CMD_TR);
> 
> can't be interrupted at all (especially at (1)), by any IRQ which ISR could
> write (again) some new ID?
> I mean, shouldn't the SJA1000 and the sequence order of the cmds be
> considered as shared resource?
> 

AFAIK only _xmit() writes to these registers and I think the _xmit() function is
only ever executed on one core at a time. Executing two of them at the same time
wouldn't make much sense. It would just cause out-of-order transmission.

Also, the interrupt routine does not call _xmit() directly. It just calls
netif_wake_queue() which AFAIK schedules the _xmit() routine to be called later.

I'm just saying that we shouldn't place locks where they are not needed because
it really just obfuscates the real issue. It's the equivalent of leaving the
mess on the ground and just walking around it instead of cleaning it up. Someone
else will step into it.

Besides, ftrace does not show _xmit() being called concurrently for the frames
that are lost.

By the way, I'm not sure that I understood you correctly. Did placing the spin
locks prevent the issue? I'm not sure this "guess what" rhetoric works well on a
mailing list. ;)

Thanks,
Andri

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [BULK]Re: peak_pci: TX Frame Loss
  2015-12-08 12:24                 ` Andri Yngvason
@ 2015-12-08 14:12                   ` Stephane Grosjean
  2015-12-22  8:13                   ` Stephane Grosjean
  1 sibling, 0 replies; 17+ messages in thread
From: Stephane Grosjean @ 2015-12-08 14:12 UTC (permalink / raw)
  To: Andri Yngvason
  Cc: Marc Kleine-Budde, Oliver Hartkopp, linux-can, wg,
	hrafnkell.eiriksson, haukur.hafsteinsson

Le 08/12/2015 13:24, Andri Yngvason a écrit :
> Hi Stephane,
>
> On Tue, Dec 08, 2015 at 12:42:58PM +0100, Stephane Grosjean wrote:
>> Hi Andri,
>>
>> So, you mean that, for example, the below sequence (extract from _xmit()):
>>
>>                  priv->write_reg(priv, SJA1000_FI, fi);
>>                  priv->write_reg(priv, SJA1000_ID1, (id & 0x000007f8) >> 3);
>>                  priv->write_reg(priv, SJA1000_ID2, (id & 0x00000007) << 5);
>>                  ...
>>                  (1)
>>                  ...
>>                  sja1000_write_cmdreg(priv, CMD_TR);
>>
>> can't be interrupted at all (especially at (1)), by any IRQ which ISR could
>> write (again) some new ID?
>> I mean, shouldn't the SJA1000 and the sequence order of the cmds be
>> considered as shared resource?
>>
> AFAIK only _xmit() writes to these registers and I think the _xmit() function is
> only ever executed on one core at a time. Executing two of them at the same time
> wouldn't make much sense. It would just cause out-of-order transmission.
>
> Also, the interrupt routine does not call _xmit() directly. It just calls
> netif_wake_queue() which AFAIK schedules the _xmit() routine to be called later.
>
> I'm just saying that we shouldn't place locks where they are not needed because
> it really just obfuscates the real issue. It's the equivalent of leaving the
> mess on the ground and just walking around it instead of cleaning it up. Someone
> else will step into it.
Ok. If you're right with what you have written above, I do also agree 
with you: there is no need to lock the _xmit() function nor the ISR.
> Besides, ftrace does not show _xmit() being called concurrently for the frames
> that are lost.
>
> By the way, I'm not sure that I understood you correctly. Did placing the spin
> locks prevent the issue? I'm not sure this "guess what" rhetoric works well on a
> mailing list. ;)

Yes, it did.

Stéphane
> Thanks,
> Andri
> --
> To unsubscribe from this list: send the line "unsubscribe linux-can" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
PEAK-System Technik GmbH
Sitz der Gesellschaft Darmstadt
Handelsregister Darmstadt HRB 9183 
Geschaeftsfuehrung: Alexander Gach, Uwe Wilhelm
--

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: peak_pci: TX Frame Loss
  2015-12-08 12:24                 ` Andri Yngvason
  2015-12-08 14:12                   ` [BULK]Re: " Stephane Grosjean
@ 2015-12-22  8:13                   ` Stephane Grosjean
  2015-12-22 11:51                     ` Andri Yngvason
  1 sibling, 1 reply; 17+ messages in thread
From: Stephane Grosjean @ 2015-12-22  8:13 UTC (permalink / raw)
  To: Andri Yngvason
  Cc: Marc Kleine-Budde, Oliver Hartkopp, linux-can, wg,
	hrafnkell.eiriksson, haukur.hafsteinsson

Hi Andri,

We're currently investigating hard onto this issue. On my side, I'm 
checking the software part and I'd like to discuss with you about one 
point:

AFAIK the "sja1000_start_xmit()" function calls are really serialized by 
the network core. So there's actually no need of any lock to share the 
access of the SJA1000 registers, regarding the Tx path. Right. But I 
don't see in this function any test condition over the Transmit Status 
bit before writing the outgoing CAN frame (aka "reg[SJA1000_SR] & SR_TS").

Without testing the Transmit Status bit before writing, wouldn't it be 
possible that any 2nd task overwrite a 1st frame written by a 1st task 
before it has been really sent by the SJA1000?

Regards,

Stéphane

Le 08/12/2015 13:24, Andri Yngvason a écrit :
> Hi Stephane,
>
> On Tue, Dec 08, 2015 at 12:42:58PM +0100, Stephane Grosjean wrote:
>> Hi Andri,
>>
>> So, you mean that, for example, the below sequence (extract from _xmit()):
>>
>>                  priv->write_reg(priv, SJA1000_FI, fi);
>>                  priv->write_reg(priv, SJA1000_ID1, (id & 0x000007f8) >> 3);
>>                  priv->write_reg(priv, SJA1000_ID2, (id & 0x00000007) << 5);
>>                  ...
>>                  (1)
>>                  ...
>>                  sja1000_write_cmdreg(priv, CMD_TR);
>>
>> can't be interrupted at all (especially at (1)), by any IRQ which ISR could
>> write (again) some new ID?
>> I mean, shouldn't the SJA1000 and the sequence order of the cmds be
>> considered as shared resource?
>>
> AFAIK only _xmit() writes to these registers and I think the _xmit() function is
> only ever executed on one core at a time. Executing two of them at the same time
> wouldn't make much sense. It would just cause out-of-order transmission.
>
> Also, the interrupt routine does not call _xmit() directly. It just calls
> netif_wake_queue() which AFAIK schedules the _xmit() routine to be called later.
>
> I'm just saying that we shouldn't place locks where they are not needed because
> it really just obfuscates the real issue. It's the equivalent of leaving the
> mess on the ground and just walking around it instead of cleaning it up. Someone
> else will step into it.
>
> Besides, ftrace does not show _xmit() being called concurrently for the frames
> that are lost.
>
> By the way, I'm not sure that I understood you correctly. Did placing the spin
> locks prevent the issue? I'm not sure this "guess what" rhetoric works well on a
> mailing list. ;)
>
> Thanks,
> Andri

--

We hereby inform you that PEAK-System Europe will be closed from 24 December 2015 untill 4 January 2016. 
During this period there will be only limited software and technical support available.

 We wish you all a merry christmas and a happy new year!

 -The PEAK-System Europe  Team -

--
PEAK-System Technik GmbH
Sitz der Gesellschaft Darmstadt
Handelsregister Darmstadt HRB 9183 
Geschaeftsfuehrung: Alexander Gach, Uwe Wilhelm
--

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: peak_pci: TX Frame Loss
  2015-12-22  8:13                   ` Stephane Grosjean
@ 2015-12-22 11:51                     ` Andri Yngvason
  0 siblings, 0 replies; 17+ messages in thread
From: Andri Yngvason @ 2015-12-22 11:51 UTC (permalink / raw)
  To: Stephane Grosjean
  Cc: Marc Kleine-Budde, Oliver Hartkopp, linux-can, wg,
	hrafnkell.eiriksson, haukur.hafsteinsson

On Tue, Dec 22, 2015 at 09:13:25AM +0100, Stephane Grosjean wrote:
> Hi Andri,
> 
> We're currently investigating hard onto this issue. On my side, I'm checking
> the software part and I'd like to discuss with you about one point:
> 
> AFAIK the "sja1000_start_xmit()" function calls are really serialized by the
> network core. So there's actually no need of any lock to share the access of
> the SJA1000 registers, regarding the Tx path. Right. But I don't see in this
> function any test condition over the Transmit Status bit before writing the
> outgoing CAN frame (aka "reg[SJA1000_SR] & SR_TS").
>
I tried adding it and it didn't do anything. See original message.
> 
> Without testing the Transmit Status bit before writing, wouldn't it be
> possible that any 2nd task overwrite a 1st frame written by a 1st task
> before it has been really sent by the SJA1000?
> 
I don't think so. The _start_xmit() call is scheduled via the wait queue, so
_start_xmit() being called concurrently would be a sign of a misbehaving wait
queue, but like I said, adding the check did not yield any results.

Thanks & Merry Xmas!
Andri

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2015-12-22 11:53 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-18 14:51 peak_pci: TX Frame Loss Andri Yngvason
2015-11-19  8:38 ` Stephane Grosjean
2015-11-19 10:12   ` Andri Yngvason
2015-12-02 18:09 ` Andri Yngvason
2015-12-02 19:19   ` Oliver Hartkopp
2015-12-03  6:37     ` Oliver Hartkopp
2015-12-03 11:23       ` Andri Yngvason
2015-12-03 11:44         ` Marc Kleine-Budde
2015-12-08 10:21           ` Stephane Grosjean
2015-12-08 10:50             ` Andri Yngvason
2015-12-08 11:42               ` Stephane Grosjean
2015-12-08 12:24                 ` Andri Yngvason
2015-12-08 14:12                   ` [BULK]Re: " Stephane Grosjean
2015-12-22  8:13                   ` Stephane Grosjean
2015-12-22 11:51                     ` Andri Yngvason
2015-12-03 16:37         ` Stephane Grosjean
2015-12-03  8:20     ` Marc Kleine-Budde

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.