[PATCH] can: fix loss of frames due to wrong assumption in raw

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] can: fix loss of frames due to wrong assumption in raw_rcv
@ 2015-06-20 17:21 Manfred Schlaegl
  2015-06-20 22:42   ` Oliver Hartkopp
  0 siblings, 1 reply; 21+ messages in thread
From: Manfred Schlaegl @ 2015-06-20 17:21 UTC (permalink / raw)
  To: Wolfgang Grandegger, Marc Kleine-Budde
  Cc: linux-can, netdev, linux-kernel, Manfred Schlaegl,
	Oliver Hartkopp, David S. Miller

[-- Attachment #1: Type: text/plain, Size: 1934 bytes --]

I've detected a massive loss of can frames on i.MX6 using flexcan
driver with 4.1-rc8 and tracked this down to following commit:
514ac99c64b22d83b52dfee3b8becaa69a92bc4a - "can: fix multiple delivery
of a single CAN frame for overlapping CAN filters"

514ac99c64b22d83b52dfee3b8becaa69a92bc4a introduces a frame equality
check. Since the sk_buff pointer is not sufficient to do this (buffers
are reused), the check also compares time stamps.
In short: pointer+time stamp was assumed as unique key to a specific
frame.
The problem with this is, that the time stamp is an optional property
and not set per default.
In our case (flexcan) the time stamp is always zero, so the equality
check is reduced to equality of buffer pointers, resulting in a lot of
dropped frames.

Possible solutions I thought of:
 1. Every driver has to set a time stamp
    (possibly error prone and hard to enforce?)
 2. Change the equality check
 3. Fulfil the requirements of the equality check by setting a
    time stamp per default.

This patch fixes the problem with solution 3. A time stamp is set at
time of allocation in alloc_can_skb.
The time stamp may be overridden later, but the function of the equality
check is ensured.

I'm not really deep in linux network subsystem, so there may exists
more elegant solutions for the problem.

Signed-off-by: Manfred Schlaegl <manfred.schlaegl@gmx.at>
---
 drivers/net/can/dev.c |    1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/can/dev.c b/drivers/net/can/dev.c
index b0f6924..282e2e7 100644
--- a/drivers/net/can/dev.c
+++ b/drivers/net/can/dev.c
@@ -575,6 +575,7 @@ struct sk_buff *alloc_can_skb(struct net_device *dev, struct can_frame **cf)
 	if (unlikely(!skb))
 		return NULL;

+	__net_timestamp(skb);
 	skb->protocol = htons(ETH_P_CAN);
 	skb->pkt_type = PACKET_BROADCAST;
 	skb->ip_summed = CHECKSUM_UNNECESSARY;
-- 
1.7.10.4

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH] can: fix loss of frames due to wrong assumption in raw_rcv
  2015-06-20 17:21 [PATCH] can: fix loss of frames due to wrong assumption in raw_rcv Manfred Schlaegl
@ 2015-06-20 22:42   ` Oliver Hartkopp
  0 siblings, 0 replies; 21+ messages in thread
From: Oliver Hartkopp @ 2015-06-20 22:42 UTC (permalink / raw)
  To: Manfred Schlaegl, Wolfgang Grandegger, Marc Kleine-Budde
  Cc: linux-can, netdev, linux-kernel, Manfred Schlaegl, David S. Miller

Hello Manfred,

On 06/20/2015 07:21 PM, Manfred Schlaegl wrote:
> I've detected a massive loss of can frames on i.MX6 using flexcan
> driver with 4.1-rc8 and tracked this down to following commit:
> 514ac99c64b22d83b52dfee3b8becaa69a92bc4a - "can: fix multiple delivery
> of a single CAN frame for overlapping CAN filters"

thanks for detecting this issue!

> 514ac99c64b22d83b52dfee3b8becaa69a92bc4a introduces a frame equality
> check. Since the sk_buff pointer is not sufficient to do this (buffers
> are reused), the check also compares time stamps.
> In short: pointer+time stamp was assumed as unique key to a specific
> frame.
> The problem with this is, that the time stamp is an optional property
> and not set per default.
> In our case (flexcan) the time stamp is always zero, so the equality
> check is reduced to equality of buffer pointers, resulting in a lot of
> dropped frames.

The question is why your system did not generate a timestamp at the time of
skb reception.

Usually when netif_rx(), netif_rx_ni() is invoked the timestamp is set in the
following reception process.

flexcan.c only uses netif_receive_skb() - but all theses functions set the
timestamp

	net_timestamp_check(netdev_tstamp_prequeue, skb);

depending on netdev_tstamp_prequeue which is configured by

/proc/sys/net/core/netdev_tstamp_prequeue

See the idea of netdev_tstamp_prequeue here:

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/net/core/sysctl_net_core.c?id=3b098e2d7c693796cc4dffb07caa249fc0f70771

Can you tell me the output of /proc/sys/net/core/netdev_tstamp_prequeue on
your machine?

If it's not '1' can you set it to '1' for a test?

> 
> Possible solutions I thought of:
>  1. Every driver has to set a time stamp
>     (possibly error prone and hard to enforce?)
>  2. Change the equality check
>  3. Fulfil the requirements of the equality check by setting a
>     time stamp per default.
> 
> This patch fixes the problem with solution 3. A time stamp is set at
> time of allocation in alloc_can_skb.

That's a feasible way if won't find a better way to make sure the timestamps
are generally set before the skb is processed in the NET_RX softirq.

> The time stamp may be overridden later, but the function of the equality
> check is ensured.
> 
> I'm not really deep in linux network subsystem, so there may exists
> more elegant solutions for the problem.
> 
> Signed-off-by: Manfred Schlaegl <manfred.schlaegl@gmx.at>
> ---
>  drivers/net/can/dev.c |    1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/net/can/dev.c b/drivers/net/can/dev.c
> index b0f6924..282e2e7 100644
> --- a/drivers/net/can/dev.c
> +++ b/drivers/net/can/dev.c
> @@ -575,6 +575,7 @@ struct sk_buff *alloc_can_skb(struct net_device *dev, struct can_frame **cf)
>  	if (unlikely(!skb))
>  		return NULL;
>  
> +	__net_timestamp(skb);
>  	skb->protocol = htons(ETH_P_CAN);
>  	skb->pkt_type = PACKET_BROADCAST;
>  	skb->ip_summed = CHECKSUM_UNNECESSARY;
> 

Please check the netdev_tstamp_prequeue value first.

If we would need solution 3 the __net_timestamp(skb) should be placed in
alloc_canfd_skb() too.

Thanks again for your investigation!

Best regards,
Oliver

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] can: fix loss of frames due to wrong assumption in raw_rcv
@ 2015-06-20 22:42   ` Oliver Hartkopp
  0 siblings, 0 replies; 21+ messages in thread
From: Oliver Hartkopp @ 2015-06-20 22:42 UTC (permalink / raw)
  To: Manfred Schlaegl, Wolfgang Grandegger, Marc Kleine-Budde
  Cc: linux-can, netdev, linux-kernel, Manfred Schlaegl, David S. Miller

Hello Manfred,

On 06/20/2015 07:21 PM, Manfred Schlaegl wrote:
> I've detected a massive loss of can frames on i.MX6 using flexcan
> driver with 4.1-rc8 and tracked this down to following commit:
> 514ac99c64b22d83b52dfee3b8becaa69a92bc4a - "can: fix multiple delivery
> of a single CAN frame for overlapping CAN filters"

thanks for detecting this issue!

> 514ac99c64b22d83b52dfee3b8becaa69a92bc4a introduces a frame equality
> check. Since the sk_buff pointer is not sufficient to do this (buffers
> are reused), the check also compares time stamps.
> In short: pointer+time stamp was assumed as unique key to a specific
> frame.
> The problem with this is, that the time stamp is an optional property
> and not set per default.
> In our case (flexcan) the time stamp is always zero, so the equality
> check is reduced to equality of buffer pointers, resulting in a lot of
> dropped frames.

The question is why your system did not generate a timestamp at the time of
skb reception.

Usually when netif_rx(), netif_rx_ni() is invoked the timestamp is set in the
following reception process.

flexcan.c only uses netif_receive_skb() - but all theses functions set the
timestamp

	net_timestamp_check(netdev_tstamp_prequeue, skb);

depending on netdev_tstamp_prequeue which is configured by

/proc/sys/net/core/netdev_tstamp_prequeue

See the idea of netdev_tstamp_prequeue here:

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/net/core/sysctl_net_core.c?id=3b098e2d7c693796cc4dffb07caa249fc0f70771

Can you tell me the output of /proc/sys/net/core/netdev_tstamp_prequeue on
your machine?

If it's not '1' can you set it to '1' for a test?

> 
> Possible solutions I thought of:
>  1. Every driver has to set a time stamp
>     (possibly error prone and hard to enforce?)
>  2. Change the equality check
>  3. Fulfil the requirements of the equality check by setting a
>     time stamp per default.
> 
> This patch fixes the problem with solution 3. A time stamp is set at
> time of allocation in alloc_can_skb.

That's a feasible way if won't find a better way to make sure the timestamps
are generally set before the skb is processed in the NET_RX softirq.

> The time stamp may be overridden later, but the function of the equality
> check is ensured.
> 
> I'm not really deep in linux network subsystem, so there may exists
> more elegant solutions for the problem.
> 
> Signed-off-by: Manfred Schlaegl <manfred.schlaegl@gmx.at>
> ---
>  drivers/net/can/dev.c |    1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/net/can/dev.c b/drivers/net/can/dev.c
> index b0f6924..282e2e7 100644
> --- a/drivers/net/can/dev.c
> +++ b/drivers/net/can/dev.c
> @@ -575,6 +575,7 @@ struct sk_buff *alloc_can_skb(struct net_device *dev, struct can_frame **cf)
>  	if (unlikely(!skb))
>  		return NULL;
>  
> +	__net_timestamp(skb);
>  	skb->protocol = htons(ETH_P_CAN);
>  	skb->pkt_type = PACKET_BROADCAST;
>  	skb->ip_summed = CHECKSUM_UNNECESSARY;
> 

Please check the netdev_tstamp_prequeue value first.

If we would need solution 3 the __net_timestamp(skb) should be placed in
alloc_canfd_skb() too.

Thanks again for your investigation!

Best regards,
Oliver

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] can: fix loss of frames due to wrong assumption in raw_rcv
  2015-06-20 22:42   ` Oliver Hartkopp
@ 2015-06-22  9:48     ` Manfred Schlaegl
  -1 siblings, 0 replies; 21+ messages in thread
From: Manfred Schlaegl @ 2015-06-22  9:48 UTC (permalink / raw)
  To: Oliver Hartkopp, Wolfgang Grandegger, Marc Kleine-Budde
  Cc: linux-can, netdev, linux-kernel, Manfred Schlaegl, David S. Miller

Hello Oliver,

On 2015-06-21 00:42, Oliver Hartkopp wrote:

>> 514ac99c64b22d83b52dfee3b8becaa69a92bc4a introduces a frame equality
>> check. Since the sk_buff pointer is not sufficient to do this (buffers
>> are reused), the check also compares time stamps.
>> In short: pointer+time stamp was assumed as unique key to a specific
>> frame.
>> The problem with this is, that the time stamp is an optional property
>> and not set per default.
>> In our case (flexcan) the time stamp is always zero, so the equality
>> check is reduced to equality of buffer pointers, resulting in a lot of
>> dropped frames.
> 
> The question is why your system did not generate a timestamp at the time of
> skb reception.
> 
> Usually when netif_rx(), netif_rx_ni() is invoked the timestamp is set in the
> following reception process.
> 
> flexcan.c only uses netif_receive_skb() - but all theses functions set the
> timestamp
> 
> 	net_timestamp_check(netdev_tstamp_prequeue, skb);
> 
> depending on netdev_tstamp_prequeue which is configured by
> 
> /proc/sys/net/core/netdev_tstamp_prequeue
> 
> See the idea of netdev_tstamp_prequeue here:
> 
> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/net/core/sysctl_net_core.c?id=3b098e2d7c693796cc4dffb07caa249fc0f70771
> 
Thank you for the background information!
I've also noticed your patch [PATCH - regression 4.1-rc8] can: fix loss of CAN frames in raw_rcv

> Can you tell me the output of /proc/sys/net/core/netdev_tstamp_prequeue on
> your machine?

/proc/sys/net/core/netdev_tstamp_prequeue is set to 1 (unmodified, default)

I tried to dig a little deeper in timestamping:
 1. (net/core/dev.c) I found that static_key_false(&netstamp_needed) is always 0, resulting that the timestamp is never set by net_timestamp_check in netif_receive_skb_internal.
 2. (net/core/dev.c) static_key_false(&netstamp_needed) is 0 because net_enable_timestamp is never called.
 3. (net/core/sock.c) net_enable_timestamp is never called because SK_FLAGS_TIMESTAMP is not set
 4. (net/core/sock.c) SK_FLAGS_TIMESTAMP is not set because neither of SOCK_TIMESTAMP or SOCK_TIMESTAMPING_RX_SOFTWARE is set 
 5. (net/core/sock.c) SOCK_TIMESTAMP or SOCK_TIMESTAMPING_RX_SOFTWARE is not set because timestamping is an optional feature (according to http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/networking/timestamping.txt?id=b953c0d234bc72e8489d3bf51a276c5c4ec85345) not enabled in my use case (even if netdev_tstamp_prequeue is set to 1)

So the original assumption for the was correct: The correctness of the skb equality check depends on a feature that is not enabled by default (respectively user configurable).
Do you agree with this?

> 
> Thanks again for your investigation!
Sure!

Best regards,
Manfred

--
To unsubscribe from this list: send the line "unsubscribe linux-can" in

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] can: fix loss of frames due to wrong assumption in raw_rcv
@ 2015-06-22  9:48     ` Manfred Schlaegl
  0 siblings, 0 replies; 21+ messages in thread
From: Manfred Schlaegl @ 2015-06-22  9:48 UTC (permalink / raw)
  To: Oliver Hartkopp, Wolfgang Grandegger, Marc Kleine-Budde
  Cc: linux-can, netdev, linux-kernel, Manfred Schlaegl, David S. Miller

Hello Oliver,

On 2015-06-21 00:42, Oliver Hartkopp wrote:

>> 514ac99c64b22d83b52dfee3b8becaa69a92bc4a introduces a frame equality
>> check. Since the sk_buff pointer is not sufficient to do this (buffers
>> are reused), the check also compares time stamps.
>> In short: pointer+time stamp was assumed as unique key to a specific
>> frame.
>> The problem with this is, that the time stamp is an optional property
>> and not set per default.
>> In our case (flexcan) the time stamp is always zero, so the equality
>> check is reduced to equality of buffer pointers, resulting in a lot of
>> dropped frames.
> 
> The question is why your system did not generate a timestamp at the time of
> skb reception.
> 
> Usually when netif_rx(), netif_rx_ni() is invoked the timestamp is set in the
> following reception process.
> 
> flexcan.c only uses netif_receive_skb() - but all theses functions set the
> timestamp
> 
> 	net_timestamp_check(netdev_tstamp_prequeue, skb);
> 
> depending on netdev_tstamp_prequeue which is configured by
> 
> /proc/sys/net/core/netdev_tstamp_prequeue
> 
> See the idea of netdev_tstamp_prequeue here:
> 
> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/net/core/sysctl_net_core.c?id=3b098e2d7c693796cc4dffb07caa249fc0f70771
> 
Thank you for the background information!
I've also noticed your patch [PATCH - regression 4.1-rc8] can: fix loss of CAN frames in raw_rcv

> Can you tell me the output of /proc/sys/net/core/netdev_tstamp_prequeue on
> your machine?

/proc/sys/net/core/netdev_tstamp_prequeue is set to 1 (unmodified, default)

I tried to dig a little deeper in timestamping:
 1. (net/core/dev.c) I found that static_key_false(&netstamp_needed) is always 0, resulting that the timestamp is never set by net_timestamp_check in netif_receive_skb_internal.
 2. (net/core/dev.c) static_key_false(&netstamp_needed) is 0 because net_enable_timestamp is never called.
 3. (net/core/sock.c) net_enable_timestamp is never called because SK_FLAGS_TIMESTAMP is not set
 4. (net/core/sock.c) SK_FLAGS_TIMESTAMP is not set because neither of SOCK_TIMESTAMP or SOCK_TIMESTAMPING_RX_SOFTWARE is set 
 5. (net/core/sock.c) SOCK_TIMESTAMP or SOCK_TIMESTAMPING_RX_SOFTWARE is not set because timestamping is an optional feature (according to http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/networking/timestamping.txt?id=b953c0d234bc72e8489d3bf51a276c5c4ec85345) not enabled in my use case (even if netdev_tstamp_prequeue is set to 1)

So the original assumption for the was correct: The correctness of the skb equality check depends on a feature that is not enabled by default (respectively user configurable).
Do you agree with this?

> 
> Thanks again for your investigation!
Sure!

Best regards,
Manfred

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] can: fix loss of frames due to wrong assumption in raw_rcv
  2015-06-22  9:48     ` Manfred Schlaegl
@ 2015-06-22 10:24       ` Oliver Hartkopp
  -1 siblings, 0 replies; 21+ messages in thread
From: Oliver Hartkopp @ 2015-06-22 10:24 UTC (permalink / raw)
  To: Manfred Schlaegl, Wolfgang Grandegger, Marc Kleine-Budde
  Cc: linux-can, netdev, linux-kernel, Manfred Schlaegl, David S. Miller

Hello Manfred,

On 22.06.2015 11:48, Manfred Schlaegl wrote:

>> Can you tell me the output of /proc/sys/net/core/netdev_tstamp_prequeue on
>> your machine?
>
> /proc/sys/net/core/netdev_tstamp_prequeue is set to 1 (unmodified, default)
>
> I tried to dig a little deeper in timestamping:
>   1. (net/core/dev.c) I found that static_key_false(&netstamp_needed) is always 0, resulting that the timestamp is never set by net_timestamp_check in netif_receive_skb_internal.
>   2. (net/core/dev.c) static_key_false(&netstamp_needed) is 0 because net_enable_timestamp is never called.
>   3. (net/core/sock.c) net_enable_timestamp is never called because SK_FLAGS_TIMESTAMP is not set
>   4. (net/core/sock.c) SK_FLAGS_TIMESTAMP is not set because neither of SOCK_TIMESTAMP or SOCK_TIMESTAMPING_RX_SOFTWARE is set
>   5. (net/core/sock.c) SOCK_TIMESTAMP or SOCK_TIMESTAMPING_RX_SOFTWARE is not set because timestamping is an optional feature (according to http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/networking/timestamping.txt?id=b953c0d234bc72e8489d3bf51a276c5c4ec85345) not enabled in my use case (even if netdev_tstamp_prequeue is set to 1)
>
> So the original assumption for the was correct: The correctness of the skb equality check depends on a feature that is not enabled by default (respectively user configurable).
> Do you agree with this?

Yes.

But the point becomes an issue when there's no userspace application that 
requires timestamps.

I did my testing wile having at least one "candump" instances running, which 
enables timestamping. So when there's no one requesting timestamps the check 
in can_rcv does not perform properly.

Therefor my patch grabs your idea to set the timestamps for CAN skbs 
unconditionally. But there were some more places in the code where we need to 
take care about that.

Regards,
Oliver


--
To unsubscribe from this list: send the line "unsubscribe linux-can" in

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] can: fix loss of frames due to wrong assumption in raw_rcv
@ 2015-06-22 10:24       ` Oliver Hartkopp
  0 siblings, 0 replies; 21+ messages in thread
From: Oliver Hartkopp @ 2015-06-22 10:24 UTC (permalink / raw)
  To: Manfred Schlaegl, Wolfgang Grandegger, Marc Kleine-Budde
  Cc: linux-can, netdev, linux-kernel, Manfred Schlaegl, David S. Miller

Hello Manfred,

On 22.06.2015 11:48, Manfred Schlaegl wrote:

>> Can you tell me the output of /proc/sys/net/core/netdev_tstamp_prequeue on
>> your machine?
>
> /proc/sys/net/core/netdev_tstamp_prequeue is set to 1 (unmodified, default)
>
> I tried to dig a little deeper in timestamping:
>   1. (net/core/dev.c) I found that static_key_false(&netstamp_needed) is always 0, resulting that the timestamp is never set by net_timestamp_check in netif_receive_skb_internal.
>   2. (net/core/dev.c) static_key_false(&netstamp_needed) is 0 because net_enable_timestamp is never called.
>   3. (net/core/sock.c) net_enable_timestamp is never called because SK_FLAGS_TIMESTAMP is not set
>   4. (net/core/sock.c) SK_FLAGS_TIMESTAMP is not set because neither of SOCK_TIMESTAMP or SOCK_TIMESTAMPING_RX_SOFTWARE is set
>   5. (net/core/sock.c) SOCK_TIMESTAMP or SOCK_TIMESTAMPING_RX_SOFTWARE is not set because timestamping is an optional feature (according to http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/networking/timestamping.txt?id=b953c0d234bc72e8489d3bf51a276c5c4ec85345) not enabled in my use case (even if netdev_tstamp_prequeue is set to 1)
>
> So the original assumption for the was correct: The correctness of the skb equality check depends on a feature that is not enabled by default (respectively user configurable).
> Do you agree with this?

Yes.

But the point becomes an issue when there's no userspace application that 
requires timestamps.

I did my testing wile having at least one "candump" instances running, which 
enables timestamping. So when there's no one requesting timestamps the check 
in can_rcv does not perform properly.

Therefor my patch grabs your idea to set the timestamps for CAN skbs 
unconditionally. But there were some more places in the code where we need to 
take care about that.

Regards,
Oliver

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] can: fix loss of frames due to wrong assumption in raw_rcv
       [not found]       ` <5588E6FB.5040903@optusnet.com.au>
@ 2015-06-23  8:01         ` Oliver Hartkopp
  2015-06-24  2:13           ` Tom Evans
  0 siblings, 1 reply; 21+ messages in thread
From: Oliver Hartkopp @ 2015-06-23  8:01 UTC (permalink / raw)
  To: tom_usenet; +Cc: linux-can

Hi Tom,

On 23.06.2015 06:56, Tom Evans wrote:
> On 22/06/15 20:24, Oliver Hartkopp wrote:
>> But the point becomes an issue when there's no userspace application that
>> requires timestamps.
>>
>> I did my testing wile having at least one "candump" instances running, which
>> enables timestamping. So when there's no one requesting timestamps the check
>> in can_rcv does not perform properly.
>>
>> Therefor my patch grabs your idea to set the timestamps for CAN skbs
>> unconditionally. But there were some more places in the code where we need to
>> take care about that.
>
> The original patch contains:
>
> +    /* eliminate multiple filter matches for the same skb */
> +    if (this_cpu_ptr(ro->uniq)->skb == oskb &&
> +        ktime_equal(this_cpu_ptr(ro->uniq)->tstamp, oskb->tstamp)) {
> +        return;
> +    } else {
> +        this_cpu_ptr(ro->uniq)->skb = oskb;
> +        this_cpu_ptr(ro->uniq)->tstamp = oskb->tstamp;
> +    }
> +
>
> Mightn't it be more robust if the above check didn't filter out a packet if
> either of the timestamps weren't set (were zero)? Assuming the structures are
> zeroed before this gets set.

Yes. But the source just moved on with the introduction of joined filters.
And when you don't have a unique skb detection the joint filter mechanic fails 
totally.

An alternative to force the timestamp to be set would be to add another 
'counter' to struct can_skb_priv which is just increased by the driver with 
every received CAN frame.

But I wonder if its worth the effort in opposite to just enable timestamping 
in every skb. What do you think?

Regards,
Oliver

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] can: fix loss of frames due to wrong assumption in raw_rcv
  2015-06-23  8:01         ` Oliver Hartkopp
@ 2015-06-24  2:13           ` Tom Evans
  2015-06-24 19:56             ` Oliver Hartkopp
  0 siblings, 1 reply; 21+ messages in thread
From: Tom Evans @ 2015-06-24  2:13 UTC (permalink / raw)
  To: Oliver Hartkopp; +Cc: linux-can

On 23/06/15 18:01, Oliver Hartkopp wrote:
> Hi Tom,
>
> On 23.06.2015 06:56, Tom Evans wrote:
>> On 22/06/15 20:24, Oliver Hartkopp wrote:
>>> But the point becomes an issue when there's no userspace application that
>>> requires timestamps.
>>>
>>> I did my testing wile having at least one "candump" instances running, which
>>> enables timestamping. So when there's no one requesting timestamps the check
>>> in can_rcv does not perform properly.
>>>
>>> Therefor my patch grabs your idea to set the timestamps for CAN skbs
>>> unconditionally. But there were some more places in the code where we need to
>>> take care about that.
>>
>> The original patch contains:
>>
>> +    /* eliminate multiple filter matches for the same skb */
>> +    if (this_cpu_ptr(ro->uniq)->skb == oskb &&
>> +        ktime_equal(this_cpu_ptr(ro->uniq)->tstamp, oskb->tstamp)) {
>> +        return;
>> +    } else {
>> +        this_cpu_ptr(ro->uniq)->skb = oskb;
>> +        this_cpu_ptr(ro->uniq)->tstamp = oskb->tstamp;
>> +    }
>> +
>>
>> Mightn't it be more robust if the above check didn't filter out a packet if
>> either of the timestamps weren't set (were zero)? Assuming the structures are
>> zeroed before this gets set.
>
> Yes. But the source just moved on with the introduction of joined filters.
> And when you don't have a unique skb detection the joint filter mechanic fails
> totally.
>
> An alternative to force the timestamp to be set would be to add another
> 'counter' to struct can_skb_priv which is just increased by the driver with
> every received CAN frame.
>
> But I wonder if its worth the effort in opposite to just enable timestamping
> in every skb. What do you think?

 From a programming standpoint, using a "timestamp" as a "unique identifier" 
is opaque and a poor design. If the SKB requires a "unique identifier" then it 
should be provided with something that has that description and purpose.

The timestamp may be optional for a reason. Maybe some platforms simply don't 
provide a high resolution timestamp, and/or the timestamp comes with a high 
overhead. Can you guarantee that all current, future and past platforms 
support a low overhead and high resolution timestamp?

A new "feature" probably shouldn't change something that was optional to being 
compulsory. Would all existing CAN drivers need to be rewritten to add this 
timestamping, or is the timestamp added in higher (and more generic) layers, 
and in the one place?

Tom


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] can: fix loss of frames due to wrong assumption in raw_rcv
  2015-06-24  2:13           ` Tom Evans
@ 2015-06-24 19:56             ` Oliver Hartkopp
  2015-06-25  8:32               ` [BULK]Re: " Stephane Grosjean
  0 siblings, 1 reply; 21+ messages in thread
From: Oliver Hartkopp @ 2015-06-24 19:56 UTC (permalink / raw)
  To: tom_usenet; +Cc: linux-can

On 24.06.2015 04:13, Tom Evans wrote:
> On 23/06/15 18:01, Oliver Hartkopp wrote:
>> An alternative to force the timestamp to be set would be to add another
>> 'counter' to struct can_skb_priv which is just increased by the driver with
>> every received CAN frame.
>>
>> But I wonder if its worth the effort in opposite to just enable timestamping
>> in every skb. What do you think?
>
>  From a programming standpoint, using a "timestamp" as a "unique identifier"
> is opaque and a poor design. If the SKB requires a "unique identifier" then it
> should be provided with something that has that description and purpose.

Hey Tom,

> The timestamp may be optional for a reason. Maybe some platforms simply don't
> provide a high resolution timestamp, and/or the timestamp comes with a high
> overhead. Can you guarantee that all current, future and past platforms
> support a low overhead and high resolution timestamp?

I have used the CAN stuff on two PowerPC, one ARM and several x86 platforms - 
always with enabled timestamping - without any problems in performance.

> A new "feature" probably shouldn't change something that was optional to being
> compulsory. Would all existing CAN drivers need to be rewritten to add this
> timestamping, or is the timestamp added in higher (and more generic) layers,
> and in the one place?

The point is that out-of-tree drivers (like the PEAK-System driver) would need 
to set the timestamp too to take advantage of the joint filter feature.
The mainline drivers are adapted for now with the stable patch

http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=36c01245eb8046c16eee6431e7dbfbb302635fa8

But I felt not very comfortable with this fix. The point is that I have to 
take care that the processing of an incoming skb in can_receive() in af_can.c 
is done in 'one step' with an unique skb. So marking the skb for the joint 
filter feature in raw_rcv() can probably be done in can_receive().

I'm looking for a SMP save solution for this - to finally revert the fix 
above. So thanks for the 'wake-up call' by naming the timestamping solution  a 
'poor design'  ;-)

Best regards,
Oliver

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [BULK]Re: [PATCH] can: fix loss of frames due to wrong assumption in raw_rcv
  2015-06-24 19:56             ` Oliver Hartkopp
@ 2015-06-25  8:32               ` Stephane Grosjean
  2015-06-25  9:36                 ` Oliver Hartkopp
  0 siblings, 1 reply; 21+ messages in thread
From: Stephane Grosjean @ 2015-06-25  8:32 UTC (permalink / raw)
  To: Oliver Hartkopp; +Cc: tom_usenet, linux-can

Hi everyone,

I follow this thread carefully and, with your permission, I'd like to 
speak in.

As we always help our customers in using the mainline drivers products 
line too, I just wanted to inform you that I currently have one who 
tells that he's facing some loss of frames with the PEAK-System PCAN-USB 
adapter, in case of "relatively"  high bus load... He's running two 
Kernels (3.19 and the last 4.1 patched with this recent "fix"). At the 
moment, he says he always notes some frame leakage, especially when he 
(for example) "resizes windows on his desktop"...

The curious things are:

0 - frames are lost when running his socket-can application over the 
mainline peak-usb driver
1 - frames are lost when running his socket-can application over our 
"pcan" driver (built in so-called "netdev" mode, that is, rx frames are 
forwarded to socket-can using netif_rx())
2 - *but* no frame are lost when running the "ioctl()" test applications 
above our "pcan" driver (built in so-called "chardev" mode, that is, rx 
frames are stored into internal fifos before being given to applications)
3 - in both cases #0 and #1, "can0" network stats don't show any 
rx_dropped nor rx_fifo_errors errors at all (even if sometimes, he says 
he notes some "real" rx_over_errors  raised up by the peak-usb driver).

I don't succeed to reproduce this issue with my "old" 3.16 Kernel at the 
moment. Before going on with installing the 3.19 Kernel, could the above 
"leakage" be explained by the last changes made in linux-can?

Many thanks for your help,

Regards,

Stéphane

Le 24/06/2015 21:56, Oliver Hartkopp a écrit :
> On 24.06.2015 04:13, Tom Evans wrote:
>> On 23/06/15 18:01, Oliver Hartkopp wrote:
>>> An alternative to force the timestamp to be set would be to add another
>>> 'counter' to struct can_skb_priv which is just increased by the 
>>> driver with
>>> every received CAN frame.
>>>
>>> But I wonder if its worth the effort in opposite to just enable 
>>> timestamping
>>> in every skb. What do you think?
>>
>>  From a programming standpoint, using a "timestamp" as a "unique 
>> identifier"
>> is opaque and a poor design. If the SKB requires a "unique 
>> identifier" then it
>> should be provided with something that has that description and purpose.
>
> Hey Tom,
>
>> The timestamp may be optional for a reason. Maybe some platforms 
>> simply don't
>> provide a high resolution timestamp, and/or the timestamp comes with 
>> a high
>> overhead. Can you guarantee that all current, future and past platforms
>> support a low overhead and high resolution timestamp?
>
> I have used the CAN stuff on two PowerPC, one ARM and several x86 
> platforms - always with enabled timestamping - without any problems in 
> performance.
>
>> A new "feature" probably shouldn't change something that was optional 
>> to being
>> compulsory. Would all existing CAN drivers need to be rewritten to 
>> add this
>> timestamping, or is the timestamp added in higher (and more generic) 
>> layers,
>> and in the one place?
>
> The point is that out-of-tree drivers (like the PEAK-System driver) 
> would need to set the timestamp too to take advantage of the joint 
> filter feature.
> The mainline drivers are adapted for now with the stable patch
>
> http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=36c01245eb8046c16eee6431e7dbfbb302635fa8 
>
>
> But I felt not very comfortable with this fix. The point is that I 
> have to take care that the processing of an incoming skb in 
> can_receive() in af_can.c is done in 'one step' with an unique skb. So 
> marking the skb for the joint filter feature in raw_rcv() can probably 
> be done in can_receive().
>
> I'm looking for a SMP save solution for this - to finally revert the 
> fix above. So thanks for the 'wake-up call' by naming the timestamping 
> solution  a 'poor design'  ;-)
>
> Best regards,
> Oliver
>
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-can" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
PEAK-System Technik GmbH
Sitz der Gesellschaft Darmstadt
Handelsregister Darmstadt HRB 9183 
Geschaeftsfuehrung: Alexander Gach, Uwe Wilhelm
--

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [BULK]Re: [PATCH] can: fix loss of frames due to wrong assumption in raw_rcv
  2015-06-25  8:32               ` [BULK]Re: " Stephane Grosjean
@ 2015-06-25  9:36                 ` Oliver Hartkopp
  2015-06-29 16:13                   ` Oliver Hartkopp
  0 siblings, 1 reply; 21+ messages in thread
From: Oliver Hartkopp @ 2015-06-25  9:36 UTC (permalink / raw)
  To: Stephane Grosjean; +Cc: linux-can, tom_usenet



> Stephane Grosjean <s.grosjean@peak-system.com> hat am 25. Juni 2015 um 10:32
> geschrieben:
> 
> 
> Hi everyone,

Hi Stephane,

sorry for the probably bad formated mail - it is via webmail ...

> 
> I follow this thread carefully and, with your permission, I'd like to 
> speak in.
> 
> As we always help our customers in using the mainline drivers products 
> line too, I just wanted to inform you that I currently have one who 
> tells that he's facing some loss of frames with the PEAK-System PCAN-USB 
> adapter, in case of "relatively"  high bus load... He's running two 
> Kernels (3.19 and the last 4.1 patched with this recent "fix"). At the 
> moment, he says he always notes some frame leakage, especially when he 
> (for example) "resizes windows on his desktop"...

The patch

http://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit?id=36c01245eb8046c16eee6431e7dbfbb302635fa8

especially adds

   __net_timestamp(skb);

after dev_alloc_skb().

So you need to add this in pcan_netdev_rx() in pcan_netdev.c analogue to the
patch above, e.g. directly after the successful dev_alloc_skb().

Can you check, if this fixes the issue with the (patched) 4.1 kernel?

Many thanks,
Oliver

> 
> The curious things are:
> 
> 0 - frames are lost when running his socket-can application over the 
> mainline peak-usb driver
> 1 - frames are lost when running his socket-can application over our 
> "pcan" driver (built in so-called "netdev" mode, that is, rx frames are 
> forwarded to socket-can using netif_rx())
> 2 - *but* no frame are lost when running the "ioctl()" test applications 
> above our "pcan" driver (built in so-called "chardev" mode, that is, rx 
> frames are stored into internal fifos before being given to applications)
> 3 - in both cases #0 and #1, "can0" network stats don't show any 
> rx_dropped nor rx_fifo_errors errors at all (even if sometimes, he says 
> he notes some "real" rx_over_errors  raised up by the peak-usb driver).
> 
> I don't succeed to reproduce this issue with my "old" 3.16 Kernel at the 
> moment. Before going on with installing the 3.19 Kernel, could the above 
> "leakage" be explained by the last changes made in linux-can?
> 
> Many thanks for your help,
> 
> Regards,
> 
> Stéphane
> 
> Le 24/06/2015 21:56, Oliver Hartkopp a écrit :
> > On 24.06.2015 04:13, Tom Evans wrote:
> >> On 23/06/15 18:01, Oliver Hartkopp wrote:
> >>> An alternative to force the timestamp to be set would be to add another
> >>> 'counter' to struct can_skb_priv which is just increased by the 
> >>> driver with
> >>> every received CAN frame.
> >>>
> >>> But I wonder if its worth the effort in opposite to just enable 
> >>> timestamping
> >>> in every skb. What do you think?
> >>
> >>  From a programming standpoint, using a "timestamp" as a "unique 
> >> identifier"
> >> is opaque and a poor design. If the SKB requires a "unique 
> >> identifier" then it
> >> should be provided with something that has that description and purpose.
> >
> > Hey Tom,
> >
> >> The timestamp may be optional for a reason. Maybe some platforms 
> >> simply don't
> >> provide a high resolution timestamp, and/or the timestamp comes with 
> >> a high
> >> overhead. Can you guarantee that all current, future and past platforms
> >> support a low overhead and high resolution timestamp?
> >
> > I have used the CAN stuff on two PowerPC, one ARM and several x86 
> > platforms - always with enabled timestamping - without any problems in 
> > performance.
> >
> >> A new "feature" probably shouldn't change something that was optional 
> >> to being
> >> compulsory. Would all existing CAN drivers need to be rewritten to 
> >> add this
> >> timestamping, or is the timestamp added in higher (and more generic) 
> >> layers,
> >> and in the one place?
> >
> > The point is that out-of-tree drivers (like the PEAK-System driver) 
> > would need to set the timestamp too to take advantage of the joint 
> > filter feature.
> > The mainline drivers are adapted for now with the stable patch
> >
> > http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=36c01245eb8046c16eee6431e7dbfbb302635fa8
> > 
> >
> >
> > But I felt not very comfortable with this fix. The point is that I 
> > have to take care that the processing of an incoming skb in 
> > can_receive() in af_can.c is done in 'one step' with an unique skb. So 
> > marking the skb for the joint filter feature in raw_rcv() can probably 
> > be done in can_receive().
> >
> > I'm looking for a SMP save solution for this - to finally revert the 
> > fix above. So thanks for the 'wake-up call' by naming the timestamping 
> > solution  a 'poor design'  ;-)
> >
> > Best regards,
> > Oliver
> >
> > -- 
> > To unsubscribe from this list: send the line "unsubscribe linux-can" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> --
> PEAK-System Technik GmbH
> Sitz der Gesellschaft Darmstadt
> Handelsregister Darmstadt HRB 9183 
> Geschaeftsfuehrung: Alexander Gach, Uwe Wilhelm
> --
> --
> To unsubscribe from this list: send the line "unsubscribe linux-can" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [BULK]Re: [PATCH] can: fix loss of frames due to wrong assumption in raw_rcv
  2015-06-25  9:36                 ` Oliver Hartkopp
@ 2015-06-29 16:13                   ` Oliver Hartkopp
  2015-07-04 16:54                     ` Oliver Hartkopp
  0 siblings, 1 reply; 21+ messages in thread
From: Oliver Hartkopp @ 2015-06-29 16:13 UTC (permalink / raw)
  To: Stephane Grosjean; +Cc: linux-can, tom_usenet

Hi Stephane,

On 25.06.2015 11:36, Oliver Hartkopp wrote:
>> Stephane Grosjean <s.grosjean@peak-system.com>

 >> (..)
>> tells that he's facing some loss of frames with the PEAK-System PCAN-USB
>> adapter, in case of "relatively"  high bus load... He's running two
>> Kernels (3.19 and the last 4.1 patched with this recent "fix"). At the
>> moment, he says he always notes some frame leakage, especially when he
>> (for example) "resizes windows on his desktop"...
>
> The patch
>
> http://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit?id=36c01245eb8046c16eee6431e7dbfbb302635fa8
>
> especially adds
>
>     __net_timestamp(skb);
>
> after dev_alloc_skb().
>
> So you need to add this in pcan_netdev_rx() in pcan_netdev.c analogue to the
> patch above, e.g. directly after the successful dev_alloc_skb().
>
> Can you check, if this fixes the issue with the (patched) 4.1 kernel?

Any feedback?

Today I checked for frame loss with kernel 4.0.6 + all relevant patches and 
had no frame loss with a PCAN USB pro at 100% bus load.

Regards,
Oliver

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [BULK]Re: [PATCH] can: fix loss of frames due to wrong assumption in raw_rcv
  2015-06-29 16:13                   ` Oliver Hartkopp
@ 2015-07-04 16:54                     ` Oliver Hartkopp
  2015-07-05  1:18                       ` Tom Evans
  2015-07-06  7:58                       ` [BULK]Re: " Stephane Grosjean
  0 siblings, 2 replies; 21+ messages in thread
From: Oliver Hartkopp @ 2015-07-04 16:54 UTC (permalink / raw)
  To: Stephane Grosjean, Marc Kleine-Budde
  Cc: linux-can, tom_usenet, Manfred Schlaegl

Hi Stephane,

On 29.06.2015 18:13, Oliver Hartkopp wrote:
> Hi Stephane,
>
> On 25.06.2015 11:36, Oliver Hartkopp wrote:
>>> Stephane Grosjean <s.grosjean@peak-system.com>
>>> (..)
>>> tells that he's facing some loss of frames with the PEAK-System PCAN-USB
>>> adapter, in case of "relatively"  high bus load... He's running two
>>> Kernels (3.19 and the last 4.1 patched with this recent "fix"). At the
>>> moment, he says he always notes some frame leakage, especially when he
>>> (for example) "resizes windows on his desktop"...

Is $COSTUMERs problem fixed now?

While testing the patches

can: fix loss of CAN frames in raw_rcv
http://git.kernel.org/cgit/linux/kernel/git/mkl/linux-can.git/commit/?h=testing&id=36c01245eb8046c16eee6431e7dbfbb302635fa8

and

can: replace timestamp as unique skb attribute
http://git.kernel.org/cgit/linux/kernel/git/mkl/linux-can.git/commit/?h=testing&id=a6ffebda241e513f6e839662d769b60b084d0957

I discovered an increase of out-of-order CAN frame receptions.
My setup is a core i7 with a PCAN USB and a PCAN USB pro connected to my full 
busload CAN source (1MBit/s, ~8008 frames/s).

With 3.16 the out-of-order CAN frame reception is 'relatively' seldom.
It's more with the PCAN USB and very few with PCAN USB pro.

With the latest 4.2-merge the effect is reproducible after some time (~10min).

Examples:

  drop detected: expected 204 received 212
  drop detected: expected 237 received 204
  drop detected: expected 212 received 237
  drop detected: expected 68 received 69
  drop detected: expected 70 received 68
  drop detected: expected 69 received 70

Obviously the frames do not get lost BUT are reordered.

newcounter = frame.data[0];
if (((counter + 1) & 0xFF) != newcounter) {
  printf(" drop detected: expected %u received %u\n", counter + 1, newcounter);
}
counter = newcounter;

I'm a bit confused as this effect seems to increase with Linux kernel version 
numbers. I removed all the changes for 4.1 in the 4.2-merge kernel to have a 
4.0 CAN subsystem (which was stable for a long time) - but even this patch 
removal did not fix the out of order reception.

Good thing: The latest changes in 4.1 (which patches) do not drop CAN frames 
by accident.

Bad thing: Why is the out-of-order reception increasing in Linux?

I /assume/ some changes with RPS (receive packet steering) in the network 
layer ...

Regards,
Oliver

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [BULK]Re: [PATCH] can: fix loss of frames due to wrong assumption in raw_rcv
  2015-07-04 16:54                     ` Oliver Hartkopp
@ 2015-07-05  1:18                       ` Tom Evans
  2015-07-05 18:21                         ` Oliver Hartkopp
  2015-07-06  7:58                       ` [BULK]Re: " Stephane Grosjean
  1 sibling, 1 reply; 21+ messages in thread
From: Tom Evans @ 2015-07-05  1:18 UTC (permalink / raw)
  To: Oliver Hartkopp, Stephane Grosjean, Marc Kleine-Budde
  Cc: linux-can, Manfred Schlaegl

On 5/07/2015 2:54 AM, Oliver Hartkopp wrote:
> Hi Stephane,
> ...
> While testing the patches
> ...
> I discovered an increase of out-of-order CAN frame receptions.
> My setup is a core i7 with a PCAN USB and a PCAN USB pro connected to my
> full busload CAN source (1MBit/s, ~8008 frames/s).

Out of order reception is guaranteed with some CAN hardware and driver 
software, such as the MCP2515 controller and Linux. The chip doesn't 
implement a FIFO, but has two receive buffers which can give message 
swaps quite easily. This can be fixed in the driver, but nobody has. 
Details here:

http://www.microchip.com/forums/m620741.aspx

The PCAN-USB uses an SJA1000 which doesn't have that problem. It has a 
64 byte FIFO. What is inside the PCAN-USB Pro isn't documented on their 
web page, but it may be faster or have less transaction overhead or 
latency or something. The PCAN-USB Pro is "no longer manufactured", the 
replacement "PCAN-USB Pro FD" has an FPGA controller.

 > It's more with the PCAN USB and very few with PCAN USB pro.

I'd guess the Pro has less overhead and can get messages over USB faster 
than the other one.

 > I'm a bit confused as this effect seems to increase with Linux kernel
 > version numbers.

As for the later kernels being worse, that looks like a simple case of 
"bloat", with them taking longer to get around to servicing the 
interrupts and reading the messages. Earlier ones are probably reading 
CAN messages one at a time, with each one getting through the stack 
before the next one arrives. Later kernels are probably reading them in 
bursts. Slower controllers (PCAN-USB) expose this sooner.

Can you drop back to a single core to see if this is a multicore 
problem? It will either fix it or make it worse if it is a loading/delay 
problem.

Reordering packets should be considered a serious bug as some CAN 
protocols can't handle this at all.

Regards,

Tom

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [BULK]Re: [PATCH] can: fix loss of frames due to wrong assumption in raw_rcv
  2015-07-05  1:18                       ` Tom Evans
@ 2015-07-05 18:21                         ` Oliver Hartkopp
  2015-07-06  5:44                           ` Oliver Hartkopp
  0 siblings, 1 reply; 21+ messages in thread
From: Oliver Hartkopp @ 2015-07-05 18:21 UTC (permalink / raw)
  To: Tom Evans, Stephane Grosjean, Marc Kleine-Budde
  Cc: linux-can, Manfred Schlaegl

On 05.07.2015 03:18, Tom Evans wrote:
> On 5/07/2015 2:54 AM, Oliver Hartkopp wrote:
>> Hi Stephane,
>> ...
>> While testing the patches
>> ...
>> I discovered an increase of out-of-order CAN frame receptions.
>> My setup is a core i7 with a PCAN USB and a PCAN USB pro connected to my
>> full busload CAN source (1MBit/s, ~8008 frames/s).
> 
> Out of order reception is guaranteed with some CAN hardware and driver 
> software, such as the MCP2515 controller and Linux. The chip doesn't implement 
> a FIFO, but has two receive buffers which can give message swaps quite easily. 
> This can be fixed in the driver, but nobody has. Details here:
> 
> http://www.microchip.com/forums/m620741.aspx
> 

Ugh. When reading "Out of order reception is guaranteed" I assumed a typo %-(
I don't have any MCP2515 hardware here. Any volunteers out there to fix that?

> The PCAN-USB uses an SJA1000 which doesn't have that problem. It has a 64 byte 
> FIFO. What is inside the PCAN-USB Pro isn't documented on their web page, but 
> it may be faster or have less transaction overhead or latency or something. 

IIRC it's some NXP LPC 17xx CPU with two CAN interfaces. Yes and it should be
faster than the SJA1000/C161 combo inside the PCAN-USB.

> The PCAN-USB Pro is "no longer manufactured", the replacement "PCAN-USB Pro 
> FD" has an FPGA controller.

I did my first tests with the PCAN-USB Pro FD - but to check the effect in
kernel versions < v4.0 I swapped over to the standard USB Pro due to the
missing FD support in older kernels.

>  > It's more with the PCAN USB and very few with PCAN USB pro.
> 
> I'd guess the Pro has less overhead and can get messages over USB faster than 
> the other one.
> 
>  > I'm a bit confused as this effect seems to increase with Linux kernel
>  > version numbers.
> 
> As for the later kernels being worse, that looks like a simple case of 
> "bloat", with them taking longer to get around to servicing the interrupts and 
> reading the messages. Earlier ones are probably reading CAN messages one at a 
> time, with each one getting through the stack before the next one arrives. 
> Later kernels are probably reading them in bursts. Slower controllers 
> (PCAN-USB) expose this sooner.
> 
> Can you drop back to a single core to see if this is a multicore problem? It 
> will either fix it or make it worse if it is a loading/delay problem.

Good idea!

I took my old 2005 Samsung X20 with 1.73GHz Pentium M and Xubuntu 14.04 ...

Both the stock Xubuntu 3.13 and the 4.1.1 did not have the out-of-order
issues. There were 'only' two sporadic drops with the PCAN-USB in more than
three hours of testing:

 drop detected: expected 224 received 18   (50 frames lost)
 drop detected: expected 251 received 252  (1 frame lost)

The drops emerged only on the PCAN USB interface in this case.

> Reordering packets should be considered a serious bug as some CAN protocols 
> can't handle this at all.

Yes definitely, e.g. ISO15765-2 will not work with out-of-order frames.

Going back to the latest 4.2-merge kernel with all the CAN fixes and the core
i7 SMP setup, I tried to assign the interrupt from the USB host controller to
a specific CPU using the documentation in

	https://www.kernel.org/doc/Documentation/IRQ-affinity.txt

My USB controller is on IRQ 28:

# cat /proc/interrupts 
           CPU0       CPU1       CPU2       CPU3       
  0:         23          0          0          0  IR-IO-APIC   2-edge      timer
(..)
 28:      10114      53480     566927    1634485  IR-PCI-MSI 327680-edge      xhci_hcd
(..)

With

# echo 1 > /proc/irq/28/smp_affinity

I assigned the IRQ 28 to CPU0 and it now looks like this:

# cat /proc/interrupts 
           CPU0       CPU1       CPU2       CPU3       
  0:         23          0          0          0  IR-IO-APIC   2-edge      timer
(..)
 28:    9072996      53480     766233    2125901  IR-PCI-MSI 327680-edge      xhci_hcd
(..)

and all the out-of-order receptions were totally gone! \o/

When nailing the CAN controller/driver interrupt to a specific CPU fixes the
out-of-order reception, we need to check whether we can do this by default.

New embedded systems like the imx6-quad will run into this problem otherwise.

Asking google about it lead to

http://stackoverflow.com/questions/11858487/change-smp-affinity-from-linux-device-driver

and finally to irq_set_affinity()

http://lxr.free-electrons.com/source/kernel/irq/manage.c#L182

which can be called by drivers from inside the kernel context.

Do you think this could be a valuable idea to follow?

Regards,
Oliver

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [BULK]Re: [PATCH] can: fix loss of frames due to wrong assumption in raw_rcv
  2015-07-05 18:21                         ` Oliver Hartkopp
@ 2015-07-06  5:44                           ` Oliver Hartkopp
  2015-07-06  6:50                             ` Tom Evans
  0 siblings, 1 reply; 21+ messages in thread
From: Oliver Hartkopp @ 2015-07-06  5:44 UTC (permalink / raw)
  To: Tom Evans, Stephane Grosjean, Marc Kleine-Budde
  Cc: linux-can, Manfred Schlaegl

/me again

On 05.07.2015 20:21, Oliver Hartkopp wrote:

> With
>
> # echo 1 > /proc/irq/28/smp_affinity
>
> I assigned the IRQ 28 to CPU0 and it now looks like this:
>
> # cat /proc/interrupts
>             CPU0       CPU1       CPU2       CPU3
>    0:         23          0          0          0  IR-IO-APIC   2-edge      timer
> (..)
>   28:    9072996      53480     766233    2125901  IR-PCI-MSI 327680-edge      xhci_hcd
> (..)
>
> and all the out-of-order receptions were totally gone! \o/
>

I ran the test for 11 hours now (3 x 8000 = 24.000 frames/s).

No single out-of-order frame this time - not even with the PCAN-USB.

Cheers,
Oliver

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [BULK]Re: [PATCH] can: fix loss of frames due to wrong assumption in raw_rcv
  2015-07-06  5:44                           ` Oliver Hartkopp
@ 2015-07-06  6:50                             ` Tom Evans
  2015-07-06 17:09                               ` Oliver Hartkopp
  0 siblings, 1 reply; 21+ messages in thread
From: Tom Evans @ 2015-07-06  6:50 UTC (permalink / raw)
  To: Oliver Hartkopp, Stephane Grosjean, Marc Kleine-Budde
  Cc: linux-can, Manfred Schlaegl

On 06/07/15 15:44, Oliver Hartkopp wrote:
> /me again
>
> On 05.07.2015 20:21, Oliver Hartkopp wrote:
>
>> With
>>
>> # echo 1 > /proc/irq/28/smp_affinity
>> ..
>> and all the out-of-order receptions were totally gone! \o/
>>
>
> I ran the test for 11 hours now (3 x 8000 = 24.000 frames/s).
>
> No single out-of-order frame this time - not even with the PCAN-USB.

I don't think you can force a specific "CPU Affinity Requirement" on this driver.

CPU Affinity may fix this problem, but I suspect there should be a more 
"standard" fix. This sort of problem (multiple CPUs fielding interrupts and 
getting them out of order) should show up elsewhere. It might even swap bytes 
from a UART.

There must be a "standard synchronization technique" of some sort that 
provides the proper sequencing here. Can anyone else reading this thread who 
knows if this is right please chime in?

Google finds:

http://www.alexonlinux.com/why-interrupt-affinity-with-multiple-cores-is-not-such-a-good-thing

     Moreover, assuming TCP connection object is properly
     protected using synchronization techniques, one of
     the cores will inevitably have to wait for the other,
     adding unnecessary delay to packet processing.

It seems to me that the "CAN connection object" is not "properly protected" in 
the existing driver. That CAN driver may not have had the "SMP Magic Wand" 
waved over it.

Tom

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [BULK]Re: [BULK]Re: [PATCH] can: fix loss of frames due to wrong assumption in raw_rcv
  2015-07-04 16:54                     ` Oliver Hartkopp
  2015-07-05  1:18                       ` Tom Evans
@ 2015-07-06  7:58                       ` Stephane Grosjean
  2015-07-06 17:14                         ` Oliver Hartkopp
  1 sibling, 1 reply; 21+ messages in thread
From: Stephane Grosjean @ 2015-07-06  7:58 UTC (permalink / raw)
  To: Oliver Hartkopp, Marc Kleine-Budde
  Cc: linux-can, tom_usenet, Manfred Schlaegl

Hi Oliver,

${CUSTOMER}'s problem was rather a "loss of frame" issue than an 
"out-of-order" issue... In fact, it seems it was a simple "rookie" error 
of non-testing errno == ENOBUF after having written on the CAN socket.

Regards,

Stéphane

Le 04/07/2015 18:54, Oliver Hartkopp a écrit :
> Hi Stephane,
>
> On 29.06.2015 18:13, Oliver Hartkopp wrote:
>> Hi Stephane,
>>
>> On 25.06.2015 11:36, Oliver Hartkopp wrote:
>>>> Stephane Grosjean <s.grosjean@peak-system.com>
>>>> (..)
>>>> tells that he's facing some loss of frames with the PEAK-System 
>>>> PCAN-USB
>>>> adapter, in case of "relatively"  high bus load... He's running two
>>>> Kernels (3.19 and the last 4.1 patched with this recent "fix"). At the
>>>> moment, he says he always notes some frame leakage, especially when he
>>>> (for example) "resizes windows on his desktop"...
>
> Is $COSTUMERs problem fixed now?
>
> While testing the patches
>
> can: fix loss of CAN frames in raw_rcv
> http://git.kernel.org/cgit/linux/kernel/git/mkl/linux-can.git/commit/?h=testing&id=36c01245eb8046c16eee6431e7dbfbb302635fa8 
>
>
> and
>
> can: replace timestamp as unique skb attribute
> http://git.kernel.org/cgit/linux/kernel/git/mkl/linux-can.git/commit/?h=testing&id=a6ffebda241e513f6e839662d769b60b084d0957 
>
>
> I discovered an increase of out-of-order CAN frame receptions.
> My setup is a core i7 with a PCAN USB and a PCAN USB pro connected to 
> my full busload CAN source (1MBit/s, ~8008 frames/s).
>
> With 3.16 the out-of-order CAN frame reception is 'relatively' seldom.
> It's more with the PCAN USB and very few with PCAN USB pro.
>
> With the latest 4.2-merge the effect is reproducible after some time 
> (~10min).
>
> Examples:
>
>  drop detected: expected 204 received 212
>  drop detected: expected 237 received 204
>  drop detected: expected 212 received 237
>  drop detected: expected 68 received 69
>  drop detected: expected 70 received 68
>  drop detected: expected 69 received 70
>
> Obviously the frames do not get lost BUT are reordered.
>
> newcounter = frame.data[0];
> if (((counter + 1) & 0xFF) != newcounter) {
>  printf(" drop detected: expected %u received %u\n", counter + 1, 
> newcounter);
> }
> counter = newcounter;
>
> I'm a bit confused as this effect seems to increase with Linux kernel 
> version numbers. I removed all the changes for 4.1 in the 4.2-merge 
> kernel to have a 4.0 CAN subsystem (which was stable for a long time) 
> - but even this patch removal did not fix the out of order reception.
>
> Good thing: The latest changes in 4.1 (which patches) do not drop CAN 
> frames by accident.
>
> Bad thing: Why is the out-of-order reception increasing in Linux?
>
> I /assume/ some changes with RPS (receive packet steering) in the 
> network layer ...
>
>
> Regards,
> Oliver
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-can" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
PEAK-System Technik GmbH
Sitz der Gesellschaft Darmstadt
Handelsregister Darmstadt HRB 9183 
Geschaeftsfuehrung: Alexander Gach, Uwe Wilhelm
--

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [BULK]Re: [PATCH] can: fix loss of frames due to wrong assumption in raw_rcv
  2015-07-06  6:50                             ` Tom Evans
@ 2015-07-06 17:09                               ` Oliver Hartkopp
  0 siblings, 0 replies; 21+ messages in thread
From: Oliver Hartkopp @ 2015-07-06 17:09 UTC (permalink / raw)
  To: tom_usenet, Stephane Grosjean, Marc Kleine-Budde
  Cc: linux-can, Manfred Schlaegl

Hi Tom,

On 06.07.2015 08:50, Tom Evans wrote:

> CPU Affinity may fix this problem, but I suspect there should be a more
> "standard" fix. This sort of problem (multiple CPUs fielding interrupts and
> getting them out of order) should show up elsewhere. It might even swap bytes
> from a UART.
>
> There must be a "standard synchronization technique" of some sort that
> provides the proper sequencing here. Can anyone else reading this thread who
> knows if this is right please chime in?
>
> Google finds:
>
> http://www.alexonlinux.com/why-interrupt-affinity-with-multiple-cores-is-not-such-a-good-thing

I found this relating kernel documentation:

https://www.kernel.org/doc/Documentation/networking/scaling.txt

It seems like we should take a look at receive packet steering (RPS) - and how 
other devices like the Intel IGB ethernet driver makes use of it.

Regards,
Oliver

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [BULK]Re: [BULK]Re: [PATCH] can: fix loss of frames due to wrong assumption in raw_rcv
  2015-07-06  7:58                       ` [BULK]Re: " Stephane Grosjean
@ 2015-07-06 17:14                         ` Oliver Hartkopp
  0 siblings, 0 replies; 21+ messages in thread
From: Oliver Hartkopp @ 2015-07-06 17:14 UTC (permalink / raw)
  To: Stephane Grosjean, Marc Kleine-Budde
  Cc: linux-can, tom_usenet, Manfred Schlaegl

On 06.07.2015 09:58, Stephane Grosjean wrote:

> ${CUSTOMER}'s problem was rather a "loss of frame" issue than an
> "out-of-order" issue... In fact, it seems it was a simple "rookie" error of
> non-testing errno == ENOBUF after having written on the CAN socket.

Thanks for the explanation!

No more sleepless nights :-)

Regards,
Oliver

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2015-07-06 17:14 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-20 17:21 [PATCH] can: fix loss of frames due to wrong assumption in raw_rcv Manfred Schlaegl
2015-06-20 22:42 ` Oliver Hartkopp
2015-06-20 22:42   ` Oliver Hartkopp
2015-06-22  9:48   ` Manfred Schlaegl
2015-06-22  9:48     ` Manfred Schlaegl
2015-06-22 10:24     ` Oliver Hartkopp
2015-06-22 10:24       ` Oliver Hartkopp
     [not found]       ` <5588E6FB.5040903@optusnet.com.au>
2015-06-23  8:01         ` Oliver Hartkopp
2015-06-24  2:13           ` Tom Evans
2015-06-24 19:56             ` Oliver Hartkopp
2015-06-25  8:32               ` [BULK]Re: " Stephane Grosjean
2015-06-25  9:36                 ` Oliver Hartkopp
2015-06-29 16:13                   ` Oliver Hartkopp
2015-07-04 16:54                     ` Oliver Hartkopp
2015-07-05  1:18                       ` Tom Evans
2015-07-05 18:21                         ` Oliver Hartkopp
2015-07-06  5:44                           ` Oliver Hartkopp
2015-07-06  6:50                             ` Tom Evans
2015-07-06 17:09                               ` Oliver Hartkopp
2015-07-06  7:58                       ` [BULK]Re: " Stephane Grosjean
2015-07-06 17:14                         ` Oliver Hartkopp

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.