All of lore.kernel.org
 help / color / mirror / Atom feed
* [ISSUE][4.20.6] mlx5 and checksum failures
@ 2019-02-06 16:16 Ian Kumlien
  2019-02-06 17:18 ` Saeed Mahameed
  0 siblings, 1 reply; 19+ messages in thread
From: Ian Kumlien @ 2019-02-06 16:16 UTC (permalink / raw)
  To: Linux Kernel Network Developers

Hi,

I'm hitting an issue that i think is fixed by the following patch,
i haven't verified it yet - but it looks like it should go on the
stable queue(?)

(And yes, I did look, and couldn't find it ;))

commit e8c8b53ccaff568fef4c13a6ccaf08bf241aa01a
Author: Cong Wang <xiyou.wangcong@gmail.com>
Date:   Mon Dec 3 22:14:04 2018 -0800

    net/mlx5e: Force CHECKSUM_UNNECESSARY for short ethernet frames

    When an ethernet frame is padded to meet the minimum ethernet frame
    size, the padding octets are not covered by the hardware checksum.
    Fortunately the padding octets are usually zero's, which don't affect
    checksum. However, we have a switch which pads non-zero octets, this
    causes kernel hardware checksum fault repeatedly.

    Prior to:
    commit '88078d98d1bb ("net: pskb_trim_rcsum() and
CHECKSUM_COMPLETE ...")'
    skb checksum was forced to be CHECKSUM_NONE when padding is detected.
    After it, we need to keep skb->csum updated, like what we do for RXFCS.
    However, fixing up CHECKSUM_COMPLETE requires to verify and parse IP
    headers, it is not worthy the effort as the packets are so small that
    CHECKSUM_COMPLETE can't save anything.

    Fixes: 88078d98d1bb ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE
are friends"),
    Cc: Eric Dumazet <edumazet@google.com>
    Cc: Tariq Toukan <tariqt@mellanox.com>
    Cc: Nikola Ciprich <nikola.ciprich@linuxbox.cz>
    Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
    Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 1d0bb5ff8c26..f86e4804e83e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -732,6 +732,8 @@ static u8 get_ip_proto(struct sk_buff *skb, int
network_depth, __be16 proto)
                                            ((struct ipv6hdr
*)ip_p)->nexthdr;
 }

+#define short_frame(size) ((size) <= ETH_ZLEN + ETH_FCS_LEN)
+
 static inline void mlx5e_handle_csum(struct net_device *netdev,
                                     struct mlx5_cqe64 *cqe,
                                     struct mlx5e_rq *rq,
@@ -754,6 +756,17 @@ static inline void mlx5e_handle_csum(struct
net_device *netdev,
        if (unlikely(test_bit(MLX5E_RQ_STATE_NO_CSUM_COMPLETE, &rq->state)))
                goto csum_unnecessary;

+       /* CQE csum doesn't cover padding octets in short ethernet
+        * frames. And the pad field is appended prior to calculating
+        * and appending the FCS field.
+        *
+        * Detecting these padded frames requires to verify and parse
+        * IP headers, so we simply force all those small frames to be
+        * CHECKSUM_UNNECESSARY even if they are not padded.
+        */
+       if (short_frame(skb->len))
+               goto csum_unnecessary;
+
        if (likely(is_last_ethertype_ip(skb, &network_depth, &proto))) {
                if (unlikely(get_ip_proto(skb, network_depth, proto)
== IPPROTO_SCTP))
                        goto csum_unnecessary;
---

Kernel log:
[ 3226.017424] bond0: hw csum failure
[ 3226.018387] CPU: 13 PID: 0 Comm: swapper/13 Tainted: G          I
    4.20.6-1.el7.elrepo.x86_64 #1
[ 3226.020928] Hardware name: HP ProLiant DL380 G6, BIOS P62 01/22/2015
[ 3226.022649] Call Trace:
[ 3226.023409]  <IRQ>
[ 3226.024039]  dump_stack+0x63/0x88
[ 3226.025066]  netdev_rx_csum_fault+0x3a/0x40
[ 3226.026208]  __skb_checksum_complete+0xd5/0xe0
[ 3226.027418]  nf_ip_checksum+0xc9/0xf0
[ 3226.028474]  nf_checksum+0x2d/0x40
[ 3226.029504]  tcp_packet+0x2ce/0xa20 [nf_conntrack]
[ 3226.030913]  ? tcp_v4_do_rcv+0x77/0x1f0
[ 3226.032094]  ? sock_put+0x19/0x20
[ 3226.033070]  ? nf_ct_deliver_cached_events+0xd0/0x110 [nf_conntrack]
[ 3226.034754]  nf_conntrack_in+0x140/0x510 [nf_conntrack]
[ 3226.036228]  ipv4_conntrack_in+0x14/0x20 [nf_conntrack]
[ 3226.037646]  nf_hook_slow+0x42/0xc0
[ 3226.038626]  ip_rcv+0xb5/0xd0
[ 3226.039480]  ? ip_local_deliver_finish+0x1e0/0x1e0
[ 3226.040767]  __netif_receive_skb_one_core+0x57/0x80
[ 3226.042155]  __netif_receive_skb+0x18/0x60
[ 3226.043275]  netif_receive_skb_internal+0x45/0xf0
[ 3226.044530]  napi_gro_receive+0xd0/0xf0
[ 3226.045665]  mlx5e_handle_rx_cqe+0x1e6/0x540 [mlx5_core]
[ 3226.047167]  mlx5e_poll_rx_cq+0xd6/0x9c0 [mlx5_core]
[ 3226.048516]  mlx5e_napi_poll+0xc2/0xcd0 [mlx5_core]
[ 3226.049836]  ? mlx5_eq_int+0x4b4/0x6c0 [mlx5_core]
[ 3226.051118]  net_rx_action+0x289/0x3d0
[ 3226.052257]  __do_softirq+0xd5/0x2a2
[ 3226.053277]  irq_exit+0xe8/0x100
[ 3226.054183]  do_IRQ+0x59/0xe0
[ 3226.055014]  common_interrupt+0xf/0xf
[ 3226.056038]  </IRQ>
[ 3226.056722] RIP: 0010:cpuidle_enter_state+0xba/0x2f0
[ 3226.058087] Code: d0 95 7e e8 38 07 a1 ff 41 8b 5c 24 04 49 89 c6
66 66 66 66 90 31 ff e8 34 19 a1 ff 80 7d cf 00 0f 85 8c 01 00 00 fb
66 66 90 <66> 66 90 45 85 ed 0f 88 94 01 00 00 4c 2b 75 c0 48 ba cf f7
53 e3
[ 3226.062925] RSP: 0018:ffffc9000c547e50 EFLAGS: 00000246 ORIG_RAX:
ffffffffffffffd6
[ 3226.064974] RAX: ffff88a3df7a2dc0 RBX: 000000000000000d RCX: 000000000000001f
[ 3226.066866] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000000
[ 3226.068747] RBP: ffffc9000c547e90 R08: 0000000000000002 R09: ffffffcdc506f2e7
[ 3226.070622] R10: 0000000000000018 R11: 071c71c71c71c71c R12: ffffe8ffffb96f00
[ 3226.072525] R13: 0000000000000004 R14: 000002ef1d9f1e10 R15: ffff88a3d8900000
[ 3226.074479]  cpuidle_enter+0x17/0x20
[ 3226.075463]  call_cpuidle+0x23/0x40
[ 3226.076412]  do_idle+0x1db/0x280
[ 3226.077323]  cpu_startup_entry+0x1d/0x30
[ 3226.078417]  start_secondary+0x1ae/0x200
[ 3226.079490]  secondary_startup_64+0xa4/0xb0

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [ISSUE][4.20.6] mlx5 and checksum failures
  2019-02-06 16:16 [ISSUE][4.20.6] mlx5 and checksum failures Ian Kumlien
@ 2019-02-06 17:18 ` Saeed Mahameed
  2019-02-06 22:03   ` David Miller
  0 siblings, 1 reply; 19+ messages in thread
From: Saeed Mahameed @ 2019-02-06 17:18 UTC (permalink / raw)
  To: ian.kumlien, netdev, davem

On Wed, 2019-02-06 at 17:16 +0100, Ian Kumlien wrote:
> Hi,
> 
> I'm hitting an issue that i think is fixed by the following patch,
> i haven't verified it yet - but it looks like it should go on the
> stable queue(?)
> 
> (And yes, I did look, and couldn't find it ;))
> 

Yes, i couldn't find it neither,

It should have been queued up for 4.18 by now.
Dave said he will take care of it, maybe he just forgot or something.
since the patch needed some extra care..

https://patchwork.ozlabs.org/patch/1027837/

Dave, Is there anything i can do here ?

Thanks,
Saeed.

> commit e8c8b53ccaff568fef4c13a6ccaf08bf241aa01a
> Author: Cong Wang <xiyou.wangcong@gmail.com>
> Date:   Mon Dec 3 22:14:04 2018 -0800
> 
>     net/mlx5e: Force CHECKSUM_UNNECESSARY for short ethernet frames


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [ISSUE][4.20.6] mlx5 and checksum failures
  2019-02-06 17:18 ` Saeed Mahameed
@ 2019-02-06 22:03   ` David Miller
  2019-02-06 22:12     ` Ian Kumlien
  0 siblings, 1 reply; 19+ messages in thread
From: David Miller @ 2019-02-06 22:03 UTC (permalink / raw)
  To: saeedm; +Cc: ian.kumlien, netdev

From: Saeed Mahameed <saeedm@mellanox.com>
Date: Wed, 6 Feb 2019 17:18:55 +0000

> On Wed, 2019-02-06 at 17:16 +0100, Ian Kumlien wrote:
>> Hi,
>> 
>> I'm hitting an issue that i think is fixed by the following patch,
>> i haven't verified it yet - but it looks like it should go on the
>> stable queue(?)
>> 
>> (And yes, I did look, and couldn't find it ;))
>> 
> 
> Yes, i couldn't find it neither,
> 
> It should have been queued up for 4.18 by now.
> Dave said he will take care of it, maybe he just forgot or something.
> since the patch needed some extra care..
> 
> https://patchwork.ozlabs.org/patch/1027837/
> 
> Dave, Is there anything i can do here ?

I never handle anything past the most recent two -stable releases,
which right now is 4.20 and 4.19

For anything beyond that you have to contact the person who maintains
that -stable tree.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [ISSUE][4.20.6] mlx5 and checksum failures
  2019-02-06 22:03   ` David Miller
@ 2019-02-06 22:12     ` Ian Kumlien
  2019-02-06 22:22       ` David Miller
  2019-02-06 22:38       ` Cong Wang
  0 siblings, 2 replies; 19+ messages in thread
From: Ian Kumlien @ 2019-02-06 22:12 UTC (permalink / raw)
  To: David Miller; +Cc: Saeed Mahameed, Linux Kernel Network Developers

On Wed, Feb 6, 2019 at 11:03 PM David Miller <davem@davemloft.net> wrote:
> From: Saeed Mahameed <saeedm@mellanox.com>
> Date: Wed, 6 Feb 2019 17:18:55 +0000

> > On Wed, 2019-02-06 at 17:16 +0100, Ian Kumlien wrote:
> >> Hi,
> >>
> >> I'm hitting an issue that i think is fixed by the following patch,
> >> i haven't verified it yet - but it looks like it should go on the
> >> stable queue(?)
> >>
> >> (And yes, I did look, and couldn't find it ;))
> >>
> >
> > Yes, i couldn't find it neither,
> >
> > It should have been queued up for 4.18 by now.
> > Dave said he will take care of it, maybe he just forgot or something.
> > since the patch needed some extra care..
> >
> > https://patchwork.ozlabs.org/patch/1027837/
> >
> > Dave, Is there anything i can do here ?
>
> I never handle anything past the most recent two -stable releases,
> which right now is 4.20 and 4.19

Could we please schedule this for 4.19 and 4.20 - it's kinda breaking things

> For anything beyond that you have to contact the person who maintains
> that -stable tree.

I think it needs to be applied to all -stable since 4.18 :/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [ISSUE][4.20.6] mlx5 and checksum failures
  2019-02-06 22:12     ` Ian Kumlien
@ 2019-02-06 22:22       ` David Miller
  2019-02-06 22:33         ` Ian Kumlien
  2019-02-06 22:38       ` Cong Wang
  1 sibling, 1 reply; 19+ messages in thread
From: David Miller @ 2019-02-06 22:22 UTC (permalink / raw)
  To: ian.kumlien; +Cc: saeedm, netdev

From: Ian Kumlien <ian.kumlien@gmail.com>
Date: Wed, 6 Feb 2019 23:12:53 +0100

> Could we please schedule this for 4.19 and 4.20 - it's kinda breaking things

Its... there:

https://patchwork.ozlabs.org/bundle/davem/stable/?series=&submitter=&state=*&q=&archive=

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [ISSUE][4.20.6] mlx5 and checksum failures
  2019-02-06 22:22       ` David Miller
@ 2019-02-06 22:33         ` Ian Kumlien
  0 siblings, 0 replies; 19+ messages in thread
From: Ian Kumlien @ 2019-02-06 22:33 UTC (permalink / raw)
  To: David Miller; +Cc: Saeed Mahameed, Linux Kernel Network Developers

On Wed, Feb 6, 2019 at 11:22 PM David Miller <davem@davemloft.net> wrote:
>
> From: Ian Kumlien <ian.kumlien@gmail.com>
> Date: Wed, 6 Feb 2019 23:12:53 +0100
>
> > Could we please schedule this for 4.19 and 4.20 - it's kinda breaking things
>
> Its... there:
>
> https://patchwork.ozlabs.org/bundle/davem/stable/?series=&submitter=&state=*&q=&archive=

F... Sorry, yet again...

I thought that I DID look at patch fork but apparently accepted wasn't
listed by default

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [ISSUE][4.20.6] mlx5 and checksum failures
  2019-02-06 22:12     ` Ian Kumlien
  2019-02-06 22:22       ` David Miller
@ 2019-02-06 22:38       ` Cong Wang
  2019-02-06 22:41         ` Ian Kumlien
  1 sibling, 1 reply; 19+ messages in thread
From: Cong Wang @ 2019-02-06 22:38 UTC (permalink / raw)
  To: Ian Kumlien; +Cc: David Miller, Saeed Mahameed, Linux Kernel Network Developers

On Wed, Feb 6, 2019 at 2:15 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
>
> Could we please schedule this for 4.19 and 4.20 - it's kinda breaking things

It doesn't break anything, packets are _not_ dropped, only that the
warning itself is noisy.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [ISSUE][4.20.6] mlx5 and checksum failures
  2019-02-06 22:38       ` Cong Wang
@ 2019-02-06 22:41         ` Ian Kumlien
  2019-02-06 22:49           ` Cong Wang
  2019-02-07 21:40           ` Marcelo Ricardo Leitner
  0 siblings, 2 replies; 19+ messages in thread
From: Ian Kumlien @ 2019-02-06 22:41 UTC (permalink / raw)
  To: Cong Wang; +Cc: David Miller, Saeed Mahameed, Linux Kernel Network Developers

On Wed, Feb 6, 2019 at 11:38 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> On Wed, Feb 6, 2019 at 2:15 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > Could we please schedule this for 4.19 and 4.20 - it's kinda breaking things
>
> It doesn't break anything, packets are _not_ dropped, only that the
> warning itself is noisy.

Not my experience, to me it slows the machine down and looses packets,
I don't however know
if this is the only culprit

You can actually see it on ping where it start out with 0.0xyx and
ends up at ~10ms

But as I said, I assume this is the culprit - further investigation
will be done =)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [ISSUE][4.20.6] mlx5 and checksum failures
  2019-02-06 22:41         ` Ian Kumlien
@ 2019-02-06 22:49           ` Cong Wang
  2019-02-06 23:00             ` Ian Kumlien
  2019-02-07 21:40           ` Marcelo Ricardo Leitner
  1 sibling, 1 reply; 19+ messages in thread
From: Cong Wang @ 2019-02-06 22:49 UTC (permalink / raw)
  To: Ian Kumlien; +Cc: David Miller, Saeed Mahameed, Linux Kernel Network Developers

On Wed, Feb 6, 2019 at 2:41 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
>
> On Wed, Feb 6, 2019 at 11:38 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > On Wed, Feb 6, 2019 at 2:15 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > Could we please schedule this for 4.19 and 4.20 - it's kinda breaking things
> >
> > It doesn't break anything, packets are _not_ dropped, only that the
> > warning itself is noisy.
>
> Not my experience, to me it slows the machine down and looses packets,
> I don't however know
> if this is the only culprit

The packet process could be slow down because of printing
out this kernel warning. Packet should be still delivered to upper
stack, at least I didn't see any packet drops because of this.

>
> You can actually see it on ping where it start out with 0.0xyx and
> ends up at ~10ms
>

I don't understand how it could affect ICMP, it is purely TCP
from my point of view, even the stack trace from you says so. ;)

Thanks.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [ISSUE][4.20.6] mlx5 and checksum failures
  2019-02-06 22:49           ` Cong Wang
@ 2019-02-06 23:00             ` Ian Kumlien
  2019-02-07  1:00               ` Saeed Mahameed
  0 siblings, 1 reply; 19+ messages in thread
From: Ian Kumlien @ 2019-02-06 23:00 UTC (permalink / raw)
  To: Cong Wang; +Cc: David Miller, Saeed Mahameed, Linux Kernel Network Developers

On Wed, Feb 6, 2019 at 11:49 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> On Wed, Feb 6, 2019 at 2:41 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> >
> > On Wed, Feb 6, 2019 at 11:38 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > > On Wed, Feb 6, 2019 at 2:15 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > > Could we please schedule this for 4.19 and 4.20 - it's kinda breaking things
> > >
> > > It doesn't break anything, packets are _not_ dropped, only that the
> > > warning itself is noisy.
> >
> > Not my experience, to me it slows the machine down and looses packets,
> > I don't however know
> > if this is the only culprit
>
> The packet process could be slow down because of printing
> out this kernel warning. Packet should be still delivered to upper
> stack, at least I didn't see any packet drops because of this.

I have several machines pushing the same errors currently, while on this
one I was logged in on the serial console and not over ssh like the others.

On the other machines, typing is slow, looses characters and drops the
connection

But, again, I don't know if this is the only culprit, it sure does
fill dmesg though =)
(which suddenly takes minutes to show over a 100gig connection)

> > You can actually see it on ping where it start out with 0.0xyx and
> > ends up at ~10ms
>
> I don't understand how it could affect ICMP, it is purely TCP
> from my point of view, even the stack trace from you says so. ;)

It changes directly after the first hw checksum failure, I don't know why =/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [ISSUE][4.20.6] mlx5 and checksum failures
  2019-02-06 23:00             ` Ian Kumlien
@ 2019-02-07  1:00               ` Saeed Mahameed
  2019-02-07 10:16                 ` Ian Kumlien
  0 siblings, 1 reply; 19+ messages in thread
From: Saeed Mahameed @ 2019-02-07  1:00 UTC (permalink / raw)
  To: Ian Kumlien
  Cc: Cong Wang, David Miller, Saeed Mahameed, Linux Kernel Network Developers

On Wed, Feb 6, 2019 at 3:00 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
>
> On Wed, Feb 6, 2019 at 11:49 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> >
> > On Wed, Feb 6, 2019 at 2:41 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > >
> > > On Wed, Feb 6, 2019 at 11:38 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > > > On Wed, Feb 6, 2019 at 2:15 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > > > Could we please schedule this for 4.19 and 4.20 - it's kinda breaking things
> > > >
> > > > It doesn't break anything, packets are _not_ dropped, only that the
> > > > warning itself is noisy.
> > >
> > > Not my experience, to me it slows the machine down and looses packets,
> > > I don't however know
> > > if this is the only culprit
> >
> > The packet process could be slow down because of printing
> > out this kernel warning. Packet should be still delivered to upper
> > stack, at least I didn't see any packet drops because of this.
>
> I have several machines pushing the same errors currently, while on this
> one I was logged in on the serial console and not over ssh like the others.
>
> On the other machines, typing is slow, looses characters and drops the
> connection
>
> But, again, I don't know if this is the only culprit, it sure does
> fill dmesg though =)
> (which suddenly takes minutes to show over a 100gig connection)
>
> > > You can actually see it on ping where it start out with 0.0xyx and
> > > ends up at ~10ms
> >
> > I don't understand how it could affect ICMP, it is purely TCP
> > from my point of view, even the stack trace from you says so. ;)
>
> It changes directly after the first hw checksum failure, I don't know why =/

weird, Maybe a real check-summing issue/corruption on the PCI ?!

can you try turning off checksum offloads
ethtool -K ethX  rx off

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [ISSUE][4.20.6] mlx5 and checksum failures
  2019-02-07  1:00               ` Saeed Mahameed
@ 2019-02-07 10:16                 ` Ian Kumlien
  2019-02-07 18:42                   ` Saeed Mahameed
  0 siblings, 1 reply; 19+ messages in thread
From: Ian Kumlien @ 2019-02-07 10:16 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Cong Wang, David Miller, Saeed Mahameed, Linux Kernel Network Developers

On Thu, Feb 7, 2019 at 2:01 AM Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote:
> On Wed, Feb 6, 2019 at 3:00 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > It changes directly after the first hw checksum failure, I don't know why =/
>
> weird, Maybe a real check-summing issue/corruption on the PCI ?!

Actually, it seems to have been introduced in 4.20.6 - 4.20.5 works just fine

Just FYI, my dmesg testcase:
time ssh <server> "dmesg && exit
real    3m5.845s
user    0m0.035s
sys     0m0.041s

> can you try turning off checksum offloads
> ethtool -K ethX  rx off

same test:
real    0m3.408s
user    0m0.022s
sys     0m0.032s

So yes, something in 4.20.6 goes wrong on the receiving part :/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [ISSUE][4.20.6] mlx5 and checksum failures
  2019-02-07 10:16                 ` Ian Kumlien
@ 2019-02-07 18:42                   ` Saeed Mahameed
  2019-02-07 22:01                     ` Ian Kumlien
  0 siblings, 1 reply; 19+ messages in thread
From: Saeed Mahameed @ 2019-02-07 18:42 UTC (permalink / raw)
  To: Ian Kumlien
  Cc: Cong Wang, David Miller, Saeed Mahameed, Linux Kernel Network Developers

On Thu, Feb 7, 2019 at 2:17 AM Ian Kumlien <ian.kumlien@gmail.com> wrote:
>
> On Thu, Feb 7, 2019 at 2:01 AM Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote:
> > On Wed, Feb 6, 2019 at 3:00 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > It changes directly after the first hw checksum failure, I don't know why =/
> >
> > weird, Maybe a real check-summing issue/corruption on the PCI ?!
>
> Actually, it seems to have been introduced in 4.20.6 - 4.20.5 works just fine
>

Great, the difference is only 120 patches.
that is bisect-able, it will only take 5 iterations to find the
offending commit.


> Just FYI, my dmesg testcase:
> time ssh <server> "dmesg && exit
> real    3m5.845s
> user    0m0.035s
> sys     0m0.041s
>
> > can you try turning off checksum offloads
> > ethtool -K ethX  rx off
>
> same test:
> real    0m3.408s
> user    0m0.022s
> sys     0m0.032s
>
> So yes, something in 4.20.6 goes wrong on the receiving part :/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [ISSUE][4.20.6] mlx5 and checksum failures
  2019-02-06 22:41         ` Ian Kumlien
  2019-02-06 22:49           ` Cong Wang
@ 2019-02-07 21:40           ` Marcelo Ricardo Leitner
  1 sibling, 0 replies; 19+ messages in thread
From: Marcelo Ricardo Leitner @ 2019-02-07 21:40 UTC (permalink / raw)
  To: Ian Kumlien
  Cc: Cong Wang, David Miller, Saeed Mahameed, Linux Kernel Network Developers

On Wed, Feb 06, 2019 at 11:41:28PM +0100, Ian Kumlien wrote:
> On Wed, Feb 6, 2019 at 11:38 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > On Wed, Feb 6, 2019 at 2:15 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > Could we please schedule this for 4.19 and 4.20 - it's kinda breaking things
> >
> > It doesn't break anything, packets are _not_ dropped, only that the
> > warning itself is noisy.
> 
> Not my experience, to me it slows the machine down and looses packets,
> I don't however know
> if this is the only culprit
> 
> You can actually see it on ping where it start out with 0.0xyx and
> ends up at ~10ms

Serial console is a/the killer in these situations.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [ISSUE][4.20.6] mlx5 and checksum failures
  2019-02-07 18:42                   ` Saeed Mahameed
@ 2019-02-07 22:01                     ` Ian Kumlien
  2019-02-08 16:29                       ` Ian Kumlien
  0 siblings, 1 reply; 19+ messages in thread
From: Ian Kumlien @ 2019-02-07 22:01 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Cong Wang, David Miller, Saeed Mahameed, Linux Kernel Network Developers

On Thu, Feb 7, 2019 at 7:43 PM Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote:
>
> On Thu, Feb 7, 2019 at 2:17 AM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> >
> > On Thu, Feb 7, 2019 at 2:01 AM Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote:
> > > On Wed, Feb 6, 2019 at 3:00 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > > It changes directly after the first hw checksum failure, I don't know why =/
> > >
> > > weird, Maybe a real check-summing issue/corruption on the PCI ?!
> >
> > Actually, it seems to have been introduced in 4.20.6 - 4.20.5 works just fine
> >
>
> Great, the difference is only 120 patches.
> that is bisect-able, it will only take 5 iterations to find the
> offending commit.

I just wish it wasn't a server that takes, what feels like 5 minutes to boot...

All of these seas of sensors 2d and 3d... =P

But, yep, that's the plan

> > Just FYI, my dmesg testcase:
> > time ssh <server> "dmesg && exit
> > real    3m5.845s
> > user    0m0.035s
> > sys     0m0.041s
> >
> > > can you try turning off checksum offloads
> > > ethtool -K ethX  rx off
> >
> > same test:
> > real    0m3.408s
> > user    0m0.022s
> > sys     0m0.032s
> >
> > So yes, something in 4.20.6 goes wrong on the receiving part :/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [ISSUE][4.20.6] mlx5 and checksum failures
  2019-02-07 22:01                     ` Ian Kumlien
@ 2019-02-08 16:29                       ` Ian Kumlien
  2019-02-09 15:54                         ` Ian Kumlien
  0 siblings, 1 reply; 19+ messages in thread
From: Ian Kumlien @ 2019-02-08 16:29 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Cong Wang, David Miller, Saeed Mahameed, Linux Kernel Network Developers

On Thu, Feb 7, 2019 at 11:01 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> On Thu, Feb 7, 2019 at 7:43 PM Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote:
> > On Thu, Feb 7, 2019 at 2:17 AM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > On Thu, Feb 7, 2019 at 2:01 AM Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote:
> > > > On Wed, Feb 6, 2019 at 3:00 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > > > It changes directly after the first hw checksum failure, I don't know why =/
> > > >
> > > > weird, Maybe a real check-summing issue/corruption on the PCI ?!
> > >
> > > Actually, it seems to have been introduced in 4.20.6 - 4.20.5 works just fine

> > Great, the difference is only 120 patches.
> > that is bisect-able, it will only take 5 iterations to find the
> > offending commit.
>
> I just wish it wasn't a server that takes, what feels like 5 minutes to boot...
>
> All of these seas of sensors 2d and 3d... =P
>
> But, yep, that's the plan

Huh, spent most of the day with two bisects and none of them yielded
any results....

Looks like I'll have to start investigating the elrepo kernel-ml build =(

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [ISSUE][4.20.6] mlx5 and checksum failures
  2019-02-08 16:29                       ` Ian Kumlien
@ 2019-02-09 15:54                         ` Ian Kumlien
  2019-02-13 12:04                           ` Ian Kumlien
  0 siblings, 1 reply; 19+ messages in thread
From: Ian Kumlien @ 2019-02-09 15:54 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Cong Wang, David Miller, Saeed Mahameed, Linux Kernel Network Developers

On Fri, Feb 8, 2019 at 5:29 PM Ian Kumlien <ian.kumlien@gmail.com> wrote
> On Thu, Feb 7, 2019 at 11:01 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > On Thu, Feb 7, 2019 at 7:43 PM Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote:
> > > On Thu, Feb 7, 2019 at 2:17 AM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > > On Thu, Feb 7, 2019 at 2:01 AM Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote:
> > > > > On Wed, Feb 6, 2019 at 3:00 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > > > > It changes directly after the first hw checksum failure, I don't know why =/
> > > > >
> > > > > weird, Maybe a real check-summing issue/corruption on the PCI ?!
> > > >
> > > > Actually, it seems to have been introduced in 4.20.6 - 4.20.5 works just fine
>
> > > Great, the difference is only 120 patches.
> > > that is bisect-able, it will only take 5 iterations to find the
> > > offending commit.
> >
> > I just wish it wasn't a server that takes, what feels like 5 minutes to boot...
> >
> > All of these seas of sensors 2d and 3d... =P
> >
> > But, yep, that's the plan
>
> Huh, spent most of the day with two bisects and none of them yielded
> any results....
>
> Looks like I'll have to start investigating the elrepo kernel-ml build =(

Just realized that it's not an entirely fair comparison - since
retpolines wasn't enabled, damned old compilers...

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [ISSUE][4.20.6] mlx5 and checksum failures
  2019-02-09 15:54                         ` Ian Kumlien
@ 2019-02-13 12:04                           ` Ian Kumlien
  2019-02-13 23:36                             ` Saeed Mahameed
  0 siblings, 1 reply; 19+ messages in thread
From: Ian Kumlien @ 2019-02-13 12:04 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Cong Wang, David Miller, Saeed Mahameed, Linux Kernel Network Developers

One last update on this, 4.20.8 compiled with the same compiler works
- I still suspect that it was fixed by:
net/mlx5e: Force CHECKSUM_UNNECESSARY for short ethernet frames

Anyway, we can forget about it now ;)

On Sat, Feb 9, 2019 at 4:54 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
>
> On Fri, Feb 8, 2019 at 5:29 PM Ian Kumlien <ian.kumlien@gmail.com> wrote
> > On Thu, Feb 7, 2019 at 11:01 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > On Thu, Feb 7, 2019 at 7:43 PM Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote:
> > > > On Thu, Feb 7, 2019 at 2:17 AM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > > > On Thu, Feb 7, 2019 at 2:01 AM Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote:
> > > > > > On Wed, Feb 6, 2019 at 3:00 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > > > > > It changes directly after the first hw checksum failure, I don't know why =/
> > > > > >
> > > > > > weird, Maybe a real check-summing issue/corruption on the PCI ?!
> > > > >
> > > > > Actually, it seems to have been introduced in 4.20.6 - 4.20.5 works just fine
> >
> > > > Great, the difference is only 120 patches.
> > > > that is bisect-able, it will only take 5 iterations to find the
> > > > offending commit.
> > >
> > > I just wish it wasn't a server that takes, what feels like 5 minutes to boot...
> > >
> > > All of these seas of sensors 2d and 3d... =P
> > >
> > > But, yep, that's the plan
> >
> > Huh, spent most of the day with two bisects and none of them yielded
> > any results....
> >
> > Looks like I'll have to start investigating the elrepo kernel-ml build =(
>
> Just realized that it's not an entirely fair comparison - since
> retpolines wasn't enabled, damned old compilers...

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [ISSUE][4.20.6] mlx5 and checksum failures
  2019-02-13 12:04                           ` Ian Kumlien
@ 2019-02-13 23:36                             ` Saeed Mahameed
  0 siblings, 0 replies; 19+ messages in thread
From: Saeed Mahameed @ 2019-02-13 23:36 UTC (permalink / raw)
  To: ian.kumlien, saeedm; +Cc: davem, netdev, xiyou.wangcong

On Wed, 2019-02-13 at 13:04 +0100, Ian Kumlien wrote:
> One last update on this, 4.20.8 compiled with the same compiler works
> - I still suspect that it was fixed by:
> net/mlx5e: Force CHECKSUM_UNNECESSARY for short ethernet frames
> 
> Anyway, we can forget about it now ;)

cool, nice to know.

Thanks for the update.


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2019-02-13 23:36 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-06 16:16 [ISSUE][4.20.6] mlx5 and checksum failures Ian Kumlien
2019-02-06 17:18 ` Saeed Mahameed
2019-02-06 22:03   ` David Miller
2019-02-06 22:12     ` Ian Kumlien
2019-02-06 22:22       ` David Miller
2019-02-06 22:33         ` Ian Kumlien
2019-02-06 22:38       ` Cong Wang
2019-02-06 22:41         ` Ian Kumlien
2019-02-06 22:49           ` Cong Wang
2019-02-06 23:00             ` Ian Kumlien
2019-02-07  1:00               ` Saeed Mahameed
2019-02-07 10:16                 ` Ian Kumlien
2019-02-07 18:42                   ` Saeed Mahameed
2019-02-07 22:01                     ` Ian Kumlien
2019-02-08 16:29                       ` Ian Kumlien
2019-02-09 15:54                         ` Ian Kumlien
2019-02-13 12:04                           ` Ian Kumlien
2019-02-13 23:36                             ` Saeed Mahameed
2019-02-07 21:40           ` Marcelo Ricardo Leitner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.