From mboxrd@z Thu Jan 1 00:00:00 1970 From: "ebradsha@gmail.com" Subject: Frequent NIC lock-ups requiring power cycle Date: Sun, 1 Jul 2018 04:54:51 -0700 Message-ID: Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============0672820191876286106==" Return-path: Received: from us1-rack-dfw2.inumbo.com ([104.130.134.6]) by lists.xenproject.org with esmtp (Exim 4.89) (envelope-from ) id 1fZax7-0004Kj-4i for xen-devel@lists.xenproject.org; Sun, 01 Jul 2018 11:55:37 +0000 Received: by mail-lf0-x241.google.com with SMTP id m13-v6so9901893lfb.12 for ; Sun, 01 Jul 2018 04:55:33 -0700 (PDT) List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Errors-To: xen-devel-bounces@lists.xenproject.org Sender: "Xen-devel" To: xen-devel@lists.xenproject.org List-Id: xen-devel@lists.xenproject.org --===============0672820191876286106== Content-Type: multipart/alternative; boundary="0000000000000879cf056feec2ca" --0000000000000879cf056feec2ca Content-Type: text/plain; charset="UTF-8" We have a server with running CentOS 7 with the 4.9.75-29.el7.x86_64 kernel and a Broadcom PCI network card BCM-95720A2003G. The server receives a fair amount of traffic (~3-4TB per month) with an even split between uploads/downloads and ~40 HTTP requests per second. Not a trivial amount of traffic, but nothing crazy either. We've had recurring problems where our NIC locks up and the server must be power cycled in order to restore network connectivity. We've had this both with the on-board Intel NIC, a PCI Broadcom network card (listed above), on CentOS 6, CentOS 7 and on different physical machines (albeit all of them Dell C6100s with XS23-TY3 mobo). Our system logs show the following at the time of the crash (see below). The issue appears to be related to the Xen kernel and/or the network driver. Given that we've had this same issue across different brands of network cards -- I'm guessing it's more related to the kernel. Anyone have any suggestions for how we might resolve or mitigate this issue? Jun 29 23:03:52 server1 kernel: ------------[ cut here ]------------ Jun 29 23:03:52 server1 kernel: WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:316 dev_watchdog+0x217/0x220 Jun 29 23:03:52 server1 kernel: NETDEV WATCHDOG: p55p1 (tg3): transmit queue 0 timed out Jun 29 23:03:52 server1 kernel: Modules linked in: br_netfilter xen_blkfront dm_crypt tun drbd lru_cache libcrc32c bridge stp llc ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter ip6_$ Jun 29 23:03:52 server1 kernel: CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.9.75-29.el7.x86_64 #1 Jun 29 23:03:52 server1 kernel: Hardware name: Dell XS23-TY3 / , BIOS 1.62 06/24/2011 Jun 29 23:03:52 server1 kernel: ffff88007ca03dd0 ffffffff813f6f05 ffff88007ca03e20 0000000000000000 Jun 29 23:03:52 server1 kernel: ffff88007ca03e10 ffffffff810a7341 0000013c11b64726 0000000000000000 Jun 29 23:03:52 server1 kernel: ffff880074dbe000 0000000000000005 0000000000000000 ffff880074dbe000 Jun 29 23:03:52 server1 kernel: Call Trace: Jun 29 23:03:52 server1 kernel: Jun 29 23:03:52 server1 kernel: [] dump_stack+0x63/0x8e Jun 29 23:03:52 server1 kernel: [] __warn+0xd1/0xf0 Jun 29 23:03:52 server1 kernel: [] warn_slowpath_fmt+0x4f/0x60 Jun 29 23:03:52 server1 kernel: [] ? hrtimer_interrupt+0xca/0x190 Jun 29 23:03:52 server1 kernel: [] dev_watchdog+0x217/0x220 Jun 29 23:03:52 server1 kernel: [] ? dev_deactivate_queue.constprop.27+0x60/0x60 Jun 29 23:03:52 server1 kernel: [] call_timer_fn+0x35/0x120 Jun 29 23:03:52 server1 kernel: [] run_timer_softirq+0x1dc/0x460 Jun 29 23:03:52 server1 kernel: [] ? xen_clocksource_read+0x15/0x20 Jun 29 23:03:52 server1 kernel: [] ? sched_clock+0x9/0x10 Jun 29 23:03:52 server1 kernel: [] ? sched_clock_cpu+0x72/0xa0 Jun 29 23:03:52 server1 kernel: [] __do_softirq+0xd1/0x283 Jun 29 23:03:52 server1 kernel: [] irq_exit+0xe9/0x100 Jun 29 23:03:52 server1 kernel: [] xen_evtchn_do_upcall+0x35/0x50 Jun 29 23:03:52 server1 kernel: [] xen_do_hypervisor_callback+0x1e/0x40 Jun 29 23:03:52 server1 kernel: Jun 29 23:03:52 server1 kernel: [] ? xen_hypercall_sched_op+0xa/0x20 Jun 29 23:03:52 server1 kernel: [] ? xen_hypercall_sched_op+0xa/0x20 Jun 29 23:03:52 server1 kernel: [] ? xen_safe_halt+0x10/0x20 Jun 29 23:03:52 server1 kernel: [] ? default_idle+0x1e/0xd0 Jun 29 23:03:52 server1 kernel: [] ? arch_cpu_idle+0xf/0x20 Jun 29 23:03:52 server1 kernel: [] ? default_idle_call+0x2c/0x40 Jun 29 23:03:52 server1 kernel: [] ? cpu_startup_entry+0x1ac/0x240 Jun 29 23:03:52 server1 kernel: [] ? rest_init+0x77/0x80 Jun 29 23:03:52 server1 kernel: [] ? start_kernel+0x4ac/0x4b9 Jun 29 23:03:52 server1 kernel: [] ? set_init_arg+0x55/0x55 Jun 29 23:03:52 server1 kernel: [] ? x86_64_start_reservations+0x24/0x26 Jun 29 23:03:52 server1 kernel: [] ? xen_start_kernel+0x56a/0x576 Jun 29 23:03:52 server1 kernel: ---[ end trace e79c6881e97dc64a ]--- Jun 29 23:03:52 server1 kernel: tg3 0000:03:00.0 p55p1: transmit timed out, resetting Jun 29 23:03:52 server1 kernel: tg3 0000:03:00.0 p55p1: 0x00000000: 0x165f14e4, 0x00100546, 0x02000000, 0x00800040 Jun 29 23:03:52 server1 kernel: tg3 0000:03:00.0 p55p1: 0x00000010: 0xf9fd000c, 0x00000000, 0xf9ff000c, 0x00000000 Jun 29 23:03:52 server1 kernel: tg3 0000:03:00.0 p55p1: 0x00000460: 0x00000008, 0x00002620, 0x01ff0106, 0x00000000 Jun 29 23:03:52 server1 kernel: tg3 0000:03:00.0 p55p1: 0x00000470: 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff Jun 29 23:03:52 server1 kernel: tg3 0000:03:00.0 p55p1: 0x00000480: 0x42000000, 0x7fffffff, 0x06000004, 0x7fffffff --0000000000000879cf056feec2ca Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
We have a server with running CentOS 7 with the=C2=A04.9.7= 5-29.el7.x86_64 kernel and a Broadcom PCI network card=C2=A0BCM-95720A2003G= .

The server receives a fair amount of traffic (~3-4TB p= er month) with an even split between uploads/downloads and ~40 HTTP request= s per second. Not a trivial amount of traffic, but nothing crazy either.

We've had recurring problems where our NIC locks= up and the server must be power cycled in order to restore network connect= ivity. We've had this both with the on-board Intel NIC, a PCI Broadcom = network card (listed above), on CentOS 6, CentOS 7 and on different physica= l machines (albeit all of them Dell C6100s with=C2=A0XS23-TY3 mobo).
<= div>
Our system logs show the following at the time of the cr= ash (see below).

The issue appears to be related t= o the Xen kernel and/or the network driver. Given that we've had this s= ame issue across different brands of network cards -- I'm guessing it&#= 39;s more related to the kernel.

Anyone have any s= uggestions for how we might resolve or mitigate this issue?

<= /div>
Jun 29 23:03:52 server1 kernel: ------------[ cut here ]----= --------
Jun 29 23:03:52 server1 kernel: WARNING: CPU: 0 PID: 0 a= t net/sched/sch_generic.c:316 dev_watchdog+0x217/0x220
Jun 29 23:= 03:52 server1 kernel: NETDEV WATCHDOG: p55p1 (tg3): transmit queue 0 timed = out
Jun 29 23:03:52 server1 kernel: Modules linked in: br_netfilt= er xen_blkfront dm_crypt tun drbd lru_cache libcrc32c bridge stp llc ip6t_R= EJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_connt= rack ip6table_filter ip6_$
Jun 29 23:03:52 server1 kernel: CPU: 0= PID: 0 Comm: swapper/0 Not tainted 4.9.75-29.el7.x86_64 #1
Jun 2= 9 23:03:52 server1 kernel: Hardware name: Dell=C2=A0 =C2=A0 =C2=A0XS23-TY3= =C2=A0 =C2=A0 /=C2=A0 =C2=A0 =C2=A0 , BIOS 1.62 06/24/2011
Jun 29= 23:03:52 server1 kernel: ffff88007ca03dd0 ffffffff813f6f05 ffff88007ca03e2= 0 0000000000000000
Jun 29 23:03:52 server1 kernel: ffff88007ca03e= 10 ffffffff810a7341 0000013c11b64726 0000000000000000
Jun 29 23:0= 3:52 server1 kernel: ffff880074dbe000 0000000000000005 0000000000000000 fff= f880074dbe000
Jun 29 23:03:52 server1 kernel: Call Trace:
Jun 29 23:03:52 server1 kernel: <IRQ>
Jun 29 23:03:52 se= rver1 kernel: [<ffffffff813f6f05>] dump_stack+0x63/0x8e
Jun= 29 23:03:52 server1 kernel: [<ffffffff810a7341>] __warn+0xd1/0xf0
Jun 29 23:03:52 server1 kernel: [<ffffffff810a73af>] warn_slo= wpath_fmt+0x4f/0x60
Jun 29 23:03:52 server1 kernel: [<ffffffff= 811198ea>] ? hrtimer_interrupt+0xca/0x190
Jun 29 23:03:52 serv= er1 kernel: [<ffffffff81787387>] dev_watchdog+0x217/0x220
J= un 29 23:03:52 server1 kernel: [<ffffffff81787170>] ? dev_deactivate_= queue.constprop.27+0x60/0x60
Jun 29 23:03:52 server1 kernel: [<= ;ffffffff81116c05>] call_timer_fn+0x35/0x120
Jun 29 23:03:52 s= erver1 kernel: [<ffffffff8111770c>] run_timer_softirq+0x1dc/0x460
Jun 29 23:03:52 server1 kernel: [<ffffffff810228a5>] ? xen_clo= cksource_read+0x15/0x20
Jun 29 23:03:52 server1 kernel: [<ffff= ffff81035639>] ? sched_clock+0x9/0x10
Jun 29 23:03:52 server1 = kernel: [<ffffffff810d7672>] ? sched_clock_cpu+0x72/0xa0
Ju= n 29 23:03:52 server1 kernel: [<ffffffff81883881>] __do_softirq+0xd1/= 0x283
Jun 29 23:03:52 server1 kernel: [<ffffffff810ad479>] = irq_exit+0xe9/0x100
Jun 29 23:03:52 server1 kernel: [<ffffffff= 814ded65>] xen_evtchn_do_upcall+0x35/0x50
Jun 29 23:03:52 serv= er1 kernel: [<ffffffff81880c5e>] xen_do_hypervisor_callback+0x1e/0x40=
Jun 29 23:03:52 server1 kernel: <EOI>
Jun 29 23:= 03:52 server1 kernel: [<ffffffff810013aa>] ? xen_hypercall_sched_op+0= xa/0x20
Jun 29 23:03:52 server1 kernel: [<ffffffff810013aa>= ] ? xen_hypercall_sched_op+0xa/0x20
Jun 29 23:03:52 server1 kerne= l: [<ffffffff81022710>] ? xen_safe_halt+0x10/0x20
Jun 29 23= :03:52 server1 kernel: [<ffffffff8187efae>] ? default_idle+0x1e/0xd0<= /div>
Jun 29 23:03:52 server1 kernel: [<ffffffff810368ff>] ? arch= _cpu_idle+0xf/0x20
Jun 29 23:03:52 server1 kernel: [<ffffffff8= 187f3cc>] ? default_idle_call+0x2c/0x40
Jun 29 23:03:52 server= 1 kernel: [<ffffffff810ecd2c>] ? cpu_startup_entry+0x1ac/0x240
<= div>Jun 29 23:03:52 server1 kernel: [<ffffffff818729b7>] ? rest_init+= 0x77/0x80
Jun 29 23:03:52 server1 kernel: [<ffffffff81fb0148&g= t;] ? start_kernel+0x4ac/0x4b9
Jun 29 23:03:52 server1 kernel: [&= lt;ffffffff81fafa8a>] ? set_init_arg+0x55/0x55
Jun 29 23:03:52= server1 kernel: [<ffffffff81faf5d7>] ? x86_64_start_reservations+0x2= 4/0x26
Jun 29 23:03:52 server1 kernel: [<ffffffff81fb6cf7>]= ? xen_start_kernel+0x56a/0x576
Jun 29 23:03:52 server1 kernel: -= --[ end trace e79c6881e97dc64a ]---
Jun 29 23:03:52 server1 kerne= l: tg3 0000:03:00.0 p55p1: transmit timed out, resetting
Jun 29 2= 3:03:52 server1 kernel: tg3 0000:03:00.0 p55p1: 0x00000000: 0x165f14e4, 0x0= 0100546, 0x02000000, 0x00800040
Jun 29 23:03:52 server1 kernel: t= g3 0000:03:00.0 p55p1: 0x00000010: 0xf9fd000c, 0x00000000, 0xf9ff000c, 0x00= 000000
Jun 29 23:03:52 server1 kernel: tg3 0000:03:00.0 p55p1: 0x= 00000460: 0x00000008, 0x00002620, 0x01ff0106, 0x00000000
Jun 29 2= 3:03:52 server1 kernel: tg3 0000:03:00.0 p55p1: 0x00000470: 0xffffffff, 0xf= fffffff, 0xffffffff, 0xffffffff
Jun 29 23:03:52 server1 kernel: t= g3 0000:03:00.0 p55p1: 0x00000480: 0x42000000, 0x7fffffff, 0x06000004, 0x7f= ffffff
--0000000000000879cf056feec2ca-- --===============0672820191876286106== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: base64 Content-Disposition: inline X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KWGVuLWRldmVs IG1haWxpbmcgbGlzdApYZW4tZGV2ZWxAbGlzdHMueGVucHJvamVjdC5vcmcKaHR0cHM6Ly9saXN0 cy54ZW5wcm9qZWN0Lm9yZy9tYWlsbWFuL2xpc3RpbmZvL3hlbi1kZXZlbA== --===============0672820191876286106==--