From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.9 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id A31E1C43381 for ; Sat, 16 Mar 2019 17:08:48 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 616582186A for ; Sat, 16 Mar 2019 17:08:48 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="Xo+jwCRn" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726926AbfCPRIr (ORCPT ); Sat, 16 Mar 2019 13:08:47 -0400 Received: from mail-it1-f195.google.com ([209.85.166.195]:51830 "EHLO mail-it1-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726493AbfCPRIr (ORCPT ); Sat, 16 Mar 2019 13:08:47 -0400 Received: by mail-it1-f195.google.com with SMTP id e24so14843806itl.1 for ; Sat, 16 Mar 2019 10:08:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=TBHvIAntNDruE91bz98HoVbqm3FPyam/HGesCBrOnrA=; b=Xo+jwCRnYFpbDRhhd7cXCmcLtkEAfbAAkFY0jVNZpOMuNr5fP8hI+I1CsUlsGk8+vs n2c7o84ARt73jKCMhRiyxKtVLwmdzvDgjVWOoHjGdLkcNow9+TXek+C6N1sNX9ra2wQy KjUD6E/++qjYdsITtW+yu4dJLtXTeOX0QGIzfZWZUuc2Q06OvGTLncvnKfPLf+MegIHW 8FesFcwA9F3B0eGWNQThfdpNwa0JLwUup7R4w0nqfGpuPhXvphrPt3S23E5ioFkFYaAO MbQfCYGPz61SqHuH7Sm77Yj/gPZf6n/OOk1gKaWzD9tGmg/69ICdb+vn8MN+v4vp3i/0 WIfg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=TBHvIAntNDruE91bz98HoVbqm3FPyam/HGesCBrOnrA=; b=NXebH+oBQNVmEzpQw1cvXi1gPtuY9sCEVfLrj2ZAEeRybJcF2zXfWKXNea7GYaVI7/ ovLmdwvFOsnh6T8PLJmcy33qjvDDYkJxGKX4eNlRkKfMlVcAemdEFINGAHFwP/zIDiiC AtmM7xUl9xjsLGvAx8oU5BcifFuFwSUqt16fc+HOXCU+doIaHmwtaNTzTNGhVWqYui7J h5Kzojz8983555Jpy5ikHqN7HcScIxgCMepBPiJ1n1tkxFQ2hBFGuVlr4jilPBjSs1h6 T0vyMyVS25P/4Q0txRBIjnFs87PrI2HyhMQSknfDXehwY5koDkwq/OaWwH+RjaKW0SWo p9Zw== X-Gm-Message-State: APjAAAW2C0kl//9v9TXCwTQt+1Z7+yky994oxmX/KRK0quaWuZMN4PCa 9f/vvOp7mY87Fk/ivvHAddtyiaJRsgQ35iQoF8A= X-Google-Smtp-Source: APXvYqyKK9sjVPSv9h0ZSLG6zwCd3JBruKRX3OTkxd+FyndAXWeI5dNH5sDap1Ucp7jqx68TD+juKwb2RBAT6TDxtCY= X-Received: by 2002:a24:45e3:: with SMTP id c96mr5449007itd.89.1552756125371; Sat, 16 Mar 2019 10:08:45 -0700 (PDT) MIME-Version: 1.0 References: <753b56b8-f1ab-82f5-f9b5-089fbb638989@gmail.com> <02388deb-0a06-95ae-1aac-b39c108fc2e7@gmail.com> <9b34d60d-8de7-5384-3822-98ec79d53e04@gmail.com> <0704f164-aa0a-bcae-a886-a7fc4a4cd52f@gmail.com> <8f910b1339a741cdc780f3948c11a082a8a51b9e.camel@linux.intel.com> In-Reply-To: From: Alexander Duyck Date: Sat, 16 Mar 2019 10:08:33 -0700 Message-ID: Subject: Re: r8169 driver from kernel 5.0 crashing - napi_consume_skb To: Heiner Kallweit Cc: VDR User , Alexander Duyck , Netdev Content-Type: text/plain; charset="UTF-8" Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org On Sat, Mar 16, 2019 at 8:12 AM Heiner Kallweit wrote: > > On 16.03.2019 15:38, VDR User wrote: > >> Part of the issue though is that we don't know how reliable that test > >> was. I believe Derek he hasn't had any crashes, but he wasn't confident > >> that it had actually resolved the issue. > > > > Previously I thought I could easily & consistently reproduce the crash > > but the more testing I did, the more I realized that wasn't the case. > > That's why my confidence was low in that reversing commit 5317d5c6d47e > > ("r8169: use napi_consume_skb where possible") fixed it. I felt like I > > needed to do a lot more testing over the weekend to be sure. But, I > > can now confirm that reversing that commit did not solve the problem. > > I didn't ifdown/ifup after the crash so the nic eventually recovered > > on its own I guess. The `ethtool -S` output is: > > > > NIC statistics: > > tx_packets: 5370650 > > rx_packets: 57340787 > > tx_errors: 0 > > rx_errors: 0 > > rx_missed: 26 > > align_errors: 0 > > tx_single_collisions: 0 > > tx_multi_collisions: 0 > > unicast: 57332905 > > broadcast: 6409 > > multicast: 1473 > > tx_aborted: 0 > > tx_underrun: 0 > > > > The dmesg log looks the same as before: > > > > [95579.984062] ------------[ cut here ]------------ > > [95579.984142] NETDEV WATCHDOG: enp4s0 (r8169): transmit queue 0 timed out > > [95579.984224] WARNING: CPU: 1 PID: 0 at net/sched/sch_generic.c:461 > > dev_watchdog+0x1bb/0x1e0 > > [95579.984276] Modules linked in: snd_hda_codec_hdmi > > snd_hda_codec_realtek snd_hda_codec_generic ohci_pci snd_hda_intel > > snd_hda_codec snd_hwdep xhci_pci ohci_hcd ehci_pci xhci_hcd ehci_hcd > > usbcore snd_hda_core usb_common snd_pcm snd_timer snd soundcore nfsd > > auth_rpcgss oid_registry lockd grace sunrpc ip_tables x_tables ipv6 > > [95579.984354] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.0.0-amd #1 > > [95579.984387] Hardware name: ECS A75F-A/A75F-A, BIOS 4.6.5 09/14/2011 > > [95579.984422] EIP: dev_watchdog+0x1bb/0x1e0 > > [95579.984454] Code: 8b 50 3c 89 f8 e8 3d aa 0a 00 8b 7e f4 eb a4 89 > > f8 c6 05 e7 1c 6d c1 01 e8 72 4f fd ff 53 50 57 68 78 05 66 c1 e8 25 > > ad ba ff <0f> 0b 83 c4 10 eb c9 eb 1c 8d b4 26 00 00 00 00 8d b4 26 00 > > 00 00 > > [95579.986189] EAX: 0000003b EBX: 00000000 ECX: 00000800 EDX: 00000103 > > [95579.986224] ESI: f4cc2264 EDI: f4cc2000 EBP: f4c99f74 ESP: f4c99f4c > > [95579.986259] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210296 > > [95579.986292] CR0: 80050033 CR2: b7c644f0 CR3: 0dfd2000 CR4: 00000690 > > [95579.986325] Call Trace: > > [95579.986356] > > [95579.986389] ? qdisc_put_unlocked+0x40/0x40 > > [95579.986423] call_timer_fn+0x19/0xa0 > > [95579.986456] run_timer_softirq+0x337/0x380 > > [95579.986488] ? qdisc_put_unlocked+0x40/0x40 > > [95579.986521] ? rcu_process_callbacks+0xcb/0x380 > > [95579.986555] __do_softirq+0xd6/0x21c > > [95579.986586] ? __irqentry_text_end+0x18/0x18 > > [95579.986619] call_on_stack+0x10/0x60 > > [95579.986646] > > [95579.986674] ? irq_exit+0x91/0xc0 > > [95579.986701] ? smp_apic_timer_interrupt+0x56/0xa0 > > [95579.986731] ? apic_timer_interrupt+0xd5/0xdc > > [95579.986761] ? acpi_idle_enter_s2idle+0x60/0x60 > > [95579.986790] ? cpuidle_enter_state+0x122/0x360 > > [95579.986821] ? cpuidle_enter+0xf/0x20 > > [95579.986850] ? call_cpuidle+0x1c/0x40 > > [95579.986878] ? do_idle+0x1e6/0x220 > > [95579.986906] ? cpu_startup_entry+0x25/0x40 > > [95579.986934] ? start_secondary+0x1a5/0x220 > > [95579.986963] ? startup_32_smp+0x15f/0x164 > > [95579.986991] ---[ end trace 2e8d77bb3c9d2d45 ]--- > > > > Please let me know if there's anything I can do to help. > > Derek > > > The other changes between 4.20 and 5.0 don't look like they could cause > the issue. And two critical ones have been reverted already. > So what would need to be done is bisecting the issue. > > Heiner So I took a quick look through the git log. Have you tried taking a look at the following commit as a possible culprit? commit e782410ed2378a2ddac58d944c3cf0c6f96b1ff3 Author: Heiner Kallweit Date: Sat Dec 15 16:25:05 2018 +0100 r8169: improve spurious interrupt detection Improve detection of spurious interrupts by checking against the interrupt mask as currently set in the chip. Signed-off-by: Heiner Kallweit Signed-off-by: David S. Miller This patch seems problematic for a few different reasons. The first being that adding a register read to the interrupt handler will actually increase the latency for your interrupt handler as a MMIO read is not a fast operation compared to a memory read. You already had to read status, and adding a read for the interrupt mask will just double the overhead as the reads have to be performed sequentially. Secondly, are we guaranteed that the read of the register will always be reliable? I know I have seen plenty of hardware issues in the past where reading a register in the middle of an event can sometimes lead to the value read being incorrect. The only other patch in the set between 4.20 and 5.0 that catches my eye is: commit 76085c9e7eaebf0caa2476f77380efc3d55b050c Author: Heiner Kallweit Date: Thu Nov 22 22:03:08 2018 +0100 r8169: replace macro TX_FRAGS_READY_FOR with a function Replace macro TX_FRAGS_READY_FOR with function rtl_tx_slots_avail to make code cleaner and type-safe. Signed-off-by: Heiner Kallweit Signed-off-by: David S. Miller However I don't see anything that should be wrong there that is causing this, but instrumenting the points where we are taking action based on that with a trace_printk probably wouldn't hurt as it would help us to verify that we are stopping things in xmit and not getting around to waking them up for whatever reason. Also a minor optimization you could do on the patch is to replace the call to netif_wake_queue in rtl8169_start_xmit with netif_start_queue as you don't need to actually wake up the queue since you are still in mid transmit so you just need to reset the bit so it doesn't prevent the next transmit. Hope that helps. - Alex