From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=7c8q=RT=vger.kernel.org=netdev-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-3.9 required=3.0 tests=DKIM_SIGNED,DKIM_VALID,
	DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM,
	HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_PASS
	autolearn=ham autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id A31E1C43381
	for <netdev@archiver.kernel.org>; Sat, 16 Mar 2019 17:08:48 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.kernel.org (Postfix) with ESMTP id 616582186A
	for <netdev@archiver.kernel.org>; Sat, 16 Mar 2019 17:08:48 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="Xo+jwCRn"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726926AbfCPRIr (ORCPT <rfc822;netdev@archiver.kernel.org>);
        Sat, 16 Mar 2019 13:08:47 -0400
Received: from mail-it1-f195.google.com ([209.85.166.195]:51830 "EHLO
        mail-it1-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726493AbfCPRIr (ORCPT
        <rfc822;netdev@vger.kernel.org>); Sat, 16 Mar 2019 13:08:47 -0400
Received: by mail-it1-f195.google.com with SMTP id e24so14843806itl.1
        for <netdev@vger.kernel.org>; Sat, 16 Mar 2019 10:08:46 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc;
        bh=TBHvIAntNDruE91bz98HoVbqm3FPyam/HGesCBrOnrA=;
        b=Xo+jwCRnYFpbDRhhd7cXCmcLtkEAfbAAkFY0jVNZpOMuNr5fP8hI+I1CsUlsGk8+vs
         n2c7o84ARt73jKCMhRiyxKtVLwmdzvDgjVWOoHjGdLkcNow9+TXek+C6N1sNX9ra2wQy
         KjUD6E/++qjYdsITtW+yu4dJLtXTeOX0QGIzfZWZUuc2Q06OvGTLncvnKfPLf+MegIHW
         8FesFcwA9F3B0eGWNQThfdpNwa0JLwUup7R4w0nqfGpuPhXvphrPt3S23E5ioFkFYaAO
         MbQfCYGPz61SqHuH7Sm77Yj/gPZf6n/OOk1gKaWzD9tGmg/69ICdb+vn8MN+v4vp3i/0
         WIfg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=TBHvIAntNDruE91bz98HoVbqm3FPyam/HGesCBrOnrA=;
        b=NXebH+oBQNVmEzpQw1cvXi1gPtuY9sCEVfLrj2ZAEeRybJcF2zXfWKXNea7GYaVI7/
         ovLmdwvFOsnh6T8PLJmcy33qjvDDYkJxGKX4eNlRkKfMlVcAemdEFINGAHFwP/zIDiiC
         AtmM7xUl9xjsLGvAx8oU5BcifFuFwSUqt16fc+HOXCU+doIaHmwtaNTzTNGhVWqYui7J
         h5Kzojz8983555Jpy5ikHqN7HcScIxgCMepBPiJ1n1tkxFQ2hBFGuVlr4jilPBjSs1h6
         T0vyMyVS25P/4Q0txRBIjnFs87PrI2HyhMQSknfDXehwY5koDkwq/OaWwH+RjaKW0SWo
         p9Zw==
X-Gm-Message-State: APjAAAW2C0kl//9v9TXCwTQt+1Z7+yky994oxmX/KRK0quaWuZMN4PCa
        9f/vvOp7mY87Fk/ivvHAddtyiaJRsgQ35iQoF8A=
X-Google-Smtp-Source: APXvYqyKK9sjVPSv9h0ZSLG6zwCd3JBruKRX3OTkxd+FyndAXWeI5dNH5sDap1Ucp7jqx68TD+juKwb2RBAT6TDxtCY=
X-Received: by 2002:a24:45e3:: with SMTP id c96mr5449007itd.89.1552756125371;
 Sat, 16 Mar 2019 10:08:45 -0700 (PDT)
MIME-Version: 1.0
References: <CAA7C2qju6tYKNiLuL96801EvWp6YODONd7a0d5JCu5rWdwXf=g@mail.gmail.com>
 <c1a8b731-d115-04af-384e-69835a3f5cc6@gmail.com> <CAA7C2qit32YSTycxXsZhj97kcxZAKkhj9ciU+CO=zbknSnRNzg@mail.gmail.com>
 <753b56b8-f1ab-82f5-f9b5-089fbb638989@gmail.com> <CAA7C2qh0QJdB+Z8J1yYm1QU2SbrXoUzyKYmshWEt3FAGSn5T2Q@mail.gmail.com>
 <02388deb-0a06-95ae-1aac-b39c108fc2e7@gmail.com> <CAA7C2qgC6VfFVidcxOBkeT0a1B1ZL7_O-WfgXZgAWBNvtHJY0A@mail.gmail.com>
 <CAA7C2qgY7F6YeWBCad9jVt86xfpg6ZLeRQTXspU07iz+--YGJw@mail.gmail.com>
 <9b34d60d-8de7-5384-3822-98ec79d53e04@gmail.com> <c32640d3bd55b826cc8566c0799ecdb0483ce6f5.camel@linux.intel.com>
 <0704f164-aa0a-bcae-a886-a7fc4a4cd52f@gmail.com> <8f910b1339a741cdc780f3948c11a082a8a51b9e.camel@linux.intel.com>
 <a77201d9-829b-a4f8-84b7-0ffb33ccb5f3@gmail.com> <b470f0ee6071bded8e78e15e18df14ffbc32cd18.camel@linux.intel.com>
 <CAA7C2qjavUe-Tn2ztZ7ibhWO2_DztnE71oGJ=VgWnbkf901qTQ@mail.gmail.com> <a878bd00-a521-58b5-d54a-b2e4c542dcb0@gmail.com>
In-Reply-To: <a878bd00-a521-58b5-d54a-b2e4c542dcb0@gmail.com>
From:   Alexander Duyck <alexander.duyck@gmail.com>
Date:   Sat, 16 Mar 2019 10:08:33 -0700
Message-ID: <CAKgT0UcS2+HHTnRyVe=f_tJZ07MKV7+XVLJtND2UDkzUbEP52Q@mail.gmail.com>
Subject: Re: r8169 driver from kernel 5.0 crashing - napi_consume_skb
To:     Heiner Kallweit <hkallweit1@gmail.com>
Cc:     VDR User <user.vdr@gmail.com>,
        Alexander Duyck <alexander.h.duyck@linux.intel.com>,
        Netdev <netdev@vger.kernel.org>
Content-Type: text/plain; charset="UTF-8"
Sender: netdev-owner@vger.kernel.org
Precedence: bulk
List-ID: <netdev.vger.kernel.org>
X-Mailing-List: netdev@vger.kernel.org

On Sat, Mar 16, 2019 at 8:12 AM Heiner Kallweit <hkallweit1@gmail.com> wrote:
>
> On 16.03.2019 15:38, VDR User wrote:
> >> Part of the issue though is that we don't know how reliable that test
> >> was. I believe Derek he hasn't had any crashes, but he wasn't confident
> >> that it had actually resolved the issue.
> >
> > Previously I thought I could easily & consistently reproduce the crash
> > but the more testing I did, the more I realized that wasn't the case.
> > That's why my confidence was low in that reversing commit 5317d5c6d47e
> > ("r8169: use napi_consume_skb where possible") fixed it. I felt like I
> > needed to do a lot more testing over the weekend to be sure. But, I
> > can now confirm that reversing that commit did not solve the problem.
> > I didn't ifdown/ifup after the crash so the nic eventually recovered
> > on its own I guess. The `ethtool -S` output is:
> >
> > NIC statistics:
> >      tx_packets: 5370650
> >      rx_packets: 57340787
> >      tx_errors: 0
> >      rx_errors: 0
> >      rx_missed: 26
> >      align_errors: 0
> >      tx_single_collisions: 0
> >      tx_multi_collisions: 0
> >      unicast: 57332905
> >      broadcast: 6409
> >      multicast: 1473
> >      tx_aborted: 0
> >      tx_underrun: 0
> >
> > The dmesg log looks the same as before:
> >
> > [95579.984062] ------------[ cut here ]------------
> > [95579.984142] NETDEV WATCHDOG: enp4s0 (r8169): transmit queue 0 timed out
> > [95579.984224] WARNING: CPU: 1 PID: 0 at net/sched/sch_generic.c:461
> > dev_watchdog+0x1bb/0x1e0
> > [95579.984276] Modules linked in: snd_hda_codec_hdmi
> > snd_hda_codec_realtek snd_hda_codec_generic ohci_pci snd_hda_intel
> > snd_hda_codec snd_hwdep xhci_pci ohci_hcd ehci_pci xhci_hcd ehci_hcd
> > usbcore snd_hda_core usb_common snd_pcm snd_timer snd soundcore nfsd
> > auth_rpcgss oid_registry lockd grace sunrpc ip_tables x_tables ipv6
> > [95579.984354] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.0.0-amd #1
> > [95579.984387] Hardware name: ECS A75F-A/A75F-A, BIOS 4.6.5 09/14/2011
> > [95579.984422] EIP: dev_watchdog+0x1bb/0x1e0
> > [95579.984454] Code: 8b 50 3c 89 f8 e8 3d aa 0a 00 8b 7e f4 eb a4 89
> > f8 c6 05 e7 1c 6d c1 01 e8 72 4f fd ff 53 50 57 68 78 05 66 c1 e8 25
> > ad ba ff <0f> 0b 83 c4 10 eb c9 eb 1c 8d b4 26 00 00 00 00 8d b4 26 00
> > 00 00
> > [95579.986189] EAX: 0000003b EBX: 00000000 ECX: 00000800 EDX: 00000103
> > [95579.986224] ESI: f4cc2264 EDI: f4cc2000 EBP: f4c99f74 ESP: f4c99f4c
> > [95579.986259] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210296
> > [95579.986292] CR0: 80050033 CR2: b7c644f0 CR3: 0dfd2000 CR4: 00000690
> > [95579.986325] Call Trace:
> > [95579.986356]  <SOFTIRQ>
> > [95579.986389]  ? qdisc_put_unlocked+0x40/0x40
> > [95579.986423]  call_timer_fn+0x19/0xa0
> > [95579.986456]  run_timer_softirq+0x337/0x380
> > [95579.986488]  ? qdisc_put_unlocked+0x40/0x40
> > [95579.986521]  ? rcu_process_callbacks+0xcb/0x380
> > [95579.986555]  __do_softirq+0xd6/0x21c
> > [95579.986586]  ? __irqentry_text_end+0x18/0x18
> > [95579.986619]  call_on_stack+0x10/0x60
> > [95579.986646]  </SOFTIRQ>
> > [95579.986674]  ? irq_exit+0x91/0xc0
> > [95579.986701]  ? smp_apic_timer_interrupt+0x56/0xa0
> > [95579.986731]  ? apic_timer_interrupt+0xd5/0xdc
> > [95579.986761]  ? acpi_idle_enter_s2idle+0x60/0x60
> > [95579.986790]  ? cpuidle_enter_state+0x122/0x360
> > [95579.986821]  ? cpuidle_enter+0xf/0x20
> > [95579.986850]  ? call_cpuidle+0x1c/0x40
> > [95579.986878]  ? do_idle+0x1e6/0x220
> > [95579.986906]  ? cpu_startup_entry+0x25/0x40
> > [95579.986934]  ? start_secondary+0x1a5/0x220
> > [95579.986963]  ? startup_32_smp+0x15f/0x164
> > [95579.986991] ---[ end trace 2e8d77bb3c9d2d45 ]---
> >
> > Please let me know if there's anything I can do to help.
> > Derek
> >
> The other changes between 4.20 and 5.0 don't look like they could cause
> the issue. And two critical ones have been reverted already.
> So what would need to be done is bisecting the issue.
>
> Heiner

So I took a quick look through the git log. Have you tried taking a
look at the following commit as a possible culprit?
commit e782410ed2378a2ddac58d944c3cf0c6f96b1ff3
Author: Heiner Kallweit <hkallweit1@gmail.com>
Date:   Sat Dec 15 16:25:05 2018 +0100

    r8169: improve spurious interrupt detection

    Improve detection of spurious interrupts by checking against the
    interrupt mask as currently set in the chip.

    Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

This patch seems problematic for a few different reasons. The first
being that adding a register read to the interrupt handler will
actually increase the latency for your interrupt handler as a MMIO
read is not a fast operation compared to a memory read. You already
had to read status, and adding a read for the interrupt mask will just
double the overhead as the reads have to be performed sequentially.
Secondly, are we guaranteed that the read of the register will always
be reliable? I know I have seen plenty of hardware issues in the past
where reading a register in the middle of an event can sometimes lead
to the value read being incorrect.

The only other patch in the set between 4.20 and 5.0 that catches my eye is:
commit 76085c9e7eaebf0caa2476f77380efc3d55b050c
Author: Heiner Kallweit <hkallweit1@gmail.com>
Date:   Thu Nov 22 22:03:08 2018 +0100

    r8169: replace macro TX_FRAGS_READY_FOR with a function

    Replace macro TX_FRAGS_READY_FOR with function rtl_tx_slots_avail
    to make code cleaner and type-safe.

    Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

However I don't see anything that should be wrong there that is
causing this, but instrumenting the points where we are taking action
based on that with a trace_printk probably wouldn't hurt as it would
help us to verify that we are stopping things in xmit and not getting
around to waking them up for whatever reason. Also a minor
optimization you could do on the patch is to replace the call to
netif_wake_queue in rtl8169_start_xmit with netif_start_queue as you
don't need to actually wake up the queue since you are still in mid
transmit so you just need to reset the bit so it doesn't prevent the
next transmit.

Hope that helps.

- Alex