* r8169: IO_PAGE_FAULT & netdev watchdog @ 2012-05-31 21:31 Vincent Pelletier 2012-06-01 12:59 ` Francois Romieu 0 siblings, 1 reply; 7+ messages in thread From: Vincent Pelletier @ 2012-05-31 21:31 UTC (permalink / raw) To: netdev Hi. First of all, I'm running 3.3.4 as of debian experimental (the rest of userland is from sid). I am not subscribed to this list, so please keep me in CC. I'm getting consistently errors when using btlaunchmanycurses (multi-torrent downloader) after a few minutes. I usually first notice the network being down (no trafic) then find this in syslog (see at bottom). Then, I "ifdown eth0;rmmod r8169;modprobe r8169" (which implicitely ifup's), but network never comes back - at least no trafic can go through - until reboot. www.kerneloops.org being down (aparently for quite some time...) I though I should report here. I'm quite sure this problem also occured on 3.2, but I don't know the exact version I was using at that time. I only have this motherboard since a few months, and previous one didn't have an IOMMU - which in my understanding is what causes (well, detects actually) this error. May 31 22:54:55 x2 kernel: [78579.111904] AMD-Vi: Event logged [IO_PAGE_FAULT device=05:00.0 domain=0x0019 address=0x0000000000003000 flags=0x0050] May 31 22:55:07 x2 kernel: [78590.832047] ------------[ cut here ]------------ May 31 22:55:07 x2 kernel: [78590.832067] WARNING: at /build/buildd-linux-2.6_3.3.4-1~experimental.1-amd64-_y3OdD/linux-2.6-3.3.4/debian/build/source_amd64_none/net/sched/sch_generic.c:256 dev_watchdog+0xf2/0x151() May 31 22:55:07 x2 kernel: [78590.832080] Hardware name: GA-990FXA-UD3 May 31 22:55:07 x2 kernel: [78590.832087] NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out May 31 22:55:07 x2 kernel: [78590.832093] Modules linked in: pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) snd_hrtimer cpufreq_powersave cpufreq_stats cpufreq_userspace cpufreq_conservative xt_multiport iptable_filter ip_tables x_tables tun parport_pc ppdev lp parport binfmt_misc ib_iser rdma_cm ib_addr iw_cm ib_cm ib_sa ib_mad ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi fuse nfsd nfs nfs_acl auth_rpcgss fscache lockd sunrpc ext3 mbcache jbd dm_crypt raid1 md_mod powernow_k8 mperf adt7475 it87 hwmon_vid snd_emu10k1_synth snd_emux_synth snd_seq_midi_emul snd_seq_virmidi snd_emu10k1 snd_util_mem snd_ac97_codec snd_hwdep snd_pcm_oss snd_mixer_oss joydev snd_pcm snd_page_alloc nouveau snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq video ttm drm_kms_helper drm sp5100_tco i2c_piix4 snd_seq_device k10temp snd_timer i2c_core mxm_wmi snd emu10k1_gp gameport edac_mce_amd edac_core evdev pcspkr wmi processor soundcore ac97_bus button thermal_sys sr_mod cdrom usbhid hid power_supply re May 31 22:55:07 x2 kernel: iserfs dm_mod nbd usb_storage uas sd_mod crc_t10dif ohci_hcd firewire_ohci firewire_core crc_itu_t ahci libahci ehci_hcd xhci_hcd r8169 mii libata scsi_mod usbcore usb_common [last unloaded: scsi_wait_scan] May 31 22:55:07 x2 kernel: [78590.832306] Pid: 0, comm: swapper/0 Tainted: G W O 3.3.0-trunk-amd64 #1 May 31 22:55:07 x2 kernel: [78590.832314] Call Trace: May 31 22:55:07 x2 kernel: [78590.832319] <IRQ> [<ffffffff810387cb>] ? warn_slowpath_common+0x78/0x8c May 31 22:55:07 x2 kernel: [78590.832339] [<ffffffff81038877>] ? warn_slowpath_fmt+0x45/0x4a May 31 22:55:07 x2 kernel: [78590.832349] [<ffffffff812aa28d>] ? netif_tx_lock+0x40/0x76 May 31 22:55:07 x2 kernel: [78590.832363] [<ffffffff812aa3ff>] ? dev_watchdog+0xf2/0x151 May 31 22:55:07 x2 kernel: [78590.832374] [<ffffffff81043ef1>] ? run_timer_softirq+0x19a/0x261 May 31 22:55:07 x2 kernel: [78590.832383] [<ffffffff812aa30d>] ? netif_tx_unlock+0x4a/0x4a May 31 22:55:07 x2 kernel: [78590.832395] [<ffffffff8103de20>] ? __do_softirq+0xb9/0x177 May 31 22:55:07 x2 kernel: [78590.832405] [<ffffffff8106d15b>] ? timekeeping_get_ns+0xd/0x2a May 31 22:55:07 x2 kernel: [78590.832417] [<ffffffff81358b5c>] ? call_softirq+0x1c/0x30 May 31 22:55:07 x2 kernel: [78590.832428] [<ffffffff8100fa35>] ? do_softirq+0x3c/0x7b May 31 22:55:07 x2 kernel: [78590.832438] [<ffffffff8103e088>] ? irq_exit+0x3c/0x96 May 31 22:55:07 x2 kernel: [78590.832447] [<ffffffff8100f763>] ? do_IRQ+0x82/0x98 May 31 22:55:07 x2 kernel: [78590.832459] [<ffffffff8135282e>] ? common_interrupt+0x6e/0x6e May 31 22:55:07 x2 kernel: [78590.832464] <EOI> [<ffffffff8102b0c8>] ? native_safe_halt+0x2/0x3 May 31 22:55:07 x2 kernel: [78590.832481] [<ffffffff81014798>] ? default_idle+0x47/0x7f May 31 22:55:07 x2 kernel: [78590.832490] [<ffffffff8101488f>] ? amd_e400_idle+0xbf/0xe4 May 31 22:55:07 x2 kernel: [78590.832500] [<ffffffff8100d252>] ? cpu_idle+0xaf/0xf7 May 31 22:55:07 x2 kernel: [78590.832510] [<ffffffff8169ab37>] ? start_kernel+0x3bd/0x3c8 May 31 22:55:07 x2 kernel: [78590.832519] [<ffffffff8169a140>] ? early_idt_handlers+0x140/0x140 May 31 22:55:07 x2 kernel: [78590.832529] [<ffffffff8169a3c3>] ? x86_64_start_kernel+0x104/0x111 May 31 22:55:07 x2 kernel: [78590.832537] ---[ end trace 627ebd8c70d61b1a ]--- May 31 22:55:07 x2 kernel: [78590.848660] r8169 0000:05:00.0: eth0: link up May 31 22:55:19 x2 kernel: [78602.848659] r8169 0000:05:00.0: eth0: link up May 31 22:55:31 x2 kernel: [78614.848656] r8169 0000:05:00.0: eth0: link up May 31 22:55:43 x2 kernel: [78626.848800] r8169 0000:05:00.0: eth0: link up May 31 22:55:55 x2 ovpn-nexedi[2610]: NOTE: OpenVPN 2.1 requires '--script-security 2' or higher to call user-defined scripts or executables May 31 22:56:31 x2 kernel: [78674.848666] r8169 0000:05:00.0: eth0: link up May 31 22:57:19 x2 kernel: [78722.848598] r8169 0000:05:00.0: eth0: link up May 31 22:58:07 x2 kernel: [78770.848662] r8169 0000:05:00.0: eth0: link up May 31 22:58:17 x2 avahi-daemon[2744]: Withdrawing address record for 192.168.0.16 on eth0. May 31 22:58:17 x2 avahi-daemon[2744]: Leaving mDNS multicast group on interface eth0.IPv4 with address 192.168.0.16. May 31 22:58:17 x2 avahi-daemon[2744]: Interface eth0.IPv4 no longer relevant for mDNS. May 31 22:58:17 x2 avahi-daemon[2744]: Interface eth0.IPv6 no longer relevant for mDNS. May 31 22:58:17 x2 avahi-daemon[2744]: Leaving mDNS multicast group on interface eth0.IPv6 with address fe80::52e5:49ff:feb4:ed6f. May 31 22:58:17 x2 avahi-daemon[2744]: Withdrawing address record for fe80::52e5:49ff:feb4:ed6f on eth0. May 31 22:58:25 x2 avahi-daemon[2744]: Withdrawing workstation service for tun0. May 31 22:59:29 x2 avahi-daemon[2744]: Withdrawing workstation service for eth0. May 31 22:59:33 x2 kernel: [78856.929121] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded May 31 22:59:33 x2 kernel: [78856.929312] r8169 0000:05:00.0: irq 41 for MSI/MSI-X May 31 22:59:33 x2 kernel: [78856.930671] r8169 0000:05:00.0: eth0: RTL8168evl/8111evl at 0xffffc90000c1e000, 50:e5:49:b4:ed:6f, XID 0c900880 IRQ 41 May 31 22:59:33 x2 kernel: [78856.930685] r8169 0000:05:00.0: eth0: jumbo features [frames: 9200 bytes, tx checksumming: ko] May 31 22:59:33 x2 avahi-daemon[2744]: Joining mDNS multicast group on interface eth0.IPv4 with address 192.168.0.16. May 31 22:59:33 x2 kernel: [78857.169029] r8169 0000:05:00.0: eth0: link down May 31 22:59:33 x2 kernel: [78857.169043] r8169 0000:05:00.0: eth0: link down May 31 22:59:33 x2 kernel: [78857.171749] ADDRCONF(NETDEV_UP): eth0: link is not ready May 31 22:59:33 x2 avahi-daemon[2744]: New relevant interface eth0.IPv4 for mDNS. May 31 22:59:33 x2 avahi-daemon[2744]: Registering new address record for 192.168.0.16 on eth0.IPv4. May 31 22:59:36 x2 kernel: [78859.538358] r8169 0000:05:00.0: eth0: link up May 31 22:59:36 x2 kernel: [78859.539012] ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready May 31 22:59:37 x2 avahi-daemon[2744]: Joining mDNS multicast group on interface eth0.IPv6 with address fe80::52e5:49ff:feb4:ed6f. May 31 22:59:37 x2 avahi-daemon[2744]: New relevant interface eth0.IPv6 for mDNS. May 31 22:59:37 x2 avahi-daemon[2744]: Registering new address record for fe80::52e5:49ff:feb4:ed6f on eth0.*. May 31 22:59:46 x2 kernel: [78870.104066] eth0: no IPv6 routers present May 31 23:00:00 x2 kernel: [78883.792620] r8169 0000:05:00.0: eth0: link up May 31 23:00:37 x2 kerneloops: Submitted 2 kernel oopses to www.kerneloops.org May 31 23:00:48 x2 kernel: [78931.792643] r8169 0000:05:00.0: eth0: link up May 31 23:01:21 x2 kernel: [78965.124469] r8169 0000:05:00.0: eth0: link down May 31 23:01:26 x2 kernel: [78969.278184] r8169 0000:05:00.0: eth0: link up May 31 23:01:27 x2 kerneloops: Submitted 1 kernel oopses to www.kerneloops.org May 31 23:01:44 x2 kernel: [78987.792649] r8169 0000:05:00.0: eth0: link up May 31 23:02:32 x2 kernel: [79035.792636] r8169 0000:05:00.0: eth0: link up May 31 23:02:54 x2 shutdown[9402]: shutting down for system reboot Regards, -- Vincent Pelletier ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: r8169: IO_PAGE_FAULT & netdev watchdog 2012-05-31 21:31 r8169: IO_PAGE_FAULT & netdev watchdog Vincent Pelletier @ 2012-06-01 12:59 ` Francois Romieu 2012-06-01 19:20 ` Vincent Pelletier 2012-06-02 9:08 ` Vincent Pelletier 0 siblings, 2 replies; 7+ messages in thread From: Francois Romieu @ 2012-06-01 12:59 UTC (permalink / raw) To: Vincent Pelletier; +Cc: netdev [-- Attachment #1: Type: text/plain, Size: 2734 bytes --] Vincent Pelletier <plr.vincent@gmail.com> : [...] > I'm getting consistently errors when using btlaunchmanycurses (multi-torrent > downloader) after a few minutes. I usually first notice the network being down > (no trafic) then find this in syslog (see at bottom). > > Then, I "ifdown eth0;rmmod r8169;modprobe r8169" (which implicitely ifup's), > but network never comes back - at least no trafic can go through - until > reboot. Same thing if you reset and remove the pci device through sysfs then ask the PCI bridge to scan it again ? > www.kerneloops.org being down (aparently for quite some time...) I though I > should report here. > > I'm quite sure this problem also occured on 3.2, but I don't know the exact > version I was using at that time. I only have this motherboard since a few > months, and previous one didn't have an IOMMU - which in my understanding is > what causes (well, detects actually) this error. https://bugzilla.kernel.org/show_bug.cgi?id=42899 contains similar if not identical IOMMU messages (this #bz is messy but it may be of intereset to add yourself to the Cc: list btw). AFAIUI the IOMMU complains because the r8169 tried to perform a read access. The target address matches the start of a descriptor ring one. However it happens long after the r8169 initialized the chipset and the driver would work rather poorly if it could not access its descriptor rings. The r8169 bug is real but the IOMMU message seems rather useless if not bogus. > May 31 22:54:55 x2 kernel: [78579.111904] AMD-Vi: Event logged [IO_PAGE_FAULT device=05:00.0 domain=0x0019 address=0x0000000000003000 flags=0x0050] > May 31 22:55:07 x2 kernel: [78590.832047] ------------[ cut here ]------------ > May 31 22:55:07 x2 kernel: [78590.832067] WARNING: at /build/buildd-linux-2.6_3.3.4-1~experimental.1-amd64-_y3OdD/linux-2.6-3.3.4/debian/build/source_amd64_none/net/sched/sch_generic.c:256 dev_watchdog+0xf2/0x151() > May 31 22:55:07 x2 kernel: [78590.832080] Hardware name: GA-990FXA-UD3 > May 31 22:55:07 x2 kernel: [78590.832087] NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out You can apply the attached patch but it may not do much for your problem. The patch below could make a difference though. Does it ? diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c index bbacb37..da46588 100644 --- a/drivers/net/ethernet/realtek/r8169.c +++ b/drivers/net/ethernet/realtek/r8169.c @@ -3766,6 +3766,7 @@ static void rtl_init_rxcfg(struct rtl8169_private *tp) case RTL_GIGA_MAC_VER_22: case RTL_GIGA_MAC_VER_23: case RTL_GIGA_MAC_VER_24: + case RTL_GIGA_MAC_VER_34: RTL_W32(RxConfig, RX128_INT_EN | RX_MULTI_EN | RX_DMA_BURST); break; default: -- Ueimor [-- Attachment #2: 0001-PATCH-r8169-fix-unsigned-int-wraparound-with-TSO.patch --] [-- Type: text/plain, Size: 4949 bytes --] >From 3068d55417db4c8e3414ce840afb932fdf1f0f76 Mon Sep 17 00:00:00 2001 Message-Id: <3068d55417db4c8e3414ce840afb932fdf1f0f76.1338553193.git.romieu@fr.zoreil.com> From: Julien Ducourthial <jducourt@free.fr> Date: Fri, 1 Jun 2012 14:17:43 +0200 Subject: [PATCH] [PATCH] r8169: fix unsigned int wraparound with TSO X-Organisation: Land of Sunshine Inc. [ Upstream commit 477206a018f902895bfcd069dd820bfe94c187b1 ] The r8169 may get stuck or show bad behaviour after activating TSO : the net_device is not stopped when it has no more TX descriptors. This problem comes from TX_BUFS_AVAIL which may reach -1 when all transmit descriptors are in use. The patch simply tries to keep positive values. Tested with 8111d(onboard) on a D510MO, and with 8111e(onboard) on a Zotac 890GXITX. Signed-off-by: Julien Ducourthial <jducourt@free.fr> Acked-by: Francois Romieu <romieu@fr.zoreil.com> Signed-off-by: David S. Miller <davem@davemloft.net> --- drivers/net/ethernet/realtek/r8169.c | 16 ++++++++++------ 1 file changed, 10 insertions(+), 6 deletions(-) diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c index f545093..ce6b44d 100644 --- a/drivers/net/ethernet/realtek/r8169.c +++ b/drivers/net/ethernet/realtek/r8169.c @@ -61,8 +61,12 @@ #define R8169_MSG_DEFAULT \ (NETIF_MSG_DRV | NETIF_MSG_PROBE | NETIF_MSG_IFUP | NETIF_MSG_IFDOWN) -#define TX_BUFFS_AVAIL(tp) \ - (tp->dirty_tx + NUM_TX_DESC - tp->cur_tx - 1) +#define TX_SLOTS_AVAIL(tp) \ + (tp->dirty_tx + NUM_TX_DESC - tp->cur_tx) + +/* A skbuff with nr_frags needs nr_frags+1 entries in the tx queue */ +#define TX_FRAGS_READY_FOR(tp,nr_frags) \ + (TX_SLOTS_AVAIL(tp) >= (nr_frags + 1)) /* Maximum number of multicast addresses to filter (vs. Rx-all-multicast). The RTL chips use a 64 element hash table based on the Ethernet CRC. */ @@ -5115,7 +5119,7 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb, u32 opts[2]; int frags; - if (unlikely(TX_BUFFS_AVAIL(tp) < skb_shinfo(skb)->nr_frags)) { + if (unlikely(!TX_FRAGS_READY_FOR(tp, skb_shinfo(skb)->nr_frags))) { netif_err(tp, drv, dev, "BUG! Tx Ring full when queue awake!\n"); goto err_stop_0; } @@ -5169,7 +5173,7 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb, mmiowb(); - if (TX_BUFFS_AVAIL(tp) < MAX_SKB_FRAGS) { + if (!TX_FRAGS_READY_FOR(tp, MAX_SKB_FRAGS)) { /* Avoid wrongly optimistic queue wake-up: rtl_tx thread must * not miss a ring update when it notices a stopped queue. */ @@ -5183,7 +5187,7 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb, * can't. */ smp_mb(); - if (TX_BUFFS_AVAIL(tp) >= MAX_SKB_FRAGS) + if (TX_FRAGS_READY_FOR(tp, MAX_SKB_FRAGS)) netif_wake_queue(dev); } @@ -5306,7 +5310,7 @@ static void rtl_tx(struct net_device *dev, struct rtl8169_private *tp) */ smp_mb(); if (netif_queue_stopped(dev) && - (TX_BUFFS_AVAIL(tp) >= MAX_SKB_FRAGS)) { + TX_FRAGS_READY_FOR(tp, MAX_SKB_FRAGS)) { netif_wake_queue(dev); } /* -- 1.7.10.2 --- drivers/net/ethernet/realtek/r8169.c | 16 ++++++++++------ 1 file changed, 10 insertions(+), 6 deletions(-) diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c index da46588..59dd29e 100644 --- a/drivers/net/ethernet/realtek/r8169.c +++ b/drivers/net/ethernet/realtek/r8169.c @@ -62,8 +62,12 @@ #define R8169_MSG_DEFAULT \ (NETIF_MSG_DRV | NETIF_MSG_PROBE | NETIF_MSG_IFUP | NETIF_MSG_IFDOWN) -#define TX_BUFFS_AVAIL(tp) \ - (tp->dirty_tx + NUM_TX_DESC - tp->cur_tx - 1) +#define TX_SLOTS_AVAIL(tp) \ + (tp->dirty_tx + NUM_TX_DESC - tp->cur_tx) + +/* A skbuff with nr_frags needs nr_frags+1 entries in the tx queue */ +#define TX_FRAGS_READY_FOR(tp,nr_frags) \ + (TX_SLOTS_AVAIL(tp) >= (nr_frags + 1)) /* Maximum number of multicast addresses to filter (vs. Rx-all-multicast). The RTL chips use a 64 element hash table based on the Ethernet CRC. */ @@ -5513,7 +5517,7 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb, u32 opts[2]; int frags; - if (unlikely(TX_BUFFS_AVAIL(tp) < skb_shinfo(skb)->nr_frags)) { + if (unlikely(!TX_FRAGS_READY_FOR(tp, skb_shinfo(skb)->nr_frags))) { netif_err(tp, drv, dev, "BUG! Tx Ring full when queue awake!\n"); goto err_stop_0; } @@ -5561,10 +5565,10 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb, RTL_W8(TxPoll, NPQ); - if (TX_BUFFS_AVAIL(tp) < MAX_SKB_FRAGS) { + if (!TX_FRAGS_READY_FOR(tp, MAX_SKB_FRAGS)) { netif_stop_queue(dev); smp_rmb(); - if (TX_BUFFS_AVAIL(tp) >= MAX_SKB_FRAGS) + if (TX_FRAGS_READY_FOR(tp, MAX_SKB_FRAGS)) netif_wake_queue(dev); } @@ -5666,7 +5670,7 @@ static void rtl8169_tx_interrupt(struct net_device *dev, tp->dirty_tx = dirty_tx; smp_wmb(); if (netif_queue_stopped(dev) && - (TX_BUFFS_AVAIL(tp) >= MAX_SKB_FRAGS)) { + TX_FRAGS_READY_FOR(tp, MAX_SKB_FRAGS)) { netif_wake_queue(dev); } /* -- 1.7.10.2 ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: r8169: IO_PAGE_FAULT & netdev watchdog 2012-06-01 12:59 ` Francois Romieu @ 2012-06-01 19:20 ` Vincent Pelletier 2012-06-01 20:13 ` Francois Romieu 2012-06-02 9:08 ` Vincent Pelletier 1 sibling, 1 reply; 7+ messages in thread From: Vincent Pelletier @ 2012-06-01 19:20 UTC (permalink / raw) To: Francois Romieu; +Cc: netdev Thanks for the quick reply. Le vendredi 01 juin 2012 14:59:49, vous avez écrit : > Same thing if you reset and remove the pci device through sysfs then ask > the PCI bridge to scan it again ? I didn't try it before - but I should have, I know this. rmmod; reset; modprobe -> doesn't work rmmod; reset; remove; rescan -> doesn't work either (?!) > https://bugzilla.kernel.org/show_bug.cgi?id=42899 contains similar if not > identical IOMMU messages (this #bz is messy but it may be of intereset to > add yourself to the Cc: list btw). I found it a bit after my post (while watching the archives, in case someone replied without CC :) ). I posted on that bug as I couldn't find a way to just add me to bug CC. > The r8169 bug is real but the IOMMU message seems rather useless if not > bogus. Just being curious, feel free to skip over my questions: If it's bogus, could it be a mis-interpretation of its state when the error occurs (I don't know how CPU knows a fault happened, I guess some IRQ + some register contain error status, address of error, some process/context identifier) ? Or hardware bug ? Or MMU misconfiguration for some reason ? If it's not bogus, would it be the sign of firmware bug (accessing some unpredictable memory upon certain conditions) ? > You can apply the attached patch but it may not do much for your problem. > The patch below could make a difference though. Does it ? I'll try either and both. Given the poor result I got from reset/remove/rescan, I guess I should reboot between attempts, right ? Should I prevent original module auto-loading at boot ? Maybe more than just r8169 ? Regards, -- Vincent Pelletier ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: r8169: IO_PAGE_FAULT & netdev watchdog 2012-06-01 19:20 ` Vincent Pelletier @ 2012-06-01 20:13 ` Francois Romieu 0 siblings, 0 replies; 7+ messages in thread From: Francois Romieu @ 2012-06-01 20:13 UTC (permalink / raw) To: Vincent Pelletier; +Cc: netdev Vincent Pelletier <plr.vincent@gmail.com> : [...] > If it's bogus, could it be a mis-interpretation of its state when the error > occurs (I don't know how CPU knows a fault happened, I guess some IRQ + some > register contain error status, address of error, some process/context > identifier) ? See "AMD I/O Virtualization Technology (IOMMU) Specification". > Or hardware bug ? Or MMU misconfiguration for some reason ? I don't have time to poke deeply enough into the iommu code. [...] > If it's not bogus, would it be the sign of firmware bug (accessing some > unpredictable memory upon certain conditions) ? That's what I thought first. Or I should have added something to the r8169 driver. However it's quite reproducible, the failing address is one of the mapped Rx or Tx descriptor ring address - don't remember which one, see the PR at korg - and it does not fit the timing pattern. [...] > I'll try either and both. Given the poor result I got from > reset/remove/rescan, I guess I should reboot between attempts, right ? Yes. The inlined patch could help avoiding the problem but it is not supposed to help a failed network adapter recovering. > Should I prevent original module auto-loading at boot ? Maybe more than just > r8169 ? It should not be required. YMMV. -- Ueimor ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: r8169: IO_PAGE_FAULT & netdev watchdog 2012-06-01 12:59 ` Francois Romieu 2012-06-01 19:20 ` Vincent Pelletier @ 2012-06-02 9:08 ` Vincent Pelletier 2012-06-02 10:56 ` Francois Romieu 1 sibling, 1 reply; 7+ messages in thread From: Vincent Pelletier @ 2012-06-02 9:08 UTC (permalink / raw) To: Francois Romieu; +Cc: netdev [-- Attachment #1: Type: Text/Plain, Size: 799 bytes --] Le vendredi 01 juin 2012 14:59:49, Francois Romieu a écrit : > You can apply the attached patch but it may not do much for your problem. After failing to build the module alone in a way that it would accept loading in debian-provided kernel, I fall back to building vanilla kernel + proposed patches. I first went for 3.4, but realised the patch you attached was already applied there. So I went with 3.3.7, and patch failed to apply, at least partly because 3.3.7 lacks "r8169: fix early queue wake-up."[1] . I solved the conflicts manually, but I'm not sure of the result. Could you confirm attached patch might give expected result ? Or should I stick to 3.4 and only test inlined patch ? [1] ae1f23fb433ac0aaff8aeaa5a7b14348e9aa8277 Regards, -- Vincent Pelletier [-- Attachment #2: for_3.3.7.patch --] [-- Type: text/x-patch, Size: 1564 bytes --] --- drivers/net/ethernet/realtek/r8169.c.orig 2012-06-02 10:26:45.000000000 +0200 +++ drivers/net/ethernet/realtek/r8169.c 2012-06-02 10:58:37.000000000 +0200 @@ -62,8 +62,12 @@ #define R8169_MSG_DEFAULT \ (NETIF_MSG_DRV | NETIF_MSG_PROBE | NETIF_MSG_IFUP | NETIF_MSG_IFDOWN) -#define TX_BUFFS_AVAIL(tp) \ - (tp->dirty_tx + NUM_TX_DESC - tp->cur_tx - 1) +#define TX_SLOTS_AVAIL(tp) \ + (tp->dirty_tx + NUM_TX_DESC - tp->cur_tx) + +/* A skbuff with nr_frags needs nr_frags+1 entries in the tx queue */ +#define TX_FRAGS_READY_FOR(tp,nr_frags) \ + (TX_SLOTS_AVAIL(tp) >= (nr_frags + 1)) /* Maximum number of multicast addresses to filter (vs. Rx-all-multicast). The RTL chips use a 64 element hash table based on the Ethernet CRC. */ @@ -5513,7 +5517,7 @@ u32 opts[2]; int frags; - if (unlikely(TX_BUFFS_AVAIL(tp) < skb_shinfo(skb)->nr_frags)) { + if (unlikely(!TX_FRAGS_READY_FOR(tp, skb_shinfo(skb)->nr_frags))) { netif_err(tp, drv, dev, "BUG! Tx Ring full when queue awake!\n"); goto err_stop_0; } @@ -5561,10 +5565,10 @@ RTL_W8(TxPoll, NPQ); - if (TX_BUFFS_AVAIL(tp) < MAX_SKB_FRAGS) { + if (!TX_FRAGS_READY_FOR(tp, MAX_SKB_FRAGS)) { netif_stop_queue(dev); smp_rmb(); - if (TX_BUFFS_AVAIL(tp) >= MAX_SKB_FRAGS) + if (TX_FRAGS_READY_FOR(tp, MAX_SKB_FRAGS)) netif_wake_queue(dev); } @@ -5666,7 +5670,7 @@ tp->dirty_tx = dirty_tx; smp_wmb(); if (netif_queue_stopped(dev) && - (TX_BUFFS_AVAIL(tp) >= MAX_SKB_FRAGS)) { + TX_FRAGS_READY_FOR(tp, MAX_SKB_FRAGS)) { netif_wake_queue(dev); } /* ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: r8169: IO_PAGE_FAULT & netdev watchdog 2012-06-02 9:08 ` Vincent Pelletier @ 2012-06-02 10:56 ` Francois Romieu 2012-06-02 13:42 ` Vincent Pelletier 0 siblings, 1 reply; 7+ messages in thread From: Francois Romieu @ 2012-06-02 10:56 UTC (permalink / raw) To: Vincent Pelletier; +Cc: netdev Vincent Pelletier <plr.vincent@gmail.com> : [...] > So I went with 3.3.7, and patch failed to apply, at least partly because > 3.3.7 lacks "r8169: fix early queue wake-up."[1]. And partly because the patch I sent included its content in the commit message as well. :o/ > I solved the conflicts manually, but I'm not sure of the result. Could you > confirm attached patch might give expected result ? Yes. > Or should I stick to 3.4 and only test inlined patch ? If the inlined patch makes a difference, you should see it with 3.4. My life is a bit easier when you work somewhere in the main branch (or in davem's -next but it is not relevant for regression fixes). -- Ueimor ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: r8169: IO_PAGE_FAULT & netdev watchdog 2012-06-02 10:56 ` Francois Romieu @ 2012-06-02 13:42 ` Vincent Pelletier 0 siblings, 0 replies; 7+ messages in thread From: Vincent Pelletier @ 2012-06-02 13:42 UTC (permalink / raw) To: Francois Romieu; +Cc: netdev Le samedi 02 juin 2012 12:56:45, vous avez écrit : > And partly because the patch I sent included its content in the commit > message as well. :o/ I noticed the repetition after trying to apply on 3.4, and dropped one. And only then realised it was really already applied. > If the inlined patch makes a difference, you should see it with 3.4. It made a difference, when testing with netcat: without any change over vanilla 3.3.7, network trafic drops to 0 in a matter of seconds (up to around 10s). With it, it stayed stable for 10 minutes, until I killed nc. I reproduced this with 3.4 as well (no patch = bug, patch = no problem). In both version without patch, I got the watchdog warning 10 minutes after traffic drop - though without the IO_PAGE_FAULT message. I spent quite some time testing with nc in UDP mode first, and couldn't reproduce the issue (then I switched to TCP as said above). Does that make any sense ? I also noticed the significant lag at bootup when eth0 is brought up is much reduced on patched kernel. Does that makes sense ? FWIW, the commands I used were based on: nc -l -p 5555 < /dev/zero > /dev/null With/without -u flag, and of course client-side equivalent command so the connection was used full-duplex at maximum speed: 450Mb/s in TCP, 800+Mb/s in UDP, each way. UDP was limited by CPU on one side (~450Mb/s upload from that box, 800Mb/s download, 100% cpu on it). Values are as reported by nload & htop. All tests were done in runlevel 2, with rsyslog manually started with its init script. > My life is a bit easier when you work somewhere in the main branch > (or in davem's -next but it is not relevant for regression fixes). I'm not sure: does 3.4 tarball from kernel.org qualify as "main branch" ? Otherwise, which git repos & branch should I use ? Regards, -- Vincent Pelletier ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2012-06-02 13:42 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2012-05-31 21:31 r8169: IO_PAGE_FAULT & netdev watchdog Vincent Pelletier 2012-06-01 12:59 ` Francois Romieu 2012-06-01 19:20 ` Vincent Pelletier 2012-06-01 20:13 ` Francois Romieu 2012-06-02 9:08 ` Vincent Pelletier 2012-06-02 10:56 ` Francois Romieu 2012-06-02 13:42 ` Vincent Pelletier
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.