All of lore.kernel.org
 help / color / mirror / Atom feed
* r8169: IO_PAGE_FAULT & netdev watchdog
@ 2012-05-31 21:31 Vincent Pelletier
  2012-06-01 12:59 ` Francois Romieu
  0 siblings, 1 reply; 7+ messages in thread
From: Vincent Pelletier @ 2012-05-31 21:31 UTC (permalink / raw)
  To: netdev

Hi.

First of all, I'm running 3.3.4 as of debian experimental (the rest of
userland is from sid). I am not subscribed to this list, so please keep me
in CC.

I'm getting consistently errors when using btlaunchmanycurses (multi-torrent
downloader) after a few minutes. I usually first notice the network being down
(no trafic) then find this in syslog (see at bottom).

Then, I "ifdown eth0;rmmod r8169;modprobe r8169" (which implicitely ifup's),
but network never comes back - at least no trafic can go through - until
reboot.

www.kerneloops.org being down (aparently for quite some time...) I though I
should report here.

I'm quite sure this problem also occured on 3.2, but I don't know the exact
version I was using at that time. I only have this motherboard since a few
months, and previous one didn't have an IOMMU - which in my understanding is
what causes (well, detects actually) this error.

May 31 22:54:55 x2 kernel: [78579.111904] AMD-Vi: Event logged [IO_PAGE_FAULT device=05:00.0 domain=0x0019 address=0x0000000000003000 flags=0x0050]
May 31 22:55:07 x2 kernel: [78590.832047] ------------[ cut here ]------------
May 31 22:55:07 x2 kernel: [78590.832067] WARNING: at /build/buildd-linux-2.6_3.3.4-1~experimental.1-amd64-_y3OdD/linux-2.6-3.3.4/debian/build/source_amd64_none/net/sched/sch_generic.c:256 dev_watchdog+0xf2/0x151()
May 31 22:55:07 x2 kernel: [78590.832080] Hardware name: GA-990FXA-UD3
May 31 22:55:07 x2 kernel: [78590.832087] NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out
May 31 22:55:07 x2 kernel: [78590.832093] Modules linked in: pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) snd_hrtimer cpufreq_powersave cpufreq_stats cpufreq_userspace cpufreq_conservative xt_multiport iptable_filter ip_tables x_tables tun parport_pc ppdev lp parport binfmt_misc ib_iser rdma_cm ib_addr iw_cm ib_cm ib_sa ib_mad ib_core iscsi_tcp libiscsi_tcp 
libiscsi scsi_transport_iscsi fuse nfsd nfs nfs_acl auth_rpcgss fscache lockd sunrpc ext3 mbcache jbd dm_crypt raid1 md_mod powernow_k8 mperf adt7475 it87 hwmon_vid snd_emu10k1_synth snd_emux_synth snd_seq_midi_emul snd_seq_virmidi snd_emu10k1 snd_util_mem snd_ac97_codec snd_hwdep snd_pcm_oss snd_mixer_oss joydev snd_pcm snd_page_alloc nouveau snd_seq_midi 
snd_seq_midi_event snd_rawmidi snd_seq video ttm drm_kms_helper drm sp5100_tco i2c_piix4 snd_seq_device k10temp snd_timer i2c_core mxm_wmi snd emu10k1_gp gameport edac_mce_amd edac_core evdev pcspkr wmi processor soundcore ac97_bus button thermal_sys sr_mod cdrom usbhid hid power_supply re
May 31 22:55:07 x2 kernel: iserfs dm_mod nbd usb_storage uas sd_mod crc_t10dif ohci_hcd firewire_ohci firewire_core crc_itu_t ahci libahci ehci_hcd xhci_hcd r8169 mii libata scsi_mod usbcore usb_common [last unloaded: scsi_wait_scan]
May 31 22:55:07 x2 kernel: [78590.832306] Pid: 0, comm: swapper/0 Tainted: G        W  O 3.3.0-trunk-amd64 #1
May 31 22:55:07 x2 kernel: [78590.832314] Call Trace:
May 31 22:55:07 x2 kernel: [78590.832319]  <IRQ>  [<ffffffff810387cb>] ? warn_slowpath_common+0x78/0x8c
May 31 22:55:07 x2 kernel: [78590.832339]  [<ffffffff81038877>] ? warn_slowpath_fmt+0x45/0x4a
May 31 22:55:07 x2 kernel: [78590.832349]  [<ffffffff812aa28d>] ? netif_tx_lock+0x40/0x76
May 31 22:55:07 x2 kernel: [78590.832363]  [<ffffffff812aa3ff>] ? dev_watchdog+0xf2/0x151
May 31 22:55:07 x2 kernel: [78590.832374]  [<ffffffff81043ef1>] ? run_timer_softirq+0x19a/0x261
May 31 22:55:07 x2 kernel: [78590.832383]  [<ffffffff812aa30d>] ? netif_tx_unlock+0x4a/0x4a
May 31 22:55:07 x2 kernel: [78590.832395]  [<ffffffff8103de20>] ? __do_softirq+0xb9/0x177
May 31 22:55:07 x2 kernel: [78590.832405]  [<ffffffff8106d15b>] ? timekeeping_get_ns+0xd/0x2a
May 31 22:55:07 x2 kernel: [78590.832417]  [<ffffffff81358b5c>] ? call_softirq+0x1c/0x30
May 31 22:55:07 x2 kernel: [78590.832428]  [<ffffffff8100fa35>] ? do_softirq+0x3c/0x7b
May 31 22:55:07 x2 kernel: [78590.832438]  [<ffffffff8103e088>] ? irq_exit+0x3c/0x96
May 31 22:55:07 x2 kernel: [78590.832447]  [<ffffffff8100f763>] ? do_IRQ+0x82/0x98
May 31 22:55:07 x2 kernel: [78590.832459]  [<ffffffff8135282e>] ? common_interrupt+0x6e/0x6e
May 31 22:55:07 x2 kernel: [78590.832464]  <EOI>  [<ffffffff8102b0c8>] ? native_safe_halt+0x2/0x3
May 31 22:55:07 x2 kernel: [78590.832481]  [<ffffffff81014798>] ? default_idle+0x47/0x7f
May 31 22:55:07 x2 kernel: [78590.832490]  [<ffffffff8101488f>] ? amd_e400_idle+0xbf/0xe4
May 31 22:55:07 x2 kernel: [78590.832500]  [<ffffffff8100d252>] ? cpu_idle+0xaf/0xf7
May 31 22:55:07 x2 kernel: [78590.832510]  [<ffffffff8169ab37>] ? start_kernel+0x3bd/0x3c8
May 31 22:55:07 x2 kernel: [78590.832519]  [<ffffffff8169a140>] ? early_idt_handlers+0x140/0x140
May 31 22:55:07 x2 kernel: [78590.832529]  [<ffffffff8169a3c3>] ? x86_64_start_kernel+0x104/0x111
May 31 22:55:07 x2 kernel: [78590.832537] ---[ end trace 627ebd8c70d61b1a ]---
May 31 22:55:07 x2 kernel: [78590.848660] r8169 0000:05:00.0: eth0: link up
May 31 22:55:19 x2 kernel: [78602.848659] r8169 0000:05:00.0: eth0: link up
May 31 22:55:31 x2 kernel: [78614.848656] r8169 0000:05:00.0: eth0: link up
May 31 22:55:43 x2 kernel: [78626.848800] r8169 0000:05:00.0: eth0: link up
May 31 22:55:55 x2 ovpn-nexedi[2610]: NOTE: OpenVPN 2.1 requires '--script-security 2' or higher to call user-defined scripts or executables
May 31 22:56:31 x2 kernel: [78674.848666] r8169 0000:05:00.0: eth0: link up
May 31 22:57:19 x2 kernel: [78722.848598] r8169 0000:05:00.0: eth0: link up
May 31 22:58:07 x2 kernel: [78770.848662] r8169 0000:05:00.0: eth0: link up
May 31 22:58:17 x2 avahi-daemon[2744]: Withdrawing address record for 192.168.0.16 on eth0.
May 31 22:58:17 x2 avahi-daemon[2744]: Leaving mDNS multicast group on interface eth0.IPv4 with address 192.168.0.16.
May 31 22:58:17 x2 avahi-daemon[2744]: Interface eth0.IPv4 no longer relevant for mDNS.
May 31 22:58:17 x2 avahi-daemon[2744]: Interface eth0.IPv6 no longer relevant for mDNS.
May 31 22:58:17 x2 avahi-daemon[2744]: Leaving mDNS multicast group on interface eth0.IPv6 with address fe80::52e5:49ff:feb4:ed6f.
May 31 22:58:17 x2 avahi-daemon[2744]: Withdrawing address record for fe80::52e5:49ff:feb4:ed6f on eth0.
May 31 22:58:25 x2 avahi-daemon[2744]: Withdrawing workstation service for tun0.
May 31 22:59:29 x2 avahi-daemon[2744]: Withdrawing workstation service for eth0.
May 31 22:59:33 x2 kernel: [78856.929121] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
May 31 22:59:33 x2 kernel: [78856.929312] r8169 0000:05:00.0: irq 41 for MSI/MSI-X
May 31 22:59:33 x2 kernel: [78856.930671] r8169 0000:05:00.0: eth0: RTL8168evl/8111evl at 0xffffc90000c1e000, 50:e5:49:b4:ed:6f, XID 0c900880 IRQ 41
May 31 22:59:33 x2 kernel: [78856.930685] r8169 0000:05:00.0: eth0: jumbo features [frames: 9200 bytes, tx checksumming: ko]
May 31 22:59:33 x2 avahi-daemon[2744]: Joining mDNS multicast group on interface eth0.IPv4 with address 192.168.0.16.
May 31 22:59:33 x2 kernel: [78857.169029] r8169 0000:05:00.0: eth0: link down
May 31 22:59:33 x2 kernel: [78857.169043] r8169 0000:05:00.0: eth0: link down
May 31 22:59:33 x2 kernel: [78857.171749] ADDRCONF(NETDEV_UP): eth0: link is not ready
May 31 22:59:33 x2 avahi-daemon[2744]: New relevant interface eth0.IPv4 for mDNS.
May 31 22:59:33 x2 avahi-daemon[2744]: Registering new address record for 192.168.0.16 on eth0.IPv4.
May 31 22:59:36 x2 kernel: [78859.538358] r8169 0000:05:00.0: eth0: link up
May 31 22:59:36 x2 kernel: [78859.539012] ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
May 31 22:59:37 x2 avahi-daemon[2744]: Joining mDNS multicast group on interface eth0.IPv6 with address fe80::52e5:49ff:feb4:ed6f.
May 31 22:59:37 x2 avahi-daemon[2744]: New relevant interface eth0.IPv6 for mDNS.
May 31 22:59:37 x2 avahi-daemon[2744]: Registering new address record for fe80::52e5:49ff:feb4:ed6f on eth0.*.
May 31 22:59:46 x2 kernel: [78870.104066] eth0: no IPv6 routers present
May 31 23:00:00 x2 kernel: [78883.792620] r8169 0000:05:00.0: eth0: link up
May 31 23:00:37 x2 kerneloops: Submitted 2 kernel oopses to www.kerneloops.org
May 31 23:00:48 x2 kernel: [78931.792643] r8169 0000:05:00.0: eth0: link up
May 31 23:01:21 x2 kernel: [78965.124469] r8169 0000:05:00.0: eth0: link down
May 31 23:01:26 x2 kernel: [78969.278184] r8169 0000:05:00.0: eth0: link up
May 31 23:01:27 x2 kerneloops: Submitted 1 kernel oopses to www.kerneloops.org
May 31 23:01:44 x2 kernel: [78987.792649] r8169 0000:05:00.0: eth0: link up
May 31 23:02:32 x2 kernel: [79035.792636] r8169 0000:05:00.0: eth0: link up
May 31 23:02:54 x2 shutdown[9402]: shutting down for system reboot

Regards,
-- 
Vincent Pelletier

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: r8169: IO_PAGE_FAULT & netdev watchdog
  2012-05-31 21:31 r8169: IO_PAGE_FAULT & netdev watchdog Vincent Pelletier
@ 2012-06-01 12:59 ` Francois Romieu
  2012-06-01 19:20   ` Vincent Pelletier
  2012-06-02  9:08   ` Vincent Pelletier
  0 siblings, 2 replies; 7+ messages in thread
From: Francois Romieu @ 2012-06-01 12:59 UTC (permalink / raw)
  To: Vincent Pelletier; +Cc: netdev

[-- Attachment #1: Type: text/plain, Size: 2734 bytes --]

Vincent Pelletier <plr.vincent@gmail.com> :
[...]
> I'm getting consistently errors when using btlaunchmanycurses (multi-torrent
> downloader) after a few minutes. I usually first notice the network being down
> (no trafic) then find this in syslog (see at bottom).
> 
> Then, I "ifdown eth0;rmmod r8169;modprobe r8169" (which implicitely ifup's),
> but network never comes back - at least no trafic can go through - until
> reboot.

Same thing if you reset and remove the pci device through sysfs then ask
the PCI bridge to scan it again ?

> www.kerneloops.org being down (aparently for quite some time...) I though I
> should report here.
> 
> I'm quite sure this problem also occured on 3.2, but I don't know the exact
> version I was using at that time. I only have this motherboard since a few
> months, and previous one didn't have an IOMMU - which in my understanding is
> what causes (well, detects actually) this error.

https://bugzilla.kernel.org/show_bug.cgi?id=42899 contains similar if not
identical IOMMU messages (this #bz is messy but it may be of intereset to
add yourself to the Cc: list btw).
AFAIUI the IOMMU complains because the r8169 tried to perform a read access.
The target address matches the start of a descriptor ring one. However it
happens long after the r8169 initialized the chipset and the driver would
work rather poorly if it could not access its descriptor rings. The r8169
bug is real but the IOMMU message seems rather useless if not bogus.

> May 31 22:54:55 x2 kernel: [78579.111904] AMD-Vi: Event logged [IO_PAGE_FAULT device=05:00.0 domain=0x0019 address=0x0000000000003000 flags=0x0050]
> May 31 22:55:07 x2 kernel: [78590.832047] ------------[ cut here ]------------
> May 31 22:55:07 x2 kernel: [78590.832067] WARNING: at /build/buildd-linux-2.6_3.3.4-1~experimental.1-amd64-_y3OdD/linux-2.6-3.3.4/debian/build/source_amd64_none/net/sched/sch_generic.c:256 dev_watchdog+0xf2/0x151()
> May 31 22:55:07 x2 kernel: [78590.832080] Hardware name: GA-990FXA-UD3
> May 31 22:55:07 x2 kernel: [78590.832087] NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out

You can apply the attached patch but it may not do much for your problem.

The patch below could make a difference though. Does it ?

diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index bbacb37..da46588 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -3766,6 +3766,7 @@ static void rtl_init_rxcfg(struct rtl8169_private *tp)
 	case RTL_GIGA_MAC_VER_22:
 	case RTL_GIGA_MAC_VER_23:
 	case RTL_GIGA_MAC_VER_24:
+	case RTL_GIGA_MAC_VER_34:
 		RTL_W32(RxConfig, RX128_INT_EN | RX_MULTI_EN | RX_DMA_BURST);
 		break;
 	default:


-- 
Ueimor

[-- Attachment #2: 0001-PATCH-r8169-fix-unsigned-int-wraparound-with-TSO.patch --]
[-- Type: text/plain, Size: 4949 bytes --]

>From 3068d55417db4c8e3414ce840afb932fdf1f0f76 Mon Sep 17 00:00:00 2001
Message-Id: <3068d55417db4c8e3414ce840afb932fdf1f0f76.1338553193.git.romieu@fr.zoreil.com>
From: Julien Ducourthial <jducourt@free.fr>
Date: Fri, 1 Jun 2012 14:17:43 +0200
Subject: [PATCH] [PATCH] r8169: fix unsigned int wraparound with TSO
X-Organisation: Land of Sunshine Inc.

[ Upstream commit 477206a018f902895bfcd069dd820bfe94c187b1 ]

The r8169 may get stuck or show bad behaviour after activating TSO :
the net_device is not stopped when it has no more TX descriptors.
This problem comes from TX_BUFS_AVAIL which may reach -1 when all
transmit descriptors are in use. The patch simply tries to keep positive
values.

Tested with 8111d(onboard) on a D510MO, and with 8111e(onboard) on a
Zotac 890GXITX.

Signed-off-by: Julien Ducourthial <jducourt@free.fr>
Acked-by: Francois Romieu <romieu@fr.zoreil.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 drivers/net/ethernet/realtek/r8169.c |   16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index f545093..ce6b44d 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -61,8 +61,12 @@
 #define R8169_MSG_DEFAULT \
 	(NETIF_MSG_DRV | NETIF_MSG_PROBE | NETIF_MSG_IFUP | NETIF_MSG_IFDOWN)

-#define TX_BUFFS_AVAIL(tp) \
-	(tp->dirty_tx + NUM_TX_DESC - tp->cur_tx - 1)
+#define TX_SLOTS_AVAIL(tp) \
+	(tp->dirty_tx + NUM_TX_DESC - tp->cur_tx)
+
+/* A skbuff with nr_frags needs nr_frags+1 entries in the tx queue */
+#define TX_FRAGS_READY_FOR(tp,nr_frags) \
+	(TX_SLOTS_AVAIL(tp) >= (nr_frags + 1))

 /* Maximum number of multicast addresses to filter (vs. Rx-all-multicast).
    The RTL chips use a 64 element hash table based on the Ethernet CRC. */
@@ -5115,7 +5119,7 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
 	u32 opts[2];
 	int frags;

-	if (unlikely(TX_BUFFS_AVAIL(tp) < skb_shinfo(skb)->nr_frags)) {
+	if (unlikely(!TX_FRAGS_READY_FOR(tp, skb_shinfo(skb)->nr_frags))) {
 		netif_err(tp, drv, dev, "BUG! Tx Ring full when queue awake!\n");
 		goto err_stop_0;
 	}
@@ -5169,7 +5173,7 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,

 	mmiowb();

-	if (TX_BUFFS_AVAIL(tp) < MAX_SKB_FRAGS) {
+	if (!TX_FRAGS_READY_FOR(tp, MAX_SKB_FRAGS)) {
 		/* Avoid wrongly optimistic queue wake-up: rtl_tx thread must
 		 * not miss a ring update when it notices a stopped queue.
 		 */
@@ -5183,7 +5187,7 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
 		 * can't.
 		 */
 		smp_mb();
-		if (TX_BUFFS_AVAIL(tp) >= MAX_SKB_FRAGS)
+		if (TX_FRAGS_READY_FOR(tp, MAX_SKB_FRAGS))
 			netif_wake_queue(dev);
 	}

@@ -5306,7 +5310,7 @@ static void rtl_tx(struct net_device *dev, struct rtl8169_private *tp)
 		 */
 		smp_mb();
 		if (netif_queue_stopped(dev) &&
-		    (TX_BUFFS_AVAIL(tp) >= MAX_SKB_FRAGS)) {
+		    TX_FRAGS_READY_FOR(tp, MAX_SKB_FRAGS)) {
 			netif_wake_queue(dev);
 		}
 		/*
--
1.7.10.2
---
 drivers/net/ethernet/realtek/r8169.c |   16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index da46588..59dd29e 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -62,8 +62,12 @@
 #define R8169_MSG_DEFAULT \
 	(NETIF_MSG_DRV | NETIF_MSG_PROBE | NETIF_MSG_IFUP | NETIF_MSG_IFDOWN)
 
-#define TX_BUFFS_AVAIL(tp) \
-	(tp->dirty_tx + NUM_TX_DESC - tp->cur_tx - 1)
+#define TX_SLOTS_AVAIL(tp) \
+	(tp->dirty_tx + NUM_TX_DESC - tp->cur_tx)
+
+/* A skbuff with nr_frags needs nr_frags+1 entries in the tx queue */
+#define TX_FRAGS_READY_FOR(tp,nr_frags) \
+	(TX_SLOTS_AVAIL(tp) >= (nr_frags + 1))
 
 /* Maximum number of multicast addresses to filter (vs. Rx-all-multicast).
    The RTL chips use a 64 element hash table based on the Ethernet CRC. */
@@ -5513,7 +5517,7 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
 	u32 opts[2];
 	int frags;
 
-	if (unlikely(TX_BUFFS_AVAIL(tp) < skb_shinfo(skb)->nr_frags)) {
+	if (unlikely(!TX_FRAGS_READY_FOR(tp, skb_shinfo(skb)->nr_frags))) {
 		netif_err(tp, drv, dev, "BUG! Tx Ring full when queue awake!\n");
 		goto err_stop_0;
 	}
@@ -5561,10 +5565,10 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
 
 	RTL_W8(TxPoll, NPQ);
 
-	if (TX_BUFFS_AVAIL(tp) < MAX_SKB_FRAGS) {
+	if (!TX_FRAGS_READY_FOR(tp, MAX_SKB_FRAGS)) {
 		netif_stop_queue(dev);
 		smp_rmb();
-		if (TX_BUFFS_AVAIL(tp) >= MAX_SKB_FRAGS)
+		if (TX_FRAGS_READY_FOR(tp, MAX_SKB_FRAGS))
 			netif_wake_queue(dev);
 	}
 
@@ -5666,7 +5670,7 @@ static void rtl8169_tx_interrupt(struct net_device *dev,
 		tp->dirty_tx = dirty_tx;
 		smp_wmb();
 		if (netif_queue_stopped(dev) &&
-		    (TX_BUFFS_AVAIL(tp) >= MAX_SKB_FRAGS)) {
+		    TX_FRAGS_READY_FOR(tp, MAX_SKB_FRAGS)) {
 			netif_wake_queue(dev);
 		}
 		/*
-- 
1.7.10.2


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: r8169: IO_PAGE_FAULT & netdev watchdog
  2012-06-01 12:59 ` Francois Romieu
@ 2012-06-01 19:20   ` Vincent Pelletier
  2012-06-01 20:13     ` Francois Romieu
  2012-06-02  9:08   ` Vincent Pelletier
  1 sibling, 1 reply; 7+ messages in thread
From: Vincent Pelletier @ 2012-06-01 19:20 UTC (permalink / raw)
  To: Francois Romieu; +Cc: netdev

Thanks for the quick reply.

Le vendredi 01 juin 2012 14:59:49, vous avez écrit :
> Same thing if you reset and remove the pci device through sysfs then ask
> the PCI bridge to scan it again ?

I didn't try it before - but I should have, I know this.
rmmod; reset; modprobe -> doesn't work
rmmod; reset; remove; rescan -> doesn't work either (?!)

> https://bugzilla.kernel.org/show_bug.cgi?id=42899 contains similar if not
> identical IOMMU messages (this #bz is messy but it may be of intereset to
> add yourself to the Cc: list btw).

I found it a bit after my post (while watching the archives, in case someone 
replied without CC :) ). I posted on that bug as I couldn't find a way to just 
add me to bug CC.

> The r8169 bug is real but the IOMMU message seems rather useless if not
> bogus.

Just being curious, feel free to skip over my questions:
If it's bogus, could it be a mis-interpretation of its state when the error 
occurs (I don't know how CPU knows a fault happened, I guess some IRQ + some 
register contain error status, address of error, some process/context 
identifier) ? Or hardware bug ? Or MMU misconfiguration for some reason ?
If it's not bogus, would it be the sign of firmware bug (accessing some 
unpredictable memory upon certain conditions) ?

> You can apply the attached patch but it may not do much for your problem.
> The patch below could make a difference though. Does it ?

I'll try either and both. Given the poor result I got from 
reset/remove/rescan, I guess I should reboot between attempts, right ?
Should I prevent original module auto-loading at boot ? Maybe more than just 
r8169 ?

Regards,
-- 
Vincent Pelletier

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: r8169: IO_PAGE_FAULT & netdev watchdog
  2012-06-01 19:20   ` Vincent Pelletier
@ 2012-06-01 20:13     ` Francois Romieu
  0 siblings, 0 replies; 7+ messages in thread
From: Francois Romieu @ 2012-06-01 20:13 UTC (permalink / raw)
  To: Vincent Pelletier; +Cc: netdev

Vincent Pelletier <plr.vincent@gmail.com> :
[...]
> If it's bogus, could it be a mis-interpretation of its state when the error 
> occurs (I don't know how CPU knows a fault happened, I guess some IRQ + some 
> register contain error status, address of error, some process/context 
> identifier) ?

See "AMD I/O Virtualization Technology (IOMMU) Specification".

> Or hardware bug ? Or MMU misconfiguration for some reason ?

I don't have time to poke deeply enough into the iommu code.

[...]
> If it's not bogus, would it be the sign of firmware bug (accessing some 
> unpredictable memory upon certain conditions) ?

That's what I thought first. Or I should have added something to the r8169
driver. However it's quite reproducible, the failing address is one of the
mapped Rx or Tx descriptor ring address - don't remember which one, see
the PR at korg - and it does not fit the timing pattern.

[...]
> I'll try either and both. Given the poor result I got from 
> reset/remove/rescan, I guess I should reboot between attempts, right ?

Yes. The inlined patch could help avoiding the problem but it is not
supposed to help a failed network adapter recovering.

> Should I prevent original module auto-loading at boot ? Maybe more than just 
> r8169 ?

It should not be required. YMMV.

-- 
Ueimor

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: r8169: IO_PAGE_FAULT & netdev watchdog
  2012-06-01 12:59 ` Francois Romieu
  2012-06-01 19:20   ` Vincent Pelletier
@ 2012-06-02  9:08   ` Vincent Pelletier
  2012-06-02 10:56     ` Francois Romieu
  1 sibling, 1 reply; 7+ messages in thread
From: Vincent Pelletier @ 2012-06-02  9:08 UTC (permalink / raw)
  To: Francois Romieu; +Cc: netdev

[-- Attachment #1: Type: Text/Plain, Size: 799 bytes --]

Le vendredi 01 juin 2012 14:59:49, Francois Romieu a écrit :
> You can apply the attached patch but it may not do much for your problem.

After failing to build the module alone in a way that it would accept loading 
in debian-provided kernel, I fall back to building vanilla kernel + proposed 
patches.

I first went for 3.4, but realised the patch you attached was already applied 
there.

So I went with 3.3.7, and patch failed to apply, at least partly because 3.3.7 
lacks "r8169: fix early queue wake-up."[1] . I solved the conflicts manually, 
but I'm not sure of the result. Could you confirm attached patch might give 
expected result ? Or should I stick to 3.4 and only test inlined patch ?

[1] ae1f23fb433ac0aaff8aeaa5a7b14348e9aa8277

Regards,
-- 
Vincent Pelletier

[-- Attachment #2: for_3.3.7.patch --]
[-- Type: text/x-patch, Size: 1564 bytes --]

--- drivers/net/ethernet/realtek/r8169.c.orig	2012-06-02 10:26:45.000000000 +0200
+++ drivers/net/ethernet/realtek/r8169.c	2012-06-02 10:58:37.000000000 +0200
@@ -62,8 +62,12 @@
 #define R8169_MSG_DEFAULT \
 	(NETIF_MSG_DRV | NETIF_MSG_PROBE | NETIF_MSG_IFUP | NETIF_MSG_IFDOWN)
 
-#define TX_BUFFS_AVAIL(tp) \
-	(tp->dirty_tx + NUM_TX_DESC - tp->cur_tx - 1)
+#define TX_SLOTS_AVAIL(tp) \
+	(tp->dirty_tx + NUM_TX_DESC - tp->cur_tx)
+
+/* A skbuff with nr_frags needs nr_frags+1 entries in the tx queue */
+#define TX_FRAGS_READY_FOR(tp,nr_frags) \
+	(TX_SLOTS_AVAIL(tp) >= (nr_frags + 1))
 
 /* Maximum number of multicast addresses to filter (vs. Rx-all-multicast).
    The RTL chips use a 64 element hash table based on the Ethernet CRC. */
@@ -5513,7 +5517,7 @@
 	u32 opts[2];
 	int frags;
 
-	if (unlikely(TX_BUFFS_AVAIL(tp) < skb_shinfo(skb)->nr_frags)) {
+	if (unlikely(!TX_FRAGS_READY_FOR(tp, skb_shinfo(skb)->nr_frags))) {
 		netif_err(tp, drv, dev, "BUG! Tx Ring full when queue awake!\n");
 		goto err_stop_0;
 	}
@@ -5561,10 +5565,10 @@
 
 	RTL_W8(TxPoll, NPQ);
 
-	if (TX_BUFFS_AVAIL(tp) < MAX_SKB_FRAGS) {
+	if (!TX_FRAGS_READY_FOR(tp, MAX_SKB_FRAGS)) {
 		netif_stop_queue(dev);
 		smp_rmb();
-		if (TX_BUFFS_AVAIL(tp) >= MAX_SKB_FRAGS)
+		if (TX_FRAGS_READY_FOR(tp, MAX_SKB_FRAGS))
 			netif_wake_queue(dev);
 	}
 
@@ -5666,7 +5670,7 @@
 		tp->dirty_tx = dirty_tx;
 		smp_wmb();
 		if (netif_queue_stopped(dev) &&
-		    (TX_BUFFS_AVAIL(tp) >= MAX_SKB_FRAGS)) {
+		    TX_FRAGS_READY_FOR(tp, MAX_SKB_FRAGS)) {
 			netif_wake_queue(dev);
 		}
 		/*

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: r8169: IO_PAGE_FAULT & netdev watchdog
  2012-06-02  9:08   ` Vincent Pelletier
@ 2012-06-02 10:56     ` Francois Romieu
  2012-06-02 13:42       ` Vincent Pelletier
  0 siblings, 1 reply; 7+ messages in thread
From: Francois Romieu @ 2012-06-02 10:56 UTC (permalink / raw)
  To: Vincent Pelletier; +Cc: netdev

Vincent Pelletier <plr.vincent@gmail.com> :
[...]
> So I went with 3.3.7, and patch failed to apply, at least partly because
> 3.3.7 lacks "r8169: fix early queue wake-up."[1].

And partly because the patch I sent included its content in the commit
message as well. :o/

> I solved the conflicts manually, but I'm not sure of the result. Could you
> confirm attached patch might give expected result ?

Yes.

> Or should I stick to 3.4 and only test inlined patch ?

If the inlined patch makes a difference, you should see it with 3.4.

My life is a bit easier when you work somewhere in the main branch
(or in davem's -next but it is not relevant for regression fixes).

-- 
Ueimor

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: r8169: IO_PAGE_FAULT & netdev watchdog
  2012-06-02 10:56     ` Francois Romieu
@ 2012-06-02 13:42       ` Vincent Pelletier
  0 siblings, 0 replies; 7+ messages in thread
From: Vincent Pelletier @ 2012-06-02 13:42 UTC (permalink / raw)
  To: Francois Romieu; +Cc: netdev

Le samedi 02 juin 2012 12:56:45, vous avez écrit :
> And partly because the patch I sent included its content in the commit
> message as well. :o/

I noticed the repetition after trying to apply on 3.4, and dropped one. And 
only then realised it was really already applied.

> If the inlined patch makes a difference, you should see it with 3.4.

It made a difference, when testing with netcat: without any change over 
vanilla 3.3.7, network trafic drops to 0 in a matter of seconds (up to around 
10s). With it, it stayed stable for 10 minutes, until I killed nc.
I reproduced this with 3.4 as well (no patch = bug, patch = no problem).
In both version without patch, I got the watchdog warning 10 minutes after 
traffic drop - though without the IO_PAGE_FAULT message.

I spent quite some time testing with nc in UDP mode first, and couldn't 
reproduce the issue (then I switched to TCP as said above). Does that make any 
sense ?
I also noticed the significant lag at bootup when eth0 is brought up is much 
reduced on patched kernel. Does that makes sense ?

FWIW, the commands I used were based on:
  nc -l -p 5555 < /dev/zero > /dev/null

With/without -u flag, and of course client-side equivalent command so the 
connection was used full-duplex at maximum speed: 450Mb/s in TCP, 800+Mb/s in 
UDP, each way. UDP was limited by CPU on one side (~450Mb/s upload from 
that box, 800Mb/s download, 100% cpu on it).
Values are as reported by nload & htop.
All tests were done in runlevel 2, with rsyslog manually started with its init 
script.

> My life is a bit easier when you work somewhere in the main branch
> (or in davem's -next but it is not relevant for regression fixes).

I'm not sure: does 3.4 tarball from kernel.org qualify as "main branch" ? 
Otherwise, which git repos & branch should I use ?

Regards,
-- 
Vincent Pelletier

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2012-06-02 13:42 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-05-31 21:31 r8169: IO_PAGE_FAULT & netdev watchdog Vincent Pelletier
2012-06-01 12:59 ` Francois Romieu
2012-06-01 19:20   ` Vincent Pelletier
2012-06-01 20:13     ` Francois Romieu
2012-06-02  9:08   ` Vincent Pelletier
2012-06-02 10:56     ` Francois Romieu
2012-06-02 13:42       ` Vincent Pelletier

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.