All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net 2/2]  8139cp: reset BQL when ring tx ring cleared
@ 2013-05-20 16:54 Stephen Hemminger
  2013-05-20 21:03 ` David Miller
  2013-05-20 21:37 ` Francois Romieu
  0 siblings, 2 replies; 15+ messages in thread
From: Stephen Hemminger @ 2013-05-20 16:54 UTC (permalink / raw)
  To: David Miller; +Cc: Francois Romieu, netdev

This patch cures transmit timeout's with DHCP observed
while running under KVM. When the transmit ring is cleaned out,
the Byte Queue Limit values need to be reset.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>

--- a/drivers/net/ethernet/realtek/8139cp.c	2013-05-20 09:29:49.689900862 -0700
+++ b/drivers/net/ethernet/realtek/8139cp.c	2013-05-20 09:29:50.857880563 -0700
@@ -1141,6 +1141,7 @@ static void cp_clean_rings (struct cp_pr
 			cp->dev->stats.tx_dropped++;
 		}
 	}
+	netdev_reset_queue(cp->dev);
 
 	memset(cp->rx_ring, 0, sizeof(struct cp_desc) * CP_RX_RING_SIZE);
 	memset(cp->tx_ring, 0, sizeof(struct cp_desc) * CP_TX_RING_SIZE);

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net 2/2] 8139cp: reset BQL when ring tx ring cleared
  2013-05-20 16:54 [PATCH net 2/2] 8139cp: reset BQL when ring tx ring cleared Stephen Hemminger
@ 2013-05-20 21:03 ` David Miller
  2013-05-20 21:37 ` Francois Romieu
  1 sibling, 0 replies; 15+ messages in thread
From: David Miller @ 2013-05-20 21:03 UTC (permalink / raw)
  To: stephen; +Cc: romieu, netdev

From: Stephen Hemminger <stephen@networkplumber.org>
Date: Mon, 20 May 2013 09:54:43 -0700

> This patch cures transmit timeout's with DHCP observed
> while running under KVM. When the transmit ring is cleaned out,
> the Byte Queue Limit values need to be reset.
> 
> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>

Applied.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net 2/2]  8139cp: reset BQL when ring tx ring cleared
  2013-05-20 16:54 [PATCH net 2/2] 8139cp: reset BQL when ring tx ring cleared Stephen Hemminger
  2013-05-20 21:03 ` David Miller
@ 2013-05-20 21:37 ` Francois Romieu
  2013-05-21  0:27   ` Stephen Hemminger
  1 sibling, 1 reply; 15+ messages in thread
From: Francois Romieu @ 2013-05-20 21:37 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: David Miller, netdev

Stephen Hemminger <stephen@networkplumber.org> :
> This patch cures transmit timeout's with DHCP observed
> while running under KVM. When the transmit ring is cleaned out,
> the Byte Queue Limit values need to be reset.
> 
> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
> 
> --- a/drivers/net/ethernet/realtek/8139cp.c	2013-05-20 09:29:49.689900862 -0700
> +++ b/drivers/net/ethernet/realtek/8139cp.c	2013-05-20 09:29:50.857880563 -0700
> @@ -1141,6 +1141,7 @@ static void cp_clean_rings (struct cp_pr
>  			cp->dev->stats.tx_dropped++;
>  		}
>  	}
> +	netdev_reset_queue(cp->dev);
>  
>  	memset(cp->rx_ring, 0, sizeof(struct cp_desc) * CP_RX_RING_SIZE);
>  	memset(cp->tx_ring, 0, sizeof(struct cp_desc) * CP_TX_RING_SIZE);

It's a good catch but I believe that you are not telling us the whole
story. :o)

Where does cp_clean_rings get called ?

- error path of cp_refill_rx
  - cp_init_rings
    - cp_alloc_rings
      - cp_open
        - in cp_change_mtu, after cp_close, whence cp_stop_hw
    - cp_tx_timeout
      -> after cp_stop_hw
- cp_free_rings
  - error path of cp_open
    -> after cp_stop_hw
  - cp_close
    -> after cp_stop_hw
- cp_tx_timeout
  -> after cp_stop_hw

cp_stop_hw includes netdev_reset_queue.

You have imho exhibited a start_xmit after cp_stop_hw race - not sure if
it happens in cp_tx_timeout or cp_change_mtu. Reverting the analysis above,
I have not found a place where cp_stop_hw could be called without being
followed by a cp_clean_rings. The netdev_reset_queue in cp_stop_hw, now
useless, should thus be removed.

Does it make sense ?

-- 
Ueimor

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net 2/2]  8139cp: reset BQL when ring tx ring cleared
  2013-05-20 21:37 ` Francois Romieu
@ 2013-05-21  0:27   ` Stephen Hemminger
  2015-09-14 12:05     ` David Woodhouse
  0 siblings, 1 reply; 15+ messages in thread
From: Stephen Hemminger @ 2013-05-21  0:27 UTC (permalink / raw)
  To: Francois Romieu; +Cc: David Miller, netdev

On Mon, 20 May 2013 23:37:28 +0200
Francois Romieu <romieu@fr.zoreil.com> wrote:

> cp_stop_hw includes netdev_reset_queue.
> 
> You have imho exhibited a start_xmit after cp_stop_hw race - not sure if
> it happens in cp_tx_timeout or cp_change_mtu. Reverting the analysis above,
> I have not found a place where cp_stop_hw could be called without being
> followed by a cp_clean_rings. The netdev_reset_queue in cp_stop_hw, now
> useless, should thus be removed.
> 
> Does it make sense ?

Your right, you could probably remove it.

It doesn't solve the problem, still seeing transmit timeouts.
Looks like what happens with DHCP is something else.


 zebra[2210]: interface eth0 index 2 changed <UP,BROADCAST,RUNNING,MULTICAST>.

[   23.960059] ------------[ cut here ]------------
[   23.960107] WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x243/0x250()
[   23.960110] Hardware name: Bochs 
[   23.960114] Modules linked in: 9p 9pnet fscache ipv6 tun mperf cpufreq_stats cpufreq_ondemand freq_table cpufreq_userspace cpufreq_conservative virtio_uio(O) cpufreq_powersave igb_uio(O) uio cirrus ttm drm_kms_helper drm sysimgblt sysfillrect virtio_balloon i2c_piix4 i2c_core hid_generic syscopyarea psmouse evdev serio_raw microcode pcspkr intel_agp intel_gtt agpgart processor thermal_sys button vfat fat usb_storage ohci_hcd squashfs loop overlayfs raid10 raid456 async_pq async_xor xor async_memcpy async_raid6_recov usbhid hid raid6_pq async_tx raid1 raid0 multipath linear md_mod pata_acpi ata_generic virtio_blk virtio_net 8139cp ata_piix 8139too mii virtio_pci virtio_ring virtio floppy
[   23.960177] Pid: 0, comm: swapper/3 Tainted: G           O 3.9.3-1-amd64-vyatta #1
[   23.960179] Call Trace:
[   23.960181]  <IRQ>  [<ffffffff8105479b>] ? warn_slowpath_common+0x7b/0xc0
[   23.960212]  [<ffffffff81054895>] ? warn_slowpath_fmt+0x45/0x50
[   23.960228]  [<ffffffff814dfb45>] ? _raw_spin_lock+0x5/0x10
[   23.960232]  [<ffffffff8143dde3>] ? dev_watchdog+0x243/0x250
[   23.960245]  [<ffffffff810704c0>] ? __queue_work+0x2b0/0x2b0
[   23.960253]  [<ffffffff8143dba0>] ? __netdev_watchdog_up+0x70/0x70
[   23.960257]  [<ffffffff8143dba0>] ? __netdev_watchdog_up+0x70/0x70
[   23.960262]  [<ffffffff81064773>] ? call_timer_fn+0x43/0x130
[   23.960265]  [<ffffffff81064c27>] ? run_timer_softirq+0x237/0x280 
[   23.960269]  [<ffffffff8107b200>] ? update_rmtp+0x80/0x80
[   23.960272]  [<ffffffff814dfb45>] ? _raw_spin_lock+0x5/0x10
[   23.960276]  [<ffffffff8105ca41>] ? __do_softirq+0xe1/0x270
[   23.960280]  [<ffffffff8105ccd5>] ? irq_exit+0xb5/0xc0
[   23.960291]  [<ffffffff81039fd8>] ? smp_apic_timer_interrupt+0x68/0xa0
[   23.960302]  [<ffffffff814e151d>] ? apic_timer_interrupt+0x6d/0x80
[   23.960304]  <EOI>  [<ffffffff81042152>] ? native_safe_halt+0x2/0x10
[   23.960318]  [<ffffffff8101e1ef>] ? default_idle+0x3f/0xf0
[   23.960323]  [<ffffffff8101db48>] ? cpu_idle+0x88/0xd0
[   23.960330]  [<ffffffff814d547a>] ? start_secondary+0x211/0x216
[   23.960333] ---[ end trace a519f7f509aceb9a ]--- 
[   23.960407] 8139cp 0000:00:03.0 eth0: Transmit timeout, status  d   3b    4 80ff
[   35.960137] 8139cp 0000:00:03.0 eth0: Transmit timeout, status  d   3b    5 80ff
[   47.960108] 8139cp 0000:00:03.0 eth0: Transmit timeout, status  d   3b    5 80ff
[   59.960117] 8139cp 0000:00:03.0 eth0: Transmit timeout, status  d   3b    5 80ff
[   71.960115] 8139cp 0000:00:03.0 eth0: Transmit timeout, status  d   3b    5 80ff
[   83.960117] 8139cp 0000:00:03.0 eth0: Transmit timeout, status  d   3b    5 80ff
[   95.960131] 8139cp 0000:00:03.0 eth0: Transmit timeout, status  d   3b    5 80ff
~                                                                                                                                                    
~                                                                                                                                                    
~                                                                                                                                                    
~                                                                                                                                                    
~                                                                                                                                                    
~                                                                                                                                                    
~                                                                                                                                                    
~                                                                                                                                                    
~                                                                                                                                                    
~                                                                                                                                                    
~                                                                                                                                                    
~                                                                                                                                                    
~                                                                                                                                                    
~                                                                                                                                                    
~                                                                                                                                                    
~                                                                                                                                                    
~                                                                                                                                                    
~                                                                                                                                                    
~                                                                                                                                                    
"messages" 35 lines, 2958 characters written

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net 2/2]  8139cp: reset BQL when ring tx ring cleared
  2013-05-21  0:27   ` Stephen Hemminger
@ 2015-09-14 12:05     ` David Woodhouse
  2015-09-14 21:59       ` Francois Romieu
  0 siblings, 1 reply; 15+ messages in thread
From: David Woodhouse @ 2015-09-14 12:05 UTC (permalink / raw)
  To: Stephen Hemminger, Francois Romieu; +Cc: David Miller, netdev

[-- Attachment #1: Type: text/plain, Size: 3679 bytes --]

On Mon, 2013-05-20 at 17:27 -0700, Stephen Hemminger wrote:
> On Mon, 20 May 2013 23:37:28 +0200
> Francois Romieu <romieu@fr.zoreil.com> wrote:
> 
> > cp_stop_hw includes netdev_reset_queue.
> > 
> > You have imho exhibited a start_xmit after cp_stop_hw race - not sure if
> > it happens in cp_tx_timeout or cp_change_mtu. Reverting the analysis above,
> > I have not found a place where cp_stop_hw could be called without being
> > followed by a cp_clean_rings. The netdev_reset_queue in cp_stop_hw, now
> > useless, should thus be removed.
> > 
> > Does it make sense ?
> 
> Your right, you could probably remove it.
> 
> It doesn't solve the problem, still seeing transmit timeouts.
> Looks like what happens with DHCP is something else.

Did you ever work this out? I'm seeing something similar on the inward
-facing interface on my home router under high load — and it doesn't
automatically recover.



[308309.340644] ------------[ cut here ]------------
[308309.345379] WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x103/0x190()
[308309.352789] Hardware name: Geos
[308309.356020] NETDEV WATCHDOG: eth1 (8139cp): transmit queue 0 timed out
[308309.362733] Modules linked in: sch_fq_codel sch_teql gpio_keys_polled leds_gpio geodewdt solos_pci ledtrig_heartbeat gpio_cs5535 cs5535_clockevt 8139cp ip6t_REJECT ip6t_rt ip6t_hbh ip6t_mh ip6t_ipv6header ip6t_frag ip6t_eui64 ip6t_ah ip6table_raw ip6table_mangle ip6table_filter ip6_tables nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_irc nf_conntrack_irc nf_nat_ftp nf_conntrack_ftp xt_HL xt_hl xt_ecn ipt_ECN xt_CLASSIFY xt_time xt_tcpmss xt_statistic xt_mark xt_length xt_DSCP xt_dscp cs5535_mfgpt cs5535_mfd mfd_core ipt_MASQUERADE nf_nat xt_recent xt_helper xt_connmark xt_connbytes pptp l2tp_ppp pppoe xt_conntrack xt_CT iptable_raw xt_state nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack pppox pppoatm ipt_REJECT xt_TCPMSS xt_comment xt_multiport xt_mac xt_limit iptable_mangle iptable_filter ip_tables xt_tcpudp x_tables nsc_gpio ip_gre gre sit l2tp_netlink l2tp_core ppp_mppe tunnel4 tun ppp_async ppp_generic slhc br2684 atm crc_ccitt ipv6 input_polldev msr input_core sha1_generic geode_aes ecb arc4 aes_i586 ohci_hcd ehci_hcd usbcore usb_common
[308309.457239] Pid: 0, comm: swapper Not tainted 3.7.1 #1
[308309.462463] Call Trace:
[308309.465020]  [<c10272e7>] ? warn_slowpath_common+0x87/0xb0
[308309.470691]  [<c11e3d43>] ? dev_watchdog+0x103/0x190
[308309.475755]  [<c10273a3>] ? warn_slowpath_fmt+0x33/0x40
[308309.481159]  [<c11e3d43>] ? dev_watchdog+0x103/0x190
[308309.486244]  [<c11e3c40>] ? pfifo_fast_dequeue+0xd0/0xd0
[308309.491751]  [<c1030f3c>] ? call_timer_fn.isra.42+0x1c/0x80
[308309.497422]  [<c11d1394>] ? process_backlog+0x54/0xe0
[308309.502674]  [<c10310ca>] ? run_timer_softirq+0x12a/0x160
[308309.508169]  [<c11e3c40>] ? pfifo_fast_dequeue+0xd0/0xd0
[308309.513697]  [<c102cfcd>] ? __do_softirq+0x6d/0x110
[308309.518675]  [<c102cf60>] ? __tasklet_schedule+0x40/0x40
[308309.524178]  <IRQ>  [<c102d121>] ? irq_exit+0x31/0x60
[308309.529359]  [<c1003c7d>] ? do_IRQ+0x8d/0xb0
[308309.533723]  [<c1003c7d>] ? do_IRQ+0x8d/0xb0
[308309.538201]  [<c1254fe9>] ? common_interrupt+0x29/0x2e
[308309.543440]  [<c1050000>] ? rt_mutex_adjust_prio_chain+0x180/0x280
[308309.549829]  [<c10085c4>] ? default_idle+0x14/0x30
[308309.554719]  [<c1008e1f>] ? cpu_idle+0x2f/0x50
[308309.559259]  [<c131e878>] ? start_kernel+0x286/0x28b
[308309.564414]  [<c131e440>] ? repair_env_string+0x4d/0x4d
[308309.569729] ---[ end trace 2e18cc211cee6089 ]---
[308309.574551] 8139cp 0000:00:0b.0 eth1: Transmit timeout, status  c   2b    0 80ff

-- 
dwmw2



[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5691 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net 2/2]  8139cp: reset BQL when ring tx ring cleared
  2015-09-14 12:05     ` David Woodhouse
@ 2015-09-14 21:59       ` Francois Romieu
  2015-09-17 11:36         ` David Woodhouse
  2015-09-17 12:02         ` David Woodhouse
  0 siblings, 2 replies; 15+ messages in thread
From: Francois Romieu @ 2015-09-14 21:59 UTC (permalink / raw)
  To: David Woodhouse; +Cc: Stephen Hemminger, David Miller, netdev

David Woodhouse <dwmw2@infradead.org> :
[...]
> Did you ever work this out ?

Not specifically.

> I'm seeing something similar on the inward -facing interface on my home
> router under high load — and it doesn't automatically recover.
[...]
> [308309.457239] Pid: 0, comm: swapper Not tainted 3.7.1 #1

It's unrelated but you may have already noticed 
7fe0ee099ad5e3dea88d4ee1b6f20246b1ca57c3 ("net: 8139cp: fix a BUG_ON
triggered by wrong bytes_compl").

[...]
> [308309.574551] 8139cp 0000:00:0b.0 eth1: Transmit timeout, status  c   2b    0 80ff

Rx and Tx are enabled.

Instant (untested) hack below.

diff --git a/drivers/net/ethernet/realtek/8139cp.c b/drivers/net/ethernet/realtek/8139cp.c
index d79e33b..09ee6fd 100644
--- a/drivers/net/ethernet/realtek/8139cp.c
+++ b/drivers/net/ethernet/realtek/8139cp.c
@@ -129,6 +129,9 @@ MODULE_PARM_DESC (multicast_filter_limit, "8139cp: maximum number of filtered mu
 /* Time in jiffies before concluding the transmitter is hung. */
 #define TX_TIMEOUT		(6*HZ)
 
+/* TODO: calibrate. It ought to be related to the PCI bus frequency. */
+#define CP_EARLY_TIMEOUT	(8 * 1024)
+
 /* hardware minimum and maximum for a single frame's data payload */
 #define CP_MIN_MTU		60	/* TODO: allow lower, but pad */
 #define CP_MAX_MTU		4096
@@ -146,9 +149,11 @@ enum {
 	TxConfig	= 0x40, /* Tx configuration */
 	ChipVersion	= 0x43, /* 8-bit chip version, inside TxConfig */
 	RxConfig	= 0x44, /* Rx configuration */
+	TimerCount	= 0x48, /* 32 bit general purpose timer. */
 	RxMissed	= 0x4C,	/* 24 bits valid, write clears */
 	Cfg9346		= 0x50, /* EEPROM select/control; Cfg reg [un]lock */
 	Config1		= 0x52, /* Config1 */
+	TimerInt	= 0x54, /* TimerCount IRQ triggering timeout value */
 	Config3		= 0x59, /* Config3 */
 	Config4		= 0x5A, /* Config4 */
 	MultiIntr	= 0x5C, /* Multiple interrupt select */
@@ -283,7 +288,8 @@ enum {
 	LANWake         = (1 << 1),  /* Enable LANWake signal */
 	PMEStatus	= (1 << 0),  /* PME status can be reset by PCI RST# */
 
-	cp_norx_intr_mask = PciErr | LinkChg | TxOK | TxErr | TxEmpty,
+	cp_norx_intr_mask = PciErr | TimerIntr | LinkChg |
+			    TxOK | TxErr | TxEmpty,
 	cp_rx_intr_mask = RxOK | RxErr | RxEmpty | RxFIFOOvr,
 	cp_intr_mask = cp_rx_intr_mask | cp_norx_intr_mask,
 };
@@ -608,6 +614,15 @@ static irqreturn_t cp_interrupt (int irq, void *dev_instance)
 
 	if (status & (TxOK | TxErr | TxEmpty | SWInt))
 		cp_tx(cp);
+
+	if ((status & TimerIntr) && (cp->tx_head != cp->tx_tail)) {
+		if (net_ratelimit()) {
+			netdev_info(dev, "Timeout head=%08x, tail=%08x\n",
+				    cp->tx_head, cp->tx_tail);
+		}
+		cp_tx(cp);
+	}
+
 	if (status & LinkChg)
 		mii_check_media(&cp->mii_if, netif_msg_link(cp), false);
 
@@ -885,6 +900,8 @@ static netdev_tx_t cp_start_xmit (struct sk_buff *skb,
 out_unlock:
 	spin_unlock_irqrestore(&cp->lock, intr_flags);
 
+	cpw32(TimerCount, CP_EARLY_TIMEOUT);
+
 	cpw8(TxPoll, NormalTxPoll);
 
 	return NETDEV_TX_OK;
@@ -1064,6 +1081,8 @@ static void cp_init_hw (struct cp_private *cp)
 
 	cpw16(MultiIntr, 0);
 
+	cpw32(TimerInt, CP_EARLY_TIMEOUT);
+
 	cpw8_f(Cfg9346, Cfg9346_Lock);
 }
 

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH net 2/2]  8139cp: reset BQL when ring tx ring cleared
  2015-09-14 21:59       ` Francois Romieu
@ 2015-09-17 11:36         ` David Woodhouse
  2015-09-17 15:06           ` David Woodhouse
  2015-09-17 12:02         ` David Woodhouse
  1 sibling, 1 reply; 15+ messages in thread
From: David Woodhouse @ 2015-09-17 11:36 UTC (permalink / raw)
  To: Francois Romieu; +Cc: Stephen Hemminger, David Miller, netdev

[-- Attachment #1: Type: text/plain, Size: 8925 bytes --]

On Mon, 2015-09-14 at 23:59 +0200, Francois Romieu wrote:
> 
> [...]
> > [308309.574551] 8139cp 0000:00:0b.0 eth1: Transmit timeout, status 
> c   2b    0 80ff
> 
> Rx and Tx are enabled.
> 
> Instant (untested) hack below.

Thanks; I'll try that. In fact since updating to 4.2 the problem has
got worse — now the whole machine dies:

[  232.064630] ------------[ cut here ]------------                             
[  232.069282] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:303 dev_watchdog+0x1e5/0x200()                                                                 
[  232.077840] NETDEV WATCHDOG: eth1 (8139cp): transmit queue 0 timed out       
[  232.084380] Modules linked in: sch_teql 8139cp mii iptable_nat pppoe nf_nat_ipv4 nf_conntrack_ipv6 nf_conntrack_ipv4 ipt_REJECT ipt_MASQUERADE xt_time xt_tcpudp xt_state xt_nat xt_multiport xt_mark xt_mac xt_limit xt_conntrack xt_comment xt_TCPMSS xt_REDIRECT xt_LOG xt_CT solos_pci pppox ppp_async nf_reject_ipv4 nf_nat_redirect nf_nat_masquerade_ipv4 nf_nat_ftp nf_nat nf_log_ipv4 nf_defrag_ipv6 nf_defrag_ipv4 nf_conntrack_ftp nf_conntrack iptable_raw iptable_mangle iptable_filter ip_tables crc_ccitt act_skbedit act_mirred em_u32 cls_u32 cls_tcindex cls_flow cls_route cls_fw sch_hfsc sch_ingress ledtrig_heartbeat ledtrig_gpio ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 nf_log_common ip6table_raw ip6table_mangle ip6table_filter ip6_tables x_tables pppoatm ppp_generic slhc br2684 atm geode_aes cbc arc4 aes_i586                                                               
[  232.157787] CPU: 0 PID: 0 Comm: swapper Not tainted 4.2.0-gx+ #25            
[  232.163982]  00000000 c10313eb dee95000 00000000 ffffff32 00000258 c1031446 00000009                                                                         
[  232.171988]  dec0bf74 c13c3afc dec0bf8c c1272ef5 c13bfe82 0000012f c13c3afc dee95000                                                                         
[  232.179978]  e04cfd3c 00000000 dee95000 dee95240 00000258 80000100 c1272d10 dee95000                                                                         
[  232.188012] Call Trace:                                                      
[  232.190482]  [<c10313eb>] ? warn_slowpath_common+0x5b/0x90                   
[  232.196063]  [<c1031446>] ? warn_slowpath_fmt+0x26/0x30                      
[  232.201307]  [<c1272ef5>] ? dev_watchdog+0x1e5/0x200                         
[  232.206317]  [<c1272d10>] ? qdisc_rcu_free+0x30/0x30                         
[  232.211307]  [<c105b19e>] ? call_timer_fn.isra.7+0xe/0x60                    
[  232.216811]  [<c1272d10>] ? qdisc_rcu_free+0x30/0x30                         
[  232.221794]  [<c105b33d>] ? run_timer_softirq+0xfd/0x1b0                     
[  232.227221]  [<c1033777>] ? __do_softirq+0xa7/0x190                          
[  232.232117]  [<c10336d0>] ? __hrtimer_tasklet_trampoline+0x20/0x20           
[  232.238395]  [<c100336b>] ? do_softirq_own_stack+0x1b/0x20                   
[  232.243881]  <IRQ>  [<c1003165>] ? do_IRQ+0x35/0xa0                          
[  232.248904]  [<c131e2a9>] ? common_interrupt+0x29/0x30                       
[  232.254141]  [<c104007b>] ? put_unbound_pool+0x17b/0x1a0                     
[  232.259470]  [<c1008bf2>] ? default_idle+0x2/0x10                            
[  232.264213]  [<c1009106>] ? arch_cpu_idle+0x6/0x10                           
[  232.269026]  [<c104eee5>] ? cpu_startup_entry+0xf5/0x190                     
[  232.274459]  [<c142a9b9>] ? start_kernel+0x2e5/0x2e8                         
[  232.279432] ---[ end trace 30ae4e701c36b431 ]---                             
[  232.284167] 8139cp 0000:00:0b.0 eth1: Transmit timeout, status  c   2b    1 80ac                                                                             
[  260.106382] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [swapper:0]
[  260.113515] Modules linked in: sch_teql 8139cp mii iptable_nat pppoe nf_nat_ipv4 nf_conntrack_ipv6 nf_conntrack_ipv4 ipt_REJECT ipt_MASQUERADE xt_time xt_tcpudp xt_state xt_nat xt_multiport xt_mark xt_mac xt_limit xt_conntrack xt_comment xt_TCPMSS xt_REDIRECT xt_LOG xt_CT solos_pci pppox ppp_async nf_reject_ipv4 nf_nat_redirect nf_nat_masquerade_ipv4 nf_nat_ftp nf_nat nf_log_ipv4 nf_defrag_ipv6 nf_defrag_ipv4 nf_conntrack_ftp nf_conntrack iptable_raw iptable_mangle iptable_filter ip_tables crc_ccitt act_skbedit act_mirred em_u32 cls_u32 cls_tcindex cls_flow cls_route cls_fw sch_hfsc sch_ingress ledtrig_heartbeat ledtrig_gpio ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 nf_log_common ip6table_raw ip6table_mangle ip6table_filter ip6_tables x_tables pppoatm ppp_generic slhc br2684 atm geode_aes cbc arc4 aes_i586                                                               
[  260.116369] CPU: 0 PID: 0 Comm: swapper Tainted: G        W       4.2.0-gx+ #25                                                                              
[  260.116369] task: c13f7540 ti: c13f0000 task.ti: c13f0000                    
[  260.116369] EIP: 0060:[<c131d6fa>] EFLAGS: 00200292 CPU: 0                   
[  260.116369] EIP is at _raw_spin_unlock_irqrestore+0xa/0x10                   
[  260.116369] EAX: 00000000 EBX: 00200292 ECX: e04d8100 EDX: 00200292          
[  260.116369] ESI: 00000000 EDI: ffffff32 EBP: 00000258 ESP: dec0bf90          
[  260.116369]  DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068                    
[  260.116369] CR0: 8005003b CR2: 098c4c50 CR3: 1ef88000 CR4: 00000090          
[  260.116369] Stack:                                                           
[  260.116369]  dee95000 c1272ecb dee95000 dee95240 00000258 80000100 c1272d10 dee95000                                                                         
[  260.116369]  00000100 c105b19e 00000410 c1272d10 c105b33d 00000000 c146d524 00000001                                                                         
[  260.116369]  00000002 c1033777 00000002 ffffe577 c13f2000 0000000a 00200000 c13f1f70                                                                         
[  260.116369] Call Trace:                                                      
[  260.116369]  [<c1272ecb>] ? dev_watchdog+0x1bb/0x200                         
[  260.116369]  [<c1272d10>] ? qdisc_rcu_free+0x30/0x30                         
[  260.116369]  [<c105b19e>] ? call_timer_fn.isra.7+0xe/0x60                    
[  260.116369]  [<c1272d10>] ? qdisc_rcu_free+0x30/0x30                         
[  260.116369]  [<c105b33d>] ? run_timer_softirq+0xfd/0x1b0                     
[  260.116369]  [<c1033777>] ? __do_softirq+0xa7/0x190                          
[  260.116369]  [<c10336d0>] ? __hrtimer_tasklet_trampoline+0x20/0x20           
[  260.116369]  [<c100336b>] ? do_softirq_own_stack+0x1b/0x20                   
[  260.116369]  <IRQ>                                                           
[  260.116369]  [<c1003165>] ? do_IRQ+0x35/0xa0                                 
[  260.116369]  [<c131e2a9>] ? common_interrupt+0x29/0x30                       
[  260.116369]  [<c104007b>] ? put_unbound_pool+0x17b/0x1a0                     
[  260.116369]  [<c1008bf2>] ? default_idle+0x2/0x10                            
[  260.116369]  [<c1009106>] ? arch_cpu_idle+0x6/0x10                           
[  260.116369]  [<c104eee5>] ? cpu_startup_entry+0xf5/0x190                     
[  260.116369]  [<c142a9b9>] ? start_kernel+0x2e5/0x2e8                         
[  260.116369] Code: 00 b8 01 00 00 00 c3 8d 76 00 8d bc 27 00 00 00 00 e8 db 2e d3 ff c3 8d 76 00 8d bc 27 00 00 00 00 53 89 d3 e8 c8 2e d3 ff 53 9d <5b> c3 8d 74 26 00 e8 bb 2e d3 ff fb c3 89 f6 8d bc 27 00 00 00                          

At this point even sysrq on the serial console isn't working; I'm
going to have to go and physically reset it.

(gdb) list *dev_watchdog+0x1bb
0xc1272ecb is in dev_watchdog (net/sched/sch_generic.c:304).
299				}
300	
301				if (some_queue_timedout) {
302					WARN_ONCE(1, KERN_INFO
"NETDEV WATCHDOG: %s (%s): transmit queue %u timed out\n",
303					       dev->name,
netdev_drivername(dev), i);
304					dev->netdev_ops
->ndo_tx_timeout(dev);
305				}
306				if (!mod_timer(&dev->watchdog_timer,
307					       round_jiffies(jiffies
+
308							     dev
->watchdog_timeo)))

So that _raw_spin_unlock_irqrestore() is going to be a tailcall from
the end of cp_tx_timeout(). Moderately confused that it dies on *unlock*.

_raw_spin_lock_irqrestore+0xa is the instruction after a 'popf', which
makes me wonder if it dies in an IRQ storm. Although then *some* of the
NMI watchdog invocations would surely show an IRQ on the stack, but
they
don't; they're all right on the same instruction.


-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse@intel.com                              Intel Corporation


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5691 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net 2/2]  8139cp: reset BQL when ring tx ring cleared
  2015-09-14 21:59       ` Francois Romieu
  2015-09-17 11:36         ` David Woodhouse
@ 2015-09-17 12:02         ` David Woodhouse
  1 sibling, 0 replies; 15+ messages in thread
From: David Woodhouse @ 2015-09-17 12:02 UTC (permalink / raw)
  To: Francois Romieu; +Cc: Stephen Hemminger, David Miller, netdev

[-- Attachment #1: Type: text/plain, Size: 10600 bytes --]

On Mon, 2015-09-14 at 23:59 +0200, Francois Romieu wrote:
> Instant (untested) hack below.

That seems to trigger a lot, but ultimately doesn't help...

[  250.998980] 8139cp 0000:00:0b.0 eth1: Timeout head=0000000b, tail=0000000a   
[  252.637287] net_ratelimit: 5 callbacks suppressed                            
[  252.642022] 8139cp 0000:00:0b.0 eth1: Timeout head=0000003f, tail=0000003e   
[  252.973255] 8139cp 0000:00:0b.0 eth1: Timeout head=00000028, tail=00000027   
[  253.911945] 8139cp 0000:00:0b.0 eth1: Timeout head=00000010, tail=0000000f   
[  254.151013] 8139cp 0000:00:0b.0 eth1: Timeout head=0000000e, tail=0000000d   
[  255.551730] 8139cp 0000:00:0b.0 eth1: Timeout head=00000025, tail=00000024   
[  255.568070] 8139cp 0000:00:0b.0 eth1: Timeout head=00000027, tail=00000024   
[  255.575717] 8139cp 0000:00:0b.0 eth1: Timeout head=0000002a, tail=00000024   
[  255.583035] 8139cp 0000:00:0b.0 eth1: Timeout head=0000002b, tail=00000024   
[  255.590361] 8139cp 0000:00:0b.0 eth1: Timeout head=0000002c, tail=00000024   
[  255.598080] 8139cp 0000:00:0b.0 eth1: Timeout head=0000002e, tail=00000024   
[  267.066384] ------------[ cut here ]------------                             
[  267.071053] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:303 dev_watchdog+0x1e5/0x200()                                                                 
[  267.079526] NETDEV WATCHDOG: eth1 (8139cp): transmit queue 0 timed out       
[  267.086051] Modules linked in: 8139cp sch_teql mii iptable_nat pppoe nf_nat_ipv4 nf_conntrack_ipv6 nf_conntrack_ipv4 ipt_REJECT ipt_MASQUERADE xt_time xt_tcpudp xt_state xt_nat xt_multiport xt_mark xt_mac xt_limit xt_conntrack xt_comment xt_TCPMSS xt_REDIRECT xt_LOG xt_CT solos_pci pppox ppp_async nf_reject_ipv4 nf_nat_redirect nf_nat_masquerade_ipv4 nf_nat_ftp nf_nat nf_log_ipv4 nf_defrag_ipv6 nf_defrag_ipv4 nf_conntrack_ftp nf_conntrack iptable_raw iptable_mangle iptable_filter ip_tables crc_ccitt act_skbedit act_mirred em_u32 cls_u32 cls_tcindex cls_flow cls_route cls_fw sch_hfsc sch_ingress ledtrig_heartbeat ledtrig_gpio ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 nf_log_common ip6table_raw ip6table_mangle ip6table_filter ip6_tables x_tables pppoatm ppp_generic slhc br2684 atm geode_aes cbc arc4 aes_i586 [last unloaded: 8139cp]                                       
[  267.161698] CPU: 0 PID: 0 Comm: swapper Not tainted 4.2.0-gx+ #26            
[  267.167800]  00000000 c10313eb ddc53000 00000000 fffffde1 00000258 c1031446 00000009                                                                         
[  267.171408]  dec0bf74 c13c3afc dec0bf8c c1272ef5 c13bfe82 0000012f c13c3afc ddc53000                                                                         
[  267.183847]  e06f9dec 00000000 ddc53000 ddc53240 00000258 80000100 c1272d10 ddc53000                                                                         
[  267.191812] Call Trace:                                                      
[  267.194376]  [<c10313eb>] ? warn_slowpath_common+0x5b/0x90                   
[  267.199874]  [<c1031446>] ? warn_slowpath_fmt+0x26/0x30                      
[  267.205200]  [<c1272ef5>] ? dev_watchdog+0x1e5/0x200                         
[  267.210179]  [<c1272d10>] ? qdisc_rcu_free+0x30/0x30                         
[  267.215250]  [<c105b19e>] ? call_timer_fn.isra.7+0xe/0x60                    
[  267.220661]  [<c1272d10>] ? qdisc_rcu_free+0x30/0x30                         
[  267.225739]  [<c105b33d>] ? run_timer_softirq+0xfd/0x1b0                     
[  267.231071]  [<c1033777>] ? __do_softirq+0xa7/0x190                          
[  267.236054]  [<c10336d0>] ? __hrtimer_tasklet_trampoline+0x20/0x20           
[  267.242274]  [<c100336b>] ? do_softirq_own_stack+0x1b/0x20                   
[  267.247768]  <IRQ>  [<c1003165>] ? do_IRQ+0x35/0xa0                          
[  267.252248]  [<c131e2a9>] ? common_interrupt+0x29/0x30                       
[  267.258062]  [<c104007b>] ? put_unbound_pool+0x17b/0x1a0                     
[  267.263391]  [<c1008bf2>] ? default_idle+0x2/0x10                            
[  267.268186]  [<c1009106>] ? arch_cpu_idle+0x6/0x10                           
[  267.272999]  [<c104eee5>] ? cpu_startup_entry+0xf5/0x190                     
[  267.278410]  [<c142a9b9>] ? start_kernel+0x2e5/0x2e8                         
[  267.283378] ---[ end trace a08600e9030733fc ]---                             
[  267.288100] cp_tx_timeout                                                    
[  267.290750] 8139cp 0000:00:0b.0 eth1: Transmit timeout, status  c   2b    1 c0ac                                                                             
[  267.298166] will lock...                                                     
[  267.300709] Handling tx timeout, flags 200286                                
[  267.305281] Will wake queue...                                               
[  267.308153] Will unlock... flags 200286                                      
[  292.120424] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [swapper:0]
[  292.127561] Modules linked in: 8139cp sch_teql mii iptable_nat pppoe nf_nat_i...
[  292.130412] CPU: 0 PID: 0 Comm: swapper Tainted: G        W       4.2.0-gx+ #26                                                                              
[  292.130412] task: c13f7540 ti: c13f0000 task.ti: c13f0000                    
[  292.130412] EIP: 0060:[<c131d6fa>] EFLAGS: 00200286 CPU: 0                   
[  292.130412] EIP is at _raw_spin_unlock_irqrestore+0xa/0x10                   
[  292.130412] EAX: 00000000 EBX: 00200286 ECX: c1409ee4 EDX: 00200286          
[  292.130412] ESI: 00200286 EDI: ddc53428 EBP: ddc53420 ESP: dec0bf68          
[  292.130412]  DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068                    
[  292.130412] CR0: 8005003b CR2: 0805244c CR3: 1edf6000 CR4: 00000090          
[  292.130412] Stack:                                                           
[  292.130412]  ddc53000 e06f8055 e06f9e2e 00200286 e06f9e18 e06f9f0c 00200286 ddc53000                                                                         
[  292.130412]  00000000 fffffde1 00000258 c1272ecb ddc53000 ddc53240 00000258 80000100                                                                         
[  292.130412]  c1272d10 ddc53000 00000100 c105b19e 00000442 c1272d10 c105b33d ddc7c444                                                                         
[  292.130412] Call Trace:                                                      
[  292.130412]  [<e06f8055>] ? cp_tx_timeout+0x1a5/0x1c0 [8139cp]               
[  292.130412]  [<c1272ecb>] ? dev_watchdog+0x1bb/0x200                         
[  292.130412]  [<c1272d10>] ? qdisc_rcu_free+0x30/0x30                         
[  292.130412]  [<c105b19e>] ? call_timer_fn.isra.7+0xe/0x60                    
[  292.130412]  [<c1272d10>] ? qdisc_rcu_free+0x30/0x30                         
[  292.130412]  [<c105b33d>] ? run_timer_softirq+0xfd/0x1b0                     
[  292.130412]  [<c1033777>] ? __do_softirq+0xa7/0x190                          
[  292.130412]  [<c10336d0>] ? __hrtimer_tasklet_trampoline+0x20/0x20           
[  292.130412]  [<c100336b>] ? do_softirq_own_stack+0x1b/0x20                   
[  292.130412]  <IRQ>                                                           
[  292.130412]  [<c1003165>] ? do_IRQ+0x35/0xa0                                 
[  292.130412]  [<c131e2a9>] ? common_interrupt+0x29/0x30                       
[  292.130412]  [<c104007b>] ? put_unbound_pool+0x17b/0x1a0                     
[  292.130412]  [<c1008bf2>] ? default_idle+0x2/0x10                            
[  292.130412]  [<c1009106>] ? arch_cpu_idle+0x6/0x10                           
[  292.130412]  [<c104eee5>] ? cpu_startup_entry+0xf5/0x190                     
[  292.130412]  [<c142a9b9>] ? start_kernel+0x2e5/0x2e8                         
[  292.130412] Code: 00 b8 01 00 00 00 c3 8d 76 00 8d bc 27 00 00 00 00 e8 db 2e d3 ff c3 8d 76 00 8d bc 27 00 00 00 00 53 89 d3 e8 c8 2e d3 ff 53 9d <5b> c3 8d 74 26 00 e8 bb 2e d3 ff fb c3 89 f6 8d bc 27 00 00 00                          
[  292.130412] Kernel panic - not syncing: softlockup: hung tasks               
[  292.130412] CPU: 0 PID: 0 Comm: swapper Tainted: G        W    L  4.2.0-gx+ #26                                                                              
[  292.130412]  00000000 c1319a5f c13f7540 00000000 00000000 000000f9 c1071439 c1399de4                                                                         
[  292.130412]  dec0bf2c c140b660 c10712f0 00000001 c140aac0 c105bf8e 00000000 03bbae55                                                                         
[  292.130412]  00000044 c140aacc 03bbae55 00000044 00000003 c141b860 c141b801 ffffffff                                                                         
[  292.130412] Call Trace:                                                      
[  292.130412]  [<c1319a5f>] ? panic+0x76/0x161                                 
[  292.130412]  [<c1071439>] ? watchdog_timer_fn+0x149/0x150                    
[  292.130412]  [<c10712f0>] ? watchdog_cleanup+0x10/0x10                       
[  292.130412]  [<c105bf8e>] ? __hrtimer_run_queues.constprop.7+0xae/0x180      
[  292.130412]  [<c105c187>] ? hrtimer_interrupt+0x87/0x1d0                     
[  292.130412]  [<c106817f>] ? tick_handle_oneshot_broadcast+0xcf/0x130         
[  292.130412]  [<c100388a>] ? timer_interrupt+0xa/0x10                         
[  292.130412]  [<c1054d6f>] ? handle_irq_event_percpu+0x4f/0xf0                
[  292.130412]  [<c1054e3d>] ? handle_irq_event+0x2d/0x50                       
[  292.130412]  [<c1056c39>] ? handle_level_irq+0x69/0xf0                       
[  292.130412]  [<c1056bd0>] ? handle_simple_irq+0x80/0x80                      
[  292.130412]  [<c10033b3>] ? handle_irq+0x43/0x70                             
[  292.130412]  <IRQ>  [<c100315c>] ? do_IRQ+0x2c/0xa0                          
[  292.130412]  [<c131e2a9>] ? common_interrupt+0x29/0x30                       
[  292.130412]  [<c105007b>] ? remove_waiter+0x11b/0x120                        
[  292.130412]  [<c131d6fa>] ? _raw_spin_unlock_irqrestore+0xa/0x10             
[  292.130412]  [<e06f8055>] ? cp_tx_timeout+0x1a5/0x1c0 [8139cp]               

-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse@intel.com                              Intel Corporation


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5691 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net 2/2]  8139cp: reset BQL when ring tx ring cleared
  2015-09-17 11:36         ` David Woodhouse
@ 2015-09-17 15:06           ` David Woodhouse
  2015-09-17 20:44             ` Francois Romieu
  0 siblings, 1 reply; 15+ messages in thread
From: David Woodhouse @ 2015-09-17 15:06 UTC (permalink / raw)
  To: Francois Romieu; +Cc: Stephen Hemminger, David Miller, netdev

[-- Attachment #1: Type: text/plain, Size: 4079 bytes --]

On Thu, 2015-09-17 at 12:36 +0100, David Woodhouse wrote:
> 
> Thanks; I'll try that. In fact since updating to 4.2 the problem has
> got worse — now the whole machine dies:

There is something very strange going on here. I've found two ways to
make it stop crashing when cp_tx_timeout() hits the 'popf' when
unlocking the spinlock.

The first is to comment out the whole of cp_tx_timeout() and let it
happen once. Then put that code *back* again and reload the module.
Then it can work fine.

The second way is to comment out the WARN_ONCE in dev_watchdog().
I remain utterly bemused; I have no idea what's going on there.

But that aside, even when it survives running cp_tx_timeout(), it still
doesn't *work* — it looks like TX is indeed working and has recovered,
but we are not *receiving* any packets.

I can't actually trigger the TX timeout at all with debugging enabled;
I've hacked things so that cp_set_wol() will also call cp_tx_timeout()
and simulate it. And now I see this...

[ 4358.499474] 8139cp 0000:00:0b.0 eth1: intr, status 0484 enable 80ff cmd 0c cpcmd 002b
[ 4358.499488] 8139cp 0000:00:0b.0 eth1: tx done, slot 35
[ 4358.513663] 8139cp 0000:00:0b.0 eth1: tx queued, slot 37, skblen 54
[ 4358.513692] 8139cp 0000:00:0b.0 eth1: intr, status 0484 enable 80ff cmd 0c cpcmd 002b
[ 4358.513705] 8139cp 0000:00:0b.0 eth1: tx done, slot 36
[ 4358.518880] 8139cp 0000:00:0b.0 eth1: intr, status 0001 enable 80ff cmd 0c cpcmd 002b
[ 4358.518900] 8139cp 0000:00:0b.0 eth1: rx slot 1 status 0x32014040 len 60
[ 4358.523601] 8139cp 0000:00:0b.0 eth1: intr, status 0001 enable 80ff cmd 0c cpcmd 002b
[ 4358.526910] 8139cp 0000:00:0b.0 eth1: rx slot 2 status 0x32036052 len 78
[ 4358.547898] 8139cp 0000:00:0b.0 eth1: intr, status 0001 enable 80ff cmd 0c cpcmd 002b
[ 4358.547996] 8139cp 0000:00:0b.0 eth1: rx slot 3 status 0x32036052 len 78
[ 4358.580526] 8139cp 0000:00:0b.0 eth1: tx queued, slot 38, skblen 70
[ 4358.580555] 8139cp 0000:00:0b.0 eth1: intr, status 0484 enable 80ff cmd 0c cpcmd 002b
[ 4358.580569] 8139cp 0000:00:0b.0 eth1: tx done, slot 37
[ 4358.601912] 8139cp 0000:00:0b.0 eth1: intr, status 0001 enable 80ff cmd 0c cpcmd 002b
[ 4358.601932] 8139cp 0000:00:0b.0 eth1: rx slot 4 status 0x32036052 len 78
[ 4358.650678] 8139cp 0000:00:0b.0 eth1: intr, status 0001 enable 80ff cmd 0c cpcmd 002b
[ 4358.650698] 8139cp 0000:00:0b.0 eth1: rx slot 5 status 0x320145a5 len 1441
[ 4358.665572] will lock...
[ 4358.668222] Handling tx timeout, flags 282
[ 4358.672494] nway_reset
[ 4358.674858] Will wake queue...
[ 4358.677919] Will unlock... flags 282
[ 4358.681525] did unlock...
[ 4358.684198] 8139cp 0000:00:0b.0 eth1: Transmit timeout handled, status  c   2b    0 80ff
[ 4358.708234] 8139cp 0000:00:0b.0 eth1: tx queued, slot 1, skblen 92
[ 4358.714567] 8139cp 0000:00:0b.0 eth1: intr, status 0484 enable 80ff cmd 0c cpcmd 002b
[ 4358.722405] 8139cp 0000:00:0b.0 eth1: tx done, slot 0
[ 4358.747412] 8139cp 0000:00:0b.0 eth1: tx queued, slot 2, skblen 106
[ 4358.753736] 8139cp 0000:00:0b.0 eth1: intr, status 0484 enable 80ff cmd 0c cpcmd 002b
[ 4358.756824] 8139cp 0000:00:0b.0 eth1: tx done, slot 1
[ 4358.814961] 8139cp 0000:00:0b.0 eth1: tx queued, slot 3, skblen 173
[ 4358.821291] 8139cp 0000:00:0b.0 eth1: intr, status 0484 enable 80ff cmd 0c cpcmd 002b
[ 4358.824186] 8139cp 0000:00:0b.0 eth1: tx done, slot 2
[ 4358.834352] 8139cp 0000:00:0b.0 eth1: tx queued, slot 4, skblen 86
[ 4358.840579] 8139cp 0000:00:0b.0 eth1: intr, status 0484 enable 80ff cmd 0c cpcmd 002b
[ 4358.844216] 8139cp 0000:00:0b.0 eth1: tx done, slot 3
[ 4358.853615] 8139cp 0000:00:0b.0 eth1: tx queued, slot 5, skblen 54
[ 4358.859822] 8139cp 0000:00:0b.0 eth1: intr, status 0484 enable 80ff cmd 0c cpcmd 002b
[ 4358.863497] 8139cp 0000:00:0b.0 eth1: tx done, slot 4
[ 4358.873111] 8139cp 0000:00:0b.0 eth1: tx queued, slot 6, skblen 66

-- 
-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse@intel.com                              Intel Corporation


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5691 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net 2/2]  8139cp: reset BQL when ring tx ring cleared
  2015-09-17 15:06           ` David Woodhouse
@ 2015-09-17 20:44             ` Francois Romieu
  2015-09-17 22:14               ` David Woodhouse
  0 siblings, 1 reply; 15+ messages in thread
From: Francois Romieu @ 2015-09-17 20:44 UTC (permalink / raw)
  To: David Woodhouse; +Cc: Stephen Hemminger, David Miller, netdev

David Woodhouse <dwmw2@infradead.org> :
> On Thu, 2015-09-17 at 12:36 +0100, David Woodhouse wrote:
> > 
> > Thanks; I'll try that. In fact since updating to 4.2 the problem has
> > got worse — now the whole machine dies:
> 
> There is something very strange going on here. I've found two ways to
> make it stop crashing when cp_tx_timeout() hits the 'popf' when
> unlocking the spinlock.

cp_tx_timeout takes lock, disables irq, calls cp_clean_rings, thus
plain dev_kfree_skb if a skb is still referenced in one of the
rx/tx ring. You may replace it with dev_kfree_skb_any.

-- 
Ueimor

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net 2/2]  8139cp: reset BQL when ring tx ring cleared
  2015-09-17 20:44             ` Francois Romieu
@ 2015-09-17 22:14               ` David Woodhouse
  2015-09-17 23:44                 ` Francois Romieu
  0 siblings, 1 reply; 15+ messages in thread
From: David Woodhouse @ 2015-09-17 22:14 UTC (permalink / raw)
  To: Francois Romieu; +Cc: Stephen Hemminger, David Miller, netdev

[-- Attachment #1: Type: text/plain, Size: 2139 bytes --]

On Thu, 2015-09-17 at 22:44 +0200, Francois Romieu wrote:
> David Woodhouse <dwmw2@infradead.org> :
> > On Thu, 2015-09-17 at 12:36 +0100, David Woodhouse wrote:
> > > 
> > > Thanks; I'll try that. In fact since updating to 4.2 the problem has
> > > got worse — now the whole machine dies:
> > 
> > There is something very strange going on here. I've found two ways to
> > make it stop crashing when cp_tx_timeout() hits the 'popf' when
> > unlocking the spinlock.
> 
> cp_tx_timeout takes lock, disables irq, calls cp_clean_rings, thus
> plain dev_kfree_skb if a skb is still referenced in one of the
> rx/tx ring. You may replace it with dev_kfree_skb_any.

Well spotted; I've made that change locally. Although I don't think it
explains the symptoms. Not that I'm sure what *could*.

I've also found that adding a call to __cp_set_rx_mode() seems to fix
the RX after reset, in some tests. Especially the simulated one via the
hack in cp_set_wol(). I think that's necessary, if not sufficient — at
least on real hardware. I didn't see the problem at all when running in
qemu.

Sometimes, though, it still dies in an interrupt storm after re
-enabling IRQs:

[  900.004214] 8139cp 0000:00:0b.0 eth1: Transmit timeout, status  c   2b    0 80ff
[  900.011725] will lock...
[  900.014273] Handling tx timeout, flags 200296
[  900.018774] Will wake queue...
[  900.021645] Will unlock... flags 200296
[  900.021645] 8139cp 0000:00:0b.0 eth1: intr, status 0001 enable 80ff cmd 0c cpcmd 002b
[  900.021645] 8139cp 0000:00:0b.0 eth1: intr, status 0001 enable 80ff cmd 0c cpcmd 002b
... 
[  901.628439] 8139cp 0000:00:0b.0 eth1: intr, status 0001 enable 80ff cmd 0c cpcmd 002b
[  901.636291] 8139cp 0000:00:0b.0 eth1: intr, status 0011 enable 80ff cmd 0c cpcmd 002b
...
[  901.966243] 8139cp 0000:00:0b.0 eth1: intr, status 0011 enable 80ff cmd 0c cpcmd 002b
[  901.968353] 8139cp 0000:00:0b.0 eth1: intr, status 0051 enable 80ff cmd 0c cpcmd 002b
... forever...

And of course, even if I fix the TX timeout handling, I'd still like to
know why it's happening in the first place...

-- 
dwmw2


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5691 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net 2/2]  8139cp: reset BQL when ring tx ring cleared
  2015-09-17 22:14               ` David Woodhouse
@ 2015-09-17 23:44                 ` Francois Romieu
  2015-09-18  1:04                   ` David Woodhouse
  2015-09-18 22:42                   ` David Woodhouse
  0 siblings, 2 replies; 15+ messages in thread
From: Francois Romieu @ 2015-09-17 23:44 UTC (permalink / raw)
  To: David Woodhouse; +Cc: Stephen Hemminger, David Miller, netdev

David Woodhouse <dwmw2@infradead.org> :
[...]
> And of course, even if I fix the TX timeout handling, I'd still like to
> know why it's happening in the first place...

So do I.

The TxDmaOkLowDesc register may tell if the Tx dma part is still making
any progress. I have added a TxPoll request. See below.

diff --git a/drivers/net/ethernet/realtek/8139cp.c b/drivers/net/ethernet/realtek/8139cp.c
index d79e33b..963d2a2 100644
--- a/drivers/net/ethernet/realtek/8139cp.c
+++ b/drivers/net/ethernet/realtek/8139cp.c
@@ -129,6 +129,9 @@ MODULE_PARM_DESC (multicast_filter_limit, "8139cp: maximum number of filtered mu
 /* Time in jiffies before concluding the transmitter is hung. */
 #define TX_TIMEOUT		(6*HZ)
 
+/* TODO: calibrate. It ought to be related to the PCI bus frequency. */
+#define CP_EARLY_TIMEOUT	(16 * 1024)
+
 /* hardware minimum and maximum for a single frame's data payload */
 #define CP_MIN_MTU		60	/* TODO: allow lower, but pad */
 #define CP_MAX_MTU		4096
@@ -146,9 +149,11 @@ enum {
 	TxConfig	= 0x40, /* Tx configuration */
 	ChipVersion	= 0x43, /* 8-bit chip version, inside TxConfig */
 	RxConfig	= 0x44, /* Rx configuration */
+	TimerCount	= 0x48, /* 32 bit general purpose timer. */
 	RxMissed	= 0x4C,	/* 24 bits valid, write clears */
 	Cfg9346		= 0x50, /* EEPROM select/control; Cfg reg [un]lock */
 	Config1		= 0x52, /* Config1 */
+	TimerInt	= 0x54, /* TimerCount IRQ triggering timeout value */
 	Config3		= 0x59, /* Config3 */
 	Config4		= 0x5A, /* Config4 */
 	MultiIntr	= 0x5C, /* Multiple interrupt select */
@@ -157,6 +162,7 @@ enum {
 	NWayAdvert	= 0x66, /* MII ADVERTISE */
 	NWayLPAR	= 0x68, /* MII LPA */
 	NWayExpansion	= 0x6A, /* MII Expansion */
+	TxDmaOkLowDesc	= 0x82,	/* Low 16 bit address of a Tx descriptor. */
 	Config5		= 0xD8,	/* Config5 */
 	TxPoll		= 0xD9,	/* Tell chip to check Tx descriptors for work */
 	RxMaxSize	= 0xDA, /* Max size of an Rx packet (8169 only) */
@@ -283,7 +289,8 @@ enum {
 	LANWake         = (1 << 1),  /* Enable LANWake signal */
 	PMEStatus	= (1 << 0),  /* PME status can be reset by PCI RST# */
 
-	cp_norx_intr_mask = PciErr | LinkChg | TxOK | TxErr | TxEmpty,
+	cp_norx_intr_mask = PciErr | TimerIntr | LinkChg |
+			    TxOK | TxErr | TxEmpty,
 	cp_rx_intr_mask = RxOK | RxErr | RxEmpty | RxFIFOOvr,
 	cp_intr_mask = cp_rx_intr_mask | cp_norx_intr_mask,
 };
@@ -608,6 +615,18 @@ static irqreturn_t cp_interrupt (int irq, void *dev_instance)
 
 	if (status & (TxOK | TxErr | TxEmpty | SWInt))
 		cp_tx(cp);
+
+	if ((status & TimerIntr) && (cp->tx_head != cp->tx_tail)) {
+		if (net_ratelimit()) {
+			netdev_info(dev, "Timeout head=%08x, tail=%08x desc=%04x\n",
+				    cp->tx_head, cp->tx_tail,
+				    cpr16(TxDmaOkLowDesc));
+		}
+		cp_tx(cp);
+		if (cp->tx_head != cp->tx_tail)
+			cpw8_f(TxPoll, NormalTxPoll);
+	}
+
 	if (status & LinkChg)
 		mii_check_media(&cp->mii_if, netif_msg_link(cp), false);
 
@@ -885,6 +904,8 @@ static netdev_tx_t cp_start_xmit (struct sk_buff *skb,
 out_unlock:
 	spin_unlock_irqrestore(&cp->lock, intr_flags);
 
+	cpw32(TimerCount, CP_EARLY_TIMEOUT);
+
 	cpw8(TxPoll, NormalTxPoll);
 
 	return NETDEV_TX_OK;
@@ -1064,6 +1085,8 @@ static void cp_init_hw (struct cp_private *cp)
 
 	cpw16(MultiIntr, 0);
 
+	cpw32(TimerInt, CP_EARLY_TIMEOUT);
+
 	cpw8_f(Cfg9346, Cfg9346_Lock);
 }
 

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH net 2/2]  8139cp: reset BQL when ring tx ring cleared
  2015-09-17 23:44                 ` Francois Romieu
@ 2015-09-18  1:04                   ` David Woodhouse
  2015-09-18  7:40                     ` David Woodhouse
  2015-09-18 22:42                   ` David Woodhouse
  1 sibling, 1 reply; 15+ messages in thread
From: David Woodhouse @ 2015-09-18  1:04 UTC (permalink / raw)
  To: Francois Romieu; +Cc: Stephen Hemminger, David Miller, netdev

[-- Attachment #1: Type: text/plain, Size: 16836 bytes --]

On Fri, 2015-09-18 at 01:44 +0200, Francois Romieu wrote:
> The TxDmaOkLowDesc register may tell if the Tx dma part is still 
> making any progress. I have added a TxPoll request. See below.

I've just added that into the original TX timeout handler, since that
doesn't seem to be crashing the box for me as long as I avoid the IRQ
storm.

Not sure what we learn from it ('desc 6550' printed as hex)... I've
also made it dump the TX descriptor ring (skb, addr, opts1, opts2):

[ 1733.027156] 8139cp 0000:00:0b.0 eth1: Transmit timeout, status  c   2b head 25 tail 22 desc 6550    0 80ff
[ 1733.036819] TX ring 00:   (null) 1dd68d8c 3000cb67 0                         
[ 1733.037040] TX ring 01:   (null) 1dd8774c 3000cb67 0                         
[ 1733.037040] TX ring 02:   (null) 1dd86e7c 30000a08 0                         
[ 1733.037040] TX ring 03:   (null) 1dd865ac 30000a08 0                         
[ 1733.037040] TX ring 04:   (null) 1dd8540c 30000a08 0                         
[ 1733.037040] TX ring 05:   (null) 1dd85cdc 30000a08 0                         
[ 1733.037040] TX ring 06:   (null) 1dd84b3c 30000a08 0                         
[ 1733.037040] TX ring 07:   (null) 1dd8399c 30000a08 0                         
[ 1733.037040] TX ring 08:   (null) 1dd8426c 30000a08 0                         
[ 1733.037040] TX ring 09:   (null) 1dd830cc 30000a08 0                         
[ 1733.037040] TX ring 10:   (null) 1dd81f2c 30000a08 0                         
[ 1733.037040] TX ring 11:   (null) 1dd8165c 30000a08 0                         
[ 1733.037040] TX ring 12:   (null) 1dd827fc 30000a08 0                         
[ 1733.037040] TX ring 13:   (null) 1dd80d8c 30000a08 0                         
[ 1733.037040] TX ring 14:   (null) 1dd804bc 30000a08 0                         
[ 1733.037040] TX ring 15:   (null) 1dd6ee7c 30000a08 0                         
[ 1733.037040] TX ring 16:   (null) 1dd6f74c 30000a08 0                         
[ 1733.037040] TX ring 17:   (null) 1dd6e5ac 30000a08 0                         
[ 1733.037040] TX ring 18:   (null) 1dd6dcdc 30000a08 0                         
[ 1733.037040] TX ring 19:   (null) 1dd6cb3c 3000f804 0                         
[ 1733.037040] TX ring 20:   (null) 1dd8de02 300083da 0                         
[ 1733.037040] TX ring 21:   (null) 1dd8da02 30000000 0                         
[ 1733.037040] TX ring 22: defcc240 1dd6d40c b00005ea 0                         
[ 1733.037040] TX ring 23: decdb900 1dd6b99c b0000046 0                         
[ 1733.037040] TX ring 24: ddd27c00 1dd6b0cc b00005ea 0                         
[ 1733.037040] TX ring 25:   (null) 1dd6e5ac 3000ca24 0                         
[ 1733.037040] TX ring 26:   (null) 1dd6dcdc 3000ca24 0                         
[ 1733.037040] TX ring 27:   (null) 1dd6d40c 3000ca24 0                         
[ 1733.037040] TX ring 28:   (null) 1dd6cb3c 3000ca24 0                         
[ 1733.037040] TX ring 29:   (null) 1dd6c26c 3000ca24 0                         
[ 1733.037040] TX ring 30:   (null) 1dd6b99c 3000ca24 0                         
[ 1733.037040] TX ring 31:   (null) 1dd6b0cc 3000ca24 0                         
[ 1733.037040] TX ring 32:   (null) 1dd6a7fc 3000ca24 0                         
[ 1733.037040] TX ring 33:   (null) 1dd69f2c 3000ca24 0                         
[ 1733.037040] TX ring 34:   (null) 1dd68d8c 3000ca24 0                         
[ 1733.037040] TX ring 35:   (null) 1dd6965c 3000cb67 0                         
[ 1733.037040] TX ring 36:   (null) 1dd684bc 3000cb67 0                         
[ 1733.037040] TX ring 37:   (null) 1dd86e7c 3000cb67 0                         
[ 1733.037040] TX ring 38:   (null) 1dd8774c 3000cb67 0                         
[ 1733.037040] TX ring 39:   (null) 1dd865ac 3000cb67 0                         
[ 1733.037040] TX ring 40:   (null) 1dd85cdc 3000cb67 0                         
[ 1733.037040] TX ring 41:   (null) 1dd8540c 3000cb67 0                         
[ 1733.037040] TX ring 42:   (null) 1dd84b3c 3000cb67 0                         
[ 1733.037040] TX ring 43:   (null) 1dd8426c 3000cb67 0                         
[ 1733.037040] TX ring 44:   (null) 1dd8399c 3000cb67 0                         
[ 1733.037040] TX ring 45:   (null) 1dd830cc 3000cb67 0                         
[ 1733.037040] TX ring 46:   (null) 1dd81f2c 3000cb67 0                         
[ 1733.037040] TX ring 47:   (null) 1dd827fc 3000cb67 0                         
[ 1733.037040] TX ring 48:   (null) 1dd8165c 3000cb67 0                         
[ 1733.037040] TX ring 49:   (null) 1dd80d8c 3000cb67 0                         
[ 1733.037040] TX ring 50:   (null) 1dd804bc 3000cb67 0                         
[ 1733.037040] TX ring 51:   (null) 1dd6f74c 3000cb67 0                         
[ 1733.037040] TX ring 52:   (null) 1dd6ee7c 3000cb67 0                         
[ 1733.037040] TX ring 53:   (null) 1dd6dcdc 3000cb67 0                         
[ 1733.037040] TX ring 54:   (null) 1dd6d40c 3000cb67 0                         
[ 1733.037040] TX ring 55:   (null) 1dd6e5ac 3000cb67 0                         
[ 1733.037040] TX ring 56:   (null) 1dd6cb3c 3000cb67 0                         
[ 1733.037040] TX ring 57:   (null) 1dd6c26c 3000cb67 0                         
[ 1733.037040] TX ring 58:   (null) 1dd6b99c 3000cb67 0                         
[ 1733.037040] TX ring 59:   (null) 1dd6b0cc 3000cb67 0                         
[ 1733.037040] TX ring 60:   (null) 1dd6a7fc 3000cb67 0                         
[ 1733.037040] TX ring 61:   (null) 1dd69f2c 3000cb67 0                         
[ 1733.037040] TX ring 62:   (null) 1dd6965c 3000cb67 0                         
[ 1733.037040] TX ring 63:   (null) 1dd684bc 7000cb67 0           


And another time...
[ 3436.024610] 8139cp 0000:00:0b.0 eth1: Transmit timeout, status  c   2b head 6 tail 1 desc 1400    0 80ff                                                     
[ 3436.034101] TX ring 00:   (null) 1dc4dcdc 30000888 0                         
[ 3436.034494] TX ring 01: deffff00 1dc64e02 b000006a 0                         
[ 3436.034494] TX ring 02: ddc35180 1dc4d40c b0000036 0                         
[ 3436.034494] TX ring 03: ddc35480 1dc4cb3c b0000036 0                         
[ 3436.034494] TX ring 04: defff780 1dc4b99c b0000249 0                         
[ 3436.034494] TX ring 05: defff900 1dc4c26c b00005ea 0                         
[ 3436.034494] TX ring 06:   (null) 1dc64e02 30003c30 0                         
[ 3436.034494] TX ring 07:   (null) 1dc5165c 30003c30 0                         
[ 3436.034494] TX ring 08:   (null) 1dc64e02 30000000 0                         
[ 3436.034494] TX ring 09:   (null) 1dc64e0a 30000000 0                         
[ 3436.034494] TX ring 10:   (null) 1dc504bc 30000000 0                         
[ 3436.034494] TX ring 11:   (null) 1dc64e0a 30000000 0                         
[ 3436.034494] TX ring 12:   (null) 1dc64e02 30004881 0                         
[ 3436.034494] TX ring 13:   (null) 1dc64e0a 3000fb66 0                         
[ 3436.034494] TX ring 14:   (null) 1dc64e02 3000c031 0                         
[ 3436.034494] TX ring 15:   (null) 1dc64e02 3000c031 0                         
[ 3436.034494] TX ring 16:   (null) 1dc64e0a 30000100 0                         
[ 3436.034494] TX ring 17:   (null) 1dc64e0a 3000a684 0                         
[ 3436.034494] TX ring 18:   (null) 1dc50d8c 30000100 0                         
[ 3436.034494] TX ring 19:   (null) 1dc5f74c 30000100 0                         
[ 3436.034494] TX ring 20:   (null) 1dc64e02 30000100 0                         
[ 3436.034494] TX ring 21:   (null) 1dc64e02 30000327 0                         
[ 3436.034494] TX ring 22:   (null) 1dc64e02 30002100 0                         
[ 3436.034494] TX ring 23:   (null) 1dc64e0a 3000bd84 0                         
[ 3436.034494] TX ring 24:   (null) 1dc64e0a 30000000 0                         
[ 3436.034494] TX ring 25:   (null) 1dc64e0a 30000000 0                         
[ 3436.034494] TX ring 26:   (null) 1dc64e02 30004881 0                         
[ 3436.034494] TX ring 27:   (null) 1dc64e0a 30000a00 0                         
[ 3436.034494] TX ring 28:   (null) 1dc64e02 30000a00 0                         
[ 3436.034494] TX ring 29:   (null) 1dc64e0a 3000ed83 0                         
[ 3436.034494] TX ring 30:   (null) 1dc64e02 30009600 0                         
[ 3436.034494] TX ring 31:   (null) 1dc64e0a 30001b98 0                         
[ 3436.034494] TX ring 32:   (null) 1dc64e02 30000008 0                         
[ 3436.034494] TX ring 33:   (null) 1dc64e0a 30000000 0                         
[ 3436.034494] TX ring 34:   (null) 1dc64e02 3000dd86 0                         
[ 3436.034494] TX ring 35:   (null) 1dc64e0a 30000000 0                         
[ 3436.034494] TX ring 36:   (null) 1dc64e02 30000000 0                         
[ 3436.034494] TX ring 37:   (null) 1dc64e0a 30000000 0                         
[ 3436.034494] TX ring 38:   (null) 1dc64e02 3000b963 0                         
[ 3436.034494] TX ring 39:   (null) 1dc64e0a 30000100 0                         
[ 3436.034494] TX ring 40:   (null) 1dc64e02 3000b263 0                         
[ 3436.034494] TX ring 41:   (null) 1dc64e0a 3000b008 0                         
[ 3436.034494] TX ring 42:   (null) 1dc64e02 30000000 0                         
[ 3436.034494] TX ring 43:   (null) 1dc64e0a 30004011 0                         
[ 3436.034494] TX ring 44:   (null) 1dc64e02 30000000 0                         
[ 3436.034494] TX ring 45:   (null) 1dc64e0a 30004011 0                         
[ 3436.034494] TX ring 46:   (null) 1dc64e02 30000000 0                         
[ 3436.034494] TX ring 47:   (null) 1dc64e0a 30004011 0                         
[ 3436.034494] TX ring 48:   (null) 1dc5e5ac 30005f06 0                         
[ 3436.034494] TX ring 49:   (null) 1dc5ee7c 30002fc0 0                         
[ 3436.034494] TX ring 50:   (null) 1dc5dcdc 3000b008 0                         
[ 3436.034494] TX ring 51:   (null) 1dc5d40c 30006364 0                         
[ 3436.034494] TX ring 52:   (null) 1dc5cb3c 30006364 0                         
[ 3436.034494] TX ring 53:   (null) 1dc5c26c 30000300 0                         
[ 3436.034494] TX ring 54:   (null) 1dc5b99c 30000300 0                         
[ 3436.034494] TX ring 55:   (null) 1dc5b0cc 30003500 0                         
[ 3436.034494] TX ring 56:   (null) 1dc64e02 30000888 0                         
[ 3436.034494] TX ring 57:   (null) 1dc5a7fc 3000f1d3 0                         
[ 3436.034494] TX ring 58:   (null) 1dc59f2c 3000f1d3 0                         
[ 3436.034494] TX ring 59:   (null) 1dc5965c 3000f1d3 0                         
[ 3436.034494] TX ring 60:   (null) 1dc58d8c 3000f1d3 0                         
[ 3436.034494] TX ring 61:   (null) 1dc4f74c 30002e33 0                         
[ 3436.034494] TX ring 62:   (null) 1dc4ee7c 30002e33 0                         
[ 3436.034494] TX ring 63:   (null) 1dc4e5ac 70000888 0                         

... and another (without all the extra debugging, and without the
WARN_ON, it does seem to survive without triggering the IRQ storm)

[ 3551.027167] 8139cp 0000:00:0b.0 eth1: Transmit timeout, status  c   2b head 19 tail 16 desc 14f0    0 80ff                                                  
[ 3551.036837] TX ring 00:   (null) 1dce04bc 30000000 0                         
[ 3551.037058] TX ring 01:   (null) 1dce0d8c 30000000 0                         
[ 3551.037058] TX ring 02:   (null) 1dd6f74c 30000b66 0                         
[ 3551.037058] TX ring 03:   (null) 1dd6ee7c 30000b66 0                         
[ 3551.037058] TX ring 04:   (null) 1dd6e5ac 30000b66 0                         
[ 3551.037058] TX ring 05:   (null) 1dd6d40c 30000b66 0                         
[ 3551.037058] TX ring 06:   (null) 1dd6dcdc 30000b66 0                         
[ 3551.037058] TX ring 07:   (null) 1dd6c26c 30000000 0                         
[ 3551.037058] TX ring 08:   (null) 1dd6cb3c 30000000 0                         
[ 3551.037058] TX ring 09:   (null) 1dd6b99c 3000f804 0                         
[ 3551.037058] TX ring 10:   (null) 1dd6b0cc 30001105 0                         
[ 3551.037058] TX ring 11:   (null) 1dd6a7fc 30001105 0                         
[ 3551.037058] TX ring 12:   (null) 1dd69f2c 30001105 0                         
[ 3551.037058] TX ring 13:   (null) 1dd6965c 3000493b 0                         
[ 3551.037058] TX ring 14:   (null) 1dd68d8c 3000cea5 0                         
[ 3551.037058] TX ring 15:   (null) 1dc4f74c 30001b55 0                         
[ 3551.037058] TX ring 16: ddd2f9c0 1dd684bc b00005ea 0                         
[ 3551.037058] TX ring 17: defc76c0 1dc4e5ac b00005ea 0                         
[ 3551.037058] TX ring 18: ddd38300 1dc4ee7c b00005ea 0                         
[ 3551.037058] TX ring 19:   (null) 1dc49f2c 3000b838 0                         
[ 3551.037058] TX ring 20:   (null) 1dc48d8c 300035ae 0                         
[ 3551.037058] TX ring 21:   (null) 1dc4965c 300035ae 0                         
[ 3551.037058] TX ring 22:   (null) 1dc484bc 300085b9 0                         
[ 3551.037058] TX ring 23:   (null) 2774c 300085b9 0                            
[ 3551.037058] TX ring 24:   (null) 26e7c 300085b9 0                            
[ 3551.037058] TX ring 25:   (null) 25cdc 3000d5c4 0                            
[ 3551.037058] TX ring 26:   (null) 1dc1fe02 300025d0 0                         
[ 3551.037058] TX ring 27:   (null) 265ac 300025d0 0                            
[ 3551.037058] TX ring 28:   (null) 2540c 3000b838 0                            
[ 3551.037058] TX ring 29:   (null) 2426c 3000b838 0                            
[ 3551.037058] TX ring 30:   (null) 24b3c 3000b838 0                            
[ 3551.037058] TX ring 31:   (null) 2399c 3000b838 0                            
[ 3551.037058] TX ring 32:   (null) 230cc 3000b838 0                            
[ 3551.037058] TX ring 33:   (null) 227fc 30000040 0                            
[ 3551.037058] TX ring 34:   (null) 21f2c 30000040 0                            
[ 3551.037058] TX ring 35:   (null) 2165c 30000040 0                            
[ 3551.037058] TX ring 36:   (null) 204bc 3000b838 0                            
[ 3551.037058] TX ring 37:   (null) 20d8c 3000b838 0                            
[ 3551.037058] TX ring 38:   (null) 1dc4f74c 3000b838 0                         
[ 3551.037058] TX ring 39:   (null) 1dc4ee7c 3000b838 0                         
[ 3551.037058] TX ring 40:   (null) 1dc4dcdc 3000b838 0                         
[ 3551.037058] TX ring 41:   (null) 1dc4d40c 3000b838 0                         
[ 3551.037058] TX ring 42:   (null) 1dc4e5ac 3000b838 0                         
[ 3551.037058] TX ring 43:   (null) 1dc4c26c 3000b838 0                         
[ 3551.037058] TX ring 44:   (null) 1dc4b99c 3000b838 0                         
[ 3551.037058] TX ring 45:   (null) 1dc4cb3c 3000b838 0                         
[ 3551.037058] TX ring 46:   (null) 1dc4b0cc 30000008 0                         
[ 3551.037058] TX ring 47:   (null) 1dc4a7fc 30000008 0                         
[ 3551.037058] TX ring 48:   (null) 1dc49f2c 30000008 0                         
[ 3551.037058] TX ring 49:   (null) 1dc48d8c 30000a00 0                         
[ 3551.037058] TX ring 50:   (null) 1dc4965c 30000a00 0                         
[ 3551.037058] TX ring 51:   (null) 1dc484bc 30000a00 0                         
[ 3551.037058] TX ring 52:   (null) 1dce774c 30000a00 0                         
[ 3551.037058] TX ring 53:   (null) 1dce6e7c 30000a00 0                         
[ 3551.037058] TX ring 54:   (null) 1dce5cdc 30000a00 0                         
[ 3551.037058] TX ring 55:   (null) 1dce65ac 30000000 0                         
[ 3551.037058] TX ring 56:   (null) 1dce540c 30000000 0                         
[ 3551.037058] TX ring 57:   (null) 1dce426c 30000a66 0                         
[ 3551.037058] TX ring 58:   (null) 1dce4b3c 30000a66 0                         
[ 3551.037058] TX ring 59:   (null) 1dce399c 30000a66 0                         
[ 3551.037058] TX ring 60:   (null) 1dce30cc 30000a66 0                         
[ 3551.037058] TX ring 61:   (null) 1dce27fc 30000000 0                         
[ 3551.037058] TX ring 62:   (null) 1dce165c 30000000 0                         
[ 3551.037058] TX ring 63:   (null) 1dce1f2c 70000000 0                         



-- 
dwmw2


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5691 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net 2/2]  8139cp: reset BQL when ring tx ring cleared
  2015-09-18  1:04                   ` David Woodhouse
@ 2015-09-18  7:40                     ` David Woodhouse
  0 siblings, 0 replies; 15+ messages in thread
From: David Woodhouse @ 2015-09-18  7:40 UTC (permalink / raw)
  To: Francois Romieu; +Cc: Stephen Hemminger, David Miller, netdev

[-- Attachment #1: Type: text/plain, Size: 8006 bytes --]

On Fri, 2015-09-18 at 02:04 +0100, David Woodhouse wrote:
> On Fri, 2015-09-18 at 01:44 +0200, Francois Romieu wrote:
> > The TxDmaOkLowDesc register may tell if the Tx dma part is still 
> > making any progress. I have added a TxPoll request. See below.
> 
> I've just added that into the original TX timeout handler, since that
> doesn't seem to be crashing the box for me as long as I avoid the IRQ
> storm.
> 
> Not sure what we learn from it ('desc 6550' printed as hex)... I've
> also made it dump the TX descriptor ring (skb, addr, opts1, opts2):

The TxDmaOkLowDesc values look sane to me; they always match the low
bits of the last descriptor that *should* have been sent, according to
the ring dumps.

I made it store and dump the original contents of the TX descriptor
ring too (before it gets overwritten by the hardware). So we can see
what *was* being transmitted.

It isn't any more enlightening. I also tried just prodding the hardware
by writing to the TxPoll register. That wasn't sufficient to restart
it.

[26589.024750] 8139cp 0000:00:0b.0 eth1: Transmit timeout, status  c   2b head 52 tail 45 desc 96c0    0 80ff
[26589.034632] TX ring 00:   (null) 1de16e7c 30003130 0 (b00005ea 0)            
[26589.034632] TX ring 01:   (null) 1de165ac 30003130 0 (b00005ea 0)            
[26589.034632] TX ring 02:   (null) 1de1540c 30003130 0 (b00005ea 0)            
[26589.034632] TX ring 03:   (null) 1de15cdc 30003130 0 (b00005ea 0)            
[26589.034632] TX ring 04:   (null) 1de14b3c 30003130 0 (b00005ea 0)            
[26589.034632] TX ring 05:   (null) 1de1399c 30003130 0 (b00005ea 0)            
[26589.034632] TX ring 06:   (null) 1de130cc 30003130 0 (b00005ea 0)            
[26589.034632] TX ring 07:   (null) 1de1426c 30003130 0 (b00005ea 0)            
[26589.034632] TX ring 08:   (null) 1de127fc 30003130 0 (b00005ea 0)            
[26589.034632] TX ring 09:   (null) 1de11f2c 30003130 0 (b00005ea 0)            
[26589.034632] TX ring 10:   (null) 1de10d8c 30003130 0 (b0000056 0)            
[26589.034632] TX ring 11:   (null) 1de1165c 30003130 0 (b00005ea 0)            
[26589.034632] TX ring 12:   (null) 1df4774c 30003130 0 (b00005ea 0)            
[26589.034632] TX ring 13:   (null) 1de104bc 30003130 0 (b00005ea 0)            
[26589.034632] TX ring 14:   (null) 1df46e7c 30003130 0 (b00005ea 0)            
[26589.034632] TX ring 15:   (null) 1df465ac 30003130 0 (b00005ea 0)            
[26589.034632] TX ring 16:   (null) 1df45cdc 30003130 0 (b00005ea 0)            
[26589.034632] TX ring 17:   (null) 1df44b3c 30003130 0 (b0000056 0)            
[26589.034632] TX ring 18:   (null) 1df4426c 30003130 0 (b00005ea 0)            
[26589.034632] TX ring 19:   (null) 1df4540c 30003130 0 (b00005ea 0)            
[26589.034632] TX ring 20:   (null) 1df4399c 30004954 0 (b00005ea 0)            
[26589.034632] TX ring 21:   (null) 1df430cc 30004954 0 (b00005ea 0)            
[26589.034632] TX ring 22:   (null) 1df427fc 30004954 0 (b00005ea 0)            
[26589.034632] TX ring 23:   (null) 1df4165c 30004954 0 (b00005ea 0)            
[26589.034632] TX ring 24:   (null) 1df41f2c 30004954 0 (b00005ea 0)            
[26589.034632] TX ring 25:   (null) 1df40d8c 30004954 0 (b00005ea 0)            
[26589.034632] TX ring 26:   (null) 1dcc9602 3000796b 0 (b000006a 0)            
[26589.034632] TX ring 27:   (null) 1de1774c 30000000 0 (b0000097 0)            
[26589.034632] TX ring 28:   (null) 1de16e7c 30000000 0 (b0000467 0)            
[26589.034632] TX ring 29:   (null) 1df404bc 30000000 0 (b0000557 0)            
[26589.034632] TX ring 30:   (null) 1de165ac 30000000 0 (b00000e7 0)            
[26589.034632] TX ring 31:   (null) 1de1540c 30000000 0 (b00000d7 0)            
[26589.034632] TX ring 32:   (null) 1dcc9602 300087b2 0 (b0000062 0)            
[26589.034632] TX ring 33:   (null) 1de15cdc 300087b2 0 (b0000046 0)            
[26589.034632] TX ring 34:   (null) 1de14b3c 30000540 0 (b000008b 0)            
[26589.034632] TX ring 35:   (null) 1de1426c 3000709e 0 (b0000097 0)            
[26589.034632] TX ring 36:   (null) 1de1399c 3000709e 0 (b0000097 0)            
[26589.034632] TX ring 37:   (null) 1de130cc 3000c169 0 (b0000084 0)            
[26589.034632] TX ring 38:   (null) 1de127fc 3000a5eb 0 (b00005ea 0)            
[26589.034632] TX ring 39:   (null) 1de11f2c 3000d4a3 0 (b0000557 0)            
[26589.034632] TX ring 40:   (null) 1de10d8c 3000b57e 0 (b0000046 0)            
[26589.034632] TX ring 41:   (null) 1de104bc 30000000 0 (b00000a7 0)            
[26589.034632] TX ring 42:   (null) 1dce774c 30004000 0 (b0000046 0)            
[26589.034632] TX ring 43:   (null) 1dce6e7c 300076f6 0 (b0000046 0)            
[26589.034632] TX ring 44:   (null) 1dce5cdc 30002034 0 (b0000096 0)            
[26589.034632] TX ring 45: dde4c3c0 1dce540c b0000557 0 (b0000557 0)            
[26589.034632] TX ring 46: dddc1900 1dce426c b0000097 0 (b0000097 0)            
[26589.034632] TX ring 47: defb3600 1dce4b3c b0000097 0 (b0000097 0)            
[26589.034632] TX ring 48: dde5b240 1dcc9602 b000006a 0 (b000006a 0)            
[26589.034632] TX ring 49: dde5b540 1dce30cc b0000097 0 (b0000097 0)            
[26589.034632] TX ring 50: dde5b600 1dce399c b0000097 0 (b0000097 0)            
[26589.034632] TX ring 51: dde5b180 1dce1f2c b00005ea 0 (b00005ea 0)            
[26589.034632] TX ring 52:   (null) 1df4540c 30003130 0 (b0000056 0)            
[26589.034632] TX ring 53:   (null) 1df45cdc 30003130 0 (b00005ea 0)            
[26589.034632] TX ring 54:   (null) 1df4426c 30003130 0 (b00005ea 0)            
[26589.034632] TX ring 55:   (null) 1df44b3c 30003130 0 (b00005ea 0)            
[26589.034632] TX ring 56:   (null) 1df4399c 30003130 0 (b00005ea 0)            
[26589.034632] TX ring 57:   (null) 1df430cc 30003130 0 (b00005ea 0)            
[26589.034632] TX ring 58:   (null) 1df427fc 30003130 0 (b00005ea 0)            
[26589.034632] TX ring 59:   (null) 1df41f2c 30003130 0 (b0000056 0)            
[26589.034632] TX ring 60:   (null) 1df4165c 30003130 0 (b00005ea 0)            
[26589.034632] TX ring 61:   (null) 1df404bc 30003130 0 (b00005ea 0)            
[26589.034632] TX ring 62:   (null) 1df40d8c 30003130 0 (b00005ea 0)            
[26589.034632] TX ring 63:   (null) 1de1774c 70003130 0 (f00005ea 0)            

Those values we see in the low 16 bits of the opts1 field after TX
(0x3130 in the last 12 or so immediately above) look bogus. Those
are supposed to be the frame length that actually got transmitted. But
they bear no relation to the length of the packet it was actually asked
to transmit. And sometimes it's zero (but not really correlated with
the TX hang). We use this value in cp_tx() for the dma_unmap call, as
the length to unmap. I've augmented the debugging in cp_tx() and we
really are seeing the same bogus values there; I'm surprised we've
never seen DMA API debugging trip up on this.

At this point I have no idea what the device is putting in the "TX
status" after a transmit, but it doesn't seem to match what's in the
datasheet.

[26906.679823] 8139cp 0000:00:0b.0 eth1: tx done, slot 35, status 0x3000f804    
[26906.686910] 8139cp 0000:00:0b.0 eth1: tx done, slot 36, status 0x3000f804    
[26906.693894] 8139cp 0000:00:0b.0 eth1: tx done, slot 37, status 0x30001e69    
[26906.700912] 8139cp 0000:00:0b.0 eth1: tx done, slot 38, status 0x3000ca24    
[26906.707800] 8139cp 0000:00:0b.0 eth1: tx done, slot 39, status 0x3000ca24    
[26906.714902] 8139cp 0000:00:0b.0 eth1: tx done, slot 40, status 0x3000ca24    
[26906.721787] 8139cp 0000:00:0b.0 eth1: tx done, slot 41, status 0x30004006    
[26906.728942] 8139cp 0000:00:0b.0 eth1: tx done, slot 42, status 0x30004006    
[26906.735921] 8139cp 0000:00:0b.0 eth1: tx done, slot 43, status 0x30004006    
[26906.742948] 8139cp 0000:00:0b.0 eth1: tx done, slot 44, status 0x30004006    

-- 
dwmw2


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5691 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net 2/2]  8139cp: reset BQL when ring tx ring cleared
  2015-09-17 23:44                 ` Francois Romieu
  2015-09-18  1:04                   ` David Woodhouse
@ 2015-09-18 22:42                   ` David Woodhouse
  1 sibling, 0 replies; 15+ messages in thread
From: David Woodhouse @ 2015-09-18 22:42 UTC (permalink / raw)
  To: Francois Romieu; +Cc: Stephen Hemminger, David Miller, netdev

[-- Attachment #1: Type: text/plain, Size: 21249 bytes --]

On Fri, 2015-09-18 at 01:44 +0200, Francois Romieu wrote:
> The TxDmaOkLowDesc register may tell if the Tx dma part is still 
> making any progress. I have added a TxPoll request. See below.

It isn't making any progress. And TxPoll doesn't help. The only thing
I've found that restarts it is to clear and then set the TxOn bit in
the Cmd register, which resets the entire ring.

I briefly wondered if it was triggered by our constantly banging on the
TxPoll register even when the ring is already running, so I put in an
optimisation to avoid that except in the case where we've had a TxEmpty
(0x80) interrupt. That doesn't help either.

When the problem happens, we've put a new descriptor into the ring and
written to TxPoll to start the processing. We get a TxEmpty interrupt
*without* TxDone, while the descriptor is still marked as being owned
by the hardware.

At other times we do get a TxEmpty without TxDone, but usually the
TxDone happens shortly thereafter. In the problematic case, the TxDone
never happens — and the offending descriptor is never given back.

(Note that I also fixed the off-by-one in the 'tx queued' debugging
message)

[35322.861870] 8139cp 0000:00:0b.0 eth1: tx queued, slot 13, skblen 1294
[35322.861870] 8139cp 0000:00:0b.0 eth1: tx kicking tx_poll, head 14 tail 13 desc 54c0 poll 00
[35322.861870] 8139cp 0000:00:0b.0 eth1: intr, status 0484 cmd 0c cpcmd 002b
[35322.861870] 8139cp 0000:00:0b.0 eth1: tx done, slot 13, status 0x30003a6e dec 54d0
[35322.861870] 8139cp 0000:00:0b.0 eth1: irq not kicking tx_poll, head 14 tail 14 desc 54d0 poll 00
[35322.875014] 8139cp 0000:00:0b.0 eth1: intr, status 0001 cmd 0c cpcmd 002b
[35322.913285] 8139cp 0000:00:0b.0 eth1: intr, status 0001 cmd 0c cpcmd 002b
[35322.917250] 8139cp 0000:00:0b.0 eth1: intr, status 0001 cmd 0c cpcmd 002b
[35322.943075] 8139cp 0000:00:0b.0 eth1: tx queued, slot 14, skblen 70
[35322.943103] 8139cp 0000:00:0b.0 eth1: tx kicking tx_poll, head 15 tail 14 desc 54d0 poll 00
[35322.943138] 8139cp 0000:00:0b.0 eth1: intr, status 0484 cmd 0c cpcmd 002b
[35322.943164] 8139cp 0000:00:0b.0 eth1: tx done, slot 14, status 0x30008001 dec 54e0
[35322.943190] 8139cp 0000:00:0b.0 eth1: irq not kicking tx_poll, head 15 tail 15 desc 54e0 poll 00
[35322.947071] 8139cp 0000:00:0b.0 eth1: intr, status 0001 cmd 0c cpcmd 002b
[35322.959487] 8139cp 0000:00:0b.0 eth1: intr, status 0001 cmd 0c cpcmd 002b
[35323.001723] 8139cp 0000:00:0b.0 eth1: intr, status 0001 cmd 0c cpcmd 002b
[35323.041541] 8139cp 0000:00:0b.0 eth1: intr, status 0001 cmd 0c cpcmd 002b
[35323.041541] 8139cp 0000:00:0b.0 eth1: intr, status 0001 cmd 0c cpcmd 002b
[35323.052386] 8139cp 0000:00:0b.0 eth1: intr, status 0001 cmd 0c cpcmd 002b
[35323.113828] 8139cp 0000:00:0b.0 eth1: intr, status 0001 cmd 0c cpcmd 002b
[35323.114824] 8139cp 0000:00:0b.0 eth1: tx queued, slot 15, skblen 1294
[35323.114851] 8139cp 0000:00:0b.0 eth1: tx kicking tx_poll, head 16 tail 15 desc 54e0 poll 00
[35323.114887] 8139cp 0000:00:0b.0 eth1: intr, status 0080 cmd 0c cpcmd 002b
[35323.114921] 8139cp 0000:00:0b.0 eth1: Invalid TxEmpty, should have seen 15 at ddea54f0 status  c   2b    0 80ff desc 54e0 poll 00
[35323.124559] 8139cp 0000:00:0b.0 eth1: irq kicking tx_poll, head 16 tail 15 desc 54e0 poll 00
[35323.126775] 8139cp 0000:00:0b.0 eth1: tx queued, slot 16, skblen 1294
[35323.126803] 8139cp 0000:00:0b.0 eth1: tx not kicking tx_poll, head 17 tail 15 desc 54e0 poll 00
[35323.127154] 8139cp 0000:00:0b.0 eth1: tx queued, slot 17, skblen 1294
[35323.127181] 8139cp 0000:00:0b.0 eth1: tx not kicking tx_poll, head 18 tail 15 desc 54e0 poll 00
[35323.127776] 8139cp 0000:00:0b.0 eth1: tx queued, slot 18, skblen 1294
[35323.127802] 8139cp 0000:00:0b.0 eth1: tx not kicking tx_poll, head 19 tail 15 desc 54e0 poll 00
[35323.147218] 8139cp 0000:00:0b.0 eth1: intr, status 0001 cmd 0c cpcmd 002b
[35323.247288] 8139cp 0000:00:0b.0 eth1: intr, status 0001 cmd 0c cpcmd 002b
[35323.314456] 8139cp 0000:00:0b.0 eth1: intr, status 0001 cmd 0c cpcmd 002b
...
[35331.995477] 8139cp 0000:00:0b.0 eth1: intr, status 0001 cmd 0c cpcmd 002b
[35332.024428] 8139cp 0000:00:0b.0 eth1: Transmit timeout, status  c   2b    0 80ff desc 54e0 poll 00
[35332.033424] TX ring 00 @ddea5400:   (null) 1ddd65ac 30006362 0 (b000050e 0)
[35332.040399] TX ring 01 @ddea5410:   (null) 1ddd5cdc 30006362 0 (b000050e 0)
[35332.043399] TX ring 02 @ddea5420:   (null) 1ed9165c 30006362 0 (b000050e 0)
[35332.043399] TX ring 03 @ddea5430:   (null) 1ddd540c 30006362 0 (b000050e 0)
[35332.043399] TX ring 04 @ddea5440:   (null) 1ddd4b3c 30006362 0 (b000050e 0)
[35332.043399] TX ring 05 @ddea5450:   (null) 1ddd426c 30006362 0 (b000050e 0)
[35332.043399] TX ring 06 @ddea5460:   (null) 1ddd399c 30006362 0 (b000050e 0)
[35332.043399] TX ring 07 @ddea5470:   (null) 1ddd27fc 30006362 0 (b000050e 0)
[35332.043399] TX ring 08 @ddea5480:   (null) 1ddd1f2c 30006362 0 (b000050e 0)
[35332.043399] TX ring 09 @ddea5490:   (null) 1ddd165c 30006362 0 (b000050e 0)
[35332.043399] TX ring 10 @ddea54a0:   (null) 1ddd0d8c 30006362 0 (b0000046 0)
[35332.043399] TX ring 11 @ddea54b0:   (null) 1ddd30cc 3000a7a6 0 (b00000ad 0)
[35332.043399] TX ring 12 @ddea54c0:   (null) 1de22002 30005b5a 0 (b000006a 0)
[35332.043399] TX ring 13 @ddea54d0:   (null) 1ed9774c 30003a6e 0 (b000050e 0)
[35332.043399] TX ring 14 @ddea54e0:   (null) 1ed96e7c 30008001 0 (b0000046 0)
[35332.043399] TX ring 15 @ddea54f0: c0014780 1ed965ac b000050e 0 (b000050e 0)
[35332.043399] TX ring 16 @ddea5500: dddbfa80 1ed95cdc b000050e 0 (b000050e 0)
[35332.043399] TX ring 17 @ddea5510: ddd59180 1ed9426c b000050e 0 (b000050e 0)
[35332.043399] TX ring 18 @ddea5520: ddca66c0 1ed94b3c b000050e 0 (b000050e 0)
[35332.043399] TX ring 19 @ddea5530:   (null) 1ddd04bc 3000d3d2 0 (b000050e 0)
[35332.043399] TX ring 20 @ddea5540:   (null) 1ddd165c 3000d3d2 0 (b000050e 0)
[35332.043399] TX ring 21 @ddea5550:   (null) 1ed9774c 3000d3d2 0 (b0000056 0)
[35332.043399] TX ring 22 @ddea5560:   (null) 1ed965ac 3000d3d2 0 (b000050e 0)
[35332.043399] TX ring 23 @ddea5570:   (null) 1ed95cdc 3000d3d2 0 (b000050e 0)
[35332.043399] TX ring 24 @ddea5580:   (null) 1ed96e7c 3000d3d2 0 (b000050e 0)
[35332.043399] TX ring 25 @ddea5590:   (null) 1ed9540c 3000d3d2 0 (b000050e 0)
[35332.043399] TX ring 26 @ddea55a0:   (null) 1ed9426c 3000d3d2 0 (b000050e 0)
[35332.043399] TX ring 27 @ddea55b0:   (null) 1ed9399c 3000d3d2 0 (b000050e 0)
[35332.043399] TX ring 28 @ddea55c0:   (null) 1ed930cc 3000d3d2 0 (b000050e 0)
[35332.043399] TX ring 29 @ddea55d0:   (null) 1ed927fc 3000d3d2 0 (b000050e 0)
[35332.043399] TX ring 30 @ddea55e0:   (null) 1ed94b3c 3000d3d2 0 (b000050e 0)
[35332.043399] TX ring 31 @ddea55f0:   (null) 1ed91f2c 3000d3d2 0 (b000050e 0)
[35332.043399] TX ring 32 @ddea5600:   (null) 1ed90d8c 3000d3d2 0 (b000050e 0)
[35332.043399] TX ring 33 @ddea5610:   (null) 1ed904bc 3000d3d2 0 (b000050e 0)
[35332.043399] TX ring 34 @ddea5620:   (null) 1ed9165c 3000d3d2 0 (b000050e 0)
[35332.043399] TX ring 35 @ddea5630:   (null) 1ddd774c 30001716 0 (b000050e 0)
[35332.043399] TX ring 36 @ddea5640:   (null) 1ddd65ac 30001716 0 (b000050e 0)
[35332.043399] TX ring 37 @ddea5650:   (null) 1ddd6e7c 30001716 0 (b000050e 0)
[35332.043399] TX ring 38 @ddea5660:   (null) 1ddd5cdc 30001716 0 (b000050e 0)
[35332.043399] TX ring 39 @ddea5670:   (null) 1ddd540c 3000a7a6 0 (b000050e 0)
[35332.043399] TX ring 40 @ddea5680:   (null) 1ddd4b3c 3000a7a6 0 (b000050e 0)
[35332.043399] TX ring 41 @ddea5690:   (null) 1ddd399c 3000a7a6 0 (b000050e 0)
[35332.043399] TX ring 42 @ddea56a0:   (null) 1ddd426c 3000a7a6 0 (b000050e 0)
[35332.043399] TX ring 43 @ddea56b0:   (null) 1ddd27fc 3000a7a6 0 (b000050e 0)
[35332.043399] TX ring 44 @ddea56c0:   (null) 1ddd30cc 3000a7a6 0 (b000050e 0)
[35332.043399] TX ring 45 @ddea56d0:   (null) 1ddd1f2c 3000a7a6 0 (b000050e 0)
[35332.043399] TX ring 46 @ddea56e0:   (null) 1ddd165c 3000a7a6 0 (b000050e 0)
[35332.043399] TX ring 47 @ddea56f0:   (null) 1ddd0d8c 3000a7a6 0 (b000050e 0)
[35332.043399] TX ring 48 @ddea5700:   (null) 1ddd04bc 3000a7a6 0 (b000050e 0)
[35332.043399] TX ring 49 @ddea5710:   (null) 1ed96e7c 3000a7a6 0 (b000050e 0)
[35332.043399] TX ring 50 @ddea5720:   (null) 1ed9774c 3000a7a6 0 (b000050e 0)
[35332.043399] TX ring 51 @ddea5730:   (null) 1ed965ac 3000a7a6 0 (b000050e 0)
[35332.043399] TX ring 52 @ddea5740:   (null) 1ed9540c 3000a7a6 0 (b000050e 0)
[35332.043399] TX ring 53 @ddea5750:   (null) 1ed95cdc 3000a7a6 0 (b000050e 0)
[35332.043399] TX ring 54 @ddea5760:   (null) 1ed94b3c 3000a7a6 0 (b000050e 0)
[35332.043399] TX ring 55 @ddea5770:   (null) 1ed9426c 3000a7a6 0 (b000050e 0)
[35332.043399] TX ring 56 @ddea5780:   (null) 1ed9399c 3000a7a6 0 (b000050e 0)
[35332.043399] TX ring 57 @ddea5790:   (null) 1ed930cc 30006362 0 (b000050e 0)
[35332.043399] TX ring 58 @ddea57a0:   (null) 1ed91f2c 30006362 0 (b000050e 0)
[35332.043399] TX ring 59 @ddea57b0:   (null) 1ed927fc 30006362 0 (b000050e 0)
[35332.043399] TX ring 60 @ddea57c0:   (null) 1ed90d8c 30006362 0 (b000050e 0)
[35332.043399] TX ring 61 @ddea57d0:   (null) 1ed904bc 30006362 0 (b000050e 0)
[35332.043399] TX ring 62 @ddea57e0:   (null) 1ddd774c 30006362 0 (b000050e 0)
[35332.043399] TX ring 63 @ddea57f0:   (null) 1ddd6e7c 70006362 0 (f000050e 0)

At this point I'm mostly out of ideas. My 'Invalid TxEmpty' check does
reliably trigger when the problem is happening, but it has false
positives. Perhaps we could set the TimerIntr when that happens, and
then reset the TX ring if the TxDone doesn't come within a certain
amount of time. Although on a half-duplex link or with flow control
it's not clear what the upper bound of that time would be.

At least with the patches I've sent so far, it does manage to recover
when the timeout happens.

This is what I'm testing, for reference:

diff --git a/drivers/net/ethernet/realtek/8139cp.c b/drivers/net/ethernet/realtek/8139cp.c
index d12fc50..4440459 100644
--- a/drivers/net/ethernet/realtek/8139cp.c
+++ b/drivers/net/ethernet/realtek/8139cp.c
@@ -45,7 +45,7 @@
 	  default, use ethtool to turn it on.
 
  */
-
+#define DEBUG
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 
 #define DRV_NAME		"8139cp"
@@ -157,6 +157,7 @@ enum {
 	NWayAdvert	= 0x66, /* MII ADVERTISE */
 	NWayLPAR	= 0x68, /* MII LPA */
 	NWayExpansion	= 0x6A, /* MII Expansion */
+	TxDmaOkLowDesc  = 0x82, /* Low 16 bit address of a Tx descriptor. */
 	Config5		= 0xD8,	/* Config5 */
 	TxPoll		= 0xD9,	/* Tell chip to check Tx descriptors for work */
 	RxMaxSize	= 0xDA, /* Max size of an Rx packet (8169 only) */
@@ -341,10 +342,11 @@ struct cp_private {
 	unsigned		tx_tail;
 	struct cp_desc		*tx_ring;
 	struct sk_buff		*tx_skb[CP_TX_RING_SIZE];
-
+	struct cp_desc tx_ring_shadow[CP_TX_RING_SIZE];
 	unsigned		rx_buf_sz;
 	unsigned		wol_enabled : 1; /* Is Wake-on-LAN enabled? */
-
+	unsigned                tx_ring_running : 1;
+	int                     tx_ring_seen;
 	dma_addr_t		ring_dma;
 
 	struct mii_if_info	mii_if;
@@ -504,8 +506,8 @@ rx_status_loop:
 			goto rx_next;
 		}
 
-		netif_dbg(cp, rx_status, dev, "rx slot %d status 0x%x len %d\n",
-			  rx_tail, status, len);
+		netif_dbg(cp, rx_status, dev, "rx (%d) slot %d status 0x%x len %d\n",
+			  rx, rx_tail, status, len);
 
 		new_skb = napi_alloc_skb(napi, buflen);
 		if (!new_skb) {
@@ -554,6 +556,7 @@ rx_next:
 	/* if we did not reach work limit, then we're done with
 	 * this round of polling
 	 */
+	netif_dbg(cp, rx_status, dev, "rx %d of %d\n", rx, budget);
 	if (rx < budget) {
 		unsigned long flags;
 
@@ -566,6 +569,7 @@ rx_next:
 		cpw16_f(IntrMask, cp_intr_mask);
 		spin_unlock_irqrestore(&cp->lock, flags);
 	}
+	netif_dbg(cp, rx_status, dev, "rx done %d of %d\n", rx, budget);
 
 	return rx;
 }
@@ -606,9 +610,38 @@ static irqreturn_t cp_interrupt (int irq, void *dev_instance)
 			__napi_schedule(&cp->napi);
 		}
 	}
+
 	if (status & (TxOK | TxErr | TxEmpty | SWInt))
 		handled |= cp_tx(cp);
 
+	if (status & TxEmpty) {
+		handled = 1;
+		if (cp->tx_head == cp->tx_tail) {
+			/* Out of descriptors and we have nothing more for it.
+			   Let it stop. */
+			cp->tx_ring_running = 0;
+			netif_dbg(cp, tx_queued, cp->dev, "irq not kicking tx_poll, head %d tail %d desc %04x poll %02x", cp->tx_head, cp->tx_tail, cpr16(TxDmaOkLowDesc), cpr8(TxPoll));
+			cp->tx_head = cp->tx_tail = 0;
+		} else {
+			if (cp->tx_ring_seen >= 0) {
+				/* We *know* that tx_ring_seen was in the queue when
+				   we prodded it to start, and yet it's claiming that
+				   it's out of descriptors already! */
+				netdev_warn(dev, "Invalid TxEmpty, should have seen %d at %p status %2x %4x %4x %4x desc %4x poll %02x\n",
+					    cp->tx_ring_seen, &cp->tx_ring[cp->tx_ring_seen],
+					    cpr8(Cmd), cpr16(CpCmd),
+					    cpr16(IntrStatus), cpr16(IntrMask), cpr16(TxDmaOkLowDesc), cpr8(TxPoll));
+			}
+			/* The hardware raced with us adding a new descriptor, and
+			   we didn't get the IRQ in time so we didn't prod it.
+			   Prod it now to restart */
+			cp->tx_ring_running = 1;
+			cp->tx_ring_seen = cp->tx_head ? cp->tx_head - 1 : (CP_TX_RING_SIZE - 1);;
+			netif_dbg(cp, tx_queued, cp->dev, "irq kicking tx_poll, head %d tail %d desc %04x poll %02x", cp->tx_head, cp->tx_tail, cpr16(TxDmaOkLowDesc), cpr8(TxPoll));
+			cpw8(TxPoll, NormalTxPoll);
+		}
+	}
+
 	if (status & LinkChg) {
 		handled = 1;
 		mii_check_media(&cp->mii_if, netif_msg_link(cp), false);
@@ -665,6 +698,11 @@ static int cp_tx (struct cp_private *cp)
 		if (status & DescOwn)
 			break;
 
+		/* If it's processed the last descriptor we *knew*
+		 * was in place when we last started it, note that. */
+		if (tx_tail == cp->tx_ring_seen)
+			cp->tx_ring_seen = -1;
+
 		handled = 1;
 		skb = cp->tx_skb[tx_tail];
 		BUG_ON(!skb);
@@ -692,12 +730,12 @@ static int cp_tx (struct cp_private *cp)
 				cp->dev->stats.tx_packets++;
 				cp->dev->stats.tx_bytes += skb->len;
 				netif_dbg(cp, tx_done, cp->dev,
-					  "tx done, slot %d\n", tx_tail);
+					  "tx done, slot %d, status 0x%x dec %04x\n", tx_tail, status, cpr16(TxDmaOkLowDesc));
 			}
 			bytes_compl += skb->len;
 			pkts_compl++;
 			dev_kfree_skb_irq(skb);
-		}
+		} else netif_dbg(cp, tx_done, cp->dev, "tx partial done, stlot %d, status 0x%x", tx_tail, status);
 
 		cp->tx_skb[tx_tail] = NULL;
 
@@ -773,6 +811,7 @@ static netdev_tx_t cp_start_xmit (struct sk_buff *skb,
 
 		txd->opts2 = opts2;
 		txd->addr = cpu_to_le64(mapping);
+		
 		wmb();
 
 		flags = eor | len | DescOwn | FirstFrag | LastFrag;
@@ -790,13 +829,16 @@ static netdev_tx_t cp_start_xmit (struct sk_buff *skb,
 		}
 
 		txd->opts1 = cpu_to_le32(flags);
+		cp->tx_ring_shadow[entry].opts2 = opts2;
+		cp->tx_ring_shadow[entry].addr = cpu_to_le64(mapping);
+		cp->tx_ring_shadow[entry].opts1 = cpu_to_le32(flags);
 		wmb();
 
 		cp->tx_skb[entry] = skb;
 		entry = NEXT_TX(entry);
 	} else {
 		struct cp_desc *txd;
-		u32 first_len, first_eor;
+		u32 first_len, first_eor, ctrl;
 		dma_addr_t first_mapping;
 		int frag, first_entry = entry;
 		const struct iphdr *ip = ip_hdr(skb);
@@ -817,7 +859,6 @@ static netdev_tx_t cp_start_xmit (struct sk_buff *skb,
 		for (frag = 0; frag < skb_shinfo(skb)->nr_frags; frag++) {
 			const skb_frag_t *this_frag = &skb_shinfo(skb)->frags[frag];
 			u32 len;
-			u32 ctrl;
 			dma_addr_t mapping;
 
 			len = skb_frag_size(this_frag);
@@ -854,6 +895,9 @@ static netdev_tx_t cp_start_xmit (struct sk_buff *skb,
 			wmb();
 
 			txd->opts1 = cpu_to_le32(ctrl);
+			cp->tx_ring_shadow[entry].opts2 = opts2;
+			cp->tx_ring_shadow[entry].addr = cpu_to_le64(mapping);
+			cp->tx_ring_shadow[entry].opts1 = cpu_to_le32(ctrl);
 			wmb();
 
 			cp->tx_skb[entry] = skb;
@@ -863,37 +907,48 @@ static netdev_tx_t cp_start_xmit (struct sk_buff *skb,
 		txd = &cp->tx_ring[first_entry];
 		txd->opts2 = opts2;
 		txd->addr = cpu_to_le64(first_mapping);
+		cp->tx_ring_shadow[first_entry].opts2 = opts2;
+		cp->tx_ring_shadow[first_entry].addr = cpu_to_le64(first_mapping);
+
 		wmb();
 
 		if (skb->ip_summed == CHECKSUM_PARTIAL) {
 			if (ip->protocol == IPPROTO_TCP)
-				txd->opts1 = cpu_to_le32(first_eor | first_len |
+				ctrl = (first_eor | first_len |
 							 FirstFrag | DescOwn |
 							 IPCS | TCPCS);
 			else if (ip->protocol == IPPROTO_UDP)
-				txd->opts1 = cpu_to_le32(first_eor | first_len |
+				ctrl = (first_eor | first_len |
 							 FirstFrag | DescOwn |
 							 IPCS | UDPCS);
 			else
 				BUG();
 		} else
-			txd->opts1 = cpu_to_le32(first_eor | first_len |
+			ctrl = (first_eor | first_len |
 						 FirstFrag | DescOwn);
+		txd->opts1 = cpu_to_le32(ctrl);
+		cp->tx_ring_shadow[first_entry].opts1 = cpu_to_le32(ctrl);
 		wmb();
 	}
 	cp->tx_head = entry;
 
 	netdev_sent_queue(dev, skb->len);
 	netif_dbg(cp, tx_queued, cp->dev, "tx queued, slot %d, skblen %d\n",
-		  entry, skb->len);
+		  entry ? entry - 1 : CP_TX_RING_SIZE-1, skb->len);
 	if (TX_BUFFS_AVAIL(cp) <= (MAX_SKB_FRAGS + 1))
 		netif_stop_queue(dev);
 
 out_unlock:
+	if (!cp->tx_ring_running) {
+		cp->tx_ring_running = 1;
+		cp->tx_ring_seen = cp->tx_head ? cp->tx_head - 1 : (CP_TX_RING_SIZE - 1);;
+		netif_dbg(cp, tx_queued, cp->dev, "tx kicking tx_poll, head %d tail %d desc %04x poll %02x", cp->tx_head, cp->tx_tail, cpr16(TxDmaOkLowDesc), cpr8(TxPoll));
+		cpw8(TxPoll, NormalTxPoll);
+	} else {
+		netif_dbg(cp, tx_queued, cp->dev, "tx not kicking tx_poll, head %d tail %d desc %04x poll %02x", cp->tx_head, cp->tx_tail, cpr16(TxDmaOkLowDesc), cpr8(TxPoll));
+	}
 	spin_unlock_irqrestore(&cp->lock, intr_flags);
 
-	cpw8(TxPoll, NormalTxPoll);
-
 	return NETDEV_TX_OK;
 out_dma_error:
 	dev_kfree_skb_any(skb);
@@ -1035,7 +1090,8 @@ static inline void cp_start_hw (struct cp_private *cp)
 	 * This variant appears to work fine.
 	 */
 	cpw8(Cmd, RxOn | TxOn);
-
+	cp->tx_ring_running = 0;
+	cp->tx_ring_seen = -1;
 	netdev_reset_queue(cp->dev);
 }
 
@@ -1057,7 +1113,7 @@ static void cp_init_hw (struct cp_private *cp)
 	cpw32_f (MAC0 + 4, le32_to_cpu (*(__le32 *) (dev->dev_addr + 4)));
 
 	cp_start_hw(cp);
-	cpw8(TxThresh, 0x06); /* XXX convert magic num to a constant */
+	cpw8(TxThresh, 0x2f); /* XXX convert magic num to a constant */
 
 	__cp_set_rx_mode(dev);
 	cpw32_f (TxConfig, IFG | (TX_DMA_BURST << TxDMAShift));
@@ -1255,26 +1311,74 @@ static int cp_close (struct net_device *dev)
 static void cp_tx_timeout(struct net_device *dev)
 {
 	struct cp_private *cp = netdev_priv(dev);
+	dma_addr_t ring_dma;
 	unsigned long flags;
 	int rc;
+	int i;
 
-	netdev_warn(dev, "Transmit timeout, status %2x %4x %4x %4x\n",
+	netdev_warn(dev, "Transmit timeout, status %2x %4x %4x %4x desc %02x poll %02x\n",
 		    cpr8(Cmd), cpr16(CpCmd),
-		    cpr16(IntrStatus), cpr16(IntrMask));
+		    cpr16(IntrStatus), cpr16(IntrMask),
+		    cpr16(TxDmaOkLowDesc), cpr8(TxPoll));
 
 	spin_lock_irqsave(&cp->lock, flags);
 
+	for (i = 0; i < CP_TX_RING_SIZE; i++) {
+		printk("TX ring %02d @%p: %p %llx %x %x (%x %x)\n",
+		       i, &cp->tx_ring[i], cp->tx_skb[i], (unsigned long long)cp->tx_ring[i].addr,
+		       cp->tx_ring[i].opts1, cp->tx_ring[i].opts2,
+		       cp->tx_ring_shadow[i].opts1, cp->tx_ring_shadow[i].opts2);
+	}
+//static void cp_stop_hw (struct cp_private *cp)
+{
+	cpw8(Cmd, RxOn);
+
+	cp->tx_head = cp->tx_tail = 0;
+
+}
+//static void cp_clean_rings (struct cp_private *cp)
+{
+	struct cp_desc *desc;
+
+	for (i = 0; i < CP_TX_RING_SIZE; i++) {
+		if (cp->tx_skb[i]) {
+			struct sk_buff *skb = cp->tx_skb[i];
+
+			desc = cp->tx_ring + i;
+			dma_unmap_single(&cp->pdev->dev,le64_to_cpu(desc->addr),
+					 le32_to_cpu(desc->opts1) & 0xffff,
+					 PCI_DMA_TODEVICE);
+			if (le32_to_cpu(desc->opts1) & LastFrag)
+				dev_kfree_skb_any(skb);
+			cp->dev->stats.tx_dropped++;
+			cp->tx_skb[i] = NULL;
+		}
+		if (i == CP_TX_RING_SIZE - 1)
+			cp->tx_ring[i].opts1 = cpu_to_le32(RingEnd);
+		else
+			cp->tx_ring[i].opts1 = cpu_to_le32(0);
+		cp->tx_ring[i].opts2 = cpu_to_le32(0);
+		cp->tx_ring[i].addr = cpu_to_le64(0);
+	}
+	netdev_reset_queue(cp->dev);
+}
+
+	cpw8(Cmd, RxOn | TxOn);
+	cp->tx_ring_running = 0;
+	cp->tx_ring_seen = -1;
+
+
+#if 0
 	cp_stop_hw(cp);
 	cp_clean_rings(cp);
 	rc = cp_init_rings(cp);
 	cp_start_hw(cp);
+#endif
 	__cp_set_rx_mode(dev);
 	cpw16_f(IntrMask, cp_norx_intr_mask);
-
-	netif_wake_queue(dev);
 	napi_schedule_irqoff(&cp->napi);
-
 	spin_unlock_irqrestore(&cp->lock, flags);
+	netif_wake_queue(dev);
 }
 
 static int cp_change_mtu(struct net_device *dev, int new_mtu)
@@ -1989,7 +2093,7 @@ static int cp_init_one (struct pci_dev *pdev, const struct pci_device_id *ent)
 		    cpu_to_le16(read_eeprom (regs, i + 7, addr_len));
 
 	dev->netdev_ops = &cp_netdev_ops;
-	netif_napi_add(dev, &cp->napi, cp_rx_poll, 16);
+	netif_napi_add(dev, &cp->napi, cp_rx_poll, NAPI_POLL_WEIGHT);
 	dev->ethtool_ops = &cp_ethtool_ops;
 	dev->watchdog_timeo = TX_TIMEOUT;
 

-- 
dwmw2


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5691 bytes --]

^ permalink raw reply related	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2015-09-18 22:42 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-05-20 16:54 [PATCH net 2/2] 8139cp: reset BQL when ring tx ring cleared Stephen Hemminger
2013-05-20 21:03 ` David Miller
2013-05-20 21:37 ` Francois Romieu
2013-05-21  0:27   ` Stephen Hemminger
2015-09-14 12:05     ` David Woodhouse
2015-09-14 21:59       ` Francois Romieu
2015-09-17 11:36         ` David Woodhouse
2015-09-17 15:06           ` David Woodhouse
2015-09-17 20:44             ` Francois Romieu
2015-09-17 22:14               ` David Woodhouse
2015-09-17 23:44                 ` Francois Romieu
2015-09-18  1:04                   ` David Woodhouse
2015-09-18  7:40                     ` David Woodhouse
2015-09-18 22:42                   ` David Woodhouse
2015-09-17 12:02         ` David Woodhouse

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.