netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* 3.9.5+:  Crash in tcp_input.c:4810.
@ 2013-06-17 18:08 Ben Greear
  2013-06-17 18:17 ` Eric Dumazet
  0 siblings, 1 reply; 16+ messages in thread
From: Ben Greear @ 2013-06-17 18:08 UTC (permalink / raw)
  To: netdev

This is from a 3.9.5+ kernel with local patches.  We saw this crash during
a weekend run where we had TCP traffic trying to run on 128+ wifi station
interfaces as the interfaces assocaited over and over again (the AP
could handle no more than 127 stations and would dis-associate others
when the 128th tried to associate).

The code in question is this from the tcp_collapse() method:

		skb_reserve(nskb, header);
		memcpy(nskb->head, skb->head, header);
		memcpy(nskb->cb, skb->cb, sizeof(skb->cb));
		TCP_SKB_CB(nskb)->seq = TCP_SKB_CB(nskb)->end_seq = start;
		__skb_queue_before(list, skb, nskb);
		skb_set_owner_r(nskb, sk);

		/* Copy data, releasing collapsed skbs. */
		while (copy > 0) {
			int offset = start - TCP_SKB_CB(skb)->seq;
			int size = TCP_SKB_CB(skb)->end_seq - start;

			BUG_ON(offset < 0);



------------[ cut here ]------------
kernel BUG at /home/greearb/git/linux-3.9.dev.y/net/ipv4/tcp_input.c:4810!
invalid opcode: 0000 [#1] PREEMPT SMP
Modules linked in: nf_nat_ipv4 nf_nat 8021q garp stp mrp llc fuse macvlan wanlink(O) pktgen lockd sunrpc f71882fg e1000e ath9k ath9k_common ath9k_hw ath 
mac80211 snd_hda_codec_realtek coretemp snd_hda_intel hwmon snd_hda_codec snd_hwdep mperf intel_powerclamp snd_seq snd_seq_device snd_pcm cfg80211 ptp pps_core 
snd_page_alloc snd_timer kvm cdc_acm i2c_i801 gpio_ich iTCO_wdt iTCO_vendor_support snd soundcore ppdev microcode pcspkr serio_raw lpc_ich parport_pc parport 
uinput ipv6 i915 video i2c_algo_bit drm_kms_helper drm i2c_core [last unloaded: iptable_nat]
CPU 1
Pid: 0, comm: swapper/1 Tainted: G        WC O 3.9.5+ #80 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M.
RIP: 0010:[<ffffffff8155a9e9>]  [<ffffffff8155a9e9>] tcp_collapse+0x267/0x37a
RSP: 0018:ffff88022bc83608  EFLAGS: 00010297
RAX: 0000000000001100 RBX: ffff8801b8f08730 RCX: 0000000000000000
RDX: 00000000fffffa4d RSI: ffff8801b8f086c0 RDI: ffff880219adbe00
RBP: ffff88022bc83668 R08: 000000009efbe0a8 R09: ffff8801d25eb328
R10: ffffffff8109d762 R11: ffff88021791ff00 R12: 000000009efba1f9
R13: ffff8801d25eb300 R14: ffff880219adbe00 R15: 0000000000000df0
FS:  0000000000000000(0000) GS:ffff88022bc80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 000000000286f350 CR3: 0000000001a0c000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper/1 (pid: 0, threadinfo ffff880222162000, task ffff88022215ddc0)
Stack:
  ffff88022bc83618 ffff8801000000d0 9efc19bcfffffa4d 0000000000000000
  ffff8801b8f086c0 ffff880219adbe28 ffff88022bc83698 ffff8801b8f086c0
  ffff8801b8f08c88 ffff8801b8f08c88 0000000000000a80 ffff8801c7841d00
Call Trace:
  <IRQ>
  [<ffffffff8155b275>] tcp_try_rmem_schedule+0x1c7/0x26d
  [<ffffffff8155b60c>] tcp_data_queue+0x1a9/0xa7e
  [<ffffffff8155e9f5>] tcp_rcv_established+0x63b/0x696
  [<ffffffff81566647>] tcp_v4_do_rcv+0x1bd/0x37d
  [<ffffffff815687f7>] tcp_v4_rcv+0x4ed/0x7d7
  [<ffffffff815384f0>] ? nf_hook_slow+0x102/0x113
  [<ffffffff815489fc>] ? xfrm4_policy_check.clone.0+0x4f/0x4f
  [<ffffffff81548b18>] ip_local_deliver_finish+0x11c/0x199
  [<ffffffff815489fc>] ? xfrm4_policy_check.clone.0+0x4f/0x4f
  [<ffffffff815489fc>] ? xfrm4_policy_check.clone.0+0x4f/0x4f
  [<ffffffff81548be1>] NF_HOOK.clone.1+0x4c/0x53
  [<ffffffff81548c36>] ip_local_deliver+0x4e/0x52
  [<ffffffff815488a6>] ip_rcv_finish+0x2da/0x2f2
  [<ffffffff815485cc>] ? inet_add_protocol+0x48/0x48
  [<ffffffff81548be1>] NF_HOOK.clone.1+0x4c/0x53
  [<ffffffff81548e76>] ip_rcv+0x23c/0x26a
  [<ffffffff8150f392>] __netif_receive_skb_core+0x4e7/0x558
  [<ffffffff8150f451>] __netif_receive_skb+0x4e/0x5e
  [<ffffffff81511657>] netif_receive_skb+0x5b/0x90
  [<ffffffffa0559fe2>] ? ieee80211_data_to_8023+0x2eb/0x370 [cfg80211]
  [<ffffffff815ca369>] ? _raw_read_unlock+0x24/0x2f
  [<ffffffffa07afa4d>] ieee80211_deliver_skb+0xcd/0x108 [mac80211]
  [<ffffffffa07b130d>] ieee80211_rx_handlers+0x1305/0x18c9 [mac80211]
  [<ffffffffa07b21cf>] ieee80211_prepare_and_rx_handle+0x8fe/0x96a [mac80211]
  [<ffffffffa07b29c4>] ieee80211_rx+0x6e9/0x759 [mac80211]
  [<ffffffff81307afc>] ? swiotlb_map_page+0x67/0xbb
  [<ffffffffa0971f83>] ath_rx_tasklet+0xfce/0x10a7 [ath9k]
  [<ffffffffa09703b5>] ath9k_tasklet+0xf9/0x150 [ath9k]
  [<ffffffff8109d6d3>] tasklet_action+0x7d/0xcc
  [<ffffffff8109db2c>] __do_softirq+0x114/0x254
  [<ffffffff815ca27d>] ? _raw_spin_unlock+0x24/0x2f
  [<ffffffff8109dcfe>] irq_exit+0x4b/0xa8
  [<ffffffff815d271d>] do_IRQ+0x9d/0xb4
  [<ffffffff815ca7ed>] common_interrupt+0x6d/0x6d
  <EOI>
  [<ffffffff810c6b5c>] ? set_next_entity+0x28/0x7e
  [<ffffffff814c74b6>] ? cpuidle_wrap_enter+0x43/0x78
  [<ffffffff814c74af>] ? cpuidle_wrap_enter+0x3c/0x78
  [<ffffffff814c74fb>] cpuidle_enter_tk+0x10/0x12
  [<ffffffff814c6fb5>] cpuidle_enter_state+0x17/0x3f
  [<ffffffff814c7734>] cpuidle_idle_call+0xba/0xfa
  [<ffffffff810177dd>] cpu_idle+0x65/0xb5
  [<ffffffff815c35d3>] start_secondary+0x211/0x213
  [<ffffffff81b34b86>] ? regulator_init_complete+0x62/0x157
Code: 89 30 4d 89 75 08 ff 43 10 48 8b 75 c0 e8 30 d0 ff ff e9 ee 00 00 00 4d 8d 4d 28 44 89 e2 41 2b 51 18 45 8b 41 1c 89 55 b0 79 04 <0f> 0b eb fe 45 29 e0 45 
85 c0 7e 4d 45 39 f8 4c 89 f7 4c 89 4d
RIP  [<ffffffff8155a9e9>] tcp_collapse+0x267/0x37a
  RSP <ffff88022bc83608>
---[ end trace f30d144e49d988df ]---
Kernel panic - not syncing: Fatal exception in interrupt
drm_kms_helper: panic occurred, switching back to text console

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 3.9.5+:  Crash in tcp_input.c:4810.
  2013-06-17 18:08 3.9.5+: Crash in tcp_input.c:4810 Ben Greear
@ 2013-06-17 18:17 ` Eric Dumazet
  2013-06-21 19:26   ` Ben Greear
  2013-07-01 18:10   ` Ben Greear
  0 siblings, 2 replies; 16+ messages in thread
From: Eric Dumazet @ 2013-06-17 18:17 UTC (permalink / raw)
  To: Ben Greear; +Cc: netdev

On Mon, 2013-06-17 at 11:08 -0700, Ben Greear wrote:
> This is from a 3.9.5+ kernel with local patches.  We saw this crash during
> a weekend run where we had TCP traffic trying to run on 128+ wifi station
> interfaces as the interfaces assocaited over and over again (the AP
> could handle no more than 127 stations and would dis-associate others
> when the 128th tried to associate).
> 
> The code in question is this from the tcp_collapse() method:
> 
> 		skb_reserve(nskb, header);
> 		memcpy(nskb->head, skb->head, header);
> 		memcpy(nskb->cb, skb->cb, sizeof(skb->cb));
> 		TCP_SKB_CB(nskb)->seq = TCP_SKB_CB(nskb)->end_seq = start;
> 		__skb_queue_before(list, skb, nskb);
> 		skb_set_owner_r(nskb, sk);
> 
> 		/* Copy data, releasing collapsed skbs. */
> 		while (copy > 0) {
> 			int offset = start - TCP_SKB_CB(skb)->seq;
> 			int size = TCP_SKB_CB(skb)->end_seq - start;
> 
> 			BUG_ON(offset < 0);
> 
> 
> 
> ------------[ cut here ]------------
> kernel BUG at /home/greearb/git/linux-3.9.dev.y/net/ipv4/tcp_input.c:4810!
> invalid opcode: 0000 [#1] PREEMPT SMP
> Modules linked in: nf_nat_ipv4 nf_nat 8021q garp stp mrp llc fuse macvlan wanlink(O) pktgen lockd sunrpc f71882fg e1000e ath9k ath9k_common ath9k_hw ath 
> mac80211 snd_hda_codec_realtek coretemp snd_hda_intel hwmon snd_hda_codec snd_hwdep mperf intel_powerclamp snd_seq snd_seq_device snd_pcm cfg80211 ptp pps_core 
> snd_page_alloc snd_timer kvm cdc_acm i2c_i801 gpio_ich iTCO_wdt iTCO_vendor_support snd soundcore ppdev microcode pcspkr serio_raw lpc_ich parport_pc parport 
> uinput ipv6 i915 video i2c_algo_bit drm_kms_helper drm i2c_core [last unloaded: iptable_nat]
> CPU 1
> Pid: 0, comm: swapper/1 Tainted: G        WC O 3.9.5+ #80 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M.
> RIP: 0010:[<ffffffff8155a9e9>]  [<ffffffff8155a9e9>] tcp_collapse+0x267/0x37a
> RSP: 0018:ffff88022bc83608  EFLAGS: 00010297
> RAX: 0000000000001100 RBX: ffff8801b8f08730 RCX: 0000000000000000
> RDX: 00000000fffffa4d RSI: ffff8801b8f086c0 RDI: ffff880219adbe00
> RBP: ffff88022bc83668 R08: 000000009efbe0a8 R09: ffff8801d25eb328
> R10: ffffffff8109d762 R11: ffff88021791ff00 R12: 000000009efba1f9
> R13: ffff8801d25eb300 R14: ffff880219adbe00 R15: 0000000000000df0
> FS:  0000000000000000(0000) GS:ffff88022bc80000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 000000000286f350 CR3: 0000000001a0c000 CR4: 00000000000007e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process swapper/1 (pid: 0, threadinfo ffff880222162000, task ffff88022215ddc0)
> Stack:
>   ffff88022bc83618 ffff8801000000d0 9efc19bcfffffa4d 0000000000000000
>   ffff8801b8f086c0 ffff880219adbe28 ffff88022bc83698 ffff8801b8f086c0
>   ffff8801b8f08c88 ffff8801b8f08c88 0000000000000a80 ffff8801c7841d00
> Call Trace:
>   <IRQ>
>   [<ffffffff8155b275>] tcp_try_rmem_schedule+0x1c7/0x26d
>   [<ffffffff8155b60c>] tcp_data_queue+0x1a9/0xa7e
>   [<ffffffff8155e9f5>] tcp_rcv_established+0x63b/0x696
>   [<ffffffff81566647>] tcp_v4_do_rcv+0x1bd/0x37d
>   [<ffffffff815687f7>] tcp_v4_rcv+0x4ed/0x7d7
>   [<ffffffff815384f0>] ? nf_hook_slow+0x102/0x113
>   [<ffffffff815489fc>] ? xfrm4_policy_check.clone.0+0x4f/0x4f
>   [<ffffffff81548b18>] ip_local_deliver_finish+0x11c/0x199
>   [<ffffffff815489fc>] ? xfrm4_policy_check.clone.0+0x4f/0x4f
>   [<ffffffff815489fc>] ? xfrm4_policy_check.clone.0+0x4f/0x4f
>   [<ffffffff81548be1>] NF_HOOK.clone.1+0x4c/0x53
>   [<ffffffff81548c36>] ip_local_deliver+0x4e/0x52
>   [<ffffffff815488a6>] ip_rcv_finish+0x2da/0x2f2
>   [<ffffffff815485cc>] ? inet_add_protocol+0x48/0x48
>   [<ffffffff81548be1>] NF_HOOK.clone.1+0x4c/0x53
>   [<ffffffff81548e76>] ip_rcv+0x23c/0x26a
>   [<ffffffff8150f392>] __netif_receive_skb_core+0x4e7/0x558
>   [<ffffffff8150f451>] __netif_receive_skb+0x4e/0x5e
>   [<ffffffff81511657>] netif_receive_skb+0x5b/0x90
>   [<ffffffffa0559fe2>] ? ieee80211_data_to_8023+0x2eb/0x370 [cfg80211]
>   [<ffffffff815ca369>] ? _raw_read_unlock+0x24/0x2f
>   [<ffffffffa07afa4d>] ieee80211_deliver_skb+0xcd/0x108 [mac80211]
>   [<ffffffffa07b130d>] ieee80211_rx_handlers+0x1305/0x18c9 [mac80211]
>   [<ffffffffa07b21cf>] ieee80211_prepare_and_rx_handle+0x8fe/0x96a [mac80211]
>   [<ffffffffa07b29c4>] ieee80211_rx+0x6e9/0x759 [mac80211]
>   [<ffffffff81307afc>] ? swiotlb_map_page+0x67/0xbb
>   [<ffffffffa0971f83>] ath_rx_tasklet+0xfce/0x10a7 [ath9k]
>   [<ffffffffa09703b5>] ath9k_tasklet+0xf9/0x150 [ath9k]
>   [<ffffffff8109d6d3>] tasklet_action+0x7d/0xcc
>   [<ffffffff8109db2c>] __do_softirq+0x114/0x254
>   [<ffffffff815ca27d>] ? _raw_spin_unlock+0x24/0x2f
>   [<ffffffff8109dcfe>] irq_exit+0x4b/0xa8
>   [<ffffffff815d271d>] do_IRQ+0x9d/0xb4
>   [<ffffffff815ca7ed>] common_interrupt+0x6d/0x6d
>   <EOI>
>   [<ffffffff810c6b5c>] ? set_next_entity+0x28/0x7e
>   [<ffffffff814c74b6>] ? cpuidle_wrap_enter+0x43/0x78
>   [<ffffffff814c74af>] ? cpuidle_wrap_enter+0x3c/0x78
>   [<ffffffff814c74fb>] cpuidle_enter_tk+0x10/0x12
>   [<ffffffff814c6fb5>] cpuidle_enter_state+0x17/0x3f
>   [<ffffffff814c7734>] cpuidle_idle_call+0xba/0xfa
>   [<ffffffff810177dd>] cpu_idle+0x65/0xb5
>   [<ffffffff815c35d3>] start_secondary+0x211/0x213
>   [<ffffffff81b34b86>] ? regulator_init_complete+0x62/0x157
> Code: 89 30 4d 89 75 08 ff 43 10 48 8b 75 c0 e8 30 d0 ff ff e9 ee 00 00 00 4d 8d 4d 28 44 89 e2 41 2b 51 18 45 8b 41 1c 89 55 b0 79 04 <0f> 0b eb fe 45 29 e0 45 
> 85 c0 7e 4d 45 39 f8 4c 89 f7 4c 89 4d
> RIP  [<ffffffff8155a9e9>] tcp_collapse+0x267/0x37a
>   RSP <ffff88022bc83608>
> ---[ end trace f30d144e49d988df ]---
> Kernel panic - not syncing: Fatal exception in interrupt
> drm_kms_helper: panic occurred, switching back to text console
> 
> Thanks,
> Ben
> 

Thanks Ben

Same problem was reported today and is under investigation

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 3.9.5+:  Crash in tcp_input.c:4810.
  2013-06-17 18:17 ` Eric Dumazet
@ 2013-06-21 19:26   ` Ben Greear
  2013-07-01 18:10   ` Ben Greear
  1 sibling, 0 replies; 16+ messages in thread
From: Ben Greear @ 2013-06-21 19:26 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

On 06/17/2013 11:17 AM, Eric Dumazet wrote:
> On Mon, 2013-06-17 at 11:08 -0700, Ben Greear wrote:
>> This is from a 3.9.5+ kernel with local patches.  We saw this crash during
>> a weekend run where we had TCP traffic trying to run on 128+ wifi station
>> interfaces as the interfaces assocaited over and over again (the AP
>> could handle no more than 127 stations and would dis-associate others
>> when the 128th tried to associate).
>>
>> The code in question is this from the tcp_collapse() method:
>>
>> 		skb_reserve(nskb, header);
>> 		memcpy(nskb->head, skb->head, header);
>> 		memcpy(nskb->cb, skb->cb, sizeof(skb->cb));
>> 		TCP_SKB_CB(nskb)->seq = TCP_SKB_CB(nskb)->end_seq = start;
>> 		__skb_queue_before(list, skb, nskb);
>> 		skb_set_owner_r(nskb, sk);
>>
>> 		/* Copy data, releasing collapsed skbs. */
>> 		while (copy > 0) {
>> 			int offset = start - TCP_SKB_CB(skb)->seq;
>> 			int size = TCP_SKB_CB(skb)->end_seq - start;
>>
>> 			BUG_ON(offset < 0);


It took about 3 days of running the same torture test (on 3.9.6+ this time),
but we saw this crash again.

No other kernel splats seen before this (at least for several hours).

Since it is rare, maybe we could change it to a WARN_ON, and take whatever
measures are needed to continue running?


------------[ cut here ]------------
kernel BUG at /home/greearb/git/linux-3.9.dev.y/net/ipv4/tcp_input.c:4810!
invalid opcode: 0000 [#1] PREEMPT SMP
Modules linked in: nf_nat_ipv4 nf_nat 8021q garp stp mrp llc fuse macvlan wanlink(O) pktgen lockd sunrpc f71882fg cdc_acm snd_hda_codec_realtek snd_hda_intel 
snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_page_alloc snd_timer snd ath9k soundcore serio_raw gpio_ich pcspkr ath9k_common coretemp hwmon mperf 
intel_powerclamp ath9k_hw ath kvm mac80211 e1000e ptp cfg80211 ppdev iTCO_wdt iTCO_vendor_support parport_pc microcode lpc_ich i2c_i801 pps_core parport uinput 
ipv6 i915 video i2c_algo_bit drm_kms_helper drm i2c_core [last unloaded: iptable_nat]
CPU 3
Pid: 2443, comm: btserver Tainted: G        WC O 3.9.6+ #84 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M.
RIP: 0010:[<ffffffff8155ac89>]  [<ffffffff8155ac89>] tcp_collapse+0x267/0x37a
RSP: 0000:ffff88022bd83608  EFLAGS: 00010287
RAX: 0000000000001100 RBX: ffff8801850fd170 RCX: 0000000000000000
RDX: 00000000fffff4a5 RSI: ffff8801850fd100 RDI: ffff88009ff47700
RBP: ffff88022bd83668 R08: 000000009a632eec R09: ffff880196284428
R10: ffffffffa04047c0 R11: ffff880217f50000 R12: 000000009a62ea89
R13: ffff880196284400 R14: ffff88009ff47700 R15: 0000000000000df0
FS:  00007f74ca9eb740(0000) GS:ffff88022bd80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000007e4dec8 CR3: 00000002179eb000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process btserver (pid: 2443, threadinfo ffff88021773c000, task ffff88021be6c650)
Stack:
  ffff88022bd83618 ffff8801000000d0 9a636da8fffff4a5 0000000000000000
  ffff8801850fd100 ffff88009ff47728 ffff88022bd83698 ffff8801850fd100
  ffff8801850fd6c8 ffff8801850fd6c8 0000000000000a80 ffff8801007d9400
Call Trace:
  <IRQ>
  [<ffffffff8155b515>] tcp_try_rmem_schedule+0x1c7/0x26d
  [<ffffffff8155b8ac>] tcp_data_queue+0x1a9/0xa7e
  [<ffffffff8155ec95>] tcp_rcv_established+0x63b/0x696
  [<ffffffff815668e7>] tcp_v4_do_rcv+0x1bd/0x37d
  [<ffffffff81568a97>] tcp_v4_rcv+0x4ed/0x7d7
  [<ffffffff81538790>] ? nf_hook_slow+0x102/0x113
  [<ffffffff81548c9c>] ? xfrm4_policy_check.clone.0+0x4f/0x4f
  [<ffffffff81548db8>] ip_local_deliver_finish+0x11c/0x199
  [<ffffffff81548c9c>] ? xfrm4_policy_check.clone.0+0x4f/0x4f
  [<ffffffff81548c9c>] ? xfrm4_policy_check.clone.0+0x4f/0x4f
  [<ffffffff81548e81>] NF_HOOK.clone.1+0x4c/0x53
  [<ffffffff81548ed6>] ip_local_deliver+0x4e/0x52
  [<ffffffff81548b46>] ip_rcv_finish+0x2da/0x2f2
  [<ffffffff8154886c>] ? inet_add_protocol+0x48/0x48
  [<ffffffff81548e81>] NF_HOOK.clone.1+0x4c/0x53
  [<ffffffff81549116>] ip_rcv+0x23c/0x26a
  [<ffffffff8150f632>] __netif_receive_skb_core+0x4e7/0x558
  [<ffffffff8150f6f1>] __netif_receive_skb+0x4e/0x5e
  [<ffffffff815118f7>] netif_receive_skb+0x5b/0x90
  [<ffffffffa027d04a>] ? ieee80211_data_to_8023+0x2eb/0x370 [cfg80211]
  [<ffffffff815ca611>] ? _raw_read_unlock+0x24/0x2f
  [<ffffffffa03cda4d>] ieee80211_deliver_skb+0xcd/0x108 [mac80211]
  [<ffffffffa03cf30d>] ieee80211_rx_handlers+0x1305/0x18c9 [mac80211]
  [<ffffffffa093b66e>] ? ath_txq_schedule+0x762/0x899 [ath9k]
  [<ffffffff81104823>] ? handle_irq_event+0x4c/0x61
  [<ffffffffa03d01cf>] ieee80211_prepare_and_rx_handle+0x8fe/0x96a [mac80211]
  [<ffffffffa03d09c4>] ieee80211_rx+0x6e9/0x759 [mac80211]
  [<ffffffff81307b1c>] ? swiotlb_map_page+0x67/0xbb
  [<ffffffffa0938f83>] ath_rx_tasklet+0xfce/0x10a7 [ath9k]
  [<ffffffffa09373b5>] ath9k_tasklet+0xf9/0x150 [ath9k]
  [<ffffffff8109d6d3>] tasklet_action+0x7d/0xcc
  [<ffffffff8109db2c>] __do_softirq+0x114/0x254
  [<ffffffff815ca525>] ? _raw_spin_unlock+0x24/0x2f
  [<ffffffff8109dcfe>] irq_exit+0x4b/0xa8
  [<ffffffff815d29dd>] do_IRQ+0x9d/0xb4
  [<ffffffff815caaad>] common_interrupt+0x6d/0x6d
  <EOI>
  [<ffffffff815d0e80>] ? sysret_audit+0x17/0x21
Code: 89 30 4d 89 75 08 ff 43 10 48 8b 75 c0 e8 30 d0 ff ff e9 ee 00 00 00 4d 8d 4d 28 44 89 e2 41 2b 51 18 45 8b 41 1c 89 55 b0 79 04 <0f> 0b eb fe 45 29 e0 45 
85 c0 7e 4d 45 39 f8 4c 89 f7 4c 89 4d
RIP  [<ffffffff8155ac89>] tcp_collapse+0x267/0x37a
  RSP <ffff88022bd83608>
---[ end trace 31987c0a8f390662 ]---


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 3.9.5+:  Crash in tcp_input.c:4810.
  2013-06-17 18:17 ` Eric Dumazet
  2013-06-21 19:26   ` Ben Greear
@ 2013-07-01 18:10   ` Ben Greear
  2013-07-03  1:04     ` Eric Dumazet
  1 sibling, 1 reply; 16+ messages in thread
From: Ben Greear @ 2013-07-01 18:10 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

On 06/17/2013 11:17 AM, Eric Dumazet wrote:
> On Mon, 2013-06-17 at 11:08 -0700, Ben Greear wrote:
>> This is from a 3.9.5+ kernel with local patches.  We saw this crash during
>> a weekend run where we had TCP traffic trying to run on 128+ wifi station
>> interfaces as the interfaces assocaited over and over again (the AP
>> could handle no more than 127 stations and would dis-associate others
>> when the 128th tried to associate).
>>
>> The code in question is this from the tcp_collapse() method:
>>
>> 		skb_reserve(nskb, header);
>> 		memcpy(nskb->head, skb->head, header);
>> 		memcpy(nskb->cb, skb->cb, sizeof(skb->cb));
>> 		TCP_SKB_CB(nskb)->seq = TCP_SKB_CB(nskb)->end_seq = start;
>> 		__skb_queue_before(list, skb, nskb);
>> 		skb_set_owner_r(nskb, sk);
>>
>> 		/* Copy data, releasing collapsed skbs. */
>> 		while (copy > 0) {
>> 			int offset = start - TCP_SKB_CB(skb)->seq;
>> 			int size = TCP_SKB_CB(skb)->end_seq - start;
>>
>> 			BUG_ON(offset < 0);

Here's some more info on this.  This is from 3.9.8, plus some local patches,
plus the debugging patch below.

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index a2f267a..63f7704 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4810,7 +4810,15 @@ restart:
                          int offset = start - TCP_SKB_CB(skb)->seq;
                          int size = TCP_SKB_CB(skb)->end_seq - start;

-                       BUG_ON(offset < 0);
+                       if (WARN_ON(offset < 0)) {
+                               /* We see a crash here (when using BUG_ON) every few days under
+                                * some torture tests.  I'm not sure how to clean this up properly,
+                                * so just return and hope thinks keep muddling through. --Ben
+                                */
+                               printk("offset: %i  start: %i seq: %i size: %i copy: %i\n",
+                                      offset, start, TCP_SKB_CB(skb)->seq, size, copy);
+                               return;
+                       }
                         if (size > 0) {
                                  size = min(copy, size);
                                  if (skb_copy_bits(skb, offset, skb_put(nskb, size), size))

There are some fairly nasty ath9k errors right before this in the logs, but
I am not certain they are the cause since in previous cases where we saw the
tcp_collapse issue I did not these errors.

wiphy1: start_sw_scan: running-other-vifs: 0  running-station-vifs: 138, associated-stations: 128 scanning current channel: 5745 MHz
sta303: authenticate with 4c:60:de:43:ae:d5
sta303: send auth to 4c:60:de:43:ae:d5 (try 1/3)
sta334: 4c:60:de:43:ae:d5 denied authentication (status 1)
sta303: 4c:60:de:43:ae:d5 denied authentication (status 1)
wiphy1: start_sw_scan: running-other-vifs: 0  running-station-vifs: 138, associated-stations: 128 scanning current channel: 5745 MHz
sta330: authenticate with 4c:60:de:43:ae:d5
sta330: send auth to 4c:60:de:43:ae:d5 (try 1/3)
sta323: authenticate with 4c:60:de:43:ae:d5
sta323: send auth to 4c:60:de:43:ae:d5 (try 1/3)
sta316: authenticate with 4c:60:de:43:ae:d5
sta316: send auth to 4c:60:de:43:ae:d5 (try 1/3)
sta316: 4c:60:de:43:ae:d5 denied authentication (status 1)
sta323: 4c:60:de:43:ae:d5 denied authentication (status 1)
sta330: 4c:60:de:43:ae:d5 denied authentication (status 1)
ath: wiphy1: soft tx hang: queue: 2 pending-frames: 124, resetting chip
ath: wiphy1: Pending frames still exist on txq: 2 after drain: 124  axq-depth: 0  ampdu-depth: 0
------------[ cut here ]------------
WARNING: at /home/greearb/git/linux-3.9.dev.y/net/ipv4/tcp_input.c:4813 tcp_collapse+0x289/0x3bf()
Hardware name: To be filled by O.E.M.
Modules linked in: nfsv3 nfs_acl nfsv4 auth_rpcgss nfs fscache nf_nat_ipv4 nf_nat 8021q garp stp mrp llc fuse macvlan wanlink(O) pktgen lockd sunrpc f71882fg 
ath9k ath9k_common ath9k_hw ath e1000e mac80211 snd_hda_codec_realtek coretemp hwmon cfg80211 mperf snd_hda_intel intel_powerclamp snd_hda_codec snd_hwdep 
snd_seq snd_seq_device snd_pcm snd_page_alloc snd_timer snd kvm iTCO_wdt cdc_acm ptp gpio_ich pps_core soundcore iTCO_vendor_support ppdev lpc_ich parport_pc 
pcspkr i2c_i801 microcode serio_raw parport uinput ipv6 i915 video i2c_algo_bit drm_kms_helper drm i2c_core [last unloaded: iptable_nat]
Pid: 10149, comm: btserver Tainted: G        WC O 3.9.8+ #97
Call Trace:
  <IRQ>  [<ffffffff81096291>] warn_slowpath_common+0x85/0x9f
  [<ffffffff810962c5>] warn_slowpath_null+0x1a/0x1c
  [<ffffffff8155af57>] tcp_collapse+0x289/0x3bf
  [<ffffffff8155b806>] tcp_try_rmem_schedule+0x1c7/0x26d
  [<ffffffff8155bb9d>] tcp_data_queue+0x1a9/0xa7e
  [<ffffffff8155efbf>] tcp_rcv_established+0x63b/0x696
  [<ffffffff81566c1f>] tcp_v4_do_rcv+0x1bd/0x37d
  [<ffffffff81568dcf>] tcp_v4_rcv+0x4ed/0x7d7
  [<ffffffff815389c4>] ? nf_hook_slow+0x102/0x113
  [<ffffffff81548f38>] ? xfrm4_policy_check.clone.0+0x4f/0x4f
  [<ffffffff81549054>] ip_local_deliver_finish+0x11c/0x199
  [<ffffffff81548f38>] ? xfrm4_policy_check.clone.0+0x4f/0x4f
  [<ffffffff81548f38>] ? xfrm4_policy_check.clone.0+0x4f/0x4f
  [<ffffffff8154911d>] NF_HOOK.clone.1+0x4c/0x53
  [<ffffffff81549172>] ip_local_deliver+0x4e/0x52
  [<ffffffff81548de2>] ip_rcv_finish+0x2da/0x2f2
  [<ffffffff81548b08>] ? inet_add_protocol+0x48/0x48
  [<ffffffff8154911d>] NF_HOOK.clone.1+0x4c/0x53
  [<ffffffff815493b2>] ip_rcv+0x23c/0x26a
  [<ffffffff8150f80e>] __netif_receive_skb_core+0x4e7/0x558
  [<ffffffff8150f8cd>] __netif_receive_skb+0x4e/0x5e
  [<ffffffff81511ad3>] netif_receive_skb+0x5b/0x90
  [<ffffffffa05a104a>] ? ieee80211_data_to_8023+0x2eb/0x370 [cfg80211]
  [<ffffffff815cab71>] ? _raw_read_unlock+0x24/0x2f
  [<ffffffffa0712a4d>] ieee80211_deliver_skb+0xcd/0x108 [mac80211]
  [<ffffffffa071430d>] ieee80211_rx_handlers+0x1305/0x18c9 [mac80211]
  [<ffffffff8109dce0>] ? local_bh_enable_ip+0xe/0x10
  [<ffffffff810a4cac>] ? del_timer+0x46/0x52
  [<ffffffffa07151cf>] ieee80211_prepare_and_rx_handle+0x8fe/0x96a [mac80211]
  [<ffffffffa07159c4>] ieee80211_rx+0x6e9/0x759 [mac80211]
  [<ffffffff81307762>] ? swiotlb_tbl_unmap_single+0xc4/0xc9
  [<ffffffffa09d7f6f>] ath_rx_tasklet+0xfce/0x10a7 [ath9k]
  [<ffffffffa09d63a1>] ath9k_tasklet+0xf9/0x150 [ath9k]
  [<ffffffff8109d5af>] tasklet_action+0x7d/0xcc
  [<ffffffff8109da08>] __do_softirq+0x114/0x254
  [<ffffffff815caa85>] ? _raw_spin_unlock+0x24/0x2f
  [<ffffffff8109dbda>] irq_exit+0x4b/0xa8
  [<ffffffff815d2f1d>] do_IRQ+0x9d/0xb4
  [<ffffffff815cafed>] common_interrupt+0x6d/0x6d
  <EOI>  [<ffffffff815d13c0>] ? sysret_audit+0x17/0x21
---[ end trace 32d17d795371ef40 ]---
offset: -1459  start: -1146162927 seq: -1146161468 size: 16047 copy: 3576
------------[ cut here ]------------
WARNING: at /home/greearb/git/linux-3.9.dev.y/net/ipv4/tcp_input.c:4813 tcp_collapse+0x289/0x3bf()
Hardware name: To be filled by O.E.M.
Modules linked in: nfsv3 nfs_acl nfsv4 auth_rpcgss nfs fscache nf_nat_ipv4 nf_nat 8021q garp stp mrp llc fuse macvlan wanlink(O) pktgen lockd sunrpc f71882fg 
ath9k ath9k_common ath9k_hw ath e1000e mac80211 snd_hda_codec_realtek coretemp hwmon cfg80211 mperf snd_hda_intel intel_powerclamp snd_hda_codec snd_hwdep 
snd_seq snd_seq_device snd_pcm snd_page_alloc snd_timer snd kvm iTCO_wdt cdc_acm ptp gpio_ich pps_core soundcore iTCO_vendor_support ppdev lpc_ich parport_pc 
pcspkr i2c_i801 microcode serio_raw parport uinput ipv6 i915 video i2c_algo_bit drm_kms_helper drm i2c_core [last unloaded: iptable_nat]
Pid: 10149, comm: btserver Tainted: G        WC O 3.9.8+ #97
Call Trace:
  <IRQ>  [<ffffffff81096291>] warn_slowpath_common+0x85/0x9f
  [<ffffffff810962c5>] warn_slowpath_null+0x1a/0x1c
  [<ffffffff8155af57>] tcp_collapse+0x289/0x3bf
  [<ffffffff8155b806>] tcp_try_rmem_schedule+0x1c7/0x26d
  [<ffffffff8155bfe9>] tcp_data_queue+0x5f5/0xa7e
  [<ffffffff8155efbf>] tcp_rcv_established+0x63b/0x696
  [<ffffffff81566c1f>] tcp_v4_do_rcv+0x1bd/0x37d
  [<ffffffff81568dcf>] tcp_v4_rcv+0x4ed/0x7d7
  [<ffffffff815389c4>] ? nf_hook_slow+0x102/0x113
  [<ffffffff81548f38>] ? xfrm4_policy_check.clone.0+0x4f/0x4f
  [<ffffffff81549054>] ip_local_deliver_finish+0x11c/0x199
  [<ffffffff81548f38>] ? xfrm4_policy_check.clone.0+0x4f/0x4f
  [<ffffffff81548f38>] ? xfrm4_policy_check.clone.0+0x4f/0x4f
  [<ffffffff8154911d>] NF_HOOK.clone.1+0x4c/0x53
  [<ffffffff81549172>] ip_local_deliver+0x4e/0x52
  [<ffffffff81548de2>] ip_rcv_finish+0x2da/0x2f2
  [<ffffffff81548b08>] ? inet_add_protocol+0x48/0x48
  [<ffffffff8154911d>] NF_HOOK.clone.1+0x4c/0x53
  [<ffffffff815493b2>] ip_rcv+0x23c/0x26a
  [<ffffffff8150f80e>] __netif_receive_skb_core+0x4e7/0x558
  [<ffffffff8150f8cd>] __netif_receive_skb+0x4e/0x5e
  [<ffffffff81511ad3>] netif_receive_skb+0x5b/0x90
  [<ffffffffa05a104a>] ? ieee80211_data_to_8023+0x2eb/0x370 [cfg80211]
  [<ffffffff815cab71>] ? _raw_read_unlock+0x24/0x2f
  [<ffffffffa0712a4d>] ieee80211_deliver_skb+0xcd/0x108 [mac80211]
  [<ffffffffa071430d>] ieee80211_rx_handlers+0x1305/0x18c9 [mac80211]
  [<ffffffff810a4cac>] ? del_timer+0x46/0x52
  [<ffffffffa07151cf>] ieee80211_prepare_and_rx_handle+0x8fe/0x96a [mac80211]
  [<ffffffffa07159c4>] ieee80211_rx+0x6e9/0x759 [mac80211]
  [<ffffffff81307762>] ? swiotlb_tbl_unmap_single+0xc4/0xc9
  [<ffffffffa09d7f6f>] ath_rx_tasklet+0xfce/0x10a7 [ath9k]
  [<ffffffffa09d63a1>] ath9k_tasklet+0xf9/0x150 [ath9k]
  [<ffffffff8109d5af>] tasklet_action+0x7d/0xcc
  [<ffffffff8109da08>] __do_softirq+0x114/0x254
  [<ffffffff815caa85>] ? _raw_spin_unlock+0x24/0x2f
  [<ffffffff8109dbda>] irq_exit+0x4b/0xa8
  [<ffffffff815d2f1d>] do_IRQ+0x9d/0xb4
  [<ffffffff815cafed>] common_interrupt+0x6d/0x6d
  <EOI>  [<ffffffff815d13c0>] ? sysret_audit+0x17/0x21
---[ end trace 32d17d795371ef41 ]---
offset: -1459  start: -1146162927 seq: -1146161468 size: 16047 copy: 3576
...

There were 80 total splats of this nature grouped together, and then
the system recovered and continue to function normally as far as I
can tell.  The later splats are a bit farther apart...maybe the
TCP connection is dying.

It appears my 'work-around' is poor at best, but I'd rather kill
a TCP connection and spam the logs than crash the OS.

I'd be more than happy to add more/different debugging code.

We will also attempt to run the same test on an un-patched upstream
kernel in case the bug is in local patches.

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: 3.9.5+:  Crash in tcp_input.c:4810.
  2013-07-01 18:10   ` Ben Greear
@ 2013-07-03  1:04     ` Eric Dumazet
  2013-07-03  3:21       ` Ben Greear
  0 siblings, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2013-07-03  1:04 UTC (permalink / raw)
  To: Ben Greear; +Cc: netdev

On Mon, 2013-07-01 at 11:10 -0700, Ben Greear wrote:

> offset: -1459  start: -1146162927 seq: -1146161468 size: 16047 copy: 3576
> ...
> 
> There were 80 total splats of this nature grouped together, and then
> the system recovered and continue to function normally as far as I
> can tell.  The later splats are a bit farther apart...maybe the
> TCP connection is dying.
> 
> It appears my 'work-around' is poor at best, but I'd rather kill
> a TCP connection and spam the logs than crash the OS.
> 
> I'd be more than happy to add more/different debugging code.

It would be nice to pinpoint the origin of the bug. Really.

This BUG_ON() is at least 7 years old. I do not think invariant has
changed ?

Sure we can avoid crashes but it looks like we could randomly corrupt
tcp payload or whatever kernel memory, if it turns out its caused by a
buggy driver.

Is it happening while collapsing the receive queue, or the ofo queue ?

In receive queue, all skbs skb2 following skb1 must have

TCP_SKB_CB(skb1)->end_seq >= TCP_SKB_CB(skb2)->seq

Only on ofo, we could have this not respected, and it should be handled
properly in tcp_collapse_ofo_queue()

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 28af45a..d77f1f0 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4457,7 +4457,12 @@ restart:
 			int offset = start - TCP_SKB_CB(skb)->seq;
 			int size = TCP_SKB_CB(skb)->end_seq - start;
 
-			BUG_ON(offset < 0);
+			if (unlikely(offset < 0)) {
+				pr_err("tcp_collapse() bug on %s offset:%d size:%d copy:%d skb->len %u truesize %u, nskb->len %u\n",
+					list == &sk->sk_receive_queue ? "receive_queue" : "ofo_queue",
+					offset, size, copy, skb->len, skb->truesize, nskb->len);
+				return;
+			}
 			if (size > 0) {
 				size = min(copy, size);
 				if (skb_copy_bits(skb, offset, skb_put(nskb, size), size))

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: 3.9.5+:  Crash in tcp_input.c:4810.
  2013-07-03  1:04     ` Eric Dumazet
@ 2013-07-03  3:21       ` Ben Greear
  2013-07-03  4:41         ` Eric Dumazet
  0 siblings, 1 reply; 16+ messages in thread
From: Ben Greear @ 2013-07-03  3:21 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

On 07/02/2013 06:04 PM, Eric Dumazet wrote:
> On Mon, 2013-07-01 at 11:10 -0700, Ben Greear wrote:
>
>> offset: -1459  start: -1146162927 seq: -1146161468 size: 16047 copy: 3576
>> ...
>>
>> There were 80 total splats of this nature grouped together, and then
>> the system recovered and continue to function normally as far as I
>> can tell.  The later splats are a bit farther apart...maybe the
>> TCP connection is dying.
>>
>> It appears my 'work-around' is poor at best, but I'd rather kill
>> a TCP connection and spam the logs than crash the OS.
>>
>> I'd be more than happy to add more/different debugging code.
>
> It would be nice to pinpoint the origin of the bug. Really.
>
> This BUG_ON() is at least 7 years old. I do not think invariant has
> changed ?
>
> Sure we can avoid crashes but it looks like we could randomly corrupt
> tcp payload or whatever kernel memory, if it turns out its caused by a
> buggy driver.
>
> Is it happening while collapsing the receive queue, or the ofo queue ?

What kinds of things could a driver do to cause this.  Maybe modify an
skb after it has sent it up the stack, or something like that?

We haven't been able to reproduce on a clean 3.10 yet...but it often takes days,
so we'll leave the test up through end of this week if we don't hit it
sooner...

I'll add your patch to my 3.9 tree.

Thanks,
Ben


> In receive queue, all skbs skb2 following skb1 must have
>
> TCP_SKB_CB(skb1)->end_seq >= TCP_SKB_CB(skb2)->seq
>
> Only on ofo, we could have this not respected, and it should be handled
> properly in tcp_collapse_ofo_queue()
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 28af45a..d77f1f0 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -4457,7 +4457,12 @@ restart:
>   			int offset = start - TCP_SKB_CB(skb)->seq;
>   			int size = TCP_SKB_CB(skb)->end_seq - start;
>
> -			BUG_ON(offset < 0);
> +			if (unlikely(offset < 0)) {
> +				pr_err("tcp_collapse() bug on %s offset:%d size:%d copy:%d skb->len %u truesize %u, nskb->len %u\n",
> +					list == &sk->sk_receive_queue ? "receive_queue" : "ofo_queue",
> +					offset, size, copy, skb->len, skb->truesize, nskb->len);
> +				return;
> +			}
>   			if (size > 0) {
>   				size = min(copy, size);
>   				if (skb_copy_bits(skb, offset, skb_put(nskb, size), size))
>


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 3.9.5+:  Crash in tcp_input.c:4810.
  2013-07-03  3:21       ` Ben Greear
@ 2013-07-03  4:41         ` Eric Dumazet
  2013-07-03  4:49           ` Ben Greear
  0 siblings, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2013-07-03  4:41 UTC (permalink / raw)
  To: Ben Greear; +Cc: netdev

On Tue, 2013-07-02 at 20:21 -0700, Ben Greear wrote:

> What kinds of things could a driver do to cause this.  Maybe modify an
> skb after it has sent it up the stack, or something like that?
> 

It might be a genuine TCP bug, who knows...


> We haven't been able to reproduce on a clean 3.10 yet...but it often takes days,
> so we'll leave the test up through end of this week if we don't hit it
> sooner...

TCP collapse is/should be really rare, but losses can of course trigger
it.

> 
> I'll add your patch to my 3.9 tree.

Thanks

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 3.9.5+:  Crash in tcp_input.c:4810.
  2013-07-03  4:41         ` Eric Dumazet
@ 2013-07-03  4:49           ` Ben Greear
  2013-07-03  5:02             ` Eric Dumazet
  0 siblings, 1 reply; 16+ messages in thread
From: Ben Greear @ 2013-07-03  4:49 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

On 07/02/2013 09:41 PM, Eric Dumazet wrote:
> On Tue, 2013-07-02 at 20:21 -0700, Ben Greear wrote:
>
>> What kinds of things could a driver do to cause this.  Maybe modify an
>> skb after it has sent it up the stack, or something like that?
>>
>
> It might be a genuine TCP bug, who knows...
>
>
>> We haven't been able to reproduce on a clean 3.10 yet...but it often takes days,
>> so we'll leave the test up through end of this week if we don't hit it
>> sooner...
>
> TCP collapse is/should be really rare, but losses can of course trigger
> it.

Well, network emulators are easy to come by in the office....  Maybe running
a bunch of TCP connections through a lossy network would exercise this code
path a bit?  Aside from random pkt loss, any other types of network conditions
that might help trigger this faster?

I'll set up some tests using some wired ethernet...if we can trigger it there
then we at least know it doesn't depend on ath9k...

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 3.9.5+:  Crash in tcp_input.c:4810.
  2013-07-03  4:49           ` Ben Greear
@ 2013-07-03  5:02             ` Eric Dumazet
  2013-07-08 17:23               ` Ben Greear
  0 siblings, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2013-07-03  5:02 UTC (permalink / raw)
  To: Ben Greear; +Cc: netdev

On Tue, 2013-07-02 at 21:49 -0700, Ben Greear wrote:

> Well, network emulators are easy to come by in the office....  Maybe running
> a bunch of TCP connections through a lossy network would exercise this code
> path a bit?  Aside from random pkt loss, any other types of network conditions
> that might help trigger this faster?
> 
> I'll set up some tests using some wired ethernet...if we can trigger it there
> then we at least know it doesn't depend on ath9k...

I tried a lot of things, including netem with many reorders and/or
packetdrill tests, but so far not a single warning from tcp_collapse()

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 3.9.5+:  Crash in tcp_input.c:4810.
  2013-07-03  5:02             ` Eric Dumazet
@ 2013-07-08 17:23               ` Ben Greear
  2013-07-08 18:21                 ` Eric Dumazet
  0 siblings, 1 reply; 16+ messages in thread
From: Ben Greear @ 2013-07-08 17:23 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

On 07/02/2013 10:02 PM, Eric Dumazet wrote:
> On Tue, 2013-07-02 at 21:49 -0700, Ben Greear wrote:
>
>> Well, network emulators are easy to come by in the office....  Maybe running
>> a bunch of TCP connections through a lossy network would exercise this code
>> path a bit?  Aside from random pkt loss, any other types of network conditions
>> that might help trigger this faster?
>>
>> I'll set up some tests using some wired ethernet...if we can trigger it there
>> then we at least know it doesn't depend on ath9k...
>
> I tried a lot of things, including netem with many reorders and/or
> packetdrill tests, but so far not a single warning from tcp_collapse()

We ran a 5+ day test using un-modified 3.10 kernel and did not trigger
the bug.

So, I'm guessing the problem is either fixed upstream or is exacerbated
or caused by our local patches.  Sometime soon we'll start porting local
patches to newer kernels...we'll see what happens then.

Thanks,
Ben


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 3.9.5+:  Crash in tcp_input.c:4810.
  2013-07-08 17:23               ` Ben Greear
@ 2013-07-08 18:21                 ` Eric Dumazet
  2013-07-08 18:30                   ` Ben Greear
  0 siblings, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2013-07-08 18:21 UTC (permalink / raw)
  To: Ben Greear; +Cc: netdev

On Mon, 2013-07-08 at 10:23 -0700, Ben Greear wrote:

> We ran a 5+ day test using un-modified 3.10 kernel and did not trigger
> the bug.

Using wired ethernet only, or any kind of adapters, including ath9k ?

> So, I'm guessing the problem is either fixed upstream or is exacerbated
> or caused by our local patches.  Sometime soon we'll start porting local
> patches to newer kernels...we'll see what happens then.

Thanks

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 3.9.5+:  Crash in tcp_input.c:4810.
  2013-07-08 18:21                 ` Eric Dumazet
@ 2013-07-08 18:30                   ` Ben Greear
  2013-07-08 19:01                     ` Eric Dumazet
  0 siblings, 1 reply; 16+ messages in thread
From: Ben Greear @ 2013-07-08 18:30 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

On 07/08/2013 11:21 AM, Eric Dumazet wrote:
> On Mon, 2013-07-08 at 10:23 -0700, Ben Greear wrote:
>
>> We ran a 5+ day test using un-modified 3.10 kernel and did not trigger
>> the bug.
>
> Using wired ethernet only, or any kind of adapters, including ath9k ?

Exact same hardware and configuration:

ath9k, around 240 wifi
stations trying to connect to APs that can handle a bit less
than 240 total, starting TCP traffic when stations are connected.
It appears that the constant churn of stations going up and down
is key, but of course that is par for the course, especially in
the wifi stack.

Some of our local wifi patches make the system work considerably faster when
we have hundreds of wifi stations, so timing will be different on upstream
kernels, and of course we could have bugs :)

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 3.9.5+:  Crash in tcp_input.c:4810.
  2013-07-08 18:30                   ` Ben Greear
@ 2013-07-08 19:01                     ` Eric Dumazet
  2013-07-08 19:59                       ` Ben Greear
  0 siblings, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2013-07-08 19:01 UTC (permalink / raw)
  To: Ben Greear; +Cc: netdev

On Mon, 2013-07-08 at 11:30 -0700, Ben Greear wrote:
> On 07/08/2013 11:21 AM, Eric Dumazet wrote:
> > On Mon, 2013-07-08 at 10:23 -0700, Ben Greear wrote:
> >
> >> We ran a 5+ day test using un-modified 3.10 kernel and did not trigger
> >> the bug.
> >
> > Using wired ethernet only, or any kind of adapters, including ath9k ?
> 
> Exact same hardware and configuration:
> 
> ath9k, around 240 wifi
> stations trying to connect to APs that can handle a bit less
> than 240 total, starting TCP traffic when stations are connected.
> It appears that the constant churn of stations going up and down
> is key, but of course that is par for the course, especially in
> the wifi stack.
> 
> Some of our local wifi patches make the system work considerably faster when
> we have hundreds of wifi stations, so timing will be different on upstream
> kernels, and of course we could have bugs :)

There is this thing in ath9k about aggregating two frags 

drivers/net/wireless/ath/ath9k/recv.c line 1298 contains :

RX_STAT_INC(rx_frags); 

Could you check these stats (I do not know if they are reported by
ethtool -S or another debugging facility) and check if rx_frags is ever
increasing ?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 3.9.5+:  Crash in tcp_input.c:4810.
  2013-07-08 19:01                     ` Eric Dumazet
@ 2013-07-08 19:59                       ` Ben Greear
  2013-07-08 20:10                         ` Eric Dumazet
  0 siblings, 1 reply; 16+ messages in thread
From: Ben Greear @ 2013-07-08 19:59 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

On 07/08/2013 12:01 PM, Eric Dumazet wrote:
> On Mon, 2013-07-08 at 11:30 -0700, Ben Greear wrote:
>> On 07/08/2013 11:21 AM, Eric Dumazet wrote:
>>> On Mon, 2013-07-08 at 10:23 -0700, Ben Greear wrote:
>>>
>>>> We ran a 5+ day test using un-modified 3.10 kernel and did not trigger
>>>> the bug.
>>>
>>> Using wired ethernet only, or any kind of adapters, including ath9k ?
>>
>> Exact same hardware and configuration:
>>
>> ath9k, around 240 wifi
>> stations trying to connect to APs that can handle a bit less
>> than 240 total, starting TCP traffic when stations are connected.
>> It appears that the constant churn of stations going up and down
>> is key, but of course that is par for the course, especially in
>> the wifi stack.
>>
>> Some of our local wifi patches make the system work considerably faster when
>> we have hundreds of wifi stations, so timing will be different on upstream
>> kernels, and of course we could have bugs :)
>
> There is this thing in ath9k about aggregating two frags
>
> drivers/net/wireless/ath/ath9k/recv.c line 1298 contains :
>
> RX_STAT_INC(rx_frags);
>
> Could you check these stats (I do not know if they are reported by
> ethtool -S or another debugging facility) and check if rx_frags is ever
> increasing ?

They are in debugfs, and they appear to increase fairly often, for
instance:

[root@lec2010-ath9k-1 lanforge]# cat /debug/ieee80211/wiphy0/ath9k/recv|tail -5
            RX-Pkts-All :  288009442
           RX-Bytes-All : 4067932166
             RX-Beacons :   14826735
               RX-Frags :       3944
            RX-Spectral :          0

I don't have the stats from the system that reproduced the bug
(it has been rebooted), but if I do see the bug again, I'll
grab the rx-frags and other stats just in case it shows
some anomaly.

Thanks,
Ben


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 3.9.5+:  Crash in tcp_input.c:4810.
  2013-07-08 19:59                       ` Ben Greear
@ 2013-07-08 20:10                         ` Eric Dumazet
  2013-07-08 20:17                           ` Ben Greear
  0 siblings, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2013-07-08 20:10 UTC (permalink / raw)
  To: Ben Greear; +Cc: netdev

On Mon, 2013-07-08 at 12:59 -0700, Ben Greear wrote:

> > There is this thing in ath9k about aggregating two frags
> >
> > drivers/net/wireless/ath/ath9k/recv.c line 1298 contains :
> >
> > RX_STAT_INC(rx_frags);
> >
> > Could you check these stats (I do not know if they are reported by
> > ethtool -S or another debugging facility) and check if rx_frags is ever
> > increasing ?
> 
> They are in debugfs, and they appear to increase fairly often, for
> instance:
> 
> [root@lec2010-ath9k-1 lanforge]# cat /debug/ieee80211/wiphy0/ath9k/recv|tail -5
>             RX-Pkts-All :  288009442
>            RX-Bytes-All : 4067932166
>              RX-Beacons :   14826735
>                RX-Frags :       3944
>             RX-Spectral :          0
> 
> I don't have the stats from the system that reproduced the bug
> (it has been rebooted), but if I do see the bug again, I'll
> grab the rx-frags and other stats just in case it shows
> some anomaly.
> 

Reading this code again, I believe following patch is needed.

Could you test it ?

Thanks

diff --git a/drivers/net/wireless/ath/ath9k/recv.c b/drivers/net/wireless/ath/ath9k/recv.c
index 8be2b5d..f642f04 100644
--- a/drivers/net/wireless/ath/ath9k/recv.c
+++ b/drivers/net/wireless/ath/ath9k/recv.c
@@ -1317,7 +1317,8 @@ int ath_rx_tasklet(struct ath_softc *sc, int flush, bool hp)
 		if (sc->rx.frag) {
 			int space = skb->len - skb_tailroom(hdr_skb);
 
-			if (pskb_expand_head(hdr_skb, 0, space, GFP_ATOMIC) < 0) {
+			if (space > 0 &&
+			    pskb_expand_head(hdr_skb, 0, space, GFP_ATOMIC) < 0) {
 				dev_kfree_skb(skb);
 				RX_STAT_INC(rx_oom_err);
 				goto requeue_drop_frag;

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: 3.9.5+:  Crash in tcp_input.c:4810.
  2013-07-08 20:10                         ` Eric Dumazet
@ 2013-07-08 20:17                           ` Ben Greear
  0 siblings, 0 replies; 16+ messages in thread
From: Ben Greear @ 2013-07-08 20:17 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, linux-wireless

On 07/08/2013 01:10 PM, Eric Dumazet wrote:
> On Mon, 2013-07-08 at 12:59 -0700, Ben Greear wrote:
>
>>> There is this thing in ath9k about aggregating two frags
>>>
>>> drivers/net/wireless/ath/ath9k/recv.c line 1298 contains :
>>>
>>> RX_STAT_INC(rx_frags);
>>>
>>> Could you check these stats (I do not know if they are reported by
>>> ethtool -S or another debugging facility) and check if rx_frags is ever
>>> increasing ?
>>
>> They are in debugfs, and they appear to increase fairly often, for
>> instance:
>>
>> [root@lec2010-ath9k-1 lanforge]# cat /debug/ieee80211/wiphy0/ath9k/recv|tail -5
>>              RX-Pkts-All :  288009442
>>             RX-Bytes-All : 4067932166
>>               RX-Beacons :   14826735
>>                 RX-Frags :       3944
>>              RX-Spectral :          0
>>
>> I don't have the stats from the system that reproduced the bug
>> (it has been rebooted), but if I do see the bug again, I'll
>> grab the rx-frags and other stats just in case it shows
>> some anomaly.
>>
>
> Reading this code again, I believe following patch is needed.
>
> Could you test it ?

Sure, will do.  Adding the linux-wireless mailing list as well.

Thanks,
Ben

>
> Thanks
>
> diff --git a/drivers/net/wireless/ath/ath9k/recv.c b/drivers/net/wireless/ath/ath9k/recv.c
> index 8be2b5d..f642f04 100644
> --- a/drivers/net/wireless/ath/ath9k/recv.c
> +++ b/drivers/net/wireless/ath/ath9k/recv.c
> @@ -1317,7 +1317,8 @@ int ath_rx_tasklet(struct ath_softc *sc, int flush, bool hp)
>   		if (sc->rx.frag) {
>   			int space = skb->len - skb_tailroom(hdr_skb);
>
> -			if (pskb_expand_head(hdr_skb, 0, space, GFP_ATOMIC) < 0) {
> +			if (space > 0 &&
> +			    pskb_expand_head(hdr_skb, 0, space, GFP_ATOMIC) < 0) {
>   				dev_kfree_skb(skb);
>   				RX_STAT_INC(rx_oom_err);
>   				goto requeue_drop_frag;
>


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2013-07-08 20:17 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-06-17 18:08 3.9.5+: Crash in tcp_input.c:4810 Ben Greear
2013-06-17 18:17 ` Eric Dumazet
2013-06-21 19:26   ` Ben Greear
2013-07-01 18:10   ` Ben Greear
2013-07-03  1:04     ` Eric Dumazet
2013-07-03  3:21       ` Ben Greear
2013-07-03  4:41         ` Eric Dumazet
2013-07-03  4:49           ` Ben Greear
2013-07-03  5:02             ` Eric Dumazet
2013-07-08 17:23               ` Ben Greear
2013-07-08 18:21                 ` Eric Dumazet
2013-07-08 18:30                   ` Ben Greear
2013-07-08 19:01                     ` Eric Dumazet
2013-07-08 19:59                       ` Ben Greear
2013-07-08 20:10                         ` Eric Dumazet
2013-07-08 20:17                           ` Ben Greear

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).