All of lore.kernel.org
 help / color / mirror / Atom feed
* Instability during large transfers
@ 2017-02-16 18:38 Samuel Holland
  2017-02-17  4:48 ` Samuel Holland
  0 siblings, 1 reply; 7+ messages in thread
From: Samuel Holland @ 2017-02-16 18:38 UTC (permalink / raw)
  To: wireguard

Hello,

Since I started using wireguard in August 2016, my main firewall has
been resetting every few weeks to a month. Since the switch from my
previous openvpn setup to wireguard coincided with some hardware
changes, I haven't been confident about the source of the crashes.

However, the last three crashes have been directly connected to a large
file transfer, meaning sustained, even if not large, bandwidth use. (For
comparison, this firewall can handle >400Mbps VPN traffic over a gigabit
WAN link.) The first was a `zfs send` job over SSH, around 15Mbps, where
the panic happened after around 18 hours. The second was a video
transfer over SSH, around 10Mbps, and the panic happened after about 30
minutes. The most recent was another `zfs send` over SSH, this time
40-50Mbps.

The transfer started at 2017-02-15 20:23:39. At 2017-02-16 01:32:13, the
firewall reset, and it came back up at 01:32:47. Through the magic of
wireguard, the SSH connection stayed intact. The transfer continued, and
my sshd logs continue showing key rotations until 04:43:55. This
coincides with the following warning from the firewall at 04:44:50:

[11547.285960] ------------[ cut here ]------------
[11547.285976] WARNING: CPU: 1 PID: 0 at kernel/workqueue.c:1440 
__queue_work+0x1e0/0x450
[11547.285980] Modules linked in: cfg80211 rfkill wireguard(O) bonding 
mei_txe mei
[11547.285995] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G           O 
4.7.10-hardened #1
[11547.285999] Hardware name: To be filled by O.E.M. To be filled by 
O.E.M./D180S/D190S/D290S
Series, BIOS DB1F1P05_x64 12/23/2014
[11547.286003]  0000000000000000 ffffffff814b4d1f 0000000000000007 
0000000000000000
[11547.286010]  0000000000000000 ffffffff810d8059 ffffe8ffffc07900 
0000000000000004
[11547.286016]  0000000000000000 ffff880077a9b600 ffffe8ffffc07a38 
00000000000105b0
[11547.286023] Call Trace:
[11547.286027]  <IRQ>  [<ffffffff814b4d1f>] ? dump_stack+0x47/0x68
[11547.286040]  [<ffffffff810d8059>] ? __warn+0xb9/0xe0
[11547.286046]  [<ffffffff810f1b00>] ? __queue_work+0x1e0/0x450
[11547.286052]  [<ffffffff810f1d80>] ? queue_work_on+0x10/0x20
[11547.286059]  [<ffffffff8118d788>] ? padata_do_parallel+0xe8/0x110
[11547.286071]  [<ffffffffa004f4c5>] ? packet_consume_data+0x4b5/0x760 
[wireguard]
[11547.286080]  [<ffffffffa004fe90>] ? packet_send_queue+0x330/0x600 
[wireguard]
[11547.286087]  [<ffffffff8195a2c9>] ? vlan_dev_hard_start_xmit+0x89/0x100
[11547.286093]  [<ffffffff818ce2b1>] ? fib_validate_source+0x101/0x470
[11547.286102]  [<ffffffffa0050a50>] ? packet_receive+0x440/0x470 
[wireguard]
[11547.286110]  [<ffffffffa0050a66>] ? packet_receive+0x456/0x470 
[wireguard]
[11547.286116]  [<ffffffff818bddeb>] ? udp_queue_rcv_skb+0x1fb/0x470
[11547.286122]  [<ffffffff818be487>] ? __udp4_lib_rcv+0x427/0x9b0
[11547.286129]  [<ffffffff8188f9d3>] ? ip_local_deliver_finish+0x63/0x1b0
[11547.286135]  [<ffffffff8188fca6>] ? ip_local_deliver+0x56/0xd0
[11547.286140]  [<ffffffff8188f970>] ? ip_rcv_finish+0x390/0x390
[11547.286146]  [<ffffffff8188ff64>] ? ip_rcv+0x244/0x370
[11547.286151]  [<ffffffff8188f5e0>] ? inet_del_offload+0x30/0x30
[11547.286158]  [<ffffffff81828089>] ? __netif_receive_skb_core+0x909/0xa20
[11547.286163]  [<ffffffff81955fd3>] ? br_handle_vlan+0xe3/0x180
[11547.286168]  [<ffffffff81955fd3>] ? br_handle_vlan+0xe3/0x180
[11547.286174]  [<ffffffff8182820a>] ? netif_receive_skb_internal+0x1a/0x80
[11547.286181]  [<ffffffff819495ce>] ? br_pass_frame_up+0x8e/0x130
[11547.286187]  [<ffffffff8110f645>] ? find_busiest_group+0xe5/0x950
[11547.286192]  [<ffffffff819562dd>] ? br_allowed_ingress+0x26d/0x390
[11547.286198]  [<ffffffff8194987f>] ? br_handle_frame_finish+0x20f/0x4f0
[11547.286205]  [<ffffffff81949cdf>] ? br_handle_frame+0x13f/0x2d0
[11547.286210]  [<ffffffff81827a7d>] ? __netif_receive_skb_core+0x2fd/0xa20
[11547.286216]  [<ffffffff818befca>] ? udp_gro_receive+0x4a/0x110
[11547.286222]  [<ffffffff818c7309>] ? inet_gro_receive+0x1b9/0x240
[11547.286227]  [<ffffffff8182820a>] ? netif_receive_skb_internal+0x1a/0x80
[11547.286232]  [<ffffffff818290f6>] ? napi_gro_receive+0xb6/0x100
[11547.286240]  [<ffffffff816bde99>] ? igb_poll+0x699/0xde0
[11547.286247]  [<ffffffff8112902d>] ? rcu_report_qs_rnp+0x11d/0x170
[11547.286252]  [<ffffffff81828969>] ? net_rx_action+0x219/0x350
[11547.286258]  [<ffffffff810dd2da>] ? __do_softirq+0xea/0x280
[11547.286263]  [<ffffffff810dd59c>] ? irq_exit+0x8c/0x90
[11547.286270]  [<ffffffff8108a2a9>] ? smp_apic_timer_interrupt+0x39/0x50
[11547.286277]  [<ffffffff8196b35b>] ? apic_timer_interrupt+0x8b/0x90
[11547.286280]  <EOI>  [<ffffffff817c03ec>] ? poll_idle+0xc/0x60
[11547.286291]  [<ffffffff817bff91>] ? cpuidle_enter_state+0xf1/0x2b0
[11547.286298]  [<ffffffff811171d9>] ? cpu_startup_entry+0x259/0x2e0
[11547.286304]  [<ffffffff8108839d>] ? start_secondary+0x1ad/0x270
[11547.286308] ---[ end trace 0439ccab88b5fb0e ]---

At that point, the VPN stopped responding, the SSH connection to the
server behind the firewall was broken, and the zfs send job failed.

When I woke up this morning, none of the firewall's wireguard peers were
able to ping or SSH into it or any machine behind it. However, they all
reported their last handshake being less than <the persistent keepalive
time> (25s) ago, even though my laptop had been suspended all night (so
it was able to connect fine from a new IP, but not transfer any data).

All machines involved are running the same kernel (Linux 4.7.10-hardened
from Gentoo's sys-kernel/hardened-sources package), and they all have
wireguard version 0.0.20170214 loaded. They had all been rebooted since
upgrading to that wireguard version, so within the last two days.

I will enable netconsole on the firewall now that I can hopefully
reproduce the panic, but since the networking setup there is rather
complicated (vlans on top of bridges) I'm not sure if I will get all of
the panic messages.

I have seen no wireguard-related warnings or panics on any other
machine, but this firewall machine has by far the most traffic and the
weakest CPU (http://ark.intel.com/products/78866).

Is there any more information I can provide to help resolve this?

Thanks,
Samuel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Instability during large transfers
  2017-02-16 18:38 Instability during large transfers Samuel Holland
@ 2017-02-17  4:48 ` Samuel Holland
  2017-02-17 13:36   ` Jason A. Donenfeld
  0 siblings, 1 reply; 7+ messages in thread
From: Samuel Holland @ 2017-02-17  4:48 UTC (permalink / raw)
  To: wireguard

[-- Attachment #1: Type: text/plain, Size: 729 bytes --]

On 02/16/17 12:38, Samuel Holland wrote:
> I will enable netconsole on the firewall now that I can hopefully
> reproduce the panic, but since the networking setup there is rather
> complicated (vlans on top of bridges) I'm not sure if I will get all
> of the panic messages.

Success! I reproduced the panic and got full logs.

I ran:
$ ssh <server> cat /dev/zero | dd of=/dev/null bs=4096 status=progress
108295966720 bytes (108 GB, 101 GiB) copied, 17437 s, 6.2 MB/s
packet_write_wait: Connection to <server> port 22: Broken pipe

26440386+4 records in
26440386+4 records out
108299829248 bytes (108 GB, 101 GiB) copied, 18413 s, 5.9 MB/s

Attached is the extended netconsole output from the firewall.

Hope this helps,
Samuel

[-- Attachment #2: wireguardpanic.gz --]
[-- Type: application/gzip, Size: 3740 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Instability during large transfers
  2017-02-17  4:48 ` Samuel Holland
@ 2017-02-17 13:36   ` Jason A. Donenfeld
  2017-02-17 17:37     ` Samuel Holland
  2017-03-01 22:44     ` Samuel Holland
  0 siblings, 2 replies; 7+ messages in thread
From: Jason A. Donenfeld @ 2017-02-17 13:36 UTC (permalink / raw)
  To: Samuel Holland; +Cc: WireGuard mailing list

Hey Samuel,

Thanks very much for the excellent debugging output. I'll try to
reproduce this as well on my systems.

The stack trace does indicate that the OOPS is happening in padata,
not in wireguard, so I wonder if this is some bug caused either by
grsecurity or by something else that was then fixed, but since your
kernel is a bit old (4.7.10) maybe the fix didn't make it. In either
case, I'll try to reproduce on that kernel and on newer kernels and
will get back to you.

I presume you have most PaX options turned on?

Thanks,
Jason

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Instability during large transfers
  2017-02-17 13:36   ` Jason A. Donenfeld
@ 2017-02-17 17:37     ` Samuel Holland
  2017-03-01 22:44     ` Samuel Holland
  1 sibling, 0 replies; 7+ messages in thread
From: Samuel Holland @ 2017-02-17 17:37 UTC (permalink / raw)
  To: Jason A. Donenfeld; +Cc: WireGuard mailing list

[-- Attachment #1: Type: text/plain, Size: 869 bytes --]

Hello,

On 02/17/17 07:36, Jason A. Donenfeld wrote:
> The stack trace does indicate that the OOPS is happening in padata,
> not in wireguard, so I wonder if this is some bug caused either by
> grsecurity or by something else that was then fixed, but since your
> kernel is a bit old (4.7.10) maybe the fix didn't make it. In either
> case, I'll try to reproduce on that kernel and on newer kernels and
> will get back to you.

There do not appear to be any relevant changes to padata in the past few
years, and grsecurity doesn't look like it affects padata much, but that
doesn't rule it out:

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/log/?qt=grep&q=padata
https://grsecurity.net/changelog-test.txt

> I presume you have most PaX options turned on?

Attached is my config.gz (it's the same on all machines).

> Thanks,
> Jason

Thanks,
Samuel


[-- Attachment #2: config-4.7.10-hardened.gz --]
[-- Type: application/gzip, Size: 26725 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Instability during large transfers
  2017-02-17 13:36   ` Jason A. Donenfeld
  2017-02-17 17:37     ` Samuel Holland
@ 2017-03-01 22:44     ` Samuel Holland
  2017-03-21 15:06       ` Samuel Holland
  1 sibling, 1 reply; 7+ messages in thread
From: Samuel Holland @ 2017-03-01 22:44 UTC (permalink / raw)
  To: Jason A. Donenfeld; +Cc: WireGuard mailing list

[-- Attachment #1: Type: text/plain, Size: 1811 bytes --]

On 02/17/17 07:36, Jason A. Donenfeld wrote:
> Thanks very much for the excellent debugging output. I'll try to
> reproduce this as well on my systems.

I assume you have not been able to reproduce this issue.

> The stack trace does indicate that the OOPS is happening in padata,
> not in wireguard, so I wonder if this is some bug caused either by
> grsecurity or by something else that was then fixed, but since your
> kernel is a bit old (4.7.10) maybe the fix didn't make it. In either
> case, I'll try to reproduce on that kernel and on newer kernels and
> will get back to you.
>
> I presume you have most PaX options turned on?

Since this is on 4.7.10 (that is pre-4.9), this is not related to the
other bug recently reported.

I have disabled all grsecurity/PaX options in my kernel config
(attached) and was able to trigger the bug again. This is with WireGuard
commit f97b7e34bda436ac4572697a8770837eec7470b6 and debugging enabled.
Again attached is the dmesg.

I used the same SSH cat /dev/zero | dd of=/dev/null as before. This time
I got "192656101376 bytes (193 GB, 179 GiB) copied, 41643 s, 4.6 MB/s"
before the connection was broken.

Interestingly, when the firewall came back up, I again had the issue
where devices were continuing to handshake, but no data went through
(and I could confirm this with the wireguard debug output in dmesg).

I was unable to reproduce this issue with a spare laptop (ThinkPad
X220), even after leaving it running for about three days. Since the
router has a rather weak Atom CPU (http://ark.intel.com/products/78866),
I suspect maybe a race condition due to the high load might be involved?

Is there anything else I can do to debug this? Enable some kernel
debugging option? Try a vanilla kernel? Try a newer kernel?

> Thanks, Jason

Thanks,
Samuel Holland

[-- Attachment #2: panic_grsec_disabled.config.gz --]
[-- Type: application/gzip, Size: 25965 bytes --]

[-- Attachment #3: panic_grsec_disabled.dmesg.gz --]
[-- Type: application/gzip, Size: 46450 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Instability during large transfers
  2017-03-01 22:44     ` Samuel Holland
@ 2017-03-21 15:06       ` Samuel Holland
  2017-03-21 17:33         ` Jason A. Donenfeld
  0 siblings, 1 reply; 7+ messages in thread
From: Samuel Holland @ 2017-03-21 15:06 UTC (permalink / raw)
  To: Jason A. Donenfeld; +Cc: WireGuard mailing list

[-- Attachment #1: Type: text/plain, Size: 1780 bytes --]

Hello,

On 03/01/17 16:44, Samuel Holland wrote:
> Is there anything else I can do to debug this? Enable some kernel
> debugging option? Try a vanilla kernel? Try a newer kernel?

I've upgraded to vanilla (kernel.org) Linux 4.9.15, and while I've been
unable to trigger the same panic so far, I'm getting a similar linked
list corruption warning. This time, instead of failing to pass data once
the warning hits, some packets are coming in out of order with 3000-5000
ms latency. Attached is the dmesg output, kernel config, and ping logs
for inside the VPN. The actual link latency between peers is around 10ms.

As I still haven't been able to reproduce this issue on another machine,
I cannot rule out hardware issues, but this exact machine worked fine
for a year straight (2x 6 months uptime) using OpenVPN. So I'm looking
at either WireGuard or padata as the cause.

The kernel has linked list debugging turned on.

--------

I upgraded again to 4.9.16, and enabled every debug and self-test option
that I could find that seemed relevant. The stack traces and symptoms
still look similar. Except now when it panics (instead of just a
warning), I don't get a log; the netconsole just suddenly cuts over to
the next boot.

You can also see how it stops sending keepalive packets once the
warnings hit.

At this point, unless you want me to try something, I'm going to stop
sending logs. I'm not very familiar with the kernel's
timer/workqueue/crypto subsystems, so there's not much debugging I can
do on my own until I have time to do research (and I have no idea when
that will be). But if you can't reproduce the issue, then it will be
rather difficult for you to debug as well.

I'd still like to get this fixed, and I'll gladly help any way I can.

Thanks,
Samuel Holland

[-- Attachment #2: warning_4.9.15.config.xz --]
[-- Type: application/x-xz, Size: 23520 bytes --]

[-- Attachment #3: warning_4.9.15.dmesg.xz --]
[-- Type: application/x-xz, Size: 18196 bytes --]

[-- Attachment #4: warning_4.9.15.dmesg_secondboot.xz --]
[-- Type: application/x-xz, Size: 3424 bytes --]

[-- Attachment #5: warning_4.9.15.vpn_ping.xz --]
[-- Type: application/x-xz, Size: 1708 bytes --]

[-- Attachment #6: warning_4.9.16.dmesg.xz --]
[-- Type: application/x-xz, Size: 8904 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Instability during large transfers
  2017-03-21 15:06       ` Samuel Holland
@ 2017-03-21 17:33         ` Jason A. Donenfeld
  0 siblings, 0 replies; 7+ messages in thread
From: Jason A. Donenfeld @ 2017-03-21 17:33 UTC (permalink / raw)
  To: Samuel Holland; +Cc: WireGuard mailing list

Hey Samuel,

Thanks for sticking with it... This is a super hard bug to track down!
I wish I could reproduce it.

Can you try two odd things, separately, and report back?

1. Compile with DEBUG_ATOMIC_SLEEP. Do you get any additional warnings?
2. Hack kernel/sched/Makefile to remove "-fno-omit-frame-pointer" and
see if the backtraces you get are a bit more coherent.

Feel free to find me on IRC.

Jason

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2017-03-21 17:29 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-16 18:38 Instability during large transfers Samuel Holland
2017-02-17  4:48 ` Samuel Holland
2017-02-17 13:36   ` Jason A. Donenfeld
2017-02-17 17:37     ` Samuel Holland
2017-03-01 22:44     ` Samuel Holland
2017-03-21 15:06       ` Samuel Holland
2017-03-21 17:33         ` Jason A. Donenfeld

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.