All of lore.kernel.org
 help / color / mirror / Atom feed
* Flood ping can cause oom when handshake fails
@ 2017-09-22 12:58 Yousong Zhou
  2017-09-22 13:19 ` Jason A. Donenfeld
       [not found] ` <59e5680d-da17-a8c4-0c16-08f0b27a4f75@gmail.com>
  0 siblings, 2 replies; 5+ messages in thread
From: Yousong Zhou @ 2017-09-22 12:58 UTC (permalink / raw)
  To: wireguard

Hi, I have encountered a few issues when running WireGuard on VoCore:
a small ramips device with 16MB flash and 32MB ram
(https://wiki.openwrt.org/toh/vocore/vocore).

  root@LEDE:/# uname -a
  Linux LEDE 4.9.49 #0 Fri Sep 15 05:14:29 2017 mips GNU/Linux
  root@LEDE:/# opkg list-installed | grep -i wireguard
  kmod-wireguard - 4.9.49+0.0.20170907-1
  luci-app-wireguard - git-17.259.19938-f36f198-1
  luci-proto-wireguard - git-17.259.19938-f36f198-1
  wireguard - 0.0.20170907-1
  wireguard-tools - 0.0.20170907-1
  root@LEDE:/# wg show
  interface: air
    public key: eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
    private key: (hidden)
    listening port: 21841

  peer: ffffffffffffffffffffffffffffffffffffffffffff
    endpoint: iiiiiiiiiiii:ppppp
    allowed ips: 0.0.0.0/0
    latest handshake: 4 minutes, 35 seconds ago
    transfer: 520 B received, 872 B sent

WAN is a wired vlan interface: eth0.1 bearing the default route.
Traffics will be marked by iptable rules and routed through wireguard
interface with simple policy routing rules.  The setup works quite
well on another ar71xx-based device (in case it matters, the wan
interface is a regular device eth1).

The first issue is that occasionally wireguard failed to send
handshake initiation packets to the remote.  I got to this conclusion
by two observations
 - Tearing down then bringing up ("ifup air") the local wireguard
device did not trigger the update of "latest handshake" timestamp on
the remote
 - Wireguard packets can be captured on eth0.1 but not on the remote

The second issue is that when handshake fails, flood ping traffic that
was expected to be forwarded through the wireguard interface can cause
oom and hang the device to death.  There is a [kworker] process taking
up high cpu usage.

WireGuard is a very nice and convenient solution.  If there are any
further steps/info required to debug this, I am all ready ;)

                yousong

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Flood ping can cause oom when handshake fails
  2017-09-22 12:58 Flood ping can cause oom when handshake fails Yousong Zhou
@ 2017-09-22 13:19 ` Jason A. Donenfeld
  2017-09-22 13:38   ` Yousong Zhou
  2017-10-23  9:52   ` Yousong Zhou
       [not found] ` <59e5680d-da17-a8c4-0c16-08f0b27a4f75@gmail.com>
  1 sibling, 2 replies; 5+ messages in thread
From: Jason A. Donenfeld @ 2017-09-22 13:19 UTC (permalink / raw)
  To: Yousong Zhou; +Cc: WireGuard mailing list

Hi Yousong,

Thanks for the report.

On Fri, Sep 22, 2017 at 2:58 PM, Yousong Zhou <yszhou4tech@gmail.com> wrote:
> The first issue is that occasionally wireguard failed to send
> handshake initiation packets to the remote.  I got to this conclusion
> by two observations
>  - Tearing down then bringing up ("ifup air") the local wireguard
> device did not trigger the update of "latest handshake" timestamp on
> the remote


The handshake will not actually occur until you try to send data over
the interface. So after bringing the interface up, send a ping. Then
you'll have the handshake. If you'd like the handshake to happen
immediately and for packets in general to persistently be sent, to,
for example, keep NAT mappings alive, there's the persistent-keepalive
option. See the wg(8) man page for details.

>  - Wireguard packets can be captured on eth0.1 but not on the remote

I'm not sure I understood this point. Can you elaborate?

> The second issue is that when handshake fails, flood ping traffic that
> was expected to be forwarded through the wireguard interface can cause
> oom and hang the device to death.  There is a [kworker] process taking
> up high cpu usage.

That's very interesting. Here's what I suspect happening: before
there's a handshake, outgoing packets are queued up to be sent for
when a handshake does occur. Right now I allow queueing up a whopping
1024 packets, before they're rotated out and freed LIFO. This is
obviously silly for low-ram situations like yours, and I should make
that mechanism a bit smarter. I'll do that for the next snapshot. I
assume that the high CPU kworker is a last minute attempt at memory
compaction, or something of that sort. However, it'd be good to know
-- could you find more information about that process? Perhaps
/proc/pid/stack or related things in there?

Additionally, I see that you're running 20170907, which is an older
snapshot. If you update to the newer one (20170918), I'd be interested
to learn if the behavior is different.

Jason

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Flood ping can cause oom when handshake fails
  2017-09-22 13:19 ` Jason A. Donenfeld
@ 2017-09-22 13:38   ` Yousong Zhou
  2017-10-23  9:52   ` Yousong Zhou
  1 sibling, 0 replies; 5+ messages in thread
From: Yousong Zhou @ 2017-09-22 13:38 UTC (permalink / raw)
  To: Jason A. Donenfeld; +Cc: WireGuard mailing list

On 22 September 2017 at 21:19, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> Hi Yousong,
>
> Thanks for the report.
>
> On Fri, Sep 22, 2017 at 2:58 PM, Yousong Zhou <yszhou4tech@gmail.com> wrote:
>> The first issue is that occasionally wireguard failed to send
>> handshake initiation packets to the remote.  I got to this conclusion
>> by two observations
>>  - Tearing down then bringing up ("ifup air") the local wireguard
>> device did not trigger the update of "latest handshake" timestamp on
>> the remote
>
>
> The handshake will not actually occur until you try to send data over
> the interface. So after bringing the interface up, send a ping. Then
> you'll have the handshake. If you'd like the handshake to happen
> immediately and for packets in general to persistently be sent, to,
> for example, keep NAT mappings alive, there's the persistent-keepalive
> option. See the wg(8) man page for details.
>
>>  - Wireguard packets can be captured on eth0.1 but not on the remote
>
> I'm not sure I understood this point. Can you elaborate?

The VoCore device has a built-in switch and as I understand the cpu
port of which will be present as eth0 in the Linux system.  Port 4 of
the switch was configured to belong to VLAN 1 and will be present as
eth0.1.

Default route in the main table points to eth0.1 and traffic flows
through it just fine.  I have the following observations when
handshake fails

 - I can send udp traffic with netcat from my laptop to the remote
peer's endpoint.  The traffic can be captured on the remote and
ignored as expected.  This means the traffic to the remote endpoint
can go through eth0.1 just fine
 - I can also capture wireguard traffic on eth0.1 with tcpdump, but
the same traffic cannot be captured on the remote which means that the
udp traffic sent from the kernel very likely failed to be sent out on
wire.

                yousong

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Fwd: Flood ping can cause oom when handshake fails
       [not found]   ` <CAECwjAgTb1qtiUabMBbg_6cnA+V0YQLd=316o_QU25Ffkxn4ow@mail.gmail.com>
@ 2017-09-22 13:53     ` Yousong Zhou
  0 siblings, 0 replies; 5+ messages in thread
From: Yousong Zhou @ 2017-09-22 13:53 UTC (permalink / raw)
  To: WireGuard mailing list

Sorry, my previous mail dropped off list accidentally.

                yousong


---------- Forwarded message ----------
From: Yousong Zhou <yszhou4tech@gmail.com>
Date: 22 September 2017 at 21:22
Subject: Re: Flood ping can cause oom when handshake fails
To: Aaron Jones <aaronmdjones@gmail.com>


On 22 September 2017 at 21:15, Aaron Jones <aaronmdjones@gmail.com> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> On 22/09/17 12:58, Yousong Zhou wrote:
>> The first issue is that occasionally wireguard failed to send
>> handshake initiation packets to the remote.  I got to this
>> conclusion by two observations
>>
>> - Tearing down then bringing up ("ifup air") the local wireguard
>> device did not trigger the update of "latest handshake" timestamp
>> on the remote
>
> WireGuard does not negotiate sessions when the interface is configured,
> it negotiates when it is required to do so (when you send a packet to
> the tunnel address of the peer, and there is no session with that peer),
> so if you want to see if negotiation is being performed, issue a ping
> immediately after reconfiguring the interface.
>
>> - Wireguard packets can be captured on eth0.1 but not on the
>> remote
>>

Yes, I am aware of the "silence is a virtue" feature in the technical
paper.  That's why I kept (flood) pinging, trying to trigger the
handshake.  Tearing down and bringing up the interface was to make
sure that the udp traffic captured on eth0.1 is about handshake setup,
not data packets.

                yousong

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Flood ping can cause oom when handshake fails
  2017-09-22 13:19 ` Jason A. Donenfeld
  2017-09-22 13:38   ` Yousong Zhou
@ 2017-10-23  9:52   ` Yousong Zhou
  1 sibling, 0 replies; 5+ messages in thread
From: Yousong Zhou @ 2017-10-23  9:52 UTC (permalink / raw)
  To: Jason A. Donenfeld; +Cc: WireGuard mailing list

Hi,

Sorry for the late reply to the issue.  I have been away from the
problematic device for a while and only found time and finally decided
to debug this further just a while ago...

The oom issue caused by staged packets is now gone with the change of
queue length from 1024 to 128.

The handshake failure issue persists but it's more an issue of the
network infrastructure than the wireguard itself.  Previously I
thought maybe the handshake packets were actually not on the wire
because maybe wireguard could not work with the switch and vlan
setting, but I could not confirm that guess because I cannot observe
traffics on the intermediate devices.

I just captured and replayed the udp handshake packets with varying
TTL settings (with corrected ip checksum), it seems that some
intermediate devices just dropped these packets silently!!!  Well, the
still more annoying part is that it's not a deterministic behaviour,
other traffics like tcp and icmp flows through it just fine most of
the time and the udp port 21841 i am using for wireguard fails more
frequently.

I guess stealthy channel is not part of the game with wireguard at the
moment...  Thanks for the good work though, really awesome.

Regards,
                yousong


On 22 September 2017 at 21:19, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> Hi Yousong,
>
> Thanks for the report.
>
> On Fri, Sep 22, 2017 at 2:58 PM, Yousong Zhou <yszhou4tech@gmail.com> wrote:
>> The first issue is that occasionally wireguard failed to send
>> handshake initiation packets to the remote.  I got to this conclusion
>> by two observations
>>  - Tearing down then bringing up ("ifup air") the local wireguard
>> device did not trigger the update of "latest handshake" timestamp on
>> the remote
>
>
> The handshake will not actually occur until you try to send data over
> the interface. So after bringing the interface up, send a ping. Then
> you'll have the handshake. If you'd like the handshake to happen
> immediately and for packets in general to persistently be sent, to,
> for example, keep NAT mappings alive, there's the persistent-keepalive
> option. See the wg(8) man page for details.
>
>>  - Wireguard packets can be captured on eth0.1 but not on the remote
>
> I'm not sure I understood this point. Can you elaborate?
>
>> The second issue is that when handshake fails, flood ping traffic that
>> was expected to be forwarded through the wireguard interface can cause
>> oom and hang the device to death.  There is a [kworker] process taking
>> up high cpu usage.
>
> That's very interesting. Here's what I suspect happening: before
> there's a handshake, outgoing packets are queued up to be sent for
> when a handshake does occur. Right now I allow queueing up a whopping
> 1024 packets, before they're rotated out and freed LIFO. This is
> obviously silly for low-ram situations like yours, and I should make
> that mechanism a bit smarter. I'll do that for the next snapshot. I
> assume that the high CPU kworker is a last minute attempt at memory
> compaction, or something of that sort. However, it'd be good to know
> -- could you find more information about that process? Perhaps
> /proc/pid/stack or related things in there?
>
> Additionally, I see that you're running 20170907, which is an older
> snapshot. If you update to the newer one (20170918), I'd be interested
> to learn if the behavior is different.
>
> Jason

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2017-10-23  9:51 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-22 12:58 Flood ping can cause oom when handshake fails Yousong Zhou
2017-09-22 13:19 ` Jason A. Donenfeld
2017-09-22 13:38   ` Yousong Zhou
2017-10-23  9:52   ` Yousong Zhou
     [not found] ` <59e5680d-da17-a8c4-0c16-08f0b27a4f75@gmail.com>
     [not found]   ` <CAECwjAgTb1qtiUabMBbg_6cnA+V0YQLd=316o_QU25Ffkxn4ow@mail.gmail.com>
2017-09-22 13:53     ` Fwd: " Yousong Zhou

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.