netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Packets being dropped somewhere in the kernel, between iptables and packet capture layers
@ 2019-01-28  0:17 Niklas Hambüchen
  2019-01-28  6:21 ` Florian Westphal
  0 siblings, 1 reply; 3+ messages in thread
From: Niklas Hambüchen @ 2019-01-28  0:17 UTC (permalink / raw)
  To: netdev

Hello network developers,

I'm sending this to netdev@vger.kernel.org even though http://vger.kernel.org/lkml/ still suggests linux-net@vger.kernel.org, because the latter seems to be inactive since 2011 and full of spam, and I got "unresolvable address" for it. Perhaps somebody should update the page that recommends it.
Nevertheless, please let me know if here is the wrong place.

## Problem

Some DNS requests on my machine take 5 seconds when a specific network interface is used.
This makes using a web browser a total pain.

I suspect this is a problem in the kernel, somewhere between the iptables layer and the level on which Wireshark/tcpdump operates.

I give the reasoning for this suspicion below.

It happens in around 5% of the invocations of `dig ifconfig.me @1.1.1.1`.
(It doesn't matter which nameserver or domain is used, I'm just using `1.1.1.1` as an example because it can be nicely grepped in logs.)
My usual DNS resolution time is 1 ms.

I can reproduce the problem in less than 1 second by running `while true; do date; time dig ifconfig.me @1.1.1.1; done`.
A couple successful invocations flush by, then `dig` hangs for 5 seconds.

I'm fairly certain 5 seconds is a timeout implemented by `dig`/`glibc`; it's what it does when it doesn't get a response to its UDP requests.

`strace` demonstrates that the userspace side does what it should:

```
sendmsg(20<socket:[92070424]>, {msg_name(16)={sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("8.8.8.8")}, msg_iov(1)=[{"\372\f\1 \0\1\0\0\0\0\0\1\10ifconfig\2me\0\0\1\0\1\0\0)"..., 40}], msg_controllen=0, msg_flags=0}, 0) = 40
epoll_wait(
  -- here it hangs
  -- 5 seconds hang here
<... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
  -- dig's 5 second timeout triggers, it retries
sendmsg(20<socket:[92070424]>, {msg_name(16)={sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("8.8.8.8")}, msg_iov(1)=[{"\372\f\1 \0\1\0\0\0\0\0\1\10ifconfig\2me\0\0\1\0\1\0\0)"..., 40}], msg_controllen=0, msg_flags=0}, 0) = 40
  -- 1 ms later
recvmsg(20<socket:[92070424]>, {msg_name(16)={sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("8.8.8.8")}, msg_iov(1)=[{"\372\f\201\200\0\1\0\4\0\0\0\1\10ifconfig\2me\0\0\1\0\1\300\f\0"..., 65535}], msg_controllen=56, [{cmsg_len=32, cmsg_level=SOL_SOCKET, cmsg_type=0x1d /* SCM_??? */}, {cmsg_len=17, cmsg_level=SOL_IP, cmsg_type=IP_TOS, {tos=0}}], msg_flags=0}, 0) = 104
```

I have recorded this in Wireshark.
It displays only 1 DNS request being made / UDP packet being sent, not 2, even though the userspace program clearly does tell the kernel to send 2 packets with its 2 `sendmsg()` syscalls:
The request in Wireshark is answered within 1 ms as expected.

So it seems that the kernel never sends the first packet.

(Side info:
Enabling the TCP mode of dig via `time dig +tcp ifconfig.me` changes the timeout to 1 second.
This suggests that the problem is with sending packets _in general_, not only UDP.
EDNS is generally used, but `+noedns` doesn't change the timeouts, no matter if `+tcp` is used or not.
)

I further discovered the problem is specific to the network interface I use.

My machine has 2 Ethernet ports:

* Realtek RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller
* Intel Ethernet Connection (2) I218-V

The problem occurs _only_ on the Realtek port.

Details about the hardware from `lshw`:

```
        *-network
             description: Ethernet interface
             product: Ethernet Connection (2) I218-V
             vendor: Intel Corporation
             physical id: 19
             bus info: pci@0000:00:19.0
             logical name: enp0s25
             version: 00
             serial: d0:50:99:5c:cf:1c
             capacity: 1Gbit/s
             width: 32 bits
             clock: 33MHz
             capabilities: pm msi bus_master cap_list ethernet physical tp 10bt 10bt-fd 100bt 100bt-fd 1000bt-fd autonegotiation
             configuration: autonegotiation=on broadcast=yes driver=e1000e driverversion=3.2.6-k firmware=0.1-4 latency=0 link=no multicast=yes port=twisted pair
             resources: irq:42 memory:ef500000-ef51ffff memory:ef538000-ef538fff ioport:f040(size=32)
...
           *-network
                description: Ethernet interface
                product: RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller
                vendor: Realtek Semiconductor Co., Ltd.
                physical id: 0
                bus info: pci@0000:03:00.0
                logical name: enp3s0
                version: 11
                serial: d0:50:99:5c:cf:1a
                size: 1Gbit/s
                capacity: 1Gbit/s
                width: 64 bits
                clock: 33MHz
                capabilities: pm msi pciexpress msix vpd bus_master cap_list ethernet physical tp mii 10bt 10bt-fd 100bt 100bt-fd 1000bt 1000bt-fd autonegotiation
                configuration: autonegotiation=on broadcast=yes driver=r8169 driverversion=2.3LK-NAPI duplex=full firmware=rtl8168g-2_0.0.1 02/06/13 ip=192.168.1.100 latency=0 link=yes multicast=yes port=MII speed=1Gbit/s
                resources: irq:37 ioport:d000(size=256) memory:ef400000-ef400fff memory:e2100000-e2103fff
```

User `longxia` from the `##linux` freenode IRC channel brought in the great suggestion to use `iptables` logging to see whether the packet missing in Wireshark is present there.


## Investigation with `iptables` logging facilities

All the below is for `UDP`. `enp3s0` is the Realtek Ethernet port.

I used `iptables -F && iptables -I OUTPUT 1 -j LOG && iptables -I OUTPUT 1 -j LOG`.

I observe during the hang:

```
Jan 27 23:24:16.210162 ares kernel: IN= OUT=enp3s0 SRC=192.168.1.100 DST=1.1.1.1 LEN=68 TOS=0x00 PREC=0x00 TTL=64 ID=57836 PROTO=UDP SPT=33554 DPT=53 LEN=48
# hang here
Jan 27 23:24:21.210106 ares kernel: IN= OUT=enp3s0 SRC=192.168.1.100 DST=1.1.1.1 LEN=68 TOS=0x00 PREC=0x00 TTL=64 ID=57843 PROTO=UDP SPT=33554 DPT=53 LEN=48
Jan 27 23:24:21.210645 ares kernel: IN=enp3s0 OUT= MAC=d0:50:99:5c:cf:1a:44:d9:e7:41:32:61:08:00 SRC=1.1.1.1 DST=192.168.1.100 LEN=132 TOS=0x00 PREC=0x00 TTL=54 ID=24125 DF PROTO=UDP SPT=53 DPT=33554 LEN=112
```

In the corresponding Wireshark capture below, we can see (with filter `dns.qry.name == "ifconfig.me"`) the two packets from after then hang.
But the first outgoing packet is not present!

```
2784  22:24:21.208712960  192.168.1.100 1.1.1.1 DNS 82  Standard query 0x8a93 A ifconfig.me OPT
2785  22:24:21.210606902  1.1.1.1 192.168.1.100 DNS 146 Standard query response 0x8a93 A ifconfig.me A 216.239.36.21 A 216.239.38.21 A 216.239.32.21 A 216.239.34.21 OPT
```

So I suspect that either:

* the packet gets lost in the kernel code path between the iptables layer and the layer that Wireshark attaches to
* the packet get sent, but forever reason doesn't show up in Wireshark (and the problem is the network not giving a response; unlikely given that the other network card works)
* the packet gets dropped by the kernel for some legitimate reason (e.g. hardware returning an error), but the kernel fails to report this error anywhere

Which of it could it be?
How should I investigate deeper?

Thanks a lot,
Niklas


PS:
You can find a version-controlled investigation log of this problem, including this email in nice formatting, at https://github.com/nh2/dig-5-seconds-hang.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Packets being dropped somewhere in the kernel, between iptables and packet capture layers
  2019-01-28  0:17 Packets being dropped somewhere in the kernel, between iptables and packet capture layers Niklas Hambüchen
@ 2019-01-28  6:21 ` Florian Westphal
  2019-01-29  0:45   ` Niklas Hambüchen
  0 siblings, 1 reply; 3+ messages in thread
From: Florian Westphal @ 2019-01-28  6:21 UTC (permalink / raw)
  To: Niklas Hambüchen; +Cc: netdev

Niklas Hambüchen <mail@nh2.me> wrote:
> I'm sending this to netdev@vger.kernel.org even though http://vger.kernel.org/lkml/ still suggests linux-net@vger.kernel.org, because the latter seems to be inactive since 2011 and full of spam, and I got "unresolvable address" for it. Perhaps somebody should update the page that recommends it.
> Nevertheless, please let me know if here is the wrong place.

This problem is known; I asked for test feedback on this patch but never
got a response:

    netfilter: nf_nat: return the same reply tuple for matching CTs
    
    It is possible that two concurrent packets originating from the same
    socket of a connection-less protocol (e.g. UDP) can end up having
    different IP_CT_DIR_REPLY tuples which results in one of the packets
    being dropped.
    
    To illustrate this, consider the following simplified scenario:
    
    1. No DNAT/SNAT/MASQUEARADE rules are installed, but the nf_nat module
       is loaded.
    2. Packet A and B are sent at the same time from two different threads
       via the same UDP socket which hasn't been used before (=no CT has
       been created before). Both packets have the same IP_CT_DIR_ORIGINAL
       tuple.
    3. CT of A has been created and confirmed, afterwards get_unique_tuple
       is called for B. Because IP_CT_DIR_REPLY tuple (the inverse of
       the IP_CT_DIR_ORIGINAL tuple) is already taken by the A's confirmed
       CT (nf_nat_used_tuple finds it), get_unique_tuple calls UDP's
       unique_tuple which returns a different IP_CT_DIR_REPLY tuple (usually
       with src port = 1024)
    4. B's CT cannot get confirmed in __nf_conntrack_confirm due to
       the found IP_CT_DIR_ORIGINAL tuple of A and the different
       IP_CT_DIR_REPLY tuples, thus the packet B gets dropped.
    
    This patch modifies nf_conntrack_tuple_taken so it doesn't consider
    colliding reply tuples if the IP_CT_DIR_ORIGINAL tuples are equal.
    
    Then, at insert time, either clash resolution is possible (new packet
    has the existing/older conntrack assigned to it), or it has to be
    dropped.
    
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 741b533148ba..07847a612adf 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -1007,6 +1007,22 @@ nf_conntrack_tuple_taken(const struct nf_conntrack_tuple *tuple,
 		}
 
 		if (nf_ct_key_equal(h, tuple, zone, net)) {
+			/* If the origin tuples are identical, we can ignore
+			 * this clashing entry: they refer to the same flow.
+			 * Do not apply nat clash resolution in this case and
+			 * let nf_ct_resolve_clash() deal with this.
+			 *
+			 * This can happen with UDP in particular, e.g. when
+			 * more than one packet is sent from same socket in
+			 * different threads.
+			 *
+			 * We would now mangle our entry and would then have to
+			 * discard it at conntrack confirm time.
+			 */
+			if (nf_ct_tuple_equal(&ignored_conntrack->tuplehash[IP_CT_DIR_ORIGINAL].tuple,
+					      &ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple))
+				continue;
+
 			NF_CT_STAT_INC_ATOMIC(net, found);
 			rcu_read_unlock();
 			return 1;

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: Packets being dropped somewhere in the kernel, between iptables and packet capture layers
  2019-01-28  6:21 ` Florian Westphal
@ 2019-01-29  0:45   ` Niklas Hambüchen
  0 siblings, 0 replies; 3+ messages in thread
From: Niklas Hambüchen @ 2019-01-29  0:45 UTC (permalink / raw)
  To: netdev; +Cc: Florian Westphal

> Some DNS requests on my machine take 5 seconds when a specific network interface is used.

My big apologies for the confusion. I have found the reason.

There was a scripted

     tc qdisc add dev enp3s0 root netem loss 5%

in effect that I wasn't aware of while investigating.
This dropped 5% of the packets, and also explains why it was on that interface only.

Well, at least I know very well now at which level `tc` operates.

My bad, and sorry for the noise.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2019-01-29  0:45 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-28  0:17 Packets being dropped somewhere in the kernel, between iptables and packet capture layers Niklas Hambüchen
2019-01-28  6:21 ` Florian Westphal
2019-01-29  0:45   ` Niklas Hambüchen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).