* Packets being dropped somewhere in the kernel, between iptables and packet capture layers
@ 2019-01-28 0:17 Niklas Hambüchen
2019-01-28 6:21 ` Florian Westphal
0 siblings, 1 reply; 3+ messages in thread
From: Niklas Hambüchen @ 2019-01-28 0:17 UTC (permalink / raw)
To: netdev
Hello network developers,
I'm sending this to netdev@vger.kernel.org even though http://vger.kernel.org/lkml/ still suggests linux-net@vger.kernel.org, because the latter seems to be inactive since 2011 and full of spam, and I got "unresolvable address" for it. Perhaps somebody should update the page that recommends it.
Nevertheless, please let me know if here is the wrong place.
## Problem
Some DNS requests on my machine take 5 seconds when a specific network interface is used.
This makes using a web browser a total pain.
I suspect this is a problem in the kernel, somewhere between the iptables layer and the level on which Wireshark/tcpdump operates.
I give the reasoning for this suspicion below.
It happens in around 5% of the invocations of `dig ifconfig.me @1.1.1.1`.
(It doesn't matter which nameserver or domain is used, I'm just using `1.1.1.1` as an example because it can be nicely grepped in logs.)
My usual DNS resolution time is 1 ms.
I can reproduce the problem in less than 1 second by running `while true; do date; time dig ifconfig.me @1.1.1.1; done`.
A couple successful invocations flush by, then `dig` hangs for 5 seconds.
I'm fairly certain 5 seconds is a timeout implemented by `dig`/`glibc`; it's what it does when it doesn't get a response to its UDP requests.
`strace` demonstrates that the userspace side does what it should:
```
sendmsg(20<socket:[92070424]>, {msg_name(16)={sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("8.8.8.8")}, msg_iov(1)=[{"\372\f\1 \0\1\0\0\0\0\0\1\10ifconfig\2me\0\0\1\0\1\0\0)"..., 40}], msg_controllen=0, msg_flags=0}, 0) = 40
epoll_wait(
-- here it hangs
-- 5 seconds hang here
<... futex resumed> ) = -1 ETIMEDOUT (Connection timed out)
-- dig's 5 second timeout triggers, it retries
sendmsg(20<socket:[92070424]>, {msg_name(16)={sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("8.8.8.8")}, msg_iov(1)=[{"\372\f\1 \0\1\0\0\0\0\0\1\10ifconfig\2me\0\0\1\0\1\0\0)"..., 40}], msg_controllen=0, msg_flags=0}, 0) = 40
-- 1 ms later
recvmsg(20<socket:[92070424]>, {msg_name(16)={sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("8.8.8.8")}, msg_iov(1)=[{"\372\f\201\200\0\1\0\4\0\0\0\1\10ifconfig\2me\0\0\1\0\1\300\f\0"..., 65535}], msg_controllen=56, [{cmsg_len=32, cmsg_level=SOL_SOCKET, cmsg_type=0x1d /* SCM_??? */}, {cmsg_len=17, cmsg_level=SOL_IP, cmsg_type=IP_TOS, {tos=0}}], msg_flags=0}, 0) = 104
```
I have recorded this in Wireshark.
It displays only 1 DNS request being made / UDP packet being sent, not 2, even though the userspace program clearly does tell the kernel to send 2 packets with its 2 `sendmsg()` syscalls:
The request in Wireshark is answered within 1 ms as expected.
So it seems that the kernel never sends the first packet.
(Side info:
Enabling the TCP mode of dig via `time dig +tcp ifconfig.me` changes the timeout to 1 second.
This suggests that the problem is with sending packets _in general_, not only UDP.
EDNS is generally used, but `+noedns` doesn't change the timeouts, no matter if `+tcp` is used or not.
)
I further discovered the problem is specific to the network interface I use.
My machine has 2 Ethernet ports:
* Realtek RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller
* Intel Ethernet Connection (2) I218-V
The problem occurs _only_ on the Realtek port.
Details about the hardware from `lshw`:
```
*-network
description: Ethernet interface
product: Ethernet Connection (2) I218-V
vendor: Intel Corporation
physical id: 19
bus info: pci@0000:00:19.0
logical name: enp0s25
version: 00
serial: d0:50:99:5c:cf:1c
capacity: 1Gbit/s
width: 32 bits
clock: 33MHz
capabilities: pm msi bus_master cap_list ethernet physical tp 10bt 10bt-fd 100bt 100bt-fd 1000bt-fd autonegotiation
configuration: autonegotiation=on broadcast=yes driver=e1000e driverversion=3.2.6-k firmware=0.1-4 latency=0 link=no multicast=yes port=twisted pair
resources: irq:42 memory:ef500000-ef51ffff memory:ef538000-ef538fff ioport:f040(size=32)
...
*-network
description: Ethernet interface
product: RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller
vendor: Realtek Semiconductor Co., Ltd.
physical id: 0
bus info: pci@0000:03:00.0
logical name: enp3s0
version: 11
serial: d0:50:99:5c:cf:1a
size: 1Gbit/s
capacity: 1Gbit/s
width: 64 bits
clock: 33MHz
capabilities: pm msi pciexpress msix vpd bus_master cap_list ethernet physical tp mii 10bt 10bt-fd 100bt 100bt-fd 1000bt 1000bt-fd autonegotiation
configuration: autonegotiation=on broadcast=yes driver=r8169 driverversion=2.3LK-NAPI duplex=full firmware=rtl8168g-2_0.0.1 02/06/13 ip=192.168.1.100 latency=0 link=yes multicast=yes port=MII speed=1Gbit/s
resources: irq:37 ioport:d000(size=256) memory:ef400000-ef400fff memory:e2100000-e2103fff
```
User `longxia` from the `##linux` freenode IRC channel brought in the great suggestion to use `iptables` logging to see whether the packet missing in Wireshark is present there.
## Investigation with `iptables` logging facilities
All the below is for `UDP`. `enp3s0` is the Realtek Ethernet port.
I used `iptables -F && iptables -I OUTPUT 1 -j LOG && iptables -I OUTPUT 1 -j LOG`.
I observe during the hang:
```
Jan 27 23:24:16.210162 ares kernel: IN= OUT=enp3s0 SRC=192.168.1.100 DST=1.1.1.1 LEN=68 TOS=0x00 PREC=0x00 TTL=64 ID=57836 PROTO=UDP SPT=33554 DPT=53 LEN=48
# hang here
Jan 27 23:24:21.210106 ares kernel: IN= OUT=enp3s0 SRC=192.168.1.100 DST=1.1.1.1 LEN=68 TOS=0x00 PREC=0x00 TTL=64 ID=57843 PROTO=UDP SPT=33554 DPT=53 LEN=48
Jan 27 23:24:21.210645 ares kernel: IN=enp3s0 OUT= MAC=d0:50:99:5c:cf:1a:44:d9:e7:41:32:61:08:00 SRC=1.1.1.1 DST=192.168.1.100 LEN=132 TOS=0x00 PREC=0x00 TTL=54 ID=24125 DF PROTO=UDP SPT=53 DPT=33554 LEN=112
```
In the corresponding Wireshark capture below, we can see (with filter `dns.qry.name == "ifconfig.me"`) the two packets from after then hang.
But the first outgoing packet is not present!
```
2784 22:24:21.208712960 192.168.1.100 1.1.1.1 DNS 82 Standard query 0x8a93 A ifconfig.me OPT
2785 22:24:21.210606902 1.1.1.1 192.168.1.100 DNS 146 Standard query response 0x8a93 A ifconfig.me A 216.239.36.21 A 216.239.38.21 A 216.239.32.21 A 216.239.34.21 OPT
```
So I suspect that either:
* the packet gets lost in the kernel code path between the iptables layer and the layer that Wireshark attaches to
* the packet get sent, but forever reason doesn't show up in Wireshark (and the problem is the network not giving a response; unlikely given that the other network card works)
* the packet gets dropped by the kernel for some legitimate reason (e.g. hardware returning an error), but the kernel fails to report this error anywhere
Which of it could it be?
How should I investigate deeper?
Thanks a lot,
Niklas
PS:
You can find a version-controlled investigation log of this problem, including this email in nice formatting, at https://github.com/nh2/dig-5-seconds-hang.
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Packets being dropped somewhere in the kernel, between iptables and packet capture layers
2019-01-28 0:17 Packets being dropped somewhere in the kernel, between iptables and packet capture layers Niklas Hambüchen
@ 2019-01-28 6:21 ` Florian Westphal
2019-01-29 0:45 ` Niklas Hambüchen
0 siblings, 1 reply; 3+ messages in thread
From: Florian Westphal @ 2019-01-28 6:21 UTC (permalink / raw)
To: Niklas Hambüchen; +Cc: netdev
Niklas Hambüchen <mail@nh2.me> wrote:
> I'm sending this to netdev@vger.kernel.org even though http://vger.kernel.org/lkml/ still suggests linux-net@vger.kernel.org, because the latter seems to be inactive since 2011 and full of spam, and I got "unresolvable address" for it. Perhaps somebody should update the page that recommends it.
> Nevertheless, please let me know if here is the wrong place.
This problem is known; I asked for test feedback on this patch but never
got a response:
netfilter: nf_nat: return the same reply tuple for matching CTs
It is possible that two concurrent packets originating from the same
socket of a connection-less protocol (e.g. UDP) can end up having
different IP_CT_DIR_REPLY tuples which results in one of the packets
being dropped.
To illustrate this, consider the following simplified scenario:
1. No DNAT/SNAT/MASQUEARADE rules are installed, but the nf_nat module
is loaded.
2. Packet A and B are sent at the same time from two different threads
via the same UDP socket which hasn't been used before (=no CT has
been created before). Both packets have the same IP_CT_DIR_ORIGINAL
tuple.
3. CT of A has been created and confirmed, afterwards get_unique_tuple
is called for B. Because IP_CT_DIR_REPLY tuple (the inverse of
the IP_CT_DIR_ORIGINAL tuple) is already taken by the A's confirmed
CT (nf_nat_used_tuple finds it), get_unique_tuple calls UDP's
unique_tuple which returns a different IP_CT_DIR_REPLY tuple (usually
with src port = 1024)
4. B's CT cannot get confirmed in __nf_conntrack_confirm due to
the found IP_CT_DIR_ORIGINAL tuple of A and the different
IP_CT_DIR_REPLY tuples, thus the packet B gets dropped.
This patch modifies nf_conntrack_tuple_taken so it doesn't consider
colliding reply tuples if the IP_CT_DIR_ORIGINAL tuples are equal.
Then, at insert time, either clash resolution is possible (new packet
has the existing/older conntrack assigned to it), or it has to be
dropped.
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 741b533148ba..07847a612adf 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -1007,6 +1007,22 @@ nf_conntrack_tuple_taken(const struct nf_conntrack_tuple *tuple,
}
if (nf_ct_key_equal(h, tuple, zone, net)) {
+ /* If the origin tuples are identical, we can ignore
+ * this clashing entry: they refer to the same flow.
+ * Do not apply nat clash resolution in this case and
+ * let nf_ct_resolve_clash() deal with this.
+ *
+ * This can happen with UDP in particular, e.g. when
+ * more than one packet is sent from same socket in
+ * different threads.
+ *
+ * We would now mangle our entry and would then have to
+ * discard it at conntrack confirm time.
+ */
+ if (nf_ct_tuple_equal(&ignored_conntrack->tuplehash[IP_CT_DIR_ORIGINAL].tuple,
+ &ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple))
+ continue;
+
NF_CT_STAT_INC_ATOMIC(net, found);
rcu_read_unlock();
return 1;
^ permalink raw reply related [flat|nested] 3+ messages in thread
* Re: Packets being dropped somewhere in the kernel, between iptables and packet capture layers
2019-01-28 6:21 ` Florian Westphal
@ 2019-01-29 0:45 ` Niklas Hambüchen
0 siblings, 0 replies; 3+ messages in thread
From: Niklas Hambüchen @ 2019-01-29 0:45 UTC (permalink / raw)
To: netdev; +Cc: Florian Westphal
> Some DNS requests on my machine take 5 seconds when a specific network interface is used.
My big apologies for the confusion. I have found the reason.
There was a scripted
tc qdisc add dev enp3s0 root netem loss 5%
in effect that I wasn't aware of while investigating.
This dropped 5% of the packets, and also explains why it was on that interface only.
Well, at least I know very well now at which level `tc` operates.
My bad, and sorry for the noise.
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2019-01-29 0:45 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-28 0:17 Packets being dropped somewhere in the kernel, between iptables and packet capture layers Niklas Hambüchen
2019-01-28 6:21 ` Florian Westphal
2019-01-29 0:45 ` Niklas Hambüchen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).