netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* epoll_wait() (and epoll_pwait) stall for 206 ms per call on sockets with a small-ish snd/rcv buffer.
@ 2019-07-06  1:45 Carlo Wood
  2019-07-06 16:16 ` Kernel BUG: " Carlo Wood
  0 siblings, 1 reply; 4+ messages in thread
From: Carlo Wood @ 2019-07-06  1:45 UTC (permalink / raw)
  To: davem, netdev

Hi all,

hopefully I'm reporting this in the right place.

While working on my C++ networking library, I ran into a problem that
I have traced back to the kernel itself.

The problem is related to setting a socket buffer size (snd/rcv).

At first I noticed the problem while trying to test the thread-safety
of my lock free stream buffer implementation, with std::streambuf
interface, that allows reading and writing by different threads
at the same time. In order to test this, I read data from a socket
into a buffer and write that back to the same socket from the
same buffer. This resulted in a speed of only a few 10kB/s bandwidth,
and after investigation turned out to be caused by epoll_pwait()
not returning until 205 to 208 ms had passed, even though data was
already available.

At first I thought it could be a bug in libev that I was using as
wrapper around epoll - so I ripped that out and wrote my own code to
interface with epoll, but with the same result. Next I changed the test
into a one-way stream: just send data from one socket to another (not
back); also with the same result. Then I wrote a test application that
was standalone, single threaded and in C. Here I discovered that first
calling connect() on a socket and then (immediately after) setting the
socket buffer sizes caused the stall to occur. Assuming that this
is not allowed I thought I had solved it.

But, back to sending data back and forth over the TCP connection
caused the problem to return.

I wrote again a standalone C program, single threaded, that
reproduces the problem.

You can find this program here (I'm scared to trigger spam filters
by attaching it here, but let me know if that is OK):
https://github.com/CarloWood/ai-evio-testsuite/blob/e6e36b66256ec5d1760d2b7b96b34d60a12477aa/src/epoll_bug.c

With socket snd/rcv buffers smaller than 32kB I always get
the problem, but using 64kB (65536 bytes) I don't, on my box.

I tested this with kernels 4.15, 5.2.0-rc7 and had someone
else confirm it with 5.1.15.

When running the program as-is, it will start printing a flood
of "epoll_wait() was delayed!" meaning that epoll_wait() didn't
return within 10 ms. The actual delay can be seen with for
example strace -T. You can also uncomment the commented printf's
which will print what is the amount of bytes written to and read
from each socket and therefore what is the number of bytes that
are still "in the pipe line".

What I observe is that the number of bytes in the pipe line is
very constant (because this just writes till no more can be
written, and then - on the other end - reads till there is nothing
to be read anymore - and that single threaded) until suddenly
that amount jumps UP to about 2.5 MB - then is SLOWLY eaten till
a say 1.5 MB (without that epoll says more can be written)
when suddenly more can be written and it fills up the pipe line
again.

I have no idea where those mega bytes of data are buffered; the
socket buffers are WAY smaller.

Obviously I have no idea if this is a problem with epoll itself,
or that there is a problem with TCP/IP for this case.
But I am not a kernel hacker, so all I could to was provide this
little test case that allows you to reproduce it.

-- 
Carlo Wood <carlo@alinoe.com>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Kernel BUG: epoll_wait() (and epoll_pwait) stall for 206 ms per call on sockets with a small-ish snd/rcv buffer.
  2019-07-06  1:45 epoll_wait() (and epoll_pwait) stall for 206 ms per call on sockets with a small-ish snd/rcv buffer Carlo Wood
@ 2019-07-06 16:16 ` Carlo Wood
  2019-07-06 18:19   ` Carlo Wood
  0 siblings, 1 reply; 4+ messages in thread
From: Carlo Wood @ 2019-07-06 16:16 UTC (permalink / raw)
  To: davem, netdev

I improved the test case a bit:

https://github.com/CarloWood/ai-evio-testsuite/blob/2a9ae49e3ae39eea7cb1d105883254370f53831b/src/epoll_bug.c

If the bug doesn't show, please increase burst_size and/or decrease
sndbuf_size and rcvbuf_size.

The output that I get with VERBOSE defined is for example:

[...snip...]
wrote 34784 bytes to 6 (total written now 9665792), now in pipe line: 34784
Read 34784 bytes from fd 5 (total read now 9665792), left in pipe line: 0
wrote 34784 bytes to 6 (total written now 9700576), now in pipe line: 34784
Read 34784 bytes from fd 5 (total read now 9700576), left in pipe line: 0
wrote 34784 bytes to 6 (total written now 9735360), now in pipe line: 34784
Read 34784 bytes from fd 5 (total read now 9735360), left in pipe line: 0
wrote 34784 bytes to 6 (total written now 9770144), now in pipe line: 34784
Read 34784 bytes from fd 5 (total read now 9770144), left in pipe line: 0
wrote 34784 bytes to 6 (total written now 9804928), now in pipe line: 34784
Read 34784 bytes from fd 5 (total read now 9804928), left in pipe line: 0
wrote 34784 bytes to 6 (total written now 9839712), now in pipe line: 34784
Read 34784 bytes from fd 5 (total read now 9839712), left in pipe line: 0
wrote 34784 bytes to 6 (total written now 9874496), now in pipe line: 34784
Read 34784 bytes from fd 5 (total read now 9874496), left in pipe line: 0
wrote 34784 bytes to 6 (total written now 9909280), now in pipe line: 34784
Read 34784 bytes from fd 5 (total read now 9909280), left in pipe line: 0
wrote 34784 bytes to 6 (total written now 9944064), now in pipe line: 34784
Read 34784 bytes from fd 5 (total read now 9944064), left in pipe line: 0
wrote 34784 bytes to 6 (total written now 9978848), now in pipe line: 34784
Read 34784 bytes from fd 5 (total read now 9978848), left in pipe line: 0
wrote 21152 bytes to 6 (total written now 10000000), now in pipe line: 21152
Read 21152 bytes from fd 5 (total read now 10000000), left in pipe line: 0
epoll_wait() stalled 193 milliseconds!
Read 21888 bytes from fd 6 (total read now 70912), left in pipe line: 21888
epoll_wait() stalled 255 milliseconds!
Read 21888 bytes from fd 6 (total read now 92800), left in pipe line: 0
write(5, buf, 9907200) = 43776 (total written now 136576), now in pipe line: 43776
epoll_wait() stalled 211 milliseconds!
Read 21888 bytes from fd 6 (total read now 114688), left in pipe line: 21888
epoll_wait() stalled 207 milliseconds!
write(5, buf, 9863424) = 38272 (total written now 174848), now in pipe line: 60160
Read 16384 bytes from fd 6 (total read now 131072), left in pipe line: 43776
epoll_wait() stalled 207 milliseconds!
Read 21888 bytes from fd 6 (total read now 152960), left in pipe line: 21888
epoll_wait() stalled 207 milliseconds!
Read 21888 bytes from fd 6 (total read now 174848), left in pipe line: 0
write(5, buf, 9825152) = 43776 (total written now 218624), now in pipe line: 43776
epoll_wait() stalled 207 milliseconds!
Read 21888 bytes from fd 6 (total read now 196736), left in pipe line: 21888
epoll_wait() stalled 211 milliseconds!
Read 21888 bytes from fd 6 (total read now 218624), left in pipe line: 0
write(5, buf, 9781376) = 43776 (total written now 262400), now in pipe line: 43776
epoll_wait() stalled 207 milliseconds!
Read 21888 bytes from fd 6 (total read now 240512), left in pipe line: 21888
epoll_wait() stalled 207 milliseconds!
Read 21888 bytes from fd 6 (total read now 262400), left in pipe line: 0
write(5, buf, 9737600) = 43776 (total written now 306176), now in pipe line: 43776
epoll_wait() stalled 207 milliseconds!
Read 21888 bytes from fd 6 (total read now 284288), left in pipe line: 21888
epoll_wait() stalled 207 milliseconds!
Read 21888 bytes from fd 6 (total read now 306176), left in pipe line: 0
write(5, buf, 9693824) = 43776 (total written now 349952), now in pipe line: 43776
epoll_wait() stalled 207 milliseconds!
Read 21888 bytes from fd 6 (total read now 328064), left in pipe line: 21888
epoll_wait() stalled 207 milliseconds!
write(5, buf, 9650048) = 38272 (total written now 388224), now in pipe line: 60160
Read 16384 bytes from fd 6 (total read now 344448), left in pipe line: 43776
epoll_wait() stalled 207 milliseconds!
... etc. etc.


It seems that the problem always occur right after stopping to write data in one
direction, and then happens for the way back.

In the case above the burst_size is set to 10000000 bytes, and it writes
that amount and reads it on the other side successfully.

What I think is going on however is that the data on the way back is stalling,
during which the "forward" burst finishes (without the epoll_wait stalling it
is VERY fast). Above you see:

Read 21152 bytes from fd 5 (total read now 10000000), left in pipe line: 0
epoll_wait() stalled 193 milliseconds!

But since we know that the stall always seems to be 207ms, I'm pretty sure
that actually it stalled 14 ms before that, and needed 14 ms to finish the
complete burst in one direction.

-- 
Carlo Wood <carlo@alinoe.com>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Kernel BUG: epoll_wait() (and epoll_pwait) stall for 206 ms per call on sockets with a small-ish snd/rcv buffer.
  2019-07-06 16:16 ` Kernel BUG: " Carlo Wood
@ 2019-07-06 18:19   ` Carlo Wood
  2019-07-08 20:11     ` Neal Cardwell
  0 siblings, 1 reply; 4+ messages in thread
From: Carlo Wood @ 2019-07-06 18:19 UTC (permalink / raw)
  To: davem, netdev

While investigating this further, I read on
http://www.masterraghu.com/subjects/np/introduction/unix_network_programming_v1.3/ch07lev1sec5.html
under "SO_RCVBUF and SO_SNDBUF Socket Options":

    When setting the size of the TCP socket receive buffer, the
    ordering of the function calls is important. This is because of
    TCP's window scale option (Section 2.6), which is exchanged with
    the peer on the SYN segments when the connection is established.
    For a client, this means the SO_RCVBUF socket option must be set
    before calling connect. For a server, this means the socket option
    must be set for the listening socket before calling listen. Setting
    this option for the connected socket will have no effect whatsoever
    on the possible window scale option because accept does not return
    with the connected socket until TCP's three-way handshake is
    complete. That is why this option must be set for the listening
    socket. (The sizes of the socket buffers are always inherited from
    the listening socket by the newly created connected socket: pp.
    462–463 of TCPv2.)

As mentioned in a previous post, I had already discovered about
needing to set the socket buffers before connect, but I didn't know
about setting them before the call to listen() in order to get the
buffer sizes inherited by the accepted sockets.

After fixing this in my test program, all problems disappeared when
keeping the send and receive buffers the same on both sides.

However, when only setting the send and receive buffers on the client
socket (not on the (accepted or) listen socket), epoll_wait() still
stalls 43ms. When the SO_SNDBUF is smaller than 33182 bytes.

Here is the latest version of my test program:

https://github.com/CarloWood/ai-evio-testsuite/blob/master/src/epoll_bug.c

I have to retract most of my "bug" report, it might even not really be
a bug then... but nevertheless, what remains strange is the fact
that setting the socket buffer sizes on the accepted sockets can lead
to so much crippling effect, while the quoted website states:

    Setting this option for the connected socket will have no effect
    whatsoever on the possible window scale option because accept does
    not return with the connected socket until TCP's three-way
    handshake is complete.

And when only setting the socket buffer sizes for the client socket
(that I use to send back received data; so this is the sending
side now) then why does epoll_wait() stall 43 ms per call when the
receiving side is using the default (much larger) socket buffer sizes?

That 43 ms is STILL crippling-- slowing down the transmission of the
data to a trickling speed compared to what it should be.
 
-- 
Carlo Wood <carlo@alinoe.com>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Kernel BUG: epoll_wait() (and epoll_pwait) stall for 206 ms per call on sockets with a small-ish snd/rcv buffer.
  2019-07-06 18:19   ` Carlo Wood
@ 2019-07-08 20:11     ` Neal Cardwell
  0 siblings, 0 replies; 4+ messages in thread
From: Neal Cardwell @ 2019-07-08 20:11 UTC (permalink / raw)
  To: Carlo Wood; +Cc: David Miller, Netdev

On Sat, Jul 6, 2019 at 2:19 PM Carlo Wood <carlo@alinoe.com> wrote:
>
> While investigating this further, I read on
> http://www.masterraghu.com/subjects/np/introduction/unix_network_programming_v1.3/ch07lev1sec5.html
> under "SO_RCVBUF and SO_SNDBUF Socket Options":
>
>     When setting the size of the TCP socket receive buffer, the
>     ordering of the function calls is important. This is because of
>     TCP's window scale option (Section 2.6), which is exchanged with
>     the peer on the SYN segments when the connection is established.
>     For a client, this means the SO_RCVBUF socket option must be set
>     before calling connect. For a server, this means the socket option
>     must be set for the listening socket before calling listen. Setting
>     this option for the connected socket will have no effect whatsoever
>     on the possible window scale option because accept does not return
>     with the connected socket until TCP's three-way handshake is
>     complete. That is why this option must be set for the listening
>     socket. (The sizes of the socket buffers are always inherited from
>     the listening socket by the newly created connected socket: pp.
>     462–463 of TCPv2.)
>
> As mentioned in a previous post, I had already discovered about
> needing to set the socket buffers before connect, but I didn't know
> about setting them before the call to listen() in order to get the
> buffer sizes inherited by the accepted sockets.
>
> After fixing this in my test program, all problems disappeared when
> keeping the send and receive buffers the same on both sides.
>
> However, when only setting the send and receive buffers on the client
> socket (not on the (accepted or) listen socket), epoll_wait() still
> stalls 43ms. When the SO_SNDBUF is smaller than 33182 bytes.
>
> Here is the latest version of my test program:
>
> https://github.com/CarloWood/ai-evio-testsuite/blob/master/src/epoll_bug.c
>
> I have to retract most of my "bug" report, it might even not really be
> a bug then... but nevertheless, what remains strange is the fact
> that setting the socket buffer sizes on the accepted sockets can lead
> to so much crippling effect, while the quoted website states:
>
>     Setting this option for the connected socket will have no effect
>     whatsoever on the possible window scale option because accept does
>     not return with the connected socket until TCP's three-way
>     handshake is complete.
>
> And when only setting the socket buffer sizes for the client socket
> (that I use to send back received data; so this is the sending
> side now) then why does epoll_wait() stall 43 ms per call when the
> receiving side is using the default (much larger) socket buffer sizes?
>
> That 43 ms is STILL crippling-- slowing down the transmission of the
> data to a trickling speed compared to what it should be.

Based on the magic numbers you cite, including the fact that this test
program seems to send traffic over a loopback device (presumably
MTU=65536), epoll_wait() stalling 43 ms (slightly longer than the
typical Linux delayed ACK timer), and the problem only happening if
SO_SNDBUF is smaller than 33182 bytes (half the MTU)... a guess would
be that when you artificially make the SO_SNDBUF that low, then you
are asking the kernel to only allow your sending sockets to buffer
less than a single MTU of data, which means that the packets the
sender sends are going to be less than the MTU, which means that the
receiver may tend to eventually (after the initial quick ACKs expire)
delay its ACKs in hopes of eventually receiving a full MSS of data
(see __tcp_ack_snd_check()). Since the SO_SNDBUF is so small in this
case, the sender then would not be able to write() or transmit
anything else until the receiver sends a delayed ACK for the data
~40ms later, allowing the sending socket to free the previously sent
data and trigger the sender's next EPOLLOUT and write().

You could try grabbing a packet capture of the traffic over your
loopback device during the test to see if it matches that theory:
  tcpdump  -i lo -s 100 -w /tmp/test.pcap

cheers,
neal

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2019-07-08 20:11 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-06  1:45 epoll_wait() (and epoll_pwait) stall for 206 ms per call on sockets with a small-ish snd/rcv buffer Carlo Wood
2019-07-06 16:16 ` Kernel BUG: " Carlo Wood
2019-07-06 18:19   ` Carlo Wood
2019-07-08 20:11     ` Neal Cardwell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).