netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jiri Slaby <jirislaby@kernel.org>
To: Eric Dumazet <edumazet@google.com>,
	"David S . Miller" <davem@davemloft.net>,
	Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>
Cc: netdev@vger.kernel.org, Soheil Hassas Yeganeh <soheil@google.com>,
	Neal Cardwell <ncardwell@google.com>,
	Yuchung Cheng <ycheng@google.com>,
	eric.dumazet@gmail.com
Subject: Re: [PATCH net-next] tcp: get rid of sysctl_tcp_adv_win_scale
Date: Fri, 3 Nov 2023 08:07:21 +0100	[thread overview]
Message-ID: <875400ba-58d8-44a0-8fe9-334e322bd1db@kernel.org> (raw)
In-Reply-To: <c798f412-ac14-4997-9431-c98d1b8e16d8@kernel.org>

On 03. 11. 23, 7:56, Jiri Slaby wrote:
> On 03. 11. 23, 7:10, Jiri Slaby wrote:
>> On 17. 07. 23, 17:29, Eric Dumazet wrote:
>>> With modern NIC drivers shifting to full page allocations per
>>> received frame, we face the following issue:
>>>
>>> TCP has one per-netns sysctl used to tweak how to translate
>>> a memory use into an expected payload (RWIN), in RX path.
>>>
>>> tcp_win_from_space() implementation is limited to few cases.
>>>
>>> For hosts dealing with various MSS, we either under estimate
>>> or over estimate the RWIN we send to the remote peers.
>>>
>>> For instance with the default sysctl_tcp_adv_win_scale value,
>>> we expect to store 50% of payload per allocated chunk of memory.
>>>
>>> For the typical use of MTU=1500 traffic, and order-0 pages allocations
>>> by NIC drivers, we are sending too big RWIN, leading to potential
>>> tcp collapse operations, which are extremely expensive and source
>>> of latency spikes.
>>>
>>> This patch makes sysctl_tcp_adv_win_scale obsolete, and instead
>>> uses a per socket scaling factor, so that we can precisely
>>> adjust the RWIN based on effective skb->len/skb->truesize ratio.
>>>
>>> This patch alone can double TCP receive performance when receivers
>>> are too slow to drain their receive queue, or by allowing
>>> a bigger RWIN when MSS is close to PAGE_SIZE.
>>
>> Hi,
>>
>> I bisected a python-eventlet test failure:
>>  > =================================== FAILURES 
>> ===================================
>>  > _______________________ TestGreenSocket.test_full_duplex 
>> _______________________
>>  >
>>  > self = <tests.greenio_test.TestGreenSocket 
>> testMethod=test_full_duplex>
>>  >
>>  >     def test_full_duplex(self):
>>  > ...
>>  > >       large_evt.wait()
>>  >
>>  > tests/greenio_test.py:424:
>>  > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
>> _ _ _ _ _ _
>>  > eventlet/greenthread.py:181: in wait
>>  >     return self._exit_event.wait()
>>  > eventlet/event.py:125: in wait
>>  >     result = hub.switch()
>> ...
>>  > E       tests.TestIsTakingTooLong: 1
>>  >
>>  > eventlet/hubs/hub.py:313: TestIsTakingTooLong
>>
>> to this commit. With the commit, the test takes > 1.5 s. Without the 
>> commit it takes only < 300 ms. And they set timeout to 1 s.
>>
>> The reduced self-stadning test case:
>> #!/usr/bin/python3
>> import eventlet
>> from eventlet.green import select, socket, time, ssl
>>
>> def bufsized(sock, size=1):
>>      """ Resize both send and receive buffers on a socket.
>>      Useful for testing trampoline.  Returns the socket.
>>
>>      >>> import socket
>>      >>> sock = bufsized(socket.socket(socket.AF_INET, 
>> socket.SOCK_STREAM))
>>      """
>>      sock.setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, size)
>>      sock.setsockopt(socket.SOL_SOCKET, socket.SO_RCVBUF, size)
>>      return sock
>>
>> def min_buf_size():
>>      """Return the minimum buffer size that the platform supports."""
>>      test_sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
>>      test_sock.setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, 1)
>>      return test_sock.getsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF)
>>
>> def test_full_duplex():
>>      large_data = b'*' * 10 * min_buf_size()
>>      listener = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
>>      listener.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
>>      listener.bind(('127.0.0.1', 0))
>>      listener.listen(50)
>>      bufsized(listener)
>>
>>      def send_large(sock):
>>          sock.sendall(large_data)
>>
>>      def read_large(sock):
>>          result = sock.recv(len(large_data))
>>          while len(result) < len(large_data):
>>              result += sock.recv(len(large_data))
>>          assert result == large_data
>>
>>      def server():
>>          (sock, addr) = listener.accept()
>>          sock = bufsized(sock)
>>          send_large_coro = eventlet.spawn(send_large, sock)
>>          eventlet.sleep(0)
>>          result = sock.recv(10)
>>          expected = b'hello world'
>>          while len(result) < len(expected):
>>              result += sock.recv(10)
>>          assert result == expected
>>          send_large_coro.wait()
>>
>>      server_evt = eventlet.spawn(server)
>>      client = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
>>      client.connect(('127.0.0.1', listener.getsockname()[1]))
>>      bufsized(client)
>>      large_evt = eventlet.spawn(read_large, client)
>>      eventlet.sleep(0)
>>      client.sendall(b'hello world')
>>      server_evt.wait()
>>      large_evt.wait()
>>      client.close()
>>
>> test_full_duplex()
>>
>> =====================================
>>
>> I speak neither python nor networking, so any ideas :)? Is the test 
>> simply wrong?
> 
> strace -rT -e trace=network:
> 
> GOOD:
>  > 0.000000 socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_IP) = 3 
> <0.000063>
>  > 0.000406 setsockopt(3, SOL_SOCKET, SO_SNDBUF, [1], 4) = 0 <0.000042>
>  > 0.000097 getsockopt(3, SOL_SOCKET, SO_SNDBUF, [4608], [4]) = 0 
> <0.000012>
>  > 0.000101 socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_IP) = 3 
> <0.000015>
>  > 0.000058 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 <0.000009>
>  > 0.000035 bind(3, {sa_family=AF_INET, sin_port=htons(0), 
> sin_addr=inet_addr("127.0.0.1")}, 16) = 0 <0.000027>
>  > 0.000058 listen(3, 50)       = 0 <0.000014>
>  > 0.000029 setsockopt(3, SOL_SOCKET, SO_SNDBUF, [1], 4) = 0 <0.000009>
>  > 0.000023 setsockopt(3, SOL_SOCKET, SO_RCVBUF, [1], 4) = 0 <0.000008>
>  > 0.000052 socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_IP) = 5 
> <0.000014>
>  > 0.000050 getsockname(3, {sa_family=AF_INET, sin_port=htons(44313), 
> sin_addr=inet_addr("127.0.0.1")}, [16]) = 0 <0.000011>
>  > 0.000037 connect(5, {sa_family=AF_INET, sin_port=htons(44313), 
> sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in 
> progress) <0.000070>
>  > 0.000210 accept4(3, {sa_family=AF_INET, sin_port=htons(56062), 
> sin_addr=inet_addr("127.0.0.1")}, [16], SOCK_CLOEXEC) = 6 <0.000012>
>  > 0.000040 getsockname(6, {sa_family=AF_INET, sin_port=htons(44313), 
> sin_addr=inet_addr("127.0.0.1")}, [128 => 16]) = 0 <0.000007>
>  > 0.000062 setsockopt(6, SOL_SOCKET, SO_SNDBUF, [1], 4) = 0 <0.000007>
>  > 0.000020 setsockopt(6, SOL_SOCKET, SO_RCVBUF, [1], 4) = 0 <0.000007>
>  > 0.000082 getsockopt(5, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 <0.000007>
>  > 0.000023 connect(5, {sa_family=AF_INET, sin_port=htons(44313), 
> sin_addr=inet_addr("127.0.0.1")}, 16) = 0 <0.000008>
>  > 0.000022 setsockopt(5, SOL_SOCKET, SO_SNDBUF, [1], 4) = 0 <0.000020>
>  > 0.000036 setsockopt(5, SOL_SOCKET, SO_RCVBUF, [1], 4) = 0 <0.000007>
>  > 0.000061 sendto(6, "********************************"..., 46080, 0, 
> NULL, 0) = 32768 <0.000049>
>  > 0.000135 sendto(6, "********************************"..., 13312, 0, 
> NULL, 0) = 13312 <0.000017>
>  > 0.000087 recvfrom(6, 0x7f78e58af890, 10, 0, NULL, NULL) = -1 EAGAIN 
> (Resource temporarily unavailable) <0.000010>
>  > 0.000125 recvfrom(5, "********************************"..., 46080, 0, 
> NULL, NULL) = 32768 <0.000032>
>  > 0.000066 recvfrom(5, 0x55fdb41bb880, 46080, 0, NULL, NULL) = -1 
> EAGAIN (Resource temporarily unavailable) <0.000011>
>  > 0.000075 sendto(5, "hello world", 11, 0, NULL, 0) = 11 <0.000023>
>  > 0.000117 recvfrom(6, "hello worl", 10, 0, NULL, NULL) = 10 <0.000015>
>  > 0.000050 recvfrom(6, "d", 10, 0, NULL, NULL) = 1 <0.000011>
>  > 0.000212 recvfrom(5, "********************************"..., 46080, 0, 
> NULL, NULL) = 13312 <0.000019>
>  > 0.050676 +++ exited with 0 +++
> 
> 
> BAD:
>  > 0.000000 socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_IP) = 3 
> <0.000045>
>  > 0.000244 setsockopt(3, SOL_SOCKET, SO_SNDBUF, [1], 4) = 0 <0.000015>
>  > 0.000057 getsockopt(3, SOL_SOCKET, SO_SNDBUF, [4608], [4]) = 0 
> <0.000013>
>  > 0.000104 socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_IP) = 3 
> <0.000016>
>  > 0.000065 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 <0.000010>
>  > 0.000038 bind(3, {sa_family=AF_INET, sin_port=htons(0), 
> sin_addr=inet_addr("127.0.0.1")}, 16) = 0 <0.000031>
>  > 0.000068 listen(3, 50)       = 0 <0.000014>
>  > 0.000032 setsockopt(3, SOL_SOCKET, SO_SNDBUF, [1], 4) = 0 <0.000010>
>  > 0.000030 setsockopt(3, SOL_SOCKET, SO_RCVBUF, [1], 4) = 0 <0.000018>
>  > 0.000060 socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_IP) = 5 
> <0.000023>
>  > 0.000071 getsockname(3, {sa_family=AF_INET, sin_port=htons(45901), 
> sin_addr=inet_addr("127.0.0.1")}, [16]) = 0 <0.000019>
>  > 0.000068 connect(5, {sa_family=AF_INET, sin_port=htons(45901), 
> sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in 
> progress) <0.000074>
>  > 0.000259 accept4(3, {sa_family=AF_INET, sin_port=htons(35002), 
> sin_addr=inet_addr("127.0.0.1")}, [16], SOCK_CLOEXEC) = 6 <0.000014>
>  > 0.000051 getsockname(6, {sa_family=AF_INET, sin_port=htons(45901), 
> sin_addr=inet_addr("127.0.0.1")}, [128 => 16]) = 0 <0.000010>
>  > 0.000082 setsockopt(6, SOL_SOCKET, SO_SNDBUF, [1], 4) = 0 <0.000009>
>  > 0.000040 setsockopt(6, SOL_SOCKET, SO_RCVBUF, [1], 4) = 0 <0.000009>
>  > 0.000104 getsockopt(5, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 <0.000009>
>  > 0.000028 connect(5, {sa_family=AF_INET, sin_port=htons(45901), 
> sin_addr=inet_addr("127.0.0.1")}, 16) = 0 <0.000009>
>  > 0.000026 setsockopt(5, SOL_SOCKET, SO_SNDBUF, [1], 4) = 0 <0.000009>
>  > 0.000024 setsockopt(5, SOL_SOCKET, SO_RCVBUF, [1], 4) = 0 <0.000008>
>  > 0.000071 sendto(6, "********************************"..., 46080, 0, 
> NULL, 0) = 16640 <0.000026>
>  > 0.000117 sendto(6, "********************************"..., 29440, 0, 
> NULL, 0) = 16640 <0.000017>
>  > 0.000041 sendto(6, "********************************"..., 12800, 0, 
> NULL, 0) = -1 EAGAIN (Resource temporarily unavailable) <0.000009>
>  > 0.000075 recvfrom(6, 0x7f4db88a38c0, 10, 0, NULL, NULL) = -1 EAGAIN 
> (Resource temporarily unavailable) <0.000010>
>  > 0.000086 recvfrom(5, "********************************"..., 46080, 0, 
> NULL, NULL) = 16640 <0.000018>
>  > 0.000044 recvfrom(5, 0x55a64d59b2a0, 46080, 0, NULL, NULL) = -1 
> EAGAIN (Resource temporarily unavailable) <0.000009>
>  > 0.000059 sendto(5, "hello world", 11, 0, NULL, 0) = 11 <0.000018>
>  > 0.000093 recvfrom(6, "hello worl", 10, 0, NULL, NULL) = 10 <0.000009>
>  > 0.000029 recvfrom(6, "d", 10, 0, NULL, NULL) = 1 <0.000009>
>  > 0.206685 recvfrom(5, "********************************"..., 46080, 0, 
> NULL, NULL) = 16640 <0.000116>
>  > 0.000306 recvfrom(5, 0x55a64d5a7600, 46080, 0, NULL, NULL) = -1 
> EAGAIN (Resource temporarily unavailable) <0.000013>
>  > 0.000208 sendto(6, "********************************"..., 12800, 0, 
> NULL, 0) = 12800 <0.000025>
>  > 0.206317 recvfrom(5, "********************************"..., 46080, 0, 
> NULL, NULL) = 2304 <0.000171>
>  > 0.000304 recvfrom(5, 0x55a64d597170, 46080, 0, NULL, NULL) = -1 
> EAGAIN (Resource temporarily unavailable) <0.000029>
>  > 0.206161 recvfrom(5, "********************************"..., 46080, 0, 
> NULL, NULL) = 2304 <0.000082>
>  > 0.000212 recvfrom(5, 0x55a64d5a0ed0, 46080, 0, NULL, NULL) = -1 
> EAGAIN (Resource temporarily unavailable) <0.000034>
>  > 0.206572 recvfrom(5, "********************************"..., 46080, 0, 
> NULL, NULL) = 2304 <0.000146>
>  > 0.000274 recvfrom(5, 0x55a64d597170, 46080, 0, NULL, NULL) = -1 
> EAGAIN (Resource temporarily unavailable) <0.000029>
>  > 0.206604 recvfrom(5, "********************************"..., 46080, 0, 
> NULL, NULL) = 2304 <0.000162>
>  > 0.000270 recvfrom(5, 0x55a64d5a20d0, 46080, 0, NULL, NULL) = -1 
> EAGAIN (Resource temporarily unavailable) <0.000016>
>  > 0.206164 recvfrom(5, "********************************"..., 46080, 0, 
> NULL, NULL) = 2304 <0.000116>
>  > 0.000291 recvfrom(5, "********************************"..., 46080, 0, 
> NULL, NULL) = 1280 <0.000038>
>  > 0.052224 +++ exited with 0 +++
> 
> I.e. recvfrom() returns -EAGAIN and takes 200 ms.

Ah, no, those 200 ms are not spent in recvfrom() but in poll:
 > 0.000029 recvfrom(6, "d", 10, 0, NULL, NULL) = 1 <0.000009>
 > 0.000040 epoll_wait(4, [{events=EPOLLIN, data={u32=5, 
u64=139942919405573}}], 1023, 60000) = 1 <0.204038>
 > 0.204258 epoll_ctl(4, EPOLL_CTL_DEL, 5, 0x7ffd49b4049c) = 0 <0.000045>
 > 0.000104 recvfrom(5, "********************************"..., 46080, 0, 
NULL, NULL) = 16640 <0.000078>
 > 0.000264 recvfrom(5, 0x5603425f3550, 46080, 0, NULL, NULL) = -1 
EAGAIN (Resource temporarily unavailable) <0.000025>
 > 0.000127 epoll_ctl(4, EPOLL_CTL_ADD, 5, 
{events=EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, data={u32=5, 
u64=94570884890629}}) = 0 <0.000031>
 > 0.000112 epoll_wait(4, [{events=EPOLLOUT, data={u32=6, 
u64=139942919405574}}], 1023, 60000) = 1 <0.000026>
 > 0.000083 epoll_ctl(4, EPOLL_CTL_DEL, 6, 0x7ffd49b404fc) = 0 <0.000025>
 > 0.000063 sendto(6, "********************************"..., 12800, 0, 
NULL, 0) = 12800 <0.000028>
 > 0.000226 epoll_wait(4, [], 1023, 0) = 0 <0.000007>
 > 0.000029 epoll_wait(4, [{events=EPOLLIN, data={u32=5, 
u64=94570884890629}}], 1023, 60000) = 1 <0.205476>
 > 0.205728 epoll_ctl(4, EPOLL_CTL_DEL, 5, 0x7ffd49b404fc) = 0 <0.000077>
 > 0.000157 recvfrom(5, "********************************"..., 46080, 0, 
NULL, NULL) = 2304 <0.000066>
 > 0.000180 recvfrom(5, 0x5603425e30c0, 46080, 0, NULL, NULL) = -1 
EAGAIN (Resource temporarily unavailable) <0.000026>
 > 0.000139 epoll_ctl(4, EPOLL_CTL_ADD, 5, 
{events=EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, data={u32=5, u64=5}}) = 0 
<0.000030>
 > 0.000104 epoll_wait(4, [{events=EPOLLIN, data={u32=5, u64=5}}], 1023, 
60000) = 1 <0.205881>
 > 0.206222 epoll_ctl(4, EPOLL_CTL_DEL, 5, 0x7ffd49b404fc) = 0 <0.000086>
 > 0.000189 recvfrom(5, "********************************"..., 46080, 0, 
NULL, NULL) = 2304 <0.000027>
 > 0.000153 recvfrom(5, 0x5603425ece20, 46080, 0, NULL, NULL) = -1 
EAGAIN (Resource temporarily unavailable) <0.000017>
 > 0.000074 epoll_ctl(4, EPOLL_CTL_ADD, 5, 
{events=EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, data={u32=5, u64=5}}) = 0 
<0.000018>
 > 0.000067 epoll_wait(4, [{events=EPOLLIN, data={u32=5, u64=5}}], 1023, 
60000) = 1 <0.205749>
 > 0.206088 epoll_ctl(4, EPOLL_CTL_DEL, 5, 0x7ffd49b404fc) = 0 <0.000197>
 > 0.000391 recvfrom(5, "********************************"..., 46080, 0, 
NULL, NULL) = 2304 <0.000080>
 > 0.000210 recvfrom(5, 0x5603425e30c0, 46080, 0, NULL, NULL) = -1 
EAGAIN (Resource temporarily unavailable) <0.000027>
 > 0.000174 epoll_ctl(4, EPOLL_CTL_ADD, 5, 
{events=EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, data={u32=5, u64=5}}) = 0 
<0.000031>
 > 0.000127 epoll_wait(4, [{events=EPOLLIN, data={u32=5, u64=5}}], 1023, 
60000) = 1 <0.205471>
 > 0.205783 epoll_ctl(4, EPOLL_CTL_DEL, 5, 0x7ffd49b404fc) = 0 <0.000205>


-- 
js
suse labs


  reply	other threads:[~2023-11-03  7:07 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-17 15:29 [PATCH net-next] tcp: get rid of sysctl_tcp_adv_win_scale Eric Dumazet
2023-07-17 16:52 ` Soheil Hassas Yeganeh
2023-07-17 17:17   ` Eric Dumazet
2023-07-17 17:20     ` Soheil Hassas Yeganeh
2023-07-19  2:10 ` patchwork-bot+netdevbpf
2023-07-20 15:41 ` Paolo Abeni
2023-07-20 15:50   ` Eric Dumazet
2023-11-03  6:10 ` Jiri Slaby
2023-11-03  6:56   ` Jiri Slaby
2023-11-03  7:07     ` Jiri Slaby [this message]
2023-11-03  8:17       ` Eric Dumazet
2023-11-03  9:27         ` Michal Kubecek
2023-11-03  9:53           ` Eric Dumazet
2023-11-03 10:14         ` Jiri Slaby
2023-11-03 10:27           ` Eric Dumazet
2023-11-03 11:07             ` Jiri Slaby
2024-04-02 15:28 ` shironeko
2024-04-02 15:50   ` Eric Dumazet
2024-04-02 16:23     ` Eric Dumazet
2024-04-02 16:29       ` shironeko
2024-04-06  0:22         ` shironeko
2024-04-06  6:11           ` Eric Dumazet
2024-04-06  7:07             ` Eric Dumazet

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=875400ba-58d8-44a0-8fe9-334e322bd1db@kernel.org \
    --to=jirislaby@kernel.org \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=eric.dumazet@gmail.com \
    --cc=kuba@kernel.org \
    --cc=ncardwell@google.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=soheil@google.com \
    --cc=ycheng@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).