* [PATCH] net: do not set SOCK_RCVBUF_LOCK if sk_rcvbuf isn't reduced @ 2022-02-15 10:36 kerneljasonxing 2022-02-15 15:25 ` Eric Dumazet 0 siblings, 1 reply; 3+ messages in thread From: kerneljasonxing @ 2022-02-15 10:36 UTC (permalink / raw) To: davem, kuba, ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend, kpsingh, edumazet, pabeni, weiwan, aahringo, yangbo.lu, fw, xiangxia.m.yue, tglx Cc: netdev, linux-kernel, bpf, kerneljasonxing, Jason Xing From: Jason Xing <xingwanli@kuaishou.com> Normally, user doesn't care the logic behind the kernel if they're trying to set receive buffer via setsockopt. However, if the new value of the receive buffer is not smaller than the initial value which is sysctl_tcp_rmem[1] implemented in tcp_rcv_space_adjust(), the server's wscale will shrink and then lead to the bad bandwidth. I think it is not appropriate. Here are some numbers: $ sysctl -a | grep rmem net.core.rmem_default = 212992 net.core.rmem_max = 40880000 net.ipv4.tcp_rmem = 4096 425984 40880000 Case 1 on the server side # iperf -s -p 5201 on the client side # iperf -c [client ip] -p 5201 It turns out that the bandwidth is 9.34 Gbits/sec while the wscale of server side is 10. It's good. Case 2 on the server side #iperf -s -p 5201 -w 425984 on the client side # iperf -c [client ip] -p 5201 It turns out that the bandwidth is reduced to 2.73 Gbits/sec while the wcale is 2, even though the receive buffer is not changed at all at the very beginning. Therefore, I added one condition where only user is trying to set a smaller rx buffer. After this patch is applied, the bandwidth of case 2 is recovered to 9.34 Gbits/sec. Fixes: e88c64f0a425 ("tcp: allow effective reduction of TCP's rcv-buffer via setsockopt") Signed-off-by: Jason Xing <xingwanli@kuaishou.com> --- net/core/filter.c | 7 ++++--- net/core/sock.c | 8 +++++--- 2 files changed, 9 insertions(+), 6 deletions(-) diff --git a/net/core/filter.c b/net/core/filter.c index 4603b7c..99f5d9c 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -4795,9 +4795,10 @@ static int _bpf_setsockopt(struct sock *sk, int level, int optname, case SO_RCVBUF: val = min_t(u32, val, sysctl_rmem_max); val = min_t(int, val, INT_MAX / 2); - sk->sk_userlocks |= SOCK_RCVBUF_LOCK; - WRITE_ONCE(sk->sk_rcvbuf, - max_t(int, val * 2, SOCK_MIN_RCVBUF)); + val = max_t(int, val * 2, SOCK_MIN_RCVBUF); + if (val < sock_net(sk)->ipv4.sysctl_tcp_rmem[1]) + sk->sk_userlocks |= SOCK_RCVBUF_LOCK; + WRITE_ONCE(sk->sk_rcvbuf, val); break; case SO_SNDBUF: val = min_t(u32, val, sysctl_wmem_max); diff --git a/net/core/sock.c b/net/core/sock.c index 4ff806d..e5e9cb0 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -923,8 +923,6 @@ static void __sock_set_rcvbuf(struct sock *sk, int val) * as a negative value. */ val = min_t(int, val, INT_MAX / 2); - sk->sk_userlocks |= SOCK_RCVBUF_LOCK; - /* We double it on the way in to account for "struct sk_buff" etc. * overhead. Applications assume that the SO_RCVBUF setting they make * will allow that much actual data to be received on that socket. @@ -935,7 +933,11 @@ static void __sock_set_rcvbuf(struct sock *sk, int val) * And after considering the possible alternatives, returning the value * we actually used in getsockopt is the most desirable behavior. */ - WRITE_ONCE(sk->sk_rcvbuf, max_t(int, val * 2, SOCK_MIN_RCVBUF)); + val = max_t(int, val * 2, SOCK_MIN_RCVBUF); + if (val < sock_net(sk)->ipv4.sysctl_tcp_rmem[1]) + sk->sk_userlocks |= SOCK_RCVBUF_LOCK; + + WRITE_ONCE(sk->sk_rcvbuf, val); } void sock_set_rcvbuf(struct sock *sk, int val) -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 3+ messages in thread
* Re: [PATCH] net: do not set SOCK_RCVBUF_LOCK if sk_rcvbuf isn't reduced 2022-02-15 10:36 [PATCH] net: do not set SOCK_RCVBUF_LOCK if sk_rcvbuf isn't reduced kerneljasonxing @ 2022-02-15 15:25 ` Eric Dumazet 2022-02-15 16:00 ` Jason Xing 0 siblings, 1 reply; 3+ messages in thread From: Eric Dumazet @ 2022-02-15 15:25 UTC (permalink / raw) To: Jason Xing Cc: David Miller, Jakub Kicinski, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend, KP Singh, Paolo Abeni, Wei Wang, Alexander Aring, Yangbo Lu, Florian Westphal, Tonghao Zhang, Thomas Gleixner, netdev, LKML, bpf, Jason Xing On Tue, Feb 15, 2022 at 2:37 AM <kerneljasonxing@gmail.com> wrote: > > From: Jason Xing <xingwanli@kuaishou.com> > > Normally, user doesn't care the logic behind the kernel if they're > trying to set receive buffer via setsockopt. However, if the new value > of the receive buffer is not smaller than the initial value which is > sysctl_tcp_rmem[1] implemented in tcp_rcv_space_adjust(), the server's > wscale will shrink and then lead to the bad bandwidth. I think it is > not appropriate. Then do not use SO_RCVBUF ? It is working as intended really. > > Here are some numbers: > $ sysctl -a | grep rmem > net.core.rmem_default = 212992 > net.core.rmem_max = 40880000 > net.ipv4.tcp_rmem = 4096 425984 40880000 > > Case 1 > on the server side > # iperf -s -p 5201 > on the client side > # iperf -c [client ip] -p 5201 > It turns out that the bandwidth is 9.34 Gbits/sec while the wscale of > server side is 10. It's good. > > Case 2 > on the server side > #iperf -s -p 5201 -w 425984 > on the client side > # iperf -c [client ip] -p 5201 > It turns out that the bandwidth is reduced to 2.73 Gbits/sec while the > wcale is 2, even though the receive buffer is not changed at all at the > very beginning. Great, you discovered auto tuning is working as intended. > > Therefore, I added one condition where only user is trying to set a > smaller rx buffer. After this patch is applied, the bandwidth of case 2 > is recovered to 9.34 Gbits/sec. > > Fixes: e88c64f0a425 ("tcp: allow effective reduction of TCP's rcv-buffer via setsockopt") This commit has nothing to do with your patch or feature. > Signed-off-by: Jason Xing <xingwanli@kuaishou.com> > --- > net/core/filter.c | 7 ++++--- > net/core/sock.c | 8 +++++--- > 2 files changed, 9 insertions(+), 6 deletions(-) > > diff --git a/net/core/filter.c b/net/core/filter.c > index 4603b7c..99f5d9c 100644 > --- a/net/core/filter.c > +++ b/net/core/filter.c > @@ -4795,9 +4795,10 @@ static int _bpf_setsockopt(struct sock *sk, int level, int optname, > case SO_RCVBUF: > val = min_t(u32, val, sysctl_rmem_max); > val = min_t(int, val, INT_MAX / 2); > - sk->sk_userlocks |= SOCK_RCVBUF_LOCK; > - WRITE_ONCE(sk->sk_rcvbuf, > - max_t(int, val * 2, SOCK_MIN_RCVBUF)); > + val = max_t(int, val * 2, SOCK_MIN_RCVBUF); > + if (val < sock_net(sk)->ipv4.sysctl_tcp_rmem[1]) > + sk->sk_userlocks |= SOCK_RCVBUF_LOCK; > + WRITE_ONCE(sk->sk_rcvbuf, val); > break; > case SO_SNDBUF: > val = min_t(u32, val, sysctl_wmem_max); > diff --git a/net/core/sock.c b/net/core/sock.c > index 4ff806d..e5e9cb0 100644 > --- a/net/core/sock.c > +++ b/net/core/sock.c > @@ -923,8 +923,6 @@ static void __sock_set_rcvbuf(struct sock *sk, int val) > * as a negative value. > */ > val = min_t(int, val, INT_MAX / 2); > - sk->sk_userlocks |= SOCK_RCVBUF_LOCK; > - > /* We double it on the way in to account for "struct sk_buff" etc. > * overhead. Applications assume that the SO_RCVBUF setting they make > * will allow that much actual data to be received on that socket. > @@ -935,7 +933,11 @@ static void __sock_set_rcvbuf(struct sock *sk, int val) > * And after considering the possible alternatives, returning the value > * we actually used in getsockopt is the most desirable behavior. > */ > - WRITE_ONCE(sk->sk_rcvbuf, max_t(int, val * 2, SOCK_MIN_RCVBUF)); > + val = max_t(int, val * 2, SOCK_MIN_RCVBUF); > + if (val < sock_net(sk)->ipv4.sysctl_tcp_rmem[1]) > + sk->sk_userlocks |= SOCK_RCVBUF_LOCK; > + > + WRITE_ONCE(sk->sk_rcvbuf, val); > } > > void sock_set_rcvbuf(struct sock *sk, int val) You are breaking applications that want to set sk->sk_rcvbuf to a fixed value, to control memory usage on millions of active sockets in a host. I think that you want new functionality, with new SO_ socket options, targeting net-next tree (No spurious FIxes: tag) For instance letting an application set or unset SOCK_RCVBUF_LOCK would be more useful, and would not break applications. ^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PATCH] net: do not set SOCK_RCVBUF_LOCK if sk_rcvbuf isn't reduced 2022-02-15 15:25 ` Eric Dumazet @ 2022-02-15 16:00 ` Jason Xing 0 siblings, 0 replies; 3+ messages in thread From: Jason Xing @ 2022-02-15 16:00 UTC (permalink / raw) To: Eric Dumazet Cc: David Miller, Jakub Kicinski, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend, KP Singh, Paolo Abeni, Wei Wang, Alexander Aring, Yangbo Lu, Florian Westphal, Tonghao Zhang, Thomas Gleixner, netdev, LKML, bpf, Jason Xing On Tue, Feb 15, 2022 at 11:25 PM Eric Dumazet <edumazet@google.com> wrote: > > On Tue, Feb 15, 2022 at 2:37 AM <kerneljasonxing@gmail.com> wrote: > > > > From: Jason Xing <xingwanli@kuaishou.com> > > > > Normally, user doesn't care the logic behind the kernel if they're > > trying to set receive buffer via setsockopt. However, if the new value > > of the receive buffer is not smaller than the initial value which is > > sysctl_tcp_rmem[1] implemented in tcp_rcv_space_adjust(), the server's > > wscale will shrink and then lead to the bad bandwidth. I think it is > > not appropriate. > > Then do not use SO_RCVBUF ? > > It is working as intended really. > > > > > Here are some numbers: > > $ sysctl -a | grep rmem > > net.core.rmem_default = 212992 > > net.core.rmem_max = 40880000 > > net.ipv4.tcp_rmem = 4096 425984 40880000 > > > > Case 1 > > on the server side > > # iperf -s -p 5201 > > on the client side > > # iperf -c [client ip] -p 5201 > > It turns out that the bandwidth is 9.34 Gbits/sec while the wscale of > > server side is 10. It's good. > > > > Case 2 > > on the server side > > #iperf -s -p 5201 -w 425984 > > on the client side > > # iperf -c [client ip] -p 5201 > > It turns out that the bandwidth is reduced to 2.73 Gbits/sec while the > > wcale is 2, even though the receive buffer is not changed at all at the > > very beginning. > > Great, you discovered auto tuning is working as intended. > Thanks. > > > > Therefore, I added one condition where only user is trying to set a > > smaller rx buffer. After this patch is applied, the bandwidth of case 2 > > is recovered to 9.34 Gbits/sec. > > > > Fixes: e88c64f0a425 ("tcp: allow effective reduction of TCP's rcv-buffer via setsockopt") > > This commit has nothing to do with your patch or feature. > > > Signed-off-by: Jason Xing <xingwanli@kuaishou.com> > > --- > > net/core/filter.c | 7 ++++--- > > net/core/sock.c | 8 +++++--- > > 2 files changed, 9 insertions(+), 6 deletions(-) > > > > diff --git a/net/core/filter.c b/net/core/filter.c > > index 4603b7c..99f5d9c 100644 > > --- a/net/core/filter.c > > +++ b/net/core/filter.c > > @@ -4795,9 +4795,10 @@ static int _bpf_setsockopt(struct sock *sk, int level, int optname, > > case SO_RCVBUF: > > val = min_t(u32, val, sysctl_rmem_max); > > val = min_t(int, val, INT_MAX / 2); > > - sk->sk_userlocks |= SOCK_RCVBUF_LOCK; > > - WRITE_ONCE(sk->sk_rcvbuf, > > - max_t(int, val * 2, SOCK_MIN_RCVBUF)); > > + val = max_t(int, val * 2, SOCK_MIN_RCVBUF); > > + if (val < sock_net(sk)->ipv4.sysctl_tcp_rmem[1]) > > + sk->sk_userlocks |= SOCK_RCVBUF_LOCK; > > + WRITE_ONCE(sk->sk_rcvbuf, val); > > break; > > case SO_SNDBUF: > > val = min_t(u32, val, sysctl_wmem_max); > > diff --git a/net/core/sock.c b/net/core/sock.c > > index 4ff806d..e5e9cb0 100644 > > --- a/net/core/sock.c > > +++ b/net/core/sock.c > > @@ -923,8 +923,6 @@ static void __sock_set_rcvbuf(struct sock *sk, int val) > > * as a negative value. > > */ > > val = min_t(int, val, INT_MAX / 2); > > - sk->sk_userlocks |= SOCK_RCVBUF_LOCK; > > - > > /* We double it on the way in to account for "struct sk_buff" etc. > > * overhead. Applications assume that the SO_RCVBUF setting they make > > * will allow that much actual data to be received on that socket. > > @@ -935,7 +933,11 @@ static void __sock_set_rcvbuf(struct sock *sk, int val) > > * And after considering the possible alternatives, returning the value > > * we actually used in getsockopt is the most desirable behavior. > > */ > > - WRITE_ONCE(sk->sk_rcvbuf, max_t(int, val * 2, SOCK_MIN_RCVBUF)); > > + val = max_t(int, val * 2, SOCK_MIN_RCVBUF); > > + if (val < sock_net(sk)->ipv4.sysctl_tcp_rmem[1]) > > + sk->sk_userlocks |= SOCK_RCVBUF_LOCK; > > + > > + WRITE_ONCE(sk->sk_rcvbuf, val); > > } > > > > void sock_set_rcvbuf(struct sock *sk, int val) > > You are breaking applications that want to set sk->sk_rcvbuf to a fixed value, > to control memory usage on millions of active sockets in a host. > > I think that you want new functionality, with new SO_ socket options, > targeting net-next tree (No spurious FIxes: tag) > I agree with you. > For instance letting an application set or unset SOCK_RCVBUF_LOCK > would be more useful, and would not break applications. I would like to change the title of the v2 patch while the title of this version is a little bit misleading, say, "net: introduce SO_RCVBUFAUTO to unset SOCK_RCVBUF_LOCK flag". Eric, what do you think of the new name of this socket option, like SO_RCVBUFAUTO? Thanks, Jason ^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2022-02-15 16:02 UTC | newest] Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-02-15 10:36 [PATCH] net: do not set SOCK_RCVBUF_LOCK if sk_rcvbuf isn't reduced kerneljasonxing 2022-02-15 15:25 ` Eric Dumazet 2022-02-15 16:00 ` Jason Xing
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).