* wrong smp_mb__after_atomic() in tcp_check_space() ? @ 2017-01-23 14:30 Oleg Nesterov 2017-01-23 16:56 ` Jason Baron 0 siblings, 1 reply; 6+ messages in thread From: Oleg Nesterov @ 2017-01-23 14:30 UTC (permalink / raw) To: Jason Baron, David S. Miller; +Cc: Herbert Xu, Yauheni Kaliuta, netdev Hello, smp_mb__after_atomic() looks wrong and misleading, sock_reset_flag() does the non-atomic __clear_bit() and thus it can not guarantee test_bit(SOCK_NOSPACE) (non-atomic too) won't be reordered. It was added by 3c7151275c0c9a "tcp: add memory barriers to write space paths" and the patch looks correct in that we need the barriers in tcp_check_space() and tcp_poll() in theory, so it seems tcp_check_space() needs smp_mb() ? Oleg. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: wrong smp_mb__after_atomic() in tcp_check_space() ? 2017-01-23 14:30 wrong smp_mb__after_atomic() in tcp_check_space() ? Oleg Nesterov @ 2017-01-23 16:56 ` Jason Baron 2017-01-23 17:18 ` Oleg Nesterov 2017-01-23 18:04 ` Eric Dumazet 0 siblings, 2 replies; 6+ messages in thread From: Jason Baron @ 2017-01-23 16:56 UTC (permalink / raw) To: Oleg Nesterov, David S. Miller Cc: Herbert Xu, Yauheni Kaliuta, netdev, Eric Dumazet On 01/23/2017 09:30 AM, Oleg Nesterov wrote: > Hello, > > smp_mb__after_atomic() looks wrong and misleading, sock_reset_flag() does the > non-atomic __clear_bit() and thus it can not guarantee test_bit(SOCK_NOSPACE) > (non-atomic too) won't be reordered. > Indeed. Here's a bit of discussion on it: http://marc.info/?l=linux-netdev&m=146662325920596&w=2 > It was added by 3c7151275c0c9a "tcp: add memory barriers to write space paths" > and the patch looks correct in that we need the barriers in tcp_check_space() > and tcp_poll() in theory, so it seems tcp_check_space() needs smp_mb() ? > Yes, I think it should be upgraded to an smp_mb() there. If you agree with this analysis, I will send a patch to upgrade it. Note, I did not actually run into this race in practice. Thanks, -Jason ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: wrong smp_mb__after_atomic() in tcp_check_space() ? 2017-01-23 16:56 ` Jason Baron @ 2017-01-23 17:18 ` Oleg Nesterov 2017-01-23 18:04 ` Eric Dumazet 1 sibling, 0 replies; 6+ messages in thread From: Oleg Nesterov @ 2017-01-23 17:18 UTC (permalink / raw) To: Jason Baron Cc: David S. Miller, Herbert Xu, Yauheni Kaliuta, netdev, Eric Dumazet On 01/23, Jason Baron wrote: > > >It was added by 3c7151275c0c9a "tcp: add memory barriers to write space paths" > >and the patch looks correct in that we need the barriers in tcp_check_space() > >and tcp_poll() in theory, so it seems tcp_check_space() needs smp_mb() ? > > > > Yes, I think it should be upgraded to an smp_mb() there. If you agree with > this analysis, I will send a patch to upgrade it. I think this makes sense, if nothing else to remove the obviously wrong smp_mb__after_atomic() ;) > Note, I did not actually > run into this race in practice. Yes, I am not sure it is actually necessary in practice, and even if tcp_check_space() races with tcp_poll() and misses SOCK_NOSPACE it should be probably called again later, but I know nothing about networking code. We noticed this by accident, we have a bug report which really looks as a missed wakeup in epoll, but this kernel is old and in particular it lacks the commit 128dd1759 "epoll: prevent missed events on EPOLL_CTL_MOD" which looks more promising. But we are not sure it can fully explain the problem we can't reproduce, so we are looking for anything else which may contribute to the problem. Oleg. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: wrong smp_mb__after_atomic() in tcp_check_space() ? 2017-01-23 16:56 ` Jason Baron 2017-01-23 17:18 ` Oleg Nesterov @ 2017-01-23 18:04 ` Eric Dumazet 2017-01-23 18:45 ` Jason Baron 2017-01-24 9:18 ` Oleg Nesterov 1 sibling, 2 replies; 6+ messages in thread From: Eric Dumazet @ 2017-01-23 18:04 UTC (permalink / raw) To: Jason Baron Cc: Oleg Nesterov, David S. Miller, Herbert Xu, Yauheni Kaliuta, netdev On Mon, 2017-01-23 at 11:56 -0500, Jason Baron wrote: > On 01/23/2017 09:30 AM, Oleg Nesterov wrote: > > Hello, > > > > smp_mb__after_atomic() looks wrong and misleading, sock_reset_flag() does the > > non-atomic __clear_bit() and thus it can not guarantee test_bit(SOCK_NOSPACE) > > (non-atomic too) won't be reordered. > > > > Indeed. Here's a bit of discussion on it: > http://marc.info/?l=linux-netdev&m=146662325920596&w=2 > > > It was added by 3c7151275c0c9a "tcp: add memory barriers to write space paths" > > and the patch looks correct in that we need the barriers in tcp_check_space() > > and tcp_poll() in theory, so it seems tcp_check_space() needs smp_mb() ? > > > > Yes, I think it should be upgraded to an smp_mb() there. If you agree > with this analysis, I will send a patch to upgrade it. Note, I did not > actually run into this race in practice. SOCK_QUEUE_SHRUNK is used locally in TCP, it is not used by tcp_poll(). (Otherwise it would be using atomic set/clear operations) I do not see obvious reason why we have this smp_mb__after_atomic() in tcp_check_space(). But looking at this code, it seems we lack one barrier if sk_sndbuf is ever increased. Fortunately this almost never happen during TCP session lifetime... diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index bfa165cc455ad0a9aea44964aa663dbe6085aebd..3692e9f4c852cebf8c4d46c141f112e75e4ae66d 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -331,8 +331,13 @@ static void tcp_sndbuf_expand(struct sock *sk) sndmem = ca_ops->sndbuf_expand ? ca_ops->sndbuf_expand(sk) : 2; sndmem *= nr_segs * per_mss; - if (sk->sk_sndbuf < sndmem) + if (sk->sk_sndbuf < sndmem) { sk->sk_sndbuf = min(sndmem, sysctl_tcp_wmem[2]); + /* Paired with second sk_stream_is_writeable(sk) + * test from tcp_poll() + */ + smp_wmb(); + } } /* 2. Tuning advertised window (window_clamp, rcv_ssthresh) ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: wrong smp_mb__after_atomic() in tcp_check_space() ? 2017-01-23 18:04 ` Eric Dumazet @ 2017-01-23 18:45 ` Jason Baron 2017-01-24 9:18 ` Oleg Nesterov 1 sibling, 0 replies; 6+ messages in thread From: Jason Baron @ 2017-01-23 18:45 UTC (permalink / raw) To: Eric Dumazet Cc: Oleg Nesterov, David S. Miller, Herbert Xu, Yauheni Kaliuta, netdev On 01/23/2017 01:04 PM, Eric Dumazet wrote: > On Mon, 2017-01-23 at 11:56 -0500, Jason Baron wrote: >> On 01/23/2017 09:30 AM, Oleg Nesterov wrote: >>> Hello, >>> >>> smp_mb__after_atomic() looks wrong and misleading, sock_reset_flag() does the >>> non-atomic __clear_bit() and thus it can not guarantee test_bit(SOCK_NOSPACE) >>> (non-atomic too) won't be reordered. >>> >> >> Indeed. Here's a bit of discussion on it: >> http://marc.info/?l=linux-netdev&m=146662325920596&w=2 >> >>> It was added by 3c7151275c0c9a "tcp: add memory barriers to write space paths" >>> and the patch looks correct in that we need the barriers in tcp_check_space() >>> and tcp_poll() in theory, so it seems tcp_check_space() needs smp_mb() ? >>> >> >> Yes, I think it should be upgraded to an smp_mb() there. If you agree >> with this analysis, I will send a patch to upgrade it. Note, I did not >> actually run into this race in practice. > > SOCK_QUEUE_SHRUNK is used locally in TCP, it is not used by tcp_poll(). > > (Otherwise it would be using atomic set/clear operations) > > I do not see obvious reason why we have this smp_mb__after_atomic() in > tcp_check_space(). > > The idea of the smp_mb__after_atomic() in tcp_check_space() was to ensure that the 'read' of SOCK_NOSPACE there didn't happen before any of the 'write' to make sk_stream_is_writeable() true. Otherwise, we could miss doing the wakeup from tcp_check_space(). There is probably an argument here that there will likely be a subsequent call to tcp_check_space() that will see the SOCK_NOSPACE bit set, but in theory we could have a small send buffer, or a lot of data could be ack'd in one go. What I missed in the original patch was that sock_reset_flag() isn't an atomic operation and thus the smp_mb__after_atomic() is wrong. > But looking at this code, it seems we lack one barrier if sk_sndbuf is > ever increased. Fortunately this almost never happen during TCP session > lifetime... > But the wakeup from sk->sk_write_space(sk) will imply a smp_wmb() as per the comment in __wake_up() ? Thanks, -Jason > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > index bfa165cc455ad0a9aea44964aa663dbe6085aebd..3692e9f4c852cebf8c4d46c141f112e75e4ae66d 100644 > --- a/net/ipv4/tcp_input.c > +++ b/net/ipv4/tcp_input.c > @@ -331,8 +331,13 @@ static void tcp_sndbuf_expand(struct sock *sk) > sndmem = ca_ops->sndbuf_expand ? ca_ops->sndbuf_expand(sk) : 2; > sndmem *= nr_segs * per_mss; > > - if (sk->sk_sndbuf < sndmem) > + if (sk->sk_sndbuf < sndmem) { > sk->sk_sndbuf = min(sndmem, sysctl_tcp_wmem[2]); > + /* Paired with second sk_stream_is_writeable(sk) > + * test from tcp_poll() > + */ > + smp_wmb(); > + } > } > > /* 2. Tuning advertised window (window_clamp, rcv_ssthresh) > > ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: wrong smp_mb__after_atomic() in tcp_check_space() ? 2017-01-23 18:04 ` Eric Dumazet 2017-01-23 18:45 ` Jason Baron @ 2017-01-24 9:18 ` Oleg Nesterov 1 sibling, 0 replies; 6+ messages in thread From: Oleg Nesterov @ 2017-01-24 9:18 UTC (permalink / raw) To: Eric Dumazet Cc: Jason Baron, David S. Miller, Herbert Xu, Yauheni Kaliuta, netdev On 01/23, Eric Dumazet wrote: > > On Mon, 2017-01-23 at 11:56 -0500, Jason Baron wrote: > > On 01/23/2017 09:30 AM, Oleg Nesterov wrote: > > > Hello, > > > > > > smp_mb__after_atomic() looks wrong and misleading, sock_reset_flag() does the > > > non-atomic __clear_bit() and thus it can not guarantee test_bit(SOCK_NOSPACE) > > > (non-atomic too) won't be reordered. > > > > > > > Indeed. Here's a bit of discussion on it: > > http://marc.info/?l=linux-netdev&m=146662325920596&w=2 > > > > > It was added by 3c7151275c0c9a "tcp: add memory barriers to write space paths" > > > and the patch looks correct in that we need the barriers in tcp_check_space() > > > and tcp_poll() in theory, so it seems tcp_check_space() needs smp_mb() ? > > > > > > > Yes, I think it should be upgraded to an smp_mb() there. If you agree > > with this analysis, I will send a patch to upgrade it. Note, I did not > > actually run into this race in practice. > > SOCK_QUEUE_SHRUNK is used locally in TCP, it is not used by tcp_poll(). > > (Otherwise it would be using atomic set/clear operations) > > I do not see obvious reason why we have this smp_mb__after_atomic() in > tcp_check_space(). It is not that we need to serialize __clear_bit(SOCK_QUEUE_SHRUNK) and test_bit(SOCK_NOSPACE), we do not care if they are reordered. But we need to ensure that either tcp_poll() sees sk_stream_is_writeable() or tcp_check_space() sees SOCK_NOSPACE and calls tcp_new_space(). > But looking at this code, it seems we lack one barrier if sk_sndbuf is > ever increased. Fortunately this almost never happen during TCP session > lifetime... > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > index bfa165cc455ad0a9aea44964aa663dbe6085aebd..3692e9f4c852cebf8c4d46c141f112e75e4ae66d 100644 > --- a/net/ipv4/tcp_input.c > +++ b/net/ipv4/tcp_input.c > @@ -331,8 +331,13 @@ static void tcp_sndbuf_expand(struct sock *sk) > sndmem = ca_ops->sndbuf_expand ? ca_ops->sndbuf_expand(sk) : 2; > sndmem *= nr_segs * per_mss; > > - if (sk->sk_sndbuf < sndmem) > + if (sk->sk_sndbuf < sndmem) { > sk->sk_sndbuf = min(sndmem, sysctl_tcp_wmem[2]); > + /* Paired with second sk_stream_is_writeable(sk) > + * test from tcp_poll() > + */ > + smp_wmb(); > + } > } I do not think we need the additional barrier here. If we are going to call sk->sk_write_space() we rely on wq_has_sleeper() which has a barrier which also pairs with the 2nd check in tcp_poll(). Oleg. ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2017-01-24 9:18 UTC | newest] Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-01-23 14:30 wrong smp_mb__after_atomic() in tcp_check_space() ? Oleg Nesterov 2017-01-23 16:56 ` Jason Baron 2017-01-23 17:18 ` Oleg Nesterov 2017-01-23 18:04 ` Eric Dumazet 2017-01-23 18:45 ` Jason Baron 2017-01-24 9:18 ` Oleg Nesterov
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.