From mboxrd@z Thu Jan 1 00:00:00 1970 From: Eric Dumazet Subject: Re: [PATCH v2 net-next] tcp: set SOCK_NOSPACE under memory pressure Date: Wed, 06 May 2015 09:13:53 -0700 Message-ID: <1430928833.14545.36.camel@edumazet-glaptop2.roam.corp.google.com> References: <20150506155223.9743A2027@prod-mail-relay06.akamai.com> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit Cc: davem@davemloft.net, netdev@vger.kernel.org To: Jason Baron Return-path: Received: from mail-pa0-f54.google.com ([209.85.220.54]:36450 "EHLO mail-pa0-f54.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752285AbbEFQNz (ORCPT ); Wed, 6 May 2015 12:13:55 -0400 Received: by pabsx10 with SMTP id sx10so13438591pab.3 for ; Wed, 06 May 2015 09:13:55 -0700 (PDT) In-Reply-To: <20150506155223.9743A2027@prod-mail-relay06.akamai.com> Sender: netdev-owner@vger.kernel.org List-ID: On Wed, 2015-05-06 at 15:52 +0000, Jason Baron wrote: > From: Jason Baron > > Under tcp memory pressure, calling epoll_wait() in edge triggered > mode after -EAGAIN, can result in an indefinite hang in epoll_wait(), > even when there is sufficient memory available to continue making > progress. The problem is that when __sk_mem_schedule() returns 0 > under memory pressure, we do not set the SOCK_NOSPACE flag in the > tcp write paths (tcp_sendmsg() or do_tcp_sendpages()). Then, since > SOCK_NOSPACE is used to trigger wakeups when incoming acks create > sufficient new space in the write queue, all outstanding packets > are acked, but we never wake up with the the EPOLLOUT that we are > expecting from epoll_wait(). > > This issue is currently limited to epoll() when used in edge trigger > mode, since 'tcp_poll()', does in fact currently set SOCK_NOSPACE. > This is sufficient for poll()/select() and epoll() in level trigger > mode. However, in edge trigger mode, epoll() is relying on the write > path to set SOCK_NOSPACE. EPOLL(7) says that in edge-trigger mode we > can only call epoll_wait() after read/write return -EAGAIN. Thus, in > the case of the socket write, we are relying on the fact that > tcp_sendmsg()/network write paths are going to issue a wakeup for > us at some point in the future when we get -EAGAIN. > > Normally, epoll() edge trigger works fine when we've exceeded the > sk->sndbuf because in that case we do set SOCK_NOSPACE. However, when > we return -EAGAIN from the write path b/c we are over the tcp memory > limits and not b/c we are over the sndbuf, we are never going to get > another wakeup. > > I can reproduce this issue, using SO_SNDBUF, since __sk_mem_schedule() > will return 0, or failure more readily with SO_SNDBUF: > > 1) create socket and set SO_SNDBUF to N > 2) add socket as edge trigger > 3) write to socket and block in epoll on -EAGAIN > 4) cause tcp mem pressure via: echo "" > net.ipv4.tcp_mem > > The fix here is simply to set SOCK_NOSPACE in sk_stream_wait_memory() > when the socket is non-blocking. Note that SOCK_NOSPACE, in addition > to waking up outstanding waiters is also used to expand the size of > the sk->sndbuf. However, we will not expand it by setting it in this > case because tcp_should_expand_sndbuf(), ensures that no expansion > occurs when we are under tcp memory pressure. > > Note that we could still hang if sk->sk_wmem_queue is 0, when we get > the -EAGAIN. In this case the SOCK_NOSPACE bit will not help, since we > are waiting for and event that will never happen. I believe > that this case is harder to hit (and did not hit in my testing), > in that over the tcp 'soft' memory limits, we continue to guarantee a > minimum write buffer size. Perhaps, we could return -ENOSPC in this > case, or maybe we simply issue a wakeup in this case, such that we > keep retrying the write. Note that this case is not specific to > epoll() ET, but rather would affect blocking sockets as well. So I > view this patch as bringing epoll() edge-trigger into sync with the > current poll()/select()/epoll() level trigger and blocking sockets > behavior. > > Signed-off-by: Jason Baron > --- Thanks Jason for this extremely clear changelog ;) Acked-by: Eric Dumazet