From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.7 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 68ECEC48BC2 for ; Sun, 27 Jun 2021 08:11:09 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 3ED0161C34 for ; Sun, 27 Jun 2021 08:11:09 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230013AbhF0INc (ORCPT ); Sun, 27 Jun 2021 04:13:32 -0400 Received: from mail.kernel.org ([198.145.29.99]:42438 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229507AbhF0INb (ORCPT ); Sun, 27 Jun 2021 04:13:31 -0400 Received: by mail.kernel.org (Postfix) with ESMTPSA id 090F661C32; Sun, 27 Jun 2021 08:11:06 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1624781467; bh=WNn1g6kdCSHd3xWcHByeDlzEeojulDSBssUamtJ6PgA=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=XxzylgjunLF+94kbeLOPB4AoEnt1QpfBOsc/nqgArJmwVmJYpMqvTNgS1+oklrlgj Oz/hkw9dWZaaKpLoau9T+0eVAQmepx+XdhTVhnslTJrQKRzCuonvR+fbySu7Uv49bg NYA6NUQoreVRGX8XBd07H61D3LXzBE+WFxrvl8lcCzTNglq1zyTWlmI+eG8gH14/i+ DCHwjnQOoRW4XI3zeKaItRzBs8Pd+uuBUP31ygr8EBfOhC9rDchqy0p9tervqLbobJ BSse7/6YoSHRxmIHuKBpwgLnzBuKUgiphbyWp9N35+4a4rM6Fz08cZj/2mhjlmQkeN j5/h7R13KswGg== Date: Sun, 27 Jun 2021 11:11:04 +0300 From: Leon Romanovsky To: Jakub Kicinski Cc: davem@davemloft.net, netdev@vger.kernel.org, willemb@google.com, eric.dumazet@gmail.com, dsahern@gmail.com, yoshfuji@linux-ipv6.org, brouer@redhat.com, Dave Jones Subject: Re: [PATCH net-next v5] net: ip: avoid OOM kills with large UDP sends over loopback Message-ID: References: <20210624175724.2389359-1-kuba@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20210624175724.2389359-1-kuba@kernel.org> Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org On Thu, Jun 24, 2021 at 10:57:24AM -0700, Jakub Kicinski wrote: > Dave observed number of machines hitting OOM on the UDP send > path. The workload seems to be sending large UDP packets over > loopback. Since loopback has MTU of 64k kernel will try to > allocate an skb with up to 64k of head space. This has a good > chance of failing under memory pressure. What's worse if > the message length is <32k the allocation may trigger an > OOM killer. > > This is entirely avoidable, we can use an skb with page frags. > > af_unix solves a similar problem by limiting the head > length to SKB_MAX_ALLOC. This seems like a good and simple > approach. It means that UDP messages > 16kB will now > use fragments if underlying device supports SG, if extra > allocator pressure causes regressions in real workloads > we can switch to trying the large allocation first and > falling back. > > v4: pre-calculate all the additions to alloclen so > we can be sure it won't go over order-2 > v5: s/< MAX/<= MAX/ ; reorder variable declaration Jakub, Can you please put changelog under "---" below SOB? It gives no value to see all vX in the final git log. Thanks > > Reported-by: Dave Jones > Signed-off-by: Jakub Kicinski > --- > net/ipv4/ip_output.c | 32 ++++++++++++++++++-------------- > net/ipv6/ip6_output.c | 32 +++++++++++++++++--------------- > 2 files changed, 35 insertions(+), 29 deletions(-) > > diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c > index c3efc7d658f6..f5c398036efa 100644 > --- a/net/ipv4/ip_output.c > +++ b/net/ipv4/ip_output.c > @@ -1050,11 +1050,11 @@ static int __ip_append_data(struct sock *sk, > if (copy < length) > copy = maxfraglen - skb->len; > if (copy <= 0) { > + unsigned int alloclen, alloc_extra; > char *data; > unsigned int datalen; > unsigned int fraglen; > unsigned int fraggap; > - unsigned int alloclen; > unsigned int pagedlen; > struct sk_buff *skb_prev; > alloc_new_skb: > @@ -1074,35 +1074,39 @@ static int __ip_append_data(struct sock *sk, > fraglen = datalen + fragheaderlen; > pagedlen = 0; > > + alloc_extra = hh_len + 15; > + alloc_extra += exthdrlen; > + > + /* The last fragment gets additional space at tail. > + * Note, with MSG_MORE we overallocate on fragments, > + * because we have no idea what fragment will be > + * the last. > + */ > + if (datalen == length + fraggap) > + alloc_extra += rt->dst.trailer_len; > + > if ((flags & MSG_MORE) && > !(rt->dst.dev->features&NETIF_F_SG)) > alloclen = mtu; > - else if (!paged) > + else if (!paged && > + (fraglen + alloc_extra <= SKB_MAX_ALLOC || > + !(rt->dst.dev->features & NETIF_F_SG))) > alloclen = fraglen; > else { > alloclen = min_t(int, fraglen, MAX_HEADER); > pagedlen = fraglen - alloclen; > } > > - alloclen += exthdrlen; > - > - /* The last fragment gets additional space at tail. > - * Note, with MSG_MORE we overallocate on fragments, > - * because we have no idea what fragment will be > - * the last. > - */ > - if (datalen == length + fraggap) > - alloclen += rt->dst.trailer_len; > + alloclen += alloc_extra; > > if (transhdrlen) { > - skb = sock_alloc_send_skb(sk, > - alloclen + hh_len + 15, > + skb = sock_alloc_send_skb(sk, alloclen, > (flags & MSG_DONTWAIT), &err); > } else { > skb = NULL; > if (refcount_read(&sk->sk_wmem_alloc) + wmem_alloc_delta <= > 2 * sk->sk_sndbuf) > - skb = alloc_skb(alloclen + hh_len + 15, > + skb = alloc_skb(alloclen, > sk->sk_allocation); > if (unlikely(!skb)) > err = -ENOBUFS; > diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c > index ff4f9ebcf7f6..afe887dd5e35 100644 > --- a/net/ipv6/ip6_output.c > +++ b/net/ipv6/ip6_output.c > @@ -1551,11 +1551,11 @@ static int __ip6_append_data(struct sock *sk, > copy = maxfraglen - skb->len; > > if (copy <= 0) { > + unsigned int alloclen, alloc_extra; > char *data; > unsigned int datalen; > unsigned int fraglen; > unsigned int fraggap; > - unsigned int alloclen; > unsigned int pagedlen; > alloc_new_skb: > /* There's no room in the current skb */ > @@ -1582,17 +1582,28 @@ static int __ip6_append_data(struct sock *sk, > fraglen = datalen + fragheaderlen; > pagedlen = 0; > > + alloc_extra = hh_len; > + alloc_extra += dst_exthdrlen; > + alloc_extra += rt->dst.trailer_len; > + > + /* We just reserve space for fragment header. > + * Note: this may be overallocation if the message > + * (without MSG_MORE) fits into the MTU. > + */ > + alloc_extra += sizeof(struct frag_hdr); > + > if ((flags & MSG_MORE) && > !(rt->dst.dev->features&NETIF_F_SG)) > alloclen = mtu; > - else if (!paged) > + else if (!paged && > + (fraglen + alloc_extra <= SKB_MAX_ALLOC || > + !(rt->dst.dev->features & NETIF_F_SG))) > alloclen = fraglen; > else { > alloclen = min_t(int, fraglen, MAX_HEADER); > pagedlen = fraglen - alloclen; > } > - > - alloclen += dst_exthdrlen; > + alloclen += alloc_extra; > > if (datalen != length + fraggap) { > /* > @@ -1602,30 +1613,21 @@ static int __ip6_append_data(struct sock *sk, > datalen += rt->dst.trailer_len; > } > > - alloclen += rt->dst.trailer_len; > fraglen = datalen + fragheaderlen; > > - /* > - * We just reserve space for fragment header. > - * Note: this may be overallocation if the message > - * (without MSG_MORE) fits into the MTU. > - */ > - alloclen += sizeof(struct frag_hdr); > - > copy = datalen - transhdrlen - fraggap - pagedlen; > if (copy < 0) { > err = -EINVAL; > goto error; > } > if (transhdrlen) { > - skb = sock_alloc_send_skb(sk, > - alloclen + hh_len, > + skb = sock_alloc_send_skb(sk, alloclen, > (flags & MSG_DONTWAIT), &err); > } else { > skb = NULL; > if (refcount_read(&sk->sk_wmem_alloc) + wmem_alloc_delta <= > 2 * sk->sk_sndbuf) > - skb = alloc_skb(alloclen + hh_len, > + skb = alloc_skb(alloclen, > sk->sk_allocation); > if (unlikely(!skb)) > err = -ENOBUFS; > -- > 2.31.1 >