[PATCH net-next] ip: avoid OOM kills with large UDP sends over loopback

* [PATCH net-next] ip: avoid OOM kills with large UDP sends over loopback
@ 2021-06-21 23:13 Jakub Kicinski
  2021-06-22 10:07 ` Paolo Abeni
  2021-06-22 14:12 ` Eric Dumazet
  0 siblings, 2 replies; 12+ messages in thread
From: Jakub Kicinski @ 2021-06-21 23:13 UTC (permalink / raw)
  To: davem
  Cc: netdev, willemb, eric.dumazet, dsahern, yoshfuji, Jakub Kicinski,
	Dave Jones

Dave observed number of machines hitting OOM on the UDP send
path. The workload seems to be sending large UDP packets over
loopback. Since loopback has MTU of 64k kernel will try to
allocate an skb with up to 64k of head space. This has a good
chance of failing under memory pressure. What's worse if
the message length is <32k the allocation may trigger an
OOM killer.

This is entirely avoidable, we can use an skb with frags.

The scenario is unlikely and always using frags requires
an extra allocation so opt for using fallback, rather
then always using frag'ed/paged skb when payload is large.

Note that the size heuristic (header_len > PAGE_SIZE)
is not entirely accurate, __alloc_skb() will add ~400B
to size. Occasional order-1 allocation should be fine,
though, we are primarily concerned with order-3.

Reported-by: Dave Jones <dsj@fb.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 include/net/sock.h    | 11 +++++++++++
 net/ipv4/ip_output.c  | 19 +++++++++++++++++--
 net/ipv6/ip6_output.c | 19 +++++++++++++++++--
 3 files changed, 45 insertions(+), 4 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 7a7058f4f265..4134fb718b97 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -924,6 +924,17 @@ static inline gfp_t sk_gfp_mask(const struct sock *sk, gfp_t gfp_mask)
 	return gfp_mask | (sk->sk_allocation & __GFP_MEMALLOC);
 }
 
+static inline void sk_allocation_push(struct sock *sk, gfp_t flag, gfp_t *old)
+{
+	*old = sk->sk_allocation;
+	sk->sk_allocation |= flag;
+}
+
+static inline void sk_allocation_pop(struct sock *sk, gfp_t old)
+{
+	sk->sk_allocation = old;
+}
+
 static inline void sk_acceptq_removed(struct sock *sk)
 {
 	WRITE_ONCE(sk->sk_ack_backlog, sk->sk_ack_backlog - 1);
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index c3efc7d658f6..a300c2c65d57 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1095,9 +1095,24 @@ static int __ip_append_data(struct sock *sk,
 				alloclen += rt->dst.trailer_len;
 
 			if (transhdrlen) {
-				skb = sock_alloc_send_skb(sk,
-						alloclen + hh_len + 15,
+				size_t header_len = alloclen + hh_len + 15;
+				gfp_t sk_allocation;
+
+				if (header_len > PAGE_SIZE)
+					sk_allocation_push(sk, __GFP_NORETRY,
+							   &sk_allocation);
+				skb = sock_alloc_send_skb(sk, header_len,
 						(flags & MSG_DONTWAIT), &err);
+				if (header_len > PAGE_SIZE) {
+					BUILD_BUG_ON(MAX_HEADER >= PAGE_SIZE);
+
+					sk_allocation_pop(sk, sk_allocation);
+					if (unlikely(!skb) && !paged &&
+					    rt->dst.dev->features & NETIF_F_SG) {
+						paged = true;
+						goto alloc_new_skb;
+					}
+				}
 			} else {
 				skb = NULL;
 				if (refcount_read(&sk->sk_wmem_alloc) + wmem_alloc_delta <=
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index ff4f9ebcf7f6..9fd167db07e4 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1618,9 +1618,24 @@ static int __ip6_append_data(struct sock *sk,
 				goto error;
 			}
 			if (transhdrlen) {
-				skb = sock_alloc_send_skb(sk,
-						alloclen + hh_len,
+				size_t header_len = alloclen + hh_len;
+				gfp_t sk_allocation;
+
+				if (header_len > PAGE_SIZE)
+					sk_allocation_push(sk, __GFP_NORETRY,
+							   &sk_allocation);
+				skb = sock_alloc_send_skb(sk, header_len,
 						(flags & MSG_DONTWAIT), &err);
+				if (header_len > PAGE_SIZE) {
+					BUILD_BUG_ON(MAX_HEADER >= PAGE_SIZE);
+
+					sk_allocation_pop(sk, sk_allocation);
+					if (unlikely(!skb) && !paged &&
+					    rt->dst.dev->features & NETIF_F_SG) {
+						paged = true;
+						goto alloc_new_skb;
+					}
+				}
 			} else {
 				skb = NULL;
 				if (refcount_read(&sk->sk_wmem_alloc) + wmem_alloc_delta <=
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread