All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next] ip: avoid OOM kills with large UDP sends over loopback
@ 2021-06-21 23:13 Jakub Kicinski
  2021-06-22 10:07 ` Paolo Abeni
  2021-06-22 14:12 ` Eric Dumazet
  0 siblings, 2 replies; 12+ messages in thread
From: Jakub Kicinski @ 2021-06-21 23:13 UTC (permalink / raw)
  To: davem
  Cc: netdev, willemb, eric.dumazet, dsahern, yoshfuji, Jakub Kicinski,
	Dave Jones

Dave observed number of machines hitting OOM on the UDP send
path. The workload seems to be sending large UDP packets over
loopback. Since loopback has MTU of 64k kernel will try to
allocate an skb with up to 64k of head space. This has a good
chance of failing under memory pressure. What's worse if
the message length is <32k the allocation may trigger an
OOM killer.

This is entirely avoidable, we can use an skb with frags.

The scenario is unlikely and always using frags requires
an extra allocation so opt for using fallback, rather
then always using frag'ed/paged skb when payload is large.

Note that the size heuristic (header_len > PAGE_SIZE)
is not entirely accurate, __alloc_skb() will add ~400B
to size. Occasional order-1 allocation should be fine,
though, we are primarily concerned with order-3.

Reported-by: Dave Jones <dsj@fb.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 include/net/sock.h    | 11 +++++++++++
 net/ipv4/ip_output.c  | 19 +++++++++++++++++--
 net/ipv6/ip6_output.c | 19 +++++++++++++++++--
 3 files changed, 45 insertions(+), 4 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 7a7058f4f265..4134fb718b97 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -924,6 +924,17 @@ static inline gfp_t sk_gfp_mask(const struct sock *sk, gfp_t gfp_mask)
 	return gfp_mask | (sk->sk_allocation & __GFP_MEMALLOC);
 }
 
+static inline void sk_allocation_push(struct sock *sk, gfp_t flag, gfp_t *old)
+{
+	*old = sk->sk_allocation;
+	sk->sk_allocation |= flag;
+}
+
+static inline void sk_allocation_pop(struct sock *sk, gfp_t old)
+{
+	sk->sk_allocation = old;
+}
+
 static inline void sk_acceptq_removed(struct sock *sk)
 {
 	WRITE_ONCE(sk->sk_ack_backlog, sk->sk_ack_backlog - 1);
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index c3efc7d658f6..a300c2c65d57 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1095,9 +1095,24 @@ static int __ip_append_data(struct sock *sk,
 				alloclen += rt->dst.trailer_len;
 
 			if (transhdrlen) {
-				skb = sock_alloc_send_skb(sk,
-						alloclen + hh_len + 15,
+				size_t header_len = alloclen + hh_len + 15;
+				gfp_t sk_allocation;
+
+				if (header_len > PAGE_SIZE)
+					sk_allocation_push(sk, __GFP_NORETRY,
+							   &sk_allocation);
+				skb = sock_alloc_send_skb(sk, header_len,
 						(flags & MSG_DONTWAIT), &err);
+				if (header_len > PAGE_SIZE) {
+					BUILD_BUG_ON(MAX_HEADER >= PAGE_SIZE);
+
+					sk_allocation_pop(sk, sk_allocation);
+					if (unlikely(!skb) && !paged &&
+					    rt->dst.dev->features & NETIF_F_SG) {
+						paged = true;
+						goto alloc_new_skb;
+					}
+				}
 			} else {
 				skb = NULL;
 				if (refcount_read(&sk->sk_wmem_alloc) + wmem_alloc_delta <=
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index ff4f9ebcf7f6..9fd167db07e4 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1618,9 +1618,24 @@ static int __ip6_append_data(struct sock *sk,
 				goto error;
 			}
 			if (transhdrlen) {
-				skb = sock_alloc_send_skb(sk,
-						alloclen + hh_len,
+				size_t header_len = alloclen + hh_len;
+				gfp_t sk_allocation;
+
+				if (header_len > PAGE_SIZE)
+					sk_allocation_push(sk, __GFP_NORETRY,
+							   &sk_allocation);
+				skb = sock_alloc_send_skb(sk, header_len,
 						(flags & MSG_DONTWAIT), &err);
+				if (header_len > PAGE_SIZE) {
+					BUILD_BUG_ON(MAX_HEADER >= PAGE_SIZE);
+
+					sk_allocation_pop(sk, sk_allocation);
+					if (unlikely(!skb) && !paged &&
+					    rt->dst.dev->features & NETIF_F_SG) {
+						paged = true;
+						goto alloc_new_skb;
+					}
+				}
 			} else {
 				skb = NULL;
 				if (refcount_read(&sk->sk_wmem_alloc) + wmem_alloc_delta <=
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next] ip: avoid OOM kills with large UDP sends over loopback
  2021-06-21 23:13 [PATCH net-next] ip: avoid OOM kills with large UDP sends over loopback Jakub Kicinski
@ 2021-06-22 10:07 ` Paolo Abeni
  2021-06-22 16:57   ` Jakub Kicinski
  2021-06-22 14:12 ` Eric Dumazet
  1 sibling, 1 reply; 12+ messages in thread
From: Paolo Abeni @ 2021-06-22 10:07 UTC (permalink / raw)
  To: Jakub Kicinski, davem
  Cc: netdev, willemb, eric.dumazet, dsahern, yoshfuji, Dave Jones

On Mon, 2021-06-21 at 16:13 -0700, Jakub Kicinski wrote:
> Dave observed number of machines hitting OOM on the UDP send
> path. The workload seems to be sending large UDP packets over
> loopback. Since loopback has MTU of 64k kernel will try to
> allocate an skb with up to 64k of head space. This has a good
> chance of failing under memory pressure. What's worse if
> the message length is <32k the allocation may trigger an
> OOM killer.

Out of sheer curiosity, are there a large number of UDP sockets in such
workload? did you increase rmem_default/rmem_max? If so, could tuning
udp_mem help?

> include/net/sock.h    | 11 +++++++++++
>  net/ipv4/ip_output.c  | 19 +++++++++++++++++--
>  net/ipv6/ip6_output.c | 19 +++++++++++++++++--
>  3 files changed, 45 insertions(+), 4 deletions(-)
> 
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 7a7058f4f265..4134fb718b97 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -924,6 +924,17 @@ static inline gfp_t sk_gfp_mask(const struct sock *sk, gfp_t gfp_mask)
>  	return gfp_mask | (sk->sk_allocation & __GFP_MEMALLOC);
>  }
>  
> +static inline void sk_allocation_push(struct sock *sk, gfp_t flag, gfp_t *old)
> +{
> +	*old = sk->sk_allocation;
> +	sk->sk_allocation |= flag;
> +}
> +
> +static inline void sk_allocation_pop(struct sock *sk, gfp_t old)
> +{
> +	sk->sk_allocation = old;
> +}
> +
>  static inline void sk_acceptq_removed(struct sock *sk)
>  {
>  	WRITE_ONCE(sk->sk_ack_backlog, sk->sk_ack_backlog - 1);
> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> index c3efc7d658f6..a300c2c65d57 100644
> --- a/net/ipv4/ip_output.c
> +++ b/net/ipv4/ip_output.c
> @@ -1095,9 +1095,24 @@ static int __ip_append_data(struct sock *sk,
>  				alloclen += rt->dst.trailer_len;
>  
>  			if (transhdrlen) {
> -				skb = sock_alloc_send_skb(sk,
> -						alloclen + hh_len + 15,
> +				size_t header_len = alloclen + hh_len + 15;
> +				gfp_t sk_allocation;
> +
> +				if (header_len > PAGE_SIZE)
> +					sk_allocation_push(sk, __GFP_NORETRY,
> +							   &sk_allocation);

Could an additional __GFP_NOWARN be relevant here?

Thanks!

Paolo


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next] ip: avoid OOM kills with large UDP sends over loopback
  2021-06-21 23:13 [PATCH net-next] ip: avoid OOM kills with large UDP sends over loopback Jakub Kicinski
  2021-06-22 10:07 ` Paolo Abeni
@ 2021-06-22 14:12 ` Eric Dumazet
  2021-06-22 16:54   ` Jakub Kicinski
  1 sibling, 1 reply; 12+ messages in thread
From: Eric Dumazet @ 2021-06-22 14:12 UTC (permalink / raw)
  To: Jakub Kicinski, davem
  Cc: netdev, willemb, eric.dumazet, dsahern, yoshfuji, Dave Jones



On 6/22/21 1:13 AM, Jakub Kicinski wrote:
> Dave observed number of machines hitting OOM on the UDP send
> path. The workload seems to be sending large UDP packets over
> loopback. Since loopback has MTU of 64k kernel will try to
> allocate an skb with up to 64k of head space. This has a good
> chance of failing under memory pressure. What's worse if
> the message length is <32k the allocation may trigger an
> OOM killer.
> 
> This is entirely avoidable, we can use an skb with frags.
> 
> The scenario is unlikely and always using frags requires
> an extra allocation so opt for using fallback, rather
> then always using frag'ed/paged skb when payload is large.
> 
> Note that the size heuristic (header_len > PAGE_SIZE)
> is not entirely accurate, __alloc_skb() will add ~400B
> to size. Occasional order-1 allocation should be fine,
> though, we are primarily concerned with order-3.
> 
> Reported-by: Dave Jones <dsj@fb.com>
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
>  include/net/sock.h    | 11 +++++++++++
>  net/ipv4/ip_output.c  | 19 +++++++++++++++++--
>  net/ipv6/ip6_output.c | 19 +++++++++++++++++--
>  3 files changed, 45 insertions(+), 4 deletions(-)
> 
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 7a7058f4f265..4134fb718b97 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -924,6 +924,17 @@ static inline gfp_t sk_gfp_mask(const struct sock *sk, gfp_t gfp_mask)
>  	return gfp_mask | (sk->sk_allocation & __GFP_MEMALLOC);
>  }
>  
> +static inline void sk_allocation_push(struct sock *sk, gfp_t flag, gfp_t *old)
> +{
> +	*old = sk->sk_allocation;
> +	sk->sk_allocation |= flag;
> +}
> +

This is not thread safe.

Remember UDP sendmsg() does not lock the socket for non-corking sends.

> +static inline void sk_allocation_pop(struct sock *sk, gfp_t old)
> +{
> +	sk->sk_allocation = old;
> +}
> +
>  static inline void sk_acceptq_removed(struct sock *sk)
>  {
>  	WRITE_ONCE(sk->sk_ack_backlog, sk->sk_ack_backlog - 1);
> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> index c3efc7d658f6..a300c2c65d57 100644
> --- a/net/ipv4/ip_output.c
> +++ b/net/ipv4/ip_output.c
> @@ -1095,9 +1095,24 @@ static int __ip_append_data(struct sock *sk,
>  				alloclen += rt->dst.trailer_len;
>  
>  			if (transhdrlen) {
> -				skb = sock_alloc_send_skb(sk,
> -						alloclen + hh_len + 15,
> +				size_t header_len = alloclen + hh_len + 15;
> +				gfp_t sk_allocation;
> +
> +				if (header_len > PAGE_SIZE)
> +					sk_allocation_push(sk, __GFP_NORETRY,
> +							   &sk_allocation);
> +				skb = sock_alloc_send_skb(sk, header_len,
>  						(flags & MSG_DONTWAIT), &err);
> +				if (header_len > PAGE_SIZE) {
> +					BUILD_BUG_ON(MAX_HEADER >= PAGE_SIZE);
> +
> +					sk_allocation_pop(sk, sk_allocation);
> +					if (unlikely(!skb) && !paged &&
> +					    rt->dst.dev->features & NETIF_F_SG) {
> +						paged = true;
> +						goto alloc_new_skb;
> +					}
> +				}


What about using sock_alloc_send_pskb(... PAGE_ALLOC_COSTLY_ORDER)
(as we did in unix_dgram_sendmsg() for large packets), for SG enabled interfaces ?

We do not _have_ to put all the payload in skb linear part,
we could instead use page frags (order-0 if high order pages are not available)


>  			} else {
>  				skb = NULL;
>  				if (refcount_read(&sk->sk_wmem_alloc) + wmem_alloc_delta <=

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next] ip: avoid OOM kills with large UDP sends over loopback
  2021-06-22 14:12 ` Eric Dumazet
@ 2021-06-22 16:54   ` Jakub Kicinski
  2021-06-22 17:19     ` Jakub Kicinski
  2021-06-22 17:48     ` Eric Dumazet
  0 siblings, 2 replies; 12+ messages in thread
From: Jakub Kicinski @ 2021-06-22 16:54 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: davem, netdev, willemb, dsahern, yoshfuji, Dave Jones

On Tue, 22 Jun 2021 16:12:11 +0200 Eric Dumazet wrote:
> On 6/22/21 1:13 AM, Jakub Kicinski wrote:
> > Dave observed number of machines hitting OOM on the UDP send
> > path. The workload seems to be sending large UDP packets over
> > loopback. Since loopback has MTU of 64k kernel will try to
> > allocate an skb with up to 64k of head space. This has a good
> > chance of failing under memory pressure. What's worse if
> > the message length is <32k the allocation may trigger an
> > OOM killer.
> > 
> > This is entirely avoidable, we can use an skb with frags.
> > 
> > The scenario is unlikely and always using frags requires
> > an extra allocation so opt for using fallback, rather
> > then always using frag'ed/paged skb when payload is large.
> > 
> > Note that the size heuristic (header_len > PAGE_SIZE)
> > is not entirely accurate, __alloc_skb() will add ~400B
> > to size. Occasional order-1 allocation should be fine,
> > though, we are primarily concerned with order-3.
> > 
> > Reported-by: Dave Jones <dsj@fb.com>
> > Signed-off-by: Jakub Kicinski <kuba@kernel.org>

> > +static inline void sk_allocation_push(struct sock *sk, gfp_t flag, gfp_t *old)
> > +{
> > +	*old = sk->sk_allocation;
> > +	sk->sk_allocation |= flag;
> > +}
> > +  
> 
> This is not thread safe.
> 
> Remember UDP sendmsg() does not lock the socket for non-corking sends.

Ugh, you're right :(

> > +static inline void sk_allocation_pop(struct sock *sk, gfp_t old)
> > +{
> > +	sk->sk_allocation = old;
> > +}
> > +
> >  static inline void sk_acceptq_removed(struct sock *sk)
> >  {
> >  	WRITE_ONCE(sk->sk_ack_backlog, sk->sk_ack_backlog - 1);
> > diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> > index c3efc7d658f6..a300c2c65d57 100644
> > --- a/net/ipv4/ip_output.c
> > +++ b/net/ipv4/ip_output.c
> > @@ -1095,9 +1095,24 @@ static int __ip_append_data(struct sock *sk,
> >  				alloclen += rt->dst.trailer_len;
> >  
> >  			if (transhdrlen) {
> > -				skb = sock_alloc_send_skb(sk,
> > -						alloclen + hh_len + 15,
> > +				size_t header_len = alloclen + hh_len + 15;
> > +				gfp_t sk_allocation;
> > +
> > +				if (header_len > PAGE_SIZE)
> > +					sk_allocation_push(sk, __GFP_NORETRY,
> > +							   &sk_allocation);
> > +				skb = sock_alloc_send_skb(sk, header_len,
> >  						(flags & MSG_DONTWAIT), &err);
> > +				if (header_len > PAGE_SIZE) {
> > +					BUILD_BUG_ON(MAX_HEADER >= PAGE_SIZE);
> > +
> > +					sk_allocation_pop(sk, sk_allocation);
> > +					if (unlikely(!skb) && !paged &&
> > +					    rt->dst.dev->features & NETIF_F_SG) {
> > +						paged = true;
> > +						goto alloc_new_skb;
> > +					}
> > +				}  
> 
> 
> What about using sock_alloc_send_pskb(... PAGE_ALLOC_COSTLY_ORDER)
> (as we did in unix_dgram_sendmsg() for large packets), for SG enabled interfaces ?

PAGE_ALLOC_COSTLY_ORDER in itself is more of a problem than a solution.
AFAIU the app sends messages primarily above the ~60kB mark, which is
above COSTLY, and those do not trigger OOM kills. All OOM kills we see
have order=3. Checking with Rik and Johannes W that's expected, OOM
killer is only invoked for allocations <= COSTLY, larger ones will just
return NULL and let us deal with it (e.g. by falling back).

So adding GFP_NORETRY is key for 0 < order <= COSTLY,
skb_page_frag_refill()-style.

> We do not _have_ to put all the payload in skb linear part,
> we could instead use page frags (order-0 if high order pages are not available)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next] ip: avoid OOM kills with large UDP sends over loopback
  2021-06-22 10:07 ` Paolo Abeni
@ 2021-06-22 16:57   ` Jakub Kicinski
  0 siblings, 0 replies; 12+ messages in thread
From: Jakub Kicinski @ 2021-06-22 16:57 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: davem, netdev, willemb, eric.dumazet, dsahern, yoshfuji, Dave Jones

On Tue, 22 Jun 2021 12:07:27 +0200 Paolo Abeni wrote:
> On Mon, 2021-06-21 at 16:13 -0700, Jakub Kicinski wrote:
> > Dave observed number of machines hitting OOM on the UDP send
> > path. The workload seems to be sending large UDP packets over
> > loopback. Since loopback has MTU of 64k kernel will try to
> > allocate an skb with up to 64k of head space. This has a good
> > chance of failing under memory pressure. What's worse if
> > the message length is <32k the allocation may trigger an
> > OOM killer.  
> 
> Out of sheer curiosity, are there a large number of UDP sockets in such
> workload? did you increase rmem_default/rmem_max? If so, could tuning
> udp_mem help?

It's a handful of sockets, < 10.

> > diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> > index c3efc7d658f6..a300c2c65d57 100644
> > --- a/net/ipv4/ip_output.c
> > +++ b/net/ipv4/ip_output.c
> > @@ -1095,9 +1095,24 @@ static int __ip_append_data(struct sock *sk,
> >  				alloclen += rt->dst.trailer_len;
> >  
> >  			if (transhdrlen) {
> > -				skb = sock_alloc_send_skb(sk,
> > -						alloclen + hh_len + 15,
> > +				size_t header_len = alloclen + hh_len + 15;
> > +				gfp_t sk_allocation;
> > +
> > +				if (header_len > PAGE_SIZE)
> > +					sk_allocation_push(sk, __GFP_NORETRY,
> > +							   &sk_allocation);  
> 
> Could an additional __GFP_NOWARN be relevant here?

We always set GFP_NOWARN for heads thru kmalloc_reserve().

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next] ip: avoid OOM kills with large UDP sends over loopback
  2021-06-22 16:54   ` Jakub Kicinski
@ 2021-06-22 17:19     ` Jakub Kicinski
  2021-06-22 17:49       ` Eric Dumazet
  2021-06-22 17:48     ` Eric Dumazet
  1 sibling, 1 reply; 12+ messages in thread
From: Jakub Kicinski @ 2021-06-22 17:19 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: davem, netdev, willemb, dsahern, yoshfuji, Dave Jones

On Tue, 22 Jun 2021 09:54:22 -0700 Jakub Kicinski wrote:
> > > +static inline void sk_allocation_push(struct sock *sk, gfp_t flag, gfp_t *old)
> > > +{
> > > +	*old = sk->sk_allocation;
> > > +	sk->sk_allocation |= flag;
> > > +}
> > > +    
> > 
> > This is not thread safe.
> > 
> > Remember UDP sendmsg() does not lock the socket for non-corking sends.  
> 
> Ugh, you're right :(

Hm, isn't it buggy to call sock_alloc_send_[p]skb() without holding the
lock in the first place, then? The knee jerk fix would be to add another 
layer of specialization to the helpers:

diff --git a/include/net/sock.h b/include/net/sock.h
index 7a7058f4f265..06f031705418 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1714,9 +1725,20 @@ int sock_gettstamp(struct socket *sock, void __user *userstamp,
 		   bool timeval, bool time32);
 struct sk_buff *sock_alloc_send_skb(struct sock *sk, unsigned long size,
 				    int noblock, int *errcode);
-struct sk_buff *sock_alloc_send_pskb(struct sock *sk, unsigned long header_len,
-				     unsigned long data_len, int noblock,
-				     int *errcode, int max_page_order);
+struct sk_buff *__sock_alloc_send_pskb(struct sock *sk,
+				       unsigned long header_len,
+				       unsigned long data_len, int noblock,
+				       int *errcode, int max_page_order,
+				       gfp_t gfp_flags);
+
+static inline sk_buff *
+sock_alloc_send_pskb(struct sock *sk, unsigned long header_len,
+		     unsigned long data_len, int noblock, int *errcode)
+{
+	return __sock_alloc_send_pskb(sk, header_len, data_len,
+				      noblock, errcode, 0, sk->sk_allocation);
+}
+
 void *sock_kmalloc(struct sock *sk, int size, gfp_t priority);
 void sock_kfree_s(struct sock *sk, void *mem, int size);
 void sock_kzfree_s(struct sock *sk, void *mem, int size);
diff --git a/net/core/sock.c b/net/core/sock.c
index 946888afef88..64b7271a7d21 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2331,9 +2331,11 @@ static long sock_wait_for_wmem(struct sock *sk, long timeo)
  *	Generic send/receive buffer handlers
  */
 
-struct sk_buff *sock_alloc_send_pskb(struct sock *sk, unsigned long header_len,
-				     unsigned long data_len, int noblock,
-				     int *errcode, int max_page_order)
+struct sk_buff *__sock_alloc_send_pskb(struct sock *sk,
+				       unsigned long header_len,
+				       unsigned long data_len, int noblock,
+				       int *errcode, int max_page_order,
+				       gfp_t gfp_flags)
 {
 	struct sk_buff *skb;
 	long timeo;
@@ -2362,7 +2364,7 @@ struct sk_buff *sock_alloc_send_pskb(struct sock *sk, unsigned long header_len,
 		timeo = sock_wait_for_wmem(sk, timeo);
 	}
 	skb = alloc_skb_with_frags(header_len, data_len, max_page_order,
-				   errcode, sk->sk_allocation);
+				   errcode, gfp_flags);
 	if (skb)
 		skb_set_owner_w(skb, sk);
 	return skb;
@@ -2373,7 +2375,7 @@ struct sk_buff *sock_alloc_send_pskb(struct sock *sk, unsigned long header_len,
 	*errcode = err;
 	return NULL;
 }
-EXPORT_SYMBOL(sock_alloc_send_pskb);
+EXPORT_SYMBOL(__sock_alloc_send_pskb);
 
 struct sk_buff *sock_alloc_send_skb(struct sock *sk, unsigned long size,
 				    int noblock, int *errcode)
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index c3efc7d658f6..211f1ea6cf2a 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1095,9 +1095,22 @@ static int __ip_append_data(struct sock *sk,
 				alloclen += rt->dst.trailer_len;
 
 			if (transhdrlen) {
-				skb = sock_alloc_send_skb(sk,
-						alloclen + hh_len + 15,
-						(flags & MSG_DONTWAIT), &err);
+				bool sg = rt->dst.dev->features & NETIF_F_SG;
+				size_t header_len = alloclen + hh_len + 15;
+				gfp_t sk_allocation;
+
+				sk_allocation = sk->sk_allocation;
+				if (header_len > PAGE_SIZE && sg)
+					sk_allocation |= __GFP_NORETRY;
+
+				skb = __sock_alloc_send_pskb(sk, header_len, 0,
+						(flags & MSG_DONTWAIT), &err,
+							     0, sk_allocation);
+				if (unlikely(!skb) && !paged && sg) {
+					BUILD_BUG_ON(MAX_HEADER >= PAGE_SIZE);
+					paged = true;
+					goto alloc_new_skb;
+				}
 			} else {
 				skb = NULL;
 				if (refcount_read(&sk->sk_wmem_alloc) + wmem_alloc_delta <=

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next] ip: avoid OOM kills with large UDP sends over loopback
  2021-06-22 16:54   ` Jakub Kicinski
  2021-06-22 17:19     ` Jakub Kicinski
@ 2021-06-22 17:48     ` Eric Dumazet
  2021-06-22 18:09       ` Jakub Kicinski
  1 sibling, 1 reply; 12+ messages in thread
From: Eric Dumazet @ 2021-06-22 17:48 UTC (permalink / raw)
  To: Jakub Kicinski, Eric Dumazet
  Cc: davem, netdev, willemb, dsahern, yoshfuji, Dave Jones



On 6/22/21 6:54 PM, Jakub Kicinski wrote:
> On Tue, 22 Jun 2021 16:12:11 +0200 Eric Dumazet wrote:
>> On 6/22/21 1:13 AM, Jakub Kicinski wrote:
>>> Dave observed number of machines hitting OOM on the UDP send
>>> path. The workload seems to be sending large UDP packets over
>>> loopback. Since loopback has MTU of 64k kernel will try to
>>> allocate an skb with up to 64k of head space. This has a good
>>> chance of failing under memory pressure. What's worse if
>>> the message length is <32k the allocation may trigger an
>>> OOM killer.
>>>
>>> This is entirely avoidable, we can use an skb with frags.
>>>
>>> The scenario is unlikely and always using frags requires
>>> an extra allocation so opt for using fallback, rather
>>> then always using frag'ed/paged skb when payload is large.
>>>
>>> Note that the size heuristic (header_len > PAGE_SIZE)
>>> is not entirely accurate, __alloc_skb() will add ~400B
>>> to size. Occasional order-1 allocation should be fine,
>>> though, we are primarily concerned with order-3.
>>>
>>> Reported-by: Dave Jones <dsj@fb.com>
>>> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> 
>>> +static inline void sk_allocation_push(struct sock *sk, gfp_t flag, gfp_t *old)
>>> +{
>>> +	*old = sk->sk_allocation;
>>> +	sk->sk_allocation |= flag;
>>> +}
>>> +  
>>
>> This is not thread safe.
>>
>> Remember UDP sendmsg() does not lock the socket for non-corking sends.
> 
> Ugh, you're right :(
> 
>>> +static inline void sk_allocation_pop(struct sock *sk, gfp_t old)
>>> +{
>>> +	sk->sk_allocation = old;
>>> +}
>>> +
>>>  static inline void sk_acceptq_removed(struct sock *sk)
>>>  {
>>>  	WRITE_ONCE(sk->sk_ack_backlog, sk->sk_ack_backlog - 1);
>>> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
>>> index c3efc7d658f6..a300c2c65d57 100644
>>> --- a/net/ipv4/ip_output.c
>>> +++ b/net/ipv4/ip_output.c
>>> @@ -1095,9 +1095,24 @@ static int __ip_append_data(struct sock *sk,
>>>  				alloclen += rt->dst.trailer_len;
>>>  
>>>  			if (transhdrlen) {
>>> -				skb = sock_alloc_send_skb(sk,
>>> -						alloclen + hh_len + 15,
>>> +				size_t header_len = alloclen + hh_len + 15;
>>> +				gfp_t sk_allocation;
>>> +
>>> +				if (header_len > PAGE_SIZE)
>>> +					sk_allocation_push(sk, __GFP_NORETRY,
>>> +							   &sk_allocation);
>>> +				skb = sock_alloc_send_skb(sk, header_len,
>>>  						(flags & MSG_DONTWAIT), &err);
>>> +				if (header_len > PAGE_SIZE) {
>>> +					BUILD_BUG_ON(MAX_HEADER >= PAGE_SIZE);
>>> +
>>> +					sk_allocation_pop(sk, sk_allocation);
>>> +					if (unlikely(!skb) && !paged &&
>>> +					    rt->dst.dev->features & NETIF_F_SG) {
>>> +						paged = true;
>>> +						goto alloc_new_skb;
>>> +					}
>>> +				}  
>>
>>
>> What about using 	sock_alloc_send_pskb(... PAGE_ALLOC_COSTLY_ORDER)
>> (as we did in unix_dgram_sendmsg() for large packets), for SG enabled interfaces ?
> 
> PAGE_ALLOC_COSTLY_ORDER in itself is more of a problem than a solution.
> AFAIU the app sends messages primarily above the ~60kB mark, which is
> above COSTLY, and those do not trigger OOM kills. All OOM kills we see
> have order=3. Checking with Rik and Johannes W that's expected, OOM
> killer is only invoked for allocations <= COSTLY, larger ones will just
> return NULL and let us deal with it (e.g. by falling back).

I  really thought alloc_skb_with_frags() was already handling low-memory-conditions.

(alloc_skb_with_frags() is called from sock_alloc_send_pskb())

If it is not, lets fix it, because af_unix sockets will have the same issue ?



> 
> So adding GFP_NORETRY is key for 0 < order <= COSTLY,
> skb_page_frag_refill()-style.
> 
>> We do not _have_ to put all the payload in skb linear part,
>> we could instead use page frags (order-0 if high order pages are not available)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next] ip: avoid OOM kills with large UDP sends over loopback
  2021-06-22 17:19     ` Jakub Kicinski
@ 2021-06-22 17:49       ` Eric Dumazet
  0 siblings, 0 replies; 12+ messages in thread
From: Eric Dumazet @ 2021-06-22 17:49 UTC (permalink / raw)
  To: Jakub Kicinski, Eric Dumazet
  Cc: davem, netdev, willemb, dsahern, yoshfuji, Dave Jones



On 6/22/21 7:19 PM, Jakub Kicinski wrote:
> On Tue, 22 Jun 2021 09:54:22 -0700 Jakub Kicinski wrote:
>>>> +static inline void sk_allocation_push(struct sock *sk, gfp_t flag, gfp_t *old)
>>>> +{
>>>> +	*old = sk->sk_allocation;
>>>> +	sk->sk_allocation |= flag;
>>>> +}
>>>> +    
>>>
>>> This is not thread safe.
>>>
>>> Remember UDP sendmsg() does not lock the socket for non-corking sends.  
>>
>> Ugh, you're right :(
> 
> Hm, isn't it buggy to call sock_alloc_send_[p]skb() without holding the
> lock in the first place, then? The knee jerk fix would be to add another 
> layer of specialization to the helpers:

It is not buggy. Please elaborate if you found it is.



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next] ip: avoid OOM kills with large UDP sends over loopback
  2021-06-22 17:48     ` Eric Dumazet
@ 2021-06-22 18:09       ` Jakub Kicinski
  2021-06-22 18:47         ` Eric Dumazet
  0 siblings, 1 reply; 12+ messages in thread
From: Jakub Kicinski @ 2021-06-22 18:09 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: davem, netdev, willemb, dsahern, yoshfuji, Dave Jones

On Tue, 22 Jun 2021 19:48:43 +0200 Eric Dumazet wrote:
> >> What about using 	sock_alloc_send_pskb(... PAGE_ALLOC_COSTLY_ORDER)
> >> (as we did in unix_dgram_sendmsg() for large packets), for SG enabled interfaces ?  
> > 
> > PAGE_ALLOC_COSTLY_ORDER in itself is more of a problem than a solution.
> > AFAIU the app sends messages primarily above the ~60kB mark, which is
> > above COSTLY, and those do not trigger OOM kills. All OOM kills we see
> > have order=3. Checking with Rik and Johannes W that's expected, OOM
> > killer is only invoked for allocations <= COSTLY, larger ones will just
> > return NULL and let us deal with it (e.g. by falling back).  
> 
> I  really thought alloc_skb_with_frags() was already handling low-memory-conditions.
> 
> (alloc_skb_with_frags() is called from sock_alloc_send_pskb())
> 
> If it is not, lets fix it, because af_unix sockets will have the same issue ?

af_unix seems to cap at SKB_MAX_ALLOC which is order 2, AFAICT.

Perhaps that's a good enough fix in practice given we see OOMs with
order=3 only?

I'll review callers of alloc_skb_with_frags() and see if they depend 
on the explicit geometry of the skb or we can safely fallback to pages.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next] ip: avoid OOM kills with large UDP sends over loopback
  2021-06-22 18:09       ` Jakub Kicinski
@ 2021-06-22 18:47         ` Eric Dumazet
  2021-06-22 19:04           ` Jakub Kicinski
  0 siblings, 1 reply; 12+ messages in thread
From: Eric Dumazet @ 2021-06-22 18:47 UTC (permalink / raw)
  To: Jakub Kicinski, Eric Dumazet
  Cc: davem, netdev, willemb, dsahern, yoshfuji, Dave Jones



On 6/22/21 8:09 PM, Jakub Kicinski wrote:
> On Tue, 22 Jun 2021 19:48:43 +0200 Eric Dumazet wrote:
>>>> What about using 	sock_alloc_send_pskb(... PAGE_ALLOC_COSTLY_ORDER)
>>>> (as we did in unix_dgram_sendmsg() for large packets), for SG enabled interfaces ?  
>>>
>>> PAGE_ALLOC_COSTLY_ORDER in itself is more of a problem than a solution.
>>> AFAIU the app sends messages primarily above the ~60kB mark, which is
>>> above COSTLY, and those do not trigger OOM kills. All OOM kills we see
>>> have order=3. Checking with Rik and Johannes W that's expected, OOM
>>> killer is only invoked for allocations <= COSTLY, larger ones will just
>>> return NULL and let us deal with it (e.g. by falling back).  
>>
>> I  really thought alloc_skb_with_frags() was already handling low-memory-conditions.
>>
>> (alloc_skb_with_frags() is called from sock_alloc_send_pskb())
>>
>> If it is not, lets fix it, because af_unix sockets will have the same issue ?
> 
> af_unix seems to cap at SKB_MAX_ALLOC which is order 2, AFAICT.

It does not cap to SKB_MAX_ALLOC.

It definitely attempt big allocations if you send 64KB datagrams.

Please look at commit d14b56f508ad70eca3e659545aab3c45200f258c
    net: cleanup gfp mask in alloc_skb_with_frags

This explains why we do not have __GFP_NORETRY there.

> 
> Perhaps that's a good enough fix in practice given we see OOMs with
> order=3 only?
> 
> I'll review callers of alloc_skb_with_frags() and see if they depend 
> on the explicit geometry of the skb or we can safely fallback to pages.
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next] ip: avoid OOM kills with large UDP sends over loopback
  2021-06-22 18:47         ` Eric Dumazet
@ 2021-06-22 19:04           ` Jakub Kicinski
  2021-06-22 19:51             ` Jakub Kicinski
  0 siblings, 1 reply; 12+ messages in thread
From: Jakub Kicinski @ 2021-06-22 19:04 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: davem, netdev, willemb, dsahern, yoshfuji, Dave Jones

On Tue, 22 Jun 2021 20:47:57 +0200 Eric Dumazet wrote:
> On 6/22/21 8:09 PM, Jakub Kicinski wrote:
> > On Tue, 22 Jun 2021 19:48:43 +0200 Eric Dumazet wrote:  
> >> I  really thought alloc_skb_with_frags() was already handling low-memory-conditions.
> >>
> >> (alloc_skb_with_frags() is called from sock_alloc_send_pskb())
> >>
> >> If it is not, lets fix it, because af_unix sockets will have the same issue ?  
> > 
> > af_unix seems to cap at SKB_MAX_ALLOC which is order 2, AFAICT.  
> 
> It does not cap to SKB_MAX_ALLOC.
> 
> It definitely attempt big allocations if you send 64KB datagrams.
> 
> Please look at commit d14b56f508ad70eca3e659545aab3c45200f258c
>     net: cleanup gfp mask in alloc_skb_with_frags
> 
> This explains why we do not have __GFP_NORETRY there.

Ah, right, slight misunderstanding.

Just to be 100% clear for UDP send we are allocating up to 64kB 
in the _head_, AFAICT. Allocation of head does not clear GFP_WAIT. 

Your memory was correct, alloc_skb_with_frags() does handle low-memory
when it comes to allocating frags. And what I was saying is af_unix
won't have the same problem as UDP as it caps head's size at
SKB_MAX_ALLOC, and frags are allocated with fallback.

For the UDP case we can either adapt the af_unix approach, and cap head
size to SKB_MAX_ALLOC or try to allocate the full skb and fall back.
Having alloc_skb_with_frags() itself re-balance head <> data
automatically does not feel right, no?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next] ip: avoid OOM kills with large UDP sends over loopback
  2021-06-22 19:04           ` Jakub Kicinski
@ 2021-06-22 19:51             ` Jakub Kicinski
  0 siblings, 0 replies; 12+ messages in thread
From: Jakub Kicinski @ 2021-06-22 19:51 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: davem, netdev, willemb, dsahern, yoshfuji, Dave Jones

On Tue, 22 Jun 2021 12:04:26 -0700 Jakub Kicinski wrote:
> For the UDP case we can either adapt the af_unix approach, and cap head
> size to SKB_MAX_ALLOC or try to allocate the full skb and fall back.
> Having alloc_skb_with_frags() itself re-balance head <> data
> automatically does not feel right, no?

Actually looking closer at the UDP code it appears it only uses the
giant head it allocated if the underlying device doesn't have SG.
We can make the head smaller and probably only improve performance 
for 99% of deployments. I'll send a v2.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2021-06-22 19:51 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-21 23:13 [PATCH net-next] ip: avoid OOM kills with large UDP sends over loopback Jakub Kicinski
2021-06-22 10:07 ` Paolo Abeni
2021-06-22 16:57   ` Jakub Kicinski
2021-06-22 14:12 ` Eric Dumazet
2021-06-22 16:54   ` Jakub Kicinski
2021-06-22 17:19     ` Jakub Kicinski
2021-06-22 17:49       ` Eric Dumazet
2021-06-22 17:48     ` Eric Dumazet
2021-06-22 18:09       ` Jakub Kicinski
2021-06-22 18:47         ` Eric Dumazet
2021-06-22 19:04           ` Jakub Kicinski
2021-06-22 19:51             ` Jakub Kicinski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.