linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH IPV6 1/1] ipv6: allocate enough headroom in ip6_finish_output2()
       [not found] <cover.1625665132.git.vvs@virtuozzo.com>
@ 2021-07-07 14:04 ` Vasily Averin
  2021-07-07 14:45   ` David Ahern
  0 siblings, 1 reply; 106+ messages in thread
From: Vasily Averin @ 2021-07-07 14:04 UTC (permalink / raw)
  To: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski
  Cc: netdev, linux-kernel

When TEE target mirrors traffic to another interface, sk_buff may
not have enough headroom to be processed correctly.
ip_finish_output2() detect this situation for ipv4 and allocates
new skb with enogh headroom. However ipv6 lacks this logic in
ip_finish_output2 and it leads to skb_under_panic:

 skbuff: skb_under_panic: text:ffffffffc0866ad4 len:96 put:24
 head:ffff97be85e31800 data:ffff97be85e317f8 tail:0x58 end:0xc0 dev:gre0
 ------------[ cut here ]------------
 kernel BUG at net/core/skbuff.c:110!
 invalid opcode: 0000 [#1] SMP PTI
 CPU: 2 PID: 393 Comm: kworker/2:2 Tainted: G           OE     5.13.0 #13
 Hardware name: Virtuozzo KVM, BIOS 1.11.0-2.vz7.4 04/01/2014
 Workqueue: ipv6_addrconf addrconf_dad_work
 RIP: 0010:skb_panic+0x48/0x4a
 Call Trace:
  skb_push.cold.111+0x10/0x10
  ipgre_header+0x24/0xf0 [ip_gre]
  neigh_connected_output+0xae/0xf0
  ip6_finish_output2+0x1a8/0x5a0
  ip6_output+0x5c/0x110
  nf_dup_ipv6+0x158/0x1000 [nf_dup_ipv6]
  tee_tg6+0x2e/0x40 [xt_TEE]
  ip6t_do_table+0x294/0x470 [ip6_tables]
  nf_hook_slow+0x44/0xc0
  nf_hook.constprop.34+0x72/0xe0
  ndisc_send_skb+0x20d/0x2e0
  ndisc_send_ns+0xd1/0x210
  addrconf_dad_work+0x3c8/0x540
  process_one_work+0x1d1/0x370
  worker_thread+0x30/0x390
  kthread+0x116/0x130
  ret_from_fork+0x22/0x30

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/ipv6/ip6_output.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index ff4f9eb..e5af740 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -61,9 +61,24 @@ static int ip6_finish_output2(struct net *net, struct sock *sk, struct sk_buff *
 	struct dst_entry *dst = skb_dst(skb);
 	struct net_device *dev = dst->dev;
 	const struct in6_addr *nexthop;
+	unsigned int hh_len = LL_RESERVED_SPACE(dev);
 	struct neighbour *neigh;
 	int ret;
 
+	/* Be paranoid, rather than too clever. */
+	if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) {
+		struct sk_buff *skb2;
+
+		skb2 = skb_realloc_headroom(skb, LL_RESERVED_SPACE(dev));
+		if (!skb2) {
+			kfree_skb(skb);
+			return -ENOMEM;
+		}
+		if (skb->sk)
+			skb_set_owner_w(skb2, skb->sk);
+		consume_skb(skb);
+		skb = skb2;
+	}
 	if (ipv6_addr_is_multicast(&ipv6_hdr(skb)->daddr)) {
 		struct inet6_dev *idev = ip6_dst_idev(skb_dst(skb));
 
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH IPV6 1/1] ipv6: allocate enough headroom in ip6_finish_output2()
  2021-07-07 14:04 ` [PATCH IPV6 1/1] ipv6: allocate enough headroom in ip6_finish_output2() Vasily Averin
@ 2021-07-07 14:45   ` David Ahern
  2021-07-07 16:42     ` Jakub Kicinski
  0 siblings, 1 reply; 106+ messages in thread
From: David Ahern @ 2021-07-07 14:45 UTC (permalink / raw)
  To: Vasily Averin, David S. Miller, Hideaki YOSHIFUJI, David Ahern,
	Jakub Kicinski
  Cc: netdev, linux-kernel

On 7/7/21 8:04 AM, Vasily Averin wrote:
> diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
> index ff4f9eb..e5af740 100644
> --- a/net/ipv6/ip6_output.c
> +++ b/net/ipv6/ip6_output.c
> @@ -61,9 +61,24 @@ static int ip6_finish_output2(struct net *net, struct sock *sk, struct sk_buff *
>  	struct dst_entry *dst = skb_dst(skb);
>  	struct net_device *dev = dst->dev;
>  	const struct in6_addr *nexthop;
> +	unsigned int hh_len = LL_RESERVED_SPACE(dev);
>  	struct neighbour *neigh;
>  	int ret;
>  
> +	/* Be paranoid, rather than too clever. */
> +	if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) {
> +		struct sk_buff *skb2;
> +
> +		skb2 = skb_realloc_headroom(skb, LL_RESERVED_SPACE(dev));

why not use hh_len here?


> +		if (!skb2) {
> +			kfree_skb(skb);
> +			return -ENOMEM;
> +		}
> +		if (skb->sk)
> +			skb_set_owner_w(skb2, skb->sk);
> +		consume_skb(skb);
> +		skb = skb2;
> +	}
>  	if (ipv6_addr_is_multicast(&ipv6_hdr(skb)->daddr)) {
>  		struct inet6_dev *idev = ip6_dst_idev(skb_dst(skb));
>  
> 


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH IPV6 1/1] ipv6: allocate enough headroom in ip6_finish_output2()
  2021-07-07 14:45   ` David Ahern
@ 2021-07-07 16:42     ` Jakub Kicinski
  2021-07-07 17:41       ` Eric Dumazet
  0 siblings, 1 reply; 106+ messages in thread
From: Jakub Kicinski @ 2021-07-07 16:42 UTC (permalink / raw)
  To: David Ahern
  Cc: Vasily Averin, David S. Miller, Hideaki YOSHIFUJI, David Ahern,
	netdev, linux-kernel

On Wed, 7 Jul 2021 08:45:13 -0600 David Ahern wrote:
> On 7/7/21 8:04 AM, Vasily Averin wrote:
> > diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
> > index ff4f9eb..e5af740 100644
> > --- a/net/ipv6/ip6_output.c
> > +++ b/net/ipv6/ip6_output.c
> > @@ -61,9 +61,24 @@ static int ip6_finish_output2(struct net *net, struct sock *sk, struct sk_buff *
> >  	struct dst_entry *dst = skb_dst(skb);
> >  	struct net_device *dev = dst->dev;
> >  	const struct in6_addr *nexthop;
> > +	unsigned int hh_len = LL_RESERVED_SPACE(dev);
> >  	struct neighbour *neigh;
> >  	int ret;
> >  
> > +	/* Be paranoid, rather than too clever. */
> > +	if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) {
> > +		struct sk_buff *skb2;
> > +
> > +		skb2 = skb_realloc_headroom(skb, LL_RESERVED_SPACE(dev));  
> 
> why not use hh_len here?

Is there a reason for the new skb? Why not pskb_expand_head()?

> > +		if (!skb2) {
> > +			kfree_skb(skb);
> > +			return -ENOMEM;
> > +		}
> > +		if (skb->sk)
> > +			skb_set_owner_w(skb2, skb->sk);
> > +		consume_skb(skb);
> > +		skb = skb2;
> > +	}
> >  	if (ipv6_addr_is_multicast(&ipv6_hdr(skb)->daddr)) {
> >  		struct inet6_dev *idev = ip6_dst_idev(skb_dst(skb));



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH IPV6 1/1] ipv6: allocate enough headroom in ip6_finish_output2()
  2021-07-07 16:42     ` Jakub Kicinski
@ 2021-07-07 17:41       ` Eric Dumazet
  2021-07-07 17:53         ` Vasily Averin
                           ` (3 more replies)
  0 siblings, 4 replies; 106+ messages in thread
From: Eric Dumazet @ 2021-07-07 17:41 UTC (permalink / raw)
  To: Jakub Kicinski, David Ahern
  Cc: Vasily Averin, David S. Miller, Hideaki YOSHIFUJI, David Ahern,
	netdev, linux-kernel



On 7/7/21 6:42 PM, Jakub Kicinski wrote:
> On Wed, 7 Jul 2021 08:45:13 -0600 David Ahern wrote:
>> On 7/7/21 8:04 AM, Vasily Averin wrote:
>>> diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
>>> index ff4f9eb..e5af740 100644
>>> --- a/net/ipv6/ip6_output.c
>>> +++ b/net/ipv6/ip6_output.c
>>> @@ -61,9 +61,24 @@ static int ip6_finish_output2(struct net *net, struct sock *sk, struct sk_buff *
>>>  	struct dst_entry *dst = skb_dst(skb);
>>>  	struct net_device *dev = dst->dev;
>>>  	const struct in6_addr *nexthop;
>>> +	unsigned int hh_len = LL_RESERVED_SPACE(dev);
>>>  	struct neighbour *neigh;
>>>  	int ret;
>>>  
>>> +	/* Be paranoid, rather than too clever. */
>>> +	if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) {
>>> +		struct sk_buff *skb2;
>>> +
>>> +		skb2 = skb_realloc_headroom(skb, LL_RESERVED_SPACE(dev));  
>>
>> why not use hh_len here?
> 
> Is there a reason for the new skb? Why not pskb_expand_head()?


pskb_expand_head() might crash, if skb is shared.

We possibly can add a helper, factorizing all this,
and eventually use pskb_expand_head() if safe.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH IPV6 1/1] ipv6: allocate enough headroom in ip6_finish_output2()
  2021-07-07 17:41       ` Eric Dumazet
@ 2021-07-07 17:53         ` Vasily Averin
  2021-07-07 18:30         ` Jakub Kicinski
                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 106+ messages in thread
From: Vasily Averin @ 2021-07-07 17:53 UTC (permalink / raw)
  To: Eric Dumazet, Jakub Kicinski, David Ahern
  Cc: David S. Miller, Hideaki YOSHIFUJI, David Ahern, netdev, linux-kernel

On 7/7/21 8:41 PM, Eric Dumazet wrote:
> On 7/7/21 6:42 PM, Jakub Kicinski wrote:
>> On Wed, 7 Jul 2021 08:45:13 -0600 David Ahern wrote:
>>> On 7/7/21 8:04 AM, Vasily Averin wrote:
>>>> diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
>>>> index ff4f9eb..e5af740 100644
>>>> --- a/net/ipv6/ip6_output.c
>>>> +++ b/net/ipv6/ip6_output.c
>>>> @@ -61,9 +61,24 @@ static int ip6_finish_output2(struct net *net, struct sock *sk, struct sk_buff *
>>>>  	struct dst_entry *dst = skb_dst(skb);
>>>>  	struct net_device *dev = dst->dev;
>>>>  	const struct in6_addr *nexthop;
>>>> +	unsigned int hh_len = LL_RESERVED_SPACE(dev);
>>>>  	struct neighbour *neigh;
>>>>  	int ret;
>>>>  
>>>> +	/* Be paranoid, rather than too clever. */
>>>> +	if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) {
>>>> +		struct sk_buff *skb2;
>>>> +
>>>> +		skb2 = skb_realloc_headroom(skb, LL_RESERVED_SPACE(dev));  
>>>
>>> why not use hh_len here?
>>
>> Is there a reason for the new skb? Why not pskb_expand_head()?
> 
> pskb_expand_head() might crash, if skb is shared.
> 
> We possibly can add a helper, factorizing all this,
> and eventually use pskb_expand_head() if safe.

Thank you for feedback, I'll do it in 2nd version.
	Vasily Averin

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH IPV6 1/1] ipv6: allocate enough headroom in ip6_finish_output2()
  2021-07-07 17:41       ` Eric Dumazet
  2021-07-07 17:53         ` Vasily Averin
@ 2021-07-07 18:30         ` Jakub Kicinski
  2021-07-07 18:50           ` Eric Dumazet
  2021-07-09  9:04         ` [PATCH IPV6 v2 0/4] " Vasily Averin
       [not found]         ` <cover.1625818825.git.vvs@virtuozzo.com>
  3 siblings, 1 reply; 106+ messages in thread
From: Jakub Kicinski @ 2021-07-07 18:30 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Ahern, Vasily Averin, David S. Miller, Hideaki YOSHIFUJI,
	David Ahern, netdev, linux-kernel

On Wed, 7 Jul 2021 19:41:44 +0200 Eric Dumazet wrote:
> On 7/7/21 6:42 PM, Jakub Kicinski wrote:
> > On Wed, 7 Jul 2021 08:45:13 -0600 David Ahern wrote:  
> >> why not use hh_len here?  
> > 
> > Is there a reason for the new skb? Why not pskb_expand_head()?  
> 
> 
> pskb_expand_head() might crash, if skb is shared.
> 
> We possibly can add a helper, factorizing all this,
> and eventually use pskb_expand_head() if safe.

Is there a strategically placed skb_share_check() somewhere further
down? Otherwise there seems to be a lot of questionable skb_cow*()
calls, also __skb_linearize() and skb_pad() are risky, no?
Or is it that shared skbs are uncommon and syzbot doesn't hit them?

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH IPV6 1/1] ipv6: allocate enough headroom in ip6_finish_output2()
  2021-07-07 18:30         ` Jakub Kicinski
@ 2021-07-07 18:50           ` Eric Dumazet
  0 siblings, 0 replies; 106+ messages in thread
From: Eric Dumazet @ 2021-07-07 18:50 UTC (permalink / raw)
  To: Jakub Kicinski, Eric Dumazet
  Cc: David Ahern, Vasily Averin, David S. Miller, Hideaki YOSHIFUJI,
	David Ahern, netdev, linux-kernel



On 7/7/21 8:30 PM, Jakub Kicinski wrote:
> On Wed, 7 Jul 2021 19:41:44 +0200 Eric Dumazet wrote:
>> On 7/7/21 6:42 PM, Jakub Kicinski wrote:
>>> On Wed, 7 Jul 2021 08:45:13 -0600 David Ahern wrote:  
>>>> why not use hh_len here?  
>>>
>>> Is there a reason for the new skb? Why not pskb_expand_head()?  
>>
>>
>> pskb_expand_head() might crash, if skb is shared.
>>
>> We possibly can add a helper, factorizing all this,
>> and eventually use pskb_expand_head() if safe.
> 
> Is there a strategically placed skb_share_check() somewhere further
> down? Otherwise there seems to be a lot of questionable skb_cow*()
> calls, also __skb_linearize() and skb_pad() are risky, no?
> Or is it that shared skbs are uncommon and syzbot doesn't hit them?
> 

Some of us try hard to remove skb_get() occurrences,
but they tend to re-appear fast :/

Refs: commit a516993f0ac1694673412eb2d16a091eafa77d2a
("net: fix wrong skb_get() usage / crash in IGMP/MLD parsing code") 


^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH IPV6 v2 0/4] ipv6: allocate enough headroom in ip6_finish_output2()
  2021-07-07 17:41       ` Eric Dumazet
  2021-07-07 17:53         ` Vasily Averin
  2021-07-07 18:30         ` Jakub Kicinski
@ 2021-07-09  9:04         ` Vasily Averin
  2021-07-12  6:44           ` [PATCH IPV6 v3 0/1] " Vasily Averin
                             ` (3 more replies)
       [not found]         ` <cover.1625818825.git.vvs@virtuozzo.com>
  3 siblings, 4 replies; 106+ messages in thread
From: Vasily Averin @ 2021-07-09  9:04 UTC (permalink / raw)
  To: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski,
	Eric Dumazet
  Cc: netdev, linux-kernel

Recently Syzkaller found one more issue on RHEL7-based OpenVz kernels.
During its investigation I've found that upstream is affected too. 

TEE target send sbk with small headroom into another interface which requires
an increased headroom.

ipv4 handles this problem in ip_finish_output2() and creates new skb with enough headroom,
though ip6_finish_output2() lacks this logic.

Suzkaller created C reproducer, it can be found in v1 cover-letter.

v2 changes: 
 new helper was created and used in ip6_finish_output2 and in ip6_xmit()
 small refactoring in changed functions: commonly used dereferences was replaced by variables

ToDo:
 clarify proper name for helper,
 move it into proper place,  
 use it in other similar places:
   pptp_xmit
   vrf_finish_output
   ax25_transmit_buffer
   ax25_rt_build_path
   bpf_out_neigh_v6
   bpf_out_neigh_v4
   ip_finish_output2
   ip6_tnl_xmit
   ipip6_tunnel_xmit
   ip_vs_prepare_tunneled_skb

Vasily Averin (4):
  ipv6: allocate enough headroom in ip6_finish_output2()
  ipv6: use new helper skb_expand_head() in ip6_xmit()
  ipv6: ip6_finish_output2 refactoring
  ipv6: ip6_xmit refactoring

 net/ipv6/ip6_output.c | 89 ++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 59 insertions(+), 30 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH IPV6 v2 1/4] ipv6: allocate enough headroom in ip6_finish_output2()
       [not found]         ` <cover.1625818825.git.vvs@virtuozzo.com>
@ 2021-07-09  9:04           ` Vasily Averin
  2021-07-09 17:58             ` David Miller
  2021-07-09  9:04           ` [PATCH IPV6 v2 2/4] ipv6: use new helper skb_expand_head() in ip6_xmit() Vasily Averin
                             ` (2 subsequent siblings)
  3 siblings, 1 reply; 106+ messages in thread
From: Vasily Averin @ 2021-07-09  9:04 UTC (permalink / raw)
  To: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski,
	Eric Dumazet
  Cc: netdev, linux-kernel

When TEE target mirrors traffic to another interface, sk_buff may
not have enough headroom to be processed correctly.
ip_finish_output2() detect this situation for ipv4 and allocates
new skb with enogh headroom. However ipv6 lacks this logic in
ip_finish_output2 and it leads to skb_under_panic:

 skbuff: skb_under_panic: text:ffffffffc0866ad4 len:96 put:24
 head:ffff97be85e31800 data:ffff97be85e317f8 tail:0x58 end:0xc0 dev:gre0
 ------------[ cut here ]------------
 kernel BUG at net/core/skbuff.c:110!
 invalid opcode: 0000 [#1] SMP PTI
 CPU: 2 PID: 393 Comm: kworker/2:2 Tainted: G           OE     5.13.0 #13
 Hardware name: Virtuozzo KVM, BIOS 1.11.0-2.vz7.4 04/01/2014
 Workqueue: ipv6_addrconf addrconf_dad_work
 RIP: 0010:skb_panic+0x48/0x4a
 Call Trace:
  skb_push.cold.111+0x10/0x10
  ipgre_header+0x24/0xf0 [ip_gre]
  neigh_connected_output+0xae/0xf0
  ip6_finish_output2+0x1a8/0x5a0
  ip6_output+0x5c/0x110
  nf_dup_ipv6+0x158/0x1000 [nf_dup_ipv6]
  tee_tg6+0x2e/0x40 [xt_TEE]
  ip6t_do_table+0x294/0x470 [ip6_tables]
  nf_hook_slow+0x44/0xc0
  nf_hook.constprop.34+0x72/0xe0
  ndisc_send_skb+0x20d/0x2e0
  ndisc_send_ns+0xd1/0x210
  addrconf_dad_work+0x3c8/0x540
  process_one_work+0x1d1/0x370
  worker_thread+0x30/0x390
  kthread+0x116/0x130
  ret_from_fork+0x22/0x30

This patch implement new helper that tries to expand headroom on current skb,
if it is not possible (shared_skb) -- creates new one.

v2 open questions:
- currently helper name skb_expand_head is bad
  and should be changed to better one. Any suggestions?  
- proper location for new helper:
   in net/core/skbuff.c right below skb_realloc_headroom() ?
- is it acceptable to free original skb inside helper ?
  Is it probably required to keep it in caller instead?

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/ipv6/ip6_output.c | 34 ++++++++++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index ff4f9eb..6c5f85f 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -56,14 +56,48 @@
 #include <net/lwtunnel.h>
 #include <net/ip_tunnels.h>
 
+static inline struct sk_buff *skb_expand_head(struct sk_buff *skb, int delta)
+{
+	/* pskb_expand_head() might crash, if skb is shared */
+	if (skb_shared(skb)) {
+		struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
+
+		if (likely(nskb)) {
+			if (skb->sk)
+				skb_set_owner_w(skb, skb->sk);
+			consume_skb(skb);
+		} else {
+			kfree_skb(skb);
+		}
+		skb = nskb;
+	}
+	if (skb &&
+	    pskb_expand_head(skb, SKB_DATA_ALIGN(delta), 0, GFP_ATOMIC)) {
+		kfree_skb(skb);
+		skb = NULL;
+	}
+	return skb;
+}
+
 static int ip6_finish_output2(struct net *net, struct sock *sk, struct sk_buff *skb)
 {
 	struct dst_entry *dst = skb_dst(skb);
 	struct net_device *dev = dst->dev;
+	unsigned int hh_len = LL_RESERVED_SPACE(dev);
+	int delta = hh_len - skb_headroom(skb);
 	const struct in6_addr *nexthop;
 	struct neighbour *neigh;
 	int ret;
 
+	/* Be paranoid, rather than too clever. */
+	if (unlikely(delta  > 0) && dev->header_ops)
+		skb = skb_expand_head(skb, delta);
+
+	if (!skb) {
+		IP6_INC_STATS(net, ip6_dst_idev(dst), IPSTATS_MIB_OUTDISCARDS);
+		return -ENOMEM;
+	}
+
 	if (ipv6_addr_is_multicast(&ipv6_hdr(skb)->daddr)) {
 		struct inet6_dev *idev = ip6_dst_idev(skb_dst(skb));
 
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH IPV6 v2 2/4] ipv6: use new helper skb_expand_head() in ip6_xmit()
       [not found]         ` <cover.1625818825.git.vvs@virtuozzo.com>
  2021-07-09  9:04           ` [PATCH IPV6 v2 1/4] ipv6: allocate enough headroom in ip6_finish_output2() Vasily Averin
@ 2021-07-09  9:04           ` Vasily Averin
  2021-07-09  9:05           ` [PATCH IPV6 v2 3/4] ipv6: ip6_finish_output2 refactoring Vasily Averin
  2021-07-09  9:05           ` [PATCH IPV6 v2 4/4] ipv6: ip6_xmit refactoring Vasily Averin
  3 siblings, 0 replies; 106+ messages in thread
From: Vasily Averin @ 2021-07-09  9:04 UTC (permalink / raw)
  To: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski,
	Eric Dumazet
  Cc: netdev, linux-kernel

By this way can be changed:
pptp_xmit
vrf_finish_output
ax25_transmit_buffer
ax25_rt_build_path
bpf_out_neigh_v6
bpf_out_neigh_v4
ip_finish_output2
ip6_tnl_xmit
ipip6_tunnel_xmit
ip_vs_prepare_tunneled_skb

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/ipv6/ip6_output.c | 16 ++++++----------
 1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 6c5f85f..9418802 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -278,25 +278,21 @@ int ip6_xmit(const struct sock *sk, struct sk_buff *skb, struct flowi6 *fl6,
 	struct ipv6hdr *hdr;
 	u8  proto = fl6->flowi6_proto;
 	int seg_len = skb->len;
-	int hlimit = -1;
+	int delta, hlimit = -1;
 	u32 mtu;
 
 	head_room = sizeof(struct ipv6hdr) + LL_RESERVED_SPACE(dst->dev);
 	if (opt)
 		head_room += opt->opt_nflen + opt->opt_flen;
 
-	if (unlikely(skb_headroom(skb) < head_room)) {
-		struct sk_buff *skb2 = skb_realloc_headroom(skb, head_room);
-		if (!skb2) {
-			IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)),
+	delta = head_room - skb_headroom(skb);
+	if (unlikely(delta > 0)) {
+		skb = skb_expand_head(skb, delta);
+		if (!skb) {
+			IP6_INC_STATS(net, ip6_dst_idev(dst),
 				      IPSTATS_MIB_OUTDISCARDS);
-			kfree_skb(skb);
 			return -ENOBUFS;
 		}
-		if (skb->sk)
-			skb_set_owner_w(skb2, skb->sk);
-		consume_skb(skb);
-		skb = skb2;
 	}
 
 	if (opt) {
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH IPV6 v2 3/4] ipv6: ip6_finish_output2 refactoring
       [not found]         ` <cover.1625818825.git.vvs@virtuozzo.com>
  2021-07-09  9:04           ` [PATCH IPV6 v2 1/4] ipv6: allocate enough headroom in ip6_finish_output2() Vasily Averin
  2021-07-09  9:04           ` [PATCH IPV6 v2 2/4] ipv6: use new helper skb_expand_head() in ip6_xmit() Vasily Averin
@ 2021-07-09  9:05           ` Vasily Averin
  2021-07-09  9:05           ` [PATCH IPV6 v2 4/4] ipv6: ip6_xmit refactoring Vasily Averin
  3 siblings, 0 replies; 106+ messages in thread
From: Vasily Averin @ 2021-07-09  9:05 UTC (permalink / raw)
  To: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski,
	Eric Dumazet
  Cc: netdev, linux-kernel

Commonly used dereferences was replaced by variables.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/ipv6/ip6_output.c | 29 ++++++++++++++---------------
 1 file changed, 14 insertions(+), 15 deletions(-)

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 9418802..9ae3baa 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -83,9 +83,11 @@ static int ip6_finish_output2(struct net *net, struct sock *sk, struct sk_buff *
 {
 	struct dst_entry *dst = skb_dst(skb);
 	struct net_device *dev = dst->dev;
+	struct inet6_dev *idev = ip6_dst_idev(dst);
 	unsigned int hh_len = LL_RESERVED_SPACE(dev);
 	int delta = hh_len - skb_headroom(skb);
-	const struct in6_addr *nexthop;
+	const struct in6_addr *daddr, *nexthop;
+	struct ipv6hdr *hdr;
 	struct neighbour *neigh;
 	int ret;
 
@@ -94,18 +96,17 @@ static int ip6_finish_output2(struct net *net, struct sock *sk, struct sk_buff *
 		skb = skb_expand_head(skb, delta);
 
 	if (!skb) {
-		IP6_INC_STATS(net, ip6_dst_idev(dst), IPSTATS_MIB_OUTDISCARDS);
+		IP6_INC_STATS(net, idev, IPSTATS_MIB_OUTDISCARDS);
 		return -ENOMEM;
 	}
 
-	if (ipv6_addr_is_multicast(&ipv6_hdr(skb)->daddr)) {
-		struct inet6_dev *idev = ip6_dst_idev(skb_dst(skb));
-
+	hdr = ipv6_hdr(skb);
+	daddr = &hdr->daddr;
+	if (ipv6_addr_is_multicast(daddr)) {
 		if (!(dev->flags & IFF_LOOPBACK) && sk_mc_loop(sk) &&
 		    ((mroute6_is_socket(net, skb) &&
 		     !(IP6CB(skb)->flags & IP6SKB_FORWARDED)) ||
-		     ipv6_chk_mcast_addr(dev, &ipv6_hdr(skb)->daddr,
-					 &ipv6_hdr(skb)->saddr))) {
+		     ipv6_chk_mcast_addr(dev, daddr, &hdr->saddr))) {
 			struct sk_buff *newskb = skb_clone(skb, GFP_ATOMIC);
 
 			/* Do not check for IFF_ALLMULTI; multicast routing
@@ -116,7 +117,7 @@ static int ip6_finish_output2(struct net *net, struct sock *sk, struct sk_buff *
 					net, sk, newskb, NULL, newskb->dev,
 					dev_loopback_xmit);
 
-			if (ipv6_hdr(skb)->hop_limit == 0) {
+			if (hdr->hop_limit == 0) {
 				IP6_INC_STATS(net, idev,
 					      IPSTATS_MIB_OUTDISCARDS);
 				kfree_skb(skb);
@@ -125,9 +126,7 @@ static int ip6_finish_output2(struct net *net, struct sock *sk, struct sk_buff *
 		}
 
 		IP6_UPD_PO_STATS(net, idev, IPSTATS_MIB_OUTMCAST, skb->len);
-
-		if (IPV6_ADDR_MC_SCOPE(&ipv6_hdr(skb)->daddr) <=
-		    IPV6_ADDR_SCOPE_NODELOCAL &&
+		if (IPV6_ADDR_MC_SCOPE(daddr) <= IPV6_ADDR_SCOPE_NODELOCAL &&
 		    !(dev->flags & IFF_LOOPBACK)) {
 			kfree_skb(skb);
 			return 0;
@@ -142,10 +141,10 @@ static int ip6_finish_output2(struct net *net, struct sock *sk, struct sk_buff *
 	}
 
 	rcu_read_lock_bh();
-	nexthop = rt6_nexthop((struct rt6_info *)dst, &ipv6_hdr(skb)->daddr);
-	neigh = __ipv6_neigh_lookup_noref(dst->dev, nexthop);
+	nexthop = rt6_nexthop((struct rt6_info *)dst, daddr);
+	neigh = __ipv6_neigh_lookup_noref(dev, nexthop);
 	if (unlikely(!neigh))
-		neigh = __neigh_create(&nd_tbl, nexthop, dst->dev, false);
+		neigh = __neigh_create(&nd_tbl, nexthop, dev, false);
 	if (!IS_ERR(neigh)) {
 		sock_confirm_neigh(skb, neigh);
 		ret = neigh_output(neigh, skb, false);
@@ -154,7 +153,7 @@ static int ip6_finish_output2(struct net *net, struct sock *sk, struct sk_buff *
 	}
 	rcu_read_unlock_bh();
 
-	IP6_INC_STATS(net, ip6_dst_idev(dst), IPSTATS_MIB_OUTNOROUTES);
+	IP6_INC_STATS(net, idev, IPSTATS_MIB_OUTNOROUTES);
 	kfree_skb(skb);
 	return -EINVAL;
 }
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH IPV6 v2 4/4] ipv6: ip6_xmit refactoring
       [not found]         ` <cover.1625818825.git.vvs@virtuozzo.com>
                             ` (2 preceding siblings ...)
  2021-07-09  9:05           ` [PATCH IPV6 v2 3/4] ipv6: ip6_finish_output2 refactoring Vasily Averin
@ 2021-07-09  9:05           ` Vasily Averin
  3 siblings, 0 replies; 106+ messages in thread
From: Vasily Averin @ 2021-07-09  9:05 UTC (permalink / raw)
  To: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski,
	Eric Dumazet
  Cc: netdev, linux-kernel

Commonly used dereferences was replaced by variables.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/ipv6/ip6_output.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 9ae3baa..5e33429 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -273,6 +273,8 @@ int ip6_xmit(const struct sock *sk, struct sk_buff *skb, struct flowi6 *fl6,
 	const struct ipv6_pinfo *np = inet6_sk(sk);
 	struct in6_addr *first_hop = &fl6->daddr;
 	struct dst_entry *dst = skb_dst(skb);
+	struct net_device *dev = dst->dev;
+	struct inet6_dev *idev = ip6_dst_idev(dst);
 	unsigned int head_room;
 	struct ipv6hdr *hdr;
 	u8  proto = fl6->flowi6_proto;
@@ -280,7 +282,7 @@ int ip6_xmit(const struct sock *sk, struct sk_buff *skb, struct flowi6 *fl6,
 	int delta, hlimit = -1;
 	u32 mtu;
 
-	head_room = sizeof(struct ipv6hdr) + LL_RESERVED_SPACE(dst->dev);
+	head_room = sizeof(struct ipv6hdr) + LL_RESERVED_SPACE(dev);
 	if (opt)
 		head_room += opt->opt_nflen + opt->opt_flen;
 
@@ -288,8 +290,7 @@ int ip6_xmit(const struct sock *sk, struct sk_buff *skb, struct flowi6 *fl6,
 	if (unlikely(delta > 0)) {
 		skb = skb_expand_head(skb, delta);
 		if (!skb) {
-			IP6_INC_STATS(net, ip6_dst_idev(dst),
-				      IPSTATS_MIB_OUTDISCARDS);
+			IP6_INC_STATS(net, idev, IPSTATS_MIB_OUTDISCARDS);
 			return -ENOBUFS;
 		}
 	}
@@ -333,8 +334,7 @@ int ip6_xmit(const struct sock *sk, struct sk_buff *skb, struct flowi6 *fl6,
 
 	mtu = dst_mtu(dst);
 	if ((skb->len <= mtu) || skb->ignore_df || skb_is_gso(skb)) {
-		IP6_UPD_PO_STATS(net, ip6_dst_idev(skb_dst(skb)),
-			      IPSTATS_MIB_OUT, skb->len);
+		IP6_UPD_PO_STATS(net, idev, IPSTATS_MIB_OUT, skb->len);
 
 		/* if egress device is enslaved to an L3 master device pass the
 		 * skb to its handler for processing
@@ -347,17 +347,17 @@ int ip6_xmit(const struct sock *sk, struct sk_buff *skb, struct flowi6 *fl6,
 		 * we promote our socket to non const
 		 */
 		return NF_HOOK(NFPROTO_IPV6, NF_INET_LOCAL_OUT,
-			       net, (struct sock *)sk, skb, NULL, dst->dev,
+			       net, (struct sock *)sk, skb, NULL, dev,
 			       dst_output);
 	}
 
-	skb->dev = dst->dev;
+	skb->dev = dev;
 	/* ipv6_local_error() does not require socket lock,
 	 * we promote our socket to non const
 	 */
 	ipv6_local_error((struct sock *)sk, EMSGSIZE, fl6, mtu);
 
-	IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)), IPSTATS_MIB_FRAGFAILS);
+	IP6_INC_STATS(net, idev, IPSTATS_MIB_FRAGFAILS);
 	kfree_skb(skb);
 	return -EMSGSIZE;
 }
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH IPV6 v2 1/4] ipv6: allocate enough headroom in ip6_finish_output2()
  2021-07-09  9:04           ` [PATCH IPV6 v2 1/4] ipv6: allocate enough headroom in ip6_finish_output2() Vasily Averin
@ 2021-07-09 17:58             ` David Miller
  2021-07-10  2:53               ` Vasily Averin
  0 siblings, 1 reply; 106+ messages in thread
From: David Miller @ 2021-07-09 17:58 UTC (permalink / raw)
  To: vvs; +Cc: yoshfuji, dsahern, kuba, eric.dumazet, netdev, linux-kernel


Please do not use inline in foo.c files, let the compiler decde.

Thank you.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH IPV6 v2 1/4] ipv6: allocate enough headroom in ip6_finish_output2()
  2021-07-09 17:58             ` David Miller
@ 2021-07-10  2:53               ` Vasily Averin
  0 siblings, 0 replies; 106+ messages in thread
From: Vasily Averin @ 2021-07-10  2:53 UTC (permalink / raw)
  To: David Miller; +Cc: yoshfuji, dsahern, kuba, eric.dumazet, netdev, linux-kernel

Dear David,
I'm happy to hear you again.

On 7/9/21 8:58 PM, David Miller wrote:
> Please do not use inline in foo.c files, let the compiler decde.

Thank you for the hint, I did not know it, and will follow him next time.
This time I'm going to move this helper somewhere anyway: 
either to net/core/skbuff.c as exported function where it will lost inline anyway,
or to include/linux/skbuff.h where inline is (it seems?) acceptable.

Could you please help me to find better name for this helper?

I would like to change its current name: 'skb_expand_head' looks very similar
to widely used 'pskb_expand_head' but have different semantic.
I afraid they can be accidentally misused in future.

Thank you,
	Vasily Averin.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH IPV6 v3 0/1] ipv6: allocate enough headroom in ip6_finish_output2()
  2021-07-09  9:04         ` [PATCH IPV6 v2 0/4] " Vasily Averin
@ 2021-07-12  6:44           ` Vasily Averin
       [not found]           ` <cover.1626069562.git.vvs@virtuozzo.com>
                             ` (2 subsequent siblings)
  3 siblings, 0 replies; 106+ messages in thread
From: Vasily Averin @ 2021-07-12  6:44 UTC (permalink / raw)
  To: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski,
	Eric Dumazet
  Cc: netdev, linux-kernel

Recently Syzkaller found one more issue on RHEL7-based OpenVz kernels.
During its investigation I've found that upstream is affected too. 

TEE target send sbk with small headroom into another interface which requires
an increased headroom.

ipv4 handles this problem in ip_finish_output2() and creates new skb with enough headroom,
though ip6_finish_output2() lacks this logic.

Suzkaller created C reproducer, it can be found in v1 cover-letter 
https://lkml.org/lkml/2021/7/7/467

v3 changes:
 now I think it's better to separate bugfix itself and creation of new helper.
 now bugfix does not create new inline function. Unlike from v1 it creates new skb
 only when it is necessary, i.e. for shared skb only.
 In case of failure it updates IPSTATS_MIB_OUTDISCARDS counter
 Patch set with new helper will be sent separately.

v2 changes: 
 new helper was created and used in ip6_finish_output2 and in ip6_xmit()
 small refactoring in changed functions: commonly used dereferences was replaced by variables

Vasily Averin (1):
  ipv6: allocate enough headroom in ip6_finish_output2()

 net/ipv6/ip6_output.c | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH IPV6 v3 1/1] ipv6: allocate enough headroom in ip6_finish_output2()
       [not found]           ` <cover.1626069562.git.vvs@virtuozzo.com>
@ 2021-07-12  6:45             ` Vasily Averin
  2021-07-12 18:30               ` patchwork-bot+netdevbpf
  2021-07-13  7:46               ` Vasily Averin
  0 siblings, 2 replies; 106+ messages in thread
From: Vasily Averin @ 2021-07-12  6:45 UTC (permalink / raw)
  To: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski,
	Eric Dumazet
  Cc: netdev, linux-kernel

When TEE target mirrors traffic to another interface, sk_buff may
not have enough headroom to be processed correctly.
ip_finish_output2() detect this situation for ipv4 and allocates
new skb with enogh headroom. However ipv6 lacks this logic in
ip_finish_output2 and it leads to skb_under_panic:

 skbuff: skb_under_panic: text:ffffffffc0866ad4 len:96 put:24
 head:ffff97be85e31800 data:ffff97be85e317f8 tail:0x58 end:0xc0 dev:gre0
 ------------[ cut here ]------------
 kernel BUG at net/core/skbuff.c:110!
 invalid opcode: 0000 [#1] SMP PTI
 CPU: 2 PID: 393 Comm: kworker/2:2 Tainted: G           OE     5.13.0 #13
 Hardware name: Virtuozzo KVM, BIOS 1.11.0-2.vz7.4 04/01/2014
 Workqueue: ipv6_addrconf addrconf_dad_work
 RIP: 0010:skb_panic+0x48/0x4a
 Call Trace:
  skb_push.cold.111+0x10/0x10
  ipgre_header+0x24/0xf0 [ip_gre]
  neigh_connected_output+0xae/0xf0
  ip6_finish_output2+0x1a8/0x5a0
  ip6_output+0x5c/0x110
  nf_dup_ipv6+0x158/0x1000 [nf_dup_ipv6]
  tee_tg6+0x2e/0x40 [xt_TEE]
  ip6t_do_table+0x294/0x470 [ip6_tables]
  nf_hook_slow+0x44/0xc0
  nf_hook.constprop.34+0x72/0xe0
  ndisc_send_skb+0x20d/0x2e0
  ndisc_send_ns+0xd1/0x210
  addrconf_dad_work+0x3c8/0x540
  process_one_work+0x1d1/0x370
  worker_thread+0x30/0x390
  kthread+0x116/0x130
  ret_from_fork+0x22/0x30

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/ipv6/ip6_output.c | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index ff4f9eb..0efcb9b 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -60,10 +60,38 @@ static int ip6_finish_output2(struct net *net, struct sock *sk, struct sk_buff *
 {
 	struct dst_entry *dst = skb_dst(skb);
 	struct net_device *dev = dst->dev;
+	unsigned int hh_len = LL_RESERVED_SPACE(dev);
+	int delta = hh_len - skb_headroom(skb);
 	const struct in6_addr *nexthop;
 	struct neighbour *neigh;
 	int ret;
 
+	/* Be paranoid, rather than too clever. */
+	if (unlikely(delta > 0) && dev->header_ops) {
+		/* pskb_expand_head() might crash, if skb is shared */
+		if (skb_shared(skb)) {
+			struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
+
+			if (likely(nskb)) {
+				if (skb->sk)
+					skb_set_owner_w(skb, skb->sk);
+				consume_skb(skb);
+			} else {
+				kfree_skb(skb);
+			}
+			skb = nskb;
+		}
+		if (skb &&
+		    pskb_expand_head(skb, SKB_DATA_ALIGN(delta), 0, GFP_ATOMIC)) {
+			kfree_skb(skb);
+			skb = NULL;
+		}
+		if (!skb) {
+			IP6_INC_STATS(net, ip6_dst_idev(dst), IPSTATS_MIB_OUTDISCARDS);
+			return -ENOMEM;
+		}
+	}
+
 	if (ipv6_addr_is_multicast(&ipv6_hdr(skb)->daddr)) {
 		struct inet6_dev *idev = ip6_dst_idev(skb_dst(skb));
 
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH NET 0/7] skbuff: introduce pskb_realloc_headroom()
  2021-07-09  9:04         ` [PATCH IPV6 v2 0/4] " Vasily Averin
  2021-07-12  6:44           ` [PATCH IPV6 v3 0/1] " Vasily Averin
       [not found]           ` <cover.1626069562.git.vvs@virtuozzo.com>
@ 2021-07-12 13:26           ` Vasily Averin
       [not found]           ` <cover.1626093470.git.vvs@virtuozzo.com>
  3 siblings, 0 replies; 106+ messages in thread
From: Vasily Averin @ 2021-07-12 13:26 UTC (permalink / raw)
  To: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski,
	Eric Dumazet
  Cc: netdev, Joerg Reuter, Ralf Baechle, linux-hams,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh, bpf,
	linux-kernel

currently if skb does not have enough headroom skb_realloc_headrom is called.
It is not optimal because it creates new skb.

Unlike skb_realloc_headroom, new helper pskb_realloc_headroom 
does not allocate a new skb if possible; 
copies skb->sk on new skb when as needed
and frees original skb in case of failures.

This helps to simplify ip[6]_finish_output2(), ip6_xmit() and a few other
functions in vrf, ax25 and bpf.

There are few other cases where this helper can be used but they require
an additional investigations. 

NB: patch "ipv6: use pskb_realloc_headroom in ip6_finish_output2" depends on 
patch "ipv6: allocate enough headroom in ip6_finish_output2()" submitted separately
https://lkml.org/lkml/2021/7/12/732

Vasily Averin (7):
  skbuff: introduce pskb_realloc_headroom()
  ipv6: use pskb_realloc_headroom in ip6_finish_output2
  ipv6: use pskb_realloc_headroom in ip6_xmit
  ipv4: use pskb_realloc_headroom in ip_finish_output2
  vrf: use pskb_realloc_headroom in vrf_finish_output
  ax25: use pskb_realloc_headroom
  bpf: use pskb_realloc_headroom in bpf_out_neigh_v4/6

 drivers/net/vrf.c      | 14 +++------
 include/linux/skbuff.h |  2 ++
 net/ax25/ax25_out.c    | 13 +++------
 net/ax25/ax25_route.c  | 13 +++------
 net/core/filter.c      | 22 +++-----------
 net/core/skbuff.c      | 41 ++++++++++++++++++++++++++
 net/ipv4/ip_output.c   | 12 ++------
 net/ipv6/ip6_output.c  | 78 ++++++++++++++++++--------------------------------
 8 files changed, 89 insertions(+), 106 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH NET 1/7] skbuff: introduce pskb_realloc_headroom()
       [not found]           ` <cover.1626093470.git.vvs@virtuozzo.com>
@ 2021-07-12 13:26             ` Vasily Averin
  2021-07-12 17:53               ` Jakub Kicinski
  2021-07-12 13:26             ` [PATCH NET 2/7] ipv6: use pskb_realloc_headroom in ip6_finish_output2 Vasily Averin
                               ` (5 subsequent siblings)
  6 siblings, 1 reply; 106+ messages in thread
From: Vasily Averin @ 2021-07-12 13:26 UTC (permalink / raw)
  To: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski,
	Eric Dumazet
  Cc: netdev, Joerg Reuter, Ralf Baechle, linux-hams,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh, bpf,
	linux-kernel

Unlike skb_realloc_headroom, new helper does not allocate a new skb
if possible; copies skb->sk on new skb when as needed and frees
original skb in case of failures.

This helps to simplify ip[6]_finish_output2() and a few other similar cases.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 include/linux/skbuff.h |  2 ++
 net/core/skbuff.c      | 41 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 43 insertions(+)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index dbf820a..381a219 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1174,6 +1174,8 @@ static inline struct sk_buff *__pskb_copy(struct sk_buff *skb, int headroom,
 int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail, gfp_t gfp_mask);
 struct sk_buff *skb_realloc_headroom(struct sk_buff *skb,
 				     unsigned int headroom);
+struct sk_buff *pskb_realloc_headroom(struct sk_buff *skb,
+				      unsigned int headroom);
 struct sk_buff *skb_copy_expand(const struct sk_buff *skb, int newheadroom,
 				int newtailroom, gfp_t priority);
 int __must_check skb_to_sgvec_nomark(struct sk_buff *skb, struct scatterlist *sg,
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index bbc3b4b..13cbe98 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1769,6 +1769,47 @@ struct sk_buff *skb_realloc_headroom(struct sk_buff *skb, unsigned int headroom)
 EXPORT_SYMBOL(skb_realloc_headroom);
 
 /**
+ *	pskb_realloc_headroom - reallocate header of &sk_buff
+ *	@skb: buffer to reallocate
+ *	@headroom: needed headroom
+ *
+ *	Unlike skb_realloc_headroom, this one does not allocate a new skb
+ *	if possible; copies skb->sk to new skb as needed
+ *	and frees original scb in case of failures.
+ *
+ *	It expect increased headroom, and generates warning otherwise.
+ */
+
+struct sk_buff *pskb_realloc_headroom(struct sk_buff *skb, unsigned int headroom)
+{
+	int delta = headroom - skb_headroom(skb);
+
+	if (WARN_ONCE(delta <= 0, "%s expect positive delta", __func__))
+		return skb;
+
+	/* pskb_expand_head() might crash, if skb is shared */
+	if (skb_shared(skb)) {
+		struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
+
+		if (likely(nskb)) {
+			if (skb->sk)
+				skb_set_owner_w(skb, skb->sk);
+			consume_skb(skb);
+		} else {
+			kfree(skb);
+		}
+		skb = nskb;
+	}
+	if (skb &&
+	    pskb_expand_head(skb, SKB_DATA_ALIGN(delta), 0, GFP_ATOMIC)) {
+		kfree(skb);
+		skb = NULL;
+	}
+	return skb;
+}
+EXPORT_SYMBOL(pskb_realloc_headroom);
+
+/**
  *	skb_copy_expand	-	copy and expand sk_buff
  *	@skb: buffer to copy
  *	@newheadroom: new free bytes at head
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH NET 2/7] ipv6: use pskb_realloc_headroom in ip6_finish_output2
       [not found]           ` <cover.1626093470.git.vvs@virtuozzo.com>
  2021-07-12 13:26             ` [PATCH NET 1/7] " Vasily Averin
@ 2021-07-12 13:26             ` Vasily Averin
  2021-07-12 13:26             ` [PATCH NET 3/7] ipv6: use pskb_realloc_headroom in ip6_xmit refactoring Vasily Averin
                               ` (4 subsequent siblings)
  6 siblings, 0 replies; 106+ messages in thread
From: Vasily Averin @ 2021-07-12 13:26 UTC (permalink / raw)
  To: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski,
	Eric Dumazet
  Cc: netdev, linux-kernel

Unlike skb_realloc_headroom, new helper pskb_realloc_headroom
does not allocate a new skb if possible.

Additionally this patch replaces commonly used dereferencing with variables.
---
NB: this patch depends on
"ipv6: allocate enough headroom in ip6_finish_output2()"
https://lkml.org/lkml/2021/7/12/732

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/ipv6/ip6_output.c | 50 ++++++++++++++++----------------------------------
 1 file changed, 16 insertions(+), 34 deletions(-)

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 0efcb9b..2054da3 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -60,46 +60,30 @@ static int ip6_finish_output2(struct net *net, struct sock *sk, struct sk_buff *
 {
 	struct dst_entry *dst = skb_dst(skb);
 	struct net_device *dev = dst->dev;
+	struct inet6_dev *idev = ip6_dst_idev(dst);
 	unsigned int hh_len = LL_RESERVED_SPACE(dev);
-	int delta = hh_len - skb_headroom(skb);
-	const struct in6_addr *nexthop;
+	const struct in6_addr *daddr, *nexthop;
+	struct ipv6hdr *hdr;
 	struct neighbour *neigh;
 	int ret;
 
 	/* Be paranoid, rather than too clever. */
-	if (unlikely(delta > 0) && dev->header_ops) {
-		/* pskb_expand_head() might crash, if skb is shared */
-		if (skb_shared(skb)) {
-			struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
+	if (unlikely(hh_len > skb_headroom(skb)) && dev->header_ops) {
+		skb = pskb_realloc_headroom(skb, hh_len);
 
-			if (likely(nskb)) {
-				if (skb->sk)
-					skb_set_owner_w(skb, skb->sk);
-				consume_skb(skb);
-			} else {
-				kfree_skb(skb);
-			}
-			skb = nskb;
-		}
-		if (skb &&
-		    pskb_expand_head(skb, SKB_DATA_ALIGN(delta), 0, GFP_ATOMIC)) {
-			kfree_skb(skb);
-			skb = NULL;
-		}
 		if (!skb) {
-			IP6_INC_STATS(net, ip6_dst_idev(dst), IPSTATS_MIB_OUTDISCARDS);
+			IP6_INC_STATS(net, idev, IPSTATS_MIB_OUTDISCARDS);
 			return -ENOMEM;
 		}
 	}
 
-	if (ipv6_addr_is_multicast(&ipv6_hdr(skb)->daddr)) {
-		struct inet6_dev *idev = ip6_dst_idev(skb_dst(skb));
-
+	hdr = ipv6_hdr(skb);
+	daddr = &hdr->daddr;
+	if (ipv6_addr_is_multicast(daddr)) {
 		if (!(dev->flags & IFF_LOOPBACK) && sk_mc_loop(sk) &&
 		    ((mroute6_is_socket(net, skb) &&
 		     !(IP6CB(skb)->flags & IP6SKB_FORWARDED)) ||
-		     ipv6_chk_mcast_addr(dev, &ipv6_hdr(skb)->daddr,
-					 &ipv6_hdr(skb)->saddr))) {
+		     ipv6_chk_mcast_addr(dev, daddr, &hdr->saddr))) {
 			struct sk_buff *newskb = skb_clone(skb, GFP_ATOMIC);
 
 			/* Do not check for IFF_ALLMULTI; multicast routing
@@ -110,7 +94,7 @@ static int ip6_finish_output2(struct net *net, struct sock *sk, struct sk_buff *
 					net, sk, newskb, NULL, newskb->dev,
 					dev_loopback_xmit);
 
-			if (ipv6_hdr(skb)->hop_limit == 0) {
+			if (hdr->hop_limit == 0) {
 				IP6_INC_STATS(net, idev,
 					      IPSTATS_MIB_OUTDISCARDS);
 				kfree_skb(skb);
@@ -119,9 +103,7 @@ static int ip6_finish_output2(struct net *net, struct sock *sk, struct sk_buff *
 		}
 
 		IP6_UPD_PO_STATS(net, idev, IPSTATS_MIB_OUTMCAST, skb->len);
-
-		if (IPV6_ADDR_MC_SCOPE(&ipv6_hdr(skb)->daddr) <=
-		    IPV6_ADDR_SCOPE_NODELOCAL &&
+		if (IPV6_ADDR_MC_SCOPE(daddr) <= IPV6_ADDR_SCOPE_NODELOCAL &&
 		    !(dev->flags & IFF_LOOPBACK)) {
 			kfree_skb(skb);
 			return 0;
@@ -136,10 +118,10 @@ static int ip6_finish_output2(struct net *net, struct sock *sk, struct sk_buff *
 	}
 
 	rcu_read_lock_bh();
-	nexthop = rt6_nexthop((struct rt6_info *)dst, &ipv6_hdr(skb)->daddr);
-	neigh = __ipv6_neigh_lookup_noref(dst->dev, nexthop);
+	nexthop = rt6_nexthop((struct rt6_info *)dst, daddr);
+	neigh = __ipv6_neigh_lookup_noref(dev, nexthop);
 	if (unlikely(!neigh))
-		neigh = __neigh_create(&nd_tbl, nexthop, dst->dev, false);
+		neigh = __neigh_create(&nd_tbl, nexthop, dev, false);
 	if (!IS_ERR(neigh)) {
 		sock_confirm_neigh(skb, neigh);
 		ret = neigh_output(neigh, skb, false);
@@ -148,7 +130,7 @@ static int ip6_finish_output2(struct net *net, struct sock *sk, struct sk_buff *
 	}
 	rcu_read_unlock_bh();
 
-	IP6_INC_STATS(net, ip6_dst_idev(dst), IPSTATS_MIB_OUTNOROUTES);
+	IP6_INC_STATS(net, idev, IPSTATS_MIB_OUTNOROUTES);
 	kfree_skb(skb);
 	return -EINVAL;
 }
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH NET 3/7] ipv6: use pskb_realloc_headroom in ip6_xmit refactoring
       [not found]           ` <cover.1626093470.git.vvs@virtuozzo.com>
  2021-07-12 13:26             ` [PATCH NET 1/7] " Vasily Averin
  2021-07-12 13:26             ` [PATCH NET 2/7] ipv6: use pskb_realloc_headroom in ip6_finish_output2 Vasily Averin
@ 2021-07-12 13:26             ` Vasily Averin
  2021-07-12 13:27             ` [PATCH NET 4/7] ipv4: use pskb_realloc_headroom in ip_finish_output2 Vasily Averin
                               ` (3 subsequent siblings)
  6 siblings, 0 replies; 106+ messages in thread
From: Vasily Averin @ 2021-07-12 13:26 UTC (permalink / raw)
  To: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski,
	Eric Dumazet
  Cc: netdev, linux-kernel

Unlike skb_realloc_headroom, new helper pskb_realloc_headroom
does not allocate a new skb if possible.

Additionally this patch replaces commonly used dereferencing with variables.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/ipv6/ip6_output.c | 28 ++++++++++++----------------
 1 file changed, 12 insertions(+), 16 deletions(-)

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 2054da3..052723c 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -250,6 +250,8 @@ int ip6_xmit(const struct sock *sk, struct sk_buff *skb, struct flowi6 *fl6,
 	const struct ipv6_pinfo *np = inet6_sk(sk);
 	struct in6_addr *first_hop = &fl6->daddr;
 	struct dst_entry *dst = skb_dst(skb);
+	struct net_device *dev = dst->dev;
+	struct inet6_dev *idev = ip6_dst_idev(dst);
 	unsigned int head_room;
 	struct ipv6hdr *hdr;
 	u8  proto = fl6->flowi6_proto;
@@ -257,22 +259,17 @@ int ip6_xmit(const struct sock *sk, struct sk_buff *skb, struct flowi6 *fl6,
 	int hlimit = -1;
 	u32 mtu;
 
-	head_room = sizeof(struct ipv6hdr) + LL_RESERVED_SPACE(dst->dev);
+	head_room = sizeof(struct ipv6hdr) + LL_RESERVED_SPACE(dev);
 	if (opt)
 		head_room += opt->opt_nflen + opt->opt_flen;
 
-	if (unlikely(skb_headroom(skb) < head_room)) {
-		struct sk_buff *skb2 = skb_realloc_headroom(skb, head_room);
-		if (!skb2) {
-			IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)),
-				      IPSTATS_MIB_OUTDISCARDS);
-			kfree_skb(skb);
+	if (unlikely(head_room > skb_headroom(skb))) {
+		skb = pskb_realloc_headroom(skb, head_room);
+	
+		if (!skb) {
+			IP6_INC_STATS(net, idev, IPSTATS_MIB_OUTDISCARDS);
 			return -ENOBUFS;
 		}
-		if (skb->sk)
-			skb_set_owner_w(skb2, skb->sk);
-		consume_skb(skb);
-		skb = skb2;
 	}
 
 	if (opt) {
@@ -314,8 +311,7 @@ int ip6_xmit(const struct sock *sk, struct sk_buff *skb, struct flowi6 *fl6,
 
 	mtu = dst_mtu(dst);
 	if ((skb->len <= mtu) || skb->ignore_df || skb_is_gso(skb)) {
-		IP6_UPD_PO_STATS(net, ip6_dst_idev(skb_dst(skb)),
-			      IPSTATS_MIB_OUT, skb->len);
+		IP6_UPD_PO_STATS(net, idev, IPSTATS_MIB_OUT, skb->len);
 
 		/* if egress device is enslaved to an L3 master device pass the
 		 * skb to its handler for processing
@@ -328,17 +324,17 @@ int ip6_xmit(const struct sock *sk, struct sk_buff *skb, struct flowi6 *fl6,
 		 * we promote our socket to non const
 		 */
 		return NF_HOOK(NFPROTO_IPV6, NF_INET_LOCAL_OUT,
-			       net, (struct sock *)sk, skb, NULL, dst->dev,
+			       net, (struct sock *)sk, skb, NULL, dev,
 			       dst_output);
 	}
 
-	skb->dev = dst->dev;
+	skb->dev = dev;
 	/* ipv6_local_error() does not require socket lock,
 	 * we promote our socket to non const
 	 */
 	ipv6_local_error((struct sock *)sk, EMSGSIZE, fl6, mtu);
 
-	IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)), IPSTATS_MIB_FRAGFAILS);
+	IP6_INC_STATS(net, idev, IPSTATS_MIB_FRAGFAILS);
 	kfree_skb(skb);
 	return -EMSGSIZE;
 }
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH NET 4/7] ipv4: use pskb_realloc_headroom in ip_finish_output2
       [not found]           ` <cover.1626093470.git.vvs@virtuozzo.com>
                               ` (2 preceding siblings ...)
  2021-07-12 13:26             ` [PATCH NET 3/7] ipv6: use pskb_realloc_headroom in ip6_xmit refactoring Vasily Averin
@ 2021-07-12 13:27             ` Vasily Averin
  2021-07-12 13:27             ` [PATCH NET 5/7] vrf: use pskb_realloc_headroom in vrf_finish_output Vasily Averin
                               ` (2 subsequent siblings)
  6 siblings, 0 replies; 106+ messages in thread
From: Vasily Averin @ 2021-07-12 13:27 UTC (permalink / raw)
  To: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski,
	Eric Dumazet
  Cc: netdev, linux-kernel

Unlike skb_realloc_headroom, new helper pskb_realloc_headroom
does not allocate a new skb if possible.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/ipv4/ip_output.c | 12 ++----------
 1 file changed, 2 insertions(+), 10 deletions(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index c3efc7d..0f66483 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -198,19 +198,11 @@ static int ip_finish_output2(struct net *net, struct sock *sk, struct sk_buff *s
 	} else if (rt->rt_type == RTN_BROADCAST)
 		IP_UPD_PO_STATS(net, IPSTATS_MIB_OUTBCAST, skb->len);
 
-	/* Be paranoid, rather than too clever. */
 	if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) {
-		struct sk_buff *skb2;
+		skb = pskb_realloc_headroom(skb, hh_len);
 
-		skb2 = skb_realloc_headroom(skb, LL_RESERVED_SPACE(dev));
-		if (!skb2) {
-			kfree_skb(skb);
+		if (!skb)
 			return -ENOMEM;
-		}
-		if (skb->sk)
-			skb_set_owner_w(skb2, skb->sk);
-		consume_skb(skb);
-		skb = skb2;
 	}
 
 	if (lwtunnel_xmit_redirect(dst->lwtstate)) {
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH NET 5/7] vrf: use pskb_realloc_headroom in vrf_finish_output
       [not found]           ` <cover.1626093470.git.vvs@virtuozzo.com>
                               ` (3 preceding siblings ...)
  2021-07-12 13:27             ` [PATCH NET 4/7] ipv4: use pskb_realloc_headroom in ip_finish_output2 Vasily Averin
@ 2021-07-12 13:27             ` Vasily Averin
  2021-07-12 13:27             ` [PATCH NET 6/7] ax25: use pskb_realloc_headroom Vasily Averin
  2021-07-12 13:27             ` [PATCH NET 7/7] bpf: use pskb_realloc_headroom in bpf_out_neigh_v4/6 Vasily Averin
  6 siblings, 0 replies; 106+ messages in thread
From: Vasily Averin @ 2021-07-12 13:27 UTC (permalink / raw)
  To: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski,
	Eric Dumazet
  Cc: netdev, linux-kernel

Unlike skb_realloc_headroom, new helper pskb_realloc_headroom
does not allocate a new skb if possible.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 drivers/net/vrf.c | 14 ++++----------
 1 file changed, 4 insertions(+), 10 deletions(-)

diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
index 28a6c4c..74b9538 100644
--- a/drivers/net/vrf.c
+++ b/drivers/net/vrf.c
@@ -863,18 +863,12 @@ static int vrf_finish_output(struct net *net, struct sock *sk, struct sk_buff *s
 
 	/* Be paranoid, rather than too clever. */
 	if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) {
-		struct sk_buff *skb2;
+		skb = pskb_realloc_headroom(skb, hh_len);
 
-		skb2 = skb_realloc_headroom(skb, LL_RESERVED_SPACE(dev));
-		if (!skb2) {
-			ret = -ENOMEM;
-			goto err;
+		if (!skb) {
+			dev->stats.tx_errors++;
+			return -ENOMEM;
 		}
-		if (skb->sk)
-			skb_set_owner_w(skb2, skb->sk);
-
-		consume_skb(skb);
-		skb = skb2;
 	}
 
 	rcu_read_lock_bh();
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH NET 6/7] ax25: use pskb_realloc_headroom
       [not found]           ` <cover.1626093470.git.vvs@virtuozzo.com>
                               ` (4 preceding siblings ...)
  2021-07-12 13:27             ` [PATCH NET 5/7] vrf: use pskb_realloc_headroom in vrf_finish_output Vasily Averin
@ 2021-07-12 13:27             ` Vasily Averin
  2021-07-12 13:27             ` [PATCH NET 7/7] bpf: use pskb_realloc_headroom in bpf_out_neigh_v4/6 Vasily Averin
  6 siblings, 0 replies; 106+ messages in thread
From: Vasily Averin @ 2021-07-12 13:27 UTC (permalink / raw)
  To: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski,
	Eric Dumazet
  Cc: netdev, Joerg Reuter, Ralf Baechle, linux-hams, linux-kernel

Use pskb_realloc_headroom() in ax25_transmit_buffer and ax25_rt_build_path.
Unlike skb_realloc_headroom, new helper pskb_realloc_headroom
does not allocate a new skb if possible.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/ax25/ax25_out.c   | 13 ++++---------
 net/ax25/ax25_route.c | 13 ++++---------
 2 files changed, 8 insertions(+), 18 deletions(-)

diff --git a/net/ax25/ax25_out.c b/net/ax25/ax25_out.c
index f53751b..1f1e581 100644
--- a/net/ax25/ax25_out.c
+++ b/net/ax25/ax25_out.c
@@ -336,18 +336,13 @@ void ax25_transmit_buffer(ax25_cb *ax25, struct sk_buff *skb, int type)
 
 	headroom = ax25_addr_size(ax25->digipeat);
 
-	if (skb_headroom(skb) < headroom) {
-		if ((skbn = skb_realloc_headroom(skb, headroom)) == NULL) {
+	if (unlikely(skb_headroom(skb) < headroom)) {
+		skb = pskb_realloc_head(roomskb, headroom);
+
+		if (!skb) {
 			printk(KERN_CRIT "AX.25: ax25_transmit_buffer - out of memory\n");
-			kfree_skb(skb);
 			return;
 		}
-
-		if (skb->sk != NULL)
-			skb_set_owner_w(skbn, skb->sk);
-
-		consume_skb(skb);
-		skb = skbn;
 	}
 
 	ptr = skb_push(skb, headroom);
diff --git a/net/ax25/ax25_route.c b/net/ax25/ax25_route.c
index b40e0bc..8f54547 100644
--- a/net/ax25/ax25_route.c
+++ b/net/ax25/ax25_route.c
@@ -447,18 +447,13 @@ struct sk_buff *ax25_rt_build_path(struct sk_buff *skb, ax25_address *src,
 
 	len = digi->ndigi * AX25_ADDR_LEN;
 
-	if (skb_headroom(skb) < len) {
-		if ((skbn = skb_realloc_headroom(skb, len)) == NULL) {
+	if (unlikely(skb_headroom(skb) < len)) {
+		skb = pskb_realloc_headroom(skb, len);
+
+		if (!skb) {
 			printk(KERN_CRIT "AX.25: ax25_dg_build_path - out of memory\n");
 			return NULL;
 		}
-
-		if (skb->sk != NULL)
-			skb_set_owner_w(skbn, skb->sk);
-
-		consume_skb(skb);
-
-		skb = skbn;
 	}
 
 	bp = skb_push(skb, len);
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH NET 7/7] bpf: use pskb_realloc_headroom in bpf_out_neigh_v4/6
       [not found]           ` <cover.1626093470.git.vvs@virtuozzo.com>
                               ` (5 preceding siblings ...)
  2021-07-12 13:27             ` [PATCH NET 6/7] ax25: use pskb_realloc_headroom Vasily Averin
@ 2021-07-12 13:27             ` Vasily Averin
  6 siblings, 0 replies; 106+ messages in thread
From: Vasily Averin @ 2021-07-12 13:27 UTC (permalink / raw)
  To: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski,
	Eric Dumazet
  Cc: netdev, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh, bpf,
	linux-kernel

Unlike skb_realloc_headroom, new helper pskb_realloc_headroom
does not allocate a new skb if possible.

Additionally this patch replaces commonly used dereferencing with variables.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/core/filter.c | 22 ++++------------------
 1 file changed, 4 insertions(+), 18 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 65ab4e2..cf6cd93 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2179,17 +2179,10 @@ static int bpf_out_neigh_v6(struct net *net, struct sk_buff *skb,
 	skb->tstamp = 0;
 
 	if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) {
-		struct sk_buff *skb2;
+		skb = pskb_realloc_headroom(skb, hh_len);
 
-		skb2 = skb_realloc_headroom(skb, hh_len);
-		if (unlikely(!skb2)) {
-			kfree_skb(skb);
+		if (!skb)
 			return -ENOMEM;
-		}
-		if (skb->sk)
-			skb_set_owner_w(skb2, skb->sk);
-		consume_skb(skb);
-		skb = skb2;
 	}
 
 	rcu_read_lock_bh();
@@ -2286,17 +2279,10 @@ static int bpf_out_neigh_v4(struct net *net, struct sk_buff *skb,
 	skb->tstamp = 0;
 
 	if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) {
-		struct sk_buff *skb2;
+		skb = pskb_realloc_headroom(skb, hh_len);
 
-		skb2 = skb_realloc_headroom(skb, hh_len);
-		if (unlikely(!skb2)) {
-			kfree_skb(skb);
+		if (!skb)
 			return -ENOMEM;
-		}
-		if (skb->sk)
-			skb_set_owner_w(skb2, skb->sk);
-		consume_skb(skb);
-		skb = skb2;
 	}
 
 	rcu_read_lock_bh();
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH NET 1/7] skbuff: introduce pskb_realloc_headroom()
  2021-07-12 13:26             ` [PATCH NET 1/7] " Vasily Averin
@ 2021-07-12 17:53               ` Jakub Kicinski
  2021-07-12 18:45                 ` Vasily Averin
  0 siblings, 1 reply; 106+ messages in thread
From: Jakub Kicinski @ 2021-07-12 17:53 UTC (permalink / raw)
  To: Vasily Averin
  Cc: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Eric Dumazet,
	netdev, Joerg Reuter, Ralf Baechle, linux-hams,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh, bpf,
	linux-kernel

On Mon, 12 Jul 2021 16:26:44 +0300 Vasily Averin wrote:
>  /**
> + *	pskb_realloc_headroom - reallocate header of &sk_buff
> + *	@skb: buffer to reallocate
> + *	@headroom: needed headroom
> + *
> + *	Unlike skb_realloc_headroom, this one does not allocate a new skb
> + *	if possible; copies skb->sk to new skb as needed
> + *	and frees original scb in case of failures.
> + *
> + *	It expect increased headroom, and generates warning otherwise.
> + */
> +
> +struct sk_buff *pskb_realloc_headroom(struct sk_buff *skb, unsigned int headroom)

I saw you asked about naming in a different sub-thread, what do you
mean by "'pskb_expand_head' have different semantic"? AFAIU the 'p'
in pskb stands for "private", meaning not shared. In fact
skb_realloc_headroom() should really be pskb... but it predates the 
'pskb' naming pattern by quite a while. Long story short
skb_expand_head() seems like a good name. With the current patch
pskb_realloc_headroom() vs skb_realloc_headroom() would give people
exactly the opposite intuition of what the code does.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH IPV6 v3 1/1] ipv6: allocate enough headroom in ip6_finish_output2()
  2021-07-12  6:45             ` [PATCH IPV6 v3 1/1] " Vasily Averin
@ 2021-07-12 18:30               ` patchwork-bot+netdevbpf
  2021-07-13  7:46               ` Vasily Averin
  1 sibling, 0 replies; 106+ messages in thread
From: patchwork-bot+netdevbpf @ 2021-07-12 18:30 UTC (permalink / raw)
  To: Vasily Averin
  Cc: davem, yoshfuji, dsahern, kuba, eric.dumazet, netdev, linux-kernel

Hello:

This patch was applied to netdev/net.git (refs/heads/master):

On Mon, 12 Jul 2021 09:45:06 +0300 you wrote:
> When TEE target mirrors traffic to another interface, sk_buff may
> not have enough headroom to be processed correctly.
> ip_finish_output2() detect this situation for ipv4 and allocates
> new skb with enogh headroom. However ipv6 lacks this logic in
> ip_finish_output2 and it leads to skb_under_panic:
> 
>  skbuff: skb_under_panic: text:ffffffffc0866ad4 len:96 put:24
>  head:ffff97be85e31800 data:ffff97be85e317f8 tail:0x58 end:0xc0 dev:gre0
> 
> [...]

Here is the summary with links:
  - [IPV6,v3,1/1] ipv6: allocate enough headroom in ip6_finish_output2()
    https://git.kernel.org/netdev/net/c/5796015fa968

You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH NET 1/7] skbuff: introduce pskb_realloc_headroom()
  2021-07-12 17:53               ` Jakub Kicinski
@ 2021-07-12 18:45                 ` Vasily Averin
  2021-07-13 20:57                   ` [PATCH NET v2 0/7] skbuff: introduce skb_expand_head() Vasily Averin
       [not found]                   ` <cover.1626206993.git.vvs@virtuozzo.com>
  0 siblings, 2 replies; 106+ messages in thread
From: Vasily Averin @ 2021-07-12 18:45 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Eric Dumazet,
	netdev, Joerg Reuter, Ralf Baechle, linux-hams,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh, bpf,
	linux-kernel

On 7/12/21 8:53 PM, Jakub Kicinski wrote:
> I saw you asked about naming in a different sub-thread, what do you
> mean by "'pskb_expand_head' have different semantic"? AFAIU the 'p'
> in pskb stands for "private", meaning not shared. In fact
> skb_realloc_headroom() should really be pskb... but it predates the 
> 'pskb' naming pattern by quite a while. Long story short
> skb_expand_head() seems like a good name. With the current patch
> pskb_realloc_headroom() vs skb_realloc_headroom() would give people
> exactly the opposite intuition of what the code does.

Thank you for feedback,
I'll change helper name back to skb_expand_head() in next patch version.

	Vasily Averin

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH IPV6 v3 1/1] ipv6: allocate enough headroom in ip6_finish_output2()
  2021-07-12  6:45             ` [PATCH IPV6 v3 1/1] " Vasily Averin
  2021-07-12 18:30               ` patchwork-bot+netdevbpf
@ 2021-07-13  7:46               ` Vasily Averin
  2021-07-13 12:01                 ` [PATCH NET v4 0/1] " Vasily Averin
                                   ` (2 more replies)
  1 sibling, 3 replies; 106+ messages in thread
From: Vasily Averin @ 2021-07-13  7:46 UTC (permalink / raw)
  To: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski,
	Eric Dumazet
  Cc: netdev, linux-kernel

I've found 2 problems in this patch,
and I'm going to resend new patch version soon.

On 7/12/21 9:45 AM, Vasily Averin wrote:
> index ff4f9eb..0efcb9b 100644
> --- a/net/ipv6/ip6_output.c
> +++ b/net/ipv6/ip6_output.c
> @@ -60,10 +60,38 @@ static int ip6_finish_output2(struct net *net, struct sock *sk, struct sk_buff *
>  {
>  	struct dst_entry *dst = skb_dst(skb);
>  	struct net_device *dev = dst->dev;
> +	unsigned int hh_len = LL_RESERVED_SPACE(dev);
> +	int delta = hh_len - skb_headroom(skb);
>  	const struct in6_addr *nexthop;
>  	struct neighbour *neigh;
>  	int ret;
>  
> +	/* Be paranoid, rather than too clever. */
> +	if (unlikely(delta > 0) && dev->header_ops) {
> +		/* pskb_expand_head() might crash, if skb is shared */
> +		if (skb_shared(skb)) {
> +			struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
> +
> +			if (likely(nskb)) {
> +				if (skb->sk)
> +					skb_set_owner_w(skb, skb->sk);

need to assign sk not to skb but to nskb 

> +				consume_skb(skb);
> +			} else {
> +				kfree_skb(skb);

It is quite strange to call consume_skb() on one case and kfree_skb() in another one.
We know that original skb was shared so we should not call kfree_skb here.

Btw I've noticed similar problem in few other cases:
in pptp_xmit, pvc_xmit, ip_vs_prepare_tunneled_skb
they call consume_skb() in case of success and kfree_skb on error path.
It looks like potential bug for me.

> +			}
> +			skb = nskb;
> +		}
> +		if (skb &&
> +		    pskb_expand_head(skb, SKB_DATA_ALIGN(delta), 0, GFP_ATOMIC)) {
> +			kfree_skb(skb);
> +			skb = NULL;
> +		}
> +		if (!skb) {
> +			IP6_INC_STATS(net, ip6_dst_idev(dst), IPSTATS_MIB_OUTDISCARDS);
> +			return -ENOMEM;
> +		}
> +	}
> +
>  	if (ipv6_addr_is_multicast(&ipv6_hdr(skb)->daddr)) {
>  		struct inet6_dev *idev = ip6_dst_idev(skb_dst(skb));
>  
> 


^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH NET v4 0/1] ipv6: allocate enough headroom in ip6_finish_output2()
  2021-07-13  7:46               ` Vasily Averin
@ 2021-07-13 12:01                 ` Vasily Averin
       [not found]                 ` <cover.1626177047.git.vvs@virtuozzo.com>
  2021-07-13 12:31                 ` [PATCH IPV6 v3 1/1] ipv6: allocate enough headroom in ip6_finish_output2() Vasily Averin
  2 siblings, 0 replies; 106+ messages in thread
From: Vasily Averin @ 2021-07-13 12:01 UTC (permalink / raw)
  To: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski,
	Eric Dumazet
  Cc: netdev, linux-kernel

Recently Syzkaller found one more issue on RHEL7-based OpenVz kernels.
During its investigation I've found that upstream is affected too. 

TEE target send sbk with small headroom into another interface which requires
an increased headroom.

ipv4 handles this problem in ip_finish_output2() and creates new skb with enough headroom,
though ip6_finish_output2() lacks this logic.

Suzkaller created C reproducer, it can be found in v1 cover-letter 
https://lkml.org/lkml/2021/7/7/467

v4 changes:
 fixed skb_set_owner_w() call: it should set sk on new nskb

v3 changes:
 now I think it's better to separate bugfix itself and creation of new helper.
 now bugfix does not create new inline function. Unlike from v1 it creates new skb
 only when it is necessary, i.e. for shared skb only.
 In case of failure it updates IPSTATS_MIB_OUTDISCARDS counter
 Patch set with new helper will be sent separately.

v2 changes: 
 new helper was created and used in ip6_finish_output2 and in ip6_xmit()
 small refactoring in changed functions: commonly used dereferences was replaced by variables


Vasily Averin (1):
  ipv6: allocate enough headroom in ip6_finish_output2()

 net/ipv6/ip6_output.c | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH NET v4 1/1] ipv6: allocate enough headroom in ip6_finish_output2()
       [not found]                 ` <cover.1626177047.git.vvs@virtuozzo.com>
@ 2021-07-13 12:01                   ` Vasily Averin
  2021-07-18 10:44                     ` Vasily Averin
  0 siblings, 1 reply; 106+ messages in thread
From: Vasily Averin @ 2021-07-13 12:01 UTC (permalink / raw)
  To: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski,
	Eric Dumazet
  Cc: netdev, linux-kernel

When TEE target mirrors traffic to another interface, sk_buff may
not have enough headroom to be processed correctly.
ip_finish_output2() detect this situation for ipv4 and allocates
new skb with enogh headroom. However ipv6 lacks this logic in
ip_finish_output2 and it leads to skb_under_panic:

 skbuff: skb_under_panic: text:ffffffffc0866ad4 len:96 put:24
 head:ffff97be85e31800 data:ffff97be85e317f8 tail:0x58 end:0xc0 dev:gre0
 ------------[ cut here ]------------
 kernel BUG at net/core/skbuff.c:110!
 invalid opcode: 0000 [#1] SMP PTI
 CPU: 2 PID: 393 Comm: kworker/2:2 Tainted: G           OE     5.13.0 #13
 Hardware name: Virtuozzo KVM, BIOS 1.11.0-2.vz7.4 04/01/2014
 Workqueue: ipv6_addrconf addrconf_dad_work
 RIP: 0010:skb_panic+0x48/0x4a
 Call Trace:
  skb_push.cold.111+0x10/0x10
  ipgre_header+0x24/0xf0 [ip_gre]
  neigh_connected_output+0xae/0xf0
  ip6_finish_output2+0x1a8/0x5a0
  ip6_output+0x5c/0x110
  nf_dup_ipv6+0x158/0x1000 [nf_dup_ipv6]
  tee_tg6+0x2e/0x40 [xt_TEE]
  ip6t_do_table+0x294/0x470 [ip6_tables]
  nf_hook_slow+0x44/0xc0
  nf_hook.constprop.34+0x72/0xe0
  ndisc_send_skb+0x20d/0x2e0
  ndisc_send_ns+0xd1/0x210
  addrconf_dad_work+0x3c8/0x540
  process_one_work+0x1d1/0x370
  worker_thread+0x30/0x390
  kthread+0x116/0x130
  ret_from_fork+0x22/0x30

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/ipv6/ip6_output.c | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index ff4f9eb..25144c7 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -60,10 +60,38 @@ static int ip6_finish_output2(struct net *net, struct sock *sk, struct sk_buff *
 {
 	struct dst_entry *dst = skb_dst(skb);
 	struct net_device *dev = dst->dev;
+	unsigned int hh_len = LL_RESERVED_SPACE(dev);
+	int delta = hh_len - skb_headroom(skb);
 	const struct in6_addr *nexthop;
 	struct neighbour *neigh;
 	int ret;
 
+	/* Be paranoid, rather than too clever. */
+	if (unlikely(delta > 0) && dev->header_ops) {
+		/* pskb_expand_head() might crash, if skb is shared */
+		if (skb_shared(skb)) {
+			struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
+
+			if (likely(nskb)) {
+				if (skb->sk)
+					skb_set_owner_w(nskb, skb->sk);
+				consume_skb(skb);
+			} else {
+				kfree_skb(skb);
+			}
+			skb = nskb;
+		}
+		if (skb &&
+		    pskb_expand_head(skb, SKB_DATA_ALIGN(delta), 0, GFP_ATOMIC)) {
+			kfree_skb(skb);
+			skb = NULL;
+		}
+		if (!skb) {
+			IP6_INC_STATS(net, ip6_dst_idev(dst), IPSTATS_MIB_OUTDISCARDS);
+			return -ENOMEM;
+		}
+	}
+
 	if (ipv6_addr_is_multicast(&ipv6_hdr(skb)->daddr)) {
 		struct inet6_dev *idev = ip6_dst_idev(skb_dst(skb));
 
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH IPV6 v3 1/1] ipv6: allocate enough headroom in ip6_finish_output2()
  2021-07-13  7:46               ` Vasily Averin
  2021-07-13 12:01                 ` [PATCH NET v4 0/1] " Vasily Averin
       [not found]                 ` <cover.1626177047.git.vvs@virtuozzo.com>
@ 2021-07-13 12:31                 ` Vasily Averin
  2 siblings, 0 replies; 106+ messages in thread
From: Vasily Averin @ 2021-07-13 12:31 UTC (permalink / raw)
  To: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski,
	Eric Dumazet
  Cc: netdev, linux-kernel

On 7/13/21 10:46 AM, Vasily Averin wrote:
>> +			if (likely(nskb)) {
>> +				if (skb->sk)
>> +					skb_set_owner_w(skb, skb->sk);
> 
> need to assign sk not to skb but to nskb 
> 
>> +				consume_skb(skb);
>> +			} else {
>> +				kfree_skb(skb);

Please disread, I was wrong here.
> It is quite strange to call consume_skb() on one case and kfree_skb() in another one.
> We know that original skb was shared so we should not call kfree_skb here.
> 
> Btw I've noticed similar problem in few other cases:
> in pptp_xmit, pvc_xmit, ip_vs_prepare_tunneled_skb
> they call consume_skb() in case of success and kfree_skb on error path.
> It looks like potential bug for me.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH NET v2 0/7] skbuff: introduce skb_expand_head()
  2021-07-12 18:45                 ` Vasily Averin
@ 2021-07-13 20:57                   ` Vasily Averin
  2021-08-02  8:52                     ` [PATCH NET v3 " Vasily Averin
       [not found]                     ` <cover.1627891754.git.vvs@virtuozzo.com>
       [not found]                   ` <cover.1626206993.git.vvs@virtuozzo.com>
  1 sibling, 2 replies; 106+ messages in thread
From: Vasily Averin @ 2021-07-13 20:57 UTC (permalink / raw)
  To: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski,
	Eric Dumazet
  Cc: netdev, Joerg Reuter, Ralf Baechle, linux-hams,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh, bpf,
	linux-kernel

currently if skb does not have enough headroom skb_realloc_headrom is called.
It is not optimal because it creates new skb.

Unlike skb_realloc_headroom, new helper skb_учзфтв_head 
does not allocate a new skb if possible; 
copies skb->sk on new skb when as needed
and frees original skb in case of failures.

This helps to simplify ip[6]_finish_output2(), ip6_xmit() and a few other
functions in vrf, ax25 and bpf.

There are few other cases where this helper can be used but they require
an additional investigations. 

v2 changes:
 - helper's name was changed to skb_expand_head
 - fixed few mistakes inside skb_expand_head():
    skb_set_owner_w should set sk on nskb
    kfree was replaced by kfree_skb()
    improved warning message
 - added minor refactoring in changed functions in vrf and bpf patches
 - removed kfree_skb() in ax25_rt_build_path caller ax25_ip_xmit

NB: patch "ipv6: use skb_expand_head in ip6_finish_output2" depends on 
patch "ipv6: allocate enough headroom in ip6_finish_output2()" submitted separately
https://lkml.org/lkml/2021/7/12/732

Vasily Averin (7):
  skbuff: introduce skb_expand_head()
  ipv6: use skb_expand_head in ip6_finish_output2
  ipv6: use skb_expand_head in ip6_xmit refactoring
  ipv4: use skb_expand_head in ip_finish_output2
  vrf: use skb_expand_head in vrf_finish_output
  ax25: use skb_expand_head
  bpf: use skb_expand_head in bpf_out_neigh_v4/6

 drivers/net/vrf.c      | 21 +++++---------
 include/linux/skbuff.h |  1 +
 net/ax25/ax25_out.c    | 12 ++------
 net/ax25/ax25_route.c  | 13 ++-------
 net/core/filter.c      | 27 ++++-------------
 net/core/skbuff.c      | 42 +++++++++++++++++++++++++++
 net/ipv4/ip_output.c   | 13 ++-------
 net/ipv6/ip6_output.c  | 78 +++++++++++++++++---------------------------------
 8 files changed, 90 insertions(+), 117 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH NET v2 1/7] skbuff: introduce skb_expand_head()
       [not found]                   ` <cover.1626206993.git.vvs@virtuozzo.com>
@ 2021-07-13 20:57                     ` Vasily Averin
  2021-07-13 20:58                     ` [PATCH NET v2 2/7] ipv6: use skb_expand_head in ip6_finish_output2 Vasily Averin
                                       ` (5 subsequent siblings)
  6 siblings, 0 replies; 106+ messages in thread
From: Vasily Averin @ 2021-07-13 20:57 UTC (permalink / raw)
  To: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski,
	Eric Dumazet
  Cc: netdev, Joerg Reuter, Ralf Baechle, linux-hams,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh, bpf,
	linux-kernel

Like skb_realloc_headroom(), new helper increases headroom of specified skb.
Unlike skb_realloc_headroom(), it does not allocate a new skb if possible;
copies skb->sk on new skb when as needed and frees original skb in case
of failures.

This helps to simplify ip[6]_finish_output2() and a few other similar cases.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 include/linux/skbuff.h |  1 +
 net/core/skbuff.c      | 42 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 43 insertions(+)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index dbf820a..0003307 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1174,6 +1174,7 @@ static inline struct sk_buff *__pskb_copy(struct sk_buff *skb, int headroom,
 int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail, gfp_t gfp_mask);
 struct sk_buff *skb_realloc_headroom(struct sk_buff *skb,
 				     unsigned int headroom);
+struct sk_buff *skb_expand_head(struct sk_buff *skb, unsigned int headroom);
 struct sk_buff *skb_copy_expand(const struct sk_buff *skb, int newheadroom,
 				int newtailroom, gfp_t priority);
 int __must_check skb_to_sgvec_nomark(struct sk_buff *skb, struct scatterlist *sg,
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index bbc3b4b..a7997c2 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1769,6 +1769,48 @@ struct sk_buff *skb_realloc_headroom(struct sk_buff *skb, unsigned int headroom)
 EXPORT_SYMBOL(skb_realloc_headroom);
 
 /**
+ *	skb_expand_head - reallocate header of &sk_buff
+ *	@skb: buffer to reallocate
+ *	@headroom: needed headroom
+ *
+ *	Unlike skb_realloc_headroom, this one does not allocate a new skb
+ *	if possible; copies skb->sk to new skb as needed
+ *	and frees original skb in case of failures.
+ *
+ *	It expect increased headroom and generates warning otherwise.
+ */
+
+struct sk_buff *skb_expand_head(struct sk_buff *skb, unsigned int headroom)
+{
+	int delta = headroom - skb_headroom(skb);
+
+	if (WARN_ONCE(delta <= 0,
+		      "%s is expecting an increase in the headroom", __func__))
+		return skb;
+
+	/* pskb_expand_head() might crash, if skb is shared */
+	if (skb_shared(skb)) {
+		struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
+
+		if (likely(nskb)) {
+			if (skb->sk)
+				skb_set_owner_w(nskb, skb->sk);
+			consume_skb(skb);
+		} else {
+			kfree_skb(skb);
+		}
+		skb = nskb;
+	}
+	if (skb &&
+	    pskb_expand_head(skb, SKB_DATA_ALIGN(delta), 0, GFP_ATOMIC)) {
+		kfree_skb(skb);
+		skb = NULL;
+	}
+	return skb;
+}
+EXPORT_SYMBOL(skb_expand_head);
+
+/**
  *	skb_copy_expand	-	copy and expand sk_buff
  *	@skb: buffer to copy
  *	@newheadroom: new free bytes at head
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH NET v2 2/7] ipv6: use skb_expand_head in ip6_finish_output2
       [not found]                   ` <cover.1626206993.git.vvs@virtuozzo.com>
  2021-07-13 20:57                     ` [PATCH NET v2 1/7] skbuff: introduce skb_expand_head() Vasily Averin
@ 2021-07-13 20:58                     ` Vasily Averin
  2021-07-13 20:58                     ` [PATCH NET v2 3/7] ipv6: use skb_expand_head in ip6_xmit Vasily Averin
                                       ` (4 subsequent siblings)
  6 siblings, 0 replies; 106+ messages in thread
From: Vasily Averin @ 2021-07-13 20:58 UTC (permalink / raw)
  To: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski,
	Eric Dumazet
  Cc: netdev, linux-kernel

Unlike skb_realloc_headroom, new helper skb_expand_head does not allocate
a new skb if possible.

Additionally this patch replaces commonly used dereferencing with variables.
---
NB: this patch depends on
"ipv6: allocate enough headroom in ip6_finish_output2()"
https://lkml.org/lkml/2021/7/12/732

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/ipv6/ip6_output.c | 51 ++++++++++++++++-----------------------------------
 1 file changed, 16 insertions(+), 35 deletions(-)

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 25144c7..6c4925e 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -60,46 +60,29 @@ static int ip6_finish_output2(struct net *net, struct sock *sk, struct sk_buff *
 {
 	struct dst_entry *dst = skb_dst(skb);
 	struct net_device *dev = dst->dev;
+	struct inet6_dev *idev = ip6_dst_idev(dst);
 	unsigned int hh_len = LL_RESERVED_SPACE(dev);
-	int delta = hh_len - skb_headroom(skb);
-	const struct in6_addr *nexthop;
+	const struct in6_addr *daddr, *nexthop;
+	struct ipv6hdr *hdr;
 	struct neighbour *neigh;
 	int ret;
 
 	/* Be paranoid, rather than too clever. */
-	if (unlikely(delta > 0) && dev->header_ops) {
-		/* pskb_expand_head() might crash, if skb is shared */
-		if (skb_shared(skb)) {
-			struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
-
-			if (likely(nskb)) {
-				if (skb->sk)
-					skb_set_owner_w(nskb, skb->sk);
-				consume_skb(skb);
-			} else {
-				kfree_skb(skb);
-			}
-			skb = nskb;
-		}
-		if (skb &&
-		    pskb_expand_head(skb, SKB_DATA_ALIGN(delta), 0, GFP_ATOMIC)) {
-			kfree_skb(skb);
-			skb = NULL;
-		}
+	if (unlikely(hh_len > skb_headroom(skb)) && dev->header_ops) {
+		skb = skb_expand_head(skb, hh_len);
 		if (!skb) {
-			IP6_INC_STATS(net, ip6_dst_idev(dst), IPSTATS_MIB_OUTDISCARDS);
+			IP6_INC_STATS(net, idev, IPSTATS_MIB_OUTDISCARDS);
 			return -ENOMEM;
 		}
 	}
 
-	if (ipv6_addr_is_multicast(&ipv6_hdr(skb)->daddr)) {
-		struct inet6_dev *idev = ip6_dst_idev(skb_dst(skb));
-
+	hdr = ipv6_hdr(skb);
+	daddr = &hdr->daddr;
+	if (ipv6_addr_is_multicast(daddr)) {
 		if (!(dev->flags & IFF_LOOPBACK) && sk_mc_loop(sk) &&
 		    ((mroute6_is_socket(net, skb) &&
 		     !(IP6CB(skb)->flags & IP6SKB_FORWARDED)) ||
-		     ipv6_chk_mcast_addr(dev, &ipv6_hdr(skb)->daddr,
-					 &ipv6_hdr(skb)->saddr))) {
+		     ipv6_chk_mcast_addr(dev, daddr, &hdr->saddr))) {
 			struct sk_buff *newskb = skb_clone(skb, GFP_ATOMIC);
 
 			/* Do not check for IFF_ALLMULTI; multicast routing
@@ -110,7 +93,7 @@ static int ip6_finish_output2(struct net *net, struct sock *sk, struct sk_buff *
 					net, sk, newskb, NULL, newskb->dev,
 					dev_loopback_xmit);
 
-			if (ipv6_hdr(skb)->hop_limit == 0) {
+			if (hdr->hop_limit == 0) {
 				IP6_INC_STATS(net, idev,
 					      IPSTATS_MIB_OUTDISCARDS);
 				kfree_skb(skb);
@@ -119,9 +102,7 @@ static int ip6_finish_output2(struct net *net, struct sock *sk, struct sk_buff *
 		}
 
 		IP6_UPD_PO_STATS(net, idev, IPSTATS_MIB_OUTMCAST, skb->len);
-
-		if (IPV6_ADDR_MC_SCOPE(&ipv6_hdr(skb)->daddr) <=
-		    IPV6_ADDR_SCOPE_NODELOCAL &&
+		if (IPV6_ADDR_MC_SCOPE(daddr) <= IPV6_ADDR_SCOPE_NODELOCAL &&
 		    !(dev->flags & IFF_LOOPBACK)) {
 			kfree_skb(skb);
 			return 0;
@@ -136,10 +117,10 @@ static int ip6_finish_output2(struct net *net, struct sock *sk, struct sk_buff *
 	}
 
 	rcu_read_lock_bh();
-	nexthop = rt6_nexthop((struct rt6_info *)dst, &ipv6_hdr(skb)->daddr);
-	neigh = __ipv6_neigh_lookup_noref(dst->dev, nexthop);
+	nexthop = rt6_nexthop((struct rt6_info *)dst, daddr);
+	neigh = __ipv6_neigh_lookup_noref(dev, nexthop);
 	if (unlikely(!neigh))
-		neigh = __neigh_create(&nd_tbl, nexthop, dst->dev, false);
+		neigh = __neigh_create(&nd_tbl, nexthop, dev, false);
 	if (!IS_ERR(neigh)) {
 		sock_confirm_neigh(skb, neigh);
 		ret = neigh_output(neigh, skb, false);
@@ -148,7 +129,7 @@ static int ip6_finish_output2(struct net *net, struct sock *sk, struct sk_buff *
 	}
 	rcu_read_unlock_bh();
 
-	IP6_INC_STATS(net, ip6_dst_idev(dst), IPSTATS_MIB_OUTNOROUTES);
+	IP6_INC_STATS(net, idev, IPSTATS_MIB_OUTNOROUTES);
 	kfree_skb(skb);
 	return -EINVAL;
 }
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH NET v2 3/7] ipv6: use skb_expand_head in ip6_xmit
       [not found]                   ` <cover.1626206993.git.vvs@virtuozzo.com>
  2021-07-13 20:57                     ` [PATCH NET v2 1/7] skbuff: introduce skb_expand_head() Vasily Averin
  2021-07-13 20:58                     ` [PATCH NET v2 2/7] ipv6: use skb_expand_head in ip6_finish_output2 Vasily Averin
@ 2021-07-13 20:58                     ` Vasily Averin
  2021-07-13 20:58                     ` [PATCH NET v2 4/7] ipv4: use skb_expand_head in ip_finish_output2 Vasily Averin
                                       ` (3 subsequent siblings)
  6 siblings, 0 replies; 106+ messages in thread
From: Vasily Averin @ 2021-07-13 20:58 UTC (permalink / raw)
  To: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski,
	Eric Dumazet
  Cc: netdev, linux-kernel

Unlike skb_realloc_headroom, new helper skb_expand_head
does not allocate a new skb if possible.

Additionally this patch replaces commonly used dereferencing with variables.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/ipv6/ip6_output.c | 27 +++++++++++----------------
 1 file changed, 11 insertions(+), 16 deletions(-)

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 6c4925e..90cd7b6 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -249,6 +249,8 @@ int ip6_xmit(const struct sock *sk, struct sk_buff *skb, struct flowi6 *fl6,
 	const struct ipv6_pinfo *np = inet6_sk(sk);
 	struct in6_addr *first_hop = &fl6->daddr;
 	struct dst_entry *dst = skb_dst(skb);
+	struct net_device *dev = dst->dev;
+	struct inet6_dev *idev = ip6_dst_idev(dst);
 	unsigned int head_room;
 	struct ipv6hdr *hdr;
 	u8  proto = fl6->flowi6_proto;
@@ -256,22 +258,16 @@ int ip6_xmit(const struct sock *sk, struct sk_buff *skb, struct flowi6 *fl6,
 	int hlimit = -1;
 	u32 mtu;
 
-	head_room = sizeof(struct ipv6hdr) + LL_RESERVED_SPACE(dst->dev);
+	head_room = sizeof(struct ipv6hdr) + LL_RESERVED_SPACE(dev);
 	if (opt)
 		head_room += opt->opt_nflen + opt->opt_flen;
 
-	if (unlikely(skb_headroom(skb) < head_room)) {
-		struct sk_buff *skb2 = skb_realloc_headroom(skb, head_room);
-		if (!skb2) {
-			IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)),
-				      IPSTATS_MIB_OUTDISCARDS);
-			kfree_skb(skb);
+	if (unlikely(head_room > skb_headroom(skb))) {
+		skb = skb_expand_head(skb, head_room);
+		if (!skb) {
+			IP6_INC_STATS(net, idev, IPSTATS_MIB_OUTDISCARDS);
 			return -ENOBUFS;
 		}
-		if (skb->sk)
-			skb_set_owner_w(skb2, skb->sk);
-		consume_skb(skb);
-		skb = skb2;
 	}
 
 	if (opt) {
@@ -313,8 +309,7 @@ int ip6_xmit(const struct sock *sk, struct sk_buff *skb, struct flowi6 *fl6,
 
 	mtu = dst_mtu(dst);
 	if ((skb->len <= mtu) || skb->ignore_df || skb_is_gso(skb)) {
-		IP6_UPD_PO_STATS(net, ip6_dst_idev(skb_dst(skb)),
-			      IPSTATS_MIB_OUT, skb->len);
+		IP6_UPD_PO_STATS(net, idev, IPSTATS_MIB_OUT, skb->len);
 
 		/* if egress device is enslaved to an L3 master device pass the
 		 * skb to its handler for processing
@@ -327,17 +322,17 @@ int ip6_xmit(const struct sock *sk, struct sk_buff *skb, struct flowi6 *fl6,
 		 * we promote our socket to non const
 		 */
 		return NF_HOOK(NFPROTO_IPV6, NF_INET_LOCAL_OUT,
-			       net, (struct sock *)sk, skb, NULL, dst->dev,
+			       net, (struct sock *)sk, skb, NULL, dev,
 			       dst_output);
 	}
 
-	skb->dev = dst->dev;
+	skb->dev = dev;
 	/* ipv6_local_error() does not require socket lock,
 	 * we promote our socket to non const
 	 */
 	ipv6_local_error((struct sock *)sk, EMSGSIZE, fl6, mtu);
 
-	IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)), IPSTATS_MIB_FRAGFAILS);
+	IP6_INC_STATS(net, idev, IPSTATS_MIB_FRAGFAILS);
 	kfree_skb(skb);
 	return -EMSGSIZE;
 }
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH NET v2 4/7] ipv4: use skb_expand_head in ip_finish_output2
       [not found]                   ` <cover.1626206993.git.vvs@virtuozzo.com>
                                       ` (2 preceding siblings ...)
  2021-07-13 20:58                     ` [PATCH NET v2 3/7] ipv6: use skb_expand_head in ip6_xmit Vasily Averin
@ 2021-07-13 20:58                     ` Vasily Averin
  2021-07-13 20:58                     ` [PATCH NET v2 5/7] vrf: use skb_expand_head in vrf_finish_output Vasily Averin
                                       ` (2 subsequent siblings)
  6 siblings, 0 replies; 106+ messages in thread
From: Vasily Averin @ 2021-07-13 20:58 UTC (permalink / raw)
  To: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski,
	Eric Dumazet
  Cc: netdev, linux-kernel

Unlike skb_realloc_headroom, new helper skb_expand_head
does not allocate a new skb if possible.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/ipv4/ip_output.c | 13 ++-----------
 1 file changed, 2 insertions(+), 11 deletions(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index c3efc7d..5b2f6ea 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -198,19 +198,10 @@ static int ip_finish_output2(struct net *net, struct sock *sk, struct sk_buff *s
 	} else if (rt->rt_type == RTN_BROADCAST)
 		IP_UPD_PO_STATS(net, IPSTATS_MIB_OUTBCAST, skb->len);
 
-	/* Be paranoid, rather than too clever. */
 	if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) {
-		struct sk_buff *skb2;
-
-		skb2 = skb_realloc_headroom(skb, LL_RESERVED_SPACE(dev));
-		if (!skb2) {
-			kfree_skb(skb);
+		skb = skb_expand_head(skb, hh_len);
+		if (!skb)
 			return -ENOMEM;
-		}
-		if (skb->sk)
-			skb_set_owner_w(skb2, skb->sk);
-		consume_skb(skb);
-		skb = skb2;
 	}
 
 	if (lwtunnel_xmit_redirect(dst->lwtstate)) {
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH NET v2 5/7] vrf: use skb_expand_head in vrf_finish_output
       [not found]                   ` <cover.1626206993.git.vvs@virtuozzo.com>
                                       ` (3 preceding siblings ...)
  2021-07-13 20:58                     ` [PATCH NET v2 4/7] ipv4: use skb_expand_head in ip_finish_output2 Vasily Averin
@ 2021-07-13 20:58                     ` Vasily Averin
  2021-07-13 20:58                     ` [PATCH NET v2 6/7] ax25: use skb_expand_head Vasily Averin
  2021-07-13 20:58                     ` [PATCH NET v2 7/7] bpf: use skb_expand_head in bpf_out_neigh_v4/6 Vasily Averin
  6 siblings, 0 replies; 106+ messages in thread
From: Vasily Averin @ 2021-07-13 20:58 UTC (permalink / raw)
  To: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski,
	Eric Dumazet
  Cc: netdev, linux-kernel

Unlike skb_realloc_headroom, new helper skb_expand_head
does not allocate a new skb if possible.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 drivers/net/vrf.c | 21 +++++++--------------
 1 file changed, 7 insertions(+), 14 deletions(-)

diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
index 28a6c4c..82e7696 100644
--- a/drivers/net/vrf.c
+++ b/drivers/net/vrf.c
@@ -857,30 +857,24 @@ static int vrf_finish_output(struct net *net, struct sock *sk, struct sk_buff *s
 	unsigned int hh_len = LL_RESERVED_SPACE(dev);
 	struct neighbour *neigh;
 	bool is_v6gw = false;
-	int ret = -EINVAL;
 
 	nf_reset_ct(skb);
 
 	/* Be paranoid, rather than too clever. */
 	if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) {
-		struct sk_buff *skb2;
-
-		skb2 = skb_realloc_headroom(skb, LL_RESERVED_SPACE(dev));
-		if (!skb2) {
-			ret = -ENOMEM;
-			goto err;
+		skb = skb_expand_head(skb, hh_len);
+		if (!skb) {
+			skb->dev->stats.tx_errors++;
+			return -ENOMEM;
 		}
-		if (skb->sk)
-			skb_set_owner_w(skb2, skb->sk);
-
-		consume_skb(skb);
-		skb = skb2;
 	}
 
 	rcu_read_lock_bh();
 
 	neigh = ip_neigh_for_gw(rt, skb, &is_v6gw);
 	if (!IS_ERR(neigh)) {
+		int ret;
+
 		sock_confirm_neigh(skb, neigh);
 		/* if crossing protocols, can not use the cached header */
 		ret = neigh_output(neigh, skb, is_v6gw);
@@ -889,9 +883,8 @@ static int vrf_finish_output(struct net *net, struct sock *sk, struct sk_buff *s
 	}
 
 	rcu_read_unlock_bh();
-err:
 	vrf_tx_error(skb->dev, skb);
-	return ret;
+	return -EINVAL;
 }
 
 static int vrf_output(struct net *net, struct sock *sk, struct sk_buff *skb)
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH NET v2 6/7] ax25: use skb_expand_head
       [not found]                   ` <cover.1626206993.git.vvs@virtuozzo.com>
                                       ` (4 preceding siblings ...)
  2021-07-13 20:58                     ` [PATCH NET v2 5/7] vrf: use skb_expand_head in vrf_finish_output Vasily Averin
@ 2021-07-13 20:58                     ` Vasily Averin
  2021-07-13 20:58                     ` [PATCH NET v2 7/7] bpf: use skb_expand_head in bpf_out_neigh_v4/6 Vasily Averin
  6 siblings, 0 replies; 106+ messages in thread
From: Vasily Averin @ 2021-07-13 20:58 UTC (permalink / raw)
  To: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski,
	Eric Dumazet
  Cc: netdev, Joerg Reuter, Ralf Baechle, linux-hams, linux-kernel

Use skb_expand_head() in ax25_transmit_buffer and ax25_rt_build_path.
Unlike skb_realloc_headroom, new helper does not allocate a new skb if possible.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
v2: removed kfree_skb() in ax25_rt_build_path caller ax25_ip_xmit

 net/ax25/ax25_ip.c    |  4 +---
 net/ax25/ax25_out.c   | 12 +++---------
 net/ax25/ax25_route.c | 13 +++----------
 3 files changed, 7 insertions(+), 22 deletions(-)

diff --git a/net/ax25/ax25_ip.c b/net/ax25/ax25_ip.c
index e4f63dd..3624977 100644
--- a/net/ax25/ax25_ip.c
+++ b/net/ax25/ax25_ip.c
@@ -193,10 +193,8 @@ netdev_tx_t ax25_ip_xmit(struct sk_buff *skb)
 	skb_pull(skb, AX25_KISS_HEADER_LEN);
 
 	if (digipeat != NULL) {
-		if ((ourskb = ax25_rt_build_path(skb, src, dst, route->digipeat)) == NULL) {
-			kfree_skb(skb);
+		if ((ourskb = ax25_rt_build_path(skb, src, dst, route->digipeat)) == NULL)
 			goto put;
-		}
 
 		skb = ourskb;
 	}
diff --git a/net/ax25/ax25_out.c b/net/ax25/ax25_out.c
index f53751b..af4a10e 100644
--- a/net/ax25/ax25_out.c
+++ b/net/ax25/ax25_out.c
@@ -336,18 +336,12 @@ void ax25_transmit_buffer(ax25_cb *ax25, struct sk_buff *skb, int type)
 
 	headroom = ax25_addr_size(ax25->digipeat);
 
-	if (skb_headroom(skb) < headroom) {
-		if ((skbn = skb_realloc_headroom(skb, headroom)) == NULL) {
+	if (unlikely(skb_headroom(skb) < headroom)) {
+		skb = skb_expand_head(skb, headroom);
+		if (!skb) {
 			printk(KERN_CRIT "AX.25: ax25_transmit_buffer - out of memory\n");
-			kfree_skb(skb);
 			return;
 		}
-
-		if (skb->sk != NULL)
-			skb_set_owner_w(skbn, skb->sk);
-
-		consume_skb(skb);
-		skb = skbn;
 	}
 
 	ptr = skb_push(skb, headroom);
diff --git a/net/ax25/ax25_route.c b/net/ax25/ax25_route.c
index b40e0bc..d0b2e09 100644
--- a/net/ax25/ax25_route.c
+++ b/net/ax25/ax25_route.c
@@ -441,24 +441,17 @@ int ax25_rt_autobind(ax25_cb *ax25, ax25_address *addr)
 struct sk_buff *ax25_rt_build_path(struct sk_buff *skb, ax25_address *src,
 	ax25_address *dest, ax25_digi *digi)
 {
-	struct sk_buff *skbn;
 	unsigned char *bp;
 	int len;
 
 	len = digi->ndigi * AX25_ADDR_LEN;
 
-	if (skb_headroom(skb) < len) {
-		if ((skbn = skb_realloc_headroom(skb, len)) == NULL) {
+	if (unlikely(skb_headroom(skb) < len)) {
+		skb = skb_expand_head(skb, len);
+		if (!skb) {
 			printk(KERN_CRIT "AX.25: ax25_dg_build_path - out of memory\n");
 			return NULL;
 		}
-
-		if (skb->sk != NULL)
-			skb_set_owner_w(skbn, skb->sk);
-
-		consume_skb(skb);
-
-		skb = skbn;
 	}
 
 	bp = skb_push(skb, len);
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH NET v2 7/7] bpf: use skb_expand_head in bpf_out_neigh_v4/6
       [not found]                   ` <cover.1626206993.git.vvs@virtuozzo.com>
                                       ` (5 preceding siblings ...)
  2021-07-13 20:58                     ` [PATCH NET v2 6/7] ax25: use skb_expand_head Vasily Averin
@ 2021-07-13 20:58                     ` Vasily Averin
  6 siblings, 0 replies; 106+ messages in thread
From: Vasily Averin @ 2021-07-13 20:58 UTC (permalink / raw)
  To: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski,
	Eric Dumazet
  Cc: netdev, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh, bpf,
	linux-kernel

Unlike skb_realloc_headroom, new helper skb_expand_head
does not allocate a new skb if possible.

Additionally this patch replaces commonly used dereferencing with variables.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/core/filter.c | 27 +++++----------------------
 1 file changed, 5 insertions(+), 22 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 65ab4e2..25a6950 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2179,17 +2179,9 @@ static int bpf_out_neigh_v6(struct net *net, struct sk_buff *skb,
 	skb->tstamp = 0;
 
 	if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) {
-		struct sk_buff *skb2;
-
-		skb2 = skb_realloc_headroom(skb, hh_len);
-		if (unlikely(!skb2)) {
-			kfree_skb(skb);
+		skb = skb_expand_head(skb, hh_len);
+		if (!skb)
 			return -ENOMEM;
-		}
-		if (skb->sk)
-			skb_set_owner_w(skb2, skb->sk);
-		consume_skb(skb);
-		skb = skb2;
 	}
 
 	rcu_read_lock_bh();
@@ -2213,8 +2205,7 @@ static int bpf_out_neigh_v6(struct net *net, struct sk_buff *skb,
 	}
 	rcu_read_unlock_bh();
 	if (dst)
-		IP6_INC_STATS(dev_net(dst->dev),
-			      ip6_dst_idev(dst), IPSTATS_MIB_OUTNOROUTES);
+		IP6_INC_STATS(net, ip6_dst_idev(dst), IPSTATS_MIB_OUTNOROUTES);
 out_drop:
 	kfree_skb(skb);
 	return -ENETDOWN;
@@ -2286,17 +2277,9 @@ static int bpf_out_neigh_v4(struct net *net, struct sk_buff *skb,
 	skb->tstamp = 0;
 
 	if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) {
-		struct sk_buff *skb2;
-
-		skb2 = skb_realloc_headroom(skb, hh_len);
-		if (unlikely(!skb2)) {
-			kfree_skb(skb);
+		skb = skb_expand_head(skb, hh_len);
+		if (!skb)
 			return -ENOMEM;
-		}
-		if (skb->sk)
-			skb_set_owner_w(skb2, skb->sk);
-		consume_skb(skb);
-		skb = skb2;
 	}
 
 	rcu_read_lock_bh();
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH NET v4 1/1] ipv6: allocate enough headroom in ip6_finish_output2()
  2021-07-13 12:01                   ` [PATCH NET v4 1/1] " Vasily Averin
@ 2021-07-18 10:44                     ` Vasily Averin
  2021-07-18 15:22                       ` David Ahern
  2021-07-18 17:04                       ` David Miller
  0 siblings, 2 replies; 106+ messages in thread
From: Vasily Averin @ 2021-07-18 10:44 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, linux-kernel, Hideaki YOSHIFUJI, Jakub Kicinski,
	David Ahern, Eric Dumazet

Dear David,
I've found that you have added v3 version of this patch into netdev-net git.
This version had one mistake: skb_set_owner_w() should set sk not to old skb byt to new nskb.
I've fixed it in v4 version.

Could you please drop bad v3 version and pick up fixed one ?
Should I perhaps submit separate fixup instead?

Thank you,
	Vasily Averin

On 7/13/21 3:01 PM, Vasily Averin wrote:
> When TEE target mirrors traffic to another interface, sk_buff may
> not have enough headroom to be processed correctly.
> ip_finish_output2() detect this situation for ipv4 and allocates
> new skb with enogh headroom. However ipv6 lacks this logic in
> ip_finish_output2 and it leads to skb_under_panic:
> 
>  skbuff: skb_under_panic: text:ffffffffc0866ad4 len:96 put:24
>  head:ffff97be85e31800 data:ffff97be85e317f8 tail:0x58 end:0xc0 dev:gre0
>  ------------[ cut here ]------------
>  kernel BUG at net/core/skbuff.c:110!
>  invalid opcode: 0000 [#1] SMP PTI
>  CPU: 2 PID: 393 Comm: kworker/2:2 Tainted: G           OE     5.13.0 #13
>  Hardware name: Virtuozzo KVM, BIOS 1.11.0-2.vz7.4 04/01/2014
>  Workqueue: ipv6_addrconf addrconf_dad_work
>  RIP: 0010:skb_panic+0x48/0x4a
>  Call Trace:
>   skb_push.cold.111+0x10/0x10
>   ipgre_header+0x24/0xf0 [ip_gre]
>   neigh_connected_output+0xae/0xf0
>   ip6_finish_output2+0x1a8/0x5a0
>   ip6_output+0x5c/0x110
>   nf_dup_ipv6+0x158/0x1000 [nf_dup_ipv6]
>   tee_tg6+0x2e/0x40 [xt_TEE]
>   ip6t_do_table+0x294/0x470 [ip6_tables]
>   nf_hook_slow+0x44/0xc0
>   nf_hook.constprop.34+0x72/0xe0
>   ndisc_send_skb+0x20d/0x2e0
>   ndisc_send_ns+0xd1/0x210
>   addrconf_dad_work+0x3c8/0x540
>   process_one_work+0x1d1/0x370
>   worker_thread+0x30/0x390
>   kthread+0x116/0x130
>   ret_from_fork+0x22/0x30
> 
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
> ---
>  net/ipv6/ip6_output.c | 28 ++++++++++++++++++++++++++++
>  1 file changed, 28 insertions(+)
> 
> diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
> index ff4f9eb..25144c7 100644
> --- a/net/ipv6/ip6_output.c
> +++ b/net/ipv6/ip6_output.c
> @@ -60,10 +60,38 @@ static int ip6_finish_output2(struct net *net, struct sock *sk, struct sk_buff *
>  {
>  	struct dst_entry *dst = skb_dst(skb);
>  	struct net_device *dev = dst->dev;
> +	unsigned int hh_len = LL_RESERVED_SPACE(dev);
> +	int delta = hh_len - skb_headroom(skb);
>  	const struct in6_addr *nexthop;
>  	struct neighbour *neigh;
>  	int ret;
>  
> +	/* Be paranoid, rather than too clever. */
> +	if (unlikely(delta > 0) && dev->header_ops) {
> +		/* pskb_expand_head() might crash, if skb is shared */
> +		if (skb_shared(skb)) {
> +			struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
> +
> +			if (likely(nskb)) {
> +				if (skb->sk)
> +					skb_set_owner_w(nskb, skb->sk);
> +				consume_skb(skb);
> +			} else {
> +				kfree_skb(skb);
> +			}
> +			skb = nskb;
> +		}
> +		if (skb &&
> +		    pskb_expand_head(skb, SKB_DATA_ALIGN(delta), 0, GFP_ATOMIC)) {
> +			kfree_skb(skb);
> +			skb = NULL;
> +		}
> +		if (!skb) {
> +			IP6_INC_STATS(net, ip6_dst_idev(dst), IPSTATS_MIB_OUTDISCARDS);
> +			return -ENOMEM;
> +		}
> +	}
> +
>  	if (ipv6_addr_is_multicast(&ipv6_hdr(skb)->daddr)) {
>  		struct inet6_dev *idev = ip6_dst_idev(skb_dst(skb));
>  
> 


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH NET v4 1/1] ipv6: allocate enough headroom in ip6_finish_output2()
  2021-07-18 10:44                     ` Vasily Averin
@ 2021-07-18 15:22                       ` David Ahern
  2021-07-18 17:04                       ` David Miller
  1 sibling, 0 replies; 106+ messages in thread
From: David Ahern @ 2021-07-18 15:22 UTC (permalink / raw)
  To: Vasily Averin, David S. Miller
  Cc: netdev, linux-kernel, Hideaki YOSHIFUJI, Jakub Kicinski,
	David Ahern, Eric Dumazet

On 7/18/21 4:44 AM, Vasily Averin wrote:
> I've found that you have added v3 version of this patch into netdev-net git.
> This version had one mistake: skb_set_owner_w() should set sk not to old skb byt to new nskb.
> I've fixed it in v4 version.
> 
> Could you please drop bad v3 version and pick up fixed one ?
> Should I perhaps submit separate fixup instead?

Patches are not dropped once pushed; send the diff between v3 and v4
with a Fixes tag.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH NET v4 1/1] ipv6: allocate enough headroom in ip6_finish_output2()
  2021-07-18 10:44                     ` Vasily Averin
  2021-07-18 15:22                       ` David Ahern
@ 2021-07-18 17:04                       ` David Miller
  2021-07-19  7:55                         ` [PATCH NET] ipv6: ip6_finish_output2: set sk into newly allocated nskb Vasily Averin
  1 sibling, 1 reply; 106+ messages in thread
From: David Miller @ 2021-07-18 17:04 UTC (permalink / raw)
  To: vvs; +Cc: netdev, linux-kernel, yoshfuji, kuba, dsahern, eric.dumazet

From: Vasily Averin <vvs@virtuozzo.com>
Date: Sun, 18 Jul 2021 13:44:33 +0300

> Dear David,
> I've found that you have added v3 version of this patch into netdev-net git.
> This version had one mistake: skb_set_owner_w() should set sk not to old skb byt to new nskb.
> I've fixed it in v4 version.
> 
> Could you please drop bad v3 version and pick up fixed one ?
> Should I perhaps submit separate fixup instead?

Always submit a fixup in these situations..

Thank you.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH NET] ipv6: ip6_finish_output2: set sk into newly allocated nskb
  2021-07-18 17:04                       ` David Miller
@ 2021-07-19  7:55                         ` Vasily Averin
  2021-07-20 10:10                           ` patchwork-bot+netdevbpf
  0 siblings, 1 reply; 106+ messages in thread
From: Vasily Averin @ 2021-07-19  7:55 UTC (permalink / raw)
  To: David S. Miller
  Cc: Linux Kernel Network Developers, linux-kernel, David Ahern,
	Jakub Kicinski, Eric Dumazet, Hideaki YOSHIFUJI

skb_set_owner_w() should set sk not to old skb but to new nskb.

Fixes: 5796015fa968("ipv6: allocate enough headroom in ip6_finish_output2()")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/ipv6/ip6_output.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 01bea76..e1b9f7a 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -74,7 +74,7 @@ static int ip6_finish_output2(struct net *net, struct sock *sk, struct sk_buff *
 
 			if (likely(nskb)) {
 				if (skb->sk)
-					skb_set_owner_w(skb, skb->sk);
+					skb_set_owner_w(nskb, skb->sk);
 				consume_skb(skb);
 			} else {
 				kfree_skb(skb);
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH NET] ipv6: ip6_finish_output2: set sk into newly allocated nskb
  2021-07-19  7:55                         ` [PATCH NET] ipv6: ip6_finish_output2: set sk into newly allocated nskb Vasily Averin
@ 2021-07-20 10:10                           ` patchwork-bot+netdevbpf
  0 siblings, 0 replies; 106+ messages in thread
From: patchwork-bot+netdevbpf @ 2021-07-20 10:10 UTC (permalink / raw)
  To: Vasily Averin
  Cc: davem, netdev, linux-kernel, dsahern, kuba, eric.dumazet, yoshfuji

Hello:

This patch was applied to netdev/net.git (refs/heads/master):

On Mon, 19 Jul 2021 10:55:14 +0300 you wrote:
> skb_set_owner_w() should set sk not to old skb but to new nskb.
> 
> Fixes: 5796015fa968("ipv6: allocate enough headroom in ip6_finish_output2()")
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
> ---
>  net/ipv6/ip6_output.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Here is the summary with links:
  - [NET] ipv6: ip6_finish_output2: set sk into newly allocated nskb
    https://git.kernel.org/netdev/net/c/2d85a1b31dde

You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH NET v3 0/7] skbuff: introduce skb_expand_head()
  2021-07-13 20:57                   ` [PATCH NET v2 0/7] skbuff: introduce skb_expand_head() Vasily Averin
@ 2021-08-02  8:52                     ` Vasily Averin
       [not found]                     ` <cover.1627891754.git.vvs@virtuozzo.com>
  1 sibling, 0 replies; 106+ messages in thread
From: Vasily Averin @ 2021-08-02  8:52 UTC (permalink / raw)
  To: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski,
	Eric Dumazet
  Cc: netdev, Joerg Reuter, Ralf Baechle, linux-hams,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh, bpf,
	linux-kernel

currently if skb does not have enough headroom skb_realloc_headrom is called.
It is not optimal because it creates new skb.

this patch set introduces new helper skb_expand_head()
Unlike skb_realloc_headroom, it does not allocate a new skb if possible; 
copies skb->sk on new skb when as needed and frees original skb in case of failures.

This helps to simplify ip[6]_finish_output2(), ip6_xmit() and few other
functions in vrf, ax25 and bpf.

There are few other cases where this helper can be used 
but it requires an additional investigations. 

v3 changes:
 - ax25 compilation warning fixed
 - v5.14-rc4 rebase
 - now it does not depend on non-committed pathces

v2 changes:
 - helper's name was changed to skb_expand_head
 - fixed few mistakes inside skb_expand_head():
    skb_set_owner_w should set sk on nskb
    kfree was replaced by kfree_skb()
    improved warning message
 - added minor refactoring in changed functions in vrf and bpf patches
 - removed kfree_skb() in ax25_rt_build_path caller ax25_ip_xmit


Vasily Averin (7):
  skbuff: introduce skb_expand_head()
  ipv6: use skb_expand_head in ip6_finish_output2
  ipv6: use skb_expand_head in ip6_xmit
  ipv4: use skb_expand_head in ip_finish_output2
  vrf: use skb_expand_head in vrf_finish_output
  ax25: use skb_expand_head
  bpf: use skb_expand_head in bpf_out_neigh_v4/6

 drivers/net/vrf.c      | 21 +++++---------
 include/linux/skbuff.h |  1 +
 net/ax25/ax25_ip.c     |  4 +--
 net/ax25/ax25_out.c    | 13 ++-------
 net/ax25/ax25_route.c  | 13 ++-------
 net/core/filter.c      | 27 ++++-------------
 net/core/skbuff.c      | 42 +++++++++++++++++++++++++++
 net/ipv4/ip_output.c   | 13 ++-------
 net/ipv6/ip6_output.c  | 78 +++++++++++++++++---------------------------------
 9 files changed, 91 insertions(+), 121 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH NET v3 1/7] skbuff: introduce skb_expand_head()
       [not found]                     ` <cover.1627891754.git.vvs@virtuozzo.com>
@ 2021-08-02  8:52                       ` Vasily Averin
  2021-08-02  8:52                       ` [PATCH NET v3 2/7] ipv6: use skb_expand_head in ip6_finish_output2 Vasily Averin
                                         ` (5 subsequent siblings)
  6 siblings, 0 replies; 106+ messages in thread
From: Vasily Averin @ 2021-08-02  8:52 UTC (permalink / raw)
  To: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski,
	Eric Dumazet
  Cc: netdev, Joerg Reuter, Ralf Baechle, linux-hams,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh, bpf,
	linux-kernel

Like skb_realloc_headroom(), new helper increases headroom of specified skb.
Unlike skb_realloc_headroom(), it does not allocate a new skb if possible;
copies skb->sk on new skb when as needed and frees original skb in case
of failures.

This helps to simplify ip[6]_finish_output2() and a few other similar cases.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 include/linux/skbuff.h |  1 +
 net/core/skbuff.c      | 42 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 43 insertions(+)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index b2db9cd..ec8a783 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1179,6 +1179,7 @@ static inline struct sk_buff *__pskb_copy(struct sk_buff *skb, int headroom,
 int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail, gfp_t gfp_mask);
 struct sk_buff *skb_realloc_headroom(struct sk_buff *skb,
 				     unsigned int headroom);
+struct sk_buff *skb_expand_head(struct sk_buff *skb, unsigned int headroom);
 struct sk_buff *skb_copy_expand(const struct sk_buff *skb, int newheadroom,
 				int newtailroom, gfp_t priority);
 int __must_check skb_to_sgvec_nomark(struct sk_buff *skb, struct scatterlist *sg,
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index fc7942c..0c70b2b 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1786,6 +1786,48 @@ struct sk_buff *skb_realloc_headroom(struct sk_buff *skb, unsigned int headroom)
 EXPORT_SYMBOL(skb_realloc_headroom);
 
 /**
+ *	skb_expand_head - reallocate header of &sk_buff
+ *	@skb: buffer to reallocate
+ *	@headroom: needed headroom
+ *
+ *	Unlike skb_realloc_headroom, this one does not allocate a new skb
+ *	if possible; copies skb->sk to new skb as needed
+ *	and frees original skb in case of failures.
+ *
+ *	It expect increased headroom and generates warning otherwise.
+ */
+
+struct sk_buff *skb_expand_head(struct sk_buff *skb, unsigned int headroom)
+{
+	int delta = headroom - skb_headroom(skb);
+
+	if (WARN_ONCE(delta <= 0,
+		      "%s is expecting an increase in the headroom", __func__))
+		return skb;
+
+	/* pskb_expand_head() might crash, if skb is shared */
+	if (skb_shared(skb)) {
+		struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
+
+		if (likely(nskb)) {
+			if (skb->sk)
+				skb_set_owner_w(nskb, skb->sk);
+			consume_skb(skb);
+		} else {
+			kfree_skb(skb);
+		}
+		skb = nskb;
+	}
+	if (skb &&
+	    pskb_expand_head(skb, SKB_DATA_ALIGN(delta), 0, GFP_ATOMIC)) {
+		kfree_skb(skb);
+		skb = NULL;
+	}
+	return skb;
+}
+EXPORT_SYMBOL(skb_expand_head);
+
+/**
  *	skb_copy_expand	-	copy and expand sk_buff
  *	@skb: buffer to copy
  *	@newheadroom: new free bytes at head
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH NET v3 2/7] ipv6: use skb_expand_head in ip6_finish_output2
       [not found]                     ` <cover.1627891754.git.vvs@virtuozzo.com>
  2021-08-02  8:52                       ` [PATCH NET v3 1/7] " Vasily Averin
@ 2021-08-02  8:52                       ` Vasily Averin
  2021-08-02  8:52                       ` [PATCH NET v3 3/7] ipv6: use skb_expand_head in ip6_xmit Vasily Averin
                                         ` (4 subsequent siblings)
  6 siblings, 0 replies; 106+ messages in thread
From: Vasily Averin @ 2021-08-02  8:52 UTC (permalink / raw)
  To: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski,
	Eric Dumazet
  Cc: netdev, linux-kernel

Unlike skb_realloc_headroom, new helper skb_expand_head does not allocate
a new skb if possible.

Additionally this patch replaces commonly used dereferencing with variables.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/ipv6/ip6_output.c | 51 ++++++++++++++++-----------------------------------
 1 file changed, 16 insertions(+), 35 deletions(-)

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 8e6ca9a..7d2ec25 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -60,46 +60,29 @@ static int ip6_finish_output2(struct net *net, struct sock *sk, struct sk_buff *
 {
 	struct dst_entry *dst = skb_dst(skb);
 	struct net_device *dev = dst->dev;
+	struct inet6_dev *idev = ip6_dst_idev(dst);
 	unsigned int hh_len = LL_RESERVED_SPACE(dev);
-	int delta = hh_len - skb_headroom(skb);
-	const struct in6_addr *nexthop;
+	const struct in6_addr *daddr, *nexthop;
+	struct ipv6hdr *hdr;
 	struct neighbour *neigh;
 	int ret;
 
 	/* Be paranoid, rather than too clever. */
-	if (unlikely(delta > 0) && dev->header_ops) {
-		/* pskb_expand_head() might crash, if skb is shared */
-		if (skb_shared(skb)) {
-			struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
-
-			if (likely(nskb)) {
-				if (skb->sk)
-					skb_set_owner_w(nskb, skb->sk);
-				consume_skb(skb);
-			} else {
-				kfree_skb(skb);
-			}
-			skb = nskb;
-		}
-		if (skb &&
-		    pskb_expand_head(skb, SKB_DATA_ALIGN(delta), 0, GFP_ATOMIC)) {
-			kfree_skb(skb);
-			skb = NULL;
-		}
+	if (unlikely(hh_len > skb_headroom(skb)) && dev->header_ops) {
+		skb = skb_expand_head(skb, hh_len);
 		if (!skb) {
-			IP6_INC_STATS(net, ip6_dst_idev(dst), IPSTATS_MIB_OUTDISCARDS);
+			IP6_INC_STATS(net, idev, IPSTATS_MIB_OUTDISCARDS);
 			return -ENOMEM;
 		}
 	}
 
-	if (ipv6_addr_is_multicast(&ipv6_hdr(skb)->daddr)) {
-		struct inet6_dev *idev = ip6_dst_idev(skb_dst(skb));
-
+	hdr = ipv6_hdr(skb);
+	daddr = &hdr->daddr;
+	if (ipv6_addr_is_multicast(daddr)) {
 		if (!(dev->flags & IFF_LOOPBACK) && sk_mc_loop(sk) &&
 		    ((mroute6_is_socket(net, skb) &&
 		     !(IP6CB(skb)->flags & IP6SKB_FORWARDED)) ||
-		     ipv6_chk_mcast_addr(dev, &ipv6_hdr(skb)->daddr,
-					 &ipv6_hdr(skb)->saddr))) {
+		     ipv6_chk_mcast_addr(dev, daddr, &hdr->saddr))) {
 			struct sk_buff *newskb = skb_clone(skb, GFP_ATOMIC);
 
 			/* Do not check for IFF_ALLMULTI; multicast routing
@@ -110,7 +93,7 @@ static int ip6_finish_output2(struct net *net, struct sock *sk, struct sk_buff *
 					net, sk, newskb, NULL, newskb->dev,
 					dev_loopback_xmit);
 
-			if (ipv6_hdr(skb)->hop_limit == 0) {
+			if (hdr->hop_limit == 0) {
 				IP6_INC_STATS(net, idev,
 					      IPSTATS_MIB_OUTDISCARDS);
 				kfree_skb(skb);
@@ -119,9 +102,7 @@ static int ip6_finish_output2(struct net *net, struct sock *sk, struct sk_buff *
 		}
 
 		IP6_UPD_PO_STATS(net, idev, IPSTATS_MIB_OUTMCAST, skb->len);
-
-		if (IPV6_ADDR_MC_SCOPE(&ipv6_hdr(skb)->daddr) <=
-		    IPV6_ADDR_SCOPE_NODELOCAL &&
+		if (IPV6_ADDR_MC_SCOPE(daddr) <= IPV6_ADDR_SCOPE_NODELOCAL &&
 		    !(dev->flags & IFF_LOOPBACK)) {
 			kfree_skb(skb);
 			return 0;
@@ -136,10 +117,10 @@ static int ip6_finish_output2(struct net *net, struct sock *sk, struct sk_buff *
 	}
 
 	rcu_read_lock_bh();
-	nexthop = rt6_nexthop((struct rt6_info *)dst, &ipv6_hdr(skb)->daddr);
-	neigh = __ipv6_neigh_lookup_noref(dst->dev, nexthop);
+	nexthop = rt6_nexthop((struct rt6_info *)dst, daddr);
+	neigh = __ipv6_neigh_lookup_noref(dev, nexthop);
 	if (unlikely(!neigh))
-		neigh = __neigh_create(&nd_tbl, nexthop, dst->dev, false);
+		neigh = __neigh_create(&nd_tbl, nexthop, dev, false);
 	if (!IS_ERR(neigh)) {
 		sock_confirm_neigh(skb, neigh);
 		ret = neigh_output(neigh, skb, false);
@@ -148,7 +129,7 @@ static int ip6_finish_output2(struct net *net, struct sock *sk, struct sk_buff *
 	}
 	rcu_read_unlock_bh();
 
-	IP6_INC_STATS(net, ip6_dst_idev(dst), IPSTATS_MIB_OUTNOROUTES);
+	IP6_INC_STATS(net, idev, IPSTATS_MIB_OUTNOROUTES);
 	kfree_skb(skb);
 	return -EINVAL;
 }
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH NET v3 3/7] ipv6: use skb_expand_head in ip6_xmit
       [not found]                     ` <cover.1627891754.git.vvs@virtuozzo.com>
  2021-08-02  8:52                       ` [PATCH NET v3 1/7] " Vasily Averin
  2021-08-02  8:52                       ` [PATCH NET v3 2/7] ipv6: use skb_expand_head in ip6_finish_output2 Vasily Averin
@ 2021-08-02  8:52                       ` Vasily Averin
  2021-08-02  8:52                       ` [PATCH NET v3 4/7] ipv4: use skb_expand_head in ip_finish_output2 Vasily Averin
                                         ` (3 subsequent siblings)
  6 siblings, 0 replies; 106+ messages in thread
From: Vasily Averin @ 2021-08-02  8:52 UTC (permalink / raw)
  To: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski,
	Eric Dumazet
  Cc: netdev, linux-kernel

Unlike skb_realloc_headroom, new helper skb_expand_head
does not allocate a new skb if possible.

Additionally this patch replaces commonly used dereferencing with variables.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/ipv6/ip6_output.c | 27 +++++++++++----------------
 1 file changed, 11 insertions(+), 16 deletions(-)

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 7d2ec25..f91d13a 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -249,6 +249,8 @@ int ip6_xmit(const struct sock *sk, struct sk_buff *skb, struct flowi6 *fl6,
 	const struct ipv6_pinfo *np = inet6_sk(sk);
 	struct in6_addr *first_hop = &fl6->daddr;
 	struct dst_entry *dst = skb_dst(skb);
+	struct net_device *dev = dst->dev;
+	struct inet6_dev *idev = ip6_dst_idev(dst);
 	unsigned int head_room;
 	struct ipv6hdr *hdr;
 	u8  proto = fl6->flowi6_proto;
@@ -256,22 +258,16 @@ int ip6_xmit(const struct sock *sk, struct sk_buff *skb, struct flowi6 *fl6,
 	int hlimit = -1;
 	u32 mtu;
 
-	head_room = sizeof(struct ipv6hdr) + LL_RESERVED_SPACE(dst->dev);
+	head_room = sizeof(struct ipv6hdr) + LL_RESERVED_SPACE(dev);
 	if (opt)
 		head_room += opt->opt_nflen + opt->opt_flen;
 
-	if (unlikely(skb_headroom(skb) < head_room)) {
-		struct sk_buff *skb2 = skb_realloc_headroom(skb, head_room);
-		if (!skb2) {
-			IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)),
-				      IPSTATS_MIB_OUTDISCARDS);
-			kfree_skb(skb);
+	if (unlikely(head_room > skb_headroom(skb))) {
+		skb = skb_expand_head(skb, head_room);
+		if (!skb) {
+			IP6_INC_STATS(net, idev, IPSTATS_MIB_OUTDISCARDS);
 			return -ENOBUFS;
 		}
-		if (skb->sk)
-			skb_set_owner_w(skb2, skb->sk);
-		consume_skb(skb);
-		skb = skb2;
 	}
 
 	if (opt) {
@@ -313,8 +309,7 @@ int ip6_xmit(const struct sock *sk, struct sk_buff *skb, struct flowi6 *fl6,
 
 	mtu = dst_mtu(dst);
 	if ((skb->len <= mtu) || skb->ignore_df || skb_is_gso(skb)) {
-		IP6_UPD_PO_STATS(net, ip6_dst_idev(skb_dst(skb)),
-			      IPSTATS_MIB_OUT, skb->len);
+		IP6_UPD_PO_STATS(net, idev, IPSTATS_MIB_OUT, skb->len);
 
 		/* if egress device is enslaved to an L3 master device pass the
 		 * skb to its handler for processing
@@ -327,17 +322,17 @@ int ip6_xmit(const struct sock *sk, struct sk_buff *skb, struct flowi6 *fl6,
 		 * we promote our socket to non const
 		 */
 		return NF_HOOK(NFPROTO_IPV6, NF_INET_LOCAL_OUT,
-			       net, (struct sock *)sk, skb, NULL, dst->dev,
+			       net, (struct sock *)sk, skb, NULL, dev,
 			       dst_output);
 	}
 
-	skb->dev = dst->dev;
+	skb->dev = dev;
 	/* ipv6_local_error() does not require socket lock,
 	 * we promote our socket to non const
 	 */
 	ipv6_local_error((struct sock *)sk, EMSGSIZE, fl6, mtu);
 
-	IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)), IPSTATS_MIB_FRAGFAILS);
+	IP6_INC_STATS(net, idev, IPSTATS_MIB_FRAGFAILS);
 	kfree_skb(skb);
 	return -EMSGSIZE;
 }
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH NET v3 4/7] ipv4: use skb_expand_head in ip_finish_output2
       [not found]                     ` <cover.1627891754.git.vvs@virtuozzo.com>
                                         ` (2 preceding siblings ...)
  2021-08-02  8:52                       ` [PATCH NET v3 3/7] ipv6: use skb_expand_head in ip6_xmit Vasily Averin
@ 2021-08-02  8:52                       ` Vasily Averin
  2021-08-02  8:52                       ` [PATCH NET v3 5/7] vrf: use skb_expand_head in vrf_finish_output Vasily Averin
                                         ` (2 subsequent siblings)
  6 siblings, 0 replies; 106+ messages in thread
From: Vasily Averin @ 2021-08-02  8:52 UTC (permalink / raw)
  To: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski,
	Eric Dumazet
  Cc: netdev, linux-kernel

Unlike skb_realloc_headroom, new helper skb_expand_head
does not allocate a new skb if possible.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/ipv4/ip_output.c | 13 ++-----------
 1 file changed, 2 insertions(+), 11 deletions(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 8d8a8da..c6b755e 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -198,19 +198,10 @@ static int ip_finish_output2(struct net *net, struct sock *sk, struct sk_buff *s
 	} else if (rt->rt_type == RTN_BROADCAST)
 		IP_UPD_PO_STATS(net, IPSTATS_MIB_OUTBCAST, skb->len);
 
-	/* Be paranoid, rather than too clever. */
 	if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) {
-		struct sk_buff *skb2;
-
-		skb2 = skb_realloc_headroom(skb, LL_RESERVED_SPACE(dev));
-		if (!skb2) {
-			kfree_skb(skb);
+		skb = skb_expand_head(skb, hh_len);
+		if (!skb)
 			return -ENOMEM;
-		}
-		if (skb->sk)
-			skb_set_owner_w(skb2, skb->sk);
-		consume_skb(skb);
-		skb = skb2;
 	}
 
 	if (lwtunnel_xmit_redirect(dst->lwtstate)) {
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH NET v3 5/7] vrf: use skb_expand_head in vrf_finish_output
       [not found]                     ` <cover.1627891754.git.vvs@virtuozzo.com>
                                         ` (3 preceding siblings ...)
  2021-08-02  8:52                       ` [PATCH NET v3 4/7] ipv4: use skb_expand_head in ip_finish_output2 Vasily Averin
@ 2021-08-02  8:52                       ` Vasily Averin
  2021-08-05 11:55                         ` Julian Wiedmann
  2021-08-02  8:52                       ` [PATCH NET v3 6/7] ax25: use skb_expand_head Vasily Averin
  2021-08-02  8:52                       ` [PATCH NET v3 7/7] bpf: use skb_expand_head in bpf_out_neigh_v4/6 Vasily Averin
  6 siblings, 1 reply; 106+ messages in thread
From: Vasily Averin @ 2021-08-02  8:52 UTC (permalink / raw)
  To: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski,
	Eric Dumazet
  Cc: netdev, linux-kernel

Unlike skb_realloc_headroom, new helper skb_expand_head
does not allocate a new skb if possible.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 drivers/net/vrf.c | 21 +++++++--------------
 1 file changed, 7 insertions(+), 14 deletions(-)

diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
index 2b1b944..726adf0 100644
--- a/drivers/net/vrf.c
+++ b/drivers/net/vrf.c
@@ -857,30 +857,24 @@ static int vrf_finish_output(struct net *net, struct sock *sk, struct sk_buff *s
 	unsigned int hh_len = LL_RESERVED_SPACE(dev);
 	struct neighbour *neigh;
 	bool is_v6gw = false;
-	int ret = -EINVAL;
 
 	nf_reset_ct(skb);
 
 	/* Be paranoid, rather than too clever. */
 	if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) {
-		struct sk_buff *skb2;
-
-		skb2 = skb_realloc_headroom(skb, LL_RESERVED_SPACE(dev));
-		if (!skb2) {
-			ret = -ENOMEM;
-			goto err;
+		skb = skb_expand_head(skb, hh_len);
+		if (!skb) {
+			skb->dev->stats.tx_errors++;
+			return -ENOMEM;
 		}
-		if (skb->sk)
-			skb_set_owner_w(skb2, skb->sk);
-
-		consume_skb(skb);
-		skb = skb2;
 	}
 
 	rcu_read_lock_bh();
 
 	neigh = ip_neigh_for_gw(rt, skb, &is_v6gw);
 	if (!IS_ERR(neigh)) {
+		int ret;
+
 		sock_confirm_neigh(skb, neigh);
 		/* if crossing protocols, can not use the cached header */
 		ret = neigh_output(neigh, skb, is_v6gw);
@@ -889,9 +883,8 @@ static int vrf_finish_output(struct net *net, struct sock *sk, struct sk_buff *s
 	}
 
 	rcu_read_unlock_bh();
-err:
 	vrf_tx_error(skb->dev, skb);
-	return ret;
+	return -EINVAL;
 }
 
 static int vrf_output(struct net *net, struct sock *sk, struct sk_buff *skb)
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH NET v3 6/7] ax25: use skb_expand_head
       [not found]                     ` <cover.1627891754.git.vvs@virtuozzo.com>
                                         ` (4 preceding siblings ...)
  2021-08-02  8:52                       ` [PATCH NET v3 5/7] vrf: use skb_expand_head in vrf_finish_output Vasily Averin
@ 2021-08-02  8:52                       ` Vasily Averin
  2021-08-02  8:52                       ` [PATCH NET v3 7/7] bpf: use skb_expand_head in bpf_out_neigh_v4/6 Vasily Averin
  6 siblings, 0 replies; 106+ messages in thread
From: Vasily Averin @ 2021-08-02  8:52 UTC (permalink / raw)
  To: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski,
	Eric Dumazet
  Cc: netdev, Joerg Reuter, Ralf Baechle, linux-hams, linux-kernel

Use skb_expand_head() in ax25_transmit_buffer and ax25_rt_build_path.
Unlike skb_realloc_headroom, new helper does not allocate a new skb if possible.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/ax25/ax25_ip.c    |  4 +---
 net/ax25/ax25_out.c   | 13 +++----------
 net/ax25/ax25_route.c | 13 +++----------
 3 files changed, 7 insertions(+), 23 deletions(-)

diff --git a/net/ax25/ax25_ip.c b/net/ax25/ax25_ip.c
index e4f63dd..3624977 100644
--- a/net/ax25/ax25_ip.c
+++ b/net/ax25/ax25_ip.c
@@ -193,10 +193,8 @@ netdev_tx_t ax25_ip_xmit(struct sk_buff *skb)
 	skb_pull(skb, AX25_KISS_HEADER_LEN);
 
 	if (digipeat != NULL) {
-		if ((ourskb = ax25_rt_build_path(skb, src, dst, route->digipeat)) == NULL) {
-			kfree_skb(skb);
+		if ((ourskb = ax25_rt_build_path(skb, src, dst, route->digipeat)) == NULL)
 			goto put;
-		}
 
 		skb = ourskb;
 	}
diff --git a/net/ax25/ax25_out.c b/net/ax25/ax25_out.c
index f53751b..22f2f66 100644
--- a/net/ax25/ax25_out.c
+++ b/net/ax25/ax25_out.c
@@ -325,7 +325,6 @@ void ax25_kick(ax25_cb *ax25)
 
 void ax25_transmit_buffer(ax25_cb *ax25, struct sk_buff *skb, int type)
 {
-	struct sk_buff *skbn;
 	unsigned char *ptr;
 	int headroom;
 
@@ -336,18 +335,12 @@ void ax25_transmit_buffer(ax25_cb *ax25, struct sk_buff *skb, int type)
 
 	headroom = ax25_addr_size(ax25->digipeat);
 
-	if (skb_headroom(skb) < headroom) {
-		if ((skbn = skb_realloc_headroom(skb, headroom)) == NULL) {
+	if (unlikely(skb_headroom(skb) < headroom)) {
+		skb = skb_expand_head(skb, headroom);
+		if (!skb) {
 			printk(KERN_CRIT "AX.25: ax25_transmit_buffer - out of memory\n");
-			kfree_skb(skb);
 			return;
 		}
-
-		if (skb->sk != NULL)
-			skb_set_owner_w(skbn, skb->sk);
-
-		consume_skb(skb);
-		skb = skbn;
 	}
 
 	ptr = skb_push(skb, headroom);
diff --git a/net/ax25/ax25_route.c b/net/ax25/ax25_route.c
index b40e0bc..d0b2e09 100644
--- a/net/ax25/ax25_route.c
+++ b/net/ax25/ax25_route.c
@@ -441,24 +441,17 @@ int ax25_rt_autobind(ax25_cb *ax25, ax25_address *addr)
 struct sk_buff *ax25_rt_build_path(struct sk_buff *skb, ax25_address *src,
 	ax25_address *dest, ax25_digi *digi)
 {
-	struct sk_buff *skbn;
 	unsigned char *bp;
 	int len;
 
 	len = digi->ndigi * AX25_ADDR_LEN;
 
-	if (skb_headroom(skb) < len) {
-		if ((skbn = skb_realloc_headroom(skb, len)) == NULL) {
+	if (unlikely(skb_headroom(skb) < len)) {
+		skb = skb_expand_head(skb, len);
+		if (!skb) {
 			printk(KERN_CRIT "AX.25: ax25_dg_build_path - out of memory\n");
 			return NULL;
 		}
-
-		if (skb->sk != NULL)
-			skb_set_owner_w(skbn, skb->sk);
-
-		consume_skb(skb);
-
-		skb = skbn;
 	}
 
 	bp = skb_push(skb, len);
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH NET v3 7/7] bpf: use skb_expand_head in bpf_out_neigh_v4/6
       [not found]                     ` <cover.1627891754.git.vvs@virtuozzo.com>
                                         ` (5 preceding siblings ...)
  2021-08-02  8:52                       ` [PATCH NET v3 6/7] ax25: use skb_expand_head Vasily Averin
@ 2021-08-02  8:52                       ` Vasily Averin
  6 siblings, 0 replies; 106+ messages in thread
From: Vasily Averin @ 2021-08-02  8:52 UTC (permalink / raw)
  To: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski,
	Eric Dumazet
  Cc: netdev, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh, bpf,
	linux-kernel

Unlike skb_realloc_headroom, new helper skb_expand_head
does not allocate a new skb if possible.

Additionally this patch replaces commonly used dereferencing with variables.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/core/filter.c | 27 +++++----------------------
 1 file changed, 5 insertions(+), 22 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index d70187c..9c2f434 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2179,17 +2179,9 @@ static int bpf_out_neigh_v6(struct net *net, struct sk_buff *skb,
 	skb->tstamp = 0;
 
 	if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) {
-		struct sk_buff *skb2;
-
-		skb2 = skb_realloc_headroom(skb, hh_len);
-		if (unlikely(!skb2)) {
-			kfree_skb(skb);
+		skb = skb_expand_head(skb, hh_len);
+		if (!skb)
 			return -ENOMEM;
-		}
-		if (skb->sk)
-			skb_set_owner_w(skb2, skb->sk);
-		consume_skb(skb);
-		skb = skb2;
 	}
 
 	rcu_read_lock_bh();
@@ -2213,8 +2205,7 @@ static int bpf_out_neigh_v6(struct net *net, struct sk_buff *skb,
 	}
 	rcu_read_unlock_bh();
 	if (dst)
-		IP6_INC_STATS(dev_net(dst->dev),
-			      ip6_dst_idev(dst), IPSTATS_MIB_OUTNOROUTES);
+		IP6_INC_STATS(net, ip6_dst_idev(dst), IPSTATS_MIB_OUTNOROUTES);
 out_drop:
 	kfree_skb(skb);
 	return -ENETDOWN;
@@ -2286,17 +2277,9 @@ static int bpf_out_neigh_v4(struct net *net, struct sk_buff *skb,
 	skb->tstamp = 0;
 
 	if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) {
-		struct sk_buff *skb2;
-
-		skb2 = skb_realloc_headroom(skb, hh_len);
-		if (unlikely(!skb2)) {
-			kfree_skb(skb);
+		skb = skb_expand_head(skb, hh_len);
+		if (!skb)
 			return -ENOMEM;
-		}
-		if (skb->sk)
-			skb_set_owner_w(skb2, skb->sk);
-		consume_skb(skb);
-		skb = skb2;
 	}
 
 	rcu_read_lock_bh();
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH NET v3 5/7] vrf: use skb_expand_head in vrf_finish_output
  2021-08-02  8:52                       ` [PATCH NET v3 5/7] vrf: use skb_expand_head in vrf_finish_output Vasily Averin
@ 2021-08-05 11:55                         ` Julian Wiedmann
  2021-08-05 12:55                           ` Vasily Averin
                                             ` (2 more replies)
  0 siblings, 3 replies; 106+ messages in thread
From: Julian Wiedmann @ 2021-08-05 11:55 UTC (permalink / raw)
  To: Vasily Averin, David S. Miller, Hideaki YOSHIFUJI, David Ahern,
	Jakub Kicinski, Eric Dumazet
  Cc: netdev, linux-kernel

On 02.08.21 11:52, Vasily Averin wrote:
> Unlike skb_realloc_headroom, new helper skb_expand_head
> does not allocate a new skb if possible.
> 
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
> ---
>  drivers/net/vrf.c | 21 +++++++--------------
>  1 file changed, 7 insertions(+), 14 deletions(-)
> 

[...]

>  	/* Be paranoid, rather than too clever. */
>  	if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) {
> -		struct sk_buff *skb2;
> -
> -		skb2 = skb_realloc_headroom(skb, LL_RESERVED_SPACE(dev));
> -		if (!skb2) {
> -			ret = -ENOMEM;
> -			goto err;
> +		skb = skb_expand_head(skb, hh_len);
> +		if (!skb) {
> +			skb->dev->stats.tx_errors++;
> +			return -ENOMEM;

Hello Vasily,

FYI, Coverity complains that we check skb != NULL here but then
still dereference skb->dev:


*** CID 1506214:  Null pointer dereferences  (FORWARD_NULL)
/drivers/net/vrf.c: 867 in vrf_finish_output()
861     	nf_reset_ct(skb);
862     
863     	/* Be paranoid, rather than too clever. */
864     	if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) {
865     		skb = skb_expand_head(skb, hh_len);
866     		if (!skb) {
>>>     CID 1506214:  Null pointer dereferences  (FORWARD_NULL)
>>>     Dereferencing null pointer "skb".
867     			skb->dev->stats.tx_errors++;
868     			return -ENOMEM;
869     		}
870     	}
871     
872     	rcu_read_lock_bh();

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH NET v3 5/7] vrf: use skb_expand_head in vrf_finish_output
  2021-08-05 11:55                         ` Julian Wiedmann
@ 2021-08-05 12:55                           ` Vasily Averin
  2021-08-06  7:49                           ` [PATCH NET v4 0/7] skbuff: introduce skb_expand_head() Vasily Averin
       [not found]                           ` <cover.1628235065.git.vvs@virtuozzo.com>
  2 siblings, 0 replies; 106+ messages in thread
From: Vasily Averin @ 2021-08-05 12:55 UTC (permalink / raw)
  To: Julian Wiedmann, David S. Miller, Hideaki YOSHIFUJI, David Ahern,
	Jakub Kicinski, Eric Dumazet
  Cc: netdev, linux-kernel

On 8/5/21 2:55 PM, Julian Wiedmann wrote:
> On 02.08.21 11:52, Vasily Averin wrote:
>> Unlike skb_realloc_headroom, new helper skb_expand_head
>> does not allocate a new skb if possible.
>>
>> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
>> ---
>>  drivers/net/vrf.c | 21 +++++++--------------
>>  1 file changed, 7 insertions(+), 14 deletions(-)
>>
> 
> [...]
> 
>>  	/* Be paranoid, rather than too clever. */
>>  	if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) {
>> -		struct sk_buff *skb2;
>> -
>> -		skb2 = skb_realloc_headroom(skb, LL_RESERVED_SPACE(dev));
>> -		if (!skb2) {
>> -			ret = -ENOMEM;
>> -			goto err;
>> +		skb = skb_expand_head(skb, hh_len);
>> +		if (!skb) {
>> +			skb->dev->stats.tx_errors++;
>> +			return -ENOMEM;
> 
> Hello Vasily,
> 
> FYI, Coverity complains that we check skb != NULL here but then
> still dereference skb->dev:
> 
> 
> *** CID 1506214:  Null pointer dereferences  (FORWARD_NULL)
> /drivers/net/vrf.c: 867 in vrf_finish_output()
> 861     	nf_reset_ct(skb);
> 862     
> 863     	/* Be paranoid, rather than too clever. */
> 864     	if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) {
> 865     		skb = skb_expand_head(skb, hh_len);
> 866     		if (!skb) {
>>>>     CID 1506214:  Null pointer dereferences  (FORWARD_NULL)
>>>>     Dereferencing null pointer "skb".
> 867     			skb->dev->stats.tx_errors++;
> 868     			return -ENOMEM;

My fault, I missed it.

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH NET v4 0/7] skbuff: introduce skb_expand_head()
  2021-08-05 11:55                         ` Julian Wiedmann
  2021-08-05 12:55                           ` Vasily Averin
@ 2021-08-06  7:49                           ` Vasily Averin
  2021-08-06 10:14                             ` David Miller
       [not found]                           ` <cover.1628235065.git.vvs@virtuozzo.com>
  2 siblings, 1 reply; 106+ messages in thread
From: Vasily Averin @ 2021-08-06  7:49 UTC (permalink / raw)
  To: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski,
	Eric Dumazet
  Cc: netdev, Joerg Reuter, Ralf Baechle, linux-hams,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh, bpf,
	linux-kernel, kernel, Julian Wiedmann

currently if skb does not have enough headroom skb_realloc_headrom is called.
It is not optimal because it creates new skb.

this patch set introduces new helper skb_expand_head()
Unlike skb_realloc_headroom, it does not allocate a new skb if possible; 
copies skb->sk on new skb when as needed and frees original skb in case of failures.

This helps to simplify ip[6]_finish_output2(), ip6_xmit() and few other
functions in vrf, ax25 and bpf.

There are few other cases where this helper can be used 
but it requires an additional investigations. 

v4 changes:
 - fixed null pointer dereference in vrf patch reported by Julian Wiedmann

v3 changes:
 - ax25 compilation warning fixed
 - v5.14-rc4 rebase
 - now it does not depend on non-committed pathces

v2 changes:
 - helper's name was changed to skb_expand_head
 - fixed few mistakes inside skb_expand_head():
    skb_set_owner_w should set sk on nskb
    kfree was replaced by kfree_skb()
    improved warning message
 - added minor refactoring in changed functions in vrf and bpf patches
 - removed kfree_skb() in ax25_rt_build_path caller ax25_ip_xmit


Vasily Averin (7):
  skbuff: introduce skb_expand_head()
  ipv6: use skb_expand_head in ip6_finish_output2
  ipv6: use skb_expand_head in ip6_xmit
  ipv4: use skb_expand_head in ip_finish_output2
  vrf: use skb_expand_head in vrf_finish_output
  ax25: use skb_expand_head
  bpf: use skb_expand_head in bpf_out_neigh_v4/6

 drivers/net/vrf.c      | 23 ++++++---------
 include/linux/skbuff.h |  1 +
 net/ax25/ax25_ip.c     |  4 +--
 net/ax25/ax25_out.c    | 13 ++-------
 net/ax25/ax25_route.c  | 13 ++-------
 net/core/filter.c      | 27 ++++-------------
 net/core/skbuff.c      | 42 +++++++++++++++++++++++++++
 net/ipv4/ip_output.c   | 13 ++-------
 net/ipv6/ip6_output.c  | 78 +++++++++++++++++---------------------------------
 9 files changed, 92 insertions(+), 122 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH NET v4 1/7] skbuff: introduce skb_expand_head()
       [not found]                           ` <cover.1628235065.git.vvs@virtuozzo.com>
@ 2021-08-06  7:49                             ` Vasily Averin
  2021-08-06  7:50                             ` [PATCH NET v4 2/7] ipv6: use skb_expand_head in ip6_finish_output2 Vasily Averin
                                               ` (5 subsequent siblings)
  6 siblings, 0 replies; 106+ messages in thread
From: Vasily Averin @ 2021-08-06  7:49 UTC (permalink / raw)
  To: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski,
	Eric Dumazet
  Cc: netdev, Joerg Reuter, Ralf Baechle, linux-hams,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh, bpf,
	linux-kernel, kernel, Julian Wiedmann

Like skb_realloc_headroom(), new helper increases headroom of specified skb.
Unlike skb_realloc_headroom(), it does not allocate a new skb if possible;
copies skb->sk on new skb when as needed and frees original skb in case
of failures.

This helps to simplify ip[6]_finish_output2() and a few other similar cases.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 include/linux/skbuff.h |  1 +
 net/core/skbuff.c      | 42 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 43 insertions(+)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index b2db9cd..ec8a783 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1179,6 +1179,7 @@ static inline struct sk_buff *__pskb_copy(struct sk_buff *skb, int headroom,
 int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail, gfp_t gfp_mask);
 struct sk_buff *skb_realloc_headroom(struct sk_buff *skb,
 				     unsigned int headroom);
+struct sk_buff *skb_expand_head(struct sk_buff *skb, unsigned int headroom);
 struct sk_buff *skb_copy_expand(const struct sk_buff *skb, int newheadroom,
 				int newtailroom, gfp_t priority);
 int __must_check skb_to_sgvec_nomark(struct sk_buff *skb, struct scatterlist *sg,
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index fc7942c..0c70b2b 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1786,6 +1786,48 @@ struct sk_buff *skb_realloc_headroom(struct sk_buff *skb, unsigned int headroom)
 EXPORT_SYMBOL(skb_realloc_headroom);
 
 /**
+ *	skb_expand_head - reallocate header of &sk_buff
+ *	@skb: buffer to reallocate
+ *	@headroom: needed headroom
+ *
+ *	Unlike skb_realloc_headroom, this one does not allocate a new skb
+ *	if possible; copies skb->sk to new skb as needed
+ *	and frees original skb in case of failures.
+ *
+ *	It expect increased headroom and generates warning otherwise.
+ */
+
+struct sk_buff *skb_expand_head(struct sk_buff *skb, unsigned int headroom)
+{
+	int delta = headroom - skb_headroom(skb);
+
+	if (WARN_ONCE(delta <= 0,
+		      "%s is expecting an increase in the headroom", __func__))
+		return skb;
+
+	/* pskb_expand_head() might crash, if skb is shared */
+	if (skb_shared(skb)) {
+		struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
+
+		if (likely(nskb)) {
+			if (skb->sk)
+				skb_set_owner_w(nskb, skb->sk);
+			consume_skb(skb);
+		} else {
+			kfree_skb(skb);
+		}
+		skb = nskb;
+	}
+	if (skb &&
+	    pskb_expand_head(skb, SKB_DATA_ALIGN(delta), 0, GFP_ATOMIC)) {
+		kfree_skb(skb);
+		skb = NULL;
+	}
+	return skb;
+}
+EXPORT_SYMBOL(skb_expand_head);
+
+/**
  *	skb_copy_expand	-	copy and expand sk_buff
  *	@skb: buffer to copy
  *	@newheadroom: new free bytes at head
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH NET v4 2/7] ipv6: use skb_expand_head in ip6_finish_output2
       [not found]                           ` <cover.1628235065.git.vvs@virtuozzo.com>
  2021-08-06  7:49                             ` [PATCH NET v4 1/7] skbuff: introduce skb_expand_head() Vasily Averin
@ 2021-08-06  7:50                             ` Vasily Averin
  2021-08-06  7:50                             ` [PATCH NET v4 3/7] ipv6: use skb_expand_head in ip6_xmit Vasily Averin
                                               ` (4 subsequent siblings)
  6 siblings, 0 replies; 106+ messages in thread
From: Vasily Averin @ 2021-08-06  7:50 UTC (permalink / raw)
  To: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski,
	Eric Dumazet
  Cc: netdev, linux-kernel, kernel, Julian Wiedmann

Unlike skb_realloc_headroom, new helper skb_expand_head does not allocate
a new skb if possible.

Additionally this patch replaces commonly used dereferencing with variables.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/ipv6/ip6_output.c | 51 ++++++++++++++++-----------------------------------
 1 file changed, 16 insertions(+), 35 deletions(-)

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 8e6ca9a..7d2ec25 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -60,46 +60,29 @@ static int ip6_finish_output2(struct net *net, struct sock *sk, struct sk_buff *
 {
 	struct dst_entry *dst = skb_dst(skb);
 	struct net_device *dev = dst->dev;
+	struct inet6_dev *idev = ip6_dst_idev(dst);
 	unsigned int hh_len = LL_RESERVED_SPACE(dev);
-	int delta = hh_len - skb_headroom(skb);
-	const struct in6_addr *nexthop;
+	const struct in6_addr *daddr, *nexthop;
+	struct ipv6hdr *hdr;
 	struct neighbour *neigh;
 	int ret;
 
 	/* Be paranoid, rather than too clever. */
-	if (unlikely(delta > 0) && dev->header_ops) {
-		/* pskb_expand_head() might crash, if skb is shared */
-		if (skb_shared(skb)) {
-			struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
-
-			if (likely(nskb)) {
-				if (skb->sk)
-					skb_set_owner_w(nskb, skb->sk);
-				consume_skb(skb);
-			} else {
-				kfree_skb(skb);
-			}
-			skb = nskb;
-		}
-		if (skb &&
-		    pskb_expand_head(skb, SKB_DATA_ALIGN(delta), 0, GFP_ATOMIC)) {
-			kfree_skb(skb);
-			skb = NULL;
-		}
+	if (unlikely(hh_len > skb_headroom(skb)) && dev->header_ops) {
+		skb = skb_expand_head(skb, hh_len);
 		if (!skb) {
-			IP6_INC_STATS(net, ip6_dst_idev(dst), IPSTATS_MIB_OUTDISCARDS);
+			IP6_INC_STATS(net, idev, IPSTATS_MIB_OUTDISCARDS);
 			return -ENOMEM;
 		}
 	}
 
-	if (ipv6_addr_is_multicast(&ipv6_hdr(skb)->daddr)) {
-		struct inet6_dev *idev = ip6_dst_idev(skb_dst(skb));
-
+	hdr = ipv6_hdr(skb);
+	daddr = &hdr->daddr;
+	if (ipv6_addr_is_multicast(daddr)) {
 		if (!(dev->flags & IFF_LOOPBACK) && sk_mc_loop(sk) &&
 		    ((mroute6_is_socket(net, skb) &&
 		     !(IP6CB(skb)->flags & IP6SKB_FORWARDED)) ||
-		     ipv6_chk_mcast_addr(dev, &ipv6_hdr(skb)->daddr,
-					 &ipv6_hdr(skb)->saddr))) {
+		     ipv6_chk_mcast_addr(dev, daddr, &hdr->saddr))) {
 			struct sk_buff *newskb = skb_clone(skb, GFP_ATOMIC);
 
 			/* Do not check for IFF_ALLMULTI; multicast routing
@@ -110,7 +93,7 @@ static int ip6_finish_output2(struct net *net, struct sock *sk, struct sk_buff *
 					net, sk, newskb, NULL, newskb->dev,
 					dev_loopback_xmit);
 
-			if (ipv6_hdr(skb)->hop_limit == 0) {
+			if (hdr->hop_limit == 0) {
 				IP6_INC_STATS(net, idev,
 					      IPSTATS_MIB_OUTDISCARDS);
 				kfree_skb(skb);
@@ -119,9 +102,7 @@ static int ip6_finish_output2(struct net *net, struct sock *sk, struct sk_buff *
 		}
 
 		IP6_UPD_PO_STATS(net, idev, IPSTATS_MIB_OUTMCAST, skb->len);
-
-		if (IPV6_ADDR_MC_SCOPE(&ipv6_hdr(skb)->daddr) <=
-		    IPV6_ADDR_SCOPE_NODELOCAL &&
+		if (IPV6_ADDR_MC_SCOPE(daddr) <= IPV6_ADDR_SCOPE_NODELOCAL &&
 		    !(dev->flags & IFF_LOOPBACK)) {
 			kfree_skb(skb);
 			return 0;
@@ -136,10 +117,10 @@ static int ip6_finish_output2(struct net *net, struct sock *sk, struct sk_buff *
 	}
 
 	rcu_read_lock_bh();
-	nexthop = rt6_nexthop((struct rt6_info *)dst, &ipv6_hdr(skb)->daddr);
-	neigh = __ipv6_neigh_lookup_noref(dst->dev, nexthop);
+	nexthop = rt6_nexthop((struct rt6_info *)dst, daddr);
+	neigh = __ipv6_neigh_lookup_noref(dev, nexthop);
 	if (unlikely(!neigh))
-		neigh = __neigh_create(&nd_tbl, nexthop, dst->dev, false);
+		neigh = __neigh_create(&nd_tbl, nexthop, dev, false);
 	if (!IS_ERR(neigh)) {
 		sock_confirm_neigh(skb, neigh);
 		ret = neigh_output(neigh, skb, false);
@@ -148,7 +129,7 @@ static int ip6_finish_output2(struct net *net, struct sock *sk, struct sk_buff *
 	}
 	rcu_read_unlock_bh();
 
-	IP6_INC_STATS(net, ip6_dst_idev(dst), IPSTATS_MIB_OUTNOROUTES);
+	IP6_INC_STATS(net, idev, IPSTATS_MIB_OUTNOROUTES);
 	kfree_skb(skb);
 	return -EINVAL;
 }
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH NET v4 3/7] ipv6: use skb_expand_head in ip6_xmit
       [not found]                           ` <cover.1628235065.git.vvs@virtuozzo.com>
  2021-08-06  7:49                             ` [PATCH NET v4 1/7] skbuff: introduce skb_expand_head() Vasily Averin
  2021-08-06  7:50                             ` [PATCH NET v4 2/7] ipv6: use skb_expand_head in ip6_finish_output2 Vasily Averin
@ 2021-08-06  7:50                             ` Vasily Averin
       [not found]                               ` <CALMXkpaay1y=0tkbnskr4gf-HTMjJJsVryh4Prnej_ws-hJvBg@mail.gmail.com>
  2021-08-06  7:50                             ` [PATCH NET v4 4/7] ipv4: use skb_expand_head in ip_finish_output2 Vasily Averin
                                               ` (3 subsequent siblings)
  6 siblings, 1 reply; 106+ messages in thread
From: Vasily Averin @ 2021-08-06  7:50 UTC (permalink / raw)
  To: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski,
	Eric Dumazet
  Cc: netdev, linux-kernel, kernel, Julian Wiedmann

Unlike skb_realloc_headroom, new helper skb_expand_head
does not allocate a new skb if possible.

Additionally this patch replaces commonly used dereferencing with variables.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/ipv6/ip6_output.c | 27 +++++++++++----------------
 1 file changed, 11 insertions(+), 16 deletions(-)

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 7d2ec25..f91d13a 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -249,6 +249,8 @@ int ip6_xmit(const struct sock *sk, struct sk_buff *skb, struct flowi6 *fl6,
 	const struct ipv6_pinfo *np = inet6_sk(sk);
 	struct in6_addr *first_hop = &fl6->daddr;
 	struct dst_entry *dst = skb_dst(skb);
+	struct net_device *dev = dst->dev;
+	struct inet6_dev *idev = ip6_dst_idev(dst);
 	unsigned int head_room;
 	struct ipv6hdr *hdr;
 	u8  proto = fl6->flowi6_proto;
@@ -256,22 +258,16 @@ int ip6_xmit(const struct sock *sk, struct sk_buff *skb, struct flowi6 *fl6,
 	int hlimit = -1;
 	u32 mtu;
 
-	head_room = sizeof(struct ipv6hdr) + LL_RESERVED_SPACE(dst->dev);
+	head_room = sizeof(struct ipv6hdr) + LL_RESERVED_SPACE(dev);
 	if (opt)
 		head_room += opt->opt_nflen + opt->opt_flen;
 
-	if (unlikely(skb_headroom(skb) < head_room)) {
-		struct sk_buff *skb2 = skb_realloc_headroom(skb, head_room);
-		if (!skb2) {
-			IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)),
-				      IPSTATS_MIB_OUTDISCARDS);
-			kfree_skb(skb);
+	if (unlikely(head_room > skb_headroom(skb))) {
+		skb = skb_expand_head(skb, head_room);
+		if (!skb) {
+			IP6_INC_STATS(net, idev, IPSTATS_MIB_OUTDISCARDS);
 			return -ENOBUFS;
 		}
-		if (skb->sk)
-			skb_set_owner_w(skb2, skb->sk);
-		consume_skb(skb);
-		skb = skb2;
 	}
 
 	if (opt) {
@@ -313,8 +309,7 @@ int ip6_xmit(const struct sock *sk, struct sk_buff *skb, struct flowi6 *fl6,
 
 	mtu = dst_mtu(dst);
 	if ((skb->len <= mtu) || skb->ignore_df || skb_is_gso(skb)) {
-		IP6_UPD_PO_STATS(net, ip6_dst_idev(skb_dst(skb)),
-			      IPSTATS_MIB_OUT, skb->len);
+		IP6_UPD_PO_STATS(net, idev, IPSTATS_MIB_OUT, skb->len);
 
 		/* if egress device is enslaved to an L3 master device pass the
 		 * skb to its handler for processing
@@ -327,17 +322,17 @@ int ip6_xmit(const struct sock *sk, struct sk_buff *skb, struct flowi6 *fl6,
 		 * we promote our socket to non const
 		 */
 		return NF_HOOK(NFPROTO_IPV6, NF_INET_LOCAL_OUT,
-			       net, (struct sock *)sk, skb, NULL, dst->dev,
+			       net, (struct sock *)sk, skb, NULL, dev,
 			       dst_output);
 	}
 
-	skb->dev = dst->dev;
+	skb->dev = dev;
 	/* ipv6_local_error() does not require socket lock,
 	 * we promote our socket to non const
 	 */
 	ipv6_local_error((struct sock *)sk, EMSGSIZE, fl6, mtu);
 
-	IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)), IPSTATS_MIB_FRAGFAILS);
+	IP6_INC_STATS(net, idev, IPSTATS_MIB_FRAGFAILS);
 	kfree_skb(skb);
 	return -EMSGSIZE;
 }
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH NET v4 4/7] ipv4: use skb_expand_head in ip_finish_output2
       [not found]                           ` <cover.1628235065.git.vvs@virtuozzo.com>
                                               ` (2 preceding siblings ...)
  2021-08-06  7:50                             ` [PATCH NET v4 3/7] ipv6: use skb_expand_head in ip6_xmit Vasily Averin
@ 2021-08-06  7:50                             ` Vasily Averin
  2021-08-06  7:50                             ` [PATCH NET v4 5/7] vrf: use skb_expand_head in vrf_finish_output Vasily Averin
                                               ` (2 subsequent siblings)
  6 siblings, 0 replies; 106+ messages in thread
From: Vasily Averin @ 2021-08-06  7:50 UTC (permalink / raw)
  To: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski,
	Eric Dumazet
  Cc: netdev, linux-kernel, kernel, Julian Wiedmann

Unlike skb_realloc_headroom, new helper skb_expand_head
does not allocate a new skb if possible.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/ipv4/ip_output.c | 13 ++-----------
 1 file changed, 2 insertions(+), 11 deletions(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 8d8a8da..c6b755e 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -198,19 +198,10 @@ static int ip_finish_output2(struct net *net, struct sock *sk, struct sk_buff *s
 	} else if (rt->rt_type == RTN_BROADCAST)
 		IP_UPD_PO_STATS(net, IPSTATS_MIB_OUTBCAST, skb->len);
 
-	/* Be paranoid, rather than too clever. */
 	if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) {
-		struct sk_buff *skb2;
-
-		skb2 = skb_realloc_headroom(skb, LL_RESERVED_SPACE(dev));
-		if (!skb2) {
-			kfree_skb(skb);
+		skb = skb_expand_head(skb, hh_len);
+		if (!skb)
 			return -ENOMEM;
-		}
-		if (skb->sk)
-			skb_set_owner_w(skb2, skb->sk);
-		consume_skb(skb);
-		skb = skb2;
 	}
 
 	if (lwtunnel_xmit_redirect(dst->lwtstate)) {
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH NET v4 5/7] vrf: use skb_expand_head in vrf_finish_output
       [not found]                           ` <cover.1628235065.git.vvs@virtuozzo.com>
                                               ` (3 preceding siblings ...)
  2021-08-06  7:50                             ` [PATCH NET v4 4/7] ipv4: use skb_expand_head in ip_finish_output2 Vasily Averin
@ 2021-08-06  7:50                             ` Vasily Averin
  2021-08-06  7:50                             ` [PATCH NET v4 6/7] ax25: use skb_expand_head Vasily Averin
  2021-08-06  7:50                             ` [PATCH NET v4 7/7] bpf: use skb_expand_head in bpf_out_neigh_v4/6 Vasily Averin
  6 siblings, 0 replies; 106+ messages in thread
From: Vasily Averin @ 2021-08-06  7:50 UTC (permalink / raw)
  To: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski,
	Eric Dumazet
  Cc: netdev, linux-kernel, kernel, Julian Wiedmann

Unlike skb_realloc_headroom, new helper skb_expand_head
does not allocate a new skb if possible.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
v4: fixes null pointer dereference reported by Julian Wiedmann,
replace skb->dev by dev = skb_dst(skb)->dev
vrf_finish_output() is only called from vrf_output(),
it set skb->dev to skb_dst(skb)->dev and calls POSTROUTING netfilter
hooks, where output device should not be changed.
---
 drivers/net/vrf.c | 23 ++++++++---------------
 1 file changed, 8 insertions(+), 15 deletions(-)

diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
index 2b1b944..168d4ef 100644
--- a/drivers/net/vrf.c
+++ b/drivers/net/vrf.c
@@ -857,30 +857,24 @@ static int vrf_finish_output(struct net *net, struct sock *sk, struct sk_buff *s
 	unsigned int hh_len = LL_RESERVED_SPACE(dev);
 	struct neighbour *neigh;
 	bool is_v6gw = false;
-	int ret = -EINVAL;
 
 	nf_reset_ct(skb);
 
 	/* Be paranoid, rather than too clever. */
 	if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) {
-		struct sk_buff *skb2;
-
-		skb2 = skb_realloc_headroom(skb, LL_RESERVED_SPACE(dev));
-		if (!skb2) {
-			ret = -ENOMEM;
-			goto err;
+		skb = skb_expand_head(skb, hh_len);
+		if (!skb) {
+			dev->stats.tx_errors++;
+			return -ENOMEM;
 		}
-		if (skb->sk)
-			skb_set_owner_w(skb2, skb->sk);
-
-		consume_skb(skb);
-		skb = skb2;
 	}
 
 	rcu_read_lock_bh();
 
 	neigh = ip_neigh_for_gw(rt, skb, &is_v6gw);
 	if (!IS_ERR(neigh)) {
+		int ret;
+
 		sock_confirm_neigh(skb, neigh);
 		/* if crossing protocols, can not use the cached header */
 		ret = neigh_output(neigh, skb, is_v6gw);
@@ -889,9 +883,8 @@ static int vrf_finish_output(struct net *net, struct sock *sk, struct sk_buff *s
 	}
 
 	rcu_read_unlock_bh();
-err:
-	vrf_tx_error(skb->dev, skb);
-	return ret;
+	vrf_tx_error(dev, skb);
+	return -EINVAL;
 }
 
 static int vrf_output(struct net *net, struct sock *sk, struct sk_buff *skb)
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH NET v4 6/7] ax25: use skb_expand_head
       [not found]                           ` <cover.1628235065.git.vvs@virtuozzo.com>
                                               ` (4 preceding siblings ...)
  2021-08-06  7:50                             ` [PATCH NET v4 5/7] vrf: use skb_expand_head in vrf_finish_output Vasily Averin
@ 2021-08-06  7:50                             ` Vasily Averin
  2021-08-06  7:50                             ` [PATCH NET v4 7/7] bpf: use skb_expand_head in bpf_out_neigh_v4/6 Vasily Averin
  6 siblings, 0 replies; 106+ messages in thread
From: Vasily Averin @ 2021-08-06  7:50 UTC (permalink / raw)
  To: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski,
	Eric Dumazet
  Cc: netdev, Joerg Reuter, Ralf Baechle, linux-hams, linux-kernel,
	kernel, Julian Wiedmann

Use skb_expand_head() in ax25_transmit_buffer and ax25_rt_build_path.
Unlike skb_realloc_headroom, new helper does not allocate a new skb if possible.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/ax25/ax25_ip.c    |  4 +---
 net/ax25/ax25_out.c   | 13 +++----------
 net/ax25/ax25_route.c | 13 +++----------
 3 files changed, 7 insertions(+), 23 deletions(-)

diff --git a/net/ax25/ax25_ip.c b/net/ax25/ax25_ip.c
index e4f63dd..3624977 100644
--- a/net/ax25/ax25_ip.c
+++ b/net/ax25/ax25_ip.c
@@ -193,10 +193,8 @@ netdev_tx_t ax25_ip_xmit(struct sk_buff *skb)
 	skb_pull(skb, AX25_KISS_HEADER_LEN);
 
 	if (digipeat != NULL) {
-		if ((ourskb = ax25_rt_build_path(skb, src, dst, route->digipeat)) == NULL) {
-			kfree_skb(skb);
+		if ((ourskb = ax25_rt_build_path(skb, src, dst, route->digipeat)) == NULL)
 			goto put;
-		}
 
 		skb = ourskb;
 	}
diff --git a/net/ax25/ax25_out.c b/net/ax25/ax25_out.c
index f53751b..22f2f66 100644
--- a/net/ax25/ax25_out.c
+++ b/net/ax25/ax25_out.c
@@ -325,7 +325,6 @@ void ax25_kick(ax25_cb *ax25)
 
 void ax25_transmit_buffer(ax25_cb *ax25, struct sk_buff *skb, int type)
 {
-	struct sk_buff *skbn;
 	unsigned char *ptr;
 	int headroom;
 
@@ -336,18 +335,12 @@ void ax25_transmit_buffer(ax25_cb *ax25, struct sk_buff *skb, int type)
 
 	headroom = ax25_addr_size(ax25->digipeat);
 
-	if (skb_headroom(skb) < headroom) {
-		if ((skbn = skb_realloc_headroom(skb, headroom)) == NULL) {
+	if (unlikely(skb_headroom(skb) < headroom)) {
+		skb = skb_expand_head(skb, headroom);
+		if (!skb) {
 			printk(KERN_CRIT "AX.25: ax25_transmit_buffer - out of memory\n");
-			kfree_skb(skb);
 			return;
 		}
-
-		if (skb->sk != NULL)
-			skb_set_owner_w(skbn, skb->sk);
-
-		consume_skb(skb);
-		skb = skbn;
 	}
 
 	ptr = skb_push(skb, headroom);
diff --git a/net/ax25/ax25_route.c b/net/ax25/ax25_route.c
index b40e0bc..d0b2e09 100644
--- a/net/ax25/ax25_route.c
+++ b/net/ax25/ax25_route.c
@@ -441,24 +441,17 @@ int ax25_rt_autobind(ax25_cb *ax25, ax25_address *addr)
 struct sk_buff *ax25_rt_build_path(struct sk_buff *skb, ax25_address *src,
 	ax25_address *dest, ax25_digi *digi)
 {
-	struct sk_buff *skbn;
 	unsigned char *bp;
 	int len;
 
 	len = digi->ndigi * AX25_ADDR_LEN;
 
-	if (skb_headroom(skb) < len) {
-		if ((skbn = skb_realloc_headroom(skb, len)) == NULL) {
+	if (unlikely(skb_headroom(skb) < len)) {
+		skb = skb_expand_head(skb, len);
+		if (!skb) {
 			printk(KERN_CRIT "AX.25: ax25_dg_build_path - out of memory\n");
 			return NULL;
 		}
-
-		if (skb->sk != NULL)
-			skb_set_owner_w(skbn, skb->sk);
-
-		consume_skb(skb);
-
-		skb = skbn;
 	}
 
 	bp = skb_push(skb, len);
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH NET v4 7/7] bpf: use skb_expand_head in bpf_out_neigh_v4/6
       [not found]                           ` <cover.1628235065.git.vvs@virtuozzo.com>
                                               ` (5 preceding siblings ...)
  2021-08-06  7:50                             ` [PATCH NET v4 6/7] ax25: use skb_expand_head Vasily Averin
@ 2021-08-06  7:50                             ` Vasily Averin
  6 siblings, 0 replies; 106+ messages in thread
From: Vasily Averin @ 2021-08-06  7:50 UTC (permalink / raw)
  To: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski,
	Eric Dumazet
  Cc: netdev, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh, bpf,
	linux-kernel, kernel, Julian Wiedmann

Unlike skb_realloc_headroom, new helper skb_expand_head
does not allocate a new skb if possible.

Additionally this patch replaces commonly used dereferencing with variables.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/core/filter.c | 27 +++++----------------------
 1 file changed, 5 insertions(+), 22 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index d70187c..9c2f434 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2179,17 +2179,9 @@ static int bpf_out_neigh_v6(struct net *net, struct sk_buff *skb,
 	skb->tstamp = 0;
 
 	if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) {
-		struct sk_buff *skb2;
-
-		skb2 = skb_realloc_headroom(skb, hh_len);
-		if (unlikely(!skb2)) {
-			kfree_skb(skb);
+		skb = skb_expand_head(skb, hh_len);
+		if (!skb)
 			return -ENOMEM;
-		}
-		if (skb->sk)
-			skb_set_owner_w(skb2, skb->sk);
-		consume_skb(skb);
-		skb = skb2;
 	}
 
 	rcu_read_lock_bh();
@@ -2213,8 +2205,7 @@ static int bpf_out_neigh_v6(struct net *net, struct sk_buff *skb,
 	}
 	rcu_read_unlock_bh();
 	if (dst)
-		IP6_INC_STATS(dev_net(dst->dev),
-			      ip6_dst_idev(dst), IPSTATS_MIB_OUTNOROUTES);
+		IP6_INC_STATS(net, ip6_dst_idev(dst), IPSTATS_MIB_OUTNOROUTES);
 out_drop:
 	kfree_skb(skb);
 	return -ENETDOWN;
@@ -2286,17 +2277,9 @@ static int bpf_out_neigh_v4(struct net *net, struct sk_buff *skb,
 	skb->tstamp = 0;
 
 	if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) {
-		struct sk_buff *skb2;
-
-		skb2 = skb_realloc_headroom(skb, hh_len);
-		if (unlikely(!skb2)) {
-			kfree_skb(skb);
+		skb = skb_expand_head(skb, hh_len);
+		if (!skb)
 			return -ENOMEM;
-		}
-		if (skb->sk)
-			skb_set_owner_w(skb2, skb->sk);
-		consume_skb(skb);
-		skb = skb2;
 	}
 
 	rcu_read_lock_bh();
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH NET v4 0/7] skbuff: introduce skb_expand_head()
  2021-08-06  7:49                           ` [PATCH NET v4 0/7] skbuff: introduce skb_expand_head() Vasily Averin
@ 2021-08-06 10:14                             ` David Miller
  2021-08-06 12:53                               ` [PATCH NET] vrf: fix null pointer dereference in vrf_finish_output() Vasily Averin
  0 siblings, 1 reply; 106+ messages in thread
From: David Miller @ 2021-08-06 10:14 UTC (permalink / raw)
  To: vvs
  Cc: yoshfuji, dsahern, kuba, eric.dumazet, netdev, jreuter, ralf,
	linux-hams, ast, daniel, andrii, kafai, songliubraving, yhs,
	kpsingh, bpf, linux-kernel, kernel, jwi



I already applied v3 to net-next, please send a relative fixup if you want to incorpoate the v4 changes too.

Thank you.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH NET] vrf: fix null pointer dereference in vrf_finish_output()
  2021-08-06 10:14                             ` David Miller
@ 2021-08-06 12:53                               ` Vasily Averin
  2021-08-06 22:42                                 ` Jakub Kicinski
  0 siblings, 1 reply; 106+ messages in thread
From: Vasily Averin @ 2021-08-06 12:53 UTC (permalink / raw)
  To: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski,
	Eric Dumazet
  Cc: netdev, linux-kernel, kernel, Julian Wiedmann

After 14ee70ca89e6 ("vrf: use skb_expand_head in vrf_finish_output")
skb->dev  is accessed after skb free.
Let's replace skb->dev by dev = skb_dst(skb)->dev:
vrf_finish_output() is only called from vrf_output(),
it set skb->dev to skb_dst(skb)->dev and calls POSTROUTING netfilter
hooks, where output device should not be changed.

Fixes: 14ee70ca89e6 ("vrf: use skb_expand_head in vrf_finish_output")
Reported-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 drivers/net/vrf.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
index 726adf0..168d4ef 100644
--- a/drivers/net/vrf.c
+++ b/drivers/net/vrf.c
@@ -864,7 +864,7 @@ static int vrf_finish_output(struct net *net, struct sock *sk, struct sk_buff *s
 	if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) {
 		skb = skb_expand_head(skb, hh_len);
 		if (!skb) {
-			skb->dev->stats.tx_errors++;
+			dev->stats.tx_errors++;
 			return -ENOMEM;
 		}
 	}
@@ -883,7 +883,7 @@ static int vrf_finish_output(struct net *net, struct sock *sk, struct sk_buff *s
 	}
 
 	rcu_read_unlock_bh();
-	vrf_tx_error(skb->dev, skb);
+	vrf_tx_error(dev, skb);
 	return -EINVAL;
 }
 
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH NET] vrf: fix null pointer dereference in vrf_finish_output()
  2021-08-06 12:53                               ` [PATCH NET] vrf: fix null pointer dereference in vrf_finish_output() Vasily Averin
@ 2021-08-06 22:42                                 ` Jakub Kicinski
  2021-08-07  6:41                                   ` Vasily Averin
  0 siblings, 1 reply; 106+ messages in thread
From: Jakub Kicinski @ 2021-08-06 22:42 UTC (permalink / raw)
  To: Vasily Averin
  Cc: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Eric Dumazet,
	netdev, linux-kernel, kernel, Julian Wiedmann

On Fri, 6 Aug 2021 15:53:00 +0300 Vasily Averin wrote:
> After 14ee70ca89e6 ("vrf: use skb_expand_head in vrf_finish_output")
> skb->dev  is accessed after skb free.
> Let's replace skb->dev by dev = skb_dst(skb)->dev:
> vrf_finish_output() is only called from vrf_output(),
> it set skb->dev to skb_dst(skb)->dev and calls POSTROUTING netfilter
> hooks, where output device should not be changed.
> 
> Fixes: 14ee70ca89e6 ("vrf: use skb_expand_head in vrf_finish_output")
> Reported-by: Julian Wiedmann <jwi@linux.ibm.com>
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>

Thanks for following up! I decided to pick a similar patch from Dan
Carpenter [1] because the chunk quoted below is not really necessary.

[1] https://lore.kernel.org/kernel-janitors/20210806150435.GB15586@kili/

> @@ -883,7 +883,7 @@ static int vrf_finish_output(struct net *net, struct sock *sk, struct sk_buff *s
>  	}
>  
>  	rcu_read_unlock_bh();
> -	vrf_tx_error(skb->dev, skb);
> +	vrf_tx_error(dev, skb);
>  	return -EINVAL;
>  }
>  


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH NET] vrf: fix null pointer dereference in vrf_finish_output()
  2021-08-06 22:42                                 ` Jakub Kicinski
@ 2021-08-07  6:41                                   ` Vasily Averin
  0 siblings, 0 replies; 106+ messages in thread
From: Vasily Averin @ 2021-08-07  6:41 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Eric Dumazet,
	netdev, linux-kernel, kernel, Julian Wiedmann

On 8/7/21 1:42 AM, Jakub Kicinski wrote:
> On Fri, 6 Aug 2021 15:53:00 +0300 Vasily Averin wrote:
>> After 14ee70ca89e6 ("vrf: use skb_expand_head in vrf_finish_output")
>> skb->dev  is accessed after skb free.
>> Let's replace skb->dev by dev = skb_dst(skb)->dev:
>> vrf_finish_output() is only called from vrf_output(),
>> it set skb->dev to skb_dst(skb)->dev and calls POSTROUTING netfilter
>> hooks, where output device should not be changed.
>>
>> Fixes: 14ee70ca89e6 ("vrf: use skb_expand_head in vrf_finish_output")
>> Reported-by: Julian Wiedmann <jwi@linux.ibm.com>
>> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
> 
> Thanks for following up! I decided to pick a similar patch from Dan
> Carpenter [1] because the chunk quoted below is not really necessary.

I still think that my patch version is preferable.
It's better to use vrf_tx_error(dev, skb) because:
a) both rollbacks can use the same net device
b) probably using 'dev' allows to avoid an extra pointer dereference.

Originally, i.e. before fixed patch 14ee70ca89e6, rollback after failed header expand
called the save vrf_tx_error() call. This function does 2 things:  
- increments stats.tx_errors on specified network device
- frees provided skb.

Commit 14ee70ca89e6 replaced skb_realloc_headroom() by skb_expand_head() that frees skb inside,
So vrf_tx_error() call on rollback was replaced with direct increment of  stats.tx_errors.
We cannot use now original skb->dev so our fixup patches replaces it with dev variable already
used in this function.
Though, if we should use the same net device in both rollbacks. It's illogical for me
to change one place and do not change another one. 

If we follow to your decision -- it isn't a problem. skb->dev and skb should be identical.
Though 'skb->dev' does an extra dereference, while dev was used in function and probably
was saved to register.

Thank you,
	Vasily Averin

> [1] https://lore.kernel.org/kernel-janitors/20210806150435.GB15586@kili/
> 
>> @@ -883,7 +883,7 @@ static int vrf_finish_output(struct net *net, struct sock *sk, struct sk_buff *s
>>  	}
>>  
>>  	rcu_read_unlock_bh();
>> -	vrf_tx_error(skb->dev, skb);
>> +	vrf_tx_error(dev, skb);
>>  	return -EINVAL;
>>  }
>>  
> 


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH NET v4 3/7] ipv6: use skb_expand_head in ip6_xmit
       [not found]                               ` <CALMXkpaay1y=0tkbnskr4gf-HTMjJJsVryh4Prnej_ws-hJvBg@mail.gmail.com>
@ 2021-08-20 22:44                                 ` Christoph Paasch
  2021-08-21  6:21                                   ` Vasily Averin
  0 siblings, 1 reply; 106+ messages in thread
From: Christoph Paasch @ 2021-08-20 22:44 UTC (permalink / raw)
  To: Vasily Averin
  Cc: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski,
	Eric Dumazet, netdev, LKML, kernel, Julian Wiedmann

(resend without html - thanks gmail web-interface...)

On Fri, Aug 20, 2021 at 3:41 PM Christoph Paasch
<christoph.paasch@gmail.com> wrote:
>
> Hello,
>
> On Fri, Aug 6, 2021 at 1:18 AM Vasily Averin <vvs@virtuozzo.com> wrote:
> >
> > Unlike skb_realloc_headroom, new helper skb_expand_head
> > does not allocate a new skb if possible.
> >
> > Additionally this patch replaces commonly used dereferencing with variables.
> >
> > Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
> > ---
> >  net/ipv6/ip6_output.c | 27 +++++++++++----------------
> >  1 file changed, 11 insertions(+), 16 deletions(-)
> >
> > diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
> > index 7d2ec25..f91d13a 100644
> > --- a/net/ipv6/ip6_output.c
> > +++ b/net/ipv6/ip6_output.c
> > @@ -249,6 +249,8 @@ int ip6_xmit(const struct sock *sk, struct sk_buff *skb, struct flowi6 *fl6,
> >         const struct ipv6_pinfo *np = inet6_sk(sk);
> >         struct in6_addr *first_hop = &fl6->daddr;
> >         struct dst_entry *dst = skb_dst(skb);
> > +       struct net_device *dev = dst->dev;
> > +       struct inet6_dev *idev = ip6_dst_idev(dst);
> >         unsigned int head_room;
> >         struct ipv6hdr *hdr;
> >         u8  proto = fl6->flowi6_proto;
> > @@ -256,22 +258,16 @@ int ip6_xmit(const struct sock *sk, struct sk_buff *skb, struct flowi6 *fl6,
> >         int hlimit = -1;
> >         u32 mtu;
> >
> > -       head_room = sizeof(struct ipv6hdr) + LL_RESERVED_SPACE(dst->dev);
> > +       head_room = sizeof(struct ipv6hdr) + LL_RESERVED_SPACE(dev);
> >         if (opt)
> >                 head_room += opt->opt_nflen + opt->opt_flen;
> >
> > -       if (unlikely(skb_headroom(skb) < head_room)) {
> > -               struct sk_buff *skb2 = skb_realloc_headroom(skb, head_room);
> > -               if (!skb2) {
> > -                       IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)),
> > -                                     IPSTATS_MIB_OUTDISCARDS);
> > -                       kfree_skb(skb);
> > +       if (unlikely(head_room > skb_headroom(skb))) {
> > +               skb = skb_expand_head(skb, head_room);
> > +               if (!skb) {
> > +                       IP6_INC_STATS(net, idev, IPSTATS_MIB_OUTDISCARDS);
>
>
> this change introduces a panic on my syzkaller instance:
>
> ------------[ cut here ]------------
> WARNING: CPU: 0 PID: 1832 at net/core/skbuff.c:5412 skb_try_coalesce+0x1019/0x12c0 net/core/skbuff.c:5412
> Modules linked in:
> CPU: 0 PID: 1832 Comm: syz-executor.0 Not tainted 5.14.0-rc4ab492b0cda378661ae004e2fd66cfd1be474438d #102
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
> RIP: 0010:skb_try_coalesce+0x1019/0x12c0 net/core/skbuff.c:5412
> Code: 24 20 bf 01 00 00 00 8b 40 20 44 0f b7 f0 44 89 f6 e8 ab 41 c0 fe 41 83 ee 01 0f 85 01 f3 ff ff e9 42 f6 ff ff e8 07 3c c0 fe <0f> 0b e9 7b f9 ff ff e8 fb 3b c0 fe 48 8b 44 24 40 48 8d 70 ff 4c
> RSP: 0018:ffffc90002d97530 EFLAGS: 00010293
> RAX: 0000000000000000 RBX: 0000000000000e00 RCX: 0000000000000000
> RDX: ffff88810a27bc00 RSI: ffffffff8276b6c9 RDI: 0000000000000003
> RBP: ffff88810a17f9e0 R08: 0000000000000e00 R09: 0000000000000000
> R10: ffffffff8276b042 R11: 0000000000000000 R12: ffff88810a17f760
> R13: ffff888108fc6ac0 R14: 0000000000001000 R15: ffff88810a17f7d6
> FS:  00007f6be8546700(0000) GS:ffff88811b400000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000000 CR3: 000000010a4f0005 CR4: 0000000000170ef0
> Call Trace:
>  tcp_try_coalesce net/ipv4/tcp_input.c:4642 [inline]
>  tcp_try_coalesce+0x312/0x870 net/ipv4/tcp_input.c:4621
>  tcp_queue_rcv+0x73/0x670 net/ipv4/tcp_input.c:4905
>  tcp_data_queue+0x11e5/0x4af0 net/ipv4/tcp_input.c:5016
>  tcp_rcv_established+0x83a/0x1d30 net/ipv4/tcp_input.c:5928
>  tcp_v6_do_rcv+0x438/0x1380 net/ipv6/tcp_ipv6.c:1517
>  sk_backlog_rcv include/net/sock.h:1024 [inline]
>  __release_sock+0x1ad/0x310 net/core/sock.c:2669
>  release_sock+0x54/0x1a0 net/core/sock.c:3193
>  tcp_sendmsg+0x36/0x40 net/ipv4/tcp.c:1462
>  inet6_sendmsg+0xb5/0x140 net/ipv6/af_inet6.c:646
>  sock_sendmsg_nosec net/socket.c:704 [inline]
>  sock_sendmsg net/socket.c:724 [inline]
>  ____sys_sendmsg+0x3b5/0x970 net/socket.c:2403
>  ___sys_sendmsg+0xff/0x170 net/socket.c:2457
>  __sys_sendmmsg+0x192/0x440 net/socket.c:2543
>  __do_sys_sendmmsg net/socket.c:2572 [inline]
>  __se_sys_sendmmsg net/socket.c:2569 [inline]
>  __x64_sys_sendmmsg+0x98/0x100 net/socket.c:2569
>  do_syscall_x64 arch/x86/entry/common.c:50 [inline]
>  do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80
>  entry_SYSCALL_64_after_hwframe+0x44/0xae
> RIP: 0033:0x7f6be7e55469
> Code: 00 f3 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ff 49 2b 00 f7 d8 64 89 01 48
> RSP: 002b:00007f6be8545da8 EFLAGS: 00000246 ORIG_RAX: 0000000000000133
> RAX: ffffffffffffffda RBX: 0000000000000133 RCX: 00007f6be7e55469
> RDX: 0000000000000003 RSI: 00000000200008c0 RDI: 0000000000000003
> RBP: 0000000000000133 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000040044040 R11: 0000000000000246 R12: 000000000069bf8c
> R13: 00007ffe013f968f R14: 00007f6be8526000 R15: 0000000000000003
> ---[ end trace 60453d9d261151ca ]---
>
> (syzkaller-reproducer at the end of this email)
>
> AFAICS, this is because pskb_expand_head (called from skb_expand_head) is not adjusting skb->truesize when skb->sk is set (which I guess is the case in this particular scenario). I'm not sure what the proper fix would be though...
>
>
> Reproducer:
>
> # {Threaded:true Collide:true Repeat:true RepeatTimes:0 Procs:1 Slowdown:1 Sandbox:none Fault:false FaultCall:-1 FaultNth:0 Leak:false NetInjection:true NetDevices:true NetReset:true Cgroups:true BinfmtMisc:true CloseFDs:true KCSAN:false DevlinkPCI:false USB:false VhciInjection:false Wifi:false IEEE802154:false Sysctl:true UseTmpDir:true HandleSegv:true Repro:false Trace:false}
> r0 = socket$inet6_tcp(0xa, 0x1, 0x0)
> bind$inet6(r0, &(0x7f0000000080)={0xa, 0x4e22, 0x0, @loopback}, 0x1c)
> sendmmsg$inet6(r0, &(0x7f0000002940)=[{{&(0x7f00000000c0)={0xa, 0x4e22, 0x0, @empty}, 0x1c, 0x0}}], 0x36, 0x20000145)
> r1 = socket$inet6_icmp_raw(0xa, 0x3, 0x3a)
> r2 = socket$inet_tcp(0x2, 0x1, 0x0)
> ioctl$sock_SIOCGIFINDEX(r2, 0x8933, 0x0)
> setsockopt$inet6_mreq(r1, 0x29, 0x1b, 0x0, 0x0)
> sendmmsg$inet6(r0, &(0x7f00000008c0)=[{{0x0, 0x0, &(0x7f0000000240)=[{&(0x7f0000000040)="11c2e854", 0x4}, {0x0}], 0x2}}, {{0x0, 0x0, &(0x7f0000000580)=[{0x0}, {&(0x7f0000001800)="bcfad31884cb1c6226004dd6ed929a7fa308da79249cdfcf5447732df714b21f9fa725d49453be964002a469aff676404855809cf7d7f8fad8ff26c30a9aa692c5ba1c3bf622d795da6efd5425e52f43e8a01e98fec8c4079d0f711b7b02666cc4046eed001f62377a14142ee1004708bc6b57ed028e4a0f8af459fa89ec7cd3098c32cbe69625cd654a3aca8814c9426a817da1b631828f40c0ba2227e8030aabba4eeaeea5617df6fbc54918905137032bb60cb86a7ebd4b8b1f8428085bfc5749a3a986dfca6f0e40b2cc32121cdd53ab4815cdf32dca5d1ee64d0e11894d07ad78f5ee5e7701f4803c49d6fdf927a0c103cf5ba9293e232ebd5661fb950f1cacef864673b79c7a180010889f0a8c9d69a97e3f6a88a9180fcc61631ffa42332f68a8ae78982b5e232decca2a4259a05e96c7fae309468a364798b7343354bdddfe1d81c3393556ced79b92f7c8f9455c1e4deb7cac7d81fdc3d72f201b711a253c2bf4df9f9bcb0ebe51edd3bb62a9440c0c88dc0b7ed6aaa2fdadd868845865cf6c40ba21222123894323fdd0ae8f8fc896a1c4b77431655d77750578db7e6d01380a7b5f5326cd8dd0091ac0a5526b9d1e9318dbfd2ac7227b1779674167c23fa8c59fcc44ab0c117788aa6a1c2baa0fa314093cfa79474e26b09d305c4b20d8f20150feb861ab750f1dbd7c4081ede2b245c14f7152f46352db5b0e4a3e676d23a76ea6363f97a35b3df2ced972f9daec0a5f6f13d5093250a93b31f4fafb1fa783cd99504ba8ac5fbfb4af9ff108a891e3786964ae5f00bc52496db9cf162b0efaded2459a3c1c1fbc60182b62a430a8eef93c6673db2596256f2b85f41975a31908dfd27eced4b169c99fbe79f59e61c8d3aef4e23f3bb665a0f25a09ce19a15f47734b817b1451f32418b1d9d801c64339b4fa2147e794dec9909569abcabeb9785991e32eb9835da449617c9c050000cf83fc6297e908d3708784a9c0599d3df55055123ef6037c5f8223acbeb0413610af573904069b64057d73a95e76f53939d6a144b411d73b7ae09bddc8c28e2484e03f781cce9df821c727facfd7d158a63f466d183e7be32c921176f574258d252bf2c15f1bcb6498ef252315a360ccd4ce7e5388d66462c51885a04211ed9ab097f66e31f6bb704902b74d6afd96f05b7a5de5eaa9d6ba4c700e7ad2e686b7241cf57358937e1b526f34d8522b761ca596b77e74ebae352ffbc9ee9d98b40ea1b1f7e24b3037d0630cd3ede29d4cd960fe0c17102c06e0a7f993b99f961384698dfd82c813cb862779d68fd8f6828a7c6e0bc8f9e3798ca7c3d5f287f8b63b65f6db8fa706e8a378166486b8c8fadfa06a4f0a21e8d6dc8ed230c8b5cde27a48742700b43ba4dce5e3da897b2bf1fc5832d07ac3482a9834b30c0371f517b9f0ae0706a12b87dcb3555e08df00cc0a83d7060c06edd7b8b8c6199398ee6bea33a4690e8f9455f54a6ef9d4b69d39c55a69d6fed3d7a376ceb51bd032b9b76a08ce48c5a2af8702922163a57835133e9968f8a1e4401479ec8a83a42f1c425840cdcd52483d8432c897ae75e8f68df55e5a5c8c16b4d071e92be22eb23f35d58cb86cc1b0c494e96723f5c50e441b81a7f1fca08faaefec8003fddb967381c542c5ac6c18cc7e4f4e61c8aae4ae90f157ae283a9fe8d02b9f52dc7bbf80e47cc6e3e2e6ee9008e8b82cb7ee672b11be1a3874aa63586fb242378f59251990cd8d944d8e46d84e76cfe9dbffe2562d29ed94f5a314062fc8f496bdb38a426f2ab0525418625727f329ca1cb8cd30ae9013829136c025892c2614d1fa0edf019c8e7f3a9ef50ec0483f3d70daec362b603a89b6a7049310b4776e1bb884f8d44a74b2a3da056e3ba8f3425f7a4628f3f140aa12fcd779ea01e22eff2c670759971ed8e11570813835401484537c0075e897863ae690c420b1c4d90ea4e90d84b220511bbba8e21869610e4ff8d2130b09a40a0cf49715275a01ecc438d6fcdfcb189d602d964a1a39ee375067d82c63dc409fab9715ee86cbbf5cdc9b906b66b40e4d2180cd1d03229bd42b2ea6359d8a127d6156d19cc638e4811b55871c39b370121bc0da9b29d3be162572f672f3f1cb9aac5dba4cb9f57062a27b95f5db96b7ea16beb7df8d5642ca2695aa0dccec99bc1b4efbdbe4ceb7febca4b1846b1486ee11383ca9c802428a9a8ac308bfa106dd8a5c87cceb4e1a700626a7ceb074d5628a6be3a96ff59851f797e375f42c7c452d66232f076f5c0045d06f5ee1e39f48b1b4285e569e574136de2d3412dc6f24e5e7024739f1fc5aa71c41e8eaa88c0094ca7ae5b96a20beaec44223d2df1cf650ef924b58e71f7ddf82b18a7d13d8eccfaea41e17ba7c65923d6dea3a7834113f446eaac407c371dc80841a85a4aa0ef2f4d49a204f626ed1c4921d61c50f6bd938359e8fec4e7d03dd6edb0b29437e3e121b65852d5cda472a99a29de0a56db2aa6aca87a9d6bcf1bbe6783735b67807c42a399b0757390d872a26e73244df25b5a6870b2da91e45e5168f8993cf7ca209df04c49cfb6eed24cfc343864cfb997490ac5b0b39d2e3bfa453d54d36546cd034d542ba38f66fe87cd4f89d08ca059e721272648c550481d952a0a47fd15d8fbe4a23303978b813011deef566317f71d9b18a7cb2283c44be2ed1a05ca9c3daf6c28f782f6c3593536d48253243a8995e8b9a30d83b7733bc166d4791bfd1555dcdf44c297ef726aaa47cbebdeae8a7288e20fae87b92fa44c78158501526afc5a7e0041b48c55881f9031d2ae4f482b8f44ce5297113bbd217d081162b26811b4d08f0fb9d76d5f179e5af344f73d62742bf871920121e26796eb67d059b49f5850475e02723c9e84a5578d5610f1cb3b1055b6333d59df391dbf67f6660f2d3aaa9a601d44cec1bb4199b1130d373a5a53ca81e16c374400976b39967bc7310ba801a225e44da319744a1e7e2b2261dedcbb31dda7105de586baf881f88bc7e0069b42fb3ebae8fa63e39609a1f596ba219da2ef61d69976d3175e47769bc1ba2b3cd064db3ff4d7ed0d87ee7054aa7e111aae93592dd6203e11e87bcaf8c43cd803282d0d25ba9b043ea8cbe26f585643d504d6f4f66848c881c066e9f10057e2c773b1e5abf2e56a4f07d35fc29f7fe98dc5fc26507406ca2901ebfae6e68715f6ca4d73555dea5f6ec77661ab9cd7335276eda1f28606e8dbd9b04fe17f53ffc3bedab5d800860f1a2c8bd4909f3b98cc7e7bda7a7e46deeff86c756e3b7d40067ca35f867bb5456ea61599e95916397f72404b2bac726dd5c1a5042eed92bd2988365405d17dde91c214c12263e976356d6131a2a5269cdfb1aefa6dc60b9f522ec2a619b2cc58baf8fadd52f43892d12d27498023e18390db603a3fe5e2f3e14491c1aed2ddaf47e717decfa6877aeed9fea0c575223cb062ceab83a20e99b49a0581e0fb25ba1eae88d5b229eb5320d888652acae4955c3bbe94de43fa99476c9058250847381ee4d85e079c190fadfa0776318ed00f7019a4010f2eb8777c7499456b8748991d4d4c6590cf903a237a81c0af16c4cec87ac49bc7787c2c05ea56501c9cddc2c4ba884ab26a51639e4b77fa5a2488ab842b8ef6d0fb1d3bea710ad1528a1e78a18e37b4a0ed803ba29d9ddf1ca7e8bd56d1588dbb9e3995493959adaa334acd611ac8d266a3199b2ecb9958366d785946722bc9507ab420d447dfadd0274a8e2a25d290c598d65145956c1c8542d07bc0a15efa0a9daed2f2fe71f60b4b657b34a3e897749a49bd85c2a0d915af6a38e8de62df1fcb4e991d04ce4c43eef97c4b3e9de11d528eb9cb923115efc681d0a2a0a714ad923086961c07335e152ec83b65ddb43b8865984c36de76f19bdb6e96566b741fb1c755d012c5377b196bed875e9a08bb7d4fd3160fddc71b9bb55a7199183529cda7a3624077a77b3a2b0367845d23e42dbc4130dcef4cc855556f833b94794b1a4d28d7503113ed4f8ec7b3578b418640d49b01e6a75b456944377a8d892eab38037705e010e32cf7074784dd42f42a92d18c0c26eb4c23a12c1697215aaa41ca92dd59f9816168bc7c9275c256cc689c03809f40125a1c17063a7696ab30dc32d8dcddf8e4e2983a883f11d64b485310c71e5533399d928da94b4ce9a8178517477345bc26b6ba838dfc0b0d9f8542f93183f68e4660e17a8bc92dea290bb718f1d0e89e385ad8ac2f3cc9ad8dfa08e62501b3bd6597574f98bfa39b901daa1e8c56dcddf87d49cf4b1345ae352b1aa72ca62fdeb39abcad462c4045fa1daaba17047d1b91bec23145d4090a7eca6f1510d88beab5a2625fdf773f16293b77eec703bdc13504550f8d74546b89ac056419fb12ded672a2cd4efee157689a704eb8511fab3c12d29e025184b6f329fe4b6226f187b0c009e1b5cfc249791a6c96bee7766276f7e90163f0c11f6c1d04a32e3128d1e3eb95e167f0bdf4b927b775d7fe01aa2174d43a8fb7b8ae4b5c1e6c7e7b7a6bc7f19a70f7ef107a6406e69cb60047aee0d3d0ed4b1a6605bc1eba9ea2664edf145d2422dcc47e26c6f000071a15cbd62c16983bf8a5fef08aeea48d6a477654aacdada917efee28a79897a7b9280025a370d2ccbadd639c3314a7e3cc12298da71e2ca40833f1b88d44e6693049baee1b0789a619815aee3707703963e296be3171854411f874a01f1c6c9f9bb6e4932e13b791280aed0cca54c3dd77033a8c3836a0306e13dabfdfa96ddcb9fa1de806419c1eaac65d7b601417b011ce50fd6257ae97f883faacfc2ce7a5fec477793f2bfa24f51d542f7d5d4828e7efc455af4f4d4e60f72c6598b72751dc058169e5d13b14b91b5849463f688c10d9d4a8f566ee18620a4541dce68411d3746b8b4ef4f891c87cbc536e8e99b828c23732fb1484fd03e23bc0affde95d7c0c6fc0407d1990b55296c4e32f8c0387b965ee20482d9e666f3caa2167218d6a48305b37430835e839b62960fab96ec61e5b49c345fe8c07f7b71ba3b82bef5701de404461a1aaa01331913dc03e27ed36c7f14f0c82057f477b73e9ee20621a5c70df091a2da470db9602aa17f281a15f644a2a85534040dba67d2a6473503453e133098205fb53a3c20fdafa508a82e9c8d172b2eed24afb1c47890ace2eb48ee9e2bc2488e47bf18e69676b32087e7abf3d1b918bd9ac24338d5ae54c0f2c95b0f969f44edac2b552b611ae2b415cfed467ec989e7a139a72b0d3c68e20dd307500bbd7f4e079f6c1486bdb31f648c50c1c4b58e6be618ac6257ddb2558af9f24be2ae415e8edc50197dda4a1178ea4de53bbd66e819173574fecaaa5540189538762f472bd562f1d87f64fc39008d4a3f85cd1b3f01508a42a3047976f6f1dcd4c1d942349967dfd633a6e56de55a5e0f2ad68847e7f1e3a08813dc7483db90cf417e7388aab6e4806628e6b9ab980a12b9ca6b345845f043d9b31cfbba9a17517cb0d421e2db467e991c2584f625dd1884126261e3f455c4f44c15380d43dc7e3e8fc69086395d9535f094432115b3d1b643178aadcf0919d85c3faddbf631ddd50c322a3e489310de21ec2c3026721f9301a34acfe9d65326f7f7ad54b6ad6eaa978b739407105d2d4eb869fff3e2de7585a7fed747493bd65537120cac03e3b48458ddbdbc5ba3382b6040d4863f4d3fd783276e01a3a9a5f05067d1d101ec424e6fb179cf9bcad8f5536b63dd63248ee411ce4b79b70fa7e8a619714646ff2fd557", 0x1000}, {0x0}, {0x0}], 0x4}}, {{0x0, 0x0, 0x0}}], 0x3, 0x40044040)
> setsockopt$inet6_IPV6_HOPOPTS(r0, 0x29, 0x36, &(0x7f0000000640)={0x0, 0x11, '\x00', [@calipso={0x7, 0x40, {0x0, 0xe, 0xb, 0x101, [0x5, 0x4ebd, 0x5d, 0x3, 0x80, 0x7, 0x8]}}, @calipso={0x7, 0x10, {0x0, 0x2, 0x6, 0x0, [0x0]}}, @calipso={0x7, 0x28, {0x2, 0x8, 0x9, 0x6, [0x0, 0x80000001, 0x5d53, 0x9]}}, @calipso={0x7, 0x8, {0x2, 0x0, 0x3, 0x5}}]}, 0x90)
>
>
>
> Christoph
>

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH NET v4 3/7] ipv6: use skb_expand_head in ip6_xmit
  2021-08-20 22:44                                 ` Christoph Paasch
@ 2021-08-21  6:21                                   ` Vasily Averin
  2021-08-22 17:04                                     ` Christoph Paasch
  0 siblings, 1 reply; 106+ messages in thread
From: Vasily Averin @ 2021-08-21  6:21 UTC (permalink / raw)
  To: Christoph Paasch
  Cc: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski,
	Eric Dumazet, netdev, LKML, kernel, Julian Wiedmann

On 8/21/21 1:44 AM, Christoph Paasch wrote:
> (resend without html - thanks gmail web-interface...)
> On Fri, Aug 20, 2021 at 3:41 PM Christoph Paasch
>> AFAICS, this is because pskb_expand_head (called from
>> skb_expand_head) is not adjusting skb->truesize when skb->sk is set
>> (which I guess is the case in this particular scenario). I'm not
>> sure what the proper fix would be though...

Could you please elaborate?
it seems to me skb_realloc_headroom used before my patch called pskb_expand_head() too
and did not adjusted skb->truesize too. Am I missed something perhaps?

The only difference in my patch is that skb_clone can be not called, 
though I do not understand how this can affect skb->truesize.

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH NET v4 3/7] ipv6: use skb_expand_head in ip6_xmit
  2021-08-21  6:21                                   ` Vasily Averin
@ 2021-08-22 17:04                                     ` Christoph Paasch
  2021-08-22 17:13                                       ` Christoph Paasch
  0 siblings, 1 reply; 106+ messages in thread
From: Christoph Paasch @ 2021-08-22 17:04 UTC (permalink / raw)
  To: Vasily Averin
  Cc: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski,
	Eric Dumazet, netdev, LKML, kernel, Julian Wiedmann

Hello Vasily,

On Fri, Aug 20, 2021 at 11:21 PM Vasily Averin <vvs@virtuozzo.com> wrote:
>
> On 8/21/21 1:44 AM, Christoph Paasch wrote:
> > (resend without html - thanks gmail web-interface...)
> > On Fri, Aug 20, 2021 at 3:41 PM Christoph Paasch
> >> AFAICS, this is because pskb_expand_head (called from
> >> skb_expand_head) is not adjusting skb->truesize when skb->sk is set
> >> (which I guess is the case in this particular scenario). I'm not
> >> sure what the proper fix would be though...
>
> Could you please elaborate?
> it seems to me skb_realloc_headroom used before my patch called pskb_expand_head() too
> and did not adjusted skb->truesize too. Am I missed something perhaps?
>
> The only difference in my patch is that skb_clone can be not called,
> though I do not understand how this can affect skb->truesize.

I *believe* that the difference is that after skb_clone() skb->sk is
NULL and thus truesize will be adjusted.

I will try to confirm that with some more debugging.


Christoph

>
> Thank you,
>         Vasily Averin

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH NET v4 3/7] ipv6: use skb_expand_head in ip6_xmit
  2021-08-22 17:04                                     ` Christoph Paasch
@ 2021-08-22 17:13                                       ` Christoph Paasch
  2021-08-23  5:44                                         ` Vasily Averin
  0 siblings, 1 reply; 106+ messages in thread
From: Christoph Paasch @ 2021-08-22 17:13 UTC (permalink / raw)
  To: Vasily Averin
  Cc: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski,
	Eric Dumazet, netdev, LKML, kernel, Julian Wiedmann

On Sun, Aug 22, 2021 at 10:04 AM Christoph Paasch
<christoph.paasch@gmail.com> wrote:
>
> Hello Vasily,
>
> On Fri, Aug 20, 2021 at 11:21 PM Vasily Averin <vvs@virtuozzo.com> wrote:
> >
> > On 8/21/21 1:44 AM, Christoph Paasch wrote:
> > > (resend without html - thanks gmail web-interface...)
> > > On Fri, Aug 20, 2021 at 3:41 PM Christoph Paasch
> > >> AFAICS, this is because pskb_expand_head (called from
> > >> skb_expand_head) is not adjusting skb->truesize when skb->sk is set
> > >> (which I guess is the case in this particular scenario). I'm not
> > >> sure what the proper fix would be though...
> >
> > Could you please elaborate?
> > it seems to me skb_realloc_headroom used before my patch called pskb_expand_head() too
> > and did not adjusted skb->truesize too. Am I missed something perhaps?
> >
> > The only difference in my patch is that skb_clone can be not called,
> > though I do not understand how this can affect skb->truesize.
>
> I *believe* that the difference is that after skb_clone() skb->sk is
> NULL and thus truesize will be adjusted.
>
> I will try to confirm that with some more debugging.

Yes indeed.

Before your patch:
[   19.154039] ip6_xmit before realloc truesize 4864 sk? 000000002ccd6868
[   19.155230] ip6_xmit after realloc truesize 5376 sk? 0000000000000000

skb->sk is not set and thus truesize will be adjusted.


With your change:
[   15.092933] ip6_xmit before realloc truesize 4864 sk? 00000000072930fd
[   15.094131] ip6_xmit after realloc truesize 4864 sk? 00000000072930fd

skb->sk is set and thus truesize is not adjusted.


Christoph

>
>
> Christoph
>
> >
> > Thank you,
> >         Vasily Averin

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH NET v4 3/7] ipv6: use skb_expand_head in ip6_xmit
  2021-08-22 17:13                                       ` Christoph Paasch
@ 2021-08-23  5:44                                         ` Vasily Averin
  2021-08-23  5:59                                           ` Vasily Averin
  0 siblings, 1 reply; 106+ messages in thread
From: Vasily Averin @ 2021-08-23  5:44 UTC (permalink / raw)
  To: Christoph Paasch
  Cc: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski,
	Eric Dumazet, netdev, LKML, kernel, Julian Wiedmann

On 8/22/21 8:13 PM, Christoph Paasch wrote:
> On Sun, Aug 22, 2021 at 10:04 AM Christoph Paasch
> <christoph.paasch@gmail.com> wrote:
>>
>> Hello Vasily,
>>
>> On Fri, Aug 20, 2021 at 11:21 PM Vasily Averin <vvs@virtuozzo.com> wrote:
>>>
>>> On 8/21/21 1:44 AM, Christoph Paasch wrote:
>>>> (resend without html - thanks gmail web-interface...)
>>>> On Fri, Aug 20, 2021 at 3:41 PM Christoph Paasch
>>>>> AFAICS, this is because pskb_expand_head (called from
>>>>> skb_expand_head) is not adjusting skb->truesize when skb->sk is set
>>>>> (which I guess is the case in this particular scenario). I'm not
>>>>> sure what the proper fix would be though...
>>>
>>> Could you please elaborate?
>>> it seems to me skb_realloc_headroom used before my patch called pskb_expand_head() too
>>> and did not adjusted skb->truesize too. Am I missed something perhaps?
>>>
>>> The only difference in my patch is that skb_clone can be not called,
>>> though I do not understand how this can affect skb->truesize.
>>
>> I *believe* that the difference is that after skb_clone() skb->sk is
>> NULL and thus truesize will be adjusted.
>>
>> I will try to confirm that with some more debugging.
> 
> Yes indeed.
> 
> Before your patch:
> [   19.154039] ip6_xmit before realloc truesize 4864 sk? 000000002ccd6868
> [   19.155230] ip6_xmit after realloc truesize 5376 sk? 0000000000000000
> 
> skb->sk is not set and thus truesize will be adjusted.

This looks strange for me. skb should not lost sk reference.

Could you please clarify where exactly you cheked it?
sk on newly allocated skb is set on line 291

net/ipv6/ip6_output.c::ip6_xmit()
 282         if (unlikely(skb_headroom(skb) < head_room)) {
 283                 struct sk_buff *skb2 = skb_realloc_headroom(skb, head_room);
 284                 if (!skb2) {
 285                         IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)),
 286                                       IPSTATS_MIB_OUTDISCARDS);
 287                         kfree_skb(skb);
 288                         return -ENOBUFS;
 289                 }
 290                 if (skb->sk)
 291                         skb_set_owner_w(skb2, skb->sk); <<<<< here
 292                 consume_skb(skb);
 293                 skb = skb2;
 294         }

> With your change:
> [   15.092933] ip6_xmit before realloc truesize 4864 sk? 00000000072930fd
> [   15.094131] ip6_xmit after realloc truesize 4864 sk? 00000000072930fd
> 
> skb->sk is set and thus truesize is not adjusted.

In this case skb_set_owner_w() is called inside skb_expand_head()

net/ipv6/ip6_output.c::ip6_xmit()
 265         if (unlikely(head_room > skb_headroom(skb))) {
 266                 skb = skb_expand_head(skb, head_room);
 267                 if (!skb) {
 268                         IP6_INC_STATS(net, idev, IPSTATS_MIB_OUTDISCARDS);
 269                         return -ENOBUFS;
 270                 }
 271         }

net/core/skbuff.c::skb_expand_head()
1813         if (skb_shared(skb)) {
1814                 struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
1815 
1816                 if (likely(nskb)) {
1817                         if (skb->sk)
1818                                 skb_set_owner_w(nskb, skb->sk);  <<<< here
1819                         consume_skb(skb);
1820                 } else {
1821                         kfree_skb(skb);
1822                 }
1823                 skb = nskb;
1824         }

So I do not understand how this can happen.
With my patch: 
a) if skb is not shared -- it should keep original skb->sk
b) if skb is shared -- new skb should set sk if it was set on original skb.

Your results can be explained if you looked and skb->sk and truesize right after skb_realloc_headroom() call
but  before following skb_set_owner_w(). Could you please check it?

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH NET v4 3/7] ipv6: use skb_expand_head in ip6_xmit
  2021-08-23  5:44                                         ` Vasily Averin
@ 2021-08-23  5:59                                           ` Vasily Averin
  2021-08-23  7:56                                             ` [PATCH NET-NEXT] ipv6: skb_expand_head() adjust skb->truesize incorrectly Vasily Averin
  0 siblings, 1 reply; 106+ messages in thread
From: Vasily Averin @ 2021-08-23  5:59 UTC (permalink / raw)
  To: Christoph Paasch
  Cc: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski,
	Eric Dumazet, netdev, LKML, kernel, Julian Wiedmann

On 8/23/21 8:44 AM, Vasily Averin wrote:
> On 8/22/21 8:13 PM, Christoph Paasch wrote:
>> On Sun, Aug 22, 2021 at 10:04 AM Christoph Paasch
>> <christoph.paasch@gmail.com> wrote:
>>>
>>> Hello Vasily,
>>>
>>> On Fri, Aug 20, 2021 at 11:21 PM Vasily Averin <vvs@virtuozzo.com> wrote:
>>>>
>>>> On 8/21/21 1:44 AM, Christoph Paasch wrote:
>>>>> (resend without html - thanks gmail web-interface...)
>>>>> On Fri, Aug 20, 2021 at 3:41 PM Christoph Paasch
>>>>>> AFAICS, this is because pskb_expand_head (called from
>>>>>> skb_expand_head) is not adjusting skb->truesize when skb->sk is set
>>>>>> (which I guess is the case in this particular scenario). I'm not
>>>>>> sure what the proper fix would be though...
>>>>
>>>> Could you please elaborate?
>>>> it seems to me skb_realloc_headroom used before my patch called pskb_expand_head() too
>>>> and did not adjusted skb->truesize too. Am I missed something perhaps?
>>>>
>>>> The only difference in my patch is that skb_clone can be not called,
>>>> though I do not understand how this can affect skb->truesize.
>>>
>>> I *believe* that the difference is that after skb_clone() skb->sk is
>>> NULL and thus truesize will be adjusted.
>>>
>>> I will try to confirm that with some more debugging.
>>
>> Yes indeed.
>>
>> Before your patch:
>> [   19.154039] ip6_xmit before realloc truesize 4864 sk? 000000002ccd6868
>> [   19.155230] ip6_xmit after realloc truesize 5376 sk? 0000000000000000
>>
>> skb->sk is not set and thus truesize will be adjusted.
> 
> This looks strange for me. skb should not lost sk reference.
> 
> Could you please clarify where exactly you cheked it?
> sk on newly allocated skb is set on line 291
> 
> net/ipv6/ip6_output.c::ip6_xmit()
>  282         if (unlikely(skb_headroom(skb) < head_room)) {
>  283                 struct sk_buff *skb2 = skb_realloc_headroom(skb, head_room);
>  284                 if (!skb2) {
>  285                         IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)),
>  286                                       IPSTATS_MIB_OUTDISCARDS);
>  287                         kfree_skb(skb);
>  288                         return -ENOBUFS;
>  289                 }
>  290                 if (skb->sk)
>  291                         skb_set_owner_w(skb2, skb->sk); <<<<< here
>  292                 consume_skb(skb);
>  293                 skb = skb2;
>  294         }
> 
>> With your change:
>> [   15.092933] ip6_xmit before realloc truesize 4864 sk? 00000000072930fd
>> [   15.094131] ip6_xmit after realloc truesize 4864 sk? 00000000072930fd
>>
>> skb->sk is set and thus truesize is not adjusted.
> 
> In this case skb_set_owner_w() is called inside skb_expand_head()
> 
> net/ipv6/ip6_output.c::ip6_xmit()
>  265         if (unlikely(head_room > skb_headroom(skb))) {
>  266                 skb = skb_expand_head(skb, head_room);
>  267                 if (!skb) {
>  268                         IP6_INC_STATS(net, idev, IPSTATS_MIB_OUTDISCARDS);
>  269                         return -ENOBUFS;
>  270                 }
>  271         }
> 
> net/core/skbuff.c::skb_expand_head()
> 1813         if (skb_shared(skb)) {
> 1814                 struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
> 1815 
> 1816                 if (likely(nskb)) {
> 1817                         if (skb->sk)
> 1818                                 skb_set_owner_w(nskb, skb->sk);  <<<< here
> 1819                         consume_skb(skb);
> 1820                 } else {
> 1821                         kfree_skb(skb);
> 1822                 }
> 1823                 skb = nskb;
> 1824         }
> 
> So I do not understand how this can happen.
> With my patch: 
> a) if skb is not shared -- it should keep original skb->sk
> b) if skb is shared -- new skb should set sk if it was set on original skb.
> 
> Your results can be explained if you looked and skb->sk and truesize right after skb_realloc_headroom() call
> but  before following skb_set_owner_w(). Could you please check it?

It seems I've found the reason:
before my change pskb_expand_head() is called for newly cloned skb where sk was not set.
after my change skb->sk is set before following pskb_expand_head() call

On own turn pskb_expand_head() adjust truesize:

net/core/skbuff.c::pskb_expand_head()
1751         /* It is not generally safe to change skb->truesize.
1752          * For the moment, we really care of rx path, or
1753          * when skb is orphaned (not attached to a socket).
1754          */
1755         if (!skb->sk || skb->destructor == sock_edemux)
1756                 skb->truesize += size - osize;
1757 
1758         return 0;

Could you please confirm it?

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH NET-NEXT] ipv6: skb_expand_head() adjust skb->truesize incorrectly
  2021-08-23  5:59                                           ` Vasily Averin
@ 2021-08-23  7:56                                             ` Vasily Averin
  2021-08-23 17:25                                               ` Christoph Paasch
  0 siblings, 1 reply; 106+ messages in thread
From: Vasily Averin @ 2021-08-23  7:56 UTC (permalink / raw)
  To: Christoph Paasch, David S. Miller
  Cc: Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski, Eric Dumazet,
	netdev, linux-kernel, kernel, Julian Wiedmann

Christoph Paasch reports [1] about incorrect skb->truesize
after skb_expand_head() call in ip6_xmit.
This happen because skb_set_owner_w() for newly clone skb is called
too early, before pskb_expand_head() where truesize is adjusted for
(!skb-sk) case.

[1] https://lkml.org/lkml/2021/8/20/1082

Reported-by: Christoph Paasch <christoph.paasch@gmail.com>
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/core/skbuff.c | 24 +++++++++++++-----------
 1 file changed, 13 insertions(+), 11 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index f931176..508d5c4 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1803,6 +1803,8 @@ struct sk_buff *skb_realloc_headroom(struct sk_buff *skb, unsigned int headroom)
 
 struct sk_buff *skb_expand_head(struct sk_buff *skb, unsigned int headroom)
 {
+	struct sk_buff *oskb = skb;
+	struct sk_buff *nskb = NULL;
 	int delta = headroom - skb_headroom(skb);
 
 	if (WARN_ONCE(delta <= 0,
@@ -1811,21 +1813,21 @@ struct sk_buff *skb_expand_head(struct sk_buff *skb, unsigned int headroom)
 
 	/* pskb_expand_head() might crash, if skb is shared */
 	if (skb_shared(skb)) {
-		struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
-
-		if (likely(nskb)) {
-			if (skb->sk)
-				skb_set_owner_w(nskb, skb->sk);
-			consume_skb(skb);
-		} else {
-			kfree_skb(skb);
-		}
+		nskb = skb_clone(skb, GFP_ATOMIC);
 		skb = nskb;
 	}
 	if (skb &&
-	    pskb_expand_head(skb, SKB_DATA_ALIGN(delta), 0, GFP_ATOMIC)) {
-		kfree_skb(skb);
+	    pskb_expand_head(skb, SKB_DATA_ALIGN(delta), 0, GFP_ATOMIC))
 		skb = NULL;
+
+	if (!skb) {
+		kfree_skb(oskb);
+		if (nskb)
+			kfree_skb(nskb);
+	} else if (nskb) {
+		if (oskb->sk)
+			skb_set_owner_w(nskb, oskb->sk);
+		consume_skb(oskb);
 	}
 	return skb;
 }
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH NET-NEXT] ipv6: skb_expand_head() adjust skb->truesize incorrectly
  2021-08-23  7:56                                             ` [PATCH NET-NEXT] ipv6: skb_expand_head() adjust skb->truesize incorrectly Vasily Averin
@ 2021-08-23 17:25                                               ` Christoph Paasch
  2021-08-23 21:45                                                 ` Eric Dumazet
  0 siblings, 1 reply; 106+ messages in thread
From: Christoph Paasch @ 2021-08-23 17:25 UTC (permalink / raw)
  To: Vasily Averin
  Cc: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski,
	Eric Dumazet, netdev, LKML, kernel, Julian Wiedmann

Hello,

On Mon, Aug 23, 2021 at 12:56 AM Vasily Averin <vvs@virtuozzo.com> wrote:
>
> Christoph Paasch reports [1] about incorrect skb->truesize
> after skb_expand_head() call in ip6_xmit.
> This happen because skb_set_owner_w() for newly clone skb is called
> too early, before pskb_expand_head() where truesize is adjusted for
> (!skb-sk) case.
>
> [1] https://lkml.org/lkml/2021/8/20/1082
>
> Reported-by: Christoph Paasch <christoph.paasch@gmail.com>
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
> ---
>  net/core/skbuff.c | 24 +++++++++++++-----------
>  1 file changed, 13 insertions(+), 11 deletions(-)
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index f931176..508d5c4 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -1803,6 +1803,8 @@ struct sk_buff *skb_realloc_headroom(struct sk_buff *skb, unsigned int headroom)
>
>  struct sk_buff *skb_expand_head(struct sk_buff *skb, unsigned int headroom)
>  {
> +       struct sk_buff *oskb = skb;
> +       struct sk_buff *nskb = NULL;
>         int delta = headroom - skb_headroom(skb);
>
>         if (WARN_ONCE(delta <= 0,
> @@ -1811,21 +1813,21 @@ struct sk_buff *skb_expand_head(struct sk_buff *skb, unsigned int headroom)
>
>         /* pskb_expand_head() might crash, if skb is shared */
>         if (skb_shared(skb)) {
> -               struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
> -
> -               if (likely(nskb)) {
> -                       if (skb->sk)
> -                               skb_set_owner_w(nskb, skb->sk);
> -                       consume_skb(skb);
> -               } else {
> -                       kfree_skb(skb);
> -               }
> +               nskb = skb_clone(skb, GFP_ATOMIC);
>                 skb = nskb;
>         }
>         if (skb &&
> -           pskb_expand_head(skb, SKB_DATA_ALIGN(delta), 0, GFP_ATOMIC)) {
> -               kfree_skb(skb);
> +           pskb_expand_head(skb, SKB_DATA_ALIGN(delta), 0, GFP_ATOMIC))
>                 skb = NULL;
> +
> +       if (!skb) {
> +               kfree_skb(oskb);
> +               if (nskb)
> +                       kfree_skb(nskb);
> +       } else if (nskb) {
> +               if (oskb->sk)
> +                       skb_set_owner_w(nskb, oskb->sk);
> +               consume_skb(oskb);

sorry, this does not fix the problem. The syzkaller repro still
triggers the WARN.

When it happens, the skb in ip6_xmit() is not shared as it comes from
__tcp_transmit_skb, where it is skb_clone()'d.


Christoph

>         }
>         return skb;
>  }
> --
> 1.8.3.1
>

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH NET-NEXT] ipv6: skb_expand_head() adjust skb->truesize incorrectly
  2021-08-23 17:25                                               ` Christoph Paasch
@ 2021-08-23 21:45                                                 ` Eric Dumazet
  2021-08-23 21:51                                                   ` Eric Dumazet
  0 siblings, 1 reply; 106+ messages in thread
From: Eric Dumazet @ 2021-08-23 21:45 UTC (permalink / raw)
  To: Christoph Paasch, Vasily Averin
  Cc: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski,
	Eric Dumazet, netdev, LKML, kernel, Julian Wiedmann



On 8/23/21 10:25 AM, Christoph Paasch wrote:
> Hello,
> 
> On Mon, Aug 23, 2021 at 12:56 AM Vasily Averin <vvs@virtuozzo.com> wrote:
>>
>> Christoph Paasch reports [1] about incorrect skb->truesize
>> after skb_expand_head() call in ip6_xmit.
>> This happen because skb_set_owner_w() for newly clone skb is called
>> too early, before pskb_expand_head() where truesize is adjusted for
>> (!skb-sk) case.
>>
>> [1] https://lkml.org/lkml/2021/8/20/1082
>>
>> Reported-by: Christoph Paasch <christoph.paasch@gmail.com>
>> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
>> ---
>>  net/core/skbuff.c | 24 +++++++++++++-----------
>>  1 file changed, 13 insertions(+), 11 deletions(-)
>>
>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>> index f931176..508d5c4 100644
>> --- a/net/core/skbuff.c
>> +++ b/net/core/skbuff.c
>> @@ -1803,6 +1803,8 @@ struct sk_buff *skb_realloc_headroom(struct sk_buff *skb, unsigned int headroom)
>>
>>  struct sk_buff *skb_expand_head(struct sk_buff *skb, unsigned int headroom)
>>  {
>> +       struct sk_buff *oskb = skb;
>> +       struct sk_buff *nskb = NULL;
>>         int delta = headroom - skb_headroom(skb);
>>
>>         if (WARN_ONCE(delta <= 0,
>> @@ -1811,21 +1813,21 @@ struct sk_buff *skb_expand_head(struct sk_buff *skb, unsigned int headroom)
>>
>>         /* pskb_expand_head() might crash, if skb is shared */
>>         if (skb_shared(skb)) {
>> -               struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
>> -
>> -               if (likely(nskb)) {
>> -                       if (skb->sk)
>> -                               skb_set_owner_w(nskb, skb->sk);
>> -                       consume_skb(skb);
>> -               } else {
>> -                       kfree_skb(skb);
>> -               }
>> +               nskb = skb_clone(skb, GFP_ATOMIC);
>>                 skb = nskb;
>>         }
>>         if (skb &&
>> -           pskb_expand_head(skb, SKB_DATA_ALIGN(delta), 0, GFP_ATOMIC)) {
>> -               kfree_skb(skb);
>> +           pskb_expand_head(skb, SKB_DATA_ALIGN(delta), 0, GFP_ATOMIC))
>>                 skb = NULL;
>> +
>> +       if (!skb) {
>> +               kfree_skb(oskb);
>> +               if (nskb)
>> +                       kfree_skb(nskb);
>> +       } else if (nskb) {
>> +               if (oskb->sk)
>> +                       skb_set_owner_w(nskb, oskb->sk);
>> +               consume_skb(oskb);
> 
> sorry, this does not fix the problem. The syzkaller repro still
> triggers the WARN.
> 
> When it happens, the skb in ip6_xmit() is not shared as it comes from
> __tcp_transmit_skb, where it is skb_clone()'d.
> 
> 

Old code (in skb_realloc_headroom())
was first calling skb2 = skb_clone(skb, GFP_ATOMIC); 

At this point, skb2->sk was NULL
So pskb_expand_head(skb2, SKB_DATA_ALIGN(delta), 0, ...) was able to tweak skb2->truesize

I would try :

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index f9311762cc475bd38d87c33e988d7c983b902e56..326749a8938637b044a616cc33b6a19ed191ac41 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1804,6 +1804,7 @@ EXPORT_SYMBOL(skb_realloc_headroom);
 struct sk_buff *skb_expand_head(struct sk_buff *skb, unsigned int headroom)
 {
        int delta = headroom - skb_headroom(skb);
+       struct sk_buff *oskb = NULL;
 
        if (WARN_ONCE(delta <= 0,
                      "%s is expecting an increase in the headroom", __func__))
@@ -1813,19 +1814,21 @@ struct sk_buff *skb_expand_head(struct sk_buff *skb, unsigned int headroom)
        if (skb_shared(skb)) {
                struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
 
-               if (likely(nskb)) {
-                       if (skb->sk)
-                               skb_set_owner_w(nskb, skb->sk);
-                       consume_skb(skb);
-               } else {
+               if (unlikely(!nskb)) {
                        kfree_skb(skb);
+                       return NULL;
                }
+               oskb = skb;
                skb = nskb;
        }
-       if (skb &&
-           pskb_expand_head(skb, SKB_DATA_ALIGN(delta), 0, GFP_ATOMIC)) {
+       if (pskb_expand_head(skb, SKB_DATA_ALIGN(delta), 0, GFP_ATOMIC)) {
                kfree_skb(skb);
-               skb = NULL;
+               kfree_skb(oskb);
+               return NULL;
+       }
+       if (oskb) {
+               skb_set_owner_w(skb, oskb->sk);
+               consume_skb(oskb);
        }
        return skb;
 }




^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH NET-NEXT] ipv6: skb_expand_head() adjust skb->truesize incorrectly
  2021-08-23 21:45                                                 ` Eric Dumazet
@ 2021-08-23 21:51                                                   ` Eric Dumazet
  2021-08-23 22:23                                                     ` Eric Dumazet
  0 siblings, 1 reply; 106+ messages in thread
From: Eric Dumazet @ 2021-08-23 21:51 UTC (permalink / raw)
  To: Eric Dumazet, Christoph Paasch, Vasily Averin
  Cc: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski,
	netdev, LKML, kernel, Julian Wiedmann



On 8/23/21 2:45 PM, Eric Dumazet wrote:
> 
> 
> On 8/23/21 10:25 AM, Christoph Paasch wrote:
>> Hello,
>>
>> On Mon, Aug 23, 2021 at 12:56 AM Vasily Averin <vvs@virtuozzo.com> wrote:
>>>
>>> Christoph Paasch reports [1] about incorrect skb->truesize
>>> after skb_expand_head() call in ip6_xmit.
>>> This happen because skb_set_owner_w() for newly clone skb is called
>>> too early, before pskb_expand_head() where truesize is adjusted for
>>> (!skb-sk) case.
>>>
>>> [1] https://lkml.org/lkml/2021/8/20/1082
>>>
>>> Reported-by: Christoph Paasch <christoph.paasch@gmail.com>
>>> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
>>> ---
>>>  net/core/skbuff.c | 24 +++++++++++++-----------
>>>  1 file changed, 13 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>>> index f931176..508d5c4 100644
>>> --- a/net/core/skbuff.c
>>> +++ b/net/core/skbuff.c
>>> @@ -1803,6 +1803,8 @@ struct sk_buff *skb_realloc_headroom(struct sk_buff *skb, unsigned int headroom)
>>>
>>>  struct sk_buff *skb_expand_head(struct sk_buff *skb, unsigned int headroom)
>>>  {
>>> +       struct sk_buff *oskb = skb;
>>> +       struct sk_buff *nskb = NULL;
>>>         int delta = headroom - skb_headroom(skb);
>>>
>>>         if (WARN_ONCE(delta <= 0,
>>> @@ -1811,21 +1813,21 @@ struct sk_buff *skb_expand_head(struct sk_buff *skb, unsigned int headroom)
>>>
>>>         /* pskb_expand_head() might crash, if skb is shared */
>>>         if (skb_shared(skb)) {
>>> -               struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
>>> -
>>> -               if (likely(nskb)) {
>>> -                       if (skb->sk)
>>> -                               skb_set_owner_w(nskb, skb->sk);
>>> -                       consume_skb(skb);
>>> -               } else {
>>> -                       kfree_skb(skb);
>>> -               }
>>> +               nskb = skb_clone(skb, GFP_ATOMIC);
>>>                 skb = nskb;
>>>         }
>>>         if (skb &&
>>> -           pskb_expand_head(skb, SKB_DATA_ALIGN(delta), 0, GFP_ATOMIC)) {
>>> -               kfree_skb(skb);
>>> +           pskb_expand_head(skb, SKB_DATA_ALIGN(delta), 0, GFP_ATOMIC))
>>>                 skb = NULL;
>>> +
>>> +       if (!skb) {
>>> +               kfree_skb(oskb);
>>> +               if (nskb)
>>> +                       kfree_skb(nskb);
>>> +       } else if (nskb) {
>>> +               if (oskb->sk)
>>> +                       skb_set_owner_w(nskb, oskb->sk);
>>> +               consume_skb(oskb);
>>
>> sorry, this does not fix the problem. The syzkaller repro still
>> triggers the WARN.
>>
>> When it happens, the skb in ip6_xmit() is not shared as it comes from
>> __tcp_transmit_skb, where it is skb_clone()'d.
>>
>>
> 
> Old code (in skb_realloc_headroom())
> was first calling skb2 = skb_clone(skb, GFP_ATOMIC); 
> 
> At this point, skb2->sk was NULL
> So pskb_expand_head(skb2, SKB_DATA_ALIGN(delta), 0, ...) was able to tweak skb2->truesize
> 
> I would try :
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index f9311762cc475bd38d87c33e988d7c983b902e56..326749a8938637b044a616cc33b6a19ed191ac41 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -1804,6 +1804,7 @@ EXPORT_SYMBOL(skb_realloc_headroom);
>  struct sk_buff *skb_expand_head(struct sk_buff *skb, unsigned int headroom)
>  {
>         int delta = headroom - skb_headroom(skb);
> +       struct sk_buff *oskb = NULL;
>  
>         if (WARN_ONCE(delta <= 0,
>                       "%s is expecting an increase in the headroom", __func__))
> @@ -1813,19 +1814,21 @@ struct sk_buff *skb_expand_head(struct sk_buff *skb, unsigned int headroom)
>         if (skb_shared(skb)) {
>                 struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
>  
> -               if (likely(nskb)) {
> -                       if (skb->sk)
> -                               skb_set_owner_w(nskb, skb->sk);
> -                       consume_skb(skb);
> -               } else {
> +               if (unlikely(!nskb)) {
>                         kfree_skb(skb);
> +                       return NULL;
>                 }
> +               oskb = skb;
>                 skb = nskb;
>         }
> -       if (skb &&
> -           pskb_expand_head(skb, SKB_DATA_ALIGN(delta), 0, GFP_ATOMIC)) {
> +       if (pskb_expand_head(skb, SKB_DATA_ALIGN(delta), 0, GFP_ATOMIC)) {
>                 kfree_skb(skb);
> -               skb = NULL;
> +               kfree_skb(oskb);
> +               return NULL;
> +       }
> +       if (oskb) {
> +               skb_set_owner_w(skb, oskb->sk);
> +               consume_skb(oskb);
>         }
>         return skb;
>  }


Oh well, probably not going to work.

We have to find a way to properly increase skb->truesize, even if skb_clone() is _not_ called.




^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH NET-NEXT] ipv6: skb_expand_head() adjust skb->truesize incorrectly
  2021-08-23 21:51                                                   ` Eric Dumazet
@ 2021-08-23 22:23                                                     ` Eric Dumazet
  2021-08-24  8:50                                                       ` Vasily Averin
  2021-08-27 15:23                                                       ` [PATCH NET-NEXT] ipv6: " Vasily Averin
  0 siblings, 2 replies; 106+ messages in thread
From: Eric Dumazet @ 2021-08-23 22:23 UTC (permalink / raw)
  To: Eric Dumazet, Christoph Paasch, Vasily Averin
  Cc: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski,
	netdev, LKML, kernel, Julian Wiedmann



On 8/23/21 2:51 PM, Eric Dumazet wrote:
> 
> 
> On 8/23/21 2:45 PM, Eric Dumazet wrote:
>>
>>
>> On 8/23/21 10:25 AM, Christoph Paasch wrote:
>>> Hello,
>>>
>>> On Mon, Aug 23, 2021 at 12:56 AM Vasily Averin <vvs@virtuozzo.com> wrote:
>>>>
>>>> Christoph Paasch reports [1] about incorrect skb->truesize
>>>> after skb_expand_head() call in ip6_xmit.
>>>> This happen because skb_set_owner_w() for newly clone skb is called
>>>> too early, before pskb_expand_head() where truesize is adjusted for
>>>> (!skb-sk) case.
>>>>
>>>> [1] https://lkml.org/lkml/2021/8/20/1082
>>>>
>>>> Reported-by: Christoph Paasch <christoph.paasch@gmail.com>
>>>> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
>>>> ---
>>>>  net/core/skbuff.c | 24 +++++++++++++-----------
>>>>  1 file changed, 13 insertions(+), 11 deletions(-)
>>>>
>>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>>>> index f931176..508d5c4 100644
>>>> --- a/net/core/skbuff.c
>>>> +++ b/net/core/skbuff.c
>>>> @@ -1803,6 +1803,8 @@ struct sk_buff *skb_realloc_headroom(struct sk_buff *skb, unsigned int headroom)
>>>>
>>>>  struct sk_buff *skb_expand_head(struct sk_buff *skb, unsigned int headroom)
>>>>  {
>>>> +       struct sk_buff *oskb = skb;
>>>> +       struct sk_buff *nskb = NULL;
>>>>         int delta = headroom - skb_headroom(skb);
>>>>
>>>>         if (WARN_ONCE(delta <= 0,
>>>> @@ -1811,21 +1813,21 @@ struct sk_buff *skb_expand_head(struct sk_buff *skb, unsigned int headroom)
>>>>
>>>>         /* pskb_expand_head() might crash, if skb is shared */
>>>>         if (skb_shared(skb)) {
>>>> -               struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
>>>> -
>>>> -               if (likely(nskb)) {
>>>> -                       if (skb->sk)
>>>> -                               skb_set_owner_w(nskb, skb->sk);
>>>> -                       consume_skb(skb);
>>>> -               } else {
>>>> -                       kfree_skb(skb);
>>>> -               }
>>>> +               nskb = skb_clone(skb, GFP_ATOMIC);
>>>>                 skb = nskb;
>>>>         }
>>>>         if (skb &&
>>>> -           pskb_expand_head(skb, SKB_DATA_ALIGN(delta), 0, GFP_ATOMIC)) {
>>>> -               kfree_skb(skb);
>>>> +           pskb_expand_head(skb, SKB_DATA_ALIGN(delta), 0, GFP_ATOMIC))
>>>>                 skb = NULL;
>>>> +
>>>> +       if (!skb) {
>>>> +               kfree_skb(oskb);
>>>> +               if (nskb)
>>>> +                       kfree_skb(nskb);
>>>> +       } else if (nskb) {
>>>> +               if (oskb->sk)
>>>> +                       skb_set_owner_w(nskb, oskb->sk);
>>>> +               consume_skb(oskb);
>>>
>>> sorry, this does not fix the problem. The syzkaller repro still
>>> triggers the WARN.
>>>
>>> When it happens, the skb in ip6_xmit() is not shared as it comes from
>>> __tcp_transmit_skb, where it is skb_clone()'d.
>>>
>>>
>>
>> Old code (in skb_realloc_headroom())
>> was first calling skb2 = skb_clone(skb, GFP_ATOMIC); 
>>
>> At this point, skb2->sk was NULL
>> So pskb_expand_head(skb2, SKB_DATA_ALIGN(delta), 0, ...) was able to tweak skb2->truesize
>>
>> I would try :
>>
>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>> index f9311762cc475bd38d87c33e988d7c983b902e56..326749a8938637b044a616cc33b6a19ed191ac41 100644
>> --- a/net/core/skbuff.c
>> +++ b/net/core/skbuff.c
>> @@ -1804,6 +1804,7 @@ EXPORT_SYMBOL(skb_realloc_headroom);
>>  struct sk_buff *skb_expand_head(struct sk_buff *skb, unsigned int headroom)
>>  {
>>         int delta = headroom - skb_headroom(skb);
>> +       struct sk_buff *oskb = NULL;
>>  
>>         if (WARN_ONCE(delta <= 0,
>>                       "%s is expecting an increase in the headroom", __func__))
>> @@ -1813,19 +1814,21 @@ struct sk_buff *skb_expand_head(struct sk_buff *skb, unsigned int headroom)
>>         if (skb_shared(skb)) {
>>                 struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
>>  
>> -               if (likely(nskb)) {
>> -                       if (skb->sk)
>> -                               skb_set_owner_w(nskb, skb->sk);
>> -                       consume_skb(skb);
>> -               } else {
>> +               if (unlikely(!nskb)) {
>>                         kfree_skb(skb);
>> +                       return NULL;
>>                 }
>> +               oskb = skb;
>>                 skb = nskb;
>>         }
>> -       if (skb &&
>> -           pskb_expand_head(skb, SKB_DATA_ALIGN(delta), 0, GFP_ATOMIC)) {
>> +       if (pskb_expand_head(skb, SKB_DATA_ALIGN(delta), 0, GFP_ATOMIC)) {
>>                 kfree_skb(skb);
>> -               skb = NULL;
>> +               kfree_skb(oskb);
>> +               return NULL;
>> +       }
>> +       if (oskb) {
>> +               skb_set_owner_w(skb, oskb->sk);
>> +               consume_skb(oskb);
>>         }
>>         return skb;
>>  }
> 
> 
> Oh well, probably not going to work.
> 
> We have to find a way to properly increase skb->truesize, even if skb_clone() is _not_ called.
> 

I also note that current use of skb_set_owner_w(), forcing skb->destructor to sock_wfree()
is probably breaking TCP Small queues, since original skb->destructor would be tcp_wfree() or __sock_wfree()





^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH NET-NEXT] ipv6: skb_expand_head() adjust skb->truesize incorrectly
  2021-08-23 22:23                                                     ` Eric Dumazet
@ 2021-08-24  8:50                                                       ` Vasily Averin
  2021-08-24 17:21                                                         ` Vasily Averin
  2021-08-27 15:23                                                       ` [PATCH NET-NEXT] ipv6: " Vasily Averin
  1 sibling, 1 reply; 106+ messages in thread
From: Vasily Averin @ 2021-08-24  8:50 UTC (permalink / raw)
  To: Eric Dumazet, Christoph Paasch
  Cc: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski,
	netdev, LKML, kernel, Julian Wiedmann

On 8/24/21 1:23 AM, Eric Dumazet wrote:
> On 8/23/21 2:51 PM, Eric Dumazet wrote:
>> On 8/23/21 2:45 PM, Eric Dumazet wrote:
>>> On 8/23/21 10:25 AM, Christoph Paasch wrote:
>>>> Hello,
>>>>
>>>> On Mon, Aug 23, 2021 at 12:56 AM Vasily Averin <vvs@virtuozzo.com> wrote:
>>>>>
>>>>> Christoph Paasch reports [1] about incorrect skb->truesize
>>>>> after skb_expand_head() call in ip6_xmit.
>>>>> This happen because skb_set_owner_w() for newly clone skb is called
>>>>> too early, before pskb_expand_head() where truesize is adjusted for
>>>>> (!skb-sk) case.
>>>>>
>>>>> [1] https://lkml.org/lkml/2021/8/20/1082
>>>>>
>>>>> Reported-by: Christoph Paasch <christoph.paasch@gmail.com>
>>>>> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
>>>>> ---
>>>>>  net/core/skbuff.c | 24 +++++++++++++-----------
>>>>>  1 file changed, 13 insertions(+), 11 deletions(-)
>>>>>
>>>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>>>>> index f931176..508d5c4 100644
>>>>> --- a/net/core/skbuff.c
>>>>> +++ b/net/core/skbuff.c
>>>>> @@ -1803,6 +1803,8 @@ struct sk_buff *skb_realloc_headroom(struct sk_buff *skb, unsigned int headroom)
>>>>>
>>>>>  struct sk_buff *skb_expand_head(struct sk_buff *skb, unsigned int headroom)
>>>>>  {
>>>>> +       struct sk_buff *oskb = skb;
>>>>> +       struct sk_buff *nskb = NULL;
>>>>>         int delta = headroom - skb_headroom(skb);
>>>>>
>>>>>         if (WARN_ONCE(delta <= 0,
>>>>> @@ -1811,21 +1813,21 @@ struct sk_buff *skb_expand_head(struct sk_buff *skb, unsigned int headroom)
>>>>>
>>>>>         /* pskb_expand_head() might crash, if skb is shared */
>>>>>         if (skb_shared(skb)) {
>>>>> -               struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
>>>>> -
>>>>> -               if (likely(nskb)) {
>>>>> -                       if (skb->sk)
>>>>> -                               skb_set_owner_w(nskb, skb->sk);
>>>>> -                       consume_skb(skb);
>>>>> -               } else {
>>>>> -                       kfree_skb(skb);
>>>>> -               }
>>>>> +               nskb = skb_clone(skb, GFP_ATOMIC);
>>>>>                 skb = nskb;
>>>>>         }
>>>>>         if (skb &&
>>>>> -           pskb_expand_head(skb, SKB_DATA_ALIGN(delta), 0, GFP_ATOMIC)) {
>>>>> -               kfree_skb(skb);
>>>>> +           pskb_expand_head(skb, SKB_DATA_ALIGN(delta), 0, GFP_ATOMIC))
>>>>>                 skb = NULL;
>>>>> +
>>>>> +       if (!skb) {
>>>>> +               kfree_skb(oskb);
>>>>> +               if (nskb)
>>>>> +                       kfree_skb(nskb);
>>>>> +       } else if (nskb) {
>>>>> +               if (oskb->sk)
>>>>> +                       skb_set_owner_w(nskb, oskb->sk);
>>>>> +               consume_skb(oskb);
>>>>
>>>> sorry, this does not fix the problem. The syzkaller repro still
>>>> triggers the WARN.
>>>>
>>>> When it happens, the skb in ip6_xmit() is not shared as it comes from
>>>> __tcp_transmit_skb, where it is skb_clone()'d.
>>>>
>>>>
>>>
>>> Old code (in skb_realloc_headroom())
>>> was first calling skb2 = skb_clone(skb, GFP_ATOMIC); 
>>>
>>> At this point, skb2->sk was NULL
>>> So pskb_expand_head(skb2, SKB_DATA_ALIGN(delta), 0, ...) was able to tweak skb2->truesize
>>>
>>> I would try :
>>>
>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>>> index f9311762cc475bd38d87c33e988d7c983b902e56..326749a8938637b044a616cc33b6a19ed191ac41 100644
>>> --- a/net/core/skbuff.c
>>> +++ b/net/core/skbuff.c
>>> @@ -1804,6 +1804,7 @@ EXPORT_SYMBOL(skb_realloc_headroom);
>>>  struct sk_buff *skb_expand_head(struct sk_buff *skb, unsigned int headroom)
>>>  {
>>>         int delta = headroom - skb_headroom(skb);
>>> +       struct sk_buff *oskb = NULL;
>>>  
>>>         if (WARN_ONCE(delta <= 0,
>>>                       "%s is expecting an increase in the headroom", __func__))
>>> @@ -1813,19 +1814,21 @@ struct sk_buff *skb_expand_head(struct sk_buff *skb, unsigned int headroom)
>>>         if (skb_shared(skb)) {
>>>                 struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
>>>  
>>> -               if (likely(nskb)) {
>>> -                       if (skb->sk)
>>> -                               skb_set_owner_w(nskb, skb->sk);
>>> -                       consume_skb(skb);
>>> -               } else {
>>> +               if (unlikely(!nskb)) {
>>>                         kfree_skb(skb);
>>> +                       return NULL;
>>>                 }
>>> +               oskb = skb;
>>>                 skb = nskb;
>>>         }
>>> -       if (skb &&
>>> -           pskb_expand_head(skb, SKB_DATA_ALIGN(delta), 0, GFP_ATOMIC)) {
>>> +       if (pskb_expand_head(skb, SKB_DATA_ALIGN(delta), 0, GFP_ATOMIC)) {
>>>                 kfree_skb(skb);
>>> -               skb = NULL;
>>> +               kfree_skb(oskb);
>>> +               return NULL;
>>> +       }
>>> +       if (oskb) {
>>> +               skb_set_owner_w(skb, oskb->sk);
>>> +               consume_skb(oskb);
>>>         }
>>>         return skb;
>>>  }
>>
>>
>> Oh well, probably not going to work.
>>
>> We have to find a way to properly increase skb->truesize, even if skb_clone() is _not_ called.

Can we adjust truesize outside pskb_expand_head()?
Could you please explain why it can be not safe?

> I also note that current use of skb_set_owner_w(), forcing skb->destructor to sock_wfree()
> is probably breaking TCP Small queues, since original skb->destructor would be tcp_wfree() or __sock_wfree()

I agree, however as far as I understand it is separate and more global problem.

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH NET-NEXT] ipv6: skb_expand_head() adjust skb->truesize incorrectly
  2021-08-24  8:50                                                       ` Vasily Averin
@ 2021-08-24 17:21                                                         ` Vasily Averin
  2021-08-25 17:49                                                           ` Christoph Paasch
  0 siblings, 1 reply; 106+ messages in thread
From: Vasily Averin @ 2021-08-24 17:21 UTC (permalink / raw)
  To: Eric Dumazet, Christoph Paasch
  Cc: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski,
	netdev, LKML, kernel, Julian Wiedmann

On 8/24/21 11:50 AM, Vasily Averin wrote:
> On 8/24/21 1:23 AM, Eric Dumazet wrote:
>> On 8/23/21 2:51 PM, Eric Dumazet wrote:
>>> On 8/23/21 2:45 PM, Eric Dumazet wrote:
>>>> On 8/23/21 10:25 AM, Christoph Paasch wrote:
>>>>> Hello,
>>>>>
>>>>> On Mon, Aug 23, 2021 at 12:56 AM Vasily Averin <vvs@virtuozzo.com> wrote:
>>>>>>
>>>>>> Christoph Paasch reports [1] about incorrect skb->truesize
>>>>>> after skb_expand_head() call in ip6_xmit.
>>>>>> This happen because skb_set_owner_w() for newly clone skb is called
>>>>>> too early, before pskb_expand_head() where truesize is adjusted for
>>>>>> (!skb-sk) case.
>>>>>>
>>>>>> [1] https://lkml.org/lkml/2021/8/20/1082
>>>>>>
>>>>>> Reported-by: Christoph Paasch <christoph.paasch@gmail.com>
>>>>>> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
>>>>>> ---
>>>>>>  net/core/skbuff.c | 24 +++++++++++++-----------
>>>>>>  1 file changed, 13 insertions(+), 11 deletions(-)
>>>>>>
>>>>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>>>>>> index f931176..508d5c4 100644
>>>>>> --- a/net/core/skbuff.c
>>>>>> +++ b/net/core/skbuff.c
>>>>>> @@ -1803,6 +1803,8 @@ struct sk_buff *skb_realloc_headroom(struct sk_buff *skb, unsigned int headroom)
>>>>>>
>>>>>>  struct sk_buff *skb_expand_head(struct sk_buff *skb, unsigned int headroom)
>>>>>>  {
>>>>>> +       struct sk_buff *oskb = skb;
>>>>>> +       struct sk_buff *nskb = NULL;
>>>>>>         int delta = headroom - skb_headroom(skb);
>>>>>>
>>>>>>         if (WARN_ONCE(delta <= 0,
>>>>>> @@ -1811,21 +1813,21 @@ struct sk_buff *skb_expand_head(struct sk_buff *skb, unsigned int headroom)
>>>>>>
>>>>>>         /* pskb_expand_head() might crash, if skb is shared */
>>>>>>         if (skb_shared(skb)) {
>>>>>> -               struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
>>>>>> -
>>>>>> -               if (likely(nskb)) {
>>>>>> -                       if (skb->sk)
>>>>>> -                               skb_set_owner_w(nskb, skb->sk);
>>>>>> -                       consume_skb(skb);
>>>>>> -               } else {
>>>>>> -                       kfree_skb(skb);
>>>>>> -               }
>>>>>> +               nskb = skb_clone(skb, GFP_ATOMIC);
>>>>>>                 skb = nskb;
>>>>>>         }
>>>>>>         if (skb &&
>>>>>> -           pskb_expand_head(skb, SKB_DATA_ALIGN(delta), 0, GFP_ATOMIC)) {
>>>>>> -               kfree_skb(skb);
>>>>>> +           pskb_expand_head(skb, SKB_DATA_ALIGN(delta), 0, GFP_ATOMIC))
>>>>>>                 skb = NULL;
>>>>>> +
>>>>>> +       if (!skb) {
>>>>>> +               kfree_skb(oskb);
>>>>>> +               if (nskb)
>>>>>> +                       kfree_skb(nskb);
>>>>>> +       } else if (nskb) {
>>>>>> +               if (oskb->sk)
>>>>>> +                       skb_set_owner_w(nskb, oskb->sk);
>>>>>> +               consume_skb(oskb);
>>>>>
>>>>> sorry, this does not fix the problem. The syzkaller repro still
>>>>> triggers the WARN.
>>>>>
>>>>> When it happens, the skb in ip6_xmit() is not shared as it comes from
>>>>> __tcp_transmit_skb, where it is skb_clone()'d.
>>>>>
>>>>>
>>>>
>>>> Old code (in skb_realloc_headroom())
>>>> was first calling skb2 = skb_clone(skb, GFP_ATOMIC); 
>>>>
>>>> At this point, skb2->sk was NULL
>>>> So pskb_expand_head(skb2, SKB_DATA_ALIGN(delta), 0, ...) was able to tweak skb2->truesize
>>>>
>>>> I would try :
>>>>
>>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>>>> index f9311762cc475bd38d87c33e988d7c983b902e56..326749a8938637b044a616cc33b6a19ed191ac41 100644
>>>> --- a/net/core/skbuff.c
>>>> +++ b/net/core/skbuff.c
>>>> @@ -1804,6 +1804,7 @@ EXPORT_SYMBOL(skb_realloc_headroom);
>>>>  struct sk_buff *skb_expand_head(struct sk_buff *skb, unsigned int headroom)
>>>>  {
>>>>         int delta = headroom - skb_headroom(skb);
>>>> +       struct sk_buff *oskb = NULL;
>>>>  
>>>>         if (WARN_ONCE(delta <= 0,
>>>>                       "%s is expecting an increase in the headroom", __func__))
>>>> @@ -1813,19 +1814,21 @@ struct sk_buff *skb_expand_head(struct sk_buff *skb, unsigned int headroom)
>>>>         if (skb_shared(skb)) {
>>>>                 struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
>>>>  
>>>> -               if (likely(nskb)) {
>>>> -                       if (skb->sk)
>>>> -                               skb_set_owner_w(nskb, skb->sk);
>>>> -                       consume_skb(skb);
>>>> -               } else {
>>>> +               if (unlikely(!nskb)) {
>>>>                         kfree_skb(skb);
>>>> +                       return NULL;
>>>>                 }
>>>> +               oskb = skb;
>>>>                 skb = nskb;
>>>>         }
>>>> -       if (skb &&
>>>> -           pskb_expand_head(skb, SKB_DATA_ALIGN(delta), 0, GFP_ATOMIC)) {
>>>> +       if (pskb_expand_head(skb, SKB_DATA_ALIGN(delta), 0, GFP_ATOMIC)) {
>>>>                 kfree_skb(skb);
>>>> -               skb = NULL;
>>>> +               kfree_skb(oskb);
>>>> +               return NULL;
>>>> +       }
>>>> +       if (oskb) {
>>>> +               skb_set_owner_w(skb, oskb->sk);
>>>> +               consume_skb(oskb);
>>>>         }
>>>>         return skb;
>>>>  }
>>>
>>>
>>> Oh well, probably not going to work.
>>>
>>> We have to find a way to properly increase skb->truesize, even if skb_clone() is _not_ called.
> 
> Can we adjust truesize outside pskb_expand_head()?
> Could you please explain why it can be not safe?

Do you mean truesize change should not break balance of sk->sk_wmem_alloc?

>> I also note that current use of skb_set_owner_w(), forcing skb->destructor to sock_wfree()
>> is probably breaking TCP Small queues, since original skb->destructor would be tcp_wfree() or __sock_wfree()
> 
> I agree, however as far as I understand it is separate and more global problem.
> 
> Thank you,
> 	Vasily Averin
> 


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH NET-NEXT] ipv6: skb_expand_head() adjust skb->truesize incorrectly
  2021-08-24 17:21                                                         ` Vasily Averin
@ 2021-08-25 17:49                                                           ` Christoph Paasch
  2021-08-29 12:59                                                             ` [PATCH v2] " Vasily Averin
  0 siblings, 1 reply; 106+ messages in thread
From: Christoph Paasch @ 2021-08-25 17:49 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Eric Dumazet, David S. Miller, Hideaki YOSHIFUJI, David Ahern,
	Jakub Kicinski, netdev, LKML, kernel, Julian Wiedmann

On Tue, Aug 24, 2021 at 10:22 AM Vasily Averin <vvs@virtuozzo.com> wrote:
>
> On 8/24/21 11:50 AM, Vasily Averin wrote:
> > On 8/24/21 1:23 AM, Eric Dumazet wrote:
> >> On 8/23/21 2:51 PM, Eric Dumazet wrote:
> >>> On 8/23/21 2:45 PM, Eric Dumazet wrote:
> >>>> On 8/23/21 10:25 AM, Christoph Paasch wrote:
> >>>>> Hello,
> >>>>>
> >>>>> On Mon, Aug 23, 2021 at 12:56 AM Vasily Averin <vvs@virtuozzo.com> wrote:
> >>>>>>
> >>>>>> Christoph Paasch reports [1] about incorrect skb->truesize
> >>>>>> after skb_expand_head() call in ip6_xmit.
> >>>>>> This happen because skb_set_owner_w() for newly clone skb is called
> >>>>>> too early, before pskb_expand_head() where truesize is adjusted for
> >>>>>> (!skb-sk) case.
> >>>>>>
> >>>>>> [1] https://lkml.org/lkml/2021/8/20/1082
> >>>>>>
> >>>>>> Reported-by: Christoph Paasch <christoph.paasch@gmail.com>
> >>>>>> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
> >>>>>> ---
> >>>>>>  net/core/skbuff.c | 24 +++++++++++++-----------
> >>>>>>  1 file changed, 13 insertions(+), 11 deletions(-)
> >>>>>>
> >>>>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> >>>>>> index f931176..508d5c4 100644
> >>>>>> --- a/net/core/skbuff.c
> >>>>>> +++ b/net/core/skbuff.c
> >>>>>> @@ -1803,6 +1803,8 @@ struct sk_buff *skb_realloc_headroom(struct sk_buff *skb, unsigned int headroom)
> >>>>>>
> >>>>>>  struct sk_buff *skb_expand_head(struct sk_buff *skb, unsigned int headroom)
> >>>>>>  {
> >>>>>> +       struct sk_buff *oskb = skb;
> >>>>>> +       struct sk_buff *nskb = NULL;
> >>>>>>         int delta = headroom - skb_headroom(skb);
> >>>>>>
> >>>>>>         if (WARN_ONCE(delta <= 0,
> >>>>>> @@ -1811,21 +1813,21 @@ struct sk_buff *skb_expand_head(struct sk_buff *skb, unsigned int headroom)
> >>>>>>
> >>>>>>         /* pskb_expand_head() might crash, if skb is shared */
> >>>>>>         if (skb_shared(skb)) {
> >>>>>> -               struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
> >>>>>> -
> >>>>>> -               if (likely(nskb)) {
> >>>>>> -                       if (skb->sk)
> >>>>>> -                               skb_set_owner_w(nskb, skb->sk);
> >>>>>> -                       consume_skb(skb);
> >>>>>> -               } else {
> >>>>>> -                       kfree_skb(skb);
> >>>>>> -               }
> >>>>>> +               nskb = skb_clone(skb, GFP_ATOMIC);
> >>>>>>                 skb = nskb;
> >>>>>>         }
> >>>>>>         if (skb &&
> >>>>>> -           pskb_expand_head(skb, SKB_DATA_ALIGN(delta), 0, GFP_ATOMIC)) {
> >>>>>> -               kfree_skb(skb);
> >>>>>> +           pskb_expand_head(skb, SKB_DATA_ALIGN(delta), 0, GFP_ATOMIC))
> >>>>>>                 skb = NULL;
> >>>>>> +
> >>>>>> +       if (!skb) {
> >>>>>> +               kfree_skb(oskb);
> >>>>>> +               if (nskb)
> >>>>>> +                       kfree_skb(nskb);
> >>>>>> +       } else if (nskb) {
> >>>>>> +               if (oskb->sk)
> >>>>>> +                       skb_set_owner_w(nskb, oskb->sk);
> >>>>>> +               consume_skb(oskb);
> >>>>>
> >>>>> sorry, this does not fix the problem. The syzkaller repro still
> >>>>> triggers the WARN.
> >>>>>
> >>>>> When it happens, the skb in ip6_xmit() is not shared as it comes from
> >>>>> __tcp_transmit_skb, where it is skb_clone()'d.
> >>>>>
> >>>>>
> >>>>
> >>>> Old code (in skb_realloc_headroom())
> >>>> was first calling skb2 = skb_clone(skb, GFP_ATOMIC);
> >>>>
> >>>> At this point, skb2->sk was NULL
> >>>> So pskb_expand_head(skb2, SKB_DATA_ALIGN(delta), 0, ...) was able to tweak skb2->truesize
> >>>>
> >>>> I would try :
> >>>>
> >>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> >>>> index f9311762cc475bd38d87c33e988d7c983b902e56..326749a8938637b044a616cc33b6a19ed191ac41 100644
> >>>> --- a/net/core/skbuff.c
> >>>> +++ b/net/core/skbuff.c
> >>>> @@ -1804,6 +1804,7 @@ EXPORT_SYMBOL(skb_realloc_headroom);
> >>>>  struct sk_buff *skb_expand_head(struct sk_buff *skb, unsigned int headroom)
> >>>>  {
> >>>>         int delta = headroom - skb_headroom(skb);
> >>>> +       struct sk_buff *oskb = NULL;
> >>>>
> >>>>         if (WARN_ONCE(delta <= 0,
> >>>>                       "%s is expecting an increase in the headroom", __func__))
> >>>> @@ -1813,19 +1814,21 @@ struct sk_buff *skb_expand_head(struct sk_buff *skb, unsigned int headroom)
> >>>>         if (skb_shared(skb)) {
> >>>>                 struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
> >>>>
> >>>> -               if (likely(nskb)) {
> >>>> -                       if (skb->sk)
> >>>> -                               skb_set_owner_w(nskb, skb->sk);
> >>>> -                       consume_skb(skb);
> >>>> -               } else {
> >>>> +               if (unlikely(!nskb)) {
> >>>>                         kfree_skb(skb);
> >>>> +                       return NULL;
> >>>>                 }
> >>>> +               oskb = skb;
> >>>>                 skb = nskb;
> >>>>         }
> >>>> -       if (skb &&
> >>>> -           pskb_expand_head(skb, SKB_DATA_ALIGN(delta), 0, GFP_ATOMIC)) {
> >>>> +       if (pskb_expand_head(skb, SKB_DATA_ALIGN(delta), 0, GFP_ATOMIC)) {
> >>>>                 kfree_skb(skb);
> >>>> -               skb = NULL;
> >>>> +               kfree_skb(oskb);
> >>>> +               return NULL;
> >>>> +       }
> >>>> +       if (oskb) {
> >>>> +               skb_set_owner_w(skb, oskb->sk);
> >>>> +               consume_skb(oskb);
> >>>>         }
> >>>>         return skb;
> >>>>  }
> >>>
> >>>
> >>> Oh well, probably not going to work.
> >>>
> >>> We have to find a way to properly increase skb->truesize, even if skb_clone() is _not_ called.
> >
> > Can we adjust truesize outside pskb_expand_head()?
> > Could you please explain why it can be not safe?
>
> Do you mean truesize change should not break balance of sk->sk_wmem_alloc?

AFAICS, that's the problem around adjusting truesize. So, maybe "just"
refcount_add the increase of the truesize.

The below does fix the syzkaller bug for me and seems to do the right
thing overall. But I honestly think that this is becoming too hacky
and not worth it... and who knows what other corner-cases this now
exposes...

Maybe a revert is a better course of action?

---
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index f9311762cc47..9cc18a0fdd1c 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -71,6 +71,7 @@
 #include <net/mpls.h>
 #include <net/mptcp.h>
 #include <net/page_pool.h>
+#include <net/tcp.h>

 #include <linux/uaccess.h>
 #include <trace/events/skb.h>
@@ -1756,9 +1757,14 @@ int pskb_expand_head(struct sk_buff *skb, int
nhead, int ntail,
  * For the moment, we really care of rx path, or
  * when skb is orphaned (not attached to a socket).
  */
- if (!skb->sk || skb->destructor == sock_edemux)
+ if (!skb->sk || skb->destructor == sock_edemux || skb->destructor ==
tcp_wfree) {
  skb->truesize += size - osize;

+ if (skb->sk && skb->destructor == tcp_wfree) {
+ refcount_add(size - osize, &skb->sk->sk_wmem_alloc);
+ }
+ }
+
  return 0;

 nofrags:

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH NET-NEXT] ipv6: skb_expand_head() adjust skb->truesize incorrectly
  2021-08-23 22:23                                                     ` Eric Dumazet
  2021-08-24  8:50                                                       ` Vasily Averin
@ 2021-08-27 15:23                                                       ` Vasily Averin
  2021-08-27 16:47                                                         ` Eric Dumazet
  1 sibling, 1 reply; 106+ messages in thread
From: Vasily Averin @ 2021-08-27 15:23 UTC (permalink / raw)
  To: Eric Dumazet, Christoph Paasch
  Cc: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski,
	netdev, LKML, kernel, Julian Wiedmann, Alexey Kuznetsov

On 8/24/21 1:23 AM, Eric Dumazet wrote:
> On 8/23/21 2:51 PM, Eric Dumazet wrote:
>> On 8/23/21 2:45 PM, Eric Dumazet wrote:
>>> On 8/23/21 10:25 AM, Christoph Paasch wrote:
>>>> Hello,
>>>>
>>>> On Mon, Aug 23, 2021 at 12:56 AM Vasily Averin <vvs@virtuozzo.com> wrote:
>>>>>
>>>>> Christoph Paasch reports [1] about incorrect skb->truesize
>>>>> after skb_expand_head() call in ip6_xmit.
>>>>> This happen because skb_set_owner_w() for newly clone skb is called
>>>>> too early, before pskb_expand_head() where truesize is adjusted for
>>>>> (!skb-sk) case.
>>>>>
>>>>> [1] https://lkml.org/lkml/2021/8/20/1082
>>>>>
>>>>> Reported-by: Christoph Paasch <christoph.paasch@gmail.com>
>>>>> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
>>>>> ---
>>>>>  net/core/skbuff.c | 24 +++++++++++++-----------
>>>>>  1 file changed, 13 insertions(+), 11 deletions(-)
>>>>>
>>>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>>>>> index f931176..508d5c4 100644
>>>>> --- a/net/core/skbuff.c
>>>>> +++ b/net/core/skbuff.c
>>>>> @@ -1803,6 +1803,8 @@ struct sk_buff *skb_realloc_headroom(struct sk_buff *skb, unsigned int headroom)
>>>>>
>>>>>  struct sk_buff *skb_expand_head(struct sk_buff *skb, unsigned int headroom)
>>>>>  {
>>>>> +       struct sk_buff *oskb = skb;
>>>>> +       struct sk_buff *nskb = NULL;
>>>>>         int delta = headroom - skb_headroom(skb);
>>>>>
>>>>>         if (WARN_ONCE(delta <= 0,
>>>>> @@ -1811,21 +1813,21 @@ struct sk_buff *skb_expand_head(struct sk_buff *skb, unsigned int headroom)
>>>>>
>>>>>         /* pskb_expand_head() might crash, if skb is shared */
>>>>>         if (skb_shared(skb)) {
>>>>> -               struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
>>>>> -
>>>>> -               if (likely(nskb)) {
>>>>> -                       if (skb->sk)
>>>>> -                               skb_set_owner_w(nskb, skb->sk);
>>>>> -                       consume_skb(skb);
>>>>> -               } else {
>>>>> -                       kfree_skb(skb);
>>>>> -               }
>>>>> +               nskb = skb_clone(skb, GFP_ATOMIC);
>>>>>                 skb = nskb;
>>>>>         }
>>>>>         if (skb &&
>>>>> -           pskb_expand_head(skb, SKB_DATA_ALIGN(delta), 0, GFP_ATOMIC)) {
>>>>> -               kfree_skb(skb);
>>>>> +           pskb_expand_head(skb, SKB_DATA_ALIGN(delta), 0, GFP_ATOMIC))
>>>>>                 skb = NULL;
>>>>> +
>>>>> +       if (!skb) {
>>>>> +               kfree_skb(oskb);
>>>>> +               if (nskb)
>>>>> +                       kfree_skb(nskb);
>>>>> +       } else if (nskb) {
>>>>> +               if (oskb->sk)
>>>>> +                       skb_set_owner_w(nskb, oskb->sk);
>>>>> +               consume_skb(oskb);
>>>>
>>>> sorry, this does not fix the problem. The syzkaller repro still
>>>> triggers the WARN.
>>>>
>>>> When it happens, the skb in ip6_xmit() is not shared as it comes from
>>>> __tcp_transmit_skb, where it is skb_clone()'d.
>>>>
>>>>
>>>
>>> Old code (in skb_realloc_headroom())
>>> was first calling skb2 = skb_clone(skb, GFP_ATOMIC); 
>>>
>>> At this point, skb2->sk was NULL
>>> So pskb_expand_head(skb2, SKB_DATA_ALIGN(delta), 0, ...) was able to tweak skb2->truesize
>>>
>>> I would try :
>>>
>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>>> index f9311762cc475bd38d87c33e988d7c983b902e56..326749a8938637b044a616cc33b6a19ed191ac41 100644
>>> --- a/net/core/skbuff.c
>>> +++ b/net/core/skbuff.c
>>> @@ -1804,6 +1804,7 @@ EXPORT_SYMBOL(skb_realloc_headroom);
>>>  struct sk_buff *skb_expand_head(struct sk_buff *skb, unsigned int headroom)
>>>  {
>>>         int delta = headroom - skb_headroom(skb);
>>> +       struct sk_buff *oskb = NULL;
>>>  
>>>         if (WARN_ONCE(delta <= 0,
>>>                       "%s is expecting an increase in the headroom", __func__))
>>> @@ -1813,19 +1814,21 @@ struct sk_buff *skb_expand_head(struct sk_buff *skb, unsigned int headroom)
>>>         if (skb_shared(skb)) {
>>>                 struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
>>>  
>>> -               if (likely(nskb)) {
>>> -                       if (skb->sk)
>>> -                               skb_set_owner_w(nskb, skb->sk);
>>> -                       consume_skb(skb);
>>> -               } else {
>>> +               if (unlikely(!nskb)) {
>>>                         kfree_skb(skb);
>>> +                       return NULL;
>>>                 }
>>> +               oskb = skb;
>>>                 skb = nskb;
>>>         }
>>> -       if (skb &&
>>> -           pskb_expand_head(skb, SKB_DATA_ALIGN(delta), 0, GFP_ATOMIC)) {
>>> +       if (pskb_expand_head(skb, SKB_DATA_ALIGN(delta), 0, GFP_ATOMIC)) {
>>>                 kfree_skb(skb);
>>> -               skb = NULL;
>>> +               kfree_skb(oskb);
>>> +               return NULL;
>>> +       }
>>> +       if (oskb) {
>>> +               skb_set_owner_w(skb, oskb->sk);
>>> +               consume_skb(oskb);
>>>         }
>>>         return skb;
>>>  }
>> Oh well, probably not going to work.
>>
>> We have to find a way to properly increase skb->truesize, even if skb_clone() is _not_ called.
> 
> I also note that current use of skb_set_owner_w(), forcing skb->destructor to sock_wfree()
> is probably breaking TCP Small queues, since original skb->destructor would be tcp_wfree() or __sock_wfree()

I asked Alexey Kuznetsov to look at this problem. Below is his answer:
"I think the current scheme is obsolete. It was created
when we had only two kinds of skb accounting (rmem & wmem)
and with more kinds of accounting it just does not work.
Even there we had ignored problems with adjusting accounting.

Logically the best solution would be replacing ->destructor,
set_owner* etc with skb_ops. Something like:

struct skb_ops
{
        void init(struct sk_buff * skb, struct skb_ops * ops, struct
sock * owner);
        void fini(struct sk_buff * skb);
        void update(struct sk_buff * skb, int adjust);
        void inherit(struct sk_buff * skb2, struct sk_buff * skb);
};

init - is replacement for skb_set_owner_r|w
fini - is replacement for skb_orphan
update - is new operation to be used in places where skb->truesize changes,
       instead of awful constructions like:

       if (!skb->sk || skb->destructor == sock_edemux)
            skb->truesize += size - osize;

       Now it will look like:

       if (skb->ops)
            skb->ops->update(skb, size - osize);

inherit - is replacement for also awful constructs like:

      if (skb->sk)
            skb_set_owner_w(skb2, skb->sk);

      Now it will be:

      if (skb->ops)
            skb->ops->inherit(skb2, skb);

The implementation looks mostly obvious.
Some troubles can be only with new functionality:
update of accounting was never done before.


More efficient, functionally equivalent, but uglier and less flexible
alternative would be removal of ->destructor, replaced with
a small numeric indicator of ownership:

enum
{
        SKB_OWNER_NONE,  /* aka destructor == NULL */
        SKB_OWNER_WMEM,  /* aka destructor == sk_wfree */
        SKB_OWNER_RMEM,  /* aka destructor == sk_rfree */
        SKB_OWNER_SK,    /* aka destructor == sk_edemux */
        SKB_OWNER_TCP,   /* aka destructor == tcp_wfree */
}

And the same init,fini,inherit,update become functions
w/o any inidirect calls. Not sure it is really more efficient though."

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH NET-NEXT] ipv6: skb_expand_head() adjust skb->truesize incorrectly
  2021-08-27 15:23                                                       ` [PATCH NET-NEXT] ipv6: " Vasily Averin
@ 2021-08-27 16:47                                                         ` Eric Dumazet
  2021-08-28  8:01                                                           ` Vasily Averin
  0 siblings, 1 reply; 106+ messages in thread
From: Eric Dumazet @ 2021-08-27 16:47 UTC (permalink / raw)
  To: Vasily Averin, Eric Dumazet, Christoph Paasch
  Cc: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski,
	netdev, LKML, kernel, Julian Wiedmann, Alexey Kuznetsov



On 8/27/21 8:23 AM, Vasily Averin wrote:

> I asked Alexey Kuznetsov to look at this problem. Below is his answer:
> "I think the current scheme is obsolete. It was created
> when we had only two kinds of skb accounting (rmem & wmem)
> and with more kinds of accounting it just does not work.
> Even there we had ignored problems with adjusting accounting.
> 
> Logically the best solution would be replacing ->destructor,
> set_owner* etc with skb_ops. Something like:
> 
> struct skb_ops
> {
>         void init(struct sk_buff * skb, struct skb_ops * ops, struct
> sock * owner);
>         void fini(struct sk_buff * skb);
>         void update(struct sk_buff * skb, int adjust);
>         void inherit(struct sk_buff * skb2, struct sk_buff * skb);
> };
> 
> init - is replacement for skb_set_owner_r|w
> fini - is replacement for skb_orphan
> update - is new operation to be used in places where skb->truesize changes,
>        instead of awful constructions like:
> 
>        if (!skb->sk || skb->destructor == sock_edemux)
>             skb->truesize += size - osize;
> 
>        Now it will look like:
> 
>        if (skb->ops)
>             skb->ops->update(skb, size - osize);
> 
> inherit - is replacement for also awful constructs like:
> 
>       if (skb->sk)
>             skb_set_owner_w(skb2, skb->sk);
> 
>       Now it will be:
> 
>       if (skb->ops)
>             skb->ops->inherit(skb2, skb);
> 
> The implementation looks mostly obvious.
> Some troubles can be only with new functionality:
> update of accounting was never done before.
> 
> 
> More efficient, functionally equivalent, but uglier and less flexible
> alternative would be removal of ->destructor, replaced with
> a small numeric indicator of ownership:
> 
> enum
> {
>         SKB_OWNER_NONE,  /* aka destructor == NULL */
>         SKB_OWNER_WMEM,  /* aka destructor == sk_wfree */
>         SKB_OWNER_RMEM,  /* aka destructor == sk_rfree */
>         SKB_OWNER_SK,    /* aka destructor == sk_edemux */
>         SKB_OWNER_TCP,   /* aka destructor == tcp_wfree */
> }
> 
> And the same init,fini,inherit,update become functions
> w/o any inidirect calls. Not sure it is really more efficient though."
> 

Well, this does not look as stable material, and would add a bunch
of indirect calls which are quite expensive these days (CONFIG_RETPOLINE=y)

I suggest we work on a fix, using existing infra, then eventually later
try to refactor if this is really bringing improvements.

A fix could simply be a revert of 0c9f227bee119 ("ipv6: use skb_expand_head in ip6_xmit")
since only IPv6 has the problem (because of arbitrary headers size)



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH NET-NEXT] ipv6: skb_expand_head() adjust skb->truesize incorrectly
  2021-08-27 16:47                                                         ` Eric Dumazet
@ 2021-08-28  8:01                                                           ` Vasily Averin
  0 siblings, 0 replies; 106+ messages in thread
From: Vasily Averin @ 2021-08-28  8:01 UTC (permalink / raw)
  To: Eric Dumazet, Christoph Paasch
  Cc: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski,
	netdev, LKML, kernel, Julian Wiedmann, Alexey Kuznetsov

On 8/27/21 7:47 PM, Eric Dumazet wrote:
> 
> 
> On 8/27/21 8:23 AM, Vasily Averin wrote:
> 
>> I asked Alexey Kuznetsov to look at this problem. Below is his answer:
>> "I think the current scheme is obsolete. It was created
>> when we had only two kinds of skb accounting (rmem & wmem)
>> and with more kinds of accounting it just does not work.
>> Even there we had ignored problems with adjusting accounting.
>>
>> Logically the best solution would be replacing ->destructor,
>> set_owner* etc with skb_ops. Something like:
>>
>> struct skb_ops
>> {
>>         void init(struct sk_buff * skb, struct skb_ops * ops, struct
>> sock * owner);
>>         void fini(struct sk_buff * skb);
>>         void update(struct sk_buff * skb, int adjust);
>>         void inherit(struct sk_buff * skb2, struct sk_buff * skb);
>> };
>>
>> init - is replacement for skb_set_owner_r|w
>> fini - is replacement for skb_orphan
>> update - is new operation to be used in places where skb->truesize changes,
>>        instead of awful constructions like:
>>
>>        if (!skb->sk || skb->destructor == sock_edemux)
>>             skb->truesize += size - osize;
>>
>>        Now it will look like:
>>
>>        if (skb->ops)
>>             skb->ops->update(skb, size - osize);
>>
>> inherit - is replacement for also awful constructs like:
>>
>>       if (skb->sk)
>>             skb_set_owner_w(skb2, skb->sk);
>>
>>       Now it will be:
>>
>>       if (skb->ops)
>>             skb->ops->inherit(skb2, skb);
>>
>> The implementation looks mostly obvious.
>> Some troubles can be only with new functionality:
>> update of accounting was never done before.
>>
>>
>> More efficient, functionally equivalent, but uglier and less flexible
>> alternative would be removal of ->destructor, replaced with
>> a small numeric indicator of ownership:
>>
>> enum
>> {
>>         SKB_OWNER_NONE,  /* aka destructor == NULL */
>>         SKB_OWNER_WMEM,  /* aka destructor == sk_wfree */
>>         SKB_OWNER_RMEM,  /* aka destructor == sk_rfree */
>>         SKB_OWNER_SK,    /* aka destructor == sk_edemux */
>>         SKB_OWNER_TCP,   /* aka destructor == tcp_wfree */
>> }
>>
>> And the same init,fini,inherit,update become functions
>> w/o any inidirect calls. Not sure it is really more efficient though."
>>
> 
> Well, this does not look as stable material, and would add a bunch
> of indirect calls which are quite expensive these days (CONFIG_RETPOLINE=y)
> 
> I suggest we work on a fix, using existing infra, then eventually later
> try to refactor if this is really bringing improvements.
> 
> A fix could simply be a revert of 0c9f227bee119 ("ipv6: use skb_expand_head in ip6_xmit")
> since only IPv6 has the problem (because of arbitrary headers size)

I think it is not enough.

Root of the problem is that skb_expand_head() works incorrectly with non-shared skb.
In this case it do not call skb_clone before pskb_expand_head() execution,
and as result pskb_expand_head() and does not adjust skb->truesize.

I think non-shared skb is more frequent case,
so all skb_expand_head() are affected.

Therefore we need to revert all my patch set in net-next:
f1260ff skbuff: introduce skb_expand_head()
e415ed3 ipv6: use skb_expand_head in ip6_finish_output2
0c9f227 ipv6: use skb_expand_head in ip6_xmit
5678a59 ipv4: use skb_expand_head in ip_finish_output2
14ee70c vrf: use skb_expand_head in vrf_finish_output
53744a4 ax25: use skb_expand_head
a1e975e bpf: use skb_expand_head in bpf_out_neigh_v4/6
07e1d6b Merge branch 'skb_expand_head'
with fixup
06669e6 vrf: fix NULL dereference in vrf_finish_output()

And then rework ip6_finish_output2() in upstream, 
to call skb_realloc_headroom() like it was done in first patch version:
https://lkml.org/lkml/2021/7/7/469.

Thank you,
	Vasily Averin


^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH v2] skb_expand_head() adjust skb->truesize incorrectly
  2021-08-25 17:49                                                           ` Christoph Paasch
@ 2021-08-29 12:59                                                             ` Vasily Averin
  2021-08-30  5:52                                                               ` [PATCH net-next " Vasily Averin
  2021-08-30 16:01                                                               ` [PATCH " Eric Dumazet
  0 siblings, 2 replies; 106+ messages in thread
From: Vasily Averin @ 2021-08-29 12:59 UTC (permalink / raw)
  To: Christoph Paasch, Eric Dumazet, David S. Miller
  Cc: Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski, netdev,
	linux-kernel, kernel, Julian Wiedmann

Christoph Paasch reports [1] about incorrect skb->truesize
after skb_expand_head() call in ip6_xmit.
This may happen because of two reasons:
- skb_set_owner_w() for newly cloned skb is called too early,
before pskb_expand_head() where truesize is adjusted for (!skb-sk) case.
- pskb_expand_head() does not adjust truesize in (skb->sk) case.
In this case sk->sk_wmem_alloc should be adjusted too.

[1] https://lkml.org/lkml/2021/8/20/1082

Reported-by: Christoph Paasch <christoph.paasch@gmail.com>
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
v2: based on patch version from Eric Dumazet,
    added __pskb_expand_head() function, which can be forced
    to adjust skb->truesize and sk->sk_wmem_alloc.
---
 net/core/skbuff.c | 43 +++++++++++++++++++++++++++++--------------
 1 file changed, 29 insertions(+), 14 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index f931176..4691023 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1681,10 +1681,10 @@ struct sk_buff *__pskb_copy_fclone(struct sk_buff *skb, int headroom,
  *	reloaded after call to this function.
  */
 
-int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
-		     gfp_t gfp_mask)
+static int __pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
+			      gfp_t gfp_mask, bool update_truesize)
 {
-	int i, osize = skb_end_offset(skb);
+	int delta, i, osize = skb_end_offset(skb);
 	int size = osize + nhead + ntail;
 	long off;
 	u8 *data;
@@ -1756,9 +1756,13 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
 	 * For the moment, we really care of rx path, or
 	 * when skb is orphaned (not attached to a socket).
 	 */
-	if (!skb->sk || skb->destructor == sock_edemux)
-		skb->truesize += size - osize;
-
+	delta = size - osize;
+	if (!skb->sk || skb->destructor == sock_edemux) {
+		skb->truesize += delta;
+	} else if (update_truesize) {
+		refcount_add(delta, &skb->sk->sk_wmem_alloc);
+		skb->truesize += delta;
+	}
 	return 0;
 
 nofrags:
@@ -1766,6 +1770,12 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
 nodata:
 	return -ENOMEM;
 }
+
+int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
+		     gfp_t gfp_mask)
+{
+	return __pskb_expand_head(skb, nhead, ntail, gfp_mask, false);
+}
 EXPORT_SYMBOL(pskb_expand_head);
 
 /* Make private copy of skb with writable head and some headroom */
@@ -1804,28 +1814,33 @@ struct sk_buff *skb_realloc_headroom(struct sk_buff *skb, unsigned int headroom)
 struct sk_buff *skb_expand_head(struct sk_buff *skb, unsigned int headroom)
 {
 	int delta = headroom - skb_headroom(skb);
+	struct sk_buff *oskb = NULL;
 
 	if (WARN_ONCE(delta <= 0,
 		      "%s is expecting an increase in the headroom", __func__))
 		return skb;
 
+	delta = SKB_DATA_ALIGN(delta);
 	/* pskb_expand_head() might crash, if skb is shared */
 	if (skb_shared(skb)) {
 		struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
 
-		if (likely(nskb)) {
-			if (skb->sk)
-				skb_set_owner_w(nskb, skb->sk);
-			consume_skb(skb);
-		} else {
+		if (unlikely(!nskb)) {
 			kfree_skb(skb);
+			return NULL;
 		}
+		oskb = skb;
 		skb = nskb;
 	}
-	if (skb &&
-	    pskb_expand_head(skb, SKB_DATA_ALIGN(delta), 0, GFP_ATOMIC)) {
+	if (__pskb_expand_head(skb, delta, 0, GFP_ATOMIC, true)) {
 		kfree_skb(skb);
-		skb = NULL;
+		kfree_skb(oskb);
+		return NULL;
+	}
+	if (oskb) {
+		if (oskb->sk)
+			skb_set_owner_w(skb, oskb->sk);
+		consume_skb(oskb);
 	}
 	return skb;
 }
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH net-next v2] skb_expand_head() adjust skb->truesize incorrectly
  2021-08-29 12:59                                                             ` [PATCH v2] " Vasily Averin
@ 2021-08-30  5:52                                                               ` Vasily Averin
  2021-08-30 16:01                                                               ` [PATCH " Eric Dumazet
  1 sibling, 0 replies; 106+ messages in thread
From: Vasily Averin @ 2021-08-30  5:52 UTC (permalink / raw)
  To: Christoph Paasch, Eric Dumazet, David S. Miller
  Cc: Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski, netdev,
	linux-kernel, kernel, Julian Wiedmann, Alexey Kuznetsov

1) I forgot to specify that the patch is intended fro net-next git
2) I forgot to ad Alexey Kuznetsov in cc. I resend the patch to him 
  in a separate letter and received his consent.
3) I forgot to set Fixed mark
Fixes: f1260ff15a71 ("skbuff: introduce skb_expand_head()")

Thank you,
	Vasily Averin

On 8/29/21 3:59 PM, Vasily Averin wrote:
> Christoph Paasch reports [1] about incorrect skb->truesize
> after skb_expand_head() call in ip6_xmit.
> This may happen because of two reasons:
> - skb_set_owner_w() for newly cloned skb is called too early,
> before pskb_expand_head() where truesize is adjusted for (!skb-sk) case.
> - pskb_expand_head() does not adjust truesize in (skb->sk) case.
> In this case sk->sk_wmem_alloc should be adjusted too.
> 
> [1] https://lkml.org/lkml/2021/8/20/1082
> 
> Reported-by: Christoph Paasch <christoph.paasch@gmail.com>
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
> ---
> v2: based on patch version from Eric Dumazet,
>     added __pskb_expand_head() function, which can be forced
>     to adjust skb->truesize and sk->sk_wmem_alloc.
> ---
>  net/core/skbuff.c | 43 +++++++++++++++++++++++++++++--------------
>  1 file changed, 29 insertions(+), 14 deletions(-)
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index f931176..4691023 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -1681,10 +1681,10 @@ struct sk_buff *__pskb_copy_fclone(struct sk_buff *skb, int headroom,
>   *	reloaded after call to this function.
>   */
>  
> -int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
> -		     gfp_t gfp_mask)
> +static int __pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
> +			      gfp_t gfp_mask, bool update_truesize)
>  {
> -	int i, osize = skb_end_offset(skb);
> +	int delta, i, osize = skb_end_offset(skb);
>  	int size = osize + nhead + ntail;
>  	long off;
>  	u8 *data;
> @@ -1756,9 +1756,13 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
>  	 * For the moment, we really care of rx path, or
>  	 * when skb is orphaned (not attached to a socket).
>  	 */
> -	if (!skb->sk || skb->destructor == sock_edemux)
> -		skb->truesize += size - osize;
> -
> +	delta = size - osize;
> +	if (!skb->sk || skb->destructor == sock_edemux) {
> +		skb->truesize += delta;
> +	} else if (update_truesize) {
> +		refcount_add(delta, &skb->sk->sk_wmem_alloc);
> +		skb->truesize += delta;
> +	}
>  	return 0;
>  
>  nofrags:
> @@ -1766,6 +1770,12 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
>  nodata:
>  	return -ENOMEM;
>  }
> +
> +int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
> +		     gfp_t gfp_mask)
> +{
> +	return __pskb_expand_head(skb, nhead, ntail, gfp_mask, false);
> +}
>  EXPORT_SYMBOL(pskb_expand_head);
>  
>  /* Make private copy of skb with writable head and some headroom */
> @@ -1804,28 +1814,33 @@ struct sk_buff *skb_realloc_headroom(struct sk_buff *skb, unsigned int headroom)
>  struct sk_buff *skb_expand_head(struct sk_buff *skb, unsigned int headroom)
>  {
>  	int delta = headroom - skb_headroom(skb);
> +	struct sk_buff *oskb = NULL;
>  
>  	if (WARN_ONCE(delta <= 0,
>  		      "%s is expecting an increase in the headroom", __func__))
>  		return skb;
>  
> +	delta = SKB_DATA_ALIGN(delta);
>  	/* pskb_expand_head() might crash, if skb is shared */
>  	if (skb_shared(skb)) {
>  		struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
>  
> -		if (likely(nskb)) {
> -			if (skb->sk)
> -				skb_set_owner_w(nskb, skb->sk);
> -			consume_skb(skb);
> -		} else {
> +		if (unlikely(!nskb)) {
>  			kfree_skb(skb);
> +			return NULL;
>  		}
> +		oskb = skb;
>  		skb = nskb;
>  	}
> -	if (skb &&
> -	    pskb_expand_head(skb, SKB_DATA_ALIGN(delta), 0, GFP_ATOMIC)) {
> +	if (__pskb_expand_head(skb, delta, 0, GFP_ATOMIC, true)) {
>  		kfree_skb(skb);
> -		skb = NULL;
> +		kfree_skb(oskb);
> +		return NULL;
> +	}
> +	if (oskb) {
> +		if (oskb->sk)
> +			skb_set_owner_w(skb, oskb->sk);
> +		consume_skb(oskb);
>  	}
>  	return skb;
>  }
> 


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v2] skb_expand_head() adjust skb->truesize incorrectly
  2021-08-29 12:59                                                             ` [PATCH v2] " Vasily Averin
  2021-08-30  5:52                                                               ` [PATCH net-next " Vasily Averin
@ 2021-08-30 16:01                                                               ` Eric Dumazet
  2021-08-30 18:09                                                                 ` Vasily Averin
  1 sibling, 1 reply; 106+ messages in thread
From: Eric Dumazet @ 2021-08-30 16:01 UTC (permalink / raw)
  To: Vasily Averin, Christoph Paasch, Eric Dumazet, David S. Miller
  Cc: Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski, netdev,
	linux-kernel, kernel, Julian Wiedmann



On 8/29/21 5:59 AM, Vasily Averin wrote:
> Christoph Paasch reports [1] about incorrect skb->truesize
> after skb_expand_head() call in ip6_xmit.
> This may happen because of two reasons:
> - skb_set_owner_w() for newly cloned skb is called too early,
> before pskb_expand_head() where truesize is adjusted for (!skb-sk) case.
> - pskb_expand_head() does not adjust truesize in (skb->sk) case.
> In this case sk->sk_wmem_alloc should be adjusted too.
> 
> [1] https://lkml.org/lkml/2021/8/20/1082
> 
> Reported-by: Christoph Paasch <christoph.paasch@gmail.com>
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
> ---
> v2: based on patch version from Eric Dumazet,
>     added __pskb_expand_head() function, which can be forced
>     to adjust skb->truesize and sk->sk_wmem_alloc.
> ---
>  net/core/skbuff.c | 43 +++++++++++++++++++++++++++++--------------
>  1 file changed, 29 insertions(+), 14 deletions(-)
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index f931176..4691023 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -1681,10 +1681,10 @@ struct sk_buff *__pskb_copy_fclone(struct sk_buff *skb, int headroom,
>   *	reloaded after call to this function.
>   */
>  
> -int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
> -		     gfp_t gfp_mask)
> +static int __pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
> +			      gfp_t gfp_mask, bool update_truesize)
>  {
> -	int i, osize = skb_end_offset(skb);
> +	int delta, i, osize = skb_end_offset(skb);
>  	int size = osize + nhead + ntail;
>  	long off;
>  	u8 *data;
> @@ -1756,9 +1756,13 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
>  	 * For the moment, we really care of rx path, or
>  	 * when skb is orphaned (not attached to a socket).
>  	 */
> -	if (!skb->sk || skb->destructor == sock_edemux)
> -		skb->truesize += size - osize;
> -
> +	delta = size - osize;
> +	if (!skb->sk || skb->destructor == sock_edemux) {
> +		skb->truesize += delta;
> +	} else if (update_truesize) {

Unfortunately we can not always do this sk_wmem_alloc change here.

Some skb have skb->sk set, but the 'reference on socket' is not through sk_wmem_alloc

It seems you need a helper to make sure skb->destructor is one of
the destructors that use skb->truesize and sk->sk_wmem_alloc

For instance, skb_orphan_partial() could have been used.



> +		refcount_add(delta, &skb->sk->sk_wmem_alloc);
> +		skb->truesize += delta;
> +	}
>  	return 0;
>  
>  nofrags:
> @@ -1766,6 +1770,12 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
>  nodata:
>  	return -ENOMEM;
>  }
> +
> +int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
> +		     gfp_t gfp_mask)
> +{
> +	return __pskb_expand_head(skb, nhead, ntail, gfp_mask, false);
> +}
>  EXPORT_SYMBOL(pskb_expand_head);
>  
>  /* Make private copy of skb with writable head and some headroom */
> @@ -1804,28 +1814,33 @@ struct sk_buff *skb_realloc_headroom(struct sk_buff *skb, unsigned int headroom)
>  struct sk_buff *skb_expand_head(struct sk_buff *skb, unsigned int headroom)
>  {
>  	int delta = headroom - skb_headroom(skb);
> +	struct sk_buff *oskb = NULL;
>  
>  	if (WARN_ONCE(delta <= 0,
>  		      "%s is expecting an increase in the headroom", __func__))
>  		return skb;
>  
> +	delta = SKB_DATA_ALIGN(delta);
>  	/* pskb_expand_head() might crash, if skb is shared */
>  	if (skb_shared(skb)) {
>  		struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
>  
> -		if (likely(nskb)) {
> -			if (skb->sk)
> -				skb_set_owner_w(nskb, skb->sk);
> -			consume_skb(skb);
> -		} else {
> +		if (unlikely(!nskb)) {
>  			kfree_skb(skb);
> +			return NULL;
>  		}
> +		oskb = skb;
>  		skb = nskb;
>  	}
> -	if (skb &&
> -	    pskb_expand_head(skb, SKB_DATA_ALIGN(delta), 0, GFP_ATOMIC)) {
> +	if (__pskb_expand_head(skb, delta, 0, GFP_ATOMIC, true)) {
>  		kfree_skb(skb);
> -		skb = NULL;
> +		kfree_skb(oskb);
> +		return NULL;
> +	}
> +	if (oskb) {
> +		if (oskb->sk)
> +			skb_set_owner_w(skb, oskb->sk);
> +		consume_skb(oskb);
>  	}
>  	return skb;
>  }
> 

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v2] skb_expand_head() adjust skb->truesize incorrectly
  2021-08-30 16:01                                                               ` [PATCH " Eric Dumazet
@ 2021-08-30 18:09                                                                 ` Vasily Averin
  2021-08-30 18:37                                                                   ` Vasily Averin
  2021-08-30 19:58                                                                   ` Eric Dumazet
  0 siblings, 2 replies; 106+ messages in thread
From: Vasily Averin @ 2021-08-30 18:09 UTC (permalink / raw)
  To: Eric Dumazet, Christoph Paasch, David S. Miller
  Cc: Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski, netdev,
	linux-kernel, kernel, Julian Wiedmann

On 8/30/21 7:01 PM, Eric Dumazet wrote:
> On 8/29/21 5:59 AM, Vasily Averin wrote:
>> Christoph Paasch reports [1] about incorrect skb->truesize
>> after skb_expand_head() call in ip6_xmit.
>> This may happen because of two reasons:
>> - skb_set_owner_w() for newly cloned skb is called too early,
>> before pskb_expand_head() where truesize is adjusted for (!skb-sk) case.
>> - pskb_expand_head() does not adjust truesize in (skb->sk) case.
>> In this case sk->sk_wmem_alloc should be adjusted too.
>>
>> [1] https://lkml.org/lkml/2021/8/20/1082
>> @@ -1756,9 +1756,13 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
>>  	 * For the moment, we really care of rx path, or
>>  	 * when skb is orphaned (not attached to a socket).
>>  	 */
>> -	if (!skb->sk || skb->destructor == sock_edemux)
>> -		skb->truesize += size - osize;
>> -
>> +	delta = size - osize;
>> +	if (!skb->sk || skb->destructor == sock_edemux) {
>> +		skb->truesize += delta;
>> +	} else if (update_truesize) {
> 
> Unfortunately we can not always do this sk_wmem_alloc change here.
> 
> Some skb have skb->sk set, but the 'reference on socket' is not through sk_wmem_alloc

Could you please provide some example?
In past in all handeled cases we have cloned original skb and then unconditionally assigned skb sock_wfree destructor.
Do you want to say that it worked correctly somehow?

I expected if we set sock_wfree, we have guarantee that old skb adjusted sk_wmem_alloc.
Am I wrong?
Could you please point on such case?

> It seems you need a helper to make sure skb->destructor is one of
> the destructors that use skb->truesize and sk->sk_wmem_alloc
> 
> For instance, skb_orphan_partial() could have been used.

Thank you, will investigate.
	Vasily Averin

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v2] skb_expand_head() adjust skb->truesize incorrectly
  2021-08-30 18:09                                                                 ` Vasily Averin
@ 2021-08-30 18:37                                                                   ` Vasily Averin
  2021-08-30 19:58                                                                   ` Eric Dumazet
  1 sibling, 0 replies; 106+ messages in thread
From: Vasily Averin @ 2021-08-30 18:37 UTC (permalink / raw)
  To: Eric Dumazet, Christoph Paasch, David S. Miller
  Cc: Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski, netdev,
	linux-kernel, kernel, Julian Wiedmann

On 8/30/21 9:09 PM, Vasily Averin wrote:
> On 8/30/21 7:01 PM, Eric Dumazet wrote:
>> On 8/29/21 5:59 AM, Vasily Averin wrote:
>>> Christoph Paasch reports [1] about incorrect skb->truesize
>>> after skb_expand_head() call in ip6_xmit.
>>> This may happen because of two reasons:
>>> - skb_set_owner_w() for newly cloned skb is called too early,
>>> before pskb_expand_head() where truesize is adjusted for (!skb-sk) case.
>>> - pskb_expand_head() does not adjust truesize in (skb->sk) case.
>>> In this case sk->sk_wmem_alloc should be adjusted too.
>>>
>>> [1] https://lkml.org/lkml/2021/8/20/1082
>>> @@ -1756,9 +1756,13 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
>>>  	 * For the moment, we really care of rx path, or
>>>  	 * when skb is orphaned (not attached to a socket).
>>>  	 */
>>> -	if (!skb->sk || skb->destructor == sock_edemux)
>>> -		skb->truesize += size - osize;
>>> -
>>> +	delta = size - osize;
>>> +	if (!skb->sk || skb->destructor == sock_edemux) {
>>> +		skb->truesize += delta;
>>> +	} else if (update_truesize) {
>>
>> Unfortunately we can not always do this sk_wmem_alloc change here.
>>
>> Some skb have skb->sk set, but the 'reference on socket' is not through sk_wmem_alloc
> 
> Could you please provide some example?
> In past in all handeled cases we have cloned original skb and then unconditionally assigned skb sock_wfree destructor.
> Do you want to say that it worked correctly somehow?
> 
> I expected if we set sock_wfree, we have guarantee that old skb adjusted sk_wmem_alloc.
> Am I wrong?
> Could you please point on such case?

However if it is true -- it is not enough to adjust sk_wmem_alloc for proper destructors,
because another destructors may require to do something else.
In this case I can check destructor first and clone skb before pskb_expand_head() call,
like it was happen before.

>> It seems you need a helper to make sure skb->destructor is one of
>> the destructors that use skb->truesize and sk->sk_wmem_alloc
>>
>> For instance, skb_orphan_partial() could have been used.
> 
> Thank you, will investigate.
> 	Vasily Averin
> 


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v2] skb_expand_head() adjust skb->truesize incorrectly
  2021-08-30 18:09                                                                 ` Vasily Averin
  2021-08-30 18:37                                                                   ` Vasily Averin
@ 2021-08-30 19:58                                                                   ` Eric Dumazet
  2021-08-31 14:34                                                                     ` [PATCH net-next v3 RFC] " Vasily Averin
  1 sibling, 1 reply; 106+ messages in thread
From: Eric Dumazet @ 2021-08-30 19:58 UTC (permalink / raw)
  To: Vasily Averin, Eric Dumazet, Christoph Paasch, David S. Miller
  Cc: Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski, netdev,
	linux-kernel, kernel, Julian Wiedmann



On 8/30/21 11:09 AM, Vasily Averin wrote:
> On 8/30/21 7:01 PM, Eric Dumazet wrote:
>> On 8/29/21 5:59 AM, Vasily Averin wrote:
>>> Christoph Paasch reports [1] about incorrect skb->truesize
>>> after skb_expand_head() call in ip6_xmit.
>>> This may happen because of two reasons:
>>> - skb_set_owner_w() for newly cloned skb is called too early,
>>> before pskb_expand_head() where truesize is adjusted for (!skb-sk) case.
>>> - pskb_expand_head() does not adjust truesize in (skb->sk) case.
>>> In this case sk->sk_wmem_alloc should be adjusted too.
>>>
>>> [1] https://lkml.org/lkml/2021/8/20/1082
>>> @@ -1756,9 +1756,13 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
>>>  	 * For the moment, we really care of rx path, or
>>>  	 * when skb is orphaned (not attached to a socket).
>>>  	 */
>>> -	if (!skb->sk || skb->destructor == sock_edemux)
>>> -		skb->truesize += size - osize;
>>> -
>>> +	delta = size - osize;
>>> +	if (!skb->sk || skb->destructor == sock_edemux) {
>>> +		skb->truesize += delta;
>>> +	} else if (update_truesize) {
>>
>> Unfortunately we can not always do this sk_wmem_alloc change here.
>>
>> Some skb have skb->sk set, but the 'reference on socket' is not through sk_wmem_alloc
> 
> Could you please provide some example?
> In past in all handeled cases we have cloned original skb and then unconditionally assigned skb sock_wfree destructor.

In the past we ignored old value of skb->destructor,
since the clone got a NULL destructor.

In your patch you assumes it is sock_wfree, or other destructors changing sk_wmem_alloc


You need to make sure skb->destructor is one of the known destructors which 
will basically remove skb->truesize from sk->sk_wmem_alloc.

This will also make sure skb->sk is a 'full socket'

If not, you should not change sk->sk_wmem_alloc

> Do you want to say that it worked correctly somehow?

I am simply saying your patch adds a wrong assumption.

> 
> I expected if we set sock_wfree, we have guarantee that old skb adjusted sk_wmem_alloc.
> Am I wrong?
> Could you please point on such case?
> 
>> It seems you need a helper to make sure skb->destructor is one of
>> the destructors that use skb->truesize and sk->sk_wmem_alloc
>>
>> For instance, skb_orphan_partial() could have been used.
> 
> Thank you, will investigate.
> 	Vasily Averin
> 

^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH net-next v3 RFC] skb_expand_head() adjust skb->truesize incorrectly
  2021-08-30 19:58                                                                   ` Eric Dumazet
@ 2021-08-31 14:34                                                                     ` Vasily Averin
  2021-08-31 19:38                                                                       ` Eric Dumazet
  0 siblings, 1 reply; 106+ messages in thread
From: Vasily Averin @ 2021-08-31 14:34 UTC (permalink / raw)
  To: Christoph Paasch, Eric Dumazet, David S. Miller
  Cc: Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski, netdev,
	linux-kernel, kernel, Julian Wiedmann, Alexey Kuznetsov

RFC because it have an extra changes:
new is_skb_wmem() helper can be called
 - either before pskb_expand_head(), to create skb clones 
    for skb with destructors that does not change sk->sk_wmem_alloc
 - or after pskb_expand_head(), to change owner in skb_set_owner_w()

In current patch I've added both these ways,
we need to keep one of them.
---
Christoph Paasch reports [1] about incorrect skb->truesize
after skb_expand_head() call in ip6_xmit.
This may happen because of two reasons:
- skb_set_owner_w() for newly cloned skb is called too early,
before pskb_expand_head() where truesize is adjusted for (!skb-sk) case.
- pskb_expand_head() does not adjust truesize in (skb->sk) case.
In this case sk->sk_wmem_alloc should be adjusted too.

[1] https://lkml.org/lkml/2021/8/20/1082

Reported-by: Christoph Paasch <christoph.paasch@gmail.com>
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
v3: removed __pskb_expand_head(),
    added is_skb_wmem() helper for skb with wmem-compatible destructors
    there are 2 ways to use it:
     - before pskb_expand_head(), to create skb clones
     - after pskb_expand_head(), to change owner on extended skb.
v2: based on patch version from Eric Dumazet,
    added __pskb_expand_head() function, which can be forced
    to adjust skb->truesize and sk->sk_wmem_alloc.
---
 include/net/sock.h |  1 +
 net/core/skbuff.c  | 39 ++++++++++++++++++++++++++++-----------
 net/core/sock.c    |  8 ++++++++
 3 files changed, 37 insertions(+), 11 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 95b2577..173d58c 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1695,6 +1695,7 @@ struct sk_buff *sock_wmalloc(struct sock *sk, unsigned long size, int force,
 			     gfp_t priority);
 void __sock_wfree(struct sk_buff *skb);
 void sock_wfree(struct sk_buff *skb);
+bool is_skb_wmem(const struct sk_buff *skb);
 struct sk_buff *sock_omalloc(struct sock *sk, unsigned long size,
 			     gfp_t priority);
 void skb_orphan_partial(struct sk_buff *skb);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index f931176..3ce33f2 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1804,30 +1804,47 @@ struct sk_buff *skb_realloc_headroom(struct sk_buff *skb, unsigned int headroom)
 struct sk_buff *skb_expand_head(struct sk_buff *skb, unsigned int headroom)
 {
 	int delta = headroom - skb_headroom(skb);
+	int osize = skb_end_offset(skb);
+	struct sk_buff *oskb = NULL;
+	struct sock *sk = skb->sk;
 
 	if (WARN_ONCE(delta <= 0,
 		      "%s is expecting an increase in the headroom", __func__))
 		return skb;
 
-	/* pskb_expand_head() might crash, if skb is shared */
-	if (skb_shared(skb)) {
+	delta = SKB_DATA_ALIGN(delta);
+	/* pskb_expand_head() might crash, if skb is shared.
+	 * Also we should clone skb if its destructor does
+	 * not adjust skb->truesize and sk->sk_wmem_alloc
+ 	 */
+	if (skb_shared(skb) ||
+	    (sk && (!sk_fullsock(sk) || !is_skb_wmem(skb)))) {
 		struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
 
-		if (likely(nskb)) {
-			if (skb->sk)
-				skb_set_owner_w(nskb, skb->sk);
-			consume_skb(skb);
-		} else {
+		if (unlikely(!nskb)) {
 			kfree_skb(skb);
+			return NULL;
 		}
+		oskb = skb;
 		skb = nskb;
 	}
-	if (skb &&
-	    pskb_expand_head(skb, SKB_DATA_ALIGN(delta), 0, GFP_ATOMIC)) {
+	if (pskb_expand_head(skb, delta, 0, GFP_ATOMIC)) {
 		kfree_skb(skb);
-		skb = NULL;
+		kfree_skb(oskb);
+		return NULL;
 	}
-	return skb;
+	if (oskb) {
+		if (sk)
+			skb_set_owner_w(skb, sk);
+		consume_skb(oskb);
+	} else if (sk) {
+		delta = osize - skb_end_offset(skb);
+		if (!is_skb_wmem(skb))
+			skb_set_owner_w(skb, sk);
+		skb->truesize += delta;
+		if (sk_fullsock(sk))
+			refcount_add(delta, &sk->sk_wmem_alloc);
+	}	return skb;
 }
 EXPORT_SYMBOL(skb_expand_head);
 
diff --git a/net/core/sock.c b/net/core/sock.c
index 950f1e7..0315dcb 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2227,6 +2227,14 @@ void skb_set_owner_w(struct sk_buff *skb, struct sock *sk)
 }
 EXPORT_SYMBOL(skb_set_owner_w);
 
+bool is_skb_wmem(const struct sk_buff *skb)
+{
+	return (skb->destructor == sock_wfree ||
+		skb->destructor == __sock_wfree ||
+		(IS_ENABLED(CONFIG_INET) && skb->destructor == tcp_wfree));
+}
+EXPORT_SYMBOL(is_skb_wmem);
+
 static bool can_skb_orphan_partial(const struct sk_buff *skb)
 {
 #ifdef CONFIG_TLS_DEVICE
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH net-next v3 RFC] skb_expand_head() adjust skb->truesize incorrectly
  2021-08-31 14:34                                                                     ` [PATCH net-next v3 RFC] " Vasily Averin
@ 2021-08-31 19:38                                                                       ` Eric Dumazet
  2021-09-01  6:20                                                                         ` Vasily Averin
  0 siblings, 1 reply; 106+ messages in thread
From: Eric Dumazet @ 2021-08-31 19:38 UTC (permalink / raw)
  To: Vasily Averin, Christoph Paasch, Eric Dumazet, David S. Miller
  Cc: Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski, netdev,
	linux-kernel, kernel, Julian Wiedmann, Alexey Kuznetsov



On 8/31/21 7:34 AM, Vasily Averin wrote:
> RFC because it have an extra changes:
> new is_skb_wmem() helper can be called
>  - either before pskb_expand_head(), to create skb clones 
>     for skb with destructors that does not change sk->sk_wmem_alloc
>  - or after pskb_expand_head(), to change owner in skb_set_owner_w()
> 
> In current patch I've added both these ways,
> we need to keep one of them.
> ---
> Christoph Paasch reports [1] about incorrect skb->truesize
> after skb_expand_head() call in ip6_xmit.
> This may happen because of two reasons:
> - skb_set_owner_w() for newly cloned skb is called too early,
> before pskb_expand_head() where truesize is adjusted for (!skb-sk) case.
> - pskb_expand_head() does not adjust truesize in (skb->sk) case.
> In this case sk->sk_wmem_alloc should be adjusted too.
> 
> [1] https://lkml.org/lkml/2021/8/20/1082
> 
> Reported-by: Christoph Paasch <christoph.paasch@gmail.com>
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
> ---
> v3: removed __pskb_expand_head(),
>     added is_skb_wmem() helper for skb with wmem-compatible destructors
>     there are 2 ways to use it:
>      - before pskb_expand_head(), to create skb clones
>      - after pskb_expand_head(), to change owner on extended skb.
> v2: based on patch version from Eric Dumazet,
>     added __pskb_expand_head() function, which can be forced
>     to adjust skb->truesize and sk->sk_wmem_alloc.
> ---
>  include/net/sock.h |  1 +
>  net/core/skbuff.c  | 39 ++++++++++++++++++++++++++++-----------
>  net/core/sock.c    |  8 ++++++++
>  3 files changed, 37 insertions(+), 11 deletions(-)
> 
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 95b2577..173d58c 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -1695,6 +1695,7 @@ struct sk_buff *sock_wmalloc(struct sock *sk, unsigned long size, int force,
>  			     gfp_t priority);
>  void __sock_wfree(struct sk_buff *skb);
>  void sock_wfree(struct sk_buff *skb);
> +bool is_skb_wmem(const struct sk_buff *skb);
>  struct sk_buff *sock_omalloc(struct sock *sk, unsigned long size,
>  			     gfp_t priority);
>  void skb_orphan_partial(struct sk_buff *skb);
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index f931176..3ce33f2 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -1804,30 +1804,47 @@ struct sk_buff *skb_realloc_headroom(struct sk_buff *skb, unsigned int headroom)
>  struct sk_buff *skb_expand_head(struct sk_buff *skb, unsigned int headroom)
>  {
>  	int delta = headroom - skb_headroom(skb);
> +	int osize = skb_end_offset(skb);
> +	struct sk_buff *oskb = NULL;
> +	struct sock *sk = skb->sk;
>  
>  	if (WARN_ONCE(delta <= 0,
>  		      "%s is expecting an increase in the headroom", __func__))
>  		return skb;
>  
> -	/* pskb_expand_head() might crash, if skb is shared */
> -	if (skb_shared(skb)) {
> +	delta = SKB_DATA_ALIGN(delta);
> +	/* pskb_expand_head() might crash, if skb is shared.
> +	 * Also we should clone skb if its destructor does
> +	 * not adjust skb->truesize and sk->sk_wmem_alloc
> + 	 */
> +	if (skb_shared(skb) ||
> +	    (sk && (!sk_fullsock(sk) || !is_skb_wmem(skb)))) {

is_skb_wmem() is only possibly true for full sockets by definition.

So the (sk_fullsock(sk) && is_skb_wmem(skb)) can be reduced to is_skb_wmem(skb)

>  		struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
>  
> -		if (likely(nskb)) {
> -			if (skb->sk)
> -				skb_set_owner_w(nskb, skb->sk);
> -			consume_skb(skb);
> -		} else {
> +		if (unlikely(!nskb)) {
>  			kfree_skb(skb);
> +			return NULL;
>  		}
> +		oskb = skb;
>  		skb = nskb;
>  	}
> -	if (skb &&
> -	    pskb_expand_head(skb, SKB_DATA_ALIGN(delta), 0, GFP_ATOMIC)) {
> +	if (pskb_expand_head(skb, delta, 0, GFP_ATOMIC)) {
>  		kfree_skb(skb);
> -		skb = NULL;
> +		kfree_skb(oskb);
> +		return NULL;
>  	}
> -	return skb;
> +	if (oskb) {
> +		if (sk)
> +			skb_set_owner_w(skb, sk);

Broken for non full sockets.
Calling skb_set_owner_w(skb, sk) for them is a bug.
> +		consume_skb(oskb);
> +	} else if (sk) {
> +		delta = osize - skb_end_offset(skb);
> +		if (!is_skb_wmem(skb))
> +			skb_set_owner_w(skb, sk);

This would be broken for non full sockets.
Calling skb_set_owner_w(skb, sk) for them is a bug.

> +		skb->truesize += delta;
> +		if (sk_fullsock(sk))
> +			refcount_add(delta, &sk->sk_wmem_alloc);


> +	}	return skb;
>  }
>  EXPORT_SYMBOL(skb_expand_head);
>  
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 950f1e7..0315dcb 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -2227,6 +2227,14 @@ void skb_set_owner_w(struct sk_buff *skb, struct sock *sk)
>  }
>  EXPORT_SYMBOL(skb_set_owner_w);
>  
> +bool is_skb_wmem(const struct sk_buff *skb)
> +{
> +	return (skb->destructor == sock_wfree ||
> +		skb->destructor == __sock_wfree ||
> +		(IS_ENABLED(CONFIG_INET) && skb->destructor == tcp_wfree));

No need for return (EXPRESSION);

You can simply : return EXPRESSION;

ie
   return skb->destructor == sock_wfree ||
          skb->destructor == __sock_wfree ||
          (IS_ENABLED(CONFIG_INET) && skb->destructor == tcp_wfree);


> +}
> +EXPORT_SYMBOL(is_skb_wmem);
> +
>  static bool can_skb_orphan_partial(const struct sk_buff *skb)
>  {
>  #ifdef CONFIG_TLS_DEVICE
> 

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH net-next v3 RFC] skb_expand_head() adjust skb->truesize incorrectly
  2021-08-31 19:38                                                                       ` Eric Dumazet
@ 2021-09-01  6:20                                                                         ` Vasily Averin
  2021-09-01  8:11                                                                           ` [PATCH net-next v4] " Vasily Averin
  0 siblings, 1 reply; 106+ messages in thread
From: Vasily Averin @ 2021-09-01  6:20 UTC (permalink / raw)
  To: Eric Dumazet, Christoph Paasch, David S. Miller
  Cc: Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski, netdev,
	linux-kernel, kernel, Julian Wiedmann, Alexey Kuznetsov

On 8/31/21 10:38 PM, Eric Dumazet wrote:
> On 8/31/21 7:34 AM, Vasily Averin wrote:
>> RFC because it have an extra changes:
>> new is_skb_wmem() helper can be called
>>  - either before pskb_expand_head(), to create skb clones 
>>     for skb with destructors that does not change sk->sk_wmem_alloc
>>  - or after pskb_expand_head(), to change owner in skb_set_owner_w()
>>
>> In current patch I've added both these ways,
>> we need to keep one of them.

If nobody object I vote for 2nd way:

>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>> index f931176..3ce33f2 100644
>> --- a/net/core/skbuff.c
>> +++ b/net/core/skbuff.c
>> @@ -1804,30 +1804,47 @@ struct sk_buff *skb_realloc_headroom(struct sk_buff *skb, unsigned int headroom)
>>  struct sk_buff *skb_expand_head(struct sk_buff *skb, unsigned int headroom)
>>  {
... skipped ...
>> -	return skb;
>> +	if (oskb) {
>> +		if (sk)
>> +			skb_set_owner_w(skb, sk);
> 
> Broken for non full sockets.
> Calling skb_set_owner_w(skb, sk) for them is a bug.

I think you're wrong here.
It is 100% equivalent of old code, 
skb_set_owner_w() handles sk_fullsock(sk) inside and does not adjust sk->sk_wmem_alloc.
Please explain if I'm wrong.

>> +		consume_skb(oskb);
>> +	} else if (sk) {
>> +		delta = osize - skb_end_offset(skb);
>> +		if (!is_skb_wmem(skb))
>> +			skb_set_owner_w(skb, sk);
> 
> This would be broken for non full sockets.
> Calling skb_set_owner_w(skb, sk) for them is a bug.
See my comment above.

>> +		skb->truesize += delta;
>> +		if (sk_fullsock(sk))
>> +			refcount_add(delta, &sk->sk_wmem_alloc);
> 
> 
>> +	}	return skb;
Strange line, will fix it.

>>  }
>>  EXPORT_SYMBOL(skb_expand_head);

^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH net-next v4] skb_expand_head() adjust skb->truesize incorrectly
  2021-09-01  6:20                                                                         ` Vasily Averin
@ 2021-09-01  8:11                                                                           ` Vasily Averin
  2021-09-01 16:58                                                                             ` Christoph Paasch
  2021-09-01 19:17                                                                             ` Eric Dumazet
  0 siblings, 2 replies; 106+ messages in thread
From: Vasily Averin @ 2021-09-01  8:11 UTC (permalink / raw)
  To: Christoph Paasch, Eric Dumazet, David S. Miller
  Cc: Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski, netdev,
	linux-kernel, kernel, Alexey Kuznetsov, Julian Wiedmann

Christoph Paasch reports [1] about incorrect skb->truesize
after skb_expand_head() call in ip6_xmit.
This may happen because of two reasons:
- skb_set_owner_w() for newly cloned skb is called too early,
before pskb_expand_head() where truesize is adjusted for (!skb-sk) case.
- pskb_expand_head() does not adjust truesize in (skb->sk) case.
In this case sk->sk_wmem_alloc should be adjusted too.

[1] https://lkml.org/lkml/2021/8/20/1082

Fixes: f1260ff15a71 ("skbuff: introduce skb_expand_head()")
Reported-by: Christoph Paasch <christoph.paasch@gmail.com>
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
v4: decided to use is_skb_wmem() after pskb_expand_head() call
    fixed 'return (EXPRESSION);' in os_skb_wmem according to Eric Dumazet
v3: removed __pskb_expand_head(),
    added is_skb_wmem() helper for skb with wmem-compatible destructors
    there are 2 ways to use it:
     - before pskb_expand_head(), to create skb clones
     - after successfull pskb_expand_head() to change owner on extended skb.
v2: based on patch version from Eric Dumazet,
    added __pskb_expand_head() function, which can be forced
    to adjust skb->truesize and sk->sk_wmem_alloc.
---
 include/net/sock.h |  1 +
 net/core/skbuff.c  | 35 ++++++++++++++++++++++++++---------
 net/core/sock.c    |  8 ++++++++
 3 files changed, 35 insertions(+), 9 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 95b2577..173d58c 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1695,6 +1695,7 @@ struct sk_buff *sock_wmalloc(struct sock *sk, unsigned long size, int force,
 			     gfp_t priority);
 void __sock_wfree(struct sk_buff *skb);
 void sock_wfree(struct sk_buff *skb);
+bool is_skb_wmem(const struct sk_buff *skb);
 struct sk_buff *sock_omalloc(struct sock *sk, unsigned long size,
 			     gfp_t priority);
 void skb_orphan_partial(struct sk_buff *skb);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index f931176..09991cb 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1804,28 +1804,45 @@ struct sk_buff *skb_realloc_headroom(struct sk_buff *skb, unsigned int headroom)
 struct sk_buff *skb_expand_head(struct sk_buff *skb, unsigned int headroom)
 {
 	int delta = headroom - skb_headroom(skb);
+	int osize = skb_end_offset(skb);
+	struct sk_buff *oskb = NULL;
+	struct sock *sk = skb->sk;
 
 	if (WARN_ONCE(delta <= 0,
 		      "%s is expecting an increase in the headroom", __func__))
 		return skb;
 
-	/* pskb_expand_head() might crash, if skb is shared */
+	delta = SKB_DATA_ALIGN(delta);
+	/* pskb_expand_head() might crash, if skb is shared.
+	 * Also we should clone skb if its destructor does
+	 * not adjust skb->truesize and sk->sk_wmem_alloc
+ 	 */
 	if (skb_shared(skb)) {
 		struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
 
-		if (likely(nskb)) {
-			if (skb->sk)
-				skb_set_owner_w(nskb, skb->sk);
-			consume_skb(skb);
-		} else {
+		if (unlikely(!nskb)) {
 			kfree_skb(skb);
+			return NULL;
 		}
+		oskb = skb;
 		skb = nskb;
 	}
-	if (skb &&
-	    pskb_expand_head(skb, SKB_DATA_ALIGN(delta), 0, GFP_ATOMIC)) {
+	if (pskb_expand_head(skb, delta, 0, GFP_ATOMIC)) {
 		kfree_skb(skb);
-		skb = NULL;
+		kfree_skb(oskb);
+		return NULL;
+	}
+	if (oskb) {
+		if (sk)
+			skb_set_owner_w(skb, sk);
+		consume_skb(oskb);
+	} else if (sk) {
+		delta = osize - skb_end_offset(skb);
+		if (!is_skb_wmem(skb))
+			skb_set_owner_w(skb, sk);
+		skb->truesize += delta;
+		if (sk_fullsock(sk))
+			refcount_add(delta, &sk->sk_wmem_alloc);
 	}
 	return skb;
 }
diff --git a/net/core/sock.c b/net/core/sock.c
index 950f1e7..6cbda43 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2227,6 +2227,14 @@ void skb_set_owner_w(struct sk_buff *skb, struct sock *sk)
 }
 EXPORT_SYMBOL(skb_set_owner_w);
 
+bool is_skb_wmem(const struct sk_buff *skb)
+{
+	return skb->destructor == sock_wfree ||
+	       skb->destructor == __sock_wfree ||
+	       (IS_ENABLED(CONFIG_INET) && skb->destructor == tcp_wfree);
+}
+EXPORT_SYMBOL(is_skb_wmem);
+
 static bool can_skb_orphan_partial(const struct sk_buff *skb)
 {
 #ifdef CONFIG_TLS_DEVICE
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH net-next v4] skb_expand_head() adjust skb->truesize incorrectly
  2021-09-01  8:11                                                                           ` [PATCH net-next v4] " Vasily Averin
@ 2021-09-01 16:58                                                                             ` Christoph Paasch
  2021-09-01 19:17                                                                             ` Eric Dumazet
  1 sibling, 0 replies; 106+ messages in thread
From: Christoph Paasch @ 2021-09-01 16:58 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Eric Dumazet, David S. Miller, Hideaki YOSHIFUJI, David Ahern,
	Jakub Kicinski, netdev, LKML, kernel, Alexey Kuznetsov,
	Julian Wiedmann

Hello,

On Wed, Sep 1, 2021 at 1:12 AM Vasily Averin <vvs@virtuozzo.com> wrote:
>
> Christoph Paasch reports [1] about incorrect skb->truesize
> after skb_expand_head() call in ip6_xmit.
> This may happen because of two reasons:
> - skb_set_owner_w() for newly cloned skb is called too early,
> before pskb_expand_head() where truesize is adjusted for (!skb-sk) case.
> - pskb_expand_head() does not adjust truesize in (skb->sk) case.
> In this case sk->sk_wmem_alloc should be adjusted too.
>
> [1] https://lkml.org/lkml/2021/8/20/1082
>
> Fixes: f1260ff15a71 ("skbuff: introduce skb_expand_head()")
> Reported-by: Christoph Paasch <christoph.paasch@gmail.com>
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
> ---
> v4: decided to use is_skb_wmem() after pskb_expand_head() call
>     fixed 'return (EXPRESSION);' in os_skb_wmem according to Eric Dumazet
> v3: removed __pskb_expand_head(),
>     added is_skb_wmem() helper for skb with wmem-compatible destructors
>     there are 2 ways to use it:
>      - before pskb_expand_head(), to create skb clones
>      - after successfull pskb_expand_head() to change owner on extended skb.
> v2: based on patch version from Eric Dumazet,
>     added __pskb_expand_head() function, which can be forced
>     to adjust skb->truesize and sk->sk_wmem_alloc.
> ---
>  include/net/sock.h |  1 +
>  net/core/skbuff.c  | 35 ++++++++++++++++++++++++++---------
>  net/core/sock.c    |  8 ++++++++
>  3 files changed, 35 insertions(+), 9 deletions(-)

this introduces more issues with the syzkaller reproducer that I
shared earlier (see below for the output).

I don't have time at the moment to dig into these though - so just
sharing this as an FYI for now.

syzkaller login: [   12.768064] cgroup: Unknown subsys name 'perf_event'
[   12.769831] cgroup: Unknown subsys name 'net_cls'
[   13.587819] ------------[ cut here ]------------
[   13.588943] refcount_t: saturated; leaking memory.
[   13.590166] WARNING: CPU: 1 PID: 1658 at lib/refcount.c:22
refcount_warn_saturate+0xce/0x1f0
[   13.591909] Modules linked in:
[   13.592595] CPU: 1 PID: 1658 Comm: syz-executor Not tainted
5.14.0ea78abdd8ff18baaea3211eabdd6a2a88169cfd6 #134
[   13.594455] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
[   13.596640] RIP: 0010:refcount_warn_saturate+0xce/0x1f0
[   13.597651] Code: 1d 32 63 11 02 31 ff 89 de e8 1e 26 79 ff 84 db
75 d8 e8 b5 1e 79 ff 48 c7 c7 80 48 32 83 c6 05 12 63 11 02 01 e8 2f
39
[   13.601049] RSP: 0018:ffffc9000091f2a8 EFLAGS: 00010286
[   13.602077] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[   13.603477] RDX: ffff888100fa2880 RSI: ffffffff8121e533 RDI: fffff52000123e47
[   13.604758] RBP: ffff88810b88013c R08: 0000000000000001 R09: 0000000000000000
[   13.606110] R10: ffffffff814135db R11: 0000000000000000 R12: ffff88810b880000
[   13.607421] R13: 00000000fffffe03 R14: ffff8881094c97c0 R15: ffff88810b88013c
[   13.608874] FS:  00007f8ad457d700(0000) GS:ffff88811b480000(0000)
knlGS:0000000000000000
[   13.610515] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   13.611671] CR2: 0000000000000000 CR3: 00000001045d8000 CR4: 00000000000006e0
[   13.613017] Call Trace:
[   13.613521]  skb_expand_head+0x35a/0x470
[   13.614331]  ip6_xmit+0x105f/0x1560
[   13.615038]  ? ip6_forward+0x22b0/0x22b0
[   13.616011]  ? ip6_dst_check+0x227/0x540
[   13.616773]  ? rt6_check_expired+0x250/0x250
[   13.617657]  ? __sk_dst_check+0xfb/0x200
[   13.618424]  ? inet6_csk_route_socket+0x59e/0x980
[   13.619377]  ? inet6_csk_addr2sockaddr+0x2a0/0x2a0
[   13.620399]  ? stack_trace_consume_entry+0x160/0x160
[   13.621530]  inet6_csk_xmit+0x2b3/0x430
[   13.622290]  ? kasan_save_stack+0x32/0x40
[   13.623133]  ? kasan_save_stack+0x1b/0x40
[   13.623939]  ? inet6_csk_route_socket+0x980/0x980
[   13.624802]  ? entry_SYSCALL_64_after_hwframe+0x44/0xae
[   13.625786]  ? csum_ipv6_magic+0x26/0x70
[   13.626653]  ? inet6_csk_route_socket+0x980/0x980
[   13.627480]  __tcp_transmit_skb+0x186e/0x35d0
[   13.628358]  ? __tcp_select_window+0xa50/0xa50
[   13.629153]  ? __sanitizer_cov_trace_cmp4+0x1c/0x70
[   13.630130]  ? kasan_unpoison+0x23/0x50
[   13.630872]  ? __build_skb_around+0x241/0x300
[   13.631667]  ? __sanitizer_cov_trace_const_cmp4+0x1c/0x70
[   13.632785]  ? __alloc_skb+0x180/0x360
[   13.633545]  __tcp_send_ack.part.0+0x3da/0x650
[   13.634333]  tcp_send_ack+0x7d/0xa0
[   13.635031]  __tcp_ack_snd_check+0x156/0x8c0
[   13.635957]  tcp_rcv_established+0x1733/0x1d30
[   13.636889]  ? tcp_data_queue+0x4af0/0x4af0
[   13.637753]  tcp_v6_do_rcv+0x438/0x1380
[   13.638523]  __release_sock+0x1ad/0x310
[   13.639306]  release_sock+0x54/0x1a0
[   13.640029]  ? tcp_sendmsg_locked+0x2ee0/0x2ee0
[   13.640953]  tcp_sendmsg+0x36/0x40
[   13.641710]  inet6_sendmsg+0xb5/0x140
[   13.642359]  ? inet6_ioctl+0x2a0/0x2a0
[   13.643092]  ____sys_sendmsg+0x3b5/0x970
[   13.643834]  ? sock_release+0x1b0/0x1b0
[   13.644593]  ? __ia32_sys_recvmmsg+0x290/0x290
[   13.645505]  ? futex_wait_setup+0x2e0/0x2e0
[   13.646350]  ___sys_sendmsg+0xff/0x170
[   13.647084]  ? hash_futex+0x12/0x1f0
[   13.647870]  ? sendmsg_copy_msghdr+0x160/0x160
[   13.648691]  ? asm_exc_page_fault+0x1e/0x30
[   13.649475]  ? __sanitizer_cov_trace_const_cmp1+0x22/0x80
[   13.650523]  ? __fget_files+0x1c2/0x2a0
[   13.651245]  ? __fget_light+0xea/0x270
[   13.652027]  ? sockfd_lookup_light+0xc3/0x170
[   13.652845]  __sys_sendmmsg+0x192/0x440
[   13.653622]  ? __ia32_sys_sendmsg+0xb0/0xb0
[   13.654365]  ? vfs_fileattr_set+0xb80/0xb80
[   13.655085]  ? __sanitizer_cov_trace_const_cmp4+0x1c/0x70
[   13.656175]  ? alloc_file_pseudo+0x1/0x250
[   13.657026]  ? sock_ioctl+0x1bb/0x670
[   13.657861]  ? __do_sys_futex+0xe7/0x3d0
[   13.658697]  ? __do_sys_futex+0xe7/0x3d0
[   13.659379]  ? __do_sys_futex+0xf0/0x3d0
[   13.660090]  ? __restore_fpregs_from_fpstate+0xa9/0xf0
[   13.661212]  ? fpregs_mark_activate+0x130/0x130
[   13.662078]  ? do_futex+0x1be0/0x1be0
[   13.662846]  __x64_sys_sendmmsg+0x98/0x100
[   13.663706]  ? syscall_exit_to_user_mode+0x1d/0x40
[   13.664698]  do_syscall_64+0x3b/0x90
[   13.665450]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[   13.666564] RIP: 0033:0x7f8ad3e8c469
[   13.667204] Code: 00 f3 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40
00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24
08
[   13.670776] RSP: 002b:00007f8ad457cda8 EFLAGS: 00000246 ORIG_RAX:
0000000000000133
[   13.672208] RAX: ffffffffffffffda RBX: 0000000000000133 RCX: 00007f8ad3e8c469
[   13.673598] RDX: 0000000000000003 RSI: 00000000200008c0 RDI: 0000000000000003
[   13.674946] RBP: 0000000000000133 R08: 0000000000000000 R09: 0000000000000000
[   13.676397] R10: 0000000040044040 R11: 0000000000000246 R12: 000000000069bf8c
[   13.677876] R13: 00007ffe38506fef R14: 00007f8ad455d000 R15: 0000000000000003
[   13.679129] ---[ end trace 55e20198e13af26c ]---
[   13.680043] ------------[ cut here ]------------
[   13.681049] refcount_t: underflow; use-after-free.
[   13.682005] WARNING: CPU: 1 PID: 1658 at lib/refcount.c:28
refcount_warn_saturate+0x103/0x1f0
[   13.683658] Modules linked in:
[   13.684246] CPU: 1 PID: 1658 Comm: syz-executor Tainted: G        W
        5.14.0ea78abdd8ff18baaea3211eabdd6a2a88169cfd6 #134
[   13.686321] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
[   13.688388] RIP: 0010:refcount_warn_saturate+0x103/0x1f0
[   13.689502] Code: 1d fb 62 11 02 31 ff 89 de e8 e9 25 79 ff 84 db
75 a3 e8 80 1e 79 ff 48 c7 c7 80 49 32 83 c6 05 db 62 11 02 01 e8 fa
34
[   13.692805] RSP: 0018:ffffc9000091eff8 EFLAGS: 00010286
[   13.693756] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[   13.695193] RDX: ffff888100fa2880 RSI: ffffffff8121e533 RDI: fffff52000123df1
[   13.696696] RBP: ffff88810b88013c R08: 0000000000000001 R09: 0000000000000000
[   13.697982] R10: ffffffff814135db R11: 0000000000000000 R12: ffff88810b88013c
[   13.699291] R13: 00000000fffffe02 R14: ffff8881011a4c00 R15: ffff8881094c97c0
[   13.700576] FS:  00007f8ad457d700(0000) GS:ffff88811b480000(0000)
knlGS:0000000000000000
[   13.702031] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   13.703134] CR2: 0000000000000000 CR3: 00000001045d8000 CR4: 00000000000006e0
[   13.704525] Call Trace:
[   13.704973]  __sock_wfree+0xec/0x110
[   13.705737]  ? sock_wfree+0x240/0x240
[   13.706406]  loopback_xmit+0x126/0x4b0
[   13.707278]  ? refcount_warn_saturate+0xce/0x1f0
[   13.708208]  dev_hard_start_xmit+0x16c/0x5c0
[   13.709116]  __dev_queue_xmit+0x1679/0x2970
[   13.709912]  ? netdev_core_pick_tx+0x2d0/0x2d0
[   13.710758]  ? __sanitizer_cov_trace_const_cmp8+0x1d/0x70
[   13.711846]  ? report_bug+0x38/0x210
[   13.712656]  ? handle_bug+0x3c/0x60
[   13.713395]  ? exc_invalid_op+0x14/0x40
[   13.714119]  ip6_finish_output2+0xb52/0x14c0
[   13.715029]  ip6_output+0x572/0x9e0
[   13.715761]  ? ip6_fragment+0x1f40/0x1f40
[   13.716478]  ip6_xmit+0xc6f/0x1560
[   13.717083]  ? ip6_forward+0x22b0/0x22b0
[   13.717895]  ? ip6_dst_check+0x227/0x540
[   13.718689]  ? rt6_check_expired+0x250/0x250
[   13.719620]  ? __sk_dst_check+0xfb/0x200
[   13.720427]  ? inet6_csk_route_socket+0x59e/0x980
[   13.721408]  ? inet6_csk_addr2sockaddr+0x2a0/0x2a0
[   13.722286]  ? stack_trace_consume_entry+0x160/0x160
[   13.723186]  inet6_csk_xmit+0x2b3/0x430
[   13.723873]  ? kasan_save_stack+0x32/0x40
[   13.724682]  ? kasan_save_stack+0x1b/0x40
[   13.725422]  ? inet6_csk_route_socket+0x980/0x980
[   13.726398]  ? entry_SYSCALL_64_after_hwframe+0x44/0xae
[   13.727478]  ? csum_ipv6_magic+0x26/0x70
[   13.728288]  ? inet6_csk_route_socket+0x980/0x980
[   13.729267]  __tcp_transmit_skb+0x186e/0x35d0
[   13.730048]  ? __tcp_select_window+0xa50/0xa50
[   13.730952]  ? __sanitizer_cov_trace_cmp4+0x1c/0x70
[   13.732007]  ? kasan_unpoison+0x23/0x50
[   13.732740]  ? __build_skb_around+0x241/0x300
[   13.733605]  ? __sanitizer_cov_trace_const_cmp4+0x1c/0x70
[   13.734749]  ? __alloc_skb+0x180/0x360
[   13.735506]  __tcp_send_ack.part.0+0x3da/0x650
[   13.736377]  tcp_send_ack+0x7d/0xa0
[   13.737015]  __tcp_ack_snd_check+0x156/0x8c0
[   13.737758]  tcp_rcv_established+0x1733/0x1d30
[   13.738679]  ? tcp_data_queue+0x4af0/0x4af0
[   13.739417]  tcp_v6_do_rcv+0x438/0x1380
[   13.740166]  __release_sock+0x1ad/0x310
[   13.740874]  release_sock+0x54/0x1a0
[   13.741527]  ? tcp_sendmsg_locked+0x2ee0/0x2ee0
[   13.742394]  tcp_sendmsg+0x36/0x40
[   13.743037]  inet6_sendmsg+0xb5/0x140
[   13.743752]  ? inet6_ioctl+0x2a0/0x2a0
[   13.744511]  ____sys_sendmsg+0x3b5/0x970
[   13.745325]  ? sock_release+0x1b0/0x1b0
[   13.746031]  ? __ia32_sys_recvmmsg+0x290/0x290
[   13.746914]  ? futex_wait_setup+0x2e0/0x2e0
[   13.747749]  ___sys_sendmsg+0xff/0x170
[   13.748393]  ? hash_futex+0x12/0x1f0
[   13.749036]  ? sendmsg_copy_msghdr+0x160/0x160
[   13.749972]  ? asm_exc_page_fault+0x1e/0x30
[   13.750870]  ? __sanitizer_cov_trace_const_cmp1+0x22/0x80
[   13.751974]  ? __fget_files+0x1c2/0x2a0
[   13.752659]  ? __fget_light+0xea/0x270
[   13.753514]  ? sockfd_lookup_light+0xc3/0x170
[   13.754296]  __sys_sendmmsg+0x192/0x440
[   13.755102]  ? __ia32_sys_sendmsg+0xb0/0xb0
[   13.755917]  ? vfs_fileattr_set+0xb80/0xb80
[   13.756692]  ? __sanitizer_cov_trace_const_cmp4+0x1c/0x70
[   13.757790]  ? alloc_file_pseudo+0x1/0x250
[   13.758675]  ? sock_ioctl+0x1bb/0x670
[   13.759341]  ? __do_sys_futex+0xe7/0x3d0
[   13.760040]  ? __do_sys_futex+0xe7/0x3d0
[   13.760762]  ? __do_sys_futex+0xf0/0x3d0
[   13.761585]  ? __restore_fpregs_from_fpstate+0xa9/0xf0
[   13.762511]  ? fpregs_mark_activate+0x130/0x130
[   13.763382]  ? do_futex+0x1be0/0x1be0
[   13.764044]  __x64_sys_sendmmsg+0x98/0x100
[   13.764831]  ? syscall_exit_to_user_mode+0x1d/0x40
[   13.765814]  do_syscall_64+0x3b/0x90
[   13.766607]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[   13.767467] RIP: 0033:0x7f8ad3e8c469
[   13.768206] Code: 00 f3 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40
00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24
08
[   13.771618] RSP: 002b:00007f8ad457cda8 EFLAGS: 00000246 ORIG_RAX:
0000000000000133
[   13.773054] RAX: ffffffffffffffda RBX: 0000000000000133 RCX: 00007f8ad3e8c469
[   13.774260] RDX: 0000000000000003 RSI: 00000000200008c0 RDI: 0000000000000003
[   13.775586] RBP: 0000000000000133 R08: 0000000000000000 R09: 0000000000000000
[   13.776909] R10: 0000000040044040 R11: 0000000000000246 R12: 000000000069bf8c
[   13.778390] R13: 00007ffe38506fef R14: 00007f8ad455d000 R15: 0000000000000003
[   13.779752] ---[ end trace 55e20198e13af26d ]---
[   13.780935] ------------[ cut here ]------------
[   13.781986] WARNING: CPU: 0 PID: 1658 at net/core/skbuff.c:5429
skb_try_coalesce+0x1019/0x12c0
[   13.783740] Modules linked in:
[   13.784398] CPU: 0 PID: 1658 Comm: syz-executor Tainted: G        W
        5.14.0ea78abdd8ff18baaea3211eabdd6a2a88169cfd6 #134
[   13.786692] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
[   13.788958] RIP: 0010:skb_try_coalesce+0x1019/0x12c0
[   13.789930] Code: 24 20 bf 01 00 00 00 8b 40 20 44 0f b7 f0 44 89
f6 e8 0b 2b cf fe 41 83 ee 01 0f 85 01 f3 ff ff e9 42 f6 ff ff e8 67
2c
[   13.793371] RSP: 0018:ffffc9000091f530 EFLAGS: 00010293
[   13.794316] RAX: 0000000000000000 RBX: 0000000000000c00 RCX: 0000000000000000
[   13.795688] RDX: ffff888100fa2880 RSI: ffffffff826767a9 RDI: 0000000000000003
[   13.797093] RBP: ffff888109496de0 R08: 0000000000000c00 R09: 0000000000000000
[   13.798381] R10: ffffffff82676122 R11: 0000000000000000 R12: ffff888100efc0e0
[   13.799766] R13: ffff8881046baac0 R14: 0000000000001000 R15: ffff888100efc156
[   13.801052] FS:  00007f8ad457d700(0000) GS:ffff88811b400000(0000)
knlGS:0000000000000000
[   13.802463] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   13.803603] CR2: 00007f83d8028000 CR3: 00000001045d8000 CR4: 00000000000006f0
[   13.805079] Call Trace:
[   13.805622]  tcp_try_coalesce+0x312/0x870
[   13.806488]  ? tcp_ack_update_rtt+0xfc0/0xfc0
[   13.807406]  ? __sanitizer_cov_trace_const_cmp4+0x1c/0x70
[   13.808483]  ? tcp_try_rmem_schedule+0x99b/0x16e0
[   13.809296]  tcp_queue_rcv+0x73/0x670
[   13.810013]  tcp_data_queue+0x11e5/0x4af0
[   13.810844]  ? __sanitizer_cov_trace_const_cmp2+0x22/0x80
[   13.811890]  ? tcp_urg+0x108/0xb60
[   13.812536]  ? tcp_data_ready+0x550/0x550
[   13.813362]  ? tcp_enter_cwr+0x3f0/0x4d0
[   13.814148]  ? __sanitizer_cov_trace_cmp4+0x1c/0x70
[   13.815014]  ? ktime_get+0xf4/0x150
[   13.815637]  ? __sanitizer_cov_trace_const_cmp8+0x1d/0x70
[   13.816721]  tcp_rcv_established+0x83a/0x1d30
[   13.817541]  ? tcp_data_queue+0x4af0/0x4af0
[   13.818375]  tcp_v6_do_rcv+0x438/0x1380
[   13.819160]  __release_sock+0x1ad/0x310
[   13.819975]  release_sock+0x54/0x1a0
[   13.820745]  ? tcp_sendmsg_locked+0x2ee0/0x2ee0
[   13.821662]  tcp_sendmsg+0x36/0x40
[   13.822351]  inet6_sendmsg+0xb5/0x140
[   13.823115]  ? inet6_ioctl+0x2a0/0x2a0
[   13.823909]  ____sys_sendmsg+0x3b5/0x970
[   13.824720]  ? sock_release+0x1b0/0x1b0
[   13.825521]  ? __ia32_sys_recvmmsg+0x290/0x290
[   13.826441]  ? futex_wait_setup+0x2e0/0x2e0
[   13.827308]  ___sys_sendmsg+0xff/0x170
[   13.828112]  ? hash_futex+0x12/0x1f0
[   13.828853]  ? sendmsg_copy_msghdr+0x160/0x160
[   13.829804]  ? asm_exc_page_fault+0x1e/0x30
[   13.830660]  ? __sanitizer_cov_trace_const_cmp1+0x22/0x80
[   13.831760]  ? __fget_files+0x1c2/0x2a0
[   13.832576]  ? __fget_light+0xea/0x270
[   13.833349]  ? sockfd_lookup_light+0xc3/0x170
[   13.834289]  __sys_sendmmsg+0x192/0x440
[   13.835065]  ? __ia32_sys_sendmsg+0xb0/0xb0
[   13.835918]  ? vfs_fileattr_set+0xb80/0xb80
[   13.836823]  ? __sanitizer_cov_trace_const_cmp4+0x1c/0x70
[   13.837941]  ? alloc_file_pseudo+0x1/0x250
[   13.838810]  ? sock_ioctl+0x1bb/0x670
[   13.839550]  ? __do_sys_futex+0xe7/0x3d0
[   13.840369]  ? __do_sys_futex+0xe7/0x3d0
[   13.841205]  ? __do_sys_futex+0xf0/0x3d0
[   13.842022]  ? __restore_fpregs_from_fpstate+0xa9/0xf0
[   13.843115]  ? fpregs_mark_activate+0x130/0x130
[   13.844074]  ? do_futex+0x1be0/0x1be0
[   13.844868]  __x64_sys_sendmmsg+0x98/0x100
[   13.845725]  ? syscall_exit_to_user_mode+0x1d/0x40
[   13.846754]  do_syscall_64+0x3b/0x90
[   13.847472]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[   13.848550] RIP: 0033:0x7f8ad3e8c469
[   13.849289] Code: 00 f3 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40
00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24
08
[   13.852935] RSP: 002b:00007f8ad457cda8 EFLAGS: 00000246 ORIG_RAX:
0000000000000133
[   13.854476] RAX: ffffffffffffffda RBX: 0000000000000133 RCX: 00007f8ad3e8c469
[   13.855896] RDX: 0000000000000003 RSI: 00000000200008c0 RDI: 0000000000000003
[   13.857304] RBP: 0000000000000133 R08: 0000000000000000 R09: 0000000000000000
[   13.858756] R10: 0000000040044040 R11: 0000000000000246 R12: 000000000069bf8c
[   13.860168] R13: 00007ffe38506fef R14: 00007f8ad455d000 R15: 0000000000000003
[   13.861597] ---[ end trace 55e20198e13af26e ]---


>
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 95b2577..173d58c 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -1695,6 +1695,7 @@ struct sk_buff *sock_wmalloc(struct sock *sk, unsigned long size, int force,
>                              gfp_t priority);
>  void __sock_wfree(struct sk_buff *skb);
>  void sock_wfree(struct sk_buff *skb);
> +bool is_skb_wmem(const struct sk_buff *skb);
>  struct sk_buff *sock_omalloc(struct sock *sk, unsigned long size,
>                              gfp_t priority);
>  void skb_orphan_partial(struct sk_buff *skb);
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index f931176..09991cb 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -1804,28 +1804,45 @@ struct sk_buff *skb_realloc_headroom(struct sk_buff *skb, unsigned int headroom)
>  struct sk_buff *skb_expand_head(struct sk_buff *skb, unsigned int headroom)
>  {
>         int delta = headroom - skb_headroom(skb);
> +       int osize = skb_end_offset(skb);
> +       struct sk_buff *oskb = NULL;
> +       struct sock *sk = skb->sk;
>
>         if (WARN_ONCE(delta <= 0,
>                       "%s is expecting an increase in the headroom", __func__))
>                 return skb;
>
> -       /* pskb_expand_head() might crash, if skb is shared */
> +       delta = SKB_DATA_ALIGN(delta);
> +       /* pskb_expand_head() might crash, if skb is shared.
> +        * Also we should clone skb if its destructor does
> +        * not adjust skb->truesize and sk->sk_wmem_alloc
> +        */
>         if (skb_shared(skb)) {
>                 struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
>
> -               if (likely(nskb)) {
> -                       if (skb->sk)
> -                               skb_set_owner_w(nskb, skb->sk);
> -                       consume_skb(skb);
> -               } else {
> +               if (unlikely(!nskb)) {
>                         kfree_skb(skb);
> +                       return NULL;
>                 }
> +               oskb = skb;
>                 skb = nskb;
>         }
> -       if (skb &&
> -           pskb_expand_head(skb, SKB_DATA_ALIGN(delta), 0, GFP_ATOMIC)) {
> +       if (pskb_expand_head(skb, delta, 0, GFP_ATOMIC)) {
>                 kfree_skb(skb);
> -               skb = NULL;
> +               kfree_skb(oskb);
> +               return NULL;
> +       }
> +       if (oskb) {
> +               if (sk)
> +                       skb_set_owner_w(skb, sk);
> +               consume_skb(oskb);
> +       } else if (sk) {
> +               delta = osize - skb_end_offset(skb);
> +               if (!is_skb_wmem(skb))
> +                       skb_set_owner_w(skb, sk);
> +               skb->truesize += delta;
> +               if (sk_fullsock(sk))
> +                       refcount_add(delta, &sk->sk_wmem_alloc);
>         }
>         return skb;
>  }
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 950f1e7..6cbda43 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -2227,6 +2227,14 @@ void skb_set_owner_w(struct sk_buff *skb, struct sock *sk)
>  }
>  EXPORT_SYMBOL(skb_set_owner_w);
>
> +bool is_skb_wmem(const struct sk_buff *skb)
> +{
> +       return skb->destructor == sock_wfree ||
> +              skb->destructor == __sock_wfree ||
> +              (IS_ENABLED(CONFIG_INET) && skb->destructor == tcp_wfree);
> +}
> +EXPORT_SYMBOL(is_skb_wmem);
> +
>  static bool can_skb_orphan_partial(const struct sk_buff *skb)
>  {
>  #ifdef CONFIG_TLS_DEVICE
> --
> 1.8.3.1
>

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH net-next v4] skb_expand_head() adjust skb->truesize incorrectly
  2021-09-01  8:11                                                                           ` [PATCH net-next v4] " Vasily Averin
  2021-09-01 16:58                                                                             ` Christoph Paasch
@ 2021-09-01 19:17                                                                             ` Eric Dumazet
  2021-09-02  3:59                                                                               ` Vasily Averin
  1 sibling, 1 reply; 106+ messages in thread
From: Eric Dumazet @ 2021-09-01 19:17 UTC (permalink / raw)
  To: Vasily Averin, Christoph Paasch, David S. Miller
  Cc: Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski, netdev,
	linux-kernel, kernel, Alexey Kuznetsov, Julian Wiedmann



On 9/1/21 1:11 AM, Vasily Averin wrote:
> Christoph Paasch reports [1] about incorrect skb->truesize
> after skb_expand_head() call in ip6_xmit.
> This may happen because of two reasons:
> - skb_set_owner_w() for newly cloned skb is called too early,
> before pskb_expand_head() where truesize is adjusted for (!skb-sk) case.
> - pskb_expand_head() does not adjust truesize in (skb->sk) case.
> In this case sk->sk_wmem_alloc should be adjusted too.
> 
> [1] https://lkml.org/lkml/2021/8/20/1082
> 
> Fixes: f1260ff15a71 ("skbuff: introduce skb_expand_head()")
> Reported-by: Christoph Paasch <christoph.paasch@gmail.com>
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
> ---
> v4: decided to use is_skb_wmem() after pskb_expand_head() call
>     fixed 'return (EXPRESSION);' in os_skb_wmem according to Eric Dumazet
> v3: removed __pskb_expand_head(),
>     added is_skb_wmem() helper for skb with wmem-compatible destructors
>     there are 2 ways to use it:
>      - before pskb_expand_head(), to create skb clones
>      - after successfull pskb_expand_head() to change owner on extended skb.
> v2: based on patch version from Eric Dumazet,
>     added __pskb_expand_head() function, which can be forced
>     to adjust skb->truesize and sk->sk_wmem_alloc.
> ---
>  include/net/sock.h |  1 +
>  net/core/skbuff.c  | 35 ++++++++++++++++++++++++++---------
>  net/core/sock.c    |  8 ++++++++
>  3 files changed, 35 insertions(+), 9 deletions(-)
> 
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 95b2577..173d58c 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -1695,6 +1695,7 @@ struct sk_buff *sock_wmalloc(struct sock *sk, unsigned long size, int force,
>  			     gfp_t priority);
>  void __sock_wfree(struct sk_buff *skb);
>  void sock_wfree(struct sk_buff *skb);
> +bool is_skb_wmem(const struct sk_buff *skb);
>  struct sk_buff *sock_omalloc(struct sock *sk, unsigned long size,
>  			     gfp_t priority);
>  void skb_orphan_partial(struct sk_buff *skb);
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index f931176..09991cb 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -1804,28 +1804,45 @@ struct sk_buff *skb_realloc_headroom(struct sk_buff *skb, unsigned int headroom)
>  struct sk_buff *skb_expand_head(struct sk_buff *skb, unsigned int headroom)
>  {
>  	int delta = headroom - skb_headroom(skb);
> +	int osize = skb_end_offset(skb);
> +	struct sk_buff *oskb = NULL;
> +	struct sock *sk = skb->sk;
>  
>  	if (WARN_ONCE(delta <= 0,
>  		      "%s is expecting an increase in the headroom", __func__))
>  		return skb;
>  
> -	/* pskb_expand_head() might crash, if skb is shared */
> +	delta = SKB_DATA_ALIGN(delta);
> +	/* pskb_expand_head() might crash, if skb is shared.
> +	 * Also we should clone skb if its destructor does
> +	 * not adjust skb->truesize and sk->sk_wmem_alloc
> + 	 */
>  	if (skb_shared(skb)) {
>  		struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
>  
> -		if (likely(nskb)) {
> -			if (skb->sk)
> -				skb_set_owner_w(nskb, skb->sk);
> -			consume_skb(skb);
> -		} else {
> +		if (unlikely(!nskb)) {
>  			kfree_skb(skb);
> +			return NULL;
>  		}
> +		oskb = skb;
>  		skb = nskb;
>  	}
> -	if (skb &&
> -	    pskb_expand_head(skb, SKB_DATA_ALIGN(delta), 0, GFP_ATOMIC)) {
> +	if (pskb_expand_head(skb, delta, 0, GFP_ATOMIC)) {
>  		kfree_skb(skb);
> -		skb = NULL;
> +		kfree_skb(oskb);
> +		return NULL;
> +	}
> +	if (oskb) {
> +		if (sk)

if (is_skb_wmem(oskb))

Again, it is not valid to call skb_set_owner_w(skb, sk) on all kind of sockets.

> +			skb_set_owner_w(skb, sk);
> +		consume_skb(oskb);
> +	} else if (sk) {

&& (skb->destructor != sock_edemux)

(Because in this case , pskb_expand_head() already adjusted skb->truesize)

> +		delta = osize - skb_end_offset(skb);

> +		if (!is_skb_wmem(skb))
> +			skb_set_owner_w(skb, sk);

This is dangerous, even if a socket is there, its sk->sk_wmem_alloc could be zero.

We can not add skb->truesize to a refcount_t that already reached 0 (sk_free())


If is_skb_wmem() is false, you probably should do nothing, and leave
current destructor as it is.
(skb->truesize can be adjusted without issue)

> +		skb->truesize += delta;
> +		if (sk_fullsock(sk))

if (is_skb_wmem(skb))

> +			refcount_add(delta, &sk->sk_wmem_alloc);
>  	}
>  	return skb;
>  }
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 950f1e7..6cbda43 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -2227,6 +2227,14 @@ void skb_set_owner_w(struct sk_buff *skb, struct sock *sk)
>  }
>  EXPORT_SYMBOL(skb_set_owner_w);
>  
> +bool is_skb_wmem(const struct sk_buff *skb)
> +{
> +	return skb->destructor == sock_wfree ||
> +	       skb->destructor == __sock_wfree ||
> +	       (IS_ENABLED(CONFIG_INET) && skb->destructor == tcp_wfree);
> +}
> +EXPORT_SYMBOL(is_skb_wmem);
> +
>  static bool can_skb_orphan_partial(const struct sk_buff *skb)
>  {
>  #ifdef CONFIG_TLS_DEVICE
> 

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH net-next v4] skb_expand_head() adjust skb->truesize incorrectly
  2021-09-01 19:17                                                                             ` Eric Dumazet
@ 2021-09-02  3:59                                                                               ` Vasily Averin
  2021-09-02  4:32                                                                                 ` Eric Dumazet
  0 siblings, 1 reply; 106+ messages in thread
From: Vasily Averin @ 2021-09-02  3:59 UTC (permalink / raw)
  To: Eric Dumazet, Christoph Paasch, David S. Miller
  Cc: Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski, netdev,
	linux-kernel, kernel, Alexey Kuznetsov, Julian Wiedmann

On 9/1/21 10:17 PM, Eric Dumazet wrote:
> 
> 
> On 9/1/21 1:11 AM, Vasily Averin wrote:
>> Christoph Paasch reports [1] about incorrect skb->truesize
>> after skb_expand_head() call in ip6_xmit.
>> This may happen because of two reasons:
>> - skb_set_owner_w() for newly cloned skb is called too early,
>> before pskb_expand_head() where truesize is adjusted for (!skb-sk) case.
>> - pskb_expand_head() does not adjust truesize in (skb->sk) case.
>> In this case sk->sk_wmem_alloc should be adjusted too.
>>
>> [1] https://lkml.org/lkml/2021/8/20/1082
>>
>> Fixes: f1260ff15a71 ("skbuff: introduce skb_expand_head()")
>> Reported-by: Christoph Paasch <christoph.paasch@gmail.com>
>> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
>> ---
>> v4: decided to use is_skb_wmem() after pskb_expand_head() call
>>     fixed 'return (EXPRESSION);' in os_skb_wmem according to Eric Dumazet
>> v3: removed __pskb_expand_head(),
>>     added is_skb_wmem() helper for skb with wmem-compatible destructors
>>     there are 2 ways to use it:
>>      - before pskb_expand_head(), to create skb clones
>>      - after successfull pskb_expand_head() to change owner on extended skb.
>> v2: based on patch version from Eric Dumazet,
>>     added __pskb_expand_head() function, which can be forced
>>     to adjust skb->truesize and sk->sk_wmem_alloc.
>> ---
>>  include/net/sock.h |  1 +
>>  net/core/skbuff.c  | 35 ++++++++++++++++++++++++++---------
>>  net/core/sock.c    |  8 ++++++++
>>  3 files changed, 35 insertions(+), 9 deletions(-)
>>
>> diff --git a/include/net/sock.h b/include/net/sock.h
>> index 95b2577..173d58c 100644
>> --- a/include/net/sock.h
>> +++ b/include/net/sock.h
>> @@ -1695,6 +1695,7 @@ struct sk_buff *sock_wmalloc(struct sock *sk, unsigned long size, int force,
>>  			     gfp_t priority);
>>  void __sock_wfree(struct sk_buff *skb);
>>  void sock_wfree(struct sk_buff *skb);
>> +bool is_skb_wmem(const struct sk_buff *skb);
>>  struct sk_buff *sock_omalloc(struct sock *sk, unsigned long size,
>>  			     gfp_t priority);
>>  void skb_orphan_partial(struct sk_buff *skb);
>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>> index f931176..09991cb 100644
>> --- a/net/core/skbuff.c
>> +++ b/net/core/skbuff.c
>> @@ -1804,28 +1804,45 @@ struct sk_buff *skb_realloc_headroom(struct sk_buff *skb, unsigned int headroom)
>>  struct sk_buff *skb_expand_head(struct sk_buff *skb, unsigned int headroom)
>>  {
>>  	int delta = headroom - skb_headroom(skb);
>> +	int osize = skb_end_offset(skb);
>> +	struct sk_buff *oskb = NULL;
>> +	struct sock *sk = skb->sk;
>>  
>>  	if (WARN_ONCE(delta <= 0,
>>  		      "%s is expecting an increase in the headroom", __func__))
>>  		return skb;
>>  
>> -	/* pskb_expand_head() might crash, if skb is shared */
>> +	delta = SKB_DATA_ALIGN(delta);
>> +	/* pskb_expand_head() might crash, if skb is shared.
>> +	 * Also we should clone skb if its destructor does
>> +	 * not adjust skb->truesize and sk->sk_wmem_alloc
>> + 	 */
>>  	if (skb_shared(skb)) {
>>  		struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
>>  
>> -		if (likely(nskb)) {
>> -			if (skb->sk)
>> -				skb_set_owner_w(nskb, skb->sk);
>> -			consume_skb(skb);
>> -		} else {
>> +		if (unlikely(!nskb)) {
>>  			kfree_skb(skb);
>> +			return NULL;
>>  		}
>> +		oskb = skb;
>>  		skb = nskb;
>>  	}
>> -	if (skb &&
>> -	    pskb_expand_head(skb, SKB_DATA_ALIGN(delta), 0, GFP_ATOMIC)) {
>> +	if (pskb_expand_head(skb, delta, 0, GFP_ATOMIC)) {
>>  		kfree_skb(skb);
>> -		skb = NULL;
>> +		kfree_skb(oskb);
>> +		return NULL;
>> +	}
>> +	if (oskb) {
>> +		if (sk)
> 
> if (is_skb_wmem(oskb))
> Again, it is not valid to call skb_set_owner_w(skb, sk) on all kind of sockets.

I'm disagree.

In this particular case we have new skb with skb->sk = NULL,
In this case skb_orphan() called inside skb_set_owner_w(() will do nothing,
we just properly set destructor to sock_wfree and adjust sk->sk_wmem_alloc,

It is 100% equivalent of code used with skb_realloc_headroom(),
and there was no claims on this.
Cristoph's reproducer do not use shared skb and to not check this path,
so it cannot be the reason of troubles in his experiments.

Old destructor (sock_edemux?) can be calleda bit later, for old skb, inside consume_skb().
It can decrement last refcount and can trigger sk_free(). However in this case
adjusted sk_wmem_alloc did not allow to free sk.

So I'm sure it is safe.

>> +			skb_set_owner_w(skb, sk);
>> +		consume_skb(oskb);
>> +	} else if (sk) {
> 
> && (skb->destructor != sock_edemux)
> (Because in this case , pskb_expand_head() already adjusted skb->truesize)

Agree, thank you, my fault, I've missed it.
I think it was the reason of the troubles in last Cristoph's experiment.

>> +		delta = osize - skb_end_offset(skb);
> 
>> +		if (!is_skb_wmem(skb))
>> +			skb_set_owner_w(skb, sk);
> 
> This is dangerous, even if a socket is there, its sk->sk_wmem_alloc could be zero.
> We can not add skb->truesize to a refcount_t that already reached 0 (sk_free())
> 
> If is_skb_wmem() is false, you probably should do nothing, and leave
> current destructor as it is.

I;m still not sure and think it is tricky too.
I've found few destructors called sock_wfree inside, they require sk_wmem_alloc adjustement.
sctp_wfree, unix_destruct_scm and tpacket_destruct_skb

In the same time another ones do not use sk_wmem_alloc and I do not know how to detect proper ones.
Potentially there are some 3rd party protocols out-of-tree, and I cannot list all of them here.

However I think I can use the same trick as one described above:
I can increase sk_wmem_alloc before skb_orphan(), so sk_free() called by old destuctor 
cannot call __sk_free() and release sk.

I hope this should work, 
otherwise we'll need to clone skb for !is_skb_wmem(skb) before pskb_expand_head() call.

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH net-next v4] skb_expand_head() adjust skb->truesize incorrectly
  2021-09-02  3:59                                                                               ` Vasily Averin
@ 2021-09-02  4:32                                                                                 ` Eric Dumazet
  2021-09-02  4:48                                                                                   ` Eric Dumazet
  0 siblings, 1 reply; 106+ messages in thread
From: Eric Dumazet @ 2021-09-02  4:32 UTC (permalink / raw)
  To: Vasily Averin, Eric Dumazet, Christoph Paasch, David S. Miller
  Cc: Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski, netdev,
	linux-kernel, kernel, Alexey Kuznetsov, Julian Wiedmann



On 9/1/21 8:59 PM, Vasily Averin wrote:
> On 9/1/21 10:17 PM, Eric Dumazet wrote:
>>
>>
>> On 9/1/21 1:11 AM, Vasily Averin wrote:
>>> Christoph Paasch reports [1] about incorrect skb->truesize
>>> after skb_expand_head() call in ip6_xmit.
>>> This may happen because of two reasons:
>>> - skb_set_owner_w() for newly cloned skb is called too early,
>>> before pskb_expand_head() where truesize is adjusted for (!skb-sk) case.
>>> - pskb_expand_head() does not adjust truesize in (skb->sk) case.
>>> In this case sk->sk_wmem_alloc should be adjusted too.
>>>
>>> [1] https://lkml.org/lkml/2021/8/20/1082
>>>
>>> Fixes: f1260ff15a71 ("skbuff: introduce skb_expand_head()")
>>> Reported-by: Christoph Paasch <christoph.paasch@gmail.com>
>>> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
>>> ---
>>> v4: decided to use is_skb_wmem() after pskb_expand_head() call
>>>     fixed 'return (EXPRESSION);' in os_skb_wmem according to Eric Dumazet
>>> v3: removed __pskb_expand_head(),
>>>     added is_skb_wmem() helper for skb with wmem-compatible destructors
>>>     there are 2 ways to use it:
>>>      - before pskb_expand_head(), to create skb clones
>>>      - after successfull pskb_expand_head() to change owner on extended skb.
>>> v2: based on patch version from Eric Dumazet,
>>>     added __pskb_expand_head() function, which can be forced
>>>     to adjust skb->truesize and sk->sk_wmem_alloc.
>>> ---
>>>  include/net/sock.h |  1 +
>>>  net/core/skbuff.c  | 35 ++++++++++++++++++++++++++---------
>>>  net/core/sock.c    |  8 ++++++++
>>>  3 files changed, 35 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/include/net/sock.h b/include/net/sock.h
>>> index 95b2577..173d58c 100644
>>> --- a/include/net/sock.h
>>> +++ b/include/net/sock.h
>>> @@ -1695,6 +1695,7 @@ struct sk_buff *sock_wmalloc(struct sock *sk, unsigned long size, int force,
>>>  			     gfp_t priority);
>>>  void __sock_wfree(struct sk_buff *skb);
>>>  void sock_wfree(struct sk_buff *skb);
>>> +bool is_skb_wmem(const struct sk_buff *skb);
>>>  struct sk_buff *sock_omalloc(struct sock *sk, unsigned long size,
>>>  			     gfp_t priority);
>>>  void skb_orphan_partial(struct sk_buff *skb);
>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>>> index f931176..09991cb 100644
>>> --- a/net/core/skbuff.c
>>> +++ b/net/core/skbuff.c
>>> @@ -1804,28 +1804,45 @@ struct sk_buff *skb_realloc_headroom(struct sk_buff *skb, unsigned int headroom)
>>>  struct sk_buff *skb_expand_head(struct sk_buff *skb, unsigned int headroom)
>>>  {
>>>  	int delta = headroom - skb_headroom(skb);
>>> +	int osize = skb_end_offset(skb);
>>> +	struct sk_buff *oskb = NULL;
>>> +	struct sock *sk = skb->sk;
>>>  
>>>  	if (WARN_ONCE(delta <= 0,
>>>  		      "%s is expecting an increase in the headroom", __func__))
>>>  		return skb;
>>>  
>>> -	/* pskb_expand_head() might crash, if skb is shared */
>>> +	delta = SKB_DATA_ALIGN(delta);
>>> +	/* pskb_expand_head() might crash, if skb is shared.
>>> +	 * Also we should clone skb if its destructor does
>>> +	 * not adjust skb->truesize and sk->sk_wmem_alloc
>>> + 	 */
>>>  	if (skb_shared(skb)) {
>>>  		struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
>>>  
>>> -		if (likely(nskb)) {
>>> -			if (skb->sk)
>>> -				skb_set_owner_w(nskb, skb->sk);
>>> -			consume_skb(skb);
>>> -		} else {
>>> +		if (unlikely(!nskb)) {
>>>  			kfree_skb(skb);
>>> +			return NULL;
>>>  		}
>>> +		oskb = skb;
>>>  		skb = nskb;
>>>  	}
>>> -	if (skb &&
>>> -	    pskb_expand_head(skb, SKB_DATA_ALIGN(delta), 0, GFP_ATOMIC)) {
>>> +	if (pskb_expand_head(skb, delta, 0, GFP_ATOMIC)) {
>>>  		kfree_skb(skb);
>>> -		skb = NULL;
>>> +		kfree_skb(oskb);
>>> +		return NULL;
>>> +	}
>>> +	if (oskb) {
>>> +		if (sk)
>>
>> if (is_skb_wmem(oskb))
>> Again, it is not valid to call skb_set_owner_w(skb, sk) on all kind of sockets.
> 
> I'm disagree.

:/ :/ :/

> 
> In this particular case we have new skb with skb->sk = NULL,
> In this case skb_orphan() called inside skb_set_owner_w(() will do nothing,
> we just properly set destructor to sock_wfree and adjust sk->sk_wmem_alloc,
> 

We can not adjust sk_wmem_alloc if this is already 0

The only way you can guarantee this is :

to look at is_skb_wmem(oskb)

Because then you are certain that _at_ least this skb owns a reference on sk->sk_wmem_alloc

If another kind of destructor is held by oskb, then you can not assume this.

Otherwise we need a new refcount_add_if_not_zero() function, and make skb_set_owner_w()
more expensive for a very corner case.

> It is 100% equivalent of code used with skb_realloc_headroom(),
> and there was no claims on this.
> Cristoph's reproducer do not use shared skb and to not check this path,
> so it cannot be the reason of troubles in his experiments.
> 
> Old destructor (sock_edemux?) can be calleda bit later, for old skb, inside consume_skb().
> It can decrement last refcount and can trigger sk_free(). However in this case
> adjusted sk_wmem_alloc did not allow to free sk.
> 
> So I'm sure it is safe.

It is not safe.

> 
>>> +			skb_set_owner_w(skb, sk);
>>> +		consume_skb(oskb);
>>> +	} else if (sk) {
>>
>> && (skb->destructor != sock_edemux)
>> (Because in this case , pskb_expand_head() already adjusted skb->truesize)
> 
> Agree, thank you, my fault, I've missed it.
> I think it was the reason of the troubles in last Cristoph's experiment.
> 
>>> +		delta = osize - skb_end_offset(skb);
>>
>>> +		if (!is_skb_wmem(skb))
>>> +			skb_set_owner_w(skb, sk);
>>
>> This is dangerous, even if a socket is there, its sk->sk_wmem_alloc could be zero.
>> We can not add skb->truesize to a refcount_t that already reached 0 (sk_free())
>>
>> If is_skb_wmem() is false, you probably should do nothing, and leave
>> current destructor as it is.
> 
> I;m still not sure and think it is tricky too.



> I've found few destructors called sock_wfree inside, they require sk_wmem_alloc adjustement.
> sctp_wfree, unix_destruct_scm and tpacket_destruct_skb
> 
> In the same time another ones do not use sk_wmem_alloc and I do not know how to detect proper ones.
> Potentially there are some 3rd party protocols out-of-tree, and I cannot list all of them here.

I think you missed netem case, in particular
skb_orphan_partial() which I already pointed out.

You can setup a stack of virtual devices (tunnels),
with a qdisc on them, before ip6_xmit() is finally called...

Socket might have been closed already.

To test your patch, you could force a skb_orphan_partial() at the beginning
of skb_expand_head() (extending code coverage)

> 
> However I think I can use the same trick as one described above:
> I can increase sk_wmem_alloc before skb_orphan(), so sk_free() called by old destuctor 
> cannot call __sk_free() and release sk.


You can not change sk_wmem_alloc if this is already 0.

refcount_add() will trigger a warning (panic under KASAN)

> 
> I hope this should work, 
> otherwise we'll need to clone skb for !is_skb_wmem(skb) before pskb_expand_head() call.
> 
> Thank you,
> 	Vasily Averin
> 


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH net-next v4] skb_expand_head() adjust skb->truesize incorrectly
  2021-09-02  4:32                                                                                 ` Eric Dumazet
@ 2021-09-02  4:48                                                                                   ` Eric Dumazet
  2021-09-02  7:13                                                                                     ` Vasily Averin
  0 siblings, 1 reply; 106+ messages in thread
From: Eric Dumazet @ 2021-09-02  4:48 UTC (permalink / raw)
  To: Vasily Averin, Christoph Paasch, David S. Miller
  Cc: Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski, netdev,
	linux-kernel, kernel, Alexey Kuznetsov, Julian Wiedmann



On 9/1/21 9:32 PM, Eric Dumazet wrote:

> 
> I think you missed netem case, in particular
> skb_orphan_partial() which I already pointed out.
> 
> You can setup a stack of virtual devices (tunnels),
> with a qdisc on them, before ip6_xmit() is finally called...
> 
> Socket might have been closed already.
> 
> To test your patch, you could force a skb_orphan_partial() at the beginning
> of skb_expand_head() (extending code coverage)
> 

To clarify :

It is ok to 'downgrade' an skb->destructor having a ref on sk->sk_wmem_alloc to
something owning a ref on sk->refcnt.

But the opposite operation (ref on sk->sk_refcnt -->  ref on sk->sk_wmem_alloc) is not safe.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH net-next v4] skb_expand_head() adjust skb->truesize incorrectly
  2021-09-02  4:48                                                                                   ` Eric Dumazet
@ 2021-09-02  7:13                                                                                     ` Vasily Averin
  2021-09-02  7:33                                                                                       ` Vasily Averin
  0 siblings, 1 reply; 106+ messages in thread
From: Vasily Averin @ 2021-09-02  7:13 UTC (permalink / raw)
  To: Eric Dumazet, Christoph Paasch, David S. Miller
  Cc: Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski, netdev,
	linux-kernel, kernel, Alexey Kuznetsov, Julian Wiedmann

On 9/2/21 7:48 AM, Eric Dumazet wrote:
> On 9/1/21 9:32 PM, Eric Dumazet wrote:
>> I think you missed netem case, in particular
>> skb_orphan_partial() which I already pointed out.
>>
>> You can setup a stack of virtual devices (tunnels),
>> with a qdisc on them, before ip6_xmit() is finally called...
>>
>> Socket might have been closed already.
>>
>> To test your patch, you could force a skb_orphan_partial() at the beginning
>> of skb_expand_head() (extending code coverage)
> 
> To clarify :
> 
> It is ok to 'downgrade' an skb->destructor having a ref on sk->sk_wmem_alloc to
> something owning a ref on sk->refcnt.
> 
> But the opposite operation (ref on sk->sk_refcnt -->  ref on sk->sk_wmem_alloc) is not safe.

Could you please explain in more details, since I stil have a completely opposite point of view?

Every sk referenced in skb have sk_wmem_alloc > 9 
It is assigned to 1 in sk_alloc and decremented right before last __sk_free(),
inside  both sk_free() sock_wfree() and __sock_wfree()

So it is safe to adjust skb->sk->sk_wmem_alloc, 
because alive skb keeps reference to alive sk and last one keeps sk_wmem_alloc > 0

So any destructor used sk->sk_refcnt will already have sk_wmem_alloc > 0, 
because last sock_put() calls sk_free().

However now I'm not sure in reversed direction.
skb_set_owner_w() check !sk_fullsock(sk) and call sock_hold(sk);
If sk->sk_refcnt can be 0 here (i.e. after execution of old destructor inside skb_orphan) 
-- it can be trigger pointed problem:
"refcount_add() will trigger a warning (panic under KASAN)".

Could you please explain where I'm wrong?

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH net-next v4] skb_expand_head() adjust skb->truesize incorrectly
  2021-09-02  7:13                                                                                     ` Vasily Averin
@ 2021-09-02  7:33                                                                                       ` Vasily Averin
  2021-09-02  8:31                                                                                         ` Vasily Averin
  0 siblings, 1 reply; 106+ messages in thread
From: Vasily Averin @ 2021-09-02  7:33 UTC (permalink / raw)
  To: Eric Dumazet, Christoph Paasch, David S. Miller
  Cc: Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski, netdev,
	linux-kernel, kernel, Alexey Kuznetsov, Julian Wiedmann

On 9/2/21 10:13 AM, Vasily Averin wrote:
> On 9/2/21 7:48 AM, Eric Dumazet wrote:
>> On 9/1/21 9:32 PM, Eric Dumazet wrote:
>>> I think you missed netem case, in particular
>>> skb_orphan_partial() which I already pointed out.
>>>
>>> You can setup a stack of virtual devices (tunnels),
>>> with a qdisc on them, before ip6_xmit() is finally called...
>>>
>>> Socket might have been closed already.
>>>
>>> To test your patch, you could force a skb_orphan_partial() at the beginning
>>> of skb_expand_head() (extending code coverage)
>>
>> To clarify :
>>
>> It is ok to 'downgrade' an skb->destructor having a ref on sk->sk_wmem_alloc to
>> something owning a ref on sk->refcnt.
>>
>> But the opposite operation (ref on sk->sk_refcnt -->  ref on sk->sk_wmem_alloc) is not safe.
> 
> Could you please explain in more details, since I stil have a completely opposite point of view?
> 
> Every sk referenced in skb have sk_wmem_alloc > 9 
> It is assigned to 1 in sk_alloc and decremented right before last __sk_free(),
> inside  both sk_free() sock_wfree() and __sock_wfree()
> 
> So it is safe to adjust skb->sk->sk_wmem_alloc, 
> because alive skb keeps reference to alive sk and last one keeps sk_wmem_alloc > 0
> 
> So any destructor used sk->sk_refcnt will already have sk_wmem_alloc > 0, 
> because last sock_put() calls sk_free().
> 
> However now I'm not sure in reversed direction.
> skb_set_owner_w() check !sk_fullsock(sk) and call sock_hold(sk);
> If sk->sk_refcnt can be 0 here (i.e. after execution of old destructor inside skb_orphan) 
> -- it can be trigger pointed problem:
> "refcount_add() will trigger a warning (panic under KASAN)".
> 
> Could you please explain where I'm wrong?

To clarify:
I'm agree it is unsafe  to call on alive skb:
skb_orphan(skb)
adjust(skb_>sk->sk_wmem_alloc)

becasue 2 reasone:
1) old destructor can decrease sk_vmem_alloc to zero and free sk
2) becasue old destructor if !sk_fullsock(sk) can call sock_out and release last sk->sk_refcnt reference.
  in this case sock_hold() will trigger warning.

1) can be handled, we can adjust(sk_wmem_alloc) before skb_orphan()
but I badly understand how to handle 2nd case.

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH net-next v4] skb_expand_head() adjust skb->truesize incorrectly
  2021-09-02  7:33                                                                                       ` Vasily Averin
@ 2021-09-02  8:31                                                                                         ` Vasily Averin
  2021-09-02 11:12                                                                                           ` [PATCH net-next v5] " Vasily Averin
  0 siblings, 1 reply; 106+ messages in thread
From: Vasily Averin @ 2021-09-02  8:31 UTC (permalink / raw)
  To: Eric Dumazet, Christoph Paasch, David S. Miller
  Cc: Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski, netdev,
	linux-kernel, kernel, Alexey Kuznetsov, Julian Wiedmann

On 9/2/21 10:33 AM, Vasily Averin wrote:
> On 9/2/21 10:13 AM, Vasily Averin wrote:
>> On 9/2/21 7:48 AM, Eric Dumazet wrote:
>>> On 9/1/21 9:32 PM, Eric Dumazet wrote:
>>>> I think you missed netem case, in particular
>>>> skb_orphan_partial() which I already pointed out.
>>>>
>>>> You can setup a stack of virtual devices (tunnels),
>>>> with a qdisc on them, before ip6_xmit() is finally called...
>>>>
>>>> Socket might have been closed already.
>>>>
>>>> To test your patch, you could force a skb_orphan_partial() at the beginning
>>>> of skb_expand_head() (extending code coverage)
>>>
>>> To clarify :
>>>
>>> It is ok to 'downgrade' an skb->destructor having a ref on sk->sk_wmem_alloc to
>>> something owning a ref on sk->refcnt.
>>>
>>> But the opposite operation (ref on sk->sk_refcnt -->  ref on sk->sk_wmem_alloc) is not safe.
>>
>> Could you please explain in more details, since I stil have a completely opposite point of view?
>>
>> Every sk referenced in skb have sk_wmem_alloc > 9 
>> It is assigned to 1 in sk_alloc and decremented right before last __sk_free(),
>> inside  both sk_free() sock_wfree() and __sock_wfree()
>>
>> So it is safe to adjust skb->sk->sk_wmem_alloc, 
>> because alive skb keeps reference to alive sk and last one keeps sk_wmem_alloc > 0
>>
>> So any destructor used sk->sk_refcnt will already have sk_wmem_alloc > 0, 
>> because last sock_put() calls sk_free().
>>
>> However now I'm not sure in reversed direction.
>> skb_set_owner_w() check !sk_fullsock(sk) and call sock_hold(sk);
>> If sk->sk_refcnt can be 0 here (i.e. after execution of old destructor inside skb_orphan) 
>> -- it can be trigger pointed problem:
>> "refcount_add() will trigger a warning (panic under KASAN)".
>>
>> Could you please explain where I'm wrong?
> 
> To clarify:
> I'm agree it is unsafe  to call on alive skb:

I badly explained the problem in previous letter, let me repeat once again:

I'm told about this piece of code:
+	} else if (sk && skb->destructor != sock_edemux) {
+		delta = osize - skb_end_offset(skb);
+		if (!is_skb_wmem(skb))
+			skb_set_owner_w(skb, sk);
+		skb->truesize += delta;
+		if (sk_fullsock(sk))
+			refcount_add(delta, &sk->sk_wmem_alloc);
 	}

it is called on alive expanded skb and it is incorrect because 2 reasons:

a) if old destructor use ref on sk->sk_wmem_alloc
   It can decrease to 0 and release sk.
b) if old descriptor use ref on sk->refcnt and !sk_fullsock(sk)
    old decriptor can release last reference and release sk.

We can workaround release of sk by move of 
refcount_add(delta, &sk->sk_wmem_alloc) before skb_set_owner_w()

        } else if (sk && skb->destructor != sock_edemux) {
                delta = osize - skb_end_offset(skb);
                refcount_add(delta, &sk->sk_wmem_alloc);
                if (!is_skb_wmem(skb))
                        skb_set_owner_w(skb, sk);
                skb->truesize += delta;
#ifdef CONFIG_INET
                if (!sk_fullsock(sk))
                        refcount_dec(delta, &sk->sk_wmem_alloc);
#endif
        }

However it it does not resolve b) completely
 
oid skb_set_owner_w(struct sk_buff *skb, struct sock *sk)
{
        skb_orphan(skb); <<< old destructor releases last sk->refcnt ...
        skb->sk = sk;
...
        if (unlikely(!sk_fullsock(sk))) {
                skb->destructor = sock_edemux;
                sock_hold(sk);   <<<< ...and it trigger wrining/panic 
                return;
        }       

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH net-next v5] skb_expand_head() adjust skb->truesize incorrectly
  2021-09-02  8:31                                                                                         ` Vasily Averin
@ 2021-09-02 11:12                                                                                           ` Vasily Averin
  2021-09-02 15:53                                                                                             ` Christoph Paasch
  0 siblings, 1 reply; 106+ messages in thread
From: Vasily Averin @ 2021-09-02 11:12 UTC (permalink / raw)
  To: Christoph Paasch, Eric Dumazet, David S. Miller
  Cc: Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski, netdev,
	linux-kernel, kernel, Alexey Kuznetsov, Julian Wiedmann

Christoph Paasch reports [1] about incorrect skb->truesize
after skb_expand_head() call in ip6_xmit.
This may happen because of two reasons:
- skb_set_owner_w() for newly cloned skb is called too early,
before pskb_expand_head() where truesize is adjusted for (!skb-sk) case.
- pskb_expand_head() does not adjust truesize in (skb->sk) case.
In this case sk->sk_wmem_alloc should be adjusted too.

[1] https://lkml.org/lkml/2021/8/20/1082

Fixes: f1260ff15a71 ("skbuff: introduce skb_expand_head()")
Reported-by: Christoph Paasch <christoph.paasch@gmail.com>
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
v5: fixed else condition, thanks to Eric
    reworked update of expanded skb,
    added corresponding comments
v4: decided to use is_skb_wmem() after pskb_expand_head() call
    fixed 'return (EXPRESSION);' in os_skb_wmem according to Eric Dumazet
v3: removed __pskb_expand_head(),
    added is_skb_wmem() helper for skb with wmem-compatible destructors
    there are 2 ways to use it:
     - before pskb_expand_head(), to create skb clones
     - after successfull pskb_expand_head() to change owner on extended skb.
v2: based on patch version from Eric Dumazet,
    added __pskb_expand_head() function, which can be forced
    to adjust skb->truesize and sk->sk_wmem_alloc.
---
 include/net/sock.h |  1 +
 net/core/skbuff.c  | 63 ++++++++++++++++++++++++++++++++++++++++++++++--------
 net/core/sock.c    |  8 +++++++
 3 files changed, 63 insertions(+), 9 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 95b2577..173d58c 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1695,6 +1695,7 @@ struct sk_buff *sock_wmalloc(struct sock *sk, unsigned long size, int force,
 			     gfp_t priority);
 void __sock_wfree(struct sk_buff *skb);
 void sock_wfree(struct sk_buff *skb);
+bool is_skb_wmem(const struct sk_buff *skb);
 struct sk_buff *sock_omalloc(struct sock *sk, unsigned long size,
 			     gfp_t priority);
 void skb_orphan_partial(struct sk_buff *skb);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index f931176..29bb92e7 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1804,28 +1804,73 @@ struct sk_buff *skb_realloc_headroom(struct sk_buff *skb, unsigned int headroom)
 struct sk_buff *skb_expand_head(struct sk_buff *skb, unsigned int headroom)
 {
 	int delta = headroom - skb_headroom(skb);
+	int osize = skb_end_offset(skb);
+	struct sk_buff *oskb = NULL;
+	struct sock *sk = skb->sk;
 
 	if (WARN_ONCE(delta <= 0,
 		      "%s is expecting an increase in the headroom", __func__))
 		return skb;
 
-	/* pskb_expand_head() might crash, if skb is shared */
+	delta = SKB_DATA_ALIGN(delta);
+	/* pskb_expand_head() might crash, if skb is shared.
+	 * Also we should clone skb if its destructor does
+	 * not adjust skb->truesize and sk->sk_wmem_alloc
+	 */
 	if (skb_shared(skb)) {
 		struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
 
-		if (likely(nskb)) {
-			if (skb->sk)
-				skb_set_owner_w(nskb, skb->sk);
-			consume_skb(skb);
-		} else {
+		if (unlikely(!nskb)) {
 			kfree_skb(skb);
+			return NULL;
 		}
+		oskb = skb;
 		skb = nskb;
 	}
-	if (skb &&
-	    pskb_expand_head(skb, SKB_DATA_ALIGN(delta), 0, GFP_ATOMIC)) {
+	if (pskb_expand_head(skb, delta, 0, GFP_ATOMIC)) {
 		kfree_skb(skb);
-		skb = NULL;
+		kfree_skb(oskb);
+		return NULL;
+	}
+	if (oskb) {
+		if (sk)
+			skb_set_owner_w(skb, sk);
+		consume_skb(oskb);
+	} else if (sk && skb->destructor != sock_edemux) {
+		bool ref, set_owner;
+
+		ref = false; set_owner = false;
+		delta = osize - skb_end_offset(skb);
+		/* skb_set_owner_w() calls current skb destructor.
+		 * It can decrease sk_wmem_alloc to 0 and release sk,
+		 * To prevnt it we increase sk_wmem_alloc earlier.
+		 * Another kind of destructors can release last sk_refcnt,
+		 * so it will be impossible to call sock_hold for !fullsock
+		 * Take extra sk_refcnt to prevent it.
+		 * Otherwise just increase truesize of expanded skb.
+		 */
+		refcount_add(delta, &sk->sk_wmem_alloc);
+		if (!is_skb_wmem(skb)) {
+			set_owner = true;
+			if (!sk_fullsock(sk) && IS_ENABLED(CONFIG_INET)) {
+				/* skb_set_owner_w can set sock_edemux */
+				ref = refcount_inc_not_zero(&sk->sk_refcnt);
+				if (!ref) {
+					set_owner = false;
+					WARN_ON(refcount_sub_and_test(delta, &sk->sk_wmem_alloc));
+				}
+			}
+		}
+		if (set_owner)
+			skb_set_owner_w(skb, sk);
+#ifdef CONFIG_INET
+		if (skb->destructor == sock_edemux) {
+			WARN_ON(refcount_sub_and_test(delta, &sk->sk_wmem_alloc));
+			if (ref)
+				WARN_ON(refcount_dec_and_test(&sk->sk_refcnt));
+		}
+#endif
+		skb->truesize += delta;
 	}
 	return skb;
 }
diff --git a/net/core/sock.c b/net/core/sock.c
index 950f1e7..6cbda43 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2227,6 +2227,14 @@ void skb_set_owner_w(struct sk_buff *skb, struct sock *sk)
 }
 EXPORT_SYMBOL(skb_set_owner_w);
 
+bool is_skb_wmem(const struct sk_buff *skb)
+{
+	return skb->destructor == sock_wfree ||
+	       skb->destructor == __sock_wfree ||
+	       (IS_ENABLED(CONFIG_INET) && skb->destructor == tcp_wfree);
+}
+EXPORT_SYMBOL(is_skb_wmem);
+
 static bool can_skb_orphan_partial(const struct sk_buff *skb)
 {
 #ifdef CONFIG_TLS_DEVICE
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH net-next v5] skb_expand_head() adjust skb->truesize incorrectly
  2021-09-02 11:12                                                                                           ` [PATCH net-next v5] " Vasily Averin
@ 2021-09-02 15:53                                                                                             ` Christoph Paasch
  2021-09-02 16:32                                                                                               ` Vasily Averin
  0 siblings, 1 reply; 106+ messages in thread
From: Christoph Paasch @ 2021-09-02 15:53 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Eric Dumazet, David S. Miller, Hideaki YOSHIFUJI, David Ahern,
	Jakub Kicinski, netdev, LKML, kernel, Alexey Kuznetsov,
	Julian Wiedmann

On Thu, Sep 2, 2021 at 4:12 AM Vasily Averin <vvs@virtuozzo.com> wrote:
>
> Christoph Paasch reports [1] about incorrect skb->truesize
> after skb_expand_head() call in ip6_xmit.
> This may happen because of two reasons:
> - skb_set_owner_w() for newly cloned skb is called too early,
> before pskb_expand_head() where truesize is adjusted for (!skb-sk) case.
> - pskb_expand_head() does not adjust truesize in (skb->sk) case.
> In this case sk->sk_wmem_alloc should be adjusted too.
>
> [1] https://lkml.org/lkml/2021/8/20/1082
>
> Fixes: f1260ff15a71 ("skbuff: introduce skb_expand_head()")
> Reported-by: Christoph Paasch <christoph.paasch@gmail.com>
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
> ---
> v5: fixed else condition, thanks to Eric
>     reworked update of expanded skb,
>     added corresponding comments
> v4: decided to use is_skb_wmem() after pskb_expand_head() call
>     fixed 'return (EXPRESSION);' in os_skb_wmem according to Eric Dumazet
> v3: removed __pskb_expand_head(),
>     added is_skb_wmem() helper for skb with wmem-compatible destructors
>     there are 2 ways to use it:
>      - before pskb_expand_head(), to create skb clones
>      - after successfull pskb_expand_head() to change owner on extended skb.
> v2: based on patch version from Eric Dumazet,
>     added __pskb_expand_head() function, which can be forced
>     to adjust skb->truesize and sk->sk_wmem_alloc.
> ---
>  include/net/sock.h |  1 +
>  net/core/skbuff.c  | 63 ++++++++++++++++++++++++++++++++++++++++++++++--------
>  net/core/sock.c    |  8 +++++++
>  3 files changed, 63 insertions(+), 9 deletions(-)

Still the same issues around refcount as I reported in my other email.

Did you try running the syzkaller reproducer on your setup?


Christoph

>
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 95b2577..173d58c 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -1695,6 +1695,7 @@ struct sk_buff *sock_wmalloc(struct sock *sk, unsigned long size, int force,
>                              gfp_t priority);
>  void __sock_wfree(struct sk_buff *skb);
>  void sock_wfree(struct sk_buff *skb);
> +bool is_skb_wmem(const struct sk_buff *skb);
>  struct sk_buff *sock_omalloc(struct sock *sk, unsigned long size,
>                              gfp_t priority);
>  void skb_orphan_partial(struct sk_buff *skb);
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index f931176..29bb92e7 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -1804,28 +1804,73 @@ struct sk_buff *skb_realloc_headroom(struct sk_buff *skb, unsigned int headroom)
>  struct sk_buff *skb_expand_head(struct sk_buff *skb, unsigned int headroom)
>  {
>         int delta = headroom - skb_headroom(skb);
> +       int osize = skb_end_offset(skb);
> +       struct sk_buff *oskb = NULL;
> +       struct sock *sk = skb->sk;
>
>         if (WARN_ONCE(delta <= 0,
>                       "%s is expecting an increase in the headroom", __func__))
>                 return skb;
>
> -       /* pskb_expand_head() might crash, if skb is shared */
> +       delta = SKB_DATA_ALIGN(delta);
> +       /* pskb_expand_head() might crash, if skb is shared.
> +        * Also we should clone skb if its destructor does
> +        * not adjust skb->truesize and sk->sk_wmem_alloc
> +        */
>         if (skb_shared(skb)) {
>                 struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
>
> -               if (likely(nskb)) {
> -                       if (skb->sk)
> -                               skb_set_owner_w(nskb, skb->sk);
> -                       consume_skb(skb);
> -               } else {
> +               if (unlikely(!nskb)) {
>                         kfree_skb(skb);
> +                       return NULL;
>                 }
> +               oskb = skb;
>                 skb = nskb;
>         }
> -       if (skb &&
> -           pskb_expand_head(skb, SKB_DATA_ALIGN(delta), 0, GFP_ATOMIC)) {
> +       if (pskb_expand_head(skb, delta, 0, GFP_ATOMIC)) {
>                 kfree_skb(skb);
> -               skb = NULL;
> +               kfree_skb(oskb);
> +               return NULL;
> +       }
> +       if (oskb) {
> +               if (sk)
> +                       skb_set_owner_w(skb, sk);
> +               consume_skb(oskb);
> +       } else if (sk && skb->destructor != sock_edemux) {
> +               bool ref, set_owner;
> +
> +               ref = false; set_owner = false;
> +               delta = osize - skb_end_offset(skb);
> +               /* skb_set_owner_w() calls current skb destructor.
> +                * It can decrease sk_wmem_alloc to 0 and release sk,
> +                * To prevnt it we increase sk_wmem_alloc earlier.
> +                * Another kind of destructors can release last sk_refcnt,
> +                * so it will be impossible to call sock_hold for !fullsock
> +                * Take extra sk_refcnt to prevent it.
> +                * Otherwise just increase truesize of expanded skb.
> +                */
> +               refcount_add(delta, &sk->sk_wmem_alloc);
> +               if (!is_skb_wmem(skb)) {
> +                       set_owner = true;
> +                       if (!sk_fullsock(sk) && IS_ENABLED(CONFIG_INET)) {
> +                               /* skb_set_owner_w can set sock_edemux */
> +                               ref = refcount_inc_not_zero(&sk->sk_refcnt);
> +                               if (!ref) {
> +                                       set_owner = false;
> +                                       WARN_ON(refcount_sub_and_test(delta, &sk->sk_wmem_alloc));
> +                               }
> +                       }
> +               }
> +               if (set_owner)
> +                       skb_set_owner_w(skb, sk);
> +#ifdef CONFIG_INET
> +               if (skb->destructor == sock_edemux) {
> +                       WARN_ON(refcount_sub_and_test(delta, &sk->sk_wmem_alloc));
> +                       if (ref)
> +                               WARN_ON(refcount_dec_and_test(&sk->sk_refcnt));
> +               }
> +#endif
> +               skb->truesize += delta;
>         }
>         return skb;
>  }
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 950f1e7..6cbda43 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -2227,6 +2227,14 @@ void skb_set_owner_w(struct sk_buff *skb, struct sock *sk)
>  }
>  EXPORT_SYMBOL(skb_set_owner_w);
>
> +bool is_skb_wmem(const struct sk_buff *skb)
> +{
> +       return skb->destructor == sock_wfree ||
> +              skb->destructor == __sock_wfree ||
> +              (IS_ENABLED(CONFIG_INET) && skb->destructor == tcp_wfree);
> +}
> +EXPORT_SYMBOL(is_skb_wmem);
> +
>  static bool can_skb_orphan_partial(const struct sk_buff *skb)
>  {
>  #ifdef CONFIG_TLS_DEVICE
> --
> 1.8.3.1
>

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH net-next v5] skb_expand_head() adjust skb->truesize incorrectly
  2021-09-02 15:53                                                                                             ` Christoph Paasch
@ 2021-09-02 16:32                                                                                               ` Vasily Averin
  2021-09-06 18:01                                                                                                 ` [PATCH net v6] " Vasily Averin
  0 siblings, 1 reply; 106+ messages in thread
From: Vasily Averin @ 2021-09-02 16:32 UTC (permalink / raw)
  To: Christoph Paasch
  Cc: Eric Dumazet, David S. Miller, Hideaki YOSHIFUJI, David Ahern,
	Jakub Kicinski, netdev, LKML, kernel, Alexey Kuznetsov,
	Julian Wiedmann

On 9/2/21 6:53 PM, Christoph Paasch wrote:
> On Thu, Sep 2, 2021 at 4:12 AM Vasily Averin <vvs@virtuozzo.com> wrote:
>>
>> Christoph Paasch reports [1] about incorrect skb->truesize
>> after skb_expand_head() call in ip6_xmit.
>> This may happen because of two reasons:
>> - skb_set_owner_w() for newly cloned skb is called too early,
>> before pskb_expand_head() where truesize is adjusted for (!skb-sk) case.
>> - pskb_expand_head() does not adjust truesize in (skb->sk) case.
>> In this case sk->sk_wmem_alloc should be adjusted too.
>>
>> [1] https://lkml.org/lkml/2021/8/20/1082
>>
>> Fixes: f1260ff15a71 ("skbuff: introduce skb_expand_head()")
>> Reported-by: Christoph Paasch <christoph.paasch@gmail.com>
>> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
>> ---
>> v5: fixed else condition, thanks to Eric
>>     reworked update of expanded skb,
>>     added corresponding comments
>> v4: decided to use is_skb_wmem() after pskb_expand_head() call
>>     fixed 'return (EXPRESSION);' in os_skb_wmem according to Eric Dumazet
>> v3: removed __pskb_expand_head(),
>>     added is_skb_wmem() helper for skb with wmem-compatible destructors
>>     there are 2 ways to use it:
>>      - before pskb_expand_head(), to create skb clones
>>      - after successfull pskb_expand_head() to change owner on extended skb.
>> v2: based on patch version from Eric Dumazet,
>>     added __pskb_expand_head() function, which can be forced
>>     to adjust skb->truesize and sk->sk_wmem_alloc.
>> ---
>>  include/net/sock.h |  1 +
>>  net/core/skbuff.c  | 63 ++++++++++++++++++++++++++++++++++++++++++++++--------
>>  net/core/sock.c    |  8 +++++++
>>  3 files changed, 63 insertions(+), 9 deletions(-)
> 
> Still the same issues around refcount as I reported in my other email.
> 
> Did you try running the syzkaller reproducer on your setup?

no, I do not have 

>> +       } else if (sk && skb->destructor != sock_edemux) {
>> +               bool ref, set_owner;
>> +
>> +               ref = false; set_owner = false;
>> +               delta = osize - skb_end_offset(skb);

error is here, should be instead
delta = skb_end_offset(skb) - osize;

>> +               /* skb_set_owner_w() calls current skb destructor.
>> +                * It can decrease sk_wmem_alloc to 0 and release sk,
>> +                * To prevnt it we increase sk_wmem_alloc earlier.
>> +                * Another kind of destructors can release last sk_refcnt,
>> +                * so it will be impossible to call sock_hold for !fullsock
>> +                * Take extra sk_refcnt to prevent it.
>> +                * Otherwise just increase truesize of expanded skb.
>> +                */
>> +               refcount_add(delta, &sk->sk_wmem_alloc);
>> +               if (!is_skb_wmem(skb)) {
>> +                       set_owner = true;
>> +                       if (!sk_fullsock(sk) && IS_ENABLED(CONFIG_INET)) {
>> +                               /* skb_set_owner_w can set sock_edemux */
>> +                               ref = refcount_inc_not_zero(&sk->sk_refcnt);
>> +                               if (!ref) {
>> +                                       set_owner = false;
>> +                                       WARN_ON(refcount_sub_and_test(delta, &sk->sk_wmem_alloc));
>> +                               }
>> +                       }
>> +               }
>> +               if (set_owner)
>> +                       skb_set_owner_w(skb, sk);
>> +#ifdef CONFIG_INET
>> +               if (skb->destructor == sock_edemux) {
>> +                       WARN_ON(refcount_sub_and_test(delta, &sk->sk_wmem_alloc));
>> +                       if (ref)
>> +                               WARN_ON(refcount_dec_and_test(&sk->sk_refcnt));
>> +               }
>> +#endif
>> +               skb->truesize += delta;
>>         }
>>         return skb;
>>  }
>> diff --git a/net/core/sock.c b/net/core/sock.c
>> index 950f1e7..6cbda43 100644
>> --- a/net/core/sock.c
>> +++ b/net/core/sock.c
>> @@ -2227,6 +2227,14 @@ void skb_set_owner_w(struct sk_buff *skb, struct sock *sk)
>>  }
>>  EXPORT_SYMBOL(skb_set_owner_w);
>>
>> +bool is_skb_wmem(const struct sk_buff *skb)
>> +{
>> +       return skb->destructor == sock_wfree ||
>> +              skb->destructor == __sock_wfree ||
>> +              (IS_ENABLED(CONFIG_INET) && skb->destructor == tcp_wfree);
>> +}
>> +EXPORT_SYMBOL(is_skb_wmem);
>> +
>>  static bool can_skb_orphan_partial(const struct sk_buff *skb)
>>  {
>>  #ifdef CONFIG_TLS_DEVICE
>> --
>> 1.8.3.1
>>


^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH net v6] skb_expand_head() adjust skb->truesize incorrectly
  2021-09-02 16:32                                                                                               ` Vasily Averin
@ 2021-09-06 18:01                                                                                                 ` Vasily Averin
  2021-09-06 18:03                                                                                                   ` Vasily Averin
  0 siblings, 1 reply; 106+ messages in thread
From: Vasily Averin @ 2021-09-06 18:01 UTC (permalink / raw)
  To: Christoph Paasch, Eric Dumazet, David S. Miller
  Cc: Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski, netdev,
	linux-kernel, kernel, Alexey Kuznetsov, Julian Wiedmann

Christoph Paasch reports [1] about incorrect skb->truesize
after skb_expand_head() call in ip6_xmit.
This may happen because of two reasons:
- skb_set_owner_w() for newly cloned skb is called too early,
before pskb_expand_head() where truesize is adjusted for (!skb-sk) case.
- pskb_expand_head() does not adjust truesize in (skb->sk) case.
In this case sk->sk_wmem_alloc should be adjusted too.

[1] https://lkml.org/lkml/2021/8/20/1082

Fixes: f1260ff15a71 ("skbuff: introduce skb_expand_head()")
Fixes: 2d85a1b31dde ("ipv6: ip6_finish_output2: set sk into newly allocated nskb")
Reported-by: Christoph Paasch <christoph.paasch@gmail.com>
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
v6: fixed delta,
    improved comments
v5: fixed else condition, thanks to Eric
    reworked update of expanded skb,
    added corresponding comments
v4: decided to use is_skb_wmem() after pskb_expand_head() call
    fixed 'return (EXPRESSION);' in os_skb_wmem according to Eric Dumazet
v3: removed __pskb_expand_head(),
    added is_skb_wmem() helper for skb with wmem-compatible destructors
    there are 2 ways to use it:
     - before pskb_expand_head(), to create skb clones
     - after successfull pskb_expand_head() to change owner on extended skb.
v2: based on patch version from Eric Dumazet,
    added __pskb_expand_head() function, which can be forced
    to adjust skb->truesize and sk->sk_wmem_alloc.
---
 include/net/sock.h |  1 +
 net/core/skbuff.c  | 60 ++++++++++++++++++++++++++++++++++++++++++++++--------
 net/core/sock.c    |  8 ++++++++
 3 files changed, 60 insertions(+), 9 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 95b2577..173d58c 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1695,6 +1695,7 @@ struct sk_buff *sock_wmalloc(struct sock *sk, unsigned long size, int force,
 			     gfp_t priority);
 void __sock_wfree(struct sk_buff *skb);
 void sock_wfree(struct sk_buff *skb);
+bool is_skb_wmem(const struct sk_buff *skb);
 struct sk_buff *sock_omalloc(struct sock *sk, unsigned long size,
 			     gfp_t priority);
 void skb_orphan_partial(struct sk_buff *skb);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index f931176..e2a2aa31 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1804,28 +1804,70 @@ struct sk_buff *skb_realloc_headroom(struct sk_buff *skb, unsigned int headroom)
 struct sk_buff *skb_expand_head(struct sk_buff *skb, unsigned int headroom)
 {
 	int delta = headroom - skb_headroom(skb);
+	int osize = skb_end_offset(skb);
+	struct sk_buff *oskb = NULL;
+	struct sock *sk = skb->sk;
 
 	if (WARN_ONCE(delta <= 0,
 		      "%s is expecting an increase in the headroom", __func__))
 		return skb;
 
-	/* pskb_expand_head() might crash, if skb is shared */
+	delta = SKB_DATA_ALIGN(delta);
+	/* pskb_expand_head() might crash, if skb is shared. */
 	if (skb_shared(skb)) {
 		struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
 
-		if (likely(nskb)) {
-			if (skb->sk)
-				skb_set_owner_w(nskb, skb->sk);
-			consume_skb(skb);
-		} else {
+		if (unlikely(!nskb)) {
 			kfree_skb(skb);
+			return NULL;
 		}
+		oskb = skb;
 		skb = nskb;
 	}
-	if (skb &&
-	    pskb_expand_head(skb, SKB_DATA_ALIGN(delta), 0, GFP_ATOMIC)) {
+	if (pskb_expand_head(skb, delta, 0, GFP_ATOMIC)) {
 		kfree_skb(skb);
-		skb = NULL;
+		kfree_skb(oskb);
+		return NULL;
+	}
+	if (oskb) {
+		if (sk)
+			skb_set_owner_w(skb, sk);
+		consume_skb(oskb);
+	} else if (sk && skb->destructor != sock_edemux) {
+		bool ref, set_owner;
+
+		ref = false; set_owner = false;
+		delta = skb_end_offset(skb) - osize;
+		/* skb_set_owner_w() calls current skb destructor.
+		 * It can reduce sk_wmem_alloc to 0 and release sk,
+		 * To prevnt this, we increase sk_wmem_alloc in advance.
+		 * Some destructors might release the last sk_refcnt,
+		 * so it won't be possible to call sock_hold for !fullsock
+		 * We take an extra sk_refcnt to prevent this.
+		 * In any case we increase truesize of expanded skb.
+		 */
+		refcount_add(delta, &sk->sk_wmem_alloc);
+		if (!is_skb_wmem(skb)) {
+			set_owner = true;
+			if (!sk_fullsock(sk) && IS_ENABLED(CONFIG_INET)) {
+				/* skb_set_owner_w can set sock_edemux */
+				ref = refcount_inc_not_zero(&sk->sk_refcnt);
+				if (!ref) {
+					set_owner = false;
+					WARN_ON(refcount_sub_and_test(delta, &sk->sk_wmem_alloc));
+				}
+			}
+		}
+		if (set_owner)
+			skb_set_owner_w(skb, sk);
+#ifdef CONFIG_INET
+		if (skb->destructor == sock_edemux) {
+			WARN_ON(refcount_sub_and_test(delta, &sk->sk_wmem_alloc));
+			if (ref)
+				WARN_ON(refcount_dec_and_test(&sk->sk_refcnt));
+		}
+#endif
+		skb->truesize += delta;
 	}
 	return skb;
 }
diff --git a/net/core/sock.c b/net/core/sock.c
index 950f1e7..6cbda43 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2227,6 +2227,14 @@ void skb_set_owner_w(struct sk_buff *skb, struct sock *sk)
 }
 EXPORT_SYMBOL(skb_set_owner_w);
 
+bool is_skb_wmem(const struct sk_buff *skb)
+{
+	return skb->destructor == sock_wfree ||
+	       skb->destructor == __sock_wfree ||
+	       (IS_ENABLED(CONFIG_INET) && skb->destructor == tcp_wfree);
+}
+EXPORT_SYMBOL(is_skb_wmem);
+
 static bool can_skb_orphan_partial(const struct sk_buff *skb)
 {
 #ifdef CONFIG_TLS_DEVICE
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH net v6] skb_expand_head() adjust skb->truesize incorrectly
  2021-09-06 18:01                                                                                                 ` [PATCH net v6] " Vasily Averin
@ 2021-09-06 18:03                                                                                                   ` Vasily Averin
  0 siblings, 0 replies; 106+ messages in thread
From: Vasily Averin @ 2021-09-06 18:03 UTC (permalink / raw)
  To: Christoph Paasch, Eric Dumazet, David S. Miller
  Cc: Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski, netdev,
	linux-kernel, kernel, Alexey Kuznetsov, Julian Wiedmann

I've finally reproduced original issue by using reproducer from Christoph Paasch,
and was able locally validate this patch.

Thank you,
	Vasily Averin

On 9/6/21 9:01 PM, Vasily Averin wrote:
> Christoph Paasch reports [1] about incorrect skb->truesize
> after skb_expand_head() call in ip6_xmit.
> This may happen because of two reasons:
> - skb_set_owner_w() for newly cloned skb is called too early,
> before pskb_expand_head() where truesize is adjusted for (!skb-sk) case.
> - pskb_expand_head() does not adjust truesize in (skb->sk) case.
> In this case sk->sk_wmem_alloc should be adjusted too.
> 
> [1] https://lkml.org/lkml/2021/8/20/1082
> 
> Fixes: f1260ff15a71 ("skbuff: introduce skb_expand_head()")
> Fixes: 2d85a1b31dde ("ipv6: ip6_finish_output2: set sk into newly allocated nskb")
> Reported-by: Christoph Paasch <christoph.paasch@gmail.com>
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
> ---
> v6: fixed delta,
>     improved comments
> v5: fixed else condition, thanks to Eric
>     reworked update of expanded skb,
>     added corresponding comments
> v4: decided to use is_skb_wmem() after pskb_expand_head() call
>     fixed 'return (EXPRESSION);' in os_skb_wmem according to Eric Dumazet
> v3: removed __pskb_expand_head(),
>     added is_skb_wmem() helper for skb with wmem-compatible destructors
>     there are 2 ways to use it:
>      - before pskb_expand_head(), to create skb clones
>      - after successfull pskb_expand_head() to change owner on extended skb.
> v2: based on patch version from Eric Dumazet,
>     added __pskb_expand_head() function, which can be forced
>     to adjust skb->truesize and sk->sk_wmem_alloc.
> ---
>  include/net/sock.h |  1 +
>  net/core/skbuff.c  | 60 ++++++++++++++++++++++++++++++++++++++++++++++--------
>  net/core/sock.c    |  8 ++++++++
>  3 files changed, 60 insertions(+), 9 deletions(-)
> 
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 95b2577..173d58c 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -1695,6 +1695,7 @@ struct sk_buff *sock_wmalloc(struct sock *sk, unsigned long size, int force,
>  			     gfp_t priority);
>  void __sock_wfree(struct sk_buff *skb);
>  void sock_wfree(struct sk_buff *skb);
> +bool is_skb_wmem(const struct sk_buff *skb);
>  struct sk_buff *sock_omalloc(struct sock *sk, unsigned long size,
>  			     gfp_t priority);
>  void skb_orphan_partial(struct sk_buff *skb);
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index f931176..e2a2aa31 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -1804,28 +1804,70 @@ struct sk_buff *skb_realloc_headroom(struct sk_buff *skb, unsigned int headroom)
>  struct sk_buff *skb_expand_head(struct sk_buff *skb, unsigned int headroom)
>  {
>  	int delta = headroom - skb_headroom(skb);
> +	int osize = skb_end_offset(skb);
> +	struct sk_buff *oskb = NULL;
> +	struct sock *sk = skb->sk;
>  
>  	if (WARN_ONCE(delta <= 0,
>  		      "%s is expecting an increase in the headroom", __func__))
>  		return skb;
>  
> -	/* pskb_expand_head() might crash, if skb is shared */
> +	delta = SKB_DATA_ALIGN(delta);
> +	/* pskb_expand_head() might crash, if skb is shared. */
>  	if (skb_shared(skb)) {
>  		struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
>  
> -		if (likely(nskb)) {
> -			if (skb->sk)
> -				skb_set_owner_w(nskb, skb->sk);
> -			consume_skb(skb);
> -		} else {
> +		if (unlikely(!nskb)) {
>  			kfree_skb(skb);
> +			return NULL;
>  		}
> +		oskb = skb;
>  		skb = nskb;
>  	}
> -	if (skb &&
> -	    pskb_expand_head(skb, SKB_DATA_ALIGN(delta), 0, GFP_ATOMIC)) {
> +	if (pskb_expand_head(skb, delta, 0, GFP_ATOMIC)) {
>  		kfree_skb(skb);
> -		skb = NULL;
> +		kfree_skb(oskb);
> +		return NULL;
> +	}
> +	if (oskb) {
> +		if (sk)
> +			skb_set_owner_w(skb, sk);
> +		consume_skb(oskb);
> +	} else if (sk && skb->destructor != sock_edemux) {
> +		bool ref, set_owner;
> +
> +		ref = false; set_owner = false;
> +		delta = skb_end_offset(skb) - osize;
> +		/* skb_set_owner_w() calls current skb destructor.
> +		 * It can reduce sk_wmem_alloc to 0 and release sk,
> +		 * To prevnt this, we increase sk_wmem_alloc in advance.
> +		 * Some destructors might release the last sk_refcnt,
> +		 * so it won't be possible to call sock_hold for !fullsock
> +		 * We take an extra sk_refcnt to prevent this.
> +		 * In any case we increase truesize of expanded skb.
> +		 */
> +		refcount_add(delta, &sk->sk_wmem_alloc);
> +		if (!is_skb_wmem(skb)) {
> +			set_owner = true;
> +			if (!sk_fullsock(sk) && IS_ENABLED(CONFIG_INET)) {
> +				/* skb_set_owner_w can set sock_edemux */
> +				ref = refcount_inc_not_zero(&sk->sk_refcnt);
> +				if (!ref) {
> +					set_owner = false;
> +					WARN_ON(refcount_sub_and_test(delta, &sk->sk_wmem_alloc));
> +				}
> +			}
> +		}
> +		if (set_owner)
> +			skb_set_owner_w(skb, sk);
> +#ifdef CONFIG_INET
> +		if (skb->destructor == sock_edemux) {
> +			WARN_ON(refcount_sub_and_test(delta, &sk->sk_wmem_alloc));
> +			if (ref)
> +				WARN_ON(refcount_dec_and_test(&sk->sk_refcnt));
> +		}
> +#endif
> +		skb->truesize += delta;
>  	}
>  	return skb;
>  }
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 950f1e7..6cbda43 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -2227,6 +2227,14 @@ void skb_set_owner_w(struct sk_buff *skb, struct sock *sk)
>  }
>  EXPORT_SYMBOL(skb_set_owner_w);
>  
> +bool is_skb_wmem(const struct sk_buff *skb)
> +{
> +	return skb->destructor == sock_wfree ||
> +	       skb->destructor == __sock_wfree ||
> +	       (IS_ENABLED(CONFIG_INET) && skb->destructor == tcp_wfree);
> +}
> +EXPORT_SYMBOL(is_skb_wmem);
> +
>  static bool can_skb_orphan_partial(const struct sk_buff *skb)
>  {
>  #ifdef CONFIG_TLS_DEVICE
> 


^ permalink raw reply	[flat|nested] 106+ messages in thread

end of thread, other threads:[~2021-09-06 18:03 UTC | newest]

Thread overview: 106+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <cover.1625665132.git.vvs@virtuozzo.com>
2021-07-07 14:04 ` [PATCH IPV6 1/1] ipv6: allocate enough headroom in ip6_finish_output2() Vasily Averin
2021-07-07 14:45   ` David Ahern
2021-07-07 16:42     ` Jakub Kicinski
2021-07-07 17:41       ` Eric Dumazet
2021-07-07 17:53         ` Vasily Averin
2021-07-07 18:30         ` Jakub Kicinski
2021-07-07 18:50           ` Eric Dumazet
2021-07-09  9:04         ` [PATCH IPV6 v2 0/4] " Vasily Averin
2021-07-12  6:44           ` [PATCH IPV6 v3 0/1] " Vasily Averin
     [not found]           ` <cover.1626069562.git.vvs@virtuozzo.com>
2021-07-12  6:45             ` [PATCH IPV6 v3 1/1] " Vasily Averin
2021-07-12 18:30               ` patchwork-bot+netdevbpf
2021-07-13  7:46               ` Vasily Averin
2021-07-13 12:01                 ` [PATCH NET v4 0/1] " Vasily Averin
     [not found]                 ` <cover.1626177047.git.vvs@virtuozzo.com>
2021-07-13 12:01                   ` [PATCH NET v4 1/1] " Vasily Averin
2021-07-18 10:44                     ` Vasily Averin
2021-07-18 15:22                       ` David Ahern
2021-07-18 17:04                       ` David Miller
2021-07-19  7:55                         ` [PATCH NET] ipv6: ip6_finish_output2: set sk into newly allocated nskb Vasily Averin
2021-07-20 10:10                           ` patchwork-bot+netdevbpf
2021-07-13 12:31                 ` [PATCH IPV6 v3 1/1] ipv6: allocate enough headroom in ip6_finish_output2() Vasily Averin
2021-07-12 13:26           ` [PATCH NET 0/7] skbuff: introduce pskb_realloc_headroom() Vasily Averin
     [not found]           ` <cover.1626093470.git.vvs@virtuozzo.com>
2021-07-12 13:26             ` [PATCH NET 1/7] " Vasily Averin
2021-07-12 17:53               ` Jakub Kicinski
2021-07-12 18:45                 ` Vasily Averin
2021-07-13 20:57                   ` [PATCH NET v2 0/7] skbuff: introduce skb_expand_head() Vasily Averin
2021-08-02  8:52                     ` [PATCH NET v3 " Vasily Averin
     [not found]                     ` <cover.1627891754.git.vvs@virtuozzo.com>
2021-08-02  8:52                       ` [PATCH NET v3 1/7] " Vasily Averin
2021-08-02  8:52                       ` [PATCH NET v3 2/7] ipv6: use skb_expand_head in ip6_finish_output2 Vasily Averin
2021-08-02  8:52                       ` [PATCH NET v3 3/7] ipv6: use skb_expand_head in ip6_xmit Vasily Averin
2021-08-02  8:52                       ` [PATCH NET v3 4/7] ipv4: use skb_expand_head in ip_finish_output2 Vasily Averin
2021-08-02  8:52                       ` [PATCH NET v3 5/7] vrf: use skb_expand_head in vrf_finish_output Vasily Averin
2021-08-05 11:55                         ` Julian Wiedmann
2021-08-05 12:55                           ` Vasily Averin
2021-08-06  7:49                           ` [PATCH NET v4 0/7] skbuff: introduce skb_expand_head() Vasily Averin
2021-08-06 10:14                             ` David Miller
2021-08-06 12:53                               ` [PATCH NET] vrf: fix null pointer dereference in vrf_finish_output() Vasily Averin
2021-08-06 22:42                                 ` Jakub Kicinski
2021-08-07  6:41                                   ` Vasily Averin
     [not found]                           ` <cover.1628235065.git.vvs@virtuozzo.com>
2021-08-06  7:49                             ` [PATCH NET v4 1/7] skbuff: introduce skb_expand_head() Vasily Averin
2021-08-06  7:50                             ` [PATCH NET v4 2/7] ipv6: use skb_expand_head in ip6_finish_output2 Vasily Averin
2021-08-06  7:50                             ` [PATCH NET v4 3/7] ipv6: use skb_expand_head in ip6_xmit Vasily Averin
     [not found]                               ` <CALMXkpaay1y=0tkbnskr4gf-HTMjJJsVryh4Prnej_ws-hJvBg@mail.gmail.com>
2021-08-20 22:44                                 ` Christoph Paasch
2021-08-21  6:21                                   ` Vasily Averin
2021-08-22 17:04                                     ` Christoph Paasch
2021-08-22 17:13                                       ` Christoph Paasch
2021-08-23  5:44                                         ` Vasily Averin
2021-08-23  5:59                                           ` Vasily Averin
2021-08-23  7:56                                             ` [PATCH NET-NEXT] ipv6: skb_expand_head() adjust skb->truesize incorrectly Vasily Averin
2021-08-23 17:25                                               ` Christoph Paasch
2021-08-23 21:45                                                 ` Eric Dumazet
2021-08-23 21:51                                                   ` Eric Dumazet
2021-08-23 22:23                                                     ` Eric Dumazet
2021-08-24  8:50                                                       ` Vasily Averin
2021-08-24 17:21                                                         ` Vasily Averin
2021-08-25 17:49                                                           ` Christoph Paasch
2021-08-29 12:59                                                             ` [PATCH v2] " Vasily Averin
2021-08-30  5:52                                                               ` [PATCH net-next " Vasily Averin
2021-08-30 16:01                                                               ` [PATCH " Eric Dumazet
2021-08-30 18:09                                                                 ` Vasily Averin
2021-08-30 18:37                                                                   ` Vasily Averin
2021-08-30 19:58                                                                   ` Eric Dumazet
2021-08-31 14:34                                                                     ` [PATCH net-next v3 RFC] " Vasily Averin
2021-08-31 19:38                                                                       ` Eric Dumazet
2021-09-01  6:20                                                                         ` Vasily Averin
2021-09-01  8:11                                                                           ` [PATCH net-next v4] " Vasily Averin
2021-09-01 16:58                                                                             ` Christoph Paasch
2021-09-01 19:17                                                                             ` Eric Dumazet
2021-09-02  3:59                                                                               ` Vasily Averin
2021-09-02  4:32                                                                                 ` Eric Dumazet
2021-09-02  4:48                                                                                   ` Eric Dumazet
2021-09-02  7:13                                                                                     ` Vasily Averin
2021-09-02  7:33                                                                                       ` Vasily Averin
2021-09-02  8:31                                                                                         ` Vasily Averin
2021-09-02 11:12                                                                                           ` [PATCH net-next v5] " Vasily Averin
2021-09-02 15:53                                                                                             ` Christoph Paasch
2021-09-02 16:32                                                                                               ` Vasily Averin
2021-09-06 18:01                                                                                                 ` [PATCH net v6] " Vasily Averin
2021-09-06 18:03                                                                                                   ` Vasily Averin
2021-08-27 15:23                                                       ` [PATCH NET-NEXT] ipv6: " Vasily Averin
2021-08-27 16:47                                                         ` Eric Dumazet
2021-08-28  8:01                                                           ` Vasily Averin
2021-08-06  7:50                             ` [PATCH NET v4 4/7] ipv4: use skb_expand_head in ip_finish_output2 Vasily Averin
2021-08-06  7:50                             ` [PATCH NET v4 5/7] vrf: use skb_expand_head in vrf_finish_output Vasily Averin
2021-08-06  7:50                             ` [PATCH NET v4 6/7] ax25: use skb_expand_head Vasily Averin
2021-08-06  7:50                             ` [PATCH NET v4 7/7] bpf: use skb_expand_head in bpf_out_neigh_v4/6 Vasily Averin
2021-08-02  8:52                       ` [PATCH NET v3 6/7] ax25: use skb_expand_head Vasily Averin
2021-08-02  8:52                       ` [PATCH NET v3 7/7] bpf: use skb_expand_head in bpf_out_neigh_v4/6 Vasily Averin
     [not found]                   ` <cover.1626206993.git.vvs@virtuozzo.com>
2021-07-13 20:57                     ` [PATCH NET v2 1/7] skbuff: introduce skb_expand_head() Vasily Averin
2021-07-13 20:58                     ` [PATCH NET v2 2/7] ipv6: use skb_expand_head in ip6_finish_output2 Vasily Averin
2021-07-13 20:58                     ` [PATCH NET v2 3/7] ipv6: use skb_expand_head in ip6_xmit Vasily Averin
2021-07-13 20:58                     ` [PATCH NET v2 4/7] ipv4: use skb_expand_head in ip_finish_output2 Vasily Averin
2021-07-13 20:58                     ` [PATCH NET v2 5/7] vrf: use skb_expand_head in vrf_finish_output Vasily Averin
2021-07-13 20:58                     ` [PATCH NET v2 6/7] ax25: use skb_expand_head Vasily Averin
2021-07-13 20:58                     ` [PATCH NET v2 7/7] bpf: use skb_expand_head in bpf_out_neigh_v4/6 Vasily Averin
2021-07-12 13:26             ` [PATCH NET 2/7] ipv6: use pskb_realloc_headroom in ip6_finish_output2 Vasily Averin
2021-07-12 13:26             ` [PATCH NET 3/7] ipv6: use pskb_realloc_headroom in ip6_xmit refactoring Vasily Averin
2021-07-12 13:27             ` [PATCH NET 4/7] ipv4: use pskb_realloc_headroom in ip_finish_output2 Vasily Averin
2021-07-12 13:27             ` [PATCH NET 5/7] vrf: use pskb_realloc_headroom in vrf_finish_output Vasily Averin
2021-07-12 13:27             ` [PATCH NET 6/7] ax25: use pskb_realloc_headroom Vasily Averin
2021-07-12 13:27             ` [PATCH NET 7/7] bpf: use pskb_realloc_headroom in bpf_out_neigh_v4/6 Vasily Averin
     [not found]         ` <cover.1625818825.git.vvs@virtuozzo.com>
2021-07-09  9:04           ` [PATCH IPV6 v2 1/4] ipv6: allocate enough headroom in ip6_finish_output2() Vasily Averin
2021-07-09 17:58             ` David Miller
2021-07-10  2:53               ` Vasily Averin
2021-07-09  9:04           ` [PATCH IPV6 v2 2/4] ipv6: use new helper skb_expand_head() in ip6_xmit() Vasily Averin
2021-07-09  9:05           ` [PATCH IPV6 v2 3/4] ipv6: ip6_finish_output2 refactoring Vasily Averin
2021-07-09  9:05           ` [PATCH IPV6 v2 4/4] ipv6: ip6_xmit refactoring Vasily Averin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).