All of lore.kernel.org
 help / color / mirror / Atom feed
* [BUG] skb corruption and kernel panic at forwarding with fragmentation
@ 2016-01-06 19:15 Konstantin Khlebnikov
  2016-01-06 19:59 ` Cong Wang
  0 siblings, 1 reply; 17+ messages in thread
From: Konstantin Khlebnikov @ 2016-01-06 19:15 UTC (permalink / raw)
  To: netdev, David Miller, Eric Dumazet, Linux Kernel Mailing List

I've got some of these:

[84408.314676] BUG: unable to handle kernel NULL pointer dereference
at           (null)
[84408.317324] IP: [<ffffffff81166e15>] put_page+0x5/0x50
[84408.319985] PGD 0
[84408.322583] Oops: 0000 [#1] SMP
[84408.325156] Modules linked in: ppp_mppe ppp_async ppp_generic slhc
8021q fuse nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace
sunrpc bridge stp llc xt_HL xt_TCPMSS xt_state w83627ehf hwmon_vid
snd_hda_codec_realtek snd_hda_codec_generic radeon snd_hda_codec_hdmi
snd_hda_intel snd_hda_controller snd_hda_codec snd_hwdep snd_pcm
snd_hda_core edac_core k10temp snd_timer snd drm_kms_helper soundcore
ath9k ttm ath9k_common ath9k_hw ath r8169 mii
[84408.336804] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.1.15-zurg #1
[84408.339839] Hardware name: To Be Filled By O.E.M. To Be Filled By
O.E.M./RS880D, BIOS 080015  04/12/2011
[84408.342964] task: ffff880216d56f50 ti: ffff880216e04000 task.ti:
ffff880216e04000
[84408.346136] RIP: 0010:[<ffffffff81166e15>]  [<ffffffff81166e15>]
put_page+0x5/0x50
[84408.349301] RSP: 0018:ffff88021fcc37c0  EFLAGS: 00010216
[84408.352433] RAX: 0000000000000030 RBX: 0000000000000001 RCX: 0000000000000077
[84408.355602] RDX: ffff880213d8818e RSI: 0000000000000200 RDI: 0000000000000000
[84408.358765] RBP: ffff88021fcc37e8 R08: 0000000000000076 R09: ffff880216c01900
[84408.361885] R10: ffffea000859a840 R11: 0000000000000001 R12: ffff8802166a1300
[84408.364988] R13: ffff88021280d8c0 R14: ffff8802166a1300 R15: ffff88021280d410
[84408.368059] FS:  00007f9ada2de700(0000) GS:ffff88021fcc0000(0000)
knlGS:0000000000000000
[84408.371211] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[84408.374336] CR2: 0000000000000000 CR3: 0000000216575000 CR4: 00000000000006e0
[84408.377484] Stack:
[84408.380623]  ffffffff81576ac8 ffff88021fcc37e8 ffff8802166a1300
ffff8802166a1300
[84408.383843]  0000000000000000 ffff88021fcc3808 ffffffff81576b48
ffff88021fcc3808
[84408.387022]  0000000000006e00 ffff88021fcc3828 ffffffff81576cdd
0000000000006e00
[84408.390187] Call Trace:
[84408.393293]  <IRQ>
[84408.393323]  [<ffffffff81576ac8>] ? skb_release_data+0x78/0xd0
[84408.399488]  [<ffffffff81576b48>] skb_release_all+0x28/0x30
[84408.402553]  [<ffffffff81576cdd>] consume_skb+0x5d/0x80
[84408.405630]  [<ffffffff815d0d64>] ip_fragment+0x5c4/0x970
[84408.408676]  [<ffffffff815cf740>] ? ip_copy_metadata+0x160/0x160
[84408.411733]  [<ffffffff815d1711>] ip_finish_output+0x601/0x900
[84408.414788]  [<ffffffff815b6ed9>] ? nf_hook_slow+0x99/0x100
[84408.417828]  [<ffffffff815d2366>] ip_output+0x66/0xc0
[84408.420847]  [<ffffffff815d1110>] ? ip_fragment+0x970/0x970
[84408.423864]  [<ffffffff815cd683>] ip_forward_finish+0x73/0xa0
[84408.426864]  [<ffffffff815cda5f>] ip_forward+0x3af/0x490
[84408.429833]  [<ffffffff815cd610>] ? ip_frag_mem+0x50/0x50
[84408.432782]  [<ffffffff815cb701>] ip_rcv_finish+0x81/0x370
[84408.435778]  [<ffffffff815cc0b2>] ip_rcv+0x2a2/0x3c0
[84408.438780]  [<ffffffff815cb680>] ? inet_del_offload+0x40/0x40
[84408.441780]  [<ffffffff8158a623>] __netif_receive_skb_core+0x673/0x810
[84408.444785]  [<ffffffff8158a7d8>] __netif_receive_skb+0x18/0x60
[84408.447766]  [<ffffffff8158a843>] netif_receive_skb_internal+0x23/0x90
[84408.450739]  [<ffffffff8158a8cc>] netif_receive_skb_sk+0x1c/0x70
[84408.453726]  [<ffffffffa04a9e5c>] br_handle_frame_finish+0x27c/0x520 [bridge]
[84408.456774]  [<ffffffff8161dcc8>] ? ipv4_confirm+0xb8/0xe0
[84408.459787]  [<ffffffffa04aa261>] br_handle_frame+0x161/0x290 [bridge]
[84408.462803]  [<ffffffff815cbdb6>] ? ip_local_deliver+0x46/0xa0
[84408.465796]  [<ffffffff8158a2de>] __netif_receive_skb_core+0x32e/0x810
[84408.468822]  [<ffffffff8158a7d8>] __netif_receive_skb+0x18/0x60
[84408.471748]  [<ffffffff8158a843>] netif_receive_skb_internal+0x23/0x90
[84408.474615]  [<ffffffff815f6483>] ? tcp4_gro_complete+0x73/0x80
[84408.477378]  [<ffffffff8158a9bc>] napi_gro_complete+0x9c/0xe0
[84408.480045]  [<ffffffff8158b0a0>] dev_gro_receive+0x230/0x360
[84408.482675]  [<ffffffff8158b400>] napi_gro_receive+0x30/0x100
[84408.485240]  [<ffffffffa000e8d6>] rtl8169_poll+0x2c6/0x6b0 [r8169]
[84408.487766]  [<ffffffff8158ad4a>] net_rx_action+0x1fa/0x320
[84408.490241]  [<ffffffff81090a1b>] __do_softirq+0x10b/0x2d0
[84408.492672]  [<ffffffff81090db5>] irq_exit+0xd5/0xe0
[84408.495072]  [<ffffffff817452d8>] do_IRQ+0x58/0xf0
[84408.497463]  [<ffffffff8174356e>] common_interrupt+0x6e/0x6e
[84408.499879]  <EOI>
[84408.499909]  [<ffffffff8104c726>] ? native_safe_halt+0x6/0x10
[84408.504697]  [<ffffffff810f01be>] ? tick_broadcast_oneshot_control+0xbe/0x200
[84408.507126]  [<ffffffff8100e98e>] default_idle+0x1e/0xc0
[84408.509516]  [<ffffffff8100ea9e>] amd_e400_idle+0x6e/0xf0
[84408.511879]  [<ffffffff8100f51f>] arch_cpu_idle+0xf/0x20
[84408.514181]  [<ffffffff810c4c37>] cpu_startup_entry+0x327/0x3a0
[84408.516456]  [<ffffffff810eea3c>] ? clockevents_register_device+0xec/0x1d0
[84408.518760]  [<ffffffff8103ba08>] start_secondary+0x138/0x160
[84408.521066] Code: 48 89 d7 e8 2e f7 ff ff e9 a1 fe ff ff 48 89 d7
e8 51 f7 ff ff e9 94 fe ff ff 66 90 66 2e 0f 1f 84 00 00 00 00 00 66
66 66 66 90 <48> f7 07 00 c0 00 00 55 48 89 e5 75 1e 8b 47 1c 85 c0 74
27 f0
[84408.526216] RIP  [<ffffffff81166e15>] put_page+0x5/0x50
[84408.528705]  RSP <ffff88021fcc37c0>
[84408.531178] CR2: 0000000000000000

Looks like this happens because ip_options_fragment() relies on
correct ip options length in ip control block in skb. But in
ip_finish_output_gso() control block in segments is reused by
skb_gso_segment(). following ip_fragment() sees some garbage.

In my case there was no ip options but length becomes non-zero and
ip_options_fragment() picked some bytes from payload and decides to
fill huge range with IPOPT_NOOP (1). One of that ones flipped nr_frags
in skb_shared_info at the end of data =)

Here is quick hack: just make room for ip control block in gso control block.

--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3316,6 +3316,7 @@ static inline struct sec_path
*skb_sec_path(struct sk_buff *skb)
  * Keeps track of level of encapsulation of network headers.
  */
 struct skb_gso_cb {
+ char pad[32]; /* inet_skb_parm lives here */
  int mac_offset;
  int encap_level;
  __u16 csum_start;

And debug which prevents kernel crash too.

--- a/net/ipv4/ip_options.c
+++ b/net/ipv4/ip_options.c
@@ -215,6 +215,10 @@ void ip_options_fragment(struct sk_buff *skb)
  int  l = opt->optlen;
  int  optlen;

+ const struct iphdr *iph = ip_hdr(skb);
+ l = iph->ihl * 4 - sizeof(struct iphdr);
+ WARN(opt->optlen != l, "%s %d != %d\n", __func__, opt->optlen, l);
+
  while (l > 0) {
  switch (*optptr) {
  case IPOPT_END:

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [BUG] skb corruption and kernel panic at forwarding with fragmentation
  2016-01-06 19:15 [BUG] skb corruption and kernel panic at forwarding with fragmentation Konstantin Khlebnikov
@ 2016-01-06 19:59 ` Cong Wang
  2016-01-06 20:11   ` Konstantin Khlebnikov
  0 siblings, 1 reply; 17+ messages in thread
From: Cong Wang @ 2016-01-06 19:59 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Linux Kernel Network Developers, David Miller, Eric Dumazet,
	Linux Kernel Mailing List

On Wed, Jan 6, 2016 at 11:15 AM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
> Looks like this happens because ip_options_fragment() relies on
> correct ip options length in ip control block in skb. But in
> ip_finish_output_gso() control block in segments is reused by
> skb_gso_segment(). following ip_fragment() sees some garbage.
>
> In my case there was no ip options but length becomes non-zero and
> ip_options_fragment() picked some bytes from payload and decides to
> fill huge range with IPOPT_NOOP (1). One of that ones flipped nr_frags
> in skb_shared_info at the end of data =)
>

Hmm, it looks like SKB_GSO_CB should be cleared after skb_gso_segment()
since all the gso information should be saved in shared_info after it finishes.

Does a memset(0) on SKB_GSO_CB after skb_gso_segment() work as well?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [BUG] skb corruption and kernel panic at forwarding with fragmentation
  2016-01-06 19:59 ` Cong Wang
@ 2016-01-06 20:11   ` Konstantin Khlebnikov
  2016-01-06 21:05     ` Thadeu Lima de Souza Cascardo
  0 siblings, 1 reply; 17+ messages in thread
From: Konstantin Khlebnikov @ 2016-01-06 20:11 UTC (permalink / raw)
  To: Cong Wang
  Cc: Linux Kernel Network Developers, David Miller, Eric Dumazet,
	Linux Kernel Mailing List

On Wed, Jan 6, 2016 at 10:59 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> On Wed, Jan 6, 2016 at 11:15 AM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
>> Looks like this happens because ip_options_fragment() relies on
>> correct ip options length in ip control block in skb. But in
>> ip_finish_output_gso() control block in segments is reused by
>> skb_gso_segment(). following ip_fragment() sees some garbage.
>>
>> In my case there was no ip options but length becomes non-zero and
>> ip_options_fragment() picked some bytes from payload and decides to
>> fill huge range with IPOPT_NOOP (1). One of that ones flipped nr_frags
>> in skb_shared_info at the end of data =)
>>
>
> Hmm, it looks like SKB_GSO_CB should be cleared after skb_gso_segment()
> since all the gso information should be saved in shared_info after it finishes.
>
> Does a memset(0) on SKB_GSO_CB after skb_gso_segment() work as well?

This will break present logic around ip_options_fragment() - it clears
options from
second and following fragments. With zeroed cb it will do nothing.

ip_options_fragment() can get required information directly from ip header but
it also resets fields in IPCB -- probably it should stay valid here
and somebody else will use it later.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [BUG] skb corruption and kernel panic at forwarding with fragmentation
  2016-01-06 20:11   ` Konstantin Khlebnikov
@ 2016-01-06 21:05     ` Thadeu Lima de Souza Cascardo
  2016-01-06 22:03       ` Florian Westphal
  0 siblings, 1 reply; 17+ messages in thread
From: Thadeu Lima de Souza Cascardo @ 2016-01-06 21:05 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Cong Wang, Linux Kernel Network Developers, David Miller,
	Eric Dumazet, Linux Kernel Mailing List

On Wed, Jan 06, 2016 at 11:11:41PM +0300, Konstantin Khlebnikov wrote:
> On Wed, Jan 6, 2016 at 10:59 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > On Wed, Jan 6, 2016 at 11:15 AM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
> >> Looks like this happens because ip_options_fragment() relies on
> >> correct ip options length in ip control block in skb. But in
> >> ip_finish_output_gso() control block in segments is reused by
> >> skb_gso_segment(). following ip_fragment() sees some garbage.
> >>
> >> In my case there was no ip options but length becomes non-zero and
> >> ip_options_fragment() picked some bytes from payload and decides to
> >> fill huge range with IPOPT_NOOP (1). One of that ones flipped nr_frags
> >> in skb_shared_info at the end of data =)
> >>
> >
> > Hmm, it looks like SKB_GSO_CB should be cleared after skb_gso_segment()
> > since all the gso information should be saved in shared_info after it finishes.
> >
> > Does a memset(0) on SKB_GSO_CB after skb_gso_segment() work as well?
> 
> This will break present logic around ip_options_fragment() - it clears
> options from
> second and following fragments. With zeroed cb it will do nothing.
> 
> ip_options_fragment() can get required information directly from ip header but
> it also resets fields in IPCB -- probably it should stay valid here
> and somebody else will use it later.
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

I have hit this as well, this fixes it for me on an older kernel. Can you try it
on latest kernel?

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index d8a1745..f44bc91 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -216,6 +216,7 @@ static int ip_finish_output_gso(struct sk_buff *skb)
 	netdev_features_t features;
 	struct sk_buff *segs;
 	int ret = 0;
+	struct inet_skb_parm ipcb;
 
 	if (skb_gso_network_seglen(skb) <= ip_skb_dst_mtu(skb))
 		return ip_finish_output2(skb);
@@ -227,6 +228,10 @@ static int ip_finish_output_gso(struct sk_buff *skb)
 	 * 2) skb arrived via virtio-net, we thus get TSO/GSO skbs directly
 	 * from host network stack.
 	 */
+	/* We need to save IPCB here because skb_gso_segment will use
+	 * SKB_GSO_CB.
+	 */
+	ipcb = *IPCB(skb);
 	features = netif_skb_features(skb);
 	segs = skb_gso_segment(skb, features & ~NETIF_F_GSO_MASK);
 	if (IS_ERR_OR_NULL(segs)) {
@@ -241,6 +246,7 @@ static int ip_finish_output_gso(struct sk_buff *skb)
 		int err;
 
 		segs->next = NULL;
+		*IPCB(segs) = ipcb;
 		err = ip_fragment(segs, ip_finish_output2);
 
 		if (err && ret == 0)

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [BUG] skb corruption and kernel panic at forwarding with fragmentation
  2016-01-06 21:05     ` Thadeu Lima de Souza Cascardo
@ 2016-01-06 22:03       ` Florian Westphal
  2016-01-06 23:49         ` Florian Westphal
  0 siblings, 1 reply; 17+ messages in thread
From: Florian Westphal @ 2016-01-06 22:03 UTC (permalink / raw)
  To: Thadeu Lima de Souza Cascardo
  Cc: Konstantin Khlebnikov, Cong Wang,
	Linux Kernel Network Developers, David Miller, Eric Dumazet,
	Linux Kernel Mailing List

Thadeu Lima de Souza Cascardo <cascardo@redhat.com> wrote:
> On Wed, Jan 06, 2016 at 11:11:41PM +0300, Konstantin Khlebnikov wrote:
> > On Wed, Jan 6, 2016 at 10:59 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > > On Wed, Jan 6, 2016 at 11:15 AM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
> > >> Looks like this happens because ip_options_fragment() relies on
> > >> correct ip options length in ip control block in skb. But in
> > >> ip_finish_output_gso() control block in segments is reused by
> > >> skb_gso_segment(). following ip_fragment() sees some garbage.
> > >>
> > >> In my case there was no ip options but length becomes non-zero and
> > >> ip_options_fragment() picked some bytes from payload and decides to
> > >> fill huge range with IPOPT_NOOP (1). One of that ones flipped nr_frags
> > >> in skb_shared_info at the end of data =)
> > >>
> > >
> > > Hmm, it looks like SKB_GSO_CB should be cleared after skb_gso_segment()
> > > since all the gso information should be saved in shared_info after it finishes.
> > >
> > > Does a memset(0) on SKB_GSO_CB after skb_gso_segment() work as well?
> > 
> > This will break present logic around ip_options_fragment() - it clears
> > options from
> > second and following fragments. With zeroed cb it will do nothing.
> > 
> > ip_options_fragment() can get required information directly from ip header but
> > it also resets fields in IPCB -- probably it should stay valid here
> > and somebody else will use it later.
[..]
> I have hit this as well, this fixes it for me on an older kernel. Can you try it
> on latest kernel?

> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> index d8a1745..f44bc91 100644
> --- a/net/ipv4/ip_output.c
> +++ b/net/ipv4/ip_output.c
> @@ -216,6 +216,7 @@ static int ip_finish_output_gso(struct sk_buff *skb)
>  	netdev_features_t features;
>  	struct sk_buff *segs;
>  	int ret = 0;
> +	struct inet_skb_parm ipcb;
>  
>  	if (skb_gso_network_seglen(skb) <= ip_skb_dst_mtu(skb))
>  		return ip_finish_output2(skb);
> @@ -227,6 +228,10 @@ static int ip_finish_output_gso(struct sk_buff *skb)
>  	 * 2) skb arrived via virtio-net, we thus get TSO/GSO skbs directly
>  	 * from host network stack.
>  	 */
> +	/* We need to save IPCB here because skb_gso_segment will use
> +	 * SKB_GSO_CB.
> +	 */
> +	ipcb = *IPCB(skb);
>  	features = netif_skb_features(skb);
>  	segs = skb_gso_segment(skb, features & ~NETIF_F_GSO_MASK);
>  	if (IS_ERR_OR_NULL(segs)) {
> @@ -241,6 +246,7 @@ static int ip_finish_output_gso(struct sk_buff *skb)
>  		int err;
>  
>  		segs->next = NULL;
> +		*IPCB(segs) = ipcb;
>  		err = ip_fragment(segs, ip_finish_output2);
>  
>  		if (err && ret == 0)

I'm worried that this doesn't solve all cases. f.e. xfrm may also
call skb_gso_segment(), and it will call into ipv4/ipv6 netfilter
postrouting + ipv4 output functions...

nfqnl_enqueue_packet() is also affected.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [BUG] skb corruption and kernel panic at forwarding with fragmentation
  2016-01-06 22:03       ` Florian Westphal
@ 2016-01-06 23:49         ` Florian Westphal
  2016-01-07 11:00           ` Konstantin Khlebnikov
  2016-01-07 18:43           ` [PATCH] net: prevent corruption of skb when using skb_gso_segment Thadeu Lima de Souza Cascardo
  0 siblings, 2 replies; 17+ messages in thread
From: Florian Westphal @ 2016-01-06 23:49 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Thadeu Lima de Souza Cascardo, Konstantin Khlebnikov, Cong Wang,
	Linux Kernel Network Developers, David Miller, Eric Dumazet,
	Linux Kernel Mailing List

Florian Westphal <fw@strlen.de> wrote:
> Thadeu Lima de Souza Cascardo <cascardo@redhat.com> wrote:
> > On Wed, Jan 06, 2016 at 11:11:41PM +0300, Konstantin Khlebnikov wrote:

[ skb_gso_segment uses skb->cb[], causes oops if ip_fragment is invoked
  on segmented skbs ]

> > I have hit this as well, this fixes it for me on an older kernel. Can you try it
> > on latest kernel?
> 
> > diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> > index d8a1745..f44bc91 100644
> > --- a/net/ipv4/ip_output.c
> > +++ b/net/ipv4/ip_output.c
> > @@ -216,6 +216,7 @@ static int ip_finish_output_gso(struct sk_buff *skb)
> >  	netdev_features_t features;
> >  	struct sk_buff *segs;
> >  	int ret = 0;
> > +	struct inet_skb_parm ipcb;
> >  
> >  	if (skb_gso_network_seglen(skb) <= ip_skb_dst_mtu(skb))
> >  		return ip_finish_output2(skb);
> > @@ -227,6 +228,10 @@ static int ip_finish_output_gso(struct sk_buff *skb)
> >  	 * 2) skb arrived via virtio-net, we thus get TSO/GSO skbs directly
> >  	 * from host network stack.
> >  	 */
> > +	/* We need to save IPCB here because skb_gso_segment will use
> > +	 * SKB_GSO_CB.
> > +	 */
> > +	ipcb = *IPCB(skb);
> >  	features = netif_skb_features(skb);
> >  	segs = skb_gso_segment(skb, features & ~NETIF_F_GSO_MASK);
> >  	if (IS_ERR_OR_NULL(segs)) {
> > @@ -241,6 +246,7 @@ static int ip_finish_output_gso(struct sk_buff *skb)
> >  		int err;
> >  
> >  		segs->next = NULL;
> > +		*IPCB(segs) = ipcb;
> >  		err = ip_fragment(segs, ip_finish_output2);
> >  
> >  		if (err && ret == 0)
> 
> I'm worried that this doesn't solve all cases. f.e. xfrm may also
> call skb_gso_segment(), and it will call into ipv4/ipv6 netfilter
> postrouting + ipv4 output functions...
> 
> nfqnl_enqueue_packet() is also affected.

... but it seems that those three are the only affected callers
of skb_gso_segment (tbf is ok since skb isn't owned by anyone,
ovs does save/restore already).

I think this patch is the right way, we just need similar
save/restore in nfqnl_enqueue_packet and xfrm_output_gso().

The latter two can be used by either ipv4 or ipv6 so it might
be preferable to just save/restore sizeof(struct skb_gso_cb);
or a union of inet_skb_parm+inet6_skb_parm.

Cascardo, can you cook a patch?

Thanks!

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [BUG] skb corruption and kernel panic at forwarding with fragmentation
  2016-01-06 23:49         ` Florian Westphal
@ 2016-01-07 11:00           ` Konstantin Khlebnikov
  2016-01-07 11:38             ` Konstantin Khlebnikov
  2016-01-07 12:03             ` Florian Westphal
  2016-01-07 18:43           ` [PATCH] net: prevent corruption of skb when using skb_gso_segment Thadeu Lima de Souza Cascardo
  1 sibling, 2 replies; 17+ messages in thread
From: Konstantin Khlebnikov @ 2016-01-07 11:00 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Thadeu Lima de Souza Cascardo, Cong Wang,
	Linux Kernel Network Developers, David Miller, Eric Dumazet,
	Linux Kernel Mailing List

On Thu, Jan 7, 2016 at 2:49 AM, Florian Westphal <fw@strlen.de> wrote:
> Florian Westphal <fw@strlen.de> wrote:
>> Thadeu Lima de Souza Cascardo <cascardo@redhat.com> wrote:
>> > On Wed, Jan 06, 2016 at 11:11:41PM +0300, Konstantin Khlebnikov wrote:
>
> [ skb_gso_segment uses skb->cb[], causes oops if ip_fragment is invoked
>   on segmented skbs ]
>
>> > I have hit this as well, this fixes it for me on an older kernel. Can you try it
>> > on latest kernel?
>>
>> > diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
>> > index d8a1745..f44bc91 100644
>> > --- a/net/ipv4/ip_output.c
>> > +++ b/net/ipv4/ip_output.c
>> > @@ -216,6 +216,7 @@ static int ip_finish_output_gso(struct sk_buff *skb)
>> >     netdev_features_t features;
>> >     struct sk_buff *segs;
>> >     int ret = 0;
>> > +   struct inet_skb_parm ipcb;
>> >
>> >     if (skb_gso_network_seglen(skb) <= ip_skb_dst_mtu(skb))
>> >             return ip_finish_output2(skb);
>> > @@ -227,6 +228,10 @@ static int ip_finish_output_gso(struct sk_buff *skb)
>> >      * 2) skb arrived via virtio-net, we thus get TSO/GSO skbs directly
>> >      * from host network stack.
>> >      */
>> > +   /* We need to save IPCB here because skb_gso_segment will use
>> > +    * SKB_GSO_CB.
>> > +    */
>> > +   ipcb = *IPCB(skb);
>> >     features = netif_skb_features(skb);
>> >     segs = skb_gso_segment(skb, features & ~NETIF_F_GSO_MASK);
>> >     if (IS_ERR_OR_NULL(segs)) {
>> > @@ -241,6 +246,7 @@ static int ip_finish_output_gso(struct sk_buff *skb)
>> >             int err;
>> >
>> >             segs->next = NULL;
>> > +           *IPCB(segs) = ipcb;
>> >             err = ip_fragment(segs, ip_finish_output2);
>> >
>> >             if (err && ret == 0)
>>
>> I'm worried that this doesn't solve all cases. f.e. xfrm may also
>> call skb_gso_segment(), and it will call into ipv4/ipv6 netfilter
>> postrouting + ipv4 output functions...
>>
>> nfqnl_enqueue_packet() is also affected.
>
> ... but it seems that those three are the only affected callers
> of skb_gso_segment (tbf is ok since skb isn't owned by anyone,
> ovs does save/restore already).
>
> I think this patch is the right way, we just need similar
> save/restore in nfqnl_enqueue_packet and xfrm_output_gso().

Which CB could be here? at this point skb isn't owned by netlink yet.

>
> The latter two can be used by either ipv4 or ipv6 so it might
> be preferable to just save/restore sizeof(struct skb_gso_cb);
> or a union of inet_skb_parm+inet6_skb_parm.

Or just shift GSO CB and add couple checks like
BUILD_BUG_ON(sizeof(SKB_GSO_CB(skb)->room) < sizeof(*IPCB(skb)));

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [BUG] skb corruption and kernel panic at forwarding with fragmentation
  2016-01-07 11:00           ` Konstantin Khlebnikov
@ 2016-01-07 11:38             ` Konstantin Khlebnikov
  2016-01-07 11:59               ` Eric Dumazet
  2016-01-07 12:03             ` Florian Westphal
  1 sibling, 1 reply; 17+ messages in thread
From: Konstantin Khlebnikov @ 2016-01-07 11:38 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Thadeu Lima de Souza Cascardo, Cong Wang,
	Linux Kernel Network Developers, David Miller, Eric Dumazet,
	Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 3043 bytes --]

On Thu, Jan 7, 2016 at 2:00 PM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
> On Thu, Jan 7, 2016 at 2:49 AM, Florian Westphal <fw@strlen.de> wrote:
>> Florian Westphal <fw@strlen.de> wrote:
>>> Thadeu Lima de Souza Cascardo <cascardo@redhat.com> wrote:
>>> > On Wed, Jan 06, 2016 at 11:11:41PM +0300, Konstantin Khlebnikov wrote:
>>
>> [ skb_gso_segment uses skb->cb[], causes oops if ip_fragment is invoked
>>   on segmented skbs ]
>>
>>> > I have hit this as well, this fixes it for me on an older kernel. Can you try it
>>> > on latest kernel?
>>>
>>> > diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
>>> > index d8a1745..f44bc91 100644
>>> > --- a/net/ipv4/ip_output.c
>>> > +++ b/net/ipv4/ip_output.c
>>> > @@ -216,6 +216,7 @@ static int ip_finish_output_gso(struct sk_buff *skb)
>>> >     netdev_features_t features;
>>> >     struct sk_buff *segs;
>>> >     int ret = 0;
>>> > +   struct inet_skb_parm ipcb;
>>> >
>>> >     if (skb_gso_network_seglen(skb) <= ip_skb_dst_mtu(skb))
>>> >             return ip_finish_output2(skb);
>>> > @@ -227,6 +228,10 @@ static int ip_finish_output_gso(struct sk_buff *skb)
>>> >      * 2) skb arrived via virtio-net, we thus get TSO/GSO skbs directly
>>> >      * from host network stack.
>>> >      */
>>> > +   /* We need to save IPCB here because skb_gso_segment will use
>>> > +    * SKB_GSO_CB.
>>> > +    */
>>> > +   ipcb = *IPCB(skb);
>>> >     features = netif_skb_features(skb);
>>> >     segs = skb_gso_segment(skb, features & ~NETIF_F_GSO_MASK);
>>> >     if (IS_ERR_OR_NULL(segs)) {
>>> > @@ -241,6 +246,7 @@ static int ip_finish_output_gso(struct sk_buff *skb)
>>> >             int err;
>>> >
>>> >             segs->next = NULL;
>>> > +           *IPCB(segs) = ipcb;
>>> >             err = ip_fragment(segs, ip_finish_output2);
>>> >
>>> >             if (err && ret == 0)
>>>
>>> I'm worried that this doesn't solve all cases. f.e. xfrm may also
>>> call skb_gso_segment(), and it will call into ipv4/ipv6 netfilter
>>> postrouting + ipv4 output functions...
>>>
>>> nfqnl_enqueue_packet() is also affected.
>>
>> ... but it seems that those three are the only affected callers
>> of skb_gso_segment (tbf is ok since skb isn't owned by anyone,
>> ovs does save/restore already).
>>
>> I think this patch is the right way, we just need similar
>> save/restore in nfqnl_enqueue_packet and xfrm_output_gso().
>
> Which CB could be here? at this point skb isn't owned by netlink yet.
>
>>
>> The latter two can be used by either ipv4 or ipv6 so it might
>> be preferable to just save/restore sizeof(struct skb_gso_cb);
>> or a union of inet_skb_parm+inet6_skb_parm.
>
> Or just shift GSO CB and add couple checks like
> BUILD_BUG_ON(sizeof(SKB_GSO_CB(skb)->room) < sizeof(*IPCB(skb)));

Somethin like this (in attachment)

Also I've found strange thing: reason of expanding skb->cb from 40 to
48 bypes in 2006
3e3850e989c5d2eb1aab6f0fd9257759f0f4cbc6 was that struct inet6_skb_parm does
not fit. But it's is only 24 bytes. Does some arches add pad after
each _u16 field?

[-- Attachment #2: net-preserve-lower-bytes-of-skb-cb-during-gso-segmentation --]
[-- Type: application/octet-stream, Size: 3490 bytes --]

net: preserve lower bytes of skb CB during GSO segmentation

From: Konstantin Khlebnikov <koct9i@gmail.com>

Skb_gso_segment() uses skb control block during segmentation.
This patch adds 32-bytes room for previous control block which
will be copied into all resulting segments.

This patch fixes kernel crash during fragmenting forwarded packets
because fragmentation relies on valid IPCB state but GSO segmentation.
Also it removes custom save/restore in ovs code, now this is redundant.

Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
Link: http://lkml.kernel.org/r/CALYGNiP-0MZ-FExV2HutTvE9U-QQtkKSoE--KN=JQE5STYsjAA@mail.gmail.com
---
 include/linux/skbuff.h     |    1 +
 net/core/dev.c             |    2 ++
 net/ipv4/ip_output.c       |    1 +
 net/openvswitch/datapath.c |    4 +---
 net/xfrm/xfrm_output.c     |    2 ++
 5 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 4307e20a4a4a..aebf1501db25 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3316,6 +3316,7 @@ static inline struct sec_path *skb_sec_path(struct sk_buff *skb)
  * Keeps track of level of encapsulation of network headers.
  */
 struct skb_gso_cb {
+	char	prev_cb[32];
 	int	mac_offset;
 	int	encap_level;
 	__u16	csum_start;
diff --git a/net/core/dev.c b/net/core/dev.c
index a42b232805a5..e64eefa31a2d 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2479,6 +2479,8 @@ static inline bool skb_needs_check(struct sk_buff *skb, bool tx_path)
  *
  *	It may return NULL if the skb requires no segmentation.  This is
  *	only possible when GSO is used for verifying header integrity.
+ *
+ *	Segmentation preserves lower bytes of original control block.
  */
 struct sk_buff *__skb_gso_segment(struct sk_buff *skb,
 				  netdev_features_t features, bool tx_path)
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index c65b93a7b711..bd94ba4d9937 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -235,6 +235,7 @@ static int ip_finish_output_gso(struct sock *sk, struct sk_buff *skb)
 	 * from host network stack.
 	 */
 	features = netif_skb_features(skb);
+	BUILD_BUG_ON(sizeof(SKB_GSO_CB(skb)->prev_cb) < sizeof(*IPCB(skb)));
 	segs = skb_gso_segment(skb, features & ~NETIF_F_GSO_MASK);
 	if (IS_ERR_OR_NULL(segs)) {
 		kfree_skb(skb);
diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index 27e14962b504..5d049efdb015 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -337,12 +337,10 @@ static int queue_gso_packets(struct datapath *dp, struct sk_buff *skb,
 	unsigned short gso_type = skb_shinfo(skb)->gso_type;
 	struct sw_flow_key later_key;
 	struct sk_buff *segs, *nskb;
-	struct ovs_skb_cb ovs_cb;
 	int err;
 
-	ovs_cb = *OVS_CB(skb);
+	BUILD_BUG_ON(sizeof(SKB_GSO_CB(skb)->prev_cb) < sizeof(*OVS_CB(skb)));
 	segs = __skb_gso_segment(skb, NETIF_F_SG, false);
-	*OVS_CB(skb) = ovs_cb;
 	if (IS_ERR(segs))
 		return PTR_ERR(segs);
 	if (segs == NULL)
diff --git a/net/xfrm/xfrm_output.c b/net/xfrm/xfrm_output.c
index fbcedbe33190..00401dc1c0e6 100644
--- a/net/xfrm/xfrm_output.c
+++ b/net/xfrm/xfrm_output.c
@@ -153,6 +153,8 @@ static int xfrm_output_gso(struct sock *sk, struct sk_buff *skb)
 {
 	struct sk_buff *segs;
 
+	BUILD_BUG_ON(sizeof(SKB_GSO_CB(skb)->prev_cb) < sizeof(*IPCB(skb)));
+	BUILD_BUG_ON(sizeof(SKB_GSO_CB(skb)->prev_cb) < sizeof(*IP6CB(skb)));
 	segs = skb_gso_segment(skb, 0);
 	kfree_skb(skb);
 	if (IS_ERR(segs))

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [BUG] skb corruption and kernel panic at forwarding with fragmentation
  2016-01-07 11:38             ` Konstantin Khlebnikov
@ 2016-01-07 11:59               ` Eric Dumazet
  2016-01-07 12:04                 ` Konstantin Khlebnikov
  0 siblings, 1 reply; 17+ messages in thread
From: Eric Dumazet @ 2016-01-07 11:59 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Florian Westphal, Thadeu Lima de Souza Cascardo, Cong Wang,
	Linux Kernel Network Developers, David Miller,
	Linux Kernel Mailing List

On Thu, Jan 7, 2016 at 6:38 AM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
>
> Also I've found strange thing: reason of expanding skb->cb from 40 to
> 48 bypes in 2006
> 3e3850e989c5d2eb1aab6f0fd9257759f0f4cbc6 was that struct inet6_skb_parm does
> not fit. But it's is only 24 bytes. Does some arches add pad after
> each _u16 field?

"struct inet6_skb_parm" is part of struct tcp_skb_cb

This is why Patrick had to increase skb->cb[]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [BUG] skb corruption and kernel panic at forwarding with fragmentation
  2016-01-07 11:00           ` Konstantin Khlebnikov
  2016-01-07 11:38             ` Konstantin Khlebnikov
@ 2016-01-07 12:03             ` Florian Westphal
  1 sibling, 0 replies; 17+ messages in thread
From: Florian Westphal @ 2016-01-07 12:03 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Florian Westphal, Thadeu Lima de Souza Cascardo, Cong Wang,
	Linux Kernel Network Developers, David Miller, Eric Dumazet,
	Linux Kernel Mailing List

Konstantin Khlebnikov <koct9i@gmail.com> wrote:
> On Thu, Jan 7, 2016 at 2:49 AM, Florian Westphal <fw@strlen.de> wrote:
> > ... but it seems that those three are the only affected callers
> > of skb_gso_segment (tbf is ok since skb isn't owned by anyone,
> > ovs does save/restore already).
> >
> > I think this patch is the right way, we just need similar
> > save/restore in nfqnl_enqueue_packet and xfrm_output_gso().
> 
> Which CB could be here? at this point skb isn't owned by netlink yet.

inet(6)_skb_parm, nfqnl_enqueue_packet is called via netfilter hooks, skb
is owned by ipv4 or ipv6 stack.

> > The latter two can be used by either ipv4 or ipv6 so it might
> > be preferable to just save/restore sizeof(struct skb_gso_cb);
> > or a union of inet_skb_parm+inet6_skb_parm.
> 
> Or just shift GSO CB and add couple checks like
> BUILD_BUG_ON(sizeof(SKB_GSO_CB(skb)->room) < sizeof(*IPCB(skb)));

Right, that works too.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [BUG] skb corruption and kernel panic at forwarding with fragmentation
  2016-01-07 11:59               ` Eric Dumazet
@ 2016-01-07 12:04                 ` Konstantin Khlebnikov
  2016-01-07 12:54                   ` Eric Dumazet
  0 siblings, 1 reply; 17+ messages in thread
From: Konstantin Khlebnikov @ 2016-01-07 12:04 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Florian Westphal, Thadeu Lima de Souza Cascardo, Cong Wang,
	Linux Kernel Network Developers, David Miller,
	Linux Kernel Mailing List

On Thu, Jan 7, 2016 at 2:59 PM, Eric Dumazet <edumazet@google.com> wrote:
> On Thu, Jan 7, 2016 at 6:38 AM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
>>
>> Also I've found strange thing: reason of expanding skb->cb from 40 to
>> 48 bypes in 2006
>> 3e3850e989c5d2eb1aab6f0fd9257759f0f4cbc6 was that struct inet6_skb_parm does
>> not fit. But it's is only 24 bytes. Does some arches add pad after
>> each _u16 field?
>
> "struct inet6_skb_parm" is part of struct tcp_skb_cb
>
> This is why Patrick had to increase skb->cb[]

Whoa. Funny. TCP moves that chunk back and forward instead of just
putting it at the first place in struct.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [BUG] skb corruption and kernel panic at forwarding with fragmentation
  2016-01-07 12:04                 ` Konstantin Khlebnikov
@ 2016-01-07 12:54                   ` Eric Dumazet
  2016-01-07 19:35                     ` Konstantin Khlebnikov
  0 siblings, 1 reply; 17+ messages in thread
From: Eric Dumazet @ 2016-01-07 12:54 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Florian Westphal, Thadeu Lima de Souza Cascardo, Cong Wang,
	Linux Kernel Network Developers, David Miller,
	Linux Kernel Mailing List

On Thu, Jan 7, 2016 at 7:04 AM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
> On Thu, Jan 7, 2016 at 2:59 PM, Eric Dumazet <edumazet@google.com> wrote:
>> On Thu, Jan 7, 2016 at 6:38 AM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
>>>
>>> Also I've found strange thing: reason of expanding skb->cb from 40 to
>>> 48 bypes in 2006
>>> 3e3850e989c5d2eb1aab6f0fd9257759f0f4cbc6 was that struct inet6_skb_parm does
>>> not fit. But it's is only 24 bytes. Does some arches add pad after
>>> each _u16 field?
>>
>> "struct inet6_skb_parm" is part of struct tcp_skb_cb
>>
>> This is why Patrick had to increase skb->cb[]
>
> Whoa. Funny. TCP moves that chunk back and forward instead of just
> putting it at the first place in struct.

You probably want to look at git history to find out why it is done this way.

TCP performance is critical for some of us, and doing such trick avoid
one cache miss per skb in some critical list traversals.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH] net: prevent corruption of skb when using skb_gso_segment
  2016-01-06 23:49         ` Florian Westphal
  2016-01-07 11:00           ` Konstantin Khlebnikov
@ 2016-01-07 18:43           ` Thadeu Lima de Souza Cascardo
  2016-01-07 19:31             ` Florian Westphal
  1 sibling, 1 reply; 17+ messages in thread
From: Thadeu Lima de Souza Cascardo @ 2016-01-07 18:43 UTC (permalink / raw)
  To: netdev
  Cc: fw, cascardo, koct9i, xiyou.wangcong, davem, edumazet, linux-kernel

skb_gso_segment uses skb->cb, which may be owned by the caller. This may
cause IPCB(skb)->opt.optlen to be overwritten, which will make
ip_fragment overwrite skb data and possibly skb_shinfo with IPOPT_NOOP,
thus causing a crash.

This patch saves skb->cb before calling skb_gso_segment for those users
that have anything to save, then restore it for each GSO segment.

Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@redhat.com>
---
 net/ipv4/ip_output.c            | 3 +++
 net/netfilter/nfnetlink_queue.c | 7 +++++++
 net/xfrm/xfrm_output.c          | 6 ++++++
 3 files changed, 16 insertions(+)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 4233cbe..37b41f6 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -226,6 +226,7 @@ static int ip_finish_output_gso(struct net *net, struct sock *sk,
 	netdev_features_t features;
 	struct sk_buff *segs;
 	int ret = 0;
+	struct inet_skb_parm ipcb;
 
 	/* common case: locally created skb or seglen is <= mtu */
 	if (((IPCB(skb)->flags & IPSKB_FORWARDED) == 0) ||
@@ -239,6 +240,7 @@ static int ip_finish_output_gso(struct net *net, struct sock *sk,
 	 * 2) skb arrived via virtio-net, we thus get TSO/GSO skbs directly
 	 * from host network stack.
 	 */
+	ipcb = *IPCB(skb);
 	features = netif_skb_features(skb);
 	segs = skb_gso_segment(skb, features & ~NETIF_F_GSO_MASK);
 	if (IS_ERR_OR_NULL(segs)) {
@@ -253,6 +255,7 @@ static int ip_finish_output_gso(struct net *net, struct sock *sk,
 		int err;
 
 		segs->next = NULL;
+		*IPCB(segs) = ipcb;
 		err = ip_fragment(net, sk, segs, mtu, ip_finish_output2);
 
 		if (err && ret == 0)
diff --git a/net/netfilter/nfnetlink_queue.c b/net/netfilter/nfnetlink_queue.c
index 861c661..ad3ebf0 100644
--- a/net/netfilter/nfnetlink_queue.c
+++ b/net/netfilter/nfnetlink_queue.c
@@ -34,6 +34,7 @@
 #include <net/tcp_states.h>
 #include <net/netfilter/nf_queue.h>
 #include <net/netns/generic.h>
+#include <net/ip.h>
 
 #include <linux/atomic.h>
 
@@ -678,6 +679,10 @@ nfqnl_enqueue_packet(struct nf_queue_entry *entry, unsigned int queuenum)
 	int err = -ENOBUFS;
 	struct net *net = entry->state.net;
 	struct nfnl_queue_net *q = nfnl_queue_pernet(net);
+	union {
+		struct inet_skb_parm h4;
+		struct inet6_skb_parm h6;
+	} header;
 
 	/* rcu_read_lock()ed by nf_hook_slow() */
 	queue = instance_lookup(q, queuenum);
@@ -702,6 +707,7 @@ nfqnl_enqueue_packet(struct nf_queue_entry *entry, unsigned int queuenum)
 		return __nfqnl_enqueue_packet(net, queue, entry);
 
 	nf_bridge_adjust_skb_data(skb);
+	memcpy(&header, skb->cb, sizeof(header));
 	segs = skb_gso_segment(skb, 0);
 	/* Does not use PTR_ERR to limit the number of error codes that can be
 	 * returned by nf_queue.  For instance, callers rely on -ESRCH to
@@ -713,6 +719,7 @@ nfqnl_enqueue_packet(struct nf_queue_entry *entry, unsigned int queuenum)
 	err = 0;
 	do {
 		struct sk_buff *nskb = segs->next;
+		memcpy(skb->cb, &header, sizeof(header));
 		if (err == 0)
 			err = __nfqnl_enqueue_packet_gso(net, queue,
 							segs, entry);
diff --git a/net/xfrm/xfrm_output.c b/net/xfrm/xfrm_output.c
index cc3676e..5ccd632 100644
--- a/net/xfrm/xfrm_output.c
+++ b/net/xfrm/xfrm_output.c
@@ -166,7 +166,12 @@ static int xfrm_output2(struct net *net, struct sock *sk, struct sk_buff *skb)
 static int xfrm_output_gso(struct net *net, struct sock *sk, struct sk_buff *skb)
 {
 	struct sk_buff *segs;
+	union {
+		struct inet_skb_parm h4;
+		struct inet6_skb_parm h6;
+	} header;
 
+	memcpy(&header, skb->cb, sizeof(header));
 	segs = skb_gso_segment(skb, 0);
 	kfree_skb(skb);
 	if (IS_ERR(segs))
@@ -179,6 +184,7 @@ static int xfrm_output_gso(struct net *net, struct sock *sk, struct sk_buff *skb
 		int err;
 
 		segs->next = NULL;
+		memcpy(skb->cb, &header, sizeof(header));
 		err = xfrm_output2(net, sk, segs);
 
 		if (unlikely(err)) {
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] net: prevent corruption of skb when using skb_gso_segment
  2016-01-07 18:43           ` [PATCH] net: prevent corruption of skb when using skb_gso_segment Thadeu Lima de Souza Cascardo
@ 2016-01-07 19:31             ` Florian Westphal
  2016-01-07 21:16               ` [PATCH v2] " Thadeu Lima de Souza Cascardo
  0 siblings, 1 reply; 17+ messages in thread
From: Florian Westphal @ 2016-01-07 19:31 UTC (permalink / raw)
  To: Thadeu Lima de Souza Cascardo
  Cc: netdev, fw, koct9i, xiyou.wangcong, davem, edumazet, linux-kernel

Thadeu Lima de Souza Cascardo <cascardo@redhat.com> wrote:
> skb_gso_segment uses skb->cb, which may be owned by the caller. This may
> cause IPCB(skb)->opt.optlen to be overwritten, which will make
> ip_fragment overwrite skb data and possibly skb_shinfo with IPOPT_NOOP,
> thus causing a crash.
> 
> This patch saves skb->cb before calling skb_gso_segment for those users
> that have anything to save, then restore it for each GSO segment.

Thanks.

> --- a/net/netfilter/nfnetlink_queue.c
> +++ b/net/netfilter/nfnetlink_queue.c
> @@ -34,6 +34,7 @@
>  #include <net/tcp_states.h>
>  #include <net/netfilter/nf_queue.h>
>  #include <net/netns/generic.h>
> +#include <net/ip.h>
>  
>  #include <linux/atomic.h>
>  
> @@ -678,6 +679,10 @@ nfqnl_enqueue_packet(struct nf_queue_entry *entry, unsigned int queuenum)
>  	int err = -ENOBUFS;
>  	struct net *net = entry->state.net;
>  	struct nfnl_queue_net *q = nfnl_queue_pernet(net);
> +	union {
> +		struct inet_skb_parm h4;
> +		struct inet6_skb_parm h6;
> +	} header;
>  
>  	/* rcu_read_lock()ed by nf_hook_slow() */
>  	queue = instance_lookup(q, queuenum);
> @@ -702,6 +707,7 @@ nfqnl_enqueue_packet(struct nf_queue_entry *entry, unsigned int queuenum)
>  		return __nfqnl_enqueue_packet(net, queue, entry);
>  
>  	nf_bridge_adjust_skb_data(skb);
> +	memcpy(&header, skb->cb, sizeof(header));
>  	segs = skb_gso_segment(skb, 0);
>  	/* Does not use PTR_ERR to limit the number of error codes that can be
>  	 * returned by nf_queue.  For instance, callers rely on -ESRCH to
> @@ -713,6 +719,7 @@ nfqnl_enqueue_packet(struct nf_queue_entry *entry, unsigned int queuenum)
>  	err = 0;
>  	do {
>  		struct sk_buff *nskb = segs->next;
> +		memcpy(skb->cb, &header, sizeof(header));

I think this should be 'segs->cb'.

Other than that, this looks good to me.

> diff --git a/net/xfrm/xfrm_output.c b/net/xfrm/xfrm_output.c
> --- a/net/xfrm/xfrm_output.c
> +++ b/net/xfrm/xfrm_output.c
> @@ -166,7 +166,12 @@ static int xfrm_output2(struct net *net, struct sock *sk, struct sk_buff *skb)
> @@ -179,6 +184,7 @@ static int xfrm_output_gso(struct net *net, struct sock *sk, struct sk_buff *skb
>  		int err;
>  
>  		segs->next = NULL;
> +		memcpy(skb->cb, &header, sizeof(header));
>  		err = xfrm_output2(net, sk, segs);

likewise.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [BUG] skb corruption and kernel panic at forwarding with fragmentation
  2016-01-07 12:54                   ` Eric Dumazet
@ 2016-01-07 19:35                     ` Konstantin Khlebnikov
  2016-01-07 19:47                       ` Eric Dumazet
  0 siblings, 1 reply; 17+ messages in thread
From: Konstantin Khlebnikov @ 2016-01-07 19:35 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Florian Westphal, Thadeu Lima de Souza Cascardo, Cong Wang,
	Linux Kernel Network Developers, David Miller,
	Linux Kernel Mailing List

On Thu, Jan 7, 2016 at 3:54 PM, Eric Dumazet <edumazet@google.com> wrote:
> On Thu, Jan 7, 2016 at 7:04 AM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
>> On Thu, Jan 7, 2016 at 2:59 PM, Eric Dumazet <edumazet@google.com> wrote:
>>> On Thu, Jan 7, 2016 at 6:38 AM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
>>>>
>>>> Also I've found strange thing: reason of expanding skb->cb from 40 to
>>>> 48 bypes in 2006
>>>> 3e3850e989c5d2eb1aab6f0fd9257759f0f4cbc6 was that struct inet6_skb_parm does
>>>> not fit. But it's is only 24 bytes. Does some arches add pad after
>>>> each _u16 field?
>>>
>>> "struct inet6_skb_parm" is part of struct tcp_skb_cb
>>>
>>> This is why Patrick had to increase skb->cb[]
>>
>> Whoa. Funny. TCP moves that chunk back and forward instead of just
>> putting it at the first place in struct.
>
> You probably want to look at git history to find out why it is done this way.
>
> TCP performance is critical for some of us, and doing such trick avoid
> one cache miss per skb in some critical list traversals.

Right. This way tcp stuff perfectly fits into leftovers of first cache line.
Then probably it's better to put ipv4/ipv6 cb into second line from
the beginning.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [BUG] skb corruption and kernel panic at forwarding with fragmentation
  2016-01-07 19:35                     ` Konstantin Khlebnikov
@ 2016-01-07 19:47                       ` Eric Dumazet
  0 siblings, 0 replies; 17+ messages in thread
From: Eric Dumazet @ 2016-01-07 19:47 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Florian Westphal, Thadeu Lima de Souza Cascardo, Cong Wang,
	Linux Kernel Network Developers, David Miller,
	Linux Kernel Mailing List

On Thu, Jan 7, 2016 at 2:35 PM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
> On Thu, Jan 7, 2016 at 3:54 PM, Eric Dumazet <edumazet@google.com> wrote:
 want to look at git history to find out why it is done this way.
>>
>> TCP performance is critical for some of us, and doing such trick avoid
>> one cache miss per skb in some critical list traversals.
>
> Right. This way tcp stuff perfectly fits into leftovers of first cache line.
> Then probably it's better to put ipv4/ipv6 cb into second line from
> the beginning.

Then IP forwarding might be slower.

Look, each layer (TCP  , IP, ....) can organize its skb->cb[] as it wants.
Nobody tries to 'make universal room' for IPCB, since only IP layer wants it.

TCP could even find a way in the future to no longer hold a copy of
IPCB in the input skb,
if code is reorganized a bit.

Note that skbs for output path in TCP do not need IPCB at all.

Only when skb leaves TCP and enter IP, skb->cb[] content is scratched.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v2] net: prevent corruption of skb when using skb_gso_segment
  2016-01-07 19:31             ` Florian Westphal
@ 2016-01-07 21:16               ` Thadeu Lima de Souza Cascardo
  0 siblings, 0 replies; 17+ messages in thread
From: Thadeu Lima de Souza Cascardo @ 2016-01-07 21:16 UTC (permalink / raw)
  To: netdev
  Cc: fw, cascardo, koct9i, xiyou.wangcong, davem, edumazet, linux-kernel

skb_gso_segment uses skb->cb, which may be owned by the caller. This may
cause IPCB(skb)->opt.optlen to be overwritten, which will make
ip_fragment overwrite skb data and possibly skb_shinfo with IPOPT_NOOP,
thus causing a crash.

This patch saves skb->cb before calling skb_gso_segment for those users
that have anything to save, then restore it for each GSO segment.

Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@redhat.com>
---
 net/ipv4/ip_output.c            | 3 +++
 net/netfilter/nfnetlink_queue.c | 7 +++++++
 net/xfrm/xfrm_output.c          | 6 ++++++
 3 files changed, 16 insertions(+)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 4233cbe..37b41f6 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -226,6 +226,7 @@ static int ip_finish_output_gso(struct net *net, struct sock *sk,
 	netdev_features_t features;
 	struct sk_buff *segs;
 	int ret = 0;
+	struct inet_skb_parm ipcb;
 
 	/* common case: locally created skb or seglen is <= mtu */
 	if (((IPCB(skb)->flags & IPSKB_FORWARDED) == 0) ||
@@ -239,6 +240,7 @@ static int ip_finish_output_gso(struct net *net, struct sock *sk,
 	 * 2) skb arrived via virtio-net, we thus get TSO/GSO skbs directly
 	 * from host network stack.
 	 */
+	ipcb = *IPCB(skb);
 	features = netif_skb_features(skb);
 	segs = skb_gso_segment(skb, features & ~NETIF_F_GSO_MASK);
 	if (IS_ERR_OR_NULL(segs)) {
@@ -253,6 +255,7 @@ static int ip_finish_output_gso(struct net *net, struct sock *sk,
 		int err;
 
 		segs->next = NULL;
+		*IPCB(segs) = ipcb;
 		err = ip_fragment(net, sk, segs, mtu, ip_finish_output2);
 
 		if (err && ret == 0)
diff --git a/net/netfilter/nfnetlink_queue.c b/net/netfilter/nfnetlink_queue.c
index 861c661..426f61d 100644
--- a/net/netfilter/nfnetlink_queue.c
+++ b/net/netfilter/nfnetlink_queue.c
@@ -34,6 +34,7 @@
 #include <net/tcp_states.h>
 #include <net/netfilter/nf_queue.h>
 #include <net/netns/generic.h>
+#include <net/ip.h>
 
 #include <linux/atomic.h>
 
@@ -678,6 +679,10 @@ nfqnl_enqueue_packet(struct nf_queue_entry *entry, unsigned int queuenum)
 	int err = -ENOBUFS;
 	struct net *net = entry->state.net;
 	struct nfnl_queue_net *q = nfnl_queue_pernet(net);
+	union {
+		struct inet_skb_parm h4;
+		struct inet6_skb_parm h6;
+	} header;
 
 	/* rcu_read_lock()ed by nf_hook_slow() */
 	queue = instance_lookup(q, queuenum);
@@ -702,6 +707,7 @@ nfqnl_enqueue_packet(struct nf_queue_entry *entry, unsigned int queuenum)
 		return __nfqnl_enqueue_packet(net, queue, entry);
 
 	nf_bridge_adjust_skb_data(skb);
+	memcpy(&header, skb->cb, sizeof(header));
 	segs = skb_gso_segment(skb, 0);
 	/* Does not use PTR_ERR to limit the number of error codes that can be
 	 * returned by nf_queue.  For instance, callers rely on -ESRCH to
@@ -713,6 +719,7 @@ nfqnl_enqueue_packet(struct nf_queue_entry *entry, unsigned int queuenum)
 	err = 0;
 	do {
 		struct sk_buff *nskb = segs->next;
+		memcpy(segs->cb, &header, sizeof(header));
 		if (err == 0)
 			err = __nfqnl_enqueue_packet_gso(net, queue,
 							segs, entry);
diff --git a/net/xfrm/xfrm_output.c b/net/xfrm/xfrm_output.c
index cc3676e..27384b2 100644
--- a/net/xfrm/xfrm_output.c
+++ b/net/xfrm/xfrm_output.c
@@ -166,7 +166,12 @@ static int xfrm_output2(struct net *net, struct sock *sk, struct sk_buff *skb)
 static int xfrm_output_gso(struct net *net, struct sock *sk, struct sk_buff *skb)
 {
 	struct sk_buff *segs;
+	union {
+		struct inet_skb_parm h4;
+		struct inet6_skb_parm h6;
+	} header;
 
+	memcpy(&header, skb->cb, sizeof(header));
 	segs = skb_gso_segment(skb, 0);
 	kfree_skb(skb);
 	if (IS_ERR(segs))
@@ -179,6 +184,7 @@ static int xfrm_output_gso(struct net *net, struct sock *sk, struct sk_buff *skb
 		int err;
 
 		segs->next = NULL;
+		memcpy(segs->cb, &header, sizeof(header));
 		err = xfrm_output2(net, sk, segs);
 
 		if (unlikely(err)) {
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2016-01-07 21:16 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-06 19:15 [BUG] skb corruption and kernel panic at forwarding with fragmentation Konstantin Khlebnikov
2016-01-06 19:59 ` Cong Wang
2016-01-06 20:11   ` Konstantin Khlebnikov
2016-01-06 21:05     ` Thadeu Lima de Souza Cascardo
2016-01-06 22:03       ` Florian Westphal
2016-01-06 23:49         ` Florian Westphal
2016-01-07 11:00           ` Konstantin Khlebnikov
2016-01-07 11:38             ` Konstantin Khlebnikov
2016-01-07 11:59               ` Eric Dumazet
2016-01-07 12:04                 ` Konstantin Khlebnikov
2016-01-07 12:54                   ` Eric Dumazet
2016-01-07 19:35                     ` Konstantin Khlebnikov
2016-01-07 19:47                       ` Eric Dumazet
2016-01-07 12:03             ` Florian Westphal
2016-01-07 18:43           ` [PATCH] net: prevent corruption of skb when using skb_gso_segment Thadeu Lima de Souza Cascardo
2016-01-07 19:31             ` Florian Westphal
2016-01-07 21:16               ` [PATCH v2] " Thadeu Lima de Souza Cascardo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.