All of lore.kernel.org
 help / color / mirror / Atom feed
* Potential race in ip4_datagram_release_cb
@ 2014-06-06 11:29 Alexey Preobrazhensky
  2014-06-06 12:56 ` Eric Dumazet
  2014-06-10 13:43 ` [PATCH] ipv4: fix a race in ip4_datagram_release_cb() Eric Dumazet
  0 siblings, 2 replies; 41+ messages in thread
From: Alexey Preobrazhensky @ 2014-06-06 11:29 UTC (permalink / raw)
  To: netdev
  Cc: Kostya Serebryany, Dmitry Vyukov, Lars Bull, Eric Dumazet,
	Bruce Curtis, Maciej Żenczykowski

Hello,

I’m working on AddressSanitizer[1] -- a tool that detects
use-after-free and out-of-bounds bugs in kernel.

We’ve encountered a heap-use-after-free in ip4_datagram_release_cb()
in linux kernel 3.15-rc5 (revision
60b5f90d0fac7585f1a43ccdad06787b97eda0ab).

It seems to be a race between dst_release() and
ip4_datagram_release_cb() on an object from ip_dst_cache slab, all
during the ip4_datagram_connect() call.

This heap-use-after-free was triggered under trinity syscall fuzzer,
so there is no reproducer.

It would be great if someone familiar with the code took time to look
into this report.

Thanks,
 Alexey

[1] https://code.google.com/p/address-sanitizer/wiki/AddressSanitizerForKernel


AddressSanitizer: heap-use-after-free in ipv4_dst_check
Read of size 2 by thread T15453:
 [<ffffffff817daa3a>] ipv4_dst_check+0x1a/0x90 ./net/ipv4/route.c:1116
 [<ffffffff8175b789>] __sk_dst_check+0x89/0xe0 ./net/core/sock.c:531
 [<ffffffff81830a36>] ip4_datagram_release_cb+0x46/0x390 ??:0
 [<ffffffff8175eaea>] release_sock+0x17a/0x230 ./net/core/sock.c:2413
 [<ffffffff81830882>] ip4_datagram_connect+0x462/0x5d0 ??:0
 [<ffffffff81846d06>] inet_dgram_connect+0x76/0xd0 ./net/ipv4/af_inet.c:534
 [<ffffffff817580ac>] SYSC_connect+0x15c/0x1c0 ./net/socket.c:1701
 [<ffffffff817596ce>] SyS_connect+0xe/0x10 ./net/socket.c:1682
 [<ffffffff818b0a29>] system_call_fastpath+0x16/0x1b
./arch/x86/kernel/entry_64.S:629

Freed by thread T15455:
 [<ffffffff8178d9b8>] dst_destroy+0xa8/0x160 ./net/core/dst.c:251
 [<ffffffff8178de25>] dst_release+0x45/0x80 ./net/core/dst.c:280
 [<ffffffff818304c1>] ip4_datagram_connect+0xa1/0x5d0 ??:0
 [<ffffffff81846d06>] inet_dgram_connect+0x76/0xd0 ./net/ipv4/af_inet.c:534
 [<ffffffff817580ac>] SYSC_connect+0x15c/0x1c0 ./net/socket.c:1701
 [<ffffffff817596ce>] SyS_connect+0xe/0x10 ./net/socket.c:1682
 [<ffffffff818b0a29>] system_call_fastpath+0x16/0x1b
./arch/x86/kernel/entry_64.S:629

Allocated by thread T15453:
 [<ffffffff8178d291>] dst_alloc+0x81/0x2b0 ./net/core/dst.c:171
 [<ffffffff817db3b7>] rt_dst_alloc+0x47/0x50 ./net/ipv4/route.c:1406
 [<     inlined    >] __ip_route_output_key+0x3e8/0xf70
__mkroute_output ./net/ipv4/route.c:1939
 [<ffffffff817dde08>] __ip_route_output_key+0x3e8/0xf70 ./net/ipv4/route.c:2161
 [<ffffffff817deb34>] ip_route_output_flow+0x14/0x30 ./net/ipv4/route.c:2249
 [<ffffffff81830737>] ip4_datagram_connect+0x317/0x5d0 ??:0
 [<ffffffff81846d06>] inet_dgram_connect+0x76/0xd0 ./net/ipv4/af_inet.c:534
 [<ffffffff817580ac>] SYSC_connect+0x15c/0x1c0 ./net/socket.c:1701
 [<ffffffff817596ce>] SyS_connect+0xe/0x10 ./net/socket.c:1682
 [<ffffffff818b0a29>] system_call_fastpath+0x16/0x1b
./arch/x86/kernel/entry_64.S:629

The buggy address ffff880024ff2266 is located 102 bytes inside
 of 192-byte region [ffff880024ff2200, ffff880024ff22c0)

Memory state around the buggy address:
 ffff880024ff1d00: ffffffff fffrrrrr rrrrrrrr rrrrrrrr
 ffff880024ff1e00: ffffffff ffffffff ffffffff fffrrrrr
 ffff880024ff1f00: rrrrrrrr rrrrrrrr rrrrrrrr rrrrrrrr
 ffff880024ff2000: rrrrrrrr rrrrrrrr rrrrrrrr rrrrrrrr
 ffff880024ff2100: rrrrrrrr rrrrrrrr rrrrrrrr rrrrrrrr
>ffff880024ff2200: ffffffff ffffffff ffffffff rrrrrrrr
                                ^
 ffff880024ff2300: rrrrrrrr rrrrrrrr ........ ........
 ffff880024ff2400: ........ rrrrrrrr rrrrrrrr rrrrrrrr
 ffff880024ff2500: ffffffff ffffffff ffffffff rrrrrrrr
 ffff880024ff2600: rrrrrrrr rrrrrrrr ffffffff ffffffff
 ffff880024ff2700: ffffffff rrrrrrrr rrrrrrrr rrrrrrrr
Legend:
 f - 8 freed bytes
 r - 8 redzone bytes
 . - 8 allocated bytes
 x=1..7 - x allocated bytes + (8-x) redzone bytes

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Potential race in ip4_datagram_release_cb
  2014-06-06 11:29 Potential race in ip4_datagram_release_cb Alexey Preobrazhensky
@ 2014-06-06 12:56 ` Eric Dumazet
  2014-06-06 15:59   ` Alexei Starovoitov
  2014-06-10 13:43 ` [PATCH] ipv4: fix a race in ip4_datagram_release_cb() Eric Dumazet
  1 sibling, 1 reply; 41+ messages in thread
From: Eric Dumazet @ 2014-06-06 12:56 UTC (permalink / raw)
  To: Alexey Preobrazhensky
  Cc: netdev, Kostya Serebryany, Dmitry Vyukov, Lars Bull,
	Eric Dumazet, Bruce Curtis, Maciej Żenczykowski

On Fri, 2014-06-06 at 15:29 +0400, Alexey Preobrazhensky wrote:
> Hello,
> 
> I’m working on AddressSanitizer[1] -- a tool that detects
> use-after-free and out-of-bounds bugs in kernel.
> 
> We’ve encountered a heap-use-after-free in ip4_datagram_release_cb()
> in linux kernel 3.15-rc5 (revision
> 60b5f90d0fac7585f1a43ccdad06787b97eda0ab).
> 
> It seems to be a race between dst_release() and
> ip4_datagram_release_cb() on an object from ip_dst_cache slab, all
> during the ip4_datagram_connect() call.
> 
> This heap-use-after-free was triggered under trinity syscall fuzzer,
> so there is no reproducer.
> 
> It would be great if someone familiar with the code took time to look
> into this report.
> 
> Thanks,
>  Alexey
> 
> [1] https://code.google.com/p/address-sanitizer/wiki/AddressSanitizerForKernel
> 
> 
> AddressSanitizer: heap-use-after-free in ipv4_dst_check
> Read of size 2 by thread T15453:
>  [<ffffffff817daa3a>] ipv4_dst_check+0x1a/0x90 ./net/ipv4/route.c:1116
>  [<ffffffff8175b789>] __sk_dst_check+0x89/0xe0 ./net/core/sock.c:531
>  [<ffffffff81830a36>] ip4_datagram_release_cb+0x46/0x390 ??:0
>  [<ffffffff8175eaea>] release_sock+0x17a/0x230 ./net/core/sock.c:2413
>  [<ffffffff81830882>] ip4_datagram_connect+0x462/0x5d0 ??:0
>  [<ffffffff81846d06>] inet_dgram_connect+0x76/0xd0 ./net/ipv4/af_inet.c:534
>  [<ffffffff817580ac>] SYSC_connect+0x15c/0x1c0 ./net/socket.c:1701
>  [<ffffffff817596ce>] SyS_connect+0xe/0x10 ./net/socket.c:1682
>  [<ffffffff818b0a29>] system_call_fastpath+0x16/0x1b
> ./arch/x86/kernel/entry_64.S:629
> 
> Freed by thread T15455:
>  [<ffffffff8178d9b8>] dst_destroy+0xa8/0x160 ./net/core/dst.c:251
>  [<ffffffff8178de25>] dst_release+0x45/0x80 ./net/core/dst.c:280
>  [<ffffffff818304c1>] ip4_datagram_connect+0xa1/0x5d0 ??:0
>  [<ffffffff81846d06>] inet_dgram_connect+0x76/0xd0 ./net/ipv4/af_inet.c:534
>  [<ffffffff817580ac>] SYSC_connect+0x15c/0x1c0 ./net/socket.c:1701
>  [<ffffffff817596ce>] SyS_connect+0xe/0x10 ./net/socket.c:1682
>  [<ffffffff818b0a29>] system_call_fastpath+0x16/0x1b
> ./arch/x86/kernel/entry_64.S:629
> 
> Allocated by thread T15453:
>  [<ffffffff8178d291>] dst_alloc+0x81/0x2b0 ./net/core/dst.c:171
>  [<ffffffff817db3b7>] rt_dst_alloc+0x47/0x50 ./net/ipv4/route.c:1406
>  [<     inlined    >] __ip_route_output_key+0x3e8/0xf70
> __mkroute_output ./net/ipv4/route.c:1939
>  [<ffffffff817dde08>] __ip_route_output_key+0x3e8/0xf70 ./net/ipv4/route.c:2161
>  [<ffffffff817deb34>] ip_route_output_flow+0x14/0x30 ./net/ipv4/route.c:2249
>  [<ffffffff81830737>] ip4_datagram_connect+0x317/0x5d0 ??:0
>  [<ffffffff81846d06>] inet_dgram_connect+0x76/0xd0 ./net/ipv4/af_inet.c:534
>  [<ffffffff817580ac>] SYSC_connect+0x15c/0x1c0 ./net/socket.c:1701
>  [<ffffffff817596ce>] SyS_connect+0xe/0x10 ./net/socket.c:1682
>  [<ffffffff818b0a29>] system_call_fastpath+0x16/0x1b
> ./arch/x86/kernel/entry_64.S:629
> 
> The buggy address ffff880024ff2266 is located 102 bytes inside
>  of 192-byte region [ffff880024ff2200, ffff880024ff22c0)
> 
> Memory state around the buggy address:
>  ffff880024ff1d00: ffffffff fffrrrrr rrrrrrrr rrrrrrrr
>  ffff880024ff1e00: ffffffff ffffffff ffffffff fffrrrrr
>  ffff880024ff1f00: rrrrrrrr rrrrrrrr rrrrrrrr rrrrrrrr
>  ffff880024ff2000: rrrrrrrr rrrrrrrr rrrrrrrr rrrrrrrr
>  ffff880024ff2100: rrrrrrrr rrrrrrrr rrrrrrrr rrrrrrrr
> >ffff880024ff2200: ffffffff ffffffff ffffffff rrrrrrrr
>                                 ^
>  ffff880024ff2300: rrrrrrrr rrrrrrrr ........ ........
>  ffff880024ff2400: ........ rrrrrrrr rrrrrrrr rrrrrrrr
>  ffff880024ff2500: ffffffff ffffffff ffffffff rrrrrrrr
>  ffff880024ff2600: rrrrrrrr rrrrrrrr ffffffff ffffffff
>  ffff880024ff2700: ffffffff rrrrrrrr rrrrrrrr rrrrrrrr
> Legend:
>  f - 8 freed bytes
>  r - 8 redzone bytes
>  . - 8 allocated bytes
>  x=1..7 - x allocated bytes + (8-x) redzone bytes
> --


Yeah, we had many reports in the past that something was wrong ...

Your nice report made me take a look, finally :(

Problem comes from

net/ipv4/udp.c:1008:                    sk_dst_set(sk, dst_clone(&rt->dst));

Could you try following patch ?

Thanks !

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 4468e1adc094..d5e68ee63b8f 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1004,8 +1004,11 @@ int udp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 		if ((rt->rt_flags & RTCF_BROADCAST) &&
 		    !sock_flag(sk, SOCK_BROADCAST))
 			goto out;
-		if (connected)
-			sk_dst_set(sk, dst_clone(&rt->dst));
+		if (connected) {
+			spin_lock_bh(&sk->sk_lock.slock);
+			__sk_dst_set(sk, dst_clone(&rt->dst));
+			spin_unlock_bh(&sk->sk_lock.slock);
+		}
 	}
 
 	if (msg->msg_flags&MSG_CONFIRM)

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: Potential race in ip4_datagram_release_cb
  2014-06-06 12:56 ` Eric Dumazet
@ 2014-06-06 15:59   ` Alexei Starovoitov
  2014-06-06 16:16     ` Eric Dumazet
  0 siblings, 1 reply; 41+ messages in thread
From: Alexei Starovoitov @ 2014-06-06 15:59 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexey Preobrazhensky, netdev, Kostya Serebryany, Dmitry Vyukov,
	Lars Bull, Eric Dumazet, Bruce Curtis, Maciej Żenczykowski,
	dormando

On Fri, Jun 6, 2014 at 5:56 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Fri, 2014-06-06 at 15:29 +0400, Alexey Preobrazhensky wrote:
>> Hello,
>>
>> I’m working on AddressSanitizer[1] -- a tool that detects
>> use-after-free and out-of-bounds bugs in kernel.
>>
>> We’ve encountered a heap-use-after-free in ip4_datagram_release_cb()
>> in linux kernel 3.15-rc5 (revision
>> 60b5f90d0fac7585f1a43ccdad06787b97eda0ab).
>>
>> It seems to be a race between dst_release() and
>> ip4_datagram_release_cb() on an object from ip_dst_cache slab, all
>> during the ip4_datagram_connect() call.
>>
>> This heap-use-after-free was triggered under trinity syscall fuzzer,
>> so there is no reproducer.
>>
>> It would be great if someone familiar with the code took time to look
>> into this report.
>>
>> Thanks,
>>  Alexey
>>
>> [1] https://code.google.com/p/address-sanitizer/wiki/AddressSanitizerForKernel
>>
>>
>> AddressSanitizer: heap-use-after-free in ipv4_dst_check
>> Read of size 2 by thread T15453:
>>  [<ffffffff817daa3a>] ipv4_dst_check+0x1a/0x90 ./net/ipv4/route.c:1116
>>  [<ffffffff8175b789>] __sk_dst_check+0x89/0xe0 ./net/core/sock.c:531
>>  [<ffffffff81830a36>] ip4_datagram_release_cb+0x46/0x390 ??:0
>>  [<ffffffff8175eaea>] release_sock+0x17a/0x230 ./net/core/sock.c:2413
>>  [<ffffffff81830882>] ip4_datagram_connect+0x462/0x5d0 ??:0
>>  [<ffffffff81846d06>] inet_dgram_connect+0x76/0xd0 ./net/ipv4/af_inet.c:534
>>  [<ffffffff817580ac>] SYSC_connect+0x15c/0x1c0 ./net/socket.c:1701
>>  [<ffffffff817596ce>] SyS_connect+0xe/0x10 ./net/socket.c:1682
>>  [<ffffffff818b0a29>] system_call_fastpath+0x16/0x1b
>> ./arch/x86/kernel/entry_64.S:629
>>
>> Freed by thread T15455:
>>  [<ffffffff8178d9b8>] dst_destroy+0xa8/0x160 ./net/core/dst.c:251
>>  [<ffffffff8178de25>] dst_release+0x45/0x80 ./net/core/dst.c:280
>>  [<ffffffff818304c1>] ip4_datagram_connect+0xa1/0x5d0 ??:0
>>  [<ffffffff81846d06>] inet_dgram_connect+0x76/0xd0 ./net/ipv4/af_inet.c:534
>>  [<ffffffff817580ac>] SYSC_connect+0x15c/0x1c0 ./net/socket.c:1701
>>  [<ffffffff817596ce>] SyS_connect+0xe/0x10 ./net/socket.c:1682
>>  [<ffffffff818b0a29>] system_call_fastpath+0x16/0x1b
>> ./arch/x86/kernel/entry_64.S:629
>>
>> Allocated by thread T15453:
>>  [<ffffffff8178d291>] dst_alloc+0x81/0x2b0 ./net/core/dst.c:171
>>  [<ffffffff817db3b7>] rt_dst_alloc+0x47/0x50 ./net/ipv4/route.c:1406
>>  [<     inlined    >] __ip_route_output_key+0x3e8/0xf70
>> __mkroute_output ./net/ipv4/route.c:1939
>>  [<ffffffff817dde08>] __ip_route_output_key+0x3e8/0xf70 ./net/ipv4/route.c:2161
>>  [<ffffffff817deb34>] ip_route_output_flow+0x14/0x30 ./net/ipv4/route.c:2249
>>  [<ffffffff81830737>] ip4_datagram_connect+0x317/0x5d0 ??:0
>>  [<ffffffff81846d06>] inet_dgram_connect+0x76/0xd0 ./net/ipv4/af_inet.c:534
>>  [<ffffffff817580ac>] SYSC_connect+0x15c/0x1c0 ./net/socket.c:1701
>>  [<ffffffff817596ce>] SyS_connect+0xe/0x10 ./net/socket.c:1682
>>  [<ffffffff818b0a29>] system_call_fastpath+0x16/0x1b
>> ./arch/x86/kernel/entry_64.S:629
>>
>> The buggy address ffff880024ff2266 is located 102 bytes inside
>>  of 192-byte region [ffff880024ff2200, ffff880024ff22c0)
>>
>> Memory state around the buggy address:
>>  ffff880024ff1d00: ffffffff fffrrrrr rrrrrrrr rrrrrrrr
>>  ffff880024ff1e00: ffffffff ffffffff ffffffff fffrrrrr
>>  ffff880024ff1f00: rrrrrrrr rrrrrrrr rrrrrrrr rrrrrrrr
>>  ffff880024ff2000: rrrrrrrr rrrrrrrr rrrrrrrr rrrrrrrr
>>  ffff880024ff2100: rrrrrrrr rrrrrrrr rrrrrrrr rrrrrrrr
>> >ffff880024ff2200: ffffffff ffffffff ffffffff rrrrrrrr
>>                                 ^
>>  ffff880024ff2300: rrrrrrrr rrrrrrrr ........ ........
>>  ffff880024ff2400: ........ rrrrrrrr rrrrrrrr rrrrrrrr
>>  ffff880024ff2500: ffffffff ffffffff ffffffff rrrrrrrr
>>  ffff880024ff2600: rrrrrrrr rrrrrrrr ffffffff ffffffff
>>  ffff880024ff2700: ffffffff rrrrrrrr rrrrrrrr rrrrrrrr
>> Legend:
>>  f - 8 freed bytes
>>  r - 8 redzone bytes
>>  . - 8 allocated bytes
>>  x=1..7 - x allocated bytes + (8-x) redzone bytes
>> --
>
>
> Yeah, we had many reports in the past that something was wrong ...
>
> Your nice report made me take a look, finally :(
>
> Problem comes from
>
> net/ipv4/udp.c:1008:                    sk_dst_set(sk, dst_clone(&rt->dst));
>
> Could you try following patch ?
>
> Thanks !
>
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 4468e1adc094..d5e68ee63b8f 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -1004,8 +1004,11 @@ int udp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
>                 if ((rt->rt_flags & RTCF_BROADCAST) &&
>                     !sock_flag(sk, SOCK_BROADCAST))
>                         goto out;
> -               if (connected)
> -                       sk_dst_set(sk, dst_clone(&rt->dst));
> +               if (connected) {
> +                       spin_lock_bh(&sk->sk_lock.slock);
> +                       __sk_dst_set(sk, dst_clone(&rt->dst));
> +                       spin_unlock_bh(&sk->sk_lock.slock);
> +               }

Nice catch.
Should then we change sk_dst_set() itself to do spin_lock_bh unconditionally?
Seems overhead is smaller, than checking all possible callsites manually.

cc-ing Dormando as well.

>         }
>
>         if (msg->msg_flags&MSG_CONFIRM)
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Potential race in ip4_datagram_release_cb
  2014-06-06 15:59   ` Alexei Starovoitov
@ 2014-06-06 16:16     ` Eric Dumazet
  2014-06-06 17:44       ` Alexei Starovoitov
  0 siblings, 1 reply; 41+ messages in thread
From: Eric Dumazet @ 2014-06-06 16:16 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Alexey Preobrazhensky, netdev, Kostya Serebryany, Dmitry Vyukov,
	Lars Bull, Eric Dumazet, Bruce Curtis, Maciej Żenczykowski,
	dormando

On Fri, 2014-06-06 at 08:59 -0700, Alexei Starovoitov wrote:
> On Fri, Jun 6, 2014 at 5:56 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > On Fri, 2014-06-06 at 15:29 +0400, Alexey Preobrazhensky wrote:
> >> Hello,
> >>
> >> I’m working on AddressSanitizer[1] -- a tool that detects
> >> use-after-free and out-of-bounds bugs in kernel.
> >>
> >> We’ve encountered a heap-use-after-free in ip4_datagram_release_cb()
> >> in linux kernel 3.15-rc5 (revision
> >> 60b5f90d0fac7585f1a43ccdad06787b97eda0ab).
> >>
> >> It seems to be a race between dst_release() and
> >> ip4_datagram_release_cb() on an object from ip_dst_cache slab, all
> >> during the ip4_datagram_connect() call.
> >>
> >> This heap-use-after-free was triggered under trinity syscall fuzzer,
> >> so there is no reproducer.
> >>
> >> It would be great if someone familiar with the code took time to look
> >> into this report.
> >>
> >> Thanks,
> >>  Alexey
> >>
> >> [1] https://code.google.com/p/address-sanitizer/wiki/AddressSanitizerForKernel
> >>
> >>
> >> AddressSanitizer: heap-use-after-free in ipv4_dst_check
> >> Read of size 2 by thread T15453:
> >>  [<ffffffff817daa3a>] ipv4_dst_check+0x1a/0x90 ./net/ipv4/route.c:1116
> >>  [<ffffffff8175b789>] __sk_dst_check+0x89/0xe0 ./net/core/sock.c:531
> >>  [<ffffffff81830a36>] ip4_datagram_release_cb+0x46/0x390 ??:0
> >>  [<ffffffff8175eaea>] release_sock+0x17a/0x230 ./net/core/sock.c:2413
> >>  [<ffffffff81830882>] ip4_datagram_connect+0x462/0x5d0 ??:0
> >>  [<ffffffff81846d06>] inet_dgram_connect+0x76/0xd0 ./net/ipv4/af_inet.c:534
> >>  [<ffffffff817580ac>] SYSC_connect+0x15c/0x1c0 ./net/socket.c:1701
> >>  [<ffffffff817596ce>] SyS_connect+0xe/0x10 ./net/socket.c:1682
> >>  [<ffffffff818b0a29>] system_call_fastpath+0x16/0x1b
> >> ./arch/x86/kernel/entry_64.S:629
> >>
> >> Freed by thread T15455:
> >>  [<ffffffff8178d9b8>] dst_destroy+0xa8/0x160 ./net/core/dst.c:251
> >>  [<ffffffff8178de25>] dst_release+0x45/0x80 ./net/core/dst.c:280
> >>  [<ffffffff818304c1>] ip4_datagram_connect+0xa1/0x5d0 ??:0
> >>  [<ffffffff81846d06>] inet_dgram_connect+0x76/0xd0 ./net/ipv4/af_inet.c:534
> >>  [<ffffffff817580ac>] SYSC_connect+0x15c/0x1c0 ./net/socket.c:1701
> >>  [<ffffffff817596ce>] SyS_connect+0xe/0x10 ./net/socket.c:1682
> >>  [<ffffffff818b0a29>] system_call_fastpath+0x16/0x1b
> >> ./arch/x86/kernel/entry_64.S:629
> >>
> >> Allocated by thread T15453:
> >>  [<ffffffff8178d291>] dst_alloc+0x81/0x2b0 ./net/core/dst.c:171
> >>  [<ffffffff817db3b7>] rt_dst_alloc+0x47/0x50 ./net/ipv4/route.c:1406
> >>  [<     inlined    >] __ip_route_output_key+0x3e8/0xf70
> >> __mkroute_output ./net/ipv4/route.c:1939
> >>  [<ffffffff817dde08>] __ip_route_output_key+0x3e8/0xf70 ./net/ipv4/route.c:2161
> >>  [<ffffffff817deb34>] ip_route_output_flow+0x14/0x30 ./net/ipv4/route.c:2249
> >>  [<ffffffff81830737>] ip4_datagram_connect+0x317/0x5d0 ??:0
> >>  [<ffffffff81846d06>] inet_dgram_connect+0x76/0xd0 ./net/ipv4/af_inet.c:534
> >>  [<ffffffff817580ac>] SYSC_connect+0x15c/0x1c0 ./net/socket.c:1701
> >>  [<ffffffff817596ce>] SyS_connect+0xe/0x10 ./net/socket.c:1682
> >>  [<ffffffff818b0a29>] system_call_fastpath+0x16/0x1b
> >> ./arch/x86/kernel/entry_64.S:629
> >>
> >> The buggy address ffff880024ff2266 is located 102 bytes inside
> >>  of 192-byte region [ffff880024ff2200, ffff880024ff22c0)
> >>
> >> Memory state around the buggy address:
> >>  ffff880024ff1d00: ffffffff fffrrrrr rrrrrrrr rrrrrrrr
> >>  ffff880024ff1e00: ffffffff ffffffff ffffffff fffrrrrr
> >>  ffff880024ff1f00: rrrrrrrr rrrrrrrr rrrrrrrr rrrrrrrr
> >>  ffff880024ff2000: rrrrrrrr rrrrrrrr rrrrrrrr rrrrrrrr
> >>  ffff880024ff2100: rrrrrrrr rrrrrrrr rrrrrrrr rrrrrrrr
> >> >ffff880024ff2200: ffffffff ffffffff ffffffff rrrrrrrr
> >>                                 ^
> >>  ffff880024ff2300: rrrrrrrr rrrrrrrr ........ ........
> >>  ffff880024ff2400: ........ rrrrrrrr rrrrrrrr rrrrrrrr
> >>  ffff880024ff2500: ffffffff ffffffff ffffffff rrrrrrrr
> >>  ffff880024ff2600: rrrrrrrr rrrrrrrr ffffffff ffffffff
> >>  ffff880024ff2700: ffffffff rrrrrrrr rrrrrrrr rrrrrrrr
> >> Legend:
> >>  f - 8 freed bytes
> >>  r - 8 redzone bytes
> >>  . - 8 allocated bytes
> >>  x=1..7 - x allocated bytes + (8-x) redzone bytes
> >> --
> >
> >
> > Yeah, we had many reports in the past that something was wrong ...
> >
> > Your nice report made me take a look, finally :(
> >
> > Problem comes from
> >
> > net/ipv4/udp.c:1008:                    sk_dst_set(sk, dst_clone(&rt->dst));
> >
> > Could you try following patch ?
> >
> > Thanks !
> >
> > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> > index 4468e1adc094..d5e68ee63b8f 100644
> > --- a/net/ipv4/udp.c
> > +++ b/net/ipv4/udp.c
> > @@ -1004,8 +1004,11 @@ int udp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
> >                 if ((rt->rt_flags & RTCF_BROADCAST) &&
> >                     !sock_flag(sk, SOCK_BROADCAST))
> >                         goto out;
> > -               if (connected)
> > -                       sk_dst_set(sk, dst_clone(&rt->dst));
> > +               if (connected) {
> > +                       spin_lock_bh(&sk->sk_lock.slock);
> > +                       __sk_dst_set(sk, dst_clone(&rt->dst));
> > +                       spin_unlock_bh(&sk->sk_lock.slock);
> > +               }
> 
> Nice catch.
> Should then we change sk_dst_set() itself to do spin_lock_bh unconditionally?
> Seems overhead is smaller, than checking all possible callsites manually.
> 
> cc-ing Dormando as well.


Real problem is that sk_dst_set() uses a different lock. I never
understood how this was supposed to work.

We should either :

1) use xchg() and no lock at all to change sk_dst_cache, as we did for
sk_rx_dst ( cf udp_sk_rx_dst_set() )

2) No longer use sk_dst_lock, and always use the socket lock
(sk->sk_lock.slock) instead.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Potential race in ip4_datagram_release_cb
  2014-06-06 16:16     ` Eric Dumazet
@ 2014-06-06 17:44       ` Alexei Starovoitov
  2014-06-06 17:56         ` Eric Dumazet
  0 siblings, 1 reply; 41+ messages in thread
From: Alexei Starovoitov @ 2014-06-06 17:44 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexey Preobrazhensky, netdev, Kostya Serebryany, Dmitry Vyukov,
	Lars Bull, Eric Dumazet, Bruce Curtis, Maciej Żenczykowski,
	dormando

On Fri, Jun 6, 2014 at 9:16 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Fri, 2014-06-06 at 08:59 -0700, Alexei Starovoitov wrote:
>> On Fri, Jun 6, 2014 at 5:56 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> > On Fri, 2014-06-06 at 15:29 +0400, Alexey Preobrazhensky wrote:
>> >> Hello,
>> >>
>> >> I’m working on AddressSanitizer[1] -- a tool that detects
>> >> use-after-free and out-of-bounds bugs in kernel.
>> >>
>> >> We’ve encountered a heap-use-after-free in ip4_datagram_release_cb()
>> >> in linux kernel 3.15-rc5 (revision
>> >> 60b5f90d0fac7585f1a43ccdad06787b97eda0ab).
>> >>
>> >> It seems to be a race between dst_release() and
>> >> ip4_datagram_release_cb() on an object from ip_dst_cache slab, all
>> >> during the ip4_datagram_connect() call.
>> >>
>> >> This heap-use-after-free was triggered under trinity syscall fuzzer,
>> >> so there is no reproducer.
>> >>
>> >> It would be great if someone familiar with the code took time to look
>> >> into this report.
>> >>
>> >> Thanks,
>> >>  Alexey
>> >>
>> >> [1] https://code.google.com/p/address-sanitizer/wiki/AddressSanitizerForKernel
>> >>
>> >>
>> >> AddressSanitizer: heap-use-after-free in ipv4_dst_check
>> >> Read of size 2 by thread T15453:
>> >>  [<ffffffff817daa3a>] ipv4_dst_check+0x1a/0x90 ./net/ipv4/route.c:1116
>> >>  [<ffffffff8175b789>] __sk_dst_check+0x89/0xe0 ./net/core/sock.c:531
>> >>  [<ffffffff81830a36>] ip4_datagram_release_cb+0x46/0x390 ??:0
>> >>  [<ffffffff8175eaea>] release_sock+0x17a/0x230 ./net/core/sock.c:2413
>> >>  [<ffffffff81830882>] ip4_datagram_connect+0x462/0x5d0 ??:0
>> >>  [<ffffffff81846d06>] inet_dgram_connect+0x76/0xd0 ./net/ipv4/af_inet.c:534
>> >>  [<ffffffff817580ac>] SYSC_connect+0x15c/0x1c0 ./net/socket.c:1701
>> >>  [<ffffffff817596ce>] SyS_connect+0xe/0x10 ./net/socket.c:1682
>> >>  [<ffffffff818b0a29>] system_call_fastpath+0x16/0x1b
>> >> ./arch/x86/kernel/entry_64.S:629
>> >>
>> >> Freed by thread T15455:
>> >>  [<ffffffff8178d9b8>] dst_destroy+0xa8/0x160 ./net/core/dst.c:251
>> >>  [<ffffffff8178de25>] dst_release+0x45/0x80 ./net/core/dst.c:280
>> >>  [<ffffffff818304c1>] ip4_datagram_connect+0xa1/0x5d0 ??:0
>> >>  [<ffffffff81846d06>] inet_dgram_connect+0x76/0xd0 ./net/ipv4/af_inet.c:534
>> >>  [<ffffffff817580ac>] SYSC_connect+0x15c/0x1c0 ./net/socket.c:1701
>> >>  [<ffffffff817596ce>] SyS_connect+0xe/0x10 ./net/socket.c:1682
>> >>  [<ffffffff818b0a29>] system_call_fastpath+0x16/0x1b
>> >> ./arch/x86/kernel/entry_64.S:629
>> >>
>> >> Allocated by thread T15453:
>> >>  [<ffffffff8178d291>] dst_alloc+0x81/0x2b0 ./net/core/dst.c:171
>> >>  [<ffffffff817db3b7>] rt_dst_alloc+0x47/0x50 ./net/ipv4/route.c:1406
>> >>  [<     inlined    >] __ip_route_output_key+0x3e8/0xf70
>> >> __mkroute_output ./net/ipv4/route.c:1939
>> >>  [<ffffffff817dde08>] __ip_route_output_key+0x3e8/0xf70 ./net/ipv4/route.c:2161
>> >>  [<ffffffff817deb34>] ip_route_output_flow+0x14/0x30 ./net/ipv4/route.c:2249
>> >>  [<ffffffff81830737>] ip4_datagram_connect+0x317/0x5d0 ??:0
>> >>  [<ffffffff81846d06>] inet_dgram_connect+0x76/0xd0 ./net/ipv4/af_inet.c:534
>> >>  [<ffffffff817580ac>] SYSC_connect+0x15c/0x1c0 ./net/socket.c:1701
>> >>  [<ffffffff817596ce>] SyS_connect+0xe/0x10 ./net/socket.c:1682
>> >>  [<ffffffff818b0a29>] system_call_fastpath+0x16/0x1b
>> >> ./arch/x86/kernel/entry_64.S:629
>> >>
>> >> The buggy address ffff880024ff2266 is located 102 bytes inside
>> >>  of 192-byte region [ffff880024ff2200, ffff880024ff22c0)
>> >>
>> >> Memory state around the buggy address:
>> >>  ffff880024ff1d00: ffffffff fffrrrrr rrrrrrrr rrrrrrrr
>> >>  ffff880024ff1e00: ffffffff ffffffff ffffffff fffrrrrr
>> >>  ffff880024ff1f00: rrrrrrrr rrrrrrrr rrrrrrrr rrrrrrrr
>> >>  ffff880024ff2000: rrrrrrrr rrrrrrrr rrrrrrrr rrrrrrrr
>> >>  ffff880024ff2100: rrrrrrrr rrrrrrrr rrrrrrrr rrrrrrrr
>> >> >ffff880024ff2200: ffffffff ffffffff ffffffff rrrrrrrr
>> >>                                 ^
>> >>  ffff880024ff2300: rrrrrrrr rrrrrrrr ........ ........
>> >>  ffff880024ff2400: ........ rrrrrrrr rrrrrrrr rrrrrrrr
>> >>  ffff880024ff2500: ffffffff ffffffff ffffffff rrrrrrrr
>> >>  ffff880024ff2600: rrrrrrrr rrrrrrrr ffffffff ffffffff
>> >>  ffff880024ff2700: ffffffff rrrrrrrr rrrrrrrr rrrrrrrr
>> >> Legend:
>> >>  f - 8 freed bytes
>> >>  r - 8 redzone bytes
>> >>  . - 8 allocated bytes
>> >>  x=1..7 - x allocated bytes + (8-x) redzone bytes
>> >> --
>> >
>> >
>> > Yeah, we had many reports in the past that something was wrong ...
>> >
>> > Your nice report made me take a look, finally :(
>> >
>> > Problem comes from
>> >
>> > net/ipv4/udp.c:1008:                    sk_dst_set(sk, dst_clone(&rt->dst));
>> >
>> > Could you try following patch ?
>> >
>> > Thanks !
>> >
>> > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
>> > index 4468e1adc094..d5e68ee63b8f 100644
>> > --- a/net/ipv4/udp.c
>> > +++ b/net/ipv4/udp.c
>> > @@ -1004,8 +1004,11 @@ int udp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
>> >                 if ((rt->rt_flags & RTCF_BROADCAST) &&
>> >                     !sock_flag(sk, SOCK_BROADCAST))
>> >                         goto out;
>> > -               if (connected)
>> > -                       sk_dst_set(sk, dst_clone(&rt->dst));
>> > +               if (connected) {
>> > +                       spin_lock_bh(&sk->sk_lock.slock);
>> > +                       __sk_dst_set(sk, dst_clone(&rt->dst));
>> > +                       spin_unlock_bh(&sk->sk_lock.slock);
>> > +               }
>>
>> Nice catch.
>> Should then we change sk_dst_set() itself to do spin_lock_bh unconditionally?
>> Seems overhead is smaller, than checking all possible callsites manually.
>>
>> cc-ing Dormando as well.
>
>
> Real problem is that sk_dst_set() uses a different lock. I never
> understood how this was supposed to work.

we probably need to test the assumption in the sk_dst_set() comment:
/* This can be called while sk is owned by the caller only */
I don't understand why this piece of code doesn't do:
old_dst = rcu_dereference_check(sk->sk_dst_cache,
sock_owned_by_user(sk) || lockdep_is_held(&sk->sk_lock.slock));
like sk_dst_get does...

If I'm reading the code correctly, at the time of sk_dst_set() call
from udp_sendmsg()
the sk_lock.slock is not held and sock_owned_by_user(sk) is false as well.
Same not held condition is in the first sk_dst_reset() call from
ip4_datagram_connect().

Though the lock is properly held from release_sock()->ip4_datagram_release_cb()
Cannot agree more with Eric "how is it supposed to work?"

> We should either :
>
> 1) use xchg() and no lock at all to change sk_dst_cache, as we did for
> sk_rx_dst ( cf udp_sk_rx_dst_set() )
>
> 2) No longer use sk_dst_lock, and always use the socket lock
> (sk->sk_lock.slock) instead.

Probably needs some combination of both.
sk_dst_lock seems useless for ipv4.
What the following suppose to do in ipv6_update_options() ?!
                spin_lock(&sk->sk_dst_lock);
                opt = xchg(&inet6_sk(sk)->opt, opt);
                spin_unlock(&sk->sk_dst_lock);
seems ipv6 is reusing the same lock for completely different reason.

>
>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Potential race in ip4_datagram_release_cb
  2014-06-06 17:44       ` Alexei Starovoitov
@ 2014-06-06 17:56         ` Eric Dumazet
  2014-06-06 18:13           ` Alexei Starovoitov
  0 siblings, 1 reply; 41+ messages in thread
From: Eric Dumazet @ 2014-06-06 17:56 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Alexey Preobrazhensky, netdev, Kostya Serebryany, Dmitry Vyukov,
	Lars Bull, Eric Dumazet, Bruce Curtis, Maciej Żenczykowski,
	dormando

On Fri, 2014-06-06 at 10:44 -0700, Alexei Starovoitov wrote:
> On Fri, Jun 6, 2014 at 9:16 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:

> > We should either :
> >
> > 1) use xchg() and no lock at all to change sk_dst_cache, as we did for
> > sk_rx_dst ( cf udp_sk_rx_dst_set() )
> >
> > 2) No longer use sk_dst_lock, and always use the socket lock
> > (sk->sk_lock.slock) instead.
> 
> Probably needs some combination of both.

I do not think so.

If we use xchg(), then we do not need anything else.

If we use a spinlock, then xchg() seems useless.

Any combination is buggy.

In the past, UDP transmit was holding socket lock,
but we added a lockless path, and I suppose more people
use a UDP socket from different threads.

Then, Steffen added the ip4_datagram_release_cb() thing,
increasing the chance of mixing both 'protections'.

Setting sk_dst_cache should hardly be a hot path.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Potential race in ip4_datagram_release_cb
  2014-06-06 17:56         ` Eric Dumazet
@ 2014-06-06 18:13           ` Alexei Starovoitov
  0 siblings, 0 replies; 41+ messages in thread
From: Alexei Starovoitov @ 2014-06-06 18:13 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexey Preobrazhensky, netdev, Kostya Serebryany, Dmitry Vyukov,
	Lars Bull, Eric Dumazet, Bruce Curtis, Maciej Żenczykowski,
	dormando

On Fri, Jun 6, 2014 at 10:56 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Fri, 2014-06-06 at 10:44 -0700, Alexei Starovoitov wrote:
>> On Fri, Jun 6, 2014 at 9:16 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>> > We should either :
>> >
>> > 1) use xchg() and no lock at all to change sk_dst_cache, as we did for
>> > sk_rx_dst ( cf udp_sk_rx_dst_set() )
>> >
>> > 2) No longer use sk_dst_lock, and always use the socket lock
>> > (sk->sk_lock.slock) instead.
>>
>> Probably needs some combination of both.
>
> I do not think so.
>
> If we use xchg(), then we do not need anything else.
>
> If we use a spinlock, then xchg() seems useless.
>
> Any combination is buggy.
>
> In the past, UDP transmit was holding socket lock,
> but we added a lockless path, and I suppose more people
> use a UDP socket from different threads.
>
> Then, Steffen added the ip4_datagram_release_cb() thing,
> increasing the chance of mixing both 'protections'.
>
> Setting sk_dst_cache should hardly be a hot path.

yeah. if we use sk_lock.slock then xchg is obviously not needed.
By 'both' I meant to do xchg() and get rid of sk_dst_lock to speed it up.
Though not worth going too fancy if speed is not needed.

>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH] ipv4: fix a race in ip4_datagram_release_cb()
  2014-06-06 11:29 Potential race in ip4_datagram_release_cb Alexey Preobrazhensky
  2014-06-06 12:56 ` Eric Dumazet
@ 2014-06-10 13:43 ` Eric Dumazet
  2014-06-11  0:32   ` dormando
  2014-06-11 22:39   ` [PATCH] ipv4: fix a race in ip4_datagram_release_cb() David Miller
  1 sibling, 2 replies; 41+ messages in thread
From: Eric Dumazet @ 2014-06-10 13:43 UTC (permalink / raw)
  To: Alexey Preobrazhensky, dormando, Steffen Klassert, David Miller, paulmck
  Cc: netdev, Kostya Serebryany, Dmitry Vyukov, Lars Bull,
	Eric Dumazet, Bruce Curtis, Maciej Żenczykowski,
	Alexei Starovoitov

From: Eric Dumazet <edumazet@google.com>

Alexey gave a AddressSanitizer[1] report that finally gave a good hint
at where was the origin of various problems already reported by Dormando
in the past [2]

Problem comes from the fact that UDP can have a lockless TX path, and
concurrent threads can manipulate sk_dst_cache, while another thread,
is holding socket lock and calls __sk_dst_set() in
ip4_datagram_release_cb() (this was added in linux-3.8)

It seems that all we need to do is to use sk_dst_check() and
sk_dst_set() so that all the writers hold same spinlock
(sk->sk_dst_lock) to prevent corruptions.

TCP stack do not need this protection, as all sk_dst_cache writers hold
the socket lock.

[1]
https://code.google.com/p/address-sanitizer/wiki/AddressSanitizerForKernel

AddressSanitizer: heap-use-after-free in ipv4_dst_check
Read of size 2 by thread T15453:
 [<ffffffff817daa3a>] ipv4_dst_check+0x1a/0x90 ./net/ipv4/route.c:1116
 [<ffffffff8175b789>] __sk_dst_check+0x89/0xe0 ./net/core/sock.c:531
 [<ffffffff81830a36>] ip4_datagram_release_cb+0x46/0x390 ??:0
 [<ffffffff8175eaea>] release_sock+0x17a/0x230 ./net/core/sock.c:2413
 [<ffffffff81830882>] ip4_datagram_connect+0x462/0x5d0 ??:0
 [<ffffffff81846d06>] inet_dgram_connect+0x76/0xd0 ./net/ipv4/af_inet.c:534
 [<ffffffff817580ac>] SYSC_connect+0x15c/0x1c0 ./net/socket.c:1701
 [<ffffffff817596ce>] SyS_connect+0xe/0x10 ./net/socket.c:1682
 [<ffffffff818b0a29>] system_call_fastpath+0x16/0x1b
./arch/x86/kernel/entry_64.S:629

Freed by thread T15455:
 [<ffffffff8178d9b8>] dst_destroy+0xa8/0x160 ./net/core/dst.c:251
 [<ffffffff8178de25>] dst_release+0x45/0x80 ./net/core/dst.c:280
 [<ffffffff818304c1>] ip4_datagram_connect+0xa1/0x5d0 ??:0
 [<ffffffff81846d06>] inet_dgram_connect+0x76/0xd0 ./net/ipv4/af_inet.c:534
 [<ffffffff817580ac>] SYSC_connect+0x15c/0x1c0 ./net/socket.c:1701
 [<ffffffff817596ce>] SyS_connect+0xe/0x10 ./net/socket.c:1682
 [<ffffffff818b0a29>] system_call_fastpath+0x16/0x1b
./arch/x86/kernel/entry_64.S:629

Allocated by thread T15453:
 [<ffffffff8178d291>] dst_alloc+0x81/0x2b0 ./net/core/dst.c:171
 [<ffffffff817db3b7>] rt_dst_alloc+0x47/0x50 ./net/ipv4/route.c:1406
 [<     inlined    >] __ip_route_output_key+0x3e8/0xf70
__mkroute_output ./net/ipv4/route.c:1939
 [<ffffffff817dde08>] __ip_route_output_key+0x3e8/0xf70 ./net/ipv4/route.c:2161
 [<ffffffff817deb34>] ip_route_output_flow+0x14/0x30 ./net/ipv4/route.c:2249
 [<ffffffff81830737>] ip4_datagram_connect+0x317/0x5d0 ??:0
 [<ffffffff81846d06>] inet_dgram_connect+0x76/0xd0 ./net/ipv4/af_inet.c:534
 [<ffffffff817580ac>] SYSC_connect+0x15c/0x1c0 ./net/socket.c:1701
 [<ffffffff817596ce>] SyS_connect+0xe/0x10 ./net/socket.c:1682
 [<ffffffff818b0a29>] system_call_fastpath+0x16/0x1b
./arch/x86/kernel/entry_64.S:629

[2]
<4>[196727.311203] general protection fault: 0000 [#1] SMP
<4>[196727.311224] Modules linked in: xt_TEE xt_dscp xt_DSCP macvlan bridge coretemp crc32_pclmul ghash_clmulni_intel gpio_ich microcode ipmi_watchdog ipmi_devintf sb_edac edac_core lpc_ich mfd_core tpm_tis tpm tpm_bios ipmi_si ipmi_msghandler isci igb libsas i2c_algo_bit ixgbe ptp pps_core mdio
<4>[196727.311333] CPU: 17 PID: 0 Comm: swapper/17 Not tainted 3.10.26 #1
<4>[196727.311344] Hardware name: Supermicro X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.0 07/05/2013
<4>[196727.311364] task: ffff885e6f069700 ti: ffff885e6f072000 task.ti: ffff885e6f072000
<4>[196727.311377] RIP: 0010:[<ffffffff815f8c7f>]  [<ffffffff815f8c7f>] ipv4_dst_destroy+0x4f/0x80
<4>[196727.311399] RSP: 0018:ffff885effd23a70  EFLAGS: 00010282
<4>[196727.311409] RAX: dead000000200200 RBX: ffff8854c398ecc0 RCX: 0000000000000040
<4>[196727.311423] RDX: dead000000100100 RSI: dead000000100100 RDI: dead000000200200
<4>[196727.311437] RBP: ffff885effd23a80 R08: ffffffff815fd9e0 R09: ffff885d5a590800
<4>[196727.311451] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
<4>[196727.311464] R13: ffffffff81c8c280 R14: 0000000000000000 R15: ffff880e85ee16ce
<4>[196727.311510] FS:  0000000000000000(0000) GS:ffff885effd20000(0000) knlGS:0000000000000000
<4>[196727.311554] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[196727.311581] CR2: 00007a46751eb000 CR3: 0000005e65688000 CR4: 00000000000407e0
<4>[196727.311625] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>[196727.311669] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>[196727.311713] Stack:
<4>[196727.311733]  ffff8854c398ecc0 ffff8854c398ecc0 ffff885effd23ab0 ffffffff815b7f42
<4>[196727.311784]  ffff88be6595bc00 ffff8854c398ecc0 0000000000000000 ffff8854c398ecc0
<4>[196727.311834]  ffff885effd23ad0 ffffffff815b86c6 ffff885d5a590800 ffff8816827821c0
<4>[196727.311885] Call Trace:
<4>[196727.311907]  <IRQ>
<4>[196727.311912]  [<ffffffff815b7f42>] dst_destroy+0x32/0xe0
<4>[196727.311959]  [<ffffffff815b86c6>] dst_release+0x56/0x80
<4>[196727.311986]  [<ffffffff81620bd5>] tcp_v4_do_rcv+0x2a5/0x4a0
<4>[196727.312013]  [<ffffffff81622b5a>] tcp_v4_rcv+0x7da/0x820
<4>[196727.312041]  [<ffffffff815fd9e0>] ? ip_rcv_finish+0x360/0x360
<4>[196727.312070]  [<ffffffff815de02d>] ? nf_hook_slow+0x7d/0x150
<4>[196727.312097]  [<ffffffff815fd9e0>] ? ip_rcv_finish+0x360/0x360
<4>[196727.312125]  [<ffffffff815fda92>] ip_local_deliver_finish+0xb2/0x230
<4>[196727.312154]  [<ffffffff815fdd9a>] ip_local_deliver+0x4a/0x90
<4>[196727.312183]  [<ffffffff815fd799>] ip_rcv_finish+0x119/0x360
<4>[196727.312212]  [<ffffffff815fe00b>] ip_rcv+0x22b/0x340
<4>[196727.312242]  [<ffffffffa0339680>] ? macvlan_broadcast+0x160/0x160 [macvlan]
<4>[196727.312275]  [<ffffffff815b0c62>] __netif_receive_skb_core+0x512/0x640
<4>[196727.312308]  [<ffffffff811427fb>] ? kmem_cache_alloc+0x13b/0x150
<4>[196727.312338]  [<ffffffff815b0db1>] __netif_receive_skb+0x21/0x70
<4>[196727.312368]  [<ffffffff815b0fa1>] netif_receive_skb+0x31/0xa0
<4>[196727.312397]  [<ffffffff815b1ae8>] napi_gro_receive+0xe8/0x140
<4>[196727.312433]  [<ffffffffa00274f1>] ixgbe_poll+0x551/0x11f0 [ixgbe]
<4>[196727.312463]  [<ffffffff815fe00b>] ? ip_rcv+0x22b/0x340
<4>[196727.312491]  [<ffffffff815b1691>] net_rx_action+0x111/0x210
<4>[196727.312521]  [<ffffffff815b0db1>] ? __netif_receive_skb+0x21/0x70
<4>[196727.312552]  [<ffffffff810519d0>] __do_softirq+0xd0/0x270
<4>[196727.312583]  [<ffffffff816cef3c>] call_softirq+0x1c/0x30
<4>[196727.312613]  [<ffffffff81004205>] do_softirq+0x55/0x90
<4>[196727.312640]  [<ffffffff81051c85>] irq_exit+0x55/0x60
<4>[196727.312668]  [<ffffffff816cf5c3>] do_IRQ+0x63/0xe0
<4>[196727.312696]  [<ffffffff816c5aaa>] common_interrupt+0x6a/0x6a
<4>[196727.312722]  <EOI>
<1>[196727.313071] RIP  [<ffffffff815f8c7f>] ipv4_dst_destroy+0x4f/0x80
<4>[196727.313100]  RSP <ffff885effd23a70>
<4>[196727.313377] ---[ end trace 64b3f14fae0f2e29 ]---
<0>[196727.380908] Kernel panic - not syncing: Fatal exception in interrupt

Reported-by: Alexey Preobrazhensky <preobr@google.com>
Reported-by: dormando <dormando@rydia.ne>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: 8141ed9fcedb2 ("ipv4: Add a socket release callback for datagram sockets")
Cc: Steffen Klassert <steffen.klassert@secunet.com>
---
 net/ipv4/datagram.c |   20 +++++++++++++++-----
 1 file changed, 15 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/datagram.c b/net/ipv4/datagram.c
index 8b5134c582f1..a3095fdefbed 100644
--- a/net/ipv4/datagram.c
+++ b/net/ipv4/datagram.c
@@ -86,18 +86,26 @@ out:
 }
 EXPORT_SYMBOL(ip4_datagram_connect);
 
+/* Because UDP xmit path can manipulate sk_dst_cache without holding
+ * socket lock, we need to use sk_dst_set() here,
+ * even if we own the socket lock.
+ */
 void ip4_datagram_release_cb(struct sock *sk)
 {
 	const struct inet_sock *inet = inet_sk(sk);
 	const struct ip_options_rcu *inet_opt;
 	__be32 daddr = inet->inet_daddr;
+	struct dst_entry *dst;
 	struct flowi4 fl4;
 	struct rtable *rt;
 
-	if (! __sk_dst_get(sk) || __sk_dst_check(sk, 0))
-		return;
-
 	rcu_read_lock();
+
+	dst = __sk_dst_get(sk);
+	if (!dst || !dst->obsolete || dst->ops->check(dst, 0)) {
+		rcu_read_unlock();
+		return;
+	}
 	inet_opt = rcu_dereference(inet->inet_opt);
 	if (inet_opt && inet_opt->opt.srr)
 		daddr = inet_opt->opt.faddr;
@@ -105,8 +113,10 @@ void ip4_datagram_release_cb(struct sock *sk)
 				   inet->inet_saddr, inet->inet_dport,
 				   inet->inet_sport, sk->sk_protocol,
 				   RT_CONN_FLAGS(sk), sk->sk_bound_dev_if);
-	if (!IS_ERR(rt))
-		__sk_dst_set(sk, &rt->dst);
+
+	dst = !IS_ERR(rt) ? &rt->dst : NULL;
+	sk_dst_set(sk, dst);
+
 	rcu_read_unlock();
 }
 EXPORT_SYMBOL_GPL(ip4_datagram_release_cb);

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH] ipv4: fix a race in ip4_datagram_release_cb()
  2014-06-10 13:43 ` [PATCH] ipv4: fix a race in ip4_datagram_release_cb() Eric Dumazet
@ 2014-06-11  0:32   ` dormando
  2014-06-11  0:55     ` Eric Dumazet
  2014-06-11 22:39   ` [PATCH] ipv4: fix a race in ip4_datagram_release_cb() David Miller
  1 sibling, 1 reply; 41+ messages in thread
From: dormando @ 2014-06-11  0:32 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexey Preobrazhensky, Steffen Klassert, David Miller, paulmck,
	netdev, Kostya Serebryany, Dmitry Vyukov, Lars Bull,
	Eric Dumazet, Bruce Curtis, Maciej Żenczykowski,
	Alexei Starovoitov

On Tue, 10 Jun 2014, Eric Dumazet wrote:

> From: Eric Dumazet <edumazet@google.com>
>
> Alexey gave a AddressSanitizer[1] report that finally gave a good hint
> at where was the origin of various problems already reported by Dormando
> in the past [2]
>
> Problem comes from the fact that UDP can have a lockless TX path, and
> concurrent threads can manipulate sk_dst_cache, while another thread,
> is holding socket lock and calls __sk_dst_set() in
> ip4_datagram_release_cb() (this was added in linux-3.8)
>
> It seems that all we need to do is to use sk_dst_check() and
> sk_dst_set() so that all the writers hold same spinlock
> (sk->sk_dst_lock) to prevent corruptions.
>
> TCP stack do not need this protection, as all sk_dst_cache writers hold
> the socket lock.
>
> [1]
> https://code.google.com/p/address-sanitizer/wiki/AddressSanitizerForKernel
>
> AddressSanitizer: heap-use-after-free in ipv4_dst_check
> Read of size 2 by thread T15453:
>  [<ffffffff817daa3a>] ipv4_dst_check+0x1a/0x90 ./net/ipv4/route.c:1116
>  [<ffffffff8175b789>] __sk_dst_check+0x89/0xe0 ./net/core/sock.c:531
>  [<ffffffff81830a36>] ip4_datagram_release_cb+0x46/0x390 ??:0
>  [<ffffffff8175eaea>] release_sock+0x17a/0x230 ./net/core/sock.c:2413
>  [<ffffffff81830882>] ip4_datagram_connect+0x462/0x5d0 ??:0
>  [<ffffffff81846d06>] inet_dgram_connect+0x76/0xd0 ./net/ipv4/af_inet.c:534
>  [<ffffffff817580ac>] SYSC_connect+0x15c/0x1c0 ./net/socket.c:1701
>  [<ffffffff817596ce>] SyS_connect+0xe/0x10 ./net/socket.c:1682
>  [<ffffffff818b0a29>] system_call_fastpath+0x16/0x1b
> ./arch/x86/kernel/entry_64.S:629
>
> Freed by thread T15455:
>  [<ffffffff8178d9b8>] dst_destroy+0xa8/0x160 ./net/core/dst.c:251
>  [<ffffffff8178de25>] dst_release+0x45/0x80 ./net/core/dst.c:280
>  [<ffffffff818304c1>] ip4_datagram_connect+0xa1/0x5d0 ??:0
>  [<ffffffff81846d06>] inet_dgram_connect+0x76/0xd0 ./net/ipv4/af_inet.c:534
>  [<ffffffff817580ac>] SYSC_connect+0x15c/0x1c0 ./net/socket.c:1701
>  [<ffffffff817596ce>] SyS_connect+0xe/0x10 ./net/socket.c:1682
>  [<ffffffff818b0a29>] system_call_fastpath+0x16/0x1b
> ./arch/x86/kernel/entry_64.S:629
>
> Allocated by thread T15453:
>  [<ffffffff8178d291>] dst_alloc+0x81/0x2b0 ./net/core/dst.c:171
>  [<ffffffff817db3b7>] rt_dst_alloc+0x47/0x50 ./net/ipv4/route.c:1406
>  [<     inlined    >] __ip_route_output_key+0x3e8/0xf70
> __mkroute_output ./net/ipv4/route.c:1939
>  [<ffffffff817dde08>] __ip_route_output_key+0x3e8/0xf70 ./net/ipv4/route.c:2161
>  [<ffffffff817deb34>] ip_route_output_flow+0x14/0x30 ./net/ipv4/route.c:2249
>  [<ffffffff81830737>] ip4_datagram_connect+0x317/0x5d0 ??:0
>  [<ffffffff81846d06>] inet_dgram_connect+0x76/0xd0 ./net/ipv4/af_inet.c:534
>  [<ffffffff817580ac>] SYSC_connect+0x15c/0x1c0 ./net/socket.c:1701
>  [<ffffffff817596ce>] SyS_connect+0xe/0x10 ./net/socket.c:1682
>  [<ffffffff818b0a29>] system_call_fastpath+0x16/0x1b
> ./arch/x86/kernel/entry_64.S:629
>
> [2]
> <4>[196727.311203] general protection fault: 0000 [#1] SMP
> <4>[196727.311224] Modules linked in: xt_TEE xt_dscp xt_DSCP macvlan bridge coretemp crc32_pclmul ghash_clmulni_intel gpio_ich microcode ipmi_watchdog ipmi_devintf sb_edac edac_core lpc_ich mfd_core tpm_tis tpm tpm_bios ipmi_si ipmi_msghandler isci igb libsas i2c_algo_bit ixgbe ptp pps_core mdio
> <4>[196727.311333] CPU: 17 PID: 0 Comm: swapper/17 Not tainted 3.10.26 #1
> <4>[196727.311344] Hardware name: Supermicro X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.0 07/05/2013
> <4>[196727.311364] task: ffff885e6f069700 ti: ffff885e6f072000 task.ti: ffff885e6f072000
> <4>[196727.311377] RIP: 0010:[<ffffffff815f8c7f>]  [<ffffffff815f8c7f>] ipv4_dst_destroy+0x4f/0x80
> <4>[196727.311399] RSP: 0018:ffff885effd23a70  EFLAGS: 00010282
> <4>[196727.311409] RAX: dead000000200200 RBX: ffff8854c398ecc0 RCX: 0000000000000040
> <4>[196727.311423] RDX: dead000000100100 RSI: dead000000100100 RDI: dead000000200200
> <4>[196727.311437] RBP: ffff885effd23a80 R08: ffffffff815fd9e0 R09: ffff885d5a590800
> <4>[196727.311451] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> <4>[196727.311464] R13: ffffffff81c8c280 R14: 0000000000000000 R15: ffff880e85ee16ce
> <4>[196727.311510] FS:  0000000000000000(0000) GS:ffff885effd20000(0000) knlGS:0000000000000000
> <4>[196727.311554] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> <4>[196727.311581] CR2: 00007a46751eb000 CR3: 0000005e65688000 CR4: 00000000000407e0
> <4>[196727.311625] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> <4>[196727.311669] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> <4>[196727.311713] Stack:
> <4>[196727.311733]  ffff8854c398ecc0 ffff8854c398ecc0 ffff885effd23ab0 ffffffff815b7f42
> <4>[196727.311784]  ffff88be6595bc00 ffff8854c398ecc0 0000000000000000 ffff8854c398ecc0
> <4>[196727.311834]  ffff885effd23ad0 ffffffff815b86c6 ffff885d5a590800 ffff8816827821c0
> <4>[196727.311885] Call Trace:
> <4>[196727.311907]  <IRQ>
> <4>[196727.311912]  [<ffffffff815b7f42>] dst_destroy+0x32/0xe0
> <4>[196727.311959]  [<ffffffff815b86c6>] dst_release+0x56/0x80
> <4>[196727.311986]  [<ffffffff81620bd5>] tcp_v4_do_rcv+0x2a5/0x4a0
> <4>[196727.312013]  [<ffffffff81622b5a>] tcp_v4_rcv+0x7da/0x820
> <4>[196727.312041]  [<ffffffff815fd9e0>] ? ip_rcv_finish+0x360/0x360
> <4>[196727.312070]  [<ffffffff815de02d>] ? nf_hook_slow+0x7d/0x150
> <4>[196727.312097]  [<ffffffff815fd9e0>] ? ip_rcv_finish+0x360/0x360
> <4>[196727.312125]  [<ffffffff815fda92>] ip_local_deliver_finish+0xb2/0x230
> <4>[196727.312154]  [<ffffffff815fdd9a>] ip_local_deliver+0x4a/0x90
> <4>[196727.312183]  [<ffffffff815fd799>] ip_rcv_finish+0x119/0x360
> <4>[196727.312212]  [<ffffffff815fe00b>] ip_rcv+0x22b/0x340
> <4>[196727.312242]  [<ffffffffa0339680>] ? macvlan_broadcast+0x160/0x160 [macvlan]
> <4>[196727.312275]  [<ffffffff815b0c62>] __netif_receive_skb_core+0x512/0x640
> <4>[196727.312308]  [<ffffffff811427fb>] ? kmem_cache_alloc+0x13b/0x150
> <4>[196727.312338]  [<ffffffff815b0db1>] __netif_receive_skb+0x21/0x70
> <4>[196727.312368]  [<ffffffff815b0fa1>] netif_receive_skb+0x31/0xa0
> <4>[196727.312397]  [<ffffffff815b1ae8>] napi_gro_receive+0xe8/0x140
> <4>[196727.312433]  [<ffffffffa00274f1>] ixgbe_poll+0x551/0x11f0 [ixgbe]
> <4>[196727.312463]  [<ffffffff815fe00b>] ? ip_rcv+0x22b/0x340
> <4>[196727.312491]  [<ffffffff815b1691>] net_rx_action+0x111/0x210
> <4>[196727.312521]  [<ffffffff815b0db1>] ? __netif_receive_skb+0x21/0x70
> <4>[196727.312552]  [<ffffffff810519d0>] __do_softirq+0xd0/0x270
> <4>[196727.312583]  [<ffffffff816cef3c>] call_softirq+0x1c/0x30
> <4>[196727.312613]  [<ffffffff81004205>] do_softirq+0x55/0x90
> <4>[196727.312640]  [<ffffffff81051c85>] irq_exit+0x55/0x60
> <4>[196727.312668]  [<ffffffff816cf5c3>] do_IRQ+0x63/0xe0
> <4>[196727.312696]  [<ffffffff816c5aaa>] common_interrupt+0x6a/0x6a
> <4>[196727.312722]  <EOI>
> <1>[196727.313071] RIP  [<ffffffff815f8c7f>] ipv4_dst_destroy+0x4f/0x80
> <4>[196727.313100]  RSP <ffff885effd23a70>
> <4>[196727.313377] ---[ end trace 64b3f14fae0f2e29 ]---
> <0>[196727.380908] Kernel panic - not syncing: Fatal exception in interrupt
>
> Reported-by: Alexey Preobrazhensky <preobr@google.com>
> Reported-by: dormando <dormando@rydia.ne>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Fixes: 8141ed9fcedb2 ("ipv4: Add a socket release callback for datagram sockets")
> Cc: Steffen Klassert <steffen.klassert@secunet.com>
> ---
>  net/ipv4/datagram.c |   20 +++++++++++++++-----
>  1 file changed, 15 insertions(+), 5 deletions(-)
>
> diff --git a/net/ipv4/datagram.c b/net/ipv4/datagram.c
> index 8b5134c582f1..a3095fdefbed 100644
> --- a/net/ipv4/datagram.c
> +++ b/net/ipv4/datagram.c
> @@ -86,18 +86,26 @@ out:
>  }
>  EXPORT_SYMBOL(ip4_datagram_connect);
>
> +/* Because UDP xmit path can manipulate sk_dst_cache without holding
> + * socket lock, we need to use sk_dst_set() here,
> + * even if we own the socket lock.
> + */
>  void ip4_datagram_release_cb(struct sock *sk)
>  {
>  	const struct inet_sock *inet = inet_sk(sk);
>  	const struct ip_options_rcu *inet_opt;
>  	__be32 daddr = inet->inet_daddr;
> +	struct dst_entry *dst;
>  	struct flowi4 fl4;
>  	struct rtable *rt;
>
> -	if (! __sk_dst_get(sk) || __sk_dst_check(sk, 0))
> -		return;
> -
>  	rcu_read_lock();
> +
> +	dst = __sk_dst_get(sk);
> +	if (!dst || !dst->obsolete || dst->ops->check(dst, 0)) {
> +		rcu_read_unlock();
> +		return;
> +	}
>  	inet_opt = rcu_dereference(inet->inet_opt);
>  	if (inet_opt && inet_opt->opt.srr)
>  		daddr = inet_opt->opt.faddr;
> @@ -105,8 +113,10 @@ void ip4_datagram_release_cb(struct sock *sk)
>  				   inet->inet_saddr, inet->inet_dport,
>  				   inet->inet_sport, sk->sk_protocol,
>  				   RT_CONN_FLAGS(sk), sk->sk_bound_dev_if);
> -	if (!IS_ERR(rt))
> -		__sk_dst_set(sk, &rt->dst);
> +
> +	dst = !IS_ERR(rt) ? &rt->dst : NULL;
> +	sk_dst_set(sk, dst);
> +
>  	rcu_read_unlock();
>  }
>  EXPORT_SYMBOL_GPL(ip4_datagram_release_cb);
>
>
>

This is neat. So just confirming that I understand (thanks for re-cc'ing
me btw <3):

Because of the UDP glitch, it can cause either the kernel panic in the UDP
code, but also elsewhere in the TCP stack based on if it's fiddling with
the same DST being hosed by a UDP packet?

Unfortunately we're observing a pretty severe sintr cpu increase under
3.14, so I'm going to try this on 3.10.latest. Any potential issues there?

Thankfully it seems like our udpkill utility will reproduce it quickly.
Lets see how it goes!

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] ipv4: fix a race in ip4_datagram_release_cb()
  2014-06-11  0:32   ` dormando
@ 2014-06-11  0:55     ` Eric Dumazet
  2014-06-11  1:12       ` Eric Dumazet
  0 siblings, 1 reply; 41+ messages in thread
From: Eric Dumazet @ 2014-06-11  0:55 UTC (permalink / raw)
  To: dormando
  Cc: Alexey Preobrazhensky, Steffen Klassert, David Miller, paulmck,
	netdev, Kostya Serebryany, Dmitry Vyukov, Lars Bull,
	Eric Dumazet, Bruce Curtis, Maciej Żenczykowski,
	Alexei Starovoitov

On Tue, 2014-06-10 at 17:32 -0700, dormando wrote:

> >
> 
> This is neat. So just confirming that I understand (thanks for re-cc'ing
> me btw <3):
> 
> Because of the UDP glitch, it can cause either the kernel panic in the UDP
> code, but also elsewhere in the TCP stack based on if it's fiddling with
> the same DST being hosed by a UDP packet?
> 
> Unfortunately we're observing a pretty severe sintr cpu increase under
> 3.14, so I'm going to try this on 3.10.latest. Any potential issues there?
> 
> Thankfully it seems like our udpkill utility will reproduce it quickly.
> Lets see how it goes!

This is a start, because we have other bugs of the same kind.

I was working on follow up

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] ipv4: fix a race in ip4_datagram_release_cb()
  2014-06-11  0:55     ` Eric Dumazet
@ 2014-06-11  1:12       ` Eric Dumazet
  2014-06-11  1:26         ` Eric Dumazet
  0 siblings, 1 reply; 41+ messages in thread
From: Eric Dumazet @ 2014-06-11  1:12 UTC (permalink / raw)
  To: dormando
  Cc: Alexey Preobrazhensky, Steffen Klassert, David Miller, paulmck,
	netdev, Kostya Serebryany, Dmitry Vyukov, Lars Bull,
	Eric Dumazet, Bruce Curtis, Maciej Żenczykowski,
	Alexei Starovoitov

On Tue, 2014-06-10 at 17:55 -0700, Eric Dumazet wrote:
> On Tue, 2014-06-10 at 17:32 -0700, dormando wrote:
> 
> > >
> > 
> > This is neat. So just confirming that I understand (thanks for re-cc'ing
> > me btw <3):
> > 
> > Because of the UDP glitch, it can cause either the kernel panic in the UDP
> > code, but also elsewhere in the TCP stack based on if it's fiddling with
> > the same DST being hosed by a UDP packet?
> > 
> > Unfortunately we're observing a pretty severe sintr cpu increase under
> > 3.14, so I'm going to try this on 3.10.latest. Any potential issues there?
> > 
> > Thankfully it seems like our udpkill utility will reproduce it quickly.
> > Lets see how it goes!
> 
> This is a start, because we have other bugs of the same kind.
> 
> I was working on follow up
> 
> 

For the curious, another problem is in ipv4_sk_update_pmtu()

This can be called on UDP sockets, but from softirq context.

We cannot use sk_dst_lock because this lock is not softirq safe.

I guess we should use xchg() for sk_dst_set() and sk_dst_reset()

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] ipv4: fix a race in ip4_datagram_release_cb()
  2014-06-11  1:12       ` Eric Dumazet
@ 2014-06-11  1:26         ` Eric Dumazet
  2014-06-11  4:16           ` dormando
                             ` (2 more replies)
  0 siblings, 3 replies; 41+ messages in thread
From: Eric Dumazet @ 2014-06-11  1:26 UTC (permalink / raw)
  To: dormando
  Cc: Alexey Preobrazhensky, Steffen Klassert, David Miller, paulmck,
	netdev, Kostya Serebryany, Dmitry Vyukov, Lars Bull,
	Eric Dumazet, Bruce Curtis, Maciej Żenczykowski,
	Alexei Starovoitov

On Tue, 2014-06-10 at 18:12 -0700, Eric Dumazet wrote:

> 
> For the curious, another problem is in ipv4_sk_update_pmtu()
> 
> This can be called on UDP sockets, but from softirq context.
> 
> We cannot use sk_dst_lock because this lock is not softirq safe.
> 
> I guess we should use xchg() for sk_dst_set() and sk_dst_reset()

This would be something like this untested patch :

 include/net/sock.h |   12 ++++++------
 net/ipv4/route.c   |   15 ++++++++-------
 2 files changed, 14 insertions(+), 13 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 21569cf456ed..04400978ceb5 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1766,9 +1766,11 @@ __sk_dst_set(struct sock *sk, struct dst_entry *dst)
 static inline void
 sk_dst_set(struct sock *sk, struct dst_entry *dst)
 {
-	spin_lock(&sk->sk_dst_lock);
-	__sk_dst_set(sk, dst);
-	spin_unlock(&sk->sk_dst_lock);
+	struct dst_entry *old_dst;
+
+	sk_tx_queue_clear(sk);
+	old_dst = xchg(&sk->sk_dst_cache, dst);
+	dst_release(old_dst);
 }
 
 static inline void
@@ -1780,9 +1782,7 @@ __sk_dst_reset(struct sock *sk)
 static inline void
 sk_dst_reset(struct sock *sk)
 {
-	spin_lock(&sk->sk_dst_lock);
-	__sk_dst_reset(sk);
-	spin_unlock(&sk->sk_dst_lock);
+	sk_dst_set(sk, NULL);
 }
 
 struct dst_entry *__sk_dst_check(struct sock *sk, u32 cookie);
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 5e676be3daeb..be9f2b1ac3ab 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1022,7 +1022,7 @@ void ipv4_sk_update_pmtu(struct sk_buff *skb, struct sock *sk, u32 mtu)
 	const struct iphdr *iph = (const struct iphdr *) skb->data;
 	struct flowi4 fl4;
 	struct rtable *rt;
-	struct dst_entry *dst;
+	struct dst_entry *odst = NULL;
 	bool new = false;
 
 	bh_lock_sock(sk);
@@ -1030,16 +1030,17 @@ void ipv4_sk_update_pmtu(struct sk_buff *skb, struct sock *sk, u32 mtu)
 	if (!ip_sk_accept_pmtu(sk))
 		goto out;
 
-	rt = (struct rtable *) __sk_dst_get(sk);
+	odst = sk_dst_get(sk);
 
-	if (sock_owned_by_user(sk) || !rt) {
+	if (sock_owned_by_user(sk) || !odst) {
 		__ipv4_sk_update_pmtu(skb, sk, mtu);
 		goto out;
 	}
 
 	__build_flow_key(&fl4, sk, iph, 0, 0, 0, 0, 0);
 
-	if (!__sk_dst_check(sk, 0)) {
+	rt = (struct rtable *)odst;
+	if (odst->obsolete && odst->ops->check(odst, 0) == NULL) {
 		rt = ip_route_output_flow(sock_net(sk), &fl4, sk);
 		if (IS_ERR(rt))
 			goto out;
@@ -1049,8 +1050,7 @@ void ipv4_sk_update_pmtu(struct sk_buff *skb, struct sock *sk, u32 mtu)
 
 	__ip_rt_update_pmtu((struct rtable *) rt->dst.path, &fl4, mtu);
 
-	dst = dst_check(&rt->dst, 0);
-	if (!dst) {
+	if (!dst_check(&rt->dst, 0)) {
 		if (new)
 			dst_release(&rt->dst);
 
@@ -1062,10 +1062,11 @@ void ipv4_sk_update_pmtu(struct sk_buff *skb, struct sock *sk, u32 mtu)
 	}
 
 	if (new)
-		__sk_dst_set(sk, &rt->dst);
+		sk_dst_set(sk, &rt->dst);
 
 out:
 	bh_unlock_sock(sk);
+	dst_release(odst);
 }
 EXPORT_SYMBOL_GPL(ipv4_sk_update_pmtu);
 

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH] ipv4: fix a race in ip4_datagram_release_cb()
  2014-06-11  1:26         ` Eric Dumazet
@ 2014-06-11  4:16           ` dormando
  2014-06-11  5:54             ` Eric Dumazet
  2014-06-11 13:38             ` [PATCH] ipv4: fix a race in ip4_datagram_release_cb() Kostya Serebryany
  2014-06-29  0:25           ` dormando
  2014-06-30  8:26           ` [PATCH] ipv4: irq safe sk_dst_[re]set() and ipv4_sk_update_pmtu() fix Eric Dumazet
  2 siblings, 2 replies; 41+ messages in thread
From: dormando @ 2014-06-11  4:16 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexey Preobrazhensky, Steffen Klassert, David Miller, paulmck,
	netdev, Kostya Serebryany, Dmitry Vyukov, Lars Bull,
	Eric Dumazet, Bruce Curtis, Maciej Żenczykowski,
	Alexei Starovoitov

On Tue, 10 Jun 2014, Eric Dumazet wrote:

> On Tue, 2014-06-10 at 18:12 -0700, Eric Dumazet wrote:
>
> >
> > For the curious, another problem is in ipv4_sk_update_pmtu()
> >
> > This can be called on UDP sockets, but from softirq context.
> >
> > We cannot use sk_dst_lock because this lock is not softirq safe.
> >
> > I guess we should use xchg() for sk_dst_set() and sk_dst_reset()
>
> This would be something like this untested patch :
>
>  include/net/sock.h |   12 ++++++------
>  net/ipv4/route.c   |   15 ++++++++-------
>  2 files changed, 14 insertions(+), 13 deletions(-)
>
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 21569cf456ed..04400978ceb5 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -1766,9 +1766,11 @@ __sk_dst_set(struct sock *sk, struct dst_entry *dst)
>  static inline void
>  sk_dst_set(struct sock *sk, struct dst_entry *dst)
>  {
> -	spin_lock(&sk->sk_dst_lock);
> -	__sk_dst_set(sk, dst);
> -	spin_unlock(&sk->sk_dst_lock);
> +	struct dst_entry *old_dst;
> +
> +	sk_tx_queue_clear(sk);
> +	old_dst = xchg(&sk->sk_dst_cache, dst);
> +	dst_release(old_dst);
>  }
>
>  static inline void
> @@ -1780,9 +1782,7 @@ __sk_dst_reset(struct sock *sk)
>  static inline void
>  sk_dst_reset(struct sock *sk)
>  {
> -	spin_lock(&sk->sk_dst_lock);
> -	__sk_dst_reset(sk);
> -	spin_unlock(&sk->sk_dst_lock);
> +	sk_dst_set(sk, NULL);
>  }
>
>  struct dst_entry *__sk_dst_check(struct sock *sk, u32 cookie);
> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index 5e676be3daeb..be9f2b1ac3ab 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -1022,7 +1022,7 @@ void ipv4_sk_update_pmtu(struct sk_buff *skb, struct sock *sk, u32 mtu)
>  	const struct iphdr *iph = (const struct iphdr *) skb->data;
>  	struct flowi4 fl4;
>  	struct rtable *rt;
> -	struct dst_entry *dst;
> +	struct dst_entry *odst = NULL;
>  	bool new = false;
>
>  	bh_lock_sock(sk);
> @@ -1030,16 +1030,17 @@ void ipv4_sk_update_pmtu(struct sk_buff *skb, struct sock *sk, u32 mtu)
>  	if (!ip_sk_accept_pmtu(sk))
>  		goto out;
>
> -	rt = (struct rtable *) __sk_dst_get(sk);
> +	odst = sk_dst_get(sk);
>
> -	if (sock_owned_by_user(sk) || !rt) {
> +	if (sock_owned_by_user(sk) || !odst) {
>  		__ipv4_sk_update_pmtu(skb, sk, mtu);
>  		goto out;
>  	}
>
>  	__build_flow_key(&fl4, sk, iph, 0, 0, 0, 0, 0);
>
> -	if (!__sk_dst_check(sk, 0)) {
> +	rt = (struct rtable *)odst;
> +	if (odst->obsolete && odst->ops->check(odst, 0) == NULL) {
>  		rt = ip_route_output_flow(sock_net(sk), &fl4, sk);
>  		if (IS_ERR(rt))
>  			goto out;
> @@ -1049,8 +1050,7 @@ void ipv4_sk_update_pmtu(struct sk_buff *skb, struct sock *sk, u32 mtu)
>
>  	__ip_rt_update_pmtu((struct rtable *) rt->dst.path, &fl4, mtu);
>
> -	dst = dst_check(&rt->dst, 0);
> -	if (!dst) {
> +	if (!dst_check(&rt->dst, 0)) {
>  		if (new)
>  			dst_release(&rt->dst);
>
> @@ -1062,10 +1062,11 @@ void ipv4_sk_update_pmtu(struct sk_buff *skb, struct sock *sk, u32 mtu)
>  	}
>
>  	if (new)
> -		__sk_dst_set(sk, &rt->dst);
> +		sk_dst_set(sk, &rt->dst);
>
>  out:
>  	bh_unlock_sock(sk);
> +	dst_release(odst);
>  }
>  EXPORT_SYMBOL_GPL(ipv4_sk_update_pmtu);
>
>
>
>

Ran our udpkill util against 3.10.42 with both of your patches applied...
seems like it ran a bit longer than normally would with this test (15-20
minutes), then died:

<4>[ 2572.249626] Modules linked in: xt_TEE xt_dscp xt_DSCP macvlan bridge
coretemp crc32_pclmul ghash_clmulni_intel gpio_ich ipmi_watchdog
ipmi_devintf microcode sb_edac edac_core lpc_ich mfd_core ipmi_si
ipmi_msghandler igb i2c_algo_bit ixgbe ptp pps_core mdio
<4>[ 2572.249733] CPU: 15 PID: 65425 Comm: udpkill Not tainted
3.10.42 #1
<4>[ 2572.249748] Hardware name: Supermicro
X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.0 07/05/2013
<4>[ 2572.249768] task: ffff885e71fe5c00 ti: ffff885e80bc4000 task.ti:
ffff885e80bc4000
<4>[ 2572.249783] RIP: 0010:[<ffffffff81602c95>]  [<ffffffff81602c95>]
ipv4_dst_destroy+0x45/0x80
<4>[ 2572.249809] RSP: 0018:ffff885e80bc5b58  EFLAGS: 00010292
<4>[ 2572.249820] RAX: dead000000200200 RBX: ffff882f2a5f26c0 RCX:
dead000000100100
<4>[ 2572.249835] RDX: dead000000100100 RSI: ffffffffffffffff RDI:
ffffffff81e95302
<4>[ 2572.249849] RBP: ffff885e80bc5b68 R08: 0000000000000000 R09:
ffff885e80bc5c78
<4>[ 2572.249863] R10: 0000000000000001 R11: 0000000000000293 R12:
0000000000000000
<4>[ 2572.249878] R13: ffffffff81c8c840 R14: ffff885e80bc5da0 R15:
ffffffff8160c9f0
<4>[ 2572.249893] FS:  00007f6a2d2af700(0000) GS:ffff88607fce0000(0000)
knlGS:0000000000000000
<4>[ 2572.249909] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[ 2572.249921] CR2: 000079686b6e0000 CR3: 0000005f65094000 CR4:
00000000000407e0
<4>[ 2572.249952] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
<4>[ 2572.249982] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
<4>[ 2572.250011] Stack:
<4>[ 2572.250032]  ffff882f2a5f26c0 ffff882f2a5f26c0 ffff885e80bc5b98
ffffffff815c12b2
<4>[ 2572.250083]  0000000000000000 ffff882f2a5f26c0 0000000000000000
0000000000000140
<4>[ 2572.250135]  ffff885e80bc5bb8 ffffffff815c15a6 0000000000000000
ffff885f6ec29f80
<4>[ 2572.250187] Call Trace:
<4>[ 2572.250216]  [<ffffffff815c12b2>] dst_destroy+0x32/0xd0
<4>[ 2572.250246]  [<ffffffff815c15a6>] dst_release+0x56/0x80
<4>[ 2572.250282]  [<ffffffff815a7272>] sk_dst_check+0x82/0x90
<4>[ 2572.250311]  [<ffffffff81634f72>] udp_sendmsg+0x622/0x8d0
<4>[ 2572.250343]  [<ffffffff8163ec95>] inet_sendmsg+0x45/0xb0
<4>[ 2572.250374]  [<ffffffff815a0dae>] sock_aio_write+0x11e/0x130
<4>[ 2572.250408]  [<ffffffff811563af>] do_sync_write+0x7f/0xb0
<4>[ 2572.250437]  [<ffffffff81156254>] ? fsnotify_modify+0x64/0x80
<4>[ 2572.250465]  [<ffffffff81156806>] vfs_write+0x166/0x1a0
<4>[ 2572.250493]  [<ffffffff81156cef>] SyS_write+0x5f/0xa0
<4>[ 2572.250526]  [<ffffffff816d2bc2>] system_call_fastpath+0x16/0x1b
<4>[ 2572.250553] Code: 87 b0 00 00 00 74 4f 48 c7 c7 02 53 e9 81 e8 a3 79
0c 00 48 8b 93 b0 00 00 00 48 8b 83 b8 00 00 00 48 b9 00 01 10 00 00 00 ad
de <48> 89 42 08 48 c7 c7 02 53 e9 81 48 89 10 48 ba 00 02 20 00 00
<1>[ 2572.250783] RIP  [<ffffffff81602c95>] ipv4_dst_destroy+0x45/0x80
<4>[ 2572.250813]  RSP <ffff885e80bc5b58>
<4>[ 2572.251127] ---[ end trace 62611fc9c3ed8cf0 ]---
<0>[ 2573.377286] Kernel panic - not syncing: Fatal exception in interrupt

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] ipv4: fix a race in ip4_datagram_release_cb()
  2014-06-11  4:16           ` dormando
@ 2014-06-11  5:54             ` Eric Dumazet
  2014-06-11  7:20               ` dormando
  2014-06-11 13:38             ` [PATCH] ipv4: fix a race in ip4_datagram_release_cb() Kostya Serebryany
  1 sibling, 1 reply; 41+ messages in thread
From: Eric Dumazet @ 2014-06-11  5:54 UTC (permalink / raw)
  To: dormando
  Cc: Alexey Preobrazhensky, Steffen Klassert, David Miller, paulmck,
	netdev, Kostya Serebryany, Dmitry Vyukov, Lars Bull,
	Eric Dumazet, Bruce Curtis, Maciej Żenczykowski,
	Alexei Starovoitov

On Tue, 2014-06-10 at 21:16 -0700, dormando wrote:

> Ran our udpkill util against 3.10.42 with both of your patches applied...
> seems like it ran a bit longer than normally would with this test (15-20
> minutes), then died:

Well, could you try a recent kernel instead ?

I can see some races and fixes are probably worth it.

$ git log --oneline v3.10.42..v3.15 net/ipv4/route.c
fbdc0ad ipv4: initialise the itag variable in __mkroute_input
0d5edc6 ipv4, route: pass 0 instead of LOOPBACK_IFINDEX to fib_validate_source()
aad8872 ipv4: add a sock pointer to dst->output() path.
9114615 ipv4: return valid RTA_IIF on ip route get
3ed66e9 net: replace __this_cpu_inc in route.c with raw_cpu_inc
0b8c7f6 ipv4: remove ip_rt_dump from route.c
4a4eb21 ipv4: remove ipv4_ifdown_dst from route.c
1e8d642 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
a625486 ipv4: fix counter in_slow_tot
cd0f0b9 ipv4: distinguish EHOSTUNREACH from the ENETUNREACH
2045cea net: remove unnecessary return's
f87c10a ipv4: introduce ip_dst_mtu_maybe_forward and protect forwarding path against pmtu spoofing
dcdfdf5 ipv4: fix race in concurrent ip_route_input_slow()
482fc60 ipv4: introduce new IP_MTU_DISCOVER mode IP_PMTUDISC_INTERFACE
0baf2b3 ipv4: shrink rt_cache_stat
0a7e226 ipv4: fix ineffective source address selection
734d272 ipv4: raise IP_MAX_MTU to theoretical limit
ca4c3fc net: split rt_genid for ipv4 and ipv6
2ffae99 ipv4: use next hop exceptions also for input routes
fe2c633 net: Convert uses of typedef ctl_table to struct ctl_table
6bc19fb Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
5aad1de ipv4: use separate genid for next hop exceptions
f016229 ipv4: rate limit updating of next hop exceptions with same pmtu
387aa65 ipv4: properly refresh rtable entries on pmtu/redirect events

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] ipv4: fix a race in ip4_datagram_release_cb()
  2014-06-11  5:54             ` Eric Dumazet
@ 2014-06-11  7:20               ` dormando
  2014-06-11  7:26                 ` dormando
  0 siblings, 1 reply; 41+ messages in thread
From: dormando @ 2014-06-11  7:20 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexey Preobrazhensky, Steffen Klassert, David Miller, paulmck,
	netdev, Kostya Serebryany, Dmitry Vyukov, Lars Bull,
	Eric Dumazet, Bruce Curtis, Maciej Żenczykowski,
	Alexei Starovoitov

On Tue, 10 Jun 2014, Eric Dumazet wrote:

> On Tue, 2014-06-10 at 21:16 -0700, dormando wrote:
>
> > Ran our udpkill util against 3.10.42 with both of your patches applied...
> > seems like it ran a bit longer than normally would with this test (15-20
> > minutes), then died:
>
> Well, could you try a recent kernel instead ?
>
> I can see some races and fixes are probably worth it.
>
> $ git log --oneline v3.10.42..v3.15 net/ipv4/route.c
> fbdc0ad ipv4: initialise the itag variable in __mkroute_input
> 0d5edc6 ipv4, route: pass 0 instead of LOOPBACK_IFINDEX to fib_validate_source()
> aad8872 ipv4: add a sock pointer to dst->output() path.
> 9114615 ipv4: return valid RTA_IIF on ip route get
> 3ed66e9 net: replace __this_cpu_inc in route.c with raw_cpu_inc
> 0b8c7f6 ipv4: remove ip_rt_dump from route.c
> 4a4eb21 ipv4: remove ipv4_ifdown_dst from route.c
> 1e8d642 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
> a625486 ipv4: fix counter in_slow_tot
> cd0f0b9 ipv4: distinguish EHOSTUNREACH from the ENETUNREACH
> 2045cea net: remove unnecessary return's
> f87c10a ipv4: introduce ip_dst_mtu_maybe_forward and protect forwarding path against pmtu spoofing
> dcdfdf5 ipv4: fix race in concurrent ip_route_input_slow()
> 482fc60 ipv4: introduce new IP_MTU_DISCOVER mode IP_PMTUDISC_INTERFACE
> 0baf2b3 ipv4: shrink rt_cache_stat
> 0a7e226 ipv4: fix ineffective source address selection
> 734d272 ipv4: raise IP_MAX_MTU to theoretical limit
> ca4c3fc net: split rt_genid for ipv4 and ipv6
> 2ffae99 ipv4: use next hop exceptions also for input routes
> fe2c633 net: Convert uses of typedef ctl_table to struct ctl_table
> 6bc19fb Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
> 5aad1de ipv4: use separate genid for next hop exceptions
> f016229 ipv4: rate limit updating of next hop exceptions with same pmtu
> 387aa65 ipv4: properly refresh rtable entries on pmtu/redirect events
>
>

Newest I can realistically roll would be 3.14.6, so I just tried
that... Without your two patches, it still dies from the UDP bug.

Unfortunately 3.14 has a few regressions.. one is some bad CPU usage i'll
have to track down, and two something about pstore is broken, so I can't
get the trace from the crash. It's compressing now and has more of the
kernel log, but it's missing the actual panic part.

$ git log --oneline v3.14..v3.15 net/ipv4/route.c
fbdc0ad ipv4: initialise the itag variable in __mkroute_input
0d5edc6 ipv4, route: pass 0 instead of LOOPBACK_IFINDEX to fib_validate_source()
aad8872 ipv4: add a sock pointer to dst->output() path.
9114615 ipv4: return valid RTA_IIF on ip route get
3ed66e9 net: replace __this_cpu_inc in route.c with raw_cpu_inc
0b8c7f6 ipv4: remove ip_rt_dump from route.c
4a4eb21 ipv4: remove ipv4_ifdown_dst from route.c
1e8d642 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
2045cea net: remove unnecessary return's

No more obvious race fixes. I can try 3.15 fully vanilla but I'm having
doubts?

We have a few patches on top of this, but none of them are active at the
time of my test. I've tried removing them in the past and it did nothing
as well.

Sorry :(

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] ipv4: fix a race in ip4_datagram_release_cb()
  2014-06-11  7:20               ` dormando
@ 2014-06-11  7:26                 ` dormando
  2014-06-11  7:38                   ` dormando
  0 siblings, 1 reply; 41+ messages in thread
From: dormando @ 2014-06-11  7:26 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexey Preobrazhensky, Steffen Klassert, David Miller, paulmck,
	netdev, Kostya Serebryany, Dmitry Vyukov, Lars Bull,
	Eric Dumazet, Bruce Curtis, Maciej Żenczykowski,
	Alexei Starovoitov

On Wed, 11 Jun 2014, dormando wrote:

> On Tue, 10 Jun 2014, Eric Dumazet wrote:
>
> > On Tue, 2014-06-10 at 21:16 -0700, dormando wrote:
> >
> > > Ran our udpkill util against 3.10.42 with both of your patches applied...
> > > seems like it ran a bit longer than normally would with this test (15-20
> > > minutes), then died:
> >
> > Well, could you try a recent kernel instead ?
> >
> > I can see some races and fixes are probably worth it.
> >
> > $ git log --oneline v3.10.42..v3.15 net/ipv4/route.c
> > fbdc0ad ipv4: initialise the itag variable in __mkroute_input
> > 0d5edc6 ipv4, route: pass 0 instead of LOOPBACK_IFINDEX to fib_validate_source()
> > aad8872 ipv4: add a sock pointer to dst->output() path.
> > 9114615 ipv4: return valid RTA_IIF on ip route get
> > 3ed66e9 net: replace __this_cpu_inc in route.c with raw_cpu_inc
> > 0b8c7f6 ipv4: remove ip_rt_dump from route.c
> > 4a4eb21 ipv4: remove ipv4_ifdown_dst from route.c
> > 1e8d642 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
> > a625486 ipv4: fix counter in_slow_tot
> > cd0f0b9 ipv4: distinguish EHOSTUNREACH from the ENETUNREACH
> > 2045cea net: remove unnecessary return's
> > f87c10a ipv4: introduce ip_dst_mtu_maybe_forward and protect forwarding path against pmtu spoofing
> > dcdfdf5 ipv4: fix race in concurrent ip_route_input_slow()
> > 482fc60 ipv4: introduce new IP_MTU_DISCOVER mode IP_PMTUDISC_INTERFACE
> > 0baf2b3 ipv4: shrink rt_cache_stat
> > 0a7e226 ipv4: fix ineffective source address selection
> > 734d272 ipv4: raise IP_MAX_MTU to theoretical limit
> > ca4c3fc net: split rt_genid for ipv4 and ipv6
> > 2ffae99 ipv4: use next hop exceptions also for input routes
> > fe2c633 net: Convert uses of typedef ctl_table to struct ctl_table
> > 6bc19fb Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
> > 5aad1de ipv4: use separate genid for next hop exceptions
> > f016229 ipv4: rate limit updating of next hop exceptions with same pmtu
> > 387aa65 ipv4: properly refresh rtable entries on pmtu/redirect events
> >
> >
>
> Newest I can realistically roll would be 3.14.6, so I just tried
> that... Without your two patches, it still dies from the UDP bug.

--> Meant to say here that both *with* and *without* your two new patches
it still crashes.

> Unfortunately 3.14 has a few regressions.. one is some bad CPU usage i'll
> have to track down, and two something about pstore is broken, so I can't
> get the trace from the crash. It's compressing now and has more of the
> kernel log, but it's missing the actual panic part.
>
> $ git log --oneline v3.14..v3.15 net/ipv4/route.c
> fbdc0ad ipv4: initialise the itag variable in __mkroute_input
> 0d5edc6 ipv4, route: pass 0 instead of LOOPBACK_IFINDEX to fib_validate_source()
> aad8872 ipv4: add a sock pointer to dst->output() path.
> 9114615 ipv4: return valid RTA_IIF on ip route get
> 3ed66e9 net: replace __this_cpu_inc in route.c with raw_cpu_inc
> 0b8c7f6 ipv4: remove ip_rt_dump from route.c
> 4a4eb21 ipv4: remove ipv4_ifdown_dst from route.c
> 1e8d642 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
> 2045cea net: remove unnecessary return's
>
> No more obvious race fixes. I can try 3.15 fully vanilla but I'm having
> doubts?
>
> We have a few patches on top of this, but none of them are active at the
> time of my test. I've tried removing them in the past and it did nothing
> as well.
>
> Sorry :(
>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] ipv4: fix a race in ip4_datagram_release_cb()
  2014-06-11  7:26                 ` dormando
@ 2014-06-11  7:38                   ` dormando
  2014-06-11 12:41                     ` Eric Dumazet
  0 siblings, 1 reply; 41+ messages in thread
From: dormando @ 2014-06-11  7:38 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexey Preobrazhensky, Steffen Klassert, David Miller, paulmck,
	netdev, Kostya Serebryany, Dmitry Vyukov, Lars Bull,
	Eric Dumazet, Bruce Curtis, Maciej Żenczykowski,
	Alexei Starovoitov

On Wed, 11 Jun 2014, dormando wrote:

> On Wed, 11 Jun 2014, dormando wrote:
>
> > On Tue, 10 Jun 2014, Eric Dumazet wrote:
> >
> > > On Tue, 2014-06-10 at 21:16 -0700, dormando wrote:
> > >
> > > > Ran our udpkill util against 3.10.42 with both of your patches applied...
> > > > seems like it ran a bit longer than normally would with this test (15-20
> > > > minutes), then died:
> > >
> > > Well, could you try a recent kernel instead ?
> > >
> > > I can see some races and fixes are probably worth it.
> > >
> > > $ git log --oneline v3.10.42..v3.15 net/ipv4/route.c
> > > fbdc0ad ipv4: initialise the itag variable in __mkroute_input
> > > 0d5edc6 ipv4, route: pass 0 instead of LOOPBACK_IFINDEX to fib_validate_source()
> > > aad8872 ipv4: add a sock pointer to dst->output() path.
> > > 9114615 ipv4: return valid RTA_IIF on ip route get
> > > 3ed66e9 net: replace __this_cpu_inc in route.c with raw_cpu_inc
> > > 0b8c7f6 ipv4: remove ip_rt_dump from route.c
> > > 4a4eb21 ipv4: remove ipv4_ifdown_dst from route.c
> > > 1e8d642 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
> > > a625486 ipv4: fix counter in_slow_tot
> > > cd0f0b9 ipv4: distinguish EHOSTUNREACH from the ENETUNREACH
> > > 2045cea net: remove unnecessary return's
> > > f87c10a ipv4: introduce ip_dst_mtu_maybe_forward and protect forwarding path against pmtu spoofing
> > > dcdfdf5 ipv4: fix race in concurrent ip_route_input_slow()
> > > 482fc60 ipv4: introduce new IP_MTU_DISCOVER mode IP_PMTUDISC_INTERFACE
> > > 0baf2b3 ipv4: shrink rt_cache_stat
> > > 0a7e226 ipv4: fix ineffective source address selection
> > > 734d272 ipv4: raise IP_MAX_MTU to theoretical limit
> > > ca4c3fc net: split rt_genid for ipv4 and ipv6
> > > 2ffae99 ipv4: use next hop exceptions also for input routes
> > > fe2c633 net: Convert uses of typedef ctl_table to struct ctl_table
> > > 6bc19fb Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
> > > 5aad1de ipv4: use separate genid for next hop exceptions
> > > f016229 ipv4: rate limit updating of next hop exceptions with same pmtu
> > > 387aa65 ipv4: properly refresh rtable entries on pmtu/redirect events
> > >
> > >
> >
> > Newest I can realistically roll would be 3.14.6, so I just tried
> > that... Without your two patches, it still dies from the UDP bug.
>
> --> Meant to say here that both *with* and *without* your two new patches
> it still crashes.
>
> > Unfortunately 3.14 has a few regressions.. one is some bad CPU usage i'll
> > have to track down, and two something about pstore is broken, so I can't
> > get the trace from the crash. It's compressing now and has more of the
> > kernel log, but it's missing the actual panic part.
> >
> > $ git log --oneline v3.14..v3.15 net/ipv4/route.c
> > fbdc0ad ipv4: initialise the itag variable in __mkroute_input
> > 0d5edc6 ipv4, route: pass 0 instead of LOOPBACK_IFINDEX to fib_validate_source()
> > aad8872 ipv4: add a sock pointer to dst->output() path.
> > 9114615 ipv4: return valid RTA_IIF on ip route get
> > 3ed66e9 net: replace __this_cpu_inc in route.c with raw_cpu_inc
> > 0b8c7f6 ipv4: remove ip_rt_dump from route.c
> > 4a4eb21 ipv4: remove ipv4_ifdown_dst from route.c
> > 1e8d642 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
> > 2045cea net: remove unnecessary return's
> >
> > No more obvious race fixes. I can try 3.15 fully vanilla but I'm having
> > doubts?
> >
> > We have a few patches on top of this, but none of them are active at the
> > time of my test. I've tried removing them in the past and it did nothing
> > as well.
> >
> > Sorry :(
> >

Spamming now! The pstore'd dmesg looked suspiciously like the boot before
I booted the crashed kernel.. checked pstore again and the crash is there
after a second reboot (wtf.. will test tomorrow).

<4>[  203.161414] general protection fault: 0000 [#1] SMP
<4>[  203.161531] Modules linked in: xt_TEE xt_dscp xt_DSCP macvlan bridge
gpio_ich ipmi_watchdog ipmi_devintf x86_pkg_temp_thermal coretemp
crct10dif_pclmul crc32_pclmul ghash_clmulni_intel microcode sb_edac
edac_core igb ixgbe i2c_algo_bit lpc_ich mfd_core ptp pps_core mdio
tpm_tis tpm ipmi_si ipmi_msghandler
<4>[  203.162626] CPU: 3 PID: 28456 Comm: udpkill Not tainted
3.14.6 #1
<4>[  203.162674] Hardware name: Supermicro
X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.0 07/05/2013
<4>[  203.162726] task: ffff885e5f080000 ti: ffff885e5406c000 task.ti:
ffff885e5406c000
<4>[  203.162777] RIP: 0010:[<ffffffff816608c5>]  [<ffffffff816608c5>]
ipv4_dst_destroy+0x45/0x80
<4>[  203.162867] RSP: 0018:ffff885e5406dbd8  EFLAGS: 00010246
<4>[  203.162912] RAX: dead000000200200 RBX: ffff885e4ee03440 RCX:
dead000000100100
<4>[  203.162959] RDX: dead000000100100 RSI: 0000000000000200 RDI:
ffffffff81ead102
<4>[  203.163007] RBP: ffff885e5406dbe8 R08: 0000000000000000 R09:
ffff885e5406dd38
<4>[  203.163054] R10: 0000000000000001 R11: 0000000000000001 R12:
0000000000000000
<4>[  203.163102] R13: 0000000000000140 R14: ffff885e5406de10 R15:
ffffffff8166a9b0
<4>[  203.163150] FS:  00007f24d1af9700(0000) GS:ffff882fbfc60000(0000)
knlGS:0000000000000000
<4>[  203.163200] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[  203.163246] CR2: 00007f7650df30f8 CR3: 0000005e46fdd000 CR4:
00000000000407e0
<4>[  203.163307] Stack:
<4>[  203.163360]  0000000000000000 ffff885e4ee03440 ffff885e5406dc08
ffffffff8161c47a
<4>[  203.163574]  ffff885e4ee03440 0000000000000000 ffff885e5406dc28
ffffffff8161c786
<4>[  203.163786]  0000000000000000 ffff885f70f51f80 ffff885e5406dc48
ffffffff815ffa92
<4>[  203.163999] Call Trace:
<4>[  203.164058]  [<ffffffff8161c47a>] dst_destroy+0x2a/0xe0
<4>[  203.164118]  [<ffffffff8161c786>] dst_release+0x56/0x80
<4>[  203.164183]  [<ffffffff815ffa92>] sk_dst_check+0x82/0x90
<4>[  203.164247]  [<ffffffff81692b35>] udp_sendmsg+0x585/0x830
<4>[  203.164314]  [<ffffffff8169dbe5>] inet_sendmsg+0x45/0xb0
<4>[  203.164375]  [<ffffffff815f9248>] sock_aio_write+0xc8/0xd0
<4>[  203.164439]  [<ffffffff8118073f>] do_sync_write+0x5f/0x90
<4>[  203.164499]  [<ffffffff81182681>] vfs_write+0x1d1/0x1e0
<4>[  203.164559]  [<ffffffff8118277a>] SyS_write+0x5a/0xd0
<4>[  203.164622]  [<ffffffff8172b9d2>] system_call_fastpath+0x16/0x1b
<4>[  203.164681] Code: 87 b0 00 00 00 74 4f 48 c7 c7 02 d1 ea 81 e8 a3 25
0c 00 48 8b 93 b0 00 00 00 48 8b 83 b8 00 00 00 48 b9 00 01 10 00 00 00 ad
de <48> 89 42 08 48 c7 c7 02 d1 ea 81 48 89 10 48 ba 00 02 20 00 00
<1>[  203.167034] RIP  [<ffffffff816608c5>] ipv4_dst_destroy+0x45/0x80
<4>[  203.167129]  RSP <ffff885e5406dbd8>
<4>[  203.167193] ---[ end trace 0201f2e2310d79bd ]---
<0>[  204.422742] Kernel panic - not syncing: Fatal exception in interrupt
<0>[  204.427379] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation
range: 0xffffffff80000000-0xffffffff9fffffff)

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] ipv4: fix a race in ip4_datagram_release_cb()
  2014-06-11  7:38                   ` dormando
@ 2014-06-11 12:41                     ` Eric Dumazet
  2014-06-11 13:12                       ` Eric Dumazet
  0 siblings, 1 reply; 41+ messages in thread
From: Eric Dumazet @ 2014-06-11 12:41 UTC (permalink / raw)
  To: dormando
  Cc: Alexey Preobrazhensky, Steffen Klassert, David Miller, paulmck,
	netdev, Kostya Serebryany, Dmitry Vyukov, Lars Bull,
	Eric Dumazet, Bruce Curtis, Maciej Żenczykowski,
	Alexei Starovoitov

On Wed, 2014-06-11 at 00:38 -0700, dormando wrote:
> On Wed, 11 Jun 2014, dormando wrote:
> >
> > --> Meant to say here that both *with* and *without* your two new patches
> > it still crashes.
> >
> > > Unfortunately 3.14 has a few regressions.. one is some bad CPU usage i'll
> > > have to track down, and two something about pstore is broken, so I can't
> > > get the trace from the crash. It's compressing now and has more of the
> > > kernel log, but it's missing the actual panic part.
> > >
> > > $ git log --oneline v3.14..v3.15 net/ipv4/route.c
> > > fbdc0ad ipv4: initialise the itag variable in __mkroute_input
> > > 0d5edc6 ipv4, route: pass 0 instead of LOOPBACK_IFINDEX to fib_validate_source()
> > > aad8872 ipv4: add a sock pointer to dst->output() path.
> > > 9114615 ipv4: return valid RTA_IIF on ip route get
> > > 3ed66e9 net: replace __this_cpu_inc in route.c with raw_cpu_inc
> > > 0b8c7f6 ipv4: remove ip_rt_dump from route.c
> > > 4a4eb21 ipv4: remove ipv4_ifdown_dst from route.c
> > > 1e8d642 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
> > > 2045cea net: remove unnecessary return's
> > >
> > > No more obvious race fixes. I can try 3.15 fully vanilla but I'm having
> > > doubts?
> > >
> > > We have a few patches on top of this, but none of them are active at the
> > > time of my test. I've tried removing them in the past and it did nothing
> > > as well.
> > >
> > > Sorry :(
> > >
> 
> Spamming now! The pstore'd dmesg looked suspiciously like the boot before
> I booted the crashed kernel.. checked pstore again and the crash is there
> after a second reboot (wtf.. will test tomorrow).


OK then we probably have another bug in UDP, which is that we call
sk_dst_set(sk, dst_clone(&rt->dst)); with a dst having DST_NOCACHE set

Its a problem, because sk_dst_get() cannot deal safely with such dst.

Care to share your program triggering the bug ?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] ipv4: fix a race in ip4_datagram_release_cb()
  2014-06-11 12:41                     ` Eric Dumazet
@ 2014-06-11 13:12                       ` Eric Dumazet
  2014-06-12  1:55                         ` dormando
  0 siblings, 1 reply; 41+ messages in thread
From: Eric Dumazet @ 2014-06-11 13:12 UTC (permalink / raw)
  To: dormando
  Cc: Alexey Preobrazhensky, Steffen Klassert, David Miller, paulmck,
	netdev, Kostya Serebryany, Dmitry Vyukov, Lars Bull,
	Eric Dumazet, Bruce Curtis, Maciej Żenczykowski,
	Alexei Starovoitov

On Wed, 2014-06-11 at 05:41 -0700, Eric Dumazet wrote:

> 
> OK then we probably have another bug in UDP, which is that we call
> sk_dst_set(sk, dst_clone(&rt->dst)); with a dst having DST_NOCACHE set
> 
> Its a problem, because sk_dst_get() cannot deal safely with such dst.

You could try this in top of other patches.

diff --git a/include/net/sock.h b/include/net/sock.h
index 21569cf456ed..427ac7cc50fc 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1728,8 +1728,8 @@ sk_dst_get(struct sock *sk)
 
 	rcu_read_lock();
 	dst = rcu_dereference(sk->sk_dst_cache);
-	if (dst)
-		dst_hold(dst);
+	if (dst && !atomic_inc_not_zero(&dst->__refcnt))
+		dst = NULL;
 	rcu_read_unlock();
 	return dst;
 }

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH] ipv4: fix a race in ip4_datagram_release_cb()
  2014-06-11  4:16           ` dormando
  2014-06-11  5:54             ` Eric Dumazet
@ 2014-06-11 13:38             ` Kostya Serebryany
  1 sibling, 0 replies; 41+ messages in thread
From: Kostya Serebryany @ 2014-06-11 13:38 UTC (permalink / raw)
  To: dormando
  Cc: Eric Dumazet, Alexey Preobrazhensky, Steffen Klassert,
	David Miller, paulmck, netdev, Dmitry Vyukov, Lars Bull,
	Eric Dumazet, Bruce Curtis, Maciej Żenczykowski,
	Alexei Starovoitov

> Ran our udpkill util against 3.10.42 with both of your patches applied...

[plain text]
Could you please tell us more about udpkill?
Will it makes sense to run address-sanitized-kernel with udpkill?

--kcc

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] ipv4: fix a race in ip4_datagram_release_cb()
  2014-06-10 13:43 ` [PATCH] ipv4: fix a race in ip4_datagram_release_cb() Eric Dumazet
  2014-06-11  0:32   ` dormando
@ 2014-06-11 22:39   ` David Miller
  1 sibling, 0 replies; 41+ messages in thread
From: David Miller @ 2014-06-11 22:39 UTC (permalink / raw)
  To: eric.dumazet
  Cc: preobr, dormando, steffen.klassert, paulmck, netdev, kcc,
	dvyukov, larsbull, edumazet, brutus, maze, alexei.starovoitov

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 10 Jun 2014 06:43:01 -0700

> From: Eric Dumazet <edumazet@google.com>
> 
> Alexey gave a AddressSanitizer[1] report that finally gave a good hint
> at where was the origin of various problems already reported by Dormando
> in the past [2]
> 
> Problem comes from the fact that UDP can have a lockless TX path, and
> concurrent threads can manipulate sk_dst_cache, while another thread,
> is holding socket lock and calls __sk_dst_set() in
> ip4_datagram_release_cb() (this was added in linux-3.8)
> 
> It seems that all we need to do is to use sk_dst_check() and
> sk_dst_set() so that all the writers hold same spinlock
> (sk->sk_dst_lock) to prevent corruptions.
> 
> TCP stack do not need this protection, as all sk_dst_cache writers hold
> the socket lock.
> 
> [1]
> https://code.google.com/p/address-sanitizer/wiki/AddressSanitizerForKernel
 ...
> Reported-by: Alexey Preobrazhensky <preobr@google.com>
> Reported-by: dormando <dormando@rydia.ne>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Fixes: 8141ed9fcedb2 ("ipv4: Add a socket release callback for datagram sockets")

Applied and queued up for -stable.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] ipv4: fix a race in ip4_datagram_release_cb()
  2014-06-11 13:12                       ` Eric Dumazet
@ 2014-06-12  1:55                         ` dormando
  2014-06-12  3:43                           ` Eric Dumazet
  0 siblings, 1 reply; 41+ messages in thread
From: dormando @ 2014-06-12  1:55 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexey Preobrazhensky, Steffen Klassert, David Miller, paulmck,
	netdev, Kostya Serebryany, Dmitry Vyukov, Lars Bull,
	Eric Dumazet, Bruce Curtis, Maciej Żenczykowski,
	Alexei Starovoitov

On Wed, 11 Jun 2014, Eric Dumazet wrote:

> On Wed, 2014-06-11 at 05:41 -0700, Eric Dumazet wrote:
>
> >
> > OK then we probably have another bug in UDP, which is that we call
> > sk_dst_set(sk, dst_clone(&rt->dst)); with a dst having DST_NOCACHE set
> >
> > Its a problem, because sk_dst_get() cannot deal safely with such dst.
>
> You could try this in top of other patches.
>
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 21569cf456ed..427ac7cc50fc 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -1728,8 +1728,8 @@ sk_dst_get(struct sock *sk)
>
>  	rcu_read_lock();
>  	dst = rcu_dereference(sk->sk_dst_cache);
> -	if (dst)
> -		dst_hold(dst);
> +	if (dst && !atomic_inc_not_zero(&dst->__refcnt))
> +		dst = NULL;
>  	rcu_read_unlock();
>  	return dst;
>  }
>
>
>

I sent the udpkill utility in an off-list mail (in case that got binned by
anyone).

Just threw this patch on top of the other two, on 3.10.42. udpkill's been
running for an hour without fault. I've just put traffic back onto the
machine am leaving udpkill enabled for a while longer.

So, this is an improvement :)

I have exactly one machine which (for whatever lucky reason) is really
prone to hitting this problem without needing udpkill. It'll take a few
days to get it going there though. I've not been able to reproduce the
crash from other angles yet.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] ipv4: fix a race in ip4_datagram_release_cb()
  2014-06-12  1:55                         ` dormando
@ 2014-06-12  3:43                           ` Eric Dumazet
  2014-06-12  4:05                             ` dormando
  2014-06-22 19:07                             ` dormando
  0 siblings, 2 replies; 41+ messages in thread
From: Eric Dumazet @ 2014-06-12  3:43 UTC (permalink / raw)
  To: dormando
  Cc: Alexey Preobrazhensky, Steffen Klassert, David Miller, paulmck,
	netdev, Kostya Serebryany, Dmitry Vyukov, Lars Bull,
	Eric Dumazet, Bruce Curtis, Maciej Żenczykowski,
	Alexei Starovoitov

On Wed, 2014-06-11 at 18:55 -0700, dormando wrote:

> I sent the udpkill utility in an off-list mail (in case that got binned by
> anyone).
> 
> Just threw this patch on top of the other two, on 3.10.42. udpkill's been
> running for an hour without fault. I've just put traffic back onto the
> machine am leaving udpkill enabled for a while longer.
> 
> So, this is an improvement :)

Nice. I suspect regression came with 3.6 ip route cache removal, but I
am lazy to point the exact commit.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] ipv4: fix a race in ip4_datagram_release_cb()
  2014-06-12  3:43                           ` Eric Dumazet
@ 2014-06-12  4:05                             ` dormando
  2014-06-22 19:07                             ` dormando
  1 sibling, 0 replies; 41+ messages in thread
From: dormando @ 2014-06-12  4:05 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexey Preobrazhensky, Steffen Klassert, David Miller, paulmck,
	netdev, Kostya Serebryany, Dmitry Vyukov, Lars Bull,
	Eric Dumazet, Bruce Curtis, Maciej Żenczykowski,
	Alexei Starovoitov

On Wed, 11 Jun 2014, Eric Dumazet wrote:

> On Wed, 2014-06-11 at 18:55 -0700, dormando wrote:
>
> > I sent the udpkill utility in an off-list mail (in case that got binned by
> > anyone).
> >
> > Just threw this patch on top of the other two, on 3.10.42. udpkill's been
> > running for an hour without fault. I've just put traffic back onto the
> > machine am leaving udpkill enabled for a while longer.
> >
> > So, this is an improvement :)
>
> Nice. I suspect regression came with 3.6 ip route cache removal, but I
> am lazy to point the exact commit.
>

Think I owe you a keg at this point. I'll let this bake and we'll see if
the other crashes come back... it'll take a week or two.

I'll have to circle back around to figure out what's gone wrong with 3.14.

Please let me know if there's anything else to test.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] ipv4: fix a race in ip4_datagram_release_cb()
  2014-06-12  3:43                           ` Eric Dumazet
  2014-06-12  4:05                             ` dormando
@ 2014-06-22 19:07                             ` dormando
  2014-06-23  8:33                               ` Eric Dumazet
  1 sibling, 1 reply; 41+ messages in thread
From: dormando @ 2014-06-22 19:07 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexey Preobrazhensky, Steffen Klassert, David Miller, paulmck,
	netdev, Kostya Serebryany, Dmitry Vyukov, Lars Bull,
	Eric Dumazet, Bruce Curtis, Maciej Żenczykowski,
	Alexei Starovoitov

On Wed, 11 Jun 2014, Eric Dumazet wrote:

> On Wed, 2014-06-11 at 18:55 -0700, dormando wrote:
>
> > I sent the udpkill utility in an off-list mail (in case that got binned by
> > anyone).
> >
> > Just threw this patch on top of the other two, on 3.10.42. udpkill's been
> > running for an hour without fault. I've just put traffic back onto the
> > machine am leaving udpkill enabled for a while longer.
> >
> > So, this is an improvement :)
>
> Nice. I suspect regression came with 3.6 ip route cache removal, but I
> am lazy to point the exact commit.
>
>

Update on testing:

I only have two machines that crash on their own frequently (more like
one, even). Unfortunately something happened to the datacenter it's in and
it was offline for a week. The machine normally crashes after 1.5-4d,
averaging 2d.

It's done about three days total time without a new crash. I also have the
kernel running in another datacenter for ~10 days.. but it takes 30-150
days to crash in that one.

So, inconclusive, but still promising. If the machine survives the week it
probably means it's fixed, or at least greatly reduced.

I saw that one of your patches got queued for stable, but all three were
necessary to fix udpkill. What's your plan for cleanup/upstreaming?

Did you folks end up running udpkill under the tester thing?

thanks,
-Dormando

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] ipv4: fix a race in ip4_datagram_release_cb()
  2014-06-22 19:07                             ` dormando
@ 2014-06-23  8:33                               ` Eric Dumazet
  2014-06-23  8:55                                 ` dormando
  2014-06-24 17:05                                 ` [PATCH net] ipv4: fix dst race in sk_dst_get() Eric Dumazet
  0 siblings, 2 replies; 41+ messages in thread
From: Eric Dumazet @ 2014-06-23  8:33 UTC (permalink / raw)
  To: dormando
  Cc: Alexey Preobrazhensky, Steffen Klassert, David Miller, paulmck,
	netdev, Kostya Serebryany, Dmitry Vyukov, Lars Bull,
	Eric Dumazet, Bruce Curtis, Maciej Żenczykowski,
	Alexei Starovoitov

On Sun, 2014-06-22 at 12:07 -0700, dormando wrote:

> Update on testing:
> 
> I only have two machines that crash on their own frequently (more like
> one, even). Unfortunately something happened to the datacenter it's in and
> it was offline for a week. The machine normally crashes after 1.5-4d,
> averaging 2d.
> 
> It's done about three days total time without a new crash. I also have the
> kernel running in another datacenter for ~10 days.. but it takes 30-150
> days to crash in that one.
> 
> So, inconclusive, but still promising. If the machine survives the week it
> probably means it's fixed, or at least greatly reduced.
> 
> I saw that one of your patches got queued for stable, but all three were
> necessary to fix udpkill. What's your plan for cleanup/upstreaming?
> 
> Did you folks end up running udpkill under the tester thing?

I did not test udpkill, as the known problem is the DST_NOCACHE flag.

We end up calling sk_dst_set(sk, dst) with a dst having this flag set.

So maybe DST_NOCACHE should be renamed, if we _can _ cache a dst like
this. Its meaning is really that dst_release() has to track when
refcount reaches 0 so that last owner fress dst, but we need to respect
rcu grace period.

Fixing sk_dst_set() as I did is not enough, as it is only reducing race
window.

Something like following :

 include/net/sock.h   |    4 ++--
 net/core/dst.c       |   16 +++++++++++-----
 net/ipv4/ip_tunnel.c |    7 +------
 3 files changed, 14 insertions(+), 13 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 07b7fcd60d80..173cae485de1 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1730,8 +1730,8 @@ sk_dst_get(struct sock *sk)
 
 	rcu_read_lock();
 	dst = rcu_dereference(sk->sk_dst_cache);
-	if (dst)
-		dst_hold(dst);
+	if (dst && !atomic_inc_not_zero(&dst->__refcnt))
+		dst = NULL;
 	rcu_read_unlock();
 	return dst;
 }
diff --git a/net/core/dst.c b/net/core/dst.c
index 80d6286c8b62..a028409ee438 100644
--- a/net/core/dst.c
+++ b/net/core/dst.c
@@ -269,6 +269,15 @@ again:
 }
 EXPORT_SYMBOL(dst_destroy);
 
+static void dst_destroy_rcu(struct rcu_head *head)
+{
+	struct dst_entry *dst = container_of(head, struct dst_entry, rcu_head);
+
+	dst = dst_destroy(dst);
+	if (dst)
+		__dst_free(dst);
+}
+
 void dst_release(struct dst_entry *dst)
 {
 	if (dst) {
@@ -276,11 +285,8 @@ void dst_release(struct dst_entry *dst)
 
 		newrefcnt = atomic_dec_return(&dst->__refcnt);
 		WARN_ON(newrefcnt < 0);
-		if (unlikely(dst->flags & DST_NOCACHE) && !newrefcnt) {
-			dst = dst_destroy(dst);
-			if (dst)
-				__dst_free(dst);
-		}
+		if (unlikely(dst->flags & DST_NOCACHE) && !newrefcnt)
+			call_rcu(&dst->rcu_head, dst_destroy_rcu);
 	}
 }
 EXPORT_SYMBOL(dst_release);
diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c
index 097b3e7c1e8f..cc3b7fd34555 100644
--- a/net/ipv4/ip_tunnel.c
+++ b/net/ipv4/ip_tunnel.c
@@ -73,12 +73,7 @@ static void __tunnel_dst_set(struct ip_tunnel_dst *idst,
 {
 	struct dst_entry *old_dst;
 
-	if (dst) {
-		if (dst->flags & DST_NOCACHE)
-			dst = NULL;
-		else
-			dst_clone(dst);
-	}
+	dst_clone(dst);
 	old_dst = xchg((__force struct dst_entry **)&idst->dst, dst);
 	dst_release(old_dst);
 }

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH] ipv4: fix a race in ip4_datagram_release_cb()
  2014-06-23  8:33                               ` Eric Dumazet
@ 2014-06-23  8:55                                 ` dormando
  2014-06-23 16:57                                   ` Dmitry Vyukov
  2014-06-24 17:05                                 ` [PATCH net] ipv4: fix dst race in sk_dst_get() Eric Dumazet
  1 sibling, 1 reply; 41+ messages in thread
From: dormando @ 2014-06-23  8:55 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexey Preobrazhensky, Steffen Klassert, David Miller, paulmck,
	netdev, Kostya Serebryany, Dmitry Vyukov, Lars Bull,
	Eric Dumazet, Bruce Curtis, Maciej Żenczykowski,
	Alexei Starovoitov

On Mon, 23 Jun 2014, Eric Dumazet wrote:

> On Sun, 2014-06-22 at 12:07 -0700, dormando wrote:
>
> > Update on testing:
> >
> > I only have two machines that crash on their own frequently (more like
> > one, even). Unfortunately something happened to the datacenter it's in and
> > it was offline for a week. The machine normally crashes after 1.5-4d,
> > averaging 2d.
> >
> > It's done about three days total time without a new crash. I also have the
> > kernel running in another datacenter for ~10 days.. but it takes 30-150
> > days to crash in that one.
> >
> > So, inconclusive, but still promising. If the machine survives the week it
> > probably means it's fixed, or at least greatly reduced.
> >
> > I saw that one of your patches got queued for stable, but all three were
> > necessary to fix udpkill. What's your plan for cleanup/upstreaming?
> >
> > Did you folks end up running udpkill under the tester thing?
>
> I did not test udpkill, as the known problem is the DST_NOCACHE flag.
>
> We end up calling sk_dst_set(sk, dst) with a dst having this flag set.
>
> So maybe DST_NOCACHE should be renamed, if we _can _ cache a dst like
> this. Its meaning is really that dst_release() has to track when
> refcount reaches 0 so that last owner fress dst, but we need to respect
> rcu grace period.
>
> Fixing sk_dst_set() as I did is not enough, as it is only reducing race
> window.

Hrm. I'll have to spend more time trying to understand how to test this
(beyond just putting a kernel into production and seeing if it crashes).
Outside of udpkill it's been slightly hard to reproduce, and udpkill ran
for over five hours with your previous patches.

Do you or other folks have any methods for testing this?

thanks!

> Something like following :
>
>  include/net/sock.h   |    4 ++--
>  net/core/dst.c       |   16 +++++++++++-----
>  net/ipv4/ip_tunnel.c |    7 +------
>  3 files changed, 14 insertions(+), 13 deletions(-)
>
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 07b7fcd60d80..173cae485de1 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -1730,8 +1730,8 @@ sk_dst_get(struct sock *sk)
>
>  	rcu_read_lock();
>  	dst = rcu_dereference(sk->sk_dst_cache);
> -	if (dst)
> -		dst_hold(dst);
> +	if (dst && !atomic_inc_not_zero(&dst->__refcnt))
> +		dst = NULL;
>  	rcu_read_unlock();
>  	return dst;
>  }
> diff --git a/net/core/dst.c b/net/core/dst.c
> index 80d6286c8b62..a028409ee438 100644
> --- a/net/core/dst.c
> +++ b/net/core/dst.c
> @@ -269,6 +269,15 @@ again:
>  }
>  EXPORT_SYMBOL(dst_destroy);
>
> +static void dst_destroy_rcu(struct rcu_head *head)
> +{
> +	struct dst_entry *dst = container_of(head, struct dst_entry, rcu_head);
> +
> +	dst = dst_destroy(dst);
> +	if (dst)
> +		__dst_free(dst);
> +}
> +
>  void dst_release(struct dst_entry *dst)
>  {
>  	if (dst) {
> @@ -276,11 +285,8 @@ void dst_release(struct dst_entry *dst)
>
>  		newrefcnt = atomic_dec_return(&dst->__refcnt);
>  		WARN_ON(newrefcnt < 0);
> -		if (unlikely(dst->flags & DST_NOCACHE) && !newrefcnt) {
> -			dst = dst_destroy(dst);
> -			if (dst)
> -				__dst_free(dst);
> -		}
> +		if (unlikely(dst->flags & DST_NOCACHE) && !newrefcnt)
> +			call_rcu(&dst->rcu_head, dst_destroy_rcu);
>  	}
>  }
>  EXPORT_SYMBOL(dst_release);
> diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c
> index 097b3e7c1e8f..cc3b7fd34555 100644
> --- a/net/ipv4/ip_tunnel.c
> +++ b/net/ipv4/ip_tunnel.c
> @@ -73,12 +73,7 @@ static void __tunnel_dst_set(struct ip_tunnel_dst *idst,
>  {
>  	struct dst_entry *old_dst;
>
> -	if (dst) {
> -		if (dst->flags & DST_NOCACHE)
> -			dst = NULL;
> -		else
> -			dst_clone(dst);
> -	}
> +	dst_clone(dst);
>  	old_dst = xchg((__force struct dst_entry **)&idst->dst, dst);
>  	dst_release(old_dst);
>  }
>
>
>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] ipv4: fix a race in ip4_datagram_release_cb()
  2014-06-23  8:55                                 ` dormando
@ 2014-06-23 16:57                                   ` Dmitry Vyukov
  0 siblings, 0 replies; 41+ messages in thread
From: Dmitry Vyukov @ 2014-06-23 16:57 UTC (permalink / raw)
  To: dormando
  Cc: Eric Dumazet, Alexey Preobrazhensky, Steffen Klassert,
	David Miller, paulmck, netdev, Kostya Serebryany, Lars Bull,
	Eric Dumazet, Bruce Curtis, Maciej Żenczykowski,
	Alexei Starovoitov

On Mon, Jun 23, 2014 at 1:55 AM, dormando <dormando@rydia.net> wrote:
> On Mon, 23 Jun 2014, Eric Dumazet wrote:
>
>> On Sun, 2014-06-22 at 12:07 -0700, dormando wrote:
>>
>> > Update on testing:
>> >
>> > I only have two machines that crash on their own frequently (more like
>> > one, even). Unfortunately something happened to the datacenter it's in and
>> > it was offline for a week. The machine normally crashes after 1.5-4d,
>> > averaging 2d.
>> >
>> > It's done about three days total time without a new crash. I also have the
>> > kernel running in another datacenter for ~10 days.. but it takes 30-150
>> > days to crash in that one.
>> >
>> > So, inconclusive, but still promising. If the machine survives the week it
>> > probably means it's fixed, or at least greatly reduced.
>> >
>> > I saw that one of your patches got queued for stable, but all three were
>> > necessary to fix udpkill. What's your plan for cleanup/upstreaming?
>> >
>> > Did you folks end up running udpkill under the tester thing?
>>
>> I did not test udpkill, as the known problem is the DST_NOCACHE flag.
>>
>> We end up calling sk_dst_set(sk, dst) with a dst having this flag set.
>>
>> So maybe DST_NOCACHE should be renamed, if we _can _ cache a dst like
>> this. Its meaning is really that dst_release() has to track when
>> refcount reaches 0 so that last owner fress dst, but we need to respect
>> rcu grace period.
>>
>> Fixing sk_dst_set() as I did is not enough, as it is only reducing race
>> window.
>
> Hrm. I'll have to spend more time trying to understand how to test this
> (beyond just putting a kernel into production and seeing if it crashes).
> Outside of udpkill it's been slightly hard to reproduce, and udpkill ran
> for over five hours with your previous patches.
>
> Do you or other folks have any methods for testing this?


Well, running with kasan should reduce time to crash by an order of magnitude.
Alexey, have we tried running udpkill with kasan?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH net] ipv4: fix dst race in sk_dst_get()
  2014-06-23  8:33                               ` Eric Dumazet
  2014-06-23  8:55                                 ` dormando
@ 2014-06-24 17:05                                 ` Eric Dumazet
  2014-06-26  0:42                                   ` David Miller
  1 sibling, 1 reply; 41+ messages in thread
From: Eric Dumazet @ 2014-06-24 17:05 UTC (permalink / raw)
  To: dormando, David Miller
  Cc: Alexey Preobrazhensky, Steffen Klassert, paulmck, netdev,
	Kostya Serebryany, Dmitry Vyukov, Lars Bull, Eric Dumazet,
	Bruce Curtis, Maciej Żenczykowski, Alexei Starovoitov

From: Eric Dumazet <edumazet@google.com>

When IP route cache had been removed in linux-3.6, we broke assumption
that dst entries were all freed after rcu grace period. DST_NOCACHE
dst were supposed to be freed from dst_release(). But it appears
we want to keep such dst around, either in UDP sockets or tunnels.

In sk_dst_get() we need to make sure dst refcount is not 0
before incrementing it, or else we might end up freeing a dst
twice.

DST_NOCACHE set on a dst does not mean this dst can not be attached
to a socket or a tunnel.

Then, before actual freeing, we need to observe a rcu grace period
to make sure all other cpus can catch the fact the dst is no longer
usable.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dormando <dormando@rydia.net>
---
 include/net/sock.h   |    4 ++--
 net/core/dst.c       |   16 +++++++++++-----
 net/ipv4/ip_tunnel.c |   14 +++++---------
 3 files changed, 18 insertions(+), 16 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 07b7fcd60d80..173cae485de1 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1730,8 +1730,8 @@ sk_dst_get(struct sock *sk)
 
 	rcu_read_lock();
 	dst = rcu_dereference(sk->sk_dst_cache);
-	if (dst)
-		dst_hold(dst);
+	if (dst && !atomic_inc_not_zero(&dst->__refcnt))
+		dst = NULL;
 	rcu_read_unlock();
 	return dst;
 }
diff --git a/net/core/dst.c b/net/core/dst.c
index 80d6286c8b62..a028409ee438 100644
--- a/net/core/dst.c
+++ b/net/core/dst.c
@@ -269,6 +269,15 @@ again:
 }
 EXPORT_SYMBOL(dst_destroy);
 
+static void dst_destroy_rcu(struct rcu_head *head)
+{
+	struct dst_entry *dst = container_of(head, struct dst_entry, rcu_head);
+
+	dst = dst_destroy(dst);
+	if (dst)
+		__dst_free(dst);
+}
+
 void dst_release(struct dst_entry *dst)
 {
 	if (dst) {
@@ -276,11 +285,8 @@ void dst_release(struct dst_entry *dst)
 
 		newrefcnt = atomic_dec_return(&dst->__refcnt);
 		WARN_ON(newrefcnt < 0);
-		if (unlikely(dst->flags & DST_NOCACHE) && !newrefcnt) {
-			dst = dst_destroy(dst);
-			if (dst)
-				__dst_free(dst);
-		}
+		if (unlikely(dst->flags & DST_NOCACHE) && !newrefcnt)
+			call_rcu(&dst->rcu_head, dst_destroy_rcu);
 	}
 }
 EXPORT_SYMBOL(dst_release);
diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c
index 097b3e7c1e8f..54b6731dab55 100644
--- a/net/ipv4/ip_tunnel.c
+++ b/net/ipv4/ip_tunnel.c
@@ -73,12 +73,7 @@ static void __tunnel_dst_set(struct ip_tunnel_dst *idst,
 {
 	struct dst_entry *old_dst;
 
-	if (dst) {
-		if (dst->flags & DST_NOCACHE)
-			dst = NULL;
-		else
-			dst_clone(dst);
-	}
+	dst_clone(dst);
 	old_dst = xchg((__force struct dst_entry **)&idst->dst, dst);
 	dst_release(old_dst);
 }
@@ -108,13 +103,14 @@ static struct rtable *tunnel_rtable_get(struct ip_tunnel *t, u32 cookie)
 
 	rcu_read_lock();
 	dst = rcu_dereference(this_cpu_ptr(t->dst_cache)->dst);
+	if (dst && !atomic_inc_not_zero(&dst->__refcnt))
+		dst = NULL;
 	if (dst) {
 		if (dst->obsolete && dst->ops->check(dst, cookie) == NULL) {
-			rcu_read_unlock();
 			tunnel_dst_reset(t);
-			return NULL;
+			dst_release(dst);
+			dst = NULL;
 		}
-		dst_hold(dst);
 	}
 	rcu_read_unlock();
 	return (struct rtable *)dst;

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH net] ipv4: fix dst race in sk_dst_get()
  2014-06-24 17:05                                 ` [PATCH net] ipv4: fix dst race in sk_dst_get() Eric Dumazet
@ 2014-06-26  0:42                                   ` David Miller
  0 siblings, 0 replies; 41+ messages in thread
From: David Miller @ 2014-06-26  0:42 UTC (permalink / raw)
  To: eric.dumazet
  Cc: dormando, preobr, steffen.klassert, paulmck, netdev, kcc,
	dvyukov, larsbull, edumazet, brutus, maze, alexei.starovoitov

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 24 Jun 2014 10:05:11 -0700

> From: Eric Dumazet <edumazet@google.com>
> 
> When IP route cache had been removed in linux-3.6, we broke assumption
> that dst entries were all freed after rcu grace period. DST_NOCACHE
> dst were supposed to be freed from dst_release(). But it appears
> we want to keep such dst around, either in UDP sockets or tunnels.
> 
> In sk_dst_get() we need to make sure dst refcount is not 0
> before incrementing it, or else we might end up freeing a dst
> twice.
> 
> DST_NOCACHE set on a dst does not mean this dst can not be attached
> to a socket or a tunnel.
> 
> Then, before actual freeing, we need to observe a rcu grace period
> to make sure all other cpus can catch the fact the dst is no longer
> usable.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reported-by: Dormando <dormando@rydia.net>

Applied and queued up for -stable, thanks Eric.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] ipv4: fix a race in ip4_datagram_release_cb()
  2014-06-11  1:26         ` Eric Dumazet
  2014-06-11  4:16           ` dormando
@ 2014-06-29  0:25           ` dormando
  2014-06-30  6:38             ` Eric Dumazet
  2014-06-30  8:26           ` [PATCH] ipv4: irq safe sk_dst_[re]set() and ipv4_sk_update_pmtu() fix Eric Dumazet
  2 siblings, 1 reply; 41+ messages in thread
From: dormando @ 2014-06-29  0:25 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexey Preobrazhensky, Steffen Klassert, David Miller, paulmck,
	netdev, Kostya Serebryany, Dmitry Vyukov, Lars Bull,
	Eric Dumazet, Bruce Curtis, Maciej Żenczykowski,
	Alexei Starovoitov

On Tue, 10 Jun 2014, Eric Dumazet wrote:

> On Tue, 2014-06-10 at 18:12 -0700, Eric Dumazet wrote:
>
> >
> > For the curious, another problem is in ipv4_sk_update_pmtu()
> >
> > This can be called on UDP sockets, but from softirq context.
> >
> > We cannot use sk_dst_lock because this lock is not softirq safe.
> >
> > I guess we should use xchg() for sk_dst_set() and sk_dst_reset()
>
> This would be something like this untested patch :
>

I see this didn't get queued for stable, should it be dropped due to the
later patch? (which did get queued for stable).

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] ipv4: fix a race in ip4_datagram_release_cb()
  2014-06-29  0:25           ` dormando
@ 2014-06-30  6:38             ` Eric Dumazet
  2014-06-30  8:15               ` dormando
  0 siblings, 1 reply; 41+ messages in thread
From: Eric Dumazet @ 2014-06-30  6:38 UTC (permalink / raw)
  To: dormando
  Cc: Alexey Preobrazhensky, Steffen Klassert, David Miller, paulmck,
	netdev, Kostya Serebryany, Dmitry Vyukov, Lars Bull,
	Eric Dumazet, Bruce Curtis, Maciej Żenczykowski,
	Alexei Starovoitov

On Sat, 2014-06-28 at 17:25 -0700, dormando wrote:
> On Tue, 10 Jun 2014, Eric Dumazet wrote:
> 
> > On Tue, 2014-06-10 at 18:12 -0700, Eric Dumazet wrote:
> >
> > >
> > > For the curious, another problem is in ipv4_sk_update_pmtu()
> > >
> > > This can be called on UDP sockets, but from softirq context.
> > >
> > > We cannot use sk_dst_lock because this lock is not softirq safe.
> > >
> > > I guess we should use xchg() for sk_dst_set() and sk_dst_reset()
> >
> > This would be something like this untested patch :
> >
> 
> I see this didn't get queued for stable, should it be dropped due to the
> later patch? (which did get queued for stable).

I have this patch in my queue.

I have to submit it officially before it can be merged...

David seems quite busy these days anyway.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] ipv4: fix a race in ip4_datagram_release_cb()
  2014-06-30  6:38             ` Eric Dumazet
@ 2014-06-30  8:15               ` dormando
  2014-06-30  8:30                 ` Eric Dumazet
  0 siblings, 1 reply; 41+ messages in thread
From: dormando @ 2014-06-30  8:15 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexey Preobrazhensky, Steffen Klassert, David Miller, paulmck,
	netdev, Kostya Serebryany, Dmitry Vyukov, Lars Bull,
	Eric Dumazet, Bruce Curtis, Maciej Żenczykowski,
	Alexei Starovoitov

On Sun, 29 Jun 2014, Eric Dumazet wrote:

> On Sat, 2014-06-28 at 17:25 -0700, dormando wrote:
> > On Tue, 10 Jun 2014, Eric Dumazet wrote:
> >
> > > On Tue, 2014-06-10 at 18:12 -0700, Eric Dumazet wrote:
> > >
> > > >
> > > > For the curious, another problem is in ipv4_sk_update_pmtu()
> > > >
> > > > This can be called on UDP sockets, but from softirq context.
> > > >
> > > > We cannot use sk_dst_lock because this lock is not softirq safe.
> > > >
> > > > I guess we should use xchg() for sk_dst_set() and sk_dst_reset()
> > >
> > > This would be something like this untested patch :
> > >
> >
> > I see this didn't get queued for stable, should it be dropped due to the
> > later patch? (which did get queued for stable).
>
> I have this patch in my queue.
>
> I have to submit it officially before it can be merged...
>
> David seems quite busy these days anyway.
>

Heh. You folks have quite insane amounts of output.

I'm now testing a kernel which has the one upstreamed patch + the two not
yet in -stable, under 3.10.45. A kernel with ~60% of those patches has
been running for 17.5 days so far. Much longer than it normally runs.

So, looking pretty decent, still. Thanks for your patience for all of my
dumb questions. I'll probably only pipe up again on this issue if I see
another crash.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH] ipv4: irq safe sk_dst_[re]set()  and ipv4_sk_update_pmtu() fix
  2014-06-11  1:26         ` Eric Dumazet
  2014-06-11  4:16           ` dormando
  2014-06-29  0:25           ` dormando
@ 2014-06-30  8:26           ` Eric Dumazet
  2014-07-01  6:43             ` David Miller
  2 siblings, 1 reply; 41+ messages in thread
From: Eric Dumazet @ 2014-06-30  8:26 UTC (permalink / raw)
  To: dormando, David Miller; +Cc: netdev, Steffen Klassert

From: Eric Dumazet <edumazet@google.com>

We have two different ways to handle changes to sk->sk_dst

First way (used by TCP) assumes socket lock is owned by caller, and use
no extra lock : __sk_dst_set() & __sk_dst_reset()

Another way (used by UDP) uses sk_dst_lock because socket lock is not
always taken. Note that sk_dst_lock is not softirq safe.

These ways are not inter changeable for a given socket type.

ipv4_sk_update_pmtu(), added in linux-3.8, added a race, as it used
the socket lock as synchronization, but users might be UDP sockets.

Instead of converting sk_dst_lock to a softirq safe version, use xchg()
as we did for sk_rx_dst in commit e47eb5dfb296b ("udp: ipv4: do not use
sk_dst_lock from softirq context")

In a follow up patch, we probably can remove sk_dst_lock, as it is
only used in IPv6.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Fixes: 9cb3a50c5f63e ("ipv4: Invalidate the socket cached route on pmtu events if possible")
---
 include/net/sock.h |   12 ++++++------
 net/ipv4/route.c   |   15 ++++++++-------
 2 files changed, 14 insertions(+), 13 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 173cae485de1..c556fd9b05ac 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1768,9 +1768,11 @@ __sk_dst_set(struct sock *sk, struct dst_entry *dst)
 static inline void
 sk_dst_set(struct sock *sk, struct dst_entry *dst)
 {
-	spin_lock(&sk->sk_dst_lock);
-	__sk_dst_set(sk, dst);
-	spin_unlock(&sk->sk_dst_lock);
+	struct dst_entry *old_dst;
+
+	sk_tx_queue_clear(sk);
+	old_dst = xchg(&sk->sk_dst_cache, dst);
+	dst_release(old_dst);
 }
 
 static inline void
@@ -1782,9 +1784,7 @@ __sk_dst_reset(struct sock *sk)
 static inline void
 sk_dst_reset(struct sock *sk)
 {
-	spin_lock(&sk->sk_dst_lock);
-	__sk_dst_reset(sk);
-	spin_unlock(&sk->sk_dst_lock);
+	sk_dst_set(sk, NULL);
 }
 
 struct dst_entry *__sk_dst_check(struct sock *sk, u32 cookie);
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 082239ffe34a..3162ea923ded 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1010,7 +1010,7 @@ void ipv4_sk_update_pmtu(struct sk_buff *skb, struct sock *sk, u32 mtu)
 	const struct iphdr *iph = (const struct iphdr *) skb->data;
 	struct flowi4 fl4;
 	struct rtable *rt;
-	struct dst_entry *dst;
+	struct dst_entry *odst = NULL;
 	bool new = false;
 
 	bh_lock_sock(sk);
@@ -1018,16 +1018,17 @@ void ipv4_sk_update_pmtu(struct sk_buff *skb, struct sock *sk, u32 mtu)
 	if (!ip_sk_accept_pmtu(sk))
 		goto out;
 
-	rt = (struct rtable *) __sk_dst_get(sk);
+	odst = sk_dst_get(sk);
 
-	if (sock_owned_by_user(sk) || !rt) {
+	if (sock_owned_by_user(sk) || !odst) {
 		__ipv4_sk_update_pmtu(skb, sk, mtu);
 		goto out;
 	}
 
 	__build_flow_key(&fl4, sk, iph, 0, 0, 0, 0, 0);
 
-	if (!__sk_dst_check(sk, 0)) {
+	rt = (struct rtable *)odst;
+	if (odst->obsolete && odst->ops->check(odst, 0) == NULL) {
 		rt = ip_route_output_flow(sock_net(sk), &fl4, sk);
 		if (IS_ERR(rt))
 			goto out;
@@ -1037,8 +1038,7 @@ void ipv4_sk_update_pmtu(struct sk_buff *skb, struct sock *sk, u32 mtu)
 
 	__ip_rt_update_pmtu((struct rtable *) rt->dst.path, &fl4, mtu);
 
-	dst = dst_check(&rt->dst, 0);
-	if (!dst) {
+	if (!dst_check(&rt->dst, 0)) {
 		if (new)
 			dst_release(&rt->dst);
 
@@ -1050,10 +1050,11 @@ void ipv4_sk_update_pmtu(struct sk_buff *skb, struct sock *sk, u32 mtu)
 	}
 
 	if (new)
-		__sk_dst_set(sk, &rt->dst);
+		sk_dst_set(sk, &rt->dst);
 
 out:
 	bh_unlock_sock(sk);
+	dst_release(odst);
 }
 EXPORT_SYMBOL_GPL(ipv4_sk_update_pmtu);
 

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH] ipv4: fix a race in ip4_datagram_release_cb()
  2014-06-30  8:15               ` dormando
@ 2014-06-30  8:30                 ` Eric Dumazet
  2014-07-08  1:41                   ` dormando
  0 siblings, 1 reply; 41+ messages in thread
From: Eric Dumazet @ 2014-06-30  8:30 UTC (permalink / raw)
  To: dormando
  Cc: Alexey Preobrazhensky, Steffen Klassert, David Miller, paulmck,
	netdev, Kostya Serebryany, Dmitry Vyukov, Lars Bull,
	Eric Dumazet, Bruce Curtis, Maciej Żenczykowski,
	Alexei Starovoitov

On Mon, 2014-06-30 at 01:15 -0700, dormando wrote:

> Heh. You folks have quite insane amounts of output.
> 
> I'm now testing a kernel which has the one upstreamed patch + the two not
> yet in -stable, under 3.10.45. A kernel with ~60% of those patches has
> been running for 17.5 days so far. Much longer than it normally runs.
> 
> So, looking pretty decent, still. Thanks for your patience for all of my
> dumb questions. I'll probably only pipe up again on this issue if I see
> another crash.

Thanks _you_ for your patience, reports, and tests !

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] ipv4: irq safe sk_dst_[re]set() and ipv4_sk_update_pmtu() fix
  2014-06-30  8:26           ` [PATCH] ipv4: irq safe sk_dst_[re]set() and ipv4_sk_update_pmtu() fix Eric Dumazet
@ 2014-07-01  6:43             ` David Miller
  0 siblings, 0 replies; 41+ messages in thread
From: David Miller @ 2014-07-01  6:43 UTC (permalink / raw)
  To: eric.dumazet; +Cc: dormando, netdev, steffen.klassert

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 30 Jun 2014 01:26:23 -0700

> From: Eric Dumazet <edumazet@google.com>
> 
> We have two different ways to handle changes to sk->sk_dst
> 
> First way (used by TCP) assumes socket lock is owned by caller, and use
> no extra lock : __sk_dst_set() & __sk_dst_reset()
> 
> Another way (used by UDP) uses sk_dst_lock because socket lock is not
> always taken. Note that sk_dst_lock is not softirq safe.
> 
> These ways are not inter changeable for a given socket type.
> 
> ipv4_sk_update_pmtu(), added in linux-3.8, added a race, as it used
> the socket lock as synchronization, but users might be UDP sockets.
> 
> Instead of converting sk_dst_lock to a softirq safe version, use xchg()
> as we did for sk_rx_dst in commit e47eb5dfb296b ("udp: ipv4: do not use
> sk_dst_lock from softirq context")
> 
> In a follow up patch, we probably can remove sk_dst_lock, as it is
> only used in IPv6.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Steffen Klassert <steffen.klassert@secunet.com>
> Fixes: 9cb3a50c5f63e ("ipv4: Invalidate the socket cached route on pmtu events if possible")

Applied, and queued up for -stable, thanks Eric.

That dst_check() sequence in ipv4_sk_update_pmtu() is superfluous if "!new",
in fact how can it trigger?

If new is true, we just got the route from ip_route_output_flow() and we
presume it to be ok.

If new is not true, we just did the odst->ops->check() and it returned
non-NULL.

Hmmm?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] ipv4: fix a race in ip4_datagram_release_cb()
  2014-06-30  8:30                 ` Eric Dumazet
@ 2014-07-08  1:41                   ` dormando
  2014-07-08  6:47                     ` Eric Dumazet
  0 siblings, 1 reply; 41+ messages in thread
From: dormando @ 2014-07-08  1:41 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexey Preobrazhensky, Steffen Klassert, David Miller, paulmck,
	netdev, Kostya Serebryany, Dmitry Vyukov, Lars Bull,
	Eric Dumazet, Bruce Curtis, Maciej Żenczykowski,
	Alexei Starovoitov

On Mon, 30 Jun 2014, Eric Dumazet wrote:

> On Mon, 2014-06-30 at 01:15 -0700, dormando wrote:
>
> > Heh. You folks have quite insane amounts of output.
> >
> > I'm now testing a kernel which has the one upstreamed patch + the two not
> > yet in -stable, under 3.10.45. A kernel with ~60% of those patches has
> > been running for 17.5 days so far. Much longer than it normally runs.
> >
> > So, looking pretty decent, still. Thanks for your patience for all of my
> > dumb questions. I'll probably only pipe up again on this issue if I see
> > another crash.
>
> Thanks _you_ for your patience, reports, and tests !
>

Mostly there, but I think we hit what might be a new bug.. The machines
which crashed every few days previously have been stable for weeks.

however I had one machine running the new kernel in a larger cluster
elsewhere; we had a network event and the one machine on the new kernel
panic'ed in ipv4_dst_destroy, but what looks like a new path. Sadly I've
had to halt the rollout :( All of the older unfixed kernels survived this
particular network event.

Unfortunately this is still on 3.10, due to a bad softirq regression in
3.14 I've not had time to track down. I applied all of your patches for
what wasn't already in 3.10. The only other change I made was to un-revert
62713c4b6bc10c2d082ee1540e11b01a2b2162ab - which I'd been keeping reverted
as it was making crashes much more frequent.

We tried for hours to get it to crash again and could not: full prod
traffic, udpkill running across four interfaces, and flapping the BGP
sessions around causing mass route flushing. No crash. There was something
specific about the upstream network event but we have not figured out what
yet. There were some InHdrErrors recorded and maybe some small amount of
forwarded traffic (we have some forwarding rules that are only used during
traffic bleeds). Nothing else stood out.

<4>[410769.205739] general protection fault: 0000 [#1] SMP
<4>[410769.205758] Modules linked in: xt_TEE xt_dscp xt_DSCP macvlan bridge coretemp crc32_pclmul ghash_clmulni_intel igb i2c_algo_bit gpio_ich isci ipmi_watchdog ixgbe ipmi_devintf microcode sb_edac libsas edac_core lpc_ich ptp mfd_core pps_core tpm_tis mdio tpm tpm_bios ipmi_si ipmi_msghandler
<4>[410769.205860] CPU: 2 PID: 251071 Comm: cache-main Tainted: G        W    3.10.45 #1
<4>[410769.205874] Hardware name: Supermicro X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.0 07/05/2013
<4>[410769.205892] task: ffff88b882bddc00 ti: ffff88be06058000 task.ti: ffff88be06058000
<4>[410769.205906] RIP: 0010:[<ffffffff815fa8ef>]  [<ffffffff815fa8ef>] ipv4_dst_destroy+0x4f/0x80
<4>[410769.205926] RSP: 0000:ffff885effc43e00  EFLAGS: 00010286
<4>[410769.205936] RAX: dead000000200200 RBX: ffff88be3a5a1380 RCX: 0000000000000040
<4>[410769.205950] RDX: dead000000100100 RSI: dead000000100100 RDI: dead000000200200
<4>[410769.205964] RBP: ffff885effc43e10 R08: ffffffff81cb0b00 R09: ffffea01791b9480
<4>[410769.205978] R10: ffffffff815b98f5 R11: 000000000003d4bf R12: 0000000000000000
<4>[410769.206009] R13: ffffffff81c8c300 R14: ffff885effc4d748 R15: ffff88b882bddc00
<4>[410769.206052] FS:  00007faa7b33c880(0000) GS:ffff885effc40000(0000) knlGS:0000000000000000
<4>[410769.206096] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[410769.206123] CR2: 00007f09c903c420 CR3: 000000bb6d074000 CR4: 00000000000407e0
<4>[410769.206166] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>[410769.206208] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>[410769.206251] Stack:
<4>[410769.206271]  ffff88be3a5a1380 ffff88be3a5a1380 ffff885effc43e40 ffffffff815b98d2
<4>[410769.206321]  ffff885effc43e40 ffff885effc4d720 ffffffff81c39d80 ffff88be06059fd8
<4>[410769.206370]  ffff885effc43e50 ffffffff815b9c6e ffff885effc43ec0 ffffffff810c91e2
<4>[410769.206420] Call Trace:
<4>[410769.206440]  <IRQ>
<4>[410769.206445]  [<ffffffff815b98d2>] dst_destroy+0x32/0xe0
<4>[410769.206491]  [<ffffffff815b9c6e>] dst_destroy_rcu+0xe/0x20
<4>[410769.206520]  [<ffffffff810c91e2>] rcu_process_callbacks+0x202/0x560
<4>[410769.206550]  [<ffffffff8108e9f4>] ? ktime_get+0x54/0xe0
<4>[410769.206578]  [<ffffffff81051a00>] __do_softirq+0xd0/0x270
<4>[410769.206607]  [<ffffffff810969b4>] ? tick_program_event+0x24/0x30
<4>[410769.206637]  [<ffffffff81071bb0>] ? hrtimer_interrupt+0x140/0x240
<4>[410769.206667]  [<ffffffff816d12fc>] call_softirq+0x1c/0x30
<4>[410769.206696]  [<ffffffff81004215>] do_softirq+0x55/0x90
<4>[410769.206722]  [<ffffffff81051cb5>] irq_exit+0x55/0x60
<4>[410769.206748]  [<ffffffff816d1a6e>] smp_apic_timer_interrupt+0x6e/0x99
<4>[410769.206776]  [<ffffffff816d0c8a>] apic_timer_interrupt+0x6a/0x70
<4>[410769.206803]  <EOI>
<4>[410769.206807]  [<ffffffff816d00c2>] ? system_call_fastpath+0x16/0x1b
<4>[410769.206854] Code: 4a 8f e9 81 e8 33 d2 0c 00 48 8b 93 b0 00 00 00 48 bf 00 02 20 00 00 00 ad de 48 8b 83 b8 00 00 00 48 be 00 01 10 00 00 00 ad de <48> 89 42 08 48 89 10 48 89 bb b8 00 00 00 48 c7 c7 4a 8f e9 81
<1>[410769.207051] RIP  [<ffffffff815fa8ef>] ipv4_dst_destroy+0x4f/0x80
<4>[410769.207080]  RSP <ffff885effc43e00>
<4>[410769.207358] ---[ end trace 81d52c76acd843b0 ]---
<0>[410769.276065] Kernel panic - not syncing: Fatal exception in interrupt

Thanks,
-Dormando

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] ipv4: fix a race in ip4_datagram_release_cb()
  2014-07-08  1:41                   ` dormando
@ 2014-07-08  6:47                     ` Eric Dumazet
  2014-07-08  7:01                       ` dormando
  2014-07-16 21:03                       ` dormando
  0 siblings, 2 replies; 41+ messages in thread
From: Eric Dumazet @ 2014-07-08  6:47 UTC (permalink / raw)
  To: dormando
  Cc: Alexey Preobrazhensky, Steffen Klassert, David Miller, paulmck,
	netdev, Kostya Serebryany, Dmitry Vyukov, Lars Bull,
	Eric Dumazet, Bruce Curtis, Maciej Żenczykowski,
	Alexei Starovoitov

On Mon, 2014-07-07 at 18:41 -0700, dormando wrote:

> Mostly there, but I think we hit what might be a new bug.. The machines
> which crashed every few days previously have been stable for weeks.
> 
> however I had one machine running the new kernel in a larger cluster
> elsewhere; we had a network event and the one machine on the new kernel
> panic'ed in ipv4_dst_destroy, but what looks like a new path. Sadly I've
> had to halt the rollout :( All of the older unfixed kernels survived this
> particular network event.
> 
> Unfortunately this is still on 3.10, due to a bad softirq regression in
> 3.14 I've not had time to track down. I applied all of your patches for
> what wasn't already in 3.10. The only other change I made was to un-revert
> 62713c4b6bc10c2d082ee1540e11b01a2b2162ab - which I'd been keeping reverted
> as it was making crashes much more frequent.

Hmm, always give patch title or a valid sha1 commit, this one is not in
David trees, so its hard to tell.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] ipv4: fix a race in ip4_datagram_release_cb()
  2014-07-08  6:47                     ` Eric Dumazet
@ 2014-07-08  7:01                       ` dormando
  2014-07-16 21:03                       ` dormando
  1 sibling, 0 replies; 41+ messages in thread
From: dormando @ 2014-07-08  7:01 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexey Preobrazhensky, Steffen Klassert, David Miller, paulmck,
	netdev, Kostya Serebryany, Dmitry Vyukov, Lars Bull,
	Eric Dumazet, Bruce Curtis, Maciej Żenczykowski,
	Alexei Starovoitov

On Tue, 8 Jul 2014, Eric Dumazet wrote:

> On Mon, 2014-07-07 at 18:41 -0700, dormando wrote:
>
> > Mostly there, but I think we hit what might be a new bug.. The machines
> > which crashed every few days previously have been stable for weeks.
> >
> > however I had one machine running the new kernel in a larger cluster
> > elsewhere; we had a network event and the one machine on the new kernel
> > panic'ed in ipv4_dst_destroy, but what looks like a new path. Sadly I've
> > had to halt the rollout :( All of the older unfixed kernels survived this
> > particular network event.
> >
> > Unfortunately this is still on 3.10, due to a bad softirq regression in
> > 3.14 I've not had time to track down. I applied all of your patches for
> > what wasn't already in 3.10. The only other change I made was to un-revert
> > 62713c4b6bc10c2d082ee1540e11b01a2b2162ab - which I'd been keeping reverted
> > as it was making crashes much more frequent.
>
> Hmm, always give patch title or a valid sha1 commit, this one is not in
> David trees, so its hard to tell.
>

Damn, sorry. I thought it was valid:
Author: Alexei Starovoitov <ast@plumgrid.com>
Date:   Tue Nov 19 19:12:34 2013 -0800

    ipv4: fix race in concurrent ip_route_input_slow()

    [ Upstream commit dcdfdf56b4a6c9437fc37dbc9cee94a788f9b0c4 ]

It's a thing that uses a DST_NOCACHE flag. I can re-add the reversion to
my own tree, but it should probably be reviewed again I guess?

We had another thread about it a while ago. I'd upgraded between stable
revisions of 3.10 (when this patch was added) and machines in one
datacenter started crashing every few hours. Thread never went anywhere.

Tried removing the reversion since your recent patches should've fixed the
underlying problem.

I have no idea if this patch is the problem or not though, just adding the
information for completeness. We had no luck at all reproducing this
latest crash.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] ipv4: fix a race in ip4_datagram_release_cb()
  2014-07-08  6:47                     ` Eric Dumazet
  2014-07-08  7:01                       ` dormando
@ 2014-07-16 21:03                       ` dormando
  2014-07-25  8:11                         ` dormando
  1 sibling, 1 reply; 41+ messages in thread
From: dormando @ 2014-07-16 21:03 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexey Preobrazhensky, Steffen Klassert, David Miller, paulmck,
	netdev, Kostya Serebryany, Dmitry Vyukov, Lars Bull,
	Eric Dumazet, Bruce Curtis, Maciej Żenczykowski,
	Alexei Starovoitov

On Tue, 8 Jul 2014, Eric Dumazet wrote:

> On Mon, 2014-07-07 at 18:41 -0700, dormando wrote:
>
> > Mostly there, but I think we hit what might be a new bug.. The machines
> > which crashed every few days previously have been stable for weeks.
> >
> > however I had one machine running the new kernel in a larger cluster
> > elsewhere; we had a network event and the one machine on the new kernel
> > panic'ed in ipv4_dst_destroy, but what looks like a new path. Sadly I've
> > had to halt the rollout :( All of the older unfixed kernels survived this
> > particular network event.
> >
> > Unfortunately this is still on 3.10, due to a bad softirq regression in
> > 3.14 I've not had time to track down. I applied all of your patches for
> > what wasn't already in 3.10. The only other change I made was to un-revert
> > 62713c4b6bc10c2d082ee1540e11b01a2b2162ab - which I'd been keeping reverted
> > as it was making crashes much more frequent.
>
> Hmm, always give patch title or a valid sha1 commit, this one is not in
> David trees, so its hard to tell.
>

Happened again, about two minutes after causing a large route churn.
Doing the same action again after it's been rebooted isn't causing it to
crash... it last went down a week ago. Either we're still not reproducing
it correctly, or it requires some amount of uptime inbetween crashes.

Trace is slightly different this time, but same function.

Any thoughts on how to instrument? :( kernels without your latest patches
aren't crashing during these changes. We've fixed the UDP issue but traded
it for something else.

<4>[774493.032809] general protection fault: 0000 [#1] SMP
<4>[774493.032830] Modules linked in: xt_TEE xt_dscp xt_DSCP macvlan bridge coretemp crc32_pclmul ghash_clmulni_intel gpio_ich microcode ipmi_watchdog ipmi_devintf sb_edac edac_core lpc_ich mfd_core tpm_tis tpm tpm_bios ipmi_si ipmi_msghandler isci igb libsas i2c_algo_bit ixgbe ptp pps_core mdio
<4>[774493.032948] CPU: 10 PID: 49 Comm: ksoftirqd/10 Tainted: G        W    3.10.45 #1
<4>[774493.032964] Hardware name: Supermicro X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.0 07/05/2013
<4>[774493.032983] task: ffff88be6f3e0000 ti: ffff88be6f3de000 task.ti: ffff88be6f3de000
<4>[774493.032997] RIP: 0010:[<ffffffff815fa8ef>]  [<ffffffff815fa8ef>] ipv4_dst_destroy+0x4f/0x80
<4>[774493.033022] RSP: 0018:ffff88be6f3dfd18  EFLAGS: 00010296
<4>[774493.033033] RAX: dead000000200200 RBX: ffff88b94f5d1380 RCX: 0000000000000040
<4>[774493.033046] RDX: dead000000100100 RSI: dead000000100100 RDI: dead000000200200
<4>[774493.033060] RBP: ffff88be6f3dfd28 R08: ffffffff81cb0b00 R09: ffffea02f9458400
<4>[774493.033090] R10: ffffffff815b98f5 R11: 0000000000000031 R12: 0000000000000000
<4>[774493.033133] R13: ffffffff81c8c300 R14: ffff88c07fc4d748 R15: ffff88be6f3e0000
<4>[774493.033177] FS:  0000000000000000(0000) GS:ffff88c07fc40000(0000) knlGS:0000000000000000
<4>[774493.033221] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[774493.033248] CR2: 00007f805c06f000 CR3: 0000005769ed2000 CR4: 00000000000407e0
<4>[774493.033291] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>[774493.033334] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>[774493.033377] Stack:
<4>[774493.033397]  ffff88b94f5d1380 ffff88b94f5d1380 ffff88be6f3dfd58 ffffffff815b98d2
<4>[774493.033448]  ffff88be6f3dfd58 ffff88c07fc4d720 ffffffff81c39d80 ffff88be6f3dffd8
<4>[774493.033499]  ffff88be6f3dfd68 ffffffff815b9c6e ffff88be6f3dfdd8 ffffffff810c91e2
<4>[774493.033551] Call Trace:
<4>[774493.033579]  [<ffffffff815b98d2>] dst_destroy+0x32/0xe0
<4>[774493.033607]  [<ffffffff815b9c6e>] dst_destroy_rcu+0xe/0x20
<4>[774493.033638]  [<ffffffff810c91e2>] rcu_process_callbacks+0x202/0x560
<4>[774493.033671]  [<ffffffff81051a00>] __do_softirq+0xd0/0x270
<4>[774493.033699]  [<ffffffff81051bc8>] run_ksoftirqd+0x28/0x40
<4>[774493.033730]  [<ffffffff8107576d>] smpboot_thread_fn+0xfd/0x180
<4>[774493.033758]  [<ffffffff81075670>] ? lg_global_lock+0x80/0x80
<4>[774493.033788]  [<ffffffff8106e040>] kthread+0xc0/0xd0
<4>[774493.033814]  [<ffffffff8106df80>] ? flush_kthread_worker+0xb0/0xb0
<4>[774493.033845]  [<ffffffff816d001c>] ret_from_fork+0x7c/0xb0
<4>[774493.033872]  [<ffffffff8106df80>] ? flush_kthread_worker+0xb0/0xb0
<4>[774493.033900] Code: 4a 8f e9 81 e8 33 d2 0c 00 48 8b 93 b0 00 00 00 48 bf 00 02 20 00 00 00 ad de 48 8b 83 b8 00 00 00 48 be 00 01 10 00 00 00 ad de <48> 89 42 08 48 89 10 48 89 bb b8 00 00 00 48 c7 c7 4a 8f e9 81
<1>[774493.034115] RIP  [<ffffffff815fa8ef>] ipv4_dst_destroy+0x4f/0x80
<4>[774493.034145]  RSP <ffff88be6f3dfd18>
<4>[774493.034439] ---[ end trace 10b9e107c9a58917 ]---
<0>[774493.096332] Kernel panic - not syncing: Fatal exception in interrupt

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] ipv4: fix a race in ip4_datagram_release_cb()
  2014-07-16 21:03                       ` dormando
@ 2014-07-25  8:11                         ` dormando
  0 siblings, 0 replies; 41+ messages in thread
From: dormando @ 2014-07-25  8:11 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexey Preobrazhensky, Steffen Klassert, David Miller, paulmck,
	netdev, Kostya Serebryany, Dmitry Vyukov, Lars Bull,
	Eric Dumazet, Bruce Curtis, Maciej Żenczykowski,
	Alexei Starovoitov

> On Tue, 8 Jul 2014, Eric Dumazet wrote:
>
> > On Mon, 2014-07-07 at 18:41 -0700, dormando wrote:
> >
> > > Mostly there, but I think we hit what might be a new bug.. The machines
> > > which crashed every few days previously have been stable for weeks.
> > >
> > > however I had one machine running the new kernel in a larger cluster
> > > elsewhere; we had a network event and the one machine on the new kernel
> > > panic'ed in ipv4_dst_destroy, but what looks like a new path. Sadly I've
> > > had to halt the rollout :( All of the older unfixed kernels survived this
> > > particular network event.
> > >
> > > Unfortunately this is still on 3.10, due to a bad softirq regression in
> > > 3.14 I've not had time to track down. I applied all of your patches for
> > > what wasn't already in 3.10. The only other change I made was to un-revert
> > > 62713c4b6bc10c2d082ee1540e11b01a2b2162ab - which I'd been keeping reverted
> > > as it was making crashes much more frequent.
> >
> > Hmm, always give patch title or a valid sha1 commit, this one is not in
> > David trees, so its hard to tell.
> >
>
> Happened again, about two minutes after causing a large route churn.
> Doing the same action again after it's been rebooted isn't causing it to
> crash... it last went down a week ago. Either we're still not reproducing
> it correctly, or it requires some amount of uptime inbetween crashes.
>
> Trace is slightly different this time, but same function.
>
> Any thoughts on how to instrument? :( kernels without your latest patches
> aren't crashing during these changes. We've fixed the UDP issue but traded
> it for something else.
>
> <4>[774493.032809] general protection fault: 0000 [#1] SMP
> <4>[774493.032830] Modules linked in: xt_TEE xt_dscp xt_DSCP macvlan bridge coretemp crc32_pclmul ghash_clmulni_intel gpio_ich microcode ipmi_watchdog ipmi_devintf sb_edac edac_core lpc_ich mfd_core tpm_tis tpm tpm_bios ipmi_si ipmi_msghandler isci igb libsas i2c_algo_bit ixgbe ptp pps_core mdio
> <4>[774493.032948] CPU: 10 PID: 49 Comm: ksoftirqd/10 Tainted: G        W    3.10.45 #1
> <4>[774493.032964] Hardware name: Supermicro X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.0 07/05/2013
> <4>[774493.032983] task: ffff88be6f3e0000 ti: ffff88be6f3de000 task.ti: ffff88be6f3de000
> <4>[774493.032997] RIP: 0010:[<ffffffff815fa8ef>]  [<ffffffff815fa8ef>] ipv4_dst_destroy+0x4f/0x80

Had our third panic (with very few machines running the kernel). Same
general spot; RCU callback, ipv4_dst_destroy, LIST_POISON1/POISON2 showing
it being a double-free.

The crash is always in the RCU callback, which only happens from
dst_release() when the dst has the DST_NOCACHE flag set.

void dst_release(struct dst_entry *dst)
{
    if (dst) {
        int newrefcnt;

        newrefcnt = atomic_dec_return(&dst->__refcnt);
        WARN_ON(newrefcnt < 0);
        if (unlikely(dst->flags & DST_NOCACHE) && !newrefcnt)
            call_rcu(&dst->rcu_head, dst_destroy_rcu);
    }
}

The WARN_ON() isn't firing so far as I can tell. the pstores we get from
the crashes are a bunch of normal sparse network noise, then immediately
the panic.

That tells me that it's some other part of the code causing the
double-free somehow? I've been working on understanding it but it's slow
going. I dont know if it's possible for a child of a dst to be shared in
more than one place, or if any of the callers of dst_release are wrong
(but if they were, the WARN_ON would fire, I'm pretty sure?)

Reproducing the problem has gone from a curiosity to a frustration:
sometimes when we do network maintenance the new kernel will immediately
die, but then we do the same thing again and it won't. We've done half a
dozen torture tests where we swap hundreds of thousands of routes on the
machine while running vairous mixes of traffic and can't get it to blow on
command.

Did you folks ever run the thing through ksan? I'm trying to come up with
ideas to instrument this better but dst fiddling is done in different ways
all over the kernel, and I've no idea which part is the trigger.

thanks, sorry for the noise :( I'm spinning my wheels on this last bug.

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2014-07-25  8:11 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-06-06 11:29 Potential race in ip4_datagram_release_cb Alexey Preobrazhensky
2014-06-06 12:56 ` Eric Dumazet
2014-06-06 15:59   ` Alexei Starovoitov
2014-06-06 16:16     ` Eric Dumazet
2014-06-06 17:44       ` Alexei Starovoitov
2014-06-06 17:56         ` Eric Dumazet
2014-06-06 18:13           ` Alexei Starovoitov
2014-06-10 13:43 ` [PATCH] ipv4: fix a race in ip4_datagram_release_cb() Eric Dumazet
2014-06-11  0:32   ` dormando
2014-06-11  0:55     ` Eric Dumazet
2014-06-11  1:12       ` Eric Dumazet
2014-06-11  1:26         ` Eric Dumazet
2014-06-11  4:16           ` dormando
2014-06-11  5:54             ` Eric Dumazet
2014-06-11  7:20               ` dormando
2014-06-11  7:26                 ` dormando
2014-06-11  7:38                   ` dormando
2014-06-11 12:41                     ` Eric Dumazet
2014-06-11 13:12                       ` Eric Dumazet
2014-06-12  1:55                         ` dormando
2014-06-12  3:43                           ` Eric Dumazet
2014-06-12  4:05                             ` dormando
2014-06-22 19:07                             ` dormando
2014-06-23  8:33                               ` Eric Dumazet
2014-06-23  8:55                                 ` dormando
2014-06-23 16:57                                   ` Dmitry Vyukov
2014-06-24 17:05                                 ` [PATCH net] ipv4: fix dst race in sk_dst_get() Eric Dumazet
2014-06-26  0:42                                   ` David Miller
2014-06-11 13:38             ` [PATCH] ipv4: fix a race in ip4_datagram_release_cb() Kostya Serebryany
2014-06-29  0:25           ` dormando
2014-06-30  6:38             ` Eric Dumazet
2014-06-30  8:15               ` dormando
2014-06-30  8:30                 ` Eric Dumazet
2014-07-08  1:41                   ` dormando
2014-07-08  6:47                     ` Eric Dumazet
2014-07-08  7:01                       ` dormando
2014-07-16 21:03                       ` dormando
2014-07-25  8:11                         ` dormando
2014-06-30  8:26           ` [PATCH] ipv4: irq safe sk_dst_[re]set() and ipv4_sk_update_pmtu() fix Eric Dumazet
2014-07-01  6:43             ` David Miller
2014-06-11 22:39   ` [PATCH] ipv4: fix a race in ip4_datagram_release_cb() David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.