linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Asterisk deadlocks since Kernel 4.1
@ 2015-11-17 14:44 Stefan Priebe - Profihost AG
  2015-11-17 19:15 ` Thomas Gleixner
  0 siblings, 1 reply; 45+ messages in thread
From: Stefan Priebe - Profihost AG @ 2015-11-17 14:44 UTC (permalink / raw)
  To: linux-netdev; +Cc: linux-fsdevel, tglx, linux-kernel

Hello,

since Upgrading our Asterisk System from Kernel 3.18.17 to 4.1.13 it
deadlocks every few hours (kill -9 is the only thing working). Booting
with 3.18 again let it run smooth again.

An strace shows asterisk is looping like this:

[pid  6068] timerfd_gettime(8, , {it_interval={0, 20000000},
it_value={0, 140592906050976}}) = 0
[pid  6068] read(8, "\1\0\0\0\0\0\0\0", 8) = 8
[pid  6068] poll([{fd=9, events=POLLIN}, {fd=8, events=POLLIN|POLLPRI}],
2, 1000) = 1 ([{fd=8, revents=POLLIN}])
[pid  6068] timerfd_gettime(8, , {it_interval={0, 20000000},
it_value={0, 140592906050976}}) = 0
[pid  6068] read(8, "\1\0\0\0\0\0\0\0", 8) = 8
[pid  6068] poll([{fd=9, events=POLLIN}, {fd=8, events=POLLIN|POLLPRI}],
2, 1000) = 1 ([{fd=8, revents=POLLIN}])
[pid  6068] timerfd_gettime(8, , {it_interval={0, 20000000},
it_value={0, 140592906050976}}) = 0
[pid  6068] read(8, "\1\0\0\0\0\0\0\0", 8) = 8
[pid  6068] poll([{fd=9, events=POLLIN}, {fd=8, events=POLLIN|POLLPRI}],
2, 1000) = 1 ([{fd=8, revents=POLLIN}])
[pid  6068] timerfd_gettime(8, , {it_interval={0, 20000000},
it_value={0, 140592906050976}}) = 0
[pid  6068] read(8, "\1\0\0\0\0\0\0\0", 8) = 8
[pid  6068] poll([{fd=9, events=POLLIN}, {fd=8, events=POLLIN|POLLPRI}],
2, 1000) = 1 ([{fd=8, revents=POLLIN}])
[pid  6068] timerfd_gettime(8, , {it_interval={0, 20000000},
it_value={0, 140592906050976}}) = 0
[pid  6068] read(8, "\1\0\0\0\0\0\0\0", 8) = 8
[pid  6068] poll([{fd=9, events=POLLIN}, {fd=8, events=POLLIN|POLLPRI}],
2, 1000) = 1 ([{fd=8, revents=POLLIN}])
[pid  6068] timerfd_gettime(8, , {it_interval={0, 20000000},
it_value={0, 140592906050976}}) = 0
[pid  6068] read(8, "\1\0\0\0\0\0\0\0", 8) = 8
[pid  6068] poll([{fd=9, events=POLLIN}, {fd=8, events=POLLIN|POLLPRI}],
2, 1000) = 1 ([{fd=8, revents=POLLIN}])
[pid  6068] timerfd_gettime(8, , {it_interval={0, 20000000},
it_value={0, 140592906050976}}) = 0
[pid  6068] read(8, "\1\0\0\0\0\0\0\0", 8) = 8
[pid  6068] poll([{fd=9, events=POLLIN}, {fd=8, events=POLLIN|POLLPRI}],
2, 1000) = 1 ([{fd=8, revents=POLLIN}])
[pid  6068] timerfd_gettime(8, , {it_interval={0, 20000000},
it_value={0, 140592906050976}}) = 0
[pid  6068] read(8, "\1\0\0\0\0\0\0\0", 8) = 8
[pid  6068] poll([{fd=9, events=POLLIN}, {fd=8, events=POLLIN|POLLPRI}],
2, 1000) = 1 ([{fd=8, revents=POLLIN}])
[pid  6068] timerfd_gettime(8, , {it_interval={0, 20000000},
it_value={0, 140592906050976}}) = 0
[pid  6068] read(8, "\1\0\0\0\0\0\0\0", 8) = 8
[pid  6068] poll([{fd=9, events=POLLIN}, {fd=8, events=POLLIN|POLLPRI}],
2, 1000) = 1 ([{fd=8, revents=POLLIN}])
[pid  6068] timerfd_gettime(8, , {it_interval={0, 20000000},
it_value={0, 140592906050976}}) = 0

fd 8 is:

lrwx------ 1 root root 64 Nov 17 15:27 /proc/6025/fd/8 ->
anon_inode:[timerfd]


# cat /proc/6025/stack
[<ffffffff941d60e9>] poll_schedule_timeout+0x49/0x70
[<ffffffff941d76c7>] do_sys_poll+0x3d7/0x590
[<ffffffff941d78bc>] do_restart_poll+0x3c/0x70
[<ffffffff9408e7ff>] sys_restart_syscall+0x1f/0x30
[<ffffffff9463f1ee>] system_call_fastpath+0x12/0x71
[<ffffffffffffffff>] 0xffffffffffffffff

Any ideas how to debug this?

Greets,
Stefan

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Asterisk deadlocks since Kernel 4.1
  2015-11-17 14:44 Asterisk deadlocks since Kernel 4.1 Stefan Priebe - Profihost AG
@ 2015-11-17 19:15 ` Thomas Gleixner
  2015-11-17 19:27   ` Stefan Priebe
  0 siblings, 1 reply; 45+ messages in thread
From: Thomas Gleixner @ 2015-11-17 19:15 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG; +Cc: linux-netdev, linux-fsdevel, linux-kernel

On Tue, 17 Nov 2015, Stefan Priebe - Profihost AG wrote:
> since Upgrading our Asterisk System from Kernel 3.18.17 to 4.1.13 it
> deadlocks every few hours (kill -9 is the only thing working). Booting
> with 3.18 again let it run smooth again.
> 
> An strace shows asterisk is looping like this:
> 
> [pid  6068] timerfd_gettime(8, , {it_interval={0, 20000000},
> it_value={0, 140592906050976}}) = 0
> [pid  6068] read(8, "\1\0\0\0\0\0\0\0", 8) = 8
> [pid  6068] poll([{fd=9, events=POLLIN}, {fd=8, events=POLLIN|POLLPRI}],
> 2, 1000) = 1 ([{fd=8, revents=POLLIN}])
> [pid  6068] timerfd_gettime(8, , {it_interval={0, 20000000},
> it_value={0, 140592906050976}}) = 0
> [pid  6068] read(8, "\1\0\0\0\0\0\0\0", 8) = 8
> 
> fd 8 is:
> 
> lrwx------ 1 root root 64 Nov 17 15:27 /proc/6025/fd/8 ->
> anon_inode:[timerfd]
> 
> 
> # cat /proc/6025/stack
> [<ffffffff941d60e9>] poll_schedule_timeout+0x49/0x70
> [<ffffffff941d76c7>] do_sys_poll+0x3d7/0x590
> [<ffffffff941d78bc>] do_restart_poll+0x3c/0x70
> [<ffffffff9408e7ff>] sys_restart_syscall+0x1f/0x30
> [<ffffffff9463f1ee>] system_call_fastpath+0x12/0x71
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> Any ideas how to debug this?

fd 8 is probably not really interesting. That looks like a interval
timer firing periodically. So it probably waits for fd 9 ...

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Asterisk deadlocks since Kernel 4.1
  2015-11-17 19:15 ` Thomas Gleixner
@ 2015-11-17 19:27   ` Stefan Priebe
  2015-11-17 19:43     ` Thomas Gleixner
  0 siblings, 1 reply; 45+ messages in thread
From: Stefan Priebe @ 2015-11-17 19:27 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: netdev, linux-fsdevel, linux-kernel

Am 17.11.2015 um 20:15 schrieb Thomas Gleixner:
> On Tue, 17 Nov 2015, Stefan Priebe - Profihost AG wrote:
>> since Upgrading our Asterisk System from Kernel 3.18.17 to 4.1.13 it
>> deadlocks every few hours (kill -9 is the only thing working). Booting
>> with 3.18 again let it run smooth again.
>>
>> An strace shows asterisk is looping like this:
>>
>> [pid  6068] timerfd_gettime(8, , {it_interval={0, 20000000},
>> it_value={0, 140592906050976}}) = 0
>> [pid  6068] read(8, "\1\0\0\0\0\0\0\0", 8) = 8
>> [pid  6068] poll([{fd=9, events=POLLIN}, {fd=8, events=POLLIN|POLLPRI}],
>> 2, 1000) = 1 ([{fd=8, revents=POLLIN}])
>> [pid  6068] timerfd_gettime(8, , {it_interval={0, 20000000},
>> it_value={0, 140592906050976}}) = 0
>> [pid  6068] read(8, "\1\0\0\0\0\0\0\0", 8) = 8
>>
>> fd 8 is:
>>
>> lrwx------ 1 root root 64 Nov 17 15:27 /proc/6025/fd/8 ->
>> anon_inode:[timerfd]
>>
>>
>> # cat /proc/6025/stack
>> [<ffffffff941d60e9>] poll_schedule_timeout+0x49/0x70
>> [<ffffffff941d76c7>] do_sys_poll+0x3d7/0x590
>> [<ffffffff941d78bc>] do_restart_poll+0x3c/0x70
>> [<ffffffff9408e7ff>] sys_restart_syscall+0x1f/0x30
>> [<ffffffff9463f1ee>] system_call_fastpath+0x12/0x71
>> [<ffffffffffffffff>] 0xffffffffffffffff
>>
>> Any ideas how to debug this?
>
> fd 8 is probably not really interesting. That looks like a interval
> timer firing periodically. So it probably waits for fd 9 ...

I've now also two gdb backtraces from two crashes:
http://pastebin.com/raw.php?i=yih5jNt8

http://pastebin.com/raw.php?i=kGEcvH4T

Stefan

>
> Thanks,
>
> 	tglx
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Asterisk deadlocks since Kernel 4.1
  2015-11-17 19:27   ` Stefan Priebe
@ 2015-11-17 19:43     ` Thomas Gleixner
  2015-11-18 20:23       ` Stefan Priebe
  0 siblings, 1 reply; 45+ messages in thread
From: Thomas Gleixner @ 2015-11-17 19:43 UTC (permalink / raw)
  To: Stefan Priebe; +Cc: netdev, linux-fsdevel, linux-kernel

On Tue, 17 Nov 2015, Stefan Priebe wrote:
> I've now also two gdb backtraces from two crashes:
> http://pastebin.com/raw.php?i=yih5jNt8
> 
> http://pastebin.com/raw.php?i=kGEcvH4T

They don't tell me anything as I have no idea of the inner workings of
asterisk. You might be better of to talk to the asterisk folks to help
you track down what that thing is waiting for, so we can actually look
at a well defined area.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Asterisk deadlocks since Kernel 4.1
  2015-11-17 19:43     ` Thomas Gleixner
@ 2015-11-18 20:23       ` Stefan Priebe
  2015-11-18 21:00         ` Hannes Frederic Sowa
  2015-11-18 21:18         ` Florian Weimer
  0 siblings, 2 replies; 45+ messages in thread
From: Stefan Priebe @ 2015-11-18 20:23 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: netdev, linux-fsdevel, linux-kernel


Am 17.11.2015 um 20:43 schrieb Thomas Gleixner:
> On Tue, 17 Nov 2015, Stefan Priebe wrote:
>> I've now also two gdb backtraces from two crashes:
>> http://pastebin.com/raw.php?i=yih5jNt8
>>
>> http://pastebin.com/raw.php?i=kGEcvH4T
>
> They don't tell me anything as I have no idea of the inner workings of
> asterisk. You might be better of to talk to the asterisk folks to help
> you track down what that thing is waiting for, so we can actually look
> at a well defined area.

The asterisk guys told me it's a livelock asterisk is waiting for 
getaddrinfo / recvmsg.

Thread 2 (Thread 0x7fbe989c6700 (LWP 12890)):
#0  0x00007fbeb9eb487d in recvmsg () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007fbeb9ed4fcc in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007fbeb9ed544a in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x00007fbeb9e92007 in getaddrinfo () from 
/lib/x86_64-linux-gnu/libc.so.6
#4  0x00000000004f2e81 in ast_sockaddr_resolve 
(addrs=addrs@entry=0x7fbe989c48d8, str=<optimized out>,
     str@entry=0x7fbe989c4a80 "10.12.12.92:2052", flags=flags@entry=0, 
family=2) at netsock2.c:268
#5  0x00007fbeb095fb8b in ast_sockaddr_resolve_first_af 
(family=<optimized out>, flag=0, name=0x7fbe989c4a80 "10.12.12.92:2052",
     addr=0x7fbeb4114948) at chan_sip.c:30797
#6  ast_sockaddr_resolve_first_transport (transport=<optimized out>, 
flag=0, name=0x7fbe989c4a80 "10.12.12.92:2052",
     addr=0x7fbeb4114948) at chan_sip.c:30828
#7  set_destination (uri=<optimized out>, p=0x7fbeb4114438) at 
chan_sip.c:10602
#8  reqprep (req=req@entry=0x7fbe989c4fc0, p=p@entry=0x7fbeb4114438, 
sipmethod=sipmethod@entry=8, seqno=<optimized out>,
     seqno@entry=0, newbranch=newbranch@entry=1) at chan_sip.c:10925
---Type <return> to continue, or q <return> to quit---
#9  0x00007fbeb0969968 in transmit_request_with_auth 
(p=p@entry=0x7fbeb4114438, newbranch=1, reliable=XMIT_RELIABLE, seqno=0,
     sipmethod=8) at chan_sip.c:14296
#10 0x00007fbeb098b704 in sip_hangup (ast=<optimized out>) at 
chan_sip.c:6822
#11 0x0000000000477b55 in ast_hangup (chan=chan@entry=0x7fbeb40ef018) at 
channel.c:2887
#12 0x000000000050ce1b in __ast_pbx_run (c=c@entry=0x7fbeb40ef018, 
args=args@entry=0x0) at pbx.c:5729
#13 0x000000000050e426 in pbx_thread (data=data@entry=0x7fbeb40ef018) at 
pbx.c:5821
#14 0x0000000000548aaa in dummy_start (data=<optimized out>) at utils.c:1151
#15 0x00007fbeb965eb50 in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#16 0x00007fbeb9eb395d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#17 0x0000000000000000 in ?? ()

In both cases, the thread is waiting in recvmsg.

Stefan

>
> Thanks,
>
> 	tglx
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Asterisk deadlocks since Kernel 4.1
  2015-11-18 20:23       ` Stefan Priebe
@ 2015-11-18 21:00         ` Hannes Frederic Sowa
  2015-11-18 21:20           ` Stefan Priebe
  2015-11-18 21:18         ` Florian Weimer
  1 sibling, 1 reply; 45+ messages in thread
From: Hannes Frederic Sowa @ 2015-11-18 21:00 UTC (permalink / raw)
  To: Stefan Priebe, Thomas Gleixner, Florian Weimer
  Cc: netdev, linux-fsdevel, linux-kernel



On Wed, Nov 18, 2015, at 21:23, Stefan Priebe wrote:
> 
> Am 17.11.2015 um 20:43 schrieb Thomas Gleixner:
> > On Tue, 17 Nov 2015, Stefan Priebe wrote:
> >> I've now also two gdb backtraces from two crashes:
> >> http://pastebin.com/raw.php?i=yih5jNt8
> >>
> >> http://pastebin.com/raw.php?i=kGEcvH4T
> >
> > They don't tell me anything as I have no idea of the inner workings of
> > asterisk. You might be better of to talk to the asterisk folks to help
> > you track down what that thing is waiting for, so we can actually look
> > at a well defined area.
> 
> The asterisk guys told me it's a livelock asterisk is waiting for 
> getaddrinfo / recvmsg.
> 
> Thread 2 (Thread 0x7fbe989c6700 (LWP 12890)):
> #0  0x00007fbeb9eb487d in recvmsg () from /lib/x86_64-linux-gnu/libc.so.6
> #1  0x00007fbeb9ed4fcc in ?? () from /lib/x86_64-linux-gnu/libc.so.6
> #2  0x00007fbeb9ed544a in ?? () from /lib/x86_64-linux-gnu/libc.so.6
> #3  0x00007fbeb9e92007 in getaddrinfo () from 
> /lib/x86_64-linux-gnu/libc.so.6

Florian Weimer recently patched glibc to detect filedescriptors which
have been reused incorrectly in the glibc name resolving routines:
<https://sourceware.org/ml/libc-alpha/2015-10/msg00865.html>

Could you determine the type of filedescriptor in /proc/pid/fd and
/proc/net/netlink by matching inodes?

Thanks,
Hannes

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Asterisk deadlocks since Kernel 4.1
  2015-11-18 20:23       ` Stefan Priebe
  2015-11-18 21:00         ` Hannes Frederic Sowa
@ 2015-11-18 21:18         ` Florian Weimer
  2015-11-18 21:23           ` Stefan Priebe
  2015-11-18 21:36           ` Stefan Priebe
  1 sibling, 2 replies; 45+ messages in thread
From: Florian Weimer @ 2015-11-18 21:18 UTC (permalink / raw)
  To: Stefan Priebe; +Cc: Thomas Gleixner, netdev, linux-fsdevel, linux-kernel

On 11/18/2015 09:23 PM, Stefan Priebe wrote:
> 
> Am 17.11.2015 um 20:43 schrieb Thomas Gleixner:
>> On Tue, 17 Nov 2015, Stefan Priebe wrote:
>>> I've now also two gdb backtraces from two crashes:
>>> http://pastebin.com/raw.php?i=yih5jNt8
>>>
>>> http://pastebin.com/raw.php?i=kGEcvH4T
>>
>> They don't tell me anything as I have no idea of the inner workings of
>> asterisk. You might be better of to talk to the asterisk folks to help
>> you track down what that thing is waiting for, so we can actually look
>> at a well defined area.
> 
> The asterisk guys told me it's a livelock asterisk is waiting for
> getaddrinfo / recvmsg.
> 
> Thread 2 (Thread 0x7fbe989c6700 (LWP 12890)):
> #0  0x00007fbeb9eb487d in recvmsg () from /lib/x86_64-linux-gnu/libc.so.6
> #1  0x00007fbeb9ed4fcc in ?? () from /lib/x86_64-linux-gnu/libc.so.6
> #2  0x00007fbeb9ed544a in ?? () from /lib/x86_64-linux-gnu/libc.so.6
> #3  0x00007fbeb9e92007 in getaddrinfo () from
> /lib/x86_64-linux-gnu/libc.so.6

Stefan,

please try to get a backtrace with debugging information.  It is likely
that this is the make_request/__check_pf functionality in glibc, but it
would be nice to get some certainty.

Which glibc version do you use?  Has it got a fix for CVE-2013-7423?

So far, the only known cause for a hang in this place (that is, lack of
return from recvmsg) is incorrect file descriptor use.  (CVE-2013-7423
is such an issue in glibc itself.)  The kernel upgrade could change
scheduling behavior, and the actual bug might have been latent before.

Theoretically, recvmsg could also hang if the Netlink query was dropped
by the kernel, or the final packet in the response was dropped.  We
never saw that happen, even under extreme load, but I didn't test with
recent kernels.

The glibc change Hannes mentioned won't detect the hang, but if there is
incorrect file descriptor reuse going on, it is possible that the new
assert catches it.

Florian


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Asterisk deadlocks since Kernel 4.1
  2015-11-18 21:00         ` Hannes Frederic Sowa
@ 2015-11-18 21:20           ` Stefan Priebe
  2015-11-18 21:22             ` Hannes Frederic Sowa
  0 siblings, 1 reply; 45+ messages in thread
From: Stefan Priebe @ 2015-11-18 21:20 UTC (permalink / raw)
  To: Hannes Frederic Sowa, Thomas Gleixner, Florian Weimer
  Cc: netdev, linux-fsdevel, linux-kernel

Am 18.11.2015 um 22:00 schrieb Hannes Frederic Sowa:
>
>
> On Wed, Nov 18, 2015, at 21:23, Stefan Priebe wrote:
>>
>> Am 17.11.2015 um 20:43 schrieb Thomas Gleixner:
>>> On Tue, 17 Nov 2015, Stefan Priebe wrote:
>>>> I've now also two gdb backtraces from two crashes:
>>>> http://pastebin.com/raw.php?i=yih5jNt8
>>>>
>>>> http://pastebin.com/raw.php?i=kGEcvH4T
>>>
>>> They don't tell me anything as I have no idea of the inner workings of
>>> asterisk. You might be better of to talk to the asterisk folks to help
>>> you track down what that thing is waiting for, so we can actually look
>>> at a well defined area.
>>
>> The asterisk guys told me it's a livelock asterisk is waiting for
>> getaddrinfo / recvmsg.
>>
>> Thread 2 (Thread 0x7fbe989c6700 (LWP 12890)):
>> #0  0x00007fbeb9eb487d in recvmsg () from /lib/x86_64-linux-gnu/libc.so.6
>> #1  0x00007fbeb9ed4fcc in ?? () from /lib/x86_64-linux-gnu/libc.so.6
>> #2  0x00007fbeb9ed544a in ?? () from /lib/x86_64-linux-gnu/libc.so.6
>> #3  0x00007fbeb9e92007 in getaddrinfo () from
>> /lib/x86_64-linux-gnu/libc.so.6
>
> Florian Weimer recently patched glibc to detect filedescriptors which
> have been reused incorrectly in the glibc name resolving routines:
> <https://sourceware.org/ml/libc-alpha/2015-10/msg00865.html>
>
> Could you determine the type of filedescriptor in /proc/pid/fd and
> /proc/net/netlink by matching inodes?

you mean just:
la /proc/$pid/fd

and

cat /proc/net/netlink

Stefan

> Thanks,
> Hannes
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Asterisk deadlocks since Kernel 4.1
  2015-11-18 21:20           ` Stefan Priebe
@ 2015-11-18 21:22             ` Hannes Frederic Sowa
  2015-11-19  9:35               ` Stefan Priebe - Profihost AG
  0 siblings, 1 reply; 45+ messages in thread
From: Hannes Frederic Sowa @ 2015-11-18 21:22 UTC (permalink / raw)
  To: Stefan Priebe, Thomas Gleixner, Florian Weimer
  Cc: netdev, linux-fsdevel, linux-kernel

On Wed, Nov 18, 2015, at 22:20, Stefan Priebe wrote:
> you mean just:
> la /proc/$pid/fd

ls -l /proc/pid/fd/

the numbers in brackets in return from readlink are the inode numbers.

> and
> 
> cat /proc/net/netlink

Exactly, last row is the inode number.

Bye,
Hannes

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Asterisk deadlocks since Kernel 4.1
  2015-11-18 21:18         ` Florian Weimer
@ 2015-11-18 21:23           ` Stefan Priebe
  2015-11-19  9:39             ` Florian Weimer
  2015-11-18 21:36           ` Stefan Priebe
  1 sibling, 1 reply; 45+ messages in thread
From: Stefan Priebe @ 2015-11-18 21:23 UTC (permalink / raw)
  To: Florian Weimer; +Cc: Thomas Gleixner, netdev, linux-fsdevel, linux-kernel


Am 18.11.2015 um 22:18 schrieb Florian Weimer:
> On 11/18/2015 09:23 PM, Stefan Priebe wrote:
>>
>> Am 17.11.2015 um 20:43 schrieb Thomas Gleixner:
>>> On Tue, 17 Nov 2015, Stefan Priebe wrote:
>>>> I've now also two gdb backtraces from two crashes:
>>>> http://pastebin.com/raw.php?i=yih5jNt8
>>>>
>>>> http://pastebin.com/raw.php?i=kGEcvH4T
>>>
>>> They don't tell me anything as I have no idea of the inner workings of
>>> asterisk. You might be better of to talk to the asterisk folks to help
>>> you track down what that thing is waiting for, so we can actually look
>>> at a well defined area.
>>
>> The asterisk guys told me it's a livelock asterisk is waiting for
>> getaddrinfo / recvmsg.
>>
>> Thread 2 (Thread 0x7fbe989c6700 (LWP 12890)):
>> #0  0x00007fbeb9eb487d in recvmsg () from /lib/x86_64-linux-gnu/libc.so.6
>> #1  0x00007fbeb9ed4fcc in ?? () from /lib/x86_64-linux-gnu/libc.so.6
>> #2  0x00007fbeb9ed544a in ?? () from /lib/x86_64-linux-gnu/libc.so.6
>> #3  0x00007fbeb9e92007 in getaddrinfo () from
>> /lib/x86_64-linux-gnu/libc.so.6
>
> Stefan,
>
> please try to get a backtrace with debugging information.  It is likely
> that this is the make_request/__check_pf functionality in glibc, but it
> would be nice to get some certainty.
>
> Which glibc version do you use?  Has it got a fix for CVE-2013-7423?

It's Debians 2.13-38+deb7u8 Debians issue tracker says it is fixed: 
https://security-tracker.debian.org/tracker/CVE-2013-7423

> So far, the only known cause for a hang in this place (that is, lack of
> return from recvmsg) is incorrect file descriptor use.  (CVE-2013-7423
> is such an issue in glibc itself.)  The kernel upgrade could change
> scheduling behavior, and the actual bug might have been latent before.
>
> Theoretically, recvmsg could also hang if the Netlink query was dropped
> by the kernel, or the final packet in the response was dropped.  We
> never saw that happen, even under extreme load, but I didn't test with
> recent kernels.

The load is very low in this system. Just 30 phones and only 1-6 calling.

> The glibc change Hannes mentioned won't detect the hang, but if there is
> incorrect file descriptor reuse going on, it is possible that the new
> assert catches it.
>
> Florian
>


Stefan

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Asterisk deadlocks since Kernel 4.1
  2015-11-18 21:18         ` Florian Weimer
  2015-11-18 21:23           ` Stefan Priebe
@ 2015-11-18 21:36           ` Stefan Priebe
  2015-11-18 21:40             ` Hannes Frederic Sowa
  2015-11-19  9:44             ` Florian Weimer
  1 sibling, 2 replies; 45+ messages in thread
From: Stefan Priebe @ 2015-11-18 21:36 UTC (permalink / raw)
  To: Florian Weimer; +Cc: Thomas Gleixner, netdev, linux-fsdevel, linux-kernel

Am 18.11.2015 um 22:18 schrieb Florian Weimer:
> On 11/18/2015 09:23 PM, Stefan Priebe wrote:
>>
>> Am 17.11.2015 um 20:43 schrieb Thomas Gleixner:
>>> On Tue, 17 Nov 2015, Stefan Priebe wrote:
>>>> I've now also two gdb backtraces from two crashes:
>>>> http://pastebin.com/raw.php?i=yih5jNt8
>>>>
>>>> http://pastebin.com/raw.php?i=kGEcvH4T
>>>
>>> They don't tell me anything as I have no idea of the inner workings of
>>> asterisk. You might be better of to talk to the asterisk folks to help
>>> you track down what that thing is waiting for, so we can actually look
>>> at a well defined area.
>>
>> The asterisk guys told me it's a livelock asterisk is waiting for
>> getaddrinfo / recvmsg.
>>
>> Thread 2 (Thread 0x7fbe989c6700 (LWP 12890)):
>> #0  0x00007fbeb9eb487d in recvmsg () from /lib/x86_64-linux-gnu/libc.so.6
>> #1  0x00007fbeb9ed4fcc in ?? () from /lib/x86_64-linux-gnu/libc.so.6
>> #2  0x00007fbeb9ed544a in ?? () from /lib/x86_64-linux-gnu/libc.so.6
>> #3  0x00007fbeb9e92007 in getaddrinfo () from
>> /lib/x86_64-linux-gnu/libc.so.6
>
> Stefan,
>
> please try to get a backtrace with debugging information.  It is likely
> that this is the make_request/__check_pf functionality in glibc, but it
> would be nice to get some certainty.

sorry here it is. What I'm wondering is why is there ipv6 stuff? I don't 
have ipv6 except for link local. Could it be this one?

https://bugzilla.redhat.com/show_bug.cgi?id=505105#c79

Thread 31 (Thread 0x7f295c011700 (LWP 26654)):
#0  0x00007f295de3287d in recvmsg () at 
../sysdeps/unix/syscall-template.S:82
#1  0x00007f295de52fcc in make_request (fd=35, pid=26631, 
seen_ipv4=<optimized out>, seen_ipv6=<optimized out>,
     in6ai=<optimized out>, in6ailen=<optimized out>) at 
../sysdeps/unix/sysv/linux/check_pf.c:119
#2  0x00007f295de5344a in __check_pf (seen_ipv4=0x7f295c00e85f, 
seen_ipv6=0x7f295c00e85e, in6ai=0x7f295c00e840,
     in6ailen=0x7f295c00e838) at ../sysdeps/unix/sysv/linux/check_pf.c:271
#3  0x00007f295de10007 in *__GI_getaddrinfo (name=0x7f295c00e8b0 
"10.12.12.55", service=0x7f295c00e8bc "2135",
     hints=0x7f295c00e910, pai=0x7f295c00e908) at 
../sysdeps/posix/getaddrinfo.c:2389
#4  0x000000000050287e in ast_sockaddr_resolve (addrs=0x7f295c00e9d0, 
str=0x7f295c00ea30 "10.12.12.55:2135", flags=0, family=2)
     at netsock2.c:268
#5  0x00007f2958963ba2 in ast_sockaddr_resolve_first_af 
(addr=0x7f29300591d8, name=0x7f295c00ea30 "10.12.12.55:2135", flag=0,
     family=2) at chan_sip.c:30689
#6  0x00007f2958963cb5 in ast_sockaddr_resolve_first_transport 
(addr=0x7f29300591d8, name=0x7f295c00ea30 "10.12.12.55:2135",
     flag=0, transport=1) at chan_sip.c:30720
#7  0x00007f29588fd3cc in set_destination (p=0x7f2930058cc8, 
uri=0x7f29300576e8 "sip:9052@10.12.12.55:2135;line=to7a729l")
     at chan_sip.c:10455
#8  0x00007f29588fe6e0 in reqprep (req=0x7f295c00fee0, p=0x7f2930058cc8, 
sipmethod=4, seqno=287, newbranch=1) at chan_sip.c:10778
#9  0x00007f295890a201 in transmit_state_notify (p=0x7f2930058cc8, 
state=1, full=1, timeout=0) at chan_sip.c:13259
#10 0x00007f29589141bb in cb_extensionstate (context=0x7f295c010cd0 
"hints", exten=0x7f295c010c80 "9052QS", state=1,
     data=0x7f2930058cc8) at chan_sip.c:15117
#11 0x000000000050ebf6 in handle_statechange (datap=0x7f293acef830) at 
pbx.c:4972
#12 0x0000000000555f8e in tps_processing_function (data=0x1f24f28) at 
taskprocessor.c:327
#13 0x0000000000569280 in dummy_start (data=0x1ed76f0) at utils.c:1173
#14 0x00007f295d5dcb50 in start_thread (arg=<optimized out>) at 
pthread_create.c:304
#15 0x00007f295de3195d in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#16 0x0000000000000000 in ?? ()

>
> Which glibc version do you use?  Has it got a fix for CVE-2013-7423?
>
> So far, the only known cause for a hang in this place (that is, lack of
> return from recvmsg) is incorrect file descriptor use.  (CVE-2013-7423
> is such an issue in glibc itself.)  The kernel upgrade could change
> scheduling behavior, and the actual bug might have been latent before.
>
> Theoretically, recvmsg could also hang if the Netlink query was dropped
> by the kernel, or the final packet in the response was dropped.  We
> never saw that happen, even under extreme load, but I didn't test with
> recent kernels.
>
> The glibc change Hannes mentioned won't detect the hang, but if there is
> incorrect file descriptor reuse going on, it is possible that the new
> assert catches it.
>
> Florian
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Asterisk deadlocks since Kernel 4.1
  2015-11-18 21:36           ` Stefan Priebe
@ 2015-11-18 21:40             ` Hannes Frederic Sowa
  2015-11-18 21:42               ` Stefan Priebe
  2015-11-19  9:44             ` Florian Weimer
  1 sibling, 1 reply; 45+ messages in thread
From: Hannes Frederic Sowa @ 2015-11-18 21:40 UTC (permalink / raw)
  To: Stefan Priebe, Florian Weimer
  Cc: Thomas Gleixner, netdev, linux-fsdevel, linux-kernel

On Wed, Nov 18, 2015, at 22:36, Stefan Priebe wrote:
> sorry here it is. What I'm wondering is why is there ipv6 stuff? I don't 
> have ipv6 except for link local. Could it be this one?
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=505105#c79
> 
> Thread 31 (Thread 0x7f295c011700 (LWP 26654)):
> #0  0x00007f295de3287d in recvmsg () at 
> ../sysdeps/unix/syscall-template.S:82
> #1  0x00007f295de52fcc in make_request (fd=35, pid=26631, 
> seen_ipv4=<optimized out>, seen_ipv6=<optimized out>,
>      in6ai=<optimized out>, in6ailen=<optimized out>) at 
> ../sysdeps/unix/sysv/linux/check_pf.c:119
> #2  0x00007f295de5344a in __check_pf (seen_ipv4=0x7f295c00e85f, 
> seen_ipv6=0x7f295c00e85e, in6ai=0x7f295c00e840,
>      in6ailen=0x7f295c00e838) at
>      ../sysdeps/unix/sysv/linux/check_pf.c:271
> #3  0x00007f295de10007 in *__GI_getaddrinfo (name=0x7f295c00e8b0 
> "10.12.12.55", service=0x7f295c00e8bc "2135",
>      hints=0x7f295c00e910, pai=0x7f295c00e908) at 
> ../sysdeps/posix/getaddrinfo.c:2389

Can you also get a /proc/pid/stack stacktrace?

Thanks,
Hannes

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Asterisk deadlocks since Kernel 4.1
  2015-11-18 21:40             ` Hannes Frederic Sowa
@ 2015-11-18 21:42               ` Stefan Priebe
  2015-11-18 21:58                 ` Hannes Frederic Sowa
  0 siblings, 1 reply; 45+ messages in thread
From: Stefan Priebe @ 2015-11-18 21:42 UTC (permalink / raw)
  To: Hannes Frederic Sowa, Florian Weimer
  Cc: Thomas Gleixner, netdev, linux-fsdevel, linux-kernel


Am 18.11.2015 um 22:40 schrieb Hannes Frederic Sowa:
> On Wed, Nov 18, 2015, at 22:36, Stefan Priebe wrote:
>> sorry here it is. What I'm wondering is why is there ipv6 stuff? I don't
>> have ipv6 except for link local. Could it be this one?
>>
>> https://bugzilla.redhat.com/show_bug.cgi?id=505105#c79
>>
>> Thread 31 (Thread 0x7f295c011700 (LWP 26654)):
>> #0  0x00007f295de3287d in recvmsg () at
>> ../sysdeps/unix/syscall-template.S:82
>> #1  0x00007f295de52fcc in make_request (fd=35, pid=26631,
>> seen_ipv4=<optimized out>, seen_ipv6=<optimized out>,
>>       in6ai=<optimized out>, in6ailen=<optimized out>) at
>> ../sysdeps/unix/sysv/linux/check_pf.c:119
>> #2  0x00007f295de5344a in __check_pf (seen_ipv4=0x7f295c00e85f,
>> seen_ipv6=0x7f295c00e85e, in6ai=0x7f295c00e840,
>>       in6ailen=0x7f295c00e838) at
>>       ../sysdeps/unix/sysv/linux/check_pf.c:271
>> #3  0x00007f295de10007 in *__GI_getaddrinfo (name=0x7f295c00e8b0
>> "10.12.12.55", service=0x7f295c00e8bc "2135",
>>       hints=0x7f295c00e910, pai=0x7f295c00e908) at
>> ../sysdeps/posix/getaddrinfo.c:2389
>
> Can you also get a /proc/pid/stack stacktrace?

Sure:

For the thread:
# cat /proc/26631/task/26654/stack
[<ffffffffb34e5cdb>] __skb_recv_datagram+0x52b/0x5d0
[<ffffffffb34e5db2>] skb_recv_datagram+0x32/0x40
[<ffffffffb35227bc>] netlink_recvmsg+0x4c/0x300
[<ffffffffb34d3863>] sock_recvmsg+0x13/0x20
[<ffffffffb34d679e>] ___sys_recvmsg+0xee/0x230
[<ffffffffb34d7739>] __sys_recvmsg+0x49/0x90
[<ffffffffb34d7792>] SyS_recvmsg+0x12/0x20
[<ffffffffb363f1ee>] system_call_fastpath+0x12/0x71
[<ffffffffffffffff>] 0xffffffffffffffff

Stefan

>
> Thanks,
> Hannes
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Asterisk deadlocks since Kernel 4.1
  2015-11-18 21:42               ` Stefan Priebe
@ 2015-11-18 21:58                 ` Hannes Frederic Sowa
  0 siblings, 0 replies; 45+ messages in thread
From: Hannes Frederic Sowa @ 2015-11-18 21:58 UTC (permalink / raw)
  To: Stefan Priebe, Florian Weimer
  Cc: Thomas Gleixner, netdev, linux-fsdevel, linux-kernel

On Wed, Nov 18, 2015, at 22:42, Stefan Priebe wrote:
> 
> Am 18.11.2015 um 22:40 schrieb Hannes Frederic Sowa:
> > On Wed, Nov 18, 2015, at 22:36, Stefan Priebe wrote:
> >> sorry here it is. What I'm wondering is why is there ipv6 stuff? I don't
> >> have ipv6 except for link local. Could it be this one?
> >>
> >> https://bugzilla.redhat.com/show_bug.cgi?id=505105#c79
> >>
> >> Thread 31 (Thread 0x7f295c011700 (LWP 26654)):
> >> #0  0x00007f295de3287d in recvmsg () at
> >> ../sysdeps/unix/syscall-template.S:82
> >> #1  0x00007f295de52fcc in make_request (fd=35, pid=26631,
> >> seen_ipv4=<optimized out>, seen_ipv6=<optimized out>,
> >>       in6ai=<optimized out>, in6ailen=<optimized out>) at
> >> ../sysdeps/unix/sysv/linux/check_pf.c:119
> >> #2  0x00007f295de5344a in __check_pf (seen_ipv4=0x7f295c00e85f,
> >> seen_ipv6=0x7f295c00e85e, in6ai=0x7f295c00e840,
> >>       in6ailen=0x7f295c00e838) at
> >>       ../sysdeps/unix/sysv/linux/check_pf.c:271
> >> #3  0x00007f295de10007 in *__GI_getaddrinfo (name=0x7f295c00e8b0
> >> "10.12.12.55", service=0x7f295c00e8bc "2135",
> >>       hints=0x7f295c00e910, pai=0x7f295c00e908) at
> >> ../sysdeps/posix/getaddrinfo.c:2389
> >
> > Can you also get a /proc/pid/stack stacktrace?
> 
> Sure:
> 
> For the thread:
> # cat /proc/26631/task/26654/stack
> [<ffffffffb34e5cdb>] __skb_recv_datagram+0x52b/0x5d0
> [<ffffffffb34e5db2>] skb_recv_datagram+0x32/0x40
> [<ffffffffb35227bc>] netlink_recvmsg+0x4c/0x300
> [<ffffffffb34d3863>] sock_recvmsg+0x13/0x20
> [<ffffffffb34d679e>] ___sys_recvmsg+0xee/0x230
> [<ffffffffb34d7739>] __sys_recvmsg+0x49/0x90
> [<ffffffffb34d7792>] SyS_recvmsg+0x12/0x20
> [<ffffffffb363f1ee>] system_call_fastpath+0x12/0x71
> [<ffffffffffffffff>] 0xffffffffffffffff

Ok, we are definitely in netlink code and thus have a correct(? at least
a netlink) netlink fd which is blocking and waiting for an answer. I
research the glibc code a little bit but I think that the occurrences of
ipv6 parameters are just generic glibc code.

Let me try to reproduce that.

Thanks,
Hannes

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Asterisk deadlocks since Kernel 4.1
  2015-11-18 21:22             ` Hannes Frederic Sowa
@ 2015-11-19  9:35               ` Stefan Priebe - Profihost AG
  0 siblings, 0 replies; 45+ messages in thread
From: Stefan Priebe - Profihost AG @ 2015-11-19  9:35 UTC (permalink / raw)
  To: Hannes Frederic Sowa, Thomas Gleixner, Florian Weimer
  Cc: netdev, linux-fsdevel, linux-kernel


Am 18.11.2015 um 22:22 schrieb Hannes Frederic Sowa:
> On Wed, Nov 18, 2015, at 22:20, Stefan Priebe wrote:
>> you mean just:
>> la /proc/$pid/fd
> 
> ls -l /proc/pid/fd/
> 
> the numbers in brackets in return from readlink are the inode numbers.
> 
>> and
>>
>> cat /proc/net/netlink
> 
> Exactly, last row is the inode number.

It has stopped locking since i've set:

net.ipv6.conf.all.disable_ipv6=1

Stefan

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Asterisk deadlocks since Kernel 4.1
  2015-11-18 21:23           ` Stefan Priebe
@ 2015-11-19  9:39             ` Florian Weimer
  0 siblings, 0 replies; 45+ messages in thread
From: Florian Weimer @ 2015-11-19  9:39 UTC (permalink / raw)
  To: Stefan Priebe; +Cc: Thomas Gleixner, netdev, linux-fsdevel, linux-kernel

On 11/18/2015 10:23 PM, Stefan Priebe wrote:

>> please try to get a backtrace with debugging information.  It is likely
>> that this is the make_request/__check_pf functionality in glibc, but it
>> would be nice to get some certainty.
>>
>> Which glibc version do you use?  Has it got a fix for CVE-2013-7423?
> 
> It's Debians 2.13-38+deb7u8 Debians issue tracker says it is fixed:
> https://security-tracker.debian.org/tracker/CVE-2013-7423

I checked, and the patch is there and applied.

>> Theoretically, recvmsg could also hang if the Netlink query was dropped
>> by the kernel, or the final packet in the response was dropped.  We
>> never saw that happen, even under extreme load, but I didn't test with
>> recent kernels.
> 
> The load is very low in this system. Just 30 phones and only 1-6 calling.

This is rather odd.

I'm not sure if we can do anything simple on the glibc to help to debug
this.  If something in the process consumes the Netlink message from the
kernel (incorrectly reading from the internal glibc file descriptor),
this is fairly difficult detect inside glibc (it would require rather
large changes which do not currently exist at all).

Florian


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Asterisk deadlocks since Kernel 4.1
  2015-11-18 21:36           ` Stefan Priebe
  2015-11-18 21:40             ` Hannes Frederic Sowa
@ 2015-11-19  9:44             ` Florian Weimer
  2015-11-19  9:49               ` Stefan Priebe - Profihost AG
  1 sibling, 1 reply; 45+ messages in thread
From: Florian Weimer @ 2015-11-19  9:44 UTC (permalink / raw)
  To: Stefan Priebe; +Cc: Thomas Gleixner, netdev, linux-fsdevel, linux-kernel

On 11/18/2015 10:36 PM, Stefan Priebe wrote:

>> please try to get a backtrace with debugging information.  It is likely
>> that this is the make_request/__check_pf functionality in glibc, but it
>> would be nice to get some certainty.
> 
> sorry here it is. What I'm wondering is why is there ipv6 stuff? I don't
> have ipv6 except for link local.

glibc needs to know if the system has global unicast addresses if it
receives AAAA records.

It's curious that net.ipv6.conf.all.disable_ipv6=1 makes the problem go
away.  Even with that setting, the kernel seems to send two Netlink
responses.  So either this is enough to narrow the window for the race
so that no longer triggers, or there is a genuine kernel issue with
supplying the requested IPv6 Netlink response.

> Could it be this one?
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=505105#c79

No, that's on the DNS/UDP side, not in the Netlink code.

Florian

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Asterisk deadlocks since Kernel 4.1
  2015-11-19  9:44             ` Florian Weimer
@ 2015-11-19  9:49               ` Stefan Priebe - Profihost AG
  2015-11-19  9:56                 ` Stefan Priebe - Profihost AG
  0 siblings, 1 reply; 45+ messages in thread
From: Stefan Priebe - Profihost AG @ 2015-11-19  9:49 UTC (permalink / raw)
  To: Florian Weimer; +Cc: Thomas Gleixner, netdev, linux-fsdevel, linux-kernel


Am 19.11.2015 um 10:44 schrieb Florian Weimer:
> On 11/18/2015 10:36 PM, Stefan Priebe wrote:
> 
>>> please try to get a backtrace with debugging information.  It is likely
>>> that this is the make_request/__check_pf functionality in glibc, but it
>>> would be nice to get some certainty.
>>
>> sorry here it is. What I'm wondering is why is there ipv6 stuff? I don't
>> have ipv6 except for link local.
> 
> glibc needs to know if the system has global unicast addresses if it
> receives AAAA records.
> 
> It's curious that net.ipv6.conf.all.disable_ipv6=1 makes the problem go
> away.  Even with that setting, the kernel seems to send two Netlink
> responses.  So either this is enough to narrow the window for the race
> so that no longer triggers, or there is a genuine kernel issue with
> supplying the requested IPv6 Netlink response.

No idea it also goes away by downgrading to 3.18 again.

Stefan

>> Could it be this one?
>>
>> https://bugzilla.redhat.com/show_bug.cgi?id=505105#c79
> 
> No, that's on the DNS/UDP side, not in the Netlink code.
> 
> Florian
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Asterisk deadlocks since Kernel 4.1
  2015-11-19  9:49               ` Stefan Priebe - Profihost AG
@ 2015-11-19  9:56                 ` Stefan Priebe - Profihost AG
  2015-11-19 11:41                   ` Hannes Frederic Sowa
  0 siblings, 1 reply; 45+ messages in thread
From: Stefan Priebe - Profihost AG @ 2015-11-19  9:56 UTC (permalink / raw)
  To: Florian Weimer; +Cc: Thomas Gleixner, netdev, linux-fsdevel, linux-kernel

OK it had a livelock again. It just took more time.

So here is the data:

# la /proc/2598/fd
total 0
dr-x------ 2 root       root        0 Nov 19 06:53 .
dr-xr-xr-x 7 callweaver callweaver  0 Nov 18 22:38 ..
lrwx------ 1 root       root       64 Nov 19 06:54 0 -> /dev/null
lrwx------ 1 root       root       64 Nov 19 06:54 1 -> /dev/null
lrwx------ 1 root       root       64 Nov 19 06:54 10 -> socket:[12066]
lr-x------ 1 root       root       64 Nov 19 06:54 11 -> anon_inode:inotify
lr-x------ 1 root       root       64 Nov 19 06:54 12 -> pipe:[12181]
l-wx------ 1 root       root       64 Nov 19 06:54 13 -> pipe:[12181]
lrwx------ 1 root       root       64 Nov 19 10:56 14 -> socket:[510853]
lrwx------ 1 root       root       64 Nov 19 10:56 15 -> socket:[510854]
lrwx------ 1 root       root       64 Nov 19 10:56 16 ->
anon_inode:[timerfd]
lr-x------ 1 root       root       64 Nov 19 10:56 17 -> pipe:[510856]
l-wx------ 1 root       root       64 Nov 19 10:56 18 -> pipe:[510856]
lrwx------ 1 root       root       64 Nov 19 10:56 19 -> socket:[208723]
lrwx------ 1 root       root       64 Nov 19 06:54 2 -> /dev/null
l-wx------ 1 root       root       64 Nov 19 10:56 20 ->
/var/log/asterisk/queue_log
lrwx------ 1 root       root       64 Nov 19 10:56 21 -> socket:[199595]
lrwx------ 1 root       root       64 Nov 19 10:56 22 -> socket:[510873]
lr-x------ 1 root       root       64 Nov 19 10:56 23 -> anon_inode:inotify
lrwx------ 1 root       root       64 Nov 19 10:56 24 -> socket:[525349]
lrwx------ 1 root       root       64 Nov 19 10:56 25 -> socket:[525350]
lrwx------ 1 root       root       64 Nov 19 10:56 26 -> socket:[510874]
lrwx------ 1 root       root       64 Nov 19 10:56 27 ->
anon_inode:[timerfd]
lr-x------ 1 root       root       64 Nov 19 10:56 28 -> pipe:[510876]
l-wx------ 1 root       root       64 Nov 19 10:56 29 -> pipe:[510876]
lr-x------ 1 root       root       64 Nov 19 06:54 3 -> /dev/urandom
lrwx------ 1 root       root       64 Nov 19 10:56 30 -> socket:[527569]
lrwx------ 1 root       root       64 Nov 19 10:56 31 -> socket:[527570]
lrwx------ 1 root       root       64 Nov 19 10:56 32 -> socket:[528123]
lrwx------ 1 root       root       64 Nov 19 10:56 33 -> socket:[528124]
lrwx------ 1 root       root       64 Nov 19 10:56 34 -> socket:[530711]
lrwx------ 1 root       root       64 Nov 19 10:56 35 -> socket:[530712]
lrwx------ 1 root       root       64 Nov 19 10:56 36 -> socket:[533366]
lrwx------ 1 root       root       64 Nov 19 10:56 37 -> socket:[533367]
lrwx------ 1 root       root       64 Nov 19 10:56 38 -> socket:[535390]
lrwx------ 1 root       root       64 Nov 19 10:56 39 -> socket:[531056]
lrwx------ 1 root       root       64 Nov 19 06:54 4 -> socket:[11726]
lrwx------ 1 root       root       64 Nov 19 10:56 40 -> socket:[531057]
lrwx------ 1 root       root       64 Nov 19 10:56 41 -> socket:[535391]
lrwx------ 1 root       root       64 Nov 19 10:56 42 -> socket:[537751]
lrwx------ 1 root       root       64 Nov 19 10:56 43 -> socket:[533468]
lrwx------ 1 root       root       64 Nov 19 10:56 44 -> socket:[531154]
lrwx------ 1 root       root       64 Nov 19 10:56 45 -> socket:[531155]
lrwx------ 1 root       root       64 Nov 19 10:56 46 -> socket:[533469]
lrwx------ 1 root       root       64 Nov 19 10:56 47 -> socket:[536172]
lrwx------ 1 root       root       64 Nov 19 10:56 48 -> socket:[536173]
lrwx------ 1 root       root       64 Nov 19 10:56 49 -> socket:[537877]
l-wx------ 1 root       root       64 Nov 19 06:54 5 ->
/var/log/asterisk/messages
lrwx------ 1 root       root       64 Nov 19 10:56 50 -> socket:[537752]
lrwx------ 1 root       root       64 Nov 19 10:56 51 -> socket:[539817]
lrwx------ 1 root       root       64 Nov 19 10:56 52 -> socket:[537878]
lrwx------ 1 root       root       64 Nov 19 10:56 53 -> socket:[539818]
lrwx------ 1 root       root       64 Nov 19 10:56 54 -> socket:[541781]
lrwx------ 1 root       root       64 Nov 19 10:56 55 -> socket:[541782]
lrwx------ 1 root       root       64 Nov 19 10:56 56 -> socket:[543462]
lrwx------ 1 root       root       64 Nov 19 10:56 57 -> socket:[545171]
lrwx------ 1 root       root       64 Nov 19 10:56 58 -> socket:[537432]
lrwx------ 1 root       root       64 Nov 19 10:56 59 -> socket:[537433]
l-wx------ 1 root       root       64 Nov 19 06:54 6 ->
/var/log/asterisk/debug.log
lrwx------ 1 root       root       64 Nov 19 10:56 60 -> socket:[545172]
lrwx------ 1 root       root       64 Nov 19 10:56 61 ->
anon_inode:[timerfd]
lrwx------ 1 root       root       64 Nov 19 10:56 62 -> socket:[541196]
lrwx------ 1 root       root       64 Nov 19 10:56 63 -> socket:[538319]
lrwx------ 1 root       root       64 Nov 19 10:56 64 -> socket:[538320]
lrwx------ 1 root       root       64 Nov 19 10:56 65 -> socket:[474586]
lrwx------ 1 root       root       64 Nov 19 10:56 66 -> socket:[541197]
lrwx------ 1 root       root       64 Nov 19 10:56 67 -> socket:[542437]
lrwx------ 1 root       root       64 Nov 19 10:56 68 -> socket:[542438]
lr-x------ 1 root       root       64 Nov 19 10:56 69 -> pipe:[545174]
lrwx------ 1 root       root       64 Nov 19 06:54 7 ->
/var/lib/asterisk/astdb
lrwx------ 1 root       root       64 Nov 19 10:56 70 -> socket:[543463]
l-wx------ 1 root       root       64 Nov 19 10:56 71 -> pipe:[545174]
lrwx------ 1 root       root       64 Nov 19 10:56 76 -> socket:[543659]
lrwx------ 1 root       root       64 Nov 19 10:56 77 -> socket:[543660]
lrwx------ 1 root       root       64 Nov 19 10:56 78 ->
anon_inode:[timerfd]
lr-x------ 1 root       root       64 Nov 19 10:56 79 -> pipe:[543662]
lrwx------ 1 root       root       64 Nov 19 06:54 8 -> anon_inode:[timerfd]
l-wx------ 1 root       root       64 Nov 19 10:56 80 -> pipe:[543662]
lrwx------ 1 root       root       64 Nov 19 06:54 9 -> socket:[12052]

[asterisksnom: ~]# cat /proc/net/netlink
sk       Eth Pid    Groups   Rmem     Wmem     Dump     Locks     Drops
    Inode
ffff8800bac17000 0   0      00000000 0        0        0 2        0
   3
ffff8800b5ef8000 4   0      00000000 0        0        0 2        0
   6201
ffff8800b71cf000 10  0      00000000 0        0        0 2        0
   5455
ffff8800b7176000 11  0      00000000 0        0        0 2        0
   12
ffff8800b1169000 15  4294962899 00000000 0        0        0 2        0
       7979
ffff8800b16cf000 15  441    00000001 0        0        0 2        0
   1542
ffff8800b1168800 15  4294962900 00000000 0        0        0 2        0
       7978
ffff8800b7088800 15  0      00000000 0        0        0 2        0
   5
ffff8800b71c9800 16  0      00000000 0        0        0 2        0
   15
ffff8800b16ca000 16  2362   00000002 0        0        0 2        0
   12313

Stefan

Am 19.11.2015 um 10:49 schrieb Stefan Priebe - Profihost AG:
> 
> Am 19.11.2015 um 10:44 schrieb Florian Weimer:
>> On 11/18/2015 10:36 PM, Stefan Priebe wrote:
>>
>>>> please try to get a backtrace with debugging information.  It is likely
>>>> that this is the make_request/__check_pf functionality in glibc, but it
>>>> would be nice to get some certainty.
>>>
>>> sorry here it is. What I'm wondering is why is there ipv6 stuff? I don't
>>> have ipv6 except for link local.
>>
>> glibc needs to know if the system has global unicast addresses if it
>> receives AAAA records.
>>
>> It's curious that net.ipv6.conf.all.disable_ipv6=1 makes the problem go
>> away.  Even with that setting, the kernel seems to send two Netlink
>> responses.  So either this is enough to narrow the window for the race
>> so that no longer triggers, or there is a genuine kernel issue with
>> supplying the requested IPv6 Netlink response.
> 
> No idea it also goes away by downgrading to 3.18 again.
> 
> Stefan
> 
>>> Could it be this one?
>>>
>>> https://bugzilla.redhat.com/show_bug.cgi?id=505105#c79
>>
>> No, that's on the DNS/UDP side, not in the Netlink code.
>>
>> Florian
>>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Asterisk deadlocks since Kernel 4.1
  2015-11-19  9:56                 ` Stefan Priebe - Profihost AG
@ 2015-11-19 11:41                   ` Hannes Frederic Sowa
  2015-11-19 11:43                     ` Stefan Priebe - Profihost AG
  0 siblings, 1 reply; 45+ messages in thread
From: Hannes Frederic Sowa @ 2015-11-19 11:41 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG, Florian Weimer
  Cc: Thomas Gleixner, netdev, linux-fsdevel, linux-kernel

On Thu, Nov 19, 2015, at 10:56, Stefan Priebe - Profihost AG wrote:
> OK it had a livelock again. It just took more time.
> 
> So here is the data:

Thanks, I couldn't reproduce it so far with simple threaded resolver
loop on your kernel. :/

Your data is useless if you don't also provide the file descriptor which
you are blocking on right now. ;)

Thanks,
Hannes

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Asterisk deadlocks since Kernel 4.1
  2015-11-19 11:41                   ` Hannes Frederic Sowa
@ 2015-11-19 11:43                     ` Stefan Priebe - Profihost AG
  2015-11-19 12:41                       ` Hannes Frederic Sowa
  0 siblings, 1 reply; 45+ messages in thread
From: Stefan Priebe - Profihost AG @ 2015-11-19 11:43 UTC (permalink / raw)
  To: Hannes Frederic Sowa, Florian Weimer
  Cc: Thomas Gleixner, netdev, linux-fsdevel, linux-kernel


Am 19.11.2015 um 12:41 schrieb Hannes Frederic Sowa:
> On Thu, Nov 19, 2015, at 10:56, Stefan Priebe - Profihost AG wrote:
>> OK it had a livelock again. It just took more time.
>>
>> So here is the data:
> 
> Thanks, I couldn't reproduce it so far with simple threaded resolver
> loop on your kernel. :/
> 
> Your data is useless if you don't also provide the file descriptor which
> you are blocking on right now. ;)
> 
> Thanks,
> Hannes

ah sorry. So we need the gdb backtrace with the rcvmsg and then the fd
list + netlink list?

Stefan


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Asterisk deadlocks since Kernel 4.1
  2015-11-19 11:43                     ` Stefan Priebe - Profihost AG
@ 2015-11-19 12:41                       ` Hannes Frederic Sowa
  2015-11-19 12:46                         ` Stefan Priebe - Profihost AG
  0 siblings, 1 reply; 45+ messages in thread
From: Hannes Frederic Sowa @ 2015-11-19 12:41 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG, Florian Weimer
  Cc: Thomas Gleixner, netdev, linux-fsdevel, linux-kernel

On Thu, Nov 19, 2015, at 12:43, Stefan Priebe - Profihost AG wrote:
> 
> Am 19.11.2015 um 12:41 schrieb Hannes Frederic Sowa:
> > On Thu, Nov 19, 2015, at 10:56, Stefan Priebe - Profihost AG wrote:
> >> OK it had a livelock again. It just took more time.
> >>
> >> So here is the data:
> > 
> > Thanks, I couldn't reproduce it so far with simple threaded resolver
> > loop on your kernel. :/
> > 
> > Your data is useless if you don't also provide the file descriptor which
> > you are blocking on right now. ;)
> > 
> > Thanks,
> > Hannes
> 
> ah sorry. So we need the gdb backtrace with the rcvmsg and then the fd
> list + netlink list?

Yes, albeit the probability we find something new is minimal. We already
know it blocks deep down in netlink code. I tried to reproduce it in
multithreaded environment with no results so far. Can you even try a
newer kernel?

Bye,
Hannes

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Asterisk deadlocks since Kernel 4.1
  2015-11-19 12:41                       ` Hannes Frederic Sowa
@ 2015-11-19 12:46                         ` Stefan Priebe - Profihost AG
  2015-11-19 13:19                           ` Florian Weimer
  0 siblings, 1 reply; 45+ messages in thread
From: Stefan Priebe - Profihost AG @ 2015-11-19 12:46 UTC (permalink / raw)
  To: Hannes Frederic Sowa, Florian Weimer
  Cc: Thomas Gleixner, netdev, linux-fsdevel, linux-kernel


Am 19.11.2015 um 13:41 schrieb Hannes Frederic Sowa:
> On Thu, Nov 19, 2015, at 12:43, Stefan Priebe - Profihost AG wrote:
>>
>> Am 19.11.2015 um 12:41 schrieb Hannes Frederic Sowa:
>>> On Thu, Nov 19, 2015, at 10:56, Stefan Priebe - Profihost AG wrote:
>>>> OK it had a livelock again. It just took more time.
>>>>
>>>> So here is the data:
>>>
>>> Thanks, I couldn't reproduce it so far with simple threaded resolver
>>> loop on your kernel. :/
>>>
>>> Your data is useless if you don't also provide the file descriptor which
>>> you are blocking on right now. ;)
>>>
>>> Thanks,
>>> Hannes
>>
>> ah sorry. So we need the gdb backtrace with the rcvmsg and then the fd
>> list + netlink list?
> 
> Yes, albeit the probability we find something new is minimal. We already
> know it blocks deep down in netlink code. I tried to reproduce it in
> multithreaded environment with no results so far. Can you even try a
> newer kernel?

I can try Kernel 4.4-rc1 next week. Or something else?

Stefan

> 
> Bye,
> Hannes
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Asterisk deadlocks since Kernel 4.1
  2015-11-19 12:46                         ` Stefan Priebe - Profihost AG
@ 2015-11-19 13:19                           ` Florian Weimer
  2015-11-19 19:51                             ` Stefan Priebe
  0 siblings, 1 reply; 45+ messages in thread
From: Florian Weimer @ 2015-11-19 13:19 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG, Hannes Frederic Sowa
  Cc: Thomas Gleixner, netdev, linux-fsdevel, linux-kernel

On 11/19/2015 01:46 PM, Stefan Priebe - Profihost AG wrote:

> I can try Kernel 4.4-rc1 next week. Or something else?

I found this bug report which indicates that 4.1.10 works:

  <https://issues.asterisk.org/jira/browse/ASTERISK-25251>

But in your original report, you said that 4.1.13 is broken.

This backtrace:

  <https://issues.asterisk.org/jira/secure/attachment/52702/bt.txt>

shows a lot of waiting on quite different netlink sockets.  So if this
is due to a race in Asterisk, it must have happened several times in a row.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Asterisk deadlocks since Kernel 4.1
  2015-11-19 13:19                           ` Florian Weimer
@ 2015-11-19 19:51                             ` Stefan Priebe
  2015-11-23 12:44                               ` Stefan Priebe - Profihost AG
  0 siblings, 1 reply; 45+ messages in thread
From: Stefan Priebe @ 2015-11-19 19:51 UTC (permalink / raw)
  To: Florian Weimer, Hannes Frederic Sowa
  Cc: Thomas Gleixner, netdev, linux-fsdevel, linux-kernel


Am 19.11.2015 um 14:19 schrieb Florian Weimer:
> On 11/19/2015 01:46 PM, Stefan Priebe - Profihost AG wrote:
>
>> I can try Kernel 4.4-rc1 next week. Or something else?
>
> I found this bug report which indicates that 4.1.10 works:
>
>    <https://issues.asterisk.org/jira/browse/ASTERISK-25251>
>
> But in your original report, you said that 4.1.13 is broken.

That's correct i'm running 4.1.13.

> This backtrace:
>
>    <https://issues.asterisk.org/jira/secure/attachment/52702/bt.txt>
>
> shows a lot of waiting on quite different netlink sockets.  So if this
> is due to a race in Asterisk, it must have happened several times in a row.

Might be I've always only one waiting.

Stefan


> Thanks,
> Florian
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Asterisk deadlocks since Kernel 4.1
  2015-11-19 19:51                             ` Stefan Priebe
@ 2015-11-23 12:44                               ` Stefan Priebe - Profihost AG
  2015-11-23 12:57                                 ` Hannes Frederic Sowa
  0 siblings, 1 reply; 45+ messages in thread
From: Stefan Priebe - Profihost AG @ 2015-11-23 12:44 UTC (permalink / raw)
  To: Florian Weimer, Hannes Frederic Sowa
  Cc: Thomas Gleixner, netdev, linux-fsdevel, linux-kernel

Am 19.11.2015 um 20:51 schrieb Stefan Priebe:
> 
> Am 19.11.2015 um 14:19 schrieb Florian Weimer:
>> On 11/19/2015 01:46 PM, Stefan Priebe - Profihost AG wrote:
>>
>>> I can try Kernel 4.4-rc1 next week. Or something else?
>>
>> I found this bug report which indicates that 4.1.10 works:
>>
>>    <https://issues.asterisk.org/jira/browse/ASTERISK-25251>
>>
>> But in your original report, you said that 4.1.13 is broken.
> 
> That's correct i'm running 4.1.13.
> 
>> This backtrace:
>>
>>    <https://issues.asterisk.org/jira/secure/attachment/52702/bt.txt>
>>
>> shows a lot of waiting on quite different netlink sockets.  So if this
>> is due to a race in Asterisk, it must have happened several times in a
>> row.

Kernel 4.4-rc2 works fine. How can we grab / get an idea which is
causing the isse in 4.1? It's an LTE kernel so it should be fixed!

Stefan

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Asterisk deadlocks since Kernel 4.1
  2015-11-23 12:44                               ` Stefan Priebe - Profihost AG
@ 2015-11-23 12:57                                 ` Hannes Frederic Sowa
  2015-11-24 13:35                                   ` Stefan Priebe - Profihost AG
  2015-12-02  9:45                                   ` Stefan Priebe - Profihost AG
  0 siblings, 2 replies; 45+ messages in thread
From: Hannes Frederic Sowa @ 2015-11-23 12:57 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG, Florian Weimer
  Cc: Thomas Gleixner, netdev, linux-fsdevel, linux-kernel

On Mon, Nov 23, 2015, at 13:44, Stefan Priebe - Profihost AG wrote:
> Am 19.11.2015 um 20:51 schrieb Stefan Priebe:
> > 
> > Am 19.11.2015 um 14:19 schrieb Florian Weimer:
> >> On 11/19/2015 01:46 PM, Stefan Priebe - Profihost AG wrote:
> >>
> >>> I can try Kernel 4.4-rc1 next week. Or something else?
> >>
> >> I found this bug report which indicates that 4.1.10 works:
> >>
> >>    <https://issues.asterisk.org/jira/browse/ASTERISK-25251>
> >>
> >> But in your original report, you said that 4.1.13 is broken.
> > 
> > That's correct i'm running 4.1.13.
> > 
> >> This backtrace:
> >>
> >>    <https://issues.asterisk.org/jira/secure/attachment/52702/bt.txt>
> >>
> >> shows a lot of waiting on quite different netlink sockets.  So if this
> >> is due to a race in Asterisk, it must have happened several times in a
> >> row.
> 
> Kernel 4.4-rc2 works fine. How can we grab / get an idea which is
> causing the isse in 4.1? It's an LTE kernel so it should be fixed!

Thanks for testing. I was not able to reproduce it at all, with as much
parallelism and threads as possible on any kernel. Could you try to do a
git bisect?

Thanks,
Hannes

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Asterisk deadlocks since Kernel 4.1
  2015-11-23 12:57                                 ` Hannes Frederic Sowa
@ 2015-11-24 13:35                                   ` Stefan Priebe - Profihost AG
  2015-12-02  9:45                                   ` Stefan Priebe - Profihost AG
  1 sibling, 0 replies; 45+ messages in thread
From: Stefan Priebe - Profihost AG @ 2015-11-24 13:35 UTC (permalink / raw)
  To: Hannes Frederic Sowa, Florian Weimer
  Cc: Thomas Gleixner, netdev, linux-fsdevel, linux-kernel


Am 23.11.2015 um 13:57 schrieb Hannes Frederic Sowa:
> On Mon, Nov 23, 2015, at 13:44, Stefan Priebe - Profihost AG wrote:
>> Am 19.11.2015 um 20:51 schrieb Stefan Priebe:
>>>
>>> Am 19.11.2015 um 14:19 schrieb Florian Weimer:
>>>> On 11/19/2015 01:46 PM, Stefan Priebe - Profihost AG wrote:
>>>>
>>>>> I can try Kernel 4.4-rc1 next week. Or something else?
>>>>
>>>> I found this bug report which indicates that 4.1.10 works:
>>>>
>>>>    <https://issues.asterisk.org/jira/browse/ASTERISK-25251>
>>>>
>>>> But in your original report, you said that 4.1.13 is broken.
>>>
>>> That's correct i'm running 4.1.13.
>>>
>>>> This backtrace:
>>>>
>>>>    <https://issues.asterisk.org/jira/secure/attachment/52702/bt.txt>
>>>>
>>>> shows a lot of waiting on quite different netlink sockets.  So if this
>>>> is due to a race in Asterisk, it must have happened several times in a
>>>> row.
>>
>> Kernel 4.4-rc2 works fine. How can we grab / get an idea which is
>> causing the isse in 4.1? It's an LTE kernel so it should be fixed!
> 
> Thanks for testing. I was not able to reproduce it at all, with as much
> parallelism and threads as possible on any kernel. Could you try to do a
> git bisect?

OK i'll start digging around with major versions first to have a working
and a non working version.

I'll try 4.3 and 4.2 first to get an idea where it got fixed. Sadly i
can't bisect from 4.1.13 to 4.X so i have to verify first it also
happens with 4.1.

OK?

Stefan

> 
> Thanks,
> Hannes
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Asterisk deadlocks since Kernel 4.1
  2015-11-23 12:57                                 ` Hannes Frederic Sowa
  2015-11-24 13:35                                   ` Stefan Priebe - Profihost AG
@ 2015-12-02  9:45                                   ` Stefan Priebe - Profihost AG
  2015-12-02 11:40                                     ` Hannes Frederic Sowa
  2015-12-02 17:15                                     ` Philipp Hahn
  1 sibling, 2 replies; 45+ messages in thread
From: Stefan Priebe - Profihost AG @ 2015-12-02  9:45 UTC (permalink / raw)
  To: Hannes Frederic Sowa, Florian Weimer
  Cc: Thomas Gleixner, netdev, linux-fsdevel, linux-kernel, herbert

Hi,

here are the results.

It works with 4.1.
It works with 4.2.
It does not work with 4.1.13.

git bisect tells me it stopped working after those two commits were applied:

commit d48623677191e0f035d7afd344f92cf880b01f8e
Author: Herbert Xu <herbert@gondor.apana.org.au>
Date:   Tue Sep 22 11:38:56 2015 +0800

    netlink: Replace rhash_portid with bound

commit 4e27762417669cb459971635be550eb7b5598286
Author: Herbert Xu <herbert@gondor.apana.org.au>
Date:   Fri Sep 18 19:16:50 2015 +0800

    netlink: Fix autobind race condition that leads to zero port ID

Stefan
Am 23.11.2015 um 13:57 schrieb Hannes Frederic Sowa:
> On Mon, Nov 23, 2015, at 13:44, Stefan Priebe - Profihost AG wrote:
>> Am 19.11.2015 um 20:51 schrieb Stefan Priebe:
>>>
>>> Am 19.11.2015 um 14:19 schrieb Florian Weimer:
>>>> On 11/19/2015 01:46 PM, Stefan Priebe - Profihost AG wrote:
>>>>
>>>>> I can try Kernel 4.4-rc1 next week. Or something else?
>>>>
>>>> I found this bug report which indicates that 4.1.10 works:
>>>>
>>>>    <https://issues.asterisk.org/jira/browse/ASTERISK-25251>
>>>>
>>>> But in your original report, you said that 4.1.13 is broken.
>>>
>>> That's correct i'm running 4.1.13.
>>>
>>>> This backtrace:
>>>>
>>>>    <https://issues.asterisk.org/jira/secure/attachment/52702/bt.txt>
>>>>
>>>> shows a lot of waiting on quite different netlink sockets.  So if this
>>>> is due to a race in Asterisk, it must have happened several times in a
>>>> row.
>>
>> Kernel 4.4-rc2 works fine. How can we grab / get an idea which is
>> causing the isse in 4.1? It's an LTE kernel so it should be fixed!
> 
> Thanks for testing. I was not able to reproduce it at all, with as much
> parallelism and threads as possible on any kernel. Could you try to do a
> git bisect?
> 
> Thanks,
> Hannes
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Asterisk deadlocks since Kernel 4.1
  2015-12-02  9:45                                   ` Stefan Priebe - Profihost AG
@ 2015-12-02 11:40                                     ` Hannes Frederic Sowa
  2015-12-02 17:51                                       ` Philipp Hahn
                                                         ` (3 more replies)
  2015-12-02 17:15                                     ` Philipp Hahn
  1 sibling, 4 replies; 45+ messages in thread
From: Hannes Frederic Sowa @ 2015-12-02 11:40 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG, Florian Weimer
  Cc: Thomas Gleixner, netdev, linux-fsdevel, linux-kernel, herbert

Hello Stefan,

Stefan Priebe - Profihost AG <s.priebe@profihost.ag> writes:


> here are the results.
>
> It works with 4.1.
> It works with 4.2.
> It does not work with 4.1.13.
>
> git bisect tells me it stopped working after those two commits were applied:
>
> commit d48623677191e0f035d7afd344f92cf880b01f8e
> Author: Herbert Xu <herbert@gondor.apana.org.au>
> Date:   Tue Sep 22 11:38:56 2015 +0800
>
>     netlink: Replace rhash_portid with bound
>
> commit 4e27762417669cb459971635be550eb7b5598286
> Author: Herbert Xu <herbert@gondor.apana.org.au>
> Date:   Fri Sep 18 19:16:50 2015 +0800
>
>     netlink: Fix autobind race condition that leads to zero port ID

Cool, thanks a lot. Does this patch make a difference?

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 59651af..278e94c 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -1137,7 +1137,7 @@ static int netlink_insert(struct sock *sk, u32 portid)
 
 	/* We need to ensure that the socket is hashed and visible. */
 	smp_wmb();
-	nlk_sk(sk)->bound = portid;
+	nlk_sk(sk)->bound = true;
 
 err:
 	release_sock(sk);

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: Asterisk deadlocks since Kernel 4.1
  2015-12-02  9:45                                   ` Stefan Priebe - Profihost AG
  2015-12-02 11:40                                     ` Hannes Frederic Sowa
@ 2015-12-02 17:15                                     ` Philipp Hahn
  2015-12-02 18:23                                       ` Hannes Frederic Sowa
  1 sibling, 1 reply; 45+ messages in thread
From: Philipp Hahn @ 2015-12-02 17:15 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG, Hannes Frederic Sowa, Florian Weimer
  Cc: Thomas Gleixner, netdev, linux-fsdevel, linux-kernel, herbert

Hi,

Am 02.12.2015 um 10:45 schrieb Stefan Priebe - Profihost AG:
> here are the results.
> 
> It works with 4.1.
> It works with 4.2.
> It does not work with 4.1.13.

the patches were first commitet in v4.3-rc3 and appear as backports only
since v4.2.3 and v4.1.10

> git bisect tells me it stopped working after those two commits were applied:
> 
> commit d48623677191e0f035d7afd344f92cf880b01f8e
> Author: Herbert Xu <herbert@gondor.apana.org.au>
> Date:   Tue Sep 22 11:38:56 2015 +0800
> 
>     netlink: Replace rhash_portid with bound
> 
> commit 4e27762417669cb459971635be550eb7b5598286
> Author: Herbert Xu <herbert@gondor.apana.org.au>
> Date:   Fri Sep 18 19:16:50 2015 +0800
> 
>     netlink: Fix autobind race condition that leads to zero port ID

I identified the same two patches for our prpblem; see mail from
2015-11-17 16:53 with subject "Strange PF_NETLINK NETLINK_ROUTE stall:
netlink: Fix autobind race condition that leads to zero port ID"

<http://marc.info/?l=linux-kernel&m=144863837825887&w=2>

Philipp

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Asterisk deadlocks since Kernel 4.1
  2015-12-02 11:40                                     ` Hannes Frederic Sowa
@ 2015-12-02 17:51                                       ` Philipp Hahn
  2015-12-03  8:23                                       ` Stefan Priebe - Profihost AG
                                                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 45+ messages in thread
From: Philipp Hahn @ 2015-12-02 17:51 UTC (permalink / raw)
  To: Hannes Frederic Sowa, Stefan Priebe - Profihost AG, Florian Weimer
  Cc: Thomas Gleixner, netdev, linux-fsdevel, linux-kernel, herbert

Hi,

Am 02.12.2015 um 12:40 schrieb Hannes Frederic Sowa:
> Cool, thanks a lot. Does this patch make a difference?
> 
> diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
> index 59651af..278e94c 100644
> --- a/net/netlink/af_netlink.c
> +++ b/net/netlink/af_netlink.c
> @@ -1137,7 +1137,7 @@ static int netlink_insert(struct sock *sk, u32 portid)
>  
>  	/* We need to ensure that the socket is hashed and visible. */
>  	smp_wmb();
> -	nlk_sk(sk)->bound = portid;
> +	nlk_sk(sk)->bound = true;
>  
>  err:
>  	release_sock(sk);

Will test that tomorrow.

What confuss me is that it works with 4.3. I compared
net/netlink/af_netlink.[ch] for differences between v4.1.10 and v4.3 and
didn't find anything which would explain only 4,1 getting stuck.

Philipp

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Asterisk deadlocks since Kernel 4.1
  2015-12-02 17:15                                     ` Philipp Hahn
@ 2015-12-02 18:23                                       ` Hannes Frederic Sowa
  0 siblings, 0 replies; 45+ messages in thread
From: Hannes Frederic Sowa @ 2015-12-02 18:23 UTC (permalink / raw)
  To: Philipp Hahn, Stefan Priebe - Profihost AG, Florian Weimer
  Cc: Thomas Gleixner, netdev, linux-fsdevel, linux-kernel, herbert

Hello,

On Wed, Dec 2, 2015, at 18:15, Philipp Hahn wrote:
> Hi,
> 
> Am 02.12.2015 um 10:45 schrieb Stefan Priebe - Profihost AG:
> > here are the results.
> > 
> > It works with 4.1.
> > It works with 4.2.
> > It does not work with 4.1.13.
> 
> the patches were first commitet in v4.3-rc3 and appear as backports only
> since v4.2.3 and v4.1.10
> 
> > git bisect tells me it stopped working after those two commits were applied:
> > 
> > commit d48623677191e0f035d7afd344f92cf880b01f8e
> > Author: Herbert Xu <herbert@gondor.apana.org.au>
> > Date:   Tue Sep 22 11:38:56 2015 +0800
> > 
> >     netlink: Replace rhash_portid with bound
> > 
> > commit 4e27762417669cb459971635be550eb7b5598286
> > Author: Herbert Xu <herbert@gondor.apana.org.au>
> > Date:   Fri Sep 18 19:16:50 2015 +0800
> > 
> >     netlink: Fix autobind race condition that leads to zero port ID
> 
> I identified the same two patches for our prpblem; see mail from
> 2015-11-17 16:53 with subject "Strange PF_NETLINK NETLINK_ROUTE stall:
> netlink: Fix autobind race condition that leads to zero port ID"
> 
> <http://marc.info/?l=linux-kernel&m=144863837825887&w=2>

Thanks, I missed this mail. I have a further look. I couldn't reproduce
it myself but will check if your test program does it.

Thanks a lot,
Hannes

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Asterisk deadlocks since Kernel 4.1
  2015-12-02 11:40                                     ` Hannes Frederic Sowa
  2015-12-02 17:51                                       ` Philipp Hahn
@ 2015-12-03  8:23                                       ` Stefan Priebe - Profihost AG
  2015-12-04 18:26                                       ` Stefan Priebe
  2015-12-05 14:19                                       ` Philipp Matthias Hahn
  3 siblings, 0 replies; 45+ messages in thread
From: Stefan Priebe - Profihost AG @ 2015-12-03  8:23 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Florian Weimer, Thomas Gleixner, netdev, linux-fsdevel,
	linux-kernel, herbert

> Am 02.12.2015 um 12:40 schrieb Hannes Frederic Sowa <hannes@stressinduktion.org>:
> 
> Hello Stefan,
> 
> Stefan Priebe - Profihost AG <s.priebe@profihost.ag> writes:
> 
> 
>> here are the results.
>> 
>> It works with 4.1.
>> It works with 4.2.
>> It does not work with 4.1.13.
>> 
>> git bisect tells me it stopped working after those two commits were applied:
>> 
>> commit d48623677191e0f035d7afd344f92cf880b01f8e
>> Author: Herbert Xu <herbert@gondor.apana.org.au>
>> Date:   Tue Sep 22 11:38:56 2015 +0800
>> 
>>    netlink: Replace rhash_portid with bound
>> 
>> commit 4e27762417669cb459971635be550eb7b5598286
>> Author: Herbert Xu <herbert@gondor.apana.org.au>
>> Date:   Fri Sep 18 19:16:50 2015 +0800
>> 
>>    netlink: Fix autobind race condition that leads to zero port ID
> 
> Cool, thanks a lot. Does this patch make a difference?
> 
> diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
> index 59651af..278e94c 100644
> --- a/net/netlink/af_netlink.c
> +++ b/net/netlink/af_netlink.c
> @@ -1137,7 +1137,7 @@ static int netlink_insert(struct sock *sk, u32 portid)
> 
>    /* We need to ensure that the socket is hashed and visible. */
>    smp_wmb();
> -    nlk_sk(sk)->bound = portid;
> +    nlk_sk(sk)->bound = true;
> 
> err:
>    release_sock(sk);

Will try that soon but already started another test. I reverted all netlink patches from 4.1.13 back to 4.1 and applied all netlink patches from 4.1 up to 4.3.

Greets,
Stefan

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Asterisk deadlocks since Kernel 4.1
  2015-12-02 11:40                                     ` Hannes Frederic Sowa
  2015-12-02 17:51                                       ` Philipp Hahn
  2015-12-03  8:23                                       ` Stefan Priebe - Profihost AG
@ 2015-12-04 18:26                                       ` Stefan Priebe
  2015-12-05  1:08                                         ` Herbert Xu
  2015-12-05 14:19                                       ` Philipp Matthias Hahn
  3 siblings, 1 reply; 45+ messages in thread
From: Stefan Priebe @ 2015-12-04 18:26 UTC (permalink / raw)
  To: Hannes Frederic Sowa, Florian Weimer
  Cc: Thomas Gleixner, netdev, linux-fsdevel, linux-kernel, herbert

Hi,

I got it fixed / at least not live / deadlocking by doing applying the 
following patch - which is the diff of the commits below on top of 4.1.13.

patch:
http://pastebin.com/raw.php?i=hiuq4bsW

all commits / changes in reverse order:

* 0ceb380 - (6 weeks ago) netlink: fix locking around 
NETLINK_LIST_MEMBERSHIPS - David Herrmann (HEAD)
* c3f272b - (7 weeks ago) netlink: Trim skb to alloc size to avoid 
MSG_TRUNC - Arad, Ronen
* 9f87e0c - (2 months ago) netlink: Replace rhash_portid with bound - 
Herbert Xu
* 35e9890 - (3 months ago) netlink: Fix autobind race condition that 
leads to zero port ID - Herbert Xu
* f1d1215 - (3 months ago) netlink, mmap: transform mmap skb into full 
skb on taps - Daniel Borkmann
* faad871 - (3 months ago) netlink, mmap: fix edge-case leakages in nf 
queue zero-copy - Daniel Borkmann
* fb18c94 - (3 months ago) netlink, mmap: don't walk rx ring on poll if 
receive queue non-empty - Daniel Borkmann
* da13789 - (3 months ago) netlink: rx mmap: fix POLLIN condition - 
Ken-ichirou MATSUZAWA
* 808071f - (3 months ago) netlink: mmap: fix lookup frame position - 
Ken-ichirou MATSUZAWA
* 589bfd5 - (3 months ago) netlink: add NETLINK_CAP_ACK socket option - 
Christophe Ricard
* d23c4eb - (4 months ago) netlink: mmap: fix tx type check - 
Ken-ichirou MATSUZAWA
* 5dcc50a - (4 months ago) netlink: make sure -EBUSY won't escape from 
netlink_insert - Daniel Borkmann
* ada2b3e - (5 months ago) netlink: don't hold mutex in rcu callback 
when releasing mmapd ring - Florian Westphal
* e0f54a3 - (5 months ago) netlink: Delete an unnecessary check before 
the function call "module_put" - Markus Elfring
* 0a5bdaf - (6 months ago) netlink: add API to retrieve all group 
memberships - David Herrmann
* 30c6472 - (7 months ago) netlink: Use random autobind rover - Herbert Xu
* 021a670 - (7 months ago) netlink: Create kernel netlink sockets in the 
proper network namespace - Eric W. Biederman
* e1b01b4 - (7 months ago) net: Pass kern from net_proto_family.create 
to sk_alloc - Eric W. Biederman
* dd4b3c9 - (7 months ago) netlink: rename private flags and states - 
Nicolas Dichtel
* 0356126 - (2 days ago) Revert "netlink: don't hold mutex in rcu 
callback when releasing mmapd ring" - Stefan Priebe
* 231d0da - (2 days ago) Revert "netlink: make sure -EBUSY won't escape 
from netlink_insert" - Stefan Priebe
* e0f56af1 - (2 days ago) Revert "netlink, mmap: transform mmap skb into 
full skb on taps" - Stefan Priebe
* 23a0326 - (2 days ago) Revert "netlink: Fix autobind race condition 
that leads to zero port ID" - Stefan Priebe
* 97f4677 - (2 days ago) Revert "netlink: Replace rhash_portid with 
bound" - Stefan Priebe
* 40c851fe - (2 days ago) Revert "netlink: Trim skb to alloc size to 
avoid MSG_TRUNC" - Stefan Priebe
* 1f2ce4a - (4 weeks ago) Linux 4.1.13 - Greg Kroah-Hartman (v4.1.13, 
origin/linux-4.1.y)

So the netlink code is in line with 4.3.

Stefan

Am 02.12.2015 um 12:40 schrieb Hannes Frederic Sowa:
> Hello Stefan,
>
> Stefan Priebe - Profihost AG <s.priebe@profihost.ag> writes:
>
>
>> here are the results.
>>
>> It works with 4.1.
>> It works with 4.2.
>> It does not work with 4.1.13.
>>
>> git bisect tells me it stopped working after those two commits were applied:
>>
>> commit d48623677191e0f035d7afd344f92cf880b01f8e
>> Author: Herbert Xu <herbert@gondor.apana.org.au>
>> Date:   Tue Sep 22 11:38:56 2015 +0800
>>
>>      netlink: Replace rhash_portid with bound
>>
>> commit 4e27762417669cb459971635be550eb7b5598286
>> Author: Herbert Xu <herbert@gondor.apana.org.au>
>> Date:   Fri Sep 18 19:16:50 2015 +0800
>>
>>      netlink: Fix autobind race condition that leads to zero port ID
>
> Cool, thanks a lot. Does this patch make a difference?
>
> diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
> index 59651af..278e94c 100644
> --- a/net/netlink/af_netlink.c
> +++ b/net/netlink/af_netlink.c
> @@ -1137,7 +1137,7 @@ static int netlink_insert(struct sock *sk, u32 portid)
>
>   	/* We need to ensure that the socket is hashed and visible. */
>   	smp_wmb();
> -	nlk_sk(sk)->bound = portid;
> +	nlk_sk(sk)->bound = true;
>
>   err:
>   	release_sock(sk);
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Asterisk deadlocks since Kernel 4.1
  2015-12-04 18:26                                       ` Stefan Priebe
@ 2015-12-05  1:08                                         ` Herbert Xu
  2015-12-06 20:56                                           ` Stefan Priebe
  0 siblings, 1 reply; 45+ messages in thread
From: Herbert Xu @ 2015-12-05  1:08 UTC (permalink / raw)
  To: Stefan Priebe
  Cc: Hannes Frederic Sowa, Florian Weimer, Thomas Gleixner, netdev,
	linux-fsdevel, linux-kernel

On Fri, Dec 04, 2015 at 07:26:12PM +0100, Stefan Priebe wrote:
>
> * 9f87e0c - (2 months ago) netlink: Replace rhash_portid with bound
> - Herbert Xu
> * 35e9890 - (3 months ago) netlink: Fix autobind race condition that
> leads to zero port ID - Herbert Xu
> * 30c6472 - (7 months ago) netlink: Use random autobind rover - Herbert Xu

These three patches are absolutely required in any kernel where the
netlink insertion is lockless.  So yes they should be applied to
4.1.

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Asterisk deadlocks since Kernel 4.1
  2015-12-02 11:40                                     ` Hannes Frederic Sowa
                                                         ` (2 preceding siblings ...)
  2015-12-04 18:26                                       ` Stefan Priebe
@ 2015-12-05 14:19                                       ` Philipp Matthias Hahn
  2015-12-05 15:34                                         ` Stefan Priebe
  3 siblings, 1 reply; 45+ messages in thread
From: Philipp Matthias Hahn @ 2015-12-05 14:19 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Stefan Priebe - Profihost AG, Florian Weimer, Thomas Gleixner,
	netdev, linux-fsdevel, linux-kernel, herbert

Hello Hannes,

On Wed, Dec 02, 2015 at 12:40:32PM +0100, Hannes Frederic Sowa wrote:
> > git bisect tells me it stopped working after those two commits were applied:
> >
> > commit d48623677191e0f035d7afd344f92cf880b01f8e
> > Author: Herbert Xu <herbert@gondor.apana.org.au>
> > Date:   Tue Sep 22 11:38:56 2015 +0800
> >
> >     netlink: Replace rhash_portid with bound
> >
> > commit 4e27762417669cb459971635be550eb7b5598286
> > Author: Herbert Xu <herbert@gondor.apana.org.au>
> > Date:   Fri Sep 18 19:16:50 2015 +0800
> >
> >     netlink: Fix autobind race condition that leads to zero port ID
> 
> Cool, thanks a lot. Does this patch make a difference?
> 
> diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
> index 59651af..278e94c 100644
> --- a/net/netlink/af_netlink.c
> +++ b/net/netlink/af_netlink.c
> @@ -1137,7 +1137,7 @@ static int netlink_insert(struct sock *sk, u32 portid)
>  
>  	/* We need to ensure that the socket is hashed and visible. */
>  	smp_wmb();
> -	nlk_sk(sk)->bound = portid;
> +	nlk_sk(sk)->bound = true;
>  
>  err:
>  	release_sock(sk);

Didn't help, test program still gets stuck; sorry :-(

Philipp

-- 
  / /  (_)__  __ ____  __ Philipp Hahn
 / /__/ / _ \/ // /\ \/ /
/____/_/_//_/\_,_/ /_/\_\ pmhahn@pmhahn.de

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Asterisk deadlocks since Kernel 4.1
  2015-12-05 14:19                                       ` Philipp Matthias Hahn
@ 2015-12-05 15:34                                         ` Stefan Priebe
  0 siblings, 0 replies; 45+ messages in thread
From: Stefan Priebe @ 2015-12-05 15:34 UTC (permalink / raw)
  To: Hannes Frederic Sowa, Florian Weimer, Thomas Gleixner, netdev,
	linux-fsdevel, linux-kernel, herbert

Hello Philipp,

Am 05.12.2015 um 15:19 schrieb Philipp Matthias Hahn:
> Hello Hannes,
>
> On Wed, Dec 02, 2015 at 12:40:32PM +0100, Hannes Frederic Sowa wrote:
>>> git bisect tells me it stopped working after those two commits were applied:
>>>
>>> commit d48623677191e0f035d7afd344f92cf880b01f8e
>>> Author: Herbert Xu <herbert@gondor.apana.org.au>
>>> Date:   Tue Sep 22 11:38:56 2015 +0800
>>>
>>>      netlink: Replace rhash_portid with bound
>>>
>>> commit 4e27762417669cb459971635be550eb7b5598286
>>> Author: Herbert Xu <herbert@gondor.apana.org.au>
>>> Date:   Fri Sep 18 19:16:50 2015 +0800
>>>
>>>      netlink: Fix autobind race condition that leads to zero port ID
>>
>> Cool, thanks a lot. Does this patch make a difference?
>>
>> diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
>> index 59651af..278e94c 100644
>> --- a/net/netlink/af_netlink.c
>> +++ b/net/netlink/af_netlink.c
>> @@ -1137,7 +1137,7 @@ static int netlink_insert(struct sock *sk, u32 portid)
>>
>>   	/* We need to ensure that the socket is hashed and visible. */
>>   	smp_wmb();
>> -	nlk_sk(sk)->bound = portid;
>> +	nlk_sk(sk)->bound = true;
>>
>>   err:
>>   	release_sock(sk);
>
> Didn't help, test program still gets stuck; sorry :-(

Can you please try this patch on top of 4.1.13 (which helps me):
http://pastebin.com/raw.php?i=hiuq4bsW

Stefan

>
> Philipp
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Asterisk deadlocks since Kernel 4.1
  2015-12-05  1:08                                         ` Herbert Xu
@ 2015-12-06 20:56                                           ` Stefan Priebe
  2015-12-07  1:20                                             ` Herbert Xu
  2015-12-07  7:41                                             ` Asterisk deadlocks since Kernel 4.1 Philipp Hahn
  0 siblings, 2 replies; 45+ messages in thread
From: Stefan Priebe @ 2015-12-06 20:56 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Hannes Frederic Sowa, Florian Weimer, Thomas Gleixner, netdev,
	linux-fsdevel, linux-kernel, jannik Winkel, Marco Schinkel

Hi Herbert,

i think i found the issue in 4.1 with netlink. Somebody made a mistake 
while backporting or cherry-picking your patch "netlink: Fix autobind 
race condition that leads to zero port ID" to 4.1.

It misses a goto in 4.1.

This goto is missing in 4.1:

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 4017e12..f15c001 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -1132,7 +1132,6 @@ static int netlink_insert(struct sock *sk, u32 portid)
                 if (err == -EEXIST)
                         err = -EADDRINUSE;
                 sock_put(sk);
-               goto err;
         }

         /* We need to ensure that the socket is hashed and visible. */

Can you please confirm, that this is not correct and might cause those 
issues.

Stefan

Am 05.12.2015 um 02:08 schrieb Herbert Xu:
> On Fri, Dec 04, 2015 at 07:26:12PM +0100, Stefan Priebe wrote:
>>
>> * 9f87e0c - (2 months ago) netlink: Replace rhash_portid with bound
>> - Herbert Xu
>> * 35e9890 - (3 months ago) netlink: Fix autobind race condition that
>> leads to zero port ID - Herbert Xu
>> * 30c6472 - (7 months ago) netlink: Use random autobind rover - Herbert Xu
>
> These three patches are absolutely required in any kernel where the
> netlink insertion is lockless.  So yes they should be applied to
> 4.1.
>
> Thanks,
>

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: Asterisk deadlocks since Kernel 4.1
  2015-12-06 20:56                                           ` Stefan Priebe
@ 2015-12-07  1:20                                             ` Herbert Xu
  2015-12-07  6:58                                               ` Stefan Priebe - Profihost AG
  2015-12-07  7:41                                             ` Asterisk deadlocks since Kernel 4.1 Philipp Hahn
  1 sibling, 1 reply; 45+ messages in thread
From: Herbert Xu @ 2015-12-07  1:20 UTC (permalink / raw)
  To: Stefan Priebe
  Cc: Hannes Frederic Sowa, Florian Weimer, Thomas Gleixner, netdev,
	linux-fsdevel, linux-kernel, jannik Winkel, Marco Schinkel,
	David S. Miller, stable

On Sun, Dec 06, 2015 at 09:56:34PM +0100, Stefan Priebe wrote:
> Hi Herbert,
> 
> i think i found the issue in 4.1 with netlink. Somebody made a
> mistake while backporting or cherry-picking your patch "netlink: Fix
> autobind race condition that leads to zero port ID" to 4.1.
> 
> It misses a goto in 4.1.
> 
> This goto is missing in 4.1:
> 
> diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
> index 4017e12..f15c001 100644
> --- a/net/netlink/af_netlink.c
> +++ b/net/netlink/af_netlink.c
> @@ -1132,7 +1132,6 @@ static int netlink_insert(struct sock *sk, u32 portid)
>                 if (err == -EEXIST)
>                         err = -EADDRINUSE;
>                 sock_put(sk);
> -               goto err;
>         }
> 
>         /* We need to ensure that the socket is hashed and visible. */
> 
> Can you please confirm, that this is not correct and might cause
> those issues.

Well spotted! Yes this would be a fatal error and can cause the
problems you guys are seeing.

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Asterisk deadlocks since Kernel 4.1
  2015-12-07  1:20                                             ` Herbert Xu
@ 2015-12-07  6:58                                               ` Stefan Priebe - Profihost AG
  2015-12-08  6:13                                                 ` netlink: Add missing goto statement to netlink_insert Herbert Xu
  0 siblings, 1 reply; 45+ messages in thread
From: Stefan Priebe - Profihost AG @ 2015-12-07  6:58 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Hannes Frederic Sowa, Florian Weimer, Thomas Gleixner, netdev,
	linux-fsdevel, linux-kernel, jannik Winkel, Marco Schinkel,
	David S. Miller, stable

Hi Herbert,

Am 07.12.2015 um 02:20 schrieb Herbert Xu:
> On Sun, Dec 06, 2015 at 09:56:34PM +0100, Stefan Priebe wrote:
>> Hi Herbert,
>>
>> i think i found the issue in 4.1 with netlink. Somebody made a
>> mistake while backporting or cherry-picking your patch "netlink: Fix
>> autobind race condition that leads to zero port ID" to 4.1.
>>
>> It misses a goto in 4.1.
>>
>> This goto is missing in 4.1:
>>
>> diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
>> index 4017e12..f15c001 100644
>> --- a/net/netlink/af_netlink.c
>> +++ b/net/netlink/af_netlink.c
>> @@ -1132,7 +1132,6 @@ static int netlink_insert(struct sock *sk, u32 portid)
>>                 if (err == -EEXIST)
>>                         err = -EADDRINUSE;
>>                 sock_put(sk);
>> -               goto err;
>>         }
>>
>>         /* We need to ensure that the socket is hashed and visible. */
>>
>> Can you please confirm, that this is not correct and might cause
>> those issues.
> 
> Well spotted! Yes this would be a fatal error and can cause the
> problems you guys are seeing.

Thanks, good. Can you help me to get this fix upstream into the stable
lines?

Stefan

> 
> Thanks,
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Asterisk deadlocks since Kernel 4.1
  2015-12-06 20:56                                           ` Stefan Priebe
  2015-12-07  1:20                                             ` Herbert Xu
@ 2015-12-07  7:41                                             ` Philipp Hahn
  1 sibling, 0 replies; 45+ messages in thread
From: Philipp Hahn @ 2015-12-07  7:41 UTC (permalink / raw)
  To: Stefan Priebe, Herbert Xu
  Cc: Hannes Frederic Sowa, Florian Weimer, Thomas Gleixner, netdev,
	linux-fsdevel, linux-kernel, jannik Winkel, Marco Schinkel

Hello Stefan,

Am 06.12.2015 um 21:56 schrieb Stefan Priebe:
> i think i found the issue in 4.1 with netlink. Somebody made a mistake
> while backporting or cherry-picking your patch "netlink: Fix autobind
> race condition that leads to zero port ID" to 4.1.
> 
> It misses a goto in 4.1.
> 
> This goto is missing in 4.1:
> 
> diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
> index 4017e12..f15c001 100644
> --- a/net/netlink/af_netlink.c
> +++ b/net/netlink/af_netlink.c
> @@ -1132,7 +1132,6 @@ static int netlink_insert(struct sock *sk, u32
> portid)
>                 if (err == -EEXIST)
>                         err = -EADDRINUSE;
>                 sock_put(sk);
> -               goto err;
>         }
> 
>         /* We need to ensure that the socket is hashed and visible. */
> 
> Can you please confirm, that this is not correct and might cause those
> issues.

I just tested that patch and it seems to fix our hang. Thank you for
your good work.

Philipp

PS: I guess I can skip testing your other test request as this simple
patch is part of your other hiuq4bsW patch. If I should still test it,
just send a note.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* netlink: Add missing goto statement to netlink_insert
  2015-12-07  6:58                                               ` Stefan Priebe - Profihost AG
@ 2015-12-08  6:13                                                 ` Herbert Xu
  2015-12-08 16:21                                                   ` David Miller
  2015-12-09  3:29                                                   ` Greg KH
  0 siblings, 2 replies; 45+ messages in thread
From: Herbert Xu @ 2015-12-08  6:13 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG
  Cc: Hannes Frederic Sowa, Florian Weimer, Thomas Gleixner, netdev,
	linux-fsdevel, linux-kernel, jannik Winkel, Marco Schinkel,
	David S. Miller, stable

On Mon, Dec 07, 2015 at 07:58:25AM +0100, Stefan Priebe - Profihost AG wrote:
>
> Thanks, good. Can you help me to get this fix upstream into the stable
> lines?

Sure.  Greg, please apply this patch to fix up the backport for 4.1.

---8<---
The backport of 1f770c0a09da855a2b51af6d19de97fb955eca85 ("netlink:
Fix autobind race condition that leads to zero port ID") missed a
goto statement, which causes netlink to break subtly.

This was discovered by Stefan Priebe <s.priebe@profihost.ag>.

Fixes: 4e2776241766 ("netlink: Fix autobind race condition that...")
Reported-by: Stefan Priebe <s.priebe@profihost.ag>
Reported-by: Philipp Hahn <pmhahn@pmhahn.de>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index d139c43..0d6038c 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -1118,6 +1118,7 @@ static int netlink_insert(struct sock *sk, u32 portid)
 		if (err == -EEXIST)
 			err = -EADDRINUSE;
 		sock_put(sk);
+		goto err;
 	}
 
 	/* We need to ensure that the socket is hashed and visible. */

-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: netlink: Add missing goto statement to netlink_insert
  2015-12-08  6:13                                                 ` netlink: Add missing goto statement to netlink_insert Herbert Xu
@ 2015-12-08 16:21                                                   ` David Miller
  2015-12-09  3:29                                                   ` Greg KH
  1 sibling, 0 replies; 45+ messages in thread
From: David Miller @ 2015-12-08 16:21 UTC (permalink / raw)
  To: herbert
  Cc: s.priebe, hannes, fweimer, tglx, netdev, linux-fsdevel,
	linux-kernel, j.winkel, m.schinkel, stable

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Tue, 8 Dec 2015 14:13:19 +0800

> On Mon, Dec 07, 2015 at 07:58:25AM +0100, Stefan Priebe - Profihost AG wrote:
>>
>> Thanks, good. Can you help me to get this fix upstream into the stable
>> lines?
> 
> Sure.  Greg, please apply this patch to fix up the backport for 4.1.

Herbert, thanks so much for taking care of this.

> ---8<---
> The backport of 1f770c0a09da855a2b51af6d19de97fb955eca85 ("netlink:
> Fix autobind race condition that leads to zero port ID") missed a
> goto statement, which causes netlink to break subtly.
> 
> This was discovered by Stefan Priebe <s.priebe@profihost.ag>.
> 
> Fixes: 4e2776241766 ("netlink: Fix autobind race condition that...")
> Reported-by: Stefan Priebe <s.priebe@profihost.ag>
> Reported-by: Philipp Hahn <pmhahn@pmhahn.de>
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Acked-by: David S. Miller <davem@davemloft.net>

> 
> diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
> index d139c43..0d6038c 100644
> --- a/net/netlink/af_netlink.c
> +++ b/net/netlink/af_netlink.c
> @@ -1118,6 +1118,7 @@ static int netlink_insert(struct sock *sk, u32 portid)
>  		if (err == -EEXIST)
>  			err = -EADDRINUSE;
>  		sock_put(sk);
> +		goto err;
>  	}
>  
>  	/* We need to ensure that the socket is hashed and visible. */
> 
> -- 
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: netlink: Add missing goto statement to netlink_insert
  2015-12-08  6:13                                                 ` netlink: Add missing goto statement to netlink_insert Herbert Xu
  2015-12-08 16:21                                                   ` David Miller
@ 2015-12-09  3:29                                                   ` Greg KH
  1 sibling, 0 replies; 45+ messages in thread
From: Greg KH @ 2015-12-09  3:29 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Stefan Priebe - Profihost AG, Hannes Frederic Sowa,
	Florian Weimer, Thomas Gleixner, netdev, linux-fsdevel,
	linux-kernel, jannik Winkel, Marco Schinkel, David S. Miller,
	stable

On Tue, Dec 08, 2015 at 02:13:19PM +0800, Herbert Xu wrote:
> On Mon, Dec 07, 2015 at 07:58:25AM +0100, Stefan Priebe - Profihost AG wrote:
> >
> > Thanks, good. Can you help me to get this fix upstream into the stable
> > lines?
> 
> Sure.  Greg, please apply this patch to fix up the backport for 4.1.

Now queued up, thanks for the fix.

greg k-h

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2015-12-09  3:29 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-17 14:44 Asterisk deadlocks since Kernel 4.1 Stefan Priebe - Profihost AG
2015-11-17 19:15 ` Thomas Gleixner
2015-11-17 19:27   ` Stefan Priebe
2015-11-17 19:43     ` Thomas Gleixner
2015-11-18 20:23       ` Stefan Priebe
2015-11-18 21:00         ` Hannes Frederic Sowa
2015-11-18 21:20           ` Stefan Priebe
2015-11-18 21:22             ` Hannes Frederic Sowa
2015-11-19  9:35               ` Stefan Priebe - Profihost AG
2015-11-18 21:18         ` Florian Weimer
2015-11-18 21:23           ` Stefan Priebe
2015-11-19  9:39             ` Florian Weimer
2015-11-18 21:36           ` Stefan Priebe
2015-11-18 21:40             ` Hannes Frederic Sowa
2015-11-18 21:42               ` Stefan Priebe
2015-11-18 21:58                 ` Hannes Frederic Sowa
2015-11-19  9:44             ` Florian Weimer
2015-11-19  9:49               ` Stefan Priebe - Profihost AG
2015-11-19  9:56                 ` Stefan Priebe - Profihost AG
2015-11-19 11:41                   ` Hannes Frederic Sowa
2015-11-19 11:43                     ` Stefan Priebe - Profihost AG
2015-11-19 12:41                       ` Hannes Frederic Sowa
2015-11-19 12:46                         ` Stefan Priebe - Profihost AG
2015-11-19 13:19                           ` Florian Weimer
2015-11-19 19:51                             ` Stefan Priebe
2015-11-23 12:44                               ` Stefan Priebe - Profihost AG
2015-11-23 12:57                                 ` Hannes Frederic Sowa
2015-11-24 13:35                                   ` Stefan Priebe - Profihost AG
2015-12-02  9:45                                   ` Stefan Priebe - Profihost AG
2015-12-02 11:40                                     ` Hannes Frederic Sowa
2015-12-02 17:51                                       ` Philipp Hahn
2015-12-03  8:23                                       ` Stefan Priebe - Profihost AG
2015-12-04 18:26                                       ` Stefan Priebe
2015-12-05  1:08                                         ` Herbert Xu
2015-12-06 20:56                                           ` Stefan Priebe
2015-12-07  1:20                                             ` Herbert Xu
2015-12-07  6:58                                               ` Stefan Priebe - Profihost AG
2015-12-08  6:13                                                 ` netlink: Add missing goto statement to netlink_insert Herbert Xu
2015-12-08 16:21                                                   ` David Miller
2015-12-09  3:29                                                   ` Greg KH
2015-12-07  7:41                                             ` Asterisk deadlocks since Kernel 4.1 Philipp Hahn
2015-12-05 14:19                                       ` Philipp Matthias Hahn
2015-12-05 15:34                                         ` Stefan Priebe
2015-12-02 17:15                                     ` Philipp Hahn
2015-12-02 18:23                                       ` Hannes Frederic Sowa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).