Kernel 4.6.7-rt14 kernel workqueue lockup - rtnl deadlock plus syscall endless loop

All of lore.kernel.org
 help / color / mirror / Atom feed

* Kernel 4.6.7-rt14 kernel workqueue lockup - rtnl deadlock plus syscall endless loop
@ 2017-01-17 17:39 Elad Nachman
  2017-01-17 17:53 ` Stephen Hemminger
  2017-01-17 17:57 ` David Miller
  0 siblings, 2 replies; 8+ messages in thread
From: Elad Nachman @ 2017-01-17 17:39 UTC (permalink / raw)
  To: netdev

Hi,

I am experiencing sporadic work queue lockups on kernel 4.6.7-rt14 (mach-socfpga).

Using a HW debugger I got the following information:

A process containing a network namespace is terminating itself (SIGKILL), which causes cleanup_net() to be scheduled to kworker/u4:2 to clean up the network namespace running on the process.

Kworker/u4:2 got preempted (plus there are a lot of other work queue items, like vmstat_shepherd, wakeup_dirtytime_writeback, phy_state_machine, neigh_periodic_work, check_lifetime plus another one by a LKM) while holding the rtnl lock.

A processing running waitpid() on the terminated process starts a new process, which forks busybox to run sysctl -w net.ipv6.conf.all.forwarding = 1 .
This in turn starts making a write syscall, calling in turn vfs_write, proc_sys_call_handler, addrconf_sysctl_forward, and finally addrconf_fixup_forwarding().

addrconf_fixup_forwarding() runs the following code:

if (!rtnl_trylock())
                 return restart_syscall();

This fails and restart_syscall() does the following:

set_tsk_thread_flag(current, TIF_SIGPENDING);
         return -ERESTARTNOINTR;

Now the system call goes back to ret_fast_syscall (arch/arm/kernel/entry-common.S)
Testing the flags in the task_struct (which contain TIF_SIGPENDING) the code branches to fast_work_pending, then falls through to slow_work_pending, which
Calls do_work_pending(), and in turn calls do_signal(), get_signal(), dequeuer_signal(), which find no signals, and clears the TIF_SIGPENDING bit when recalc_sigpending() is called, then returns zero.

This causes do_signal() to examine r0 and return 1 (-ERESTARTNOINTR), which is propogated to the assembly code by do_work_pending().
Having r0 equal zero causes a branch to local_restart, which restarts the very same write system call in an endless loop.
No scheduling is possible, so the cleanup_net() cannot finish and release rtnl, which in turn causes the endless restarting of the write system call.

I have sent this to linux-arm-kernel and got a response from Russel King saying that (relating to addrconf_fixup_forwarding, net/ipv6/addrconf.c ):

"
I think the problem is that:

        if (!rtnl_trylock())
                return restart_syscall();

which, if it didn't do a trylock, it would put this thread to sleep
and allow other threads to run (potentially allowing the holder of
the lock to release it.)

What's more odd about this is that it's very unusual and strange for
a kernel function to invoke the restart mechanism because a lock is
being held - the point of the restart mechanism is to allow userspace
signal handlers to run, so it should only be used when there's a
signal pending. I think this is a hack in the IPv6 code to work
around some other issue.
"

Any reason we cannot change the above two lines to rtnl_lock() ?

Thanks,

Elad.

IMPORTANT - This email and any attachments is intended for the above named addressee(s), and may contain information which is confidential or privileged. If you are not the intended recipient, please inform the sender immediately and delete this email: you should not copy or use this e-mail for any purpose nor disclose its contents to any person.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Kernel 4.6.7-rt14 kernel workqueue lockup - rtnl deadlock plus syscall endless loop
  2017-01-17 17:39 Kernel 4.6.7-rt14 kernel workqueue lockup - rtnl deadlock plus syscall endless loop Elad Nachman
@ 2017-01-17 17:53 ` Stephen Hemminger
  2017-01-17 17:57 ` David Miller
  1 sibling, 0 replies; 8+ messages in thread
From: Stephen Hemminger @ 2017-01-17 17:53 UTC (permalink / raw)
  To: Elad Nachman; +Cc: netdev

On Tue, 17 Jan 2017 17:39:03 +0000
Elad Nachman <EladN@gilat.com> wrote:

> Hi,
> 
> I am experiencing sporadic work queue lockups on kernel 4.6.7-rt14 (mach-socfpga).
> 
> Using a HW debugger I got the following information:
> 
> A process containing a network namespace is terminating itself (SIGKILL), which causes cleanup_net() to be scheduled to kworker/u4:2 to clean up the network namespace running on the process.
> 
> Kworker/u4:2 got preempted (plus there are a lot of other work queue items, like vmstat_shepherd, wakeup_dirtytime_writeback, phy_state_machine, neigh_periodic_work, check_lifetime plus another one by a LKM) while holding the rtnl lock.
> 
> A processing running waitpid() on the terminated process starts a new process, which forks busybox to run sysctl -w net.ipv6.conf.all.forwarding = 1 .
> This in turn starts making a write syscall, calling in turn vfs_write, proc_sys_call_handler, addrconf_sysctl_forward, and finally addrconf_fixup_forwarding().
> 
> addrconf_fixup_forwarding() runs the following code:
> 
> if (!rtnl_trylock())
>                  return restart_syscall();
> 
> This fails and restart_syscall() does the following:
> 
> set_tsk_thread_flag(current, TIF_SIGPENDING);
>          return -ERESTARTNOINTR;
> 
> Now the system call goes back to ret_fast_syscall (arch/arm/kernel/entry-common.S)
> Testing the flags in the task_struct (which contain TIF_SIGPENDING) the code branches to fast_work_pending, then falls through to slow_work_pending, which
> Calls do_work_pending(), and in turn calls do_signal(), get_signal(), dequeuer_signal(), which find no signals, and clears the TIF_SIGPENDING bit when recalc_sigpending() is called, then returns zero.
> 
> This causes do_signal() to examine r0 and return 1 (-ERESTARTNOINTR), which is propogated to the assembly code by do_work_pending().
> Having r0 equal zero causes a branch to local_restart, which restarts the very same write system call in an endless loop.
> No scheduling is possible, so the cleanup_net() cannot finish and release rtnl, which in turn causes the endless restarting of the write system call.
> 
> I have sent this to linux-arm-kernel and got a response from Russel King saying that (relating to addrconf_fixup_forwarding, net/ipv6/addrconf.c ):
> 
> "
> I think the problem is that:
> 
>         if (!rtnl_trylock())
>                 return restart_syscall();
> 
> 
> 
> which, if it didn't do a trylock, it would put this thread to sleep
> and allow other threads to run (potentially allowing the holder of
> the lock to release it.)
> 
> What's more odd about this is that it's very unusual and strange for
> a kernel function to invoke the restart mechanism because a lock is
> being held - the point of the restart mechanism is to allow userspace
> signal handlers to run, so it should only be used when there's a
> signal pending. I think this is a hack in the IPv6 code to work
> around some other issue.

The trylock was added intentionally to handle a different deadlock.
Going back to a blocking lock would cause that problem.

There was a deadlock between device unregistration and sysfs access.
Unregistration wants to remove sysfs entry while holding RTNL.
Sysfs access graps sysfs file entry lock then acquires RTNL.

The patch back in 2.6.30 followed by multiple revisions was to
restart the sysfs write syscall.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Kernel 4.6.7-rt14 kernel workqueue lockup - rtnl deadlock plus syscall endless loop
  2017-01-17 17:39 Kernel 4.6.7-rt14 kernel workqueue lockup - rtnl deadlock plus syscall endless loop Elad Nachman
  2017-01-17 17:53 ` Stephen Hemminger
@ 2017-01-17 17:57 ` David Miller
  2017-01-17 18:15   ` Elad Nachman
  1 sibling, 1 reply; 8+ messages in thread
From: David Miller @ 2017-01-17 17:57 UTC (permalink / raw)
  To: EladN; +Cc: netdev

From: Elad Nachman <EladN@gilat.com>
Date: Tue, 17 Jan 2017 17:39:03 +0000

> What's more odd about this is that it's very unusual and strange for
> a kernel function to invoke the restart mechanism because a lock is
> being held - the point of the restart mechanism is to allow userspace
> signal handlers to run, so it should only be used when there's a
> signal pending. I think this is a hack in the IPv6 code to work
> around some other issue.

It's not unusal at all, if you actually grep for this under net/ you will
see that it is in fact a common code pattern.

It prevents deadlocks because the sysfs and other nodes that we are
operating with can be unregistered by other threads of control holding
the RTNL mutex.  If we don't break out, we won't release our reference
and therefore the RTNL mutex holding entity cannot make forward
progress.

This behavior is therefore very much intentional.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: Kernel 4.6.7-rt14 kernel workqueue lockup - rtnl deadlock plus syscall endless loop
  2017-01-17 17:57 ` David Miller
@ 2017-01-17 18:15   ` Elad Nachman
  2017-01-17 19:05     ` David Miller
  0 siblings, 1 reply; 8+ messages in thread
From: Elad Nachman @ 2017-01-17 18:15 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

Any thought about limiting the amount of busy polling?
Say if more than X polls are done within a jiffy, then at least for preemptable kernels you can sleep for a jiffy inside the syscall to yield the CPU for a while?

Thanks,

Elad.

-----Original Message-----
From: David Miller [mailto:davem@davemloft.net]
Sent: יום ג 17 ינואר 2017 19:58
To: Elad Nachman <EladN@gilat.com>
Cc: netdev@vger.kernel.org
Subject: Re: Kernel 4.6.7-rt14 kernel workqueue lockup - rtnl deadlock plus syscall endless loop

From: Elad Nachman <EladN@gilat.com>
Date: Tue, 17 Jan 2017 17:39:03 +0000

> What's more odd about this is that it's very unusual and strange for a
> kernel function to invoke the restart mechanism because a lock is
> being held - the point of the restart mechanism is to allow userspace
> signal handlers to run, so it should only be used when there's a
> signal pending. I think this is a hack in the IPv6 code to work around
> some other issue.

It's not unusal at all, if you actually grep for this under net/ you will see that it is in fact a common code pattern.

It prevents deadlocks because the sysfs and other nodes that we are operating with can be unregistered by other threads of control holding the RTNL mutex.  If we don't break out, we won't release our reference and therefore the RTNL mutex holding entity cannot make forward progress.

This behavior is therefore very much intentional.
IMPORTANT - This email and any attachments is intended for the above named addressee(s), and may contain information which is confidential or privileged. If you are not the intended recipient, please inform the sender immediately and delete this email: you should not copy or use this e-mail for any purpose nor disclose its contents to any person.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Kernel 4.6.7-rt14 kernel workqueue lockup - rtnl deadlock plus syscall endless loop
  2017-01-17 18:15   ` Elad Nachman
@ 2017-01-17 19:05     ` David Miller
  2017-01-18  9:57       ` Elad Nachman
  0 siblings, 1 reply; 8+ messages in thread
From: David Miller @ 2017-01-17 19:05 UTC (permalink / raw)
  To: EladN; +Cc: netdev

From: Elad Nachman <EladN@gilat.com>
Date: Tue, 17 Jan 2017 18:15:19 +0000

> Any thought about limiting the amount of busy polling?  Say if more
> than X polls are done within a jiffy, then at least for preemptable
> kernels you can sleep for a jiffy inside the syscall to yield the
> CPU for a while?

We cannot yield there, because we must return immediately from this
context in order to drop the sysctl locks and references.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: Kernel 4.6.7-rt14 kernel workqueue lockup - rtnl deadlock plus syscall endless loop
  2017-01-17 19:05     ` David Miller
@ 2017-01-18  9:57       ` Elad Nachman
  0 siblings, 0 replies; 8+ messages in thread
From: Elad Nachman @ 2017-01-18  9:57 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

OK, how about reflecting the state of the rtnl lock to user space via the /proc file system?

This way I can test it before using sysctl on the relevant proc files to avoid live-lock.

Thanks,

Elad.

-----Original Message-----
From: David Miller [mailto:davem@davemloft.net]
Sent: יום ג 17 ינואר 2017 21:06
To: Elad Nachman <EladN@gilat.com>
Cc: netdev@vger.kernel.org
Subject: Re: Kernel 4.6.7-rt14 kernel workqueue lockup - rtnl deadlock plus syscall endless loop

From: Elad Nachman <EladN@gilat.com>
Date: Tue, 17 Jan 2017 18:15:19 +0000

> Any thought about limiting the amount of busy polling?  Say if more
> than X polls are done within a jiffy, then at least for preemptable
> kernels you can sleep for a jiffy inside the syscall to yield the
> CPU for a while?

We cannot yield there, because we must return immediately from this
context in order to drop the sysctl locks and references.
IMPORTANT - This email and any attachments is intended for the above named addressee(s), and may contain information which is confidential or privileged. If you are not the intended recipient, please inform the sender immediately and delete this email: you should not copy or use this e-mail for any purpose nor disclose its contents to any person.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Kernel 4.6.7-rt14 kernel workqueue lockup - rtnl deadlock plus syscall endless loop
  2017-01-17 16:20 Elad Nachman
@ 2017-01-17 16:40 ` Russell King - ARM Linux
  0 siblings, 0 replies; 8+ messages in thread
From: Russell King - ARM Linux @ 2017-01-17 16:40 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Jan 17, 2017 at 04:20:00PM +0000, Elad Nachman wrote:
> Hi,
> 
> I am experiencing sporadic work queue lockups on kernel 4.6.7-rt14 (mach-socfpga).
> 
> Using a HW debugger I got the following information:
> 
> A process containing a network namespace is terminating itself (SIGKILL),
> which causes cleanup_net() to be scheduled to kworker/u4:2 to clean up
> the network namespace running on the process.
> 
> Kworker/u4:2 got preempted (plus there are a lot of other work queue
> items, like vmstat_shepherd, wakeup_dirtytime_writeback, phy_state_machine,
> neigh_periodic_work, check_lifetime plus another one by a LKM) while
> holding the rtnl lock.
> 
> A processing running waitpid() on the terminated process starts a new
> process, which forks busybox to run sysctl -w net.ipv6.conf.all.forwarding
> = 1 .
> This in turn starts making a write syscall, calling in turn vfs_write,
> proc_sys_call_handler, addrconf_sysctl_forward, and finally
> addrconf_fixup_forwarding().
> 
> addrconf_fixup_forwarding() runs the following code:
> 
> if (!rtnl_trylock())
>                  return restart_syscall();
> 
> This fails and restart_syscall() does the following:
> 
> set_tsk_thread_flag(current, TIF_SIGPENDING);
>          return -ERESTARTNOINTR;
> 
> Now the system call goes back to ret_fast_syscall (arch/arm/kernel/entry-common.S)
> Testing the flags in the task_struct (which contain TIF_SIGPENDING) the code branches to fast_work_pending, then falls through to slow_work_pending, which
> Calls do_work_pending(), and in turn calls do_signal(), get_signal(), dequeuer_signal(), which find no signals, and clears the TIF_SIGPENDING bit when recalc_sigpending() is called, then returns zero.
> 
> This causes do_signal() to examine r0 and return 1 (-ERESTARTNOINTR), which is propogated to the assembly code by do_work_pending().
> Having r0 equal zero causes a branch to local_restart, which restarts the very same write system call in an endless loop.
> No scheduling is possible, so the cleanup_net() cannot finish and release rtnl, which in turn causes the endless restarting of the write system call.
> 
> Going over the x86 assembly code and does not look like system calls are restarted within the assembly syscall handler without returning to user-space.
> 
> There could be several remedies:
> 
> 1.Adopt the X86 handling (avoid restarting system calls within the handler, but rather return to user-space).

We used to do that, but it became infeasible.

commit 81783786d5cf4aa0d3e15bb0fac856aa8ebf1a76
Author: Al Viro <viro@zeniv.linux.org.uk>
Date:   Thu Jul 19 17:48:21 2012 +0100

    ARM: 7473/1: deal with handlerless restarts without leaving the kernel

However, I think your analysis is slightly off.  Yes, we call into
do_work_pending().  As long as _TIF_NEED_RESCHED is not set, then
you are correct.

However, _TIF_NEED_RESCHED will be set at the end of the thread's
quantum, or when a higher priority thread needs to run on the current
CPU.

Now, that's the exact same path which gets used when a thread needs to
be preempted, so returning back to userspace and re-entering to the
restart syscall doesn't achieve anything as long as _TIF_NEED_RESCHED
is clear.  We just end up executing more instructions uselessly.

I think the problem is that:

        if (!rtnl_trylock())
                return restart_syscall();

which, if it didn't do a trylock, it would put this thread to sleep
and allow other threads to run (potentially allowing the holder of
the lock to release it.)

What's more odd about this is that it's very unusual and strange for
a kernel function to invoke the restart mechanism because a lock is
being held - the point of the restart mechanism is to allow userspace
signal handlers to run, so it should only be used when there's a
signal pending.  I think this is a hack in the IPv6 code to work
around some other issue.

This isn't really -rt kernel specific, I'd expect exactly the same
behaviour from a non-rt kernel.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Kernel 4.6.7-rt14 kernel workqueue lockup - rtnl deadlock plus syscall endless loop
@ 2017-01-17 16:20 Elad Nachman
  2017-01-17 16:40 ` Russell King - ARM Linux
  0 siblings, 1 reply; 8+ messages in thread
From: Elad Nachman @ 2017-01-17 16:20 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

I am experiencing sporadic work queue lockups on kernel 4.6.7-rt14 (mach-socfpga).

Using a HW debugger I got the following information:

A process containing a network namespace is terminating itself (SIGKILL), which causes cleanup_net() to be scheduled to kworker/u4:2 to clean up the network namespace running on the process.

Kworker/u4:2 got preempted (plus there are a lot of other work queue items, like vmstat_shepherd, wakeup_dirtytime_writeback, phy_state_machine, neigh_periodic_work, check_lifetime plus another one by a LKM) while holding the rtnl lock.

A processing running waitpid() on the terminated process starts a new process, which forks busybox to run sysctl -w net.ipv6.conf.all.forwarding = 1 .
This in turn starts making a write syscall, calling in turn vfs_write, proc_sys_call_handler, addrconf_sysctl_forward, and finally addrconf_fixup_forwarding().

addrconf_fixup_forwarding() runs the following code:

if (!rtnl_trylock())
                 return restart_syscall();

This fails and restart_syscall() does the following:

set_tsk_thread_flag(current, TIF_SIGPENDING);
         return -ERESTARTNOINTR;

Now the system call goes back to ret_fast_syscall (arch/arm/kernel/entry-common.S)
Testing the flags in the task_struct (which contain TIF_SIGPENDING) the code branches to fast_work_pending, then falls through to slow_work_pending, which
Calls do_work_pending(), and in turn calls do_signal(), get_signal(), dequeuer_signal(), which find no signals, and clears the TIF_SIGPENDING bit when recalc_sigpending() is called, then returns zero.

This causes do_signal() to examine r0 and return 1 (-ERESTARTNOINTR), which is propogated to the assembly code by do_work_pending().
Having r0 equal zero causes a branch to local_restart, which restarts the very same write system call in an endless loop.
No scheduling is possible, so the cleanup_net() cannot finish and release rtnl, which in turn causes the endless restarting of the write system call.

Going over the x86 assembly code and does not look like system calls are restarted within the assembly syscall handler without returning to user-space.

There could be several remedies:

1.Adopt the X86 handling (avoid restarting system calls within the handler, but rather return to user-space).
2.Count the number of retries. Above a set threshold (1? 2? 3? retries) force a return to user-space.
3.Count the number of retries. Above a set threshold (1? 2? 3? retries) force a reschedule() in do_work_pending() (as if _TIF_NEED_RESCHED) was set.

What do you think is the best solution for this issue?

Thanks,

Elad.

IMPORTANT - This email and any attachments is intended for the above named addressee(s), and may contain information which is confidential or privileged. If you are not the intended recipient, please inform the sender immediately and delete this email: you should not copy or use this e-mail for any purpose nor disclose its contents to any person.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2017-01-18  9:57 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-17 17:39 Kernel 4.6.7-rt14 kernel workqueue lockup - rtnl deadlock plus syscall endless loop Elad Nachman
2017-01-17 17:53 ` Stephen Hemminger
2017-01-17 17:57 ` David Miller
2017-01-17 18:15   ` Elad Nachman
2017-01-17 19:05     ` David Miller
2017-01-18  9:57       ` Elad Nachman
  -- strict thread matches above, loose matches on Subject: below --
2017-01-17 16:20 Elad Nachman
2017-01-17 16:40 ` Russell King - ARM Linux

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.