From mboxrd@z Thu Jan  1 00:00:00 1970
From: Stephen Hemminger <stephen@networkplumber.org>
Subject: Re: Deadlock with restart_syscall()
Date: Fri, 27 Jul 2018 08:53:51 -0700
Message-ID: <20180727085351.36210a12@xeon-e3>
References: <681500CE65202E47A192754B01DAB4673BE3D87D8D@SDE12.beckipc.net>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Cc: "netdev@vger.kernel.org" <netdev@vger.kernel.org>
To: =?UTF-8?B?QW5kcsOp?= Pribil <a.pribil@beck-ipc.com>
Return-path: <netdev-owner@vger.kernel.org>
Received: from mail-pf1-f196.google.com ([209.85.210.196]:44900 "EHLO
        mail-pf1-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1732752AbeG0RQX (ORCPT
        <rfc822;netdev@vger.kernel.org>); Fri, 27 Jul 2018 13:16:23 -0400
Received: by mail-pf1-f196.google.com with SMTP id k21-v6so1866768pff.11
        for <netdev@vger.kernel.org>; Fri, 27 Jul 2018 08:53:53 -0700 (PDT)
In-Reply-To: <681500CE65202E47A192754B01DAB4673BE3D87D8D@SDE12.beckipc.net>
Sender: netdev-owner@vger.kernel.org
List-ID: <netdev.vger.kernel.org>

On Mon, 16 Jul 2018 09:31:06 +0200
Andr=C3=A9 Pribil <a.pribil@beck-ipc.com> wrote:

> Hello,
>=20
> I'm using kernel 4.14.52-rt34 on a single core ARM system and I'm seeing =
a=20
> deadlock inside the kernel when two RT processes make calls in the right=
=20
> temporal distance. The first process is trying to bring the Ethernet inte=
rface=20
> up, with the SIOCGIFFLAGS ioctl(). The second process is checking the Eth=
ernet=20
> carrier, speed and duplex status, by reading e.g. "/sys/class/net/eth1/sp=
eed".
>=20
> The first process finally gets to phy_poll_reset() in=20
> drivers/net/phy/phy_device.c, where it calls msleep(50).=20
> It never returns from the sleep.
>=20
> The second process gets to speed_show() in net/core/net-sysfs.c. It tries=
 to get
> the RTNL lock with rtnl_trylock(), but fails and calls restart_syscall().=
=20
> This happens over and over again.
>=20
> It seems like the first process in no longer scheduled and cannot release=
 the
> RTNL lock, while the second process is busy restarting the syscall. The f=
irst=20
> process has a higher RT priority than the second process.
>                                                         =20
> Just for testing I've added the TIF_NEED_RESCHED flag to the restart_sysc=
all()=20
> function and I did not see the deadlock again with this change.
>=20
> static inline int restart_syscall(void)
> {
> 	set_tsk_thread_flag(current, TIF_SIGPENDING | TIF_NEED_RESCHED);
> 	return -ERESTARTNOINTR;
> }
>=20
> As a second test I released the RTNL lock while calling msleep() in=20
> phy_poll_reset(). This also made the problem disappear.
>=20
> I've found this thread, where a similar issue with restart_syscall() has =
been=20
> reported:
> https://www.spinics.net/lists/netdev/msg415144.html
>=20
> Any ideas how to fix this issue?
>=20
> Andre  =20

Don't do control operations from RT processes!
There can be cases of priority inversion where RT process is waiting for
something that requires a kthread to complete the operation.