From mboxrd@z Thu Jan 1 00:00:00 1970 From: Stephen Hemminger Subject: Re: Deadlock with restart_syscall() Date: Fri, 27 Jul 2018 08:53:51 -0700 Message-ID: <20180727085351.36210a12@xeon-e3> References: <681500CE65202E47A192754B01DAB4673BE3D87D8D@SDE12.beckipc.net> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Cc: "netdev@vger.kernel.org" To: =?UTF-8?B?QW5kcsOp?= Pribil Return-path: Received: from mail-pf1-f196.google.com ([209.85.210.196]:44900 "EHLO mail-pf1-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1732752AbeG0RQX (ORCPT ); Fri, 27 Jul 2018 13:16:23 -0400 Received: by mail-pf1-f196.google.com with SMTP id k21-v6so1866768pff.11 for ; Fri, 27 Jul 2018 08:53:53 -0700 (PDT) In-Reply-To: <681500CE65202E47A192754B01DAB4673BE3D87D8D@SDE12.beckipc.net> Sender: netdev-owner@vger.kernel.org List-ID: On Mon, 16 Jul 2018 09:31:06 +0200 Andr=C3=A9 Pribil wrote: > Hello, >=20 > I'm using kernel 4.14.52-rt34 on a single core ARM system and I'm seeing = a=20 > deadlock inside the kernel when two RT processes make calls in the right= =20 > temporal distance. The first process is trying to bring the Ethernet inte= rface=20 > up, with the SIOCGIFFLAGS ioctl(). The second process is checking the Eth= ernet=20 > carrier, speed and duplex status, by reading e.g. "/sys/class/net/eth1/sp= eed". >=20 > The first process finally gets to phy_poll_reset() in=20 > drivers/net/phy/phy_device.c, where it calls msleep(50).=20 > It never returns from the sleep. >=20 > The second process gets to speed_show() in net/core/net-sysfs.c. It tries= to get > the RTNL lock with rtnl_trylock(), but fails and calls restart_syscall().= =20 > This happens over and over again. >=20 > It seems like the first process in no longer scheduled and cannot release= the > RTNL lock, while the second process is busy restarting the syscall. The f= irst=20 > process has a higher RT priority than the second process. > =20 > Just for testing I've added the TIF_NEED_RESCHED flag to the restart_sysc= all()=20 > function and I did not see the deadlock again with this change. >=20 > static inline int restart_syscall(void) > { > set_tsk_thread_flag(current, TIF_SIGPENDING | TIF_NEED_RESCHED); > return -ERESTARTNOINTR; > } >=20 > As a second test I released the RTNL lock while calling msleep() in=20 > phy_poll_reset(). This also made the problem disappear. >=20 > I've found this thread, where a similar issue with restart_syscall() has = been=20 > reported: > https://www.spinics.net/lists/netdev/msg415144.html >=20 > Any ideas how to fix this issue? >=20 > Andre =20 Don't do control operations from RT processes! There can be cases of priority inversion where RT process is waiting for something that requires a kthread to complete the operation.