From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752780AbdK2WiN (ORCPT ); Wed, 29 Nov 2017 17:38:13 -0500 Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:57714 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752121AbdK2WiL (ORCPT ); Wed, 29 Nov 2017 17:38:11 -0500 Date: Wed, 29 Nov 2017 14:38:06 -0800 From: "Paul E. McKenney" To: Thomas Gleixner Cc: kernel test robot , LKML , lkp@01.org, frederic@kernel.org Subject: Re: [lkp-robot] [torture] b151f93a71: INFO:rcu_preempt_detected_stalls_on_CPUs/tasks Reply-To: paulmck@linux.vnet.ibm.com References: <20171128164144.GS3624@linux.vnet.ibm.com> <20171128170759.GU3624@linux.vnet.ibm.com> <20171128173554.GV3624@linux.vnet.ibm.com> <20171128204619.GA16740@linux.vnet.ibm.com> <20171128210810.GA22726@linux.vnet.ibm.com> <20171129190819.GA18159@linux.vnet.ibm.com> <20171129220703.GA12908@linux.vnet.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20171129220703.GA12908@linux.vnet.ibm.com> User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-GCONF: 00 x-cbid: 17112922-0044-0000-0000-000003B729A8 X-IBM-SpamModules-Scores: X-IBM-SpamModules-Versions: BY=3.00008124; HX=3.00000241; KW=3.00000007; PH=3.00000004; SC=3.00000242; SDB=6.00953095; UDB=6.00481543; IPR=6.00733174; BA=6.00005721; NDR=6.00000001; ZLA=6.00000005; ZF=6.00000009; ZB=6.00000000; ZP=6.00000000; ZH=6.00000000; ZU=6.00000002; MB=3.00018258; XFM=3.00000015; UTC=2017-11-29 22:38:07 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 17112922-0045-0000-0000-000007E656E3 Message-Id: <20171129223806.GA15615@linux.vnet.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2017-11-29_08:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 impostorscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1709140000 definitions=main-1711290292 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Nov 29, 2017 at 02:07:03PM -0800, Paul E. McKenney wrote: > On Wed, Nov 29, 2017 at 11:08:19AM -0800, Paul E. McKenney wrote: > > On Tue, Nov 28, 2017 at 01:08:10PM -0800, Paul E. McKenney wrote: > > > On Tue, Nov 28, 2017 at 12:46:19PM -0800, Paul E. McKenney wrote: > > > > On Tue, Nov 28, 2017 at 09:35:54AM -0800, Paul E. McKenney wrote: > > > > > On Tue, Nov 28, 2017 at 06:10:08PM +0100, Thomas Gleixner wrote: > > > > > > On Tue, 28 Nov 2017, Paul E. McKenney wrote: > > > > > > > On Tue, Nov 28, 2017 at 05:47:35PM +0100, Thomas Gleixner wrote: > > > > > > > diff --git a/kernel/time/timer.c b/kernel/time/timer.c > > > > > > > index db774b0f217e..a3321bb565db 100644 > > > > > > > --- a/kernel/time/timer.c > > > > > > > +++ b/kernel/time/timer.c > > > > > > > @@ -1803,7 +1803,7 @@ signed long __sched schedule_timeout(signed long timeout) > > > > > > > idx = timer_get_idx(&timer.timer); > > > > > > > idx_now = calc_wheel_index(j, base->clk); > > > > > > > raw_spin_unlock_irqrestore(&base->lock, flags); > > > > > > > - pr_info("%s: Waylayed timer idx: %u idx_now: %u\n", __func__, idx, idx_now); > > > > > > > + pr_info("%s: Waylayed timer base->clk: %#lx jiffies: %#lx base->next_expiry: %#lx timer->flags: %#x timer->expires %#lx idx: %u idx_now: %u\n", __func__, base->clk, j, base->next_expiry, timer.timer.flags, timer.timer.expires, idx, idx_now); > > > > > > > > > > > > Please print idx and idx_now as hex values. It's simpler to decode that way. > > > > > > > > > > Here you go! Starting tests at this end, focusing on TREE01 and TREE04. > > > > > BTW, TREE04 doesn't do any CPU hotplug, providing a counterexample to > > > > > my long-held assumption that this only happened in the presence of CPU > > > > > hotplug operations. > > > > > > > > And here is output with changes discussed on IRC. TREE04 managed to > > > > have not one but two overlapping RCU CPU stall warnings, one for RCU-bh > > > > and the second for RCU-sched. TREE04 and TREE04. HZ=1000. > > > > > > And here is the full patch, in all its lack of aesthetic appeal. > > > > And here is the list of waylaid timers from last night's testing. The big > > pile of them from TREE01 at the end is due to wakeups from kthread_stop(), > > I am guessing. The TREE04 run only had two of them, but they seem reliable > > enough that I just might be able to bisect. I will try that. > > And it converged to 5c4991e24c69 ("sched/isolation: Split out new > CONFIG_CPU_ISOLATION=y config from CONFIG_NO_HZ_FULL"), which is a bit > hard to believe. Please see below for the log. I will be retesting > some of the allegedly good commits, just in case. And the bisection really did converge here. It appears that splitting CONFIG_CPU_ISOLATION from CONFIG_NO_HZ_FULL isn't fully cooked. Adding CONFIG_CPU_ISOLATION=y to my tests causes them to pass, but I get the eternal wait for a three-jiffy timeout given CONFIG_NO_HZ_FULL=y and CONFIG_CPU_ISOLATION=n. Both cases use CONFIG_NO_HZ_FULL_ALL=y. Adding Frederic for his perspective. Thanx, Paul