From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751700AbeAZWJd (ORCPT ); Fri, 26 Jan 2018 17:09:33 -0500 Received: from mx0b-001b2d01.pphosted.com ([148.163.158.5]:50654 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1751257AbeAZWJb (ORCPT ); Fri, 26 Jan 2018 17:09:31 -0500 Date: Fri, 26 Jan 2018 14:09:17 -0800 From: "Paul E. McKenney" To: Thomas Gleixner Cc: LKML , Sebastian Sewior , Anna-Maria Gleixner , Peter Zijlstra , Ingo Molnar Subject: Re: [PATCH] hrtimer: Reset hrtimer cpu base proper on CPU hotplug Reply-To: paulmck@linux.vnet.ibm.com References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-GCONF: 00 x-cbid: 18012622-0044-0000-0000-000003D6B483 X-IBM-SpamModules-Scores: X-IBM-SpamModules-Versions: BY=3.00008433; HX=3.00000241; KW=3.00000007; PH=3.00000004; SC=3.00000248; SDB=6.00980809; UDB=6.00497238; IPR=6.00760110; BA=6.00005797; NDR=6.00000001; ZLA=6.00000005; ZF=6.00000009; ZB=6.00000000; ZP=6.00000000; ZH=6.00000000; ZU=6.00000002; MB=3.00019234; XFM=3.00000015; UTC=2018-01-26 22:09:20 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 18012622-0045-0000-0000-000008061B22 Message-Id: <20180126220917.GI3741@linux.vnet.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2018-01-26_10:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 impostorscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1709140000 definitions=main-1801260289 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jan 26, 2018 at 02:54:32PM +0100, Thomas Gleixner wrote: > The hrtimer interrupt code contains a hang detection and mitigation > mechanism, which prevents that a long delayed hrtimer interrupt causes a > continous retriggering of interrupts which prevent the system from making > progress. If a hang is detected then the timer hardware is programmed with > a certain delay into the future and a flag is set in the hrtimer cpu base > which prevents newly enqueued timers from reprogramming the timer hardware > prior to the chosen delay. The subsequent hrtimer interrupt after the delay > clears the flag and resumes normal operation. > > If such a hang happens in the last hrtimer interrupt before a CPU is > unplugged then the hang_detected flag is set and stays that way when the > CPU is plugged in again. At that point the timer hardware is not armed and > it cannot be armed because the hang_detected flag is still active, so > nothing clears that flag. As a consequence the CPU does not receive hrtimer > interrupts and no timers expire on that CPU which results in RCU stalls and > other malfunctions. > > Clear the flag along with some other less critical members of the hrtimer > cpu base to ensure starting from a clean state when a CPU is plugged in. > > Thanks to Paul, Sebastian and Anna-Maria for their help to get down to the > root cause of that hard to reproduce heisenbug. Once understood it's > trivial and certainly justifies a brown paperbag. Thank you very much, and I do know that feeling! After reading the commit log, I feel significantly less incompetent for having failed to find this one. ;-) But it did pass rcutorture testing for a great many years, didn't it? :-/ I have started an eight-hour seven-way test on the dreaded rcutorture TREE01 scenario. In the meantime, off to the train! Thanx, Paul > Fixes: 41d2e4949377 ("hrtimer: Tune hrtimer_interrupt hang logic") > Reported-by: Paul E. McKenney > Signed-off-by: Thomas Gleixner > Cc: stable@vger.kernel.org > --- > kernel/time/hrtimer.c | 3 +++ > 1 file changed, 3 insertions(+) > > --- a/kernel/time/hrtimer.c > +++ b/kernel/time/hrtimer.c > @@ -655,7 +655,9 @@ static void hrtimer_reprogram(struct hrt > static inline void hrtimer_init_hres(struct hrtimer_cpu_base *base) > { > base->expires_next = KTIME_MAX; > + base->hang_detected = 0; > base->hres_active = 0; > + base->next_timer = NULL; > } > > /* > @@ -1589,6 +1591,7 @@ int hrtimers_prepare_cpu(unsigned int cp > timerqueue_init_head(&cpu_base->clock_base[i].active); > } > > + cpu_base->active_bases = 0; > cpu_base->cpu = cpu; > hrtimer_init_hres(cpu_base); > return 0; >