From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751850AbdGQOqk (ORCPT ); Mon, 17 Jul 2017 10:46:40 -0400 Received: from mx1.redhat.com ([209.132.183.28]:54132 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751362AbdGQOqj (ORCPT ); Mon, 17 Jul 2017 10:46:39 -0400 DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com 154B7C0587EA Authentication-Results: ext-mx08.extmail.prod.ext.phx2.redhat.com; dmarc=none (p=none dis=none) header.from=redhat.com Authentication-Results: ext-mx08.extmail.prod.ext.phx2.redhat.com; spf=pass smtp.mailfrom=dzickus@redhat.com DKIM-Filter: OpenDKIM Filter v2.11.0 mx1.redhat.com 154B7C0587EA Date: Mon, 17 Jul 2017 10:46:37 -0400 From: Don Zickus To: "Liang, Kan" Cc: Thomas Gleixner , "linux-kernel@vger.kernel.org" , "mingo@kernel.org" , "akpm@linux-foundation.org" , "babu.moger@oracle.com" , "atomlin@redhat.com" , "prarit@redhat.com" , "torvalds@linux-foundation.org" , "peterz@infradead.org" , "eranian@google.com" , "acme@redhat.com" , "ak@linux.intel.com" , "stable@vger.kernel.org" Subject: Re: [PATCH V2] kernel/watchdog: fix spurious hard lockups Message-ID: <20170717144637.34umykrccvjma3fl@redhat.com> References: <20170621144118.5939-1-kan.liang@intel.com> <20170622154450.2lua7fdmigcixldw@redhat.com> <20170623162907.l6inpxgztwwkeaoi@redhat.com> <20170626201927.3ak7fk3yvdzbb4ay@redhat.com> <20170627201249.ll34ecwhpme3vh2u@redhat.com> <37D7C6CF3E00A74B8858931C1DB2F0775371D43E@SHSMSX103.ccr.corp.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <37D7C6CF3E00A74B8858931C1DB2F0775371D43E@SHSMSX103.ccr.corp.intel.com> User-Agent: NeoMutt/20170428-dirty (1.8.2) X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.32]); Mon, 17 Jul 2017 14:46:39 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Jul 17, 2017 at 01:24:23AM +0000, Liang, Kan wrote: > Hi Don & Thomas, > > Sorry for the late response. We just finished the tests for all proposed patches. > > There are three proposed patches so far. > Patch 1: The patch as above which speed up the hrtimer. > Patch 2: Thomas's first proposal. > https://patchwork.kernel.org/patch/9803033/ > https://patchwork.kernel.org/patch/9805903/ > Patch 3: my original proposal which increase the NMI watchdog timeout by 3X > https://patchwork.kernel.org/patch/9802053/ > > According to our test, only patch 3 works well. > The other two patches will hang the system eventually. > For patch 1, the system hang after running our test case for ~1 hour. > For patch 2, the system hang in running the overnight test. > There is no error message shown when the system hang. So I don't know the > root cause yet. Hi Kan, Thanks for the feedback. Odd that the different patches had different results. What is more odd to me is the hang. I thought these were all false lockups that prematurely panic'd and rebooted the box. Is the machine configured to panic on hardlockup and reboot? Perhaps kdump is enabled to store the console log for review upon reboot? It almost implies that a hardlockup did happen but isnt' being detected until later?? > > BTW: We set 1 to watchdog_thresh when we did the test. > It's believed that can speed up the failure. Sure, you/they look for 1 second hangs instead of 10 second ones. But with patch3 it is more like 3 seconds'ish vs 30 second'ish. As Thomas asked, I would also be interested in the way the test works. The hang doesn't make sense. Cheers, Don