From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756830AbcLNVCr (ORCPT ); Wed, 14 Dec 2016 16:02:47 -0500 Received: from Galois.linutronix.de ([146.0.238.70]:45619 "EHLO Galois.linutronix.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752851AbcLNVCp (ORCPT ); Wed, 14 Dec 2016 16:02:45 -0500 Date: Wed, 14 Dec 2016 21:59:37 +0100 (CET) From: Thomas Gleixner To: Roland Scheidegger cc: LKML , x86@kernel.org, Peter Zijlstra , Borislav Petkov , Bruce Schlobohm , Kevin Stanton , Allen Hung Subject: Re: [patch 0/2] tsc/adjust: Cure suspend/resume issues and prevent TSC deadline timer irq storm In-Reply-To: Message-ID: References: <20161213131115.764824574@linutronix.de> <33d4286c-3f77-1274-34b7-bc62d2c146a4@hispeed.ch> <357e0a0f-af6b-2a8e-2af0-b05652ccbb30@hispeed.ch> User-Agent: Alpine 2.20 (DEB 67 2015-01-07) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 14 Dec 2016, Thomas Gleixner wrote: > On Wed, 14 Dec 2016, Roland Scheidegger wrote: > > Am 13.12.2016 um 17:46 schrieb Thomas Gleixner: > > > What are the adjust values after a warm boot? > > > > So, after cold boot with a kernel which doesn't adjust TSCs, then warm > > boot I got: > > [ 0.000000] TSC ADJUST: CPU0: -602358264300 176072418728 > > [ 0.000000] TSC ADJUST: Boot CPU0: -602358264300 > > [ 0.172245] TSC ADJUST: CPU1: -602360207584 176587932558 > > [ 0.172245] TSC ADJUST differs: Reference CPU0: -602358264300 CPU1: > > -602360207584 > > [ 0.172246] TSC ADJUST synchronize: Reference CPU0: -602358264300 > > CPU1: -602360207584 > > [ 0.252663] TSC ADJUST: CPU2: -602359000822 176828627154 > > [ 0.252663] TSC ADJUST differs: Reference CPU0: -602358264300 CPU2: > > -602359000822 > > [ 0.252664] TSC ADJUST synchronize: Reference CPU0: -602358264300 > > CPU2: -602359000822 > > [ 0.337014] TSC ADJUST: CPU3: -602360177680 177081093132 > > [ 0.337014] TSC ADJUST differs: Reference CPU0: -602358264300 CPU3: > > -602360177680 > > [ 0.337015] TSC ADJUST synchronize: Reference CPU0: -602358264300 > > CPU3: -602360177680 > > > > and so on. > > > > Albeit after another reboot (some minutes later), it actually straight > > locked up again: > > > > TSC ADJUST: CPU1: -8257481427958 165112676430 > > TSC ADJUST differs: Reference CPU0: -8257479484330 CPU1: -8257481427958 > > TSC ADJUST synchronize: Reference CPU0: -8257479484330 CPU1: -8254781427958 > > TSC target sync skip > > ... > > smpboot: Target CPU is online > > > > So, actually I thought the TSC would get reset too on warm boot, but > > clearly looks like that isn't the case... > > But I don't know what's the difference between first and second reboot - > > the adjust values have just more magnitude, but otherwise even the > > direction of the adjustments and everything looks all the same (just > > like cold boot, which also looks all the same to me). > > I haven't found a pattern for the lockups yet and we have to wait for Intel > to provide useful information about that issue. All we know so far is that > negative adjust values are dangerous. Did some futher investigation. The values which cause the interrupt storms have very clear identifiable points which reliably reproduce: Positive space, results in timer not firing anymore - at least not in a time frame you are willing to wait for. 0x0000 0000 8000 0000 Negative space, results in an interrupt storm. 0xffff ffff 0000 0000 0xffff fffe 0000 0000 0xffff fffd 0000 0000 0xffff fffc 0000 0000 0xffff fffb 0000 0000 .... These points are independent of the underlying counter value (cold boot, warm boot) and even reproduce after hours of power on reliably. And looking at the values makes me wonder about 32bit vs. 64bit wreckage combined with sign expansion done wrong. Im really impressed! In the negative space there is something else going on which is dependent on the counter value. Right after cold boot the space is closer to zero than after hours of power on. So the approach of forbidding negative values is definitely not wrong. Thanks, tglx