From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756105AbcGGCuX (ORCPT ); Wed, 6 Jul 2016 22:50:23 -0400 Received: from mga01.intel.com ([192.55.52.88]:12887 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751862AbcGGCuV (ORCPT ); Wed, 6 Jul 2016 22:50:21 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.28,322,1464678000"; d="scan'208";a="730703231" From: "Chen, Yu C" To: "Rafael J. Wysocki" , James Morse CC: "linux-pm@vger.kernel.org" , Thomas Gleixner , "H. Peter Anvin" , Pavel Machek , Borislav Petkov , Peter Zijlstra , Ingo Molnar , Len Brown , "x86@kernel.org" , "linux-kernel@vger.kernel.org" Subject: RE: [PATCH][RFC v3] x86, hotplug: Use hlt instead of mwait if invoked from disable_nonboot_cpus Thread-Topic: [PATCH][RFC v3] x86, hotplug: Use hlt instead of mwait if invoked from disable_nonboot_cpus Thread-Index: AQHR0RzTWcx5tST5T0eZO3HV+7Kwt6ALpf2AgACqsaA= Date: Thu, 7 Jul 2016 02:50:09 +0000 Message-ID: <36DF59CE26D8EE47B0655C516E9CE6402877EBB4@shsmsx102.ccr.corp.intel.com> References: <1467105403-5085-1-git-send-email-yu.c.chen@intel.com> <209704957.TWElLMTLfP@vostro.rjw.lan> In-Reply-To: <209704957.TWElLMTLfP@vostro.rjw.lan> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-titus-metadata-40: eyJDYXRlZ29yeUxhYmVscyI6IiIsIk1ldGFkYXRhIjp7Im5zIjoiaHR0cDpcL1wvd3d3LnRpdHVzLmNvbVwvbnNcL0ludGVsMyIsImlkIjoiMDUyZjY5OTItNDRjOC00ZjdiLTg1YmItM2RkNTM0ZGUyM2IzIiwicHJvcHMiOlt7Im4iOiJDVFBDbGFzc2lmaWNhdGlvbiIsInZhbHMiOlt7InZhbHVlIjoiQ1RQX0lDIn1dfV19LCJTdWJqZWN0TGFiZWxzIjpbXSwiVE1DVmVyc2lvbiI6IjE1LjkuNi42IiwiVHJ1c3RlZExhYmVsSGFzaCI6Ino4dUpRTGVMWHErMGpjZm9mMmxYSzY2UkpWdzVtOXVmUVMzVjNCcHhJRU09In0= x-ctpclassification: CTP_IC x-originating-ip: [10.239.127.40] Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from base64 to 8bit by mail.home.local id u672oTxQ028777 > -----Original Message----- > From: Rafael J. Wysocki [mailto:rjw@rjwysocki.net] > Sent: Thursday, July 07, 2016 8:33 AM > To: Chen, Yu C; James Morse > Cc: linux-pm@vger.kernel.org; Thomas Gleixner; H. Peter Anvin; Pavel Machek; > Borislav Petkov; Peter Zijlstra; Ingo Molnar; Len Brown; x86@kernel.org; linux- > kernel@vger.kernel.org > Subject: Re: [PATCH][RFC v3] x86, hotplug: Use hlt instead of mwait if invoked > from disable_nonboot_cpus > > On Tuesday, June 28, 2016 05:16:43 PM Chen Yu wrote: > > Stress test from Varun Koyyalagunta reports that, the nonboot CPU > > would hang occasionally, when resuming from hibernation. Further > > investigation shows that, the precise stage when nonboot CPU hangs, is > > the time when the nonboot CPU been woken up incorrectly, and tries to > > monitor the mwait_ptr for the second time, then an exception is > > triggered due to illegal vaddr access, say, something like, 'Unable to > > handler kernel address of 0xffff8800ba800010...' > > > > Further investigation shows that, this exception is caused by > > accessing a page without PRESENT flag, because the pte entry for this > > vaddr is zero. Here's the scenario how this problem > > happens: Page table for direct mapping is allocated dynamically by > > kernel_physical_mapping_init, it is possible that in the resume > > process, when the boot CPU is trying to write back pages to their > > original address, and just right to writes to the monitor mwait_ptr > > then wakes up one of the nonboot CPUs, since the page table currently > > used by the nonboot CPU might not the same as it is before the > > hibernation, an exception might occur due to inconsistent page table. > > > > First try is to get rid of this problem by changing the monitor > > address from task.flag to zero page, because no one would write data > > to zero page. But there is still problem because of a ping-pong wake > > up scenario in mwait_play_dead: > > > > One possible implementation of a clflush is a read-invalidate snoop, > > which is what a store might look like, so cflush might break the mwait. > > > > 1. CPU1 wait at zero page > > 2. CPU2 cflush zero page, wake CPU1 up, then CPU2 waits at zero page > > 3. CPU1 is woken up, and invoke cflush zero page, thus wake up CPU2 again. > > then the nonboot CPUs never sleep for long. > > > > So it's better to monitor different address for each nonboot CPUs, > > however since there is only one zero page, at most: > > PAGE_SIZE/L1_CACHE_LINE CPUs are satisfied, which is usually 64 on a > > x86_64, apparently it's not enough for servers, maybe more zero pages > > are required. > > > > So choose a new solution as Brian suggested, to put the nonboot CPUs > > into hlt before resume, without touching any memory during s/r. > > Theoretically there might still be some problems if some of the CPUs > > have already been put offline, but since the case is very rare and > > users can work around it, we do not deal with this special case in > > kernel for now. > > > > BTW, as James mentioned, he might want to encapsulate > > disable_nonboot_cpus into arch-specific, so this patch might need small > change after that. > > > > Comments and suggestions would be appreciated. > > > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=106371 > > Reported-and-tested-by: Varun Koyyalagunta > > Signed-off-by: Chen Yu > > Below is my sort of version of this (untested) and I did it this way, because the > issue is specific to resume from hibernation (the workaround need not be > applied anywhere else) and the hibernate_resume_nonboot_cpu_disable() > thing may be useful to arm64 too if I'm not mistaken (James?). James might want a flag to distinguish whether it is from suspend or resume, in his arch-specific disabled_nonboot_cpus? and this patch works on my xeon. Tested-by: Chen Yu > > Actually, if arm64 uses it too, the __weak implementation can be dropped, > because it will be possible to make it depend on ARCH_HIBERNATION_HEADER > (x86 and arm64 are the only users of that). > > Thanks, > Rafael >