From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id ADF4FC433F5 for ; Mon, 14 Feb 2022 13:46:00 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1354612AbiBNNqG (ORCPT ); Mon, 14 Feb 2022 08:46:06 -0500 Received: from mxb-00190b01.gslb.pphosted.com ([23.128.96.19]:58250 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1354731AbiBNNqB (ORCPT ); Mon, 14 Feb 2022 08:46:01 -0500 Received: from mx1.molgen.mpg.de (mx3.molgen.mpg.de [141.14.17.11]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D2D8EB849; Mon, 14 Feb 2022 05:45:51 -0800 (PST) Received: from [192.168.0.2] (ip5f5aebfe.dynamic.kabel-deutschland.de [95.90.235.254]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) (Authenticated sender: pmenzel) by mx.molgen.mpg.de (Postfix) with ESMTPSA id 9D27261EA1927; Mon, 14 Feb 2022 14:45:49 +0100 (CET) Message-ID: <74d2302f-88fc-c75c-6d2d-4aece1a515bb@molgen.mpg.de> Date: Mon, 14 Feb 2022 14:45:49 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.6.0 Subject: Re: [PATCH v3 0/9] Parallel CPU bringup for x86_64 Content-Language: en-US To: David Woodhouse Cc: Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , x86@kernel.org, "H . Peter Anvin" , Paolo Bonzini , "Paul E . McKenney" , linux-kernel@vger.kernel.org, kvm@vger.kernel.org, rcu@vger.kernel.org, mimoja@mimoja.de, hewenliang4@huawei.com, hushiyuan@huawei.com, luolongjun@huawei.com, hejingxian@huawei.com References: <20211215145633.5238-1-dwmw2@infradead.org> <9a47b5ec-f2d1-94d9-3a48-9b326c88cfcb@molgen.mpg.de> <3bfacf45d2d0f3dfa3789ff5a2dcb46744aacff7.camel@infradead.org> From: Paul Menzel In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Dear David, Am 29.12.21 um 14:54 schrieb David Woodhouse: > On Wed, 2021-12-29 at 14:18 +0100, Paul Menzel wrote: >>> Or the one in >>> https://lore.kernel.org/lkml/d4cde50b4aab24612823714dfcbe69bc4bb63b60.camel@infradead.org >>> >>> which makes it do nothing except prepare all the CPUs before bringing >>> them up one at a time? >> >> I applied it on top the other one, and it made no difference either. > > It's possible I missed something else in the prepare stage that doesn't > cope with all CPUs being prepared first. > > My next attempt might be to change the loop in bringup_nonboot_cpus() > to bring all the CPUs not to the CPUHP_BP_PARALLEL_DYN state(s) but > instead just bring them to somewhere like CPUHP_RCUTREE_PREP, which is > somewhere in the middle between CPUHP_OFFLINE and CPUHP_BRINGUP_CPU. > > Then a binary chop search — if that one boots, try maybe > CPUHP_TOPOLOGY_PREPARE. And if not, try CPUHP_PROFILE_PREPARE. Etc. > >>> My current theory (not that I've spent that much time thinking about it >>> in the last week) is that there's something about the existing CPU >>> bringup, possibly a CPU bug or something special about the AMD CPUs, >>> which is triggered by just making it a little bit *faster*, which is >>> why bringing them up from kexec (especially in qemu) can cause it too? >> >> Would having the serial console enabled make a difference? > > Yes. I couldn't make this fail in my EC2 m6a instance (for clean boots; > I have never managed to kexec it) until I turned off the serial console > to make things go faster. > >>> Tom seemed to find that it was in load_TR_desc(), so if you could try >>> this hack on a machine that doesn't magically wink out of existence on >>> a triplefault before even flushing its serial output, that would be >>> much appreciated... > >> Unfortunately, no more messages were printed on the serial console. > > I suppose we need to litter those outputs somewhere earlier in the > trampoline then, perhaps it *isn't* getting to load_TR_desc() in your > case? > > Will be back online properly next week and can actually provide some of > the above suggestions in patch form if you're willing to keep testing. Sorry for replying so late. I saw your v4 patches, and tried commit 5e3524d21d2a () from your branch `parallel-5.17-part1`. Unfortunately, the boot problem still persists on an AMD Ryzen 3 2200 g system, I tested with. Please tell, where I should report these results too (here or posted v4 patches). Also, do you have (physical) access to a system with an AMD CPU? If not, maybe we can get you one, so it’s more convenient for you to test. Kind regards, Paul