From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 87059C05027 for ; Fri, 20 Jan 2023 18:35:58 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229807AbjATSf4 (ORCPT ); Fri, 20 Jan 2023 13:35:56 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44064 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229723AbjATSfy (ORCPT ); Fri, 20 Jan 2023 13:35:54 -0500 Received: from mail-pl1-x62b.google.com (mail-pl1-x62b.google.com [IPv6:2607:f8b0:4864:20::62b]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A494BB2D1C for ; Fri, 20 Jan 2023 10:35:53 -0800 (PST) Received: by mail-pl1-x62b.google.com with SMTP id z20so4103104plc.2 for ; Fri, 20 Jan 2023 10:35:53 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=okolDEB21Y0jblKM4zeN9TyhMCVMpN2TkJfDtRniBvc=; b=HkY8uYC3eaoxfG+CDtPspZfD/eX0Nq+DRBL/w/lbBPAVctXPlKO27GnPIwXfthk5N9 VC7gyEcJ6u4nAFPLBZZtLuaIrW5eLQ7pRUIMsQ0P7IH/vAwrMVeY6O0eq5y8N9gtXD71 MsB/8V0TvAjpRoA/f7mCK/eJ8L5cCpz/8ba/KSvwKSwa6zBYdMvsYnc3fycaiKzhvXu0 bdjb2uPAa1zIysmYxuisWAZGW6U9LNAlN4IJnA8+tzSANClzdAuibHS3CeaFv6CpylZE liAoGaxEFIkOokZ7ibDIQSN4J2us6Ws11V6wPdyXS4f5x0ophRHfbRwS5//CQJMGSnxo NK+g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=okolDEB21Y0jblKM4zeN9TyhMCVMpN2TkJfDtRniBvc=; b=klQ9HRSKD8N3Wvmojr1kuzq14dln2pe04xRc+pDIE9mZ6KvTZVifg4Qrg4YS60tB7p xR5y/Df30KB4apY3KWlS7KxA2/Y3QebQWrf1bVjw7AtK7i7aTYMQnxQfzfojn2w6AA1P sSeD3nzAOc0lU2QCxxpmDZRHuzc5mdcSldNyaqLOjsTTvZrVEiVcYRLhFym32sLbDZrx 284LrK23TUzTSSjokn3LolJW/JfU6GDy58MhiHPQDeNfVz9xHA9GDMbnUSRqK9kC9kUn BvnEGf3TJ/a0fVE83SWvoQItJ1FtT7WWGEev91jCtbJp6L5pIQjTWw6YFOCklCTKvpGC rBzg== X-Gm-Message-State: AFqh2kqmVN+AJ0DMW69Oj/CfyReqzqtCSsQu9SJGuWm+tjeswCPsXfuK EX82XGrBPvEnZjQyxanXprm1Mg== X-Google-Smtp-Source: AMrXdXtygWR/fMhpmxei/j385EDCB6XPa4vZRiEc+kaRf3Blc1If9RpQwQ0WRTDgbKj1ut4JY7mfcQ== X-Received: by 2002:a17:90a:c985:b0:219:f970:5119 with SMTP id w5-20020a17090ac98500b00219f9705119mr297717pjt.1.1674239752856; Fri, 20 Jan 2023 10:35:52 -0800 (PST) Received: from google.com (7.104.168.34.bc.googleusercontent.com. [34.168.104.7]) by smtp.gmail.com with ESMTPSA id y9-20020a17090aca8900b0022bb3ee9b68sm816173pjt.13.2023.01.20.10.35.51 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 20 Jan 2023 10:35:52 -0800 (PST) Date: Fri, 20 Jan 2023 18:35:48 +0000 From: Sean Christopherson To: Igor Mammedov Cc: "Srivatsa S. Bhat" , Thomas Gleixner , linux-kernel@vger.kernel.org, amakhalov@vmware.com, ganb@vmware.com, ankitja@vmware.com, bordoloih@vmware.com, keerthanak@vmware.com, blamoreaux@vmware.com, namit@vmware.com, Peter Zijlstra , Ingo Molnar , Borislav Petkov , Dave Hansen , "H. Peter Anvin" , "Rafael J. Wysocki" , "Paul E. McKenney" , Wyes Karny , Lewis Caroll , Tom Lendacky , Juergen Gross , x86@kernel.org, VMware PV-Drivers Reviewers , virtualization@lists.linux-foundation.org, kvm@vger.kernel.org, xen-devel@lists.xenproject.org Subject: Re: [PATCH v2] x86/hotplug: Do not put offline vCPUs in mwait idle state Message-ID: References: <20230116060134.80259-1-srivatsa@csail.mit.edu> <20230116155526.05d37ff9@imammedo.users.ipa.redhat.com> <87bkmui5z4.ffs@tglx> <20230120163734.63e62444@imammedo.users.ipa.redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20230120163734.63e62444@imammedo.users.ipa.redhat.com> Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jan 20, 2023, Igor Mammedov wrote: > On Fri, 20 Jan 2023 05:55:11 -0800 > "Srivatsa S. Bhat" wrote: > > > Hi Igor and Thomas, > > > > Thank you for your review! > > > > On 1/19/23 1:12 PM, Thomas Gleixner wrote: > > > On Mon, Jan 16 2023 at 15:55, Igor Mammedov wrote: > > >> "Srivatsa S. Bhat" wrote: > > >>> Fix this by preventing the use of mwait idle state in the vCPU offline > > >>> play_dead() path for any hypervisor, even if mwait support is > > >>> available. > > >> > > >> if mwait is enabled, it's very likely guest to have cpuidle > > >> enabled and using the same mwait as well. So exiting early from > > >> mwait_play_dead(), might just punt workflow down: > > >> native_play_dead() > > >> ... > > >> mwait_play_dead(); > > >> if (cpuidle_play_dead()) <- possible mwait here > > >> hlt_play_dead(); > > >> > > >> and it will end up in mwait again and only if that fails > > >> it will go HLT route and maybe transition to VMM. > > > > > > Good point. > > > > > >> Instead of workaround on guest side, > > >> shouldn't hypervisor force VMEXIT on being uplugged vCPU when it's > > >> actually hot-unplugging vCPU? (ex: QEMU kicks vCPU out from guest > > >> context when it is removing vCPU, among other things) > > > > > > For a pure guest side CPU unplug operation: > > > > > > guest$ echo 0 >/sys/devices/system/cpu/cpu$N/online > > > > > > the hypervisor is not involved at all. The vCPU is not removed in that > > > case. > > > > > > > Agreed, and this is indeed the scenario I was targeting with this patch, > > as opposed to vCPU removal from the host side. I'll add this clarification > > to the commit message. Forcing HLT doesn't solve anything, it's perfectly legal to passthrough HLT. I guarantee there are use cases that passthrough HLT but _not_ MONITOR/MWAIT, and that passthrough all of them. > commit message explicitly said: > "which prevents the hypervisor from running other vCPUs or workloads on the > corresponding pCPU." > > and that implies unplug on hypervisor side as well. > Why? That's because when hypervisor exposes mwait to guest, it has to reserve/pin > a pCPU for each of present vCPUs. And you can safely run other VMs/workloads > on that pCPU only after it's not possible for it to be reused by VM where > it was used originally. Pinning isn't strictly required from a safety perspective. The latency of context switching may suffer due to wake times, but preempting a vCPU that it's C1 (or deeper) won't cause functional problems. Passing through an entire socket (or whatever scope triggers extra fun) might be a different story, but pinning isn't strictly required. That said, I 100% agree that this is expected behavior and not a bug. Letting the guest execute MWAIT or HLT means the host won't have perfect visibility into guest activity state. Oversubscribing a pCPU and exposing MWAIT and/or HLT to vCPUs is generally not done precisely because the guest will always appear busy without extra effort on the host. E.g. KVM requires an explicit opt-in from userspace to expose MWAIT and/or HLT. If someone really wants to effeciently oversubscribe pCPUs and passthrough MWAIT, then their best option is probably to have a paravirt interface so that the guest can tell the host its offlining a vCPU. Barring that the host could inspect the guest when preempting a vCPU to try and guesstimate how much work the vCPU is actually doing in order to make better scheduling decisions. > Now consider following worst (and most likely) case without unplug > on hypervisor side: > > 1. vm1mwait: pin pCPU2 to vCPU2 > 2. vm1mwait: guest$ echo 0 >/sys/devices/system/cpu/cpu2/online > -> HLT -> VMEXIT > -- > 3. vm2mwait: pin pCPU2 to vCPUx and start VM > 4. vm2mwait: guest OS onlines Vcpu and starts using it incl. > going into idle=>mwait state > -- > 5. vm1mwait: it still thinks that vCPU is present it can rightfully do: > guest$ echo 1 >/sys/devices/system/cpu/cpu2/online > -- > 6.1 best case vm1mwait online fails after timeout > 6.2 worse case: vm2mwait does VMEXIT on vCPUx around time-frame when > vm1mwait onlines vCPU2, the online may succeed and then vm2mwait's > vCPUx will be stuck (possibly indefinitely) until for some reason > VMEXIT happens on vm1mwait's vCPU2 _and_ host decides to schedule > vCPUx on pCPU2 which would make vm1mwait stuck on vCPU2. > So either way it's expected behavior. > > And if there is no intention to unplug vCPU on hypervisor side, > then VMEXIT on play_dead is not really necessary (mwait is better > then HLT), since hypervisor can't safely reuse pCPU elsewhere and > VCPU goes into deep sleep within guest context. > > PS: > The only case where making HLT/VMEXIT on play_dead might work out, > would be if new workload weren't pinned to the same pCPU nor > used mwait (i.e. host can migrate it elsewhere and schedule > vCPU2 back on pCPU2).