From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.1 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1A416C04E87 for ; Thu, 16 May 2019 01:47:41 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id D64F620881 for ; Thu, 16 May 2019 01:47:40 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="VCQOndVd" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727811AbfEPBrL (ORCPT ); Wed, 15 May 2019 21:47:11 -0400 Received: from mail-ot1-f68.google.com ([209.85.210.68]:35697 "EHLO mail-ot1-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726901AbfEPBGN (ORCPT ); Wed, 15 May 2019 21:06:13 -0400 Received: by mail-ot1-f68.google.com with SMTP id n14so1883481otk.2; Wed, 15 May 2019 18:06:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=J6f1GZ8vCMw819yikh69VjuQxUpy7OYmJvnY0pJldzc=; b=VCQOndVdDPVpCcmRB4XI4mPyBaz8hFP/tZZ+OrKuD/J0xM/ZN9hbmJne51MgjHXpsc mSclLtRVpK8X/6tfslh8J5RQQoR3HciSPXP9Iup6nZdYzKEkaIK4RW1YUDJybCZy9BTQ Rd8KH/9Lu/TYSmVYnyp7BfMk0xO9NuHm9MIwG9ic5Kud59FrMya2LROAt06ksGGwef+8 jCtGhkqbq6tGS3szt6Y1Y3SbSEnWLIneuWvC8XbsLEqoYRKPwj2NM+2pl217TGYJZyEh odBvDMneEHV4RABX+U9+HECmzcx4/BX7BNXbD0El0Ll68ZbxCLj4G+LX1bmuDZ9CMc8w mvIg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=J6f1GZ8vCMw819yikh69VjuQxUpy7OYmJvnY0pJldzc=; b=nPF0OkWvklQVe4YjDAA++qQoyAFMOWKhEGRIqKgswNbSuBgcVY1l7f4ovYesQnUMHz Pis7vA4lx3WJei5IVINIjui6GivwqByZ/HwK6YKiU80P2PwUODetFHRaQyBkU/KsAvNR SHj9RP1l5tIQYbI2r3SvXVqIAGxtYr/1zC3UCjRG3tKIuGxFxeGGbyReT7jyxafGPWZ3 f04ldeKwtH+ttz+nR6hG0fKfkDdTmgcul7KmFYA5bv/jf82A1Jt9cLuwPme0Hk4ETsRV E70UWCrqctuZB5cbajfdQvpXzzbNITDaQmw2D6Jvi7CET/3Qd8u/Gmm0pgm9JivlAYoz Mvkg== X-Gm-Message-State: APjAAAX+abP34F0S+TLCcL6NSNQBKFO4x3jJpwr7CXhywxlqSmyVGisc TxS/YJB/h5z2h3WcsG6VilyhI7AauSweBCdBZKs= X-Google-Smtp-Source: APXvYqyUghYepUxVunBrQJsSOb/EkGSL/LU3bonenux+W/r6y1N76kWBBLy2tg3qyc1ZMcipE5G908BUMvPkPl9jC14= X-Received: by 2002:a05:6830:1356:: with SMTP id r22mr5792502otq.191.1557968772119; Wed, 15 May 2019 18:06:12 -0700 (PDT) MIME-Version: 1.0 References: <20190507185647.GA29409@amt.cnet> <20190514135022.GD4392@amt.cnet> <7e390fef-e0df-963f-4e18-e44ac2766be3@oracle.com> In-Reply-To: <7e390fef-e0df-963f-4e18-e44ac2766be3@oracle.com> From: Wanpeng Li Date: Thu, 16 May 2019 09:07:32 +0800 Message-ID: Subject: Re: [PATCH] sched: introduce configurable delay before entering idle To: Ankur Arora Cc: Marcelo Tosatti , kvm-devel , LKML , Thomas Gleixner , Ingo Molnar , Andrea Arcangeli , Bandan Das , Paolo Bonzini Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 16 May 2019 at 02:42, Ankur Arora wrote: > > On 5/14/19 6:50 AM, Marcelo Tosatti wrote: > > On Mon, May 13, 2019 at 05:20:37PM +0800, Wanpeng Li wrote: > >> On Wed, 8 May 2019 at 02:57, Marcelo Tosatti wrote: > >>> > >>> > >>> Certain workloads perform poorly on KVM compared to baremetal > >>> due to baremetal's ability to perform mwait on NEED_RESCHED > >>> bit of task flags (therefore skipping the IPI). > >> > >> KVM supports expose mwait to the guest, if it can solve this? > >> > >> Regards, > >> Wanpeng Li > > > > Unfortunately mwait in guest is not feasible (uncompatible with multiple > > guests). Checking whether a paravirt solution is possible. > > Hi Marcelo, > > I was also looking at making MWAIT available to guests in a safe manner: > whether through emulation or a PV-MWAIT. My (unsolicited) thoughts MWAIT emulation is not simple, here is a research https://www.contrib.andrew.cmu.edu/~somlo/OSXKVM/mwait.html Regards, Wanpeng Li > follow. > > We basically want to handle this sequence: > > monitor(monitor_address); > if (*monitor_address == base_value) > mwaitx(max_delay); > > Emulation seems problematic because, AFAICS this would happen: > > guest hypervisor > ===== ==== > > monitor(monitor_address); > vmexit ===> monitor(monitor_address) > if (*monitor_address == base_value) > mwait(); > vmexit ====> mwait() > > There's a context switch back to the guest in this sequence which seems > problematic. Both the AMD and Intel specs list system calls and > far calls as events which would lead to the MWAIT being woken up: > "Voluntary transitions due to fast system call and far calls (occurring > prior to issuing MWAIT but after setting the monitor)". > > > We could do this instead: > > guest hypervisor > ===== ==== > > monitor(monitor_address); > vmexit ===> cache monitor_address > if (*monitor_address == base_value) > mwait(); > vmexit ====> monitor(monitor_address) > mwait() > > But, this would miss the "if (*monitor_address == base_value)" check in > the host which is problematic if *monitor_address changed simultaneously > when monitor was executed. > (Similar problem if we cache both the monitor_address and > *monitor_address.) > > > So, AFAICS, the only thing that would work is the guest offloading the > whole PV-MWAIT operation. > > AFAICS, that could be a paravirt operation which needs three parameters: > (monitor_address, base_value, max_delay.) > > This would allow the guest to offload this whole operation to > the host: > monitor(monitor_address); > if (*monitor_address == base_value) > mwaitx(max_delay); > > I'm guessing you are thinking on similar lines? > > > High level semantics: If the CPU doesn't have any runnable threads, then > we actually do this version of PV-MWAIT -- arming a timer if necessary > so we only sleep until the time-slice expires or the MWAIT max_delay does. > > If the CPU has any runnable threads then this could still finish its > time-quanta or we could just do a schedule-out. > > > So the semantics guaranteed to the host would be that PV-MWAIT returns > after >= max_delay OR with the *monitor_address changed. > > > > Ankur