From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.8 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED, USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id A1D24C4727D for ; Sat, 26 Sep 2020 01:14:15 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 4299320878 for ; Sat, 26 Sep 2020 01:14:15 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="UmRqI6bw" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729841AbgIZBOL (ORCPT ); Fri, 25 Sep 2020 21:14:11 -0400 Received: from us-smtp-delivery-124.mimecast.com ([216.205.24.124]:54205 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729426AbgIZBOL (ORCPT ); Fri, 25 Sep 2020 21:14:11 -0400 Dkim-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1601082848; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=T/MfC+wTSjbqTRPcN7Qitt4qsrANTjIPAEkx0OnCxwU=; b=UmRqI6bw2yF5fCm/7b2z+70ToBYjnkwT/HkQkR2AZHWxrtTsWz5GDVe+X2U1F2WaAAzDN6 nP0BIeH2atPMel3v/L1Ao/eDNYSmfHoVb1WkugS/GUa93pV6bF5jdWoNKRPJ4/ZYzPLX0+ nYyg7g1JIYbhL8epcyFpcyMcBo0blXo= Received: from mail-wr1-f70.google.com (mail-wr1-f70.google.com [209.85.221.70]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-115-zbB2HPA2PHCyNSiozTSaRA-1; Fri, 25 Sep 2020 21:14:04 -0400 X-MC-Unique: zbB2HPA2PHCyNSiozTSaRA-1 Received: by mail-wr1-f70.google.com with SMTP id l9so1751909wrq.20 for ; Fri, 25 Sep 2020 18:14:04 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=T/MfC+wTSjbqTRPcN7Qitt4qsrANTjIPAEkx0OnCxwU=; b=ig+bscjH+htCAqeYWV+uy2ofYFVcmNmjJo/8K5RmUHXThDL7Em3gvOtnxCO+a3XG9T GuGi7a2Qydt0dGlgLe6sboY4/Hv1B/wu4fiTMU9Rvym2Zjs+lgyykX0rirDjczUrXpat oVmAhkUHZMn9rFCNYZbuorvIHJsT3cWKDHjqaLipOhAjz/ssvxxPWNLaZWah+06f+M70 qrzH/fE9z0SlYizHx8/YbVWhT9V7PPx7pjTzo0F9m0k01h0f1V/j+Q5LHUEggONBFHiE IMiqTw0Zm/l3Oijl5i60Kpxm/cHqh3UJeoObDWKt4IQDp9PNAGSaxFCBiOc+9FkSAqfh GZlA== X-Gm-Message-State: AOAM5317Wvco8ft6FZwOTxdvQTG04JTkFVu+eHItIUKDKK3hzalj+nQH jEKlwARbJ81U3/hGb0f8xDTFB0Ay6u8PuM7cVZ0zhUo+2tKZKM6dAMb7XbDK0VpDDlsET1YlGQ1 ofrGvn0Xs0z1LdUg7I5Klvkdy X-Received: by 2002:adf:efc9:: with SMTP id i9mr7380940wrp.187.1601082843265; Fri, 25 Sep 2020 18:14:03 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzOQhtTbairjCMi0UAH/ZepweDc9c30EinOjCbyKbkBsucQ0wIQCZO+Q2lO5AO1syUaFBB+Nw== X-Received: by 2002:adf:efc9:: with SMTP id i9mr7380913wrp.187.1601082842935; Fri, 25 Sep 2020 18:14:02 -0700 (PDT) Received: from ?IPv6:2001:b07:6468:f312:ec9b:111a:97e3:4baf? ([2001:b07:6468:f312:ec9b:111a:97e3:4baf]) by smtp.gmail.com with ESMTPSA id h16sm5117997wre.87.2020.09.25.18.14.01 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 25 Sep 2020 18:14:02 -0700 (PDT) Subject: Re: [PATCH 00/22] Introduce the TDP MMU To: Ben Gardon , linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: Cannon Matthews , Peter Xu , Sean Christopherson , Peter Shier , Peter Feiner , Junaid Shahid , Jim Mattson , Yulei Zhang , Wanpeng Li , Vitaly Kuznetsov , Xiao Guangrong References: <20200925212302.3979661-1-bgardon@google.com> From: Paolo Bonzini Message-ID: <34785fca-4d6d-88d7-f90a-2e17815d02e6@redhat.com> Date: Sat, 26 Sep 2020 03:14:01 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.11.0 MIME-Version: 1.0 In-Reply-To: <20200925212302.3979661-1-bgardon@google.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 25/09/20 23:22, Ben Gardon wrote: > Over the years, the needs for KVM's x86 MMU have grown from running small > guests to live migrating multi-terabyte VMs with hundreds of vCPUs. Where > we previously depended on shadow paging to run all guests, we now have > two dimensional paging (TDP). This patch set introduces a new > implementation of much of the KVM MMU, optimized for running guests with > TDP. We have re-implemented many of the MMU functions to take advantage of > the relative simplicity of TDP and eliminate the need for an rmap. > Building on this simplified implementation, a future patch set will change > the synchronization model for this "TDP MMU" to enable more parallelism > than the monolithic MMU lock. A TDP MMU is currently in use at Google > and has given us the performance necessary to live migrate our 416 vCPU, > 12TiB m2-ultramem-416 VMs. > > This work was motivated by the need to handle page faults in parallel for > very large VMs. When VMs have hundreds of vCPUs and terabytes of memory, > KVM's MMU lock suffers extreme contention, resulting in soft-lockups and > long latency on guest page faults. This contention can be easily seen > running the KVM selftests demand_paging_test with a couple hundred vCPUs. > Over a 1 second profile of the demand_paging_test, with 416 vCPUs and 4G > per vCPU, 98% of the time was spent waiting for the MMU lock. At Google, > the TDP MMU reduced the test duration by 89% and the execution was > dominated by get_user_pages and the user fault FD ioctl instead of the > MMU lock. > > This series is the first of two. In this series we add a basic > implementation of the TDP MMU. In the next series we will improve the > performance of the TDP MMU and allow it to execute MMU operations > in parallel. > > The overall purpose of the KVM MMU is to program paging structures > (CR3/EPT/NPT) to encode the mapping of guest addresses to host physical > addresses (HPA), and to provide utilities for other KVM features, for > example dirty logging. The definition of the L1 guest physical address > (GPA) to HPA mapping comes in two parts: KVM's memslots map GPA to HVA, > and the kernel MM/x86 host page tables map HVA -> HPA. Without TDP, the > MMU must program the x86 page tables to encode the full translation of > guest virtual addresses (GVA) to HPA. This requires "shadowing" the > guest's page tables to create a composite x86 paging structure. This > solution is complicated, requires separate paging structures for each > guest CR3, and requires emulating guest page table changes. The TDP case > is much simpler. In this case, KVM lets the guest control CR3 and programs > the EPT/NPT paging structures with the GPA -> HPA mapping. The guest has > no way to change this mapping and only one version of the paging structure > is needed per L1 paging mode. In this case the paging mode is some > combination of the number of levels in the paging structure, the address > space (normal execution or system management mode, on x86), and other > attributes. Most VMs only ever use 1 paging mode and so only ever need one > TDP structure. > > This series implements a "TDP MMU" through alternative implementations of > MMU functions for running L1 guests with TDP. The TDP MMU falls back to > the existing shadow paging implementation when TDP is not available, and > interoperates with the existing shadow paging implementation for nesting. > The use of the TDP MMU can be controlled by a module parameter which is > snapshot on VM creation and follows the life of the VM. This snapshot > is used in many functions to decide whether or not to use TDP MMU handlers > for a given operation. > > This series can also be viewed in Gerrit here: > https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538 > (Thanks to Dmitry Vyukov for setting up the > Gerrit instance) > > Ben Gardon (22): > kvm: mmu: Separate making SPTEs from set_spte > kvm: mmu: Introduce tdp_iter > kvm: mmu: Init / Uninit the TDP MMU > kvm: mmu: Allocate and free TDP MMU roots > kvm: mmu: Add functions to handle changed TDP SPTEs > kvm: mmu: Make address space ID a property of memslots > kvm: mmu: Support zapping SPTEs in the TDP MMU > kvm: mmu: Separate making non-leaf sptes from link_shadow_page > kvm: mmu: Remove disallowed_hugepage_adjust shadow_walk_iterator arg > kvm: mmu: Add TDP MMU PF handler > kvm: mmu: Factor out allocating a new tdp_mmu_page > kvm: mmu: Allocate struct kvm_mmu_pages for all pages in TDP MMU > kvm: mmu: Support invalidate range MMU notifier for TDP MMU > kvm: mmu: Add access tracking for tdp_mmu > kvm: mmu: Support changed pte notifier in tdp MMU > kvm: mmu: Add dirty logging handler for changed sptes > kvm: mmu: Support dirty logging for the TDP MMU > kvm: mmu: Support disabling dirty logging for the tdp MMU > kvm: mmu: Support write protection for nesting in tdp MMU > kvm: mmu: NX largepage recovery for TDP MMU > kvm: mmu: Support MMIO in the TDP MMU > kvm: mmu: Don't clear write flooding count for direct roots > > arch/x86/include/asm/kvm_host.h | 17 + > arch/x86/kvm/Makefile | 3 +- > arch/x86/kvm/mmu/mmu.c | 437 ++++++---- > arch/x86/kvm/mmu/mmu_internal.h | 98 +++ > arch/x86/kvm/mmu/paging_tmpl.h | 3 +- > arch/x86/kvm/mmu/tdp_iter.c | 198 +++++ > arch/x86/kvm/mmu/tdp_iter.h | 55 ++ > arch/x86/kvm/mmu/tdp_mmu.c | 1315 +++++++++++++++++++++++++++++++ > arch/x86/kvm/mmu/tdp_mmu.h | 52 ++ > include/linux/kvm_host.h | 2 + > virt/kvm/kvm_main.c | 7 +- > 11 files changed, 2022 insertions(+), 165 deletions(-) > create mode 100644 arch/x86/kvm/mmu/tdp_iter.c > create mode 100644 arch/x86/kvm/mmu/tdp_iter.h > create mode 100644 arch/x86/kvm/mmu/tdp_mmu.c > create mode 100644 arch/x86/kvm/mmu/tdp_mmu.h > Ok, I've not finished reading the code but I have already an idea of what it's like. I really think we should fast track this as the basis for more 5.11 work. I'll finish reviewing it and, if you don't mind, I might make some of the changes myself so I have the occasion to play and get accustomed to the code; speak up if you disagree with them though! Another thing I'd like to add is a few tracepoints. Paolo