From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id DF9C8C19F2D for ; Tue, 9 Aug 2022 16:52:45 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S245080AbiHIQwn (ORCPT ); Tue, 9 Aug 2022 12:52:43 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35390 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S245072AbiHIQwl (ORCPT ); Tue, 9 Aug 2022 12:52:41 -0400 Received: from mail-oi1-x22f.google.com (mail-oi1-x22f.google.com [IPv6:2607:f8b0:4864:20::22f]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DC99622BC2 for ; Tue, 9 Aug 2022 09:52:37 -0700 (PDT) Received: by mail-oi1-x22f.google.com with SMTP id c185so14530203oia.7 for ; Tue, 09 Aug 2022 09:52:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc; bh=U/D22wA89KCZ8FoM5lxlmalnb8zbCkLG8x+Wao6IKHs=; b=gri5en4j1Nf3AVExb/xtD1i/SndOYTeJjQtSC4usRK/O5d48jIap8p682Zeazqp1Pe ipEbBmYcoxs2HYZIFr0omCnqtPLNcjZBirFyZdioPg2ZHJTdjjl3rybGv5A8NN8sEr2D j/46Th9HahRaqHle8jwPoK5mftTBUmii9vmBapkxypM1FxDjqyr8XLKNjh7zeXnt2/do wB/lJlW0htmLpjlsTEnwrlPtWZEKxAXiszhwYqKQSPvWhm0V6psnPZtA7vYtrguTQ8qm P0tnIJh4/g7qzJlNnNSX9XeWQuoDh5KaFvMiDWpq29FKzA3WM7kDA7XCjhYsgnVyiCoB uyiA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc; bh=U/D22wA89KCZ8FoM5lxlmalnb8zbCkLG8x+Wao6IKHs=; b=1S83R3nUtuEHYAa+cjtNV8UiYtJBeapXAvrQnUMASPj8Wxz+aOjQhCI4veOUzyd0U0 iwyDCHSQYS+XFBPLBiO6wiRPqvDEA5kFAy1MTFQtmRt0Y2T9Y9y/qpWQuBDCl9xmdlB9 ujc4O59HJva7gfRxCq6XZxnTanLlcBE317HhCzu7AK6wM50MIwfaLGDrTpXj0Z9vpnQn A0T2YcrIQ/by5gA0sBSziairNVNOIZEDTBpa+o+Z0KmM1sm5Uo00W8UxKBot90dSH3DH UerBfAOqxmDR8xNYgpH02q8PSERhixYpmAhYKPU3nFfIjrQZNOcK3La6IA2QtIqxVyOP 6B2g== X-Gm-Message-State: ACgBeo2q4zfyxi9UTG1dVrBsoUsqrbT7mn6Zr9D8NpQ88nzuTW6iwbMa poUnNdwaOCz3mnNeyIs8hWbYqWa2gu4AYCWM48L/Pw== X-Google-Smtp-Source: AA6agR4EANbHbB06m6UK0Z8LDAx9jyRV3DCFcB8+aJ4dmS8ZHdO7iP7HvmWEfgbIWxIpnUmtw3QmaRRsASv33ufBVR4= X-Received: by 2002:a05:6808:16a3:b0:326:a585:95b8 with SMTP id bb35-20020a05680816a300b00326a58595b8mr13853758oib.281.1660063957038; Tue, 09 Aug 2022 09:52:37 -0700 (PDT) MIME-Version: 1.0 References: <20220801151928.270380-1-vipinsh@google.com> <4ccbafb5-9157-ec73-c751-ec71164f8688@redhat.com> In-Reply-To: From: David Matlack Date: Tue, 9 Aug 2022 09:52:10 -0700 Message-ID: Subject: Re: [PATCH] KVM: x86/mmu: Make page tables for eager page splitting NUMA aware To: Vipin Sharma Cc: Paolo Bonzini , Sean Christopherson , kvm list , LKML Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org On Fri, Aug 5, 2022 at 4:30 PM Vipin Sharma wrote: [...] > > Here are the two approaches, please provide feedback on which one > looks more appropriate before I start spamming your inbox with my > patches > > Approach A: > Have per numa node cache for page table pages allocation > > Instead of having only one mmu_shadow_page_cache per vcpu, we provide > multiple caches for each node > > either: > mmu_shadow_page_cache[MAX_NUMNODES] > or > mmu_shadow_page_cache->objects[MAX_NUMNODES * KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE] I think the former approach will work better. The objects[] array is allocated dynamically during top-up now, so if a vCPU never allocates a page table to map memory on a given node, KVM will never have to allocate an objects[] array for that node. Whereas with the latter approach KVM would have to allocate the entire objects[] array up-front. > > We can decrease KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE to some lower value > instead of 40 to control memory consumption. I'm not sure we are getting any performance benefit from the cache size being so high. It doesn't fundamentally change the number of times a vCPU thread will have to call __get_free_page(), it just batches more of those calls together. Assuming reducing the cache size doesn't impact performance, I think it's a good idea to reduce it as part of this feature. KVM needs at most PT64_ROOT_MAX_LEVEL (5) page tables to handle a fault. So we could decrease the mmu_shadow_page_cache.objects[] capacity to PT64_ROOT_MAX_LEVEL (5) and support up to 8 NUMA nodes without increasing memory usage. If a user wants to run a VM on an even larger machine, I think it's safe to consume a few extra bytes for the vCPU shadow page caches at that point (the machine probably has 10s of TiB of RAM). > > When a fault happens, use the pfn to find which node the page should > belong to and use the corresponding cache to get page table pages. > > struct *page = kvm_pfn_to_refcounted_page(pfn); > int nid; > if(page) { > nid = page_to_nid(page); > } else { > nid = numa_node_id(); > } > > ... > tdp_mmu_alloc_sp(nid, vcpu); > ... > > static struct kvm_mmu_page *tdp_mmu_alloc_sp(int nid, struct kvm_vcpu *vcpu) { > ... > sp->spt = kvm_mmu_memory_cache_alloc(nid, > &vcpu->arch.mmu_shadow_page_cache); > ... > } > > > Since we are changing cache allocation for page table pages, should we > also make similar changes for other caches like mmu_page_header_cache, > mmu_gfn_array_cache, and mmu_pte_list_desc_cache? I am not sure how > good this idea is. We don't currently have a reason to make these objects NUMA-aware, so I would only recommend it if it somehow makes the code a lot simpler. > > Approach B: > Ask page from the specific node on fault path with option to fallback > to the original cache and default task policy. > > This is what Sean's rough patch looks like. This would definitely be a simpler approach but could increase the amount of time a vCPU thread holds the MMU lock when handling a fault, since KVM would start performing GFP_NOWAIT allocations under the lock. So my preference would be to try the cache approach first and see how complex it turns out to be.