From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id E779DC00140 for ; Fri, 5 Aug 2022 23:30:52 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S241618AbiHEXaw (ORCPT ); Fri, 5 Aug 2022 19:30:52 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46270 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238141AbiHEXav (ORCPT ); Fri, 5 Aug 2022 19:30:51 -0400 Received: from mail-lf1-x12b.google.com (mail-lf1-x12b.google.com [IPv6:2a00:1450:4864:20::12b]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id F08881A817 for ; Fri, 5 Aug 2022 16:30:49 -0700 (PDT) Received: by mail-lf1-x12b.google.com with SMTP id z25so5395447lfr.2 for ; Fri, 05 Aug 2022 16:30:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc; bh=5+e88HfQH2lIucsXxwkZxdyfP+2njFIpGKdmQ0s6vaQ=; b=ZK8Dljk4TO46z8nvkk219ZxDRYc85V4rtZtlROQhonB3PPjOO2z0gTucP8WSJdOsKe qbGdeJOABn0xfboUvrMhnjnYC7+mQvxuFGsvcfVL2IdEgEOgYpgBzgQhuhgKcjZ6l80q 7DLaTUu2PpD0ClnNPaU1NAQvEGM+Voi6VXTVzJuNxhyP4GKlyTId4Gw1eRPCfScJhoxh MrKfzPxGa+JXf/Skg/YJcDDyECArSk5+EuGP33YpIYG99HBZuAaCX8lzPH8g/9vmeRA1 7lCOCLqML6TfLAOLeNOxWiaesSJor/7yCdZmoJaO1uBlXWzv0DZYj/gFyJ4jR5nUISGa nf7Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc; bh=5+e88HfQH2lIucsXxwkZxdyfP+2njFIpGKdmQ0s6vaQ=; b=7no/mwuTvXBtSVGt/7vhJOsnXN7C/O67+vSmar/zyso7sz20ntw7y4KdJb0CpBMFVU oL7Kgr6Sxx29emMU3pxB1VhBCg2U1GOO6KyKVj/DuZU9zmjnvaCYMVXcVmyn9J+RoH8V pVY0YmJDx2ApaU8MAk7V2aUBzRH7yuk4p+3i3hlJlUt+5EhI8R//fi1EfySN8a6rNJq5 2SvpxJ3DfLstRq9DcX/x7rHEi3CJFB5thDv3TcNGIQ0JkpCFAYlDfzS0TmP3hZIizmbs kAI7tzeuIEAre65rjqqnzXH88qXS4P3qoWKk/X/arzQePgYC3AOLlJjsLW7+99+b3n06 Kl8Q== X-Gm-Message-State: ACgBeo2qnY0xOoPErM1W078YqU/bXUtn7i2Mz33YHxfI58vF4cifvz82 r+eaAJrocUgdKNTv6pC1ujUUlLJFJMEea82sDfckyg== X-Google-Smtp-Source: AA6agR71hMou1+TenheE26zSVkyrFcJSLp/xxnA0rMCiPfBjfVp6MbcTrdFToBJoN9OoHXbHFye0UY5qBwCaNzGDYu8= X-Received: by 2002:a05:6512:b03:b0:489:e00a:b32 with SMTP id w3-20020a0565120b0300b00489e00a0b32mr2883849lfu.368.1659742248023; Fri, 05 Aug 2022 16:30:48 -0700 (PDT) MIME-Version: 1.0 References: <20220801151928.270380-1-vipinsh@google.com> <4ccbafb5-9157-ec73-c751-ec71164f8688@redhat.com> In-Reply-To: From: Vipin Sharma Date: Fri, 5 Aug 2022 16:30:11 -0700 Message-ID: Subject: Re: [PATCH] KVM: x86/mmu: Make page tables for eager page splitting NUMA aware To: Paolo Bonzini , Sean Christopherson , David Matlack Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org On Wed, Aug 3, 2022 at 8:08 AM Paolo Bonzini wrote: > > On 8/2/22 21:12, Sean Christopherson wrote: > > Ah crud, I misread the patch. I was thinking tdp_mmu_spte_to_nid() was getting > > the node for the existing shadow page, but it's actually getting the node for the > > underlying physical huge page. > > > > That means this patch is broken now that KVM doesn't require a "struct page" to > > create huge SPTEs (commit a8ac499bb6ab ("KVM: x86/mmu: Don't require refcounted > > "struct page" to create huge SPTEs"). I.e. this will explode as pfn_to_page() > > may return garbage. > > > > return page_to_nid(pfn_to_page(spte_to_pfn(spte))); > > I was about to say that yesterday. However my knowledge of struct page > things has been proved to be spotty enough, that I wasn't 100% sure of > that. But yeah, with a fresh mind it's quite obvious that anything that > goes through hva_to_pfn_remap and similar paths will fail. > > > That said, I still don't like this patch, at least not as a one-off thing. I do > > like the idea of having page table allocations follow the node requested for the > > target pfn, what I don't like is having one-off behavior that silently overrides > > userspace policy. > > Yes, I totally agree with making it all or nothing. > > > I would much rather we solve the problem for all page table allocations, maybe > > with a KVM_CAP or module param? Unfortunately, that's a very non-trivial change > > because it will probably require having a per-node cache in each of the per-vCPU > > caches. > > A module parameter is fine. If you care about performance to this > level, your userspace is probably homogeneous enough. > Thank you all for the feedback and suggestions. Regarding dirty_log_perf_test, I will send out a patch to add an option which will tag vcpus to physical cpus using sched_setaffinity() calls. This should increase accuracy for the test. It seems like we are agreeing on two things: 1. Keep the same behavior for the page table pages allocation for both during the page fault and during eager page splitting. 2. Page table pages should be allocated on the same node where the underlying guest memory page is residing. Here are the two approaches, please provide feedback on which one looks more appropriate before I start spamming your inbox with my patches Approach A: Have per numa node cache for page table pages allocation Instead of having only one mmu_shadow_page_cache per vcpu, we provide multiple caches for each node either: mmu_shadow_page_cache[MAX_NUMNODES] or mmu_shadow_page_cache->objects[MAX_NUMNODES * KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE] We can decrease KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE to some lower value instead of 40 to control memory consumption. When a fault happens, use the pfn to find which node the page should belong to and use the corresponding cache to get page table pages. struct *page = kvm_pfn_to_refcounted_page(pfn); int nid; if(page) { nid = page_to_nid(page); } else { nid = numa_node_id(); } ... tdp_mmu_alloc_sp(nid, vcpu); ... static struct kvm_mmu_page *tdp_mmu_alloc_sp(int nid, struct kvm_vcpu *vcpu) { ... sp->spt = kvm_mmu_memory_cache_alloc(nid, &vcpu->arch.mmu_shadow_page_cache); ... } Since we are changing cache allocation for page table pages, should we also make similar changes for other caches like mmu_page_header_cache, mmu_gfn_array_cache, and mmu_pte_list_desc_cache? I am not sure how good this idea is. Approach B: Ask page from the specific node on fault path with option to fallback to the original cache and default task policy. This is what Sean's rough patch looks like.