From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-13.3 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 56C2BC433DB for ; Thu, 21 Jan 2021 21:41:17 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 2B421206BE for ; Thu, 21 Jan 2021 21:41:17 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727777AbhAUVk6 (ORCPT ); Thu, 21 Jan 2021 16:40:58 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59396 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727617AbhAUVdn (ORCPT ); Thu, 21 Jan 2021 16:33:43 -0500 Received: from mail-pj1-x102e.google.com (mail-pj1-x102e.google.com [IPv6:2607:f8b0:4864:20::102e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1C85AC0613D6 for ; Thu, 21 Jan 2021 13:33:03 -0800 (PST) Received: by mail-pj1-x102e.google.com with SMTP id b5so2527169pjl.0 for ; Thu, 21 Jan 2021 13:33:03 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=heOAa0BFO9cJ9EoYaea7q8l/tcUhXpogBrPq9/6scMA=; b=BcDoRCPuaUWM4pW+HetAi848XmG4yvza7LPddenAWJ0iUvYKLfO29QnyobpHUsy0Sx b7Ii4T4te7+OAGKXtopOExxiBK5AVfSae6AQxLjJzITc6LJErKBHbY2qwTXGUXYh4NNr S5k784LolcZVj5n5iGkTwhJxyIPzdaoDdOn55YP4z9Yd/7sT5tGM352qet6+PWYHVEGf d6CJAXNAFdvtoiiWhTtIdNvGhwFYjgE2JvVuqpnlRogNCL6xw0oV3rDaFzuVoE8lBPyX fuEGh+XyR9kqiPumhYGLbr/7TbOY4Gis6tVz4NaUsftvz17NrHLKyJm9lyxBziVGjjZL T+rQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=heOAa0BFO9cJ9EoYaea7q8l/tcUhXpogBrPq9/6scMA=; b=LOvkdwbcXm5oYZEwZ77x48Ci99jCtYlkJB5MlUnk3mLN23Xr4YZoJghfwxNerRfR8G pgHpBhFypK+Ivi17KK2uB8HiSHdZCy4RTPJhDBBwqcgFwJpZdjTFA9mlbZMqXQ0glp5F B2aSWhgyaKchiU21c9GAZ1eAZiriBsS1S6ivQXdoUq5P1gfUtarnm3CRMhhAynfbhkcx lPBLwEVM2PnZ7ycXo997XtnEYuME/FBxjS0DENf7ydXhSNJSu7mV8sPs259Vxag9Q+la 5Ck+NHFtJuRS7exgtfYgYqg6xH7Oa9w9sR+iDjNwadRCmzuhOyjgGugntNoCIMk9uloB iVow== X-Gm-Message-State: AOAM5307FGKKLqgtcuZm7AaYNt1e3+eD2PW/vZsLWIIjCL/ep+4G5jxQ PJ9aroETXKiOCNEpCvWF9ysEUg== X-Google-Smtp-Source: ABdhPJy9DeI3hwQH4fKBdS6DUSA8NXpJW63wheYRd+I4c9hipw7e+1V0kThuiugDCxf/ohlQmdnabg== X-Received: by 2002:a17:90a:7e90:: with SMTP id j16mr1478133pjl.163.1611264782355; Thu, 21 Jan 2021 13:33:02 -0800 (PST) Received: from google.com ([2620:15c:f:10:1ea0:b8ff:fe73:50f5]) by smtp.gmail.com with ESMTPSA id 123sm6648455pfd.91.2021.01.21.13.33.00 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 21 Jan 2021 13:33:01 -0800 (PST) Date: Thu, 21 Jan 2021 13:32:55 -0800 From: Sean Christopherson To: Ben Gardon Cc: linux-kernel@vger.kernel.org, kvm@vger.kernel.org, Paolo Bonzini , Peter Xu , Peter Shier , Peter Feiner , Junaid Shahid , Jim Mattson , Yulei Zhang , Wanpeng Li , Vitaly Kuznetsov , Xiao Guangrong Subject: Re: [PATCH 19/24] kvm: x86/mmu: Protect tdp_mmu_pages with a lock Message-ID: References: <20210112181041.356734-1-bgardon@google.com> <20210112181041.356734-20-bgardon@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jan 21, 2021, Sean Christopherson wrote: > On Tue, Jan 12, 2021, Ben Gardon wrote: > > +static void tdp_mmu_unlink_page(struct kvm *kvm, struct kvm_mmu_page *sp, > > + bool atomic) > > +{ > > + if (atomic) > > Summarizing an off-list discussion with Ben: > > This path isn't reachable in this series, which means all the RCU stuff is more > or less untestable. Only the page fault path modifies the MMU while hold a read > lock, and it can't zap non-leaf shadow pages (only zaps large SPTEs and installs > new SPs). Aha! I was wrong. This will be hit when KVM zaps a 4k SPTE and installs a large SPTE overtop a SP, e.g. if the host migrates a page for compaction and creates a new THP. tdp_mmu_map_handle_target_level() tdp_mmu_set_spte_atomic() handle_changed_spte() __handle_changed_spte() handle_disconnected_tdp_mmu_page() tdp_mmu_unlink_page() > The intent is to convert other zap-happy paths to a read lock, notably > kvm_mmu_zap_collapsible_sptes() and kvm_recover_nx_lpages(). Ben will include > patches to convert at least one of those in the next version of this series so > that there is justification and coverage for the RCU-deferred freeing. Somewhat offtopic, zap_collapsible_spte_range() looks wrong. It zaps non-leaf SPs, and has several comments that make it quite clear that that's its intent, but the logic is messed up. For non-leaf SPs, PFN points at the next table, not the final PFN that is mapped into the guest. That absolutely should never be a reserved PFN, and whether or not its a huge page is irrelevant. My analysis is more or less confirmed by looking at Ben's internal code, which explicitly does the exact opposite in that it explicitly zaps leaf SPTEs. tdp_root_for_each_pte(iter, root, start, end) { /* Ensure forward progress has been made before yielding. */ if (iter.goal_gfn != last_goal_gfn && tdp_mmu_iter_flush_cond_resched(kvm, &iter)) { last_goal_gfn = iter.goal_gfn; spte_set = false; /* * Yielding caused the paging structure walk to be * reset so skip to the next iteration to continue the * walk from the root. */ continue; } if (!is_shadow_present_pte(iter.old_spte) || is_last_spte(iter.old_spte, iter.level)) <--- inverted? continue; pfn = spte_to_pfn(iter.old_spte); <-- this would be the page table? if (kvm_is_reserved_pfn(pfn) || !PageTransCompoundMap(pfn_to_page(pfn))) continue; tdp_mmu_set_spte(kvm, &iter, 0); spte_set = true; } Coming back to this series, I wonder if the RCU approach is truly necessary to get the desired scalability. If both zap_collapsible_sptes() and NX huge page recovery zap _only_ leaf SPTEs, then the only path that can actually unlink a shadow page while holding the lock for read is the page fault path that installs a huge page over an existing shadow page. Assuming the above analysis is correct, I think it's worth exploring alternatives to using RCU to defer freeing the SP memory, e.g. promoting to a write lock in the specific case of overwriting a SP (though that may not exist for rwlocks), or maybe something entirely different? I actually do like deferred free concept, but I find it difficult to reason about exactly what protections are provided by RCU, and what even _needs_ to be protected. Maybe we just need to add some __rcu annotations?