From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.8 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,FSL_HELO_FAKE, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, USER_IN_DEF_DKIM_WL autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6C6C9C07E95 for ; Tue, 20 Jul 2021 20:33:53 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 03A3160FF2 for ; Tue, 20 Jul 2021 20:33:52 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 03A3160FF2 Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 922736B0070; Tue, 20 Jul 2021 16:33:53 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8F9796B0071; Tue, 20 Jul 2021 16:33:53 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 799B66B0072; Tue, 20 Jul 2021 16:33:53 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0075.hostedemail.com [216.40.44.75]) by kanga.kvack.org (Postfix) with ESMTP id 530036B0070 for ; Tue, 20 Jul 2021 16:33:53 -0400 (EDT) Received: from smtpin38.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id C45D122BF1 for ; Tue, 20 Jul 2021 20:33:51 +0000 (UTC) X-FDA: 78384117462.38.2CA7E30 Received: from mail-pl1-f180.google.com (mail-pl1-f180.google.com [209.85.214.180]) by imf11.hostedemail.com (Postfix) with ESMTP id 81A59F0005BE for ; Tue, 20 Jul 2021 20:33:51 +0000 (UTC) Received: by mail-pl1-f180.google.com with SMTP id b12so12002292plh.10 for ; Tue, 20 Jul 2021 13:33:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=mbKlAsXvTpB6CPeZ6j5yKIg+s519W2REhFMvJ3Fp/jY=; b=slX3auxMexaeHwyk0mKz1uilKpLlpoTuqzpNJc+VHWt3WlVTnT5roMCPMjNeSSmmmf vlN7qiE3u4ff9Y6BgWkCw7esNzcCYRsCYgq0VT71Zp1fY5rNt73mudCBUvu8hxnKfWsY zka5XhAXoTkCgF6EeTBRX6iVwn0bvtTLgrF2O2+CgzSDU8sJPnPhARaOZaeHAG8ptlo9 DB+Kh3DeI4ElDDe8tk5Z8odZMnq+RVcv7rgKDQYclntIahLRa3cEixO0oGKe3WY8Zs7z xjr7nW7J6RSohxArSDj5CT/R0u35dcICTUlTE9edvsCN+XLFxo6F4gb5DgtZKkRKfMt4 MnuQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=mbKlAsXvTpB6CPeZ6j5yKIg+s519W2REhFMvJ3Fp/jY=; b=iqlUqeI8sRpZM/TPhTWfposzcyE6iGYIPWv+Zen/2/uYy0ZJ9VZhdZ4jAsXeHhs20t YoMsPv67DCx+QQ4cDuvscfezLMRdKP2PJrIjx0FgcBuF29srSblRLZtXzwzbODMcAHTF QFM7TzHyk2H2/MoFrMQLrSIF60IVJr+Oxgo0OHEtXopskRrFLD38heJmvlJxUxB5lsTJ ned7j4sSvl5Tvy/er6bbwYE9vVr8HuofgGh1oCejQBQxWxL0LRijqb2FiAOYgVdiHyvP 7AnvsFcpZaMtiBXp6wOo4XmQbM0fVSAtREvNn+/hCbBN6kg0/v0gb/m7jbPw2fBl7TcR sKqw== X-Gm-Message-State: AOAM532VHwJvqF8eFi8vzF1Cmi7ezH2XdhN6L3zi3WxbQ85jYt53z72G K+XXFAUde1hC6TslxhC8J8OGjQ== X-Google-Smtp-Source: ABdhPJyMSwrO+FqXn2FlrNeBVin0MKCIdNEvJBnjFa2GWqgaDqwIZzphqJi1e3/3dpJWoJv94pwvDg== X-Received: by 2002:a17:902:82c1:b029:12a:fb53:2038 with SMTP id u1-20020a17090282c1b029012afb532038mr24863288plz.6.1626813230190; Tue, 20 Jul 2021 13:33:50 -0700 (PDT) Received: from google.com (157.214.185.35.bc.googleusercontent.com. [35.185.214.157]) by smtp.gmail.com with ESMTPSA id y4sm3648831pjg.9.2021.07.20.13.33.49 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 20 Jul 2021 13:33:49 -0700 (PDT) Date: Tue, 20 Jul 2021 20:33:46 +0000 From: Sean Christopherson To: Alexandru Elisei Cc: Marc Zyngier , linux-arm-kernel@lists.infradead.org, kvm@vger.kernel.org, kvmarm@lists.cs.columbia.edu, linux-mm@kvack.org, Matthew Wilcox , Paolo Bonzini , Will Deacon , Quentin Perret , James Morse , Suzuki K Poulose , kernel-team@android.com Subject: Re: [PATCH 1/5] KVM: arm64: Walk userspace page tables to compute the THP mapping size Message-ID: References: <20210717095541.1486210-1-maz@kernel.org> <20210717095541.1486210-2-maz@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 81A59F0005BE X-Stat-Signature: dirw4j3n7rniy7koj6dsx3f4qnr7atdi Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=google.com header.s=20161025 header.b=slX3auxM; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf11.hostedemail.com: domain of seanjc@google.com designates 209.85.214.180 as permitted sender) smtp.mailfrom=seanjc@google.com X-HE-Tag: 1626813231-793068 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Jul 20, 2021, Alexandru Elisei wrote: > Hi Marc, > > I just can't figure out why having the mmap lock is not needed to walk the > userspace page tables. Any hints? Or am I not seeing where it's taken? Disclaimer: I'm not super familiar with arm64's page tables, but the relevant KVM functionality is common across x86 and arm64. KVM arm64 (and x86) unconditionally registers a mmu_notifier for the mm_struct associated with the VM, and disallows calling ioctls from a different process, i.e. walking the page tables during KVM_RUN is guaranteed to use the mm for which KVM registered the mmu_notifier. As part of registration, the mmu_notifier does mmgrab() and doesn't do mmdrop() until it's unregistered. That ensures the mm_struct itself is live. For the page tables liveliness, KVM implements mmu_notifier_ops.release, which is invoked at the beginning of exit_mmap(), before the page tables are freed. In its implementation, KVM takes mmu_lock and zaps all its shadow page tables, a.k.a. the stage2 tables in KVM arm64. The flow in question, get_user_mapping_size(), also runs under mmu_lock, and so effectively blocks exit_mmap() and thus is guaranteed to run with live userspace tables. Lastly, KVM also implements mmu_notifier_ops.invalidate_range_{start,end}. KVM's invalidate_range implementations also take mmu_lock, and also update a sequence counter and a flag stating that there's an invalidation in progress. When installing a stage2 entry, KVM snapshots the sequence counter before taking mmu_lock, and then checks it again after acquiring mmu_lock. If the counter mismatches, or an invalidation is in-progress, then KVM bails and resumes the guest without fixing the fault. E.g. if the host zaps userspace page tables and KVM "wins" the race, the subsequent kvm_mmu_notifier_invalidate_range_start() will zap the recently installed stage2 entries. And if the host zap "wins" the race, KVM will resume the guest, which in normal operation will hit the exception again and go back through the entire process of installing stage2 entries. Looking at the arm64 code, one thing I'm not clear on is whether arm64 correctly handles the case where exit_mmap() wins the race. The invalidate_range hooks will still be called, so userspace page tables aren't a problem, but kvm_arch_flush_shadow_all() -> kvm_free_stage2_pgd() nullifies mmu->pgt without any additional notifications that I see. x86 deals with this by ensuring its top-level TDP entry (stage2 equivalent) is valid while the page fault handler is running. void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu) { struct kvm *kvm = kvm_s2_mmu_to_kvm(mmu); struct kvm_pgtable *pgt = NULL; spin_lock(&kvm->mmu_lock); pgt = mmu->pgt; if (pgt) { mmu->pgd_phys = 0; mmu->pgt = NULL; free_percpu(mmu->last_vcpu_ran); } spin_unlock(&kvm->mmu_lock); ... } AFAICT, nothing in user_mem_abort() would prevent consuming that null mmu->pgt if exit_mmap() collidied with user_mem_abort(). static int user_mem_abort(...) { ... spin_lock(&kvm->mmu_lock); pgt = vcpu->arch.hw_mmu->pgt; <-- hw_mmu->pgt may be NULL (hw_mmu points at vcpu->kvm->arch.mmu) if (mmu_notifier_retry(kvm, mmu_seq)) <-- mmu_seq not guaranteed to change goto out_unlock; ... if (fault_status == FSC_PERM && vma_pagesize == fault_granule) { ret = kvm_pgtable_stage2_relax_perms(pgt, fault_ipa, prot); } else { ret = kvm_pgtable_stage2_map(pgt, fault_ipa, vma_pagesize, __pfn_to_phys(pfn), prot, memcache); } }