From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.5 required=3.0 tests=DKIM_INVALID,DKIM_SIGNED, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E82A2C5DF60 for ; Wed, 6 Nov 2019 00:09:08 +0000 (UTC) Received: from silver.osuosl.org (smtp3.osuosl.org [140.211.166.136]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 8DDB821A49 for ; Wed, 6 Nov 2019 00:09:08 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=intel-com.20150623.gappssmtp.com header.i=@intel-com.20150623.gappssmtp.com header.b="e+Di3mlc" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 8DDB821A49 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=driverdev-devel-bounces@linuxdriverproject.org Received: from localhost (localhost [127.0.0.1]) by silver.osuosl.org (Postfix) with ESMTP id 1BB2F2267B; Wed, 6 Nov 2019 00:09:08 +0000 (UTC) X-Virus-Scanned: amavisd-new at osuosl.org Received: from silver.osuosl.org ([127.0.0.1]) by localhost (.osuosl.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id OxLZXdCe2I7J; Wed, 6 Nov 2019 00:09:07 +0000 (UTC) Received: from ash.osuosl.org (ash.osuosl.org [140.211.166.34]) by silver.osuosl.org (Postfix) with ESMTP id 6FB7422668; Wed, 6 Nov 2019 00:09:06 +0000 (UTC) Received: from silver.osuosl.org (smtp3.osuosl.org [140.211.166.136]) by ash.osuosl.org (Postfix) with ESMTP id D46171BF329 for ; Wed, 6 Nov 2019 00:09:01 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by silver.osuosl.org (Postfix) with ESMTP id 9791722668 for ; Wed, 6 Nov 2019 00:08:55 +0000 (UTC) X-Virus-Scanned: amavisd-new at osuosl.org Received: from silver.osuosl.org ([127.0.0.1]) by localhost (.osuosl.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Qp3cgdRarI44 for ; Wed, 6 Nov 2019 00:08:47 +0000 (UTC) X-Greylist: from auto-whitelisted by SQLgrey-1.7.6 Received: from mail-ot1-f65.google.com (mail-ot1-f65.google.com [209.85.210.65]) by silver.osuosl.org (Postfix) with ESMTPS id 0905722650 for ; Wed, 6 Nov 2019 00:08:44 +0000 (UTC) Received: by mail-ot1-f65.google.com with SMTP id b16so19287555otk.9 for ; Tue, 05 Nov 2019 16:08:44 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=intel-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=NBnxU3butVFZM7AnAXfHT8dkMjtUFYfyCvUDW61SXV0=; b=e+Di3mlc5nguC0AjBP91KU4QLtXyLv60VHL8bcA2yTFqSx8GzUT3JPnO2SUvw9dJ3e x892tG+naCnJ7xx1/XyJqFBd5BAvIkbzvc8Hm+iANPYt1TTQlO54TnNTT2/Vr/wnnHwU cWH/uHsxgeOfe757EoaxLRwaf5ARnH5aT0zdOmRUm2qEb9I99F92KkPxOUH4REMi83wk kEyAkGq3/W1xKyxTMdH0KRUk/WXIRl4YbdL49pTFBgS4yeLctZvOTpoL9B1muQPVqJmY p3ipEL1tac4Bnfpc8lK1KzxSavaLkdb1VqqzcOJA3HvnvhhzQSrRj7AaXeANHjyy47rG tLMg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=NBnxU3butVFZM7AnAXfHT8dkMjtUFYfyCvUDW61SXV0=; b=g1ftzc/jpGigzKc9vjAUdTpfEXA34Xz1lo5OJBEOSso/hAv4KxGqy09oBacpCYc89F xJFHLULi1BbsdcvgO5IAhGAAxALt8o6wgbixiSjCNf45PsoD5/cugZwoz52UbUOre0YE 2ilubJYud4RdT5ieGvAkhyh2jKcAauLzpC20XTGSkgw9NqRYmnNmLD1rS9GUHvBfDlo4 UHopIwwYsJrTCXWnEYvFJKcG/FRhLuq8L0ePHEc3gIhAG4OsFJtcBWzgqLKJYzEwYBqT 03c5sjDTuV8I7sl1cjoYeSjJrzLab0kyWt0opaK2Kl81SpK9LBKNm+YTBVJ+gYKenKbu BOLg== X-Gm-Message-State: APjAAAXbWXO5RNoY5ZkRvZnNCIYypvxVejsBWNzkuwFtSHnmRZ75ODkb ctRcAGxrViNDm9rujLFQ4NsjdoL2t3v77yyBhTFGpIhWDg0= X-Google-Smtp-Source: APXvYqy6XKydDtiqLti7hhnEPqaloEXe8khYaaQC8o7B44In3kaSUL1URuHERGnTYAOJPrixGMDMzxSCyGGCNLsXEME= X-Received: by 2002:a9d:5f11:: with SMTP id f17mr24252119oti.207.1572998924064; Tue, 05 Nov 2019 16:08:44 -0800 (PST) MIME-Version: 1.0 References: <01adb4cb-6092-638c-0bab-e61322be7cf5@redhat.com> <613f3606-748b-0e56-a3ad-1efaffa1a67b@redhat.com> <20191105160000.GC8128@linux.intel.com> <20191105231316.GE23297@linux.intel.com> <20191106000315.GI23297@linux.intel.com> In-Reply-To: <20191106000315.GI23297@linux.intel.com> From: Dan Williams Date: Tue, 5 Nov 2019 16:08:32 -0800 Message-ID: Subject: Re: [PATCH v1 03/10] KVM: Prepare kvm_is_reserved_pfn() for PG_reserved changes To: Sean Christopherson X-BeenThere: driverdev-devel@linuxdriverproject.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Linux Driver Project Developer List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: linux-hyperv@vger.kernel.org, Michal Hocko , =?UTF-8?B?UmFkaW0gS3LEjW3DocWZ?= , KVM list , David Hildenbrand , KarimAllah Ahmed , Benjamin Herrenschmidt , Dave Hansen , Alexander Duyck , Michal Hocko , Paul Mackerras , Linux MM , Pavel Tatashin , Paul Mackerras , Michael Ellerman , "H. Peter Anvin" , Wanpeng Li , Alexander Duyck , Thomas Gleixner , Kees Cook , devel@driverdev.osuosl.org, Stefano Stabellini , Stephen Hemminger , "Aneesh Kumar K.V" , Joerg Roedel , X86 ML , YueHaibing , "Matthew Wilcox \(Oracle\)" , Mike Rapoport , Peter Zijlstra , Ingo Molnar , Vlastimil Babka , Anthony Yznaga , Oscar Salvador , "Isaac J. Manjarres" , Juergen Gross , Anshuman Khandual , Haiyang Zhang , Sasha Levin , kvm-ppc@vger.kernel.org, Qian Cai , Alex Williamson , Mike Rapoport , Borislav Petkov , Nicholas Piggin , Andy Lutomirski , xen-devel , Boris Ostrovsky , Vitaly Kuznetsov , Allison Randal , Jim Mattson , Christophe Leroy , Mel Gorman , Adam Borowski , Cornelia Huck , Pavel Tatashin , Linux Kernel Mailing List , Johannes Weiner , Paolo Bonzini , Andrew Morton , linuxppc-dev Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: driverdev-devel-bounces@linuxdriverproject.org Sender: "devel" On Tue, Nov 5, 2019 at 4:03 PM Sean Christopherson wrote: > > On Tue, Nov 05, 2019 at 03:43:29PM -0800, Dan Williams wrote: > > On Tue, Nov 5, 2019 at 3:30 PM Dan Williams wrote: > > > > > > On Tue, Nov 5, 2019 at 3:13 PM Sean Christopherson > > > wrote: > > > > > > > > On Tue, Nov 05, 2019 at 03:02:40PM -0800, Dan Williams wrote: > > > > > On Tue, Nov 5, 2019 at 12:31 PM David Hildenbrand wrote: > > > > > > > The scarier code (for me) is transparent_hugepage_adjust() and > > > > > > > kvm_mmu_zap_collapsible_spte(), as I don't at all understand the > > > > > > > interaction between THP and _PAGE_DEVMAP. > > > > > > > > > > > > The x86 KVM MMU code is one of the ugliest code I know (sorry, but it > > > > > > had to be said :/ ). Luckily, this should be independent of the > > > > > > PG_reserved thingy AFAIKs. > > > > > > > > > > Both transparent_hugepage_adjust() and kvm_mmu_zap_collapsible_spte() > > > > > are honoring kvm_is_reserved_pfn(), so again I'm missing where the > > > > > page count gets mismanaged and leads to the reported hang. > > > > > > > > When mapping pages into the guest, KVM gets the page via gup(), which > > > > increments the page count for ZONE_DEVICE pages. But KVM puts the page > > > > using kvm_release_pfn_clean(), which skips put_page() if PageReserved() > > > > and so never puts its reference to ZONE_DEVICE pages. > > > > > > Oh, yeah, that's busted. > > > > Ugh, it's extra busted because every other gup user in the kernel > > tracks the pages resulting from gup and puts them (put_page()) when > > they are done. KVM wants to forget about whether it did a gup to get > > the page and optionally trigger put_page() based purely on the pfn. > > Outside of VFIO device assignment that needs pages pinned for DMA, why > > does KVM itself need to pin pages? If pages are pinned over a return > > to userspace that needs to be a FOLL_LONGTERM gup. > > Short answer, KVM pins the page to ensure correctness with respect to the > primary MMU invalidating the associated host virtual address, e.g. when > the page is being migrated or unmapped from host userspace. > > The main use of gup() is to handle guest page faults and map pages into > the guest, i.e. into KVM's secondary MMU. KVM uses gup() to both get the > PFN and to temporarily pin the page. The pin is held just long enough to > guaranteed that any invalidation via the mmu_notifier will be stalled > until after KVM finishes installing the page into the secondary MMU, i.e. > the pin is short-term and not held across a return to userspace or entry > into the guest. When a subsequent mmu_notifier invalidation occurs, KVM > pulls the PFN from the secondary MMU and uses that to update accessed > and dirty bits in the host. > > There are a few other KVM flows that eventually call into gup(), but those > are "traditional" short-term pins and use put_page() directly. Ok, I was misinterpreting the effect of the bug with what KVM is using the reference to do. To your other point: > But David's proposed fix for the above refcount bug is to omit the patch > so that KVM no longer treats ZONE_DEVICE pages as reserved. That seems > like the right thing to do, including for thp_adjust(), e.g. it would > naturally let KVM use 2mb pages for the guest when a ZONE_DEVICE page is > mapped with a huge page (2mb or above) in the host. The only hiccup is > figuring out how to correctly transfer the reference. That might not be the only hiccup. There's currently no such thing as huge pages for ZONE_DEVICE, there are huge *mappings* (pmd and pud), but the result of pfn_to_page() on such a mapping does not yield a huge 'struct page'. It seems there are other paths in KVM that assume that more typical page machinery is active like SetPageDirty() based on kvm_is_reserved_pfn(). While I told David that I did not want to see more usage of is_zone_device_page(), this patch below (untested) seems a cleaner path with less surprises: diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 4df0aa6b8e5c..fbea17c1810c 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1831,7 +1831,8 @@ EXPORT_SYMBOL_GPL(kvm_release_page_clean); void kvm_release_pfn_clean(kvm_pfn_t pfn) { - if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn)) + if ((!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn)) || + (pfn_valid(pfn) && is_zone_device_page(pfn_to_page(pfn)))) put_page(pfn_to_page(pfn)); } EXPORT_SYMBOL_GPL(kvm_release_pfn_clean); This is safe because the reference that KVM took earlier protects the is_zone_device_page() lookup from racing device teardown. Otherwise, if KVM does not have a reference it's unsafe, but that's already even more broken because KVM would be releasing a page that it never referenced. Every other KVM operation that assumes page allocator pages would continue to honor kvm_is_reserved_pfn(). _______________________________________________ devel mailing list devel@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel