From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.3 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5D5C4C5DF62 for ; Wed, 6 Nov 2019 06:57:06 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id DD8C5217F5 for ; Wed, 6 Nov 2019 06:57:05 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="K8UBhX1J" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org DD8C5217F5 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 4F4426B0003; Wed, 6 Nov 2019 01:57:05 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 47E666B0006; Wed, 6 Nov 2019 01:57:05 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 31F5B6B0007; Wed, 6 Nov 2019 01:57:05 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0101.hostedemail.com [216.40.44.101]) by kanga.kvack.org (Postfix) with ESMTP id 15DE46B0003 for ; Wed, 6 Nov 2019 01:57:05 -0500 (EST) Received: from smtpin08.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with SMTP id AA3F045A6 for ; Wed, 6 Nov 2019 06:57:04 +0000 (UTC) X-FDA: 76124945568.08.store62_3dbcb5075322f X-HE-Tag: store62_3dbcb5075322f X-Filterd-Recvd-Size: 11414 Received: from us-smtp-1.mimecast.com (us-smtp-delivery-1.mimecast.com [207.211.31.120]) by imf36.hostedemail.com (Postfix) with ESMTP for ; Wed, 6 Nov 2019 06:57:04 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1573023423; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=gHhiVRGjTg2z5h3dRfpoqGRkrXa+G8taPvqq9+5Ox4o=; b=K8UBhX1JtFZMs594Ypt3noWNFA7EDgGOzgTZeoLrecZiYBc0z9aQ1ftUT60ADHOM/4fKea S4RheFJRLPHeZXXsdxGjj1lj/DBLQ6j6EczxwIps2sdVeLlU8ExvD/QiqqB572Rk55bS6J wibfYYo3txGWC+O//e5qvLzZuGhvknc= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-43-W8IUVLyUMSC0rC0dumrlPQ-1; Wed, 06 Nov 2019 01:57:02 -0500 Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.phx2.redhat.com [10.5.11.15]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 7029E8017E0; Wed, 6 Nov 2019 06:56:55 +0000 (UTC) Received: from [10.36.116.143] (ovpn-116-143.ams2.redhat.com [10.36.116.143]) by smtp.corp.redhat.com (Postfix) with ESMTP id DB0E15D70E; Wed, 6 Nov 2019 06:56:35 +0000 (UTC) Subject: Re: [PATCH v1 03/10] KVM: Prepare kvm_is_reserved_pfn() for PG_reserved changes To: Dan Williams , Sean Christopherson Cc: Linux Kernel Mailing List , Linux MM , Michal Hocko , Andrew Morton , kvm-ppc@vger.kernel.org, linuxppc-dev , KVM list , linux-hyperv@vger.kernel.org, devel@driverdev.osuosl.org, xen-devel , X86 ML , Alexander Duyck , Alexander Duyck , Alex Williamson , Allison Randal , Andy Lutomirski , "Aneesh Kumar K.V" , Anshuman Khandual , Anthony Yznaga , Benjamin Herrenschmidt , Borislav Petkov , Boris Ostrovsky , Christophe Leroy , Cornelia Huck , Dave Hansen , Haiyang Zhang , "H. Peter Anvin" , Ingo Molnar , "Isaac J. Manjarres" , Jim Mattson , Joerg Roedel , Johannes Weiner , Juergen Gross , KarimAllah Ahmed , Kees Cook , "K. Y. Srinivasan" , "Matthew Wilcox (Oracle)" , Matt Sickler , Mel Gorman , Michael Ellerman , Michal Hocko , Mike Rapoport , Mike Rapoport , Nicholas Piggin , Oscar Salvador , Paolo Bonzini , Paul Mackerras , Paul Mackerras , Pavel Tatashin , Pavel Tatashin , Peter Zijlstra , Qian Cai , =?UTF-8?B?UmFkaW0gS3LEjW3DocWZ?= , Sasha Levin , Stefano Stabellini , Stephen Hemminger , Thomas Gleixner , Vitaly Kuznetsov , Vlastimil Babka , Wanpeng Li , YueHaibing , Adam Borowski References: <01adb4cb-6092-638c-0bab-e61322be7cf5@redhat.com> <613f3606-748b-0e56-a3ad-1efaffa1a67b@redhat.com> <20191105160000.GC8128@linux.intel.com> <20191105231316.GE23297@linux.intel.com> <20191106000315.GI23297@linux.intel.com> From: David Hildenbrand Organization: Red Hat GmbH Message-ID: <694202e7-d8e6-6ac8-6e47-3553b298bbcc@redhat.com> Date: Wed, 6 Nov 2019 07:56:34 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.1.1 MIME-Version: 1.0 In-Reply-To: Content-Language: en-US X-Scanned-By: MIMEDefang 2.79 on 10.5.11.15 X-MC-Unique: W8IUVLyUMSC0rC0dumrlPQ-1 X-Mimecast-Spam-Score: 0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 06.11.19 01:08, Dan Williams wrote: > On Tue, Nov 5, 2019 at 4:03 PM Sean Christopherson > wrote: >> >> On Tue, Nov 05, 2019 at 03:43:29PM -0800, Dan Williams wrote: >>> On Tue, Nov 5, 2019 at 3:30 PM Dan Williams = wrote: >>>> >>>> On Tue, Nov 5, 2019 at 3:13 PM Sean Christopherson >>>> wrote: >>>>> >>>>> On Tue, Nov 05, 2019 at 03:02:40PM -0800, Dan Williams wrote: >>>>>> On Tue, Nov 5, 2019 at 12:31 PM David Hildenbrand = wrote: >>>>>>>> The scarier code (for me) is transparent_hugepage_adjust() and >>>>>>>> kvm_mmu_zap_collapsible_spte(), as I don't at all understand the >>>>>>>> interaction between THP and _PAGE_DEVMAP. >>>>>>> >>>>>>> The x86 KVM MMU code is one of the ugliest code I know (sorry, but = it >>>>>>> had to be said :/ ). Luckily, this should be independent of the >>>>>>> PG_reserved thingy AFAIKs. >>>>>> >>>>>> Both transparent_hugepage_adjust() and kvm_mmu_zap_collapsible_spte(= ) >>>>>> are honoring kvm_is_reserved_pfn(), so again I'm missing where the >>>>>> page count gets mismanaged and leads to the reported hang. >>>>> >>>>> When mapping pages into the guest, KVM gets the page via gup(), which >>>>> increments the page count for ZONE_DEVICE pages. But KVM puts the pa= ge >>>>> using kvm_release_pfn_clean(), which skips put_page() if PageReserved= () >>>>> and so never puts its reference to ZONE_DEVICE pages. >>>> >>>> Oh, yeah, that's busted. >>> >>> Ugh, it's extra busted because every other gup user in the kernel >>> tracks the pages resulting from gup and puts them (put_page()) when >>> they are done. KVM wants to forget about whether it did a gup to get >>> the page and optionally trigger put_page() based purely on the pfn. >>> Outside of VFIO device assignment that needs pages pinned for DMA, why >>> does KVM itself need to pin pages? If pages are pinned over a return >>> to userspace that needs to be a FOLL_LONGTERM gup. >> >> Short answer, KVM pins the page to ensure correctness with respect to th= e >> primary MMU invalidating the associated host virtual address, e.g. when >> the page is being migrated or unmapped from host userspace. >> >> The main use of gup() is to handle guest page faults and map pages into >> the guest, i.e. into KVM's secondary MMU. KVM uses gup() to both get th= e >> PFN and to temporarily pin the page. The pin is held just long enough t= o >> guaranteed that any invalidation via the mmu_notifier will be stalled >> until after KVM finishes installing the page into the secondary MMU, i.e= . >> the pin is short-term and not held across a return to userspace or entry >> into the guest. When a subsequent mmu_notifier invalidation occurs, KVM >> pulls the PFN from the secondary MMU and uses that to update accessed >> and dirty bits in the host. >> >> There are a few other KVM flows that eventually call into gup(), but tho= se >> are "traditional" short-term pins and use put_page() directly. >=20 > Ok, I was misinterpreting the effect of the bug with what KVM is using > the reference to do. >=20 > To your other point: >=20 >> But David's proposed fix for the above refcount bug is to omit the patch >> so that KVM no longer treats ZONE_DEVICE pages as reserved. That seems >> like the right thing to do, including for thp_adjust(), e.g. it would >> naturally let KVM use 2mb pages for the guest when a ZONE_DEVICE page is >> mapped with a huge page (2mb or above) in the host. The only hiccup is >> figuring out how to correctly transfer the reference. >=20 > That might not be the only hiccup. There's currently no such thing as > huge pages for ZONE_DEVICE, there are huge *mappings* (pmd and pud), > but the result of pfn_to_page() on such a mapping does not yield a > huge 'struct page'. It seems there are other paths in KVM that assume > that more typical page machinery is active like SetPageDirty() based > on kvm_is_reserved_pfn(). While I told David that I did not want to > see more usage of is_zone_device_page(), this patch below (untested) > seems a cleaner path with less surprises: >=20 > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 4df0aa6b8e5c..fbea17c1810c 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -1831,7 +1831,8 @@ EXPORT_SYMBOL_GPL(kvm_release_page_clean); >=20 > void kvm_release_pfn_clean(kvm_pfn_t pfn) > { > - if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn)) > + if ((!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn)) || > + (pfn_valid(pfn) && is_zone_device_page(pfn_to_page(pfn)))) > put_page(pfn_to_page(pfn)); > } > EXPORT_SYMBOL_GPL(kvm_release_pfn_clean); I had the same thought, but I do wonder about the kvm_get_pfn() users,=20 e.g.,: hva_to_pfn_remapped(): =09r =3D follow_pfn(vma, addr, &pfn); =09... =09kvm_get_pfn(pfn); =09... We would not take a reference for ZONE_DEVICE, but later drop one=20 reference via kvm_release_pfn_clean(). IOW, kvm_get_pfn() gets *really*=20 dangerous to use. I can't tell if this can happen right now. We do have 3 users of kvm_get_pfn() that we have to audit before this=20 change. Also, we should add a comment to kvm_get_pfn() that it should=20 never be used with possible ZONE_DEVICE pages. Also, we should add a comment to kvm_release_pfn_clean(), describing why=20 we treat ZONE_DEVICE in a special way here. We can then progress like this 1. Get this fix upstream, it's somewhat unrelated to this series. 2. This patch here remains as is in this series (+/- documentation update) 3. Long term, rework KVM to not have to not treat ZONE_DEVICE like=20 reserved pages. E.g., get rid of kvm_get_pfn(). Then, this special zone=20 check can go. --=20 Thanks, David / dhildenb