From: David Hildenbrand <david@redhat.com>
To: Dan Williams <dan.j.williams@intel.com>
Cc: "Kate Stewart" <kstewart@linuxfoundation.org>,
linux-hyperv@vger.kernel.org, "Michal Hocko" <mhocko@suse.com>,
"Radim Krčmář" <rkrcmar@redhat.com>,
"KVM list" <kvm@vger.kernel.org>,
"Pavel Tatashin" <pavel.tatashin@microsoft.com>,
"KarimAllah Ahmed" <karahmed@amazon.de>,
"Benjamin Herrenschmidt" <benh@kernel.crashing.org>,
"Dave Hansen" <dave.hansen@linux.intel.com>,
"Alexander Duyck" <alexander.duyck@gmail.com>,
"Michal Hocko" <mhocko@kernel.org>,
"Paul Mackerras" <paulus@ozlabs.org>,
"Linux MM" <linux-mm@kvack.org>,
"Paul Mackerras" <paulus@samba.org>,
"Michael Ellerman" <mpe@ellerman.id.au>,
"H. Peter Anvin" <hpa@zytor.com>,
"Wanpeng Li" <wanpengli@tencent.com>,
"Pavel Tatashin" <pasha.tatashin@soleen.com>,
devel@driverdev.osuosl.org,
"Stefano Stabellini" <sstabellini@kernel.org>,
"Stephen Hemminger" <sthemmin@microsoft.com>,
"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>,
"Joerg Roedel" <joro@8bytes.org>, "X86 ML" <x86@kernel.org>,
YueHaibing <yuehaibing@huawei.com>,
"Mike Rapoport" <rppt@linux.ibm.com>,
"Madhumitha Prabakaran" <madhumithabiw@gmail.com>,
"Peter Zijlstra" <peterz@infradead.org>,
"Ingo Molnar" <mingo@redhat.com>,
"Vlastimil Babka" <vbabka@suse.cz>,
"Nishka Dasgupta" <nishkadg.linux@gmail.com>,
"Anthony Yznaga" <anthony.yznaga@oracle.com>,
"Oscar Salvador" <osalvador@suse.de>,
"Dan Carpenter" <dan.carpenter@oracle.com>,
"Isaac J. Manjarres" <isaacm@codeaurora.org>,
"Kees Cook" <keescook@chromium.org>,
"Anshuman Khandual" <anshuman.khandual@arm.com>,
"Haiyang Zhang" <haiyangz@microsoft.com>,
"Simon Sandström" <simon@nikanor.nu>,
"Sasha Levin" <sashal@kernel.org>,
"Juergen Gross" <jgross@suse.com>,
kvm-ppc@vger.kernel.org, "Qian Cai" <cai@lca.pw>,
"Alex Williamson" <alex.williamson@redhat.com>,
"Mike Rapoport" <rppt@linux.vnet.ibm.com>,
"Borislav Petkov" <bp@alien8.de>,
"Nicholas Piggin" <npiggin@gmail.com>,
"Andy Lutomirski" <luto@kernel.org>,
xen-devel <xen-devel@lists.xenproject.org>,
"Boris Ostrovsky" <boris.ostrovsky@oracle.com>,
"Todd Poynor" <toddpoynor@google.com>,
"Vitaly Kuznetsov" <vkuznets@redhat.com>,
"Allison Randal" <allison@lohutok.net>,
"Jim Mattson" <jmattson@google.com>,
"Christophe Leroy" <christophe.leroy@c-s.fr>,
"Vandana BN" <bnvandana@gmail.com>,
"Mel Gorman" <mgorman@techsingularity.net>,
"Greg Kroah-Hartman" <gregkh@linuxfoundation.org>,
"Cornelia Huck" <cohuck@redhat.com>,
"Linux Kernel Mailing List" <linux-kernel@vger.kernel.org>,
"Sean Christopherson" <sean.j.christopherson@intel.com>,
"Rob Springer" <rspringer@google.com>,
"Thomas Gleixner" <tglx@linutronix.de>,
"Johannes Weiner" <hannes@cmpxchg.org>,
"Paolo Bonzini" <pbonzini@redhat.com>,
"Andrew Morton" <akpm@linux-foundation.org>,
linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
Subject: Re: [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)
Date: Wed, 23 Oct 2019 19:27:57 +0200 [thread overview]
Message-ID: <55640861-bbcb-95f0-766a-95bc961f1b0e@redhat.com> (raw)
In-Reply-To: <CAPcyv4hHTqWWWREX2AtpEpfLHdDHvT-Kp_2YBW1As0y2Mx+6Dg@mail.gmail.com>
>> I dislike this for three reasons
>>
>> a) It does not protect against any races, really, it does not improve things.
>> b) We do have the exact same problem with pfn_to_online_page(). As long as we
>> don't hold the memory hotplug lock, memory can get offlined and remove any time. Racy.
>
> True, we need to solve that problem too. That seems to want something
> lighter weight than the hotplug lock that can be held over pfn lookups
> + use rather than requiring a page lookup in paths where it's not
> clear that a page reference would prevent unplug.
>
>> c) We mix in ZONE specific stuff into the core. It should be "just another zone"
>
> Not sure I grok this when the RFC is sprinkling zone-specific
> is_zone_device_page() throughout the core?
Most users should not care about the zone. pfn_active() would be enough
in most situations, especially most PFN walkers - "this memmap is valid
and e.g., contains a valid zone ...".
>
>>
>> What I propose instead (already discussed in https://lkml.org/lkml/2019/10/10/87)
>
> Sorry I missed this earlier...
>
>>
>> 1. Convert SECTION_IS_ONLINE to SECTION_IS_ACTIVE
>> 2. Convert SECTION_IS_ACTIVE to a subsection bitmap
>> 3. Introduce pfn_active() that checks against the subsection bitmap
>> 4. Once the memmap was initialized / prepared, set the subsection active
>> (similar to SECTION_IS_ONLINE in the buddy right now)
>> 5. Before the memmap gets invalidated, set the subsection inactive
>> (similar to SECTION_IS_ONLINE in the buddy right now)
>> 5. pfn_to_online_page() = pfn_active() && zone != ZONE_DEVICE
>> 6. pfn_to_device_page() = pfn_active() && zone == ZONE_DEVICE
>
> This does not seem to reduce any complexity because it still requires
> a pfn to zone lookup at the end of the process.
>
> I.e. converting pfn_to_online_page() to use a new pfn_active()
> subsection map plus looking up the zone from pfn_to_page() is more
> steps than just doing a direct pfn to zone lookup. What am I missing?
That a real "pfn to zone" lookup without going via the struct page will
require to have more than just a single bitmap. IMHO, keeping the
information at a single place (memmap) is the clean thing to do (not
replicating it somewhere else). Going via the memmap might not be as
fast as a direct lookup, but do we actually care? We are already looking
at "random PFNs we are not sure if there is a valid memmap".
>>
>> Especially, driver-reserved device memory will not get set active in
>> the subsection bitmap.
>>
>> Now to the race. Taking the memory hotplug lock at random places is ugly. I do
>> wonder if we can use RCU:
>
> Ah, yes, exactly what I was thinking above.
>
>>
>> The user of pfn_active()/pfn_to_online_page()/pfn_to_device_page():
>>
>> /* the memmap is guaranteed to remain active under RCU */
>> rcu_read_lock();
>> if (pfn_active(random_pfn)) {
>> page = pfn_to_page(random_pfn);
>> ... use the page, stays valid
>> }
>> rcu_unread_lock();
>>
>> Memory offlining/memremap code:
>>
>> set_subsections_inactive(pfn, nr_pages); /* clears the bit atomically */
>> synchronize_rcu();
>> /* all users saw the bitmap update, we can invalide the memmap */
>> remove_pfn_range_from_zone(zone, pfn, nr_pages);
>
> Looks good to me.
>
>>
>>>
>>>>
>>>> I only gave it a quick test with DIMMs on x86-64, but didn't test the
>>>> ZONE_DEVICE part at all (any tips for a nice QEMU setup?). Compile-tested
>>>> on x86-64 and PPC.
>>>
>>> I'll give it a spin, but I don't think the kernel wants to grow more
>>> is_zone_device_page() users.
>>
>> Let's recap. In this RFC, I introduce a total of 4 (!) users only.
>> The other parts can rely on pfn_to_online_page() only.
>>
>> 1. "staging: kpc2000: Prepare transfer_complete_cb() for PG_reserved changes"
>> - Basically never used with ZONE_DEVICE.
>> - We hold a reference!
>> - All it protects is a SetPageDirty(page);
>>
>> 2. "staging/gasket: Prepare gasket_release_page() for PG_reserved changes"
>> - Same as 1.
>>
>> 3. "mm/usercopy.c: Prepare check_page_span() for PG_reserved changes"
>> - We come via virt_to_head_page() / virt_to_head_page(), not sure about
>> references (I assume this should be fine as we don't come via random
>> PFNs)
>> - We check that we don't mix Reserved (including device memory) and CMA
>> pages when crossing compound pages.
>>
>> I think we can drop 1. and 2., resulting in a total of 2 new users in
>> the same context. I think that is totally tolerable to finally clean
>> this up.
>
> ...but more is_zone_device_page() doesn't "finally clean this up".
> Like we discussed above it's the missing locking that's the real
> cleanup, the pfn_to_online_page() internals are secondary.
It's a different cleanup IMHO. We can't do everything in one shot. But
maybe I can drop the is_zone_device_page() parts from this patch and
completely rely on pfn_to_online_page(). Yes, that needs fixing to, but
it's a different story.
The important part of this patch:
While pfn_to_online_page() will always exclude ZONE_DEVICE pages,
checking PG_reserved on ZONE_DEVICE pages (what we do right now!) is
racy as hell (especially when concurrently initializing the memmap).
This does improve the situation.
>>
>> However, I think we also have to clarify if we need the change in 3 at all.
>> It comes from
>>
>> commit f5509cc18daa7f82bcc553be70df2117c8eedc16
>> Author: Kees Cook <keescook@chromium.org>
>> Date: Tue Jun 7 11:05:33 2016 -0700
>>
>> mm: Hardened usercopy
>>
>> This is the start of porting PAX_USERCOPY into the mainline kernel. This
>> is the first set of features, controlled by CONFIG_HARDENED_USERCOPY. The
>> work is based on code by PaX Team and Brad Spengler, and an earlier port
>> from Casey Schaufler. Additional non-slab page tests are from Rik van Riel.
>> [...]
>> - otherwise, object must not span page allocations (excepting Reserved
>> and CMA ranges)
>>
>> Not sure if we really have to care about ZONE_DEVICE at this point.
>
> That check needs to be careful to ignore ZONE_DEVICE pages. There's
> nothing wrong with a copy spanning ZONE_DEVICE and typical pages.
Please note that the current check would *forbid* this (AFAIKs for a
single heap object). As discussed in the relevant patch, we might be
able to just stop doing that and limit it to real PG_reserved pages
(without ZONE_DEVICE). I'd be happy to not introduce new
is_zone_device_page() users.
--
Thanks,
David / dhildenb
_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
next prev parent reply other threads:[~2019-10-23 17:28 UTC|newest]
Thread overview: 28+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-10-22 17:12 [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE) David Hildenbrand
2019-10-22 17:12 ` [PATCH RFC v1 01/12] mm/memory_hotplug: Don't allow to online/offline memory blocks with holes David Hildenbrand
2019-10-24 3:53 ` Anshuman Khandual
2019-10-24 7:55 ` David Hildenbrand
2019-10-22 17:12 ` [PATCH RFC v1 02/12] mm/usercopy.c: Prepare check_page_span() for PG_reserved changes David Hildenbrand
2019-10-23 8:20 ` David Hildenbrand
2019-10-23 16:25 ` Kees Cook
2019-10-23 16:32 ` David Hildenbrand
2019-10-22 17:12 ` [PATCH RFC v1 03/12] KVM: x86/mmu: Prepare kvm_is_mmio_pfn() " David Hildenbrand
2019-10-22 17:12 ` [PATCH RFC v1 04/12] KVM: Prepare kvm_is_reserved_pfn() " David Hildenbrand
2019-10-22 17:12 ` [PATCH RFC v1 05/12] vfio/type1: Prepare is_invalid_reserved_pfn() " David Hildenbrand
2019-10-22 17:12 ` [PATCH RFC v1 06/12] staging/gasket: Prepare gasket_release_page() " David Hildenbrand
2019-10-23 8:17 ` David Hildenbrand
2019-10-22 17:12 ` [PATCH RFC v1 07/12] staging: kpc2000: Prepare transfer_complete_cb() " David Hildenbrand
2019-10-22 17:55 ` Matt Sickler
2019-10-22 21:01 ` David Hildenbrand
2019-10-22 17:12 ` [PATCH RFC v1 08/12] powerpc/book3s: Prepare kvmppc_book3s_instantiate_page() " David Hildenbrand
2019-10-22 17:12 ` [PATCH RFC v1 09/12] powerpc/64s: Prepare hash_page_do_lazy_icache() " David Hildenbrand
2019-10-22 17:12 ` [PATCH RFC v1 10/12] powerpc/mm: Prepare maybe_pte_to_page() " David Hildenbrand
2019-10-22 17:12 ` [PATCH RFC v1 11/12] x86/mm: Prepare __ioremap_check_ram() " David Hildenbrand
2019-10-22 17:12 ` [PATCH RFC v1 12/12] mm/memory_hotplug: Don't mark pages PG_reserved when initializing the memmap David Hildenbrand
2019-10-22 21:54 ` [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE) Dan Williams
2019-10-23 7:26 ` David Hildenbrand
2019-10-23 17:09 ` Dan Williams
2019-10-23 17:27 ` David Hildenbrand [this message]
2019-10-23 19:39 ` Dan Williams
2019-10-23 21:22 ` David Hildenbrand
2019-10-24 12:50 ` David Hildenbrand
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=55640861-bbcb-95f0-766a-95bc961f1b0e@redhat.com \
--to=david@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=alex.williamson@redhat.com \
--cc=alexander.duyck@gmail.com \
--cc=allison@lohutok.net \
--cc=aneesh.kumar@linux.ibm.com \
--cc=anshuman.khandual@arm.com \
--cc=anthony.yznaga@oracle.com \
--cc=benh@kernel.crashing.org \
--cc=bnvandana@gmail.com \
--cc=boris.ostrovsky@oracle.com \
--cc=bp@alien8.de \
--cc=cai@lca.pw \
--cc=christophe.leroy@c-s.fr \
--cc=cohuck@redhat.com \
--cc=dan.carpenter@oracle.com \
--cc=dan.j.williams@intel.com \
--cc=dave.hansen@linux.intel.com \
--cc=devel@driverdev.osuosl.org \
--cc=gregkh@linuxfoundation.org \
--cc=haiyangz@microsoft.com \
--cc=hannes@cmpxchg.org \
--cc=hpa@zytor.com \
--cc=isaacm@codeaurora.org \
--cc=jgross@suse.com \
--cc=jmattson@google.com \
--cc=joro@8bytes.org \
--cc=karahmed@amazon.de \
--cc=keescook@chromium.org \
--cc=kstewart@linuxfoundation.org \
--cc=kvm-ppc@vger.kernel.org \
--cc=kvm@vger.kernel.org \
--cc=linux-hyperv@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=luto@kernel.org \
--cc=madhumithabiw@gmail.com \
--cc=mgorman@techsingularity.net \
--cc=mhocko@kernel.org \
--cc=mhocko@suse.com \
--cc=mingo@redhat.com \
--cc=mpe@ellerman.id.au \
--cc=nishkadg.linux@gmail.com \
--cc=npiggin@gmail.com \
--cc=osalvador@suse.de \
--cc=pasha.tatashin@soleen.com \
--cc=paulus@ozlabs.org \
--cc=paulus@samba.org \
--cc=pavel.tatashin@microsoft.com \
--cc=pbonzini@redhat.com \
--cc=peterz@infradead.org \
--cc=rkrcmar@redhat.com \
--cc=rppt@linux.ibm.com \
--cc=rppt@linux.vnet.ibm.com \
--cc=rspringer@google.com \
--cc=sashal@kernel.org \
--cc=sean.j.christopherson@intel.com \
--cc=simon@nikanor.nu \
--cc=sstabellini@kernel.org \
--cc=sthemmin@microsoft.com \
--cc=tglx@linutronix.de \
--cc=toddpoynor@google.com \
--cc=vbabka@suse.cz \
--cc=vkuznets@redhat.com \
--cc=wanpengli@tencent.com \
--cc=x86@kernel.org \
--cc=xen-devel@lists.xenproject.org \
--cc=yuehaibing@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).