driverdev-devel.linuxdriverproject.org archive mirror
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: Dan Williams <dan.j.williams@intel.com>
Cc: "Kate Stewart" <kstewart@linuxfoundation.org>,
	linux-hyperv@vger.kernel.org, "Michal Hocko" <mhocko@suse.com>,
	"Radim Krčmář" <rkrcmar@redhat.com>,
	"KVM list" <kvm@vger.kernel.org>,
	"Pavel Tatashin" <pavel.tatashin@microsoft.com>,
	"KarimAllah Ahmed" <karahmed@amazon.de>,
	"Benjamin Herrenschmidt" <benh@kernel.crashing.org>,
	"Dave Hansen" <dave.hansen@linux.intel.com>,
	"Alexander Duyck" <alexander.duyck@gmail.com>,
	"Michal Hocko" <mhocko@kernel.org>,
	"Paul Mackerras" <paulus@ozlabs.org>,
	"Linux MM" <linux-mm@kvack.org>,
	"Paul Mackerras" <paulus@samba.org>,
	"Michael Ellerman" <mpe@ellerman.id.au>,
	"H. Peter Anvin" <hpa@zytor.com>,
	"Wanpeng Li" <wanpengli@tencent.com>,
	"Pavel Tatashin" <pasha.tatashin@soleen.com>,
	devel@driverdev.osuosl.org,
	"Stefano Stabellini" <sstabellini@kernel.org>,
	"Stephen Hemminger" <sthemmin@microsoft.com>,
	"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>,
	"Joerg Roedel" <joro@8bytes.org>, "X86 ML" <x86@kernel.org>,
	YueHaibing <yuehaibing@huawei.com>,
	"Mike Rapoport" <rppt@linux.ibm.com>,
	"Madhumitha Prabakaran" <madhumithabiw@gmail.com>,
	"Peter Zijlstra" <peterz@infradead.org>,
	"Ingo Molnar" <mingo@redhat.com>,
	"Vlastimil Babka" <vbabka@suse.cz>,
	"Nishka Dasgupta" <nishkadg.linux@gmail.com>,
	"Anthony Yznaga" <anthony.yznaga@oracle.com>,
	"Oscar Salvador" <osalvador@suse.de>,
	"Dan Carpenter" <dan.carpenter@oracle.com>,
	"Isaac J. Manjarres" <isaacm@codeaurora.org>,
	"Kees Cook" <keescook@chromium.org>,
	"Anshuman Khandual" <anshuman.khandual@arm.com>,
	"Haiyang Zhang" <haiyangz@microsoft.com>,
	"Simon Sandström" <simon@nikanor.nu>,
	"Sasha Levin" <sashal@kernel.org>,
	"Juergen Gross" <jgross@suse.com>,
	kvm-ppc@vger.kernel.org, "Qian Cai" <cai@lca.pw>,
	"Alex Williamson" <alex.williamson@redhat.com>,
	"Mike Rapoport" <rppt@linux.vnet.ibm.com>,
	"Borislav Petkov" <bp@alien8.de>,
	"Nicholas Piggin" <npiggin@gmail.com>,
	"Andy Lutomirski" <luto@kernel.org>,
	xen-devel <xen-devel@lists.xenproject.org>,
	"Boris Ostrovsky" <boris.ostrovsky@oracle.com>,
	"Todd Poynor" <toddpoynor@google.com>,
	"Vitaly Kuznetsov" <vkuznets@redhat.com>,
	"Allison Randal" <allison@lohutok.net>,
	"Jim Mattson" <jmattson@google.com>,
	"Christophe Leroy" <christophe.leroy@c-s.fr>,
	"Vandana BN" <bnvandana@gmail.com>,
	"Mel Gorman" <mgorman@techsingularity.net>,
	"Greg Kroah-Hartman" <gregkh@linuxfoundation.org>,
	"Cornelia Huck" <cohuck@redhat.com>,
	"Linux Kernel Mailing List" <linux-kernel@vger.kernel.org>,
	"Sean Christopherson" <sean.j.christopherson@intel.com>,
	"Rob Springer" <rspringer@google.com>,
	"Thomas Gleixner" <tglx@linutronix.de>,
	"Johannes Weiner" <hannes@cmpxchg.org>,
	"Paolo Bonzini" <pbonzini@redhat.com>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
Subject: Re: [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)
Date: Wed, 23 Oct 2019 19:27:57 +0200	[thread overview]
Message-ID: <55640861-bbcb-95f0-766a-95bc961f1b0e@redhat.com> (raw)
In-Reply-To: <CAPcyv4hHTqWWWREX2AtpEpfLHdDHvT-Kp_2YBW1As0y2Mx+6Dg@mail.gmail.com>

>> I dislike this for three reasons
>>
>> a) It does not protect against any races, really, it does not improve things.
>> b) We do have the exact same problem with pfn_to_online_page(). As long as we
>>    don't hold the memory hotplug lock, memory can get offlined and remove any time. Racy.
> 
> True, we need to solve that problem too. That seems to want something
> lighter weight than the hotplug lock that can be held over pfn lookups
> +  use rather than requiring a page lookup in paths where it's not
> clear that a page reference would prevent unplug.
> 
>> c) We mix in ZONE specific stuff into the core. It should be "just another zone"
> 
> Not sure I grok this when the RFC is sprinkling zone-specific
> is_zone_device_page() throughout the core?

Most users should not care about the zone. pfn_active() would be enough
in most situations, especially most PFN walkers - "this memmap is valid
and e.g., contains a valid zone ...".

> 
>>
>> What I propose instead (already discussed in https://lkml.org/lkml/2019/10/10/87)
> 
> Sorry I missed this earlier...
> 
>>
>> 1. Convert SECTION_IS_ONLINE to SECTION_IS_ACTIVE
>> 2. Convert SECTION_IS_ACTIVE to a subsection bitmap
>> 3. Introduce pfn_active() that checks against the subsection bitmap
>> 4. Once the memmap was initialized / prepared, set the subsection active
>>    (similar to SECTION_IS_ONLINE in the buddy right now)
>> 5. Before the memmap gets invalidated, set the subsection inactive
>>    (similar to SECTION_IS_ONLINE in the buddy right now)
>> 5. pfn_to_online_page() = pfn_active() && zone != ZONE_DEVICE
>> 6. pfn_to_device_page() = pfn_active() && zone == ZONE_DEVICE
> 
> This does not seem to reduce any complexity because it still requires
> a pfn to zone lookup at the end of the process.
> 
> I.e. converting pfn_to_online_page() to use a new pfn_active()
> subsection map plus looking up the zone from pfn_to_page() is more
> steps than just doing a direct pfn to zone lookup. What am I missing?

That a real "pfn to zone" lookup without going via the struct page will
require to have more than just a single bitmap. IMHO, keeping the
information at a single place (memmap) is the clean thing to do (not
replicating it somewhere else). Going via the memmap might not be as
fast as a direct lookup, but do we actually care? We are already looking
at "random PFNs we are not sure if there is a valid memmap".

>>
>> Especially, driver-reserved device memory will not get set active in
>> the subsection bitmap.
>>
>> Now to the race. Taking the memory hotplug lock at random places is ugly. I do
>> wonder if we can use RCU:
> 
> Ah, yes, exactly what I was thinking above.
> 
>>
>> The user of pfn_active()/pfn_to_online_page()/pfn_to_device_page():
>>
>>         /* the memmap is guaranteed to remain active under RCU */
>>         rcu_read_lock();
>>         if (pfn_active(random_pfn)) {
>>                 page = pfn_to_page(random_pfn);
>>                 ... use the page, stays valid
>>         }
>>         rcu_unread_lock();
>>
>> Memory offlining/memremap code:
>>
>>         set_subsections_inactive(pfn, nr_pages); /* clears the bit atomically */
>>         synchronize_rcu();
>>         /* all users saw the bitmap update, we can invalide the memmap */
>>         remove_pfn_range_from_zone(zone, pfn, nr_pages);
> 
> Looks good to me.
> 
>>
>>>
>>>>
>>>> I only gave it a quick test with DIMMs on x86-64, but didn't test the
>>>> ZONE_DEVICE part at all (any tips for a nice QEMU setup?). Compile-tested
>>>> on x86-64 and PPC.
>>>
>>> I'll give it a spin, but I don't think the kernel wants to grow more
>>> is_zone_device_page() users.
>>
>> Let's recap. In this RFC, I introduce a total of 4 (!) users only.
>> The other parts can rely on pfn_to_online_page() only.
>>
>> 1. "staging: kpc2000: Prepare transfer_complete_cb() for PG_reserved changes"
>> - Basically never used with ZONE_DEVICE.
>> - We hold a reference!
>> - All it protects is a SetPageDirty(page);
>>
>> 2. "staging/gasket: Prepare gasket_release_page() for PG_reserved changes"
>> - Same as 1.
>>
>> 3. "mm/usercopy.c: Prepare check_page_span() for PG_reserved changes"
>> - We come via virt_to_head_page() / virt_to_head_page(), not sure about
>>   references (I assume this should be fine as we don't come via random
>>   PFNs)
>> - We check that we don't mix Reserved (including device memory) and CMA
>>   pages when crossing compound pages.
>>
>> I think we can drop 1. and 2., resulting in a total of 2 new users in
>> the same context. I think that is totally tolerable to finally clean
>> this up.
> 
> ...but more is_zone_device_page() doesn't "finally clean this up".
> Like we discussed above it's the missing locking that's the real
> cleanup, the pfn_to_online_page() internals are secondary.

It's a different cleanup IMHO. We can't do everything in one shot. But
maybe I can drop the is_zone_device_page() parts from this patch and
completely rely on pfn_to_online_page(). Yes, that needs fixing to, but
it's a different story.

The important part of this patch:

While pfn_to_online_page() will always exclude ZONE_DEVICE pages,
checking PG_reserved on ZONE_DEVICE pages (what we do right now!) is
racy as hell (especially when concurrently initializing the memmap).

This does improve the situation.

>>
>> However, I think we also have to clarify if we need the change in 3 at all.
>> It comes from
>>
>> commit f5509cc18daa7f82bcc553be70df2117c8eedc16
>> Author: Kees Cook <keescook@chromium.org>
>> Date:   Tue Jun 7 11:05:33 2016 -0700
>>
>>     mm: Hardened usercopy
>>
>>     This is the start of porting PAX_USERCOPY into the mainline kernel. This
>>     is the first set of features, controlled by CONFIG_HARDENED_USERCOPY. The
>>     work is based on code by PaX Team and Brad Spengler, and an earlier port
>>     from Casey Schaufler. Additional non-slab page tests are from Rik van Riel.
>> [...]
>>     - otherwise, object must not span page allocations (excepting Reserved
>>       and CMA ranges)
>>
>> Not sure if we really have to care about ZONE_DEVICE at this point.
> 
> That check needs to be careful to ignore ZONE_DEVICE pages. There's
> nothing wrong with a copy spanning ZONE_DEVICE and typical pages.

Please note that the current check would *forbid* this (AFAIKs for a
single heap object). As discussed in the relevant patch, we might be
able to just stop doing that and limit it to real PG_reserved pages
(without ZONE_DEVICE). I'd be happy to not introduce new
is_zone_device_page() users.

-- 

Thanks,

David / dhildenb

_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

  reply	other threads:[~2019-10-23 17:28 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-10-22 17:12 [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE) David Hildenbrand
2019-10-22 17:12 ` [PATCH RFC v1 01/12] mm/memory_hotplug: Don't allow to online/offline memory blocks with holes David Hildenbrand
2019-10-24  3:53   ` Anshuman Khandual
2019-10-24  7:55     ` David Hildenbrand
2019-10-22 17:12 ` [PATCH RFC v1 02/12] mm/usercopy.c: Prepare check_page_span() for PG_reserved changes David Hildenbrand
2019-10-23  8:20   ` David Hildenbrand
2019-10-23 16:25     ` Kees Cook
2019-10-23 16:32       ` David Hildenbrand
2019-10-22 17:12 ` [PATCH RFC v1 03/12] KVM: x86/mmu: Prepare kvm_is_mmio_pfn() " David Hildenbrand
2019-10-22 17:12 ` [PATCH RFC v1 04/12] KVM: Prepare kvm_is_reserved_pfn() " David Hildenbrand
2019-10-22 17:12 ` [PATCH RFC v1 05/12] vfio/type1: Prepare is_invalid_reserved_pfn() " David Hildenbrand
2019-10-22 17:12 ` [PATCH RFC v1 06/12] staging/gasket: Prepare gasket_release_page() " David Hildenbrand
2019-10-23  8:17   ` David Hildenbrand
2019-10-22 17:12 ` [PATCH RFC v1 07/12] staging: kpc2000: Prepare transfer_complete_cb() " David Hildenbrand
2019-10-22 17:55   ` Matt Sickler
2019-10-22 21:01     ` David Hildenbrand
2019-10-22 17:12 ` [PATCH RFC v1 08/12] powerpc/book3s: Prepare kvmppc_book3s_instantiate_page() " David Hildenbrand
2019-10-22 17:12 ` [PATCH RFC v1 09/12] powerpc/64s: Prepare hash_page_do_lazy_icache() " David Hildenbrand
2019-10-22 17:12 ` [PATCH RFC v1 10/12] powerpc/mm: Prepare maybe_pte_to_page() " David Hildenbrand
2019-10-22 17:12 ` [PATCH RFC v1 11/12] x86/mm: Prepare __ioremap_check_ram() " David Hildenbrand
2019-10-22 17:12 ` [PATCH RFC v1 12/12] mm/memory_hotplug: Don't mark pages PG_reserved when initializing the memmap David Hildenbrand
2019-10-22 21:54 ` [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE) Dan Williams
2019-10-23  7:26   ` David Hildenbrand
2019-10-23 17:09     ` Dan Williams
2019-10-23 17:27       ` David Hildenbrand [this message]
2019-10-23 19:39         ` Dan Williams
2019-10-23 21:22           ` David Hildenbrand
2019-10-24 12:50     ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=55640861-bbcb-95f0-766a-95bc961f1b0e@redhat.com \
    --to=david@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=alex.williamson@redhat.com \
    --cc=alexander.duyck@gmail.com \
    --cc=allison@lohutok.net \
    --cc=aneesh.kumar@linux.ibm.com \
    --cc=anshuman.khandual@arm.com \
    --cc=anthony.yznaga@oracle.com \
    --cc=benh@kernel.crashing.org \
    --cc=bnvandana@gmail.com \
    --cc=boris.ostrovsky@oracle.com \
    --cc=bp@alien8.de \
    --cc=cai@lca.pw \
    --cc=christophe.leroy@c-s.fr \
    --cc=cohuck@redhat.com \
    --cc=dan.carpenter@oracle.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=devel@driverdev.osuosl.org \
    --cc=gregkh@linuxfoundation.org \
    --cc=haiyangz@microsoft.com \
    --cc=hannes@cmpxchg.org \
    --cc=hpa@zytor.com \
    --cc=isaacm@codeaurora.org \
    --cc=jgross@suse.com \
    --cc=jmattson@google.com \
    --cc=joro@8bytes.org \
    --cc=karahmed@amazon.de \
    --cc=keescook@chromium.org \
    --cc=kstewart@linuxfoundation.org \
    --cc=kvm-ppc@vger.kernel.org \
    --cc=kvm@vger.kernel.org \
    --cc=linux-hyperv@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=luto@kernel.org \
    --cc=madhumithabiw@gmail.com \
    --cc=mgorman@techsingularity.net \
    --cc=mhocko@kernel.org \
    --cc=mhocko@suse.com \
    --cc=mingo@redhat.com \
    --cc=mpe@ellerman.id.au \
    --cc=nishkadg.linux@gmail.com \
    --cc=npiggin@gmail.com \
    --cc=osalvador@suse.de \
    --cc=pasha.tatashin@soleen.com \
    --cc=paulus@ozlabs.org \
    --cc=paulus@samba.org \
    --cc=pavel.tatashin@microsoft.com \
    --cc=pbonzini@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rkrcmar@redhat.com \
    --cc=rppt@linux.ibm.com \
    --cc=rppt@linux.vnet.ibm.com \
    --cc=rspringer@google.com \
    --cc=sashal@kernel.org \
    --cc=sean.j.christopherson@intel.com \
    --cc=simon@nikanor.nu \
    --cc=sstabellini@kernel.org \
    --cc=sthemmin@microsoft.com \
    --cc=tglx@linutronix.de \
    --cc=toddpoynor@google.com \
    --cc=vbabka@suse.cz \
    --cc=vkuznets@redhat.com \
    --cc=wanpengli@tencent.com \
    --cc=x86@kernel.org \
    --cc=xen-devel@lists.xenproject.org \
    --cc=yuehaibing@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).