From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=QUr6=ZE=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-2.4 required=3.0 tests=DKIM_SIGNED,DKIM_VALID,
	DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,
	SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=no autolearn_force=no
	version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 51C37C17440
	for <linux-mm@archiver.kernel.org>; Tue, 12 Nov 2019 22:17:35 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id EEB8D20674
	for <linux-mm@archiver.kernel.org>; Tue, 12 Nov 2019 22:17:34 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="UflFA0md"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org EEB8D20674
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 888876B0005; Tue, 12 Nov 2019 17:17:34 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 8390D6B0006; Tue, 12 Nov 2019 17:17:34 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 700E56B0007; Tue, 12 Nov 2019 17:17:34 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0141.hostedemail.com [216.40.44.141])
	by kanga.kvack.org (Postfix) with ESMTP id 575376B0005
	for <linux-mm@kvack.org>; Tue, 12 Nov 2019 17:17:34 -0500 (EST)
Received: from smtpin23.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay05.hostedemail.com (Postfix) with SMTP id 07986181AEF15
	for <linux-mm@kvack.org>; Tue, 12 Nov 2019 22:17:34 +0000 (UTC)
X-FDA: 76149038028.23.grain45_7c292279810a
X-HE-Tag: grain45_7c292279810a
X-Filterd-Recvd-Size: 14579
Received: from us-smtp-delivery-1.mimecast.com (us-smtp-1.mimecast.com [205.139.110.61])
	by imf41.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Tue, 12 Nov 2019 22:17:33 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1573597052;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:autocrypt:autocrypt;
	bh=IscBMJtgceeVdX+DhYlGLgtwTAUgyImeeJVjw+x775A=;
	b=UflFA0md4N6QyGoOv2DafMHBdkUZi04AZEq3OfAubmovNlqSqpF6tq5XPSu+mORzbuvT/R
	kyrNdXl8/Z6KZsl9eIEDPs6zl8cZ3dw4ew7Kgfpu1kIYekfsHmfK9qSTN8UQ8tiHwi96X/
	1Blnsig3PAo9KrqjWB0ajGgKXET+6m4=
Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com
 [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id
 us-mta-308-6xUhKAWAOIyBnKqBNRwZcw-1; Tue, 12 Nov 2019 17:17:31 -0500
Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12])
	(using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 9117A1800D63;
	Tue, 12 Nov 2019 22:17:29 +0000 (UTC)
Received: from [10.36.116.62] (ovpn-116-62.ams2.redhat.com [10.36.116.62])
	by smtp.corp.redhat.com (Postfix) with ESMTP id 6EBD56117F;
	Tue, 12 Nov 2019 22:17:18 +0000 (UTC)
Subject: Re: + mm-introduce-reported-pages.patch added to -mm tree
From: David Hildenbrand <david@redhat.com>
To: Alexander Duyck <alexander.h.duyck@linux.intel.com>,
 Michal Hocko <mhocko@kernel.org>
Cc: akpm@linux-foundation.org, aarcange@redhat.com, dan.j.williams@intel.com,
 dave.hansen@intel.com, konrad.wilk@oracle.com, lcapitulino@redhat.com,
 mgorman@techsingularity.net, mm-commits@vger.kernel.org, mst@redhat.com,
 osalvador@suse.de, pagupta@redhat.com, pbonzini@redhat.com,
 riel@surriel.com, vbabka@suse.cz, wei.w.wang@intel.com, willy@infradead.org,
 yang.zhang.wz@gmail.com, linux-mm@kvack.org
References: <20191106121605.GH8314@dhcp22.suse.cz>
 <CD4A882A-91A7-43F2-B31C-3FFD85289907@redhat.com>
 <dc1b2b3f4db8591303932351971f55688e0e240e.camel@linux.intel.com>
 <20191106165416.GO8314@dhcp22.suse.cz>
 <e90877ab21cdb7bb6a7a71a035f1b03e5544f384.camel@linux.intel.com>
 <f84f53b8-221e-02bd-2e7a-c0040ca03a38@redhat.com>
 <f1f84779123b7d3f0613d20f6bd9b05c806b39f7.camel@linux.intel.com>
 <4cf64ff9-b099-d50a-5c08-9a8f3a2f52bf@redhat.com>
 <d47d796d11d0943aaebc4864767b6ecba22f9a34.camel@linux.intel.com>
 <131f72aa-c4e6-572d-f616-624316b62842@redhat.com>
 <1d881e86ed58511b20883fd0031623fe6cade480.camel@linux.intel.com>
 <8a407188-5dd2-648b-fc26-f03a826bfee3@redhat.com>
 <4be6114f57934eb1478f84fd1358a7fcc547b248.camel@linux.intel.com>
 <cfeeb6e4-ba73-bbf7-0cc3-d0c011d00971@redhat.com>
Autocrypt: addr=david@redhat.com; prefer-encrypt=mutual; keydata=
 mQINBFXLn5EBEAC+zYvAFJxCBY9Tr1xZgcESmxVNI/0ffzE/ZQOiHJl6mGkmA1R7/uUpiCjJ
 dBrn+lhhOYjjNefFQou6478faXE6o2AhmebqT4KiQoUQFV4R7y1KMEKoSyy8hQaK1umALTdL
 QZLQMzNE74ap+GDK0wnacPQFpcG1AE9RMq3aeErY5tujekBS32jfC/7AnH7I0v1v1TbbK3Gp
 XNeiN4QroO+5qaSr0ID2sz5jtBLRb15RMre27E1ImpaIv2Jw8NJgW0k/D1RyKCwaTsgRdwuK
 Kx/Y91XuSBdz0uOyU/S8kM1+ag0wvsGlpBVxRR/xw/E8M7TEwuCZQArqqTCmkG6HGcXFT0V9
 PXFNNgV5jXMQRwU0O/ztJIQqsE5LsUomE//bLwzj9IVsaQpKDqW6TAPjcdBDPLHvriq7kGjt
 WhVhdl0qEYB8lkBEU7V2Yb+SYhmhpDrti9Fq1EsmhiHSkxJcGREoMK/63r9WLZYI3+4W2rAc
 UucZa4OT27U5ZISjNg3Ev0rxU5UH2/pT4wJCfxwocmqaRr6UYmrtZmND89X0KigoFD/XSeVv
 jwBRNjPAubK9/k5NoRrYqztM9W6sJqrH8+UWZ1Idd/DdmogJh0gNC0+N42Za9yBRURfIdKSb
 B3JfpUqcWwE7vUaYrHG1nw54pLUoPG6sAA7Mehl3nd4pZUALHwARAQABtCREYXZpZCBIaWxk
 ZW5icmFuZCA8ZGF2aWRAcmVkaGF0LmNvbT6JAj4EEwECACgFAljj9eoCGwMFCQlmAYAGCwkI
 BwMCBhUIAgkKCwQWAgMBAh4BAheAAAoJEE3eEPcA/4Na5IIP/3T/FIQMxIfNzZshIq687qgG
 8UbspuE/YSUDdv7r5szYTK6KPTlqN8NAcSfheywbuYD9A4ZeSBWD3/NAVUdrCaRP2IvFyELj
 xoMvfJccbq45BxzgEspg/bVahNbyuBpLBVjVWwRtFCUEXkyazksSv8pdTMAs9IucChvFmmq3
 jJ2vlaz9lYt/lxN246fIVceckPMiUveimngvXZw21VOAhfQ+/sofXF8JCFv2mFcBDoa7eYob
 s0FLpmqFaeNRHAlzMWgSsP80qx5nWWEvRLdKWi533N2vC/EyunN3HcBwVrXH4hxRBMco3jvM
 m8VKLKao9wKj82qSivUnkPIwsAGNPdFoPbgghCQiBjBe6A75Z2xHFrzo7t1jg7nQfIyNC7ez
 MZBJ59sqA9EDMEJPlLNIeJmqslXPjmMFnE7Mby/+335WJYDulsRybN+W5rLT5aMvhC6x6POK
 z55fMNKrMASCzBJum2Fwjf/VnuGRYkhKCqqZ8gJ3OvmR50tInDV2jZ1DQgc3i550T5JDpToh
 dPBxZocIhzg+MBSRDXcJmHOx/7nQm3iQ6iLuwmXsRC6f5FbFefk9EjuTKcLMvBsEx+2DEx0E
 UnmJ4hVg7u1PQ+2Oy+Lh/opK/BDiqlQ8Pz2jiXv5xkECvr/3Sv59hlOCZMOaiLTTjtOIU7Tq
 7ut6OL64oAq+uQINBFXLn5EBEADn1959INH2cwYJv0tsxf5MUCghCj/CA/lc/LMthqQ773ga
 uB9mN+F1rE9cyyXb6jyOGn+GUjMbnq1o121Vm0+neKHUCBtHyseBfDXHA6m4B3mUTWo13nid
 0e4AM71r0DS8+KYh6zvweLX/LL5kQS9GQeT+QNroXcC1NzWbitts6TZ+IrPOwT1hfB4WNC+X
 2n4AzDqp3+ILiVST2DT4VBc11Gz6jijpC/KI5Al8ZDhRwG47LUiuQmt3yqrmN63V9wzaPhC+
 xbwIsNZlLUvuRnmBPkTJwwrFRZvwu5GPHNndBjVpAfaSTOfppyKBTccu2AXJXWAE1Xjh6GOC
 8mlFjZwLxWFqdPHR1n2aPVgoiTLk34LR/bXO+e0GpzFXT7enwyvFFFyAS0Nk1q/7EChPcbRb
 hJqEBpRNZemxmg55zC3GLvgLKd5A09MOM2BrMea+l0FUR+PuTenh2YmnmLRTro6eZ/qYwWkC
 u8FFIw4pT0OUDMyLgi+GI1aMpVogTZJ70FgV0pUAlpmrzk/bLbRkF3TwgucpyPtcpmQtTkWS
 gDS50QG9DR/1As3LLLcNkwJBZzBG6PWbvcOyrwMQUF1nl4SSPV0LLH63+BrrHasfJzxKXzqg
 rW28CTAE2x8qi7e/6M/+XXhrsMYG+uaViM7n2je3qKe7ofum3s4vq7oFCPsOgwARAQABiQIl
 BBgBAgAPBQJVy5+RAhsMBQkJZgGAAAoJEE3eEPcA/4NagOsP/jPoIBb/iXVbM+fmSHOjEshl
 KMwEl/m5iLj3iHnHPVLBUWrXPdS7iQijJA/VLxjnFknhaS60hkUNWexDMxVVP/6lbOrs4bDZ
 NEWDMktAeqJaFtxackPszlcpRVkAs6Msn9tu8hlvB517pyUgvuD7ZS9gGOMmYwFQDyytpepo
 YApVV00P0u3AaE0Cj/o71STqGJKZxcVhPaZ+LR+UCBZOyKfEyq+ZN311VpOJZ1IvTExf+S/5
 lqnciDtbO3I4Wq0ArLX1gs1q1XlXLaVaA3yVqeC8E7kOchDNinD3hJS4OX0e1gdsx/e6COvy
 qNg5aL5n0Kl4fcVqM0LdIhsubVs4eiNCa5XMSYpXmVi3HAuFyg9dN+x8thSwI836FoMASwOl
 C7tHsTjnSGufB+D7F7ZBT61BffNBBIm1KdMxcxqLUVXpBQHHlGkbwI+3Ye+nE6HmZH7IwLwV
 W+Ajl7oYF+jeKaH4DZFtgLYGLtZ1LDwKPjX7VAsa4Yx7S5+EBAaZGxK510MjIx6SGrZWBrrV
 TEvdV00F2MnQoeXKzD7O4WFbL55hhyGgfWTHwZ457iN9SgYi1JLPqWkZB0JRXIEtjd4JEQcx
 +8Umfre0Xt4713VxMygW0PnQt5aSQdMD58jHFxTk092mU+yIHj5LeYgvwSgZN4airXk5yRXl
 SE+xAvmumFBY
Organization: Red Hat GmbH
Message-ID: <2177e6e7-a02a-a75f-3d60-93f1a18f96a5@redhat.com>
Date: Tue, 12 Nov 2019 23:17:17 +0100
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101
 Thunderbird/68.1.1
MIME-Version: 1.0
In-Reply-To: <cfeeb6e4-ba73-bbf7-0cc3-d0c011d00971@redhat.com>
Content-Language: en-US
X-Scanned-By: MIMEDefang 2.79 on 10.5.11.12
X-MC-Unique: 6xUhKAWAOIyBnKqBNRwZcw-1
X-Mimecast-Spam-Score: 0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On 12.11.19 22:05, David Hildenbrand wrote:
> On 12.11.19 19:34, Alexander Duyck wrote:
>> On Tue, 2019-11-12 at 14:04 +0100, David Hildenbrand wrote:
>>>>>> fact is it is still invasive, just to different parts of the mm subs=
ystem.
>>>>>
>>>>> I'd love to see how it uses the page isolation framework, and only ha=
s a=20
>>>>> single hook to queue pages. I don't like the way pages are pulled out=
 of=20
>>>>> the buddy in Niteshs approach currently. What you have is cleaner.
>>>>
>>>> I don't see how you could use the page isolation framework to pull out
>>>> free pages. Is there a thread somewhere on the topic that I missed?
>>>
>>> It's basically only isolating pages while reporting them, and not
>>> pulling them out of the buddy (IOW, you move the pages to the isolate
>>> queues where nobody is allowed to touch them, and setting the
>>> migratetype properly). This e.g., makes other user of page isolation
>>> (e.g., memory offlining, alloc_contig_range()) play nicely with these
>>> isolated pages. "somebody else just isolated them, please try again."
>>
>> How so? If I understand correctly there isn't anything that prevents you
>> from isolating an already isolated page, is there? Last I knew isolated
>=20
> mm/page_isolation.c:set_migratetype_isolate()
> ...
> if (is_migrate_isolate_page(page))
> =09goto out;
> ...
> -> Currently -EBUSY
>=20
>> pages are still considered "movable" since they are still buddy pages
>> aren't they?
>=20
> They are neither movable nor unmovable AFAIK. They are temporarily
> blocked. E.g., memory offlining currently returns -EBUSY if it cannot
> isolate the page range. alloc_contig_range() does the same. Imagine
> somebody allocating a gigantic page. You certainly cannot move the pages
> that are isolated while allocating the page. But you can signal to the
> caller to try again later.
>=20
>>
>> Also this seems like it would have other implications since isolating a
>> page kicks of the memory notifier so as a result a balloon driver would
>> then free the pages back out so that they could be isolated with the
>> assumption the region is going offline.
>=20
> Memory notifier? Balloon pages getting freed? No.
>=20
> The memory notifier is used for onlining/offlining, it is not involved he=
re.
>=20
> I think what you mean is the "isolate notifier", which is only used by
> CMM on PPC.
>=20
> See https://lkml.org/lkml/2019/10/31/487, where I rip that notifier out.
>=20
>>
>>> start_isolate_page_range()/undo_isolate_page_range()/test_pages_isolate=
d()
>>> along with a lockless check if the page is free.
>>
>> Okay, that part I think I get. However doesn't all that logic more or le=
ss
>> ignore the watermarks? It seems like you could cause an OOM if you don't
>> have the necessary checks in place for that.
>=20
> Any approach that temporarily blocks some free pages from getting
> allocated will essentially have this issue, no? I think one main design
> point to minimize false OOMs was to limit the number of pages we report
> at a time. Or what do you propose here in addition to that?
>=20
>>
>>> I think it should be something like this (ignoring different
>>> migratetypes and such for now)
>>>
>>> 1. Test lockless if page is free: Not free? Done.
>>
>> So this should help to reduce the liklihood of races in the steps below.
>> However it might also be useful if the code had some other check to see =
if
>> it was done other than just making a pass through the bitmap.
>=20
> Yes.
>=20
>>
>> One thing I had brought up with Nitesh was the idea of maybe doing some
>> sort of RCU bitmap type approach. Basically while we hold the zone lock =
we
>> could swap out the old bitmap for a new one. We could probably even keep=
 a
>> counter at the start of the structure so that we could track how many bi=
ts
>> are actually set there. Then it becomes less likely of having a race whe=
re
>> you free a page and set the bit and the hinting thread tests and clears
>> the bit but doesn't see the freed page since it is not synchronized.
>> Otherwise your notification setup and reporting thread may need a few sm=
p
>> barriers added where necessary.
>=20
> Yes, swapping out the bitmap via RCU is also be a way to make memory
> hotplug work.
>=20
> I was also thinking about a different bitmap approach. Store for each
> section a bitmap. Use a meta bitmap with a bit for each section that
> contains pages to report. Sparse zones and memory hot(un)plug would not
> be a real issue anymore.
>=20
> One could go one step further and only have a bitmap with a bit for each
> section. Only remember that some (large) page was not reported in that
> section (e.g., after buddy merging). In the reporting thread, report all
> free pages within that section. You could end up reporting the same page
> a couple of times, but the question would be if this is relevant at all.
> One would have to prototype and measure that.
>=20
> Long story short, I am not 100% a fan of the current "bitmap per zone"
> approach but is is fairly simple to start with :)
>=20
>>
>>> 2. start_isolate_page_range(): Busy? Rare race (with other isolate user=
s
>>
>> Doesn't this have the side effect of draining all the percpu caches in
>> order to make certain to flush the pages we isolated from there?
>=20
> While alloc_contig_range() e.g., calls lru_add_drain_all(), I don't
> think isolation will. Where did you spot something like this in
> mm/page_isolation.c?
>=20
>>
>>> or with an allocation). Done.
>>> 3. test_pages_isolated()
>>
>> So I have reviewed the code and I don't see how this could conflict with
>> other callers isolating the pages. If anything it seems like if another
>> thread has already isolated the pages you would end up getting a false
>> positive, reporting the pages, and pulling them back out of isolation.
>=20
> Isolated pages cannot be isolated. This is tracked via the migratetype.
>=20
>>
>>> 3a. no? Rare race, page not free anymore. undo_isolate_page_range()
>>
>> I would hope it is rare. However for something like a max order page I
>> could easily see a piece of it having been pulled out. I would think thi=
s
>> case would be exceedingly expensive since you would have to put back any
>> pages you had previous moved into isolation.
>=20
> I guess it is rare, there is a tiny slot between checking if the page is
> free and isolating it. Would have to see that in action.
>=20
>>
>>> 3b. yes? Report, then undo_isolate_page_range()
>>>
>>> If we would run into performance issues with the current page isolation
>>> implementation (esp. locking), I think there are some nice
>>> cleanups/reworks possible of which all current users could benefit
>>> (especially accross pageblocks).
>>
>> To me this feels a lot like what you had for this solution near the star=
t.
>> Only now instead of placing the pages into an array you are tracking a
>> bitmap and then using that bitmap to populate the MIGRATE_ISOLATE lists.
>=20
> Now we have a clean MM interface to do that :) And yes, which data
> structure we're using becomes irrelevant.
>=20
>>
>> This sounds far more complex to me then it probably needs to be since ju=
st
>> holding the pages with the buddy type cleared should be enough to make
>> them temporarily unusable for other threads, and even in your case you a=
re
>=20
> If you have a page that is not PageBuddy() and not movable within
> ZONE_MOVABLE, has_unmovable_pages() will WARN_ON_ONCE(zone_idx(zone) =3D=
=3D
> ZONE_MOVABLE). This can be triggered via memory offlining, when
> isolating the page range.
>=20
> If your approach does exactly that (clear PageBuddy() on a
> ZONE_MOVABLE), it would be a bug. The only safe way is to have the
> pageblock(s) isolated.
>=20

Minor correction: Only if your refcount is > 0.

--=20

Thanks,

David / dhildenb