From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=QUr6=ZE=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-2.4 required=3.0 tests=DKIM_SIGNED,DKIM_VALID,
	DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,
	SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id C0422C43331
	for <linux-mm@archiver.kernel.org>; Tue, 12 Nov 2019 23:11:03 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 6387921925
	for <linux-mm@archiver.kernel.org>; Tue, 12 Nov 2019 23:11:03 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="UTzj5+9b"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 6387921925
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 0BEC16B0005; Tue, 12 Nov 2019 18:11:03 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 070966B0006; Tue, 12 Nov 2019 18:11:03 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id E79DD6B0007; Tue, 12 Nov 2019 18:11:02 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0211.hostedemail.com [216.40.44.211])
	by kanga.kvack.org (Postfix) with ESMTP id D25486B0005
	for <linux-mm@kvack.org>; Tue, 12 Nov 2019 18:11:02 -0500 (EST)
Received: from smtpin18.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay01.hostedemail.com (Postfix) with SMTP id 660CD180AD807
	for <linux-mm@kvack.org>; Tue, 12 Nov 2019 23:11:02 +0000 (UTC)
X-FDA: 76149172764.18.quiet28_260f91ca08461
X-HE-Tag: quiet28_260f91ca08461
X-Filterd-Recvd-Size: 13085
Received: from us-smtp-delivery-1.mimecast.com (us-smtp-1.mimecast.com [207.211.31.81])
	by imf46.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Tue, 12 Nov 2019 23:11:01 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1573600260;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=/FFI3vhsJz0zleo51/s2Ip2GdcajFIw2WQx4beeJzM8=;
	b=UTzj5+9bcaOms6yAgsxgKURkdGeUnGXBU+eaAbT/Ai/dJ7uAm4QCBgWgty9rzuX1euPQZD
	JnApivqzgjxrXw8uvPc1BP6gup9pX8htbDTqWncm2VzXoVpwoQ8yLO8dJ0Pi2iPJN8p/cM
	UkgGMOJDbONuq9147OqYblegwKUJqLM=
Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com
 [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id
 us-mta-121-bEvHNfvrMuuTx7q1i2nhtQ-1; Tue, 12 Nov 2019 18:10:36 -0500
Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12])
	(using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by mimecast-mx01.redhat.com (Postfix) with ESMTPS id BD79BDB6B;
	Tue, 12 Nov 2019 23:10:34 +0000 (UTC)
Received: from [10.36.116.89] (ovpn-116-89.ams2.redhat.com [10.36.116.89])
	by smtp.corp.redhat.com (Postfix) with ESMTP id 6D1299F79;
	Tue, 12 Nov 2019 23:10:22 +0000 (UTC)
Subject: Re: + mm-introduce-reported-pages.patch added to -mm tree
To: Alexander Duyck <alexander.h.duyck@linux.intel.com>,
 Michal Hocko <mhocko@kernel.org>
Cc: akpm@linux-foundation.org, aarcange@redhat.com, dan.j.williams@intel.com,
 dave.hansen@intel.com, konrad.wilk@oracle.com, lcapitulino@redhat.com,
 mgorman@techsingularity.net, mm-commits@vger.kernel.org, mst@redhat.com,
 osalvador@suse.de, pagupta@redhat.com, pbonzini@redhat.com,
 riel@surriel.com, vbabka@suse.cz, wei.w.wang@intel.com, willy@infradead.org,
 yang.zhang.wz@gmail.com, linux-mm@kvack.org
References: <20191106121605.GH8314@dhcp22.suse.cz>
 <CD4A882A-91A7-43F2-B31C-3FFD85289907@redhat.com>
 <dc1b2b3f4db8591303932351971f55688e0e240e.camel@linux.intel.com>
 <20191106165416.GO8314@dhcp22.suse.cz>
 <e90877ab21cdb7bb6a7a71a035f1b03e5544f384.camel@linux.intel.com>
 <f84f53b8-221e-02bd-2e7a-c0040ca03a38@redhat.com>
 <f1f84779123b7d3f0613d20f6bd9b05c806b39f7.camel@linux.intel.com>
 <4cf64ff9-b099-d50a-5c08-9a8f3a2f52bf@redhat.com>
 <d47d796d11d0943aaebc4864767b6ecba22f9a34.camel@linux.intel.com>
 <131f72aa-c4e6-572d-f616-624316b62842@redhat.com>
 <1d881e86ed58511b20883fd0031623fe6cade480.camel@linux.intel.com>
 <8a407188-5dd2-648b-fc26-f03a826bfee3@redhat.com>
 <4be6114f57934eb1478f84fd1358a7fcc547b248.camel@linux.intel.com>
 <cfeeb6e4-ba73-bbf7-0cc3-d0c011d00971@redhat.com>
 <c2aa52d9265b125d1325a0a78a28362e8c37ff01.camel@linux.intel.com>
From: David Hildenbrand <david@redhat.com>
Organization: Red Hat GmbH
Message-ID: <bfcabcdd-2824-f092-d546-8a9ce4325225@redhat.com>
Date: Wed, 13 Nov 2019 00:10:21 +0100
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101
 Thunderbird/68.1.1
MIME-Version: 1.0
In-Reply-To: <c2aa52d9265b125d1325a0a78a28362e8c37ff01.camel@linux.intel.com>
Content-Language: en-US
X-Scanned-By: MIMEDefang 2.79 on 10.5.11.12
X-MC-Unique: bEvHNfvrMuuTx7q1i2nhtQ-1
X-Mimecast-Spam-Score: 0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>


>>>> start_isolate_page_range()/undo_isolate_page_range()/test_pages_isolat=
ed()
>>>> along with a lockless check if the page is free.
>>>
>>> Okay, that part I think I get. However doesn't all that logic more or l=
ess
>>> ignore the watermarks? It seems like you could cause an OOM if you don'=
t
>>> have the necessary checks in place for that.
>>
>> Any approach that temporarily blocks some free pages from getting
>> allocated will essentially have this issue, no? I think one main design
>> point to minimize false OOMs was to limit the number of pages we report
>> at a time. Or what do you propose here in addition to that?
>=20
> If you take a look at __isolate_free_page it was performing a check to se=
e
> if pulling the page would place us below the minimum watermark for pages.
> Odds are you should probably look at somehow incorporating that into the
> solution before you pull the page. I have updated my approach to check fo=
r

Ah, now I see what you mean. Makes sense!

> the low watermark with the full capacity of MAX_ORDER - 1 pages before I
> start reporting, and then I am using __isolate_free_page which will check
> the minimum watermark to make sure I don't cross that.

Yeah, you probably want to check the watermark before doing any=20
reporting - I assume.

>=20
>>>> I think it should be something like this (ignoring different
>>>> migratetypes and such for now)
>>>>
>>>> 1. Test lockless if page is free: Not free? Done.
>>>
>>> So this should help to reduce the liklihood of races in the steps below=
.
>>> However it might also be useful if the code had some other check to see=
 if
>>> it was done other than just making a pass through the bitmap.
>>
>> Yes.
>>
>>> One thing I had brought up with Nitesh was the idea of maybe doing some
>>> sort of RCU bitmap type approach. Basically while we hold the zone lock=
 we
>>> could swap out the old bitmap for a new one. We could probably even kee=
p a
>>> counter at the start of the structure so that we could track how many b=
its
>>> are actually set there. Then it becomes less likely of having a race wh=
ere
>>> you free a page and set the bit and the hinting thread tests and clears
>>> the bit but doesn't see the freed page since it is not synchronized.
>>> Otherwise your notification setup and reporting thread may need a few s=
mp
>>> barriers added where necessary.
>>
>> Yes, swapping out the bitmap via RCU is also be a way to make memory
>> hotplug work.
>>
>> I was also thinking about a different bitmap approach. Store for each
>> section a bitmap. Use a meta bitmap with a bit for each section that
>> contains pages to report. Sparse zones and memory hot(un)plug would not
>> be a real issue anymore.
>=20
> I had thought about that too. The only problem is that the section has to
> be power of 2 sized and I don't know if we want to be increasing the size

... are there sections that are not a power of 2? x86_64: 128MB, s390x:=20
256MB, ...

It does not really make sense to have sections that are not a power of=20
two, thinking about page tables ... I would really be interested where=20
something like that is possible.

> by 100% in the base case, although I guess there is an 8 byte pad on the
> structure if page extensions are enabled.
>=20
>> One could go one step further and only have a bitmap with a bit for each
>> section. Only remember that some (large) page was not reported in that
>> section (e.g., after buddy merging). In the reporting thread, report all
>> free pages within that section. You could end up reporting the same page
>> a couple of times, but the question would be if this is relevant at all.
>> One would have to prototype and measure that.
>>
>> Long story short, I am not 100% a fan of the current "bitmap per zone"
>> approach but is is fairly simple to start with :)
>=20
> Agreed. Although I worry that a bitmap per section may be even more
> complex.

Slightly, yes.

>=20
>>>> 2. start_isolate_page_range(): Busy? Rare race (with other isolate use=
rs
>>>
>>> Doesn't this have the side effect of draining all the percpu caches in
>>> order to make certain to flush the pages we isolated from there?
>>
>> While alloc_contig_range() e.g., calls lru_add_drain_all(), I don't
>> think isolation will. Where did you spot something like this in
>> mm/page_isolation.c?
>=20
> On the end of set_migratetype_isolate(). The last thing it does is call
> drain_all_pages.

Ahh, missed that, thanks. Yeah, one could probably make the=20
configurable, because for that use case, where we already expect a free=20
page, we don't need that.

>=20
>>>> or with an allocation). Done.
>>>> 3. test_pages_isolated()
>>>
>>> So I have reviewed the code and I don't see how this could conflict wit=
h
>>> other callers isolating the pages. If anything it seems like if another
>>> thread has already isolated the pages you would end up getting a false
>>> positive, reporting the pages, and pulling them back out of isolation.
>>
>> Isolated pages cannot be isolated. This is tracked via the migratetype.
>=20
> Thanks. I see that now that you pointed it out up above.
>=20
>>>> 3a. no? Rare race, page not free anymore. undo_isolate_page_range()
>>>
>>> I would hope it is rare. However for something like a max order page I
>>> could easily see a piece of it having been pulled out. I would think th=
is
>>> case would be exceedingly expensive since you would have to put back an=
y
>>> pages you had previous moved into isolation.
>>
>> I guess it is rare, there is a tiny slot between checking if the page is
>> free and isolating it. Would have to see that in action.
>=20
> Yeah, probably depends on the number of cores in play as well since the
> liklihood of a collision is probably pretty low.
>=20
>>>> 3b. yes? Report, then undo_isolate_page_range()
>>>>
>>>> If we would run into performance issues with the current page isolatio=
n
>>>> implementation (esp. locking), I think there are some nice
>>>> cleanups/reworks possible of which all current users could benefit
>>>> (especially accross pageblocks).
>>>
>>> To me this feels a lot like what you had for this solution near the sta=
rt.
>>> Only now instead of placing the pages into an array you are tracking a
>>> bitmap and then using that bitmap to populate the MIGRATE_ISOLATE lists=
.
>>
>> Now we have a clean MM interface to do that :) And yes, which data
>> structure we're using becomes irrelevant.
>>
>>> This sounds far more complex to me then it probably needs to be since j=
ust
>>> holding the pages with the buddy type cleared should be enough to make
>>> them temporarily unusable for other threads, and even in your case you =
are
>>
>> If you have a page that is not PageBuddy() and not movable within
>> ZONE_MOVABLE, has_unmovable_pages() will WARN_ON_ONCE(zone_idx(zone) =3D=
=3D
>> ZONE_MOVABLE). This can be triggered via memory offlining, when
>> isolating the page range.
>>
>> If your approach does exactly that (clear PageBuddy() on a
>> ZONE_MOVABLE), it would be a bug. The only safe way is to have the
>> pageblock(s) isolated.
>=20
>  From what I can tell it looks like if the page is in ZONE_MOVABLE the
> buddy flag doesn't even matter since the only thing checked is
> PageReserved. There is a check early on in the main loop that will
> "continue" if zone_idx(zone) =3D=3D ZONE_MOVABLE.

Very right, missed that :) reserved pages are a different story.

>=20
> The refcount is 0 so that will cause us to "continue" and not be counted
> as an unmovable page. The downside is the scan cannot take advantage of
> the "PageBuddy" value to skip over us so it just has to skip over the
> section one page at a time.
>=20
> The advantage here is that we can still offline a region that contains
> pages that are being reported. I would think that it would fail if the

Yes, you can isolate +offline, while the isolation approach would=20
require to actually try again (right now manually).

> pages in the region are isolated since as you pointed out you get an EBUS=
Y
> when you attempt to isolate a page that is already isolated and as such
> removal will fail won't it?

Right now, yes.

(we should rework that code either way to return -EAGAIN in that case=20
and let memory offlining try again automatically. But we have to rework=20
the -EAGAIN vs. -EBUSY handling in memory offlining code at one point=20
either way, I discussed that partially with Michal recently. There is a=20
lot of cleaning up to do.)

>=20
>>> still having to use the scatterlist in order to hold the pages and trac=
k
>>> what you will need to undo the isolation later.
>>
>> I think it is very neat and not complex at all. Page isolation is a nice
>> feature we have in the kernel. :) It deserves some cleanups, though.
>=20
> We can agree to disagree. At this point you are talking about adding bits
> for sections and pages, and in the meantime I am working with zones and
> pages. I believe finding free space in the section may be much more trick=
y
> than finding it in the zone or page has been. Now that I am rid of the
> list manipulators my approach may soon surpass the bitmap one in terms of
> being less intrusive/complex.. :-)

I am definitely interested to see that approach :) Good to see that the=20
whole discussion in this big thread turned out to be productive.

--=20

Thanks,

David / dhildenb