From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=QUr6=ZE=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-0.8 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=no
	autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 32061C43331
	for <linux-mm@archiver.kernel.org>; Tue, 12 Nov 2019 22:19:22 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id D53BD20674
	for <linux-mm@archiver.kernel.org>; Tue, 12 Nov 2019 22:19:21 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org D53BD20674
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.intel.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 811026B0005; Tue, 12 Nov 2019 17:19:21 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 7C1856B0006; Tue, 12 Nov 2019 17:19:21 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 6B0BE6B0007; Tue, 12 Nov 2019 17:19:21 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0235.hostedemail.com [216.40.44.235])
	by kanga.kvack.org (Postfix) with ESMTP id 54A446B0005
	for <linux-mm@kvack.org>; Tue, 12 Nov 2019 17:19:21 -0500 (EST)
Received: from smtpin27.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay05.hostedemail.com (Postfix) with SMTP id 07B82181AEF1E
	for <linux-mm@kvack.org>; Tue, 12 Nov 2019 22:19:21 +0000 (UTC)
X-FDA: 76149042522.27.drop87_1748b43a0625b
X-HE-Tag: drop87_1748b43a0625b
X-Filterd-Recvd-Size: 13010
Received: from mga04.intel.com (mga04.intel.com [192.55.52.120])
	by imf01.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Tue, 12 Nov 2019 22:19:19 +0000 (UTC)
X-Amp-Result: SKIPPED(no attachment in message)
X-Amp-File-Uploaded: False
Received: from orsmga006.jf.intel.com ([10.7.209.51])
  by fmsmga104.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 12 Nov 2019 14:19:17 -0800
X-IronPort-AV: E=Sophos;i="5.68,297,1569308400"; 
   d="scan'208";a="207617207"
Received: from ahduyck-desk1.jf.intel.com ([10.7.198.76])
  by orsmga006-auth.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 12 Nov 2019 14:19:16 -0800
Message-ID: <c2aa52d9265b125d1325a0a78a28362e8c37ff01.camel@linux.intel.com>
Subject: Re: + mm-introduce-reported-pages.patch added to -mm tree
From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
To: David Hildenbrand <david@redhat.com>, Michal Hocko <mhocko@kernel.org>
Cc: akpm@linux-foundation.org, aarcange@redhat.com,
 dan.j.williams@intel.com,  dave.hansen@intel.com, konrad.wilk@oracle.com,
 lcapitulino@redhat.com,  mgorman@techsingularity.net,
 mm-commits@vger.kernel.org, mst@redhat.com,  osalvador@suse.de,
 pagupta@redhat.com, pbonzini@redhat.com, riel@surriel.com,  vbabka@suse.cz,
 wei.w.wang@intel.com, willy@infradead.org, yang.zhang.wz@gmail.com, 
 linux-mm@kvack.org
Date: Tue, 12 Nov 2019 14:19:16 -0800
In-Reply-To: <cfeeb6e4-ba73-bbf7-0cc3-d0c011d00971@redhat.com>
References: <20191106121605.GH8314@dhcp22.suse.cz>
	 <CD4A882A-91A7-43F2-B31C-3FFD85289907@redhat.com>
	 <dc1b2b3f4db8591303932351971f55688e0e240e.camel@linux.intel.com>
	 <20191106165416.GO8314@dhcp22.suse.cz>
	 <e90877ab21cdb7bb6a7a71a035f1b03e5544f384.camel@linux.intel.com>
	 <f84f53b8-221e-02bd-2e7a-c0040ca03a38@redhat.com>
	 <f1f84779123b7d3f0613d20f6bd9b05c806b39f7.camel@linux.intel.com>
	 <4cf64ff9-b099-d50a-5c08-9a8f3a2f52bf@redhat.com>
	 <d47d796d11d0943aaebc4864767b6ecba22f9a34.camel@linux.intel.com>
	 <131f72aa-c4e6-572d-f616-624316b62842@redhat.com>
	 <1d881e86ed58511b20883fd0031623fe6cade480.camel@linux.intel.com>
	 <8a407188-5dd2-648b-fc26-f03a826bfee3@redhat.com>
	 <4be6114f57934eb1478f84fd1358a7fcc547b248.camel@linux.intel.com>
	 <cfeeb6e4-ba73-bbf7-0cc3-d0c011d00971@redhat.com>
Content-Type: text/plain; charset="UTF-8"
User-Agent: Evolution 3.30.5 (3.30.5-1.fc29) 
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Tue, 2019-11-12 at 22:05 +0100, David Hildenbrand wrote:
> On 12.11.19 19:34, Alexander Duyck wrote:
> > On Tue, 2019-11-12 at 14:04 +0100, David Hildenbrand wrote:
> > > > > > fact is it is still invasive, just to different parts of the mm subsystem.
> > > > > 
> > > > > I'd love to see how it uses the page isolation framework, and only has a 
> > > > > single hook to queue pages. I don't like the way pages are pulled out of 
> > > > > the buddy in Niteshs approach currently. What you have is cleaner.
> > > > 
> > > > I don't see how you could use the page isolation framework to pull out
> > > > free pages. Is there a thread somewhere on the topic that I missed?
> > > 
> > > It's basically only isolating pages while reporting them, and not
> > > pulling them out of the buddy (IOW, you move the pages to the isolate
> > > queues where nobody is allowed to touch them, and setting the
> > > migratetype properly). This e.g., makes other user of page isolation
> > > (e.g., memory offlining, alloc_contig_range()) play nicely with these
> > > isolated pages. "somebody else just isolated them, please try again."
> > 
> > How so? If I understand correctly there isn't anything that prevents you
> > from isolating an already isolated page, is there? Last I knew isolated
> 
> mm/page_isolation.c:set_migratetype_isolate()
> ...
> if (is_migrate_isolate_page(page))
> 	goto out;
> ...
> -> Currently -EBUSY
> 
> > pages are still considered "movable" since they are still buddy pages
> > aren't they?
> 
> They are neither movable nor unmovable AFAIK. They are temporarily
> blocked. E.g., memory offlining currently returns -EBUSY if it cannot
> isolate the page range. alloc_contig_range() does the same. Imagine
> somebody allocating a gigantic page. You certainly cannot move the pages
> that are isolated while allocating the page. But you can signal to the
> caller to try again later.
> 
> > Also this seems like it would have other implications since isolating a
> > page kicks of the memory notifier so as a result a balloon driver would
> > then free the pages back out so that they could be isolated with the
> > assumption the region is going offline.
> 
> Memory notifier? Balloon pages getting freed? No.
> 
> The memory notifier is used for onlining/offlining, it is not involved here.

Sorry, I misread the comment in set_migratetype_isolate.

> I think what you mean is the "isolate notifier", which is only used by
> CMM on PPC.
> 
> See https://lkml.org/lkml/2019/10/31/487, where I rip that notifier out.

Okay.

> > > start_isolate_page_range()/undo_isolate_page_range()/test_pages_isolated()
> > > along with a lockless check if the page is free.
> > 
> > Okay, that part I think I get. However doesn't all that logic more or less
> > ignore the watermarks? It seems like you could cause an OOM if you don't
> > have the necessary checks in place for that.
> 
> Any approach that temporarily blocks some free pages from getting
> allocated will essentially have this issue, no? I think one main design
> point to minimize false OOMs was to limit the number of pages we report
> at a time. Or what do you propose here in addition to that?

If you take a look at __isolate_free_page it was performing a check to see
if pulling the page would place us below the minimum watermark for pages.
Odds are you should probably look at somehow incorporating that into the
solution before you pull the page. I have updated my approach to check for
the low watermark with the full capacity of MAX_ORDER - 1 pages before I
start reporting, and then I am using __isolate_free_page which will check
the minimum watermark to make sure I don't cross that.

> > > I think it should be something like this (ignoring different
> > > migratetypes and such for now)
> > > 
> > > 1. Test lockless if page is free: Not free? Done.
> > 
> > So this should help to reduce the liklihood of races in the steps below.
> > However it might also be useful if the code had some other check to see if
> > it was done other than just making a pass through the bitmap.
> 
> Yes.
> 
> > One thing I had brought up with Nitesh was the idea of maybe doing some
> > sort of RCU bitmap type approach. Basically while we hold the zone lock we
> > could swap out the old bitmap for a new one. We could probably even keep a
> > counter at the start of the structure so that we could track how many bits
> > are actually set there. Then it becomes less likely of having a race where
> > you free a page and set the bit and the hinting thread tests and clears
> > the bit but doesn't see the freed page since it is not synchronized.
> > Otherwise your notification setup and reporting thread may need a few smp
> > barriers added where necessary.
> 
> Yes, swapping out the bitmap via RCU is also be a way to make memory
> hotplug work.
> 
> I was also thinking about a different bitmap approach. Store for each
> section a bitmap. Use a meta bitmap with a bit for each section that
> contains pages to report. Sparse zones and memory hot(un)plug would not
> be a real issue anymore.

I had thought about that too. The only problem is that the section has to
be power of 2 sized and I don't know if we want to be increasing the size
by 100% in the base case, although I guess there is an 8 byte pad on the
structure if page extensions are enabled.

> One could go one step further and only have a bitmap with a bit for each
> section. Only remember that some (large) page was not reported in that
> section (e.g., after buddy merging). In the reporting thread, report all
> free pages within that section. You could end up reporting the same page
> a couple of times, but the question would be if this is relevant at all.
> One would have to prototype and measure that.
> 
> Long story short, I am not 100% a fan of the current "bitmap per zone"
> approach but is is fairly simple to start with :)

Agreed. Although I worry that a bitmap per section may be even more
complex.

> > > 2. start_isolate_page_range(): Busy? Rare race (with other isolate users
> > 
> > Doesn't this have the side effect of draining all the percpu caches in
> > order to make certain to flush the pages we isolated from there?
> 
> While alloc_contig_range() e.g., calls lru_add_drain_all(), I don't
> think isolation will. Where did you spot something like this in
> mm/page_isolation.c?

On the end of set_migratetype_isolate(). The last thing it does is call
drain_all_pages.

> > > or with an allocation). Done.
> > > 3. test_pages_isolated()
> > 
> > So I have reviewed the code and I don't see how this could conflict with
> > other callers isolating the pages. If anything it seems like if another
> > thread has already isolated the pages you would end up getting a false
> > positive, reporting the pages, and pulling them back out of isolation.
> 
> Isolated pages cannot be isolated. This is tracked via the migratetype.

Thanks. I see that now that you pointed it out up above.

> > > 3a. no? Rare race, page not free anymore. undo_isolate_page_range()
> > 
> > I would hope it is rare. However for something like a max order page I
> > could easily see a piece of it having been pulled out. I would think this
> > case would be exceedingly expensive since you would have to put back any
> > pages you had previous moved into isolation.
> 
> I guess it is rare, there is a tiny slot between checking if the page is
> free and isolating it. Would have to see that in action.

Yeah, probably depends on the number of cores in play as well since the
liklihood of a collision is probably pretty low.

> > > 3b. yes? Report, then undo_isolate_page_range()
> > > 
> > > If we would run into performance issues with the current page isolation
> > > implementation (esp. locking), I think there are some nice
> > > cleanups/reworks possible of which all current users could benefit
> > > (especially accross pageblocks).
> > 
> > To me this feels a lot like what you had for this solution near the start.
> > Only now instead of placing the pages into an array you are tracking a
> > bitmap and then using that bitmap to populate the MIGRATE_ISOLATE lists.
> 
> Now we have a clean MM interface to do that :) And yes, which data
> structure we're using becomes irrelevant.
> 
> > This sounds far more complex to me then it probably needs to be since just
> > holding the pages with the buddy type cleared should be enough to make
> > them temporarily unusable for other threads, and even in your case you are
> 
> If you have a page that is not PageBuddy() and not movable within
> ZONE_MOVABLE, has_unmovable_pages() will WARN_ON_ONCE(zone_idx(zone) ==
> ZONE_MOVABLE). This can be triggered via memory offlining, when
> isolating the page range.
> 
> If your approach does exactly that (clear PageBuddy() on a
> ZONE_MOVABLE), it would be a bug. The only safe way is to have the
> pageblock(s) isolated.

>From what I can tell it looks like if the page is in ZONE_MOVABLE the
buddy flag doesn't even matter since the only thing checked is
PageReserved. There is a check early on in the main loop that will
"continue" if zone_idx(zone) == ZONE_MOVABLE.

The refcount is 0 so that will cause us to "continue" and not be counted
as an unmovable page. The downside is the scan cannot take advantage of
the "PageBuddy" value to skip over us so it just has to skip over the
section one page at a time.

The advantage here is that we can still offline a region that contains
pages that are being reported. I would think that it would fail if the
pages in the region are isolated since as you pointed out you get an EBUSY
when you attempt to isolate a page that is already isolated and as such
removal will fail won't it?

> > still having to use the scatterlist in order to hold the pages and track
> > what you will need to undo the isolation later.
> 
> I think it is very neat and not complex at all. Page isolation is a nice
> feature we have in the kernel. :) It deserves some cleanups, though.

We can agree to disagree. At this point you are talking about adding bits
for sections and pages, and in the meantime I am working with zones and
pages. I believe finding free space in the section may be much more tricky
than finding it in the zone or page has been. Now that I am rid of the
list manipulators my approach may soon surpass the bitmap one in terms of
being less intrusive/complex.. :-)