From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=ID7B=Y7=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-2.3 required=3.0 tests=BAD_CREDIT,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,
	SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no
	version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id E5A70C43331
	for <linux-mm@archiver.kernel.org>; Thu,  7 Nov 2019 22:44:05 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 96EE12187F
	for <linux-mm@archiver.kernel.org>; Thu,  7 Nov 2019 22:44:05 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="On1EZq6a"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 96EE12187F
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 43C7B6B000C; Thu,  7 Nov 2019 17:44:05 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 3EB5C6B000A; Thu,  7 Nov 2019 17:44:05 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 2DA776B000D; Thu,  7 Nov 2019 17:44:05 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0126.hostedemail.com [216.40.44.126])
	by kanga.kvack.org (Postfix) with ESMTP id 1354C6B000A
	for <linux-mm@kvack.org>; Thu,  7 Nov 2019 17:44:05 -0500 (EST)
Received: from smtpin01.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay04.hostedemail.com (Postfix) with SMTP id C93522C0D
	for <linux-mm@kvack.org>; Thu,  7 Nov 2019 22:44:04 +0000 (UTC)
X-FDA: 76130960808.01.goose36_5dc9bd67f2d2b
X-HE-Tag: goose36_5dc9bd67f2d2b
X-Filterd-Recvd-Size: 12576
Received: from us-smtp-delivery-1.mimecast.com (us-smtp-1.mimecast.com [207.211.31.81])
	by imf38.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Thu,  7 Nov 2019 22:44:04 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1573166643;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=OFmaFSPL+oMF3eFsWEqZns/yTYFEKgY7EM11BeTnSyo=;
	b=On1EZq6akgPperQCkQ2Ci+T/5eC7MGYbVZp+I/7QdoihpPxLvTQtp4AtvDa5lA+HJQ6Gwe
	PhHxwaHKQV//tDATjay7lU6wc3byfpqdWKgcaz8xaQmf9XF6XCyZx5BHRZt5Olwbq/V2J4
	EJghK3qOdZ5pfvcOzrf5/6VrONdg5ls=
Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com
 [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id
 us-mta-59-6iCda_toPrKQLpINH2pq2g-1; Thu, 07 Nov 2019 17:44:00 -0500
Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.phx2.redhat.com [10.5.11.15])
	(using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 2CF251800D6B;
	Thu,  7 Nov 2019 22:43:58 +0000 (UTC)
Received: from [10.36.116.80] (ovpn-116-80.ams2.redhat.com [10.36.116.80])
	by smtp.corp.redhat.com (Postfix) with ESMTP id ECDE55D6B7;
	Thu,  7 Nov 2019 22:43:48 +0000 (UTC)
Subject: Re: + mm-introduce-reported-pages.patch added to -mm tree
To: Alexander Duyck <alexander.h.duyck@linux.intel.com>,
 Michal Hocko <mhocko@kernel.org>
Cc: akpm@linux-foundation.org, aarcange@redhat.com, dan.j.williams@intel.com,
 dave.hansen@intel.com, konrad.wilk@oracle.com, lcapitulino@redhat.com,
 mgorman@techsingularity.net, mm-commits@vger.kernel.org, mst@redhat.com,
 osalvador@suse.de, pagupta@redhat.com, pbonzini@redhat.com,
 riel@surriel.com, vbabka@suse.cz, wei.w.wang@intel.com, willy@infradead.org,
 yang.zhang.wz@gmail.com, linux-mm@kvack.org
References: <20191106121605.GH8314@dhcp22.suse.cz>
 <CD4A882A-91A7-43F2-B31C-3FFD85289907@redhat.com>
 <dc1b2b3f4db8591303932351971f55688e0e240e.camel@linux.intel.com>
 <20191106165416.GO8314@dhcp22.suse.cz>
 <e90877ab21cdb7bb6a7a71a035f1b03e5544f384.camel@linux.intel.com>
 <f84f53b8-221e-02bd-2e7a-c0040ca03a38@redhat.com>
 <f1f84779123b7d3f0613d20f6bd9b05c806b39f7.camel@linux.intel.com>
From: David Hildenbrand <david@redhat.com>
Organization: Red Hat GmbH
Message-ID: <4cf64ff9-b099-d50a-5c08-9a8f3a2f52bf@redhat.com>
Date: Thu, 7 Nov 2019 23:43:47 +0100
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101
 Thunderbird/68.1.1
MIME-Version: 1.0
In-Reply-To: <f1f84779123b7d3f0613d20f6bd9b05c806b39f7.camel@linux.intel.com>
Content-Language: en-US
X-Scanned-By: MIMEDefang 2.79 on 10.5.11.15
X-MC-Unique: 6iCda_toPrKQLpINH2pq2g-1
X-Mimecast-Spam-Score: 0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

[...]
>>> The alternative approach doesn't touch the page allocator, however it
>>> still has essentially the same changes to __free_one_page. I suspect th=
e
>>
>> Nitesh is working on Michals suggestion to use page isolation instead
>> AFAIK - which avoids this.
>=20
> Okay. However it makes it much harder to discuss when we are comparing
> against code that isn't public. If the design is being redone do we have
> any ETA for when we will have something to actually compare to?

Maybe Nitesh got a little bit more careful with sending RFCs because he=20
was getting negatives vibes due to the prototype quality. I might be=20
wrong and he really is only looking into some performance aspects.

>=20
>>> performance issue seen is mostly due to the fact that because it doesn'=
t
>>> touch the page allocator it is taking the zone lock and probing the pag=
e
>>> for each set bit to see if the page is still free. As such the performa=
nce
>>> regression seen gets worse the lower the order used for reporting.
>>>
>>> Also I suspect Nitesh's patches are also in need of further review. I h=
ave
>>> provided feedback however my focus ended up being on more the kernel
>>> panics and 30% performance regression rather than debating architecture=
.
>>
>> Please don't take this personally, but I really dislike you taking about
>> Niteshs RFCs (!) and pushing for your approach (although it was you that
>> was late to the party!) in that way. If there are problems then please
>> collaborate and fix instead of using the same wrong arguments over and
>> over again.
>=20
> Since Nitesh is in the middle of doing a full rewrite anyway I don't have
> much to compare against except for the previous set, which still needs
> fixes.  It is why I mentioned in the cover of the last patch set that I
> would prefer to not discuss it since I have no visibility into the patch
> set he is now working on.

Me too :) I'd love to see how Michals idea with page isolation worked=20
out. But I can understand that Nitesh wants to explore some details first.

>=20
>> a) hotplug/sparse zones: I explained a couple of times why we can ignore
>> that. There was never a reply from you, yet you keep coming up with
>> that. I don't enjoy talking to a wall.
>=20
> This gets to the heart of how Nitesh's patch set works. It is assuming
> that every zone is linear, that there will be no overlap between zones,
> and that the zones don't really change. These are key architectural
> assumptions that should really be discussed instead of simply dismissed.

IMHO, implementation detail of using bitmaps for each zone right now.=20
Maybe there is a better data structure for tracking this sparse data=20
(e.g., sparse bitmaps), or a better way to handle bitmaps. I think this=20
is a good start to get something relatively simple implemented (yeah,=20
there were some pitfalls in previous versions, maybe page isolation will=20
make that less error prone in an RFC).

>=20
> I guess part of the difference between us is that I am looking for
> something that is production ready and not a proof of concept. It sounds
> like you would prefer this work stays in a proof of concept stage for som=
e
> time longer.

Certainly not, but I don't think we have to rush. As I said, let's come=20
to a conclusion if we want this in the allocator or not. For me, other=20
things (e.g., maintainability) are more important. And AFAIKT, also for=20
Michal and Mel.

>=20
>> b) Locking optimizations: Come on, these are premature optimizations and
>> nothing to dictate your design. *nobody* but you cares about that in an
>> initial approach we get upstream. We can always optimize that.
>=20
> My concern isn't so much the locking as the fact that it is the hunt and
> peck approach through a bitmap that will become increasingly more stale a=
s
> you are processing the data. Every bit you have to test for requires
> taking a zone lock and then probing to see if the page is still free and
> the right size. My concern is how much time is going to be spent with the
> zone lock held while other CPUs are waiting on access.

Valid concerns, really. But I don't think these are road blockers.

>=20
>> c) Kernel panics: Come on, we are talking about complicated RFCs here
>> with moving design decisions. We want do discuss *design* and
>> *architecture* here, not *implementation details*.
>=20
> Then why ask me to compare performance against it? You were the one
> pushing for me to test it, not me. If you and Nitesh knew the design
> wasn't complete enough to run it why ask me to test it?

The design changed with Michals comment about page isolation, that was=20
afterwards, no?

Your performance comparison was very helpful. I think, I said back then=20
that I am interested in fundamental performance differences. You=20
reported differences, AFAIK Nitesh was able to resolve one (MAX_ORDER -=20
1 if I'm, not wrong) using implementation changes. I *think* he is still=20
looking into another comparison.

>=20
> Many of the kernel panics for the patch sets in the past have been relate=
d
> to fundamental architectural issues. For example ignoring things like
> NUMA, mangling the free_list by accessing it with the wrong locks held,
> etc.

Yeah, I think Nitesh was still fairly new to the kernel when he started=20
working on Riks ideas. I assume he learned a lot during the last=20
months/years :)

>=20
>> d) Performance: We want to see a design that fits into the whole
>> architecture cleanly, is maintainable, and provides a benefit. Of
>> course, performance is relevant, but it certainly should not dictate our
>> design of a *virtualization specific optimization feature*. Performance
>> is not everything, otherwise please feel free and rewrite the kernel in
>> ASM and claim it is better because it is faster.
>=20
> I agree performance is not everything. But when a system grinds down to
> 60% of what it was originally I find that significant.

I totally agree, that's why I asked for a fundamental performance=20
comparison, which helps to make a decision. "is this gain in performance=20
worth moving it into the core".

>=20
>> Again, I do value your review and feedback, but I absolutely do not
>> enjoy the way you are trying to push your series here, sorry.
>=20
> Well I am a bit frustrated as I have had to provide a significant amount
> of feedback on Nitesh's patches, and in spite of that I feel like I am
> getting nothing in return. I have pointed out the various issues and

I can understand the frustration. I reviewed all the parts I feel=20
comfortable with (e.g., page flag vs. page type, cleanup patches), and=20
left the core buddy review to experts (Mel), because that's not my aree=20
of experience (yet, lol). Yeah, MM people are busy.

> opportunities to address the issues. At this point there are sections of
> his code that are directly copied from mine[1]. I have done everything I

Bad: he's not crediting you. Good: Both implementations came to the same=20
conclusion virtio-wise.

> can to help the patches along but it seems like they aren't getting out o=
f
> RFC or proof-of-concept state any time soon. So with that being the case

My gut feeling is that with page isolation the RFC stage could be over=20
soon. It heavily simplifies locking/blocking pages from getting=20
allocated. I might be wrong. But that's what it is when you explore new=20
ideas.

> why not consider his patch set as something that could end up being a
> follow-on/refactor instead of an alternative to mine?

I guess MM people prefer to start simple and only add core functionality=20
when really needed / it can be shown that there is a serious performance=20
impact.

>=20
>> Yes, if we end up finding out that there is real value in your approach,
>> nothing speaks against considering it. But please don't try to hurry and
>> push your series in that way. Please give everybody to time to evaluate.
>=20
> I would love to argue this patch set on the merits. However I really don'=
t
> feel like I am getting a fair comparison here, at least from you. Every
> other reply on the thread seems to be from you trying to reinforce any
> criticism and taking the opportunity to mention that there is another
> solution out there. It is fine to fight for your own idea, but at least

"for your own idea" - are you saying Nitesh's approach is my idea? I=20
hope not, otherwise I would get credit for Rik's and Nitesh's work by=20
simply providing review comments.

Of course it is okay to fight for your own idea.

> let me reply to the criticisms of my own patchset before you pile on. I

Me (+ Michal): Are these core buddy changes really wanted and required.=20
Can we evaluate the alternatives properly. (Michal even proposed=20
something very similar to Nitesh's approach before even looking into it)

You: Please take my patch set, it is better than the alternatives=20
because of X, for X in {RFC quality, sparse zones, locking internals,=20
current performance differences}

And all I am requesting is that we do the evaluation, discuss if there=20
are really no alternatives, and sort out fundamental issues with=20
external tracking.

Michal asked the very same question again at the beginning of this=20
thread: "Is there really a consensus"

Reading the replies, "no".

--=20

Thanks,

David / dhildenb