linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* Re: + mm-introduce-reported-pages.patch added to -mm tree
       [not found] <20191106000547.juQRi83gi%akpm@linux-foundation.org>
@ 2019-11-06 12:16 ` Michal Hocko
  2019-11-06 14:09   ` David Hildenbrand
                     ` (2 more replies)
  0 siblings, 3 replies; 47+ messages in thread
From: Michal Hocko @ 2019-11-06 12:16 UTC (permalink / raw)
  To: akpm
  Cc: aarcange, alexander.h.duyck, dan.j.williams, dave.hansen, david,
	konrad.wilk, lcapitulino, mgorman, mm-commits, mst, osalvador,
	pagupta, pbonzini, riel, vbabka, wei.w.wang, willy,
	yang.zhang.wz, linux-mm

I didn't have time to read through newer versions of this patch series
but I remember there were concerns about this functionality being pulled
into the page allocator previously both by me and Mel [1][2]. Have those been 
addressed? I do not see an ack from Mel or any other MM people. Is there
really a consensus that we want something like that living in the
allocator?

There has also been a different approach discussed and from [3]
(referenced by the cover letter) I can only see

: Then Nitesh's solution had changed to the bitmap approach[7]. However it
: has been pointed out that this solution doesn't deal with sparse memory,
: hotplug, and various other issues.

which looks more like something to be done than a fundamental
roadblocks.

[1] http://lkml.kernel.org/r/20190912163525.GV2739@techsingularity.net
[2] http://lkml.kernel.org/r/20190912091925.GM4023@dhcp22.suse.cz
[3] http://lkml.kernel.org/r/29f43d5796feed0dec8e8bb98b187d9dac03b900.camel@linux.intel.com

On Tue 05-11-19 16:05:47, Andrew Morton wrote:
> From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> Subject: mm: introduce Reported pages
> 
> In order to pave the way for free page reporting in virtualized
> environments we will need a way to get pages out of the free lists and
> identify those pages after they have been returned.  To accomplish this,
> this patch adds the concept of a Reported Buddy, which is essentially
> meant to just be the Uptodate flag used in conjunction with the Buddy page
> type.
> 
> It adds a set of pointers we shall call "reported_boundary" which
> represent the upper boundary between the unreported and reported pages. 
> The general idea is that in order for a page to cross from one side of the
> boundary to the other it will need to verify that it went through the
> reporting process.  Ultimately a free list has been fully processed when
> the boundary has been moved from the tail all they way up to occupying the
> first entry in the list.  Without this we would have to manually walk the
> entire page list until we have find a page that hasn't been reported.  In
> my testing this adds as much as 18% additional overhead which would make
> this unattractive as a solution.
> 
> One limitation to this approach is that it is essentially a linear search
> and in the case of the free lists we can have pages added to either the
> head or the tail of the list.  In order to place limits on this we only
> allow pages to be added before the reported_boundary instead of adding to
> the tail itself.  An added advantage to this approach is that we should be
> reducing the overall memory footprint of the guest as it will be more
> likely to recycle warm pages versus trying to allocate the reported pages
> that were likely evicted from the guest memory.
> 
> Since we will only be reporting one zone at a time we keep the boundary
> limited to being defined for just the zone we are currently reporting
> pages from.  Doing this we can keep the number of additional pointers
> needed quite small.  To flag that the boundaries are in place we use a
> single bit in the zone to indicate that reporting and the boundaries are
> active.
> 
> We store the index of the boundary pointer used to track the reported page
> in the page->index value.  Doing this we can avoid unnecessary computation
> to determine the index value again.  There should be no issues with this
> as the value is unused when the page is in the buddy allocator, and is
> reset as soon as the page is removed from the free list.
> 
> Link: http://lkml.kernel.org/r/20191105220219.15144.69031.stgit@localhost.localdomain
> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: <konrad.wilk@oracle.com>
> Cc: Luiz Capitulino <lcapitulino@redhat.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Pankaj Gupta <pagupta@redhat.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Rik van Riel <riel@surriel.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Wei Wang <wei.w.wang@intel.com>
> Cc: Yang Zhang <yang.zhang.wz@gmail.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: + mm-introduce-reported-pages.patch added to -mm tree
  2019-11-06 12:16 ` + mm-introduce-reported-pages.patch added to -mm tree Michal Hocko
@ 2019-11-06 14:09   ` David Hildenbrand
  2019-11-06 16:35     ` Alexander Duyck
  2019-11-06 16:49   ` Nitesh Narayan Lal
  2019-11-11 18:52   ` Nitesh Narayan Lal
  2 siblings, 1 reply; 47+ messages in thread
From: David Hildenbrand @ 2019-11-06 14:09 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, aarcange, alexander.h.duyck, dan.j.williams, dave.hansen,
	konrad.wilk, lcapitulino, mgorman, mm-commits, mst, osalvador,
	pagupta, pbonzini, riel, vbabka, wei.w.wang, willy,
	yang.zhang.wz, linux-mm



> Am 06.11.2019 um 13:16 schrieb Michal Hocko <mhocko@kernel.org>:
> 
> I didn't have time to read through newer versions of this patch series
> but I remember there were concerns about this functionality being pulled
> into the page allocator previously both by me and Mel [1][2]. Have those been 
> addressed? I do not see an ack from Mel or any other MM people. Is there
> really a consensus that we want something like that living in the
> allocator?

I don‘t think there is. The discussion is still ongoing (although quiet, Nitesh is working on a new version AFAIK). I think we should not rush this.

> 
> There has also been a different approach discussed and from [3]
> (referenced by the cover letter) I can only see
> 
> : Then Nitesh's solution had changed to the bitmap approach[7]. However it
> : has been pointed out that this solution doesn't deal with sparse memory,
> : hotplug, and various other issues.
> 
> which looks more like something to be done than a fundamental
> roadblocks.

I second that. As I stated a couple of times already, it is totally fine to not support all environments initially (hotunplug, sparse zones). The major difference I am interested in is performance comparison. Then we have to decide if the gain in performance is worth core buddy modifications.

> 
> [1] http://lkml.kernel.org/r/20190912163525.GV2739@techsingularity.net
> [2] http://lkml.kernel.org/r/20190912091925.GM4023@dhcp22.suse.cz
> [3] http://lkml.kernel.org/r/29f43d5796feed0dec8e8bb98b187d9dac03b900.camel@linux.intel.com
> 
>> On Tue 05-11-19 16:05:47, Andrew Morton wrote:
>> From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
>> Subject: mm: introduce Reported pages
>> 
>> In order to pave the way for free page reporting in virtualized
>> environments we will need a way to get pages out of the free lists and
>> identify those pages after they have been returned.  To accomplish this,
>> this patch adds the concept of a Reported Buddy, which is essentially
>> meant to just be the Uptodate flag used in conjunction with the Buddy page
>> type.
>> 
>> It adds a set of pointers we shall call "reported_boundary" which
>> represent the upper boundary between the unreported and reported pages. 
>> The general idea is that in order for a page to cross from one side of the
>> boundary to the other it will need to verify that it went through the
>> reporting process.  Ultimately a free list has been fully processed when
>> the boundary has been moved from the tail all they way up to occupying the
>> first entry in the list.  Without this we would have to manually walk the
>> entire page list until we have find a page that hasn't been reported.  In
>> my testing this adds as much as 18% additional overhead which would make
>> this unattractive as a solution.
>> 
>> One limitation to this approach is that it is essentially a linear search
>> and in the case of the free lists we can have pages added to either the
>> head or the tail of the list.  In order to place limits on this we only
>> allow pages to be added before the reported_boundary instead of adding to
>> the tail itself.  An added advantage to this approach is that we should be
>> reducing the overall memory footprint of the guest as it will be more
>> likely to recycle warm pages versus trying to allocate the reported pages
>> that were likely evicted from the guest memory.
>> 
>> Since we will only be reporting one zone at a time we keep the boundary
>> limited to being defined for just the zone we are currently reporting
>> pages from.  Doing this we can keep the number of additional pointers
>> needed quite small.  To flag that the boundaries are in place we use a
>> single bit in the zone to indicate that reporting and the boundaries are
>> active.
>> 
>> We store the index of the boundary pointer used to track the reported page
>> in the page->index value.  Doing this we can avoid unnecessary computation
>> to determine the index value again.  There should be no issues with this
>> as the value is unused when the page is in the buddy allocator, and is
>> reset as soon as the page is removed from the free list.
>> 
>> Link: http://lkml.kernel.org/r/20191105220219.15144.69031.stgit@localhost.localdomain
>> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>> Cc: Dan Williams <dan.j.williams@intel.com>
>> Cc: Dave Hansen <dave.hansen@intel.com>
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: <konrad.wilk@oracle.com>
>> Cc: Luiz Capitulino <lcapitulino@redhat.com>
>> Cc: Matthew Wilcox <willy@infradead.org>
>> Cc: Mel Gorman <mgorman@techsingularity.net>
>> Cc: Michael S. Tsirkin <mst@redhat.com>
>> Cc: Michal Hocko <mhocko@kernel.org>
>> Cc: Oscar Salvador <osalvador@suse.de>
>> Cc: Pankaj Gupta <pagupta@redhat.com>
>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>> Cc: Rik van Riel <riel@surriel.com>
>> Cc: Vlastimil Babka <vbabka@suse.cz>
>> Cc: Wei Wang <wei.w.wang@intel.com>
>> Cc: Yang Zhang <yang.zhang.wz@gmail.com>
>> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> -- 
> Michal Hocko
> SUSE Labs


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: + mm-introduce-reported-pages.patch added to -mm tree
  2019-11-06 14:09   ` David Hildenbrand
@ 2019-11-06 16:35     ` Alexander Duyck
  2019-11-06 16:54       ` Michal Hocko
  0 siblings, 1 reply; 47+ messages in thread
From: Alexander Duyck @ 2019-11-06 16:35 UTC (permalink / raw)
  To: David Hildenbrand, Michal Hocko
  Cc: akpm, aarcange, dan.j.williams, dave.hansen, konrad.wilk,
	lcapitulino, mgorman, mm-commits, mst, osalvador, pagupta,
	pbonzini, riel, vbabka, wei.w.wang, willy, yang.zhang.wz,
	linux-mm

On Wed, 2019-11-06 at 15:09 +0100, David Hildenbrand wrote:
> > Am 06.11.2019 um 13:16 schrieb Michal Hocko <mhocko@kernel.org>:
> > 
> > I didn't have time to read through newer versions of this patch series
> > but I remember there were concerns about this functionality being pulled
> > into the page allocator previously both by me and Mel [1][2]. Have those been 
> > addressed? I do not see an ack from Mel or any other MM people. Is there
> > really a consensus that we want something like that living in the
> > allocator?
> 
> I don‘t think there is. The discussion is still ongoing (although quiet,
> Nitesh is working on a new version AFAIK). I think we should not rush
> this.

How much time is needed to get a review? I waited 2 weeks since posting
v12 and the only comments I got on the code were from Andrew. Most of this
hasn't changed much since v10 and that was posted back in mid September. I
have been down to making small tweaks here and there and haven't had any
real critiques on the approach since Mel had the comments about conflicts
with compaction which I addressed by allowing compaction to punt the
reporter out so that it could split and splice the lists as it walked
through them.

The fact is we have been discussing this since 2011. I think it makes
sense at this point to push a solution if we can do so without impacting
performance negatively.

It isn't as though this exposes a userspace interface so we could always
go back and change the internal implementation if Nitesh comes up with
something that works out better than this. The only piece that will be
locked down is the virtio interface and both Nitesh's solution and mine
are using the same one at this point.

> > There has also been a different approach discussed and from [3]
> > (referenced by the cover letter) I can only see
> > 
> > : Then Nitesh's solution had changed to the bitmap approach[7]. However it
> > : has been pointed out that this solution doesn't deal with sparse memory,
> > : hotplug, and various other issues.
> > 
> > which looks more like something to be done than a fundamental
> > roadblocks.
> 
> I second that. As I stated a couple of times already, it is totally fine
> to not support all environments initially (hotunplug, sparse zones). The
> major difference I am interested in is performance comparison. Then we
> have to decide if the gain in performance is worth core buddy
> modifications.

I only modified three spots in the buddy allocator really. I forced pages
added to the tail of the list to be placed before the iterator, I added
logic so that if the page pointed to by the iterator is pulled from the
list we can wind the pointer back, and I added a function that triggers
the page reporting to __free_one_page.

If nothing else we could consider my solution the baseline for now. If
Nitesh comes up with something better we could then look at replacing my
code with his. Since he is now using the same virtio-balloon interface I
am there isn't really much point in holding things up since we could
always refactor/replace the implementation later if Nitesh comes up with a
better approach.




^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: + mm-introduce-reported-pages.patch added to -mm tree
  2019-11-06 12:16 ` + mm-introduce-reported-pages.patch added to -mm tree Michal Hocko
  2019-11-06 14:09   ` David Hildenbrand
@ 2019-11-06 16:49   ` Nitesh Narayan Lal
  2019-11-11 18:52   ` Nitesh Narayan Lal
  2 siblings, 0 replies; 47+ messages in thread
From: Nitesh Narayan Lal @ 2019-11-06 16:49 UTC (permalink / raw)
  To: Michal Hocko, akpm
  Cc: aarcange, alexander.h.duyck, dan.j.williams, dave.hansen, david,
	konrad.wilk, lcapitulino, mgorman, mm-commits, mst, osalvador,
	pagupta, pbonzini, riel, vbabka, wei.w.wang, willy,
	yang.zhang.wz, linux-mm


On 11/6/19 7:16 AM, Michal Hocko wrote:
> I didn't have time to read through newer versions of this patch series
> but I remember there were concerns about this functionality being pulled
> into the page allocator previously both by me and Mel [1][2]. Have those been 
> addressed? I do not see an ack from Mel or any other MM people. Is there
> really a consensus that we want something like that living in the
> allocator?
>
> There has also been a different approach discussed and from [3]
> (referenced by the cover letter) I can only see
>
> : Then Nitesh's solution had changed to the bitmap approach[7]. However it
> : has been pointed out that this solution doesn't deal with sparse memory,
> : hotplug, and various other issues.
>
> which looks more like something to be done than a fundamental
> roadblocks.

I agree. I think the major issue with my series would be the performance
drop which we are observing specifically with (MAX_ORDER - 2) as the minimum
reporting order.
That is something I am trying to investigate and see if I can fix it.

With (MAX_ORDER - 1), I was able to get a similar performance which Alexander
has reported.

>
> [1] http://lkml.kernel.org/r/20190912163525.GV2739@techsingularity.net
> [2] http://lkml.kernel.org/r/20190912091925.GM4023@dhcp22.suse.cz
> [3] http://lkml.kernel.org/r/29f43d5796feed0dec8e8bb98b187d9dac03b900.camel@linux.intel.com
>
> On Tue 05-11-19 16:05:47, Andrew Morton wrote:
>> From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
>> Subject: mm: introduce Reported pages
>>
>> In order to pave the way for free page reporting in virtualized
>> environments we will need a way to get pages out of the free lists and
>> identify those pages after they have been returned.  To accomplish this,
>> this patch adds the concept of a Reported Buddy, which is essentially
>> meant to just be the Uptodate flag used in conjunction with the Buddy page
>> type.
>>
>> It adds a set of pointers we shall call "reported_boundary" which
>> represent the upper boundary between the unreported and reported pages. 
>> The general idea is that in order for a page to cross from one side of the
>> boundary to the other it will need to verify that it went through the
>> reporting process.  Ultimately a free list has been fully processed when
>> the boundary has been moved from the tail all they way up to occupying the
>> first entry in the list.  Without this we would have to manually walk the
>> entire page list until we have find a page that hasn't been reported.  In
>> my testing this adds as much as 18% additional overhead which would make
>> this unattractive as a solution.
>>
>> One limitation to this approach is that it is essentially a linear search
>> and in the case of the free lists we can have pages added to either the
>> head or the tail of the list.  In order to place limits on this we only
>> allow pages to be added before the reported_boundary instead of adding to
>> the tail itself.  An added advantage to this approach is that we should be
>> reducing the overall memory footprint of the guest as it will be more
>> likely to recycle warm pages versus trying to allocate the reported pages
>> that were likely evicted from the guest memory.
>>
>> Since we will only be reporting one zone at a time we keep the boundary
>> limited to being defined for just the zone we are currently reporting
>> pages from.  Doing this we can keep the number of additional pointers
>> needed quite small.  To flag that the boundaries are in place we use a
>> single bit in the zone to indicate that reporting and the boundaries are
>> active.
>>
>> We store the index of the boundary pointer used to track the reported page
>> in the page->index value.  Doing this we can avoid unnecessary computation
>> to determine the index value again.  There should be no issues with this
>> as the value is unused when the page is in the buddy allocator, and is
>> reset as soon as the page is removed from the free list.
>>
>> Link: http://lkml.kernel.org/r/20191105220219.15144.69031.stgit@localhost.localdomain
>> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>> Cc: Dan Williams <dan.j.williams@intel.com>
>> Cc: Dave Hansen <dave.hansen@intel.com>
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: <konrad.wilk@oracle.com>
>> Cc: Luiz Capitulino <lcapitulino@redhat.com>
>> Cc: Matthew Wilcox <willy@infradead.org>
>> Cc: Mel Gorman <mgorman@techsingularity.net>
>> Cc: Michael S. Tsirkin <mst@redhat.com>
>> Cc: Michal Hocko <mhocko@kernel.org>
>> Cc: Oscar Salvador <osalvador@suse.de>
>> Cc: Pankaj Gupta <pagupta@redhat.com>
>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>> Cc: Rik van Riel <riel@surriel.com>
>> Cc: Vlastimil Babka <vbabka@suse.cz>
>> Cc: Wei Wang <wei.w.wang@intel.com>
>> Cc: Yang Zhang <yang.zhang.wz@gmail.com>
>> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-- 
Thanks
Nitesh



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: + mm-introduce-reported-pages.patch added to -mm tree
  2019-11-06 16:35     ` Alexander Duyck
@ 2019-11-06 16:54       ` Michal Hocko
  2019-11-06 17:48         ` Alexander Duyck
  0 siblings, 1 reply; 47+ messages in thread
From: Michal Hocko @ 2019-11-06 16:54 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: David Hildenbrand, akpm, aarcange, dan.j.williams, dave.hansen,
	konrad.wilk, lcapitulino, mgorman, mm-commits, mst, osalvador,
	pagupta, pbonzini, riel, vbabka, wei.w.wang, willy,
	yang.zhang.wz, linux-mm

On Wed 06-11-19 08:35:43, Alexander Duyck wrote:
> On Wed, 2019-11-06 at 15:09 +0100, David Hildenbrand wrote:
> > > Am 06.11.2019 um 13:16 schrieb Michal Hocko <mhocko@kernel.org>:
> > > 
> > > I didn't have time to read through newer versions of this patch series
> > > but I remember there were concerns about this functionality being pulled
> > > into the page allocator previously both by me and Mel [1][2]. Have those been 
> > > addressed? I do not see an ack from Mel or any other MM people. Is there
> > > really a consensus that we want something like that living in the
> > > allocator?
> > 
> > I don‘t think there is. The discussion is still ongoing (although quiet,
> > Nitesh is working on a new version AFAIK). I think we should not rush
> > this.
> 
> How much time is needed to get a review? I waited 2 weeks since posting
> v12 and the only comments I got on the code were from Andrew. Most of this
> hasn't changed much since v10 and that was posted back in mid September. I
> have been down to making small tweaks here and there and haven't had any
> real critiques on the approach since Mel had the comments about conflicts
> with compaction which I addressed by allowing compaction to punt the
> reporter out so that it could split and splice the lists as it walked
> through them.

Well, people are busy and MM community is not a large one. I cannot
really help you much other than keep poking those people and give
reasonable arguments so they decide to ack your patch.

I definitely do not intent to nack this work, I just have maintainability
concerns and considering there is an alternative approach that does not
require to touch page allocator internals and which we need to compare
against then I do not really think there is any need to push something
in right away. Or is there any pressing reason to have this merged right
now?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: + mm-introduce-reported-pages.patch added to -mm tree
  2019-11-06 16:54       ` Michal Hocko
@ 2019-11-06 17:48         ` Alexander Duyck
  2019-11-06 22:11           ` Mel Gorman
  2019-11-06 23:33           ` David Hildenbrand
  0 siblings, 2 replies; 47+ messages in thread
From: Alexander Duyck @ 2019-11-06 17:48 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Hildenbrand, akpm, aarcange, dan.j.williams, dave.hansen,
	konrad.wilk, lcapitulino, mgorman, mm-commits, mst, osalvador,
	pagupta, pbonzini, riel, vbabka, wei.w.wang, willy,
	yang.zhang.wz, linux-mm

On Wed, 2019-11-06 at 17:54 +0100, Michal Hocko wrote:
> On Wed 06-11-19 08:35:43, Alexander Duyck wrote:
> > On Wed, 2019-11-06 at 15:09 +0100, David Hildenbrand wrote:
> > > > Am 06.11.2019 um 13:16 schrieb Michal Hocko <mhocko@kernel.org>:
> > > > 
> > > > I didn't have time to read through newer versions of this patch series
> > > > but I remember there were concerns about this functionality being pulled
> > > > into the page allocator previously both by me and Mel [1][2]. Have those been 
> > > > addressed? I do not see an ack from Mel or any other MM people. Is there
> > > > really a consensus that we want something like that living in the
> > > > allocator?
> > > 
> > > I don‘t think there is. The discussion is still ongoing (although quiet,
> > > Nitesh is working on a new version AFAIK). I think we should not rush
> > > this.
> > 
> > How much time is needed to get a review? I waited 2 weeks since posting
> > v12 and the only comments I got on the code were from Andrew. Most of this
> > hasn't changed much since v10 and that was posted back in mid September. I
> > have been down to making small tweaks here and there and haven't had any
> > real critiques on the approach since Mel had the comments about conflicts
> > with compaction which I addressed by allowing compaction to punt the
> > reporter out so that it could split and splice the lists as it walked
> > through them.
> 
> Well, people are busy and MM community is not a large one. I cannot
> really help you much other than keep poking those people and give
> reasonable arguments so they decide to ack your patch.

I get that. But v10 was posted in mid September. Back then we had a
discussion about addressing what Mel had mentioned and I had mentioned
then that I had addressed it by allowing compaction to essentially reset
the reporter to get it out of the list so compaction could do this split
and splice tumbling logic.

> I definitely do not intent to nack this work, I just have maintainability
> concerns and considering there is an alternative approach that does not
> require to touch page allocator internals and which we need to compare
> against then I do not really think there is any need to push something
> in right away. Or is there any pressing reason to have this merged right
> now?

The alternative approach doesn't touch the page allocator, however it
still has essentially the same changes to __free_one_page. I suspect the
performance issue seen is mostly due to the fact that because it doesn't
touch the page allocator it is taking the zone lock and probing the page
for each set bit to see if the page is still free. As such the performance
regression seen gets worse the lower the order used for reporting.

Also I suspect Nitesh's patches are also in need of further review. I have
provided feedback however my focus ended up being on more the kernel
panics and 30% performance regression rather than debating architecture.



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: + mm-introduce-reported-pages.patch added to -mm tree
  2019-11-06 17:48         ` Alexander Duyck
@ 2019-11-06 22:11           ` Mel Gorman
  2019-11-06 23:38             ` David Hildenbrand
  2019-11-07  0:20             ` Alexander Duyck
  2019-11-06 23:33           ` David Hildenbrand
  1 sibling, 2 replies; 47+ messages in thread
From: Mel Gorman @ 2019-11-06 22:11 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Michal Hocko, David Hildenbrand, akpm, aarcange, dan.j.williams,
	dave.hansen, konrad.wilk, lcapitulino, mm-commits, mst,
	osalvador, pagupta, pbonzini, riel, vbabka, wei.w.wang, willy,
	yang.zhang.wz, linux-mm

On Wed, Nov 06, 2019 at 09:48:14AM -0800, Alexander Duyck wrote:
> On Wed, 2019-11-06 at 17:54 +0100, Michal Hocko wrote:
> > On Wed 06-11-19 08:35:43, Alexander Duyck wrote:
> > > On Wed, 2019-11-06 at 15:09 +0100, David Hildenbrand wrote:
> > > > > Am 06.11.2019 um 13:16 schrieb Michal Hocko <mhocko@kernel.org>:
> > > > > 
> > > > > ???I didn't have time to read through newer versions of this patch series
> > > > > but I remember there were concerns about this functionality being pulled
> > > > > into the page allocator previously both by me and Mel [1][2]. Have those been 
> > > > > addressed? I do not see an ack from Mel or any other MM people. Is there
> > > > > really a consensus that we want something like that living in the
> > > > > allocator?
> > > > 
> > > > I don???t think there is. The discussion is still ongoing (although quiet,
> > > > Nitesh is working on a new version AFAIK). I think we should not rush
> > > > this.
> > > 
> > > How much time is needed to get a review? I waited 2 weeks since posting
> > > v12 and the only comments I got on the code were from Andrew. Most of this
> > > hasn't changed much since v10 and that was posted back in mid September. I
> > > have been down to making small tweaks here and there and haven't had any
> > > real critiques on the approach since Mel had the comments about conflicts
> > > with compaction which I addressed by allowing compaction to punt the
> > > reporter out so that it could split and splice the lists as it walked
> > > through them.
> > 
> > Well, people are busy and MM community is not a large one. I cannot
> > really help you much other than keep poking those people and give
> > reasonable arguments so they decide to ack your patch.
> 
> I get that. But v10 was posted in mid September. Back then we had a
> discussion about addressing what Mel had mentioned and I had mentioned
> then that I had addressed it by allowing compaction to essentially reset
> the reporter to get it out of the list so compaction could do this split
> and splice tumbling logic.
> 

At the time I said "While in theory that could be addressed by always
going through an interface maintained by the page allocator, it would be
tricky to test the virtio case in particular."

Now, I get that you added an interface for that *but* if compaction was
ever updated or yet another approach was taken to deal with it, virtio
could get broken. If the page allocator itself has a bug or compaction
has a bug, the effect is very visible. While stuff does slip through, it
tends to be an obscure corner case. However, when virtio gets broken,
it'll be a long time before we get it.

Then we get onto the locking, the API appears to require the zone lock
be held which is not particularly obvious but it gets better. The zone
lock is a *heavy* lock now. By rights, it should be split into a
freelist and zone metadata lock as appropriate and then split the
freelist lock. Compaction could be updated and checked trivially, but
not so much virtio.

> > I definitely do not intent to nack this work, I just have maintainability
> > concerns and considering there is an alternative approach that does not
> > require to touch page allocator internals and which we need to compare
> > against then I do not really think there is any need to push something
> > in right away. Or is there any pressing reason to have this merged right
> > now?
> 
> The alternative approach doesn't touch the page allocator, however it
> still has essentially the same changes to __free_one_page. I suspect the
> performance issue seen is mostly due to the fact that because it doesn't
> touch the page allocator it is taking the zone lock and probing the page
> for each set bit to see if the page is still free. As such the performance
> regression seen gets worse the lower the order used for reporting.
> 

What confused me quite a lot is that this is enabled at compile time
and then incurs a performance hit whether there is a hypervisor that
even cares is involved or not. So I don't think the performance angle
justifying this approach is a good one because this implementation has
issues of its own. Previously I said

	I worry that poking too much into the internal state of the
	allocator will be fragile long-term. There is the arch alloc/free
	hooks but they are typically about protections only and does not
	interfere with the internal state of the allocator. Compaction
	pokes in as well but once the page is off the free list, the page
	allocator no longer cares so again there is on interference with
	the internal state. If the state is interefered with externally,
	it becomes unclear what happens if things like page merging is
	deferred in a way the allocator cannot control as high-order
	allocation requests may fail for example.

Adding an API for compaction does not get away from the problem that
it'll be fragile to depend on the internal state of the allocator for
correctness. Existing users that poke into the state do so as an
optimistic shortcut but if it fails, nothing actually breaks. The free
list reporting stuff might and will not be routinely tested.

Take compaction as an example, the initial implementation of it was dumb
as rocks and only started maintaining additional state and later poking
into the page allocator when there was empirical evidence it was necessary.

The initial implementation of page reporting should be the same, it
should do no harm at all to users that don't care (hiding behind
kconfig is not good enough, use static branches) and it should not
depend on the internal state of the page allocator and ordering of free
lists for correctness until it's shown it's absolutely necessary.

You say that the zone lock has to be taken in the alternative
implementation to check if it's still free and sure, that would cost
but unless you are isolating that page immediately then you are racing
once the lock is released. If you are isolating immediately, then isolate
pages in batches to amortise the loock costs.  The details of this could
be really hard but this approach is essentially saying "everything,
everywhere should take a small hit so the overhead is not noticeable for
virtio users" which is a curious choice for a new feature.

Regardless of the details of any implementation, the first one should be
basic, do no harm and be relatively simple giving just a bare interface
to virtio/qemu/etc. Then optimise it until such point as there is no
chance but to poke into the core.

-- 
Mel Gorman
SUSE Labs


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: + mm-introduce-reported-pages.patch added to -mm tree
  2019-11-06 17:48         ` Alexander Duyck
  2019-11-06 22:11           ` Mel Gorman
@ 2019-11-06 23:33           ` David Hildenbrand
  2019-11-07  0:20             ` Dave Hansen
  2019-11-07 18:02             ` Alexander Duyck
  1 sibling, 2 replies; 47+ messages in thread
From: David Hildenbrand @ 2019-11-06 23:33 UTC (permalink / raw)
  To: Alexander Duyck, Michal Hocko
  Cc: akpm, aarcange, dan.j.williams, dave.hansen, konrad.wilk,
	lcapitulino, mgorman, mm-commits, mst, osalvador, pagupta,
	pbonzini, riel, vbabka, wei.w.wang, willy, yang.zhang.wz,
	linux-mm

On 06.11.19 18:48, Alexander Duyck wrote:
> On Wed, 2019-11-06 at 17:54 +0100, Michal Hocko wrote:
>> On Wed 06-11-19 08:35:43, Alexander Duyck wrote:
>>> On Wed, 2019-11-06 at 15:09 +0100, David Hildenbrand wrote:
>>>>> Am 06.11.2019 um 13:16 schrieb Michal Hocko <mhocko@kernel.org>:
>>>>>
>>>>> I didn't have time to read through newer versions of this patch series
>>>>> but I remember there were concerns about this functionality being pulled
>>>>> into the page allocator previously both by me and Mel [1][2]. Have those been
>>>>> addressed? I do not see an ack from Mel or any other MM people. Is there
>>>>> really a consensus that we want something like that living in the
>>>>> allocator?
>>>>
>>>> I don‘t think there is. The discussion is still ongoing (although quiet,
>>>> Nitesh is working on a new version AFAIK). I think we should not rush
>>>> this.
>>>
>>> How much time is needed to get a review? I waited 2 weeks since posting
>>> v12 and the only comments I got on the code were from Andrew. Most of this
>>> hasn't changed much since v10 and that was posted back in mid September. I
>>> have been down to making small tweaks here and there and haven't had any
>>> real critiques on the approach since Mel had the comments about conflicts
>>> with compaction which I addressed by allowing compaction to punt the
>>> reporter out so that it could split and splice the lists as it walked
>>> through them.
>>
>> Well, people are busy and MM community is not a large one. I cannot
>> really help you much other than keep poking those people and give
>> reasonable arguments so they decide to ack your patch.
> 
> I get that. But v10 was posted in mid September. Back then we had a
> discussion about addressing what Mel had mentioned and I had mentioned
> then that I had addressed it by allowing compaction to essentially reset
> the reporter to get it out of the list so compaction could do this split
> and splice tumbling logic.
> 
>> I definitely do not intent to nack this work, I just have maintainability
>> concerns and considering there is an alternative approach that does not
>> require to touch page allocator internals and which we need to compare
>> against then I do not really think there is any need to push something
>> in right away. Or is there any pressing reason to have this merged right
>> now?
> 
> The alternative approach doesn't touch the page allocator, however it
> still has essentially the same changes to __free_one_page. I suspect the

Nitesh is working on Michals suggestion to use page isolation instead 
AFAIK - which avoids this.

> performance issue seen is mostly due to the fact that because it doesn't
> touch the page allocator it is taking the zone lock and probing the page
> for each set bit to see if the page is still free. As such the performance
> regression seen gets worse the lower the order used for reporting.
> 
> Also I suspect Nitesh's patches are also in need of further review. I have
> provided feedback however my focus ended up being on more the kernel
> panics and 30% performance regression rather than debating architecture.

Please don't take this personally, but I really dislike you taking about 
Niteshs RFCs (!) and pushing for your approach (although it was you that 
was late to the party!) in that way. If there are problems then please 
collaborate and fix instead of using the same wrong arguments over and 
over again.

a) hotplug/sparse zones: I explained a couple of times why we can ignore 
that. There was never a reply from you, yet you keep coming up with 
that. I don't enjoy talking to a wall.

b) Locking optimizations: Come on, these are premature optimizations and 
nothing to dictate your design. *nobody* but you cares about that in an 
initial approach we get upstream. We can always optimize that.

c) Kernel panics: Come on, we are talking about complicated RFCs here 
with moving design decisions. We want do discuss *design* and 
*architecture* here, not *implementation details*.

d) Performance: We want to see a design that fits into the whole 
architecture cleanly, is maintainable, and provides a benefit. Of 
course, performance is relevant, but it certainly should not dictate our 
design of a *virtualization specific optimization feature*. Performance 
is not everything, otherwise please feel free and rewrite the kernel in 
ASM and claim it is better because it is faster.

Again, I do value your review and feedback, but I absolutely do not 
enjoy the way you are trying to push your series here, sorry.

Yes, if we end up finding out that there is real value in your approach, 
nothing speaks against considering it. But please don't try to hurry and 
push your series in that way. Please give everybody to time to evaluate.

-- 

Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: + mm-introduce-reported-pages.patch added to -mm tree
  2019-11-06 22:11           ` Mel Gorman
@ 2019-11-06 23:38             ` David Hildenbrand
  2019-11-07  0:20             ` Alexander Duyck
  1 sibling, 0 replies; 47+ messages in thread
From: David Hildenbrand @ 2019-11-06 23:38 UTC (permalink / raw)
  To: Mel Gorman, Alexander Duyck
  Cc: Michal Hocko, akpm, aarcange, dan.j.williams, dave.hansen,
	konrad.wilk, lcapitulino, mm-commits, mst, osalvador, pagupta,
	pbonzini, riel, vbabka, wei.w.wang, willy, yang.zhang.wz,
	linux-mm

>>> I definitely do not intent to nack this work, I just have maintainability
>>> concerns and considering there is an alternative approach that does not
>>> require to touch page allocator internals and which we need to compare
>>> against then I do not really think there is any need to push something
>>> in right away. Or is there any pressing reason to have this merged right
>>> now?
>>
>> The alternative approach doesn't touch the page allocator, however it
>> still has essentially the same changes to __free_one_page. I suspect the
>> performance issue seen is mostly due to the fact that because it doesn't
>> touch the page allocator it is taking the zone lock and probing the page
>> for each set bit to see if the page is still free. As such the performance
>> regression seen gets worse the lower the order used for reporting.
>>
> 
> What confused me quite a lot is that this is enabled at compile time
> and then incurs a performance hit whether there is a hypervisor that
> even cares is involved or not. So I don't think the performance angle
> justifying this approach is a good one because this implementation has
> issues of its own. Previously I said
> 
> 	I worry that poking too much into the internal state of the
> 	allocator will be fragile long-term. There is the arch alloc/free
> 	hooks but they are typically about protections only and does not
> 	interfere with the internal state of the allocator. Compaction
> 	pokes in as well but once the page is off the free list, the page
> 	allocator no longer cares so again there is on interference with
> 	the internal state. If the state is interefered with externally,
> 	it becomes unclear what happens if things like page merging is
> 	deferred in a way the allocator cannot control as high-order
> 	allocation requests may fail for example.
> 
> Adding an API for compaction does not get away from the problem that
> it'll be fragile to depend on the internal state of the allocator for
> correctness. Existing users that poke into the state do so as an
> optimistic shortcut but if it fails, nothing actually breaks. The free
> list reporting stuff might and will not be routinely tested.
> 
> Take compaction as an example, the initial implementation of it was dumb
> as rocks and only started maintaining additional state and later poking
> into the page allocator when there was empirical evidence it was necessary.
> 
> The initial implementation of page reporting should be the same, it
> should do no harm at all to users that don't care (hiding behind
> kconfig is not good enough, use static branches) and it should not
> depend on the internal state of the page allocator and ordering of free
> lists for correctness until it's shown it's absolutely necessary.
> 
> You say that the zone lock has to be taken in the alternative
> implementation to check if it's still free and sure, that would cost
> but unless you are isolating that page immediately then you are racing
> once the lock is released. If you are isolating immediately, then isolate
> pages in batches to amortise the loock costs.  The details of this could
> be really hard but this approach is essentially saying "everything,
> everywhere should take a small hit so the overhead is not noticeable for
> virtio users" which is a curious choice for a new feature.
> 
> Regardless of the details of any implementation, the first one should be
> basic, do no harm and be relatively simple giving just a bare interface
> to virtio/qemu/etc. Then optimise it until such point as there is no
> chance but to poke into the core.

I second that. If somebody would ask me, I'd want to see a simple, 
maintainable design that provides a net benefit, does not harm !virt and 
possibly reuses existing core functionality (e.g., page isolation). We 
can work from there to optimize.

-- 

Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: + mm-introduce-reported-pages.patch added to -mm tree
  2019-11-06 23:33           ` David Hildenbrand
@ 2019-11-07  0:20             ` Dave Hansen
  2019-11-07  0:52               ` David Hildenbrand
  2019-11-07 18:02             ` Alexander Duyck
  1 sibling, 1 reply; 47+ messages in thread
From: Dave Hansen @ 2019-11-07  0:20 UTC (permalink / raw)
  To: David Hildenbrand, Alexander Duyck, Michal Hocko
  Cc: akpm, aarcange, dan.j.williams, konrad.wilk, lcapitulino,
	mgorman, mm-commits, mst, osalvador, pagupta, pbonzini, riel,
	vbabka, wei.w.wang, willy, yang.zhang.wz, linux-mm

On 11/6/19 3:33 PM, David Hildenbrand wrote:
> We want do discuss *design* and *architecture* here, not
> *implementation details*.

Could folks remind me what they see as the high-level design delta
between these two approaches?

It seems to me like Alex's approach tries to minimize new metadata, but
imposes some rules on the internals of the allocator.  Nitesh's is more
disconnected from the allocator, but has the increased complexity of
managing some new disjoint metadata.

Right?

Neither one of those seems particularly good or bad on face value.
They've just staked out two reasonable but different approaches.  Is
there anything intractably bad about either approach that I'm missing?



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: + mm-introduce-reported-pages.patch added to -mm tree
  2019-11-06 22:11           ` Mel Gorman
  2019-11-06 23:38             ` David Hildenbrand
@ 2019-11-07  0:20             ` Alexander Duyck
  2019-11-07 10:20               ` Mel Gorman
  1 sibling, 1 reply; 47+ messages in thread
From: Alexander Duyck @ 2019-11-07  0:20 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Michal Hocko, David Hildenbrand, akpm, aarcange, dan.j.williams,
	dave.hansen, konrad.wilk, lcapitulino, mm-commits, mst,
	osalvador, pagupta, pbonzini, riel, vbabka, wei.w.wang, willy,
	yang.zhang.wz, linux-mm

On Wed, 2019-11-06 at 22:11 +0000, Mel Gorman wrote:
> On Wed, Nov 06, 2019 at 09:48:14AM -0800, Alexander Duyck wrote:
> > On Wed, 2019-11-06 at 17:54 +0100, Michal Hocko wrote:
> > > On Wed 06-11-19 08:35:43, Alexander Duyck wrote:
> > > > On Wed, 2019-11-06 at 15:09 +0100, David Hildenbrand wrote:
> > > > > > Am 06.11.2019 um 13:16 schrieb Michal Hocko <mhocko@kernel.org>:
> > > > > > 
> > > > > > ???I didn't have time to read through newer versions of this patch series
> > > > > > but I remember there were concerns about this functionality being pulled
> > > > > > into the page allocator previously both by me and Mel [1][2]. Have those been 
> > > > > > addressed? I do not see an ack from Mel or any other MM people. Is there
> > > > > > really a consensus that we want something like that living in the
> > > > > > allocator?
> > > > > 
> > > > > I don???t think there is. The discussion is still ongoing (although quiet,
> > > > > Nitesh is working on a new version AFAIK). I think we should not rush
> > > > > this.
> > > > 
> > > > How much time is needed to get a review? I waited 2 weeks since posting
> > > > v12 and the only comments I got on the code were from Andrew. Most of this
> > > > hasn't changed much since v10 and that was posted back in mid September. I
> > > > have been down to making small tweaks here and there and haven't had any
> > > > real critiques on the approach since Mel had the comments about conflicts
> > > > with compaction which I addressed by allowing compaction to punt the
> > > > reporter out so that it could split and splice the lists as it walked
> > > > through them.
> > > 
> > > Well, people are busy and MM community is not a large one. I cannot
> > > really help you much other than keep poking those people and give
> > > reasonable arguments so they decide to ack your patch.
> > 
> > I get that. But v10 was posted in mid September. Back then we had a
> > discussion about addressing what Mel had mentioned and I had mentioned
> > then that I had addressed it by allowing compaction to essentially reset
> > the reporter to get it out of the list so compaction could do this split
> > and splice tumbling logic.
> > 
> 
> At the time I said "While in theory that could be addressed by always
> going through an interface maintained by the page allocator, it would be
> tricky to test the virtio case in particular."
> 
> Now, I get that you added an interface for that *but* if compaction was
> ever updated or yet another approach was taken to deal with it, virtio
> could get broken. If the page allocator itself has a bug or compaction
> has a bug, the effect is very visible. While stuff does slip through, it
> tends to be an obscure corner case. However, when virtio gets broken,
> it'll be a long time before we get it.

Specifically all we are doing is walking through a list using a pointer as
an iterator.  I would think we would likely see the page reporting blow up
pretty quickly if we were to somehow mess up the list so bad that it still
had access to a page that was no longer on the list. Other than that if
the list is just shuffled without resetting the pointer then worst case
would be that we end up with the reporting being rearmed as soon as we
were complete due to a batch of unreported pages being shuffled past the
iterator.

> Then we get onto the locking, the API appears to require the zone lock
> be held which is not particularly obvious but it gets better. The zone
> lock is a *heavy* lock now. By rights, it should be split into a
> freelist and zone metadata lock as appropriate and then split the
> freelist lock. Compaction could be updated and checked trivially, but
> not so much virtio.

The zone metadata I added is all stuff I would have preferred to put in
the freelist anyway. However doing that would cause a performance
regression since it would cause our highest order zone to share a cache
line with the zone lock. Instead I am maintaining it via the zone flags
and a separately allocated block of reported pages statistics.

If you are talking about the ph_dev_info it already has a separate mutex
to protect it when we are adding/removing it.

> > > I definitely do not intent to nack this work, I just have maintainability
> > > concerns and considering there is an alternative approach that does not
> > > require to touch page allocator internals and which we need to compare
> > > against then I do not really think there is any need to push something
> > > in right away. Or is there any pressing reason to have this merged right
> > > now?
> > 
> > The alternative approach doesn't touch the page allocator, however it
> > still has essentially the same changes to __free_one_page. I suspect the
> > performance issue seen is mostly due to the fact that because it doesn't
> > touch the page allocator it is taking the zone lock and probing the page
> > for each set bit to see if the page is still free. As such the performance
> > regression seen gets worse the lower the order used for reporting.
> > 
> 
> What confused me quite a lot is that this is enabled at compile time
> and then incurs a performance hit whether there is a hypervisor that
> even cares is involved or not. So I don't think the performance angle
> justifying this approach is a good one because this implementation has
> issues of its own. Previously I said

We are adding code. There is always going to be some performance impact.
Even if we had it implemented in the arch alloc/free code there would be
an impact as there is code there that would have to be jumped over. One
reason why I have been focused on the possible regression is that patch 2
in the series, which only replaced free_area with zone and order was
supposedly having an impact due to code being shifted around.

> 	I worry that poking too much into the internal state of the
> 	allocator will be fragile long-term. There is the arch alloc/free
> 	hooks but they are typically about protections only and does not
> 	interfere with the internal state of the allocator. Compaction
> 	pokes in as well but once the page is off the free list, the page
> 	allocator no longer cares so again there is on interference with
> 	the internal state. If the state is interefered with externally,
> 	it becomes unclear what happens if things like page merging is
> 	deferred in a way the allocator cannot control as high-order
> 	allocation requests may fail for example.
> 
> Adding an API for compaction does not get away from the problem that
> it'll be fragile to depend on the internal state of the allocator for
> correctness. Existing users that poke into the state do so as an
> optimistic shortcut but if it fails, nothing actually breaks. The free
> list reporting stuff might and will not be routinely tested.

I view what I am doing as not being too different from that. I am only
maintaining state when I can. I understand that it makes things more
fragile as we have more routes where things could go wrong, but isn't that
the case with adding any interface. I have simplified this about as far as
it can go.

All I am tracking is basically a pointer into the freelists so that we can
remember where we left off, and adding a flag indicating that those
pointers are there. If the flag is cleared we stop using the pointers. We
can also just reset to the list_head when an individual freelist is
manipulated since the list_head itself will never actually go anywhere.

> Take compaction as an example, the initial implementation of it was dumb
> as rocks and only started maintaining additional state and later poking
> into the page allocator when there was empirical evidence it was necessary.

The issue is that we are trying to do this iteratively. As such I need to
store some sort of state so that once I go off to report the small bunch
of pages I have collected I have some way to quickly get back to where I
was. Without doing that I burn a large number of cycles having to rewalk
the list.

That is why I rewrote this so that we can have the list get completely
shuffled and we don't care as long as our iterator is reset to the
list_head, or the flag indicating that the iterators are active is
cleared.

> The initial implementation of page reporting should be the same, it
> should do no harm at all to users that don't care (hiding behind
> kconfig is not good enough, use static branches) and it should not
> depend on the internal state of the page allocator and ordering of free
> lists for correctness until it's shown it's absolutely necessary.

Perhaps I didn't make it clear, the "Patches applied" I called out in my
cover page was with the kconfig option enabled. The only thing disabled
was that the virtio-balloon did not have the feature enabled in the
hypervisor as I had turned it off in QEMU:

  Test                page_fault1 (THP)     page_fault2
  Baseline         1  1209281.00  +/-0.47%   411314.00  +/-0.42%
                  16  8804587.33  +/-1.80%  3419453.00  +/-1.80%

  Patches applied  1  1209369.67  +/-0.06%   412187.00  +/-0.10%
                  16  8812606.33  +/-0.06%  3435339.33  +/-1.82%

> You say that the zone lock has to be taken in the alternative
> implementation to check if it's still free and sure, that would cost
> but unless you are isolating that page immediately then you are racing
> once the lock is released. If you are isolating immediately, then isolate
> pages in batches to amortise the loock costs.  The details of this could
> be really hard but this approach is essentially saying "everything,
> everywhere should take a small hit so the overhead is not noticeable for
> virtio users" which is a curious choice for a new feature.

The point I am getting at is lock amortization. The alternate approach
cannot really pull it off because it takes the lock, checks for buddy, if
it is present it isolates it, releases the lock, and then goes onto the
next bit. I would think of it as more of a hunt and peck approach to page
reporting as it doesn't know if is operating on stale data or not since
there is logic to set the bit but nothing to clear it.

With my approach we take the lock and then immediately start reporting
pages in batches. The iterator allows us to jump back to where we were in
the list and resume so that we process the entire zone to completion.

> Regardless of the details of any implementation, the first one should be
> basic, do no harm and be relatively simple giving just a bare interface
> to virtio/qemu/etc. Then optimise it until such point as there is no
> chance but to poke into the core.

I tried doing this without the iterator/boundary pointers. Without them I
was running about 20% slower than I am now. That was still about 10%
faster then the alternative approach was with the same page order limit,
however I didn't really feel it was appropriate for us to be running that
much slower with this feature enabled. The code itself is pretty simple.
We are essentially quarantining the pages within the freelist and
reporting them in batches until they are all quarantined. With any other
approach doing it iteratively would be challenging as the last iteration
would require walking all of memory to find the straggler pages.

I was measuring the overhead for us just doing the reporting by disabling
the madvise but leaving the feature enabled int he hypervisor. The output
of that was:

  Test                page_fault1 (THP)     page_fault2
  Patches enabled  1  1209104.67  +/-0.11%   413067.67  +/-0.43%
   MADV disabled  16  8835481.67  +/-0.29%  3463485.67  +/-0.50%

So the reporting of pages itself has little effect. The part that had
noticeable effect was when we actually started performing the madvise.
That pulled us down in the THP case but still didn't generate enough
overhead to show up in the standard 4K pages case:

  Test                page_fault1 (THP)     page_fault2
  Patches enabled  1  1210367.67  +/-0.58%   416962.00  +/-0.14%
                  16  8433236.00  +/-0.58%  3437897.67  +/-0.34%

I do plan to iterate on this once it is accepted as I figure there will be
further things we can do to optimize it. I don't assume that this is the
penultimate solution. However in my mind this is a good starting point for
things as the reporting itself has little impact until we actually start
dropping the pages from the hypervisor and forcing them to be faulted back
into the guest.

Thanks.

- Alex




^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: + mm-introduce-reported-pages.patch added to -mm tree
  2019-11-07  0:20             ` Dave Hansen
@ 2019-11-07  0:52               ` David Hildenbrand
  2019-11-07 17:12                 ` Dave Hansen
  0 siblings, 1 reply; 47+ messages in thread
From: David Hildenbrand @ 2019-11-07  0:52 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Alexander Duyck, Michal Hocko, akpm, aarcange, dan.j.williams,
	konrad.wilk, lcapitulino, mgorman, mm-commits, mst, osalvador,
	pagupta, pbonzini, riel, vbabka, wei.w.wang, willy,
	yang.zhang.wz, linux-mm



> Am 07.11.2019 um 01:20 schrieb Dave Hansen <dave.hansen@intel.com>:
> 
> On 11/6/19 3:33 PM, David Hildenbrand wrote:
>> We want do discuss *design* and *architecture* here, not
>> *implementation details*.
> 
> Could folks remind me what they see as the high-level design delta
> between these two approaches?
> 
> It seems to me like Alex's approach tries to minimize new metadata, but
> imposes some rules on the internals of the allocator.  Nitesh's is more
> disconnected from the allocator, but has the increased complexity of
> managing some new disjoint metadata.
> 
> Right?

Very right AFAIKS. It’s a matter of where to store new metadata and how to stop the buddy from reusing pages that are currently getting reported.The latter does not spill virt specific optimization stuff into the core buddy, which is why that information has to be tracked differently.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: + mm-introduce-reported-pages.patch added to -mm tree
  2019-11-07  0:20             ` Alexander Duyck
@ 2019-11-07 10:20               ` Mel Gorman
  2019-11-07 16:07                 ` Alexander Duyck
  0 siblings, 1 reply; 47+ messages in thread
From: Mel Gorman @ 2019-11-07 10:20 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Michal Hocko, David Hildenbrand, akpm, aarcange, dan.j.williams,
	dave.hansen, konrad.wilk, lcapitulino, mm-commits, mst,
	osalvador, pagupta, pbonzini, riel, vbabka, wei.w.wang, willy,
	yang.zhang.wz, linux-mm

On Wed, Nov 06, 2019 at 04:20:56PM -0800, Alexander Duyck wrote:
> > > I get that. But v10 was posted in mid September. Back then we had a
> > > discussion about addressing what Mel had mentioned and I had mentioned
> > > then that I had addressed it by allowing compaction to essentially reset
> > > the reporter to get it out of the list so compaction could do this split
> > > and splice tumbling logic.
> > > 
> > 
> > At the time I said "While in theory that could be addressed by always
> > going through an interface maintained by the page allocator, it would be
> > tricky to test the virtio case in particular."
> > 
> > Now, I get that you added an interface for that *but* if compaction was
> > ever updated or yet another approach was taken to deal with it, virtio
> > could get broken. If the page allocator itself has a bug or compaction
> > has a bug, the effect is very visible. While stuff does slip through, it
> > tends to be an obscure corner case. However, when virtio gets broken,
> > it'll be a long time before we get it.
> 
> Specifically all we are doing is walking through a list using a pointer as
> an iterator.  I would think we would likely see the page reporting blow up
> pretty quickly if we were to somehow mess up the list so bad that it still
> had access to a page that was no longer on the list. Other than that if
> the list is just shuffled without resetting the pointer then worst case
> would be that we end up with the reporting being rearmed as soon as we
> were complete due to a batch of unreported pages being shuffled past the
> iterator.
> 

And what I'm saying is that the initial version should have focused
exclusively on the mechanism to report free pages and kept the search as
simple as possible with optimisations on top. By all means track pages
that are already reported and skip them because that is a relatively basic
operation with the caveat that even the page_reported checks should be
behind a static branch if there is no virtio balloon driver loaded.

The page reporting to a hypervisor is done out of band from any application
so even if it takes a little longer to do that reporting, is there an
impact really? If so, how much and what frequency is this impact incurred?
The primary impact I can see is that free pages are invisible while the
notification takes place so the window for that may be wider if it takes
longer to find enough pages to send in batch but there will always be a
window where the free pages are invisible.

> > <SNIP>
> >
> > What confused me quite a lot is that this is enabled at compile time
> > and then incurs a performance hit whether there is a hypervisor that
> > even cares is involved or not. So I don't think the performance angle
> > justifying this approach is a good one because this implementation has
> > issues of its own. Previously I said
> 
> We are adding code. There is always going to be some performance impact.

And that impact should be negligible in so far as possible, parts of
it are protected by branches but as it stands, just building the virtio
balloon driver enables all of this whether the running system is using
virtual machines or not.

> > Adding an API for compaction does not get away from the problem that
> > it'll be fragile to depend on the internal state of the allocator for
> > correctness. Existing users that poke into the state do so as an
> > optimistic shortcut but if it fails, nothing actually breaks. The free
> > list reporting stuff might and will not be routinely tested.
> 
> I view what I am doing as not being too different from that. I am only
> maintaining state when I can. I understand that it makes things more
> fragile as we have more routes where things could go wrong, but isn't that
> the case with adding any interface. I have simplified this about as far as
> it can go.
> 

The enhanced tracking of state is not necessary initially to simply report
free pages to the hypervisor. 

> All I am tracking is basically a pointer into the freelists so that we can
> remember where we left off, and adding a flag indicating that those
> pointers are there. If the flag is cleared we stop using the pointers. We
> can also just reset to the list_head when an individual freelist is
> manipulated since the list_head itself will never actually go anywhere.
> 
> > Take compaction as an example, the initial implementation of it was dumb
> > as rocks and only started maintaining additional state and later poking
> > into the page allocator when there was empirical evidence it was necessary.
> 
> The issue is that we are trying to do this iteratively. As such I need to
> store some sort of state so that once I go off to report the small bunch
> of pages I have collected I have some way to quickly get back to where I
> was. Without doing that I burn a large number of cycles having to rewalk
> the list.
> 
> That is why I rewrote this so that we can have the list get completely
> shuffled and we don't care as long as our iterator is reset to the
> list_head, or the flag indicating that the iterators are active is
> cleared.
> 

And again, it's not clear that the additional complexity is required.
Specifically, it'll be very hard to tell if the state tracking is
actually helping because excessive list shuffling due to compaction may
mean that a lot of state is being tracked while the full free list ends
up having to be searched anyway.

As it stands, the optimisations are hard-wired into the providing of
the feature itself making it an all or nothing approach and no option to
"merge a bare-bones mechanism that is behind static branches and build
optimisations on top". At least that way, any competing approach could
be based on optimisations alone while the feature is still available.

It also doesn't help that reviewing the series takes a lot of jumping
around. This patch has a kfree of a structure that is not allocated
until a later patch for example.

-- 
Mel Gorman
SUSE Labs


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: + mm-introduce-reported-pages.patch added to -mm tree
  2019-11-07 10:20               ` Mel Gorman
@ 2019-11-07 16:07                 ` Alexander Duyck
  2019-11-08  9:43                   ` Mel Gorman
  0 siblings, 1 reply; 47+ messages in thread
From: Alexander Duyck @ 2019-11-07 16:07 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Michal Hocko, David Hildenbrand, akpm, aarcange, dan.j.williams,
	dave.hansen, konrad.wilk, lcapitulino, mm-commits, mst,
	osalvador, pagupta, pbonzini, riel, vbabka, wei.w.wang, willy,
	yang.zhang.wz, linux-mm

On Thu, 2019-11-07 at 10:20 +0000, Mel Gorman wrote:
> On Wed, Nov 06, 2019 at 04:20:56PM -0800, Alexander Duyck wrote:
> > > > I get that. But v10 was posted in mid September. Back then we had a
> > > > discussion about addressing what Mel had mentioned and I had mentioned
> > > > then that I had addressed it by allowing compaction to essentially reset
> > > > the reporter to get it out of the list so compaction could do this split
> > > > and splice tumbling logic.
> > > > 
> > > 
> > > At the time I said "While in theory that could be addressed by always
> > > going through an interface maintained by the page allocator, it would be
> > > tricky to test the virtio case in particular."
> > > 
> > > Now, I get that you added an interface for that *but* if compaction was
> > > ever updated or yet another approach was taken to deal with it, virtio
> > > could get broken. If the page allocator itself has a bug or compaction
> > > has a bug, the effect is very visible. While stuff does slip through, it
> > > tends to be an obscure corner case. However, when virtio gets broken,
> > > it'll be a long time before we get it.
> > 
> > Specifically all we are doing is walking through a list using a pointer as
> > an iterator.  I would think we would likely see the page reporting blow up
> > pretty quickly if we were to somehow mess up the list so bad that it still
> > had access to a page that was no longer on the list. Other than that if
> > the list is just shuffled without resetting the pointer then worst case
> > would be that we end up with the reporting being rearmed as soon as we
> > were complete due to a batch of unreported pages being shuffled past the
> > iterator.
> > 
> 
> And what I'm saying is that the initial version should have focused
> exclusively on the mechanism to report free pages and kept the search as
> simple as possible with optimisations on top. By all means track pages
> that are already reported and skip them because that is a relatively basic
> operation with the caveat that even the page_reported checks should be
> behind a static branch if there is no virtio balloon driver loaded.

I can split out the list manipulation logic and put it behind a static
branch, however I cannot do that with page_reported. The issue is that
with keeping any state in the page we need to clear it when the page is
allocated. I could do the static branch approach if we were to clear the
bit for all pages, but I thought it better to just do a test and clear in
order to avoid an unnecessary read-modify-write.

> The page reporting to a hypervisor is done out of band from any application
> so even if it takes a little longer to do that reporting, is there an
> impact really? If so, how much and what frequency is this impact incurred?

My main concern has been time with the zone lock held in order to extract
the page from the list. Any time while we are holding the zone lock is
time when another CPU cannot free or allocate a page from that zone.
Without the pointers I have to walk the free lists until I find a non-
reported page in the zone. That is the overhead I am avoiding by
maintaining the pointers since we have to walk the free lists for every
batch of pages we want to pull.

> The primary impact I can see is that free pages are invisible while the
> notification takes place so the window for that may be wider if it takes
> longer to find enough pages to send in batch but there will always be a
> window where the free pages are invisible.

You are correct. There is a window where the pages will be invisible. The
hope is that the impact of that should be fairly small since we only pull
a few pages out at a time. In addition we will not pull past the watermark
since the pages are pulled via __isolate_free_page.

> > > <SNIP>
> > > 
> > > What confused me quite a lot is that this is enabled at compile time
> > > and then incurs a performance hit whether there is a hypervisor that
> > > even cares is involved or not. So I don't think the performance angle
> > > justifying this approach is a good one because this implementation has
> > > issues of its own. Previously I said
> > 
> > We are adding code. There is always going to be some performance impact.
> 
> And that impact should be negligible in so far as possible, parts of
> it are protected by branches but as it stands, just building the virtio
> balloon driver enables all of this whether the running system is using
> virtual machines or not.

Agreed the code is present, however that would be the case with a static
branch as well. I tried to use static branches where I thought it made
sense and to just use the page_reported test for the cases where I
couldn't. I can probably add the static branch to a few more spots however
I am not sure it adds much value when it is already behind a page_reported
check.

> > > Adding an API for compaction does not get away from the problem that
> > > it'll be fragile to depend on the internal state of the allocator for
> > > correctness. Existing users that poke into the state do so as an
> > > optimistic shortcut but if it fails, nothing actually breaks. The free
> > > list reporting stuff might and will not be routinely tested.
> > 
> > I view what I am doing as not being too different from that. I am only
> > maintaining state when I can. I understand that it makes things more
> > fragile as we have more routes where things could go wrong, but isn't that
> > the case with adding any interface. I have simplified this about as far as
> > it can go.
> > 
> 
> The enhanced tracking of state is not necessary initially to simply report
> free pages to the hypervisor. 

Define enhanced tracking? If you are talking about the PageReported bit
then I would disagree. I need to have some way to indicate that the free
page has been reported or not if I am going to make forward progress while
performing the reporting in iterations. If you are talking about the list
manipulation and boundary pointers then I agree. However there is a pretty
significant performance impact for not having those as we have to rewalk
the free lists for every batch of pages we pull.

> > All I am tracking is basically a pointer into the freelists so that we can
> > remember where we left off, and adding a flag indicating that those
> > pointers are there. If the flag is cleared we stop using the pointers. We
> > can also just reset to the list_head when an individual freelist is
> > manipulated since the list_head itself will never actually go anywhere.
> > 
> > > Take compaction as an example, the initial implementation of it was dumb
> > > as rocks and only started maintaining additional state and later poking
> > > into the page allocator when there was empirical evidence it was necessary.
> > 
> > The issue is that we are trying to do this iteratively. As such I need to
> > store some sort of state so that once I go off to report the small bunch
> > of pages I have collected I have some way to quickly get back to where I
> > was. Without doing that I burn a large number of cycles having to rewalk
> > the list.
> > 
> > That is why I rewrote this so that we can have the list get completely
> > shuffled and we don't care as long as our iterator is reset to the
> > list_head, or the flag indicating that the iterators are active is
> > cleared.
> > 
> 
> And again, it's not clear that the additional complexity is required.
> Specifically, it'll be very hard to tell if the state tracking is
> actually helping because excessive list shuffling due to compaction may
> mean that a lot of state is being tracked while the full free list ends
> up having to be searched anyway.
> 
> As it stands, the optimisations are hard-wired into the providing of
> the feature itself making it an all or nothing approach and no option to
> "merge a bare-bones mechanism that is behind static branches and build
> optimisations on top". At least that way, any competing approach could
> be based on optimisations alone while the feature is still available.

If needed I can split another patch out of patches 3 and 4 that does the
list manipulation. It shouldn't take all that long to do as I have already
done it once for testing to get the 20% regression number. Also as I
mentioned I can put the list manpiulation behind a static key, but cannot
do so for the actual testing and clearing of the page reporting bit.

> It also doesn't help that reviewing the series takes a lot of jumping
> around. This patch has a kfree of a structure that is not allocated
> until a later patch for example.

Sorry about that. Patches 3 and 4 were one patch until I had split them. I
can fix that so that the page_reporting_reset_zone is either moved to
patch 4 where we actually perform the allocation, or I will move the
allocation to patch 3.




^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: + mm-introduce-reported-pages.patch added to -mm tree
  2019-11-07  0:52               ` David Hildenbrand
@ 2019-11-07 17:12                 ` Dave Hansen
  2019-11-07 17:46                   ` Michal Hocko
  2019-11-07 18:46                   ` Qian Cai
  0 siblings, 2 replies; 47+ messages in thread
From: Dave Hansen @ 2019-11-07 17:12 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alexander Duyck, Michal Hocko, akpm, aarcange, dan.j.williams,
	konrad.wilk, lcapitulino, mgorman, mm-commits, mst, osalvador,
	pagupta, pbonzini, riel, vbabka, wei.w.wang, willy,
	yang.zhang.wz, linux-mm

On 11/6/19 4:52 PM, David Hildenbrand wrote:
>> It seems to me like Alex's approach tries to minimize new metadata, but
>> imposes some rules on the internals of the allocator.  Nitesh's is more
>> disconnected from the allocator, but has the increased complexity of
>> managing some new disjoint metadata.
>>
>> Right?
> Very right AFAIKS. It’s a matter of where to store new metadata and
> how to stop the buddy from reusing pages that are currently getting
> reported.  The latter does not spill virt specific optimization stuff
> into the core buddy, which is why that information has to be tracked
> differently.

I went back and looked at how much is getting spilled into the core
buddy.  The refactoring (patches 1-2) looks pretty neutral in terms of
complexity.  I think it actually makes things look a wee bit nicer.

Patch 3 does, indeed, poke into the core of the allocator.  There are
really six new sites that need to be poked:

	2 sites in for compaction.c
	1 site in memory_hotplug.c
	3 sites in page_alloc.c

Plus an extra function call argument and the new free_reported_page() in
page_alloc.c.  That's not _great_, of course, but it's not a deal
breaker for me, but maybe I've seen too many buddy atrocities over the
years and I'm numb to the pain. :)

Nitesh's patches seem to do this with a single poke at the allocator.
The downside is that since there are new data structures to manage, it
takes more code to create and use them since the buddy structures don't
get reused.

Both approaches seem totally reasonable to me.  I don't like Alex's
pokes into the core of the allocator, but I also don't like Nitesh's
extra data structures and the (yet unimplemented) code that it will take
to make them handle all the things they need to like hotplug.

There's also a common kernel/hypervisor ABI that's *SHARED* between the
two approaches.  That's really the only thing that would marry us to one
approach versus the other forever.

Am I missing something?  What's the harm in merging the implementation
that's ready today?  If a better one comes along, we'll rip out the ~15
lines of code in page_alloc.c and merge the better one.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: + mm-introduce-reported-pages.patch added to -mm tree
  2019-11-07 17:12                 ` Dave Hansen
@ 2019-11-07 17:46                   ` Michal Hocko
  2019-11-07 18:08                     ` Dave Hansen
  2019-11-07 18:12                     ` Alexander Duyck
  2019-11-07 18:46                   ` Qian Cai
  1 sibling, 2 replies; 47+ messages in thread
From: Michal Hocko @ 2019-11-07 17:46 UTC (permalink / raw)
  To: Dave Hansen
  Cc: David Hildenbrand, Alexander Duyck, akpm, aarcange,
	dan.j.williams, konrad.wilk, lcapitulino, mgorman, mm-commits,
	mst, osalvador, pagupta, pbonzini, riel, vbabka, wei.w.wang,
	willy, yang.zhang.wz, linux-mm

On Thu 07-11-19 09:12:01, Dave Hansen wrote:
[...]
> Both approaches seem totally reasonable to me.  I don't like Alex's
> pokes into the core of the allocator, but I also don't like Nitesh's
> extra data structures and the (yet unimplemented) code that it will take
> to make them handle all the things they need to like hotplug.

Well, I would argue that an extra data structure belongs to the user who
is actually using it. Compare that with metadata that is de-facto
maintain athe allocator level which has no real knowledge about the end
user. This is an inherently fragile design as Mel points out in his
earlier email in this thread. I can live with that if that is _really_
necessary but I would rather not to go that way if there is other way.

> There's also a common kernel/hypervisor ABI that's *SHARED* between the
> two approaches.  That's really the only thing that would marry us to one
> approach versus the other forever.
> 
> Am I missing something?  What's the harm in merging the implementation
> that's ready today?  If a better one comes along, we'll rip out the ~15
> lines of code in page_alloc.c and merge the better one.

I have asked several times why there is such a push and received no
answer but "this is taking too long" which I honestly do not care much.
Especially when other virt people tend to agree that there is no need to
rush here.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: + mm-introduce-reported-pages.patch added to -mm tree
  2019-11-06 23:33           ` David Hildenbrand
  2019-11-07  0:20             ` Dave Hansen
@ 2019-11-07 18:02             ` Alexander Duyck
  2019-11-07 19:37               ` Nitesh Narayan Lal
  2019-11-07 22:43               ` David Hildenbrand
  1 sibling, 2 replies; 47+ messages in thread
From: Alexander Duyck @ 2019-11-07 18:02 UTC (permalink / raw)
  To: David Hildenbrand, Michal Hocko
  Cc: akpm, aarcange, dan.j.williams, dave.hansen, konrad.wilk,
	lcapitulino, mgorman, mm-commits, mst, osalvador, pagupta,
	pbonzini, riel, vbabka, wei.w.wang, willy, yang.zhang.wz,
	linux-mm

On Thu, 2019-11-07 at 00:33 +0100, David Hildenbrand wrote:
> On 06.11.19 18:48, Alexander Duyck wrote:
> > On Wed, 2019-11-06 at 17:54 +0100, Michal Hocko wrote:
> > > On Wed 06-11-19 08:35:43, Alexander Duyck wrote:
> > > > On Wed, 2019-11-06 at 15:09 +0100, David Hildenbrand wrote:
> > > > > > Am 06.11.2019 um 13:16 schrieb Michal Hocko <mhocko@kernel.org>:
> > > > > > 
> > > > > > I didn't have time to read through newer versions of this patch series
> > > > > > but I remember there were concerns about this functionality being pulled
> > > > > > into the page allocator previously both by me and Mel [1][2]. Have those been
> > > > > > addressed? I do not see an ack from Mel or any other MM people. Is there
> > > > > > really a consensus that we want something like that living in the
> > > > > > allocator?
> > > > > 
> > > > > I don‘t think there is. The discussion is still ongoing (although quiet,
> > > > > Nitesh is working on a new version AFAIK). I think we should not rush
> > > > > this.
> > > > 
> > > > How much time is needed to get a review? I waited 2 weeks since posting
> > > > v12 and the only comments I got on the code were from Andrew. Most of this
> > > > hasn't changed much since v10 and that was posted back in mid September. I
> > > > have been down to making small tweaks here and there and haven't had any
> > > > real critiques on the approach since Mel had the comments about conflicts
> > > > with compaction which I addressed by allowing compaction to punt the
> > > > reporter out so that it could split and splice the lists as it walked
> > > > through them.
> > > 
> > > Well, people are busy and MM community is not a large one. I cannot
> > > really help you much other than keep poking those people and give
> > > reasonable arguments so they decide to ack your patch.
> > 
> > I get that. But v10 was posted in mid September. Back then we had a
> > discussion about addressing what Mel had mentioned and I had mentioned
> > then that I had addressed it by allowing compaction to essentially reset
> > the reporter to get it out of the list so compaction could do this split
> > and splice tumbling logic.
> > 
> > > I definitely do not intent to nack this work, I just have maintainability
> > > concerns and considering there is an alternative approach that does not
> > > require to touch page allocator internals and which we need to compare
> > > against then I do not really think there is any need to push something
> > > in right away. Or is there any pressing reason to have this merged right
> > > now?
> > 
> > The alternative approach doesn't touch the page allocator, however it
> > still has essentially the same changes to __free_one_page. I suspect the
> 
> Nitesh is working on Michals suggestion to use page isolation instead 
> AFAIK - which avoids this.

Okay. However it makes it much harder to discuss when we are comparing
against code that isn't public. If the design is being redone do we have
any ETA for when we will have something to actually compare to?

> > performance issue seen is mostly due to the fact that because it doesn't
> > touch the page allocator it is taking the zone lock and probing the page
> > for each set bit to see if the page is still free. As such the performance
> > regression seen gets worse the lower the order used for reporting.
> > 
> > Also I suspect Nitesh's patches are also in need of further review. I have
> > provided feedback however my focus ended up being on more the kernel
> > panics and 30% performance regression rather than debating architecture.
> 
> Please don't take this personally, but I really dislike you taking about 
> Niteshs RFCs (!) and pushing for your approach (although it was you that 
> was late to the party!) in that way. If there are problems then please 
> collaborate and fix instead of using the same wrong arguments over and 
> over again.

Since Nitesh is in the middle of doing a full rewrite anyway I don't have
much to compare against except for the previous set, which still needs
fixes.  It is why I mentioned in the cover of the last patch set that I
would prefer to not discuss it since I have no visibility into the patch
set he is now working on.

> a) hotplug/sparse zones: I explained a couple of times why we can ignore 
> that. There was never a reply from you, yet you keep coming up with 
> that. I don't enjoy talking to a wall.

This gets to the heart of how Nitesh's patch set works. It is assuming
that every zone is linear, that there will be no overlap between zones,
and that the zones don't really change. These are key architectural
assumptions that should really be discussed instead of simply dismissed.

I guess part of the difference between us is that I am looking for
something that is production ready and not a proof of concept. It sounds
like you would prefer this work stays in a proof of concept stage for some
time longer.

> b) Locking optimizations: Come on, these are premature optimizations and 
> nothing to dictate your design. *nobody* but you cares about that in an 
> initial approach we get upstream. We can always optimize that.

My concern isn't so much the locking as the fact that it is the hunt and
peck approach through a bitmap that will become increasingly more stale as
you are processing the data. Every bit you have to test for requires
taking a zone lock and then probing to see if the page is still free and
the right size. My concern is how much time is going to be spent with the
zone lock held while other CPUs are waiting on access.

> c) Kernel panics: Come on, we are talking about complicated RFCs here 
> with moving design decisions. We want do discuss *design* and 
> *architecture* here, not *implementation details*.

Then why ask me to compare performance against it? You were the one
pushing for me to test it, not me. If you and Nitesh knew the design
wasn't complete enough to run it why ask me to test it?

Many of the kernel panics for the patch sets in the past have been related
to fundamental architectural issues. For example ignoring things like
NUMA, mangling the free_list by accessing it with the wrong locks held,
etc.

> d) Performance: We want to see a design that fits into the whole 
> architecture cleanly, is maintainable, and provides a benefit. Of 
> course, performance is relevant, but it certainly should not dictate our 
> design of a *virtualization specific optimization feature*. Performance 
> is not everything, otherwise please feel free and rewrite the kernel in 
> ASM and claim it is better because it is faster.

I agree performance is not everything. But when a system grinds down to
60% of what it was originally I find that significant.

> Again, I do value your review and feedback, but I absolutely do not 
> enjoy the way you are trying to push your series here, sorry.

Well I am a bit frustrated as I have had to provide a significant amount
of feedback on Nitesh's patches, and in spite of that I feel like I am
getting nothing in return. I have pointed out the various issues and
opportunities to address the issues. At this point there are sections of
his code that are directly copied from mine[1]. I have done everything I
can to help the patches along but it seems like they aren't getting out of
RFC or proof-of-concept state any time soon. So with that being the case
why not consider his patch set as something that could end up being a
follow-on/refactor instead of an alternative to mine?

> Yes, if we end up finding out that there is real value in your approach, 
> nothing speaks against considering it. But please don't try to hurry and 
> push your series in that way. Please give everybody to time to evaluate.

I would love to argue this patch set on the merits. However I really don't
feel like I am getting a fair comparison here, at least from you. Every
other reply on the thread seems to be from you trying to reinforce any
criticism and taking the opportunity to mention that there is another
solution out there. It is fine to fight for your own idea, but at least
let me reply to the criticisms of my own patchset before you pile on. I
would really prefer to discuss Nitesh's patches on a posting of his patch
set rather than here.

[1]: https://lore.kernel.org/lkml/101649ae-58d4-76ee-91f3-42ac1c145c46@redhat.com/



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: + mm-introduce-reported-pages.patch added to -mm tree
  2019-11-07 17:46                   ` Michal Hocko
@ 2019-11-07 18:08                     ` Dave Hansen
  2019-11-07 18:12                     ` Alexander Duyck
  1 sibling, 0 replies; 47+ messages in thread
From: Dave Hansen @ 2019-11-07 18:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Hildenbrand, Alexander Duyck, akpm, aarcange,
	dan.j.williams, konrad.wilk, lcapitulino, mgorman, mm-commits,
	mst, osalvador, pagupta, pbonzini, riel, vbabka, wei.w.wang,
	willy, yang.zhang.wz, linux-mm

On 11/7/19 9:46 AM, Michal Hocko wrote:
> On Thu 07-11-19 09:12:01, Dave Hansen wrote:
> [...]
>> Both approaches seem totally reasonable to me.  I don't like Alex's
>> pokes into the core of the allocator, but I also don't like Nitesh's
>> extra data structures and the (yet unimplemented) code that it will take
>> to make them handle all the things they need to like hotplug.
> 
> Well, I would argue that an extra data structure belongs to the user who
> is actually using it. Compare that with metadata that is de-facto
> maintain athe allocator level which has no real knowledge about the end
> user. This is an inherently fragile design as Mel points out in his
> earlier email in this thread. I can live with that if that is _really_
> necessary but I would rather not to go that way if there is other way.

Yeah, I share some of the fragility concern.  But, in the end, I was
really happy with the solution when things break.  With compaction, for
instance, it seems just fine to throw all the state out the window and
start over.

While it might be fragile, it's also rather tolerant once broken, which
is a nice property to pair with fragility.

>> There's also a common kernel/hypervisor ABI that's *SHARED* between the
>> two approaches.  That's really the only thing that would marry us to one
>> approach versus the other forever.
>>
>> Am I missing something?  What's the harm in merging the implementation
>> that's ready today?  If a better one comes along, we'll rip out the ~15
>> lines of code in page_alloc.c and merge the better one.
> 
> I have asked several times why there is such a push and received no
> answer but "this is taking too long" which I honestly do not care much.
> Especially when other virt people tend to agree that there is no need to
> rush here.

I _think_ the ask here is for Alex to try to do this with _some_ data
structures outside the core allocator's, not necessarily to take
Nitesh's approach and get it fully functional.



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: + mm-introduce-reported-pages.patch added to -mm tree
  2019-11-07 17:46                   ` Michal Hocko
  2019-11-07 18:08                     ` Dave Hansen
@ 2019-11-07 18:12                     ` Alexander Duyck
  2019-11-08  9:57                       ` Michal Hocko
  1 sibling, 1 reply; 47+ messages in thread
From: Alexander Duyck @ 2019-11-07 18:12 UTC (permalink / raw)
  To: Michal Hocko, Dave Hansen
  Cc: David Hildenbrand, akpm, aarcange, dan.j.williams, konrad.wilk,
	lcapitulino, mgorman, mm-commits, mst, osalvador, pagupta,
	pbonzini, riel, vbabka, wei.w.wang, willy, yang.zhang.wz,
	linux-mm

On Thu, 2019-11-07 at 18:46 +0100, Michal Hocko wrote:
> On Thu 07-11-19 09:12:01, Dave Hansen wrote:
> [...]
> > Both approaches seem totally reasonable to me.  I don't like Alex's
> > pokes into the core of the allocator, but I also don't like Nitesh's
> > extra data structures and the (yet unimplemented) code that it will take
> > to make them handle all the things they need to like hotplug.
> 
> Well, I would argue that an extra data structure belongs to the user who
> is actually using it. Compare that with metadata that is de-facto
> maintain athe allocator level which has no real knowledge about the end
> user. This is an inherently fragile design as Mel points out in his
> earlier email in this thread. I can live with that if that is _really_
> necessary but I would rather not to go that way if there is other way.

So to address things I am going to split out the list manipulation into a
separate patch as Mel suggested. So in terms of metadata that will leave
us with 3 pieces; the PageReported bit, the reported_pages statistics, and
the boundary pointers. I will add each one in a separate patch and track
the effect of each as I go. 

> > There's also a common kernel/hypervisor ABI that's *SHARED* between the
> > two approaches.  That's really the only thing that would marry us to one
> > approach versus the other forever.
> > 
> > Am I missing something?  What's the harm in merging the implementation
> > that's ready today?  If a better one comes along, we'll rip out the ~15
> > lines of code in page_alloc.c and merge the better one.
> 
> I have asked several times why there is such a push and received no
> answer but "this is taking too long" which I honestly do not care much.
> Especially when other virt people tend to agree that there is no need to
> rush here.

Part of the rush, at least from my perspective, is that I don't have
indefinite time to work on this. I am sure you are aware that maintaining
an external patch set can be a real chore and I would prefer to have it
merged and then maintain it as a part of the tree. Then other changes can
be rebased on it instead of having to rebase it around other changes that
are going on.



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: + mm-introduce-reported-pages.patch added to -mm tree
  2019-11-07 17:12                 ` Dave Hansen
  2019-11-07 17:46                   ` Michal Hocko
@ 2019-11-07 18:46                   ` Qian Cai
  1 sibling, 0 replies; 47+ messages in thread
From: Qian Cai @ 2019-11-07 18:46 UTC (permalink / raw)
  To: Dave Hansen
  Cc: David Hildenbrand, Alexander Duyck, Michal Hocko, akpm, aarcange,
	dan.j.williams, konrad.wilk, lcapitulino, mgorman, mm-commits,
	mst, osalvador, pagupta, pbonzini, riel, vbabka, wei.w.wang,
	willy, yang.zhang.wz, linux-mm



> On Nov 7, 2019, at 12:12 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> 
> Am I missing something?  What's the harm in merging the implementation
> that's ready today?  If a better one comes along, we'll rip out the ~15
> lines of code in page_alloc.c and merge the better one.

No, it is increasingly hard to rip out any code in MM nowadays once merged (looks at those early debugging SLUB code existing for probably 10 years unused, but yet the bar is high to justify removing), so it is important to take time to get it right the first time by all means.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: + mm-introduce-reported-pages.patch added to -mm tree
  2019-11-07 18:02             ` Alexander Duyck
@ 2019-11-07 19:37               ` Nitesh Narayan Lal
  2019-11-07 22:46                 ` Alexander Duyck
  2019-11-07 22:43               ` David Hildenbrand
  1 sibling, 1 reply; 47+ messages in thread
From: Nitesh Narayan Lal @ 2019-11-07 19:37 UTC (permalink / raw)
  To: Alexander Duyck, David Hildenbrand, Michal Hocko
  Cc: akpm, aarcange, dan.j.williams, dave.hansen, konrad.wilk,
	lcapitulino, mgorman, mm-commits, mst, osalvador, pagupta,
	pbonzini, riel, vbabka, wei.w.wang, willy, yang.zhang.wz,
	linux-mm


On 11/7/19 1:02 PM, Alexander Duyck wrote:
> On Thu, 2019-11-07 at 00:33 +0100, David Hildenbrand wrote:
>> On 06.11.19 18:48, Alexander Duyck wrote:
>>> On Wed, 2019-11-06 at 17:54 +0100, Michal Hocko wrote:
>>>> On Wed 06-11-19 08:35:43, Alexander Duyck wrote:
>>>>> On Wed, 2019-11-06 at 15:09 +0100, David Hildenbrand wrote:
>>>>>>> Am 06.11.2019 um 13:16 schrieb Michal Hocko <mhocko@kernel.org>:
>>>>>>>
>>>>>>> I didn't have time to read through newer versions of this patch series
>>>>>>> but I remember there were concerns about this functionality being pulled
>>>>>>> into the page allocator previously both by me and Mel [1][2]. Have those been
>>>>>>> addressed? I do not see an ack from Mel or any other MM people. Is there
>>>>>>> really a consensus that we want something like that living in the
>>>>>>> allocator?
>>>>>> I don‘t think there is. The discussion is still ongoing (although quiet,
>>>>>> Nitesh is working on a new version AFAIK). I think we should not rush
>>>>>> this.
>>>>> How much time is needed to get a review? I waited 2 weeks since posting
>>>>> v12 and the only comments I got on the code were from Andrew. Most of this
>>>>> hasn't changed much since v10 and that was posted back in mid September. I
>>>>> have been down to making small tweaks here and there and haven't had any
>>>>> real critiques on the approach since Mel had the comments about conflicts
>>>>> with compaction which I addressed by allowing compaction to punt the
>>>>> reporter out so that it could split and splice the lists as it walked
>>>>> through them.
>>>> Well, people are busy and MM community is not a large one. I cannot
>>>> really help you much other than keep poking those people and give
>>>> reasonable arguments so they decide to ack your patch.
>>> I get that. But v10 was posted in mid September. Back then we had a
>>> discussion about addressing what Mel had mentioned and I had mentioned
>>> then that I had addressed it by allowing compaction to essentially reset
>>> the reporter to get it out of the list so compaction could do this split
>>> and splice tumbling logic.
>>>
>>>> I definitely do not intent to nack this work, I just have maintainability
>>>> concerns and considering there is an alternative approach that does not
>>>> require to touch page allocator internals and which we need to compare
>>>> against then I do not really think there is any need to push something
>>>> in right away. Or is there any pressing reason to have this merged right
>>>> now?
>>> The alternative approach doesn't touch the page allocator, however it
>>> still has essentially the same changes to __free_one_page. I suspect the
>> Nitesh is working on Michals suggestion to use page isolation instead 
>> AFAIK - which avoids this.
> Okay. However it makes it much harder to discuss when we are comparing
> against code that isn't public. If the design is being redone do we have
> any ETA for when we will have something to actually compare to?

If there would have been just the design change then giving definite ETA would
have been possible.
However, I also have to fix the performance with (MAX_ORDER - 2). Unlike you,
I need some time to do that.
If I just post the code without fixing the performance there will again be an
unnecessary discussion about the same thing which doesn't make any sense.


>
>>> performance issue seen is mostly due to the fact that because it doesn't
>>> touch the page allocator it is taking the zone lock and probing the page
>>> for each set bit to see if the page is still free. As such the performance
>>> regression seen gets worse the lower the order used for reporting.
>>>
>>> Also I suspect Nitesh's patches are also in need of further review. I have
>>> provided feedback however my focus ended up being on more the kernel
>>> panics and 30% performance regression rather than debating architecture.
>> Please don't take this personally, but I really dislike you taking about 
>> Niteshs RFCs (!) and pushing for your approach (although it was you that 
>> was late to the party!) in that way. If there are problems then please 
>> collaborate and fix instead of using the same wrong arguments over and 
>> over again.
> Since Nitesh is in the middle of doing a full rewrite anyway I don't have
> much to compare against except for the previous set, which still needs
> fixes.  It is why I mentioned in the cover of the last patch set that I
> would prefer to not discuss it since I have no visibility into the patch
> set he is now working on.


Fair point.


>
>> a) hotplug/sparse zones: I explained a couple of times why we can ignore 
>> that. There was never a reply from you, yet you keep coming up with 
>> that. I don't enjoy talking to a wall.
> This gets to the heart of how Nitesh's patch set works. It is assuming
> that every zone is linear, that there will be no overlap between zones,
> and that the zones don't really change. These are key architectural
> assumptions that should really be discussed instead of simply dismissed.


They are not at all dismissed, they are just kept as a future action item.

>
> I guess part of the difference between us is that I am looking for
> something that is production ready and not a proof of concept. It sounds
> like you would prefer this work stays in a proof of concept stage for some
> time longer.

In my opinion, it is more about how many use-cases do we want to target
initially.
With your patch-set, I agree we can cover more use-case where the solution
will fit in.
However, my series might not be suitable for use-cases where
we have memory hotplug or memory restriction. (This will be the case
after I fix the issues in the series)

>
>> b) Locking optimizations: Come on, these are premature optimizations and 
>> nothing to dictate your design. *nobody* but you cares about that in an 
>> initial approach we get upstream. We can always optimize that.
> My concern isn't so much the locking as the fact that it is the hunt and
> peck approach through a bitmap that will become increasingly more stale as
> you are processing the data. Every bit you have to test for requires
> taking a zone lock and then probing to see if the page is still free and
> the right size. My concern is how much time is going to be spent with the
> zone lock held while other CPUs are waiting on access.

This can be prevented (at least to an extent) by checking if the page is in
the buddy before acquiring the lock as I have suggested previously.

>
>> c) Kernel panics: Come on, we are talking about complicated RFCs here 
>> with moving design decisions. We want do discuss *design* and 
>> *architecture* here, not *implementation details*.
> Then why ask me to compare performance against it? You were the one
> pushing for me to test it, not me. If you and Nitesh knew the design
> wasn't complete enough to run it why ask me to test it?
>
> Many of the kernel panics for the patch sets in the past have been related
> to fundamental architectural issues. For example ignoring things like
> NUMA, mangling the free_list by accessing it with the wrong locks held,
> etc.


Obviously we didn't know it earlier, whatever tests I had tried I didn't see
any issues with them.
Again, I am trying to learn from my mistakes and appreciate you helping
me out with that.

>
>> d) Performance: We want to see a design that fits into the whole 
>> architecture cleanly, is maintainable, and provides a benefit. Of 
>> course, performance is relevant, but it certainly should not dictate our 
>> design of a *virtualization specific optimization feature*. Performance 
>> is not everything, otherwise please feel free and rewrite the kernel in 
>> ASM and claim it is better because it is faster.
> I agree performance is not everything. But when a system grinds down to
> 60% of what it was originally I find that significant.

60%? In one of your previous emails you suggested that the drop was 30%.

>
>> Again, I do value your review and feedback, but I absolutely do not 
>> enjoy the way you are trying to push your series here, sorry.
> Well I am a bit frustrated as I have had to provide a significant amount
> of feedback on Nitesh's patches, and in spite of that I feel like I am
> getting nothing in return.

Not sure if I understood the meaning here. May I know what were you expecting?
I do try to review your series and share whatever I can.

>  I have pointed out the various issues and
> opportunities to address the issues. At this point there are sections of
> his code that are directly copied from mine[1]. 

I don't think there is any problem with learning from you or your code.
Is there?
As far as giving proper credit is concerned, that was my mistake and I intend
to correct it as I have already mentioned.

> I have done everything I
> can to help the patches along but it seems like they aren't getting out of
> RFC or proof-of-concept state any time soon. 

So one reason for that is definitely the issues you pointed out.
But also since I started working on this project, I kept getting different design
suggestions. Which according to me is excellent. However, adopting them and
making those changes could be easy for you but I have to take my time to
properly understand them before implementing them.

> So with that being the case
> why not consider his patch set as something that could end up being a
> follow-on/refactor instead of an alternative to mine?
>

I have already mentioned that I would like to see the solution which is better
and has a consensus (It doesn't matter from where it is coming from).

>> Yes, if we end up finding out that there is real value in your approach, 
>> nothing speaks against considering it. But please don't try to hurry and 
>> push your series in that way. Please give everybody to time to evaluate.
> I would love to argue this patch set on the merits. However I really don't
> feel like I am getting a fair comparison here, at least from you. Every
> other reply on the thread seems to be from you trying to reinforce any
> criticism and taking the opportunity to mention that there is another
> solution out there. It is fine to fight for your own idea, but at least
> let me reply to the criticisms of my own patchset before you pile on. I
> would really prefer to discuss Nitesh's patches on a posting of his patch
> set rather than here.
>
> [1]: https://lore.kernel.org/lkml/101649ae-58d4-76ee-91f3-42ac1c145c46@redhat.com/
>
>
-- 
Thanks
Nitesh



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: + mm-introduce-reported-pages.patch added to -mm tree
  2019-11-07 18:02             ` Alexander Duyck
  2019-11-07 19:37               ` Nitesh Narayan Lal
@ 2019-11-07 22:43               ` David Hildenbrand
  2019-11-08  0:42                 ` Alexander Duyck
  1 sibling, 1 reply; 47+ messages in thread
From: David Hildenbrand @ 2019-11-07 22:43 UTC (permalink / raw)
  To: Alexander Duyck, Michal Hocko
  Cc: akpm, aarcange, dan.j.williams, dave.hansen, konrad.wilk,
	lcapitulino, mgorman, mm-commits, mst, osalvador, pagupta,
	pbonzini, riel, vbabka, wei.w.wang, willy, yang.zhang.wz,
	linux-mm

[...]
>>> The alternative approach doesn't touch the page allocator, however it
>>> still has essentially the same changes to __free_one_page. I suspect the
>>
>> Nitesh is working on Michals suggestion to use page isolation instead
>> AFAIK - which avoids this.
> 
> Okay. However it makes it much harder to discuss when we are comparing
> against code that isn't public. If the design is being redone do we have
> any ETA for when we will have something to actually compare to?

Maybe Nitesh got a little bit more careful with sending RFCs because he 
was getting negatives vibes due to the prototype quality. I might be 
wrong and he really is only looking into some performance aspects.

> 
>>> performance issue seen is mostly due to the fact that because it doesn't
>>> touch the page allocator it is taking the zone lock and probing the page
>>> for each set bit to see if the page is still free. As such the performance
>>> regression seen gets worse the lower the order used for reporting.
>>>
>>> Also I suspect Nitesh's patches are also in need of further review. I have
>>> provided feedback however my focus ended up being on more the kernel
>>> panics and 30% performance regression rather than debating architecture.
>>
>> Please don't take this personally, but I really dislike you taking about
>> Niteshs RFCs (!) and pushing for your approach (although it was you that
>> was late to the party!) in that way. If there are problems then please
>> collaborate and fix instead of using the same wrong arguments over and
>> over again.
> 
> Since Nitesh is in the middle of doing a full rewrite anyway I don't have
> much to compare against except for the previous set, which still needs
> fixes.  It is why I mentioned in the cover of the last patch set that I
> would prefer to not discuss it since I have no visibility into the patch
> set he is now working on.

Me too :) I'd love to see how Michals idea with page isolation worked 
out. But I can understand that Nitesh wants to explore some details first.

> 
>> a) hotplug/sparse zones: I explained a couple of times why we can ignore
>> that. There was never a reply from you, yet you keep coming up with
>> that. I don't enjoy talking to a wall.
> 
> This gets to the heart of how Nitesh's patch set works. It is assuming
> that every zone is linear, that there will be no overlap between zones,
> and that the zones don't really change. These are key architectural
> assumptions that should really be discussed instead of simply dismissed.

IMHO, implementation detail of using bitmaps for each zone right now. 
Maybe there is a better data structure for tracking this sparse data 
(e.g., sparse bitmaps), or a better way to handle bitmaps. I think this 
is a good start to get something relatively simple implemented (yeah, 
there were some pitfalls in previous versions, maybe page isolation will 
make that less error prone in an RFC).

> 
> I guess part of the difference between us is that I am looking for
> something that is production ready and not a proof of concept. It sounds
> like you would prefer this work stays in a proof of concept stage for some
> time longer.

Certainly not, but I don't think we have to rush. As I said, let's come 
to a conclusion if we want this in the allocator or not. For me, other 
things (e.g., maintainability) are more important. And AFAIKT, also for 
Michal and Mel.

> 
>> b) Locking optimizations: Come on, these are premature optimizations and
>> nothing to dictate your design. *nobody* but you cares about that in an
>> initial approach we get upstream. We can always optimize that.
> 
> My concern isn't so much the locking as the fact that it is the hunt and
> peck approach through a bitmap that will become increasingly more stale as
> you are processing the data. Every bit you have to test for requires
> taking a zone lock and then probing to see if the page is still free and
> the right size. My concern is how much time is going to be spent with the
> zone lock held while other CPUs are waiting on access.

Valid concerns, really. But I don't think these are road blockers.

> 
>> c) Kernel panics: Come on, we are talking about complicated RFCs here
>> with moving design decisions. We want do discuss *design* and
>> *architecture* here, not *implementation details*.
> 
> Then why ask me to compare performance against it? You were the one
> pushing for me to test it, not me. If you and Nitesh knew the design
> wasn't complete enough to run it why ask me to test it?

The design changed with Michals comment about page isolation, that was 
afterwards, no?

Your performance comparison was very helpful. I think, I said back then 
that I am interested in fundamental performance differences. You 
reported differences, AFAIK Nitesh was able to resolve one (MAX_ORDER - 
1 if I'm, not wrong) using implementation changes. I *think* he is still 
looking into another comparison.

> 
> Many of the kernel panics for the patch sets in the past have been related
> to fundamental architectural issues. For example ignoring things like
> NUMA, mangling the free_list by accessing it with the wrong locks held,
> etc.

Yeah, I think Nitesh was still fairly new to the kernel when he started 
working on Riks ideas. I assume he learned a lot during the last 
months/years :)

> 
>> d) Performance: We want to see a design that fits into the whole
>> architecture cleanly, is maintainable, and provides a benefit. Of
>> course, performance is relevant, but it certainly should not dictate our
>> design of a *virtualization specific optimization feature*. Performance
>> is not everything, otherwise please feel free and rewrite the kernel in
>> ASM and claim it is better because it is faster.
> 
> I agree performance is not everything. But when a system grinds down to
> 60% of what it was originally I find that significant.

I totally agree, that's why I asked for a fundamental performance 
comparison, which helps to make a decision. "is this gain in performance 
worth moving it into the core".

> 
>> Again, I do value your review and feedback, but I absolutely do not
>> enjoy the way you are trying to push your series here, sorry.
> 
> Well I am a bit frustrated as I have had to provide a significant amount
> of feedback on Nitesh's patches, and in spite of that I feel like I am
> getting nothing in return. I have pointed out the various issues and

I can understand the frustration. I reviewed all the parts I feel 
comfortable with (e.g., page flag vs. page type, cleanup patches), and 
left the core buddy review to experts (Mel), because that's not my aree 
of experience (yet, lol). Yeah, MM people are busy.

> opportunities to address the issues. At this point there are sections of
> his code that are directly copied from mine[1]. I have done everything I

Bad: he's not crediting you. Good: Both implementations came to the same 
conclusion virtio-wise.

> can to help the patches along but it seems like they aren't getting out of
> RFC or proof-of-concept state any time soon. So with that being the case

My gut feeling is that with page isolation the RFC stage could be over 
soon. It heavily simplifies locking/blocking pages from getting 
allocated. I might be wrong. But that's what it is when you explore new 
ideas.

> why not consider his patch set as something that could end up being a
> follow-on/refactor instead of an alternative to mine?

I guess MM people prefer to start simple and only add core functionality 
when really needed / it can be shown that there is a serious performance 
impact.

> 
>> Yes, if we end up finding out that there is real value in your approach,
>> nothing speaks against considering it. But please don't try to hurry and
>> push your series in that way. Please give everybody to time to evaluate.
> 
> I would love to argue this patch set on the merits. However I really don't
> feel like I am getting a fair comparison here, at least from you. Every
> other reply on the thread seems to be from you trying to reinforce any
> criticism and taking the opportunity to mention that there is another
> solution out there. It is fine to fight for your own idea, but at least

"for your own idea" - are you saying Nitesh's approach is my idea? I 
hope not, otherwise I would get credit for Rik's and Nitesh's work by 
simply providing review comments.

Of course it is okay to fight for your own idea.

> let me reply to the criticisms of my own patchset before you pile on. I

Me (+ Michal): Are these core buddy changes really wanted and required. 
Can we evaluate the alternatives properly. (Michal even proposed 
something very similar to Nitesh's approach before even looking into it)

You: Please take my patch set, it is better than the alternatives 
because of X, for X in {RFC quality, sparse zones, locking internals, 
current performance differences}

And all I am requesting is that we do the evaluation, discuss if there 
are really no alternatives, and sort out fundamental issues with 
external tracking.

Michal asked the very same question again at the beginning of this 
thread: "Is there really a consensus"

Reading the replies, "no".

-- 

Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: + mm-introduce-reported-pages.patch added to -mm tree
  2019-11-07 19:37               ` Nitesh Narayan Lal
@ 2019-11-07 22:46                 ` Alexander Duyck
  0 siblings, 0 replies; 47+ messages in thread
From: Alexander Duyck @ 2019-11-07 22:46 UTC (permalink / raw)
  To: Nitesh Narayan Lal, David Hildenbrand, Michal Hocko
  Cc: akpm, aarcange, dan.j.williams, dave.hansen, konrad.wilk,
	lcapitulino, mgorman, mm-commits, mst, osalvador, pagupta,
	pbonzini, riel, vbabka, wei.w.wang, willy, yang.zhang.wz,
	linux-mm

On Thu, 2019-11-07 at 14:37 -0500, Nitesh Narayan Lal wrote:
> On 11/7/19 1:02 PM, Alexander Duyck wrote:
> > On Thu, 2019-11-07 at 00:33 +0100, David Hildenbrand wrote:
> > > On 06.11.19 18:48, Alexander Duyck wrote:
> > > > On Wed, 2019-11-06 at 17:54 +0100, Michal Hocko wrote:
> > > > > On Wed 06-11-19 08:35:43, Alexander Duyck wrote:
> > > > > > On Wed, 2019-11-06 at 15:09 +0100, David Hildenbrand wrote:
> > > > > > > > Am 06.11.2019 um 13:16 schrieb Michal Hocko <mhocko@kernel.org>:
> > > > > > > > 
> > > > > > > > I didn't have time to read through newer versions of this patch series
> > > > > > > > but I remember there were concerns about this functionality being pulled
> > > > > > > > into the page allocator previously both by me and Mel [1][2]. Have those been
> > > > > > > > addressed? I do not see an ack from Mel or any other MM people. Is there
> > > > > > > > really a consensus that we want something like that living in the
> > > > > > > > allocator?
> > > > > > > I don‘t think there is. The discussion is still ongoing (although quiet,
> > > > > > > Nitesh is working on a new version AFAIK). I think we should not rush
> > > > > > > this.
> > > > > > How much time is needed to get a review? I waited 2 weeks since posting
> > > > > > v12 and the only comments I got on the code were from Andrew. Most of this
> > > > > > hasn't changed much since v10 and that was posted back in mid September. I
> > > > > > have been down to making small tweaks here and there and haven't had any
> > > > > > real critiques on the approach since Mel had the comments about conflicts
> > > > > > with compaction which I addressed by allowing compaction to punt the
> > > > > > reporter out so that it could split and splice the lists as it walked
> > > > > > through them.
> > > > > Well, people are busy and MM community is not a large one. I cannot
> > > > > really help you much other than keep poking those people and give
> > > > > reasonable arguments so they decide to ack your patch.
> > > > I get that. But v10 was posted in mid September. Back then we had a
> > > > discussion about addressing what Mel had mentioned and I had mentioned
> > > > then that I had addressed it by allowing compaction to essentially reset
> > > > the reporter to get it out of the list so compaction could do this split
> > > > and splice tumbling logic.
> > > > 
> > > > > I definitely do not intent to nack this work, I just have maintainability
> > > > > concerns and considering there is an alternative approach that does not
> > > > > require to touch page allocator internals and which we need to compare
> > > > > against then I do not really think there is any need to push something
> > > > > in right away. Or is there any pressing reason to have this merged right
> > > > > now?
> > > > The alternative approach doesn't touch the page allocator, however it
> > > > still has essentially the same changes to __free_one_page. I suspect the
> > > Nitesh is working on Michals suggestion to use page isolation instead 
> > > AFAIK - which avoids this.
> > Okay. However it makes it much harder to discuss when we are comparing
> > against code that isn't public. If the design is being redone do we have
> > any ETA for when we will have something to actually compare to?
> 
> If there would have been just the design change then giving definite ETA would
> have been possible.
> However, I also have to fix the performance with (MAX_ORDER - 2). Unlike you,
> I need some time to do that.
> If I just post the code without fixing the performance there will again be an
> unnecessary discussion about the same thing which doesn't make any sense.

Understood. However with that being the case I don't think it is really
fair to say that my patch set needs to wait on yours to be completed
before we can review it and decide if it is acceptable for upstream.

With that said I am working on a v14 to address Mel's review feedback. I
will hopefully have that completed and tested by early next week.

> > > > performance issue seen is mostly due to the fact that because it doesn't
> > > > touch the page allocator it is taking the zone lock and probing the page
> > > > for each set bit to see if the page is still free. As such the performance
> > > > regression seen gets worse the lower the order used for reporting.
> > > > 
> > > > Also I suspect Nitesh's patches are also in need of further review. I have
> > > > provided feedback however my focus ended up being on more the kernel
> > > > panics and 30% performance regression rather than debating architecture.
> > > Please don't take this personally, but I really dislike you taking about 
> > > Niteshs RFCs (!) and pushing for your approach (although it was you that 
> > > was late to the party!) in that way. If there are problems then please 
> > > collaborate and fix instead of using the same wrong arguments over and 
> > > over again.
> > Since Nitesh is in the middle of doing a full rewrite anyway I don't have
> > much to compare against except for the previous set, which still needs
> > fixes.  It is why I mentioned in the cover of the last patch set that I
> > would prefer to not discuss it since I have no visibility into the patch
> > set he is now working on.
> 
> Fair point.
> 
> 
> > > a) hotplug/sparse zones: I explained a couple of times why we can ignore 
> > > that. There was never a reply from you, yet you keep coming up with 
> > > that. I don't enjoy talking to a wall.
> > This gets to the heart of how Nitesh's patch set works. It is assuming
> > that every zone is linear, that there will be no overlap between zones,
> > and that the zones don't really change. These are key architectural
> > assumptions that should really be discussed instead of simply dismissed.
> 
> They are not at all dismissed, they are just kept as a future action item.

Yes, but when we are asked to compare the two solutions pointing out the
areas that are not complete is a valid point. David complained about
talking to a wall, but what does he expect when we is comparing an RFC
versus something that is ready for acceptance. He is going to get a list
of things that still need to be completed and other issues that were
identified with the patch set.



> > I guess part of the difference between us is that I am looking for
> > something that is production ready and not a proof of concept. It sounds
> > like you would prefer this work stays in a proof of concept stage for some
> > time longer.
> 
> In my opinion, it is more about how many use-cases do we want to target
> initially.
> With your patch-set, I agree we can cover more use-case where the solution
> will fit in.
> However, my series might not be suitable for use-cases where
> we have memory hotplug or memory restriction. (This will be the case
> after I fix the issues in the series)

This is the difference between a proof of concept and production code in
my opinion. I have had to go through and cover all the corner cases, make
sure this compiles on architectures other than x86, and try to validate as
much as possible in terms of possible regressions. As such I may have had
to make performance compromises here and there in order to make sure all
those cases are covered.

> > > b) Locking optimizations: Come on, these are premature optimizations and 
> > > nothing to dictate your design. *nobody* but you cares about that in an 
> > > initial approach we get upstream. We can always optimize that.
> > My concern isn't so much the locking as the fact that it is the hunt and
> > peck approach through a bitmap that will become increasingly more stale as
> > you are processing the data. Every bit you have to test for requires
> > taking a zone lock and then probing to see if the page is still free and
> > the right size. My concern is how much time is going to be spent with the
> > zone lock held while other CPUs are waiting on access.
> 
> This can be prevented (at least to an extent) by checking if the page is in
> the buddy before acquiring the lock as I have suggested previously.

Agreed that should help to reduce the pain somewhat. However it still
isn't an exact science since you may find that the page state changes
before you can acquire the zone lock.

> > > c) Kernel panics: Come on, we are talking about complicated RFCs here 
> > > with moving design decisions. We want do discuss *design* and 
> > > *architecture* here, not *implementation details*.
> > Then why ask me to compare performance against it? You were the one
> > pushing for me to test it, not me. If you and Nitesh knew the design
> > wasn't complete enough to run it why ask me to test it?
> > 
> > Many of the kernel panics for the patch sets in the past have been related
> > to fundamental architectural issues. For example ignoring things like
> > NUMA, mangling the free_list by accessing it with the wrong locks held,
> > etc.
> 
> Obviously we didn't know it earlier, whatever tests I had tried I didn't see
> any issues with them.
> Again, I am trying to learn from my mistakes and appreciate you helping
> me out with that.

I understand that. However I have been burned a few times by this now so I
feel it is valid to point out that there have been ongoing issues such as
this when doing a comparison. Especially when the complaint is that my
approach is "fragile".

> > > d) Performance: We want to see a design that fits into the whole 
> > > architecture cleanly, is maintainable, and provides a benefit. Of 
> > > course, performance is relevant, but it certainly should not dictate our 
> > > design of a *virtualization specific optimization feature*. Performance 
> > > is not everything, otherwise please feel free and rewrite the kernel in 
> > > ASM and claim it is better because it is faster.
> > I agree performance is not everything. But when a system grinds down to
> > 60% of what it was originally I find that significant.
> 
> 60%? In one of your previous emails you suggested that the drop was 30%.

I was rounding down in both cases by basically just dropping the last
digit for 0. As I recall it was something in the mid 30s that I was seeing
for a drop. I was being a bit more generous before, but I was feeling a
bit irritable so I flipped things and rounded the percentage it retained
down.

> > > Again, I do value your review and feedback, but I absolutely do not 
> > > enjoy the way you are trying to push your series here, sorry.
> > Well I am a bit frustrated as I have had to provide a significant amount
> > of feedback on Nitesh's patches, and in spite of that I feel like I am
> > getting nothing in return.
> 
> Not sure if I understood the meaning here. May I know what were you expecting?
> I do try to review your series and share whatever I can.

So for all the review feedback I have provided I feel like I haven't
gotten much back. I have had several iterations of the patch set that got
almost no replies. Then on top of it instead of getting suggestions on how
to improve my patch set what ends up happening at some point is that your
patch set gets brought up and muddies the waters since we start discussing
the issues with it instead of addressing issues in my patch set.

> >  I have pointed out the various issues and
> > opportunities to address the issues. At this point there are sections of
> > his code that are directly copied from mine[1]. 
> 
> I don't think there is any problem with learning from you or your code.
> Is there?

I don't have any problems with you learning from my code, and the fact is
it shows we are making progress since at this point the virtio-balloon
bits are now essentially identical. The point that I was trying to make is
that I have been contributing to the progress that has been made.

> As far as giving proper credit is concerned, that was my mistake and I intend
> to correct it as I have already mentioned.

I appreciate that.

> > I have done everything I
> > can to help the patches along but it seems like they aren't getting out of
> > RFC or proof-of-concept state any time soon. 
> 
> So one reason for that is definitely the issues you pointed out.
> But also since I started working on this project, I kept getting different design
> suggestions. Which according to me is excellent. However, adopting them and
> making those changes could be easy for you but I have to take my time to
> properly understand them before implementing them.

Understood. In my case I am suffering from the opposite problem. I am
taking the input and iterating pretty quickly at this point. However I
haven't been getting much feedback. Thus my frustration when the patches
are applied and then people start wanting them reverted until they can be
tested against your patch set.

> > So with that being the case
> > why not consider his patch set as something that could end up being a
> > follow-on/refactor instead of an alternative to mine?
> > 
> 
> I have already mentioned that I would like to see the solution which is better
> and has a consensus (It doesn't matter from where it is coming from).

My primary focus is getting a solution in place. I'm not sure if
management has changed much at Red Hat since I was there but usually there
is a push to get things completed is there not? In my case I would like to
switch from development to maintenance for memory hinting, and if you end
up with a better approach I would be more than open to refactoring it
later and/or throwing out my code to replace it with yours.

There is a saying "Perfection is the enemy of progress." My concern is
that we are spending so much time trying to find the perfect solution we
aren't going to ever come up with one. After all the concept has been
around since 2011 and we are still debating how to go about implementing
it.



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: + mm-introduce-reported-pages.patch added to -mm tree
  2019-11-07 22:43               ` David Hildenbrand
@ 2019-11-08  0:42                 ` Alexander Duyck
  2019-11-08  7:06                   ` David Hildenbrand
  0 siblings, 1 reply; 47+ messages in thread
From: Alexander Duyck @ 2019-11-08  0:42 UTC (permalink / raw)
  To: David Hildenbrand, Michal Hocko
  Cc: akpm, aarcange, dan.j.williams, dave.hansen, konrad.wilk,
	lcapitulino, mgorman, mm-commits, mst, osalvador, pagupta,
	pbonzini, riel, vbabka, wei.w.wang, willy, yang.zhang.wz,
	linux-mm

On Thu, 2019-11-07 at 23:43 +0100, David Hildenbrand wrote:

[...]

> 
> > > Yes, if we end up finding out that there is real value in your approach,
> > > nothing speaks against considering it. But please don't try to hurry and
> > > push your series in that way. Please give everybody to time to evaluate.
> > 
> > I would love to argue this patch set on the merits. However I really don't
> > feel like I am getting a fair comparison here, at least from you. Every
> > other reply on the thread seems to be from you trying to reinforce any
> > criticism and taking the opportunity to mention that there is another
> > solution out there. It is fine to fight for your own idea, but at least
> 
> "for your own idea" - are you saying Nitesh's approach is my idea? I 
> hope not, otherwise I would get credit for Rik's and Nitesh's work by 
> simply providing review comments.

Sorry, I was using "your" in the collective sense. I meant Nitesh, Rik,
MST, yourself, and any other folks are working on the bitmap approach.

> Of course it is okay to fight for your own idea.
> 
> > let me reply to the criticisms of my own patchset before you pile on. I
> 
> Me (+ Michal): Are these core buddy changes really wanted and required. 
> Can we evaluate the alternatives properly. (Michal even proposed 
> something very similar to Nitesh's approach before even looking into it)

That is part of my frustration. Before I even had a chance to explain the
situation you had already jumped in throwing a "I second that" into the
discussion and insisting on comparisons against Nitesh's patch set which I
have already provided.

Nitesh's most recent patch set caused a kernel panic, and when I fixed
that then it is over 30% worse performance wise than my patch set, and
then when Nitesh restricted the order to MAX_ORDER - 1 and added some
logic to check the buddy before taking the lock then it finally became
comparable.

> You: Please take my patch set, it is better than the alternatives 
> because of X, for X in {RFC quality, sparse zones, locking internals, 
> current performance differences}

I should have replied to Michal's original question and simply stated that
Mel had not replied to the patches in the last month and a half. I half
suspect that is the reason for Andrew applying it. It put some pressure on
others to provide review feedback, which if nothing else I am grateful
for.

You had inserted the need to compare it against Nitesh's patch set. Which
based on Nitesh's email is likely going to be a little while since he
cannot give me an ETA.

> And all I am requesting is that we do the evaluation, discuss if there 
> are really no alternatives, and sort out fundamental issues with 
> external tracking.
> 
> Michal asked the very same question again at the beginning of this 
> thread: "Is there really a consensus"

Yes, but he also said he is not nacking the patch. It seemed like he is
deferring to Mel on this so I will try to work with Mel to address his
concerns since he had some feedback that I can act on.

I'll address the comments Mel provided and be submitting a v14 sometime
soon.

> Reading the replies, "no".

I get that you think the bitmap approach is the best approach, but the
fact is it is still invasive, just to different parts of the mm subsystem.
I would argue that one of my concerns about the hotplug and sparse
handling is that by skipping those for now is essentially hiding what is
likely to be some invasive code, likely not too different from what I had
to deal with with compaction. At this point he adds more data to the zone
struct than my changes, and I suspect as he progresses that may increase
further.

I do not think it is fair to hold up review and acceptance of this patch
set for performance comparisons with a patch set with no definite ETA.
Ideally we should be able to review and provide feedback on this patch set
on its own. Since Nitesh's code is in part based on my virio-balloon code
anyway it would make more sense to replace my code eventually if/when he
comes up with a better solution. One thing I can do is make sure that my
code is as non-intrusive as possible so that if/when that time comes
reverting it would not be too difficult.





^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: + mm-introduce-reported-pages.patch added to -mm tree
  2019-11-08  0:42                 ` Alexander Duyck
@ 2019-11-08  7:06                   ` David Hildenbrand
  2019-11-08 17:18                     ` Alexander Duyck
  0 siblings, 1 reply; 47+ messages in thread
From: David Hildenbrand @ 2019-11-08  7:06 UTC (permalink / raw)
  To: Alexander Duyck, Michal Hocko
  Cc: akpm, aarcange, dan.j.williams, dave.hansen, konrad.wilk,
	lcapitulino, mgorman, mm-commits, mst, osalvador, pagupta,
	pbonzini, riel, vbabka, wei.w.wang, willy, yang.zhang.wz,
	linux-mm

>> "for your own idea" - are you saying Nitesh's approach is my idea? I
>> hope not, otherwise I would get credit for Rik's and Nitesh's work by
>> simply providing review comments.
> 
> Sorry, I was using "your" in the collective sense. I meant Nitesh, Rik,
> MST, yourself, and any other folks are working on the bitmap approach.

I guess I was misreading it then :)

> 
>> Of course it is okay to fight for your own idea.
>>
>>> let me reply to the criticisms of my own patchset before you pile on. I
>>
>> Me (+ Michal): Are these core buddy changes really wanted and required.
>> Can we evaluate the alternatives properly. (Michal even proposed
>> something very similar to Nitesh's approach before even looking into it)
> 
> That is part of my frustration. Before I even had a chance to explain the
> situation you had already jumped in throwing a "I second that" into the
> discussion and insisting on comparisons against Nitesh's patch set which I
> have already provided.

The different prototypes Nitesh provided are the only alternatives we 
have. So, to do the discussion that I want to see for a long time, we 
should evaluate them?

> 
> Nitesh's most recent patch set caused a kernel panic, and when I fixed
> that then it is over 30% worse performance wise than my patch set, and
> then when Nitesh restricted the order to MAX_ORDER - 1 and added some
> logic to check the buddy before taking the lock then it finally became
> comparable.

Yes, they are protoypes, RFCs. You did the right thing to report the 
issues so Nitesh can look into them.

> 
>> You: Please take my patch set, it is better than the alternatives
>> because of X, for X in {RFC quality, sparse zones, locking internals,
>> current performance differences}
> 
> I should have replied to Michal's original question and simply stated that
> Mel had not replied to the patches in the last month and a half. I half
> suspect that is the reason for Andrew applying it. It put some pressure on
> others to provide review feedback, which if nothing else I am grateful
> for.
> 
> You had inserted the need to compare it against Nitesh's patch set. Which
> based on Nitesh's email is likely going to be a little while since he
> cannot give me an ETA.

So I want us (you, me, Michal, Mel, Dave, ...) to discuss the direction 
we want to go. I'd love to do this on a design level, instead of having 
to wait for any patch set. But I guess this is harder to do? And 
especially as you keep mentioning the performance difference, I think we 
should evaluate if this is an unsolvable problem or just an issue in the 
current prototype?

I mean people could have a look at Niteshs older series to see how it 
fundamentally differs to your approach (external tracking). Nitesh might 
have fixed some things in the mean time, and is replacing the "fake 
allocation" by page isolation. But I mean, the general approach should 
be obvious and sufficient for people to make a decision.

> 
>> And all I am requesting is that we do the evaluation, discuss if there
>> are really no alternatives, and sort out fundamental issues with
>> external tracking.
>>
>> Michal asked the very same question again at the beginning of this
>> thread: "Is there really a consensus"
> 
> Yes, but he also said he is not nacking the patch. It seemed like he is

I never NACKed your series either.

> deferring to Mel on this so I will try to work with Mel to address his
> concerns since he had some feedback that I can act on.

Makes perfect sense to me.

> 
> I'll address the comments Mel provided and be submitting a v14 sometime
> soon.
> 
>> Reading the replies, "no".
> 
> I get that you think the bitmap approach is the best approach, but the

No, I think that a simple external tracking is preferable over the core 
buddy modifications we have right now. Kernel panics of fixable 
performance regressions in an RFC are not fundamental issues to me.

> fact is it is still invasive, just to different parts of the mm subsystem.

I'd love to see how it uses the page isolation framework, and only has a 
single hook to queue pages. I don't like the way pages are pulled out of 
the buddy in Niteshs approach currently. What you have is cleaner.

> I would argue that one of my concerns about the hotplug and sparse
> handling is that by skipping those for now is essentially hiding what is
> likely to be some invasive code, likely not too different from what I had
> to deal with with compaction. At this point he adds more data to the zone
> struct than my changes, and I suspect as he progresses that may increase
> further. >
> I do not think it is fair to hold up review and acceptance of this patch
> set for performance comparisons with a patch set with no definite ETA.

Michal asked "Is there really a consensus". A consensus that we want 
something like this, not that we want Nitesh's approach. It's just an 
alternative worth discussing.

And if you are reworking your patch set now with Mel, we might get 
another alternative that everybody is pleased with. Nobody is against 
reviewing your series - that's perfect, it's against picking it up and 
sending it upstream. That's my concern an Michals concern if I am not wrong.

> Ideally we should be able to review and provide feedback on this patch set
> on its own. Since Nitesh's code is in part based on my virio-balloon code

I definitely agree, but this here is *not* a reply to your patch set but 
to "Re: + mm-introduce-reported-pages.patch added to -mm tree" - picking 
it up for testing. Michals question is completely valid. And I think the 
discussion started by Michal and me here is completely valid.

> anyway it would make more sense to replace my code eventually if/when he
> comes up with a better solution. One thing I can do is make sure that my
> code is as non-intrusive as possible so that if/when that time comes
> reverting it would not be too difficult.

-- 

Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: + mm-introduce-reported-pages.patch added to -mm tree
  2019-11-07 16:07                 ` Alexander Duyck
@ 2019-11-08  9:43                   ` Mel Gorman
  2019-11-08 16:17                     ` Alexander Duyck
  0 siblings, 1 reply; 47+ messages in thread
From: Mel Gorman @ 2019-11-08  9:43 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Michal Hocko, David Hildenbrand, akpm, aarcange, dan.j.williams,
	dave.hansen, konrad.wilk, lcapitulino, mm-commits, mst,
	osalvador, pagupta, pbonzini, riel, vbabka, wei.w.wang, willy,
	yang.zhang.wz, linux-mm

On Thu, Nov 07, 2019 at 08:07:02AM -0800, Alexander Duyck wrote:
> > And what I'm saying is that the initial version should have focused
> > exclusively on the mechanism to report free pages and kept the search as
> > simple as possible with optimisations on top. By all means track pages
> > that are already reported and skip them because that is a relatively basic
> > operation with the caveat that even the page_reported checks should be
> > behind a static branch if there is no virtio balloon driver loaded.
> 
> I can split out the list manipulation logic and put it behind a static
> branch, however I cannot do that with page_reported. The issue is that
> with keeping any state in the page we need to clear it when the page is
> allocated. I could do the static branch approach if we were to clear the
> bit for all pages, but I thought it better to just do a test and clear in
> order to avoid an unnecessary read-modify-write.
> 

A test and clear (the non-atomic variety) might be fine but I don't see
why the page_reported or zone flag test test cannot be behind a static
branch. If no virtio balloon driver with page reporting exists then it
does not matter if we track the state. If it's loaded for the first time,
no page has been updated but that's also ok, it now gets to start. Maybe
you are thinking about what happens if the driver is unloaded? If that
is your concern, do not disable the static branch once it is enabled. I
don't think we need to worry about handling driver unloading efficiently.

> > The page reporting to a hypervisor is done out of band from any application
> > so even if it takes a little longer to do that reporting, is there an
> > impact really? If so, how much and what frequency is this impact incurred?
> 
> My main concern has been time with the zone lock held in order to extract
> the page from the list. Any time while we are holding the zone lock is
> time when another CPU cannot free or allocate a page from that zone.
> Without the pointers I have to walk the free lists until I find a non-
> reported page in the zone. That is the overhead I am avoiding by
> maintaining the pointers since we have to walk the free lists for every
> batch of pages we want to pull.
> 

I'm not too worried about the zone lock hold time to be honest as long
as it's not long enough to cause soft lockups. The very nature of the
feature is that there is a performance penalty to favour packing more
VMs into a physical host. Even if the zone lock hold wasn't a factor,
the reallocation of pages after they have been freed by the hypervisor
will be a slowdown. Anyone using this feature has already accepted a
performance hit and this is why I'm not massively concerned about the
performance of the reporting itself.

> > The primary impact I can see is that free pages are invisible while the
> > notification takes place so the window for that may be wider if it takes
> > longer to find enough pages to send in batch but there will always be a
> > window where the free pages are invisible.
> 
> You are correct. There is a window where the pages will be invisible. The
> hope is that the impact of that should be fairly small since we only pull
> a few pages out at a time. In addition we will not pull past the watermark
> since the pages are pulled via __isolate_free_page.
> 

Indeed

> > > > <SNIP>
> > > > 
> > > > What confused me quite a lot is that this is enabled at compile time
> > > > and then incurs a performance hit whether there is a hypervisor that
> > > > even cares is involved or not. So I don't think the performance angle
> > > > justifying this approach is a good one because this implementation has
> > > > issues of its own. Previously I said
> > > 
> > > We are adding code. There is always going to be some performance impact.
> > 
> > And that impact should be negligible in so far as possible, parts of
> > it are protected by branches but as it stands, just building the virtio
> > balloon driver enables all of this whether the running system is using
> > virtual machines or not.
> 
> Agreed the code is present, however that would be the case with a static
> branch as well. I tried to use static branches where I thought it made
> sense and to just use the page_reported test for the cases where I
> couldn't. I can probably add the static branch to a few more spots however
> I am not sure it adds much value when it is already behind a page_reported
> check.
> 

It's simply an ideal from my perspective. Other features like cpusets
have zero impact if they are not in use.

> > > > Adding an API for compaction does not get away from the problem that
> > > > it'll be fragile to depend on the internal state of the allocator for
> > > > correctness. Existing users that poke into the state do so as an
> > > > optimistic shortcut but if it fails, nothing actually breaks. The free
> > > > list reporting stuff might and will not be routinely tested.
> > > 
> > > I view what I am doing as not being too different from that. I am only
> > > maintaining state when I can. I understand that it makes things more
> > > fragile as we have more routes where things could go wrong, but isn't that
> > > the case with adding any interface. I have simplified this about as far as
> > > it can go.
> > > 
> > 
> > The enhanced tracking of state is not necessary initially to simply report
> > free pages to the hypervisor. 
> 
> Define enhanced tracking?

Primarily the list manipulations and boundary pointers.

> If you are talking about the PageReported bit
> then I would disagree. I need to have some way to indicate that the free
> page has been reported or not if I am going to make forward progress while
> performing the reporting in iterations.

I'm not concerned about the PageReported bit. I fully accept that we
should not be isolating and re-reporting pages that have already been
handled by the hypervisor.

> > And again, it's not clear that the additional complexity is required.
> > Specifically, it'll be very hard to tell if the state tracking is
> > actually helping because excessive list shuffling due to compaction may
> > mean that a lot of state is being tracked while the full free list ends
> > up having to be searched anyway.
> > 
> > As it stands, the optimisations are hard-wired into the providing of
> > the feature itself making it an all or nothing approach and no option to
> > "merge a bare-bones mechanism that is behind static branches and build
> > optimisations on top". At least that way, any competing approach could
> > be based on optimisations alone while the feature is still available.
> 
> If needed I can split another patch out of patches 3 and 4 that does the
> list manipulation. It shouldn't take all that long to do as I have already
> done it once for testing to get the 20% regression number.

From your perspective, I see it's a bit annoying because in the final
result, the code should be identical. However, it'll be a lot clearer
during review what is required, what level of complexity optimisations
add and the performance of it. The changelog should include what metric
you are using to evaluate the performance, the test case and the delta. It
also will be easier from a debugging perspective as minimally a bisection
could identify if a bug was due to the core mechanism itself or one of
the optimisations.  Finally, it leaves open the possibility that someone
can evaluate a completely different set of optimisations. Whatever the
alternative approaches are, the actual interface to virtio ballon surely
is the same (I don't actually know, I just can't see why the virtio ABI
would be depend on how the pages are isolated, tracked and reported).

> Also as I
> mentioned I can put the list manpiulation behind a static key, but cannot
> do so for the actual testing and clearing of the page reporting bit.
> 

Still not 100% sure why because until the driver is loaded, the bits
don't matter.


-- 
Mel Gorman
SUSE Labs


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: + mm-introduce-reported-pages.patch added to -mm tree
  2019-11-07 18:12                     ` Alexander Duyck
@ 2019-11-08  9:57                       ` Michal Hocko
  2019-11-08 16:43                         ` Alexander Duyck
  0 siblings, 1 reply; 47+ messages in thread
From: Michal Hocko @ 2019-11-08  9:57 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Dave Hansen, David Hildenbrand, akpm, aarcange, dan.j.williams,
	konrad.wilk, lcapitulino, mgorman, mm-commits, mst, osalvador,
	pagupta, pbonzini, riel, vbabka, wei.w.wang, willy,
	yang.zhang.wz, linux-mm

On Thu 07-11-19 10:12:21, Alexander Duyck wrote:
> On Thu, 2019-11-07 at 18:46 +0100, Michal Hocko wrote:
[...]
> > I have asked several times why there is such a push and received no
> > answer but "this is taking too long" which I honestly do not care much.
> > Especially when other virt people tend to agree that there is no need to
> > rush here.
> 
> Part of the rush, at least from my perspective, is that I don't have
> indefinite time to work on this.

I fully understand this! And I also feel the frustration. Been through
that several times.

> I am sure you are aware that maintaining
> an external patch set can be a real chore and I would prefer to have it
> merged and then maintain it as a part of the tree.

Sure, keeping the code in sync is an additional burden. Having the code
in just pushes the burden to everybody touching that subsystem in the
future though. This is the maintenance cost we have to consider. Your
approach of integrating a very narrow feature into the core allocator
will require considering that usecase for future changes in the
allocator. Maintaining metadata elsewhere doesn't impose that
maintenance cost.

Can we agree on this at least? Because feel we are circling around in
this and previous discussions.

> Then other changes can
> be rebased on it instead of having to rebase it around other changes that
> are going on.

Well, that is not a real argument because alternatives are not an
incremental change from the allocator POV. It is a different approach of
maintaining metadata. Sure a different approach could replace your
implementation (if it was merged) but what is the point of merging an
approach that would be replaced? Just because you do not want to
maintain your implmentation off tree? That is a poor argument to me.

I completely agree with Mel. Let's start with a simple solution first
(using existing page isolation interfaces sound like a good start to
interact with the page allocator), establish a decent API for virtio
and start optimizing from there.

Last but not least, I would also recommend to be more explicit about
workloads which are going to benefit from those performance optimizations.
So far I have only seen some micro benchmarks results. Do we have any
real workloads and see how your approach behaves so that we can compare
that to the other approach?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: + mm-introduce-reported-pages.patch added to -mm tree
  2019-11-08  9:43                   ` Mel Gorman
@ 2019-11-08 16:17                     ` Alexander Duyck
  2019-11-08 18:41                       ` Mel Gorman
  0 siblings, 1 reply; 47+ messages in thread
From: Alexander Duyck @ 2019-11-08 16:17 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Michal Hocko, David Hildenbrand, akpm, aarcange, dan.j.williams,
	dave.hansen, konrad.wilk, lcapitulino, mm-commits, mst,
	osalvador, pagupta, pbonzini, riel, vbabka, wei.w.wang, willy,
	yang.zhang.wz, linux-mm

On Fri, 2019-11-08 at 09:43 +0000, Mel Gorman wrote:
> On Thu, Nov 07, 2019 at 08:07:02AM -0800, Alexander Duyck wrote:
> > > And what I'm saying is that the initial version should have focused
> > > exclusively on the mechanism to report free pages and kept the search as
> > > simple as possible with optimisations on top. By all means track pages
> > > that are already reported and skip them because that is a relatively basic
> > > operation with the caveat that even the page_reported checks should be
> > > behind a static branch if there is no virtio balloon driver loaded.
> > 
> > I can split out the list manipulation logic and put it behind a static
> > branch, however I cannot do that with page_reported. The issue is that
> > with keeping any state in the page we need to clear it when the page is
> > allocated. I could do the static branch approach if we were to clear the
> > bit for all pages, but I thought it better to just do a test and clear in
> > order to avoid an unnecessary read-modify-write.
> > 
> 
> A test and clear (the non-atomic variety) might be fine but I don't see
> why the page_reported or zone flag test test cannot be behind a static
> branch. If no virtio balloon driver with page reporting exists then it
> does not matter if we track the state. If it's loaded for the first time,
> no page has been updated but that's also ok, it now gets to start. Maybe
> you are thinking about what happens if the driver is unloaded? If that
> is your concern, do not disable the static branch once it is enabled. I
> don't think we need to worry about handling driver unloading efficiently.

Yeah, the static branch confusion was a bit of rigid thinking on my part.
Specifically I was thinking that if the virtio-device is disabled I needed
to disable the static key. Instead I have already updated the code so that
we just enable the branch the first time, and then rely on the ph_dev
being set to NULL disabling reporting when the interface is unregistered.

> > > The page reporting to a hypervisor is done out of band from any application
> > > so even if it takes a little longer to do that reporting, is there an
> > > impact really? If so, how much and what frequency is this impact incurred?
> > 
> > My main concern has been time with the zone lock held in order to extract
> > the page from the list. Any time while we are holding the zone lock is
> > time when another CPU cannot free or allocate a page from that zone.
> > Without the pointers I have to walk the free lists until I find a non-
> > reported page in the zone. That is the overhead I am avoiding by
> > maintaining the pointers since we have to walk the free lists for every
> > batch of pages we want to pull.
> > 
> 
> I'm not too worried about the zone lock hold time to be honest as long
> as it's not long enough to cause soft lockups. The very nature of the
> feature is that there is a performance penalty to favour packing more
> VMs into a physical host. Even if the zone lock hold wasn't a factor,
> the reallocation of pages after they have been freed by the hypervisor
> will be a slowdown. Anyone using this feature has already accepted a
> performance hit and this is why I'm not massively concerned about the
> performance of the reporting itself.

Okay.

<snip>

> > > > > What confused me quite a lot is that this is enabled at compile time
> > > > > and then incurs a performance hit whether there is a hypervisor that
> > > > > even cares is involved or not. So I don't think the performance angle
> > > > > justifying this approach is a good one because this implementation has
> > > > > issues of its own. Previously I said
> > > > 
> > > > We are adding code. There is always going to be some performance impact.
> > > 
> > > And that impact should be negligible in so far as possible, parts of
> > > it are protected by branches but as it stands, just building the virtio
> > > balloon driver enables all of this whether the running system is using
> > > virtual machines or not.
> > 
> > Agreed the code is present, however that would be the case with a static
> > branch as well. I tried to use static branches where I thought it made
> > sense and to just use the page_reported test for the cases where I
> > couldn't. I can probably add the static branch to a few more spots however
> > I am not sure it adds much value when it is already behind a page_reported
> > check.
> > 
> 
> It's simply an ideal from my perspective. Other features like cpusets
> have zero impact if they are not in use.

Got it.

> > > > > Adding an API for compaction does not get away from the problem that
> > > > > it'll be fragile to depend on the internal state of the allocator for
> > > > > correctness. Existing users that poke into the state do so as an
> > > > > optimistic shortcut but if it fails, nothing actually breaks. The free
> > > > > list reporting stuff might and will not be routinely tested.
> > > > 
> > > > I view what I am doing as not being too different from that. I am only
> > > > maintaining state when I can. I understand that it makes things more
> > > > fragile as we have more routes where things could go wrong, but isn't that
> > > > the case with adding any interface. I have simplified this about as far as
> > > > it can go.
> > > > 
> > > 
> > > The enhanced tracking of state is not necessary initially to simply report
> > > free pages to the hypervisor. 
> > 
> > Define enhanced tracking?
> 
> Primarily the list manipulations and boundary pointers.

Okay, those are going to be pulled out into a separate patch. What I may
do is merge patches 3 and 4, and then pull the list manipulation and
boundary logic back out into a 4th patch since pulling out the list
manipulation will shrink patch 3 by quite a bit.

> > If you are talking about the PageReported bit
> > then I would disagree. I need to have some way to indicate that the free
> > page has been reported or not if I am going to make forward progress while
> > performing the reporting in iterations.
> 
> I'm not concerned about the PageReported bit. I fully accept that we
> should not be isolating and re-reporting pages that have already been
> handled by the hypervisor.

Okay. Sounds good.

> > > And again, it's not clear that the additional complexity is required.
> > > Specifically, it'll be very hard to tell if the state tracking is
> > > actually helping because excessive list shuffling due to compaction may
> > > mean that a lot of state is being tracked while the full free list ends
> > > up having to be searched anyway.
> > > 
> > > As it stands, the optimisations are hard-wired into the providing of
> > > the feature itself making it an all or nothing approach and no option to
> > > "merge a bare-bones mechanism that is behind static branches and build
> > > optimisations on top". At least that way, any competing approach could
> > > be based on optimisations alone while the feature is still available.
> > 
> > If needed I can split another patch out of patches 3 and 4 that does the
> > list manipulation. It shouldn't take all that long to do as I have already
> > done it once for testing to get the 20% regression number.
> 
> From your perspective, I see it's a bit annoying because in the final
> result, the code should be identical. However, it'll be a lot clearer
> during review what is required, what level of complexity optimisations
> add and the performance of it. The changelog should include what metric
> you are using to evaluate the performance, the test case and the delta. It
> also will be easier from a debugging perspective as minimally a bisection
> could identify if a bug was due to the core mechanism itself or one of
> the optimisations.  Finally, it leaves open the possibility that someone
> can evaluate a completely different set of optimisations. Whatever the
> alternative approaches are, the actual interface to virtio ballon surely
> is the same (I don't actually know, I just can't see why the virtio ABI
> would be depend on how the pages are isolated, tracked and reported).

The virtio-balloon interface is the same at this point between my solution
and Nitesh's. So the only real disagreement in terms of the two solutions
is about keeping the bit in the page and the list manipulation versus the
external bitmap and the hunt and peck approach.




^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: + mm-introduce-reported-pages.patch added to -mm tree
  2019-11-08  9:57                       ` Michal Hocko
@ 2019-11-08 16:43                         ` Alexander Duyck
  0 siblings, 0 replies; 47+ messages in thread
From: Alexander Duyck @ 2019-11-08 16:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Dave Hansen, David Hildenbrand, akpm, aarcange, dan.j.williams,
	konrad.wilk, lcapitulino, mgorman, mm-commits, mst, osalvador,
	pagupta, pbonzini, riel, vbabka, wei.w.wang, willy,
	yang.zhang.wz, linux-mm

On Fri, 2019-11-08 at 10:57 +0100, Michal Hocko wrote:
> On Thu 07-11-19 10:12:21, Alexander Duyck wrote:
> > On Thu, 2019-11-07 at 18:46 +0100, Michal Hocko wrote:
> [...]
> > > I have asked several times why there is such a push and received no
> > > answer but "this is taking too long" which I honestly do not care much.
> > > Especially when other virt people tend to agree that there is no need to
> > > rush here.
> > 
> > Part of the rush, at least from my perspective, is that I don't have
> > indefinite time to work on this.
> 
> I fully understand this! And I also feel the frustration. Been through
> that several times.
> 
> > I am sure you are aware that maintaining
> > an external patch set can be a real chore and I would prefer to have it
> > merged and then maintain it as a part of the tree.
> 
> Sure, keeping the code in sync is an additional burden. Having the code
> in just pushes the burden to everybody touching that subsystem in the
> future though. This is the maintenance cost we have to consider. Your
> approach of integrating a very narrow feature into the core allocator
> will require considering that usecase for future changes in the
> allocator. Maintaining metadata elsewhere doesn't impose that
> maintenance cost.
> 
> Can we agree on this at least? Because feel we are circling around in
> this and previous discussions.

Not really. The problem as I see it with external metadata is that it
creates the opportunity for issues like what I ran into with compaction.
It isn't necessarily maintaining any metadata in the page but instead is
has a massive effect on things since it is essentially churning the free
lists. In my mind the main difference is just how visible the
intrusiveness is, not necessarily if it is intrusive or not. The end
result is things are becoming more ossified either way.

> > Then other changes can
> > be rebased on it instead of having to rebase it around other changes that
> > are going on.
> 
> Well, that is not a real argument because alternatives are not an
> incremental change from the allocator POV. It is a different approach of
> maintaining metadata. Sure a different approach could replace your
> implementation (if it was merged) but what is the point of merging an
> approach that would be replaced? Just because you do not want to
> maintain your implmentation off tree? That is a poor argument to me.

It is more than the maintenance cost though. So one thing having the code
in the mm tree and linux-next gets me is more visibility and more review.
At this point the code just rots if I am sitting on it waiting for a
better alternative. What is the point in writing it if I am just sitting
on it? I am writing it with the goal of getting it upstream. I need to see
that there is a path for the patch set that ends in that direction or I am
just wasting time.

> I completely agree with Mel. Let's start with a simple solution first
> (using existing page isolation interfaces sound like a good start to
> interact with the page allocator), establish a decent API for virtio
> and start optimizing from there.

This has been brought up a few times but I don't recall seeing it
discussed anywhere. How do you see the page isolation interfaces being
used to handle the free page reporting case?

> Last but not least, I would also recommend to be more explicit about
> workloads which are going to benefit from those performance optimizations.
> So far I have only seen some micro benchmarks results. Do we have any
> real workloads and see how your approach behaves so that we can compare
> that to the other approach?

I think with some of my early versions I had a fairly simple test that
demonstrated the advantage of the approach by basically just starting up
enough VMs to create an overcommit situation and then running memhog on
each one in series, and then timing the second pass though them. If I
recall it was a matter of something like 6 to 7 seconds versus 45 seconds
per VM tested. Would something like that work or do you have another
suggestion?

I'm still somewhat new to doing development in the mm area, so of my tests
have consisted of variants of things from will-it-scale and the like. I'm
just wondering if there is anything you would recommend?





^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: + mm-introduce-reported-pages.patch added to -mm tree
  2019-11-08  7:06                   ` David Hildenbrand
@ 2019-11-08 17:18                     ` Alexander Duyck
  2019-11-12 13:04                       ` David Hildenbrand
  0 siblings, 1 reply; 47+ messages in thread
From: Alexander Duyck @ 2019-11-08 17:18 UTC (permalink / raw)
  To: David Hildenbrand, Michal Hocko
  Cc: akpm, aarcange, dan.j.williams, dave.hansen, konrad.wilk,
	lcapitulino, mgorman, mm-commits, mst, osalvador, pagupta,
	pbonzini, riel, vbabka, wei.w.wang, willy, yang.zhang.wz,
	linux-mm

On Fri, 2019-11-08 at 08:06 +0100, David Hildenbrand wrote:
> > > "for your own idea" - are you saying Nitesh's approach is my idea? I
> > > hope not, otherwise I would get credit for Rik's and Nitesh's work by
> > > simply providing review comments.
> > 
> > Sorry, I was using "your" in the collective sense. I meant Nitesh, Rik,
> > MST, yourself, and any other folks are working on the bitmap approach.
> 
> I guess I was misreading it then :)
> 
> > > Of course it is okay to fight for your own idea.
> > > 
> > > > let me reply to the criticisms of my own patchset before you pile on. I
> > > 
> > > Me (+ Michal): Are these core buddy changes really wanted and required.
> > > Can we evaluate the alternatives properly. (Michal even proposed
> > > something very similar to Nitesh's approach before even looking into it)
> > 
> > That is part of my frustration. Before I even had a chance to explain the
> > situation you had already jumped in throwing a "I second that" into the
> > discussion and insisting on comparisons against Nitesh's patch set which I
> > have already provided.
> 
> The different prototypes Nitesh provided are the only alternatives we 
> have. So, to do the discussion that I want to see for a long time, we 
> should evaluate them?
> 
> > Nitesh's most recent patch set caused a kernel panic, and when I fixed
> > that then it is over 30% worse performance wise than my patch set, and
> > then when Nitesh restricted the order to MAX_ORDER - 1 and added some
> > logic to check the buddy before taking the lock then it finally became
> > comparable.
> 
> Yes, they are protoypes, RFCs. You did the right thing to report the 
> issues so Nitesh can look into them.

I feel like you are arguing this both ways. On one side you are saying
that these are alternatives and need to be evaluated. Then on the other
side you say they are RFCs and it isn't fair to hold the outcome of a
performance evaluation against them.

Maybe instead of directly comparing the two approaches we should just look
at defining what requirements need to be met for either approach to be
considered acceptable. Then neither of us necessarily needs to be
comparing things directly and instead we are marching toward a set of
requirements to get to the solution that will work best overall.

> > > You: Please take my patch set, it is better than the alternatives
> > > because of X, for X in {RFC quality, sparse zones, locking internals,
> > > current performance differences}
> > 
> > I should have replied to Michal's original question and simply stated that
> > Mel had not replied to the patches in the last month and a half. I half
> > suspect that is the reason for Andrew applying it. It put some pressure on
> > others to provide review feedback, which if nothing else I am grateful
> > for.
> > 
> > You had inserted the need to compare it against Nitesh's patch set. Which
> > based on Nitesh's email is likely going to be a little while since he
> > cannot give me an ETA.
> 
> So I want us (you, me, Michal, Mel, Dave, ...) to discuss the direction 
> we want to go. I'd love to do this on a design level, instead of having 
> to wait for any patch set. But I guess this is harder to do? And 
> especially as you keep mentioning the performance difference, I think we 
> should evaluate if this is an unsolvable problem or just an issue in the 
> current prototype?
> 
> I mean people could have a look at Niteshs older series to see how it 
> fundamentally differs to your approach (external tracking). Nitesh might 
> have fixed some things in the mean time, and is replacing the "fake 
> allocation" by page isolation. But I mean, the general approach should 
> be obvious and sufficient for people to make a decision.

I think the problem is what is being asked for versus how we are going
about it. Asking for performance comparisons isn't going to really lead to
a design discussion.

One thing that might be interesting, at least to me, might be to just
start a new thread to discuss the options/approaches. I know I haven't
really heard much about the page isolation approach. I would be interested
in hearing how you guys are planning to go about implementing that.

> > > And all I am requesting is that we do the evaluation, discuss if there
> > > are really no alternatives, and sort out fundamental issues with
> > > external tracking.
> > > 
> > > Michal asked the very same question again at the beginning of this
> > > thread: "Is there really a consensus"
> > 
> > Yes, but he also said he is not nacking the patch. It seemed like he is
> 
> I never NACKed your series either.
> 
> > deferring to Mel on this so I will try to work with Mel to address his
> > concerns since he had some feedback that I can act on.
> 
> Makes perfect sense to me.
> 
> > I'll address the comments Mel provided and be submitting a v14 sometime
> > soon.
> > 
> > > Reading the replies, "no".
> > 
> > I get that you think the bitmap approach is the best approach, but the
> 
> No, I think that a simple external tracking is preferable over the core 
> buddy modifications we have right now. Kernel panics of fixable 
> performance regressions in an RFC are not fundamental issues to me.
> 
> > fact is it is still invasive, just to different parts of the mm subsystem.
> 
> I'd love to see how it uses the page isolation framework, and only has a 
> single hook to queue pages. I don't like the way pages are pulled out of 
> the buddy in Niteshs approach currently. What you have is cleaner.

I don't see how you could use the page isolation framework to pull out
free pages. Is there a thread somewhere on the topic that I missed?

> > I would argue that one of my concerns about the hotplug and sparse
> > handling is that by skipping those for now is essentially hiding what is
> > likely to be some invasive code, likely not too different from what I had
> > to deal with with compaction. At this point he adds more data to the zone
> > struct than my changes, and I suspect as he progresses that may increase
> > further. >
> > I do not think it is fair to hold up review and acceptance of this patch
> > set for performance comparisons with a patch set with no definite ETA.
> 
> Michal asked "Is there really a consensus". A consensus that we want 
> something like this, not that we want Nitesh's approach. It's just an 
> alternative worth discussing.
> 
> And if you are reworking your patch set now with Mel, we might get 
> another alternative that everybody is pleased with. Nobody is against 
> reviewing your series - that's perfect, it's against picking it up and 
> sending it upstream. That's my concern an Michals concern if I am not wrong.

I agree it is not necessarily ready for upstream yet. Thus why I am
working on a v14. My past experience has been that anything accepted at
this state is going to spend at least a couple months in the mm tree
before it is pushed. However I don't see the issue with it spending some
time in the mm tree and linux-next to get more eyes on it to identify any
potential issues or additional use cases. If anything I welcome the
additional debate, as it allows for additional opportunities for
improvement.



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: + mm-introduce-reported-pages.patch added to -mm tree
  2019-11-08 16:17                     ` Alexander Duyck
@ 2019-11-08 18:41                       ` Mel Gorman
  2019-11-08 20:29                         ` Alexander Duyck
  0 siblings, 1 reply; 47+ messages in thread
From: Mel Gorman @ 2019-11-08 18:41 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Michal Hocko, David Hildenbrand, akpm, aarcange, dan.j.williams,
	dave.hansen, konrad.wilk, lcapitulino, mm-commits, mst,
	osalvador, pagupta, pbonzini, riel, vbabka, wei.w.wang, willy,
	yang.zhang.wz, linux-mm

On Fri, Nov 08, 2019 at 08:17:49AM -0800, Alexander Duyck wrote:
> > <SNIP>
> >
> > From your perspective, I see it's a bit annoying because in the final
> > result, the code should be identical. However, it'll be a lot clearer
> > during review what is required, what level of complexity optimisations
> > add and the performance of it. The changelog should include what metric
> > you are using to evaluate the performance, the test case and the delta. It
> > also will be easier from a debugging perspective as minimally a bisection
> > could identify if a bug was due to the core mechanism itself or one of
> > the optimisations.  Finally, it leaves open the possibility that someone
> > can evaluate a completely different set of optimisations. Whatever the
> > alternative approaches are, the actual interface to virtio ballon surely
> > is the same (I don't actually know, I just can't see why the virtio ABI
> > would be depend on how the pages are isolated, tracked and reported).
> 
> The virtio-balloon interface is the same at this point between my solution
> and Nitesh's. So the only real disagreement in terms of the two solutions
> is about keeping the bit in the page and the list manipulation versus the
> external bitmap and the hunt and peck approach.
> 

This is good news because it means that when/if Nitesh's approach is ready
that the optimisations can be reverted and the new approach applied and
give a like-like comparison if appropriate. The core feature and interface
to userspace would remain the same and stay available regardless of how
it's optimised. Maybe it's the weekend talking but I think structuring
the series like that will allow forward progress to be made.

-- 
Mel Gorman
SUSE Labs


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: + mm-introduce-reported-pages.patch added to -mm tree
  2019-11-08 18:41                       ` Mel Gorman
@ 2019-11-08 20:29                         ` Alexander Duyck
  2019-11-09 14:57                           ` Mel Gorman
  0 siblings, 1 reply; 47+ messages in thread
From: Alexander Duyck @ 2019-11-08 20:29 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Michal Hocko, David Hildenbrand, akpm, aarcange, dan.j.williams,
	dave.hansen, konrad.wilk, lcapitulino, mm-commits, mst,
	osalvador, pagupta, pbonzini, riel, vbabka, wei.w.wang, willy,
	yang.zhang.wz, linux-mm

On Fri, 2019-11-08 at 18:41 +0000, Mel Gorman wrote:
> On Fri, Nov 08, 2019 at 08:17:49AM -0800, Alexander Duyck wrote:
> > > <SNIP>
> > > 
> > > From your perspective, I see it's a bit annoying because in the final
> > > result, the code should be identical. However, it'll be a lot clearer
> > > during review what is required, what level of complexity optimisations
> > > add and the performance of it. The changelog should include what metric
> > > you are using to evaluate the performance, the test case and the delta. It
> > > also will be easier from a debugging perspective as minimally a bisection
> > > could identify if a bug was due to the core mechanism itself or one of
> > > the optimisations.  Finally, it leaves open the possibility that someone
> > > can evaluate a completely different set of optimisations. Whatever the
> > > alternative approaches are, the actual interface to virtio ballon surely
> > > is the same (I don't actually know, I just can't see why the virtio ABI
> > > would be depend on how the pages are isolated, tracked and reported).
> > 
> > The virtio-balloon interface is the same at this point between my solution
> > and Nitesh's. So the only real disagreement in terms of the two solutions
> > is about keeping the bit in the page and the list manipulation versus the
> > external bitmap and the hunt and peck approach.
> > 
> 
> This is good news because it means that when/if Nitesh's approach is ready
> that the optimisations can be reverted and the new approach applied and
> give a like-like comparison if appropriate. The core feature and interface
> to userspace would remain the same and stay available regardless of how
> it's optimised. Maybe it's the weekend talking but I think structuring
> the series like that will allow forward progress to be made.

So quick question.

Any issue with me manipulating the lists like you do with the compaction
code? I ask because most of the overhead I was encountering was likely due
to walking the list so many times. If I do the split/splice style logic
that should reduce the total number of trips through the free lists since
I could push the reported pages to the tail of the list. For now I am
working on that as an alternate patch to the existing reported_boundary
approach just as an experiment.

Thanks.

- Alex



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: + mm-introduce-reported-pages.patch added to -mm tree
  2019-11-08 20:29                         ` Alexander Duyck
@ 2019-11-09 14:57                           ` Mel Gorman
  2019-11-10 18:03                             ` Alexander Duyck
  0 siblings, 1 reply; 47+ messages in thread
From: Mel Gorman @ 2019-11-09 14:57 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Michal Hocko, David Hildenbrand, akpm, aarcange, dan.j.williams,
	dave.hansen, konrad.wilk, lcapitulino, mm-commits, mst,
	osalvador, pagupta, pbonzini, riel, vbabka, wei.w.wang, willy,
	yang.zhang.wz, linux-mm

On Fri, Nov 08, 2019 at 12:29:47PM -0800, Alexander Duyck wrote:
> On Fri, 2019-11-08 at 18:41 +0000, Mel Gorman wrote:
> > On Fri, Nov 08, 2019 at 08:17:49AM -0800, Alexander Duyck wrote:
> > > > <SNIP>
> > > > 
> > > > From your perspective, I see it's a bit annoying because in the final
> > > > result, the code should be identical. However, it'll be a lot clearer
> > > > during review what is required, what level of complexity optimisations
> > > > add and the performance of it. The changelog should include what metric
> > > > you are using to evaluate the performance, the test case and the delta. It
> > > > also will be easier from a debugging perspective as minimally a bisection
> > > > could identify if a bug was due to the core mechanism itself or one of
> > > > the optimisations.  Finally, it leaves open the possibility that someone
> > > > can evaluate a completely different set of optimisations. Whatever the
> > > > alternative approaches are, the actual interface to virtio ballon surely
> > > > is the same (I don't actually know, I just can't see why the virtio ABI
> > > > would be depend on how the pages are isolated, tracked and reported).
> > > 
> > > The virtio-balloon interface is the same at this point between my solution
> > > and Nitesh's. So the only real disagreement in terms of the two solutions
> > > is about keeping the bit in the page and the list manipulation versus the
> > > external bitmap and the hunt and peck approach.
> > > 
> > 
> > This is good news because it means that when/if Nitesh's approach is ready
> > that the optimisations can be reverted and the new approach applied and
> > give a like-like comparison if appropriate. The core feature and interface
> > to userspace would remain the same and stay available regardless of how
> > it's optimised. Maybe it's the weekend talking but I think structuring
> > the series like that will allow forward progress to be made.
> 
> So quick question.
> 
> Any issue with me manipulating the lists like you do with the compaction
> code? I ask because most of the overhead I was encountering was likely due
> to walking the list so many times.

That doesn't surprise me because it was necessary for the fast isolation
in compaction to reduce the overhead when compaction was running at
high frequency.

> If I do the split/splice style logic
> that should reduce the total number of trips through the free lists since
> I could push the reported pages to the tail of the list. For now I am
> working on that as an alternate patch to the existing reported_boundary
> approach just as an experiment.

I don't have a problem with that although it should be split out and shared
between compaction and the virtio balloon if possible. The consequences
are that compaction and the balloon might interfere with each other. That
would increase the overhead of compaction and the balloon if they both
were running at the same time. However, given that the balloon will have
a performance impact anyway, I don't think it's worth worrying about
because functionally it should be fine.

-- 
Mel Gorman
SUSE Labs


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: + mm-introduce-reported-pages.patch added to -mm tree
  2019-11-09 14:57                           ` Mel Gorman
@ 2019-11-10 18:03                             ` Alexander Duyck
  0 siblings, 0 replies; 47+ messages in thread
From: Alexander Duyck @ 2019-11-10 18:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Alexander Duyck, Michal Hocko, David Hildenbrand, Andrew Morton,
	Andrea Arcangeli, Dan Williams, Dave Hansen,
	Konrad Rzeszutek Wilk, lcapitulino, mm-commits,
	Michael S. Tsirkin, Oscar Salvador, Pankaj Gupta, Paolo Bonzini,
	Rik van Riel, Vlastimil Babka, Wang, Wei W, Matthew Wilcox,
	Yang Zhang, linux-mm

On Sat, Nov 9, 2019 at 6:57 AM Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Fri, Nov 08, 2019 at 12:29:47PM -0800, Alexander Duyck wrote:
> > On Fri, 2019-11-08 at 18:41 +0000, Mel Gorman wrote:
> > > On Fri, Nov 08, 2019 at 08:17:49AM -0800, Alexander Duyck wrote:
> > > > > <SNIP>
> > > > >
> > > > > From your perspective, I see it's a bit annoying because in the final
> > > > > result, the code should be identical. However, it'll be a lot clearer
> > > > > during review what is required, what level of complexity optimisations
> > > > > add and the performance of it. The changelog should include what metric
> > > > > you are using to evaluate the performance, the test case and the delta. It
> > > > > also will be easier from a debugging perspective as minimally a bisection
> > > > > could identify if a bug was due to the core mechanism itself or one of
> > > > > the optimisations.  Finally, it leaves open the possibility that someone
> > > > > can evaluate a completely different set of optimisations. Whatever the
> > > > > alternative approaches are, the actual interface to virtio ballon surely
> > > > > is the same (I don't actually know, I just can't see why the virtio ABI
> > > > > would be depend on how the pages are isolated, tracked and reported).
> > > >
> > > > The virtio-balloon interface is the same at this point between my solution
> > > > and Nitesh's. So the only real disagreement in terms of the two solutions
> > > > is about keeping the bit in the page and the list manipulation versus the
> > > > external bitmap and the hunt and peck approach.
> > > >
> > >
> > > This is good news because it means that when/if Nitesh's approach is ready
> > > that the optimisations can be reverted and the new approach applied and
> > > give a like-like comparison if appropriate. The core feature and interface
> > > to userspace would remain the same and stay available regardless of how
> > > it's optimised. Maybe it's the weekend talking but I think structuring
> > > the series like that will allow forward progress to be made.
> >
> > So quick question.
> >
> > Any issue with me manipulating the lists like you do with the compaction
> > code? I ask because most of the overhead I was encountering was likely due
> > to walking the list so many times.
>
> That doesn't surprise me because it was necessary for the fast isolation
> in compaction to reduce the overhead when compaction was running at
> high frequency.
>
> > If I do the split/splice style logic
> > that should reduce the total number of trips through the free lists since
> > I could push the reported pages to the tail of the list. For now I am
> > working on that as an alternate patch to the existing reported_boundary
> > approach just as an experiment.
>
> I don't have a problem with that although it should be split out and shared
> between compaction and the virtio balloon if possible. The consequences
> are that compaction and the balloon might interfere with each other. That
> would increase the overhead of compaction and the balloon if they both
> were running at the same time. However, given that the balloon will have
> a performance impact anyway, I don't think it's worth worrying about
> because functionally it should be fine.

Actually I may have found a better alternative to achieve the same
solution. It looks like there was a function called
list_rotate_to_front for dealing with issues like this. Instead of the
complicated logic that was used in the compaction code this seems like
this might be a better fit as it is pretty simple to use. The only
check I have to throw in is one to test for if the entry is first or
not, if not then we basically pluck the head of the list out and plant
it right in front of the page we want to process on the next
iteration.

Results so far seem promising. The performance for the non-shuffle
case is on par with the v13 set, and I am testing the shuffle case now
as I suspect that one will likely show more of a regression.

Thanks.

- Alex


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: + mm-introduce-reported-pages.patch added to -mm tree
  2019-11-06 12:16 ` + mm-introduce-reported-pages.patch added to -mm tree Michal Hocko
  2019-11-06 14:09   ` David Hildenbrand
  2019-11-06 16:49   ` Nitesh Narayan Lal
@ 2019-11-11 18:52   ` Nitesh Narayan Lal
  2019-11-11 22:00     ` Alexander Duyck
  2 siblings, 1 reply; 47+ messages in thread
From: Nitesh Narayan Lal @ 2019-11-11 18:52 UTC (permalink / raw)
  To: Michal Hocko, akpm
  Cc: aarcange, alexander.h.duyck, dan.j.williams, dave.hansen, david,
	konrad.wilk, lcapitulino, mgorman, mm-commits, mst, osalvador,
	pagupta, pbonzini, riel, vbabka, wei.w.wang, willy,
	yang.zhang.wz, linux-mm


On 11/6/19 7:16 AM, Michal Hocko wrote:
> I didn't have time to read through newer versions of this patch series
> but I remember there were concerns about this functionality being pulled
> into the page allocator previously both by me and Mel [1][2]. Have those been 
> addressed? I do not see an ack from Mel or any other MM people. Is there
> really a consensus that we want something like that living in the
> allocator?
>
> There has also been a different approach discussed and from [3]
> (referenced by the cover letter) I can only see
>
> : Then Nitesh's solution had changed to the bitmap approach[7]. However it
> : has been pointed out that this solution doesn't deal with sparse memory,
> : hotplug, and various other issues.
>
> which looks more like something to be done than a fundamental
> roadblocks.
>
> [1] http://lkml.kernel.org/r/20190912163525.GV2739@techsingularity.net
> [2] http://lkml.kernel.org/r/20190912091925.GM4023@dhcp22.suse.cz
> [3] http://lkml.kernel.org/r/29f43d5796feed0dec8e8bb98b187d9dac03b900.camel@linux.intel.com
>
[...]

Hi,

I performed some experiments to find the root cause for the performance
degradation Alexander reported with my v12 patch-set. [1]

I will try to give a brief background of the previous discussion
under v12: (Alexander can correct me if I am missing something).
Alexander suggested two issues with my v12 posting: [2]
(This is excluding the sparse zone and memory hotplug/hotremove support)

- A crash which was caused because I was not using spinlock_irqsave()
  (Fix suggestion came from Alexander).

- Performance degradation with Alexander's suggested setup. Where we are using
  modified will-it-scale/page_fault with THP, CONFIG_SLAB_FREELIST_RANDOM &
  CONFIG_SHUFFLE_PAGE_ALLOCATOR. When I was using (MAX_ORDER - 2) as the
  PAGE_REPORTING_MIN_ORDER, I also observed significant performance degradation
  (around 20% in the number of threads launched on the 16th vCPU). However, on
  switching the PAGE_REPORTING_MIN_ORDER to (MAX_ORDER - 1), I was able to get
  the performance similar to what Alexander is reporting.

PAGE_REPORTING_MIN_ORDER: is the minimum order of a page to be captured in the
bitmap and get reported to the hypervisor.

For the discussion where we are comparing the two series, the performance
aspect is more relevant and important.
It turns out that with the current implementation the number of vmexit with
PAGE_REPORTING_MIN_ORDER as pageblock_order or (MAX_ORDER - 2) are significantly
large when compared to (MAX_ODER - 1).

One of the reason could be that the lower order pages are not getting sufficient
time to merge with each other as a result they are somehow getting reported
with 2 separate reporting requests. Hence, generating more vmexits. Where
as with (MAX_ORDER - 1) we don't have that kind of situation as I never try
to report any page which has order < (MAX_ORDER - 1).

To fix this, I might have to further limit the reporting which could allow the
lower order pages to further merge and hence reduce the VM exits. I will try to
do some experiments to see if I can fix this. In any case, if anyone has a
suggestion I would be more than happy to look in that direction.

Following are the numbers I gathered on a 30GB single NUMA, 16 vCPU guest
affined to a single host-NUMA:

On 16th vCPU:
With PAGE_REPORTING_MIN_ORDER as (MAX_ORDER - 1):
% Dip on the number of Processes = 1.3 %
% Dip on the number of  Threads  = 5.7 %

With PAGE_REPORTING_MIN_ORDER as With (pageblock_order):
% Dip on the number of Processes = 5 %
% Dip on the number of  Threads  = 20 %


Michal's suggestion:
I was able to get the prototype which could use page-isolation API:
start_isolate_page_range()/undo_isolate_page_range() to work.
But the issue mentioned above was also evident with it.

Hence, I think before moving to the decision whether I want to use
__isolate_free_page() which isolates pages from the buddy or
start/undo_isolate_page_range() which just marks the page as MIGRATE_ISOLATE,
it is important for me to resolve the above-mentioned issue.


Previous discussions:
More about how we ended up with these two approaches could be found at [3] &
[4] explained by Alexander & David.

[1] https://lore.kernel.org/lkml/20190812131235.27244-1-nitesh@redhat.com/
[2] https://lkml.org/lkml/2019/10/2/425
[3] https://lkml.org/lkml/2019/10/23/1166
[4] https://lkml.org/lkml/2019/9/12/48

-- 
Thanks
Nitesh



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: + mm-introduce-reported-pages.patch added to -mm tree
  2019-11-11 18:52   ` Nitesh Narayan Lal
@ 2019-11-11 22:00     ` Alexander Duyck
  2019-11-12 15:19       ` Nitesh Narayan Lal
  0 siblings, 1 reply; 47+ messages in thread
From: Alexander Duyck @ 2019-11-11 22:00 UTC (permalink / raw)
  To: Nitesh Narayan Lal
  Cc: Michal Hocko, Andrew Morton, Andrea Arcangeli, Alexander Duyck,
	Dan Williams, Dave Hansen, David Hildenbrand,
	Konrad Rzeszutek Wilk, lcapitulino, Mel Gorman, mm-commits,
	Michael S. Tsirkin, Oscar Salvador, Pankaj Gupta, Paolo Bonzini,
	Rik van Riel, Vlastimil Babka, Wang, Wei W, Matthew Wilcox,
	Yang Zhang, linux-mm

On Mon, Nov 11, 2019 at 10:52 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>
>
> On 11/6/19 7:16 AM, Michal Hocko wrote:
> > I didn't have time to read through newer versions of this patch series
> > but I remember there were concerns about this functionality being pulled
> > into the page allocator previously both by me and Mel [1][2]. Have those been
> > addressed? I do not see an ack from Mel or any other MM people. Is there
> > really a consensus that we want something like that living in the
> > allocator?
> >
> > There has also been a different approach discussed and from [3]
> > (referenced by the cover letter) I can only see
> >
> > : Then Nitesh's solution had changed to the bitmap approach[7]. However it
> > : has been pointed out that this solution doesn't deal with sparse memory,
> > : hotplug, and various other issues.
> >
> > which looks more like something to be done than a fundamental
> > roadblocks.
> >
> > [1] http://lkml.kernel.org/r/20190912163525.GV2739@techsingularity.net
> > [2] http://lkml.kernel.org/r/20190912091925.GM4023@dhcp22.suse.cz
> > [3] http://lkml.kernel.org/r/29f43d5796feed0dec8e8bb98b187d9dac03b900.camel@linux.intel.com
> >
> [...]
>
> Hi,
>
> I performed some experiments to find the root cause for the performance
> degradation Alexander reported with my v12 patch-set. [1]
>
> I will try to give a brief background of the previous discussion
> under v12: (Alexander can correct me if I am missing something).
> Alexander suggested two issues with my v12 posting: [2]
> (This is excluding the sparse zone and memory hotplug/hotremove support)
>
> - A crash which was caused because I was not using spinlock_irqsave()
>   (Fix suggestion came from Alexander).
>
> - Performance degradation with Alexander's suggested setup. Where we are using
>   modified will-it-scale/page_fault with THP, CONFIG_SLAB_FREELIST_RANDOM &
>   CONFIG_SHUFFLE_PAGE_ALLOCATOR. When I was using (MAX_ORDER - 2) as the
>   PAGE_REPORTING_MIN_ORDER, I also observed significant performance degradation
>   (around 20% in the number of threads launched on the 16th vCPU). However, on
>   switching the PAGE_REPORTING_MIN_ORDER to (MAX_ORDER - 1), I was able to get
>   the performance similar to what Alexander is reporting.
>
> PAGE_REPORTING_MIN_ORDER: is the minimum order of a page to be captured in the
> bitmap and get reported to the hypervisor.
>
> For the discussion where we are comparing the two series, the performance
> aspect is more relevant and important.
> It turns out that with the current implementation the number of vmexit with
> PAGE_REPORTING_MIN_ORDER as pageblock_order or (MAX_ORDER - 2) are significantly
> large when compared to (MAX_ODER - 1).
>
> One of the reason could be that the lower order pages are not getting sufficient
> time to merge with each other as a result they are somehow getting reported
> with 2 separate reporting requests. Hence, generating more vmexits. Where
> as with (MAX_ORDER - 1) we don't have that kind of situation as I never try
> to report any page which has order < (MAX_ORDER - 1).
>
> To fix this, I might have to further limit the reporting which could allow the
> lower order pages to further merge and hence reduce the VM exits. I will try to
> do some experiments to see if I can fix this. In any case, if anyone has a
> suggestion I would be more than happy to look in that direction.

That doesn't make any sense. My setup using MAX_ORDER - 2, aka
pageblock_order, as the limit doesn't experience the same performance
issues the bitmap solution does. That leads me to believe the issue
isn't that the pages have not had a chance to be merged.

> Following are the numbers I gathered on a 30GB single NUMA, 16 vCPU guest
> affined to a single host-NUMA:
>
> On 16th vCPU:
> With PAGE_REPORTING_MIN_ORDER as (MAX_ORDER - 1):
> % Dip on the number of Processes = 1.3 %
> % Dip on the number of  Threads  = 5.7 %
>
> With PAGE_REPORTING_MIN_ORDER as With (pageblock_order):
> % Dip on the number of Processes = 5 %
> % Dip on the number of  Threads  = 20 %

So I don't hold much faith in the threads numbers. I have seen the
variability be as high as 14% between runs.

> Michal's suggestion:
> I was able to get the prototype which could use page-isolation API:
> start_isolate_page_range()/undo_isolate_page_range() to work.
> But the issue mentioned above was also evident with it.
>
> Hence, I think before moving to the decision whether I want to use
> __isolate_free_page() which isolates pages from the buddy or
> start/undo_isolate_page_range() which just marks the page as MIGRATE_ISOLATE,
> it is important for me to resolve the above-mentioned issue.

I'd be curious how you are avoiding causing memory starvation if you
are isolating ranges of memory that have been recently freed.

> Previous discussions:
> More about how we ended up with these two approaches could be found at [3] &
> [4] explained by Alexander & David.
>
> [1] https://lore.kernel.org/lkml/20190812131235.27244-1-nitesh@redhat.com/
> [2] https://lkml.org/lkml/2019/10/2/425
> [3] https://lkml.org/lkml/2019/10/23/1166
> [4] https://lkml.org/lkml/2019/9/12/48
>

So one thing you may want to consider would be how placement of the
buffers will impact your performance.

One thing I realized I was doing wrong with my approach was scanning
for pages starting at the tail and then working up. It greatly hurt
the efficiency of my search since in the standard case most of the
free memory will be placed at the head and only with shuffling enabled
do I really need to worry about things getting mixed up with the tail.

I suspect you may be similarly making things more difficult for
yourself by placing the reported pages back on the head of the list
instead of placing them at the tail where they will not be reallocated
immediately.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: + mm-introduce-reported-pages.patch added to -mm tree
  2019-11-08 17:18                     ` Alexander Duyck
@ 2019-11-12 13:04                       ` David Hildenbrand
  2019-11-12 18:34                         ` Alexander Duyck
  0 siblings, 1 reply; 47+ messages in thread
From: David Hildenbrand @ 2019-11-12 13:04 UTC (permalink / raw)
  To: Alexander Duyck, Michal Hocko
  Cc: akpm, aarcange, dan.j.williams, dave.hansen, konrad.wilk,
	lcapitulino, mgorman, mm-commits, mst, osalvador, pagupta,
	pbonzini, riel, vbabka, wei.w.wang, willy, yang.zhang.wz,
	linux-mm

>> Yes, they are protoypes, RFCs. You did the right thing to report the 
>> issues so Nitesh can look into them.
> 
> I feel like you are arguing this both ways. On one side you are saying
> that these are alternatives and need to be evaluated. Then on the other
> side you say they are RFCs and it isn't fair to hold the outcome of a
> performance evaluation against them.

I guess my point is to identify if there are fundamental performance
issues that cannot be solved easily later. If you have a prototype and
you perform some minor changes to solve them, then I don't consider it a
fundamental problem. It's still a prototype.

> 
> Maybe instead of directly comparing the two approaches we should just look
> at defining what requirements need to be met for either approach to be
> considered acceptable. Then neither of us necessarily needs to be
> comparing things directly and instead we are marching toward a set of
> requirements to get to the solution that will work best overall.

Makes perfect sense to me.

> 
>>>> You: Please take my patch set, it is better than the alternatives
>>>> because of X, for X in {RFC quality, sparse zones, locking internals,
>>>> current performance differences}
>>>
>>> I should have replied to Michal's original question and simply stated that
>>> Mel had not replied to the patches in the last month and a half. I half
>>> suspect that is the reason for Andrew applying it. It put some pressure on
>>> others to provide review feedback, which if nothing else I am grateful
>>> for.
>>>
>>> You had inserted the need to compare it against Nitesh's patch set. Which
>>> based on Nitesh's email is likely going to be a little while since he
>>> cannot give me an ETA.
>>
>> So I want us (you, me, Michal, Mel, Dave, ...) to discuss the direction 
>> we want to go. I'd love to do this on a design level, instead of having 
>> to wait for any patch set. But I guess this is harder to do? And 
>> especially as you keep mentioning the performance difference, I think we 
>> should evaluate if this is an unsolvable problem or just an issue in the 
>> current prototype?
>>
>> I mean people could have a look at Niteshs older series to see how it 
>> fundamentally differs to your approach (external tracking). Nitesh might 
>> have fixed some things in the mean time, and is replacing the "fake 
>> allocation" by page isolation. But I mean, the general approach should 
>> be obvious and sufficient for people to make a decision.
> 
> I think the problem is what is being asked for versus how we are going
> about it. Asking for performance comparisons isn't going to really lead to
> a design discussion.

I tend to agree. It's just me wondering how good of a performance we can
achieve with external tracking :)

> 
> One thing that might be interesting, at least to me, might be to just
> start a new thread to discuss the options/approaches. I know I haven't
> really heard much about the page isolation approach. I would be interested
> in hearing how you guys are planning to go about implementing that.

Yes, we should definitely do that. I was only skimming over the other
mail exchange (won't really be working this week), but I think you
identified some improvements yourself for your approach.

[...]

>>> fact is it is still invasive, just to different parts of the mm subsystem.
>>
>> I'd love to see how it uses the page isolation framework, and only has a 
>> single hook to queue pages. I don't like the way pages are pulled out of 
>> the buddy in Niteshs approach currently. What you have is cleaner.
> 
> I don't see how you could use the page isolation framework to pull out
> free pages. Is there a thread somewhere on the topic that I missed?

It's basically only isolating pages while reporting them, and not
pulling them out of the buddy (IOW, you move the pages to the isolate
queues where nobody is allowed to touch them, and setting the
migratetype properly). This e.g., makes other user of page isolation
(e.g., memory offlining, alloc_contig_range()) play nicely with these
isolated pages. "somebody else just isolated them, please try again."

start_isolate_page_range()/undo_isolate_page_range()/test_pages_isolated()
along with a lockless check if the page is free.

I think it should be something like this (ignoring different
migratetypes and such for now)

1. Test lockless if page is free: Not free? Done.
2. start_isolate_page_range(): Busy? Rare race (with other isolate users
or with an allocation). Done.
3. test_pages_isolated()
3a. no? Rare race, page not free anymore. undo_isolate_page_range()
3b. yes? Report, then undo_isolate_page_range()

If we would run into performance issues with the current page isolation
implementation (esp. locking), I think there are some nice
cleanups/reworks possible of which all current users could benefit
(especially accross pageblocks).

> 
>>> I would argue that one of my concerns about the hotplug and sparse
>>> handling is that by skipping those for now is essentially hiding what is
>>> likely to be some invasive code, likely not too different from what I had
>>> to deal with with compaction. At this point he adds more data to the zone
>>> struct than my changes, and I suspect as he progresses that may increase
>>> further. >
>>> I do not think it is fair to hold up review and acceptance of this patch
>>> set for performance comparisons with a patch set with no definite ETA.
>>
>> Michal asked "Is there really a consensus". A consensus that we want 
>> something like this, not that we want Nitesh's approach. It's just an 
>> alternative worth discussing.
>>
>> And if you are reworking your patch set now with Mel, we might get 
>> another alternative that everybody is pleased with. Nobody is against 
>> reviewing your series - that's perfect, it's against picking it up and 
>> sending it upstream. That's my concern an Michals concern if I am not wrong.
> 
> I agree it is not necessarily ready for upstream yet. Thus why I am
> working on a v14. My past experience has been that anything accepted at
> this state is going to spend at least a couple months in the mm tree
> before it is pushed. However I don't see the issue with it spending some
> time in the mm tree and linux-next to get more eyes on it to identify any
> potential issues or additional use cases. If anything I welcome the
> additional debate, as it allows for additional opportunities for
> improvement.

Indeed.

-- 

Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: + mm-introduce-reported-pages.patch added to -mm tree
  2019-11-11 22:00     ` Alexander Duyck
@ 2019-11-12 15:19       ` Nitesh Narayan Lal
  2019-11-12 16:18         ` Alexander Duyck
  0 siblings, 1 reply; 47+ messages in thread
From: Nitesh Narayan Lal @ 2019-11-12 15:19 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Michal Hocko, Andrew Morton, Andrea Arcangeli, Alexander Duyck,
	Dan Williams, Dave Hansen, David Hildenbrand,
	Konrad Rzeszutek Wilk, lcapitulino, Mel Gorman, mm-commits,
	Michael S. Tsirkin, Oscar Salvador, Pankaj Gupta, Paolo Bonzini,
	Rik van Riel, Vlastimil Babka, Wang, Wei W, Matthew Wilcox,
	Yang Zhang, linux-mm


On 11/11/19 5:00 PM, Alexander Duyck wrote:
> On Mon, Nov 11, 2019 at 10:52 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>>
>> On 11/6/19 7:16 AM, Michal Hocko wrote:
>>> I didn't have time to read through newer versions of this patch series
>>> but I remember there were concerns about this functionality being pulled
>>> into the page allocator previously both by me and Mel [1][2]. Have those been
>>> addressed? I do not see an ack from Mel or any other MM people. Is there
>>> really a consensus that we want something like that living in the
>>> allocator?
>>>
>>> There has also been a different approach discussed and from [3]
>>> (referenced by the cover letter) I can only see
>>>
>>> : Then Nitesh's solution had changed to the bitmap approach[7]. However it
>>> : has been pointed out that this solution doesn't deal with sparse memory,
>>> : hotplug, and various other issues.
>>>
>>> which looks more like something to be done than a fundamental
>>> roadblocks.
>>>
>>> [1] http://lkml.kernel.org/r/20190912163525.GV2739@techsingularity.net
>>> [2] http://lkml.kernel.org/r/20190912091925.GM4023@dhcp22.suse.cz
>>> [3] http://lkml.kernel.org/r/29f43d5796feed0dec8e8bb98b187d9dac03b900.camel@linux.intel.com
>>>
>> [...]
>>
>> Hi,
>>
>> I performed some experiments to find the root cause for the performance
>> degradation Alexander reported with my v12 patch-set. [1]
>>
>> I will try to give a brief background of the previous discussion
>> under v12: (Alexander can correct me if I am missing something).
>> Alexander suggested two issues with my v12 posting: [2]
>> (This is excluding the sparse zone and memory hotplug/hotremove support)
>>
>> - A crash which was caused because I was not using spinlock_irqsave()
>>   (Fix suggestion came from Alexander).
>>
>> - Performance degradation with Alexander's suggested setup. Where we are using
>>   modified will-it-scale/page_fault with THP, CONFIG_SLAB_FREELIST_RANDOM &
>>   CONFIG_SHUFFLE_PAGE_ALLOCATOR. When I was using (MAX_ORDER - 2) as the
>>   PAGE_REPORTING_MIN_ORDER, I also observed significant performance degradation
>>   (around 20% in the number of threads launched on the 16th vCPU). However, on
>>   switching the PAGE_REPORTING_MIN_ORDER to (MAX_ORDER - 1), I was able to get
>>   the performance similar to what Alexander is reporting.
>>
>> PAGE_REPORTING_MIN_ORDER: is the minimum order of a page to be captured in the
>> bitmap and get reported to the hypervisor.
>>
>> For the discussion where we are comparing the two series, the performance
>> aspect is more relevant and important.
>> It turns out that with the current implementation the number of vmexit with
>> PAGE_REPORTING_MIN_ORDER as pageblock_order or (MAX_ORDER - 2) are significantly
>> large when compared to (MAX_ODER - 1).
>>
>> One of the reason could be that the lower order pages are not getting sufficient
>> time to merge with each other as a result they are somehow getting reported
>> with 2 separate reporting requests. Hence, generating more vmexits. Where
>> as with (MAX_ORDER - 1) we don't have that kind of situation as I never try
>> to report any page which has order < (MAX_ORDER - 1).
>>
>> To fix this, I might have to further limit the reporting which could allow the
>> lower order pages to further merge and hence reduce the VM exits. I will try to
>> do some experiments to see if I can fix this. In any case, if anyone has a
>> suggestion I would be more than happy to look in that direction.
> That doesn't make any sense. My setup using MAX_ORDER - 2, aka
> pageblock_order, as the limit doesn't experience the same performance
> issues the bitmap solution does. That leads me to believe the issue
> isn't that the pages have not had a chance to be merged.
>

So, I did run your series as well with a few syfs variables to see how many
pages of order (MAX_ORDER - 1) or (MAX_ORDER - 2) are reported at the end of
will-it-scale/page_fault4 test.
What I observed is the number of (MAX_ORDER - 2) pages which were getting
reported in your case were lesser than what has been reported in mine with
pageblock_order.
As you have mentioned below about putting pages in a certain part of the
free list might have also an impact.

>> Following are the numbers I gathered on a 30GB single NUMA, 16 vCPU guest
>> affined to a single host-NUMA:
>>
>> On 16th vCPU:
>> With PAGE_REPORTING_MIN_ORDER as (MAX_ORDER - 1):
>> % Dip on the number of Processes = 1.3 %
>> % Dip on the number of  Threads  = 5.7 %
>>
>> With PAGE_REPORTING_MIN_ORDER as With (pageblock_order):
>> % Dip on the number of Processes = 5 %
>> % Dip on the number of  Threads  = 20 %
> So I don't hold much faith in the threads numbers. I have seen the
> variability be as high as 14% between runs.

That's interesting. Do you see the variability even with an unmodified kernel?
Somehow, for me it seems pretty consistent. However, if you are running with
multiple NUMA nodes it might have a significant impact on the numbers.

For now, I am only running a single NUMA guest affined to a single NUMA
of host.

>> Michal's suggestion:
>> I was able to get the prototype which could use page-isolation API:
>> start_isolate_page_range()/undo_isolate_page_range() to work.
>> But the issue mentioned above was also evident with it.
>>
>> Hence, I think before moving to the decision whether I want to use
>> __isolate_free_page() which isolates pages from the buddy or
>> start/undo_isolate_page_range() which just marks the page as MIGRATE_ISOLATE,
>> it is important for me to resolve the above-mentioned issue.
> I'd be curious how you are avoiding causing memory starvation if you
> are isolating ranges of memory that have been recently freed.

I would still be marking only 32 pages as MIGRATE_ISOLATE at a time. It is
exactly same as that of isolating limited chunk of pages from the buddy.
For example if I have a pfn:x of order y then I pass
start_isolate_page_range(x, x+y, mt, 0). So at the end we
will have 32 such entries marked as MIGRATE_ISOLATE.

>> Previous discussions:
>> More about how we ended up with these two approaches could be found at [3] &
>> [4] explained by Alexander & David.
>>
>> [1] https://lore.kernel.org/lkml/20190812131235.27244-1-nitesh@redhat.com/
>> [2] https://lkml.org/lkml/2019/10/2/425
>> [3] https://lkml.org/lkml/2019/10/23/1166
>> [4] https://lkml.org/lkml/2019/9/12/48
>>
> So one thing you may want to consider would be how placement of the
> buffers will impact your performance.
>
> One thing I realized I was doing wrong with my approach was scanning
> for pages starting at the tail and then working up. It greatly hurt
> the efficiency of my search since in the standard case most of the
> free memory will be placed at the head and only with shuffling enabled
> do I really need to worry about things getting mixed up with the tail.
>
> I suspect you may be similarly making things more difficult for
> yourself by placing the reported pages back on the head of the list
> instead of placing them at the tail where they will not be reallocated
> immediately.

hmm, I see. I will try and explore this.

-- 
Thanks
Nitesh



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: + mm-introduce-reported-pages.patch added to -mm tree
  2019-11-12 15:19       ` Nitesh Narayan Lal
@ 2019-11-12 16:18         ` Alexander Duyck
  2019-11-13 18:39           ` Nitesh Narayan Lal
  0 siblings, 1 reply; 47+ messages in thread
From: Alexander Duyck @ 2019-11-12 16:18 UTC (permalink / raw)
  To: Nitesh Narayan Lal, Alexander Duyck
  Cc: Michal Hocko, Andrew Morton, Andrea Arcangeli, Dan Williams,
	Dave Hansen, David Hildenbrand, Konrad Rzeszutek Wilk,
	lcapitulino, Mel Gorman, mm-commits, Michael S. Tsirkin,
	Oscar Salvador, Pankaj Gupta, Paolo Bonzini, Rik van Riel,
	Vlastimil Babka, Wang, Wei W, Matthew Wilcox, Yang Zhang,
	linux-mm

On Tue, 2019-11-12 at 10:19 -0500, Nitesh Narayan Lal wrote:
> On 11/11/19 5:00 PM, Alexander Duyck wrote:
> > On Mon, Nov 11, 2019 at 10:52 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
> > > On 11/6/19 7:16 AM, Michal Hocko wrote:
> > > > I didn't have time to read through newer versions of this patch series
> > > > but I remember there were concerns about this functionality being pulled
> > > > into the page allocator previously both by me and Mel [1][2]. Have those been
> > > > addressed? I do not see an ack from Mel or any other MM people. Is there
> > > > really a consensus that we want something like that living in the
> > > > allocator?
> > > > 
> > > > There has also been a different approach discussed and from [3]
> > > > (referenced by the cover letter) I can only see
> > > > 
> > > > : Then Nitesh's solution had changed to the bitmap approach[7]. However it
> > > > : has been pointed out that this solution doesn't deal with sparse memory,
> > > > : hotplug, and various other issues.
> > > > 
> > > > which looks more like something to be done than a fundamental
> > > > roadblocks.
> > > > 
> > > > [1] http://lkml.kernel.org/r/20190912163525.GV2739@techsingularity.net
> > > > [2] http://lkml.kernel.org/r/20190912091925.GM4023@dhcp22.suse.cz
> > > > [3] http://lkml.kernel.org/r/29f43d5796feed0dec8e8bb98b187d9dac03b900.camel@linux.intel.com
> > > > 
> > > [...]
> > > 
> > > Hi,
> > > 
> > > I performed some experiments to find the root cause for the performance
> > > degradation Alexander reported with my v12 patch-set. [1]
> > > 
> > > I will try to give a brief background of the previous discussion
> > > under v12: (Alexander can correct me if I am missing something).
> > > Alexander suggested two issues with my v12 posting: [2]
> > > (This is excluding the sparse zone and memory hotplug/hotremove support)
> > > 
> > > - A crash which was caused because I was not using spinlock_irqsave()
> > >   (Fix suggestion came from Alexander).
> > > 
> > > - Performance degradation with Alexander's suggested setup. Where we are using
> > >   modified will-it-scale/page_fault with THP, CONFIG_SLAB_FREELIST_RANDOM &
> > >   CONFIG_SHUFFLE_PAGE_ALLOCATOR. When I was using (MAX_ORDER - 2) as the
> > >   PAGE_REPORTING_MIN_ORDER, I also observed significant performance degradation
> > >   (around 20% in the number of threads launched on the 16th vCPU). However, on
> > >   switching the PAGE_REPORTING_MIN_ORDER to (MAX_ORDER - 1), I was able to get
> > >   the performance similar to what Alexander is reporting.
> > > 
> > > PAGE_REPORTING_MIN_ORDER: is the minimum order of a page to be captured in the
> > > bitmap and get reported to the hypervisor.
> > > 
> > > For the discussion where we are comparing the two series, the performance
> > > aspect is more relevant and important.
> > > It turns out that with the current implementation the number of vmexit with
> > > PAGE_REPORTING_MIN_ORDER as pageblock_order or (MAX_ORDER - 2) are significantly
> > > large when compared to (MAX_ODER - 1).
> > > 
> > > One of the reason could be that the lower order pages are not getting sufficient
> > > time to merge with each other as a result they are somehow getting reported
> > > with 2 separate reporting requests. Hence, generating more vmexits. Where
> > > as with (MAX_ORDER - 1) we don't have that kind of situation as I never try
> > > to report any page which has order < (MAX_ORDER - 1).
> > > 
> > > To fix this, I might have to further limit the reporting which could allow the
> > > lower order pages to further merge and hence reduce the VM exits. I will try to
> > > do some experiments to see if I can fix this. In any case, if anyone has a
> > > suggestion I would be more than happy to look in that direction.
> > That doesn't make any sense. My setup using MAX_ORDER - 2, aka
> > pageblock_order, as the limit doesn't experience the same performance
> > issues the bitmap solution does. That leads me to believe the issue
> > isn't that the pages have not had a chance to be merged.
> > 
> 
> So, I did run your series as well with a few syfs variables to see how many
> pages of order (MAX_ORDER - 1) or (MAX_ORDER - 2) are reported at the end of
> will-it-scale/page_fault4 test.
> What I observed is the number of (MAX_ORDER - 2) pages which were getting
> reported in your case were lesser than what has been reported in mine with
> pageblock_order.
> As you have mentioned below about putting pages in a certain part of the
> free list might have also an impact.

Another thing you may want to check is how often your notifier is
triggering. One thing I did was to intentionally put a fairly significant
delay from the time the notification is scheduled to when it will start. I
did this because when an application is freeing memory it will take some
time to completely free it, and if it is going to reallocate it anyway
there is no need to rush since it would just invalidate the pages you
reported anyway.

> > > Following are the numbers I gathered on a 30GB single NUMA, 16 vCPU guest
> > > affined to a single host-NUMA:
> > > 
> > > On 16th vCPU:
> > > With PAGE_REPORTING_MIN_ORDER as (MAX_ORDER - 1):
> > > % Dip on the number of Processes = 1.3 %
> > > % Dip on the number of  Threads  = 5.7 %
> > > 
> > > With PAGE_REPORTING_MIN_ORDER as With (pageblock_order):
> > > % Dip on the number of Processes = 5 %
> > > % Dip on the number of  Threads  = 20 %
> > So I don't hold much faith in the threads numbers. I have seen the
> > variability be as high as 14% between runs.
> 
> That's interesting. Do you see the variability even with an unmodified kernel?
> Somehow, for me it seems pretty consistent. However, if you are running with
> multiple NUMA nodes it might have a significant impact on the numbers.
> 
> For now, I am only running a single NUMA guest affined to a single NUMA
> of host.

My guest should be running in a single node, and yes I saw it with just
the unmodified kernel. I am running on the linux-next 20191031 kernel. It
did occur to me that it seems like the performance for the threads number
recently increased. There might be a guest config option impacting things
as well since I know I have changed a number of variables since then.

> > > Michal's suggestion:
> > > I was able to get the prototype which could use page-isolation API:
> > > start_isolate_page_range()/undo_isolate_page_range() to work.
> > > But the issue mentioned above was also evident with it.
> > > 
> > > Hence, I think before moving to the decision whether I want to use
> > > __isolate_free_page() which isolates pages from the buddy or
> > > start/undo_isolate_page_range() which just marks the page as MIGRATE_ISOLATE,
> > > it is important for me to resolve the above-mentioned issue.
> > I'd be curious how you are avoiding causing memory starvation if you
> > are isolating ranges of memory that have been recently freed.
> 
> I would still be marking only 32 pages as MIGRATE_ISOLATE at a time. It is
> exactly same as that of isolating limited chunk of pages from the buddy.
> For example if I have a pfn:x of order y then I pass
> start_isolate_page_range(x, x+y, mt, 0). So at the end we
> will have 32 such entries marked as MIGRATE_ISOLATE.

I get that you are isolating the same amount of memory. What I was getting
at is that __isolate_free_page has a check in it to make certain you are
not pulling memory that would put you below the minimum watermark. As far
as I know there isn't anything like that for the page isolation framework
since it is normally used for offlining memory before it is hotplugged
away.

> > > Previous discussions:
> > > More about how we ended up with these two approaches could be found at [3] &
> > > [4] explained by Alexander & David.
> > > 
> > > [1] https://lore.kernel.org/lkml/20190812131235.27244-1-nitesh@redhat.com/
> > > [2] https://lkml.org/lkml/2019/10/2/425
> > > [3] https://lkml.org/lkml/2019/10/23/1166
> > > [4] https://lkml.org/lkml/2019/9/12/48
> > > 
> > So one thing you may want to consider would be how placement of the
> > buffers will impact your performance.
> > 
> > One thing I realized I was doing wrong with my approach was scanning
> > for pages starting at the tail and then working up. It greatly hurt
> > the efficiency of my search since in the standard case most of the
> > free memory will be placed at the head and only with shuffling enabled
> > do I really need to worry about things getting mixed up with the tail.
> > 
> > I suspect you may be similarly making things more difficult for
> > yourself by placing the reported pages back on the head of the list
> > instead of placing them at the tail where they will not be reallocated
> > immediately.
> 
> hmm, I see. I will try and explore this.
> 




^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: + mm-introduce-reported-pages.patch added to -mm tree
  2019-11-12 13:04                       ` David Hildenbrand
@ 2019-11-12 18:34                         ` Alexander Duyck
  2019-11-12 21:05                           ` David Hildenbrand
  2019-11-13 18:51                           ` Nitesh Narayan Lal
  0 siblings, 2 replies; 47+ messages in thread
From: Alexander Duyck @ 2019-11-12 18:34 UTC (permalink / raw)
  To: David Hildenbrand, Michal Hocko
  Cc: akpm, aarcange, dan.j.williams, dave.hansen, konrad.wilk,
	lcapitulino, mgorman, mm-commits, mst, osalvador, pagupta,
	pbonzini, riel, vbabka, wei.w.wang, willy, yang.zhang.wz,
	linux-mm

On Tue, 2019-11-12 at 14:04 +0100, David Hildenbrand wrote:
> > > > fact is it is still invasive, just to different parts of the mm subsystem.
> > > 
> > > I'd love to see how it uses the page isolation framework, and only has a 
> > > single hook to queue pages. I don't like the way pages are pulled out of 
> > > the buddy in Niteshs approach currently. What you have is cleaner.
> > 
> > I don't see how you could use the page isolation framework to pull out
> > free pages. Is there a thread somewhere on the topic that I missed?
> 
> It's basically only isolating pages while reporting them, and not
> pulling them out of the buddy (IOW, you move the pages to the isolate
> queues where nobody is allowed to touch them, and setting the
> migratetype properly). This e.g., makes other user of page isolation
> (e.g., memory offlining, alloc_contig_range()) play nicely with these
> isolated pages. "somebody else just isolated them, please try again."

How so? If I understand correctly there isn't anything that prevents you
from isolating an already isolated page, is there? Last I knew isolated
pages are still considered "movable" since they are still buddy pages
aren't they?

Also this seems like it would have other implications since isolating a
page kicks of the memory notifier so as a result a balloon driver would
then free the pages back out so that they could be isolated with the
assumption the region is going offline.

> start_isolate_page_range()/undo_isolate_page_range()/test_pages_isolated()
> along with a lockless check if the page is free.

Okay, that part I think I get. However doesn't all that logic more or less
ignore the watermarks? It seems like you could cause an OOM if you don't
have the necessary checks in place for that.

> I think it should be something like this (ignoring different
> migratetypes and such for now)
> 
> 1. Test lockless if page is free: Not free? Done.

So this should help to reduce the liklihood of races in the steps below.
However it might also be useful if the code had some other check to see if
it was done other than just making a pass through the bitmap.

One thing I had brought up with Nitesh was the idea of maybe doing some
sort of RCU bitmap type approach. Basically while we hold the zone lock we
could swap out the old bitmap for a new one. We could probably even keep a
counter at the start of the structure so that we could track how many bits
are actually set there. Then it becomes less likely of having a race where
you free a page and set the bit and the hinting thread tests and clears
the bit but doesn't see the freed page since it is not synchronized.
Otherwise your notification setup and reporting thread may need a few smp
barriers added where necessary.

> 2. start_isolate_page_range(): Busy? Rare race (with other isolate users

Doesn't this have the side effect of draining all the percpu caches in
order to make certain to flush the pages we isolated from there?

> or with an allocation). Done.
> 3. test_pages_isolated()

So I have reviewed the code and I don't see how this could conflict with
other callers isolating the pages. If anything it seems like if another
thread has already isolated the pages you would end up getting a false
positive, reporting the pages, and pulling them back out of isolation.

> 3a. no? Rare race, page not free anymore. undo_isolate_page_range()

I would hope it is rare. However for something like a max order page I
could easily see a piece of it having been pulled out. I would think this
case would be exceedingly expensive since you would have to put back any
pages you had previous moved into isolation.

> 3b. yes? Report, then undo_isolate_page_range()
> 
> If we would run into performance issues with the current page isolation
> implementation (esp. locking), I think there are some nice
> cleanups/reworks possible of which all current users could benefit
> (especially accross pageblocks).

To me this feels a lot like what you had for this solution near the start.
Only now instead of placing the pages into an array you are tracking a
bitmap and then using that bitmap to populate the MIGRATE_ISOLATE lists.

This sounds far more complex to me then it probably needs to be since just
holding the pages with the buddy type cleared should be enough to make
them temporarily unusable for other threads, and even in your case you are
still having to use the scatterlist in order to hold the pages and track
what you will need to undo the isolation later.





^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: + mm-introduce-reported-pages.patch added to -mm tree
  2019-11-12 18:34                         ` Alexander Duyck
@ 2019-11-12 21:05                           ` David Hildenbrand
  2019-11-12 22:17                             ` David Hildenbrand
  2019-11-12 22:19                             ` Alexander Duyck
  2019-11-13 18:51                           ` Nitesh Narayan Lal
  1 sibling, 2 replies; 47+ messages in thread
From: David Hildenbrand @ 2019-11-12 21:05 UTC (permalink / raw)
  To: Alexander Duyck, Michal Hocko
  Cc: akpm, aarcange, dan.j.williams, dave.hansen, konrad.wilk,
	lcapitulino, mgorman, mm-commits, mst, osalvador, pagupta,
	pbonzini, riel, vbabka, wei.w.wang, willy, yang.zhang.wz,
	linux-mm

On 12.11.19 19:34, Alexander Duyck wrote:
> On Tue, 2019-11-12 at 14:04 +0100, David Hildenbrand wrote:
>>>>> fact is it is still invasive, just to different parts of the mm subsystem.
>>>>
>>>> I'd love to see how it uses the page isolation framework, and only has a 
>>>> single hook to queue pages. I don't like the way pages are pulled out of 
>>>> the buddy in Niteshs approach currently. What you have is cleaner.
>>>
>>> I don't see how you could use the page isolation framework to pull out
>>> free pages. Is there a thread somewhere on the topic that I missed?
>>
>> It's basically only isolating pages while reporting them, and not
>> pulling them out of the buddy (IOW, you move the pages to the isolate
>> queues where nobody is allowed to touch them, and setting the
>> migratetype properly). This e.g., makes other user of page isolation
>> (e.g., memory offlining, alloc_contig_range()) play nicely with these
>> isolated pages. "somebody else just isolated them, please try again."
> 
> How so? If I understand correctly there isn't anything that prevents you
> from isolating an already isolated page, is there? Last I knew isolated

mm/page_isolation.c:set_migratetype_isolate()
...
if (is_migrate_isolate_page(page))
	goto out;
...
-> Currently -EBUSY

> pages are still considered "movable" since they are still buddy pages
> aren't they?

They are neither movable nor unmovable AFAIK. They are temporarily
blocked. E.g., memory offlining currently returns -EBUSY if it cannot
isolate the page range. alloc_contig_range() does the same. Imagine
somebody allocating a gigantic page. You certainly cannot move the pages
that are isolated while allocating the page. But you can signal to the
caller to try again later.

> 
> Also this seems like it would have other implications since isolating a
> page kicks of the memory notifier so as a result a balloon driver would
> then free the pages back out so that they could be isolated with the
> assumption the region is going offline.

Memory notifier? Balloon pages getting freed? No.

The memory notifier is used for onlining/offlining, it is not involved here.

I think what you mean is the "isolate notifier", which is only used by
CMM on PPC.

See https://lkml.org/lkml/2019/10/31/487, where I rip that notifier out.

> 
>> start_isolate_page_range()/undo_isolate_page_range()/test_pages_isolated()
>> along with a lockless check if the page is free.
> 
> Okay, that part I think I get. However doesn't all that logic more or less
> ignore the watermarks? It seems like you could cause an OOM if you don't
> have the necessary checks in place for that.

Any approach that temporarily blocks some free pages from getting
allocated will essentially have this issue, no? I think one main design
point to minimize false OOMs was to limit the number of pages we report
at a time. Or what do you propose here in addition to that?

> 
>> I think it should be something like this (ignoring different
>> migratetypes and such for now)
>>
>> 1. Test lockless if page is free: Not free? Done.
> 
> So this should help to reduce the liklihood of races in the steps below.
> However it might also be useful if the code had some other check to see if
> it was done other than just making a pass through the bitmap.

Yes.

> 
> One thing I had brought up with Nitesh was the idea of maybe doing some
> sort of RCU bitmap type approach. Basically while we hold the zone lock we
> could swap out the old bitmap for a new one. We could probably even keep a
> counter at the start of the structure so that we could track how many bits
> are actually set there. Then it becomes less likely of having a race where
> you free a page and set the bit and the hinting thread tests and clears
> the bit but doesn't see the freed page since it is not synchronized.
> Otherwise your notification setup and reporting thread may need a few smp
> barriers added where necessary.

Yes, swapping out the bitmap via RCU is also be a way to make memory
hotplug work.

I was also thinking about a different bitmap approach. Store for each
section a bitmap. Use a meta bitmap with a bit for each section that
contains pages to report. Sparse zones and memory hot(un)plug would not
be a real issue anymore.

One could go one step further and only have a bitmap with a bit for each
section. Only remember that some (large) page was not reported in that
section (e.g., after buddy merging). In the reporting thread, report all
free pages within that section. You could end up reporting the same page
a couple of times, but the question would be if this is relevant at all.
One would have to prototype and measure that.

Long story short, I am not 100% a fan of the current "bitmap per zone"
approach but is is fairly simple to start with :)

> 
>> 2. start_isolate_page_range(): Busy? Rare race (with other isolate users
> 
> Doesn't this have the side effect of draining all the percpu caches in
> order to make certain to flush the pages we isolated from there?

While alloc_contig_range() e.g., calls lru_add_drain_all(), I don't
think isolation will. Where did you spot something like this in
mm/page_isolation.c?

> 
>> or with an allocation). Done.
>> 3. test_pages_isolated()
> 
> So I have reviewed the code and I don't see how this could conflict with
> other callers isolating the pages. If anything it seems like if another
> thread has already isolated the pages you would end up getting a false
> positive, reporting the pages, and pulling them back out of isolation.

Isolated pages cannot be isolated. This is tracked via the migratetype.

> 
>> 3a. no? Rare race, page not free anymore. undo_isolate_page_range()
> 
> I would hope it is rare. However for something like a max order page I
> could easily see a piece of it having been pulled out. I would think this
> case would be exceedingly expensive since you would have to put back any
> pages you had previous moved into isolation.

I guess it is rare, there is a tiny slot between checking if the page is
free and isolating it. Would have to see that in action.

> 
>> 3b. yes? Report, then undo_isolate_page_range()
>>
>> If we would run into performance issues with the current page isolation
>> implementation (esp. locking), I think there are some nice
>> cleanups/reworks possible of which all current users could benefit
>> (especially accross pageblocks).
> 
> To me this feels a lot like what you had for this solution near the start.
> Only now instead of placing the pages into an array you are tracking a
> bitmap and then using that bitmap to populate the MIGRATE_ISOLATE lists.

Now we have a clean MM interface to do that :) And yes, which data
structure we're using becomes irrelevant.

> 
> This sounds far more complex to me then it probably needs to be since just
> holding the pages with the buddy type cleared should be enough to make
> them temporarily unusable for other threads, and even in your case you are

If you have a page that is not PageBuddy() and not movable within
ZONE_MOVABLE, has_unmovable_pages() will WARN_ON_ONCE(zone_idx(zone) ==
ZONE_MOVABLE). This can be triggered via memory offlining, when
isolating the page range.

If your approach does exactly that (clear PageBuddy() on a
ZONE_MOVABLE), it would be a bug. The only safe way is to have the
pageblock(s) isolated.

> still having to use the scatterlist in order to hold the pages and track
> what you will need to undo the isolation later.

I think it is very neat and not complex at all. Page isolation is a nice
feature we have in the kernel. :) It deserves some cleanups, though.

-- 

Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: + mm-introduce-reported-pages.patch added to -mm tree
  2019-11-12 21:05                           ` David Hildenbrand
@ 2019-11-12 22:17                             ` David Hildenbrand
  2019-11-12 22:19                             ` Alexander Duyck
  1 sibling, 0 replies; 47+ messages in thread
From: David Hildenbrand @ 2019-11-12 22:17 UTC (permalink / raw)
  To: Alexander Duyck, Michal Hocko
  Cc: akpm, aarcange, dan.j.williams, dave.hansen, konrad.wilk,
	lcapitulino, mgorman, mm-commits, mst, osalvador, pagupta,
	pbonzini, riel, vbabka, wei.w.wang, willy, yang.zhang.wz,
	linux-mm

On 12.11.19 22:05, David Hildenbrand wrote:
> On 12.11.19 19:34, Alexander Duyck wrote:
>> On Tue, 2019-11-12 at 14:04 +0100, David Hildenbrand wrote:
>>>>>> fact is it is still invasive, just to different parts of the mm subsystem.
>>>>>
>>>>> I'd love to see how it uses the page isolation framework, and only has a 
>>>>> single hook to queue pages. I don't like the way pages are pulled out of 
>>>>> the buddy in Niteshs approach currently. What you have is cleaner.
>>>>
>>>> I don't see how you could use the page isolation framework to pull out
>>>> free pages. Is there a thread somewhere on the topic that I missed?
>>>
>>> It's basically only isolating pages while reporting them, and not
>>> pulling them out of the buddy (IOW, you move the pages to the isolate
>>> queues where nobody is allowed to touch them, and setting the
>>> migratetype properly). This e.g., makes other user of page isolation
>>> (e.g., memory offlining, alloc_contig_range()) play nicely with these
>>> isolated pages. "somebody else just isolated them, please try again."
>>
>> How so? If I understand correctly there isn't anything that prevents you
>> from isolating an already isolated page, is there? Last I knew isolated
> 
> mm/page_isolation.c:set_migratetype_isolate()
> ...
> if (is_migrate_isolate_page(page))
> 	goto out;
> ...
> -> Currently -EBUSY
> 
>> pages are still considered "movable" since they are still buddy pages
>> aren't they?
> 
> They are neither movable nor unmovable AFAIK. They are temporarily
> blocked. E.g., memory offlining currently returns -EBUSY if it cannot
> isolate the page range. alloc_contig_range() does the same. Imagine
> somebody allocating a gigantic page. You certainly cannot move the pages
> that are isolated while allocating the page. But you can signal to the
> caller to try again later.
> 
>>
>> Also this seems like it would have other implications since isolating a
>> page kicks of the memory notifier so as a result a balloon driver would
>> then free the pages back out so that they could be isolated with the
>> assumption the region is going offline.
> 
> Memory notifier? Balloon pages getting freed? No.
> 
> The memory notifier is used for onlining/offlining, it is not involved here.
> 
> I think what you mean is the "isolate notifier", which is only used by
> CMM on PPC.
> 
> See https://lkml.org/lkml/2019/10/31/487, where I rip that notifier out.
> 
>>
>>> start_isolate_page_range()/undo_isolate_page_range()/test_pages_isolated()
>>> along with a lockless check if the page is free.
>>
>> Okay, that part I think I get. However doesn't all that logic more or less
>> ignore the watermarks? It seems like you could cause an OOM if you don't
>> have the necessary checks in place for that.
> 
> Any approach that temporarily blocks some free pages from getting
> allocated will essentially have this issue, no? I think one main design
> point to minimize false OOMs was to limit the number of pages we report
> at a time. Or what do you propose here in addition to that?
> 
>>
>>> I think it should be something like this (ignoring different
>>> migratetypes and such for now)
>>>
>>> 1. Test lockless if page is free: Not free? Done.
>>
>> So this should help to reduce the liklihood of races in the steps below.
>> However it might also be useful if the code had some other check to see if
>> it was done other than just making a pass through the bitmap.
> 
> Yes.
> 
>>
>> One thing I had brought up with Nitesh was the idea of maybe doing some
>> sort of RCU bitmap type approach. Basically while we hold the zone lock we
>> could swap out the old bitmap for a new one. We could probably even keep a
>> counter at the start of the structure so that we could track how many bits
>> are actually set there. Then it becomes less likely of having a race where
>> you free a page and set the bit and the hinting thread tests and clears
>> the bit but doesn't see the freed page since it is not synchronized.
>> Otherwise your notification setup and reporting thread may need a few smp
>> barriers added where necessary.
> 
> Yes, swapping out the bitmap via RCU is also be a way to make memory
> hotplug work.
> 
> I was also thinking about a different bitmap approach. Store for each
> section a bitmap. Use a meta bitmap with a bit for each section that
> contains pages to report. Sparse zones and memory hot(un)plug would not
> be a real issue anymore.
> 
> One could go one step further and only have a bitmap with a bit for each
> section. Only remember that some (large) page was not reported in that
> section (e.g., after buddy merging). In the reporting thread, report all
> free pages within that section. You could end up reporting the same page
> a couple of times, but the question would be if this is relevant at all.
> One would have to prototype and measure that.
> 
> Long story short, I am not 100% a fan of the current "bitmap per zone"
> approach but is is fairly simple to start with :)
> 
>>
>>> 2. start_isolate_page_range(): Busy? Rare race (with other isolate users
>>
>> Doesn't this have the side effect of draining all the percpu caches in
>> order to make certain to flush the pages we isolated from there?
> 
> While alloc_contig_range() e.g., calls lru_add_drain_all(), I don't
> think isolation will. Where did you spot something like this in
> mm/page_isolation.c?
> 
>>
>>> or with an allocation). Done.
>>> 3. test_pages_isolated()
>>
>> So I have reviewed the code and I don't see how this could conflict with
>> other callers isolating the pages. If anything it seems like if another
>> thread has already isolated the pages you would end up getting a false
>> positive, reporting the pages, and pulling them back out of isolation.
> 
> Isolated pages cannot be isolated. This is tracked via the migratetype.
> 
>>
>>> 3a. no? Rare race, page not free anymore. undo_isolate_page_range()
>>
>> I would hope it is rare. However for something like a max order page I
>> could easily see a piece of it having been pulled out. I would think this
>> case would be exceedingly expensive since you would have to put back any
>> pages you had previous moved into isolation.
> 
> I guess it is rare, there is a tiny slot between checking if the page is
> free and isolating it. Would have to see that in action.
> 
>>
>>> 3b. yes? Report, then undo_isolate_page_range()
>>>
>>> If we would run into performance issues with the current page isolation
>>> implementation (esp. locking), I think there are some nice
>>> cleanups/reworks possible of which all current users could benefit
>>> (especially accross pageblocks).
>>
>> To me this feels a lot like what you had for this solution near the start.
>> Only now instead of placing the pages into an array you are tracking a
>> bitmap and then using that bitmap to populate the MIGRATE_ISOLATE lists.
> 
> Now we have a clean MM interface to do that :) And yes, which data
> structure we're using becomes irrelevant.
> 
>>
>> This sounds far more complex to me then it probably needs to be since just
>> holding the pages with the buddy type cleared should be enough to make
>> them temporarily unusable for other threads, and even in your case you are
> 
> If you have a page that is not PageBuddy() and not movable within
> ZONE_MOVABLE, has_unmovable_pages() will WARN_ON_ONCE(zone_idx(zone) ==
> ZONE_MOVABLE). This can be triggered via memory offlining, when
> isolating the page range.
> 
> If your approach does exactly that (clear PageBuddy() on a
> ZONE_MOVABLE), it would be a bug. The only safe way is to have the
> pageblock(s) isolated.
> 

Minor correction: Only if your refcount is > 0.

-- 

Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: + mm-introduce-reported-pages.patch added to -mm tree
  2019-11-12 21:05                           ` David Hildenbrand
  2019-11-12 22:17                             ` David Hildenbrand
@ 2019-11-12 22:19                             ` Alexander Duyck
  2019-11-12 23:10                               ` David Hildenbrand
  1 sibling, 1 reply; 47+ messages in thread
From: Alexander Duyck @ 2019-11-12 22:19 UTC (permalink / raw)
  To: David Hildenbrand, Michal Hocko
  Cc: akpm, aarcange, dan.j.williams, dave.hansen, konrad.wilk,
	lcapitulino, mgorman, mm-commits, mst, osalvador, pagupta,
	pbonzini, riel, vbabka, wei.w.wang, willy, yang.zhang.wz,
	linux-mm

On Tue, 2019-11-12 at 22:05 +0100, David Hildenbrand wrote:
> On 12.11.19 19:34, Alexander Duyck wrote:
> > On Tue, 2019-11-12 at 14:04 +0100, David Hildenbrand wrote:
> > > > > > fact is it is still invasive, just to different parts of the mm subsystem.
> > > > > 
> > > > > I'd love to see how it uses the page isolation framework, and only has a 
> > > > > single hook to queue pages. I don't like the way pages are pulled out of 
> > > > > the buddy in Niteshs approach currently. What you have is cleaner.
> > > > 
> > > > I don't see how you could use the page isolation framework to pull out
> > > > free pages. Is there a thread somewhere on the topic that I missed?
> > > 
> > > It's basically only isolating pages while reporting them, and not
> > > pulling them out of the buddy (IOW, you move the pages to the isolate
> > > queues where nobody is allowed to touch them, and setting the
> > > migratetype properly). This e.g., makes other user of page isolation
> > > (e.g., memory offlining, alloc_contig_range()) play nicely with these
> > > isolated pages. "somebody else just isolated them, please try again."
> > 
> > How so? If I understand correctly there isn't anything that prevents you
> > from isolating an already isolated page, is there? Last I knew isolated
> 
> mm/page_isolation.c:set_migratetype_isolate()
> ...
> if (is_migrate_isolate_page(page))
> 	goto out;
> ...
> -> Currently -EBUSY
> 
> > pages are still considered "movable" since they are still buddy pages
> > aren't they?
> 
> They are neither movable nor unmovable AFAIK. They are temporarily
> blocked. E.g., memory offlining currently returns -EBUSY if it cannot
> isolate the page range. alloc_contig_range() does the same. Imagine
> somebody allocating a gigantic page. You certainly cannot move the pages
> that are isolated while allocating the page. But you can signal to the
> caller to try again later.
> 
> > Also this seems like it would have other implications since isolating a
> > page kicks of the memory notifier so as a result a balloon driver would
> > then free the pages back out so that they could be isolated with the
> > assumption the region is going offline.
> 
> Memory notifier? Balloon pages getting freed? No.
> 
> The memory notifier is used for onlining/offlining, it is not involved here.

Sorry, I misread the comment in set_migratetype_isolate.

> I think what you mean is the "isolate notifier", which is only used by
> CMM on PPC.
> 
> See https://lkml.org/lkml/2019/10/31/487, where I rip that notifier out.

Okay.

> > > start_isolate_page_range()/undo_isolate_page_range()/test_pages_isolated()
> > > along with a lockless check if the page is free.
> > 
> > Okay, that part I think I get. However doesn't all that logic more or less
> > ignore the watermarks? It seems like you could cause an OOM if you don't
> > have the necessary checks in place for that.
> 
> Any approach that temporarily blocks some free pages from getting
> allocated will essentially have this issue, no? I think one main design
> point to minimize false OOMs was to limit the number of pages we report
> at a time. Or what do you propose here in addition to that?

If you take a look at __isolate_free_page it was performing a check to see
if pulling the page would place us below the minimum watermark for pages.
Odds are you should probably look at somehow incorporating that into the
solution before you pull the page. I have updated my approach to check for
the low watermark with the full capacity of MAX_ORDER - 1 pages before I
start reporting, and then I am using __isolate_free_page which will check
the minimum watermark to make sure I don't cross that.

> > > I think it should be something like this (ignoring different
> > > migratetypes and such for now)
> > > 
> > > 1. Test lockless if page is free: Not free? Done.
> > 
> > So this should help to reduce the liklihood of races in the steps below.
> > However it might also be useful if the code had some other check to see if
> > it was done other than just making a pass through the bitmap.
> 
> Yes.
> 
> > One thing I had brought up with Nitesh was the idea of maybe doing some
> > sort of RCU bitmap type approach. Basically while we hold the zone lock we
> > could swap out the old bitmap for a new one. We could probably even keep a
> > counter at the start of the structure so that we could track how many bits
> > are actually set there. Then it becomes less likely of having a race where
> > you free a page and set the bit and the hinting thread tests and clears
> > the bit but doesn't see the freed page since it is not synchronized.
> > Otherwise your notification setup and reporting thread may need a few smp
> > barriers added where necessary.
> 
> Yes, swapping out the bitmap via RCU is also be a way to make memory
> hotplug work.
> 
> I was also thinking about a different bitmap approach. Store for each
> section a bitmap. Use a meta bitmap with a bit for each section that
> contains pages to report. Sparse zones and memory hot(un)plug would not
> be a real issue anymore.

I had thought about that too. The only problem is that the section has to
be power of 2 sized and I don't know if we want to be increasing the size
by 100% in the base case, although I guess there is an 8 byte pad on the
structure if page extensions are enabled.

> One could go one step further and only have a bitmap with a bit for each
> section. Only remember that some (large) page was not reported in that
> section (e.g., after buddy merging). In the reporting thread, report all
> free pages within that section. You could end up reporting the same page
> a couple of times, but the question would be if this is relevant at all.
> One would have to prototype and measure that.
> 
> Long story short, I am not 100% a fan of the current "bitmap per zone"
> approach but is is fairly simple to start with :)

Agreed. Although I worry that a bitmap per section may be even more
complex.

> > > 2. start_isolate_page_range(): Busy? Rare race (with other isolate users
> > 
> > Doesn't this have the side effect of draining all the percpu caches in
> > order to make certain to flush the pages we isolated from there?
> 
> While alloc_contig_range() e.g., calls lru_add_drain_all(), I don't
> think isolation will. Where did you spot something like this in
> mm/page_isolation.c?

On the end of set_migratetype_isolate(). The last thing it does is call
drain_all_pages.

> > > or with an allocation). Done.
> > > 3. test_pages_isolated()
> > 
> > So I have reviewed the code and I don't see how this could conflict with
> > other callers isolating the pages. If anything it seems like if another
> > thread has already isolated the pages you would end up getting a false
> > positive, reporting the pages, and pulling them back out of isolation.
> 
> Isolated pages cannot be isolated. This is tracked via the migratetype.

Thanks. I see that now that you pointed it out up above.

> > > 3a. no? Rare race, page not free anymore. undo_isolate_page_range()
> > 
> > I would hope it is rare. However for something like a max order page I
> > could easily see a piece of it having been pulled out. I would think this
> > case would be exceedingly expensive since you would have to put back any
> > pages you had previous moved into isolation.
> 
> I guess it is rare, there is a tiny slot between checking if the page is
> free and isolating it. Would have to see that in action.

Yeah, probably depends on the number of cores in play as well since the
liklihood of a collision is probably pretty low.

> > > 3b. yes? Report, then undo_isolate_page_range()
> > > 
> > > If we would run into performance issues with the current page isolation
> > > implementation (esp. locking), I think there are some nice
> > > cleanups/reworks possible of which all current users could benefit
> > > (especially accross pageblocks).
> > 
> > To me this feels a lot like what you had for this solution near the start.
> > Only now instead of placing the pages into an array you are tracking a
> > bitmap and then using that bitmap to populate the MIGRATE_ISOLATE lists.
> 
> Now we have a clean MM interface to do that :) And yes, which data
> structure we're using becomes irrelevant.
> 
> > This sounds far more complex to me then it probably needs to be since just
> > holding the pages with the buddy type cleared should be enough to make
> > them temporarily unusable for other threads, and even in your case you are
> 
> If you have a page that is not PageBuddy() and not movable within
> ZONE_MOVABLE, has_unmovable_pages() will WARN_ON_ONCE(zone_idx(zone) ==
> ZONE_MOVABLE). This can be triggered via memory offlining, when
> isolating the page range.
> 
> If your approach does exactly that (clear PageBuddy() on a
> ZONE_MOVABLE), it would be a bug. The only safe way is to have the
> pageblock(s) isolated.

From what I can tell it looks like if the page is in ZONE_MOVABLE the
buddy flag doesn't even matter since the only thing checked is
PageReserved. There is a check early on in the main loop that will
"continue" if zone_idx(zone) == ZONE_MOVABLE.

The refcount is 0 so that will cause us to "continue" and not be counted
as an unmovable page. The downside is the scan cannot take advantage of
the "PageBuddy" value to skip over us so it just has to skip over the
section one page at a time.

The advantage here is that we can still offline a region that contains
pages that are being reported. I would think that it would fail if the
pages in the region are isolated since as you pointed out you get an EBUSY
when you attempt to isolate a page that is already isolated and as such
removal will fail won't it?

> > still having to use the scatterlist in order to hold the pages and track
> > what you will need to undo the isolation later.
> 
> I think it is very neat and not complex at all. Page isolation is a nice
> feature we have in the kernel. :) It deserves some cleanups, though.

We can agree to disagree. At this point you are talking about adding bits
for sections and pages, and in the meantime I am working with zones and
pages. I believe finding free space in the section may be much more tricky
than finding it in the zone or page has been. Now that I am rid of the
list manipulators my approach may soon surpass the bitmap one in terms of
being less intrusive/complex.. :-)





^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: + mm-introduce-reported-pages.patch added to -mm tree
  2019-11-12 22:19                             ` Alexander Duyck
@ 2019-11-12 23:10                               ` David Hildenbrand
  2019-11-13  0:31                                 ` Alexander Duyck
  0 siblings, 1 reply; 47+ messages in thread
From: David Hildenbrand @ 2019-11-12 23:10 UTC (permalink / raw)
  To: Alexander Duyck, Michal Hocko
  Cc: akpm, aarcange, dan.j.williams, dave.hansen, konrad.wilk,
	lcapitulino, mgorman, mm-commits, mst, osalvador, pagupta,
	pbonzini, riel, vbabka, wei.w.wang, willy, yang.zhang.wz,
	linux-mm


>>>> start_isolate_page_range()/undo_isolate_page_range()/test_pages_isolated()
>>>> along with a lockless check if the page is free.
>>>
>>> Okay, that part I think I get. However doesn't all that logic more or less
>>> ignore the watermarks? It seems like you could cause an OOM if you don't
>>> have the necessary checks in place for that.
>>
>> Any approach that temporarily blocks some free pages from getting
>> allocated will essentially have this issue, no? I think one main design
>> point to minimize false OOMs was to limit the number of pages we report
>> at a time. Or what do you propose here in addition to that?
> 
> If you take a look at __isolate_free_page it was performing a check to see
> if pulling the page would place us below the minimum watermark for pages.
> Odds are you should probably look at somehow incorporating that into the
> solution before you pull the page. I have updated my approach to check for

Ah, now I see what you mean. Makes sense!

> the low watermark with the full capacity of MAX_ORDER - 1 pages before I
> start reporting, and then I am using __isolate_free_page which will check
> the minimum watermark to make sure I don't cross that.

Yeah, you probably want to check the watermark before doing any 
reporting - I assume.

> 
>>>> I think it should be something like this (ignoring different
>>>> migratetypes and such for now)
>>>>
>>>> 1. Test lockless if page is free: Not free? Done.
>>>
>>> So this should help to reduce the liklihood of races in the steps below.
>>> However it might also be useful if the code had some other check to see if
>>> it was done other than just making a pass through the bitmap.
>>
>> Yes.
>>
>>> One thing I had brought up with Nitesh was the idea of maybe doing some
>>> sort of RCU bitmap type approach. Basically while we hold the zone lock we
>>> could swap out the old bitmap for a new one. We could probably even keep a
>>> counter at the start of the structure so that we could track how many bits
>>> are actually set there. Then it becomes less likely of having a race where
>>> you free a page and set the bit and the hinting thread tests and clears
>>> the bit but doesn't see the freed page since it is not synchronized.
>>> Otherwise your notification setup and reporting thread may need a few smp
>>> barriers added where necessary.
>>
>> Yes, swapping out the bitmap via RCU is also be a way to make memory
>> hotplug work.
>>
>> I was also thinking about a different bitmap approach. Store for each
>> section a bitmap. Use a meta bitmap with a bit for each section that
>> contains pages to report. Sparse zones and memory hot(un)plug would not
>> be a real issue anymore.
> 
> I had thought about that too. The only problem is that the section has to
> be power of 2 sized and I don't know if we want to be increasing the size

... are there sections that are not a power of 2? x86_64: 128MB, s390x: 
256MB, ...

It does not really make sense to have sections that are not a power of 
two, thinking about page tables ... I would really be interested where 
something like that is possible.

> by 100% in the base case, although I guess there is an 8 byte pad on the
> structure if page extensions are enabled.
> 
>> One could go one step further and only have a bitmap with a bit for each
>> section. Only remember that some (large) page was not reported in that
>> section (e.g., after buddy merging). In the reporting thread, report all
>> free pages within that section. You could end up reporting the same page
>> a couple of times, but the question would be if this is relevant at all.
>> One would have to prototype and measure that.
>>
>> Long story short, I am not 100% a fan of the current "bitmap per zone"
>> approach but is is fairly simple to start with :)
> 
> Agreed. Although I worry that a bitmap per section may be even more
> complex.

Slightly, yes.

> 
>>>> 2. start_isolate_page_range(): Busy? Rare race (with other isolate users
>>>
>>> Doesn't this have the side effect of draining all the percpu caches in
>>> order to make certain to flush the pages we isolated from there?
>>
>> While alloc_contig_range() e.g., calls lru_add_drain_all(), I don't
>> think isolation will. Where did you spot something like this in
>> mm/page_isolation.c?
> 
> On the end of set_migratetype_isolate(). The last thing it does is call
> drain_all_pages.

Ahh, missed that, thanks. Yeah, one could probably make the 
configurable, because for that use case, where we already expect a free 
page, we don't need that.

> 
>>>> or with an allocation). Done.
>>>> 3. test_pages_isolated()
>>>
>>> So I have reviewed the code and I don't see how this could conflict with
>>> other callers isolating the pages. If anything it seems like if another
>>> thread has already isolated the pages you would end up getting a false
>>> positive, reporting the pages, and pulling them back out of isolation.
>>
>> Isolated pages cannot be isolated. This is tracked via the migratetype.
> 
> Thanks. I see that now that you pointed it out up above.
> 
>>>> 3a. no? Rare race, page not free anymore. undo_isolate_page_range()
>>>
>>> I would hope it is rare. However for something like a max order page I
>>> could easily see a piece of it having been pulled out. I would think this
>>> case would be exceedingly expensive since you would have to put back any
>>> pages you had previous moved into isolation.
>>
>> I guess it is rare, there is a tiny slot between checking if the page is
>> free and isolating it. Would have to see that in action.
> 
> Yeah, probably depends on the number of cores in play as well since the
> liklihood of a collision is probably pretty low.
> 
>>>> 3b. yes? Report, then undo_isolate_page_range()
>>>>
>>>> If we would run into performance issues with the current page isolation
>>>> implementation (esp. locking), I think there are some nice
>>>> cleanups/reworks possible of which all current users could benefit
>>>> (especially accross pageblocks).
>>>
>>> To me this feels a lot like what you had for this solution near the start.
>>> Only now instead of placing the pages into an array you are tracking a
>>> bitmap and then using that bitmap to populate the MIGRATE_ISOLATE lists.
>>
>> Now we have a clean MM interface to do that :) And yes, which data
>> structure we're using becomes irrelevant.
>>
>>> This sounds far more complex to me then it probably needs to be since just
>>> holding the pages with the buddy type cleared should be enough to make
>>> them temporarily unusable for other threads, and even in your case you are
>>
>> If you have a page that is not PageBuddy() and not movable within
>> ZONE_MOVABLE, has_unmovable_pages() will WARN_ON_ONCE(zone_idx(zone) ==
>> ZONE_MOVABLE). This can be triggered via memory offlining, when
>> isolating the page range.
>>
>> If your approach does exactly that (clear PageBuddy() on a
>> ZONE_MOVABLE), it would be a bug. The only safe way is to have the
>> pageblock(s) isolated.
> 
>  From what I can tell it looks like if the page is in ZONE_MOVABLE the
> buddy flag doesn't even matter since the only thing checked is
> PageReserved. There is a check early on in the main loop that will
> "continue" if zone_idx(zone) == ZONE_MOVABLE.

Very right, missed that :) reserved pages are a different story.

> 
> The refcount is 0 so that will cause us to "continue" and not be counted
> as an unmovable page. The downside is the scan cannot take advantage of
> the "PageBuddy" value to skip over us so it just has to skip over the
> section one page at a time.
> 
> The advantage here is that we can still offline a region that contains
> pages that are being reported. I would think that it would fail if the

Yes, you can isolate +offline, while the isolation approach would 
require to actually try again (right now manually).

> pages in the region are isolated since as you pointed out you get an EBUSY
> when you attempt to isolate a page that is already isolated and as such
> removal will fail won't it?

Right now, yes.

(we should rework that code either way to return -EAGAIN in that case 
and let memory offlining try again automatically. But we have to rework 
the -EAGAIN vs. -EBUSY handling in memory offlining code at one point 
either way, I discussed that partially with Michal recently. There is a 
lot of cleaning up to do.)

> 
>>> still having to use the scatterlist in order to hold the pages and track
>>> what you will need to undo the isolation later.
>>
>> I think it is very neat and not complex at all. Page isolation is a nice
>> feature we have in the kernel. :) It deserves some cleanups, though.
> 
> We can agree to disagree. At this point you are talking about adding bits
> for sections and pages, and in the meantime I am working with zones and
> pages. I believe finding free space in the section may be much more tricky
> than finding it in the zone or page has been. Now that I am rid of the
> list manipulators my approach may soon surpass the bitmap one in terms of
> being less intrusive/complex.. :-)

I am definitely interested to see that approach :) Good to see that the 
whole discussion in this big thread turned out to be productive.

-- 

Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: + mm-introduce-reported-pages.patch added to -mm tree
  2019-11-12 23:10                               ` David Hildenbrand
@ 2019-11-13  0:31                                 ` Alexander Duyck
  0 siblings, 0 replies; 47+ messages in thread
From: Alexander Duyck @ 2019-11-13  0:31 UTC (permalink / raw)
  To: David Hildenbrand, Michal Hocko
  Cc: akpm, aarcange, dan.j.williams, dave.hansen, konrad.wilk,
	lcapitulino, mgorman, mm-commits, mst, osalvador, pagupta,
	pbonzini, riel, vbabka, wei.w.wang, willy, yang.zhang.wz,
	linux-mm

On Wed, 2019-11-13 at 00:10 +0100, David Hildenbrand wrote:
> > > > > start_isolate_page_range()/undo_isolate_page_range()/test_pages_isolated()
> > > > > along with a lockless check if the page is free.
> > > > 
> > > > Okay, that part I think I get. However doesn't all that logic more or less
> > > > ignore the watermarks? It seems like you could cause an OOM if you don't
> > > > have the necessary checks in place for that.
> > > 
> > > Any approach that temporarily blocks some free pages from getting
> > > allocated will essentially have this issue, no? I think one main design
> > > point to minimize false OOMs was to limit the number of pages we report
> > > at a time. Or what do you propose here in addition to that?
> > 
> > If you take a look at __isolate_free_page it was performing a check to see
> > if pulling the page would place us below the minimum watermark for pages.
> > Odds are you should probably look at somehow incorporating that into the
> > solution before you pull the page. I have updated my approach to check for
> 
> Ah, now I see what you mean. Makes sense!
> 
> > the low watermark with the full capacity of MAX_ORDER - 1 pages before I
> > start reporting, and then I am using __isolate_free_page which will check
> > the minimum watermark to make sure I don't cross that.
> 
> Yeah, you probably want to check the watermark before doing any 
> reporting - I assume.
> 
> > > > > I think it should be something like this (ignoring different
> > > > > migratetypes and such for now)
> > > > > 
> > > > > 1. Test lockless if page is free: Not free? Done.
> > > > 
> > > > So this should help to reduce the liklihood of races in the steps below.
> > > > However it might also be useful if the code had some other check to see if
> > > > it was done other than just making a pass through the bitmap.
> > > 
> > > Yes.
> > > 
> > > > One thing I had brought up with Nitesh was the idea of maybe doing some
> > > > sort of RCU bitmap type approach. Basically while we hold the zone lock we
> > > > could swap out the old bitmap for a new one. We could probably even keep a
> > > > counter at the start of the structure so that we could track how many bits
> > > > are actually set there. Then it becomes less likely of having a race where
> > > > you free a page and set the bit and the hinting thread tests and clears
> > > > the bit but doesn't see the freed page since it is not synchronized.
> > > > Otherwise your notification setup and reporting thread may need a few smp
> > > > barriers added where necessary.
> > > 
> > > Yes, swapping out the bitmap via RCU is also be a way to make memory
> > > hotplug work.
> > > 
> > > I was also thinking about a different bitmap approach. Store for each
> > > section a bitmap. Use a meta bitmap with a bit for each section that
> > > contains pages to report. Sparse zones and memory hot(un)plug would not
> > > be a real issue anymore.
> > 
> > I had thought about that too. The only problem is that the section has to
> > be power of 2 sized and I don't know if we want to be increasing the size
> 
> ... are there sections that are not a power of 2? x86_64: 128MB, s390x: 
> 256MB, ...

No, what I meant was the mem_section structure. It has a hard requirement
about being power of 2 aligned and is already 16 bytes, or 32 with page
extensions enabled. There is room for a pad in the page extension case so
maybe you could squeeze in something there.

> It does not really make sense to have sections that are not a power of 
> two, thinking about page tables ... I would really be interested where 
> something like that is possible.

Sorry for the confusion on that.

> > 2. start_isolate_page_range(): Busy? Rare race (with other isolate users
> > > > 
> > > > Doesn't this have the side effect of draining all the percpu caches in
> > > > order to make certain to flush the pages we isolated from there?
> > > 
> > > While alloc_contig_range() e.g., calls lru_add_drain_all(), I don't
> > > think isolation will. Where did you spot something like this in
> > > mm/page_isolation.c?
> > 
> > On the end of set_migratetype_isolate(). The last thing it does is call
> > drain_all_pages.
> 
> Ahh, missed that, thanks. Yeah, one could probably make the 
> configurable, because for that use case, where we already expect a free 
> page, we don't need that.

I suppose but that gets back into adding complexity as we now have to
special case isolation to work with page reporting.

<snip>

> > pages in the region are isolated since as you pointed out you get an EBUSY
> > when you attempt to isolate a page that is already isolated and as such
> > removal will fail won't it?
> 
> Right now, yes.
> 
> (we should rework that code either way to return -EAGAIN in that case 
> and let memory offlining try again automatically. But we have to rework 
> the -EAGAIN vs. -EBUSY handling in memory offlining code at one point 
> either way, I discussed that partially with Michal recently. There is a 
> lot of cleaning up to do.)

So it sounds like a cleanup/rewrite of some of the isolation code will be
needed to really get it doing what you want.

Actually I wonder if we couldn't look at something like the
free_reported_page function I did and instead split it up so that it could
be used as an inverse of __isolate_free_page. Maybe something like a
__free_isolated_page.

> > > > still having to use the scatterlist in order to hold the pages and track
> > > > what you will need to undo the isolation later.
> > > 
> > > I think it is very neat and not complex at all. Page isolation is a nice
> > > feature we have in the kernel. :) It deserves some cleanups, though.
> > 
> > We can agree to disagree. At this point you are talking about adding bits
> > for sections and pages, and in the meantime I am working with zones and
> > pages. I believe finding free space in the section may be much more tricky
> > than finding it in the zone or page has been. Now that I am rid of the
> > list manipulators my approach may soon surpass the bitmap one in terms of
> > being less intrusive/complex.. :-)
> 
> I am definitely interested to see that approach :) Good to see that the 
> whole discussion in this big thread turned out to be productive.

Yeah, when I started working on the patch split Mel wanted I kind of
realized I was optimizing for the shuffle case which really shouldn't be
an optimization target. I think I just got focused on it as it was in the
way of some of the initial changes I needed to make to handle the
notifier.





^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: + mm-introduce-reported-pages.patch added to -mm tree
  2019-11-12 16:18         ` Alexander Duyck
@ 2019-11-13 18:39           ` Nitesh Narayan Lal
  0 siblings, 0 replies; 47+ messages in thread
From: Nitesh Narayan Lal @ 2019-11-13 18:39 UTC (permalink / raw)
  To: Alexander Duyck, Alexander Duyck, David Hildenbrand
  Cc: Michal Hocko, Andrew Morton, Andrea Arcangeli, Dan Williams,
	Dave Hansen, Konrad Rzeszutek Wilk, lcapitulino, Mel Gorman,
	mm-commits, Michael S. Tsirkin, Oscar Salvador, Pankaj Gupta,
	Paolo Bonzini, Rik van Riel, Vlastimil Babka, Wang, Wei W,
	Matthew Wilcox, Yang Zhang, linux-mm


On 11/12/19 11:18 AM, Alexander Duyck wrote:
> On Tue, 2019-11-12 at 10:19 -0500, Nitesh Narayan Lal wrote:
>> On 11/11/19 5:00 PM, Alexander Duyck wrote:
>>> On Mon, Nov 11, 2019 at 10:52 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>>>> On 11/6/19 7:16 AM, Michal Hocko wrote:
>>>>> I didn't have time to read through newer versions of this patch series
>>>>> but I remember there were concerns about this functionality being pulled
>>>>> into the page allocator previously both by me and Mel [1][2]. Have those been
>>>>> addressed? I do not see an ack from Mel or any other MM people. Is there
>>>>> really a consensus that we want something like that living in the
>>>>> allocator?
>>>>>
>>>>> There has also been a different approach discussed and from [3]
>>>>> (referenced by the cover letter) I can only see
>>>>>
>>>>> : Then Nitesh's solution had changed to the bitmap approach[7]. However it
>>>>> : has been pointed out that this solution doesn't deal with sparse memory,
>>>>> : hotplug, and various other issues.
>>>>>
>>>>> which looks more like something to be done than a fundamental
>>>>> roadblocks.
>>>>>
>>>>> [1] http://lkml.kernel.org/r/20190912163525.GV2739@techsingularity.net
>>>>> [2] http://lkml.kernel.org/r/20190912091925.GM4023@dhcp22.suse.cz
>>>>> [3] http://lkml.kernel.org/r/29f43d5796feed0dec8e8bb98b187d9dac03b900.camel@linux.intel.com
>>>>>
>>>> [...]
>>>>
>>>> Hi,
>>>>
>>>> I performed some experiments to find the root cause for the performance
>>>> degradation Alexander reported with my v12 patch-set. [1]
>>>>
>>>> I will try to give a brief background of the previous discussion
>>>> under v12: (Alexander can correct me if I am missing something).
>>>> Alexander suggested two issues with my v12 posting: [2]
>>>> (This is excluding the sparse zone and memory hotplug/hotremove support)
>>>>
>>>> - A crash which was caused because I was not using spinlock_irqsave()
>>>>   (Fix suggestion came from Alexander).
>>>>
>>>> - Performance degradation with Alexander's suggested setup. Where we are using
>>>>   modified will-it-scale/page_fault with THP, CONFIG_SLAB_FREELIST_RANDOM &
>>>>   CONFIG_SHUFFLE_PAGE_ALLOCATOR. When I was using (MAX_ORDER - 2) as the
>>>>   PAGE_REPORTING_MIN_ORDER, I also observed significant performance degradation
>>>>   (around 20% in the number of threads launched on the 16th vCPU). However, on
>>>>   switching the PAGE_REPORTING_MIN_ORDER to (MAX_ORDER - 1), I was able to get
>>>>   the performance similar to what Alexander is reporting.
>>>>
>>>> PAGE_REPORTING_MIN_ORDER: is the minimum order of a page to be captured in the
>>>> bitmap and get reported to the hypervisor.
>>>>
>>>> For the discussion where we are comparing the two series, the performance
>>>> aspect is more relevant and important.
>>>> It turns out that with the current implementation the number of vmexit with
>>>> PAGE_REPORTING_MIN_ORDER as pageblock_order or (MAX_ORDER - 2) are significantly
>>>> large when compared to (MAX_ODER - 1).
>>>>
>>>> One of the reason could be that the lower order pages are not getting sufficient
>>>> time to merge with each other as a result they are somehow getting reported
>>>> with 2 separate reporting requests. Hence, generating more vmexits. Where
>>>> as with (MAX_ORDER - 1) we don't have that kind of situation as I never try
>>>> to report any page which has order < (MAX_ORDER - 1).
>>>>
>>>> To fix this, I might have to further limit the reporting which could allow the
>>>> lower order pages to further merge and hence reduce the VM exits. I will try to
>>>> do some experiments to see if I can fix this. In any case, if anyone has a
>>>> suggestion I would be more than happy to look in that direction.
>>> That doesn't make any sense. My setup using MAX_ORDER - 2, aka
>>> pageblock_order, as the limit doesn't experience the same performance
>>> issues the bitmap solution does. That leads me to believe the issue
>>> isn't that the pages have not had a chance to be merged.
>>>
>> So, I did run your series as well with a few syfs variables to see how many
>> pages of order (MAX_ORDER - 1) or (MAX_ORDER - 2) are reported at the end of
>> will-it-scale/page_fault4 test.
>> What I observed is the number of (MAX_ORDER - 2) pages which were getting
>> reported in your case were lesser than what has been reported in mine with
>> pageblock_order.
>> As you have mentioned below about putting pages in a certain part of the
>> free list might have also an impact.
> Another thing you may want to check is how often your notifier is
> triggering. One thing I did was to intentionally put a fairly significant
> delay from the time the notification is scheduled to when it will start. I
> did this because when an application is freeing memory it will take some
> time to completely free it, and if it is going to reallocate it anyway
> there is no need to rush since it would just invalidate the pages you
> reported anyway.

Yes, I agree with this. This could have an impact on the performance.

>
>>>> Following are the numbers I gathered on a 30GB single NUMA, 16 vCPU guest
>>>> affined to a single host-NUMA:
>>>>
>>>> On 16th vCPU:
>>>> With PAGE_REPORTING_MIN_ORDER as (MAX_ORDER - 1):
>>>> % Dip on the number of Processes = 1.3 %
>>>> % Dip on the number of  Threads  = 5.7 %
>>>>
>>>> With PAGE_REPORTING_MIN_ORDER as With (pageblock_order):
>>>> % Dip on the number of Processes = 5 %
>>>> % Dip on the number of  Threads  = 20 %
>>> So I don't hold much faith in the threads numbers. I have seen the
>>> variability be as high as 14% between runs.
>> That's interesting. Do you see the variability even with an unmodified kernel?
>> Somehow, for me it seems pretty consistent. However, if you are running with
>> multiple NUMA nodes it might have a significant impact on the numbers.
>>
>> For now, I am only running a single NUMA guest affined to a single NUMA
>> of host.
> My guest should be running in a single node, and yes I saw it with just
> the unmodified kernel. I am running on the linux-next 20191031 kernel.

I am using Linus linux tree and working on top of Linux 5.4-rc5.
Not sure how much difference will that make.

>  It
> did occur to me that it seems like the performance for the threads number
> recently increased. There might be a guest config option impacting things
> as well since I know I have changed a number of variables since then.

This is quite interesting because if I remember correctly then you reported a
huge degradation of over 30% with my patch-set.
So far, I was able to reproduce significant degradation with the number of
threads launched on the 16th vcpu but not in the number of processes which you
are observing.
I am wondering if I am still missing something in my test-setup.

>
>>>> Michal's suggestion:
>>>> I was able to get the prototype which could use page-isolation API:
>>>> start_isolate_page_range()/undo_isolate_page_range() to work.
>>>> But the issue mentioned above was also evident with it.
>>>>
>>>> Hence, I think before moving to the decision whether I want to use
>>>> __isolate_free_page() which isolates pages from the buddy or
>>>> start/undo_isolate_page_range() which just marks the page as MIGRATE_ISOLATE,
>>>> it is important for me to resolve the above-mentioned issue.
>>> I'd be curious how you are avoiding causing memory starvation if you
>>> are isolating ranges of memory that have been recently freed.
>> I would still be marking only 32 pages as MIGRATE_ISOLATE at a time. It is
>> exactly same as that of isolating limited chunk of pages from the buddy.
>> For example if I have a pfn:x of order y then I pass
>> start_isolate_page_range(x, x+y, mt, 0). So at the end we
>> will have 32 such entries marked as MIGRATE_ISOLATE.
> I get that you are isolating the same amount of memory. What I was getting
> at is that __isolate_free_page has a check in it to make certain you are
> not pulling memory that would put you below the minimum watermark. As far
> as I know there isn't anything like that for the page isolation framework
> since it is normally used for offlining memory before it is hotplugged
> away.

Yes, that is correct. I will have to take care of that explicitly.

>
>>>> Previous discussions:
>>>> More about how we ended up with these two approaches could be found at [3] &
>>>> [4] explained by Alexander & David.
>>>>
>>>> [1] https://lore.kernel.org/lkml/20190812131235.27244-1-nitesh@redhat.com/
>>>> [2] https://lkml.org/lkml/2019/10/2/425
>>>> [3] https://lkml.org/lkml/2019/10/23/1166
>>>> [4] https://lkml.org/lkml/2019/9/12/48
>>>>
>>> So one thing you may want to consider would be how placement of the
>>> buffers will impact your performance.
>>>
>>> One thing I realized I was doing wrong with my approach was scanning
>>> for pages starting at the tail and then working up. It greatly hurt
>>> the efficiency of my search since in the standard case most of the
>>> free memory will be placed at the head and only with shuffling enabled
>>> do I really need to worry about things getting mixed up with the tail.
>>>
>>> I suspect you may be similarly making things more difficult for
>>> yourself by placing the reported pages back on the head of the list
>>> instead of placing them at the tail where they will not be reallocated
>>> immediately.
>> hmm, I see. I will try and explore this.
>>
>
-- 
Thanks
Nitesh



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: + mm-introduce-reported-pages.patch added to -mm tree
  2019-11-12 18:34                         ` Alexander Duyck
  2019-11-12 21:05                           ` David Hildenbrand
@ 2019-11-13 18:51                           ` Nitesh Narayan Lal
  1 sibling, 0 replies; 47+ messages in thread
From: Nitesh Narayan Lal @ 2019-11-13 18:51 UTC (permalink / raw)
  To: Alexander Duyck, David Hildenbrand, Michal Hocko
  Cc: akpm, aarcange, dan.j.williams, dave.hansen, konrad.wilk,
	lcapitulino, mgorman, mm-commits, mst, osalvador, pagupta,
	pbonzini, riel, vbabka, wei.w.wang, willy, yang.zhang.wz,
	linux-mm


On 11/12/19 1:34 PM, Alexander Duyck wrote:
> On Tue, 2019-11-12 at 14:04 +0100, David Hildenbrand wrote:
>>>>> fact is it is still invasive, just to different parts of the mm subsystem.
>>>> I'd love to see how it uses the page isolation framework, and only has a 
>>>> single hook to queue pages. I don't like the way pages are pulled out of 
>>>> the buddy in Niteshs approach currently. What you have is cleaner.
>>> I don't see how you could use the page isolation framework to pull out
>>> free pages. Is there a thread somewhere on the topic that I missed?
>> It's basically only isolating pages while reporting them, and not
>> pulling them out of the buddy (IOW, you move the pages to the isolate
>> queues where nobody is allowed to touch them, and setting the
>> migratetype properly). This e.g., makes other user of page isolation
>> (e.g., memory offlining, alloc_contig_range()) play nicely with these
>> isolated pages. "somebody else just isolated them, please try again."
> How so? If I understand correctly there isn't anything that prevents you
> from isolating an already isolated page, is there? Last I knew isolated
> pages are still considered "movable" since they are still buddy pages
> aren't they?
>
> Also this seems like it would have other implications since isolating a
> page kicks of the memory notifier so as a result a balloon driver would
> then free the pages back out so that they could be isolated with the
> assumption the region is going offline.
>
>> start_isolate_page_range()/undo_isolate_page_range()/test_pages_isolated()
>> along with a lockless check if the page is free.
> Okay, that part I think I get. However doesn't all that logic more or less
> ignore the watermarks? It seems like you could cause an OOM if you don't
> have the necessary checks in place for that.
>
>> I think it should be something like this (ignoring different
>> migratetypes and such for now)
>>
>> 1. Test lockless if page is free: Not free? Done.
> So this should help to reduce the liklihood of races in the steps below.
> However it might also be useful if the code had some other check to see if
> it was done other than just making a pass through the bitmap.
>
> One thing I had brought up with Nitesh was the idea of maybe doing some
> sort of RCU bitmap type approach. Basically while we hold the zone lock we
> could swap out the old bitmap for a new one. We could probably even keep a
> counter at the start of the structure so that we could track how many bits
> are actually set there. Then it becomes less likely of having a race where
> you free a page and set the bit and the hinting thread tests and clears
> the bit but doesn't see the freed page since it is not synchronized.
> Otherwise your notification setup and reporting thread may need a few smp
> barriers added where necessary.
>
>

One generic request.
I would appreciate it if you guys can keep me in the cc while discussing.
Otherwise, with the amount of discussion its quite easy to go out of sync.


-- 
Thanks
Nitesh



^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2019-11-13 18:52 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20191106000547.juQRi83gi%akpm@linux-foundation.org>
2019-11-06 12:16 ` + mm-introduce-reported-pages.patch added to -mm tree Michal Hocko
2019-11-06 14:09   ` David Hildenbrand
2019-11-06 16:35     ` Alexander Duyck
2019-11-06 16:54       ` Michal Hocko
2019-11-06 17:48         ` Alexander Duyck
2019-11-06 22:11           ` Mel Gorman
2019-11-06 23:38             ` David Hildenbrand
2019-11-07  0:20             ` Alexander Duyck
2019-11-07 10:20               ` Mel Gorman
2019-11-07 16:07                 ` Alexander Duyck
2019-11-08  9:43                   ` Mel Gorman
2019-11-08 16:17                     ` Alexander Duyck
2019-11-08 18:41                       ` Mel Gorman
2019-11-08 20:29                         ` Alexander Duyck
2019-11-09 14:57                           ` Mel Gorman
2019-11-10 18:03                             ` Alexander Duyck
2019-11-06 23:33           ` David Hildenbrand
2019-11-07  0:20             ` Dave Hansen
2019-11-07  0:52               ` David Hildenbrand
2019-11-07 17:12                 ` Dave Hansen
2019-11-07 17:46                   ` Michal Hocko
2019-11-07 18:08                     ` Dave Hansen
2019-11-07 18:12                     ` Alexander Duyck
2019-11-08  9:57                       ` Michal Hocko
2019-11-08 16:43                         ` Alexander Duyck
2019-11-07 18:46                   ` Qian Cai
2019-11-07 18:02             ` Alexander Duyck
2019-11-07 19:37               ` Nitesh Narayan Lal
2019-11-07 22:46                 ` Alexander Duyck
2019-11-07 22:43               ` David Hildenbrand
2019-11-08  0:42                 ` Alexander Duyck
2019-11-08  7:06                   ` David Hildenbrand
2019-11-08 17:18                     ` Alexander Duyck
2019-11-12 13:04                       ` David Hildenbrand
2019-11-12 18:34                         ` Alexander Duyck
2019-11-12 21:05                           ` David Hildenbrand
2019-11-12 22:17                             ` David Hildenbrand
2019-11-12 22:19                             ` Alexander Duyck
2019-11-12 23:10                               ` David Hildenbrand
2019-11-13  0:31                                 ` Alexander Duyck
2019-11-13 18:51                           ` Nitesh Narayan Lal
2019-11-06 16:49   ` Nitesh Narayan Lal
2019-11-11 18:52   ` Nitesh Narayan Lal
2019-11-11 22:00     ` Alexander Duyck
2019-11-12 15:19       ` Nitesh Narayan Lal
2019-11-12 16:18         ` Alexander Duyck
2019-11-13 18:39           ` Nitesh Narayan Lal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).