All of lore.kernel.org
 help / color / mirror / Atom feed
* Thoughts on simple scanner approach for free page hinting
@ 2019-04-06  0:09 ` Alexander Duyck
  0 siblings, 0 replies; 31+ messages in thread
From: Alexander Duyck @ 2019-04-06  0:09 UTC (permalink / raw)
  To: Michael S. Tsirkin, David Hildenbrand, Nitesh Narayan Lal
  Cc: kvm list, LKML, linux-mm, Paolo Bonzini, lcapitulino, pagupta,
	wei.w.wang, Yang Zhang, Rik van Riel, dodgen,
	Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli

So I am starting this thread as a spot to collect my thoughts on the
current guest free page hinting design as well as point out a few
possible things we could do to improve upon it.

1. The current design isn't likely going to scale well to multiple
VCPUs. The issue specifically is that the zone lock must be held to
pull pages off of the free list and to place them back there once they
have been hinted upon. As a result it would likely make sense to try
to limit ourselves to only having one thread performing the actual
hinting so that we can avoid running into issues with lock contention
between threads.

2. There are currently concerns about the hinting triggering false OOM
situations if too much memory is isolated while it is being hinted. My
thought on this is to simply avoid the issue by only hint on a limited
amount of memory at a time. Something like 64MB should be a workable
limit without introducing much in the way of regressions. However as a
result of this we can easily be overrun while waiting on the host to
process the hinting request. As such we will probably need a way to
walk the free list and free pages after they have been freed instead
of trying to do it as they are freed.

3. Even with the current buffering which is still on the larger side
it is possible to overrun the hinting limits if something causes the
host to stall and a large swath of memory is released. As such we are
still going to need some sort of scanning mechanism or will have to
live with not providing accurate hints.

4. In my opinion, the code overall is likely more complex then it
needs to be. We currently have 2 allocations that have to occur every
time we provide a hint all the way to the host, ideally we should not
need to allocate more memory to provide hints. We should be able to
hold the memory use for a memory hint device constant and simply map
the page address and size to the descriptors of the virtio-ring.

With that said I have a few ideas that may help to address the 4
issues called out above. The basic idea is simple. We use a high water
mark based on zone->free_area[order].nr_free to determine when to wake
up a thread to start hinting memory out of a given free area. From
there we allocate non-"Offline" pages from the free area and assign
them to the hinting queue up to 64MB at a time. Once the hinting is
completed we mark them "Offline" and add them to the tail of the
free_area. Doing this we should cycle the non-"Offline" pages slowly
out of the free_area. In addition the search cost should be minimal
since all of the "Offline" pages should be aggregated to the tail of
the free_area so all pages allocated off of the free_area will be the
non-"Offline" pages until we shift over to them all being "Offline".
This should be effective for MAX_ORDER - 1 and MAX_ORDER - 2 pages
since the only real consumer of add_to_free_area_tail is
__free_one_page which uses it to place a page with an order less than
MAX_ORDER - 2 on the tail of a free_area assuming that it should be
freeing the buddy of that page shortly. The only other issue with
adding to tail would be the memory shuffling which was recently added,
but I don't see that as being something that will be enabled in most
cases so we could probably just make the features mutually exclusive,
at least for now.

So if I am not mistaken this would essentially require a couple
changes to the mm infrastructure in order for this to work.

First we would need to split nr_free into two counters, something like
nr_freed and nr_bound. You could use nr_freed - nr_bound to get the
value currently used for nr_free. When we pulled the pages for hinting
we would reduce the nr_freed value and then add back to it when the
pages are returned. When pages are allocated they would increment the
nr_bound value. The idea behind this is that we can record nr_free
when we collect the pages and save it to some local value. This value
could then tell us how many new pages have been added that have not
been hinted upon.

In addition we will need some way to identify which pages have been
hinted on and which have not. The way I believe easiest to do this
would be to overload the PageType value so that we could essentially
have two values for "Buddy" pages. We would have our standard "Buddy"
pages, and "Buddy" pages that also have the "Offline" value set in the
PageType field. Tracking the Online vs Offline pages this way would
actually allow us to do this with almost no overhead as the mapcount
value is already being reset to clear the "Buddy" flag so adding a
"Offline" flag to this clearing should come at no additional cost.

Lastly we would need to create a specialized function for allocating
the non-"Offline" pages, and to tweak __free_one_page to tail enqueue
"Offline" pages. I'm thinking the alloc function it would look
something like __rmqueue_smallest but without the "expand" and needing
to modify the !page check to also include a check to verify the page
is not "Offline". As far as the changes to __free_one_page it would be
a 2 line change to test for the PageType being offline, and if it is
to call add_to_free_area_tail instead of add_to_free_area.

Anyway this email ended up being pretty massive by the time I was
done. Feel free to reply to parts of it and we can break it out into
separate threads of discussion as necessary. I will start working on
coding some parts of this next week.

Thanks.

- Alex

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Thoughts on simple scanner approach for free page hinting
@ 2019-04-06  0:09 ` Alexander Duyck
  0 siblings, 0 replies; 31+ messages in thread
From: Alexander Duyck @ 2019-04-06  0:09 UTC (permalink / raw)
  To: Michael S. Tsirkin, David Hildenbrand, Nitesh Narayan Lal
  Cc: kvm list, LKML, linux-mm, Paolo Bonzini, lcapitulino, pagupta,
	wei.w.wang, Yang Zhang, Rik van Riel, dodgen,
	Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli

So I am starting this thread as a spot to collect my thoughts on the
current guest free page hinting design as well as point out a few
possible things we could do to improve upon it.

1. The current design isn't likely going to scale well to multiple
VCPUs. The issue specifically is that the zone lock must be held to
pull pages off of the free list and to place them back there once they
have been hinted upon. As a result it would likely make sense to try
to limit ourselves to only having one thread performing the actual
hinting so that we can avoid running into issues with lock contention
between threads.

2. There are currently concerns about the hinting triggering false OOM
situations if too much memory is isolated while it is being hinted. My
thought on this is to simply avoid the issue by only hint on a limited
amount of memory at a time. Something like 64MB should be a workable
limit without introducing much in the way of regressions. However as a
result of this we can easily be overrun while waiting on the host to
process the hinting request. As such we will probably need a way to
walk the free list and free pages after they have been freed instead
of trying to do it as they are freed.

3. Even with the current buffering which is still on the larger side
it is possible to overrun the hinting limits if something causes the
host to stall and a large swath of memory is released. As such we are
still going to need some sort of scanning mechanism or will have to
live with not providing accurate hints.

4. In my opinion, the code overall is likely more complex then it
needs to be. We currently have 2 allocations that have to occur every
time we provide a hint all the way to the host, ideally we should not
need to allocate more memory to provide hints. We should be able to
hold the memory use for a memory hint device constant and simply map
the page address and size to the descriptors of the virtio-ring.

With that said I have a few ideas that may help to address the 4
issues called out above. The basic idea is simple. We use a high water
mark based on zone->free_area[order].nr_free to determine when to wake
up a thread to start hinting memory out of a given free area. From
there we allocate non-"Offline" pages from the free area and assign
them to the hinting queue up to 64MB at a time. Once the hinting is
completed we mark them "Offline" and add them to the tail of the
free_area. Doing this we should cycle the non-"Offline" pages slowly
out of the free_area. In addition the search cost should be minimal
since all of the "Offline" pages should be aggregated to the tail of
the free_area so all pages allocated off of the free_area will be the
non-"Offline" pages until we shift over to them all being "Offline".
This should be effective for MAX_ORDER - 1 and MAX_ORDER - 2 pages
since the only real consumer of add_to_free_area_tail is
__free_one_page which uses it to place a page with an order less than
MAX_ORDER - 2 on the tail of a free_area assuming that it should be
freeing the buddy of that page shortly. The only other issue with
adding to tail would be the memory shuffling which was recently added,
but I don't see that as being something that will be enabled in most
cases so we could probably just make the features mutually exclusive,
at least for now.

So if I am not mistaken this would essentially require a couple
changes to the mm infrastructure in order for this to work.

First we would need to split nr_free into two counters, something like
nr_freed and nr_bound. You could use nr_freed - nr_bound to get the
value currently used for nr_free. When we pulled the pages for hinting
we would reduce the nr_freed value and then add back to it when the
pages are returned. When pages are allocated they would increment the
nr_bound value. The idea behind this is that we can record nr_free
when we collect the pages and save it to some local value. This value
could then tell us how many new pages have been added that have not
been hinted upon.

In addition we will need some way to identify which pages have been
hinted on and which have not. The way I believe easiest to do this
would be to overload the PageType value so that we could essentially
have two values for "Buddy" pages. We would have our standard "Buddy"
pages, and "Buddy" pages that also have the "Offline" value set in the
PageType field. Tracking the Online vs Offline pages this way would
actually allow us to do this with almost no overhead as the mapcount
value is already being reset to clear the "Buddy" flag so adding a
"Offline" flag to this clearing should come at no additional cost.

Lastly we would need to create a specialized function for allocating
the non-"Offline" pages, and to tweak __free_one_page to tail enqueue
"Offline" pages. I'm thinking the alloc function it would look
something like __rmqueue_smallest but without the "expand" and needing
to modify the !page check to also include a check to verify the page
is not "Offline". As far as the changes to __free_one_page it would be
a 2 line change to test for the PageType being offline, and if it is
to call add_to_free_area_tail instead of add_to_free_area.

Anyway this email ended up being pretty massive by the time I was
done. Feel free to reply to parts of it and we can break it out into
separate threads of discussion as necessary. I will start working on
coding some parts of this next week.

Thanks.

- Alex


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Thoughts on simple scanner approach for free page hinting
  2019-04-06  0:09 ` Alexander Duyck
  (?)
@ 2019-04-08 12:24 ` Nitesh Narayan Lal
  2019-04-08 15:18     ` Alexander Duyck
  -1 siblings, 1 reply; 31+ messages in thread
From: Nitesh Narayan Lal @ 2019-04-08 12:24 UTC (permalink / raw)
  To: Alexander Duyck, Michael S. Tsirkin, David Hildenbrand
  Cc: kvm list, LKML, linux-mm, Paolo Bonzini, lcapitulino, pagupta,
	wei.w.wang, Yang Zhang, Rik van Riel, dodgen,
	Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli


[-- Attachment #1.1: Type: text/plain, Size: 5972 bytes --]


On 4/5/19 8:09 PM, Alexander Duyck wrote:
> So I am starting this thread as a spot to collect my thoughts on the
> current guest free page hinting design as well as point out a few
> possible things we could do to improve upon it.
>
> 1. The current design isn't likely going to scale well to multiple
> VCPUs. The issue specifically is that the zone lock must be held to
> pull pages off of the free list and to place them back there once they
> have been hinted upon. As a result it would likely make sense to try
> to limit ourselves to only having one thread performing the actual
> hinting so that we can avoid running into issues with lock contention
> between threads.
>
> 2. There are currently concerns about the hinting triggering false OOM
> situations if too much memory is isolated while it is being hinted. My
> thought on this is to simply avoid the issue by only hint on a limited
> amount of memory at a time. Something like 64MB should be a workable
> limit without introducing much in the way of regressions. However as a
> result of this we can easily be overrun while waiting on the host to
> process the hinting request. As such we will probably need a way to
> walk the free list and free pages after they have been freed instead
> of trying to do it as they are freed.
>
> 3. Even with the current buffering which is still on the larger side
> it is possible to overrun the hinting limits if something causes the
> host to stall and a large swath of memory is released. As such we are
> still going to need some sort of scanning mechanism or will have to
> live with not providing accurate hints.
>
> 4. In my opinion, the code overall is likely more complex then it
> needs to be. We currently have 2 allocations that have to occur every
> time we provide a hint all the way to the host, ideally we should not
> need to allocate more memory to provide hints. We should be able to
> hold the memory use for a memory hint device constant and simply map
> the page address and size to the descriptors of the virtio-ring.
>
> With that said I have a few ideas that may help to address the 4
> issues called out above. The basic idea is simple. We use a high water
> mark based on zone->free_area[order].nr_free to determine when to wake
> up a thread to start hinting memory out of a given free area. From
> there we allocate non-"Offline" pages from the free area and assign
> them to the hinting queue up to 64MB at a time. Once the hinting is
> completed we mark them "Offline" and add them to the tail of the
> free_area. Doing this we should cycle the non-"Offline" pages slowly
> out of the free_area. 
any ideas about how are you planning to control this?
> In addition the search cost should be minimal
> since all of the "Offline" pages should be aggregated to the tail of
> the free_area so all pages allocated off of the free_area will be the
> non-"Offline" pages until we shift over to them all being "Offline".
> This should be effective for MAX_ORDER - 1 and MAX_ORDER - 2 pages
> since the only real consumer of add_to_free_area_tail is
> __free_one_page which uses it to place a page with an order less than
> MAX_ORDER - 2 on the tail of a free_area assuming that it should be
> freeing the buddy of that page shortly. The only other issue with
> adding to tail would be the memory shuffling which was recently added,
> but I don't see that as being something that will be enabled in most
> cases so we could probably just make the features mutually exclusive,
> at least for now.
>
> So if I am not mistaken this would essentially require a couple
> changes to the mm infrastructure in order for this to work.
>
> First we would need to split nr_free into two counters, something like
> nr_freed and nr_bound. You could use nr_freed - nr_bound to get the
> value currently used for nr_free. When we pulled the pages for hinting
> we would reduce the nr_freed value and then add back to it when the
> pages are returned. When pages are allocated they would increment the
> nr_bound value. The idea behind this is that we can record nr_free
> when we collect the pages and save it to some local value. This value
> could then tell us how many new pages have been added that have not
> been hinted upon.
>
> In addition we will need some way to identify which pages have been
> hinted on and which have not. The way I believe easiest to do this
> would be to overload the PageType value so that we could essentially
> have two values for "Buddy" pages. We would have our standard "Buddy"
> pages, and "Buddy" pages that also have the "Offline" value set in the
> PageType field. Tracking the Online vs Offline pages this way would
> actually allow us to do this with almost no overhead as the mapcount
> value is already being reset to clear the "Buddy" flag so adding a
> "Offline" flag to this clearing should come at no additional cost.
>
> Lastly we would need to create a specialized function for allocating
> the non-"Offline" pages, and to tweak __free_one_page to tail enqueue
> "Offline" pages. I'm thinking the alloc function it would look
> something like __rmqueue_smallest but without the "expand" and needing
> to modify the !page check to also include a check to verify the page
> is not "Offline". As far as the changes to __free_one_page it would be
> a 2 line change to test for the PageType being offline, and if it is
> to call add_to_free_area_tail instead of add_to_free_area.
Is it possible that once the pages are offline, there is a large
allocation request in the guest needing those offline pages as well?
>
> Anyway this email ended up being pretty massive by the time I was
> done. Feel free to reply to parts of it and we can break it out into
> separate threads of discussion as necessary. I will start working on
> coding some parts of this next week.
>
> Thanks.
>
> - Alex
-- 
Regards
Nitesh


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Thoughts on simple scanner approach for free page hinting
  2019-04-08 12:24 ` Nitesh Narayan Lal
@ 2019-04-08 15:18     ` Alexander Duyck
  0 siblings, 0 replies; 31+ messages in thread
From: Alexander Duyck @ 2019-04-08 15:18 UTC (permalink / raw)
  To: Nitesh Narayan Lal
  Cc: Michael S. Tsirkin, David Hildenbrand, kvm list, LKML, linux-mm,
	Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang,
	Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli

On Mon, Apr 8, 2019 at 5:24 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>
>
> On 4/5/19 8:09 PM, Alexander Duyck wrote:
> > So I am starting this thread as a spot to collect my thoughts on the
> > current guest free page hinting design as well as point out a few
> > possible things we could do to improve upon it.
> >
> > 1. The current design isn't likely going to scale well to multiple
> > VCPUs. The issue specifically is that the zone lock must be held to
> > pull pages off of the free list and to place them back there once they
> > have been hinted upon. As a result it would likely make sense to try
> > to limit ourselves to only having one thread performing the actual
> > hinting so that we can avoid running into issues with lock contention
> > between threads.
> >
> > 2. There are currently concerns about the hinting triggering false OOM
> > situations if too much memory is isolated while it is being hinted. My
> > thought on this is to simply avoid the issue by only hint on a limited
> > amount of memory at a time. Something like 64MB should be a workable
> > limit without introducing much in the way of regressions. However as a
> > result of this we can easily be overrun while waiting on the host to
> > process the hinting request. As such we will probably need a way to
> > walk the free list and free pages after they have been freed instead
> > of trying to do it as they are freed.
> >
> > 3. Even with the current buffering which is still on the larger side
> > it is possible to overrun the hinting limits if something causes the
> > host to stall and a large swath of memory is released. As such we are
> > still going to need some sort of scanning mechanism or will have to
> > live with not providing accurate hints.
> >
> > 4. In my opinion, the code overall is likely more complex then it
> > needs to be. We currently have 2 allocations that have to occur every
> > time we provide a hint all the way to the host, ideally we should not
> > need to allocate more memory to provide hints. We should be able to
> > hold the memory use for a memory hint device constant and simply map
> > the page address and size to the descriptors of the virtio-ring.
> >
> > With that said I have a few ideas that may help to address the 4
> > issues called out above. The basic idea is simple. We use a high water
> > mark based on zone->free_area[order].nr_free to determine when to wake
> > up a thread to start hinting memory out of a given free area. From
> > there we allocate non-"Offline" pages from the free area and assign
> > them to the hinting queue up to 64MB at a time. Once the hinting is
> > completed we mark them "Offline" and add them to the tail of the
> > free_area. Doing this we should cycle the non-"Offline" pages slowly
> > out of the free_area.
> any ideas about how are you planning to control this?

You mean in terms of switching the hinting on/off? The setup should be
pretty simple. Basically we would still need a hook like the one you
added after the allocation to determine where the free page ultimately
landed and to do a check against the high water mark I mentioned.
Basically if there is something like 2X the number of pages needed to
fulfill the 64MB requirement we could then kick off a thread running
on the zone to begin populating the hints and notifying the
virtio-balloon interface. When we can no longer fill the ring we would
simply stop the thread until we get back to the 2X state for nr_freed
versus the last nr_freed value we had hinted upon. It wouldn't be
dissimilar to how we currently handle the Tx path in many NICs where
we shut off hinting.

For examples of doing something like this you could look at the Rx
softIRQ handling in the NIC drivers. Basically the idea there is you
trigger the event once, and then the thread is running until all work
has been completed. The thread itself is limiting itself to only
processing some number of fixed buffers for each request, and when it
can no longer get a full set it stops and waits to be rescheduled by
an interrupt.

> > In addition the search cost should be minimal
> > since all of the "Offline" pages should be aggregated to the tail of
> > the free_area so all pages allocated off of the free_area will be the
> > non-"Offline" pages until we shift over to them all being "Offline".
> > This should be effective for MAX_ORDER - 1 and MAX_ORDER - 2 pages
> > since the only real consumer of add_to_free_area_tail is
> > __free_one_page which uses it to place a page with an order less than
> > MAX_ORDER - 2 on the tail of a free_area assuming that it should be
> > freeing the buddy of that page shortly. The only other issue with
> > adding to tail would be the memory shuffling which was recently added,
> > but I don't see that as being something that will be enabled in most
> > cases so we could probably just make the features mutually exclusive,
> > at least for now.
> >
> > So if I am not mistaken this would essentially require a couple
> > changes to the mm infrastructure in order for this to work.
> >
> > First we would need to split nr_free into two counters, something like
> > nr_freed and nr_bound. You could use nr_freed - nr_bound to get the
> > value currently used for nr_free. When we pulled the pages for hinting
> > we would reduce the nr_freed value and then add back to it when the
> > pages are returned. When pages are allocated they would increment the
> > nr_bound value. The idea behind this is that we can record nr_free
> > when we collect the pages and save it to some local value. This value
> > could then tell us how many new pages have been added that have not
> > been hinted upon.
> >
> > In addition we will need some way to identify which pages have been
> > hinted on and which have not. The way I believe easiest to do this
> > would be to overload the PageType value so that we could essentially
> > have two values for "Buddy" pages. We would have our standard "Buddy"
> > pages, and "Buddy" pages that also have the "Offline" value set in the
> > PageType field. Tracking the Online vs Offline pages this way would
> > actually allow us to do this with almost no overhead as the mapcount
> > value is already being reset to clear the "Buddy" flag so adding a
> > "Offline" flag to this clearing should come at no additional cost.
> >
> > Lastly we would need to create a specialized function for allocating
> > the non-"Offline" pages, and to tweak __free_one_page to tail enqueue
> > "Offline" pages. I'm thinking the alloc function it would look
> > something like __rmqueue_smallest but without the "expand" and needing
> > to modify the !page check to also include a check to verify the page
> > is not "Offline". As far as the changes to __free_one_page it would be
> > a 2 line change to test for the PageType being offline, and if it is
> > to call add_to_free_area_tail instead of add_to_free_area.
> Is it possible that once the pages are offline, there is a large
> allocation request in the guest needing those offline pages as well?

It is possible. However the behavior here would be no different from a
NIC driver. NIC drivers will sit on a swath of memory for Rx purposes
waiting for the DMA to occur. Here we are sitting on 64MB which for a
large allocation should not be that significant.

As far as avoiding it, I don't think there is any way we can avoid
such an event completely. There are scenerios where the hitning will
get hung up while sitting on memory for an extended period of time.
That is why I am thinking our best mitigation for now would be to keep
the amount of hinting we are doing confined to something on the
smaller side such as 64M or less which I have already mentioned. By
doing that if we do hit one of the problematic scenarios we should
have minimal impact.

> >
> > Anyway this email ended up being pretty massive by the time I was
> > done. Feel free to reply to parts of it and we can break it out into
> > separate threads of discussion as necessary. I will start working on
> > coding some parts of this next week.
> >
> > Thanks.
> >
> > - Alex
> --
> Regards
> Nitesh
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Thoughts on simple scanner approach for free page hinting
@ 2019-04-08 15:18     ` Alexander Duyck
  0 siblings, 0 replies; 31+ messages in thread
From: Alexander Duyck @ 2019-04-08 15:18 UTC (permalink / raw)
  To: Nitesh Narayan Lal
  Cc: Michael S. Tsirkin, David Hildenbrand, kvm list, LKML, linux-mm,
	Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang,
	Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli

On Mon, Apr 8, 2019 at 5:24 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>
>
> On 4/5/19 8:09 PM, Alexander Duyck wrote:
> > So I am starting this thread as a spot to collect my thoughts on the
> > current guest free page hinting design as well as point out a few
> > possible things we could do to improve upon it.
> >
> > 1. The current design isn't likely going to scale well to multiple
> > VCPUs. The issue specifically is that the zone lock must be held to
> > pull pages off of the free list and to place them back there once they
> > have been hinted upon. As a result it would likely make sense to try
> > to limit ourselves to only having one thread performing the actual
> > hinting so that we can avoid running into issues with lock contention
> > between threads.
> >
> > 2. There are currently concerns about the hinting triggering false OOM
> > situations if too much memory is isolated while it is being hinted. My
> > thought on this is to simply avoid the issue by only hint on a limited
> > amount of memory at a time. Something like 64MB should be a workable
> > limit without introducing much in the way of regressions. However as a
> > result of this we can easily be overrun while waiting on the host to
> > process the hinting request. As such we will probably need a way to
> > walk the free list and free pages after they have been freed instead
> > of trying to do it as they are freed.
> >
> > 3. Even with the current buffering which is still on the larger side
> > it is possible to overrun the hinting limits if something causes the
> > host to stall and a large swath of memory is released. As such we are
> > still going to need some sort of scanning mechanism or will have to
> > live with not providing accurate hints.
> >
> > 4. In my opinion, the code overall is likely more complex then it
> > needs to be. We currently have 2 allocations that have to occur every
> > time we provide a hint all the way to the host, ideally we should not
> > need to allocate more memory to provide hints. We should be able to
> > hold the memory use for a memory hint device constant and simply map
> > the page address and size to the descriptors of the virtio-ring.
> >
> > With that said I have a few ideas that may help to address the 4
> > issues called out above. The basic idea is simple. We use a high water
> > mark based on zone->free_area[order].nr_free to determine when to wake
> > up a thread to start hinting memory out of a given free area. From
> > there we allocate non-"Offline" pages from the free area and assign
> > them to the hinting queue up to 64MB at a time. Once the hinting is
> > completed we mark them "Offline" and add them to the tail of the
> > free_area. Doing this we should cycle the non-"Offline" pages slowly
> > out of the free_area.
> any ideas about how are you planning to control this?

You mean in terms of switching the hinting on/off? The setup should be
pretty simple. Basically we would still need a hook like the one you
added after the allocation to determine where the free page ultimately
landed and to do a check against the high water mark I mentioned.
Basically if there is something like 2X the number of pages needed to
fulfill the 64MB requirement we could then kick off a thread running
on the zone to begin populating the hints and notifying the
virtio-balloon interface. When we can no longer fill the ring we would
simply stop the thread until we get back to the 2X state for nr_freed
versus the last nr_freed value we had hinted upon. It wouldn't be
dissimilar to how we currently handle the Tx path in many NICs where
we shut off hinting.

For examples of doing something like this you could look at the Rx
softIRQ handling in the NIC drivers. Basically the idea there is you
trigger the event once, and then the thread is running until all work
has been completed. The thread itself is limiting itself to only
processing some number of fixed buffers for each request, and when it
can no longer get a full set it stops and waits to be rescheduled by
an interrupt.

> > In addition the search cost should be minimal
> > since all of the "Offline" pages should be aggregated to the tail of
> > the free_area so all pages allocated off of the free_area will be the
> > non-"Offline" pages until we shift over to them all being "Offline".
> > This should be effective for MAX_ORDER - 1 and MAX_ORDER - 2 pages
> > since the only real consumer of add_to_free_area_tail is
> > __free_one_page which uses it to place a page with an order less than
> > MAX_ORDER - 2 on the tail of a free_area assuming that it should be
> > freeing the buddy of that page shortly. The only other issue with
> > adding to tail would be the memory shuffling which was recently added,
> > but I don't see that as being something that will be enabled in most
> > cases so we could probably just make the features mutually exclusive,
> > at least for now.
> >
> > So if I am not mistaken this would essentially require a couple
> > changes to the mm infrastructure in order for this to work.
> >
> > First we would need to split nr_free into two counters, something like
> > nr_freed and nr_bound. You could use nr_freed - nr_bound to get the
> > value currently used for nr_free. When we pulled the pages for hinting
> > we would reduce the nr_freed value and then add back to it when the
> > pages are returned. When pages are allocated they would increment the
> > nr_bound value. The idea behind this is that we can record nr_free
> > when we collect the pages and save it to some local value. This value
> > could then tell us how many new pages have been added that have not
> > been hinted upon.
> >
> > In addition we will need some way to identify which pages have been
> > hinted on and which have not. The way I believe easiest to do this
> > would be to overload the PageType value so that we could essentially
> > have two values for "Buddy" pages. We would have our standard "Buddy"
> > pages, and "Buddy" pages that also have the "Offline" value set in the
> > PageType field. Tracking the Online vs Offline pages this way would
> > actually allow us to do this with almost no overhead as the mapcount
> > value is already being reset to clear the "Buddy" flag so adding a
> > "Offline" flag to this clearing should come at no additional cost.
> >
> > Lastly we would need to create a specialized function for allocating
> > the non-"Offline" pages, and to tweak __free_one_page to tail enqueue
> > "Offline" pages. I'm thinking the alloc function it would look
> > something like __rmqueue_smallest but without the "expand" and needing
> > to modify the !page check to also include a check to verify the page
> > is not "Offline". As far as the changes to __free_one_page it would be
> > a 2 line change to test for the PageType being offline, and if it is
> > to call add_to_free_area_tail instead of add_to_free_area.
> Is it possible that once the pages are offline, there is a large
> allocation request in the guest needing those offline pages as well?

It is possible. However the behavior here would be no different from a
NIC driver. NIC drivers will sit on a swath of memory for Rx purposes
waiting for the DMA to occur. Here we are sitting on 64MB which for a
large allocation should not be that significant.

As far as avoiding it, I don't think there is any way we can avoid
such an event completely. There are scenerios where the hitning will
get hung up while sitting on memory for an extended period of time.
That is why I am thinking our best mitigation for now would be to keep
the amount of hinting we are doing confined to something on the
smaller side such as 64M or less which I have already mentioned. By
doing that if we do hit one of the problematic scenarios we should
have minimal impact.

> >
> > Anyway this email ended up being pretty massive by the time I was
> > done. Feel free to reply to parts of it and we can break it out into
> > separate threads of discussion as necessary. I will start working on
> > coding some parts of this next week.
> >
> > Thanks.
> >
> > - Alex
> --
> Regards
> Nitesh
>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Thoughts on simple scanner approach for free page hinting
  2019-04-08 15:18     ` Alexander Duyck
  (?)
@ 2019-04-08 15:41     ` Michael S. Tsirkin
  -1 siblings, 0 replies; 31+ messages in thread
From: Michael S. Tsirkin @ 2019-04-08 15:41 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Nitesh Narayan Lal, David Hildenbrand, kvm list, LKML, linux-mm,
	Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang,
	Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli

On Mon, Apr 08, 2019 at 08:18:35AM -0700, Alexander Duyck wrote:
> On Mon, Apr 8, 2019 at 5:24 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
> >
> >
> > On 4/5/19 8:09 PM, Alexander Duyck wrote:
> > > So I am starting this thread as a spot to collect my thoughts on the
> > > current guest free page hinting design as well as point out a few
> > > possible things we could do to improve upon it.
> > >
> > > 1. The current design isn't likely going to scale well to multiple
> > > VCPUs. The issue specifically is that the zone lock must be held to
> > > pull pages off of the free list and to place them back there once they
> > > have been hinted upon. As a result it would likely make sense to try
> > > to limit ourselves to only having one thread performing the actual
> > > hinting so that we can avoid running into issues with lock contention
> > > between threads.
> > >
> > > 2. There are currently concerns about the hinting triggering false OOM
> > > situations if too much memory is isolated while it is being hinted. My
> > > thought on this is to simply avoid the issue by only hint on a limited
> > > amount of memory at a time. Something like 64MB should be a workable
> > > limit without introducing much in the way of regressions. However as a
> > > result of this we can easily be overrun while waiting on the host to
> > > process the hinting request. As such we will probably need a way to
> > > walk the free list and free pages after they have been freed instead
> > > of trying to do it as they are freed.
> > >
> > > 3. Even with the current buffering which is still on the larger side
> > > it is possible to overrun the hinting limits if something causes the
> > > host to stall and a large swath of memory is released. As such we are
> > > still going to need some sort of scanning mechanism or will have to
> > > live with not providing accurate hints.
> > >
> > > 4. In my opinion, the code overall is likely more complex then it
> > > needs to be. We currently have 2 allocations that have to occur every
> > > time we provide a hint all the way to the host, ideally we should not
> > > need to allocate more memory to provide hints. We should be able to
> > > hold the memory use for a memory hint device constant and simply map
> > > the page address and size to the descriptors of the virtio-ring.
> > >
> > > With that said I have a few ideas that may help to address the 4
> > > issues called out above. The basic idea is simple. We use a high water
> > > mark based on zone->free_area[order].nr_free to determine when to wake
> > > up a thread to start hinting memory out of a given free area. From
> > > there we allocate non-"Offline" pages from the free area and assign
> > > them to the hinting queue up to 64MB at a time. Once the hinting is
> > > completed we mark them "Offline" and add them to the tail of the
> > > free_area. Doing this we should cycle the non-"Offline" pages slowly
> > > out of the free_area.
> > any ideas about how are you planning to control this?

I think supplying the 64M value from host is probably reasonable.

> 
> You mean in terms of switching the hinting on/off? The setup should be
> pretty simple. Basically we would still need a hook like the one you
> added after the allocation to determine where the free page ultimately
> landed and to do a check against the high water mark I mentioned.
> Basically if there is something like 2X the number of pages needed to
> fulfill the 64MB requirement we could then kick off a thread running
> on the zone to begin populating the hints and notifying the
> virtio-balloon interface. When we can no longer fill the ring we would
> simply stop the thread until we get back to the 2X state for nr_freed
> versus the last nr_freed value we had hinted upon. It wouldn't be
> dissimilar to how we currently handle the Tx path in many NICs where
> we shut off hinting.
> 
> For examples of doing something like this you could look at the Rx
> softIRQ handling in the NIC drivers. Basically the idea there is you
> trigger the event once, and then the thread is running until all work
> has been completed. The thread itself is limiting itself to only
> processing some number of fixed buffers for each request, and when it
> can no longer get a full set it stops and waits to be rescheduled by
> an interrupt.
> 
> > > In addition the search cost should be minimal
> > > since all of the "Offline" pages should be aggregated to the tail of
> > > the free_area so all pages allocated off of the free_area will be the
> > > non-"Offline" pages until we shift over to them all being "Offline".
> > > This should be effective for MAX_ORDER - 1 and MAX_ORDER - 2 pages
> > > since the only real consumer of add_to_free_area_tail is
> > > __free_one_page which uses it to place a page with an order less than
> > > MAX_ORDER - 2 on the tail of a free_area assuming that it should be
> > > freeing the buddy of that page shortly. The only other issue with
> > > adding to tail would be the memory shuffling which was recently added,
> > > but I don't see that as being something that will be enabled in most
> > > cases so we could probably just make the features mutually exclusive,
> > > at least for now.
> > >
> > > So if I am not mistaken this would essentially require a couple
> > > changes to the mm infrastructure in order for this to work.
> > >
> > > First we would need to split nr_free into two counters, something like
> > > nr_freed and nr_bound. You could use nr_freed - nr_bound to get the
> > > value currently used for nr_free. When we pulled the pages for hinting
> > > we would reduce the nr_freed value and then add back to it when the
> > > pages are returned. When pages are allocated they would increment the
> > > nr_bound value. The idea behind this is that we can record nr_free
> > > when we collect the pages and save it to some local value. This value
> > > could then tell us how many new pages have been added that have not
> > > been hinted upon.
> > >
> > > In addition we will need some way to identify which pages have been
> > > hinted on and which have not. The way I believe easiest to do this
> > > would be to overload the PageType value so that we could essentially
> > > have two values for "Buddy" pages. We would have our standard "Buddy"
> > > pages, and "Buddy" pages that also have the "Offline" value set in the
> > > PageType field. Tracking the Online vs Offline pages this way would
> > > actually allow us to do this with almost no overhead as the mapcount
> > > value is already being reset to clear the "Buddy" flag so adding a
> > > "Offline" flag to this clearing should come at no additional cost.
> > >
> > > Lastly we would need to create a specialized function for allocating
> > > the non-"Offline" pages, and to tweak __free_one_page to tail enqueue
> > > "Offline" pages. I'm thinking the alloc function it would look
> > > something like __rmqueue_smallest but without the "expand" and needing
> > > to modify the !page check to also include a check to verify the page
> > > is not "Offline". As far as the changes to __free_one_page it would be
> > > a 2 line change to test for the PageType being offline, and if it is
> > > to call add_to_free_area_tail instead of add_to_free_area.
> > Is it possible that once the pages are offline, there is a large
> > allocation request in the guest needing those offline pages as well?
> 
> It is possible. However the behavior here would be no different from a
> NIC driver. NIC drivers will sit on a swath of memory for Rx purposes
> waiting for the DMA to occur. Here we are sitting on 64MB which for a
> large allocation should not be that significant.
> 
> As far as avoiding it, I don't think there is any way we can avoid
> such an event completely. There are scenerios where the hitning will
> get hung up while sitting on memory for an extended period of time.
> That is why I am thinking our best mitigation for now would be to keep
> the amount of hinting we are doing confined to something on the
> smaller side such as 64M or less which I have already mentioned. By
> doing that if we do hit one of the problematic scenarios we should
> have minimal impact.
> 
> > >
> > > Anyway this email ended up being pretty massive by the time I was
> > > done. Feel free to reply to parts of it and we can break it out into
> > > separate threads of discussion as necessary. I will start working on
> > > coding some parts of this next week.
> > >
> > > Thanks.
> > >
> > > - Alex
> > --
> > Regards
> > Nitesh
> >


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Thoughts on simple scanner approach for free page hinting
  2019-04-06  0:09 ` Alexander Duyck
  (?)
  (?)
@ 2019-04-08 16:36 ` David Hildenbrand
  2019-04-08 18:09   ` Nitesh Narayan Lal
  2019-04-08 18:18     ` Alexander Duyck
  -1 siblings, 2 replies; 31+ messages in thread
From: David Hildenbrand @ 2019-04-08 16:36 UTC (permalink / raw)
  To: Alexander Duyck, Michael S. Tsirkin, Nitesh Narayan Lal
  Cc: kvm list, LKML, linux-mm, Paolo Bonzini, lcapitulino, pagupta,
	wei.w.wang, Yang Zhang, Rik van Riel, dodgen,
	Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli

On 06.04.19 02:09, Alexander Duyck wrote:
> So I am starting this thread as a spot to collect my thoughts on the
> current guest free page hinting design as well as point out a few
> possible things we could do to improve upon it.
> 
> 1. The current design isn't likely going to scale well to multiple
> VCPUs. The issue specifically is that the zone lock must be held to
> pull pages off of the free list and to place them back there once they
> have been hinted upon. As a result it would likely make sense to try
> to limit ourselves to only having one thread performing the actual
> hinting so that we can avoid running into issues with lock contention
> between threads.

Makes sense.

> 
> 2. There are currently concerns about the hinting triggering false OOM
> situations if too much memory is isolated while it is being hinted. My
> thought on this is to simply avoid the issue by only hint on a limited
> amount of memory at a time. Something like 64MB should be a workable
> limit without introducing much in the way of regressions. However as a
> result of this we can easily be overrun while waiting on the host to
> process the hinting request. As such we will probably need a way to
> walk the free list and free pages after they have been freed instead
> of trying to do it as they are freed.

We will need such a way in case we care about dropped hinting requests, yes.

> 
> 3. Even with the current buffering which is still on the larger side
> it is possible to overrun the hinting limits if something causes the
> host to stall and a large swath of memory is released. As such we are
> still going to need some sort of scanning mechanism or will have to
> live with not providing accurate hints.

Yes, usually if there is a lot of guest activity, you could however
assume that free pages might get reused either way soon. Of course,
special cases are "freeing XGB and being idle afterwards".

> 
> 4. In my opinion, the code overall is likely more complex then it
> needs to be. We currently have 2 allocations that have to occur every
> time we provide a hint all the way to the host, ideally we should not
> need to allocate more memory to provide hints. We should be able to
> hold the memory use for a memory hint device constant and simply map
> the page address and size to the descriptors of the virtio-ring.

I don't think the two allocations are that complex. The only thing I
consider complex is isolation a lot of pages from different zones etc.
Two allocations, nobody really cares about that. Of course, the fact
that we have to allocate memory from the VCPUs where we currently freed
a page is not optimal. I consider that rather a problem/complex.

Especially you have a point regarding scalability and multiple VCPUs.

> 
> With that said I have a few ideas that may help to address the 4
> issues called out above. The basic idea is simple. We use a high water
> mark based on zone->free_area[order].nr_free to determine when to wake
> up a thread to start hinting memory out of a given free area. From
> there we allocate non-"Offline" pages from the free area and assign
> them to the hinting queue up to 64MB at a time. Once the hinting is
> completed we mark them "Offline" and add them to the tail of the
> free_area. Doing this we should cycle the non-"Offline" pages slowly
> out of the free_area. In addition the search cost should be minimal
> since all of the "Offline" pages should be aggregated to the tail of
> the free_area so all pages allocated off of the free_area will be the
> non-"Offline" pages until we shift over to them all being "Offline".
> This should be effective for MAX_ORDER - 1 and MAX_ORDER - 2 pages
> since the only real consumer of add_to_free_area_tail is
> __free_one_page which uses it to place a page with an order less than
> MAX_ORDER - 2 on the tail of a free_area assuming that it should be
> freeing the buddy of that page shortly. The only other issue with
> adding to tail would be the memory shuffling which was recently added,
> but I don't see that as being something that will be enabled in most
> cases so we could probably just make the features mutually exclusive,
> at least for now.
> 
> So if I am not mistaken this would essentially require a couple
> changes to the mm infrastructure in order for this to work.
> 
> First we would need to split nr_free into two counters, something like
> nr_freed and nr_bound. You could use nr_freed - nr_bound to get the
> value currently used for nr_free. When we pulled the pages for hinting
> we would reduce the nr_freed value and then add back to it when the
> pages are returned. When pages are allocated they would increment the
> nr_bound value. The idea behind this is that we can record nr_free
> when we collect the pages and save it to some local value. This value
> could then tell us how many new pages have been added that have not
> been hinted upon.

I can imagine that quite some people will have problems with such
"virtualization specific changes" splattered around core memory
management. Would there be a way to manage this data at a different
place, out of core-mm and somehow work on it via callbacks?

> 
> In addition we will need some way to identify which pages have been
> hinted on and which have not. The way I believe easiest to do this
> would be to overload the PageType value so that we could essentially
> have two values for "Buddy" pages. We would have our standard "Buddy"
> pages, and "Buddy" pages that also have the "Offline" value set in the
> PageType field. Tracking the Online vs Offline pages this way would
> actually allow us to do this with almost no overhead as the mapcount
> value is already being reset to clear the "Buddy" flag so adding a
> "Offline" flag to this clearing should come at no additional cost.

Just nothing here that this will require modifications to kdump
(makedumpfile to be precise and the vmcore information exposed from the
kernel), as kdump only checks for the the actual mapcount value to
detect buddy and offline pages (to exclude them from dumps), they are
not treated as flags.

For now, any mapcount values are really only separate values, meaning
not the separate bits are of interest, like flags would be. Reusing
other flags would make our life a lot easier. E.g. PG_young or so. But
clearing of these is then the problematic part.

Of course we could use in the kernel two values, Buddy and BuddyOffline.
But then we have to check for two different values whenever we want to
identify a buddy page in the kernel.

> 
> Lastly we would need to create a specialized function for allocating
> the non-"Offline" pages, and to tweak __free_one_page to tail enqueue
> "Offline" pages. I'm thinking the alloc function it would look
> something like __rmqueue_smallest but without the "expand" and needing
> to modify the !page check to also include a check to verify the page
> is not "Offline". As far as the changes to __free_one_page it would be
> a 2 line change to test for the PageType being offline, and if it is
> to call add_to_free_area_tail instead of add_to_free_area.

As already mentioned, there might be scenarios where the additional
hinting thread might consume too much CPU cycles, especially if there is
little guest activity any you mostly spend time scanning a handful of
free pages and reporting them. I wonder if we can somehow limit the
amount of wakeups/scans for a given period to mitigate this issue.

One main issue I see with your approach is that we need quite a lot of
core memory management changes. This is a problem. I wonder if we can
factor out most parts into callbacks.

E.g. in order to detect where to queue a certain page (front/tail), call
a callback if one is registered, mark/check pages in a core-mm unknown
way as offline etc.

I still wonder if there could be an easier way to combine recording of
hints and one hinting thread, essentially avoiding scanning and some of
the required core-mm changes.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Thoughts on simple scanner approach for free page hinting
  2019-04-08 16:36 ` David Hildenbrand
@ 2019-04-08 18:09   ` Nitesh Narayan Lal
  2019-04-08 18:19     ` Michael S. Tsirkin
  2019-04-08 18:27     ` David Hildenbrand
  2019-04-08 18:18     ` Alexander Duyck
  1 sibling, 2 replies; 31+ messages in thread
From: Nitesh Narayan Lal @ 2019-04-08 18:09 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: kvm list, LKML, linux-mm, Paolo Bonzini, lcapitulino, pagupta,
	wei.w.wang, Yang Zhang, Rik van Riel, dodgen,
	Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli,
	Alexander Duyck, Michael S. Tsirkin


[-- Attachment #1.1: Type: text/plain, Size: 9175 bytes --]

On 4/8/19 12:36 PM, David Hildenbrand wrote:
> On 06.04.19 02:09, Alexander Duyck wrote:
>> So I am starting this thread as a spot to collect my thoughts on the
>> current guest free page hinting design as well as point out a few
>> possible things we could do to improve upon it.
>>
>> 1. The current design isn't likely going to scale well to multiple
>> VCPUs. The issue specifically is that the zone lock must be held to
>> pull pages off of the free list and to place them back there once they
>> have been hinted upon. As a result it would likely make sense to try
>> to limit ourselves to only having one thread performing the actual
>> hinting so that we can avoid running into issues with lock contention
>> between threads.
> Makes sense.
>
>> 2. There are currently concerns about the hinting triggering false OOM
>> situations if too much memory is isolated while it is being hinted. My
>> thought on this is to simply avoid the issue by only hint on a limited
>> amount of memory at a time. Something like 64MB should be a workable
>> limit without introducing much in the way of regressions. However as a
>> result of this we can easily be overrun while waiting on the host to
>> process the hinting request. As such we will probably need a way to
>> walk the free list and free pages after they have been freed instead
>> of trying to do it as they are freed.
> We will need such a way in case we care about dropped hinting requests, yes.
>
>> 3. Even with the current buffering which is still on the larger side
>> it is possible to overrun the hinting limits if something causes the
>> host to stall and a large swath of memory is released. As such we are
>> still going to need some sort of scanning mechanism or will have to
>> live with not providing accurate hints.
> Yes, usually if there is a lot of guest activity, you could however
> assume that free pages might get reused either way soon. Of course,
> special cases are "freeing XGB and being idle afterwards".
>
>> 4. In my opinion, the code overall is likely more complex then it
>> needs to be. We currently have 2 allocations that have to occur every
>> time we provide a hint all the way to the host, ideally we should not
>> need to allocate more memory to provide hints. We should be able to
>> hold the memory use for a memory hint device constant and simply map
>> the page address and size to the descriptors of the virtio-ring.
> I don't think the two allocations are that complex. The only thing I
> consider complex is isolation a lot of pages from different zones etc.
> Two allocations, nobody really cares about that. Of course, the fact
> that we have to allocate memory from the VCPUs where we currently freed
> a page is not optimal. I consider that rather a problem/complex.
>
> Especially you have a point regarding scalability and multiple VCPUs.
>
>> With that said I have a few ideas that may help to address the 4
>> issues called out above. The basic idea is simple. We use a high water
>> mark based on zone->free_area[order].nr_free to determine when to wake
>> up a thread to start hinting memory out of a given free area. From
>> there we allocate non-"Offline" pages from the free area and assign
>> them to the hinting queue up to 64MB at a time. Once the hinting is
>> completed we mark them "Offline" and add them to the tail of the
>> free_area. Doing this we should cycle the non-"Offline" pages slowly
>> out of the free_area. In addition the search cost should be minimal
>> since all of the "Offline" pages should be aggregated to the tail of
>> the free_area so all pages allocated off of the free_area will be the
>> non-"Offline" pages until we shift over to them all being "Offline".
>> This should be effective for MAX_ORDER - 1 and MAX_ORDER - 2 pages
>> since the only real consumer of add_to_free_area_tail is
>> __free_one_page which uses it to place a page with an order less than
>> MAX_ORDER - 2 on the tail of a free_area assuming that it should be
>> freeing the buddy of that page shortly. The only other issue with
>> adding to tail would be the memory shuffling which was recently added,
>> but I don't see that as being something that will be enabled in most
>> cases so we could probably just make the features mutually exclusive,
>> at least for now.
>>
>> So if I am not mistaken this would essentially require a couple
>> changes to the mm infrastructure in order for this to work.
>>
>> First we would need to split nr_free into two counters, something like
>> nr_freed and nr_bound. You could use nr_freed - nr_bound to get the
>> value currently used for nr_free. When we pulled the pages for hinting
>> we would reduce the nr_freed value and then add back to it when the
>> pages are returned. When pages are allocated they would increment the
>> nr_bound value. The idea behind this is that we can record nr_free
>> when we collect the pages and save it to some local value. This value
>> could then tell us how many new pages have been added that have not
>> been hinted upon.
> I can imagine that quite some people will have problems with such
> "virtualization specific changes" splattered around core memory
> management. Would there be a way to manage this data at a different
> place, out of core-mm and somehow work on it via callbacks?
>
>> In addition we will need some way to identify which pages have been
>> hinted on and which have not. The way I believe easiest to do this
>> would be to overload the PageType value so that we could essentially
>> have two values for "Buddy" pages. We would have our standard "Buddy"
>> pages, and "Buddy" pages that also have the "Offline" value set in the
>> PageType field. Tracking the Online vs Offline pages this way would
>> actually allow us to do this with almost no overhead as the mapcount
>> value is already being reset to clear the "Buddy" flag so adding a
>> "Offline" flag to this clearing should come at no additional cost.
> Just nothing here that this will require modifications to kdump
> (makedumpfile to be precise and the vmcore information exposed from the
> kernel), as kdump only checks for the the actual mapcount value to
> detect buddy and offline pages (to exclude them from dumps), they are
> not treated as flags.
>
> For now, any mapcount values are really only separate values, meaning
> not the separate bits are of interest, like flags would be. Reusing
> other flags would make our life a lot easier. E.g. PG_young or so. But
> clearing of these is then the problematic part.
>
> Of course we could use in the kernel two values, Buddy and BuddyOffline.
> But then we have to check for two different values whenever we want to
> identify a buddy page in the kernel.
>
>> Lastly we would need to create a specialized function for allocating
>> the non-"Offline" pages, and to tweak __free_one_page to tail enqueue
>> "Offline" pages. I'm thinking the alloc function it would look
>> something like __rmqueue_smallest but without the "expand" and needing
>> to modify the !page check to also include a check to verify the page
>> is not "Offline". As far as the changes to __free_one_page it would be
>> a 2 line change to test for the PageType being offline, and if it is
>> to call add_to_free_area_tail instead of add_to_free_area.
> As already mentioned, there might be scenarios where the additional
> hinting thread might consume too much CPU cycles, especially if there is
> little guest activity any you mostly spend time scanning a handful of
> free pages and reporting them. I wonder if we can somehow limit the
> amount of wakeups/scans for a given period to mitigate this issue.
>
> One main issue I see with your approach is that we need quite a lot of
> core memory management changes. This is a problem. I wonder if we can
> factor out most parts into callbacks.
>
> E.g. in order to detect where to queue a certain page (front/tail), call
> a callback if one is registered, mark/check pages in a core-mm unknown
> way as offline etc.
>
> I still wonder if there could be an easier way to combine recording of
> hints and one hinting thread, essentially avoiding scanning and some of
> the required core-mm changes.
In order to resolve the scalability issues associated with my
patch-series without compromising with free memory hints, I may explore
the idea described below:
- Use xbitmap (if possible - earlier suggested by Rik and Wei)
corresponding to each zone on a granularity of MAX_ORDER - 2, to track
the freed PFN's.
- Define and use counters corresponding to each zone to monitor the
amount of memory freed.
- As soon as the 64MB free memory threshold is hit wake up the kernel
thread which will scan this xbitmap and try to isolate the pages and
clear the corresponding bits. (We still have to acquire zone lock to
protect the respective xbitmap)
- Report the isolated pages back to the host in a synchronous manner.
I still have to work on several details of this idea including xbitmap,
but first would like to hear any suggestions/thoughts.
>
-- 
Regards
Nitesh


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Thoughts on simple scanner approach for free page hinting
  2019-04-08 16:36 ` David Hildenbrand
@ 2019-04-08 18:18     ` Alexander Duyck
  2019-04-08 18:18     ` Alexander Duyck
  1 sibling, 0 replies; 31+ messages in thread
From: Alexander Duyck @ 2019-04-08 18:18 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Michael S. Tsirkin, Nitesh Narayan Lal, kvm list, LKML, linux-mm,
	Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang,
	Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli

On Mon, Apr 8, 2019 at 9:36 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 06.04.19 02:09, Alexander Duyck wrote:
> > So I am starting this thread as a spot to collect my thoughts on the
> > current guest free page hinting design as well as point out a few
> > possible things we could do to improve upon it.
> >
> > 1. The current design isn't likely going to scale well to multiple
> > VCPUs. The issue specifically is that the zone lock must be held to
> > pull pages off of the free list and to place them back there once they
> > have been hinted upon. As a result it would likely make sense to try
> > to limit ourselves to only having one thread performing the actual
> > hinting so that we can avoid running into issues with lock contention
> > between threads.
>
> Makes sense.
>
> >
> > 2. There are currently concerns about the hinting triggering false OOM
> > situations if too much memory is isolated while it is being hinted. My
> > thought on this is to simply avoid the issue by only hint on a limited
> > amount of memory at a time. Something like 64MB should be a workable
> > limit without introducing much in the way of regressions. However as a
> > result of this we can easily be overrun while waiting on the host to
> > process the hinting request. As such we will probably need a way to
> > walk the free list and free pages after they have been freed instead
> > of trying to do it as they are freed.
>
> We will need such a way in case we care about dropped hinting requests, yes.
>
> >
> > 3. Even with the current buffering which is still on the larger side
> > it is possible to overrun the hinting limits if something causes the
> > host to stall and a large swath of memory is released. As such we are
> > still going to need some sort of scanning mechanism or will have to
> > live with not providing accurate hints.
>
> Yes, usually if there is a lot of guest activity, you could however
> assume that free pages might get reused either way soon. Of course,
> special cases are "freeing XGB and being idle afterwards".

If we are assuming pages would get reused soon there wouldn't be as
much value to the free page hinting. Our optimal case is memory that
is going to be sitting free for an extended period of time. I say that
because we are adding overhead by hinting the pages away. The biggest
gain for this patch set is to be hinting before the page would be
swapped out to disk. If we are hinting before then and then reusing
the page we will actually introduce a performance regression as a page
fault is necessary to make the page usable again and that comes at a
cost. On the other hand if we linger too long in hinting we end up
with the page being swapped to disk.

> >
> > 4. In my opinion, the code overall is likely more complex then it
> > needs to be. We currently have 2 allocations that have to occur every
> > time we provide a hint all the way to the host, ideally we should not
> > need to allocate more memory to provide hints. We should be able to
> > hold the memory use for a memory hint device constant and simply map
> > the page address and size to the descriptors of the virtio-ring.
>
> I don't think the two allocations are that complex. The only thing I
> consider complex is isolation a lot of pages from different zones etc.
> Two allocations, nobody really cares about that. Of course, the fact
> that we have to allocate memory from the VCPUs where we currently freed
> a page is not optimal. I consider that rather a problem/complex.
>
> Especially you have a point regarding scalability and multiple VCPUs.

I found the extra allocations made thing much more difficult to
review. Basically we are having to design the virio interface to parse
the scatterlist that is being allocated and sent via the virtio
buffer. I really prefer the way the virtio-ballon was offlining pages
by just mapping them as though they were going to be used as Rx
buffers.

> >
> > With that said I have a few ideas that may help to address the 4
> > issues called out above. The basic idea is simple. We use a high water
> > mark based on zone->free_area[order].nr_free to determine when to wake
> > up a thread to start hinting memory out of a given free area. From
> > there we allocate non-"Offline" pages from the free area and assign
> > them to the hinting queue up to 64MB at a time. Once the hinting is
> > completed we mark them "Offline" and add them to the tail of the
> > free_area. Doing this we should cycle the non-"Offline" pages slowly
> > out of the free_area. In addition the search cost should be minimal
> > since all of the "Offline" pages should be aggregated to the tail of
> > the free_area so all pages allocated off of the free_area will be the
> > non-"Offline" pages until we shift over to them all being "Offline".
> > This should be effective for MAX_ORDER - 1 and MAX_ORDER - 2 pages
> > since the only real consumer of add_to_free_area_tail is
> > __free_one_page which uses it to place a page with an order less than
> > MAX_ORDER - 2 on the tail of a free_area assuming that it should be
> > freeing the buddy of that page shortly. The only other issue with
> > adding to tail would be the memory shuffling which was recently added,
> > but I don't see that as being something that will be enabled in most
> > cases so we could probably just make the features mutually exclusive,
> > at least for now.
> >
> > So if I am not mistaken this would essentially require a couple
> > changes to the mm infrastructure in order for this to work.
> >
> > First we would need to split nr_free into two counters, something like
> > nr_freed and nr_bound. You could use nr_freed - nr_bound to get the
> > value currently used for nr_free. When we pulled the pages for hinting
> > we would reduce the nr_freed value and then add back to it when the
> > pages are returned. When pages are allocated they would increment the
> > nr_bound value. The idea behind this is that we can record nr_free
> > when we collect the pages and save it to some local value. This value
> > could then tell us how many new pages have been added that have not
> > been hinted upon.
>
> I can imagine that quite some people will have problems with such
> "virtualization specific changes" splattered around core memory
> management. Would there be a way to manage this data at a different
> place, out of core-mm and somehow work on it via callbacks?

I'm working on the patches for this now. As I mentioned I only really
see 2, maybe 3 total changes.
1. area->nr_free -> area->nr_freed - area->nr_bound
2. Buddy -> Buddy & Offline, free "Offline" to tail.
3. alloc_non-offline

I'll elaborate more on the last two below.

> >
> > In addition we will need some way to identify which pages have been
> > hinted on and which have not. The way I believe easiest to do this
> > would be to overload the PageType value so that we could essentially
> > have two values for "Buddy" pages. We would have our standard "Buddy"
> > pages, and "Buddy" pages that also have the "Offline" value set in the
> > PageType field. Tracking the Online vs Offline pages this way would
> > actually allow us to do this with almost no overhead as the mapcount
> > value is already being reset to clear the "Buddy" flag so adding a
> > "Offline" flag to this clearing should come at no additional cost.
>
> Just nothing here that this will require modifications to kdump
> (makedumpfile to be precise and the vmcore information exposed from the
> kernel), as kdump only checks for the the actual mapcount value to
> detect buddy and offline pages (to exclude them from dumps), they are
> not treated as flags.
>
> For now, any mapcount values are really only separate values, meaning
> not the separate bits are of interest, like flags would be. Reusing
> other flags would make our life a lot easier. E.g. PG_young or so. But
> clearing of these is then the problematic part.
>
> Of course we could use in the kernel two values, Buddy and BuddyOffline.
> But then we have to check for two different values whenever we want to
> identify a buddy page in the kernel.

Actually this may not be working the way you think it is working.
Below is the PageType code:
#define PAGE_TYPE_BASE 0xf0000000
/* Reserve 0x0000007f to catch underflows of page_mapcount */
#define PAGE_MAPCOUNT_RESERVE -128
#define PG_buddy 0x00000080
#define PG_offline 0x00000100
#define PG_kmemcg 0x00000200
#define PG_table 0x00000400

#define PageType(page, flag) \
((page->page_type & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE)

If I am reading this correctly when you are testing for the "Buddy"
page type all you are doing is something like:
        page_type & 0xf0000080 = 0xf0000000;

This should work if mapcount is 0xfffffe7f or 0xffffff7f. So if I have
added "Online" to it the value the result will not have changed since
all adding a type really does is clear an additional bit in the
mapcount.

The only real issue I see is in __ClearPage##uname as it requires that
the bit be set when we are clearing it. For now I am just creating
another macro called "__ResetPage##uname" that uses the same test as
the set and then does an "|=" with the bit, and then using that to
reset the "Offline" value.

> >
> > Lastly we would need to create a specialized function for allocating
> > the non-"Offline" pages, and to tweak __free_one_page to tail enqueue
> > "Offline" pages. I'm thinking the alloc function it would look
> > something like __rmqueue_smallest but without the "expand" and needing
> > to modify the !page check to also include a check to verify the page
> > is not "Offline". As far as the changes to __free_one_page it would be
> > a 2 line change to test for the PageType being offline, and if it is
> > to call add_to_free_area_tail instead of add_to_free_area.
>
> As already mentioned, there might be scenarios where the additional
> hinting thread might consume too much CPU cycles, especially if there is
> little guest activity any you mostly spend time scanning a handful of
> free pages and reporting them. I wonder if we can somehow limit the
> amount of wakeups/scans for a given period to mitigate this issue.

That is why I was talking about breaking nr_free into nr_freed and
nr_bound. By doing that I can record the nr_free value to a
virtio-balloon specific location at the start of any walk and should
know exactly now many pages were freed between that call and the next
one. By ordering things such that we place the "Offline" pages on the
tail of the list it should make the search quite fast since we would
just be always allocating off of the head of the queue until we have
hinted everything int he queue. So when we hit the last call to alloc
the non-"Offline" pages and shut down our thread we can use the
nr_freed value that we recorded to know exactly how many pages have
been added that haven't been hinted.

> One main issue I see with your approach is that we need quite a lot of
> core memory management changes. This is a problem. I wonder if we can
> factor out most parts into callbacks.

I think that is something we can't get away from. However if we make
this generic enough there would likely be others beyond just the
virtualization drivers that could make use of the infrastructure. For
example being able to track the rate at which the free areas are
cycling in and out pages seems like something that would be useful
outside of just the virtualization areas.

> E.g. in order to detect where to queue a certain page (front/tail), call
> a callback if one is registered, mark/check pages in a core-mm unknown
> way as offline etc.
>
> I still wonder if there could be an easier way to combine recording of
> hints and one hinting thread, essentially avoiding scanning and some of
> the required core-mm changes.

The concern I have with trying to avoid the scanning by tracking is
that if you fall behind it becomes something where just tracking the
metadata for the page hints would start to become expensive.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Thoughts on simple scanner approach for free page hinting
@ 2019-04-08 18:18     ` Alexander Duyck
  0 siblings, 0 replies; 31+ messages in thread
From: Alexander Duyck @ 2019-04-08 18:18 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Michael S. Tsirkin, Nitesh Narayan Lal, kvm list, LKML, linux-mm,
	Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang,
	Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli

On Mon, Apr 8, 2019 at 9:36 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 06.04.19 02:09, Alexander Duyck wrote:
> > So I am starting this thread as a spot to collect my thoughts on the
> > current guest free page hinting design as well as point out a few
> > possible things we could do to improve upon it.
> >
> > 1. The current design isn't likely going to scale well to multiple
> > VCPUs. The issue specifically is that the zone lock must be held to
> > pull pages off of the free list and to place them back there once they
> > have been hinted upon. As a result it would likely make sense to try
> > to limit ourselves to only having one thread performing the actual
> > hinting so that we can avoid running into issues with lock contention
> > between threads.
>
> Makes sense.
>
> >
> > 2. There are currently concerns about the hinting triggering false OOM
> > situations if too much memory is isolated while it is being hinted. My
> > thought on this is to simply avoid the issue by only hint on a limited
> > amount of memory at a time. Something like 64MB should be a workable
> > limit without introducing much in the way of regressions. However as a
> > result of this we can easily be overrun while waiting on the host to
> > process the hinting request. As such we will probably need a way to
> > walk the free list and free pages after they have been freed instead
> > of trying to do it as they are freed.
>
> We will need such a way in case we care about dropped hinting requests, yes.
>
> >
> > 3. Even with the current buffering which is still on the larger side
> > it is possible to overrun the hinting limits if something causes the
> > host to stall and a large swath of memory is released. As such we are
> > still going to need some sort of scanning mechanism or will have to
> > live with not providing accurate hints.
>
> Yes, usually if there is a lot of guest activity, you could however
> assume that free pages might get reused either way soon. Of course,
> special cases are "freeing XGB and being idle afterwards".

If we are assuming pages would get reused soon there wouldn't be as
much value to the free page hinting. Our optimal case is memory that
is going to be sitting free for an extended period of time. I say that
because we are adding overhead by hinting the pages away. The biggest
gain for this patch set is to be hinting before the page would be
swapped out to disk. If we are hinting before then and then reusing
the page we will actually introduce a performance regression as a page
fault is necessary to make the page usable again and that comes at a
cost. On the other hand if we linger too long in hinting we end up
with the page being swapped to disk.

> >
> > 4. In my opinion, the code overall is likely more complex then it
> > needs to be. We currently have 2 allocations that have to occur every
> > time we provide a hint all the way to the host, ideally we should not
> > need to allocate more memory to provide hints. We should be able to
> > hold the memory use for a memory hint device constant and simply map
> > the page address and size to the descriptors of the virtio-ring.
>
> I don't think the two allocations are that complex. The only thing I
> consider complex is isolation a lot of pages from different zones etc.
> Two allocations, nobody really cares about that. Of course, the fact
> that we have to allocate memory from the VCPUs where we currently freed
> a page is not optimal. I consider that rather a problem/complex.
>
> Especially you have a point regarding scalability and multiple VCPUs.

I found the extra allocations made thing much more difficult to
review. Basically we are having to design the virio interface to parse
the scatterlist that is being allocated and sent via the virtio
buffer. I really prefer the way the virtio-ballon was offlining pages
by just mapping them as though they were going to be used as Rx
buffers.

> >
> > With that said I have a few ideas that may help to address the 4
> > issues called out above. The basic idea is simple. We use a high water
> > mark based on zone->free_area[order].nr_free to determine when to wake
> > up a thread to start hinting memory out of a given free area. From
> > there we allocate non-"Offline" pages from the free area and assign
> > them to the hinting queue up to 64MB at a time. Once the hinting is
> > completed we mark them "Offline" and add them to the tail of the
> > free_area. Doing this we should cycle the non-"Offline" pages slowly
> > out of the free_area. In addition the search cost should be minimal
> > since all of the "Offline" pages should be aggregated to the tail of
> > the free_area so all pages allocated off of the free_area will be the
> > non-"Offline" pages until we shift over to them all being "Offline".
> > This should be effective for MAX_ORDER - 1 and MAX_ORDER - 2 pages
> > since the only real consumer of add_to_free_area_tail is
> > __free_one_page which uses it to place a page with an order less than
> > MAX_ORDER - 2 on the tail of a free_area assuming that it should be
> > freeing the buddy of that page shortly. The only other issue with
> > adding to tail would be the memory shuffling which was recently added,
> > but I don't see that as being something that will be enabled in most
> > cases so we could probably just make the features mutually exclusive,
> > at least for now.
> >
> > So if I am not mistaken this would essentially require a couple
> > changes to the mm infrastructure in order for this to work.
> >
> > First we would need to split nr_free into two counters, something like
> > nr_freed and nr_bound. You could use nr_freed - nr_bound to get the
> > value currently used for nr_free. When we pulled the pages for hinting
> > we would reduce the nr_freed value and then add back to it when the
> > pages are returned. When pages are allocated they would increment the
> > nr_bound value. The idea behind this is that we can record nr_free
> > when we collect the pages and save it to some local value. This value
> > could then tell us how many new pages have been added that have not
> > been hinted upon.
>
> I can imagine that quite some people will have problems with such
> "virtualization specific changes" splattered around core memory
> management. Would there be a way to manage this data at a different
> place, out of core-mm and somehow work on it via callbacks?

I'm working on the patches for this now. As I mentioned I only really
see 2, maybe 3 total changes.
1. area->nr_free -> area->nr_freed - area->nr_bound
2. Buddy -> Buddy & Offline, free "Offline" to tail.
3. alloc_non-offline

I'll elaborate more on the last two below.

> >
> > In addition we will need some way to identify which pages have been
> > hinted on and which have not. The way I believe easiest to do this
> > would be to overload the PageType value so that we could essentially
> > have two values for "Buddy" pages. We would have our standard "Buddy"
> > pages, and "Buddy" pages that also have the "Offline" value set in the
> > PageType field. Tracking the Online vs Offline pages this way would
> > actually allow us to do this with almost no overhead as the mapcount
> > value is already being reset to clear the "Buddy" flag so adding a
> > "Offline" flag to this clearing should come at no additional cost.
>
> Just nothing here that this will require modifications to kdump
> (makedumpfile to be precise and the vmcore information exposed from the
> kernel), as kdump only checks for the the actual mapcount value to
> detect buddy and offline pages (to exclude them from dumps), they are
> not treated as flags.
>
> For now, any mapcount values are really only separate values, meaning
> not the separate bits are of interest, like flags would be. Reusing
> other flags would make our life a lot easier. E.g. PG_young or so. But
> clearing of these is then the problematic part.
>
> Of course we could use in the kernel two values, Buddy and BuddyOffline.
> But then we have to check for two different values whenever we want to
> identify a buddy page in the kernel.

Actually this may not be working the way you think it is working.
Below is the PageType code:
#define PAGE_TYPE_BASE 0xf0000000
/* Reserve 0x0000007f to catch underflows of page_mapcount */
#define PAGE_MAPCOUNT_RESERVE -128
#define PG_buddy 0x00000080
#define PG_offline 0x00000100
#define PG_kmemcg 0x00000200
#define PG_table 0x00000400

#define PageType(page, flag) \
((page->page_type & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE)

If I am reading this correctly when you are testing for the "Buddy"
page type all you are doing is something like:
        page_type & 0xf0000080 = 0xf0000000;

This should work if mapcount is 0xfffffe7f or 0xffffff7f. So if I have
added "Online" to it the value the result will not have changed since
all adding a type really does is clear an additional bit in the
mapcount.

The only real issue I see is in __ClearPage##uname as it requires that
the bit be set when we are clearing it. For now I am just creating
another macro called "__ResetPage##uname" that uses the same test as
the set and then does an "|=" with the bit, and then using that to
reset the "Offline" value.

> >
> > Lastly we would need to create a specialized function for allocating
> > the non-"Offline" pages, and to tweak __free_one_page to tail enqueue
> > "Offline" pages. I'm thinking the alloc function it would look
> > something like __rmqueue_smallest but without the "expand" and needing
> > to modify the !page check to also include a check to verify the page
> > is not "Offline". As far as the changes to __free_one_page it would be
> > a 2 line change to test for the PageType being offline, and if it is
> > to call add_to_free_area_tail instead of add_to_free_area.
>
> As already mentioned, there might be scenarios where the additional
> hinting thread might consume too much CPU cycles, especially if there is
> little guest activity any you mostly spend time scanning a handful of
> free pages and reporting them. I wonder if we can somehow limit the
> amount of wakeups/scans for a given period to mitigate this issue.

That is why I was talking about breaking nr_free into nr_freed and
nr_bound. By doing that I can record the nr_free value to a
virtio-balloon specific location at the start of any walk and should
know exactly now many pages were freed between that call and the next
one. By ordering things such that we place the "Offline" pages on the
tail of the list it should make the search quite fast since we would
just be always allocating off of the head of the queue until we have
hinted everything int he queue. So when we hit the last call to alloc
the non-"Offline" pages and shut down our thread we can use the
nr_freed value that we recorded to know exactly how many pages have
been added that haven't been hinted.

> One main issue I see with your approach is that we need quite a lot of
> core memory management changes. This is a problem. I wonder if we can
> factor out most parts into callbacks.

I think that is something we can't get away from. However if we make
this generic enough there would likely be others beyond just the
virtualization drivers that could make use of the infrastructure. For
example being able to track the rate at which the free areas are
cycling in and out pages seems like something that would be useful
outside of just the virtualization areas.

> E.g. in order to detect where to queue a certain page (front/tail), call
> a callback if one is registered, mark/check pages in a core-mm unknown
> way as offline etc.
>
> I still wonder if there could be an easier way to combine recording of
> hints and one hinting thread, essentially avoiding scanning and some of
> the required core-mm changes.

The concern I have with trying to avoid the scanning by tracking is
that if you fall behind it becomes something where just tracking the
metadata for the page hints would start to become expensive.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Thoughts on simple scanner approach for free page hinting
  2019-04-08 18:09   ` Nitesh Narayan Lal
@ 2019-04-08 18:19     ` Michael S. Tsirkin
  2019-04-08 18:29         ` Alexander Duyck
  2019-04-08 18:27     ` David Hildenbrand
  1 sibling, 1 reply; 31+ messages in thread
From: Michael S. Tsirkin @ 2019-04-08 18:19 UTC (permalink / raw)
  To: Nitesh Narayan Lal
  Cc: David Hildenbrand, kvm list, LKML, linux-mm, Paolo Bonzini,
	lcapitulino, pagupta, wei.w.wang, Yang Zhang, Rik van Riel,
	dodgen, Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli,
	Alexander Duyck

On Mon, Apr 08, 2019 at 02:09:59PM -0400, Nitesh Narayan Lal wrote:
> On 4/8/19 12:36 PM, David Hildenbrand wrote:
> > On 06.04.19 02:09, Alexander Duyck wrote:
> >> So I am starting this thread as a spot to collect my thoughts on the
> >> current guest free page hinting design as well as point out a few
> >> possible things we could do to improve upon it.
> >>
> >> 1. The current design isn't likely going to scale well to multiple
> >> VCPUs. The issue specifically is that the zone lock must be held to
> >> pull pages off of the free list and to place them back there once they
> >> have been hinted upon. As a result it would likely make sense to try
> >> to limit ourselves to only having one thread performing the actual
> >> hinting so that we can avoid running into issues with lock contention
> >> between threads.
> > Makes sense.
> >
> >> 2. There are currently concerns about the hinting triggering false OOM
> >> situations if too much memory is isolated while it is being hinted. My
> >> thought on this is to simply avoid the issue by only hint on a limited
> >> amount of memory at a time. Something like 64MB should be a workable
> >> limit without introducing much in the way of regressions. However as a
> >> result of this we can easily be overrun while waiting on the host to
> >> process the hinting request. As such we will probably need a way to
> >> walk the free list and free pages after they have been freed instead
> >> of trying to do it as they are freed.
> > We will need such a way in case we care about dropped hinting requests, yes.
> >
> >> 3. Even with the current buffering which is still on the larger side
> >> it is possible to overrun the hinting limits if something causes the
> >> host to stall and a large swath of memory is released. As such we are
> >> still going to need some sort of scanning mechanism or will have to
> >> live with not providing accurate hints.
> > Yes, usually if there is a lot of guest activity, you could however
> > assume that free pages might get reused either way soon. Of course,
> > special cases are "freeing XGB and being idle afterwards".
> >
> >> 4. In my opinion, the code overall is likely more complex then it
> >> needs to be. We currently have 2 allocations that have to occur every
> >> time we provide a hint all the way to the host, ideally we should not
> >> need to allocate more memory to provide hints. We should be able to
> >> hold the memory use for a memory hint device constant and simply map
> >> the page address and size to the descriptors of the virtio-ring.
> > I don't think the two allocations are that complex. The only thing I
> > consider complex is isolation a lot of pages from different zones etc.
> > Two allocations, nobody really cares about that. Of course, the fact
> > that we have to allocate memory from the VCPUs where we currently freed
> > a page is not optimal. I consider that rather a problem/complex.
> >
> > Especially you have a point regarding scalability and multiple VCPUs.
> >
> >> With that said I have a few ideas that may help to address the 4
> >> issues called out above. The basic idea is simple. We use a high water
> >> mark based on zone->free_area[order].nr_free to determine when to wake
> >> up a thread to start hinting memory out of a given free area. From
> >> there we allocate non-"Offline" pages from the free area and assign
> >> them to the hinting queue up to 64MB at a time. Once the hinting is
> >> completed we mark them "Offline" and add them to the tail of the
> >> free_area. Doing this we should cycle the non-"Offline" pages slowly
> >> out of the free_area. In addition the search cost should be minimal
> >> since all of the "Offline" pages should be aggregated to the tail of
> >> the free_area so all pages allocated off of the free_area will be the
> >> non-"Offline" pages until we shift over to them all being "Offline".
> >> This should be effective for MAX_ORDER - 1 and MAX_ORDER - 2 pages
> >> since the only real consumer of add_to_free_area_tail is
> >> __free_one_page which uses it to place a page with an order less than
> >> MAX_ORDER - 2 on the tail of a free_area assuming that it should be
> >> freeing the buddy of that page shortly. The only other issue with
> >> adding to tail would be the memory shuffling which was recently added,
> >> but I don't see that as being something that will be enabled in most
> >> cases so we could probably just make the features mutually exclusive,
> >> at least for now.
> >>
> >> So if I am not mistaken this would essentially require a couple
> >> changes to the mm infrastructure in order for this to work.
> >>
> >> First we would need to split nr_free into two counters, something like
> >> nr_freed and nr_bound. You could use nr_freed - nr_bound to get the
> >> value currently used for nr_free. When we pulled the pages for hinting
> >> we would reduce the nr_freed value and then add back to it when the
> >> pages are returned. When pages are allocated they would increment the
> >> nr_bound value. The idea behind this is that we can record nr_free
> >> when we collect the pages and save it to some local value. This value
> >> could then tell us how many new pages have been added that have not
> >> been hinted upon.
> > I can imagine that quite some people will have problems with such
> > "virtualization specific changes" splattered around core memory
> > management. Would there be a way to manage this data at a different
> > place, out of core-mm and somehow work on it via callbacks?
> >
> >> In addition we will need some way to identify which pages have been
> >> hinted on and which have not. The way I believe easiest to do this
> >> would be to overload the PageType value so that we could essentially
> >> have two values for "Buddy" pages. We would have our standard "Buddy"
> >> pages, and "Buddy" pages that also have the "Offline" value set in the
> >> PageType field. Tracking the Online vs Offline pages this way would
> >> actually allow us to do this with almost no overhead as the mapcount
> >> value is already being reset to clear the "Buddy" flag so adding a
> >> "Offline" flag to this clearing should come at no additional cost.
> > Just nothing here that this will require modifications to kdump
> > (makedumpfile to be precise and the vmcore information exposed from the
> > kernel), as kdump only checks for the the actual mapcount value to
> > detect buddy and offline pages (to exclude them from dumps), they are
> > not treated as flags.
> >
> > For now, any mapcount values are really only separate values, meaning
> > not the separate bits are of interest, like flags would be. Reusing
> > other flags would make our life a lot easier. E.g. PG_young or so. But
> > clearing of these is then the problematic part.
> >
> > Of course we could use in the kernel two values, Buddy and BuddyOffline.
> > But then we have to check for two different values whenever we want to
> > identify a buddy page in the kernel.
> >
> >> Lastly we would need to create a specialized function for allocating
> >> the non-"Offline" pages, and to tweak __free_one_page to tail enqueue
> >> "Offline" pages. I'm thinking the alloc function it would look
> >> something like __rmqueue_smallest but without the "expand" and needing
> >> to modify the !page check to also include a check to verify the page
> >> is not "Offline". As far as the changes to __free_one_page it would be
> >> a 2 line change to test for the PageType being offline, and if it is
> >> to call add_to_free_area_tail instead of add_to_free_area.
> > As already mentioned, there might be scenarios where the additional
> > hinting thread might consume too much CPU cycles, especially if there is
> > little guest activity any you mostly spend time scanning a handful of
> > free pages and reporting them. I wonder if we can somehow limit the
> > amount of wakeups/scans for a given period to mitigate this issue.
> >
> > One main issue I see with your approach is that we need quite a lot of
> > core memory management changes. This is a problem. I wonder if we can
> > factor out most parts into callbacks.
> >
> > E.g. in order to detect where to queue a certain page (front/tail), call
> > a callback if one is registered, mark/check pages in a core-mm unknown
> > way as offline etc.
> >
> > I still wonder if there could be an easier way to combine recording of
> > hints and one hinting thread, essentially avoiding scanning and some of
> > the required core-mm changes.
> In order to resolve the scalability issues associated with my
> patch-series without compromising with free memory hints, I may explore
> the idea described below:
> - Use xbitmap (if possible - earlier suggested by Rik and Wei)
> corresponding to each zone on a granularity of MAX_ORDER - 2, to track
> the freed PFN's.

MAX_ORDER - 2 is what? 2Mbyte?

> - Define and use counters corresponding to each zone to monitor the
> amount of memory freed.
> - As soon as the 64MB free memory threshold is hit wake up the kernel
> thread which will scan this xbitmap and try to isolate the pages and
> clear the corresponding bits. (We still have to acquire zone lock to
> protect the respective xbitmap)

So that's 32 pages then? I'd say just keep them in an array,
list, tree or hash, bitmap is for when you have nots of pages.


> - Report the isolated pages back to the host in a synchronous manner.
> I still have to work on several details of this idea including xbitmap,
> but first would like to hear any suggestions/thoughts.
> >
> -- 
> Regards
> Nitesh
> 







-- 
MST

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Thoughts on simple scanner approach for free page hinting
  2019-04-08 18:09   ` Nitesh Narayan Lal
  2019-04-08 18:19     ` Michael S. Tsirkin
@ 2019-04-08 18:27     ` David Hildenbrand
  1 sibling, 0 replies; 31+ messages in thread
From: David Hildenbrand @ 2019-04-08 18:27 UTC (permalink / raw)
  To: Nitesh Narayan Lal
  Cc: kvm list, LKML, linux-mm, Paolo Bonzini, lcapitulino, pagupta,
	wei.w.wang, Yang Zhang, Rik van Riel, dodgen,
	Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli,
	Alexander Duyck, Michael S. Tsirkin

On 08.04.19 20:09, Nitesh Narayan Lal wrote:
> On 4/8/19 12:36 PM, David Hildenbrand wrote:
>> On 06.04.19 02:09, Alexander Duyck wrote:
>>> So I am starting this thread as a spot to collect my thoughts on the
>>> current guest free page hinting design as well as point out a few
>>> possible things we could do to improve upon it.
>>>
>>> 1. The current design isn't likely going to scale well to multiple
>>> VCPUs. The issue specifically is that the zone lock must be held to
>>> pull pages off of the free list and to place them back there once they
>>> have been hinted upon. As a result it would likely make sense to try
>>> to limit ourselves to only having one thread performing the actual
>>> hinting so that we can avoid running into issues with lock contention
>>> between threads.
>> Makes sense.
>>
>>> 2. There are currently concerns about the hinting triggering false OOM
>>> situations if too much memory is isolated while it is being hinted. My
>>> thought on this is to simply avoid the issue by only hint on a limited
>>> amount of memory at a time. Something like 64MB should be a workable
>>> limit without introducing much in the way of regressions. However as a
>>> result of this we can easily be overrun while waiting on the host to
>>> process the hinting request. As such we will probably need a way to
>>> walk the free list and free pages after they have been freed instead
>>> of trying to do it as they are freed.
>> We will need such a way in case we care about dropped hinting requests, yes.
>>
>>> 3. Even with the current buffering which is still on the larger side
>>> it is possible to overrun the hinting limits if something causes the
>>> host to stall and a large swath of memory is released. As such we are
>>> still going to need some sort of scanning mechanism or will have to
>>> live with not providing accurate hints.
>> Yes, usually if there is a lot of guest activity, you could however
>> assume that free pages might get reused either way soon. Of course,
>> special cases are "freeing XGB and being idle afterwards".
>>
>>> 4. In my opinion, the code overall is likely more complex then it
>>> needs to be. We currently have 2 allocations that have to occur every
>>> time we provide a hint all the way to the host, ideally we should not
>>> need to allocate more memory to provide hints. We should be able to
>>> hold the memory use for a memory hint device constant and simply map
>>> the page address and size to the descriptors of the virtio-ring.
>> I don't think the two allocations are that complex. The only thing I
>> consider complex is isolation a lot of pages from different zones etc.
>> Two allocations, nobody really cares about that. Of course, the fact
>> that we have to allocate memory from the VCPUs where we currently freed
>> a page is not optimal. I consider that rather a problem/complex.
>>
>> Especially you have a point regarding scalability and multiple VCPUs.
>>
>>> With that said I have a few ideas that may help to address the 4
>>> issues called out above. The basic idea is simple. We use a high water
>>> mark based on zone->free_area[order].nr_free to determine when to wake
>>> up a thread to start hinting memory out of a given free area. From
>>> there we allocate non-"Offline" pages from the free area and assign
>>> them to the hinting queue up to 64MB at a time. Once the hinting is
>>> completed we mark them "Offline" and add them to the tail of the
>>> free_area. Doing this we should cycle the non-"Offline" pages slowly
>>> out of the free_area. In addition the search cost should be minimal
>>> since all of the "Offline" pages should be aggregated to the tail of
>>> the free_area so all pages allocated off of the free_area will be the
>>> non-"Offline" pages until we shift over to them all being "Offline".
>>> This should be effective for MAX_ORDER - 1 and MAX_ORDER - 2 pages
>>> since the only real consumer of add_to_free_area_tail is
>>> __free_one_page which uses it to place a page with an order less than
>>> MAX_ORDER - 2 on the tail of a free_area assuming that it should be
>>> freeing the buddy of that page shortly. The only other issue with
>>> adding to tail would be the memory shuffling which was recently added,
>>> but I don't see that as being something that will be enabled in most
>>> cases so we could probably just make the features mutually exclusive,
>>> at least for now.
>>>
>>> So if I am not mistaken this would essentially require a couple
>>> changes to the mm infrastructure in order for this to work.
>>>
>>> First we would need to split nr_free into two counters, something like
>>> nr_freed and nr_bound. You could use nr_freed - nr_bound to get the
>>> value currently used for nr_free. When we pulled the pages for hinting
>>> we would reduce the nr_freed value and then add back to it when the
>>> pages are returned. When pages are allocated they would increment the
>>> nr_bound value. The idea behind this is that we can record nr_free
>>> when we collect the pages and save it to some local value. This value
>>> could then tell us how many new pages have been added that have not
>>> been hinted upon.
>> I can imagine that quite some people will have problems with such
>> "virtualization specific changes" splattered around core memory
>> management. Would there be a way to manage this data at a different
>> place, out of core-mm and somehow work on it via callbacks?
>>
>>> In addition we will need some way to identify which pages have been
>>> hinted on and which have not. The way I believe easiest to do this
>>> would be to overload the PageType value so that we could essentially
>>> have two values for "Buddy" pages. We would have our standard "Buddy"
>>> pages, and "Buddy" pages that also have the "Offline" value set in the
>>> PageType field. Tracking the Online vs Offline pages this way would
>>> actually allow us to do this with almost no overhead as the mapcount
>>> value is already being reset to clear the "Buddy" flag so adding a
>>> "Offline" flag to this clearing should come at no additional cost.
>> Just nothing here that this will require modifications to kdump
>> (makedumpfile to be precise and the vmcore information exposed from the
>> kernel), as kdump only checks for the the actual mapcount value to
>> detect buddy and offline pages (to exclude them from dumps), they are
>> not treated as flags.
>>
>> For now, any mapcount values are really only separate values, meaning
>> not the separate bits are of interest, like flags would be. Reusing
>> other flags would make our life a lot easier. E.g. PG_young or so. But
>> clearing of these is then the problematic part.
>>
>> Of course we could use in the kernel two values, Buddy and BuddyOffline.
>> But then we have to check for two different values whenever we want to
>> identify a buddy page in the kernel.
>>
>>> Lastly we would need to create a specialized function for allocating
>>> the non-"Offline" pages, and to tweak __free_one_page to tail enqueue
>>> "Offline" pages. I'm thinking the alloc function it would look
>>> something like __rmqueue_smallest but without the "expand" and needing
>>> to modify the !page check to also include a check to verify the page
>>> is not "Offline". As far as the changes to __free_one_page it would be
>>> a 2 line change to test for the PageType being offline, and if it is
>>> to call add_to_free_area_tail instead of add_to_free_area.
>> As already mentioned, there might be scenarios where the additional
>> hinting thread might consume too much CPU cycles, especially if there is
>> little guest activity any you mostly spend time scanning a handful of
>> free pages and reporting them. I wonder if we can somehow limit the
>> amount of wakeups/scans for a given period to mitigate this issue.
>>
>> One main issue I see with your approach is that we need quite a lot of
>> core memory management changes. This is a problem. I wonder if we can
>> factor out most parts into callbacks.
>>
>> E.g. in order to detect where to queue a certain page (front/tail), call
>> a callback if one is registered, mark/check pages in a core-mm unknown
>> way as offline etc.
>>
>> I still wonder if there could be an easier way to combine recording of
>> hints and one hinting thread, essentially avoiding scanning and some of
>> the required core-mm changes.
> In order to resolve the scalability issues associated with my
> patch-series without compromising with free memory hints, I may explore
> the idea described below:
> - Use xbitmap (if possible - earlier suggested by Rik and Wei)
> corresponding to each zone on a granularity of MAX_ORDER - 2, to track
> the freed PFN's.
> - Define and use counters corresponding to each zone to monitor the
> amount of memory freed.
> - As soon as the 64MB free memory threshold is hit wake up the kernel
> thread which will scan this xbitmap and try to isolate the pages and
> clear the corresponding bits. (We still have to acquire zone lock to
> protect the respective xbitmap)
> - Report the isolated pages back to the host in a synchronous manner.
> I still have to work on several details of this idea including xbitmap,
> but first would like to hear any suggestions/thoughts.

As discussed offline, I think this is the right approach to follow.

a) We have an asynchronous hinting thread as suggested by Alex.
b) We record hints and only have to scan a xbitmap, not some random lists.
c) We have minimal core mm changes

Best of both worlds and sounds simple.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Thoughts on simple scanner approach for free page hinting
  2019-04-08 18:19     ` Michael S. Tsirkin
@ 2019-04-08 18:29         ` Alexander Duyck
  0 siblings, 0 replies; 31+ messages in thread
From: Alexander Duyck @ 2019-04-08 18:29 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Nitesh Narayan Lal, David Hildenbrand, kvm list, LKML, linux-mm,
	Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang,
	Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli

On Mon, Apr 8, 2019 at 11:19 AM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Mon, Apr 08, 2019 at 02:09:59PM -0400, Nitesh Narayan Lal wrote:
> > On 4/8/19 12:36 PM, David Hildenbrand wrote:
> > > On 06.04.19 02:09, Alexander Duyck wrote:
> > >> So I am starting this thread as a spot to collect my thoughts on the
> > >> current guest free page hinting design as well as point out a few
> > >> possible things we could do to improve upon it.
> > >>
> > >> 1. The current design isn't likely going to scale well to multiple
> > >> VCPUs. The issue specifically is that the zone lock must be held to
> > >> pull pages off of the free list and to place them back there once they
> > >> have been hinted upon. As a result it would likely make sense to try
> > >> to limit ourselves to only having one thread performing the actual
> > >> hinting so that we can avoid running into issues with lock contention
> > >> between threads.
> > > Makes sense.
> > >
> > >> 2. There are currently concerns about the hinting triggering false OOM
> > >> situations if too much memory is isolated while it is being hinted. My
> > >> thought on this is to simply avoid the issue by only hint on a limited
> > >> amount of memory at a time. Something like 64MB should be a workable
> > >> limit without introducing much in the way of regressions. However as a
> > >> result of this we can easily be overrun while waiting on the host to
> > >> process the hinting request. As such we will probably need a way to
> > >> walk the free list and free pages after they have been freed instead
> > >> of trying to do it as they are freed.
> > > We will need such a way in case we care about dropped hinting requests, yes.
> > >
> > >> 3. Even with the current buffering which is still on the larger side
> > >> it is possible to overrun the hinting limits if something causes the
> > >> host to stall and a large swath of memory is released. As such we are
> > >> still going to need some sort of scanning mechanism or will have to
> > >> live with not providing accurate hints.
> > > Yes, usually if there is a lot of guest activity, you could however
> > > assume that free pages might get reused either way soon. Of course,
> > > special cases are "freeing XGB and being idle afterwards".
> > >
> > >> 4. In my opinion, the code overall is likely more complex then it
> > >> needs to be. We currently have 2 allocations that have to occur every
> > >> time we provide a hint all the way to the host, ideally we should not
> > >> need to allocate more memory to provide hints. We should be able to
> > >> hold the memory use for a memory hint device constant and simply map
> > >> the page address and size to the descriptors of the virtio-ring.
> > > I don't think the two allocations are that complex. The only thing I
> > > consider complex is isolation a lot of pages from different zones etc.
> > > Two allocations, nobody really cares about that. Of course, the fact
> > > that we have to allocate memory from the VCPUs where we currently freed
> > > a page is not optimal. I consider that rather a problem/complex.
> > >
> > > Especially you have a point regarding scalability and multiple VCPUs.
> > >
> > >> With that said I have a few ideas that may help to address the 4
> > >> issues called out above. The basic idea is simple. We use a high water
> > >> mark based on zone->free_area[order].nr_free to determine when to wake
> > >> up a thread to start hinting memory out of a given free area. From
> > >> there we allocate non-"Offline" pages from the free area and assign
> > >> them to the hinting queue up to 64MB at a time. Once the hinting is
> > >> completed we mark them "Offline" and add them to the tail of the
> > >> free_area. Doing this we should cycle the non-"Offline" pages slowly
> > >> out of the free_area. In addition the search cost should be minimal
> > >> since all of the "Offline" pages should be aggregated to the tail of
> > >> the free_area so all pages allocated off of the free_area will be the
> > >> non-"Offline" pages until we shift over to them all being "Offline".
> > >> This should be effective for MAX_ORDER - 1 and MAX_ORDER - 2 pages
> > >> since the only real consumer of add_to_free_area_tail is
> > >> __free_one_page which uses it to place a page with an order less than
> > >> MAX_ORDER - 2 on the tail of a free_area assuming that it should be
> > >> freeing the buddy of that page shortly. The only other issue with
> > >> adding to tail would be the memory shuffling which was recently added,
> > >> but I don't see that as being something that will be enabled in most
> > >> cases so we could probably just make the features mutually exclusive,
> > >> at least for now.
> > >>
> > >> So if I am not mistaken this would essentially require a couple
> > >> changes to the mm infrastructure in order for this to work.
> > >>
> > >> First we would need to split nr_free into two counters, something like
> > >> nr_freed and nr_bound. You could use nr_freed - nr_bound to get the
> > >> value currently used for nr_free. When we pulled the pages for hinting
> > >> we would reduce the nr_freed value and then add back to it when the
> > >> pages are returned. When pages are allocated they would increment the
> > >> nr_bound value. The idea behind this is that we can record nr_free
> > >> when we collect the pages and save it to some local value. This value
> > >> could then tell us how many new pages have been added that have not
> > >> been hinted upon.
> > > I can imagine that quite some people will have problems with such
> > > "virtualization specific changes" splattered around core memory
> > > management. Would there be a way to manage this data at a different
> > > place, out of core-mm and somehow work on it via callbacks?
> > >
> > >> In addition we will need some way to identify which pages have been
> > >> hinted on and which have not. The way I believe easiest to do this
> > >> would be to overload the PageType value so that we could essentially
> > >> have two values for "Buddy" pages. We would have our standard "Buddy"
> > >> pages, and "Buddy" pages that also have the "Offline" value set in the
> > >> PageType field. Tracking the Online vs Offline pages this way would
> > >> actually allow us to do this with almost no overhead as the mapcount
> > >> value is already being reset to clear the "Buddy" flag so adding a
> > >> "Offline" flag to this clearing should come at no additional cost.
> > > Just nothing here that this will require modifications to kdump
> > > (makedumpfile to be precise and the vmcore information exposed from the
> > > kernel), as kdump only checks for the the actual mapcount value to
> > > detect buddy and offline pages (to exclude them from dumps), they are
> > > not treated as flags.
> > >
> > > For now, any mapcount values are really only separate values, meaning
> > > not the separate bits are of interest, like flags would be. Reusing
> > > other flags would make our life a lot easier. E.g. PG_young or so. But
> > > clearing of these is then the problematic part.
> > >
> > > Of course we could use in the kernel two values, Buddy and BuddyOffline.
> > > But then we have to check for two different values whenever we want to
> > > identify a buddy page in the kernel.
> > >
> > >> Lastly we would need to create a specialized function for allocating
> > >> the non-"Offline" pages, and to tweak __free_one_page to tail enqueue
> > >> "Offline" pages. I'm thinking the alloc function it would look
> > >> something like __rmqueue_smallest but without the "expand" and needing
> > >> to modify the !page check to also include a check to verify the page
> > >> is not "Offline". As far as the changes to __free_one_page it would be
> > >> a 2 line change to test for the PageType being offline, and if it is
> > >> to call add_to_free_area_tail instead of add_to_free_area.
> > > As already mentioned, there might be scenarios where the additional
> > > hinting thread might consume too much CPU cycles, especially if there is
> > > little guest activity any you mostly spend time scanning a handful of
> > > free pages and reporting them. I wonder if we can somehow limit the
> > > amount of wakeups/scans for a given period to mitigate this issue.
> > >
> > > One main issue I see with your approach is that we need quite a lot of
> > > core memory management changes. This is a problem. I wonder if we can
> > > factor out most parts into callbacks.
> > >
> > > E.g. in order to detect where to queue a certain page (front/tail), call
> > > a callback if one is registered, mark/check pages in a core-mm unknown
> > > way as offline etc.
> > >
> > > I still wonder if there could be an easier way to combine recording of
> > > hints and one hinting thread, essentially avoiding scanning and some of
> > > the required core-mm changes.
> > In order to resolve the scalability issues associated with my
> > patch-series without compromising with free memory hints, I may explore
> > the idea described below:
> > - Use xbitmap (if possible - earlier suggested by Rik and Wei)
> > corresponding to each zone on a granularity of MAX_ORDER - 2, to track
> > the freed PFN's.
>
> MAX_ORDER - 2 is what? 2Mbyte?

Yeah, it should be 2MB assuming that MAX_ORDER - 1 is 4MB.

I'm still not a fan of us adding to much additional tracking data. The
problem with xbitmap is that you are going to have to protect that
with the zone lock and then scan it. Basically all you are going to do
is spread the pain out over all the allocate and free tasks since they
are going to have to update this bitmap now in addition to everything
else.

The way I see it we are adding additional overhead. We shouldn't need
much more than a single long per free_area that we maintain in the
virtio_balloon driver.

> > - Define and use counters corresponding to each zone to monitor the
> > amount of memory freed.
> > - As soon as the 64MB free memory threshold is hit wake up the kernel
> > thread which will scan this xbitmap and try to isolate the pages and
> > clear the corresponding bits. (We still have to acquire zone lock to
> > protect the respective xbitmap)
>
> So that's 32 pages then? I'd say just keep them in an array,
> list, tree or hash, bitmap is for when you have nots of pages.

The xbitmap I think is for the free page tracking. The problem is this
could build up to tons of pages while we are waiting on hints to
complete if we have a thread that is dumping a large amount of free
pages.

> > - Report the isolated pages back to the host in a synchronous manner.
> > I still have to work on several details of this idea including xbitmap,
> > but first would like to hear any suggestions/thoughts.

I'm still not a fan of trying to keep track of the free page metadata
in real-time. It is going to be far more expensive to have every free
and alloc have to update the extra piece of data than to just come
through after the fact and scan the new pages that have been added.

If we assume the free_area is populated such that the standard
alloc/free is always adding and removing from the head, then having
our offlining thread pulling from head and adding to tail would
naturally make things function such that our "walk" to get free pages
should be extremely inexpensive since head will always be the recent
stuff, and tail will build up with the offline stuff until there is no
recent stuff left and all the pages are offline.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Thoughts on simple scanner approach for free page hinting
@ 2019-04-08 18:29         ` Alexander Duyck
  0 siblings, 0 replies; 31+ messages in thread
From: Alexander Duyck @ 2019-04-08 18:29 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Nitesh Narayan Lal, David Hildenbrand, kvm list, LKML, linux-mm,
	Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang,
	Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli

On Mon, Apr 8, 2019 at 11:19 AM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Mon, Apr 08, 2019 at 02:09:59PM -0400, Nitesh Narayan Lal wrote:
> > On 4/8/19 12:36 PM, David Hildenbrand wrote:
> > > On 06.04.19 02:09, Alexander Duyck wrote:
> > >> So I am starting this thread as a spot to collect my thoughts on the
> > >> current guest free page hinting design as well as point out a few
> > >> possible things we could do to improve upon it.
> > >>
> > >> 1. The current design isn't likely going to scale well to multiple
> > >> VCPUs. The issue specifically is that the zone lock must be held to
> > >> pull pages off of the free list and to place them back there once they
> > >> have been hinted upon. As a result it would likely make sense to try
> > >> to limit ourselves to only having one thread performing the actual
> > >> hinting so that we can avoid running into issues with lock contention
> > >> between threads.
> > > Makes sense.
> > >
> > >> 2. There are currently concerns about the hinting triggering false OOM
> > >> situations if too much memory is isolated while it is being hinted. My
> > >> thought on this is to simply avoid the issue by only hint on a limited
> > >> amount of memory at a time. Something like 64MB should be a workable
> > >> limit without introducing much in the way of regressions. However as a
> > >> result of this we can easily be overrun while waiting on the host to
> > >> process the hinting request. As such we will probably need a way to
> > >> walk the free list and free pages after they have been freed instead
> > >> of trying to do it as they are freed.
> > > We will need such a way in case we care about dropped hinting requests, yes.
> > >
> > >> 3. Even with the current buffering which is still on the larger side
> > >> it is possible to overrun the hinting limits if something causes the
> > >> host to stall and a large swath of memory is released. As such we are
> > >> still going to need some sort of scanning mechanism or will have to
> > >> live with not providing accurate hints.
> > > Yes, usually if there is a lot of guest activity, you could however
> > > assume that free pages might get reused either way soon. Of course,
> > > special cases are "freeing XGB and being idle afterwards".
> > >
> > >> 4. In my opinion, the code overall is likely more complex then it
> > >> needs to be. We currently have 2 allocations that have to occur every
> > >> time we provide a hint all the way to the host, ideally we should not
> > >> need to allocate more memory to provide hints. We should be able to
> > >> hold the memory use for a memory hint device constant and simply map
> > >> the page address and size to the descriptors of the virtio-ring.
> > > I don't think the two allocations are that complex. The only thing I
> > > consider complex is isolation a lot of pages from different zones etc.
> > > Two allocations, nobody really cares about that. Of course, the fact
> > > that we have to allocate memory from the VCPUs where we currently freed
> > > a page is not optimal. I consider that rather a problem/complex.
> > >
> > > Especially you have a point regarding scalability and multiple VCPUs.
> > >
> > >> With that said I have a few ideas that may help to address the 4
> > >> issues called out above. The basic idea is simple. We use a high water
> > >> mark based on zone->free_area[order].nr_free to determine when to wake
> > >> up a thread to start hinting memory out of a given free area. From
> > >> there we allocate non-"Offline" pages from the free area and assign
> > >> them to the hinting queue up to 64MB at a time. Once the hinting is
> > >> completed we mark them "Offline" and add them to the tail of the
> > >> free_area. Doing this we should cycle the non-"Offline" pages slowly
> > >> out of the free_area. In addition the search cost should be minimal
> > >> since all of the "Offline" pages should be aggregated to the tail of
> > >> the free_area so all pages allocated off of the free_area will be the
> > >> non-"Offline" pages until we shift over to them all being "Offline".
> > >> This should be effective for MAX_ORDER - 1 and MAX_ORDER - 2 pages
> > >> since the only real consumer of add_to_free_area_tail is
> > >> __free_one_page which uses it to place a page with an order less than
> > >> MAX_ORDER - 2 on the tail of a free_area assuming that it should be
> > >> freeing the buddy of that page shortly. The only other issue with
> > >> adding to tail would be the memory shuffling which was recently added,
> > >> but I don't see that as being something that will be enabled in most
> > >> cases so we could probably just make the features mutually exclusive,
> > >> at least for now.
> > >>
> > >> So if I am not mistaken this would essentially require a couple
> > >> changes to the mm infrastructure in order for this to work.
> > >>
> > >> First we would need to split nr_free into two counters, something like
> > >> nr_freed and nr_bound. You could use nr_freed - nr_bound to get the
> > >> value currently used for nr_free. When we pulled the pages for hinting
> > >> we would reduce the nr_freed value and then add back to it when the
> > >> pages are returned. When pages are allocated they would increment the
> > >> nr_bound value. The idea behind this is that we can record nr_free
> > >> when we collect the pages and save it to some local value. This value
> > >> could then tell us how many new pages have been added that have not
> > >> been hinted upon.
> > > I can imagine that quite some people will have problems with such
> > > "virtualization specific changes" splattered around core memory
> > > management. Would there be a way to manage this data at a different
> > > place, out of core-mm and somehow work on it via callbacks?
> > >
> > >> In addition we will need some way to identify which pages have been
> > >> hinted on and which have not. The way I believe easiest to do this
> > >> would be to overload the PageType value so that we could essentially
> > >> have two values for "Buddy" pages. We would have our standard "Buddy"
> > >> pages, and "Buddy" pages that also have the "Offline" value set in the
> > >> PageType field. Tracking the Online vs Offline pages this way would
> > >> actually allow us to do this with almost no overhead as the mapcount
> > >> value is already being reset to clear the "Buddy" flag so adding a
> > >> "Offline" flag to this clearing should come at no additional cost.
> > > Just nothing here that this will require modifications to kdump
> > > (makedumpfile to be precise and the vmcore information exposed from the
> > > kernel), as kdump only checks for the the actual mapcount value to
> > > detect buddy and offline pages (to exclude them from dumps), they are
> > > not treated as flags.
> > >
> > > For now, any mapcount values are really only separate values, meaning
> > > not the separate bits are of interest, like flags would be. Reusing
> > > other flags would make our life a lot easier. E.g. PG_young or so. But
> > > clearing of these is then the problematic part.
> > >
> > > Of course we could use in the kernel two values, Buddy and BuddyOffline.
> > > But then we have to check for two different values whenever we want to
> > > identify a buddy page in the kernel.
> > >
> > >> Lastly we would need to create a specialized function for allocating
> > >> the non-"Offline" pages, and to tweak __free_one_page to tail enqueue
> > >> "Offline" pages. I'm thinking the alloc function it would look
> > >> something like __rmqueue_smallest but without the "expand" and needing
> > >> to modify the !page check to also include a check to verify the page
> > >> is not "Offline". As far as the changes to __free_one_page it would be
> > >> a 2 line change to test for the PageType being offline, and if it is
> > >> to call add_to_free_area_tail instead of add_to_free_area.
> > > As already mentioned, there might be scenarios where the additional
> > > hinting thread might consume too much CPU cycles, especially if there is
> > > little guest activity any you mostly spend time scanning a handful of
> > > free pages and reporting them. I wonder if we can somehow limit the
> > > amount of wakeups/scans for a given period to mitigate this issue.
> > >
> > > One main issue I see with your approach is that we need quite a lot of
> > > core memory management changes. This is a problem. I wonder if we can
> > > factor out most parts into callbacks.
> > >
> > > E.g. in order to detect where to queue a certain page (front/tail), call
> > > a callback if one is registered, mark/check pages in a core-mm unknown
> > > way as offline etc.
> > >
> > > I still wonder if there could be an easier way to combine recording of
> > > hints and one hinting thread, essentially avoiding scanning and some of
> > > the required core-mm changes.
> > In order to resolve the scalability issues associated with my
> > patch-series without compromising with free memory hints, I may explore
> > the idea described below:
> > - Use xbitmap (if possible - earlier suggested by Rik and Wei)
> > corresponding to each zone on a granularity of MAX_ORDER - 2, to track
> > the freed PFN's.
>
> MAX_ORDER - 2 is what? 2Mbyte?

Yeah, it should be 2MB assuming that MAX_ORDER - 1 is 4MB.

I'm still not a fan of us adding to much additional tracking data. The
problem with xbitmap is that you are going to have to protect that
with the zone lock and then scan it. Basically all you are going to do
is spread the pain out over all the allocate and free tasks since they
are going to have to update this bitmap now in addition to everything
else.

The way I see it we are adding additional overhead. We shouldn't need
much more than a single long per free_area that we maintain in the
virtio_balloon driver.

> > - Define and use counters corresponding to each zone to monitor the
> > amount of memory freed.
> > - As soon as the 64MB free memory threshold is hit wake up the kernel
> > thread which will scan this xbitmap and try to isolate the pages and
> > clear the corresponding bits. (We still have to acquire zone lock to
> > protect the respective xbitmap)
>
> So that's 32 pages then? I'd say just keep them in an array,
> list, tree or hash, bitmap is for when you have nots of pages.

The xbitmap I think is for the free page tracking. The problem is this
could build up to tons of pages while we are waiting on hints to
complete if we have a thread that is dumping a large amount of free
pages.

> > - Report the isolated pages back to the host in a synchronous manner.
> > I still have to work on several details of this idea including xbitmap,
> > but first would like to hear any suggestions/thoughts.

I'm still not a fan of trying to keep track of the free page metadata
in real-time. It is going to be far more expensive to have every free
and alloc have to update the extra piece of data than to just come
through after the fact and scan the new pages that have been added.

If we assume the free_area is populated such that the standard
alloc/free is always adding and removing from the head, then having
our offlining thread pulling from head and adding to tail would
naturally make things function such that our "walk" to get free pages
should be extremely inexpensive since head will always be the recent
stuff, and tail will build up with the offline stuff until there is no
recent stuff left and all the pages are offline.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Thoughts on simple scanner approach for free page hinting
  2019-04-08 18:18     ` Alexander Duyck
  (?)
@ 2019-04-08 18:40     ` David Hildenbrand
  2019-04-08 20:10         ` Alexander Duyck
  -1 siblings, 1 reply; 31+ messages in thread
From: David Hildenbrand @ 2019-04-08 18:40 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Michael S. Tsirkin, Nitesh Narayan Lal, kvm list, LKML, linux-mm,
	Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang,
	Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli

>>>
>>> In addition we will need some way to identify which pages have been
>>> hinted on and which have not. The way I believe easiest to do this
>>> would be to overload the PageType value so that we could essentially
>>> have two values for "Buddy" pages. We would have our standard "Buddy"
>>> pages, and "Buddy" pages that also have the "Offline" value set in the
>>> PageType field. Tracking the Online vs Offline pages this way would
>>> actually allow us to do this with almost no overhead as the mapcount
>>> value is already being reset to clear the "Buddy" flag so adding a
>>> "Offline" flag to this clearing should come at no additional cost.
>>
>> Just nothing here that this will require modifications to kdump
>> (makedumpfile to be precise and the vmcore information exposed from the
>> kernel), as kdump only checks for the the actual mapcount value to
>> detect buddy and offline pages (to exclude them from dumps), they are
>> not treated as flags.
>>
>> For now, any mapcount values are really only separate values, meaning
>> not the separate bits are of interest, like flags would be. Reusing
>> other flags would make our life a lot easier. E.g. PG_young or so. But
>> clearing of these is then the problematic part.
>>
>> Of course we could use in the kernel two values, Buddy and BuddyOffline.
>> But then we have to check for two different values whenever we want to
>> identify a buddy page in the kernel.
> 
> Actually this may not be working the way you think it is working.

Trust me, I know how it works. That's why I was giving you the notice.

Read the first paragraph again and ignore the others. I am only
concerned about makedumpfile that has to be changed.

PAGE_OFFLINE_MAPCOUNT_VALUE
PAGE_BUDDY_MAPCOUNT_VALUE

Once you find out how these values are used, you should understand what
has to be changed and where.

>>>
>>> Lastly we would need to create a specialized function for allocating
>>> the non-"Offline" pages, and to tweak __free_one_page to tail enqueue
>>> "Offline" pages. I'm thinking the alloc function it would look
>>> something like __rmqueue_smallest but without the "expand" and needing
>>> to modify the !page check to also include a check to verify the page
>>> is not "Offline". As far as the changes to __free_one_page it would be
>>> a 2 line change to test for the PageType being offline, and if it is
>>> to call add_to_free_area_tail instead of add_to_free_area.
>>
>> As already mentioned, there might be scenarios where the additional
>> hinting thread might consume too much CPU cycles, especially if there is
>> little guest activity any you mostly spend time scanning a handful of
>> free pages and reporting them. I wonder if we can somehow limit the
>> amount of wakeups/scans for a given period to mitigate this issue.
> 
> That is why I was talking about breaking nr_free into nr_freed and
> nr_bound. By doing that I can record the nr_free value to a
> virtio-balloon specific location at the start of any walk and should
> know exactly now many pages were freed between that call and the next
> one. By ordering things such that we place the "Offline" pages on the
> tail of the list it should make the search quite fast since we would
> just be always allocating off of the head of the queue until we have
> hinted everything int he queue. So when we hit the last call to alloc
> the non-"Offline" pages and shut down our thread we can use the
> nr_freed value that we recorded to know exactly how many pages have
> been added that haven't been hinted.
> 
>> One main issue I see with your approach is that we need quite a lot of
>> core memory management changes. This is a problem. I wonder if we can
>> factor out most parts into callbacks.
> 
> I think that is something we can't get away from. However if we make
> this generic enough there would likely be others beyond just the
> virtualization drivers that could make use of the infrastructure. For
> example being able to track the rate at which the free areas are
> cycling in and out pages seems like something that would be useful
> outside of just the virtualization areas.

Might be, but might be the other extreme, people not wanting such
special cases in core mm. I assume the latter until I see a very clear
design where such stuff has been properly factored out.

> 
>> E.g. in order to detect where to queue a certain page (front/tail), call
>> a callback if one is registered, mark/check pages in a core-mm unknown
>> way as offline etc.
>>
>> I still wonder if there could be an easier way to combine recording of
>> hints and one hinting thread, essentially avoiding scanning and some of
>> the required core-mm changes.
> 
> The concern I have with trying to avoid the scanning by tracking is
> that if you fall behind it becomes something where just tracking the
> metadata for the page hints would start to become expensive.

Depends, if it is mostly only marking a bit in a bitmap, it should in
general not be too much of an issue. As usual, the datastructure used is
the important bit.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Thoughts on simple scanner approach for free page hinting
  2019-04-08 18:29         ` Alexander Duyck
  (?)
@ 2019-04-08 18:58         ` David Hildenbrand
  -1 siblings, 0 replies; 31+ messages in thread
From: David Hildenbrand @ 2019-04-08 18:58 UTC (permalink / raw)
  To: Alexander Duyck, Michael S. Tsirkin
  Cc: Nitesh Narayan Lal, kvm list, LKML, linux-mm, Paolo Bonzini,
	lcapitulino, pagupta, wei.w.wang, Yang Zhang, Rik van Riel,
	dodgen, Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli

>>> - Define and use counters corresponding to each zone to monitor the
>>> amount of memory freed.
>>> - As soon as the 64MB free memory threshold is hit wake up the kernel
>>> thread which will scan this xbitmap and try to isolate the pages and
>>> clear the corresponding bits. (We still have to acquire zone lock to
>>> protect the respective xbitmap)
>>
>> So that's 32 pages then? I'd say just keep them in an array,
>> list, tree or hash, bitmap is for when you have nots of pages.
> 
> The xbitmap I think is for the free page tracking. The problem is this
> could build up to tons of pages while we are waiting on hints to
> complete if we have a thread that is dumping a large amount of free
> pages.
> 
>>> - Report the isolated pages back to the host in a synchronous manner.
>>> I still have to work on several details of this idea including xbitmap,
>>> but first would like to hear any suggestions/thoughts.
> 
> I'm still not a fan of trying to keep track of the free page metadata
> in real-time. It is going to be far more expensive to have every free
> and alloc have to update the extra piece of data than to just come
> through after the fact and scan the new pages that have been added.

Tracking metadata separately is a very good starting point. While
integration into core mm in some form e.g. like you describe would be
desirable long term, not messing too much with core-mm is the lower
hanging fruit.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Thoughts on simple scanner approach for free page hinting
  2019-04-08 18:40     ` David Hildenbrand
@ 2019-04-08 20:10         ` Alexander Duyck
  0 siblings, 0 replies; 31+ messages in thread
From: Alexander Duyck @ 2019-04-08 20:10 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Michael S. Tsirkin, Nitesh Narayan Lal, kvm list, LKML, linux-mm,
	Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang,
	Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli

On Mon, Apr 8, 2019 at 11:40 AM David Hildenbrand <david@redhat.com> wrote:
>
> >>>
> >>> In addition we will need some way to identify which pages have been
> >>> hinted on and which have not. The way I believe easiest to do this
> >>> would be to overload the PageType value so that we could essentially
> >>> have two values for "Buddy" pages. We would have our standard "Buddy"
> >>> pages, and "Buddy" pages that also have the "Offline" value set in the
> >>> PageType field. Tracking the Online vs Offline pages this way would
> >>> actually allow us to do this with almost no overhead as the mapcount
> >>> value is already being reset to clear the "Buddy" flag so adding a
> >>> "Offline" flag to this clearing should come at no additional cost.
> >>
> >> Just nothing here that this will require modifications to kdump
> >> (makedumpfile to be precise and the vmcore information exposed from the
> >> kernel), as kdump only checks for the the actual mapcount value to
> >> detect buddy and offline pages (to exclude them from dumps), they are
> >> not treated as flags.
> >>
> >> For now, any mapcount values are really only separate values, meaning
> >> not the separate bits are of interest, like flags would be. Reusing
> >> other flags would make our life a lot easier. E.g. PG_young or so. But
> >> clearing of these is then the problematic part.
> >>
> >> Of course we could use in the kernel two values, Buddy and BuddyOffline.
> >> But then we have to check for two different values whenever we want to
> >> identify a buddy page in the kernel.
> >
> > Actually this may not be working the way you think it is working.
>
> Trust me, I know how it works. That's why I was giving you the notice.
>
> Read the first paragraph again and ignore the others. I am only
> concerned about makedumpfile that has to be changed.
>
> PAGE_OFFLINE_MAPCOUNT_VALUE
> PAGE_BUDDY_MAPCOUNT_VALUE
>
> Once you find out how these values are used, you should understand what
> has to be changed and where.

Ugh. Is there an official repo I am supposed to refer to for makedumpfile?

As far as the changes needed I don't think this would necessitate
additional exports. We could probably just get away with having
makedumpfile generate a new value by simply doing an "&" of the two
values to determine what an offline buddy would be. If need be I can
submit a patch for that. I find it kind of annoying that the kernel is
handling identifying these bits one way, and makedumpfile is doing it
another way. It should have been setup to handle this all the same
way.

>
> >>>
> >>> Lastly we would need to create a specialized function for allocating
> >>> the non-"Offline" pages, and to tweak __free_one_page to tail enqueue
> >>> "Offline" pages. I'm thinking the alloc function it would look
> >>> something like __rmqueue_smallest but without the "expand" and needing
> >>> to modify the !page check to also include a check to verify the page
> >>> is not "Offline". As far as the changes to __free_one_page it would be
> >>> a 2 line change to test for the PageType being offline, and if it is
> >>> to call add_to_free_area_tail instead of add_to_free_area.
> >>
> >> As already mentioned, there might be scenarios where the additional
> >> hinting thread might consume too much CPU cycles, especially if there is
> >> little guest activity any you mostly spend time scanning a handful of
> >> free pages and reporting them. I wonder if we can somehow limit the
> >> amount of wakeups/scans for a given period to mitigate this issue.
> >
> > That is why I was talking about breaking nr_free into nr_freed and
> > nr_bound. By doing that I can record the nr_free value to a
> > virtio-balloon specific location at the start of any walk and should
> > know exactly now many pages were freed between that call and the next
> > one. By ordering things such that we place the "Offline" pages on the
> > tail of the list it should make the search quite fast since we would
> > just be always allocating off of the head of the queue until we have
> > hinted everything int he queue. So when we hit the last call to alloc
> > the non-"Offline" pages and shut down our thread we can use the
> > nr_freed value that we recorded to know exactly how many pages have
> > been added that haven't been hinted.
> >
> >> One main issue I see with your approach is that we need quite a lot of
> >> core memory management changes. This is a problem. I wonder if we can
> >> factor out most parts into callbacks.
> >
> > I think that is something we can't get away from. However if we make
> > this generic enough there would likely be others beyond just the
> > virtualization drivers that could make use of the infrastructure. For
> > example being able to track the rate at which the free areas are
> > cycling in and out pages seems like something that would be useful
> > outside of just the virtualization areas.
>
> Might be, but might be the other extreme, people not wanting such
> special cases in core mm. I assume the latter until I see a very clear
> design where such stuff has been properly factored out.

The only real pain point I am seeing right now is the assumptions
makedumpfile is currently making about how mapcount is being used to
indicate pagetype. If we patch it to fix it most of the other bits are
minor.

> >
> >> E.g. in order to detect where to queue a certain page (front/tail), call
> >> a callback if one is registered, mark/check pages in a core-mm unknown
> >> way as offline etc.
> >>
> >> I still wonder if there could be an easier way to combine recording of
> >> hints and one hinting thread, essentially avoiding scanning and some of
> >> the required core-mm changes.
> >
> > The concern I have with trying to avoid the scanning by tracking is
> > that if you fall behind it becomes something where just tracking the
> > metadata for the page hints would start to become expensive.
>
> Depends, if it is mostly only marking a bit in a bitmap, it should in
> general not be too much of an issue. As usual, the datastructure used is
> the important bit.

Right, but that is a shared bitmap. It means there is yet another
cacheline that will be touched for allocating and freeing pages and
that is going to add cost. That cost is even worse if we are talking
about using the existing setup that was only tracking the pages that
was freed and then having to search for the page buddy from the
original order up to MAX_ORDER - 1.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Thoughts on simple scanner approach for free page hinting
@ 2019-04-08 20:10         ` Alexander Duyck
  0 siblings, 0 replies; 31+ messages in thread
From: Alexander Duyck @ 2019-04-08 20:10 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Michael S. Tsirkin, Nitesh Narayan Lal, kvm list, LKML, linux-mm,
	Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang,
	Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli

On Mon, Apr 8, 2019 at 11:40 AM David Hildenbrand <david@redhat.com> wrote:
>
> >>>
> >>> In addition we will need some way to identify which pages have been
> >>> hinted on and which have not. The way I believe easiest to do this
> >>> would be to overload the PageType value so that we could essentially
> >>> have two values for "Buddy" pages. We would have our standard "Buddy"
> >>> pages, and "Buddy" pages that also have the "Offline" value set in the
> >>> PageType field. Tracking the Online vs Offline pages this way would
> >>> actually allow us to do this with almost no overhead as the mapcount
> >>> value is already being reset to clear the "Buddy" flag so adding a
> >>> "Offline" flag to this clearing should come at no additional cost.
> >>
> >> Just nothing here that this will require modifications to kdump
> >> (makedumpfile to be precise and the vmcore information exposed from the
> >> kernel), as kdump only checks for the the actual mapcount value to
> >> detect buddy and offline pages (to exclude them from dumps), they are
> >> not treated as flags.
> >>
> >> For now, any mapcount values are really only separate values, meaning
> >> not the separate bits are of interest, like flags would be. Reusing
> >> other flags would make our life a lot easier. E.g. PG_young or so. But
> >> clearing of these is then the problematic part.
> >>
> >> Of course we could use in the kernel two values, Buddy and BuddyOffline.
> >> But then we have to check for two different values whenever we want to
> >> identify a buddy page in the kernel.
> >
> > Actually this may not be working the way you think it is working.
>
> Trust me, I know how it works. That's why I was giving you the notice.
>
> Read the first paragraph again and ignore the others. I am only
> concerned about makedumpfile that has to be changed.
>
> PAGE_OFFLINE_MAPCOUNT_VALUE
> PAGE_BUDDY_MAPCOUNT_VALUE
>
> Once you find out how these values are used, you should understand what
> has to be changed and where.

Ugh. Is there an official repo I am supposed to refer to for makedumpfile?

As far as the changes needed I don't think this would necessitate
additional exports. We could probably just get away with having
makedumpfile generate a new value by simply doing an "&" of the two
values to determine what an offline buddy would be. If need be I can
submit a patch for that. I find it kind of annoying that the kernel is
handling identifying these bits one way, and makedumpfile is doing it
another way. It should have been setup to handle this all the same
way.

>
> >>>
> >>> Lastly we would need to create a specialized function for allocating
> >>> the non-"Offline" pages, and to tweak __free_one_page to tail enqueue
> >>> "Offline" pages. I'm thinking the alloc function it would look
> >>> something like __rmqueue_smallest but without the "expand" and needing
> >>> to modify the !page check to also include a check to verify the page
> >>> is not "Offline". As far as the changes to __free_one_page it would be
> >>> a 2 line change to test for the PageType being offline, and if it is
> >>> to call add_to_free_area_tail instead of add_to_free_area.
> >>
> >> As already mentioned, there might be scenarios where the additional
> >> hinting thread might consume too much CPU cycles, especially if there is
> >> little guest activity any you mostly spend time scanning a handful of
> >> free pages and reporting them. I wonder if we can somehow limit the
> >> amount of wakeups/scans for a given period to mitigate this issue.
> >
> > That is why I was talking about breaking nr_free into nr_freed and
> > nr_bound. By doing that I can record the nr_free value to a
> > virtio-balloon specific location at the start of any walk and should
> > know exactly now many pages were freed between that call and the next
> > one. By ordering things such that we place the "Offline" pages on the
> > tail of the list it should make the search quite fast since we would
> > just be always allocating off of the head of the queue until we have
> > hinted everything int he queue. So when we hit the last call to alloc
> > the non-"Offline" pages and shut down our thread we can use the
> > nr_freed value that we recorded to know exactly how many pages have
> > been added that haven't been hinted.
> >
> >> One main issue I see with your approach is that we need quite a lot of
> >> core memory management changes. This is a problem. I wonder if we can
> >> factor out most parts into callbacks.
> >
> > I think that is something we can't get away from. However if we make
> > this generic enough there would likely be others beyond just the
> > virtualization drivers that could make use of the infrastructure. For
> > example being able to track the rate at which the free areas are
> > cycling in and out pages seems like something that would be useful
> > outside of just the virtualization areas.
>
> Might be, but might be the other extreme, people not wanting such
> special cases in core mm. I assume the latter until I see a very clear
> design where such stuff has been properly factored out.

The only real pain point I am seeing right now is the assumptions
makedumpfile is currently making about how mapcount is being used to
indicate pagetype. If we patch it to fix it most of the other bits are
minor.

> >
> >> E.g. in order to detect where to queue a certain page (front/tail), call
> >> a callback if one is registered, mark/check pages in a core-mm unknown
> >> way as offline etc.
> >>
> >> I still wonder if there could be an easier way to combine recording of
> >> hints and one hinting thread, essentially avoiding scanning and some of
> >> the required core-mm changes.
> >
> > The concern I have with trying to avoid the scanning by tracking is
> > that if you fall behind it becomes something where just tracking the
> > metadata for the page hints would start to become expensive.
>
> Depends, if it is mostly only marking a bit in a bitmap, it should in
> general not be too much of an issue. As usual, the datastructure used is
> the important bit.

Right, but that is a shared bitmap. It means there is yet another
cacheline that will be touched for allocating and freeing pages and
that is going to add cost. That cost is even worse if we are talking
about using the existing setup that was only tracking the pages that
was freed and then having to search for the page buddy from the
original order up to MAX_ORDER - 1.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Thoughts on simple scanner approach for free page hinting
  2019-04-08 20:10         ` Alexander Duyck
  (?)
@ 2019-04-08 20:47         ` David Hildenbrand
  -1 siblings, 0 replies; 31+ messages in thread
From: David Hildenbrand @ 2019-04-08 20:47 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Michael S. Tsirkin, Nitesh Narayan Lal, kvm list, LKML, linux-mm,
	Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang,
	Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli

On 08.04.19 22:10, Alexander Duyck wrote:
> On Mon, Apr 8, 2019 at 11:40 AM David Hildenbrand <david@redhat.com> wrote:
>>
>>>>>
>>>>> In addition we will need some way to identify which pages have been
>>>>> hinted on and which have not. The way I believe easiest to do this
>>>>> would be to overload the PageType value so that we could essentially
>>>>> have two values for "Buddy" pages. We would have our standard "Buddy"
>>>>> pages, and "Buddy" pages that also have the "Offline" value set in the
>>>>> PageType field. Tracking the Online vs Offline pages this way would
>>>>> actually allow us to do this with almost no overhead as the mapcount
>>>>> value is already being reset to clear the "Buddy" flag so adding a
>>>>> "Offline" flag to this clearing should come at no additional cost.
>>>>
>>>> Just nothing here that this will require modifications to kdump
>>>> (makedumpfile to be precise and the vmcore information exposed from the
>>>> kernel), as kdump only checks for the the actual mapcount value to
>>>> detect buddy and offline pages (to exclude them from dumps), they are
>>>> not treated as flags.
>>>>
>>>> For now, any mapcount values are really only separate values, meaning
>>>> not the separate bits are of interest, like flags would be. Reusing
>>>> other flags would make our life a lot easier. E.g. PG_young or so. But
>>>> clearing of these is then the problematic part.
>>>>
>>>> Of course we could use in the kernel two values, Buddy and BuddyOffline.
>>>> But then we have to check for two different values whenever we want to
>>>> identify a buddy page in the kernel.
>>>
>>> Actually this may not be working the way you think it is working.
>>
>> Trust me, I know how it works. That's why I was giving you the notice.
>>
>> Read the first paragraph again and ignore the others. I am only
>> concerned about makedumpfile that has to be changed.
>>
>> PAGE_OFFLINE_MAPCOUNT_VALUE
>> PAGE_BUDDY_MAPCOUNT_VALUE
>>
>> Once you find out how these values are used, you should understand what
>> has to be changed and where.
> 
> Ugh. Is there an official repo I am supposed to refer to for makedumpfile?
> 
> As far as the changes needed I don't think this would necessitate
> additional exports. We could probably just get away with having
> makedumpfile generate a new value by simply doing an "&" of the two
> values to determine what an offline buddy would be. If need be I can
> submit a patch for that. I find it kind of annoying that the kernel is
> handling identifying these bits one way, and makedumpfile is doing it
> another way. It should have been setup to handle this all the same
> way.

Here you go:

https://sourceforge.net/p/makedumpfile/code/ci/master/tree/

for now we only had one type at a time, so it wasn't an issue. E.g.
Buddy or Offline were never used in combination with other types. They
had distinct mapcount values.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Thoughts on simple scanner approach for free page hinting
  2019-04-08 20:10         ` Alexander Duyck
  (?)
  (?)
@ 2019-04-08 20:51         ` David Hildenbrand
  2019-04-08 21:20           ` David Hildenbrand
  -1 siblings, 1 reply; 31+ messages in thread
From: David Hildenbrand @ 2019-04-08 20:51 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Michael S. Tsirkin, Nitesh Narayan Lal, kvm list, LKML, linux-mm,
	Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang,
	Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli

On 08.04.19 22:10, Alexander Duyck wrote:
> On Mon, Apr 8, 2019 at 11:40 AM David Hildenbrand <david@redhat.com> wrote:
>>
>>>>>
>>>>> In addition we will need some way to identify which pages have been
>>>>> hinted on and which have not. The way I believe easiest to do this
>>>>> would be to overload the PageType value so that we could essentially
>>>>> have two values for "Buddy" pages. We would have our standard "Buddy"
>>>>> pages, and "Buddy" pages that also have the "Offline" value set in the
>>>>> PageType field. Tracking the Online vs Offline pages this way would
>>>>> actually allow us to do this with almost no overhead as the mapcount
>>>>> value is already being reset to clear the "Buddy" flag so adding a
>>>>> "Offline" flag to this clearing should come at no additional cost.
>>>>
>>>> Just nothing here that this will require modifications to kdump
>>>> (makedumpfile to be precise and the vmcore information exposed from the
>>>> kernel), as kdump only checks for the the actual mapcount value to
>>>> detect buddy and offline pages (to exclude them from dumps), they are
>>>> not treated as flags.
>>>>
>>>> For now, any mapcount values are really only separate values, meaning
>>>> not the separate bits are of interest, like flags would be. Reusing
>>>> other flags would make our life a lot easier. E.g. PG_young or so. But
>>>> clearing of these is then the problematic part.
>>>>
>>>> Of course we could use in the kernel two values, Buddy and BuddyOffline.
>>>> But then we have to check for two different values whenever we want to
>>>> identify a buddy page in the kernel.
>>>
>>> Actually this may not be working the way you think it is working.
>>
>> Trust me, I know how it works. That's why I was giving you the notice.
>>
>> Read the first paragraph again and ignore the others. I am only
>> concerned about makedumpfile that has to be changed.
>>
>> PAGE_OFFLINE_MAPCOUNT_VALUE
>> PAGE_BUDDY_MAPCOUNT_VALUE
>>
>> Once you find out how these values are used, you should understand what
>> has to be changed and where.
> 
> Ugh. Is there an official repo I am supposed to refer to for makedumpfile?
> 
> As far as the changes needed I don't think this would necessitate
> additional exports. We could probably just get away with having
> makedumpfile generate a new value by simply doing an "&" of the two
> values to determine what an offline buddy would be. If need be I can
> submit a patch for that. I find it kind of annoying that the kernel is
> handling identifying these bits one way, and makedumpfile is doing it
> another way. It should have been setup to handle this all the same
> way.
> 
>>
>>>>>
>>>>> Lastly we would need to create a specialized function for allocating
>>>>> the non-"Offline" pages, and to tweak __free_one_page to tail enqueue
>>>>> "Offline" pages. I'm thinking the alloc function it would look
>>>>> something like __rmqueue_smallest but without the "expand" and needing
>>>>> to modify the !page check to also include a check to verify the page
>>>>> is not "Offline". As far as the changes to __free_one_page it would be
>>>>> a 2 line change to test for the PageType being offline, and if it is
>>>>> to call add_to_free_area_tail instead of add_to_free_area.
>>>>
>>>> As already mentioned, there might be scenarios where the additional
>>>> hinting thread might consume too much CPU cycles, especially if there is
>>>> little guest activity any you mostly spend time scanning a handful of
>>>> free pages and reporting them. I wonder if we can somehow limit the
>>>> amount of wakeups/scans for a given period to mitigate this issue.
>>>
>>> That is why I was talking about breaking nr_free into nr_freed and
>>> nr_bound. By doing that I can record the nr_free value to a
>>> virtio-balloon specific location at the start of any walk and should
>>> know exactly now many pages were freed between that call and the next
>>> one. By ordering things such that we place the "Offline" pages on the
>>> tail of the list it should make the search quite fast since we would
>>> just be always allocating off of the head of the queue until we have
>>> hinted everything int he queue. So when we hit the last call to alloc
>>> the non-"Offline" pages and shut down our thread we can use the
>>> nr_freed value that we recorded to know exactly how many pages have
>>> been added that haven't been hinted.
>>>
>>>> One main issue I see with your approach is that we need quite a lot of
>>>> core memory management changes. This is a problem. I wonder if we can
>>>> factor out most parts into callbacks.
>>>
>>> I think that is something we can't get away from. However if we make
>>> this generic enough there would likely be others beyond just the
>>> virtualization drivers that could make use of the infrastructure. For
>>> example being able to track the rate at which the free areas are
>>> cycling in and out pages seems like something that would be useful
>>> outside of just the virtualization areas.
>>
>> Might be, but might be the other extreme, people not wanting such
>> special cases in core mm. I assume the latter until I see a very clear
>> design where such stuff has been properly factored out.
> 
> The only real pain point I am seeing right now is the assumptions
> makedumpfile is currently making about how mapcount is being used to
> indicate pagetype. If we patch it to fix it most of the other bits are
> minor.

I'll be curious how splitting etc. will be handled. Especially if you
want to set Offline for all affected sub pages.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Thoughts on simple scanner approach for free page hinting
  2019-04-08 20:51         ` David Hildenbrand
@ 2019-04-08 21:20           ` David Hildenbrand
  2019-04-08 21:56               ` Alexander Duyck
  0 siblings, 1 reply; 31+ messages in thread
From: David Hildenbrand @ 2019-04-08 21:20 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Michael S. Tsirkin, Nitesh Narayan Lal, kvm list, LKML, linux-mm,
	Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang,
	Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli

On 08.04.19 22:51, David Hildenbrand wrote:
> On 08.04.19 22:10, Alexander Duyck wrote:
>> On Mon, Apr 8, 2019 at 11:40 AM David Hildenbrand <david@redhat.com> wrote:
>>>
>>>>>>
>>>>>> In addition we will need some way to identify which pages have been
>>>>>> hinted on and which have not. The way I believe easiest to do this
>>>>>> would be to overload the PageType value so that we could essentially
>>>>>> have two values for "Buddy" pages. We would have our standard "Buddy"
>>>>>> pages, and "Buddy" pages that also have the "Offline" value set in the
>>>>>> PageType field. Tracking the Online vs Offline pages this way would
>>>>>> actually allow us to do this with almost no overhead as the mapcount
>>>>>> value is already being reset to clear the "Buddy" flag so adding a
>>>>>> "Offline" flag to this clearing should come at no additional cost.
>>>>>
>>>>> Just nothing here that this will require modifications to kdump
>>>>> (makedumpfile to be precise and the vmcore information exposed from the
>>>>> kernel), as kdump only checks for the the actual mapcount value to
>>>>> detect buddy and offline pages (to exclude them from dumps), they are
>>>>> not treated as flags.
>>>>>
>>>>> For now, any mapcount values are really only separate values, meaning
>>>>> not the separate bits are of interest, like flags would be. Reusing
>>>>> other flags would make our life a lot easier. E.g. PG_young or so. But
>>>>> clearing of these is then the problematic part.
>>>>>
>>>>> Of course we could use in the kernel two values, Buddy and BuddyOffline.
>>>>> But then we have to check for two different values whenever we want to
>>>>> identify a buddy page in the kernel.
>>>>
>>>> Actually this may not be working the way you think it is working.
>>>
>>> Trust me, I know how it works. That's why I was giving you the notice.
>>>
>>> Read the first paragraph again and ignore the others. I am only
>>> concerned about makedumpfile that has to be changed.
>>>
>>> PAGE_OFFLINE_MAPCOUNT_VALUE
>>> PAGE_BUDDY_MAPCOUNT_VALUE
>>>
>>> Once you find out how these values are used, you should understand what
>>> has to be changed and where.
>>
>> Ugh. Is there an official repo I am supposed to refer to for makedumpfile?
>>
>> As far as the changes needed I don't think this would necessitate
>> additional exports. We could probably just get away with having
>> makedumpfile generate a new value by simply doing an "&" of the two
>> values to determine what an offline buddy would be. If need be I can
>> submit a patch for that. I find it kind of annoying that the kernel is
>> handling identifying these bits one way, and makedumpfile is doing it
>> another way. It should have been setup to handle this all the same
>> way.
>>
>>>
>>>>>>
>>>>>> Lastly we would need to create a specialized function for allocating
>>>>>> the non-"Offline" pages, and to tweak __free_one_page to tail enqueue
>>>>>> "Offline" pages. I'm thinking the alloc function it would look
>>>>>> something like __rmqueue_smallest but without the "expand" and needing
>>>>>> to modify the !page check to also include a check to verify the page
>>>>>> is not "Offline". As far as the changes to __free_one_page it would be
>>>>>> a 2 line change to test for the PageType being offline, and if it is
>>>>>> to call add_to_free_area_tail instead of add_to_free_area.
>>>>>
>>>>> As already mentioned, there might be scenarios where the additional
>>>>> hinting thread might consume too much CPU cycles, especially if there is
>>>>> little guest activity any you mostly spend time scanning a handful of
>>>>> free pages and reporting them. I wonder if we can somehow limit the
>>>>> amount of wakeups/scans for a given period to mitigate this issue.
>>>>
>>>> That is why I was talking about breaking nr_free into nr_freed and
>>>> nr_bound. By doing that I can record the nr_free value to a
>>>> virtio-balloon specific location at the start of any walk and should
>>>> know exactly now many pages were freed between that call and the next
>>>> one. By ordering things such that we place the "Offline" pages on the
>>>> tail of the list it should make the search quite fast since we would
>>>> just be always allocating off of the head of the queue until we have
>>>> hinted everything int he queue. So when we hit the last call to alloc
>>>> the non-"Offline" pages and shut down our thread we can use the
>>>> nr_freed value that we recorded to know exactly how many pages have
>>>> been added that haven't been hinted.
>>>>
>>>>> One main issue I see with your approach is that we need quite a lot of
>>>>> core memory management changes. This is a problem. I wonder if we can
>>>>> factor out most parts into callbacks.
>>>>
>>>> I think that is something we can't get away from. However if we make
>>>> this generic enough there would likely be others beyond just the
>>>> virtualization drivers that could make use of the infrastructure. For
>>>> example being able to track the rate at which the free areas are
>>>> cycling in and out pages seems like something that would be useful
>>>> outside of just the virtualization areas.
>>>
>>> Might be, but might be the other extreme, people not wanting such
>>> special cases in core mm. I assume the latter until I see a very clear
>>> design where such stuff has been properly factored out.
>>
>> The only real pain point I am seeing right now is the assumptions
>> makedumpfile is currently making about how mapcount is being used to
>> indicate pagetype. If we patch it to fix it most of the other bits are
>> minor.
> 
> I'll be curious how splitting etc. will be handled. Especially if you
> want to set Offline for all affected sub pages.
> 

Answering that myself, I guess you are planning to change the buddy to
basically copy the offline value to sub-pages when splitting, also
attaching them to the tail of the list instead of the head.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Thoughts on simple scanner approach for free page hinting
  2019-04-08 21:20           ` David Hildenbrand
@ 2019-04-08 21:56               ` Alexander Duyck
  0 siblings, 0 replies; 31+ messages in thread
From: Alexander Duyck @ 2019-04-08 21:56 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Michael S. Tsirkin, Nitesh Narayan Lal, kvm list, LKML, linux-mm,
	Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang,
	Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli

On Mon, Apr 8, 2019 at 2:21 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 08.04.19 22:51, David Hildenbrand wrote:
> > On 08.04.19 22:10, Alexander Duyck wrote:
> >> On Mon, Apr 8, 2019 at 11:40 AM David Hildenbrand <david@redhat.com> wrote:
> >>>
> >>>>>>
> >>>>>> In addition we will need some way to identify which pages have been
> >>>>>> hinted on and which have not. The way I believe easiest to do this
> >>>>>> would be to overload the PageType value so that we could essentially
> >>>>>> have two values for "Buddy" pages. We would have our standard "Buddy"
> >>>>>> pages, and "Buddy" pages that also have the "Offline" value set in the
> >>>>>> PageType field. Tracking the Online vs Offline pages this way would
> >>>>>> actually allow us to do this with almost no overhead as the mapcount
> >>>>>> value is already being reset to clear the "Buddy" flag so adding a
> >>>>>> "Offline" flag to this clearing should come at no additional cost.
> >>>>>
> >>>>> Just nothing here that this will require modifications to kdump
> >>>>> (makedumpfile to be precise and the vmcore information exposed from the
> >>>>> kernel), as kdump only checks for the the actual mapcount value to
> >>>>> detect buddy and offline pages (to exclude them from dumps), they are
> >>>>> not treated as flags.
> >>>>>
> >>>>> For now, any mapcount values are really only separate values, meaning
> >>>>> not the separate bits are of interest, like flags would be. Reusing
> >>>>> other flags would make our life a lot easier. E.g. PG_young or so. But
> >>>>> clearing of these is then the problematic part.
> >>>>>
> >>>>> Of course we could use in the kernel two values, Buddy and BuddyOffline.
> >>>>> But then we have to check for two different values whenever we want to
> >>>>> identify a buddy page in the kernel.
> >>>>
> >>>> Actually this may not be working the way you think it is working.
> >>>
> >>> Trust me, I know how it works. That's why I was giving you the notice.
> >>>
> >>> Read the first paragraph again and ignore the others. I am only
> >>> concerned about makedumpfile that has to be changed.
> >>>
> >>> PAGE_OFFLINE_MAPCOUNT_VALUE
> >>> PAGE_BUDDY_MAPCOUNT_VALUE
> >>>
> >>> Once you find out how these values are used, you should understand what
> >>> has to be changed and where.
> >>
> >> Ugh. Is there an official repo I am supposed to refer to for makedumpfile?
> >>
> >> As far as the changes needed I don't think this would necessitate
> >> additional exports. We could probably just get away with having
> >> makedumpfile generate a new value by simply doing an "&" of the two
> >> values to determine what an offline buddy would be. If need be I can
> >> submit a patch for that. I find it kind of annoying that the kernel is
> >> handling identifying these bits one way, and makedumpfile is doing it
> >> another way. It should have been setup to handle this all the same
> >> way.
> >>
> >>>
> >>>>>>
> >>>>>> Lastly we would need to create a specialized function for allocating
> >>>>>> the non-"Offline" pages, and to tweak __free_one_page to tail enqueue
> >>>>>> "Offline" pages. I'm thinking the alloc function it would look
> >>>>>> something like __rmqueue_smallest but without the "expand" and needing
> >>>>>> to modify the !page check to also include a check to verify the page
> >>>>>> is not "Offline". As far as the changes to __free_one_page it would be
> >>>>>> a 2 line change to test for the PageType being offline, and if it is
> >>>>>> to call add_to_free_area_tail instead of add_to_free_area.
> >>>>>
> >>>>> As already mentioned, there might be scenarios where the additional
> >>>>> hinting thread might consume too much CPU cycles, especially if there is
> >>>>> little guest activity any you mostly spend time scanning a handful of
> >>>>> free pages and reporting them. I wonder if we can somehow limit the
> >>>>> amount of wakeups/scans for a given period to mitigate this issue.
> >>>>
> >>>> That is why I was talking about breaking nr_free into nr_freed and
> >>>> nr_bound. By doing that I can record the nr_free value to a
> >>>> virtio-balloon specific location at the start of any walk and should
> >>>> know exactly now many pages were freed between that call and the next
> >>>> one. By ordering things such that we place the "Offline" pages on the
> >>>> tail of the list it should make the search quite fast since we would
> >>>> just be always allocating off of the head of the queue until we have
> >>>> hinted everything int he queue. So when we hit the last call to alloc
> >>>> the non-"Offline" pages and shut down our thread we can use the
> >>>> nr_freed value that we recorded to know exactly how many pages have
> >>>> been added that haven't been hinted.
> >>>>
> >>>>> One main issue I see with your approach is that we need quite a lot of
> >>>>> core memory management changes. This is a problem. I wonder if we can
> >>>>> factor out most parts into callbacks.
> >>>>
> >>>> I think that is something we can't get away from. However if we make
> >>>> this generic enough there would likely be others beyond just the
> >>>> virtualization drivers that could make use of the infrastructure. For
> >>>> example being able to track the rate at which the free areas are
> >>>> cycling in and out pages seems like something that would be useful
> >>>> outside of just the virtualization areas.
> >>>
> >>> Might be, but might be the other extreme, people not wanting such
> >>> special cases in core mm. I assume the latter until I see a very clear
> >>> design where such stuff has been properly factored out.
> >>
> >> The only real pain point I am seeing right now is the assumptions
> >> makedumpfile is currently making about how mapcount is being used to
> >> indicate pagetype. If we patch it to fix it most of the other bits are
> >> minor.
> >
> > I'll be curious how splitting etc. will be handled. Especially if you
> > want to set Offline for all affected sub pages.
> >
>
> Answering that myself, I guess you are planning to change the buddy to
> basically copy the offline value to sub-pages when splitting, also
> attaching them to the tail of the list instead of the head.

Yes that was the ultimate plan. I'm still debating the best place to
pull it from though. For now I am looking at just sampling the Offline
value before calling del_page_from_free_area as I had that currently
clearing the Offline flag when I was clearing the buddy. Then I was
just passing that to the expand function and having it set the Offline
flag.

Since expand is only called if the lower orders are empty there isn't
any point in adding to tail since the list is empty so head == tail
anyway.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Thoughts on simple scanner approach for free page hinting
@ 2019-04-08 21:56               ` Alexander Duyck
  0 siblings, 0 replies; 31+ messages in thread
From: Alexander Duyck @ 2019-04-08 21:56 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Michael S. Tsirkin, Nitesh Narayan Lal, kvm list, LKML, linux-mm,
	Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang,
	Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli

On Mon, Apr 8, 2019 at 2:21 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 08.04.19 22:51, David Hildenbrand wrote:
> > On 08.04.19 22:10, Alexander Duyck wrote:
> >> On Mon, Apr 8, 2019 at 11:40 AM David Hildenbrand <david@redhat.com> wrote:
> >>>
> >>>>>>
> >>>>>> In addition we will need some way to identify which pages have been
> >>>>>> hinted on and which have not. The way I believe easiest to do this
> >>>>>> would be to overload the PageType value so that we could essentially
> >>>>>> have two values for "Buddy" pages. We would have our standard "Buddy"
> >>>>>> pages, and "Buddy" pages that also have the "Offline" value set in the
> >>>>>> PageType field. Tracking the Online vs Offline pages this way would
> >>>>>> actually allow us to do this with almost no overhead as the mapcount
> >>>>>> value is already being reset to clear the "Buddy" flag so adding a
> >>>>>> "Offline" flag to this clearing should come at no additional cost.
> >>>>>
> >>>>> Just nothing here that this will require modifications to kdump
> >>>>> (makedumpfile to be precise and the vmcore information exposed from the
> >>>>> kernel), as kdump only checks for the the actual mapcount value to
> >>>>> detect buddy and offline pages (to exclude them from dumps), they are
> >>>>> not treated as flags.
> >>>>>
> >>>>> For now, any mapcount values are really only separate values, meaning
> >>>>> not the separate bits are of interest, like flags would be. Reusing
> >>>>> other flags would make our life a lot easier. E.g. PG_young or so. But
> >>>>> clearing of these is then the problematic part.
> >>>>>
> >>>>> Of course we could use in the kernel two values, Buddy and BuddyOffline.
> >>>>> But then we have to check for two different values whenever we want to
> >>>>> identify a buddy page in the kernel.
> >>>>
> >>>> Actually this may not be working the way you think it is working.
> >>>
> >>> Trust me, I know how it works. That's why I was giving you the notice.
> >>>
> >>> Read the first paragraph again and ignore the others. I am only
> >>> concerned about makedumpfile that has to be changed.
> >>>
> >>> PAGE_OFFLINE_MAPCOUNT_VALUE
> >>> PAGE_BUDDY_MAPCOUNT_VALUE
> >>>
> >>> Once you find out how these values are used, you should understand what
> >>> has to be changed and where.
> >>
> >> Ugh. Is there an official repo I am supposed to refer to for makedumpfile?
> >>
> >> As far as the changes needed I don't think this would necessitate
> >> additional exports. We could probably just get away with having
> >> makedumpfile generate a new value by simply doing an "&" of the two
> >> values to determine what an offline buddy would be. If need be I can
> >> submit a patch for that. I find it kind of annoying that the kernel is
> >> handling identifying these bits one way, and makedumpfile is doing it
> >> another way. It should have been setup to handle this all the same
> >> way.
> >>
> >>>
> >>>>>>
> >>>>>> Lastly we would need to create a specialized function for allocating
> >>>>>> the non-"Offline" pages, and to tweak __free_one_page to tail enqueue
> >>>>>> "Offline" pages. I'm thinking the alloc function it would look
> >>>>>> something like __rmqueue_smallest but without the "expand" and needing
> >>>>>> to modify the !page check to also include a check to verify the page
> >>>>>> is not "Offline". As far as the changes to __free_one_page it would be
> >>>>>> a 2 line change to test for the PageType being offline, and if it is
> >>>>>> to call add_to_free_area_tail instead of add_to_free_area.
> >>>>>
> >>>>> As already mentioned, there might be scenarios where the additional
> >>>>> hinting thread might consume too much CPU cycles, especially if there is
> >>>>> little guest activity any you mostly spend time scanning a handful of
> >>>>> free pages and reporting them. I wonder if we can somehow limit the
> >>>>> amount of wakeups/scans for a given period to mitigate this issue.
> >>>>
> >>>> That is why I was talking about breaking nr_free into nr_freed and
> >>>> nr_bound. By doing that I can record the nr_free value to a
> >>>> virtio-balloon specific location at the start of any walk and should
> >>>> know exactly now many pages were freed between that call and the next
> >>>> one. By ordering things such that we place the "Offline" pages on the
> >>>> tail of the list it should make the search quite fast since we would
> >>>> just be always allocating off of the head of the queue until we have
> >>>> hinted everything int he queue. So when we hit the last call to alloc
> >>>> the non-"Offline" pages and shut down our thread we can use the
> >>>> nr_freed value that we recorded to know exactly how many pages have
> >>>> been added that haven't been hinted.
> >>>>
> >>>>> One main issue I see with your approach is that we need quite a lot of
> >>>>> core memory management changes. This is a problem. I wonder if we can
> >>>>> factor out most parts into callbacks.
> >>>>
> >>>> I think that is something we can't get away from. However if we make
> >>>> this generic enough there would likely be others beyond just the
> >>>> virtualization drivers that could make use of the infrastructure. For
> >>>> example being able to track the rate at which the free areas are
> >>>> cycling in and out pages seems like something that would be useful
> >>>> outside of just the virtualization areas.
> >>>
> >>> Might be, but might be the other extreme, people not wanting such
> >>> special cases in core mm. I assume the latter until I see a very clear
> >>> design where such stuff has been properly factored out.
> >>
> >> The only real pain point I am seeing right now is the assumptions
> >> makedumpfile is currently making about how mapcount is being used to
> >> indicate pagetype. If we patch it to fix it most of the other bits are
> >> minor.
> >
> > I'll be curious how splitting etc. will be handled. Especially if you
> > want to set Offline for all affected sub pages.
> >
>
> Answering that myself, I guess you are planning to change the buddy to
> basically copy the offline value to sub-pages when splitting, also
> attaching them to the tail of the list instead of the head.

Yes that was the ultimate plan. I'm still debating the best place to
pull it from though. For now I am looking at just sampling the Offline
value before calling del_page_from_free_area as I had that currently
clearing the Offline flag when I was clearing the buddy. Then I was
just passing that to the expand function and having it set the Offline
flag.

Since expand is only called if the lower orders are empty there isn't
any point in adding to tail since the list is empty so head == tail
anyway.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Thoughts on simple scanner approach for free page hinting
  2019-04-06  0:09 ` Alexander Duyck
                   ` (2 preceding siblings ...)
  (?)
@ 2019-04-09  2:44 ` Michael S. Tsirkin
  2019-04-09  7:05   ` David Hildenbrand
  -1 siblings, 1 reply; 31+ messages in thread
From: Michael S. Tsirkin @ 2019-04-09  2:44 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: David Hildenbrand, Nitesh Narayan Lal, kvm list, LKML, linux-mm,
	Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang,
	Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli

On Fri, Apr 05, 2019 at 05:09:45PM -0700, Alexander Duyck wrote:
> In addition we will need some way to identify which pages have been
> hinted on and which have not. The way I believe easiest to do this
> would be to overload the PageType value so that we could essentially
> have two values for "Buddy" pages. We would have our standard "Buddy"
> pages, and "Buddy" pages that also have the "Offline" value set in the
> PageType field. Tracking the Online vs Offline pages this way would
> actually allow us to do this with almost no overhead as the mapcount
> value is already being reset to clear the "Buddy" flag so adding a
> "Offline" flag to this clearing should come at no additional cost.

It bothers me a bit that this doesn't scale to multiple hint types
if we ever need them. Would it be better to have two
free lists: hinted and non-hinted one?


-- 
MST

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Thoughts on simple scanner approach for free page hinting
  2019-04-09  2:44 ` Michael S. Tsirkin
@ 2019-04-09  7:05   ` David Hildenbrand
  0 siblings, 0 replies; 31+ messages in thread
From: David Hildenbrand @ 2019-04-09  7:05 UTC (permalink / raw)
  To: Michael S. Tsirkin, Alexander Duyck
  Cc: Nitesh Narayan Lal, kvm list, LKML, linux-mm, Paolo Bonzini,
	lcapitulino, pagupta, wei.w.wang, Yang Zhang, Rik van Riel,
	dodgen, Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli

On 09.04.19 04:44, Michael S. Tsirkin wrote:
> On Fri, Apr 05, 2019 at 05:09:45PM -0700, Alexander Duyck wrote:
>> In addition we will need some way to identify which pages have been
>> hinted on and which have not. The way I believe easiest to do this
>> would be to overload the PageType value so that we could essentially
>> have two values for "Buddy" pages. We would have our standard "Buddy"
>> pages, and "Buddy" pages that also have the "Offline" value set in the
>> PageType field. Tracking the Online vs Offline pages this way would
>> actually allow us to do this with almost no overhead as the mapcount
>> value is already being reset to clear the "Buddy" flag so adding a
>> "Offline" flag to this clearing should come at no additional cost.
> 
> It bothers me a bit that this doesn't scale to multiple hint types
> if we ever need them. Would it be better to have two
> free lists: hinted and non-hinted one?

That would imply having to change all places trying to allocate memory
to have a look at both lists? I think that could be factored out. I
assume keeping track of the amount of pages would feel more naturally
than having split counters for the existing lists.

I think I'd actually prefer something like that, avoiding mixing page
types and working with different types per list.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Thoughts on simple scanner approach for free page hinting
  2019-04-06  0:09 ` Alexander Duyck
                   ` (3 preceding siblings ...)
  (?)
@ 2019-04-09  9:20 ` David Hildenbrand
  2019-04-09 13:31   ` Michael S. Tsirkin
  -1 siblings, 1 reply; 31+ messages in thread
From: David Hildenbrand @ 2019-04-09  9:20 UTC (permalink / raw)
  To: Alexander Duyck, Michael S. Tsirkin, Nitesh Narayan Lal
  Cc: kvm list, LKML, linux-mm, Paolo Bonzini, lcapitulino, pagupta,
	wei.w.wang, Yang Zhang, Rik van Riel, dodgen,
	Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli

> With that said I have a few ideas that may help to address the 4
> issues called out above. The basic idea is simple. We use a high water
> mark based on zone->free_area[order].nr_free to determine when to wake
> up a thread to start hinting memory out of a given free area. From
> there we allocate non-"Offline" pages from the free area and assign
> them to the hinting queue up to 64MB at a time. Once the hinting is
> completed we mark them "Offline" and add them to the tail of the
> free_area. Doing this we should cycle the non-"Offline" pages slowly
> out of the free_area. In addition the search cost should be minimal
> since all of the "Offline" pages should be aggregated to the tail of
> the free_area so all pages allocated off of the free_area will be the
> non-"Offline" pages until we shift over to them all being "Offline".
> This should be effective for MAX_ORDER - 1 and MAX_ORDER - 2 pages
> since the only real consumer of add_to_free_area_tail is
> __free_one_page which uses it to place a page with an order less than
> MAX_ORDER - 2 on the tail of a free_area assuming that it should be
> freeing the buddy of that page shortly. The only other issue with
> adding to tail would be the memory shuffling which was recently added,
> but I don't see that as being something that will be enabled in most
> cases so we could probably just make the features mutually exclusive,
> at least for now.
> 
> So if I am not mistaken this would essentially require a couple
> changes to the mm infrastructure in order for this to work.
> 
> First we would need to split nr_free into two counters, something like
> nr_freed and nr_bound. You could use nr_freed - nr_bound to get the
> value currently used for nr_free. When we pulled the pages for hinting
> we would reduce the nr_freed value and then add back to it when the
> pages are returned. When pages are allocated they would increment the
> nr_bound value. The idea behind this is that we can record nr_free
> when we collect the pages and save it to some local value. This value
> could then tell us how many new pages have been added that have not
> been hinted upon.
> 
> In addition we will need some way to identify which pages have been
> hinted on and which have not. The way I believe easiest to do this
> would be to overload the PageType value so that we could essentially
> have two values for "Buddy" pages. We would have our standard "Buddy"
> pages, and "Buddy" pages that also have the "Offline" value set in the
> PageType field. Tracking the Online vs Offline pages this way would
> actually allow us to do this with almost no overhead as the mapcount
> value is already being reset to clear the "Buddy" flag so adding a
> "Offline" flag to this clearing should come at no additional cost.

BTW I like the idea of allocating pages that have already been hinted as
last "choice", allocating pages that have not been hinted yet first.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Thoughts on simple scanner approach for free page hinting
  2019-04-09  9:20 ` David Hildenbrand
@ 2019-04-09 13:31   ` Michael S. Tsirkin
  2019-04-09 13:36     ` David Hildenbrand
  0 siblings, 1 reply; 31+ messages in thread
From: Michael S. Tsirkin @ 2019-04-09 13:31 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alexander Duyck, Nitesh Narayan Lal, kvm list, LKML, linux-mm,
	Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang,
	Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli

On Tue, Apr 09, 2019 at 11:20:36AM +0200, David Hildenbrand wrote:
> BTW I like the idea of allocating pages that have already been hinted as
> last "choice", allocating pages that have not been hinted yet first.

OK I guess but note this is just a small window during which
not all pages have been hinted.

So if we actually think this has value then we need
to design something that will desist and not drop pages
in steady state too.

-- 
MST

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Thoughts on simple scanner approach for free page hinting
  2019-04-09 13:31   ` Michael S. Tsirkin
@ 2019-04-09 13:36     ` David Hildenbrand
  2019-04-09 13:37       ` Michael S. Tsirkin
  0 siblings, 1 reply; 31+ messages in thread
From: David Hildenbrand @ 2019-04-09 13:36 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alexander Duyck, Nitesh Narayan Lal, kvm list, LKML, linux-mm,
	Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang,
	Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli

On 09.04.19 15:31, Michael S. Tsirkin wrote:
> On Tue, Apr 09, 2019 at 11:20:36AM +0200, David Hildenbrand wrote:
>> BTW I like the idea of allocating pages that have already been hinted as
>> last "choice", allocating pages that have not been hinted yet first.
> 
> OK I guess but note this is just a small window during which
> not all pages have been hinted.

Yes, good point. It might sound desirable but might be completely
irrelevant in practice.

> 
> So if we actually think this has value then we need
> to design something that will desist and not drop pages
> in steady state too.

By dropping, you mean dropping hints of e.g. MAX_ORDER - 1 or e.g. not
reporting MAX_ORDER - 3?

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Thoughts on simple scanner approach for free page hinting
  2019-04-09 13:36     ` David Hildenbrand
@ 2019-04-09 13:37       ` Michael S. Tsirkin
  2019-04-09 13:43         ` David Hildenbrand
  0 siblings, 1 reply; 31+ messages in thread
From: Michael S. Tsirkin @ 2019-04-09 13:37 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alexander Duyck, Nitesh Narayan Lal, kvm list, LKML, linux-mm,
	Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang,
	Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli

On Tue, Apr 09, 2019 at 03:36:08PM +0200, David Hildenbrand wrote:
> On 09.04.19 15:31, Michael S. Tsirkin wrote:
> > On Tue, Apr 09, 2019 at 11:20:36AM +0200, David Hildenbrand wrote:
> >> BTW I like the idea of allocating pages that have already been hinted as
> >> last "choice", allocating pages that have not been hinted yet first.
> > 
> > OK I guess but note this is just a small window during which
> > not all pages have been hinted.
> 
> Yes, good point. It might sound desirable but might be completely
> irrelevant in practice.
> 
> > 
> > So if we actually think this has value then we need
> > to design something that will desist and not drop pages
> > in steady state too.
> 
> By dropping, you mean dropping hints of e.g. MAX_ORDER - 1 or e.g. not
> reporting MAX_ORDER - 3?

I mean the issue is host unmaps the pages from guest right?  That is
what makes hinted pages slower than non-hinted ones.  If we do not want
that to happen for some pages, then either host can defer acting on the
hint, or we can defer hinting.

> -- 
> 
> Thanks,
> 
> David / dhildenb

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Thoughts on simple scanner approach for free page hinting
  2019-04-09 13:37       ` Michael S. Tsirkin
@ 2019-04-09 13:43         ` David Hildenbrand
  2019-04-09 14:03           ` Michael S. Tsirkin
  0 siblings, 1 reply; 31+ messages in thread
From: David Hildenbrand @ 2019-04-09 13:43 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alexander Duyck, Nitesh Narayan Lal, kvm list, LKML, linux-mm,
	Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang,
	Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli

On 09.04.19 15:37, Michael S. Tsirkin wrote:
> On Tue, Apr 09, 2019 at 03:36:08PM +0200, David Hildenbrand wrote:
>> On 09.04.19 15:31, Michael S. Tsirkin wrote:
>>> On Tue, Apr 09, 2019 at 11:20:36AM +0200, David Hildenbrand wrote:
>>>> BTW I like the idea of allocating pages that have already been hinted as
>>>> last "choice", allocating pages that have not been hinted yet first.
>>>
>>> OK I guess but note this is just a small window during which
>>> not all pages have been hinted.
>>
>> Yes, good point. It might sound desirable but might be completely
>> irrelevant in practice.
>>
>>>
>>> So if we actually think this has value then we need
>>> to design something that will desist and not drop pages
>>> in steady state too.
>>
>> By dropping, you mean dropping hints of e.g. MAX_ORDER - 1 or e.g. not
>> reporting MAX_ORDER - 3?
> 
> I mean the issue is host unmaps the pages from guest right?  That is
> what makes hinted pages slower than non-hinted ones.  If we do not want
> that to happen for some pages, then either host can defer acting on the
> hint, or we can defer hinting.

Ah right, I think what Alex mentioned is that pages processed in the
hypervisor via MADVISE_FREE will be set RO, so the kernel can detect if
they will actually be used again, resulting int

1. A pagefault if written and the page(s) have not been reused by the host
2. A pagefault if read/written if the page(s) have been reused by the host

Now, assuming hinting is fast,  most pages will be hinted right away and
therefore result in pagefaults. I think this is what you meant.

Deferring processing in the hypervisor cannot be done after a request
has been marked as processed and handed back to the guest. (otherwise
pages might already get reused)

So pages would have to "age" in the guest instead before they might be
worth hinting. Marking pages as "Offline" alone won't help. Agreed.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Thoughts on simple scanner approach for free page hinting
  2019-04-09 13:43         ` David Hildenbrand
@ 2019-04-09 14:03           ` Michael S. Tsirkin
  0 siblings, 0 replies; 31+ messages in thread
From: Michael S. Tsirkin @ 2019-04-09 14:03 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alexander Duyck, Nitesh Narayan Lal, kvm list, LKML, linux-mm,
	Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang,
	Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli

On Tue, Apr 09, 2019 at 03:43:58PM +0200, David Hildenbrand wrote:
> On 09.04.19 15:37, Michael S. Tsirkin wrote:
> > On Tue, Apr 09, 2019 at 03:36:08PM +0200, David Hildenbrand wrote:
> >> On 09.04.19 15:31, Michael S. Tsirkin wrote:
> >>> On Tue, Apr 09, 2019 at 11:20:36AM +0200, David Hildenbrand wrote:
> >>>> BTW I like the idea of allocating pages that have already been hinted as
> >>>> last "choice", allocating pages that have not been hinted yet first.
> >>>
> >>> OK I guess but note this is just a small window during which
> >>> not all pages have been hinted.
> >>
> >> Yes, good point. It might sound desirable but might be completely
> >> irrelevant in practice.
> >>
> >>>
> >>> So if we actually think this has value then we need
> >>> to design something that will desist and not drop pages
> >>> in steady state too.
> >>
> >> By dropping, you mean dropping hints of e.g. MAX_ORDER - 1 or e.g. not
> >> reporting MAX_ORDER - 3?
> > 
> > I mean the issue is host unmaps the pages from guest right?  That is
> > what makes hinted pages slower than non-hinted ones.  If we do not want
> > that to happen for some pages, then either host can defer acting on the
> > hint, or we can defer hinting.
> 
> Ah right, I think what Alex mentioned is that pages processed in the
> hypervisor via MADVISE_FREE will be set RO, so the kernel can detect if
> they will actually be used again, resulting int
> 
> 1. A pagefault if written and the page(s) have not been reused by the host
> 2. A pagefault if read/written if the page(s) have been reused by the host
> 
> Now, assuming hinting is fast,  most pages will be hinted right away and
> therefore result in pagefaults. I think this is what you meant.
> 
> Deferring processing in the hypervisor cannot be done after a request
> has been marked as processed and handed back to the guest. (otherwise
> pages might already get reused)
> 
> So pages would have to "age" in the guest instead before they might be
> worth hinting. Marking pages as "Offline" alone won't help. Agreed.

Right. I don't see this as a blocker though. We can think about
strategies for addressing this after we have basic hinting
in place.

> -- 
> 
> Thanks,
> 
> David / dhildenb

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2019-04-09 14:03 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-06  0:09 Thoughts on simple scanner approach for free page hinting Alexander Duyck
2019-04-06  0:09 ` Alexander Duyck
2019-04-08 12:24 ` Nitesh Narayan Lal
2019-04-08 15:18   ` Alexander Duyck
2019-04-08 15:18     ` Alexander Duyck
2019-04-08 15:41     ` Michael S. Tsirkin
2019-04-08 16:36 ` David Hildenbrand
2019-04-08 18:09   ` Nitesh Narayan Lal
2019-04-08 18:19     ` Michael S. Tsirkin
2019-04-08 18:29       ` Alexander Duyck
2019-04-08 18:29         ` Alexander Duyck
2019-04-08 18:58         ` David Hildenbrand
2019-04-08 18:27     ` David Hildenbrand
2019-04-08 18:18   ` Alexander Duyck
2019-04-08 18:18     ` Alexander Duyck
2019-04-08 18:40     ` David Hildenbrand
2019-04-08 20:10       ` Alexander Duyck
2019-04-08 20:10         ` Alexander Duyck
2019-04-08 20:47         ` David Hildenbrand
2019-04-08 20:51         ` David Hildenbrand
2019-04-08 21:20           ` David Hildenbrand
2019-04-08 21:56             ` Alexander Duyck
2019-04-08 21:56               ` Alexander Duyck
2019-04-09  2:44 ` Michael S. Tsirkin
2019-04-09  7:05   ` David Hildenbrand
2019-04-09  9:20 ` David Hildenbrand
2019-04-09 13:31   ` Michael S. Tsirkin
2019-04-09 13:36     ` David Hildenbrand
2019-04-09 13:37       ` Michael S. Tsirkin
2019-04-09 13:43         ` David Hildenbrand
2019-04-09 14:03           ` Michael S. Tsirkin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.