On 4/5/19 8:09 PM, Alexander Duyck wrote:
> So I am starting this thread as a spot to collect my thoughts on the
> current guest free page hinting design as well as point out a few
> possible things we could do to improve upon it.
>
> 1. The current design isn't likely going to scale well to multiple
> VCPUs. The issue specifically is that the zone lock must be held to
> pull pages off of the free list and to place them back there once they
> have been hinted upon. As a result it would likely make sense to try
> to limit ourselves to only having one thread performing the actual
> hinting so that we can avoid running into issues with lock contention
> between threads.
>
> 2. There are currently concerns about the hinting triggering false OOM
> situations if too much memory is isolated while it is being hinted. My
> thought on this is to simply avoid the issue by only hint on a limited
> amount of memory at a time. Something like 64MB should be a workable
> limit without introducing much in the way of regressions. However as a
> result of this we can easily be overrun while waiting on the host to
> process the hinting request. As such we will probably need a way to
> walk the free list and free pages after they have been freed instead
> of trying to do it as they are freed.
>
> 3. Even with the current buffering which is still on the larger side
> it is possible to overrun the hinting limits if something causes the
> host to stall and a large swath of memory is released. As such we are
> still going to need some sort of scanning mechanism or will have to
> live with not providing accurate hints.
>
> 4. In my opinion, the code overall is likely more complex then it
> needs to be. We currently have 2 allocations that have to occur every
> time we provide a hint all the way to the host, ideally we should not
> need to allocate more memory to provide hints. We should be able to
> hold the memory use for a memory hint device constant and simply map
> the page address and size to the descriptors of the virtio-ring.
>
> With that said I have a few ideas that may help to address the 4
> issues called out above. The basic idea is simple. We use a high water
> mark based on zone->free_area[order].nr_free to determine when to wake
> up a thread to start hinting memory out of a given free area. From
> there we allocate non-"Offline" pages from the free area and assign
> them to the hinting queue up to 64MB at a time. Once the hinting is
> completed we mark them "Offline" and add them to the tail of the
> free_area. Doing this we should cycle the non-"Offline" pages slowly
> out of the free_area. 
any ideas about how are you planning to control this?
> In addition the search cost should be minimal
> since all of the "Offline" pages should be aggregated to the tail of
> the free_area so all pages allocated off of the free_area will be the
> non-"Offline" pages until we shift over to them all being "Offline".
> This should be effective for MAX_ORDER - 1 and MAX_ORDER - 2 pages
> since the only real consumer of add_to_free_area_tail is
> __free_one_page which uses it to place a page with an order less than
> MAX_ORDER - 2 on the tail of a free_area assuming that it should be
> freeing the buddy of that page shortly. The only other issue with
> adding to tail would be the memory shuffling which was recently added,
> but I don't see that as being something that will be enabled in most
> cases so we could probably just make the features mutually exclusive,
> at least for now.
>
> So if I am not mistaken this would essentially require a couple
> changes to the mm infrastructure in order for this to work.
>
> First we would need to split nr_free into two counters, something like
> nr_freed and nr_bound. You could use nr_freed - nr_bound to get the
> value currently used for nr_free. When we pulled the pages for hinting
> we would reduce the nr_freed value and then add back to it when the
> pages are returned. When pages are allocated they would increment the
> nr_bound value. The idea behind this is that we can record nr_free
> when we collect the pages and save it to some local value. This value
> could then tell us how many new pages have been added that have not
> been hinted upon.
>
> In addition we will need some way to identify which pages have been
> hinted on and which have not. The way I believe easiest to do this
> would be to overload the PageType value so that we could essentially
> have two values for "Buddy" pages. We would have our standard "Buddy"
> pages, and "Buddy" pages that also have the "Offline" value set in the
> PageType field. Tracking the Online vs Offline pages this way would
> actually allow us to do this with almost no overhead as the mapcount
> value is already being reset to clear the "Buddy" flag so adding a
> "Offline" flag to this clearing should come at no additional cost.
>
> Lastly we would need to create a specialized function for allocating
> the non-"Offline" pages, and to tweak __free_one_page to tail enqueue
> "Offline" pages. I'm thinking the alloc function it would look
> something like __rmqueue_smallest but without the "expand" and needing
> to modify the !page check to also include a check to verify the page
> is not "Offline". As far as the changes to __free_one_page it would be
> a 2 line change to test for the PageType being offline, and if it is
> to call add_to_free_area_tail instead of add_to_free_area.
Is it possible that once the pages are offline, there is a large
allocation request in the guest needing those offline pages as well?
>
> Anyway this email ended up being pretty massive by the time I was
> done. Feel free to reply to parts of it and we can break it out into
> separate threads of discussion as necessary. I will start working on
> coding some parts of this next week.
>
> Thanks.
>
> - Alex
-- 
Regards
Nitesh