Re: Thoughts on simple scanner approach for free page hinting

From: David Hildenbrand <david@redhat.com>
To: Alexander Duyck <alexander.duyck@gmail.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>,
	Nitesh Narayan Lal <nitesh@redhat.com>,
	kvm list <kvm@vger.kernel.org>,
	LKML <linux-kernel@vger.kernel.org>,
	linux-mm <linux-mm@kvack.org>,
	Paolo Bonzini <pbonzini@redhat.com>,
	lcapitulino@redhat.com, pagupta@redhat.com, wei.w.wang@intel.com,
	Yang Zhang <yang.zhang.wz@gmail.com>,
	Rik van Riel <riel@surriel.com>,
	dodgen@google.com, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>,
	dhildenb@redhat.com, Andrea Arcangeli <aarcange@redhat.com>
Subject: Re: Thoughts on simple scanner approach for free page hinting
Date: Mon, 8 Apr 2019 20:40:11 +0200	[thread overview]
Message-ID: <ef4c219f-6686-f5f6-fd22-d1da0b1720f3@redhat.com> (raw)
In-Reply-To: <CAKgT0UfbVS2iupbf4Dfp91PAdgHNHwZ-RNyL=mcPsS_68Ly_9Q@mail.gmail.com>

>>>
>>> In addition we will need some way to identify which pages have been
>>> hinted on and which have not. The way I believe easiest to do this
>>> would be to overload the PageType value so that we could essentially
>>> have two values for "Buddy" pages. We would have our standard "Buddy"
>>> pages, and "Buddy" pages that also have the "Offline" value set in the
>>> PageType field. Tracking the Online vs Offline pages this way would
>>> actually allow us to do this with almost no overhead as the mapcount
>>> value is already being reset to clear the "Buddy" flag so adding a
>>> "Offline" flag to this clearing should come at no additional cost.
>>
>> Just nothing here that this will require modifications to kdump
>> (makedumpfile to be precise and the vmcore information exposed from the
>> kernel), as kdump only checks for the the actual mapcount value to
>> detect buddy and offline pages (to exclude them from dumps), they are
>> not treated as flags.
>>
>> For now, any mapcount values are really only separate values, meaning
>> not the separate bits are of interest, like flags would be. Reusing
>> other flags would make our life a lot easier. E.g. PG_young or so. But
>> clearing of these is then the problematic part.
>>
>> Of course we could use in the kernel two values, Buddy and BuddyOffline.
>> But then we have to check for two different values whenever we want to
>> identify a buddy page in the kernel.
> 
> Actually this may not be working the way you think it is working.

Trust me, I know how it works. That's why I was giving you the notice.

Read the first paragraph again and ignore the others. I am only
concerned about makedumpfile that has to be changed.

PAGE_OFFLINE_MAPCOUNT_VALUE
PAGE_BUDDY_MAPCOUNT_VALUE

Once you find out how these values are used, you should understand what
has to be changed and where.

>>>
>>> Lastly we would need to create a specialized function for allocating
>>> the non-"Offline" pages, and to tweak __free_one_page to tail enqueue
>>> "Offline" pages. I'm thinking the alloc function it would look
>>> something like __rmqueue_smallest but without the "expand" and needing
>>> to modify the !page check to also include a check to verify the page
>>> is not "Offline". As far as the changes to __free_one_page it would be
>>> a 2 line change to test for the PageType being offline, and if it is
>>> to call add_to_free_area_tail instead of add_to_free_area.
>>
>> As already mentioned, there might be scenarios where the additional
>> hinting thread might consume too much CPU cycles, especially if there is
>> little guest activity any you mostly spend time scanning a handful of
>> free pages and reporting them. I wonder if we can somehow limit the
>> amount of wakeups/scans for a given period to mitigate this issue.
> 
> That is why I was talking about breaking nr_free into nr_freed and
> nr_bound. By doing that I can record the nr_free value to a
> virtio-balloon specific location at the start of any walk and should
> know exactly now many pages were freed between that call and the next
> one. By ordering things such that we place the "Offline" pages on the
> tail of the list it should make the search quite fast since we would
> just be always allocating off of the head of the queue until we have
> hinted everything int he queue. So when we hit the last call to alloc
> the non-"Offline" pages and shut down our thread we can use the
> nr_freed value that we recorded to know exactly how many pages have
> been added that haven't been hinted.
> 
>> One main issue I see with your approach is that we need quite a lot of
>> core memory management changes. This is a problem. I wonder if we can
>> factor out most parts into callbacks.
> 
> I think that is something we can't get away from. However if we make
> this generic enough there would likely be others beyond just the
> virtualization drivers that could make use of the infrastructure. For
> example being able to track the rate at which the free areas are
> cycling in and out pages seems like something that would be useful
> outside of just the virtualization areas.

Might be, but might be the other extreme, people not wanting such
special cases in core mm. I assume the latter until I see a very clear
design where such stuff has been properly factored out.

> 
>> E.g. in order to detect where to queue a certain page (front/tail), call
>> a callback if one is registered, mark/check pages in a core-mm unknown
>> way as offline etc.
>>
>> I still wonder if there could be an easier way to combine recording of
>> hints and one hinting thread, essentially avoiding scanning and some of
>> the required core-mm changes.
> 
> The concern I have with trying to avoid the scanning by tracking is
> that if you fall behind it becomes something where just tracking the
> metadata for the page hints would start to become expensive.

Depends, if it is mostly only marking a bit in a bitmap, it should in
general not be too much of an issue. As usual, the datastructure used is
the important bit.

-- 

Thanks,

David / dhildenb