From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.0 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8AFCDC10F13 for ; Mon, 8 Apr 2019 18:27:55 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 4BD1220830 for ; Mon, 8 Apr 2019 18:27:55 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729212AbfDHS1y (ORCPT ); Mon, 8 Apr 2019 14:27:54 -0400 Received: from mx1.redhat.com ([209.132.183.28]:42656 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728741AbfDHS1x (ORCPT ); Mon, 8 Apr 2019 14:27:53 -0400 Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.phx2.redhat.com [10.5.11.22]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id BBEA6821F5; Mon, 8 Apr 2019 18:27:52 +0000 (UTC) Received: from [10.36.116.113] (ovpn-116-113.ams2.redhat.com [10.36.116.113]) by smtp.corp.redhat.com (Postfix) with ESMTP id 5D0491001E92; Mon, 8 Apr 2019 18:27:42 +0000 (UTC) Subject: Re: Thoughts on simple scanner approach for free page hinting To: Nitesh Narayan Lal Cc: kvm list , LKML , linux-mm , Paolo Bonzini , lcapitulino@redhat.com, pagupta@redhat.com, wei.w.wang@intel.com, Yang Zhang , Rik van Riel , dodgen@google.com, Konrad Rzeszutek Wilk , dhildenb@redhat.com, Andrea Arcangeli , Alexander Duyck , "Michael S. Tsirkin" References: <01d5f4e8-742b-33f5-6d91-0c7c396d1cfc@redhat.com> <6f097f31-abc7-f56c-199c-dc167331f6b9@redhat.com> From: David Hildenbrand Openpgp: preference=signencrypt Autocrypt: addr=david@redhat.com; prefer-encrypt=mutual; keydata= xsFNBFXLn5EBEAC+zYvAFJxCBY9Tr1xZgcESmxVNI/0ffzE/ZQOiHJl6mGkmA1R7/uUpiCjJ dBrn+lhhOYjjNefFQou6478faXE6o2AhmebqT4KiQoUQFV4R7y1KMEKoSyy8hQaK1umALTdL QZLQMzNE74ap+GDK0wnacPQFpcG1AE9RMq3aeErY5tujekBS32jfC/7AnH7I0v1v1TbbK3Gp XNeiN4QroO+5qaSr0ID2sz5jtBLRb15RMre27E1ImpaIv2Jw8NJgW0k/D1RyKCwaTsgRdwuK Kx/Y91XuSBdz0uOyU/S8kM1+ag0wvsGlpBVxRR/xw/E8M7TEwuCZQArqqTCmkG6HGcXFT0V9 PXFNNgV5jXMQRwU0O/ztJIQqsE5LsUomE//bLwzj9IVsaQpKDqW6TAPjcdBDPLHvriq7kGjt WhVhdl0qEYB8lkBEU7V2Yb+SYhmhpDrti9Fq1EsmhiHSkxJcGREoMK/63r9WLZYI3+4W2rAc UucZa4OT27U5ZISjNg3Ev0rxU5UH2/pT4wJCfxwocmqaRr6UYmrtZmND89X0KigoFD/XSeVv jwBRNjPAubK9/k5NoRrYqztM9W6sJqrH8+UWZ1Idd/DdmogJh0gNC0+N42Za9yBRURfIdKSb B3JfpUqcWwE7vUaYrHG1nw54pLUoPG6sAA7Mehl3nd4pZUALHwARAQABzSREYXZpZCBIaWxk ZW5icmFuZCA8ZGF2aWRAcmVkaGF0LmNvbT7CwX4EEwECACgFAljj9eoCGwMFCQlmAYAGCwkI BwMCBhUIAgkKCwQWAgMBAh4BAheAAAoJEE3eEPcA/4Na5IIP/3T/FIQMxIfNzZshIq687qgG 8UbspuE/YSUDdv7r5szYTK6KPTlqN8NAcSfheywbuYD9A4ZeSBWD3/NAVUdrCaRP2IvFyELj xoMvfJccbq45BxzgEspg/bVahNbyuBpLBVjVWwRtFCUEXkyazksSv8pdTMAs9IucChvFmmq3 jJ2vlaz9lYt/lxN246fIVceckPMiUveimngvXZw21VOAhfQ+/sofXF8JCFv2mFcBDoa7eYob s0FLpmqFaeNRHAlzMWgSsP80qx5nWWEvRLdKWi533N2vC/EyunN3HcBwVrXH4hxRBMco3jvM m8VKLKao9wKj82qSivUnkPIwsAGNPdFoPbgghCQiBjBe6A75Z2xHFrzo7t1jg7nQfIyNC7ez MZBJ59sqA9EDMEJPlLNIeJmqslXPjmMFnE7Mby/+335WJYDulsRybN+W5rLT5aMvhC6x6POK z55fMNKrMASCzBJum2Fwjf/VnuGRYkhKCqqZ8gJ3OvmR50tInDV2jZ1DQgc3i550T5JDpToh dPBxZocIhzg+MBSRDXcJmHOx/7nQm3iQ6iLuwmXsRC6f5FbFefk9EjuTKcLMvBsEx+2DEx0E UnmJ4hVg7u1PQ+2Oy+Lh/opK/BDiqlQ8Pz2jiXv5xkECvr/3Sv59hlOCZMOaiLTTjtOIU7Tq 7ut6OL64oAq+zsFNBFXLn5EBEADn1959INH2cwYJv0tsxf5MUCghCj/CA/lc/LMthqQ773ga uB9mN+F1rE9cyyXb6jyOGn+GUjMbnq1o121Vm0+neKHUCBtHyseBfDXHA6m4B3mUTWo13nid 0e4AM71r0DS8+KYh6zvweLX/LL5kQS9GQeT+QNroXcC1NzWbitts6TZ+IrPOwT1hfB4WNC+X 2n4AzDqp3+ILiVST2DT4VBc11Gz6jijpC/KI5Al8ZDhRwG47LUiuQmt3yqrmN63V9wzaPhC+ xbwIsNZlLUvuRnmBPkTJwwrFRZvwu5GPHNndBjVpAfaSTOfppyKBTccu2AXJXWAE1Xjh6GOC 8mlFjZwLxWFqdPHR1n2aPVgoiTLk34LR/bXO+e0GpzFXT7enwyvFFFyAS0Nk1q/7EChPcbRb hJqEBpRNZemxmg55zC3GLvgLKd5A09MOM2BrMea+l0FUR+PuTenh2YmnmLRTro6eZ/qYwWkC u8FFIw4pT0OUDMyLgi+GI1aMpVogTZJ70FgV0pUAlpmrzk/bLbRkF3TwgucpyPtcpmQtTkWS gDS50QG9DR/1As3LLLcNkwJBZzBG6PWbvcOyrwMQUF1nl4SSPV0LLH63+BrrHasfJzxKXzqg rW28CTAE2x8qi7e/6M/+XXhrsMYG+uaViM7n2je3qKe7ofum3s4vq7oFCPsOgwARAQABwsFl BBgBAgAPBQJVy5+RAhsMBQkJZgGAAAoJEE3eEPcA/4NagOsP/jPoIBb/iXVbM+fmSHOjEshl KMwEl/m5iLj3iHnHPVLBUWrXPdS7iQijJA/VLxjnFknhaS60hkUNWexDMxVVP/6lbOrs4bDZ NEWDMktAeqJaFtxackPszlcpRVkAs6Msn9tu8hlvB517pyUgvuD7ZS9gGOMmYwFQDyytpepo YApVV00P0u3AaE0Cj/o71STqGJKZxcVhPaZ+LR+UCBZOyKfEyq+ZN311VpOJZ1IvTExf+S/5 lqnciDtbO3I4Wq0ArLX1gs1q1XlXLaVaA3yVqeC8E7kOchDNinD3hJS4OX0e1gdsx/e6COvy qNg5aL5n0Kl4fcVqM0LdIhsubVs4eiNCa5XMSYpXmVi3HAuFyg9dN+x8thSwI836FoMASwOl C7tHsTjnSGufB+D7F7ZBT61BffNBBIm1KdMxcxqLUVXpBQHHlGkbwI+3Ye+nE6HmZH7IwLwV W+Ajl7oYF+jeKaH4DZFtgLYGLtZ1LDwKPjX7VAsa4Yx7S5+EBAaZGxK510MjIx6SGrZWBrrV TEvdV00F2MnQoeXKzD7O4WFbL55hhyGgfWTHwZ457iN9SgYi1JLPqWkZB0JRXIEtjd4JEQcx +8Umfre0Xt4713VxMygW0PnQt5aSQdMD58jHFxTk092mU+yIHj5LeYgvwSgZN4airXk5yRXl SE+xAvmumFBY Organization: Red Hat GmbH Message-ID: <23bf2735-c61e-9ade-30f9-4d314617283e@redhat.com> Date: Mon, 8 Apr 2019 20:27:41 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.4.0 MIME-Version: 1.0 In-Reply-To: <6f097f31-abc7-f56c-199c-dc167331f6b9@redhat.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit X-Scanned-By: MIMEDefang 2.84 on 10.5.11.22 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.28]); Mon, 08 Apr 2019 18:27:52 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 08.04.19 20:09, Nitesh Narayan Lal wrote: > On 4/8/19 12:36 PM, David Hildenbrand wrote: >> On 06.04.19 02:09, Alexander Duyck wrote: >>> So I am starting this thread as a spot to collect my thoughts on the >>> current guest free page hinting design as well as point out a few >>> possible things we could do to improve upon it. >>> >>> 1. The current design isn't likely going to scale well to multiple >>> VCPUs. The issue specifically is that the zone lock must be held to >>> pull pages off of the free list and to place them back there once they >>> have been hinted upon. As a result it would likely make sense to try >>> to limit ourselves to only having one thread performing the actual >>> hinting so that we can avoid running into issues with lock contention >>> between threads. >> Makes sense. >> >>> 2. There are currently concerns about the hinting triggering false OOM >>> situations if too much memory is isolated while it is being hinted. My >>> thought on this is to simply avoid the issue by only hint on a limited >>> amount of memory at a time. Something like 64MB should be a workable >>> limit without introducing much in the way of regressions. However as a >>> result of this we can easily be overrun while waiting on the host to >>> process the hinting request. As such we will probably need a way to >>> walk the free list and free pages after they have been freed instead >>> of trying to do it as they are freed. >> We will need such a way in case we care about dropped hinting requests, yes. >> >>> 3. Even with the current buffering which is still on the larger side >>> it is possible to overrun the hinting limits if something causes the >>> host to stall and a large swath of memory is released. As such we are >>> still going to need some sort of scanning mechanism or will have to >>> live with not providing accurate hints. >> Yes, usually if there is a lot of guest activity, you could however >> assume that free pages might get reused either way soon. Of course, >> special cases are "freeing XGB and being idle afterwards". >> >>> 4. In my opinion, the code overall is likely more complex then it >>> needs to be. We currently have 2 allocations that have to occur every >>> time we provide a hint all the way to the host, ideally we should not >>> need to allocate more memory to provide hints. We should be able to >>> hold the memory use for a memory hint device constant and simply map >>> the page address and size to the descriptors of the virtio-ring. >> I don't think the two allocations are that complex. The only thing I >> consider complex is isolation a lot of pages from different zones etc. >> Two allocations, nobody really cares about that. Of course, the fact >> that we have to allocate memory from the VCPUs where we currently freed >> a page is not optimal. I consider that rather a problem/complex. >> >> Especially you have a point regarding scalability and multiple VCPUs. >> >>> With that said I have a few ideas that may help to address the 4 >>> issues called out above. The basic idea is simple. We use a high water >>> mark based on zone->free_area[order].nr_free to determine when to wake >>> up a thread to start hinting memory out of a given free area. From >>> there we allocate non-"Offline" pages from the free area and assign >>> them to the hinting queue up to 64MB at a time. Once the hinting is >>> completed we mark them "Offline" and add them to the tail of the >>> free_area. Doing this we should cycle the non-"Offline" pages slowly >>> out of the free_area. In addition the search cost should be minimal >>> since all of the "Offline" pages should be aggregated to the tail of >>> the free_area so all pages allocated off of the free_area will be the >>> non-"Offline" pages until we shift over to them all being "Offline". >>> This should be effective for MAX_ORDER - 1 and MAX_ORDER - 2 pages >>> since the only real consumer of add_to_free_area_tail is >>> __free_one_page which uses it to place a page with an order less than >>> MAX_ORDER - 2 on the tail of a free_area assuming that it should be >>> freeing the buddy of that page shortly. The only other issue with >>> adding to tail would be the memory shuffling which was recently added, >>> but I don't see that as being something that will be enabled in most >>> cases so we could probably just make the features mutually exclusive, >>> at least for now. >>> >>> So if I am not mistaken this would essentially require a couple >>> changes to the mm infrastructure in order for this to work. >>> >>> First we would need to split nr_free into two counters, something like >>> nr_freed and nr_bound. You could use nr_freed - nr_bound to get the >>> value currently used for nr_free. When we pulled the pages for hinting >>> we would reduce the nr_freed value and then add back to it when the >>> pages are returned. When pages are allocated they would increment the >>> nr_bound value. The idea behind this is that we can record nr_free >>> when we collect the pages and save it to some local value. This value >>> could then tell us how many new pages have been added that have not >>> been hinted upon. >> I can imagine that quite some people will have problems with such >> "virtualization specific changes" splattered around core memory >> management. Would there be a way to manage this data at a different >> place, out of core-mm and somehow work on it via callbacks? >> >>> In addition we will need some way to identify which pages have been >>> hinted on and which have not. The way I believe easiest to do this >>> would be to overload the PageType value so that we could essentially >>> have two values for "Buddy" pages. We would have our standard "Buddy" >>> pages, and "Buddy" pages that also have the "Offline" value set in the >>> PageType field. Tracking the Online vs Offline pages this way would >>> actually allow us to do this with almost no overhead as the mapcount >>> value is already being reset to clear the "Buddy" flag so adding a >>> "Offline" flag to this clearing should come at no additional cost. >> Just nothing here that this will require modifications to kdump >> (makedumpfile to be precise and the vmcore information exposed from the >> kernel), as kdump only checks for the the actual mapcount value to >> detect buddy and offline pages (to exclude them from dumps), they are >> not treated as flags. >> >> For now, any mapcount values are really only separate values, meaning >> not the separate bits are of interest, like flags would be. Reusing >> other flags would make our life a lot easier. E.g. PG_young or so. But >> clearing of these is then the problematic part. >> >> Of course we could use in the kernel two values, Buddy and BuddyOffline. >> But then we have to check for two different values whenever we want to >> identify a buddy page in the kernel. >> >>> Lastly we would need to create a specialized function for allocating >>> the non-"Offline" pages, and to tweak __free_one_page to tail enqueue >>> "Offline" pages. I'm thinking the alloc function it would look >>> something like __rmqueue_smallest but without the "expand" and needing >>> to modify the !page check to also include a check to verify the page >>> is not "Offline". As far as the changes to __free_one_page it would be >>> a 2 line change to test for the PageType being offline, and if it is >>> to call add_to_free_area_tail instead of add_to_free_area. >> As already mentioned, there might be scenarios where the additional >> hinting thread might consume too much CPU cycles, especially if there is >> little guest activity any you mostly spend time scanning a handful of >> free pages and reporting them. I wonder if we can somehow limit the >> amount of wakeups/scans for a given period to mitigate this issue. >> >> One main issue I see with your approach is that we need quite a lot of >> core memory management changes. This is a problem. I wonder if we can >> factor out most parts into callbacks. >> >> E.g. in order to detect where to queue a certain page (front/tail), call >> a callback if one is registered, mark/check pages in a core-mm unknown >> way as offline etc. >> >> I still wonder if there could be an easier way to combine recording of >> hints and one hinting thread, essentially avoiding scanning and some of >> the required core-mm changes. > In order to resolve the scalability issues associated with my > patch-series without compromising with free memory hints, I may explore > the idea described below: > - Use xbitmap (if possible - earlier suggested by Rik and Wei) > corresponding to each zone on a granularity of MAX_ORDER - 2, to track > the freed PFN's. > - Define and use counters corresponding to each zone to monitor the > amount of memory freed. > - As soon as the 64MB free memory threshold is hit wake up the kernel > thread which will scan this xbitmap and try to isolate the pages and > clear the corresponding bits. (We still have to acquire zone lock to > protect the respective xbitmap) > - Report the isolated pages back to the host in a synchronous manner. > I still have to work on several details of this idea including xbitmap, > but first would like to hear any suggestions/thoughts. As discussed offline, I think this is the right approach to follow. a) We have an asynchronous hinting thread as suggested by Alex. b) We record hints and only have to scan a xbitmap, not some random lists. c) We have minimal core mm changes Best of both worlds and sounds simple. -- Thanks, David / dhildenb