On 2/8/19 12:58 PM, Alexander Duyck wrote: > On Thu, Feb 7, 2019 at 12:50 PM Nitesh Narayan Lal wrote: >> >> On 2/7/19 12:43 PM, Alexander Duyck wrote: >>> On Tue, Feb 5, 2019 at 3:21 PM Michael S. Tsirkin wrote: >>>> On Tue, Feb 05, 2019 at 04:54:03PM -0500, Nitesh Narayan Lal wrote: >>>>> On 2/5/19 3:45 PM, Michael S. Tsirkin wrote: >>>>>> On Mon, Feb 04, 2019 at 03:18:53PM -0500, Nitesh Narayan Lal wrote: >>>>>>> This patch enables the kernel to scan the per cpu array and >>>>>>> compress it by removing the repetitive/re-allocated pages. >>>>>>> Once the per cpu array is completely filled with pages in the >>>>>>> buddy it wakes up the kernel per cpu thread which re-scans the >>>>>>> entire per cpu array by acquiring a zone lock corresponding to >>>>>>> the page which is being scanned. If the page is still free and >>>>>>> present in the buddy it tries to isolate the page and adds it >>>>>>> to another per cpu array. >>>>>>> >>>>>>> Once this scanning process is complete and if there are any >>>>>>> isolated pages added to the new per cpu array kernel thread >>>>>>> invokes hyperlist_ready(). >>>>>>> >>>>>>> In hyperlist_ready() a hypercall is made to report these pages to >>>>>>> the host using the virtio-balloon framework. In order to do so >>>>>>> another virtqueue 'hinting_vq' is added to the balloon framework. >>>>>>> As the host frees all the reported pages, the kernel thread returns >>>>>>> them back to the buddy. >>>>>>> >>>>>>> Signed-off-by: Nitesh Narayan Lal >>>>>> This looks kind of like what early iterations of Wei's patches did. >>>>>> >>>>>> But this has lots of issues, for example you might end up with >>>>>> a hypercall per a 4K page. >>>>>> So in the end, he switched over to just reporting only >>>>>> MAX_ORDER - 1 pages. >>>>> You mean that I should only capture/attempt to isolate pages with order >>>>> MAX_ORDER - 1? >>>>>> Would that be a good idea for you too? >>>>> Will it help if we have a threshold value based on the amount of memory >>>>> captured instead of the number of entries/pages in the array? >>>> This is what Wei's patches do at least. >>> So in the solution I had posted I was looking more at >>> HUGETLB_PAGE_ORDER and above as the size of pages to provide the hints >>> on [1]. The advantage to doing that is that you can also avoid >>> fragmenting huge pages which in turn can cause what looks like a >>> memory leak as the memory subsystem attempts to reassemble huge >>> pages[2]. In my mind a 2MB page makes good sense in terms of the size >>> of things to be performing hints on as anything smaller than that is >>> going to just end up being a bunch of extra work and end up causing a >>> bunch of fragmentation. >> As per my opinion, in any implementation which page size to store before >> reporting depends on the allocation pattern of the workload running in >> the guest. > I suggest you take a look at item 2 that I had called out in the > previous email. There are known issues with providing hints smaller > than THP using MADV_DONTNEED or MADV_FREE. Specifically what will > happen is that you end up breaking up a higher order transparent huge > page, backfilling a few holes with other pages, but then the memory > allocation subsystem attempts to reassemble the larger THP page > resulting in an application exhibiting behavior similar to a memory > leak while not actually allocating memory since it is sitting on > fragments of THP pages. I will look into this. > > Also while I am thinking of it I haven't noticed anywhere that you are > handling the case of a device assigned to the guest. That seems like a > spot where we are going to have to stop hinting as well aren't we? > Otherwise we would need to redo the memory mapping of the guest in the > IOMMU every time a page is evicted and replaced.  I haven't explored such a use-case as of now but will definitely explore it. > >> I am also planning to try Michael's suggestion of using MAX_ORDER - 1. >> However I am still thinking about a workload which I can use to test its >> effectiveness. > You might want to look at doing something like min(MAX_ORDER - 1, > HUGETLB_PAGE_ORDER). I know for x86 a 2MB page is the upper limit for > THP which is the most likely to be used page size with the guest. Sure, thanks for the suggestion. > >>> The only issue with limiting things on an arbitrary boundary like that >>> is that you have to hook into the buddy allocator to catch the cases >>> where a page has been merged up into that range. >> I don't think, I understood your comment completely. In any case, we >> have to rely on the buddy for merging the pages. >>> [1] https://lkml.org/lkml/2019/2/4/903 >>> [2] https://blog.digitalocean.com/transparent-huge-pages-and-alternative-memory-allocators/ >> -- >> Regards >> Nitesh >> -- Regards Nitesh