On 3/12/19 5:13 PM, Alexander Duyck wrote: > On Tue, Mar 12, 2019 at 12:46 PM Nitesh Narayan Lal wrote: >> On 3/8/19 4:39 PM, Alexander Duyck wrote: >>> On Fri, Mar 8, 2019 at 11:39 AM Nitesh Narayan Lal wrote: >>>> On 3/8/19 2:25 PM, Alexander Duyck wrote: >>>>> On Fri, Mar 8, 2019 at 11:10 AM Nitesh Narayan Lal wrote: >>>>>> On 3/8/19 1:06 PM, Alexander Duyck wrote: >>>>>>> On Thu, Mar 7, 2019 at 6:32 PM Michael S. Tsirkin wrote: >>>>>>>> On Thu, Mar 07, 2019 at 02:35:53PM -0800, Alexander Duyck wrote: >>>>>>>>> The only other thing I still want to try and see if I can do is to add >>>>>>>>> a jiffies value to the page private data in the case of the buddy >>>>>>>>> pages. >>>>>>>> Actually there's one extra thing I think we should do, and that is make >>>>>>>> sure we do not leave less than X% off the free memory at a time. >>>>>>>> This way chances of triggering an OOM are lower. >>>>>>> If nothing else we could probably look at doing a watermark of some >>>>>>> sort so we have to have X amount of memory free but not hinted before >>>>>>> we will start providing the hints. It would just be a matter of >>>>>>> tracking how much memory we have hinted on versus the amount of memory >>>>>>> that has been pulled from that pool. >>>>>> This is to avoid false OOM in the guest? >>>>> Partially, though it would still be possible. Basically it would just >>>>> be a way of determining when we have hinted "enough". Basically it >>>>> doesn't do us much good to be hinting on free memory if the guest is >>>>> already constrained and just going to reallocate the memory shortly >>>>> after we hinted on it. The idea is with a watermark we can avoid >>>>> hinting until we start having pages that are actually going to stay >>>>> free for a while. >>>>> >>>>>>> It is another reason why we >>>>>>> probably want a bit in the buddy pages somewhere to indicate if a page >>>>>>> has been hinted or not as we can then use that to determine if we have >>>>>>> to account for it in the statistics. >>>>>> The one benefit which I can see of having an explicit bit is that it >>>>>> will help us to have a single hook away from the hot path within buddy >>>>>> merging code (just like your arch_merge_page) and still avoid duplicate >>>>>> hints while releasing pages. >>>>>> >>>>>> I still have to check PG_idle and PG_young which you mentioned but I >>>>>> don't think we can reuse any existing bits. >>>>> Those are bits that are already there for 64b. I think those exist in >>>>> the page extension for 32b systems. If I am not mistaken they are only >>>>> used in VMA mapped memory. What I was getting at is that those are the >>>>> bits we could think about reusing. >>>>> >>>>>> If we really want to have something like a watermark, then can't we use >>>>>> zone->free_pages before isolating to see how many free pages are there >>>>>> and put a threshold on it? (__isolate_free_page() does a similar thing >>>>>> but it does that on per request basis). >>>>> Right. That is only part of it though since that tells you how many >>>>> free pages are there. But how many of those free pages are hinted? >>>>> That is the part we would need to track separately and then then >>>>> compare to free_pages to determine if we need to start hinting on more >>>>> memory or not. >>>> Only pages which are isolated will be hinted, and once a page is >>>> isolated it will not be counted in the zone free pages. >>>> Feel free to correct me if I am wrong. >>> You are correct up to here. When we isolate the page it isn't counted >>> against the free pages. However after we complete the hint we end up >>> taking it out of isolation and returning it to the "free" state, so it >>> will be counted against the free pages. >>> >>>> If I am understanding it correctly you only want to hint the idle pages, >>>> is that right? >>> Getting back to the ideas from our earlier discussion, we had 3 stages >>> for things. Free but not hinted, isolated due to hinting, and free and >>> hinted. So what we would need to do is identify the size of the first >>> pool that is free and not hinted by knowing the total number of free >>> pages, and then subtract the size of the pages that are hinted and >>> still free. >> To summarize, for now, I think it makes sense to stick with the current >> approach as this way we can avoid any locking in the allocation path and >> reduce the number of hypercalls for a bunch of MAX_ORDER - 1 page. > I'm not sure what you are talking about by "avoid any locking in the > allocation path". Are you talking about the spin on idle bit, if so > then yes. Yeap! > However I have been testing your patches and I was correct > in the assumption that you forgot to handle the zone lock when you > were freeing __free_one_page. Yes, these are the steps other than the comments you provided in the code. (One of them is to fix release_buddy_page()) > I just did a quick copy/paste from your > zone lock handling from the guest_free_page_hinting function into the > release_buddy_pages function and then I was able to enable multiple > CPUs without any issues. > >> For the next step other than the comments received in the code and what >> I mentioned in the cover email, I would like to do the following: >> 1. Explore the watermark idea suggested by Alex and bring down memhog >> execution time if possible. > So there are a few things that are hurting us on the memhog test: > 1. The current QEMU patch is only madvising 4K pages at a time, this > is disabling THP and hurts the test. Makes sense, thanks for pointing this out. > > 2. The fact that we madvise the pages away makes it so that we have to > fault the page back in in order to use it for the memhog test. In > order to avoid that penalty we may want to see if we can introduce > some sort of "timeout" on the pages so that we are only hinting away > old pages that have not been used for some period of time. Possibly using MADVISE_FREE should also help in this, I will try this as well. If we could come up with something bit which we could reuse then we may be able to  tackle this issue easily. I will look into this. > > 3. Currently we are still doing a large amount of processing in the > page free path. Ideally we should look at getting away from trying to > do so much per-cpu work and instead just have some small tasks that > put the data needed in the page, and then have a separate thread > walking the free_list checking that data, isolating the pages, hinting > them, and then returning them back to the free_list. I will probably defer this analysis for now, once we have other things fixed. I can possibly evaluate/compare the performance impact with both the approach and chose from them. > >> 2. Benchmark hinting v/s non-hinting more extensively. >> Let me know if you have any specific suggestions in terms of the tools I >> can run to do the same. (I am planning to run atleast netperf, hackbench >> and stress for this). > So I have been running the memhog 32g test and the will-it-scale > page_fault1 test as my primary two tests for this so far. > > What I have seen so far has been pretty promising. I had to do some > build fixes, fixes to QEMU to hint on the full size page instead of 4K > page, and fixes for locking so this isn't exactly your original patch > set, but with all that I am seeing data comparable to the original > patch set I had. > > For memhog 32g I am seeing performance similar to a VM that was fresh > booted. I make that the comparison because you will have to take page > faults on a fresh boot as you access additional memory. However after > the first run of the runtime drops from 22s to 20s without the > hinting enabled. > > The big one that probably still needs some work will be the multi-cpu > scaling. With the per-cpu locking for the zone lock to pull pages out, > and put them back in the free list I am seeing what looks like about a > 10% drop in the page_fault1 test. Here are the results as I have seen > so far on a 16 cpu 32G VM: > > -- baseline -- > ./runtest.py page_fault1 > tasks,processes,processes_idle,threads,threads_idle,linear > 0,0,100,0,100,0 > 1,522242,93.73,514965,93.74,522242 > 2,929433,87.48,857280,87.50,1044484 > 3,1360651,81.25,1214224,81.48,1566726 > 4,1693709,75.01,1437156,76.33,2088968 > 5,2062392,68.77,1743294,70.78,2611210 > 6,2271363,62.54,1787238,66.75,3133452 > 7,2564479,56.33,1924684,61.77,3655694 > 8,2699897,50.09,2205783,54.28,4177936 > 9,2931697,43.85,2135788,50.20,4700178 > 10,2939384,37.63,2258725,45.04,5222420 > 11,3039010,31.41,2209401,41.04,5744662 > 12,3022976,25.19,2177655,35.68,6266904 > 13,3015683,18.98,2123546,31.73,6789146 > 14,2921798,12.77,2160489,27.30,7311388 > 15,2846758,6.51,1815036,17.40,7833630 > 16,2703146,0.36,2121018,18.21,8355872 > > -- modified rh patchset -- > ./runtest.py page_fault1 > tasks,processes,processes_idle,threads,threads_idle,linear > 0,0,100,0,100,0 > 1,527216,93.72,517459,93.70,527216 > 2,911239,87.48,843278,87.51,1054432 > 3,1295059,81.22,1193523,81.61,1581648 > 4,1649332,75.02,1439403,76.17,2108864 > 5,1985780,68.81,1745556,70.44,2636080 > 6,2174751,62.56,1769433,66.84,3163296 > 7,2433273,56.33,2121777,58.46,3690512 > 8,2537356,50.17,1901743,57.23,4217728 > 9,2737689,43.87,1859179,54.17,4744944 > 10,2718474,37.65,2188891,43.69,5272160 > 11,2743381,31.47,2205112,38.00,5799376 > 12,2738717,25.26,2117281,38.09,6326592 > 13,2643648,19.06,1887956,35.31,6853808 > 14,2598001,12.92,1916544,27.87,7381024 > 15,2498325,6.70,1992580,26.10,7908240 > 16,2424587,0.45,2137742,21.37,8435456 > > As we discussed earlier, it would probably be good to focus on only > pulling something like 4 to 8 (MAX_ORDER - 1) pages per round of > hinting. I agree that I should bring down the page-set on which I am working. > You might also look at only working one zone at a time. Then > what you could do is look at placing the pages you have already hinted > on at the tail end of the free_list and pull a new set of pages out to > hint on. I think for this we still need a way to check if a particular page is hinted or not. > You could do this all in one shot while holding the zone > lock. -- Regards Nitesh