From mboxrd@z Thu Jan 1 00:00:00 1970 From: Nitesh Narayan Lal Subject: Re: [RFC][Patch v9 2/6] KVM: Enables the kernel to isolate guest free pages Date: Wed, 13 Mar 2019 07:54:26 -0400 Message-ID: <1ae522f1-1e98-9eef-324c-29585fe574d6@redhat.com> References: <20190306155048.12868-1-nitesh@redhat.com> <20190306155048.12868-3-nitesh@redhat.com> <2d9ae889-a9b9-7969-4455-ff36944f388b@redhat.com> <22e4b1cd-38a5-6642-8cbe-d68e4fcbb0b7@redhat.com> <78b604be-2129-a716-a7a6-f5b382c9fb9c@redhat.com> <20190307212845-mutt-send-email-mst@kernel.org> <17d2afa6-556e-ec73-40dc-beac536b3f20@redhat.com> <8f692047-4750-6827-1ee0-d3d354788f09@redhat.com> <41ae8afe-72c9-58e6-0cbb-9375c91ce37a@redhat.com> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="iHa4rglTlXOsn1mpauERKuWjGcGGQtF1D" Cc: "Michael S. Tsirkin" , David Hildenbrand , kvm list , LKML , linux-mm , Paolo Bonzini , lcapitulino@redhat.com, pagupta@redhat.com, wei.w.wang@intel.com, Yang Zhang , Rik van Riel , dodgen@google.com, Konrad Rzeszutek Wilk , dhildenb@redhat.com, Andrea Arcangeli To: Alexander Duyck Return-path: In-Reply-To: Sender: linux-kernel-owner@vger.kernel.org List-Id: kvm.vger.kernel.org This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --iHa4rglTlXOsn1mpauERKuWjGcGGQtF1D Content-Type: multipart/mixed; boundary="KSQ9FEIrt46Kzcap3MtQpnkVZGofCC5QT"; protected-headers="v1" From: Nitesh Narayan Lal To: Alexander Duyck Cc: "Michael S. Tsirkin" , David Hildenbrand , kvm list , LKML , linux-mm , Paolo Bonzini , lcapitulino@redhat.com, pagupta@redhat.com, wei.w.wang@intel.com, Yang Zhang , Rik van Riel , dodgen@google.com, Konrad Rzeszutek Wilk , dhildenb@redhat.com, Andrea Arcangeli Message-ID: <1ae522f1-1e98-9eef-324c-29585fe574d6@redhat.com> Subject: Re: [RFC][Patch v9 2/6] KVM: Enables the kernel to isolate guest free pages References: <20190306155048.12868-1-nitesh@redhat.com> <20190306155048.12868-3-nitesh@redhat.com> <2d9ae889-a9b9-7969-4455-ff36944f388b@redhat.com> <22e4b1cd-38a5-6642-8cbe-d68e4fcbb0b7@redhat.com> <78b604be-2129-a716-a7a6-f5b382c9fb9c@redhat.com> <20190307212845-mutt-send-email-mst@kernel.org> <17d2afa6-556e-ec73-40dc-beac536b3f20@redhat.com> <8f692047-4750-6827-1ee0-d3d354788f09@redhat.com> <41ae8afe-72c9-58e6-0cbb-9375c91ce37a@redhat.com> In-Reply-To: --KSQ9FEIrt46Kzcap3MtQpnkVZGofCC5QT Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Content-Language: en-US On 3/12/19 5:13 PM, Alexander Duyck wrote: > On Tue, Mar 12, 2019 at 12:46 PM Nitesh Narayan Lal = wrote: >> On 3/8/19 4:39 PM, Alexander Duyck wrote: >>> On Fri, Mar 8, 2019 at 11:39 AM Nitesh Narayan Lal wrote: >>>> On 3/8/19 2:25 PM, Alexander Duyck wrote: >>>>> On Fri, Mar 8, 2019 at 11:10 AM Nitesh Narayan Lal wrote: >>>>>> On 3/8/19 1:06 PM, Alexander Duyck wrote: >>>>>>> On Thu, Mar 7, 2019 at 6:32 PM Michael S. Tsirkin wrote: >>>>>>>> On Thu, Mar 07, 2019 at 02:35:53PM -0800, Alexander Duyck wrote:= >>>>>>>>> The only other thing I still want to try and see if I can do is= to add >>>>>>>>> a jiffies value to the page private data in the case of the bud= dy >>>>>>>>> pages. >>>>>>>> Actually there's one extra thing I think we should do, and that = is make >>>>>>>> sure we do not leave less than X% off the free memory at a time.= >>>>>>>> This way chances of triggering an OOM are lower. >>>>>>> If nothing else we could probably look at doing a watermark of so= me >>>>>>> sort so we have to have X amount of memory free but not hinted be= fore >>>>>>> we will start providing the hints. It would just be a matter of >>>>>>> tracking how much memory we have hinted on versus the amount of m= emory >>>>>>> that has been pulled from that pool. >>>>>> This is to avoid false OOM in the guest? >>>>> Partially, though it would still be possible. Basically it would ju= st >>>>> be a way of determining when we have hinted "enough". Basically it >>>>> doesn't do us much good to be hinting on free memory if the guest i= s >>>>> already constrained and just going to reallocate the memory shortly= >>>>> after we hinted on it. The idea is with a watermark we can avoid >>>>> hinting until we start having pages that are actually going to stay= >>>>> free for a while. >>>>> >>>>>>> It is another reason why we >>>>>>> probably want a bit in the buddy pages somewhere to indicate if a= page >>>>>>> has been hinted or not as we can then use that to determine if we= have >>>>>>> to account for it in the statistics. >>>>>> The one benefit which I can see of having an explicit bit is that = it >>>>>> will help us to have a single hook away from the hot path within b= uddy >>>>>> merging code (just like your arch_merge_page) and still avoid dupl= icate >>>>>> hints while releasing pages. >>>>>> >>>>>> I still have to check PG_idle and PG_young which you mentioned but= I >>>>>> don't think we can reuse any existing bits. >>>>> Those are bits that are already there for 64b. I think those exist = in >>>>> the page extension for 32b systems. If I am not mistaken they are o= nly >>>>> used in VMA mapped memory. What I was getting at is that those are = the >>>>> bits we could think about reusing. >>>>> >>>>>> If we really want to have something like a watermark, then can't w= e use >>>>>> zone->free_pages before isolating to see how many free pages are t= here >>>>>> and put a threshold on it? (__isolate_free_page() does a similar t= hing >>>>>> but it does that on per request basis). >>>>> Right. That is only part of it though since that tells you how many= >>>>> free pages are there. But how many of those free pages are hinted? >>>>> That is the part we would need to track separately and then then >>>>> compare to free_pages to determine if we need to start hinting on m= ore >>>>> memory or not. >>>> Only pages which are isolated will be hinted, and once a page is >>>> isolated it will not be counted in the zone free pages. >>>> Feel free to correct me if I am wrong. >>> You are correct up to here. When we isolate the page it isn't counted= >>> against the free pages. However after we complete the hint we end up >>> taking it out of isolation and returning it to the "free" state, so i= t >>> will be counted against the free pages. >>> >>>> If I am understanding it correctly you only want to hint the idle pa= ges, >>>> is that right? >>> Getting back to the ideas from our earlier discussion, we had 3 stage= s >>> for things. Free but not hinted, isolated due to hinting, and free an= d >>> hinted. So what we would need to do is identify the size of the first= >>> pool that is free and not hinted by knowing the total number of free >>> pages, and then subtract the size of the pages that are hinted and >>> still free. >> To summarize, for now, I think it makes sense to stick with the curren= t >> approach as this way we can avoid any locking in the allocation path a= nd >> reduce the number of hypercalls for a bunch of MAX_ORDER - 1 page. > I'm not sure what you are talking about by "avoid any locking in the > allocation path". Are you talking about the spin on idle bit, if so > then yes.=20 Yeap! > However I have been testing your patches and I was correct > in the assumption that you forgot to handle the zone lock when you > were freeing __free_one_page. Yes, these are the steps other than the comments you provided in the code. (One of them is to fix release_buddy_page()) > I just did a quick copy/paste from your > zone lock handling from the guest_free_page_hinting function into the > release_buddy_pages function and then I was able to enable multiple > CPUs without any issues. > >> For the next step other than the comments received in the code and wha= t >> I mentioned in the cover email, I would like to do the following: >> 1. Explore the watermark idea suggested by Alex and bring down memhog >> execution time if possible. > So there are a few things that are hurting us on the memhog test: > 1. The current QEMU patch is only madvising 4K pages at a time, this > is disabling THP and hurts the test. Makes sense, thanks for pointing this out. > > 2. The fact that we madvise the pages away makes it so that we have to > fault the page back in in order to use it for the memhog test. In > order to avoid that penalty we may want to see if we can introduce > some sort of "timeout" on the pages so that we are only hinting away > old pages that have not been used for some period of time. Possibly using MADVISE_FREE should also help in this, I will try this as well. If we could come up with something bit which we could reuse then we may be able to=C2=A0 tackle this issue easily. I will look into this. > > 3. Currently we are still doing a large amount of processing in the > page free path. Ideally we should look at getting away from trying to > do so much per-cpu work and instead just have some small tasks that > put the data needed in the page, and then have a separate thread > walking the free_list checking that data, isolating the pages, hinting > them, and then returning them back to the free_list. I will probably defer this analysis for now, once we have other things fixed. I can possibly evaluate/compare the performance impact with both the approach and chose from them. > >> 2. Benchmark hinting v/s non-hinting more extensively. >> Let me know if you have any specific suggestions in terms of the tools= I >> can run to do the same. (I am planning to run atleast netperf, hackben= ch >> and stress for this). > So I have been running the memhog 32g test and the will-it-scale > page_fault1 test as my primary two tests for this so far. > > What I have seen so far has been pretty promising. I had to do some > build fixes, fixes to QEMU to hint on the full size page instead of 4K > page, and fixes for locking so this isn't exactly your original patch > set, but with all that I am seeing data comparable to the original > patch set I had. > > For memhog 32g I am seeing performance similar to a VM that was fresh > booted. I make that the comparison because you will have to take page > faults on a fresh boot as you access additional memory. However after > the first run of the runtime drops from 22s to 20s without the > hinting enabled. > > The big one that probably still needs some work will be the multi-cpu > scaling. With the per-cpu locking for the zone lock to pull pages out, > and put them back in the free list I am seeing what looks like about a > 10% drop in the page_fault1 test. Here are the results as I have seen > so far on a 16 cpu 32G VM: > > -- baseline -- > ./runtest.py page_fault1 > tasks,processes,processes_idle,threads,threads_idle,linear > 0,0,100,0,100,0 > 1,522242,93.73,514965,93.74,522242 > 2,929433,87.48,857280,87.50,1044484 > 3,1360651,81.25,1214224,81.48,1566726 > 4,1693709,75.01,1437156,76.33,2088968 > 5,2062392,68.77,1743294,70.78,2611210 > 6,2271363,62.54,1787238,66.75,3133452 > 7,2564479,56.33,1924684,61.77,3655694 > 8,2699897,50.09,2205783,54.28,4177936 > 9,2931697,43.85,2135788,50.20,4700178 > 10,2939384,37.63,2258725,45.04,5222420 > 11,3039010,31.41,2209401,41.04,5744662 > 12,3022976,25.19,2177655,35.68,6266904 > 13,3015683,18.98,2123546,31.73,6789146 > 14,2921798,12.77,2160489,27.30,7311388 > 15,2846758,6.51,1815036,17.40,7833630 > 16,2703146,0.36,2121018,18.21,8355872 > > -- modified rh patchset -- > ./runtest.py page_fault1 > tasks,processes,processes_idle,threads,threads_idle,linear > 0,0,100,0,100,0 > 1,527216,93.72,517459,93.70,527216 > 2,911239,87.48,843278,87.51,1054432 > 3,1295059,81.22,1193523,81.61,1581648 > 4,1649332,75.02,1439403,76.17,2108864 > 5,1985780,68.81,1745556,70.44,2636080 > 6,2174751,62.56,1769433,66.84,3163296 > 7,2433273,56.33,2121777,58.46,3690512 > 8,2537356,50.17,1901743,57.23,4217728 > 9,2737689,43.87,1859179,54.17,4744944 > 10,2718474,37.65,2188891,43.69,5272160 > 11,2743381,31.47,2205112,38.00,5799376 > 12,2738717,25.26,2117281,38.09,6326592 > 13,2643648,19.06,1887956,35.31,6853808 > 14,2598001,12.92,1916544,27.87,7381024 > 15,2498325,6.70,1992580,26.10,7908240 > 16,2424587,0.45,2137742,21.37,8435456 > > As we discussed earlier, it would probably be good to focus on only > pulling something like 4 to 8 (MAX_ORDER - 1) pages per round of > hinting.=20 I agree that I should bring down the page-set on which I am working. > You might also look at only working one zone at a time. Then > what you could do is look at placing the pages you have already hinted > on at the tail end of the free_list and pull a new set of pages out to > hint on. I think for this we still need a way to check if a particular page is hinted or not. > You could do this all in one shot while holding the zone > lock. --=20 Regards Nitesh --KSQ9FEIrt46Kzcap3MtQpnkVZGofCC5QT-- --iHa4rglTlXOsn1mpauERKuWjGcGGQtF1D Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEkXcoRVGaqvbHPuAGo4ZA3AYyozkFAlyI73IACgkQo4ZA3AYy oznWRA//QZWj0knV38rUztpCKEazF2dO2NzO9Ar7PqGwlEIJNYcuqWlsQreMl4x6 eu+KBI5owbr7Xc+Apq+IacZCon8RsFilS0iMunsYwzpxNLG1Uh24IeJeDDKu1UjH sqZt573Gf4eM8tLBNt7zC+F0ndYeZS/vyGDbFQgXDoxnne7ebkI/M69iFBxeJVp3 YIJDFDRdPkMVGIX2PFm+gGoSwmsHWxhObR6l58ltfUB6e5asdg9il+HtSJo+eUpd wMFWpeySlnGepHisJ2d+PI5HUOSbHqzFq5XKLfNi8K3TeKza6v5j65wjDmqg3sdv rCxQsbV2NtB7sYKj3uNROGx0FpK/H/S6dyb/9HvqIPZheUivNyRqFxsU8a0bdN9G mxbdVHMpu67xncUvNkk6UXtNhAQmsz/ltn5bKrBDcrpAjST7ANy2AsTpQ3/8gmjJ VdS1RY0TOl+d2j8rowu52jVdcguj0LH2AB6fssH0+0hgnj3yl4Z3bxa8aGs4p1BG AEuUd966tvn+xwkQ9TGL9OYVY9vJnA4KAXwWaRnk15ulxgh0aplem2gXqPS1ErM5 +K8HkdILIGPXjY7+/8SjJJFAq9ohfbhus8Vmv3drz33v97SM7nXvAT12/9iyQhP2 BM33/CRUaarrULvKJ/I1lT5K8iPmX5QmiRT+ckur6lBZTN0dJJA= =9bI5 -----END PGP SIGNATURE----- --iHa4rglTlXOsn1mpauERKuWjGcGGQtF1D--