From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.0 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id DC6F6C43381 for ; Mon, 25 Mar 2019 14:28:39 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 97DFB20830 for ; Mon, 25 Mar 2019 14:28:39 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729248AbfCYO2i (ORCPT ); Mon, 25 Mar 2019 10:28:38 -0400 Received: from mx1.redhat.com ([209.132.183.28]:46884 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725788AbfCYO2h (ORCPT ); Mon, 25 Mar 2019 10:28:37 -0400 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.phx2.redhat.com [10.5.11.16]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 61F7D3084215; Mon, 25 Mar 2019 14:28:36 +0000 (UTC) Received: from [10.18.17.32] (dhcp-17-32.bos.redhat.com [10.18.17.32]) by smtp.corp.redhat.com (Postfix) with ESMTPS id BB6DBA33BB; Mon, 25 Mar 2019 14:27:56 +0000 (UTC) From: Nitesh Narayan Lal To: Alexander Duyck , "Michael S. Tsirkin" Cc: David Hildenbrand , kvm list , LKML , linux-mm , Paolo Bonzini , lcapitulino@redhat.com, pagupta@redhat.com, wei.w.wang@intel.com, Yang Zhang , Rik van Riel , dodgen@google.com, Konrad Rzeszutek Wilk , dhildenb@redhat.com, Andrea Arcangeli Subject: Re: [RFC][Patch v9 0/6] KVM: Guest Free Page Hinting References: <20190306155048.12868-1-nitesh@redhat.com> <20190306110501-mutt-send-email-mst@kernel.org> <20190306130955-mutt-send-email-mst@kernel.org> <4bd54f8b-3e9a-3493-40be-668962282431@redhat.com> <6d744ed6-9c1c-b29f-aa32-d38387187b74@redhat.com> <6709bb82-5e99-019d-7de0-3fded385b9ac@redhat.com> <6ab9b763-ac90-b3db-3712-79a20c949d5d@redhat.com> Organization: Red Hat Inc, Message-ID: <99b9fa88-17b1-f2a9-7dd4-7a8f6e790d30@redhat.com> Date: Mon, 25 Mar 2019 10:27:46 -0400 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.4.0 MIME-Version: 1.0 In-Reply-To: <6ab9b763-ac90-b3db-3712-79a20c949d5d@redhat.com> Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="OUEyeRINyY9r7rkz8NWRKqrGFm2wLnu3L" X-Scanned-By: MIMEDefang 2.79 on 10.5.11.16 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.40]); Mon, 25 Mar 2019 14:28:37 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --OUEyeRINyY9r7rkz8NWRKqrGFm2wLnu3L Content-Type: multipart/mixed; boundary="Uxy1fcR5FM2Ozt00umK0ZxNrLwPctajkC"; protected-headers="v1" From: Nitesh Narayan Lal To: Alexander Duyck , "Michael S. Tsirkin" Cc: David Hildenbrand , kvm list , LKML , linux-mm , Paolo Bonzini , lcapitulino@redhat.com, pagupta@redhat.com, wei.w.wang@intel.com, Yang Zhang , Rik van Riel , dodgen@google.com, Konrad Rzeszutek Wilk , dhildenb@redhat.com, Andrea Arcangeli Message-ID: <99b9fa88-17b1-f2a9-7dd4-7a8f6e790d30@redhat.com> Subject: Re: [RFC][Patch v9 0/6] KVM: Guest Free Page Hinting --Uxy1fcR5FM2Ozt00umK0ZxNrLwPctajkC Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On 3/20/19 9:18 AM, Nitesh Narayan Lal wrote: > On 3/19/19 1:59 PM, Nitesh Narayan Lal wrote: >> On 3/19/19 1:38 PM, Alexander Duyck wrote: >>> On Tue, Mar 19, 2019 at 9:04 AM Nitesh Narayan Lal wrote: >>>> On 3/19/19 9:33 AM, David Hildenbrand wrote: >>>>> On 18.03.19 16:57, Nitesh Narayan Lal wrote: >>>>>> On 3/14/19 12:58 PM, Alexander Duyck wrote: >>>>>>> On Thu, Mar 14, 2019 at 9:43 AM Nitesh Narayan Lal wrote: >>>>>>>> On 3/6/19 1:12 PM, Michael S. Tsirkin wrote: >>>>>>>>> On Wed, Mar 06, 2019 at 01:07:50PM -0500, Nitesh Narayan Lal wr= ote: >>>>>>>>>> On 3/6/19 11:09 AM, Michael S. Tsirkin wrote: >>>>>>>>>>> On Wed, Mar 06, 2019 at 10:50:42AM -0500, Nitesh Narayan Lal = wrote: >>>>>>>>>>>> The following patch-set proposes an efficient mechanism for = handing freed memory between the guest and the host. It enables the guest= s with no page cache to rapidly free and reclaims memory to and from the = host respectively. >>>>>>>>>>>> >>>>>>>>>>>> Benefit: >>>>>>>>>>>> With this patch-series, in our test-case, executed on a sing= le system and single NUMA node with 15GB memory, we were able to successf= ully launch 5 guests(each with 5 GB memory) when page hinting was enabled= and 3 without it. (Detailed explanation of the test procedure is provide= d at the bottom under Test - 1). >>>>>>>>>>>> >>>>>>>>>>>> Changelog in v9: >>>>>>>>>>>> * Guest free page hinting hook is now invoked after a pag= e has been merged in the buddy. >>>>>>>>>>>> * Free pages only with order FREE_PAGE_HINTING_MIN_O= RDER(currently defined as MAX_ORDER - 1) are captured. >>>>>>>>>>>> * Removed kthread which was earlier used to perform the s= canning, isolation & reporting of free pages. >>>>>>>>>>>> * Pages, captured in the per cpu array are sorted based o= n the zone numbers. This is to avoid redundancy of acquiring zone locks. >>>>>>>>>>>> * Dynamically allocated space is used to hold the is= olated guest free pages. >>>>>>>>>>>> * All the pages are reported asynchronously to the h= ost via virtio driver. >>>>>>>>>>>> * Pages are returned back to the guest buddy free li= st only when the host response is received. >>>>>>>>>>>> >>>>>>>>>>>> Pending items: >>>>>>>>>>>> * Make sure that the guest free page hinting's curre= nt implementation doesn't break hugepages or device assigned guests. >>>>>>>>>>>> * Follow up on VIRTIO_BALLOON_F_PAGE_POISON's device side= support. (It is currently missing) >>>>>>>>>>>> * Compare reporting free pages via vring with vhost.= >>>>>>>>>>>> * Decide between MADV_DONTNEED and MADV_FREE. >>>>>>>>>>>> * Analyze overall performance impact due to guest free pa= ge hinting. >>>>>>>>>>>> * Come up with proper/traceable error-message/logs. >>>>>>>>>>>> >>>>>>>>>>>> Tests: >>>>>>>>>>>> 1. Use-case - Number of guests we can launch >>>>>>>>>>>> >>>>>>>>>>>> NUMA Nodes =3D 1 with 15 GB memory >>>>>>>>>>>> Guest Memory =3D 5 GB >>>>>>>>>>>> Number of cores in guest =3D 1 >>>>>>>>>>>> Workload =3D test allocation program allocates 4GB memory= , touches it via memset and exits. >>>>>>>>>>>> Procedure =3D >>>>>>>>>>>> The first guest is launched and once its console is up, t= he test allocation program is executed with 4 GB memory request (Due to t= his the guest occupies almost 4-5 GB of memory in the host in a system wi= thout page hinting). Once this program exits at that time another guest i= s launched in the host and the same process is followed. We continue laun= ching the guests until a guest gets killed due to low memory condition in= the host. >>>>>>>>>>>> >>>>>>>>>>>> Results: >>>>>>>>>>>> Without hinting =3D 3 >>>>>>>>>>>> With hinting =3D 5 >>>>>>>>>>>> >>>>>>>>>>>> 2. Hackbench >>>>>>>>>>>> Guest Memory =3D 5 GB >>>>>>>>>>>> Number of cores =3D 4 >>>>>>>>>>>> Number of tasks Time with Hinting Time with= out Hinting >>>>>>>>>>>> 4000 19.540 17.818 >>>>>>>>>>>> >>>>>>>>>>> How about memhog btw? >>>>>>>>>>> Alex reported: >>>>>>>>>>> >>>>>>>>>>> My testing up till now has consisted of setting up 4 8GB = VMs on a system >>>>>>>>>>> with 32GB of memory and 4GB of swap. To stress the memory= on the system I >>>>>>>>>>> would run "memhog 8G" sequentially on each of the guests = and observe how >>>>>>>>>>> long it took to complete the run. The observed behavior i= s that on the >>>>>>>>>>> systems with these patches applied in both the guest and = on the host I was >>>>>>>>>>> able to complete the test with a time of 5 to 7 seconds p= er guest. On a >>>>>>>>>>> system without these patches the time ranged from 7 to 49= seconds per >>>>>>>>>>> guest. I am assuming the variability is due to time being= spent writing >>>>>>>>>>> pages out to disk in order to free up space for the guest= =2E >>>>>>>>>>> >>>>>>>>>> Here are the results: >>>>>>>>>> >>>>>>>>>> Procedure: 3 Guests of size 5GB is launched on a single NUMA n= ode with >>>>>>>>>> total memory of 15GB and no swap. In each of the guest, memhog= is run >>>>>>>>>> with 5GB. Post-execution of memhog, Host memory usage is monit= ored by >>>>>>>>>> using Free command. >>>>>>>>>> >>>>>>>>>> Without Hinting: >>>>>>>>>> Time of execution Host used memory >>>>>>>>>> Guest 1: 45 seconds 5.4 GB >>>>>>>>>> Guest 2: 45 seconds 10 GB >>>>>>>>>> Guest 3: 1 minute 15 GB >>>>>>>>>> >>>>>>>>>> With Hinting: >>>>>>>>>> Time of execution Host used memory >>>>>>>>>> Guest 1: 49 seconds 2.4 GB >>>>>>>>>> Guest 2: 40 seconds 4.3 GB >>>>>>>>>> Guest 3: 50 seconds 6.3 GB >>>>>>>>> OK so no improvement. OTOH Alex's patches cut time down to 5-7 = seconds >>>>>>>>> which seems better. Want to try testing Alex's patches for comp= arison? >>>>>>>>> >>>>>>>> I realized that the last time I reported the memhog numbers, I d= idn't >>>>>>>> enable the swap due to which the actual benefits of the series w= ere not >>>>>>>> shown. >>>>>>>> I have re-run the test by including some of the changes suggeste= d by >>>>>>>> Alexander and David: >>>>>>>> * Reduced the size of the per-cpu array to 32 and minimum hi= nting >>>>>>>> threshold to 16. >>>>>>>> * Reported length of isolated pages along with start pfn, in= stead of >>>>>>>> the order from the guest. >>>>>>>> * Used the reported length to madvise the entire length of a= ddress >>>>>>>> instead of a single 4K page. >>>>>>>> * Replaced MADV_DONTNEED with MADV_FREE. >>>>>>>> >>>>>>>> Setup for the test: >>>>>>>> NUMA node:1 >>>>>>>> Memory: 15GB >>>>>>>> Swap: 4GB >>>>>>>> Guest memory: 6GB >>>>>>>> Number of core: 1 >>>>>>>> >>>>>>>> Process: A guest is launched and memhog is run with 6GB. As its >>>>>>>> execution is over next guest is launched. Everytime memhog execu= tion >>>>>>>> time is monitored. >>>>>>>> Results: >>>>>>>> Without Hinting: >>>>>>>> Time of execution >>>>>>>> Guest1: 22s >>>>>>>> Guest2: 24s >>>>>>>> Guest3: 1m29s >>>>>>>> >>>>>>>> With Hinting: >>>>>>>> Time of execution >>>>>>>> Guest1: 24s >>>>>>>> Guest2: 25s >>>>>>>> Guest3: 28s >>>>>>>> >>>>>>>> When hinting is enabled swap space is not used until memhog with= 6GB is >>>>>>>> ran in 6th guest. >>>>>>> So one change you may want to make to your test setup would be to= >>>>>>> launch the tests sequentially after all the guests all up, instea= d of >>>>>>> combining the test and guest bring-up. In addition you could run >>>>>>> through the guests more than once to determine a more-or-less ste= ady >>>>>>> state in terms of the performance as you move between the guests = after >>>>>>> they have hit the point of having to either swap or pull MADV_FRE= E >>>>>>> pages. >>>>>> I tried running memhog as you suggested, here are the results: >>>>>> Setup for the test: >>>>>> NUMA node:1 >>>>>> Memory: 15GB >>>>>> Swap: 4GB >>>>>> Guest memory: 6GB >>>>>> Number of core: 1 >>>>>> >>>>>> Process: 3 guests are launched and memhog is run with 6GB. Results= are >>>>>> monitored after 1st-time execution of memhog. Memhog is launched >>>>>> sequentially in each of the guests and time is observed after the >>>>>> execution of all 3 memhog is over. >>>>>> >>>>>> Results: >>>>>> Without Hinting >>>>>> Time of Execution >>>>>> 1. 6m48s >>>>>> 2. 6m9s >>>>>> >>>>>> With Hinting >>>>>> Array size:16 Minimum Threshold:8 >>>>>> 1. 2m57s >>>>>> 2. 2m20s >>>>>> >>>>>> The memhog execution time in the case of hinting is still not that= low >>>>>> as we would have expected. This is due to the usage of swap space.= >>>>>> Although wrt to non-hinting when swap used space is around 3.5G, w= ith >>>>>> hinting it remains to around 1.1-1.5G. >>>>>> I did try using a zone free page barrier which prevented hinting w= hen >>>>>> free pages of order HINTING_ORDER goes below 256. This further bri= ngs >>>>>> down the swap usage to 100-150 MB. The tricky part of this approac= h is >>>>>> to configure this barrier condition for different guests. >>>>>> >>>>>> Array size:16 Minimum Threshold:8 >>>>>> 1. 1m16s >>>>>> 2. 1m41s >>>>>> >>>>>> Note: Memhog time does seem to vary a little bit on every boot wit= h or >>>>>> without hinting. >>>>>> >>>>> I don't quite understand yet why "hinting more pages" (no free page= >>>>> barrier) should result in a higher swap usage in the hypervisor >>>>> (1.1-1.5GB vs. 100-150 MB). If we are "hinting more pages" I would = have >>>>> guessed that runtime could get slower, but not that we need more sw= ap. >>>>> >>>>> One theory: >>>>> >>>>> If you hint all MAX_ORDER - 1 pages, at one point it could be that = all >>>>> "remaining" free pages are currently isolated to be hinted. As MM n= eeds >>>>> more pages for a process, it will fallback to using "MAX_ORDER - 2"= >>>>> pages and so on. These pages, when they are freed, you won't hint >>>>> anymore unless they get merged. But after all they won't get merged= >>>>> because they can't be merged (otherwise they wouldn't be "MAX_ORDER= - 2" >>>>> after all right from the beginning). >>>>> >>>>> Try hinting a smaller granularity to see if this could actually be = the case. >>>> So I have two questions in my mind after looking at the results now:= >>>> 1. Why swap is coming into the picture when hinting is enabled? >>>> 2. Same to what you have raised. >>>> For the 1st question, I think the answer is: (correct me if I am wro= ng.) >>>> Memhog while writing the memory does free memory but the pages it fr= ees >>>> are of a lower order which doesn't merge until the memhog write >>>> completes. After which we do get the MAX_ORDER - 1 page from the bud= dy >>>> resulting in hinting. >>>> As all 3 memhog are running parallelly we don't get free memory unti= l >>>> one of them completes. >>>> This does explain that when 3 guests each of 6GB on a 15GB host trie= s to >>>> run memhog with 6GB parallelly, swap comes into the picture even if >>>> hinting is enabled. >>> Are you running them in parallel or sequentially?=20 >> I was running them parallelly but then I realized to see any benefits,= >> in that case, I should have run less number of guests. >>> I had suggested >>> running them serially so that the previous one could complete and fre= e >>> the memory before the next one allocated memory. In that setup you >>> should see the guests still swapping without hints, but with hints th= e >>> guest should free the memory up before the next one starts using it. >> Yeah, I just realized this. Thanks for the clarification. >>> If you are running them in parallel then you are going to see things >>> going to swap because memhog does like what the name implies and it >>> will use all of the memory you give it. It isn't until it completes >>> that the memory is freed. >>> >>>> This doesn't explain why putting a barrier or avoid hinting reduced = the >>>> swap usage. It seems I possibly had a wrong impression of the delayi= ng >>>> hinting idea which we discussed. >>>> As I was observing the value of the swap at the end of the memhog >>>> execution which is logically incorrect. I will re-run the test and >>>> observe the highest swap usage during the entire execution of memhog= for >>>> hinting vs non-hinting. >>> So one option you may look at if you are wanting to run the tests in >>> parallel would be to limit the number of tests you have running at th= e >>> same time. If you have 15G of memory and 6G per guest you should be >>> able to run 2 sessions at a time without going to swap, however if yo= u >>> run all 3 then you are likely going to be going to swap even with >>> hinting. >>> >>> - Alex > Here are the updated numbers excluding the guest bring-up cost: > Setup for the test- > NUMA node:1 > Memory: 15GB > Swap: 4GB > Guest memory: 6GB > Number of core: 1 > Process: 3 guests are launched and memhog is run serially with 6GB. > Results: > Without Hinting > =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2= =A0 =C2=A0=C2=A0=C2=A0 Time of Execution=C2=A0=C2=A0=C2=A0 > Guest1:=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2= =A0=C2=A0 56s =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2= =A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 > Guest2: =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0 =C2=A0 =C2=A0=C2=A0= 45s=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 > Guest3:=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2= =A0=C2=A0 3m41s=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 > > With Hinting > Guest1:=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2= =A0=C2=A0 46s =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2= =A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 > Guest2: =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0 45s=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 > Guest3:=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2= =A0=C2=A0 49s=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 > > > > I performed some experiments to see if the current implementation of hinting breaks THP. I used AnonHugePages to track the THP pages currently in use and memhog as the guest workload. Setup: Host Size: 30GB (No swap) Guest Size: 15GB THP Size: 2MB Process: Guest is installed with different kernels to hint different granularities(MAX_ORDER - 1, MAX_ORDER - 2 and MAX_ORDER - 3). Memhog=C2=A0= 15G is run multiple times in the same guest to see AnonHugePages usage in the host. Observation: There is no THP split for order MAX_ORDER - 1 & MAX_ORDER - 2 whereas for hinting granularity MAX_ORDER - 3 THP does split irrespective of MADVISE_FREE or MADVISE_DONTNEED. --=20 Regards Nitesh --Uxy1fcR5FM2Ozt00umK0ZxNrLwPctajkC-- --OUEyeRINyY9r7rkz8NWRKqrGFm2wLnu3L Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEkXcoRVGaqvbHPuAGo4ZA3AYyozkFAlyY5WIACgkQo4ZA3AYy ozmVbxAAgmGMsb521ImWzCfauPnm+tep4pu3vb4EKBM+pSlbARwL25A2qTu+tXc7 tysKvC4U0OZ/oIPi4/q4NfpyvdRYXzS/SkQeK5paQHsxrjjzzsoEJyckbnjfqy5v NXMDEEYm0rZiWeeUCrY4iyZ73sbLUQNXk9RAVybbvg3mHm6TgSSXQZsn05YAwCIs Ue29RUIcmktloUObMxKekVQelu8txqpCLHBWq/wDkfxvAymkQKMj5ebGBBqkrSUW pEl1BPl1nPHpwuiYpwp9GBVqaoGoycRTm/SHJ6zEqy7f5DvWr0Wo7y0DMdRiTyfI xmTShn4gOfxKBy8sYAtr+gtdrqUjRaLd4JJnBzlGgEeTX9H3hOKL6vk2SHPO73Qd RnCE+3YHdgON3sv2/K6XqvF700jiLbB+nDPAqbUjr/mkOC7IYFs55GlZKNnokTVq OUaMSrBhNUFYkOZ86usN04EqNW8sFatNnHNhd8PhS0Rqcu0W0Q7X6rCSeI+mWf3z cO3XYHqtewV1OLt+LHrEs4GTvHdZWR0h4DNy2N4Tdd4ymIpMwCOXj6ioK8Sdo8Nc 5XBi/5/uWCoGXzwcO1lTwWrSoFsZtYv3XRdyKCCBiGF1YlR9m1Dp4EJWLt6hply9 g2RpzFk/QzNiPthoBwSz03DpLaKrFCn7sngIPmtKXGcuAdOGXqI= =hmRu -----END PGP SIGNATURE----- --OUEyeRINyY9r7rkz8NWRKqrGFm2wLnu3L--