On 6/3/19 2:04 PM, Michael S. Tsirkin wrote: > On Mon, Jun 03, 2019 at 01:03:04PM -0400, Nitesh Narayan Lal wrote: >> This patch series proposes an efficient mechanism for communicating free memory >> from a guest to its hypervisor. It especially enables guests with no page cache >> (e.g., nvdimm, virtio-pmem) or with small page caches (e.g., ram > disk) to >> rapidly hand back free memory to the hypervisor. >> This approach has a minimal impact on the existing core-mm infrastructure. > Could you help us compare with Alex's series? > What are the main differences? Sorry for the late reply, but I haven't been feeling too well during the last week. The main differences are that this series uses a bitmap to track pages that should be hinted to the hypervisor, while Alexander's series tracks it directly in core-mm. Also in order to prevent duplicate hints Alexander's series uses a newly defined page flag whereas I have added another argument to __free_one_page. For these reasons, Alexander's series is relatively more core-mm invasive, while this series is lightweight (e.g., LOC). We'll have to see if there are real performance differences. I'm planning on doing some further investigations/review/testing/... once I'm back on track. > >> Measurement results (measurement details appended to this email): >> * With active page hinting, 3 more guests could be launched each of 5 GB(total >> 5 vs. 2) on a 15GB (single NUMA) system without swapping. >> * With active page hinting, on a system with 15 GB of (single NUMA) memory and >> 4GB of swap, the runtime of "memhog 6G" in 3 guests (run sequentially) resulted >> in the last invocation to only need 37s compared to 3m35s without page hinting. >> >> This approach tracks all freed pages of the order MAX_ORDER - 2 in bitmaps. >> A new hook after buddy merging is used to set the bits in the bitmap. >> Currently, the bits are only cleared when pages are hinted, not when pages are >> re-allocated. >> >> Bitmaps are stored on a per-zone basis and are protected by the zone lock. A >> workqueue asynchronously processes the bitmaps as soon as a pre-defined memory >> threshold is met, trying to isolate and report pages that are still free. >> >> The isolated pages are reported via virtio-balloon, which is responsible for >> sending batched pages to the host synchronously. Once the hypervisor processed >> the hinting request, the isolated pages are returned back to the buddy. >> >> The key changes made in this series compared to v9[1] are: >> * Pages only in the chunks of "MAX_ORDER - 2" are reported to the hypervisor to >> not break up the THP. >> * At a time only a set of 16 pages can be isolated and reported to the host to >> avoids any false OOMs. >> * page_hinting.c is moved under mm/ from virt/kvm/ as the feature is dependent >> on virtio and not on KVM itself. This would enable any other hypervisor to use >> this feature by implementing virtio devices. >> * The sysctl variable is replaced with a virtio-balloon parameter to >> enable/disable page-hinting. >> >> Pending items: >> * Test device assigned guests to ensure that hinting doesn't break it. >> * Follow up on VIRTIO_BALLOON_F_PAGE_POISON's device side support. >> * Compare reporting free pages via vring with vhost. >> * Decide between MADV_DONTNEED and MADV_FREE. >> * Look into memory hotplug, more efficient locking, possible races when >> disabling. >> * Come up with proper/traceable error-message/logs. >> * Minor reworks and simplifications (e.g., virtio protocol). >> >> Benefit analysis: >> 1. Use-case - Number of guests that can be launched without swap usage >> NUMA Nodes = 1 with 15 GB memory >> Guest Memory = 5 GB >> Number of cores in guest = 1 >> Workload = test allocation program allocates 4GB memory, touches it via memset >> and exits. >> Procedure = >> The first guest is launched and once its console is up, the test allocation >> program is executed with 4 GB memory request (Due to this the guest occupies >> almost 4-5 GB of memory in the host in a system without page hinting). Once >> this program exits at that time another guest is launched in the host and the >> same process is followed. It is continued until the swap is not used. >> >> Results: >> Without hinting = 3, swap usage at the end 1.1GB. >> With hinting = 5, swap usage at the end 0. >> >> 2. Use-case - memhog execution time >> Guest Memory = 6GB >> Number of cores = 4 >> NUMA Nodes = 1 with 15 GB memory >> Process: 3 Guests are launched and the ‘memhog 6G’ execution time is monitored >> one after the other in each of them. >> Without Hinting - Guest1:47s, Guest2:53s, Guest3:3m35s, End swap usage: 3.5G >> With Hinting - Guest1:40s, Guest2:44s, Guest3:37s, End swap usage: 0 >> >> Performance analysis: >> 1. will-it-scale's page_faul1: >> Guest Memory = 6GB >> Number of cores = 24 >> >> Without Hinting: >> tasks,processes,processes_idle,threads,threads_idle,linear >> 0,0,100,0,100,0 >> 1,315890,95.82,317633,95.83,317633 >> 2,570810,91.67,531147,91.94,635266 >> 3,826491,87.54,713545,88.53,952899 >> 4,1087434,83.40,901215,85.30,1270532 >> 5,1277137,79.26,916442,83.74,1588165 >> 6,1503611,75.12,1113832,79.89,1905798 >> 7,1683750,70.99,1140629,78.33,2223431 >> 8,1893105,66.85,1157028,77.40,2541064 >> 9,2046516,62.50,1179445,76.48,2858697 >> 10,2291171,58.57,1209247,74.99,3176330 >> 11,2486198,54.47,1217265,75.13,3493963 >> 12,2656533,50.36,1193392,74.42,3811596 >> 13,2747951,46.21,1185540,73.45,4129229 >> 14,2965757,42.09,1161862,72.20,4446862 >> 15,3049128,37.97,1185923,72.12,4764495 >> 16,3150692,33.83,1163789,70.70,5082128 >> 17,3206023,29.70,1174217,70.11,5399761 >> 18,3211380,25.62,1179660,69.40,5717394 >> 19,3202031,21.44,1181259,67.28,6035027 >> 20,3218245,17.35,1196367,66.75,6352660 >> 21,3228576,13.26,1129561,66.74,6670293 >> 22,3207452,9.15,1166517,66.47,6987926 >> 23,3153800,5.09,1172877,61.57,7305559 >> 24,3184542,0.99,1186244,58.36,7623192 >> >> With Hinting: >> 0,0,100,0,100,0 >> 1,306737,95.82,305130,95.78,306737 >> 2,573207,91.68,530453,91.92,613474 >> 3,810319,87.53,695281,88.58,920211 >> 4,1074116,83.40,880602,85.48,1226948 >> 5,1308283,79.26,1109257,81.23,1533685 >> 6,1501987,75.12,1093661,80.19,1840422 >> 7,1695300,70.99,1104207,79.03,2147159 >> 8,1901523,66.85,1193613,76.90,2453896 >> 9,2051288,62.73,1200913,76.22,2760633 >> 10,2275771,58.60,1192992,75.66,3067370 >> 11,2435016,54.48,1191472,74.66,3374107 >> 12,2623114,50.35,1196911,74.02,3680844 >> 13,2766071,46.22,1178589,73.02,3987581 >> 14,2932163,42.10,1166414,72.96,4294318 >> 15,3000853,37.96,1177177,72.62,4601055 >> 16,3113738,33.85,1165444,70.54,4907792 >> 17,3132135,29.77,1165055,68.51,5214529 >> 18,3175121,25.69,1166969,69.27,5521266 >> 19,3205490,21.61,1159310,65.65,5828003 >> 20,3220855,17.52,1171827,62.04,6134740 >> 21,3182568,13.48,1138918,65.05,6441477 >> 22,3130543,9.30,1128185,60.60,6748214 >> 23,3087426,5.15,1127912,55.36,7054951 >> 24,3099457,1.04,1176100,54.96,7361688 >> >> [1] https://lkml.org/lkml/2019/3/6/413 >> -- Regards Nitesh