On 6/3/19 2:04 PM, Michael S. Tsirkin wrote:
> On Mon, Jun 03, 2019 at 01:03:04PM -0400, Nitesh Narayan Lal wrote:
>> This patch series proposes an efficient mechanism for communicating free memory
>> from a guest to its hypervisor. It especially enables guests with no page cache
>> (e.g., nvdimm, virtio-pmem) or with small page caches (e.g., ram > disk) to
>> rapidly hand back free memory to the hypervisor.
>> This approach has a minimal impact on the existing core-mm infrastructure.
> Could you help us compare with Alex's series?
> What are the main differences?
Sorry for the late reply, but I haven't been feeling too well during the
last week.

The main differences are that this series uses a bitmap to track pages
that should be hinted to the hypervisor, while Alexander's series tracks
it directly in core-mm. Also in order to prevent duplicate hints
Alexander's series uses a newly defined page flag whereas I have added
another argument to __free_one_page.
For these reasons, Alexander's series is relatively more core-mm
invasive, while this series is lightweight (e.g., LOC). We'll have to
see if there are real performance differences.

I'm planning on doing some further investigations/review/testing/...
once I'm back on track.
>
>> Measurement results (measurement details appended to this email):
>> * With active page hinting, 3 more guests could be launched each of 5 GB(total 
>> 5 vs. 2) on a 15GB (single NUMA) system without swapping.
>> * With active page hinting, on a system with 15 GB of (single NUMA) memory and
>> 4GB of swap, the runtime of "memhog 6G" in 3 guests (run sequentially) resulted
>> in the last invocation to only need 37s compared to 3m35s without page hinting.
>>
>> This approach tracks all freed pages of the order MAX_ORDER - 2 in bitmaps.
>> A new hook after buddy merging is used to set the bits in the bitmap.
>> Currently, the bits are only cleared when pages are hinted, not when pages are
>> re-allocated.
>>
>> Bitmaps are stored on a per-zone basis and are protected by the zone lock. A
>> workqueue asynchronously processes the bitmaps as soon as a pre-defined memory
>> threshold is met, trying to isolate and report pages that are still free.
>>
>> The isolated pages are reported via virtio-balloon, which is responsible for
>> sending batched pages to the host synchronously. Once the hypervisor processed
>> the hinting request, the isolated pages are returned back to the buddy.
>>
>> The key changes made in this series compared to v9[1] are:
>> * Pages only in the chunks of "MAX_ORDER - 2" are reported to the hypervisor to
>> not break up the THP.
>> * At a time only a set of 16 pages can be isolated and reported to the host to
>> avoids any false OOMs.
>> * page_hinting.c is moved under mm/ from virt/kvm/ as the feature is dependent
>> on virtio and not on KVM itself. This would enable any other hypervisor to use
>> this feature by implementing virtio devices.
>> * The sysctl variable is replaced with a virtio-balloon parameter to
>> enable/disable page-hinting.
>>
>> Pending items:
>> * Test device assigned guests to ensure that hinting doesn't break it.
>> * Follow up on VIRTIO_BALLOON_F_PAGE_POISON's device side support.
>> * Compare reporting free pages via vring with vhost.
>> * Decide between MADV_DONTNEED and MADV_FREE.
>> * Look into memory hotplug, more efficient locking, possible races when
>> disabling.
>> * Come up with proper/traceable error-message/logs.
>> * Minor reworks and simplifications (e.g., virtio protocol).
>>
>> Benefit analysis:
>> 1. Use-case - Number of guests that can be launched without swap usage
>> NUMA Nodes = 1 with 15 GB memory
>> Guest Memory = 5 GB
>> Number of cores in guest = 1
>> Workload = test allocation program allocates 4GB memory, touches it via memset
>> and exits.
>> Procedure =
>> The first guest is launched and once its console is up, the test allocation
>> program is executed with 4 GB memory request (Due to this the guest occupies
>> almost 4-5 GB of memory in the host in a system without page hinting). Once
>> this program exits at that time another guest is launched in the host and the
>> same process is followed. It is continued until the swap is not used.
>>
>> Results:
>> Without hinting = 3, swap usage at the end 1.1GB.
>> With hinting = 5, swap usage at the end 0.
>>
>> 2. Use-case - memhog execution time
>> Guest Memory = 6GB
>> Number of cores = 4
>> NUMA Nodes = 1 with 15 GB memory
>> Process: 3 Guests are launched and the ‘memhog 6G’ execution time is monitored
>> one after the other in each of them.
>> Without Hinting - Guest1:47s, Guest2:53s, Guest3:3m35s, End swap usage: 3.5G
>> With Hinting - Guest1:40s, Guest2:44s, Guest3:37s, End swap usage: 0
>>
>> Performance analysis:
>> 1. will-it-scale's page_faul1:
>> Guest Memory = 6GB
>> Number of cores = 24
>>
>> Without Hinting:
>> tasks,processes,processes_idle,threads,threads_idle,linear
>> 0,0,100,0,100,0
>> 1,315890,95.82,317633,95.83,317633
>> 2,570810,91.67,531147,91.94,635266
>> 3,826491,87.54,713545,88.53,952899
>> 4,1087434,83.40,901215,85.30,1270532
>> 5,1277137,79.26,916442,83.74,1588165
>> 6,1503611,75.12,1113832,79.89,1905798
>> 7,1683750,70.99,1140629,78.33,2223431
>> 8,1893105,66.85,1157028,77.40,2541064
>> 9,2046516,62.50,1179445,76.48,2858697
>> 10,2291171,58.57,1209247,74.99,3176330
>> 11,2486198,54.47,1217265,75.13,3493963
>> 12,2656533,50.36,1193392,74.42,3811596
>> 13,2747951,46.21,1185540,73.45,4129229
>> 14,2965757,42.09,1161862,72.20,4446862
>> 15,3049128,37.97,1185923,72.12,4764495
>> 16,3150692,33.83,1163789,70.70,5082128
>> 17,3206023,29.70,1174217,70.11,5399761
>> 18,3211380,25.62,1179660,69.40,5717394
>> 19,3202031,21.44,1181259,67.28,6035027
>> 20,3218245,17.35,1196367,66.75,6352660
>> 21,3228576,13.26,1129561,66.74,6670293
>> 22,3207452,9.15,1166517,66.47,6987926
>> 23,3153800,5.09,1172877,61.57,7305559
>> 24,3184542,0.99,1186244,58.36,7623192
>>
>> With Hinting:
>> 0,0,100,0,100,0
>> 1,306737,95.82,305130,95.78,306737
>> 2,573207,91.68,530453,91.92,613474
>> 3,810319,87.53,695281,88.58,920211
>> 4,1074116,83.40,880602,85.48,1226948
>> 5,1308283,79.26,1109257,81.23,1533685
>> 6,1501987,75.12,1093661,80.19,1840422
>> 7,1695300,70.99,1104207,79.03,2147159
>> 8,1901523,66.85,1193613,76.90,2453896
>> 9,2051288,62.73,1200913,76.22,2760633
>> 10,2275771,58.60,1192992,75.66,3067370
>> 11,2435016,54.48,1191472,74.66,3374107
>> 12,2623114,50.35,1196911,74.02,3680844
>> 13,2766071,46.22,1178589,73.02,3987581
>> 14,2932163,42.10,1166414,72.96,4294318
>> 15,3000853,37.96,1177177,72.62,4601055
>> 16,3113738,33.85,1165444,70.54,4907792
>> 17,3132135,29.77,1165055,68.51,5214529
>> 18,3175121,25.69,1166969,69.27,5521266
>> 19,3205490,21.61,1159310,65.65,5828003
>> 20,3220855,17.52,1171827,62.04,6134740
>> 21,3182568,13.48,1138918,65.05,6441477
>> 22,3130543,9.30,1128185,60.60,6748214
>> 23,3087426,5.15,1127912,55.36,7054951
>> 24,3099457,1.04,1176100,54.96,7361688
>>
>> [1] https://lkml.org/lkml/2019/3/6/413
>>
-- 
Regards
Nitesh